* [PATCH 0/3] memcg: optimizatize charge codepath @ 2022-08-22 0:17 ` Shakeel Butt 0 siblings, 0 replies; 86+ messages in thread From: Shakeel Butt @ 2022-08-22 0:17 UTC (permalink / raw) To: Johannes Weiner, Michal Hocko, Roman Gushchin, Muchun Song Cc: Michal Koutný, Eric Dumazet, Soheil Hassas Yeganeh, Feng Tang, Oliver Sang, Andrew Morton, lkp, cgroups, linux-mm, netdev, linux-kernel, Shakeel Butt Recently Linux networking stack has moved from a very old per socket pre-charge caching to per-cpu caching to avoid pre-charge fragmentation and unwarranted OOMs. One impact of this change is that for network traffic workloads, memcg charging codepath can become a bottleneck. The kernel test robot has also reported this regression. This patch series tries to improve the memcg charging for such workloads. This patch series implement three optimizations: (A) Reduce atomic ops in page counter update path. (B) Change layout of struct page_counter to eliminate false sharing between usage and high. (C) Increase the memcg charge batch to 64. To evaluate the impact of these optimizations, on a 72 CPUs machine, we ran the following workload in root memcg and then compared with scenario where the workload is run in a three level of cgroup hierarchy with top level having min and low setup appropriately. $ netserver -6 # 36 instances of netperf with following params $ netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- -m 10K Results (average throughput of netperf): 1. root memcg 21694.8 2. 6.0-rc1 10482.7 (-51.6%) 3. 6.0-rc1 + (A) 14542.5 (-32.9%) 4. 6.0-rc1 + (B) 12413.7 (-42.7%) 5. 6.0-rc1 + (C) 17063.7 (-21.3%) 6. 6.0-rc1 + (A+B+C) 20120.3 (-7.2%) With all three optimizations, the memcg overhead of this workload has been reduced from 51.6% to just 7.2%. Shakeel Butt (3): mm: page_counter: remove unneeded atomic ops for low/min mm: page_counter: rearrange struct page_counter fields memcg: increase MEMCG_CHARGE_BATCH to 64 include/linux/memcontrol.h | 7 ++++--- include/linux/page_counter.h | 34 +++++++++++++++++++++++----------- mm/page_counter.c | 13 ++++++------- 3 files changed, 33 insertions(+), 21 deletions(-) -- 2.37.1.595.g718a3a8f04-goog ^ permalink raw reply [flat|nested] 86+ messages in thread
* [PATCH 0/3] memcg: optimizatize charge codepath @ 2022-08-22 0:17 ` Shakeel Butt 0 siblings, 0 replies; 86+ messages in thread From: Shakeel Butt @ 2022-08-22 0:17 UTC (permalink / raw) To: Johannes Weiner, Michal Hocko, Roman Gushchin, Muchun Song Cc: Michal Koutný, Eric Dumazet, Soheil Hassas Yeganeh, Feng Tang, Oliver Sang, Andrew Morton, lkp-hn68Rpc1hR1g9hUCZPvPmw, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, netdev-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Shakeel Butt Recently Linux networking stack has moved from a very old per socket pre-charge caching to per-cpu caching to avoid pre-charge fragmentation and unwarranted OOMs. One impact of this change is that for network traffic workloads, memcg charging codepath can become a bottleneck. The kernel test robot has also reported this regression. This patch series tries to improve the memcg charging for such workloads. This patch series implement three optimizations: (A) Reduce atomic ops in page counter update path. (B) Change layout of struct page_counter to eliminate false sharing between usage and high. (C) Increase the memcg charge batch to 64. To evaluate the impact of these optimizations, on a 72 CPUs machine, we ran the following workload in root memcg and then compared with scenario where the workload is run in a three level of cgroup hierarchy with top level having min and low setup appropriately. $ netserver -6 # 36 instances of netperf with following params $ netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- -m 10K Results (average throughput of netperf): 1. root memcg 21694.8 2. 6.0-rc1 10482.7 (-51.6%) 3. 6.0-rc1 + (A) 14542.5 (-32.9%) 4. 6.0-rc1 + (B) 12413.7 (-42.7%) 5. 6.0-rc1 + (C) 17063.7 (-21.3%) 6. 6.0-rc1 + (A+B+C) 20120.3 (-7.2%) With all three optimizations, the memcg overhead of this workload has been reduced from 51.6% to just 7.2%. Shakeel Butt (3): mm: page_counter: remove unneeded atomic ops for low/min mm: page_counter: rearrange struct page_counter fields memcg: increase MEMCG_CHARGE_BATCH to 64 include/linux/memcontrol.h | 7 ++++--- include/linux/page_counter.h | 34 +++++++++++++++++++++++----------- mm/page_counter.c | 13 ++++++------- 3 files changed, 33 insertions(+), 21 deletions(-) -- 2.37.1.595.g718a3a8f04-goog ^ permalink raw reply [flat|nested] 86+ messages in thread
* [PATCH 0/3] memcg: optimizatize charge codepath @ 2022-08-22 0:17 ` Shakeel Butt 0 siblings, 0 replies; 86+ messages in thread From: Shakeel Butt @ 2022-08-22 0:17 UTC (permalink / raw) To: lkp [-- Attachment #1: Type: text/plain, Size: 1844 bytes --] Recently Linux networking stack has moved from a very old per socket pre-charge caching to per-cpu caching to avoid pre-charge fragmentation and unwarranted OOMs. One impact of this change is that for network traffic workloads, memcg charging codepath can become a bottleneck. The kernel test robot has also reported this regression. This patch series tries to improve the memcg charging for such workloads. This patch series implement three optimizations: (A) Reduce atomic ops in page counter update path. (B) Change layout of struct page_counter to eliminate false sharing between usage and high. (C) Increase the memcg charge batch to 64. To evaluate the impact of these optimizations, on a 72 CPUs machine, we ran the following workload in root memcg and then compared with scenario where the workload is run in a three level of cgroup hierarchy with top level having min and low setup appropriately. $ netserver -6 # 36 instances of netperf with following params $ netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- -m 10K Results (average throughput of netperf): 1. root memcg 21694.8 2. 6.0-rc1 10482.7 (-51.6%) 3. 6.0-rc1 + (A) 14542.5 (-32.9%) 4. 6.0-rc1 + (B) 12413.7 (-42.7%) 5. 6.0-rc1 + (C) 17063.7 (-21.3%) 6. 6.0-rc1 + (A+B+C) 20120.3 (-7.2%) With all three optimizations, the memcg overhead of this workload has been reduced from 51.6% to just 7.2%. Shakeel Butt (3): mm: page_counter: remove unneeded atomic ops for low/min mm: page_counter: rearrange struct page_counter fields memcg: increase MEMCG_CHARGE_BATCH to 64 include/linux/memcontrol.h | 7 ++++--- include/linux/page_counter.h | 34 +++++++++++++++++++++++----------- mm/page_counter.c | 13 ++++++------- 3 files changed, 33 insertions(+), 21 deletions(-) -- 2.37.1.595.g718a3a8f04-goog ^ permalink raw reply [flat|nested] 86+ messages in thread
* [PATCH 1/3] mm: page_counter: remove unneeded atomic ops for low/min 2022-08-22 0:17 ` Shakeel Butt (?) @ 2022-08-22 0:17 ` Shakeel Butt -1 siblings, 0 replies; 86+ messages in thread From: Shakeel Butt @ 2022-08-22 0:17 UTC (permalink / raw) To: Johannes Weiner, Michal Hocko, Roman Gushchin, Muchun Song Cc: Michal Koutný, Eric Dumazet, Soheil Hassas Yeganeh, Feng Tang, Oliver Sang, Andrew Morton, lkp, cgroups, linux-mm, netdev, linux-kernel, Shakeel Butt For cgroups using low or min protections, the function propagate_protected_usage() was doing an atomic xchg() operation irrespectively. It only needs to do that operation if the new value of protection is different from older one. This patch does that. To evaluate the impact of this optimization, on a 72 CPUs machine, we ran the following workload in a three level of cgroup hierarchy with top level having min and low setup appropriately. More specifically memory.min equal to size of netperf binary and memory.low double of that. $ netserver -6 # 36 instances of netperf with following params $ netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- -m 10K Results (average throughput of netperf): Without (6.0-rc1) 10482.7 Mbps With patch 14542.5 Mbps (38.7% improvement) With the patch, the throughput improved by 38.7% Signed-off-by: Shakeel Butt <shakeelb@google.com> Reported-by: kernel test robot <oliver.sang@intel.com> --- mm/page_counter.c | 13 ++++++------- 1 file changed, 6 insertions(+), 7 deletions(-) diff --git a/mm/page_counter.c b/mm/page_counter.c index eb156ff5d603..47711aa28161 100644 --- a/mm/page_counter.c +++ b/mm/page_counter.c @@ -17,24 +17,23 @@ static void propagate_protected_usage(struct page_counter *c, unsigned long usage) { unsigned long protected, old_protected; - unsigned long low, min; long delta; if (!c->parent) return; - min = READ_ONCE(c->min); - if (min || atomic_long_read(&c->min_usage)) { - protected = min(usage, min); + protected = min(usage, READ_ONCE(c->min)); + old_protected = atomic_long_read(&c->min_usage); + if (protected != old_protected) { old_protected = atomic_long_xchg(&c->min_usage, protected); delta = protected - old_protected; if (delta) atomic_long_add(delta, &c->parent->children_min_usage); } - low = READ_ONCE(c->low); - if (low || atomic_long_read(&c->low_usage)) { - protected = min(usage, low); + protected = min(usage, READ_ONCE(c->low)); + old_protected = atomic_long_read(&c->low_usage); + if (protected != old_protected) { old_protected = atomic_long_xchg(&c->low_usage, protected); delta = protected - old_protected; if (delta) -- 2.37.1.595.g718a3a8f04-goog ^ permalink raw reply related [flat|nested] 86+ messages in thread
* [PATCH 1/3] mm: page_counter: remove unneeded atomic ops for low/min @ 2022-08-22 0:17 ` Shakeel Butt 0 siblings, 0 replies; 86+ messages in thread From: Shakeel Butt @ 2022-08-22 0:17 UTC (permalink / raw) To: Johannes Weiner, Michal Hocko, Roman Gushchin, Muchun Song Cc: Michal Koutný, Eric Dumazet, Soheil Hassas Yeganeh, Feng Tang, Oliver Sang, Andrew Morton, lkp-hn68Rpc1hR1g9hUCZPvPmw, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, netdev-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Shakeel Butt For cgroups using low or min protections, the function propagate_protected_usage() was doing an atomic xchg() operation irrespectively. It only needs to do that operation if the new value of protection is different from older one. This patch does that. To evaluate the impact of this optimization, on a 72 CPUs machine, we ran the following workload in a three level of cgroup hierarchy with top level having min and low setup appropriately. More specifically memory.min equal to size of netperf binary and memory.low double of that. $ netserver -6 # 36 instances of netperf with following params $ netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- -m 10K Results (average throughput of netperf): Without (6.0-rc1) 10482.7 Mbps With patch 14542.5 Mbps (38.7% improvement) With the patch, the throughput improved by 38.7% Signed-off-by: Shakeel Butt <shakeelb-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> Reported-by: kernel test robot <oliver.sang-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org> --- mm/page_counter.c | 13 ++++++------- 1 file changed, 6 insertions(+), 7 deletions(-) diff --git a/mm/page_counter.c b/mm/page_counter.c index eb156ff5d603..47711aa28161 100644 --- a/mm/page_counter.c +++ b/mm/page_counter.c @@ -17,24 +17,23 @@ static void propagate_protected_usage(struct page_counter *c, unsigned long usage) { unsigned long protected, old_protected; - unsigned long low, min; long delta; if (!c->parent) return; - min = READ_ONCE(c->min); - if (min || atomic_long_read(&c->min_usage)) { - protected = min(usage, min); + protected = min(usage, READ_ONCE(c->min)); + old_protected = atomic_long_read(&c->min_usage); + if (protected != old_protected) { old_protected = atomic_long_xchg(&c->min_usage, protected); delta = protected - old_protected; if (delta) atomic_long_add(delta, &c->parent->children_min_usage); } - low = READ_ONCE(c->low); - if (low || atomic_long_read(&c->low_usage)) { - protected = min(usage, low); + protected = min(usage, READ_ONCE(c->low)); + old_protected = atomic_long_read(&c->low_usage); + if (protected != old_protected) { old_protected = atomic_long_xchg(&c->low_usage, protected); delta = protected - old_protected; if (delta) -- 2.37.1.595.g718a3a8f04-goog ^ permalink raw reply related [flat|nested] 86+ messages in thread
* [PATCH 1/3] mm: page_counter: remove unneeded atomic ops for low/min @ 2022-08-22 0:17 ` Shakeel Butt 0 siblings, 0 replies; 86+ messages in thread From: Shakeel Butt @ 2022-08-22 0:17 UTC (permalink / raw) To: lkp [-- Attachment #1: Type: text/plain, Size: 2261 bytes --] For cgroups using low or min protections, the function propagate_protected_usage() was doing an atomic xchg() operation irrespectively. It only needs to do that operation if the new value of protection is different from older one. This patch does that. To evaluate the impact of this optimization, on a 72 CPUs machine, we ran the following workload in a three level of cgroup hierarchy with top level having min and low setup appropriately. More specifically memory.min equal to size of netperf binary and memory.low double of that. $ netserver -6 # 36 instances of netperf with following params $ netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- -m 10K Results (average throughput of netperf): Without (6.0-rc1) 10482.7 Mbps With patch 14542.5 Mbps (38.7% improvement) With the patch, the throughput improved by 38.7% Signed-off-by: Shakeel Butt <shakeelb@google.com> Reported-by: kernel test robot <oliver.sang@intel.com> --- mm/page_counter.c | 13 ++++++------- 1 file changed, 6 insertions(+), 7 deletions(-) diff --git a/mm/page_counter.c b/mm/page_counter.c index eb156ff5d603..47711aa28161 100644 --- a/mm/page_counter.c +++ b/mm/page_counter.c @@ -17,24 +17,23 @@ static void propagate_protected_usage(struct page_counter *c, unsigned long usage) { unsigned long protected, old_protected; - unsigned long low, min; long delta; if (!c->parent) return; - min = READ_ONCE(c->min); - if (min || atomic_long_read(&c->min_usage)) { - protected = min(usage, min); + protected = min(usage, READ_ONCE(c->min)); + old_protected = atomic_long_read(&c->min_usage); + if (protected != old_protected) { old_protected = atomic_long_xchg(&c->min_usage, protected); delta = protected - old_protected; if (delta) atomic_long_add(delta, &c->parent->children_min_usage); } - low = READ_ONCE(c->low); - if (low || atomic_long_read(&c->low_usage)) { - protected = min(usage, low); + protected = min(usage, READ_ONCE(c->low)); + old_protected = atomic_long_read(&c->low_usage); + if (protected != old_protected) { old_protected = atomic_long_xchg(&c->low_usage, protected); delta = protected - old_protected; if (delta) -- 2.37.1.595.g718a3a8f04-goog ^ permalink raw reply related [flat|nested] 86+ messages in thread
* Re: [PATCH 1/3] mm: page_counter: remove unneeded atomic ops for low/min 2022-08-22 0:17 ` Shakeel Butt (?) @ 2022-08-22 0:20 ` Soheil Hassas Yeganeh -1 siblings, 0 replies; 86+ messages in thread From: Soheil Hassas Yeganeh @ 2022-08-22 0:20 UTC (permalink / raw) To: Shakeel Butt Cc: Johannes Weiner, Michal Hocko, Roman Gushchin, Muchun Song, Michal Koutný, Eric Dumazet, Feng Tang, Oliver Sang, Andrew Morton, lkp, cgroups, linux-mm, netdev, linux-kernel On Sun, Aug 21, 2022 at 8:17 PM Shakeel Butt <shakeelb@google.com> wrote: > > For cgroups using low or min protections, the function > propagate_protected_usage() was doing an atomic xchg() operation > irrespectively. It only needs to do that operation if the new value of > protection is different from older one. This patch does that. > > To evaluate the impact of this optimization, on a 72 CPUs machine, we > ran the following workload in a three level of cgroup hierarchy with top > level having min and low setup appropriately. More specifically > memory.min equal to size of netperf binary and memory.low double of > that. > > $ netserver -6 > # 36 instances of netperf with following params > $ netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- -m 10K > > Results (average throughput of netperf): > Without (6.0-rc1) 10482.7 Mbps > With patch 14542.5 Mbps (38.7% improvement) > > With the patch, the throughput improved by 38.7% > > Signed-off-by: Shakeel Butt <shakeelb@google.com> > Reported-by: kernel test robot <oliver.sang@intel.com> Nice speed up! Acked-by: Soheil Hassas Yeganeh <soheil@google.com> > --- > mm/page_counter.c | 13 ++++++------- > 1 file changed, 6 insertions(+), 7 deletions(-) > > diff --git a/mm/page_counter.c b/mm/page_counter.c > index eb156ff5d603..47711aa28161 100644 > --- a/mm/page_counter.c > +++ b/mm/page_counter.c > @@ -17,24 +17,23 @@ static void propagate_protected_usage(struct page_counter *c, > unsigned long usage) > { > unsigned long protected, old_protected; > - unsigned long low, min; > long delta; > > if (!c->parent) > return; > > - min = READ_ONCE(c->min); > - if (min || atomic_long_read(&c->min_usage)) { > - protected = min(usage, min); > + protected = min(usage, READ_ONCE(c->min)); > + old_protected = atomic_long_read(&c->min_usage); > + if (protected != old_protected) { > old_protected = atomic_long_xchg(&c->min_usage, protected); > delta = protected - old_protected; > if (delta) > atomic_long_add(delta, &c->parent->children_min_usage); > } > > - low = READ_ONCE(c->low); > - if (low || atomic_long_read(&c->low_usage)) { > - protected = min(usage, low); > + protected = min(usage, READ_ONCE(c->low)); > + old_protected = atomic_long_read(&c->low_usage); > + if (protected != old_protected) { > old_protected = atomic_long_xchg(&c->low_usage, protected); > delta = protected - old_protected; > if (delta) > -- > 2.37.1.595.g718a3a8f04-goog > ^ permalink raw reply [flat|nested] 86+ messages in thread
* Re: [PATCH 1/3] mm: page_counter: remove unneeded atomic ops for low/min @ 2022-08-22 0:20 ` Soheil Hassas Yeganeh 0 siblings, 0 replies; 86+ messages in thread From: Soheil Hassas Yeganeh @ 2022-08-22 0:20 UTC (permalink / raw) To: Shakeel Butt Cc: Johannes Weiner, Michal Hocko, Roman Gushchin, Muchun Song, Michal Koutný, Eric Dumazet, Feng Tang, Oliver Sang, Andrew Morton, lkp-hn68Rpc1hR1g9hUCZPvPmw, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm, netdev, linux-kernel On Sun, Aug 21, 2022 at 8:17 PM Shakeel Butt <shakeelb-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> wrote: > > For cgroups using low or min protections, the function > propagate_protected_usage() was doing an atomic xchg() operation > irrespectively. It only needs to do that operation if the new value of > protection is different from older one. This patch does that. > > To evaluate the impact of this optimization, on a 72 CPUs machine, we > ran the following workload in a three level of cgroup hierarchy with top > level having min and low setup appropriately. More specifically > memory.min equal to size of netperf binary and memory.low double of > that. > > $ netserver -6 > # 36 instances of netperf with following params > $ netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- -m 10K > > Results (average throughput of netperf): > Without (6.0-rc1) 10482.7 Mbps > With patch 14542.5 Mbps (38.7% improvement) > > With the patch, the throughput improved by 38.7% > > Signed-off-by: Shakeel Butt <shakeelb-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> > Reported-by: kernel test robot <oliver.sang-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org> Nice speed up! Acked-by: Soheil Hassas Yeganeh <soheil-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> > --- > mm/page_counter.c | 13 ++++++------- > 1 file changed, 6 insertions(+), 7 deletions(-) > > diff --git a/mm/page_counter.c b/mm/page_counter.c > index eb156ff5d603..47711aa28161 100644 > --- a/mm/page_counter.c > +++ b/mm/page_counter.c > @@ -17,24 +17,23 @@ static void propagate_protected_usage(struct page_counter *c, > unsigned long usage) > { > unsigned long protected, old_protected; > - unsigned long low, min; > long delta; > > if (!c->parent) > return; > > - min = READ_ONCE(c->min); > - if (min || atomic_long_read(&c->min_usage)) { > - protected = min(usage, min); > + protected = min(usage, READ_ONCE(c->min)); > + old_protected = atomic_long_read(&c->min_usage); > + if (protected != old_protected) { > old_protected = atomic_long_xchg(&c->min_usage, protected); > delta = protected - old_protected; > if (delta) > atomic_long_add(delta, &c->parent->children_min_usage); > } > > - low = READ_ONCE(c->low); > - if (low || atomic_long_read(&c->low_usage)) { > - protected = min(usage, low); > + protected = min(usage, READ_ONCE(c->low)); > + old_protected = atomic_long_read(&c->low_usage); > + if (protected != old_protected) { > old_protected = atomic_long_xchg(&c->low_usage, protected); > delta = protected - old_protected; > if (delta) > -- > 2.37.1.595.g718a3a8f04-goog > ^ permalink raw reply [flat|nested] 86+ messages in thread
* Re: [PATCH 1/3] mm: page_counter: remove unneeded atomic ops for low/min @ 2022-08-22 0:20 ` Soheil Hassas Yeganeh 0 siblings, 0 replies; 86+ messages in thread From: Soheil Hassas Yeganeh @ 2022-08-22 0:20 UTC (permalink / raw) To: lkp [-- Attachment #1: Type: text/plain, Size: 2802 bytes --] On Sun, Aug 21, 2022 at 8:17 PM Shakeel Butt <shakeelb@google.com> wrote: > > For cgroups using low or min protections, the function > propagate_protected_usage() was doing an atomic xchg() operation > irrespectively. It only needs to do that operation if the new value of > protection is different from older one. This patch does that. > > To evaluate the impact of this optimization, on a 72 CPUs machine, we > ran the following workload in a three level of cgroup hierarchy with top > level having min and low setup appropriately. More specifically > memory.min equal to size of netperf binary and memory.low double of > that. > > $ netserver -6 > # 36 instances of netperf with following params > $ netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- -m 10K > > Results (average throughput of netperf): > Without (6.0-rc1) 10482.7 Mbps > With patch 14542.5 Mbps (38.7% improvement) > > With the patch, the throughput improved by 38.7% > > Signed-off-by: Shakeel Butt <shakeelb@google.com> > Reported-by: kernel test robot <oliver.sang@intel.com> Nice speed up! Acked-by: Soheil Hassas Yeganeh <soheil@google.com> > --- > mm/page_counter.c | 13 ++++++------- > 1 file changed, 6 insertions(+), 7 deletions(-) > > diff --git a/mm/page_counter.c b/mm/page_counter.c > index eb156ff5d603..47711aa28161 100644 > --- a/mm/page_counter.c > +++ b/mm/page_counter.c > @@ -17,24 +17,23 @@ static void propagate_protected_usage(struct page_counter *c, > unsigned long usage) > { > unsigned long protected, old_protected; > - unsigned long low, min; > long delta; > > if (!c->parent) > return; > > - min = READ_ONCE(c->min); > - if (min || atomic_long_read(&c->min_usage)) { > - protected = min(usage, min); > + protected = min(usage, READ_ONCE(c->min)); > + old_protected = atomic_long_read(&c->min_usage); > + if (protected != old_protected) { > old_protected = atomic_long_xchg(&c->min_usage, protected); > delta = protected - old_protected; > if (delta) > atomic_long_add(delta, &c->parent->children_min_usage); > } > > - low = READ_ONCE(c->low); > - if (low || atomic_long_read(&c->low_usage)) { > - protected = min(usage, low); > + protected = min(usage, READ_ONCE(c->low)); > + old_protected = atomic_long_read(&c->low_usage); > + if (protected != old_protected) { > old_protected = atomic_long_xchg(&c->low_usage, protected); > delta = protected - old_protected; > if (delta) > -- > 2.37.1.595.g718a3a8f04-goog > ^ permalink raw reply [flat|nested] 86+ messages in thread
* Re: [PATCH 1/3] mm: page_counter: remove unneeded atomic ops for low/min 2022-08-22 0:17 ` Shakeel Butt (?) @ 2022-08-22 2:39 ` Feng Tang -1 siblings, 0 replies; 86+ messages in thread From: Feng Tang @ 2022-08-22 2:39 UTC (permalink / raw) To: Shakeel Butt Cc: Johannes Weiner, Michal Hocko, Roman Gushchin, Muchun Song, Michal Koutn??, Eric Dumazet, Soheil Hassas Yeganeh, Sang, Oliver, Andrew Morton, lkp, cgroups, linux-mm, netdev, linux-kernel On Mon, Aug 22, 2022 at 08:17:35AM +0800, Shakeel Butt wrote: > For cgroups using low or min protections, the function > propagate_protected_usage() was doing an atomic xchg() operation > irrespectively. It only needs to do that operation if the new value of > protection is different from older one. This patch does that. > > To evaluate the impact of this optimization, on a 72 CPUs machine, we > ran the following workload in a three level of cgroup hierarchy with top > level having min and low setup appropriately. More specifically > memory.min equal to size of netperf binary and memory.low double of > that. > > $ netserver -6 > # 36 instances of netperf with following params > $ netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- -m 10K > > Results (average throughput of netperf): > Without (6.0-rc1) 10482.7 Mbps > With patch 14542.5 Mbps (38.7% improvement) > > With the patch, the throughput improved by 38.7% > > Signed-off-by: Shakeel Butt <shakeelb@google.com> > Reported-by: kernel test robot <oliver.sang@intel.com> Reviewed-by: Feng Tang <feng.tang@intel.com> Thanks! - Feng > --- > mm/page_counter.c | 13 ++++++------- > 1 file changed, 6 insertions(+), 7 deletions(-) > > diff --git a/mm/page_counter.c b/mm/page_counter.c > index eb156ff5d603..47711aa28161 100644 > --- a/mm/page_counter.c > +++ b/mm/page_counter.c > @@ -17,24 +17,23 @@ static void propagate_protected_usage(struct page_counter *c, > unsigned long usage) > { > unsigned long protected, old_protected; > - unsigned long low, min; > long delta; > > if (!c->parent) > return; > > - min = READ_ONCE(c->min); > - if (min || atomic_long_read(&c->min_usage)) { > - protected = min(usage, min); > + protected = min(usage, READ_ONCE(c->min)); > + old_protected = atomic_long_read(&c->min_usage); > + if (protected != old_protected) { > old_protected = atomic_long_xchg(&c->min_usage, protected); > delta = protected - old_protected; > if (delta) > atomic_long_add(delta, &c->parent->children_min_usage); > } > > - low = READ_ONCE(c->low); > - if (low || atomic_long_read(&c->low_usage)) { > - protected = min(usage, low); > + protected = min(usage, READ_ONCE(c->low)); > + old_protected = atomic_long_read(&c->low_usage); > + if (protected != old_protected) { > old_protected = atomic_long_xchg(&c->low_usage, protected); > delta = protected - old_protected; > if (delta) > -- > 2.37.1.595.g718a3a8f04-goog > ^ permalink raw reply [flat|nested] 86+ messages in thread
* Re: [PATCH 1/3] mm: page_counter: remove unneeded atomic ops for low/min @ 2022-08-22 2:39 ` Feng Tang 0 siblings, 0 replies; 86+ messages in thread From: Feng Tang @ 2022-08-22 2:39 UTC (permalink / raw) To: Shakeel Butt Cc: Johannes Weiner, Michal Hocko, Roman Gushchin, Muchun Song, Michal Koutn??, Eric Dumazet, Soheil Hassas Yeganeh, Sang, Oliver, Andrew Morton, lkp-hn68Rpc1hR1g9hUCZPvPmw, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, netdev-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA On Mon, Aug 22, 2022 at 08:17:35AM +0800, Shakeel Butt wrote: > For cgroups using low or min protections, the function > propagate_protected_usage() was doing an atomic xchg() operation > irrespectively. It only needs to do that operation if the new value of > protection is different from older one. This patch does that. > > To evaluate the impact of this optimization, on a 72 CPUs machine, we > ran the following workload in a three level of cgroup hierarchy with top > level having min and low setup appropriately. More specifically > memory.min equal to size of netperf binary and memory.low double of > that. > > $ netserver -6 > # 36 instances of netperf with following params > $ netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- -m 10K > > Results (average throughput of netperf): > Without (6.0-rc1) 10482.7 Mbps > With patch 14542.5 Mbps (38.7% improvement) > > With the patch, the throughput improved by 38.7% > > Signed-off-by: Shakeel Butt <shakeelb-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> > Reported-by: kernel test robot <oliver.sang-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org> Reviewed-by: Feng Tang <feng.tang-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org> Thanks! - Feng > --- > mm/page_counter.c | 13 ++++++------- > 1 file changed, 6 insertions(+), 7 deletions(-) > > diff --git a/mm/page_counter.c b/mm/page_counter.c > index eb156ff5d603..47711aa28161 100644 > --- a/mm/page_counter.c > +++ b/mm/page_counter.c > @@ -17,24 +17,23 @@ static void propagate_protected_usage(struct page_counter *c, > unsigned long usage) > { > unsigned long protected, old_protected; > - unsigned long low, min; > long delta; > > if (!c->parent) > return; > > - min = READ_ONCE(c->min); > - if (min || atomic_long_read(&c->min_usage)) { > - protected = min(usage, min); > + protected = min(usage, READ_ONCE(c->min)); > + old_protected = atomic_long_read(&c->min_usage); > + if (protected != old_protected) { > old_protected = atomic_long_xchg(&c->min_usage, protected); > delta = protected - old_protected; > if (delta) > atomic_long_add(delta, &c->parent->children_min_usage); > } > > - low = READ_ONCE(c->low); > - if (low || atomic_long_read(&c->low_usage)) { > - protected = min(usage, low); > + protected = min(usage, READ_ONCE(c->low)); > + old_protected = atomic_long_read(&c->low_usage); > + if (protected != old_protected) { > old_protected = atomic_long_xchg(&c->low_usage, protected); > delta = protected - old_protected; > if (delta) > -- > 2.37.1.595.g718a3a8f04-goog > ^ permalink raw reply [flat|nested] 86+ messages in thread
* Re: [PATCH 1/3] mm: page_counter: remove unneeded atomic ops for low/min @ 2022-08-22 2:39 ` Feng Tang 0 siblings, 0 replies; 86+ messages in thread From: Feng Tang @ 2022-08-22 2:39 UTC (permalink / raw) To: lkp [-- Attachment #1: Type: text/plain, Size: 2525 bytes --] On Mon, Aug 22, 2022 at 08:17:35AM +0800, Shakeel Butt wrote: > For cgroups using low or min protections, the function > propagate_protected_usage() was doing an atomic xchg() operation > irrespectively. It only needs to do that operation if the new value of > protection is different from older one. This patch does that. > > To evaluate the impact of this optimization, on a 72 CPUs machine, we > ran the following workload in a three level of cgroup hierarchy with top > level having min and low setup appropriately. More specifically > memory.min equal to size of netperf binary and memory.low double of > that. > > $ netserver -6 > # 36 instances of netperf with following params > $ netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- -m 10K > > Results (average throughput of netperf): > Without (6.0-rc1) 10482.7 Mbps > With patch 14542.5 Mbps (38.7% improvement) > > With the patch, the throughput improved by 38.7% > > Signed-off-by: Shakeel Butt <shakeelb@google.com> > Reported-by: kernel test robot <oliver.sang@intel.com> Reviewed-by: Feng Tang <feng.tang@intel.com> Thanks! - Feng > --- > mm/page_counter.c | 13 ++++++------- > 1 file changed, 6 insertions(+), 7 deletions(-) > > diff --git a/mm/page_counter.c b/mm/page_counter.c > index eb156ff5d603..47711aa28161 100644 > --- a/mm/page_counter.c > +++ b/mm/page_counter.c > @@ -17,24 +17,23 @@ static void propagate_protected_usage(struct page_counter *c, > unsigned long usage) > { > unsigned long protected, old_protected; > - unsigned long low, min; > long delta; > > if (!c->parent) > return; > > - min = READ_ONCE(c->min); > - if (min || atomic_long_read(&c->min_usage)) { > - protected = min(usage, min); > + protected = min(usage, READ_ONCE(c->min)); > + old_protected = atomic_long_read(&c->min_usage); > + if (protected != old_protected) { > old_protected = atomic_long_xchg(&c->min_usage, protected); > delta = protected - old_protected; > if (delta) > atomic_long_add(delta, &c->parent->children_min_usage); > } > > - low = READ_ONCE(c->low); > - if (low || atomic_long_read(&c->low_usage)) { > - protected = min(usage, low); > + protected = min(usage, READ_ONCE(c->low)); > + old_protected = atomic_long_read(&c->low_usage); > + if (protected != old_protected) { > old_protected = atomic_long_xchg(&c->low_usage, protected); > delta = protected - old_protected; > if (delta) > -- > 2.37.1.595.g718a3a8f04-goog > ^ permalink raw reply [flat|nested] 86+ messages in thread
* Re: [PATCH 1/3] mm: page_counter: remove unneeded atomic ops for low/min 2022-08-22 0:17 ` Shakeel Butt @ 2022-08-22 9:55 ` Michal Hocko -1 siblings, 0 replies; 86+ messages in thread From: Michal Hocko @ 2022-08-22 9:55 UTC (permalink / raw) To: Shakeel Butt Cc: Johannes Weiner, Roman Gushchin, Muchun Song, Michal Koutný, Eric Dumazet, Soheil Hassas Yeganeh, Feng Tang, Oliver Sang, Andrew Morton, lkp, cgroups, linux-mm, netdev, linux-kernel On Mon 22-08-22 00:17:35, Shakeel Butt wrote: > For cgroups using low or min protections, the function > propagate_protected_usage() was doing an atomic xchg() operation > irrespectively. It only needs to do that operation if the new value of > protection is different from older one. This patch does that. This doesn't really explain why. > To evaluate the impact of this optimization, on a 72 CPUs machine, we > ran the following workload in a three level of cgroup hierarchy with top > level having min and low setup appropriately. More specifically > memory.min equal to size of netperf binary and memory.low double of > that. I have hard time to really grasp what is the actual setup and why it matters and why the patch makes any difference. Please elaborate some more here. > $ netserver -6 > # 36 instances of netperf with following params > $ netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- -m 10K > > Results (average throughput of netperf): > Without (6.0-rc1) 10482.7 Mbps > With patch 14542.5 Mbps (38.7% improvement) > > With the patch, the throughput improved by 38.7% > > Signed-off-by: Shakeel Butt <shakeelb@google.com> > Reported-by: kernel test robot <oliver.sang@intel.com> > --- > mm/page_counter.c | 13 ++++++------- > 1 file changed, 6 insertions(+), 7 deletions(-) > > diff --git a/mm/page_counter.c b/mm/page_counter.c > index eb156ff5d603..47711aa28161 100644 > --- a/mm/page_counter.c > +++ b/mm/page_counter.c > @@ -17,24 +17,23 @@ static void propagate_protected_usage(struct page_counter *c, > unsigned long usage) > { > unsigned long protected, old_protected; > - unsigned long low, min; > long delta; > > if (!c->parent) > return; > > - min = READ_ONCE(c->min); > - if (min || atomic_long_read(&c->min_usage)) { > - protected = min(usage, min); > + protected = min(usage, READ_ONCE(c->min)); > + old_protected = atomic_long_read(&c->min_usage); > + if (protected != old_protected) { I have to cache that code back into brain. It is really subtle thing and it is not really obvious why this is still correct. I will think about that some more but the changelog could help with that a lot. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 86+ messages in thread
* Re: [PATCH 1/3] mm: page_counter: remove unneeded atomic ops for low/min @ 2022-08-22 9:55 ` Michal Hocko 0 siblings, 0 replies; 86+ messages in thread From: Michal Hocko @ 2022-08-22 9:55 UTC (permalink / raw) To: lkp [-- Attachment #1: Type: text/plain, Size: 2245 bytes --] On Mon 22-08-22 00:17:35, Shakeel Butt wrote: > For cgroups using low or min protections, the function > propagate_protected_usage() was doing an atomic xchg() operation > irrespectively. It only needs to do that operation if the new value of > protection is different from older one. This patch does that. This doesn't really explain why. > To evaluate the impact of this optimization, on a 72 CPUs machine, we > ran the following workload in a three level of cgroup hierarchy with top > level having min and low setup appropriately. More specifically > memory.min equal to size of netperf binary and memory.low double of > that. I have hard time to really grasp what is the actual setup and why it matters and why the patch makes any difference. Please elaborate some more here. > $ netserver -6 > # 36 instances of netperf with following params > $ netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- -m 10K > > Results (average throughput of netperf): > Without (6.0-rc1) 10482.7 Mbps > With patch 14542.5 Mbps (38.7% improvement) > > With the patch, the throughput improved by 38.7% > > Signed-off-by: Shakeel Butt <shakeelb@google.com> > Reported-by: kernel test robot <oliver.sang@intel.com> > --- > mm/page_counter.c | 13 ++++++------- > 1 file changed, 6 insertions(+), 7 deletions(-) > > diff --git a/mm/page_counter.c b/mm/page_counter.c > index eb156ff5d603..47711aa28161 100644 > --- a/mm/page_counter.c > +++ b/mm/page_counter.c > @@ -17,24 +17,23 @@ static void propagate_protected_usage(struct page_counter *c, > unsigned long usage) > { > unsigned long protected, old_protected; > - unsigned long low, min; > long delta; > > if (!c->parent) > return; > > - min = READ_ONCE(c->min); > - if (min || atomic_long_read(&c->min_usage)) { > - protected = min(usage, min); > + protected = min(usage, READ_ONCE(c->min)); > + old_protected = atomic_long_read(&c->min_usage); > + if (protected != old_protected) { I have to cache that code back into brain. It is really subtle thing and it is not really obvious why this is still correct. I will think about that some more but the changelog could help with that a lot. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 86+ messages in thread
* Re: [PATCH 1/3] mm: page_counter: remove unneeded atomic ops for low/min 2022-08-22 9:55 ` Michal Hocko (?) @ 2022-08-22 10:18 ` Michal Hocko -1 siblings, 0 replies; 86+ messages in thread From: Michal Hocko @ 2022-08-22 10:18 UTC (permalink / raw) To: Shakeel Butt Cc: Johannes Weiner, Roman Gushchin, Muchun Song, Michal Koutný, Eric Dumazet, Soheil Hassas Yeganeh, Feng Tang, Oliver Sang, Andrew Morton, lkp, cgroups, linux-mm, netdev, linux-kernel On Mon 22-08-22 11:55:33, Michal Hocko wrote: > On Mon 22-08-22 00:17:35, Shakeel Butt wrote: [...] > > diff --git a/mm/page_counter.c b/mm/page_counter.c > > index eb156ff5d603..47711aa28161 100644 > > --- a/mm/page_counter.c > > +++ b/mm/page_counter.c > > @@ -17,24 +17,23 @@ static void propagate_protected_usage(struct page_counter *c, > > unsigned long usage) > > { > > unsigned long protected, old_protected; > > - unsigned long low, min; > > long delta; > > > > if (!c->parent) > > return; > > > > - min = READ_ONCE(c->min); > > - if (min || atomic_long_read(&c->min_usage)) { > > - protected = min(usage, min); > > + protected = min(usage, READ_ONCE(c->min)); > > + old_protected = atomic_long_read(&c->min_usage); > > + if (protected != old_protected) { > > I have to cache that code back into brain. It is really subtle thing and > it is not really obvious why this is still correct. I will think about > that some more but the changelog could help with that a lot. OK, so the this patch will be most useful when the min > 0 && min < usage because then the protection doesn't really change since the last call. In other words when the usage grows above the protection and your workload benefits from this change because that happens a lot as only a part of the workload is protected. Correct? Unless I have missed anything this shouldn't break the correctness but I still have to think about the proportional distribution of the protection because that adds to the complexity here. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 86+ messages in thread
* Re: [PATCH 1/3] mm: page_counter: remove unneeded atomic ops for low/min @ 2022-08-22 10:18 ` Michal Hocko 0 siblings, 0 replies; 86+ messages in thread From: Michal Hocko @ 2022-08-22 10:18 UTC (permalink / raw) To: Shakeel Butt Cc: Johannes Weiner, Roman Gushchin, Muchun Song, Michal Koutný, Eric Dumazet, Soheil Hassas Yeganeh, Feng Tang, Oliver Sang, Andrew Morton, lkp-hn68Rpc1hR1g9hUCZPvPmw, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, netdev-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA On Mon 22-08-22 11:55:33, Michal Hocko wrote: > On Mon 22-08-22 00:17:35, Shakeel Butt wrote: [...] > > diff --git a/mm/page_counter.c b/mm/page_counter.c > > index eb156ff5d603..47711aa28161 100644 > > --- a/mm/page_counter.c > > +++ b/mm/page_counter.c > > @@ -17,24 +17,23 @@ static void propagate_protected_usage(struct page_counter *c, > > unsigned long usage) > > { > > unsigned long protected, old_protected; > > - unsigned long low, min; > > long delta; > > > > if (!c->parent) > > return; > > > > - min = READ_ONCE(c->min); > > - if (min || atomic_long_read(&c->min_usage)) { > > - protected = min(usage, min); > > + protected = min(usage, READ_ONCE(c->min)); > > + old_protected = atomic_long_read(&c->min_usage); > > + if (protected != old_protected) { > > I have to cache that code back into brain. It is really subtle thing and > it is not really obvious why this is still correct. I will think about > that some more but the changelog could help with that a lot. OK, so the this patch will be most useful when the min > 0 && min < usage because then the protection doesn't really change since the last call. In other words when the usage grows above the protection and your workload benefits from this change because that happens a lot as only a part of the workload is protected. Correct? Unless I have missed anything this shouldn't break the correctness but I still have to think about the proportional distribution of the protection because that adds to the complexity here. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 86+ messages in thread
* Re: [PATCH 1/3] mm: page_counter: remove unneeded atomic ops for low/min @ 2022-08-22 10:18 ` Michal Hocko 0 siblings, 0 replies; 86+ messages in thread From: Michal Hocko @ 2022-08-22 10:18 UTC (permalink / raw) To: lkp [-- Attachment #1: Type: text/plain, Size: 1587 bytes --] On Mon 22-08-22 11:55:33, Michal Hocko wrote: > On Mon 22-08-22 00:17:35, Shakeel Butt wrote: [...] > > diff --git a/mm/page_counter.c b/mm/page_counter.c > > index eb156ff5d603..47711aa28161 100644 > > --- a/mm/page_counter.c > > +++ b/mm/page_counter.c > > @@ -17,24 +17,23 @@ static void propagate_protected_usage(struct page_counter *c, > > unsigned long usage) > > { > > unsigned long protected, old_protected; > > - unsigned long low, min; > > long delta; > > > > if (!c->parent) > > return; > > > > - min = READ_ONCE(c->min); > > - if (min || atomic_long_read(&c->min_usage)) { > > - protected = min(usage, min); > > + protected = min(usage, READ_ONCE(c->min)); > > + old_protected = atomic_long_read(&c->min_usage); > > + if (protected != old_protected) { > > I have to cache that code back into brain. It is really subtle thing and > it is not really obvious why this is still correct. I will think about > that some more but the changelog could help with that a lot. OK, so the this patch will be most useful when the min > 0 && min < usage because then the protection doesn't really change since the last call. In other words when the usage grows above the protection and your workload benefits from this change because that happens a lot as only a part of the workload is protected. Correct? Unless I have missed anything this shouldn't break the correctness but I still have to think about the proportional distribution of the protection because that adds to the complexity here. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 86+ messages in thread
* Re: [PATCH 1/3] mm: page_counter: remove unneeded atomic ops for low/min 2022-08-22 10:18 ` Michal Hocko @ 2022-08-22 14:55 ` Shakeel Butt -1 siblings, 0 replies; 86+ messages in thread From: Shakeel Butt @ 2022-08-22 14:55 UTC (permalink / raw) To: Michal Hocko Cc: Johannes Weiner, Roman Gushchin, Muchun Song, Michal Koutný, Eric Dumazet, Soheil Hassas Yeganeh, Feng Tang, Oliver Sang, Andrew Morton, lkp, Cgroups, Linux MM, netdev, LKML On Mon, Aug 22, 2022 at 3:18 AM Michal Hocko <mhocko@suse.com> wrote: > > On Mon 22-08-22 11:55:33, Michal Hocko wrote: > > On Mon 22-08-22 00:17:35, Shakeel Butt wrote: > [...] > > > diff --git a/mm/page_counter.c b/mm/page_counter.c > > > index eb156ff5d603..47711aa28161 100644 > > > --- a/mm/page_counter.c > > > +++ b/mm/page_counter.c > > > @@ -17,24 +17,23 @@ static void propagate_protected_usage(struct page_counter *c, > > > unsigned long usage) > > > { > > > unsigned long protected, old_protected; > > > - unsigned long low, min; > > > long delta; > > > > > > if (!c->parent) > > > return; > > > > > > - min = READ_ONCE(c->min); > > > - if (min || atomic_long_read(&c->min_usage)) { > > > - protected = min(usage, min); > > > + protected = min(usage, READ_ONCE(c->min)); > > > + old_protected = atomic_long_read(&c->min_usage); > > > + if (protected != old_protected) { > > > > I have to cache that code back into brain. It is really subtle thing and > > it is not really obvious why this is still correct. I will think about > > that some more but the changelog could help with that a lot. > > OK, so the this patch will be most useful when the min > 0 && min < > usage because then the protection doesn't really change since the last > call. In other words when the usage grows above the protection and your > workload benefits from this change because that happens a lot as only a > part of the workload is protected. Correct? Yes, that is correct. I hope the experiment setup is clear now. > > Unless I have missed anything this shouldn't break the correctness but I > still have to think about the proportional distribution of the > protection because that adds to the complexity here. The patch is not changing any semantics. It is just removing an unnecessary atomic xchg() for a specific scenario (min > 0 && min < usage). I don't think there will be any change related to proportional distribution of the protection. ^ permalink raw reply [flat|nested] 86+ messages in thread
* Re: [PATCH 1/3] mm: page_counter: remove unneeded atomic ops for low/min @ 2022-08-22 14:55 ` Shakeel Butt 0 siblings, 0 replies; 86+ messages in thread From: Shakeel Butt @ 2022-08-22 14:55 UTC (permalink / raw) To: lkp [-- Attachment #1: Type: text/plain, Size: 2067 bytes --] On Mon, Aug 22, 2022 at 3:18 AM Michal Hocko <mhocko@suse.com> wrote: > > On Mon 22-08-22 11:55:33, Michal Hocko wrote: > > On Mon 22-08-22 00:17:35, Shakeel Butt wrote: > [...] > > > diff --git a/mm/page_counter.c b/mm/page_counter.c > > > index eb156ff5d603..47711aa28161 100644 > > > --- a/mm/page_counter.c > > > +++ b/mm/page_counter.c > > > @@ -17,24 +17,23 @@ static void propagate_protected_usage(struct page_counter *c, > > > unsigned long usage) > > > { > > > unsigned long protected, old_protected; > > > - unsigned long low, min; > > > long delta; > > > > > > if (!c->parent) > > > return; > > > > > > - min = READ_ONCE(c->min); > > > - if (min || atomic_long_read(&c->min_usage)) { > > > - protected = min(usage, min); > > > + protected = min(usage, READ_ONCE(c->min)); > > > + old_protected = atomic_long_read(&c->min_usage); > > > + if (protected != old_protected) { > > > > I have to cache that code back into brain. It is really subtle thing and > > it is not really obvious why this is still correct. I will think about > > that some more but the changelog could help with that a lot. > > OK, so the this patch will be most useful when the min > 0 && min < > usage because then the protection doesn't really change since the last > call. In other words when the usage grows above the protection and your > workload benefits from this change because that happens a lot as only a > part of the workload is protected. Correct? Yes, that is correct. I hope the experiment setup is clear now. > > Unless I have missed anything this shouldn't break the correctness but I > still have to think about the proportional distribution of the > protection because that adds to the complexity here. The patch is not changing any semantics. It is just removing an unnecessary atomic xchg() for a specific scenario (min > 0 && min < usage). I don't think there will be any change related to proportional distribution of the protection. ^ permalink raw reply [flat|nested] 86+ messages in thread
* Re: [PATCH 1/3] mm: page_counter: remove unneeded atomic ops for low/min 2022-08-22 14:55 ` Shakeel Butt @ 2022-08-22 15:20 ` Michal Hocko -1 siblings, 0 replies; 86+ messages in thread From: Michal Hocko @ 2022-08-22 15:20 UTC (permalink / raw) To: Shakeel Butt Cc: Johannes Weiner, Roman Gushchin, Muchun Song, Michal Koutný, Eric Dumazet, Soheil Hassas Yeganeh, Feng Tang, Oliver Sang, Andrew Morton, lkp, Cgroups, Linux MM, netdev, LKML On Mon 22-08-22 07:55:58, Shakeel Butt wrote: > On Mon, Aug 22, 2022 at 3:18 AM Michal Hocko <mhocko@suse.com> wrote: > > > > On Mon 22-08-22 11:55:33, Michal Hocko wrote: > > > On Mon 22-08-22 00:17:35, Shakeel Butt wrote: > > [...] > > > > diff --git a/mm/page_counter.c b/mm/page_counter.c > > > > index eb156ff5d603..47711aa28161 100644 > > > > --- a/mm/page_counter.c > > > > +++ b/mm/page_counter.c > > > > @@ -17,24 +17,23 @@ static void propagate_protected_usage(struct page_counter *c, > > > > unsigned long usage) > > > > { > > > > unsigned long protected, old_protected; > > > > - unsigned long low, min; > > > > long delta; > > > > > > > > if (!c->parent) > > > > return; > > > > > > > > - min = READ_ONCE(c->min); > > > > - if (min || atomic_long_read(&c->min_usage)) { > > > > - protected = min(usage, min); > > > > + protected = min(usage, READ_ONCE(c->min)); > > > > + old_protected = atomic_long_read(&c->min_usage); > > > > + if (protected != old_protected) { > > > > > > I have to cache that code back into brain. It is really subtle thing and > > > it is not really obvious why this is still correct. I will think about > > > that some more but the changelog could help with that a lot. > > > > OK, so the this patch will be most useful when the min > 0 && min < > > usage because then the protection doesn't really change since the last > > call. In other words when the usage grows above the protection and your > > workload benefits from this change because that happens a lot as only a > > part of the workload is protected. Correct? > > Yes, that is correct. I hope the experiment setup is clear now. Maybe it is just me that it took a bit to grasp but maybe we want to save our future selfs from going through that mental process again. So please just be explicit about that in the changelog. It is really the part that workloads excessing the protection will benefit the most that would help to understand this patch. > > Unless I have missed anything this shouldn't break the correctness but I > > still have to think about the proportional distribution of the > > protection because that adds to the complexity here. > > The patch is not changing any semantics. It is just removing an > unnecessary atomic xchg() for a specific scenario (min > 0 && min < > usage). I don't think there will be any change related to proportional > distribution of the protection. Yes, I suspect you are right. I just remembered previous fixes like 503970e42325 ("mm: memcontrol: fix memory.low proportional distribution") which just made me nervous that this is a tricky area. I will have another look tomorrow with a fresh brain and send an ack. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 86+ messages in thread
* Re: [PATCH 1/3] mm: page_counter: remove unneeded atomic ops for low/min @ 2022-08-22 15:20 ` Michal Hocko 0 siblings, 0 replies; 86+ messages in thread From: Michal Hocko @ 2022-08-22 15:20 UTC (permalink / raw) To: lkp [-- Attachment #1: Type: text/plain, Size: 2832 bytes --] On Mon 22-08-22 07:55:58, Shakeel Butt wrote: > On Mon, Aug 22, 2022 at 3:18 AM Michal Hocko <mhocko@suse.com> wrote: > > > > On Mon 22-08-22 11:55:33, Michal Hocko wrote: > > > On Mon 22-08-22 00:17:35, Shakeel Butt wrote: > > [...] > > > > diff --git a/mm/page_counter.c b/mm/page_counter.c > > > > index eb156ff5d603..47711aa28161 100644 > > > > --- a/mm/page_counter.c > > > > +++ b/mm/page_counter.c > > > > @@ -17,24 +17,23 @@ static void propagate_protected_usage(struct page_counter *c, > > > > unsigned long usage) > > > > { > > > > unsigned long protected, old_protected; > > > > - unsigned long low, min; > > > > long delta; > > > > > > > > if (!c->parent) > > > > return; > > > > > > > > - min = READ_ONCE(c->min); > > > > - if (min || atomic_long_read(&c->min_usage)) { > > > > - protected = min(usage, min); > > > > + protected = min(usage, READ_ONCE(c->min)); > > > > + old_protected = atomic_long_read(&c->min_usage); > > > > + if (protected != old_protected) { > > > > > > I have to cache that code back into brain. It is really subtle thing and > > > it is not really obvious why this is still correct. I will think about > > > that some more but the changelog could help with that a lot. > > > > OK, so the this patch will be most useful when the min > 0 && min < > > usage because then the protection doesn't really change since the last > > call. In other words when the usage grows above the protection and your > > workload benefits from this change because that happens a lot as only a > > part of the workload is protected. Correct? > > Yes, that is correct. I hope the experiment setup is clear now. Maybe it is just me that it took a bit to grasp but maybe we want to save our future selfs from going through that mental process again. So please just be explicit about that in the changelog. It is really the part that workloads excessing the protection will benefit the most that would help to understand this patch. > > Unless I have missed anything this shouldn't break the correctness but I > > still have to think about the proportional distribution of the > > protection because that adds to the complexity here. > > The patch is not changing any semantics. It is just removing an > unnecessary atomic xchg() for a specific scenario (min > 0 && min < > usage). I don't think there will be any change related to proportional > distribution of the protection. Yes, I suspect you are right. I just remembered previous fixes like 503970e42325 ("mm: memcontrol: fix memory.low proportional distribution") which just made me nervous that this is a tricky area. I will have another look tomorrow with a fresh brain and send an ack. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 86+ messages in thread
* Re: [PATCH 1/3] mm: page_counter: remove unneeded atomic ops for low/min 2022-08-22 15:20 ` Michal Hocko (?) @ 2022-08-22 16:06 ` Shakeel Butt -1 siblings, 0 replies; 86+ messages in thread From: Shakeel Butt @ 2022-08-22 16:06 UTC (permalink / raw) To: Michal Hocko Cc: Johannes Weiner, Roman Gushchin, Muchun Song, Michal Koutný, Eric Dumazet, Soheil Hassas Yeganeh, Feng Tang, Oliver Sang, Andrew Morton, lkp, Cgroups, Linux MM, netdev, LKML On Mon, Aug 22, 2022 at 8:20 AM Michal Hocko <mhocko@suse.com> wrote: > > On Mon 22-08-22 07:55:58, Shakeel Butt wrote: > > On Mon, Aug 22, 2022 at 3:18 AM Michal Hocko <mhocko@suse.com> wrote: > > > > > > On Mon 22-08-22 11:55:33, Michal Hocko wrote: > > > > On Mon 22-08-22 00:17:35, Shakeel Butt wrote: > > > [...] > > > > > diff --git a/mm/page_counter.c b/mm/page_counter.c > > > > > index eb156ff5d603..47711aa28161 100644 > > > > > --- a/mm/page_counter.c > > > > > +++ b/mm/page_counter.c > > > > > @@ -17,24 +17,23 @@ static void propagate_protected_usage(struct page_counter *c, > > > > > unsigned long usage) > > > > > { > > > > > unsigned long protected, old_protected; > > > > > - unsigned long low, min; > > > > > long delta; > > > > > > > > > > if (!c->parent) > > > > > return; > > > > > > > > > > - min = READ_ONCE(c->min); > > > > > - if (min || atomic_long_read(&c->min_usage)) { > > > > > - protected = min(usage, min); > > > > > + protected = min(usage, READ_ONCE(c->min)); > > > > > + old_protected = atomic_long_read(&c->min_usage); > > > > > + if (protected != old_protected) { > > > > > > > > I have to cache that code back into brain. It is really subtle thing and > > > > it is not really obvious why this is still correct. I will think about > > > > that some more but the changelog could help with that a lot. > > > > > > OK, so the this patch will be most useful when the min > 0 && min < > > > usage because then the protection doesn't really change since the last > > > call. In other words when the usage grows above the protection and your > > > workload benefits from this change because that happens a lot as only a > > > part of the workload is protected. Correct? > > > > Yes, that is correct. I hope the experiment setup is clear now. > > Maybe it is just me that it took a bit to grasp but maybe we want to > save our future selfs from going through that mental process again. So > please just be explicit about that in the changelog. It is really the > part that workloads excessing the protection will benefit the most that > would help to understand this patch. > I will add more detail in the commit message in the next version. > > > Unless I have missed anything this shouldn't break the correctness but I > > > still have to think about the proportional distribution of the > > > protection because that adds to the complexity here. > > > > The patch is not changing any semantics. It is just removing an > > unnecessary atomic xchg() for a specific scenario (min > 0 && min < > > usage). I don't think there will be any change related to proportional > > distribution of the protection. > > Yes, I suspect you are right. I just remembered previous fixes > like 503970e42325 ("mm: memcontrol: fix memory.low proportional > distribution") which just made me nervous that this is a tricky area. > > I will have another look tomorrow with a fresh brain and send an ack. I will wait for your ack before sending the next version. ^ permalink raw reply [flat|nested] 86+ messages in thread
* Re: [PATCH 1/3] mm: page_counter: remove unneeded atomic ops for low/min @ 2022-08-22 16:06 ` Shakeel Butt 0 siblings, 0 replies; 86+ messages in thread From: Shakeel Butt @ 2022-08-22 16:06 UTC (permalink / raw) To: Michal Hocko Cc: Johannes Weiner, Roman Gushchin, Muchun Song, Michal Koutný, Eric Dumazet, Soheil Hassas Yeganeh, Feng Tang, Oliver Sang, Andrew Morton, lkp-hn68Rpc1hR1g9hUCZPvPmw, Cgroups, Linux MM, netdev, LKML On Mon, Aug 22, 2022 at 8:20 AM Michal Hocko <mhocko-IBi9RG/b67k@public.gmane.org> wrote: > > On Mon 22-08-22 07:55:58, Shakeel Butt wrote: > > On Mon, Aug 22, 2022 at 3:18 AM Michal Hocko <mhocko-IBi9RG/b67k@public.gmane.org> wrote: > > > > > > On Mon 22-08-22 11:55:33, Michal Hocko wrote: > > > > On Mon 22-08-22 00:17:35, Shakeel Butt wrote: > > > [...] > > > > > diff --git a/mm/page_counter.c b/mm/page_counter.c > > > > > index eb156ff5d603..47711aa28161 100644 > > > > > --- a/mm/page_counter.c > > > > > +++ b/mm/page_counter.c > > > > > @@ -17,24 +17,23 @@ static void propagate_protected_usage(struct page_counter *c, > > > > > unsigned long usage) > > > > > { > > > > > unsigned long protected, old_protected; > > > > > - unsigned long low, min; > > > > > long delta; > > > > > > > > > > if (!c->parent) > > > > > return; > > > > > > > > > > - min = READ_ONCE(c->min); > > > > > - if (min || atomic_long_read(&c->min_usage)) { > > > > > - protected = min(usage, min); > > > > > + protected = min(usage, READ_ONCE(c->min)); > > > > > + old_protected = atomic_long_read(&c->min_usage); > > > > > + if (protected != old_protected) { > > > > > > > > I have to cache that code back into brain. It is really subtle thing and > > > > it is not really obvious why this is still correct. I will think about > > > > that some more but the changelog could help with that a lot. > > > > > > OK, so the this patch will be most useful when the min > 0 && min < > > > usage because then the protection doesn't really change since the last > > > call. In other words when the usage grows above the protection and your > > > workload benefits from this change because that happens a lot as only a > > > part of the workload is protected. Correct? > > > > Yes, that is correct. I hope the experiment setup is clear now. > > Maybe it is just me that it took a bit to grasp but maybe we want to > save our future selfs from going through that mental process again. So > please just be explicit about that in the changelog. It is really the > part that workloads excessing the protection will benefit the most that > would help to understand this patch. > I will add more detail in the commit message in the next version. > > > Unless I have missed anything this shouldn't break the correctness but I > > > still have to think about the proportional distribution of the > > > protection because that adds to the complexity here. > > > > The patch is not changing any semantics. It is just removing an > > unnecessary atomic xchg() for a specific scenario (min > 0 && min < > > usage). I don't think there will be any change related to proportional > > distribution of the protection. > > Yes, I suspect you are right. I just remembered previous fixes > like 503970e42325 ("mm: memcontrol: fix memory.low proportional > distribution") which just made me nervous that this is a tricky area. > > I will have another look tomorrow with a fresh brain and send an ack. I will wait for your ack before sending the next version. ^ permalink raw reply [flat|nested] 86+ messages in thread
* Re: [PATCH 1/3] mm: page_counter: remove unneeded atomic ops for low/min @ 2022-08-22 16:06 ` Shakeel Butt 0 siblings, 0 replies; 86+ messages in thread From: Shakeel Butt @ 2022-08-22 16:06 UTC (permalink / raw) To: lkp [-- Attachment #1: Type: text/plain, Size: 3120 bytes --] On Mon, Aug 22, 2022 at 8:20 AM Michal Hocko <mhocko@suse.com> wrote: > > On Mon 22-08-22 07:55:58, Shakeel Butt wrote: > > On Mon, Aug 22, 2022 at 3:18 AM Michal Hocko <mhocko@suse.com> wrote: > > > > > > On Mon 22-08-22 11:55:33, Michal Hocko wrote: > > > > On Mon 22-08-22 00:17:35, Shakeel Butt wrote: > > > [...] > > > > > diff --git a/mm/page_counter.c b/mm/page_counter.c > > > > > index eb156ff5d603..47711aa28161 100644 > > > > > --- a/mm/page_counter.c > > > > > +++ b/mm/page_counter.c > > > > > @@ -17,24 +17,23 @@ static void propagate_protected_usage(struct page_counter *c, > > > > > unsigned long usage) > > > > > { > > > > > unsigned long protected, old_protected; > > > > > - unsigned long low, min; > > > > > long delta; > > > > > > > > > > if (!c->parent) > > > > > return; > > > > > > > > > > - min = READ_ONCE(c->min); > > > > > - if (min || atomic_long_read(&c->min_usage)) { > > > > > - protected = min(usage, min); > > > > > + protected = min(usage, READ_ONCE(c->min)); > > > > > + old_protected = atomic_long_read(&c->min_usage); > > > > > + if (protected != old_protected) { > > > > > > > > I have to cache that code back into brain. It is really subtle thing and > > > > it is not really obvious why this is still correct. I will think about > > > > that some more but the changelog could help with that a lot. > > > > > > OK, so the this patch will be most useful when the min > 0 && min < > > > usage because then the protection doesn't really change since the last > > > call. In other words when the usage grows above the protection and your > > > workload benefits from this change because that happens a lot as only a > > > part of the workload is protected. Correct? > > > > Yes, that is correct. I hope the experiment setup is clear now. > > Maybe it is just me that it took a bit to grasp but maybe we want to > save our future selfs from going through that mental process again. So > please just be explicit about that in the changelog. It is really the > part that workloads excessing the protection will benefit the most that > would help to understand this patch. > I will add more detail in the commit message in the next version. > > > Unless I have missed anything this shouldn't break the correctness but I > > > still have to think about the proportional distribution of the > > > protection because that adds to the complexity here. > > > > The patch is not changing any semantics. It is just removing an > > unnecessary atomic xchg() for a specific scenario (min > 0 && min < > > usage). I don't think there will be any change related to proportional > > distribution of the protection. > > Yes, I suspect you are right. I just remembered previous fixes > like 503970e42325 ("mm: memcontrol: fix memory.low proportional > distribution") which just made me nervous that this is a tricky area. > > I will have another look tomorrow with a fresh brain and send an ack. I will wait for your ack before sending the next version. ^ permalink raw reply [flat|nested] 86+ messages in thread
* Re: [PATCH 1/3] mm: page_counter: remove unneeded atomic ops for low/min 2022-08-22 15:20 ` Michal Hocko (?) @ 2022-08-23 9:42 ` Michal Hocko -1 siblings, 0 replies; 86+ messages in thread From: Michal Hocko @ 2022-08-23 9:42 UTC (permalink / raw) To: Shakeel Butt Cc: Johannes Weiner, Roman Gushchin, Muchun Song, Michal Koutný, Eric Dumazet, Soheil Hassas Yeganeh, Feng Tang, Oliver Sang, Andrew Morton, lkp, Cgroups, Linux MM, netdev, LKML On Mon 22-08-22 17:20:02, Michal Hocko wrote: > On Mon 22-08-22 07:55:58, Shakeel Butt wrote: > > On Mon, Aug 22, 2022 at 3:18 AM Michal Hocko <mhocko@suse.com> wrote: [...] > > > Unless I have missed anything this shouldn't break the correctness but I > > > still have to think about the proportional distribution of the > > > protection because that adds to the complexity here. > > > > The patch is not changing any semantics. It is just removing an > > unnecessary atomic xchg() for a specific scenario (min > 0 && min < > > usage). I don't think there will be any change related to proportional > > distribution of the protection. > > Yes, I suspect you are right. I just remembered previous fixes > like 503970e42325 ("mm: memcontrol: fix memory.low proportional > distribution") which just made me nervous that this is a tricky area. > > I will have another look tomorrow with a fresh brain and send an ack. I cannot spot any problem. But I guess it would be good to have a little comment to explain that races on the min_usage update (mentioned by Roman) are acceptable and savings from atomic update are preferred. The worst case I can imagine would be something like uncharge 4kB racing with charge 2MB. The first reduces the protection (min_usage) while the other one misses that update and doesn't increase it. But even then the effect shouldn't be really large. At least I have hard time imagine this would throw things off too much. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 86+ messages in thread
* Re: [PATCH 1/3] mm: page_counter: remove unneeded atomic ops for low/min @ 2022-08-23 9:42 ` Michal Hocko 0 siblings, 0 replies; 86+ messages in thread From: Michal Hocko @ 2022-08-23 9:42 UTC (permalink / raw) To: Shakeel Butt Cc: Johannes Weiner, Roman Gushchin, Muchun Song, Michal Koutný, Eric Dumazet, Soheil Hassas Yeganeh, Feng Tang, Oliver Sang, Andrew Morton, lkp-hn68Rpc1hR1g9hUCZPvPmw, Cgroups, Linux MM, netdev, LKML On Mon 22-08-22 17:20:02, Michal Hocko wrote: > On Mon 22-08-22 07:55:58, Shakeel Butt wrote: > > On Mon, Aug 22, 2022 at 3:18 AM Michal Hocko <mhocko-IBi9RG/b67k@public.gmane.org> wrote: [...] > > > Unless I have missed anything this shouldn't break the correctness but I > > > still have to think about the proportional distribution of the > > > protection because that adds to the complexity here. > > > > The patch is not changing any semantics. It is just removing an > > unnecessary atomic xchg() for a specific scenario (min > 0 && min < > > usage). I don't think there will be any change related to proportional > > distribution of the protection. > > Yes, I suspect you are right. I just remembered previous fixes > like 503970e42325 ("mm: memcontrol: fix memory.low proportional > distribution") which just made me nervous that this is a tricky area. > > I will have another look tomorrow with a fresh brain and send an ack. I cannot spot any problem. But I guess it would be good to have a little comment to explain that races on the min_usage update (mentioned by Roman) are acceptable and savings from atomic update are preferred. The worst case I can imagine would be something like uncharge 4kB racing with charge 2MB. The first reduces the protection (min_usage) while the other one misses that update and doesn't increase it. But even then the effect shouldn't be really large. At least I have hard time imagine this would throw things off too much. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 86+ messages in thread
* Re: [PATCH 1/3] mm: page_counter: remove unneeded atomic ops for low/min @ 2022-08-23 9:42 ` Michal Hocko 0 siblings, 0 replies; 86+ messages in thread From: Michal Hocko @ 2022-08-23 9:42 UTC (permalink / raw) To: lkp [-- Attachment #1: Type: text/plain, Size: 1508 bytes --] On Mon 22-08-22 17:20:02, Michal Hocko wrote: > On Mon 22-08-22 07:55:58, Shakeel Butt wrote: > > On Mon, Aug 22, 2022 at 3:18 AM Michal Hocko <mhocko@suse.com> wrote: [...] > > > Unless I have missed anything this shouldn't break the correctness but I > > > still have to think about the proportional distribution of the > > > protection because that adds to the complexity here. > > > > The patch is not changing any semantics. It is just removing an > > unnecessary atomic xchg() for a specific scenario (min > 0 && min < > > usage). I don't think there will be any change related to proportional > > distribution of the protection. > > Yes, I suspect you are right. I just remembered previous fixes > like 503970e42325 ("mm: memcontrol: fix memory.low proportional > distribution") which just made me nervous that this is a tricky area. > > I will have another look tomorrow with a fresh brain and send an ack. I cannot spot any problem. But I guess it would be good to have a little comment to explain that races on the min_usage update (mentioned by Roman) are acceptable and savings from atomic update are preferred. The worst case I can imagine would be something like uncharge 4kB racing with charge 2MB. The first reduces the protection (min_usage) while the other one misses that update and doesn't increase it. But even then the effect shouldn't be really large. At least I have hard time imagine this would throw things off too much. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 86+ messages in thread
* Re: [PATCH 1/3] mm: page_counter: remove unneeded atomic ops for low/min 2022-08-22 0:17 ` Shakeel Butt (?) @ 2022-08-22 18:23 ` Roman Gushchin -1 siblings, 0 replies; 86+ messages in thread From: Roman Gushchin @ 2022-08-22 18:23 UTC (permalink / raw) To: Shakeel Butt Cc: Johannes Weiner, Michal Hocko, Muchun Song, Michal Koutný, Eric Dumazet, Soheil Hassas Yeganeh, Feng Tang, Oliver Sang, Andrew Morton, lkp, cgroups, linux-mm, netdev, linux-kernel On Mon, Aug 22, 2022 at 12:17:35AM +0000, Shakeel Butt wrote: > For cgroups using low or min protections, the function > propagate_protected_usage() was doing an atomic xchg() operation > irrespectively. It only needs to do that operation if the new value of > protection is different from older one. This patch does that. > > To evaluate the impact of this optimization, on a 72 CPUs machine, we > ran the following workload in a three level of cgroup hierarchy with top > level having min and low setup appropriately. More specifically > memory.min equal to size of netperf binary and memory.low double of > that. > > $ netserver -6 > # 36 instances of netperf with following params > $ netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- -m 10K > > Results (average throughput of netperf): > Without (6.0-rc1) 10482.7 Mbps > With patch 14542.5 Mbps (38.7% improvement) > > With the patch, the throughput improved by 38.7% Nice savings! > > Signed-off-by: Shakeel Butt <shakeelb@google.com> > Reported-by: kernel test robot <oliver.sang@intel.com> > --- > mm/page_counter.c | 13 ++++++------- > 1 file changed, 6 insertions(+), 7 deletions(-) > > diff --git a/mm/page_counter.c b/mm/page_counter.c > index eb156ff5d603..47711aa28161 100644 > --- a/mm/page_counter.c > +++ b/mm/page_counter.c > @@ -17,24 +17,23 @@ static void propagate_protected_usage(struct page_counter *c, > unsigned long usage) > { > unsigned long protected, old_protected; > - unsigned long low, min; > long delta; > > if (!c->parent) > return; > > - min = READ_ONCE(c->min); > - if (min || atomic_long_read(&c->min_usage)) { > - protected = min(usage, min); > + protected = min(usage, READ_ONCE(c->min)); > + old_protected = atomic_long_read(&c->min_usage); > + if (protected != old_protected) { > old_protected = atomic_long_xchg(&c->min_usage, protected); > delta = protected - old_protected; > if (delta) > atomic_long_add(delta, &c->parent->children_min_usage); What if there is a concurrent update of c->min_usage? Then the patched version can miss an update. I can't imagine a case when it will lead to bad consequences, so probably it's ok. But not super obvious. I think the way to think of it is that a missed update will be fixed by the next one, so it's ok to run some time with old numbers. Acked-by: Roman Gushchin <roman.gushchin@linux.dev> Thanks! ^ permalink raw reply [flat|nested] 86+ messages in thread
* Re: [PATCH 1/3] mm: page_counter: remove unneeded atomic ops for low/min @ 2022-08-22 18:23 ` Roman Gushchin 0 siblings, 0 replies; 86+ messages in thread From: Roman Gushchin @ 2022-08-22 18:23 UTC (permalink / raw) To: Shakeel Butt Cc: Johannes Weiner, Michal Hocko, Muchun Song, Michal Koutný, Eric Dumazet, Soheil Hassas Yeganeh, Feng Tang, Oliver Sang, Andrew Morton, lkp-hn68Rpc1hR1g9hUCZPvPmw, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, netdev-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA On Mon, Aug 22, 2022 at 12:17:35AM +0000, Shakeel Butt wrote: > For cgroups using low or min protections, the function > propagate_protected_usage() was doing an atomic xchg() operation > irrespectively. It only needs to do that operation if the new value of > protection is different from older one. This patch does that. > > To evaluate the impact of this optimization, on a 72 CPUs machine, we > ran the following workload in a three level of cgroup hierarchy with top > level having min and low setup appropriately. More specifically > memory.min equal to size of netperf binary and memory.low double of > that. > > $ netserver -6 > # 36 instances of netperf with following params > $ netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- -m 10K > > Results (average throughput of netperf): > Without (6.0-rc1) 10482.7 Mbps > With patch 14542.5 Mbps (38.7% improvement) > > With the patch, the throughput improved by 38.7% Nice savings! > > Signed-off-by: Shakeel Butt <shakeelb-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> > Reported-by: kernel test robot <oliver.sang-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org> > --- > mm/page_counter.c | 13 ++++++------- > 1 file changed, 6 insertions(+), 7 deletions(-) > > diff --git a/mm/page_counter.c b/mm/page_counter.c > index eb156ff5d603..47711aa28161 100644 > --- a/mm/page_counter.c > +++ b/mm/page_counter.c > @@ -17,24 +17,23 @@ static void propagate_protected_usage(struct page_counter *c, > unsigned long usage) > { > unsigned long protected, old_protected; > - unsigned long low, min; > long delta; > > if (!c->parent) > return; > > - min = READ_ONCE(c->min); > - if (min || atomic_long_read(&c->min_usage)) { > - protected = min(usage, min); > + protected = min(usage, READ_ONCE(c->min)); > + old_protected = atomic_long_read(&c->min_usage); > + if (protected != old_protected) { > old_protected = atomic_long_xchg(&c->min_usage, protected); > delta = protected - old_protected; > if (delta) > atomic_long_add(delta, &c->parent->children_min_usage); What if there is a concurrent update of c->min_usage? Then the patched version can miss an update. I can't imagine a case when it will lead to bad consequences, so probably it's ok. But not super obvious. I think the way to think of it is that a missed update will be fixed by the next one, so it's ok to run some time with old numbers. Acked-by: Roman Gushchin <roman.gushchin-fxUVXftIFDnyG1zEObXtfA@public.gmane.org> Thanks! ^ permalink raw reply [flat|nested] 86+ messages in thread
* Re: [PATCH 1/3] mm: page_counter: remove unneeded atomic ops for low/min @ 2022-08-22 18:23 ` Roman Gushchin 0 siblings, 0 replies; 86+ messages in thread From: Roman Gushchin @ 2022-08-22 18:23 UTC (permalink / raw) To: lkp [-- Attachment #1: Type: text/plain, Size: 2448 bytes --] On Mon, Aug 22, 2022 at 12:17:35AM +0000, Shakeel Butt wrote: > For cgroups using low or min protections, the function > propagate_protected_usage() was doing an atomic xchg() operation > irrespectively. It only needs to do that operation if the new value of > protection is different from older one. This patch does that. > > To evaluate the impact of this optimization, on a 72 CPUs machine, we > ran the following workload in a three level of cgroup hierarchy with top > level having min and low setup appropriately. More specifically > memory.min equal to size of netperf binary and memory.low double of > that. > > $ netserver -6 > # 36 instances of netperf with following params > $ netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- -m 10K > > Results (average throughput of netperf): > Without (6.0-rc1) 10482.7 Mbps > With patch 14542.5 Mbps (38.7% improvement) > > With the patch, the throughput improved by 38.7% Nice savings! > > Signed-off-by: Shakeel Butt <shakeelb@google.com> > Reported-by: kernel test robot <oliver.sang@intel.com> > --- > mm/page_counter.c | 13 ++++++------- > 1 file changed, 6 insertions(+), 7 deletions(-) > > diff --git a/mm/page_counter.c b/mm/page_counter.c > index eb156ff5d603..47711aa28161 100644 > --- a/mm/page_counter.c > +++ b/mm/page_counter.c > @@ -17,24 +17,23 @@ static void propagate_protected_usage(struct page_counter *c, > unsigned long usage) > { > unsigned long protected, old_protected; > - unsigned long low, min; > long delta; > > if (!c->parent) > return; > > - min = READ_ONCE(c->min); > - if (min || atomic_long_read(&c->min_usage)) { > - protected = min(usage, min); > + protected = min(usage, READ_ONCE(c->min)); > + old_protected = atomic_long_read(&c->min_usage); > + if (protected != old_protected) { > old_protected = atomic_long_xchg(&c->min_usage, protected); > delta = protected - old_protected; > if (delta) > atomic_long_add(delta, &c->parent->children_min_usage); What if there is a concurrent update of c->min_usage? Then the patched version can miss an update. I can't imagine a case when it will lead to bad consequences, so probably it's ok. But not super obvious. I think the way to think of it is that a missed update will be fixed by the next one, so it's ok to run some time with old numbers. Acked-by: Roman Gushchin <roman.gushchin@linux.dev> Thanks! ^ permalink raw reply [flat|nested] 86+ messages in thread
* [PATCH 2/3] mm: page_counter: rearrange struct page_counter fields 2022-08-22 0:17 ` Shakeel Butt @ 2022-08-22 0:17 ` Shakeel Butt -1 siblings, 0 replies; 86+ messages in thread From: Shakeel Butt @ 2022-08-22 0:17 UTC (permalink / raw) To: Johannes Weiner, Michal Hocko, Roman Gushchin, Muchun Song Cc: Michal Koutný, Eric Dumazet, Soheil Hassas Yeganeh, Feng Tang, Oliver Sang, Andrew Morton, lkp, cgroups, linux-mm, netdev, linux-kernel, Shakeel Butt With memcg v2 enabled, memcg->memory.usage is a very hot member for the workloads doing memcg charging on multiple CPUs concurrently. Particularly the network intensive workloads. In addition, there is a false cache sharing between memory.usage and memory.high on the charge path. This patch moves the usage into a separate cacheline and move all the read most fields into separate cacheline. To evaluate the impact of this optimization, on a 72 CPUs machine, we ran the following workload in a three level of cgroup hierarchy with top level having min and low setup appropriately. More specifically memory.min equal to size of netperf binary and memory.low double of that. $ netserver -6 # 36 instances of netperf with following params $ netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- -m 10K Results (average throughput of netperf): Without (6.0-rc1) 10482.7 Mbps With patch 12413.7 Mbps (18.4% improvement) With the patch, the throughput improved by 18.4%. One side-effect of this patch is the increase in the size of struct mem_cgroup. However for the performance improvement, this additional size is worth it. In addition there are opportunities to reduce the size of struct mem_cgroup like deprecation of kmem and tcpmem page counters and better packing. Signed-off-by: Shakeel Butt <shakeelb@google.com> Reported-by: kernel test robot <oliver.sang@intel.com> --- include/linux/page_counter.h | 34 +++++++++++++++++++++++----------- 1 file changed, 23 insertions(+), 11 deletions(-) diff --git a/include/linux/page_counter.h b/include/linux/page_counter.h index 679591301994..8ce99bde645f 100644 --- a/include/linux/page_counter.h +++ b/include/linux/page_counter.h @@ -3,15 +3,27 @@ #define _LINUX_PAGE_COUNTER_H #include <linux/atomic.h> +#include <linux/cache.h> #include <linux/kernel.h> #include <asm/page.h> +#if defined(CONFIG_SMP) +struct pc_padding { + char x[0]; +} ____cacheline_internodealigned_in_smp; +#define PC_PADDING(name) struct pc_padding name +#else +#define PC_PADDING(name) +#endif + struct page_counter { + /* + * Make sure 'usage' does not share cacheline with any other field. The + * memcg->memory.usage is a hot member of struct mem_cgroup. + */ + PC_PADDING(_pad1_); atomic_long_t usage; - unsigned long min; - unsigned long low; - unsigned long high; - unsigned long max; + PC_PADDING(_pad2_); /* effective memory.min and memory.min usage tracking */ unsigned long emin; @@ -23,16 +35,16 @@ struct page_counter { atomic_long_t low_usage; atomic_long_t children_low_usage; - /* legacy */ unsigned long watermark; unsigned long failcnt; - /* - * 'parent' is placed here to be far from 'usage' to reduce - * cache false sharing, as 'usage' is written mostly while - * parent is frequently read for cgroup's hierarchical - * counting nature. - */ + /* Keep all the read most fields in a separete cacheline. */ + PC_PADDING(_pad3_); + + unsigned long min; + unsigned long low; + unsigned long high; + unsigned long max; struct page_counter *parent; }; -- 2.37.1.595.g718a3a8f04-goog ^ permalink raw reply related [flat|nested] 86+ messages in thread
* [PATCH 2/3] mm: page_counter: rearrange struct page_counter fields @ 2022-08-22 0:17 ` Shakeel Butt 0 siblings, 0 replies; 86+ messages in thread From: Shakeel Butt @ 2022-08-22 0:17 UTC (permalink / raw) To: lkp [-- Attachment #1: Type: text/plain, Size: 3162 bytes --] With memcg v2 enabled, memcg->memory.usage is a very hot member for the workloads doing memcg charging on multiple CPUs concurrently. Particularly the network intensive workloads. In addition, there is a false cache sharing between memory.usage and memory.high on the charge path. This patch moves the usage into a separate cacheline and move all the read most fields into separate cacheline. To evaluate the impact of this optimization, on a 72 CPUs machine, we ran the following workload in a three level of cgroup hierarchy with top level having min and low setup appropriately. More specifically memory.min equal to size of netperf binary and memory.low double of that. $ netserver -6 # 36 instances of netperf with following params $ netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- -m 10K Results (average throughput of netperf): Without (6.0-rc1) 10482.7 Mbps With patch 12413.7 Mbps (18.4% improvement) With the patch, the throughput improved by 18.4%. One side-effect of this patch is the increase in the size of struct mem_cgroup. However for the performance improvement, this additional size is worth it. In addition there are opportunities to reduce the size of struct mem_cgroup like deprecation of kmem and tcpmem page counters and better packing. Signed-off-by: Shakeel Butt <shakeelb@google.com> Reported-by: kernel test robot <oliver.sang@intel.com> --- include/linux/page_counter.h | 34 +++++++++++++++++++++++----------- 1 file changed, 23 insertions(+), 11 deletions(-) diff --git a/include/linux/page_counter.h b/include/linux/page_counter.h index 679591301994..8ce99bde645f 100644 --- a/include/linux/page_counter.h +++ b/include/linux/page_counter.h @@ -3,15 +3,27 @@ #define _LINUX_PAGE_COUNTER_H #include <linux/atomic.h> +#include <linux/cache.h> #include <linux/kernel.h> #include <asm/page.h> +#if defined(CONFIG_SMP) +struct pc_padding { + char x[0]; +} ____cacheline_internodealigned_in_smp; +#define PC_PADDING(name) struct pc_padding name +#else +#define PC_PADDING(name) +#endif + struct page_counter { + /* + * Make sure 'usage' does not share cacheline with any other field. The + * memcg->memory.usage is a hot member of struct mem_cgroup. + */ + PC_PADDING(_pad1_); atomic_long_t usage; - unsigned long min; - unsigned long low; - unsigned long high; - unsigned long max; + PC_PADDING(_pad2_); /* effective memory.min and memory.min usage tracking */ unsigned long emin; @@ -23,16 +35,16 @@ struct page_counter { atomic_long_t low_usage; atomic_long_t children_low_usage; - /* legacy */ unsigned long watermark; unsigned long failcnt; - /* - * 'parent' is placed here to be far from 'usage' to reduce - * cache false sharing, as 'usage' is written mostly while - * parent is frequently read for cgroup's hierarchical - * counting nature. - */ + /* Keep all the read most fields in a separete cacheline. */ + PC_PADDING(_pad3_); + + unsigned long min; + unsigned long low; + unsigned long high; + unsigned long max; struct page_counter *parent; }; -- 2.37.1.595.g718a3a8f04-goog ^ permalink raw reply related [flat|nested] 86+ messages in thread
* Re: [PATCH 2/3] mm: page_counter: rearrange struct page_counter fields 2022-08-22 0:17 ` Shakeel Butt (?) @ 2022-08-22 0:24 ` Soheil Hassas Yeganeh -1 siblings, 0 replies; 86+ messages in thread From: Soheil Hassas Yeganeh @ 2022-08-22 0:24 UTC (permalink / raw) To: Shakeel Butt Cc: Johannes Weiner, Michal Hocko, Roman Gushchin, Muchun Song, Michal Koutný, Eric Dumazet, Feng Tang, Oliver Sang, Andrew Morton, lkp, cgroups, linux-mm, netdev, linux-kernel On Sun, Aug 21, 2022 at 8:18 PM Shakeel Butt <shakeelb@google.com> wrote: > > With memcg v2 enabled, memcg->memory.usage is a very hot member for > the workloads doing memcg charging on multiple CPUs concurrently. > Particularly the network intensive workloads. In addition, there is a > false cache sharing between memory.usage and memory.high on the charge > path. This patch moves the usage into a separate cacheline and move all > the read most fields into separate cacheline. > > To evaluate the impact of this optimization, on a 72 CPUs machine, we > ran the following workload in a three level of cgroup hierarchy with top > level having min and low setup appropriately. More specifically > memory.min equal to size of netperf binary and memory.low double of > that. > > $ netserver -6 > # 36 instances of netperf with following params > $ netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- -m 10K > > Results (average throughput of netperf): > Without (6.0-rc1) 10482.7 Mbps > With patch 12413.7 Mbps (18.4% improvement) > > With the patch, the throughput improved by 18.4%. Shakeel, for my understanding: is this on top of the gains from the previous patch? > One side-effect of this patch is the increase in the size of struct > mem_cgroup. However for the performance improvement, this additional > size is worth it. In addition there are opportunities to reduce the size > of struct mem_cgroup like deprecation of kmem and tcpmem page counters > and better packing. > > Signed-off-by: Shakeel Butt <shakeelb@google.com> > Reported-by: kernel test robot <oliver.sang@intel.com> > --- > include/linux/page_counter.h | 34 +++++++++++++++++++++++----------- > 1 file changed, 23 insertions(+), 11 deletions(-) > > diff --git a/include/linux/page_counter.h b/include/linux/page_counter.h > index 679591301994..8ce99bde645f 100644 > --- a/include/linux/page_counter.h > +++ b/include/linux/page_counter.h > @@ -3,15 +3,27 @@ > #define _LINUX_PAGE_COUNTER_H > > #include <linux/atomic.h> > +#include <linux/cache.h> > #include <linux/kernel.h> > #include <asm/page.h> > > +#if defined(CONFIG_SMP) > +struct pc_padding { > + char x[0]; > +} ____cacheline_internodealigned_in_smp; > +#define PC_PADDING(name) struct pc_padding name > +#else > +#define PC_PADDING(name) > +#endif > + > struct page_counter { > + /* > + * Make sure 'usage' does not share cacheline with any other field. The > + * memcg->memory.usage is a hot member of struct mem_cgroup. > + */ > + PC_PADDING(_pad1_); > atomic_long_t usage; > - unsigned long min; > - unsigned long low; > - unsigned long high; > - unsigned long max; > + PC_PADDING(_pad2_); > > /* effective memory.min and memory.min usage tracking */ > unsigned long emin; > @@ -23,16 +35,16 @@ struct page_counter { > atomic_long_t low_usage; > atomic_long_t children_low_usage; > > - /* legacy */ > unsigned long watermark; > unsigned long failcnt; > > - /* > - * 'parent' is placed here to be far from 'usage' to reduce > - * cache false sharing, as 'usage' is written mostly while > - * parent is frequently read for cgroup's hierarchical > - * counting nature. > - */ > + /* Keep all the read most fields in a separete cacheline. */ > + PC_PADDING(_pad3_); > + > + unsigned long min; > + unsigned long low; > + unsigned long high; > + unsigned long max; > struct page_counter *parent; > }; > > -- > 2.37.1.595.g718a3a8f04-goog > ^ permalink raw reply [flat|nested] 86+ messages in thread
* Re: [PATCH 2/3] mm: page_counter: rearrange struct page_counter fields @ 2022-08-22 0:24 ` Soheil Hassas Yeganeh 0 siblings, 0 replies; 86+ messages in thread From: Soheil Hassas Yeganeh @ 2022-08-22 0:24 UTC (permalink / raw) To: Shakeel Butt Cc: Johannes Weiner, Michal Hocko, Roman Gushchin, Muchun Song, Michal Koutný, Eric Dumazet, Feng Tang, Oliver Sang, Andrew Morton, lkp-hn68Rpc1hR1g9hUCZPvPmw, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm, netdev, linux-kernel On Sun, Aug 21, 2022 at 8:18 PM Shakeel Butt <shakeelb-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> wrote: > > With memcg v2 enabled, memcg->memory.usage is a very hot member for > the workloads doing memcg charging on multiple CPUs concurrently. > Particularly the network intensive workloads. In addition, there is a > false cache sharing between memory.usage and memory.high on the charge > path. This patch moves the usage into a separate cacheline and move all > the read most fields into separate cacheline. > > To evaluate the impact of this optimization, on a 72 CPUs machine, we > ran the following workload in a three level of cgroup hierarchy with top > level having min and low setup appropriately. More specifically > memory.min equal to size of netperf binary and memory.low double of > that. > > $ netserver -6 > # 36 instances of netperf with following params > $ netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- -m 10K > > Results (average throughput of netperf): > Without (6.0-rc1) 10482.7 Mbps > With patch 12413.7 Mbps (18.4% improvement) > > With the patch, the throughput improved by 18.4%. Shakeel, for my understanding: is this on top of the gains from the previous patch? > One side-effect of this patch is the increase in the size of struct > mem_cgroup. However for the performance improvement, this additional > size is worth it. In addition there are opportunities to reduce the size > of struct mem_cgroup like deprecation of kmem and tcpmem page counters > and better packing. > > Signed-off-by: Shakeel Butt <shakeelb-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> > Reported-by: kernel test robot <oliver.sang-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org> > --- > include/linux/page_counter.h | 34 +++++++++++++++++++++++----------- > 1 file changed, 23 insertions(+), 11 deletions(-) > > diff --git a/include/linux/page_counter.h b/include/linux/page_counter.h > index 679591301994..8ce99bde645f 100644 > --- a/include/linux/page_counter.h > +++ b/include/linux/page_counter.h > @@ -3,15 +3,27 @@ > #define _LINUX_PAGE_COUNTER_H > > #include <linux/atomic.h> > +#include <linux/cache.h> > #include <linux/kernel.h> > #include <asm/page.h> > > +#if defined(CONFIG_SMP) > +struct pc_padding { > + char x[0]; > +} ____cacheline_internodealigned_in_smp; > +#define PC_PADDING(name) struct pc_padding name > +#else > +#define PC_PADDING(name) > +#endif > + > struct page_counter { > + /* > + * Make sure 'usage' does not share cacheline with any other field. The > + * memcg->memory.usage is a hot member of struct mem_cgroup. > + */ > + PC_PADDING(_pad1_); > atomic_long_t usage; > - unsigned long min; > - unsigned long low; > - unsigned long high; > - unsigned long max; > + PC_PADDING(_pad2_); > > /* effective memory.min and memory.min usage tracking */ > unsigned long emin; > @@ -23,16 +35,16 @@ struct page_counter { > atomic_long_t low_usage; > atomic_long_t children_low_usage; > > - /* legacy */ > unsigned long watermark; > unsigned long failcnt; > > - /* > - * 'parent' is placed here to be far from 'usage' to reduce > - * cache false sharing, as 'usage' is written mostly while > - * parent is frequently read for cgroup's hierarchical > - * counting nature. > - */ > + /* Keep all the read most fields in a separete cacheline. */ > + PC_PADDING(_pad3_); > + > + unsigned long min; > + unsigned long low; > + unsigned long high; > + unsigned long max; > struct page_counter *parent; > }; > > -- > 2.37.1.595.g718a3a8f04-goog > ^ permalink raw reply [flat|nested] 86+ messages in thread
* Re: [PATCH 2/3] mm: page_counter: rearrange struct page_counter fields @ 2022-08-22 0:24 ` Soheil Hassas Yeganeh 0 siblings, 0 replies; 86+ messages in thread From: Soheil Hassas Yeganeh @ 2022-08-22 0:24 UTC (permalink / raw) To: lkp [-- Attachment #1: Type: text/plain, Size: 3718 bytes --] On Sun, Aug 21, 2022 at 8:18 PM Shakeel Butt <shakeelb@google.com> wrote: > > With memcg v2 enabled, memcg->memory.usage is a very hot member for > the workloads doing memcg charging on multiple CPUs concurrently. > Particularly the network intensive workloads. In addition, there is a > false cache sharing between memory.usage and memory.high on the charge > path. This patch moves the usage into a separate cacheline and move all > the read most fields into separate cacheline. > > To evaluate the impact of this optimization, on a 72 CPUs machine, we > ran the following workload in a three level of cgroup hierarchy with top > level having min and low setup appropriately. More specifically > memory.min equal to size of netperf binary and memory.low double of > that. > > $ netserver -6 > # 36 instances of netperf with following params > $ netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- -m 10K > > Results (average throughput of netperf): > Without (6.0-rc1) 10482.7 Mbps > With patch 12413.7 Mbps (18.4% improvement) > > With the patch, the throughput improved by 18.4%. Shakeel, for my understanding: is this on top of the gains from the previous patch? > One side-effect of this patch is the increase in the size of struct > mem_cgroup. However for the performance improvement, this additional > size is worth it. In addition there are opportunities to reduce the size > of struct mem_cgroup like deprecation of kmem and tcpmem page counters > and better packing. > > Signed-off-by: Shakeel Butt <shakeelb@google.com> > Reported-by: kernel test robot <oliver.sang@intel.com> > --- > include/linux/page_counter.h | 34 +++++++++++++++++++++++----------- > 1 file changed, 23 insertions(+), 11 deletions(-) > > diff --git a/include/linux/page_counter.h b/include/linux/page_counter.h > index 679591301994..8ce99bde645f 100644 > --- a/include/linux/page_counter.h > +++ b/include/linux/page_counter.h > @@ -3,15 +3,27 @@ > #define _LINUX_PAGE_COUNTER_H > > #include <linux/atomic.h> > +#include <linux/cache.h> > #include <linux/kernel.h> > #include <asm/page.h> > > +#if defined(CONFIG_SMP) > +struct pc_padding { > + char x[0]; > +} ____cacheline_internodealigned_in_smp; > +#define PC_PADDING(name) struct pc_padding name > +#else > +#define PC_PADDING(name) > +#endif > + > struct page_counter { > + /* > + * Make sure 'usage' does not share cacheline with any other field. The > + * memcg->memory.usage is a hot member of struct mem_cgroup. > + */ > + PC_PADDING(_pad1_); > atomic_long_t usage; > - unsigned long min; > - unsigned long low; > - unsigned long high; > - unsigned long max; > + PC_PADDING(_pad2_); > > /* effective memory.min and memory.min usage tracking */ > unsigned long emin; > @@ -23,16 +35,16 @@ struct page_counter { > atomic_long_t low_usage; > atomic_long_t children_low_usage; > > - /* legacy */ > unsigned long watermark; > unsigned long failcnt; > > - /* > - * 'parent' is placed here to be far from 'usage' to reduce > - * cache false sharing, as 'usage' is written mostly while > - * parent is frequently read for cgroup's hierarchical > - * counting nature. > - */ > + /* Keep all the read most fields in a separete cacheline. */ > + PC_PADDING(_pad3_); > + > + unsigned long min; > + unsigned long low; > + unsigned long high; > + unsigned long max; > struct page_counter *parent; > }; > > -- > 2.37.1.595.g718a3a8f04-goog > ^ permalink raw reply [flat|nested] 86+ messages in thread
* Re: [PATCH 2/3] mm: page_counter: rearrange struct page_counter fields 2022-08-22 0:24 ` Soheil Hassas Yeganeh @ 2022-08-22 4:55 ` Shakeel Butt -1 siblings, 0 replies; 86+ messages in thread From: Shakeel Butt @ 2022-08-22 4:55 UTC (permalink / raw) To: Soheil Hassas Yeganeh Cc: Johannes Weiner, Michal Hocko, Roman Gushchin, Muchun Song, Michal Koutný, Eric Dumazet, Feng Tang, Oliver Sang, Andrew Morton, lkp, Cgroups, linux-mm, netdev, linux-kernel On Sun, Aug 21, 2022 at 5:24 PM Soheil Hassas Yeganeh <soheil@google.com> wrote: > > On Sun, Aug 21, 2022 at 8:18 PM Shakeel Butt <shakeelb@google.com> wrote: > > > > With memcg v2 enabled, memcg->memory.usage is a very hot member for > > the workloads doing memcg charging on multiple CPUs concurrently. > > Particularly the network intensive workloads. In addition, there is a > > false cache sharing between memory.usage and memory.high on the charge > > path. This patch moves the usage into a separate cacheline and move all > > the read most fields into separate cacheline. > > > > To evaluate the impact of this optimization, on a 72 CPUs machine, we > > ran the following workload in a three level of cgroup hierarchy with top > > level having min and low setup appropriately. More specifically > > memory.min equal to size of netperf binary and memory.low double of > > that. > > > > $ netserver -6 > > # 36 instances of netperf with following params > > $ netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- -m 10K > > > > Results (average throughput of netperf): > > Without (6.0-rc1) 10482.7 Mbps > > With patch 12413.7 Mbps (18.4% improvement) > > > > With the patch, the throughput improved by 18.4%. > > Shakeel, for my understanding: is this on top of the gains from the > previous patch? > No, this is independent of the previous patch. The cover letter has the numbers for all three optimizations applied together. ^ permalink raw reply [flat|nested] 86+ messages in thread
* Re: [PATCH 2/3] mm: page_counter: rearrange struct page_counter fields @ 2022-08-22 4:55 ` Shakeel Butt 0 siblings, 0 replies; 86+ messages in thread From: Shakeel Butt @ 2022-08-22 4:55 UTC (permalink / raw) To: lkp [-- Attachment #1: Type: text/plain, Size: 1478 bytes --] On Sun, Aug 21, 2022 at 5:24 PM Soheil Hassas Yeganeh <soheil@google.com> wrote: > > On Sun, Aug 21, 2022 at 8:18 PM Shakeel Butt <shakeelb@google.com> wrote: > > > > With memcg v2 enabled, memcg->memory.usage is a very hot member for > > the workloads doing memcg charging on multiple CPUs concurrently. > > Particularly the network intensive workloads. In addition, there is a > > false cache sharing between memory.usage and memory.high on the charge > > path. This patch moves the usage into a separate cacheline and move all > > the read most fields into separate cacheline. > > > > To evaluate the impact of this optimization, on a 72 CPUs machine, we > > ran the following workload in a three level of cgroup hierarchy with top > > level having min and low setup appropriately. More specifically > > memory.min equal to size of netperf binary and memory.low double of > > that. > > > > $ netserver -6 > > # 36 instances of netperf with following params > > $ netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- -m 10K > > > > Results (average throughput of netperf): > > Without (6.0-rc1) 10482.7 Mbps > > With patch 12413.7 Mbps (18.4% improvement) > > > > With the patch, the throughput improved by 18.4%. > > Shakeel, for my understanding: is this on top of the gains from the > previous patch? > No, this is independent of the previous patch. The cover letter has the numbers for all three optimizations applied together. ^ permalink raw reply [flat|nested] 86+ messages in thread
* Re: [PATCH 2/3] mm: page_counter: rearrange struct page_counter fields 2022-08-22 4:55 ` Shakeel Butt @ 2022-08-22 13:06 ` Soheil Hassas Yeganeh -1 siblings, 0 replies; 86+ messages in thread From: Soheil Hassas Yeganeh @ 2022-08-22 13:06 UTC (permalink / raw) To: Shakeel Butt Cc: Johannes Weiner, Michal Hocko, Roman Gushchin, Muchun Song, Michal Koutný, Eric Dumazet, Feng Tang, Oliver Sang, Andrew Morton, lkp, Cgroups, linux-mm, netdev, linux-kernel On Mon, Aug 22, 2022 at 12:55 AM Shakeel Butt <shakeelb@google.com> wrote: > > On Sun, Aug 21, 2022 at 5:24 PM Soheil Hassas Yeganeh <soheil@google.com> wrote: > > > > On Sun, Aug 21, 2022 at 8:18 PM Shakeel Butt <shakeelb@google.com> wrote: > > > > > > With memcg v2 enabled, memcg->memory.usage is a very hot member for > > > the workloads doing memcg charging on multiple CPUs concurrently. > > > Particularly the network intensive workloads. In addition, there is a > > > false cache sharing between memory.usage and memory.high on the charge > > > path. This patch moves the usage into a separate cacheline and move all > > > the read most fields into separate cacheline. > > > > > > To evaluate the impact of this optimization, on a 72 CPUs machine, we > > > ran the following workload in a three level of cgroup hierarchy with top > > > level having min and low setup appropriately. More specifically > > > memory.min equal to size of netperf binary and memory.low double of > > > that. > > > > > > $ netserver -6 > > > # 36 instances of netperf with following params > > > $ netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- -m 10K > > > > > > Results (average throughput of netperf): > > > Without (6.0-rc1) 10482.7 Mbps > > > With patch 12413.7 Mbps (18.4% improvement) > > > > > > With the patch, the throughput improved by 18.4%. > > > > Shakeel, for my understanding: is this on top of the gains from the > > previous patch? > > > > No, this is independent of the previous patch. The cover letter has > the numbers for all three optimizations applied together. Acked-by: Soheil Hassas Yeganeh <soheil@google.com> ^ permalink raw reply [flat|nested] 86+ messages in thread
* Re: [PATCH 2/3] mm: page_counter: rearrange struct page_counter fields @ 2022-08-22 13:06 ` Soheil Hassas Yeganeh 0 siblings, 0 replies; 86+ messages in thread From: Soheil Hassas Yeganeh @ 2022-08-22 13:06 UTC (permalink / raw) To: lkp [-- Attachment #1: Type: text/plain, Size: 1677 bytes --] On Mon, Aug 22, 2022 at 12:55 AM Shakeel Butt <shakeelb@google.com> wrote: > > On Sun, Aug 21, 2022 at 5:24 PM Soheil Hassas Yeganeh <soheil@google.com> wrote: > > > > On Sun, Aug 21, 2022 at 8:18 PM Shakeel Butt <shakeelb@google.com> wrote: > > > > > > With memcg v2 enabled, memcg->memory.usage is a very hot member for > > > the workloads doing memcg charging on multiple CPUs concurrently. > > > Particularly the network intensive workloads. In addition, there is a > > > false cache sharing between memory.usage and memory.high on the charge > > > path. This patch moves the usage into a separate cacheline and move all > > > the read most fields into separate cacheline. > > > > > > To evaluate the impact of this optimization, on a 72 CPUs machine, we > > > ran the following workload in a three level of cgroup hierarchy with top > > > level having min and low setup appropriately. More specifically > > > memory.min equal to size of netperf binary and memory.low double of > > > that. > > > > > > $ netserver -6 > > > # 36 instances of netperf with following params > > > $ netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- -m 10K > > > > > > Results (average throughput of netperf): > > > Without (6.0-rc1) 10482.7 Mbps > > > With patch 12413.7 Mbps (18.4% improvement) > > > > > > With the patch, the throughput improved by 18.4%. > > > > Shakeel, for my understanding: is this on top of the gains from the > > previous patch? > > > > No, this is independent of the previous patch. The cover letter has > the numbers for all three optimizations applied together. Acked-by: Soheil Hassas Yeganeh <soheil@google.com> ^ permalink raw reply [flat|nested] 86+ messages in thread
* Re: [PATCH 2/3] mm: page_counter: rearrange struct page_counter fields 2022-08-22 0:17 ` Shakeel Butt (?) @ 2022-08-22 2:10 ` Feng Tang -1 siblings, 0 replies; 86+ messages in thread From: Feng Tang @ 2022-08-22 2:10 UTC (permalink / raw) To: Shakeel Butt Cc: Johannes Weiner, Michal Hocko, Roman Gushchin, Muchun Song, Michal Koutn??, Eric Dumazet, Soheil Hassas Yeganeh, Sang, Oliver, Andrew Morton, lkp, cgroups, linux-mm, netdev, linux-kernel On Mon, Aug 22, 2022 at 08:17:36AM +0800, Shakeel Butt wrote: > With memcg v2 enabled, memcg->memory.usage is a very hot member for > the workloads doing memcg charging on multiple CPUs concurrently. > Particularly the network intensive workloads. In addition, there is a > false cache sharing between memory.usage and memory.high on the charge > path. This patch moves the usage into a separate cacheline and move all > the read most fields into separate cacheline. > > To evaluate the impact of this optimization, on a 72 CPUs machine, we > ran the following workload in a three level of cgroup hierarchy with top > level having min and low setup appropriately. More specifically > memory.min equal to size of netperf binary and memory.low double of > that. > > $ netserver -6 > # 36 instances of netperf with following params > $ netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- -m 10K > > Results (average throughput of netperf): > Without (6.0-rc1) 10482.7 Mbps > With patch 12413.7 Mbps (18.4% improvement) > > With the patch, the throughput improved by 18.4%. > > One side-effect of this patch is the increase in the size of struct > mem_cgroup. However for the performance improvement, this additional > size is worth it. In addition there are opportunities to reduce the size > of struct mem_cgroup like deprecation of kmem and tcpmem page counters > and better packing. > > Signed-off-by: Shakeel Butt <shakeelb@google.com> > Reported-by: kernel test robot <oliver.sang@intel.com> Looks good to me, with one nit below. Reviewed-by: Feng Tang <feng.tang@intel.com> > --- > include/linux/page_counter.h | 34 +++++++++++++++++++++++----------- > 1 file changed, 23 insertions(+), 11 deletions(-) > > diff --git a/include/linux/page_counter.h b/include/linux/page_counter.h > index 679591301994..8ce99bde645f 100644 > --- a/include/linux/page_counter.h > +++ b/include/linux/page_counter.h > @@ -3,15 +3,27 @@ > #define _LINUX_PAGE_COUNTER_H > > #include <linux/atomic.h> > +#include <linux/cache.h> > #include <linux/kernel.h> > #include <asm/page.h> > > +#if defined(CONFIG_SMP) > +struct pc_padding { > + char x[0]; > +} ____cacheline_internodealigned_in_smp; > +#define PC_PADDING(name) struct pc_padding name > +#else > +#define PC_PADDING(name) > +#endif There are 2 similar padding definitions in mmzone.h and memcontrol.h: struct memcg_padding { char x[0]; } ____cacheline_internodealigned_in_smp; #define MEMCG_PADDING(name) struct memcg_padding name struct zone_padding { char x[0]; } ____cacheline_internodealigned_in_smp; #define ZONE_PADDING(name) struct zone_padding name; Maybe we can generalize them, and lift it into include/cache.h? so that more places can reuse it in future. Thanks, Feng ^ permalink raw reply [flat|nested] 86+ messages in thread
* Re: [PATCH 2/3] mm: page_counter: rearrange struct page_counter fields @ 2022-08-22 2:10 ` Feng Tang 0 siblings, 0 replies; 86+ messages in thread From: Feng Tang @ 2022-08-22 2:10 UTC (permalink / raw) To: Shakeel Butt Cc: Johannes Weiner, Michal Hocko, Roman Gushchin, Muchun Song, Michal Koutn??, Eric Dumazet, Soheil Hassas Yeganeh, Sang, Oliver, Andrew Morton, lkp-hn68Rpc1hR1g9hUCZPvPmw, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, netdev-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA On Mon, Aug 22, 2022 at 08:17:36AM +0800, Shakeel Butt wrote: > With memcg v2 enabled, memcg->memory.usage is a very hot member for > the workloads doing memcg charging on multiple CPUs concurrently. > Particularly the network intensive workloads. In addition, there is a > false cache sharing between memory.usage and memory.high on the charge > path. This patch moves the usage into a separate cacheline and move all > the read most fields into separate cacheline. > > To evaluate the impact of this optimization, on a 72 CPUs machine, we > ran the following workload in a three level of cgroup hierarchy with top > level having min and low setup appropriately. More specifically > memory.min equal to size of netperf binary and memory.low double of > that. > > $ netserver -6 > # 36 instances of netperf with following params > $ netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- -m 10K > > Results (average throughput of netperf): > Without (6.0-rc1) 10482.7 Mbps > With patch 12413.7 Mbps (18.4% improvement) > > With the patch, the throughput improved by 18.4%. > > One side-effect of this patch is the increase in the size of struct > mem_cgroup. However for the performance improvement, this additional > size is worth it. In addition there are opportunities to reduce the size > of struct mem_cgroup like deprecation of kmem and tcpmem page counters > and better packing. > > Signed-off-by: Shakeel Butt <shakeelb-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> > Reported-by: kernel test robot <oliver.sang-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org> Looks good to me, with one nit below. Reviewed-by: Feng Tang <feng.tang-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org> > --- > include/linux/page_counter.h | 34 +++++++++++++++++++++++----------- > 1 file changed, 23 insertions(+), 11 deletions(-) > > diff --git a/include/linux/page_counter.h b/include/linux/page_counter.h > index 679591301994..8ce99bde645f 100644 > --- a/include/linux/page_counter.h > +++ b/include/linux/page_counter.h > @@ -3,15 +3,27 @@ > #define _LINUX_PAGE_COUNTER_H > > #include <linux/atomic.h> > +#include <linux/cache.h> > #include <linux/kernel.h> > #include <asm/page.h> > > +#if defined(CONFIG_SMP) > +struct pc_padding { > + char x[0]; > +} ____cacheline_internodealigned_in_smp; > +#define PC_PADDING(name) struct pc_padding name > +#else > +#define PC_PADDING(name) > +#endif There are 2 similar padding definitions in mmzone.h and memcontrol.h: struct memcg_padding { char x[0]; } ____cacheline_internodealigned_in_smp; #define MEMCG_PADDING(name) struct memcg_padding name struct zone_padding { char x[0]; } ____cacheline_internodealigned_in_smp; #define ZONE_PADDING(name) struct zone_padding name; Maybe we can generalize them, and lift it into include/cache.h? so that more places can reuse it in future. Thanks, Feng ^ permalink raw reply [flat|nested] 86+ messages in thread
* Re: [PATCH 2/3] mm: page_counter: rearrange struct page_counter fields @ 2022-08-22 2:10 ` Feng Tang 0 siblings, 0 replies; 86+ messages in thread From: Feng Tang @ 2022-08-22 2:10 UTC (permalink / raw) To: lkp [-- Attachment #1: Type: text/plain, Size: 2832 bytes --] On Mon, Aug 22, 2022 at 08:17:36AM +0800, Shakeel Butt wrote: > With memcg v2 enabled, memcg->memory.usage is a very hot member for > the workloads doing memcg charging on multiple CPUs concurrently. > Particularly the network intensive workloads. In addition, there is a > false cache sharing between memory.usage and memory.high on the charge > path. This patch moves the usage into a separate cacheline and move all > the read most fields into separate cacheline. > > To evaluate the impact of this optimization, on a 72 CPUs machine, we > ran the following workload in a three level of cgroup hierarchy with top > level having min and low setup appropriately. More specifically > memory.min equal to size of netperf binary and memory.low double of > that. > > $ netserver -6 > # 36 instances of netperf with following params > $ netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- -m 10K > > Results (average throughput of netperf): > Without (6.0-rc1) 10482.7 Mbps > With patch 12413.7 Mbps (18.4% improvement) > > With the patch, the throughput improved by 18.4%. > > One side-effect of this patch is the increase in the size of struct > mem_cgroup. However for the performance improvement, this additional > size is worth it. In addition there are opportunities to reduce the size > of struct mem_cgroup like deprecation of kmem and tcpmem page counters > and better packing. > > Signed-off-by: Shakeel Butt <shakeelb@google.com> > Reported-by: kernel test robot <oliver.sang@intel.com> Looks good to me, with one nit below. Reviewed-by: Feng Tang <feng.tang@intel.com> > --- > include/linux/page_counter.h | 34 +++++++++++++++++++++++----------- > 1 file changed, 23 insertions(+), 11 deletions(-) > > diff --git a/include/linux/page_counter.h b/include/linux/page_counter.h > index 679591301994..8ce99bde645f 100644 > --- a/include/linux/page_counter.h > +++ b/include/linux/page_counter.h > @@ -3,15 +3,27 @@ > #define _LINUX_PAGE_COUNTER_H > > #include <linux/atomic.h> > +#include <linux/cache.h> > #include <linux/kernel.h> > #include <asm/page.h> > > +#if defined(CONFIG_SMP) > +struct pc_padding { > + char x[0]; > +} ____cacheline_internodealigned_in_smp; > +#define PC_PADDING(name) struct pc_padding name > +#else > +#define PC_PADDING(name) > +#endif There are 2 similar padding definitions in mmzone.h and memcontrol.h: struct memcg_padding { char x[0]; } ____cacheline_internodealigned_in_smp; #define MEMCG_PADDING(name) struct memcg_padding name struct zone_padding { char x[0]; } ____cacheline_internodealigned_in_smp; #define ZONE_PADDING(name) struct zone_padding name; Maybe we can generalize them, and lift it into include/cache.h? so that more places can reuse it in future. Thanks, Feng ^ permalink raw reply [flat|nested] 86+ messages in thread
* Re: [PATCH 2/3] mm: page_counter: rearrange struct page_counter fields 2022-08-22 2:10 ` Feng Tang (?) @ 2022-08-22 4:59 ` Shakeel Butt -1 siblings, 0 replies; 86+ messages in thread From: Shakeel Butt @ 2022-08-22 4:59 UTC (permalink / raw) To: Feng Tang Cc: Johannes Weiner, Michal Hocko, Roman Gushchin, Muchun Song, Michal Koutn??, Eric Dumazet, Soheil Hassas Yeganeh, Sang, Oliver, Andrew Morton, lkp, cgroups, linux-mm, netdev, linux-kernel On Sun, Aug 21, 2022 at 7:12 PM Feng Tang <feng.tang@intel.com> wrote: > > On Mon, Aug 22, 2022 at 08:17:36AM +0800, Shakeel Butt wrote: > > With memcg v2 enabled, memcg->memory.usage is a very hot member for > > the workloads doing memcg charging on multiple CPUs concurrently. > > Particularly the network intensive workloads. In addition, there is a > > false cache sharing between memory.usage and memory.high on the charge > > path. This patch moves the usage into a separate cacheline and move all > > the read most fields into separate cacheline. > > > > To evaluate the impact of this optimization, on a 72 CPUs machine, we > > ran the following workload in a three level of cgroup hierarchy with top > > level having min and low setup appropriately. More specifically > > memory.min equal to size of netperf binary and memory.low double of > > that. > > > > $ netserver -6 > > # 36 instances of netperf with following params > > $ netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- -m 10K > > > > Results (average throughput of netperf): > > Without (6.0-rc1) 10482.7 Mbps > > With patch 12413.7 Mbps (18.4% improvement) > > > > With the patch, the throughput improved by 18.4%. > > > > One side-effect of this patch is the increase in the size of struct > > mem_cgroup. However for the performance improvement, this additional > > size is worth it. In addition there are opportunities to reduce the size > > of struct mem_cgroup like deprecation of kmem and tcpmem page counters > > and better packing. > > > > Signed-off-by: Shakeel Butt <shakeelb@google.com> > > Reported-by: kernel test robot <oliver.sang@intel.com> > > Looks good to me, with one nit below. > > Reviewed-by: Feng Tang <feng.tang@intel.com> Thanks. > > > --- > > include/linux/page_counter.h | 34 +++++++++++++++++++++++----------- > > 1 file changed, 23 insertions(+), 11 deletions(-) > > > > diff --git a/include/linux/page_counter.h b/include/linux/page_counter.h > > index 679591301994..8ce99bde645f 100644 > > --- a/include/linux/page_counter.h > > +++ b/include/linux/page_counter.h > > @@ -3,15 +3,27 @@ > > #define _LINUX_PAGE_COUNTER_H > > > > #include <linux/atomic.h> > > +#include <linux/cache.h> > > #include <linux/kernel.h> > > #include <asm/page.h> > > > > +#if defined(CONFIG_SMP) > > +struct pc_padding { > > + char x[0]; > > +} ____cacheline_internodealigned_in_smp; > > +#define PC_PADDING(name) struct pc_padding name > > +#else > > +#define PC_PADDING(name) > > +#endif > > There are 2 similar padding definitions in mmzone.h and memcontrol.h: > > struct memcg_padding { > char x[0]; > } ____cacheline_internodealigned_in_smp; > #define MEMCG_PADDING(name) struct memcg_padding name > > struct zone_padding { > char x[0]; > } ____cacheline_internodealigned_in_smp; > #define ZONE_PADDING(name) struct zone_padding name; > > Maybe we can generalize them, and lift it into include/cache.h? so > that more places can reuse it in future. > This makes sense but let me do that in a separate patch. ^ permalink raw reply [flat|nested] 86+ messages in thread
* Re: [PATCH 2/3] mm: page_counter: rearrange struct page_counter fields @ 2022-08-22 4:59 ` Shakeel Butt 0 siblings, 0 replies; 86+ messages in thread From: Shakeel Butt @ 2022-08-22 4:59 UTC (permalink / raw) To: Feng Tang Cc: Johannes Weiner, Michal Hocko, Roman Gushchin, Muchun Song, Michal Koutn??, Eric Dumazet, Soheil Hassas Yeganeh, Sang, Oliver, Andrew Morton, lkp-hn68Rpc1hR1g9hUCZPvPmw, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, netdev-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA On Sun, Aug 21, 2022 at 7:12 PM Feng Tang <feng.tang-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org> wrote: > > On Mon, Aug 22, 2022 at 08:17:36AM +0800, Shakeel Butt wrote: > > With memcg v2 enabled, memcg->memory.usage is a very hot member for > > the workloads doing memcg charging on multiple CPUs concurrently. > > Particularly the network intensive workloads. In addition, there is a > > false cache sharing between memory.usage and memory.high on the charge > > path. This patch moves the usage into a separate cacheline and move all > > the read most fields into separate cacheline. > > > > To evaluate the impact of this optimization, on a 72 CPUs machine, we > > ran the following workload in a three level of cgroup hierarchy with top > > level having min and low setup appropriately. More specifically > > memory.min equal to size of netperf binary and memory.low double of > > that. > > > > $ netserver -6 > > # 36 instances of netperf with following params > > $ netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- -m 10K > > > > Results (average throughput of netperf): > > Without (6.0-rc1) 10482.7 Mbps > > With patch 12413.7 Mbps (18.4% improvement) > > > > With the patch, the throughput improved by 18.4%. > > > > One side-effect of this patch is the increase in the size of struct > > mem_cgroup. However for the performance improvement, this additional > > size is worth it. In addition there are opportunities to reduce the size > > of struct mem_cgroup like deprecation of kmem and tcpmem page counters > > and better packing. > > > > Signed-off-by: Shakeel Butt <shakeelb-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> > > Reported-by: kernel test robot <oliver.sang-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org> > > Looks good to me, with one nit below. > > Reviewed-by: Feng Tang <feng.tang-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org> Thanks. > > > --- > > include/linux/page_counter.h | 34 +++++++++++++++++++++++----------- > > 1 file changed, 23 insertions(+), 11 deletions(-) > > > > diff --git a/include/linux/page_counter.h b/include/linux/page_counter.h > > index 679591301994..8ce99bde645f 100644 > > --- a/include/linux/page_counter.h > > +++ b/include/linux/page_counter.h > > @@ -3,15 +3,27 @@ > > #define _LINUX_PAGE_COUNTER_H > > > > #include <linux/atomic.h> > > +#include <linux/cache.h> > > #include <linux/kernel.h> > > #include <asm/page.h> > > > > +#if defined(CONFIG_SMP) > > +struct pc_padding { > > + char x[0]; > > +} ____cacheline_internodealigned_in_smp; > > +#define PC_PADDING(name) struct pc_padding name > > +#else > > +#define PC_PADDING(name) > > +#endif > > There are 2 similar padding definitions in mmzone.h and memcontrol.h: > > struct memcg_padding { > char x[0]; > } ____cacheline_internodealigned_in_smp; > #define MEMCG_PADDING(name) struct memcg_padding name > > struct zone_padding { > char x[0]; > } ____cacheline_internodealigned_in_smp; > #define ZONE_PADDING(name) struct zone_padding name; > > Maybe we can generalize them, and lift it into include/cache.h? so > that more places can reuse it in future. > This makes sense but let me do that in a separate patch. ^ permalink raw reply [flat|nested] 86+ messages in thread
* Re: [PATCH 2/3] mm: page_counter: rearrange struct page_counter fields @ 2022-08-22 4:59 ` Shakeel Butt 0 siblings, 0 replies; 86+ messages in thread From: Shakeel Butt @ 2022-08-22 4:59 UTC (permalink / raw) To: lkp [-- Attachment #1: Type: text/plain, Size: 3194 bytes --] On Sun, Aug 21, 2022 at 7:12 PM Feng Tang <feng.tang@intel.com> wrote: > > On Mon, Aug 22, 2022 at 08:17:36AM +0800, Shakeel Butt wrote: > > With memcg v2 enabled, memcg->memory.usage is a very hot member for > > the workloads doing memcg charging on multiple CPUs concurrently. > > Particularly the network intensive workloads. In addition, there is a > > false cache sharing between memory.usage and memory.high on the charge > > path. This patch moves the usage into a separate cacheline and move all > > the read most fields into separate cacheline. > > > > To evaluate the impact of this optimization, on a 72 CPUs machine, we > > ran the following workload in a three level of cgroup hierarchy with top > > level having min and low setup appropriately. More specifically > > memory.min equal to size of netperf binary and memory.low double of > > that. > > > > $ netserver -6 > > # 36 instances of netperf with following params > > $ netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- -m 10K > > > > Results (average throughput of netperf): > > Without (6.0-rc1) 10482.7 Mbps > > With patch 12413.7 Mbps (18.4% improvement) > > > > With the patch, the throughput improved by 18.4%. > > > > One side-effect of this patch is the increase in the size of struct > > mem_cgroup. However for the performance improvement, this additional > > size is worth it. In addition there are opportunities to reduce the size > > of struct mem_cgroup like deprecation of kmem and tcpmem page counters > > and better packing. > > > > Signed-off-by: Shakeel Butt <shakeelb@google.com> > > Reported-by: kernel test robot <oliver.sang@intel.com> > > Looks good to me, with one nit below. > > Reviewed-by: Feng Tang <feng.tang@intel.com> Thanks. > > > --- > > include/linux/page_counter.h | 34 +++++++++++++++++++++++----------- > > 1 file changed, 23 insertions(+), 11 deletions(-) > > > > diff --git a/include/linux/page_counter.h b/include/linux/page_counter.h > > index 679591301994..8ce99bde645f 100644 > > --- a/include/linux/page_counter.h > > +++ b/include/linux/page_counter.h > > @@ -3,15 +3,27 @@ > > #define _LINUX_PAGE_COUNTER_H > > > > #include <linux/atomic.h> > > +#include <linux/cache.h> > > #include <linux/kernel.h> > > #include <asm/page.h> > > > > +#if defined(CONFIG_SMP) > > +struct pc_padding { > > + char x[0]; > > +} ____cacheline_internodealigned_in_smp; > > +#define PC_PADDING(name) struct pc_padding name > > +#else > > +#define PC_PADDING(name) > > +#endif > > There are 2 similar padding definitions in mmzone.h and memcontrol.h: > > struct memcg_padding { > char x[0]; > } ____cacheline_internodealigned_in_smp; > #define MEMCG_PADDING(name) struct memcg_padding name > > struct zone_padding { > char x[0]; > } ____cacheline_internodealigned_in_smp; > #define ZONE_PADDING(name) struct zone_padding name; > > Maybe we can generalize them, and lift it into include/cache.h? so > that more places can reuse it in future. > This makes sense but let me do that in a separate patch. ^ permalink raw reply [flat|nested] 86+ messages in thread
* Re: [PATCH 2/3] mm: page_counter: rearrange struct page_counter fields 2022-08-22 0:17 ` Shakeel Butt (?) @ 2022-08-22 10:23 ` Michal Hocko -1 siblings, 0 replies; 86+ messages in thread From: Michal Hocko @ 2022-08-22 10:23 UTC (permalink / raw) To: Shakeel Butt Cc: Johannes Weiner, Roman Gushchin, Muchun Song, Michal Koutný, Eric Dumazet, Soheil Hassas Yeganeh, Feng Tang, Oliver Sang, Andrew Morton, lkp, cgroups, linux-mm, netdev, linux-kernel On Mon 22-08-22 00:17:36, Shakeel Butt wrote: > With memcg v2 enabled, memcg->memory.usage is a very hot member for > the workloads doing memcg charging on multiple CPUs concurrently. > Particularly the network intensive workloads. In addition, there is a > false cache sharing between memory.usage and memory.high on the charge > path. This patch moves the usage into a separate cacheline and move all > the read most fields into separate cacheline. > > To evaluate the impact of this optimization, on a 72 CPUs machine, we > ran the following workload in a three level of cgroup hierarchy with top > level having min and low setup appropriately. More specifically > memory.min equal to size of netperf binary and memory.low double of > that. Again the workload description is not particularly useful. I guess the only important aspect is the netserver part below and the number of CPUs because min and low setup doesn't have much to do with this, right? At least that is my reading of the memory.high mentioned above. > $ netserver -6 > # 36 instances of netperf with following params > $ netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- -m 10K > > Results (average throughput of netperf): > Without (6.0-rc1) 10482.7 Mbps > With patch 12413.7 Mbps (18.4% improvement) > > With the patch, the throughput improved by 18.4%. > > One side-effect of this patch is the increase in the size of struct > mem_cgroup. However for the performance improvement, this additional > size is worth it. In addition there are opportunities to reduce the size > of struct mem_cgroup like deprecation of kmem and tcpmem page counters > and better packing. > > Signed-off-by: Shakeel Butt <shakeelb@google.com> > Reported-by: kernel test robot <oliver.sang@intel.com> > --- > include/linux/page_counter.h | 34 +++++++++++++++++++++++----------- > 1 file changed, 23 insertions(+), 11 deletions(-) > > diff --git a/include/linux/page_counter.h b/include/linux/page_counter.h > index 679591301994..8ce99bde645f 100644 > --- a/include/linux/page_counter.h > +++ b/include/linux/page_counter.h > @@ -3,15 +3,27 @@ > #define _LINUX_PAGE_COUNTER_H > > #include <linux/atomic.h> > +#include <linux/cache.h> > #include <linux/kernel.h> > #include <asm/page.h> > > +#if defined(CONFIG_SMP) > +struct pc_padding { > + char x[0]; > +} ____cacheline_internodealigned_in_smp; > +#define PC_PADDING(name) struct pc_padding name > +#else > +#define PC_PADDING(name) > +#endif > + > struct page_counter { > + /* > + * Make sure 'usage' does not share cacheline with any other field. The > + * memcg->memory.usage is a hot member of struct mem_cgroup. > + */ > + PC_PADDING(_pad1_); Why don't you simply require alignment for the structure? Other than that, looks good to me and it makes sense. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 86+ messages in thread
* Re: [PATCH 2/3] mm: page_counter: rearrange struct page_counter fields @ 2022-08-22 10:23 ` Michal Hocko 0 siblings, 0 replies; 86+ messages in thread From: Michal Hocko @ 2022-08-22 10:23 UTC (permalink / raw) To: Shakeel Butt Cc: Johannes Weiner, Roman Gushchin, Muchun Song, Michal Koutný, Eric Dumazet, Soheil Hassas Yeganeh, Feng Tang, Oliver Sang, Andrew Morton, lkp-hn68Rpc1hR1g9hUCZPvPmw, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, netdev-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA On Mon 22-08-22 00:17:36, Shakeel Butt wrote: > With memcg v2 enabled, memcg->memory.usage is a very hot member for > the workloads doing memcg charging on multiple CPUs concurrently. > Particularly the network intensive workloads. In addition, there is a > false cache sharing between memory.usage and memory.high on the charge > path. This patch moves the usage into a separate cacheline and move all > the read most fields into separate cacheline. > > To evaluate the impact of this optimization, on a 72 CPUs machine, we > ran the following workload in a three level of cgroup hierarchy with top > level having min and low setup appropriately. More specifically > memory.min equal to size of netperf binary and memory.low double of > that. Again the workload description is not particularly useful. I guess the only important aspect is the netserver part below and the number of CPUs because min and low setup doesn't have much to do with this, right? At least that is my reading of the memory.high mentioned above. > $ netserver -6 > # 36 instances of netperf with following params > $ netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- -m 10K > > Results (average throughput of netperf): > Without (6.0-rc1) 10482.7 Mbps > With patch 12413.7 Mbps (18.4% improvement) > > With the patch, the throughput improved by 18.4%. > > One side-effect of this patch is the increase in the size of struct > mem_cgroup. However for the performance improvement, this additional > size is worth it. In addition there are opportunities to reduce the size > of struct mem_cgroup like deprecation of kmem and tcpmem page counters > and better packing. > > Signed-off-by: Shakeel Butt <shakeelb-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> > Reported-by: kernel test robot <oliver.sang-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org> > --- > include/linux/page_counter.h | 34 +++++++++++++++++++++++----------- > 1 file changed, 23 insertions(+), 11 deletions(-) > > diff --git a/include/linux/page_counter.h b/include/linux/page_counter.h > index 679591301994..8ce99bde645f 100644 > --- a/include/linux/page_counter.h > +++ b/include/linux/page_counter.h > @@ -3,15 +3,27 @@ > #define _LINUX_PAGE_COUNTER_H > > #include <linux/atomic.h> > +#include <linux/cache.h> > #include <linux/kernel.h> > #include <asm/page.h> > > +#if defined(CONFIG_SMP) > +struct pc_padding { > + char x[0]; > +} ____cacheline_internodealigned_in_smp; > +#define PC_PADDING(name) struct pc_padding name > +#else > +#define PC_PADDING(name) > +#endif > + > struct page_counter { > + /* > + * Make sure 'usage' does not share cacheline with any other field. The > + * memcg->memory.usage is a hot member of struct mem_cgroup. > + */ > + PC_PADDING(_pad1_); Why don't you simply require alignment for the structure? Other than that, looks good to me and it makes sense. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 86+ messages in thread
* Re: [PATCH 2/3] mm: page_counter: rearrange struct page_counter fields @ 2022-08-22 10:23 ` Michal Hocko 0 siblings, 0 replies; 86+ messages in thread From: Michal Hocko @ 2022-08-22 10:23 UTC (permalink / raw) To: lkp [-- Attachment #1: Type: text/plain, Size: 2883 bytes --] On Mon 22-08-22 00:17:36, Shakeel Butt wrote: > With memcg v2 enabled, memcg->memory.usage is a very hot member for > the workloads doing memcg charging on multiple CPUs concurrently. > Particularly the network intensive workloads. In addition, there is a > false cache sharing between memory.usage and memory.high on the charge > path. This patch moves the usage into a separate cacheline and move all > the read most fields into separate cacheline. > > To evaluate the impact of this optimization, on a 72 CPUs machine, we > ran the following workload in a three level of cgroup hierarchy with top > level having min and low setup appropriately. More specifically > memory.min equal to size of netperf binary and memory.low double of > that. Again the workload description is not particularly useful. I guess the only important aspect is the netserver part below and the number of CPUs because min and low setup doesn't have much to do with this, right? At least that is my reading of the memory.high mentioned above. > $ netserver -6 > # 36 instances of netperf with following params > $ netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- -m 10K > > Results (average throughput of netperf): > Without (6.0-rc1) 10482.7 Mbps > With patch 12413.7 Mbps (18.4% improvement) > > With the patch, the throughput improved by 18.4%. > > One side-effect of this patch is the increase in the size of struct > mem_cgroup. However for the performance improvement, this additional > size is worth it. In addition there are opportunities to reduce the size > of struct mem_cgroup like deprecation of kmem and tcpmem page counters > and better packing. > > Signed-off-by: Shakeel Butt <shakeelb@google.com> > Reported-by: kernel test robot <oliver.sang@intel.com> > --- > include/linux/page_counter.h | 34 +++++++++++++++++++++++----------- > 1 file changed, 23 insertions(+), 11 deletions(-) > > diff --git a/include/linux/page_counter.h b/include/linux/page_counter.h > index 679591301994..8ce99bde645f 100644 > --- a/include/linux/page_counter.h > +++ b/include/linux/page_counter.h > @@ -3,15 +3,27 @@ > #define _LINUX_PAGE_COUNTER_H > > #include <linux/atomic.h> > +#include <linux/cache.h> > #include <linux/kernel.h> > #include <asm/page.h> > > +#if defined(CONFIG_SMP) > +struct pc_padding { > + char x[0]; > +} ____cacheline_internodealigned_in_smp; > +#define PC_PADDING(name) struct pc_padding name > +#else > +#define PC_PADDING(name) > +#endif > + > struct page_counter { > + /* > + * Make sure 'usage' does not share cacheline with any other field. The > + * memcg->memory.usage is a hot member of struct mem_cgroup. > + */ > + PC_PADDING(_pad1_); Why don't you simply require alignment for the structure? Other than that, looks good to me and it makes sense. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 86+ messages in thread
* Re: [PATCH 2/3] mm: page_counter: rearrange struct page_counter fields 2022-08-22 10:23 ` Michal Hocko @ 2022-08-22 15:06 ` Shakeel Butt -1 siblings, 0 replies; 86+ messages in thread From: Shakeel Butt @ 2022-08-22 15:06 UTC (permalink / raw) To: Michal Hocko Cc: Johannes Weiner, Roman Gushchin, Muchun Song, Michal Koutný, Eric Dumazet, Soheil Hassas Yeganeh, Feng Tang, Oliver Sang, Andrew Morton, lkp, Cgroups, Linux MM, netdev, LKML On Mon, Aug 22, 2022 at 3:23 AM Michal Hocko <mhocko@suse.com> wrote: > > On Mon 22-08-22 00:17:36, Shakeel Butt wrote: > > With memcg v2 enabled, memcg->memory.usage is a very hot member for > > the workloads doing memcg charging on multiple CPUs concurrently. > > Particularly the network intensive workloads. In addition, there is a > > false cache sharing between memory.usage and memory.high on the charge > > path. This patch moves the usage into a separate cacheline and move all > > the read most fields into separate cacheline. > > > > To evaluate the impact of this optimization, on a 72 CPUs machine, we > > ran the following workload in a three level of cgroup hierarchy with top > > level having min and low setup appropriately. More specifically > > memory.min equal to size of netperf binary and memory.low double of > > that. > > Again the workload description is not particularly useful. I guess the > only important aspect is the netserver part below and the number of CPUs > because min and low setup doesn't have much to do with this, right? At > least that is my reading of the memory.high mentioned above. > The experiment numbers below are for only this patch independently i.e. the unnecessary min/low atomic xchg() is still happening for both setups. I could run the experiment without setting min and low but I wanted to keep the setup exactly the same for all three optimizations. This patch and the following perf numbers shows only the impact of removing false sharing in struct page_counter for memcg->memory on the charging code path. > > $ netserver -6 > > # 36 instances of netperf with following params > > $ netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- -m 10K > > > > Results (average throughput of netperf): > > Without (6.0-rc1) 10482.7 Mbps > > With patch 12413.7 Mbps (18.4% improvement) > > > > With the patch, the throughput improved by 18.4%. > > > > One side-effect of this patch is the increase in the size of struct > > mem_cgroup. However for the performance improvement, this additional > > size is worth it. In addition there are opportunities to reduce the size > > of struct mem_cgroup like deprecation of kmem and tcpmem page counters > > and better packing. > > > > Signed-off-by: Shakeel Butt <shakeelb@google.com> > > Reported-by: kernel test robot <oliver.sang@intel.com> > > --- > > include/linux/page_counter.h | 34 +++++++++++++++++++++++----------- > > 1 file changed, 23 insertions(+), 11 deletions(-) > > > > diff --git a/include/linux/page_counter.h b/include/linux/page_counter.h > > index 679591301994..8ce99bde645f 100644 > > --- a/include/linux/page_counter.h > > +++ b/include/linux/page_counter.h > > @@ -3,15 +3,27 @@ > > #define _LINUX_PAGE_COUNTER_H > > > > #include <linux/atomic.h> > > +#include <linux/cache.h> > > #include <linux/kernel.h> > > #include <asm/page.h> > > > > +#if defined(CONFIG_SMP) > > +struct pc_padding { > > + char x[0]; > > +} ____cacheline_internodealigned_in_smp; > > +#define PC_PADDING(name) struct pc_padding name > > +#else > > +#define PC_PADDING(name) > > +#endif > > + > > struct page_counter { > > + /* > > + * Make sure 'usage' does not share cacheline with any other field. The > > + * memcg->memory.usage is a hot member of struct mem_cgroup. > > + */ > > + PC_PADDING(_pad1_); > > Why don't you simply require alignment for the structure? I don't just want the alignment of the structure. I want different fields of this structure to not share the cache line. More specifically the 'high' and 'usage' fields. With this change the usage will be its own cache line, the read-most fields will be on separate cache line and the fields which sometimes get updated on charge path based on some condition will be a different cache line from the previous two. ^ permalink raw reply [flat|nested] 86+ messages in thread
* Re: [PATCH 2/3] mm: page_counter: rearrange struct page_counter fields @ 2022-08-22 15:06 ` Shakeel Butt 0 siblings, 0 replies; 86+ messages in thread From: Shakeel Butt @ 2022-08-22 15:06 UTC (permalink / raw) To: lkp [-- Attachment #1: Type: text/plain, Size: 3908 bytes --] On Mon, Aug 22, 2022 at 3:23 AM Michal Hocko <mhocko@suse.com> wrote: > > On Mon 22-08-22 00:17:36, Shakeel Butt wrote: > > With memcg v2 enabled, memcg->memory.usage is a very hot member for > > the workloads doing memcg charging on multiple CPUs concurrently. > > Particularly the network intensive workloads. In addition, there is a > > false cache sharing between memory.usage and memory.high on the charge > > path. This patch moves the usage into a separate cacheline and move all > > the read most fields into separate cacheline. > > > > To evaluate the impact of this optimization, on a 72 CPUs machine, we > > ran the following workload in a three level of cgroup hierarchy with top > > level having min and low setup appropriately. More specifically > > memory.min equal to size of netperf binary and memory.low double of > > that. > > Again the workload description is not particularly useful. I guess the > only important aspect is the netserver part below and the number of CPUs > because min and low setup doesn't have much to do with this, right? At > least that is my reading of the memory.high mentioned above. > The experiment numbers below are for only this patch independently i.e. the unnecessary min/low atomic xchg() is still happening for both setups. I could run the experiment without setting min and low but I wanted to keep the setup exactly the same for all three optimizations. This patch and the following perf numbers shows only the impact of removing false sharing in struct page_counter for memcg->memory on the charging code path. > > $ netserver -6 > > # 36 instances of netperf with following params > > $ netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- -m 10K > > > > Results (average throughput of netperf): > > Without (6.0-rc1) 10482.7 Mbps > > With patch 12413.7 Mbps (18.4% improvement) > > > > With the patch, the throughput improved by 18.4%. > > > > One side-effect of this patch is the increase in the size of struct > > mem_cgroup. However for the performance improvement, this additional > > size is worth it. In addition there are opportunities to reduce the size > > of struct mem_cgroup like deprecation of kmem and tcpmem page counters > > and better packing. > > > > Signed-off-by: Shakeel Butt <shakeelb@google.com> > > Reported-by: kernel test robot <oliver.sang@intel.com> > > --- > > include/linux/page_counter.h | 34 +++++++++++++++++++++++----------- > > 1 file changed, 23 insertions(+), 11 deletions(-) > > > > diff --git a/include/linux/page_counter.h b/include/linux/page_counter.h > > index 679591301994..8ce99bde645f 100644 > > --- a/include/linux/page_counter.h > > +++ b/include/linux/page_counter.h > > @@ -3,15 +3,27 @@ > > #define _LINUX_PAGE_COUNTER_H > > > > #include <linux/atomic.h> > > +#include <linux/cache.h> > > #include <linux/kernel.h> > > #include <asm/page.h> > > > > +#if defined(CONFIG_SMP) > > +struct pc_padding { > > + char x[0]; > > +} ____cacheline_internodealigned_in_smp; > > +#define PC_PADDING(name) struct pc_padding name > > +#else > > +#define PC_PADDING(name) > > +#endif > > + > > struct page_counter { > > + /* > > + * Make sure 'usage' does not share cacheline with any other field. The > > + * memcg->memory.usage is a hot member of struct mem_cgroup. > > + */ > > + PC_PADDING(_pad1_); > > Why don't you simply require alignment for the structure? I don't just want the alignment of the structure. I want different fields of this structure to not share the cache line. More specifically the 'high' and 'usage' fields. With this change the usage will be its own cache line, the read-most fields will be on separate cache line and the fields which sometimes get updated on charge path based on some condition will be a different cache line from the previous two. ^ permalink raw reply [flat|nested] 86+ messages in thread
* Re: [PATCH 2/3] mm: page_counter: rearrange struct page_counter fields 2022-08-22 15:06 ` Shakeel Butt (?) @ 2022-08-22 15:15 ` Michal Hocko -1 siblings, 0 replies; 86+ messages in thread From: Michal Hocko @ 2022-08-22 15:15 UTC (permalink / raw) To: Shakeel Butt Cc: Johannes Weiner, Roman Gushchin, Muchun Song, Michal Koutný, Eric Dumazet, Soheil Hassas Yeganeh, Feng Tang, Oliver Sang, Andrew Morton, lkp, Cgroups, Linux MM, netdev, LKML On Mon 22-08-22 08:06:14, Shakeel Butt wrote: [...] > > > struct page_counter { > > > + /* > > > + * Make sure 'usage' does not share cacheline with any other field. The > > > + * memcg->memory.usage is a hot member of struct mem_cgroup. > > > + */ > > > + PC_PADDING(_pad1_); > > > > Why don't you simply require alignment for the structure? > > I don't just want the alignment of the structure. I want different > fields of this structure to not share the cache line. More > specifically the 'high' and 'usage' fields. With this change the usage > will be its own cache line, the read-most fields will be on separate > cache line and the fields which sometimes get updated on charge path > based on some condition will be a different cache line from the > previous two. I do not follow. If you make an explicit requirement for the structure alignement then the first field in the structure will be guarantied to have that alignement and you achieve the rest to be in the other cache line by adding padding behind that. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 86+ messages in thread
* Re: [PATCH 2/3] mm: page_counter: rearrange struct page_counter fields @ 2022-08-22 15:15 ` Michal Hocko 0 siblings, 0 replies; 86+ messages in thread From: Michal Hocko @ 2022-08-22 15:15 UTC (permalink / raw) To: Shakeel Butt Cc: Johannes Weiner, Roman Gushchin, Muchun Song, Michal Koutný, Eric Dumazet, Soheil Hassas Yeganeh, Feng Tang, Oliver Sang, Andrew Morton, lkp-hn68Rpc1hR1g9hUCZPvPmw, Cgroups, Linux MM, netdev, LKML On Mon 22-08-22 08:06:14, Shakeel Butt wrote: [...] > > > struct page_counter { > > > + /* > > > + * Make sure 'usage' does not share cacheline with any other field. The > > > + * memcg->memory.usage is a hot member of struct mem_cgroup. > > > + */ > > > + PC_PADDING(_pad1_); > > > > Why don't you simply require alignment for the structure? > > I don't just want the alignment of the structure. I want different > fields of this structure to not share the cache line. More > specifically the 'high' and 'usage' fields. With this change the usage > will be its own cache line, the read-most fields will be on separate > cache line and the fields which sometimes get updated on charge path > based on some condition will be a different cache line from the > previous two. I do not follow. If you make an explicit requirement for the structure alignement then the first field in the structure will be guarantied to have that alignement and you achieve the rest to be in the other cache line by adding padding behind that. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 86+ messages in thread
* Re: [PATCH 2/3] mm: page_counter: rearrange struct page_counter fields @ 2022-08-22 15:15 ` Michal Hocko 0 siblings, 0 replies; 86+ messages in thread From: Michal Hocko @ 2022-08-22 15:15 UTC (permalink / raw) To: lkp [-- Attachment #1: Type: text/plain, Size: 1100 bytes --] On Mon 22-08-22 08:06:14, Shakeel Butt wrote: [...] > > > struct page_counter { > > > + /* > > > + * Make sure 'usage' does not share cacheline with any other field. The > > > + * memcg->memory.usage is a hot member of struct mem_cgroup. > > > + */ > > > + PC_PADDING(_pad1_); > > > > Why don't you simply require alignment for the structure? > > I don't just want the alignment of the structure. I want different > fields of this structure to not share the cache line. More > specifically the 'high' and 'usage' fields. With this change the usage > will be its own cache line, the read-most fields will be on separate > cache line and the fields which sometimes get updated on charge path > based on some condition will be a different cache line from the > previous two. I do not follow. If you make an explicit requirement for the structure alignement then the first field in the structure will be guarantied to have that alignement and you achieve the rest to be in the other cache line by adding padding behind that. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 86+ messages in thread
* Re: [PATCH 2/3] mm: page_counter: rearrange struct page_counter fields 2022-08-22 15:15 ` Michal Hocko @ 2022-08-22 16:04 ` Shakeel Butt -1 siblings, 0 replies; 86+ messages in thread From: Shakeel Butt @ 2022-08-22 16:04 UTC (permalink / raw) To: Michal Hocko Cc: Johannes Weiner, Roman Gushchin, Muchun Song, Michal Koutný, Eric Dumazet, Soheil Hassas Yeganeh, Feng Tang, Oliver Sang, Andrew Morton, lkp, Cgroups, Linux MM, netdev, LKML On Mon, Aug 22, 2022 at 8:15 AM Michal Hocko <mhocko@suse.com> wrote: > > On Mon 22-08-22 08:06:14, Shakeel Butt wrote: > [...] > > > > struct page_counter { > > > > + /* > > > > + * Make sure 'usage' does not share cacheline with any other field. The > > > > + * memcg->memory.usage is a hot member of struct mem_cgroup. > > > > + */ > > > > + PC_PADDING(_pad1_); > > > > > > Why don't you simply require alignment for the structure? > > > > I don't just want the alignment of the structure. I want different > > fields of this structure to not share the cache line. More > > specifically the 'high' and 'usage' fields. With this change the usage > > will be its own cache line, the read-most fields will be on separate > > cache line and the fields which sometimes get updated on charge path > > based on some condition will be a different cache line from the > > previous two. > > I do not follow. If you make an explicit requirement for the structure > alignement then the first field in the structure will be guarantied to > have that alignement and you achieve the rest to be in the other cache > line by adding padding behind that. Oh, you were talking explicitly about _pad1_, yes, we can remove it and make the struct cache align. I will do it in the next version. ^ permalink raw reply [flat|nested] 86+ messages in thread
* Re: [PATCH 2/3] mm: page_counter: rearrange struct page_counter fields @ 2022-08-22 16:04 ` Shakeel Butt 0 siblings, 0 replies; 86+ messages in thread From: Shakeel Butt @ 2022-08-22 16:04 UTC (permalink / raw) To: lkp [-- Attachment #1: Type: text/plain, Size: 1325 bytes --] On Mon, Aug 22, 2022 at 8:15 AM Michal Hocko <mhocko@suse.com> wrote: > > On Mon 22-08-22 08:06:14, Shakeel Butt wrote: > [...] > > > > struct page_counter { > > > > + /* > > > > + * Make sure 'usage' does not share cacheline with any other field. The > > > > + * memcg->memory.usage is a hot member of struct mem_cgroup. > > > > + */ > > > > + PC_PADDING(_pad1_); > > > > > > Why don't you simply require alignment for the structure? > > > > I don't just want the alignment of the structure. I want different > > fields of this structure to not share the cache line. More > > specifically the 'high' and 'usage' fields. With this change the usage > > will be its own cache line, the read-most fields will be on separate > > cache line and the fields which sometimes get updated on charge path > > based on some condition will be a different cache line from the > > previous two. > > I do not follow. If you make an explicit requirement for the structure > alignement then the first field in the structure will be guarantied to > have that alignement and you achieve the rest to be in the other cache > line by adding padding behind that. Oh, you were talking explicitly about _pad1_, yes, we can remove it and make the struct cache align. I will do it in the next version. ^ permalink raw reply [flat|nested] 86+ messages in thread
* Re: [PATCH 2/3] mm: page_counter: rearrange struct page_counter fields 2022-08-22 16:04 ` Shakeel Butt (?) @ 2022-08-22 18:27 ` Roman Gushchin -1 siblings, 0 replies; 86+ messages in thread From: Roman Gushchin @ 2022-08-22 18:27 UTC (permalink / raw) To: Shakeel Butt Cc: Michal Hocko, Johannes Weiner, Muchun Song, Michal Koutný, Eric Dumazet, Soheil Hassas Yeganeh, Feng Tang, Oliver Sang, Andrew Morton, lkp, Cgroups, Linux MM, netdev, LKML On Mon, Aug 22, 2022 at 09:04:59AM -0700, Shakeel Butt wrote: > On Mon, Aug 22, 2022 at 8:15 AM Michal Hocko <mhocko@suse.com> wrote: > > > > On Mon 22-08-22 08:06:14, Shakeel Butt wrote: > > [...] > > > > > struct page_counter { > > > > > + /* > > > > > + * Make sure 'usage' does not share cacheline with any other field. The > > > > > + * memcg->memory.usage is a hot member of struct mem_cgroup. > > > > > + */ > > > > > + PC_PADDING(_pad1_); > > > > > > > > Why don't you simply require alignment for the structure? > > > > > > I don't just want the alignment of the structure. I want different > > > fields of this structure to not share the cache line. More > > > specifically the 'high' and 'usage' fields. With this change the usage > > > will be its own cache line, the read-most fields will be on separate > > > cache line and the fields which sometimes get updated on charge path > > > based on some condition will be a different cache line from the > > > previous two. > > > > I do not follow. If you make an explicit requirement for the structure > > alignement then the first field in the structure will be guarantied to > > have that alignement and you achieve the rest to be in the other cache > > line by adding padding behind that. > > Oh, you were talking explicitly about _pad1_, yes, we can remove it > and make the struct cache align. I will do it in the next version. Yes, please, it caught my eyes too. With this change: Acked-by: Roman Gushchin <roman.gushchin@linux.dev> Also, can you, please, include the numbers on the additional memory overhead? I think it still worth it, just think we need to include them for a record. Thanks! ^ permalink raw reply [flat|nested] 86+ messages in thread
* Re: [PATCH 2/3] mm: page_counter: rearrange struct page_counter fields @ 2022-08-22 18:27 ` Roman Gushchin 0 siblings, 0 replies; 86+ messages in thread From: Roman Gushchin @ 2022-08-22 18:27 UTC (permalink / raw) To: Shakeel Butt Cc: Michal Hocko, Johannes Weiner, Muchun Song, Michal Koutný, Eric Dumazet, Soheil Hassas Yeganeh, Feng Tang, Oliver Sang, Andrew Morton, lkp-hn68Rpc1hR1g9hUCZPvPmw, Cgroups, Linux MM, netdev, LKML On Mon, Aug 22, 2022 at 09:04:59AM -0700, Shakeel Butt wrote: > On Mon, Aug 22, 2022 at 8:15 AM Michal Hocko <mhocko-IBi9RG/b67k@public.gmane.org> wrote: > > > > On Mon 22-08-22 08:06:14, Shakeel Butt wrote: > > [...] > > > > > struct page_counter { > > > > > + /* > > > > > + * Make sure 'usage' does not share cacheline with any other field. The > > > > > + * memcg->memory.usage is a hot member of struct mem_cgroup. > > > > > + */ > > > > > + PC_PADDING(_pad1_); > > > > > > > > Why don't you simply require alignment for the structure? > > > > > > I don't just want the alignment of the structure. I want different > > > fields of this structure to not share the cache line. More > > > specifically the 'high' and 'usage' fields. With this change the usage > > > will be its own cache line, the read-most fields will be on separate > > > cache line and the fields which sometimes get updated on charge path > > > based on some condition will be a different cache line from the > > > previous two. > > > > I do not follow. If you make an explicit requirement for the structure > > alignement then the first field in the structure will be guarantied to > > have that alignement and you achieve the rest to be in the other cache > > line by adding padding behind that. > > Oh, you were talking explicitly about _pad1_, yes, we can remove it > and make the struct cache align. I will do it in the next version. Yes, please, it caught my eyes too. With this change: Acked-by: Roman Gushchin <roman.gushchin-fxUVXftIFDnyG1zEObXtfA@public.gmane.org> Also, can you, please, include the numbers on the additional memory overhead? I think it still worth it, just think we need to include them for a record. Thanks! ^ permalink raw reply [flat|nested] 86+ messages in thread
* Re: [PATCH 2/3] mm: page_counter: rearrange struct page_counter fields @ 2022-08-22 18:27 ` Roman Gushchin 0 siblings, 0 replies; 86+ messages in thread From: Roman Gushchin @ 2022-08-22 18:27 UTC (permalink / raw) To: lkp [-- Attachment #1: Type: text/plain, Size: 1724 bytes --] On Mon, Aug 22, 2022 at 09:04:59AM -0700, Shakeel Butt wrote: > On Mon, Aug 22, 2022 at 8:15 AM Michal Hocko <mhocko@suse.com> wrote: > > > > On Mon 22-08-22 08:06:14, Shakeel Butt wrote: > > [...] > > > > > struct page_counter { > > > > > + /* > > > > > + * Make sure 'usage' does not share cacheline with any other field. The > > > > > + * memcg->memory.usage is a hot member of struct mem_cgroup. > > > > > + */ > > > > > + PC_PADDING(_pad1_); > > > > > > > > Why don't you simply require alignment for the structure? > > > > > > I don't just want the alignment of the structure. I want different > > > fields of this structure to not share the cache line. More > > > specifically the 'high' and 'usage' fields. With this change the usage > > > will be its own cache line, the read-most fields will be on separate > > > cache line and the fields which sometimes get updated on charge path > > > based on some condition will be a different cache line from the > > > previous two. > > > > I do not follow. If you make an explicit requirement for the structure > > alignement then the first field in the structure will be guarantied to > > have that alignement and you achieve the rest to be in the other cache > > line by adding padding behind that. > > Oh, you were talking explicitly about _pad1_, yes, we can remove it > and make the struct cache align. I will do it in the next version. Yes, please, it caught my eyes too. With this change: Acked-by: Roman Gushchin <roman.gushchin@linux.dev> Also, can you, please, include the numbers on the additional memory overhead? I think it still worth it, just think we need to include them for a record. Thanks! ^ permalink raw reply [flat|nested] 86+ messages in thread
* [PATCH 3/3] memcg: increase MEMCG_CHARGE_BATCH to 64 2022-08-22 0:17 ` Shakeel Butt @ 2022-08-22 0:17 ` Shakeel Butt -1 siblings, 0 replies; 86+ messages in thread From: Shakeel Butt @ 2022-08-22 0:17 UTC (permalink / raw) To: Johannes Weiner, Michal Hocko, Roman Gushchin, Muchun Song Cc: Michal Koutný, Eric Dumazet, Soheil Hassas Yeganeh, Feng Tang, Oliver Sang, Andrew Morton, lkp, cgroups, linux-mm, netdev, linux-kernel, Shakeel Butt For several years, MEMCG_CHARGE_BATCH was kept at 32 but with bigger machines and the network intensive workloads requiring througput in Gbps, 32 is too small and makes the memcg charging path a bottleneck. For now, increase it to 64 for easy acceptance to 6.0. We will need to revisit this in future for ever increasing demand of higher performance. Please note that the memcg charge path drain the per-cpu memcg charge stock, so there should not be any oom behavior change. To evaluate the impact of this optimization, on a 72 CPUs machine, we ran the following workload in a three level of cgroup hierarchy with top level having min and low setup appropriately. More specifically memory.min equal to size of netperf binary and memory.low double of that. $ netserver -6 # 36 instances of netperf with following params $ netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- -m 10K Results (average throughput of netperf): Without (6.0-rc1) 10482.7 Mbps With patch 17064.7 Mbps (62.7% improvement) With the patch, the throughput improved by 62.7%. Signed-off-by: Shakeel Butt <shakeelb@google.com> Reported-by: kernel test robot <oliver.sang@intel.com> --- include/linux/memcontrol.h | 7 ++++--- 1 file changed, 4 insertions(+), 3 deletions(-) diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 4d31ce55b1c0..70ae91188e16 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -354,10 +354,11 @@ struct mem_cgroup { }; /* - * size of first charge trial. "32" comes from vmscan.c's magic value. - * TODO: maybe necessary to use big numbers in big irons. + * size of first charge trial. + * TODO: maybe necessary to use big numbers in big irons or dynamic based of the + * workload. */ -#define MEMCG_CHARGE_BATCH 32U +#define MEMCG_CHARGE_BATCH 64U extern struct mem_cgroup *root_mem_cgroup; -- 2.37.1.595.g718a3a8f04-goog ^ permalink raw reply related [flat|nested] 86+ messages in thread
* [PATCH 3/3] memcg: increase MEMCG_CHARGE_BATCH to 64 @ 2022-08-22 0:17 ` Shakeel Butt 0 siblings, 0 replies; 86+ messages in thread From: Shakeel Butt @ 2022-08-22 0:17 UTC (permalink / raw) To: lkp [-- Attachment #1: Type: text/plain, Size: 1952 bytes --] For several years, MEMCG_CHARGE_BATCH was kept at 32 but with bigger machines and the network intensive workloads requiring througput in Gbps, 32 is too small and makes the memcg charging path a bottleneck. For now, increase it to 64 for easy acceptance to 6.0. We will need to revisit this in future for ever increasing demand of higher performance. Please note that the memcg charge path drain the per-cpu memcg charge stock, so there should not be any oom behavior change. To evaluate the impact of this optimization, on a 72 CPUs machine, we ran the following workload in a three level of cgroup hierarchy with top level having min and low setup appropriately. More specifically memory.min equal to size of netperf binary and memory.low double of that. $ netserver -6 # 36 instances of netperf with following params $ netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- -m 10K Results (average throughput of netperf): Without (6.0-rc1) 10482.7 Mbps With patch 17064.7 Mbps (62.7% improvement) With the patch, the throughput improved by 62.7%. Signed-off-by: Shakeel Butt <shakeelb@google.com> Reported-by: kernel test robot <oliver.sang@intel.com> --- include/linux/memcontrol.h | 7 ++++--- 1 file changed, 4 insertions(+), 3 deletions(-) diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 4d31ce55b1c0..70ae91188e16 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -354,10 +354,11 @@ struct mem_cgroup { }; /* - * size of first charge trial. "32" comes from vmscan.c's magic value. - * TODO: maybe necessary to use big numbers in big irons. + * size of first charge trial. + * TODO: maybe necessary to use big numbers in big irons or dynamic based of the + * workload. */ -#define MEMCG_CHARGE_BATCH 32U +#define MEMCG_CHARGE_BATCH 64U extern struct mem_cgroup *root_mem_cgroup; -- 2.37.1.595.g718a3a8f04-goog ^ permalink raw reply related [flat|nested] 86+ messages in thread
* Re: [PATCH 3/3] memcg: increase MEMCG_CHARGE_BATCH to 64 2022-08-22 0:17 ` Shakeel Butt (?) @ 2022-08-22 0:24 ` Soheil Hassas Yeganeh -1 siblings, 0 replies; 86+ messages in thread From: Soheil Hassas Yeganeh @ 2022-08-22 0:24 UTC (permalink / raw) To: Shakeel Butt Cc: Johannes Weiner, Michal Hocko, Roman Gushchin, Muchun Song, Michal Koutný, Eric Dumazet, Feng Tang, Oliver Sang, Andrew Morton, lkp, cgroups, linux-mm, netdev, linux-kernel On Sun, Aug 21, 2022 at 8:18 PM Shakeel Butt <shakeelb@google.com> wrote: > > For several years, MEMCG_CHARGE_BATCH was kept at 32 but with bigger > machines and the network intensive workloads requiring througput in > Gbps, 32 is too small and makes the memcg charging path a bottleneck. > For now, increase it to 64 for easy acceptance to 6.0. We will need to > revisit this in future for ever increasing demand of higher performance. > > Please note that the memcg charge path drain the per-cpu memcg charge > stock, so there should not be any oom behavior change. > > To evaluate the impact of this optimization, on a 72 CPUs machine, we > ran the following workload in a three level of cgroup hierarchy with top > level having min and low setup appropriately. More specifically > memory.min equal to size of netperf binary and memory.low double of > that. > > $ netserver -6 > # 36 instances of netperf with following params > $ netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- -m 10K > > Results (average throughput of netperf): > Without (6.0-rc1) 10482.7 Mbps > With patch 17064.7 Mbps (62.7% improvement) > > With the patch, the throughput improved by 62.7%. > > Signed-off-by: Shakeel Butt <shakeelb@google.com> > Reported-by: kernel test robot <oliver.sang@intel.com> Nice! Acked-by: Soheil Hassas Yeganeh <soheil@google.com> > --- > include/linux/memcontrol.h | 7 ++++--- > 1 file changed, 4 insertions(+), 3 deletions(-) > > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h > index 4d31ce55b1c0..70ae91188e16 100644 > --- a/include/linux/memcontrol.h > +++ b/include/linux/memcontrol.h > @@ -354,10 +354,11 @@ struct mem_cgroup { > }; > > /* > - * size of first charge trial. "32" comes from vmscan.c's magic value. > - * TODO: maybe necessary to use big numbers in big irons. > + * size of first charge trial. > + * TODO: maybe necessary to use big numbers in big irons or dynamic based of the > + * workload. > */ > -#define MEMCG_CHARGE_BATCH 32U > +#define MEMCG_CHARGE_BATCH 64U > > extern struct mem_cgroup *root_mem_cgroup; > > -- > 2.37.1.595.g718a3a8f04-goog > ^ permalink raw reply [flat|nested] 86+ messages in thread
* Re: [PATCH 3/3] memcg: increase MEMCG_CHARGE_BATCH to 64 @ 2022-08-22 0:24 ` Soheil Hassas Yeganeh 0 siblings, 0 replies; 86+ messages in thread From: Soheil Hassas Yeganeh @ 2022-08-22 0:24 UTC (permalink / raw) To: Shakeel Butt Cc: Johannes Weiner, Michal Hocko, Roman Gushchin, Muchun Song, Michal Koutný, Eric Dumazet, Feng Tang, Oliver Sang, Andrew Morton, lkp-hn68Rpc1hR1g9hUCZPvPmw, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm, netdev, linux-kernel On Sun, Aug 21, 2022 at 8:18 PM Shakeel Butt <shakeelb-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> wrote: > > For several years, MEMCG_CHARGE_BATCH was kept at 32 but with bigger > machines and the network intensive workloads requiring througput in > Gbps, 32 is too small and makes the memcg charging path a bottleneck. > For now, increase it to 64 for easy acceptance to 6.0. We will need to > revisit this in future for ever increasing demand of higher performance. > > Please note that the memcg charge path drain the per-cpu memcg charge > stock, so there should not be any oom behavior change. > > To evaluate the impact of this optimization, on a 72 CPUs machine, we > ran the following workload in a three level of cgroup hierarchy with top > level having min and low setup appropriately. More specifically > memory.min equal to size of netperf binary and memory.low double of > that. > > $ netserver -6 > # 36 instances of netperf with following params > $ netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- -m 10K > > Results (average throughput of netperf): > Without (6.0-rc1) 10482.7 Mbps > With patch 17064.7 Mbps (62.7% improvement) > > With the patch, the throughput improved by 62.7%. > > Signed-off-by: Shakeel Butt <shakeelb-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> > Reported-by: kernel test robot <oliver.sang-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org> Nice! Acked-by: Soheil Hassas Yeganeh <soheil-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> > --- > include/linux/memcontrol.h | 7 ++++--- > 1 file changed, 4 insertions(+), 3 deletions(-) > > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h > index 4d31ce55b1c0..70ae91188e16 100644 > --- a/include/linux/memcontrol.h > +++ b/include/linux/memcontrol.h > @@ -354,10 +354,11 @@ struct mem_cgroup { > }; > > /* > - * size of first charge trial. "32" comes from vmscan.c's magic value. > - * TODO: maybe necessary to use big numbers in big irons. > + * size of first charge trial. > + * TODO: maybe necessary to use big numbers in big irons or dynamic based of the > + * workload. > */ > -#define MEMCG_CHARGE_BATCH 32U > +#define MEMCG_CHARGE_BATCH 64U > > extern struct mem_cgroup *root_mem_cgroup; > > -- > 2.37.1.595.g718a3a8f04-goog > ^ permalink raw reply [flat|nested] 86+ messages in thread
* Re: [PATCH 3/3] memcg: increase MEMCG_CHARGE_BATCH to 64 @ 2022-08-22 0:24 ` Soheil Hassas Yeganeh 0 siblings, 0 replies; 86+ messages in thread From: Soheil Hassas Yeganeh @ 2022-08-22 0:24 UTC (permalink / raw) To: lkp [-- Attachment #1: Type: text/plain, Size: 2187 bytes --] On Sun, Aug 21, 2022 at 8:18 PM Shakeel Butt <shakeelb@google.com> wrote: > > For several years, MEMCG_CHARGE_BATCH was kept at 32 but with bigger > machines and the network intensive workloads requiring througput in > Gbps, 32 is too small and makes the memcg charging path a bottleneck. > For now, increase it to 64 for easy acceptance to 6.0. We will need to > revisit this in future for ever increasing demand of higher performance. > > Please note that the memcg charge path drain the per-cpu memcg charge > stock, so there should not be any oom behavior change. > > To evaluate the impact of this optimization, on a 72 CPUs machine, we > ran the following workload in a three level of cgroup hierarchy with top > level having min and low setup appropriately. More specifically > memory.min equal to size of netperf binary and memory.low double of > that. > > $ netserver -6 > # 36 instances of netperf with following params > $ netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- -m 10K > > Results (average throughput of netperf): > Without (6.0-rc1) 10482.7 Mbps > With patch 17064.7 Mbps (62.7% improvement) > > With the patch, the throughput improved by 62.7%. > > Signed-off-by: Shakeel Butt <shakeelb@google.com> > Reported-by: kernel test robot <oliver.sang@intel.com> Nice! Acked-by: Soheil Hassas Yeganeh <soheil@google.com> > --- > include/linux/memcontrol.h | 7 ++++--- > 1 file changed, 4 insertions(+), 3 deletions(-) > > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h > index 4d31ce55b1c0..70ae91188e16 100644 > --- a/include/linux/memcontrol.h > +++ b/include/linux/memcontrol.h > @@ -354,10 +354,11 @@ struct mem_cgroup { > }; > > /* > - * size of first charge trial. "32" comes from vmscan.c's magic value. > - * TODO: maybe necessary to use big numbers in big irons. > + * size of first charge trial. > + * TODO: maybe necessary to use big numbers in big irons or dynamic based of the > + * workload. > */ > -#define MEMCG_CHARGE_BATCH 32U > +#define MEMCG_CHARGE_BATCH 64U > > extern struct mem_cgroup *root_mem_cgroup; > > -- > 2.37.1.595.g718a3a8f04-goog > ^ permalink raw reply [flat|nested] 86+ messages in thread
* Re: [PATCH 3/3] memcg: increase MEMCG_CHARGE_BATCH to 64 2022-08-22 0:17 ` Shakeel Butt @ 2022-08-22 2:30 ` Feng Tang -1 siblings, 0 replies; 86+ messages in thread From: Feng Tang @ 2022-08-22 2:30 UTC (permalink / raw) To: Shakeel Butt Cc: Johannes Weiner, Michal Hocko, Roman Gushchin, Muchun Song, Michal Koutn??, Eric Dumazet, Soheil Hassas Yeganeh, Sang, Oliver, Andrew Morton, lkp, cgroups, linux-mm, netdev, linux-kernel On Mon, Aug 22, 2022 at 08:17:37AM +0800, Shakeel Butt wrote: > For several years, MEMCG_CHARGE_BATCH was kept at 32 but with bigger > machines and the network intensive workloads requiring througput in > Gbps, 32 is too small and makes the memcg charging path a bottleneck. > For now, increase it to 64 for easy acceptance to 6.0. We will need to > revisit this in future for ever increasing demand of higher performance. > > Please note that the memcg charge path drain the per-cpu memcg charge > stock, so there should not be any oom behavior change. > > To evaluate the impact of this optimization, on a 72 CPUs machine, we > ran the following workload in a three level of cgroup hierarchy with top > level having min and low setup appropriately. More specifically > memory.min equal to size of netperf binary and memory.low double of > that. > > $ netserver -6 > # 36 instances of netperf with following params > $ netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- -m 10K > > Results (average throughput of netperf): > Without (6.0-rc1) 10482.7 Mbps > With patch 17064.7 Mbps (62.7% improvement) > > With the patch, the throughput improved by 62.7%. > > Signed-off-by: Shakeel Butt <shakeelb@google.com> > Reported-by: kernel test robot <oliver.sang@intel.com> This batch number has long been a pain point :) thanks for the work! Reviewed-by: Feng Tang <feng.tang@intel.com> - Feng > --- > include/linux/memcontrol.h | 7 ++++--- > 1 file changed, 4 insertions(+), 3 deletions(-) > > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h > index 4d31ce55b1c0..70ae91188e16 100644 > --- a/include/linux/memcontrol.h > +++ b/include/linux/memcontrol.h > @@ -354,10 +354,11 @@ struct mem_cgroup { > }; > > /* > - * size of first charge trial. "32" comes from vmscan.c's magic value. > - * TODO: maybe necessary to use big numbers in big irons. > + * size of first charge trial. > + * TODO: maybe necessary to use big numbers in big irons or dynamic based of the > + * workload. > */ > -#define MEMCG_CHARGE_BATCH 32U > +#define MEMCG_CHARGE_BATCH 64U > > extern struct mem_cgroup *root_mem_cgroup; > > -- > 2.37.1.595.g718a3a8f04-goog > ^ permalink raw reply [flat|nested] 86+ messages in thread
* Re: [PATCH 3/3] memcg: increase MEMCG_CHARGE_BATCH to 64 @ 2022-08-22 2:30 ` Feng Tang 0 siblings, 0 replies; 86+ messages in thread From: Feng Tang @ 2022-08-22 2:30 UTC (permalink / raw) To: lkp [-- Attachment #1: Type: text/plain, Size: 2253 bytes --] On Mon, Aug 22, 2022 at 08:17:37AM +0800, Shakeel Butt wrote: > For several years, MEMCG_CHARGE_BATCH was kept at 32 but with bigger > machines and the network intensive workloads requiring througput in > Gbps, 32 is too small and makes the memcg charging path a bottleneck. > For now, increase it to 64 for easy acceptance to 6.0. We will need to > revisit this in future for ever increasing demand of higher performance. > > Please note that the memcg charge path drain the per-cpu memcg charge > stock, so there should not be any oom behavior change. > > To evaluate the impact of this optimization, on a 72 CPUs machine, we > ran the following workload in a three level of cgroup hierarchy with top > level having min and low setup appropriately. More specifically > memory.min equal to size of netperf binary and memory.low double of > that. > > $ netserver -6 > # 36 instances of netperf with following params > $ netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- -m 10K > > Results (average throughput of netperf): > Without (6.0-rc1) 10482.7 Mbps > With patch 17064.7 Mbps (62.7% improvement) > > With the patch, the throughput improved by 62.7%. > > Signed-off-by: Shakeel Butt <shakeelb@google.com> > Reported-by: kernel test robot <oliver.sang@intel.com> This batch number has long been a pain point :) thanks for the work! Reviewed-by: Feng Tang <feng.tang@intel.com> - Feng > --- > include/linux/memcontrol.h | 7 ++++--- > 1 file changed, 4 insertions(+), 3 deletions(-) > > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h > index 4d31ce55b1c0..70ae91188e16 100644 > --- a/include/linux/memcontrol.h > +++ b/include/linux/memcontrol.h > @@ -354,10 +354,11 @@ struct mem_cgroup { > }; > > /* > - * size of first charge trial. "32" comes from vmscan.c's magic value. > - * TODO: maybe necessary to use big numbers in big irons. > + * size of first charge trial. > + * TODO: maybe necessary to use big numbers in big irons or dynamic based of the > + * workload. > */ > -#define MEMCG_CHARGE_BATCH 32U > +#define MEMCG_CHARGE_BATCH 64U > > extern struct mem_cgroup *root_mem_cgroup; > > -- > 2.37.1.595.g718a3a8f04-goog > ^ permalink raw reply [flat|nested] 86+ messages in thread
* Re: [PATCH 3/3] memcg: increase MEMCG_CHARGE_BATCH to 64 2022-08-22 0:17 ` Shakeel Butt (?) @ 2022-08-22 10:47 ` Michal Hocko -1 siblings, 0 replies; 86+ messages in thread From: Michal Hocko @ 2022-08-22 10:47 UTC (permalink / raw) To: Shakeel Butt Cc: Johannes Weiner, Roman Gushchin, Muchun Song, Michal Koutný, Eric Dumazet, Soheil Hassas Yeganeh, Feng Tang, Oliver Sang, Andrew Morton, lkp, cgroups, linux-mm, netdev, linux-kernel On Mon 22-08-22 00:17:37, Shakeel Butt wrote: > For several years, MEMCG_CHARGE_BATCH was kept at 32 but with bigger > machines and the network intensive workloads requiring througput in > Gbps, 32 is too small and makes the memcg charging path a bottleneck. > For now, increase it to 64 for easy acceptance to 6.0. We will need to > revisit this in future for ever increasing demand of higher performance. Yes, the batch size has always been an arbitrary number. I do not think there have ever been any solid grounds for the value we have now except we need something and SWAP_CLUSTER_MAX was a good enough template. Increasing it to 64 sounds like a reasonable step. It would be great to have it scale based on the number of CPUs and potentially other factors but that would be hard to get right and actually hard to evaluate because it will depend on the specific workload. > Please note that the memcg charge path drain the per-cpu memcg charge > stock, so there should not be any oom behavior change. It will have an effect on other stuff as well like high limit reclaim backoff and stast flushing. > To evaluate the impact of this optimization, on a 72 CPUs machine, we > ran the following workload in a three level of cgroup hierarchy with top > level having min and low setup appropriately. More specifically > memory.min equal to size of netperf binary and memory.low double of > that. a similar feedback to the test case description as with other patches. > > $ netserver -6 > # 36 instances of netperf with following params > $ netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- -m 10K > > Results (average throughput of netperf): > Without (6.0-rc1) 10482.7 Mbps > With patch 17064.7 Mbps (62.7% improvement) > > With the patch, the throughput improved by 62.7%. > > Signed-off-by: Shakeel Butt <shakeelb@google.com> > Reported-by: kernel test robot <oliver.sang@intel.com> Anyway Acked-by: Michal Hocko <mhocko@suse.com> Thanks! > --- > include/linux/memcontrol.h | 7 ++++--- > 1 file changed, 4 insertions(+), 3 deletions(-) > > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h > index 4d31ce55b1c0..70ae91188e16 100644 > --- a/include/linux/memcontrol.h > +++ b/include/linux/memcontrol.h > @@ -354,10 +354,11 @@ struct mem_cgroup { > }; > > /* > - * size of first charge trial. "32" comes from vmscan.c's magic value. > - * TODO: maybe necessary to use big numbers in big irons. > + * size of first charge trial. > + * TODO: maybe necessary to use big numbers in big irons or dynamic based of the > + * workload. > */ > -#define MEMCG_CHARGE_BATCH 32U > +#define MEMCG_CHARGE_BATCH 64U > > extern struct mem_cgroup *root_mem_cgroup; > > -- > 2.37.1.595.g718a3a8f04-goog -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 86+ messages in thread
* Re: [PATCH 3/3] memcg: increase MEMCG_CHARGE_BATCH to 64 @ 2022-08-22 10:47 ` Michal Hocko 0 siblings, 0 replies; 86+ messages in thread From: Michal Hocko @ 2022-08-22 10:47 UTC (permalink / raw) To: Shakeel Butt Cc: Johannes Weiner, Roman Gushchin, Muchun Song, Michal Koutný, Eric Dumazet, Soheil Hassas Yeganeh, Feng Tang, Oliver Sang, Andrew Morton, lkp-hn68Rpc1hR1g9hUCZPvPmw, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, netdev-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA On Mon 22-08-22 00:17:37, Shakeel Butt wrote: > For several years, MEMCG_CHARGE_BATCH was kept at 32 but with bigger > machines and the network intensive workloads requiring througput in > Gbps, 32 is too small and makes the memcg charging path a bottleneck. > For now, increase it to 64 for easy acceptance to 6.0. We will need to > revisit this in future for ever increasing demand of higher performance. Yes, the batch size has always been an arbitrary number. I do not think there have ever been any solid grounds for the value we have now except we need something and SWAP_CLUSTER_MAX was a good enough template. Increasing it to 64 sounds like a reasonable step. It would be great to have it scale based on the number of CPUs and potentially other factors but that would be hard to get right and actually hard to evaluate because it will depend on the specific workload. > Please note that the memcg charge path drain the per-cpu memcg charge > stock, so there should not be any oom behavior change. It will have an effect on other stuff as well like high limit reclaim backoff and stast flushing. > To evaluate the impact of this optimization, on a 72 CPUs machine, we > ran the following workload in a three level of cgroup hierarchy with top > level having min and low setup appropriately. More specifically > memory.min equal to size of netperf binary and memory.low double of > that. a similar feedback to the test case description as with other patches. > > $ netserver -6 > # 36 instances of netperf with following params > $ netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- -m 10K > > Results (average throughput of netperf): > Without (6.0-rc1) 10482.7 Mbps > With patch 17064.7 Mbps (62.7% improvement) > > With the patch, the throughput improved by 62.7%. > > Signed-off-by: Shakeel Butt <shakeelb-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> > Reported-by: kernel test robot <oliver.sang-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org> Anyway Acked-by: Michal Hocko <mhocko-IBi9RG/b67k@public.gmane.org> Thanks! > --- > include/linux/memcontrol.h | 7 ++++--- > 1 file changed, 4 insertions(+), 3 deletions(-) > > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h > index 4d31ce55b1c0..70ae91188e16 100644 > --- a/include/linux/memcontrol.h > +++ b/include/linux/memcontrol.h > @@ -354,10 +354,11 @@ struct mem_cgroup { > }; > > /* > - * size of first charge trial. "32" comes from vmscan.c's magic value. > - * TODO: maybe necessary to use big numbers in big irons. > + * size of first charge trial. > + * TODO: maybe necessary to use big numbers in big irons or dynamic based of the > + * workload. > */ > -#define MEMCG_CHARGE_BATCH 32U > +#define MEMCG_CHARGE_BATCH 64U > > extern struct mem_cgroup *root_mem_cgroup; > > -- > 2.37.1.595.g718a3a8f04-goog -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 86+ messages in thread
* Re: [PATCH 3/3] memcg: increase MEMCG_CHARGE_BATCH to 64 @ 2022-08-22 10:47 ` Michal Hocko 0 siblings, 0 replies; 86+ messages in thread From: Michal Hocko @ 2022-08-22 10:47 UTC (permalink / raw) To: lkp [-- Attachment #1: Type: text/plain, Size: 2853 bytes --] On Mon 22-08-22 00:17:37, Shakeel Butt wrote: > For several years, MEMCG_CHARGE_BATCH was kept at 32 but with bigger > machines and the network intensive workloads requiring througput in > Gbps, 32 is too small and makes the memcg charging path a bottleneck. > For now, increase it to 64 for easy acceptance to 6.0. We will need to > revisit this in future for ever increasing demand of higher performance. Yes, the batch size has always been an arbitrary number. I do not think there have ever been any solid grounds for the value we have now except we need something and SWAP_CLUSTER_MAX was a good enough template. Increasing it to 64 sounds like a reasonable step. It would be great to have it scale based on the number of CPUs and potentially other factors but that would be hard to get right and actually hard to evaluate because it will depend on the specific workload. > Please note that the memcg charge path drain the per-cpu memcg charge > stock, so there should not be any oom behavior change. It will have an effect on other stuff as well like high limit reclaim backoff and stast flushing. > To evaluate the impact of this optimization, on a 72 CPUs machine, we > ran the following workload in a three level of cgroup hierarchy with top > level having min and low setup appropriately. More specifically > memory.min equal to size of netperf binary and memory.low double of > that. a similar feedback to the test case description as with other patches. > > $ netserver -6 > # 36 instances of netperf with following params > $ netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- -m 10K > > Results (average throughput of netperf): > Without (6.0-rc1) 10482.7 Mbps > With patch 17064.7 Mbps (62.7% improvement) > > With the patch, the throughput improved by 62.7%. > > Signed-off-by: Shakeel Butt <shakeelb@google.com> > Reported-by: kernel test robot <oliver.sang@intel.com> Anyway Acked-by: Michal Hocko <mhocko@suse.com> Thanks! > --- > include/linux/memcontrol.h | 7 ++++--- > 1 file changed, 4 insertions(+), 3 deletions(-) > > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h > index 4d31ce55b1c0..70ae91188e16 100644 > --- a/include/linux/memcontrol.h > +++ b/include/linux/memcontrol.h > @@ -354,10 +354,11 @@ struct mem_cgroup { > }; > > /* > - * size of first charge trial. "32" comes from vmscan.c's magic value. > - * TODO: maybe necessary to use big numbers in big irons. > + * size of first charge trial. > + * TODO: maybe necessary to use big numbers in big irons or dynamic based of the > + * workload. > */ > -#define MEMCG_CHARGE_BATCH 32U > +#define MEMCG_CHARGE_BATCH 64U > > extern struct mem_cgroup *root_mem_cgroup; > > -- > 2.37.1.595.g718a3a8f04-goog -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 86+ messages in thread
* Re: [PATCH 3/3] memcg: increase MEMCG_CHARGE_BATCH to 64 2022-08-22 10:47 ` Michal Hocko @ 2022-08-22 15:09 ` Shakeel Butt -1 siblings, 0 replies; 86+ messages in thread From: Shakeel Butt @ 2022-08-22 15:09 UTC (permalink / raw) To: Michal Hocko Cc: Johannes Weiner, Roman Gushchin, Muchun Song, Michal Koutný, Eric Dumazet, Soheil Hassas Yeganeh, Feng Tang, Oliver Sang, Andrew Morton, lkp, Cgroups, Linux MM, netdev, LKML On Mon, Aug 22, 2022 at 3:47 AM Michal Hocko <mhocko@suse.com> wrote: > [...] > > > To evaluate the impact of this optimization, on a 72 CPUs machine, we > > ran the following workload in a three level of cgroup hierarchy with top > > level having min and low setup appropriately. More specifically > > memory.min equal to size of netperf binary and memory.low double of > > that. > > a similar feedback to the test case description as with other patches. What more info should I add to the description? Why did I set up min and low or something else? > > > > $ netserver -6 > > # 36 instances of netperf with following params > > $ netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- -m 10K > > > > Results (average throughput of netperf): > > Without (6.0-rc1) 10482.7 Mbps > > With patch 17064.7 Mbps (62.7% improvement) > > > > With the patch, the throughput improved by 62.7%. > > > > Signed-off-by: Shakeel Butt <shakeelb@google.com> > > Reported-by: kernel test robot <oliver.sang@intel.com> > > Anyway > Acked-by: Michal Hocko <mhocko@suse.com> Thanks ^ permalink raw reply [flat|nested] 86+ messages in thread
* Re: [PATCH 3/3] memcg: increase MEMCG_CHARGE_BATCH to 64 @ 2022-08-22 15:09 ` Shakeel Butt 0 siblings, 0 replies; 86+ messages in thread From: Shakeel Butt @ 2022-08-22 15:09 UTC (permalink / raw) To: lkp [-- Attachment #1: Type: text/plain, Size: 1107 bytes --] On Mon, Aug 22, 2022 at 3:47 AM Michal Hocko <mhocko@suse.com> wrote: > [...] > > > To evaluate the impact of this optimization, on a 72 CPUs machine, we > > ran the following workload in a three level of cgroup hierarchy with top > > level having min and low setup appropriately. More specifically > > memory.min equal to size of netperf binary and memory.low double of > > that. > > a similar feedback to the test case description as with other patches. What more info should I add to the description? Why did I set up min and low or something else? > > > > $ netserver -6 > > # 36 instances of netperf with following params > > $ netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- -m 10K > > > > Results (average throughput of netperf): > > Without (6.0-rc1) 10482.7 Mbps > > With patch 17064.7 Mbps (62.7% improvement) > > > > With the patch, the throughput improved by 62.7%. > > > > Signed-off-by: Shakeel Butt <shakeelb@google.com> > > Reported-by: kernel test robot <oliver.sang@intel.com> > > Anyway > Acked-by: Michal Hocko <mhocko@suse.com> Thanks ^ permalink raw reply [flat|nested] 86+ messages in thread
* Re: [PATCH 3/3] memcg: increase MEMCG_CHARGE_BATCH to 64 2022-08-22 15:09 ` Shakeel Butt @ 2022-08-22 15:22 ` Michal Hocko -1 siblings, 0 replies; 86+ messages in thread From: Michal Hocko @ 2022-08-22 15:22 UTC (permalink / raw) To: Shakeel Butt Cc: Johannes Weiner, Roman Gushchin, Muchun Song, Michal Koutný, Eric Dumazet, Soheil Hassas Yeganeh, Feng Tang, Oliver Sang, Andrew Morton, lkp, Cgroups, Linux MM, netdev, LKML On Mon 22-08-22 08:09:01, Shakeel Butt wrote: > On Mon, Aug 22, 2022 at 3:47 AM Michal Hocko <mhocko@suse.com> wrote: > > > [...] > > > > > To evaluate the impact of this optimization, on a 72 CPUs machine, we > > > ran the following workload in a three level of cgroup hierarchy with top > > > level having min and low setup appropriately. More specifically > > > memory.min equal to size of netperf binary and memory.low double of > > > that. > > > > a similar feedback to the test case description as with other patches. > > What more info should I add to the description? Why did I set up min > and low or something else? I do see why you wanted to keep the test consistent over those three patches. I would just drop the reference to the protection configuration because it likely doesn't make much of an impact, does it? It is the multi cpu setup and false sharing that makes the real difference. Or am I wrong in assuming that? -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 86+ messages in thread
* Re: [PATCH 3/3] memcg: increase MEMCG_CHARGE_BATCH to 64 @ 2022-08-22 15:22 ` Michal Hocko 0 siblings, 0 replies; 86+ messages in thread From: Michal Hocko @ 2022-08-22 15:22 UTC (permalink / raw) To: lkp [-- Attachment #1: Type: text/plain, Size: 988 bytes --] On Mon 22-08-22 08:09:01, Shakeel Butt wrote: > On Mon, Aug 22, 2022 at 3:47 AM Michal Hocko <mhocko@suse.com> wrote: > > > [...] > > > > > To evaluate the impact of this optimization, on a 72 CPUs machine, we > > > ran the following workload in a three level of cgroup hierarchy with top > > > level having min and low setup appropriately. More specifically > > > memory.min equal to size of netperf binary and memory.low double of > > > that. > > > > a similar feedback to the test case description as with other patches. > > What more info should I add to the description? Why did I set up min > and low or something else? I do see why you wanted to keep the test consistent over those three patches. I would just drop the reference to the protection configuration because it likely doesn't make much of an impact, does it? It is the multi cpu setup and false sharing that makes the real difference. Or am I wrong in assuming that? -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 86+ messages in thread
* Re: [PATCH 3/3] memcg: increase MEMCG_CHARGE_BATCH to 64 2022-08-22 15:22 ` Michal Hocko (?) @ 2022-08-22 16:07 ` Shakeel Butt -1 siblings, 0 replies; 86+ messages in thread From: Shakeel Butt @ 2022-08-22 16:07 UTC (permalink / raw) To: Michal Hocko Cc: Johannes Weiner, Roman Gushchin, Muchun Song, Michal Koutný, Eric Dumazet, Soheil Hassas Yeganeh, Feng Tang, Oliver Sang, Andrew Morton, lkp, Cgroups, Linux MM, netdev, LKML On Mon, Aug 22, 2022 at 8:22 AM Michal Hocko <mhocko@suse.com> wrote: > > On Mon 22-08-22 08:09:01, Shakeel Butt wrote: > > On Mon, Aug 22, 2022 at 3:47 AM Michal Hocko <mhocko@suse.com> wrote: > > > > > [...] > > > > > > > To evaluate the impact of this optimization, on a 72 CPUs machine, we > > > > ran the following workload in a three level of cgroup hierarchy with top > > > > level having min and low setup appropriately. More specifically > > > > memory.min equal to size of netperf binary and memory.low double of > > > > that. > > > > > > a similar feedback to the test case description as with other patches. > > > > What more info should I add to the description? Why did I set up min > > and low or something else? > > I do see why you wanted to keep the test consistent over those three > patches. I would just drop the reference to the protection configuration > because it likely doesn't make much of an impact, does it? It is the > multi cpu setup and false sharing that makes the real difference. Or am > I wrong in assuming that? > No, you are correct. I will cleanup the commit message in the next version. ^ permalink raw reply [flat|nested] 86+ messages in thread
* Re: [PATCH 3/3] memcg: increase MEMCG_CHARGE_BATCH to 64 @ 2022-08-22 16:07 ` Shakeel Butt 0 siblings, 0 replies; 86+ messages in thread From: Shakeel Butt @ 2022-08-22 16:07 UTC (permalink / raw) To: Michal Hocko Cc: Johannes Weiner, Roman Gushchin, Muchun Song, Michal Koutný, Eric Dumazet, Soheil Hassas Yeganeh, Feng Tang, Oliver Sang, Andrew Morton, lkp-hn68Rpc1hR1g9hUCZPvPmw, Cgroups, Linux MM, netdev, LKML On Mon, Aug 22, 2022 at 8:22 AM Michal Hocko <mhocko-IBi9RG/b67k@public.gmane.org> wrote: > > On Mon 22-08-22 08:09:01, Shakeel Butt wrote: > > On Mon, Aug 22, 2022 at 3:47 AM Michal Hocko <mhocko-IBi9RG/b67k@public.gmane.org> wrote: > > > > > [...] > > > > > > > To evaluate the impact of this optimization, on a 72 CPUs machine, we > > > > ran the following workload in a three level of cgroup hierarchy with top > > > > level having min and low setup appropriately. More specifically > > > > memory.min equal to size of netperf binary and memory.low double of > > > > that. > > > > > > a similar feedback to the test case description as with other patches. > > > > What more info should I add to the description? Why did I set up min > > and low or something else? > > I do see why you wanted to keep the test consistent over those three > patches. I would just drop the reference to the protection configuration > because it likely doesn't make much of an impact, does it? It is the > multi cpu setup and false sharing that makes the real difference. Or am > I wrong in assuming that? > No, you are correct. I will cleanup the commit message in the next version. ^ permalink raw reply [flat|nested] 86+ messages in thread
* Re: [PATCH 3/3] memcg: increase MEMCG_CHARGE_BATCH to 64 @ 2022-08-22 16:07 ` Shakeel Butt 0 siblings, 0 replies; 86+ messages in thread From: Shakeel Butt @ 2022-08-22 16:07 UTC (permalink / raw) To: lkp [-- Attachment #1: Type: text/plain, Size: 1152 bytes --] On Mon, Aug 22, 2022 at 8:22 AM Michal Hocko <mhocko@suse.com> wrote: > > On Mon 22-08-22 08:09:01, Shakeel Butt wrote: > > On Mon, Aug 22, 2022 at 3:47 AM Michal Hocko <mhocko@suse.com> wrote: > > > > > [...] > > > > > > > To evaluate the impact of this optimization, on a 72 CPUs machine, we > > > > ran the following workload in a three level of cgroup hierarchy with top > > > > level having min and low setup appropriately. More specifically > > > > memory.min equal to size of netperf binary and memory.low double of > > > > that. > > > > > > a similar feedback to the test case description as with other patches. > > > > What more info should I add to the description? Why did I set up min > > and low or something else? > > I do see why you wanted to keep the test consistent over those three > patches. I would just drop the reference to the protection configuration > because it likely doesn't make much of an impact, does it? It is the > multi cpu setup and false sharing that makes the real difference. Or am > I wrong in assuming that? > No, you are correct. I will cleanup the commit message in the next version. ^ permalink raw reply [flat|nested] 86+ messages in thread
* Re: [PATCH 3/3] memcg: increase MEMCG_CHARGE_BATCH to 64 2022-08-22 0:17 ` Shakeel Butt (?) @ 2022-08-22 18:37 ` Roman Gushchin -1 siblings, 0 replies; 86+ messages in thread From: Roman Gushchin @ 2022-08-22 18:37 UTC (permalink / raw) To: Shakeel Butt Cc: Johannes Weiner, Michal Hocko, Muchun Song, Michal Koutný, Eric Dumazet, Soheil Hassas Yeganeh, Feng Tang, Oliver Sang, Andrew Morton, lkp, cgroups, linux-mm, netdev, linux-kernel On Mon, Aug 22, 2022 at 12:17:37AM +0000, Shakeel Butt wrote: > For several years, MEMCG_CHARGE_BATCH was kept at 32 but with bigger > machines and the network intensive workloads requiring througput in > Gbps, 32 is too small and makes the memcg charging path a bottleneck. > For now, increase it to 64 for easy acceptance to 6.0. We will need to > revisit this in future for ever increasing demand of higher performance. > > Please note that the memcg charge path drain the per-cpu memcg charge > stock, so there should not be any oom behavior change. > > To evaluate the impact of this optimization, on a 72 CPUs machine, we > ran the following workload in a three level of cgroup hierarchy with top > level having min and low setup appropriately. More specifically > memory.min equal to size of netperf binary and memory.low double of > that. > > $ netserver -6 > # 36 instances of netperf with following params > $ netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- -m 10K > > Results (average throughput of netperf): > Without (6.0-rc1) 10482.7 Mbps > With patch 17064.7 Mbps (62.7% improvement) > > With the patch, the throughput improved by 62.7%. This is pretty significant! Acked-by: Roman Gushchin <roman.gushchin@linux.dev> I wonder only if we want to make it configurable (Idk a sysctl or maybe a config option) and close the topic. Thanks! ^ permalink raw reply [flat|nested] 86+ messages in thread
* Re: [PATCH 3/3] memcg: increase MEMCG_CHARGE_BATCH to 64 @ 2022-08-22 18:37 ` Roman Gushchin 0 siblings, 0 replies; 86+ messages in thread From: Roman Gushchin @ 2022-08-22 18:37 UTC (permalink / raw) To: Shakeel Butt Cc: Johannes Weiner, Michal Hocko, Muchun Song, Michal Koutný, Eric Dumazet, Soheil Hassas Yeganeh, Feng Tang, Oliver Sang, Andrew Morton, lkp-hn68Rpc1hR1g9hUCZPvPmw, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, netdev-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA On Mon, Aug 22, 2022 at 12:17:37AM +0000, Shakeel Butt wrote: > For several years, MEMCG_CHARGE_BATCH was kept at 32 but with bigger > machines and the network intensive workloads requiring througput in > Gbps, 32 is too small and makes the memcg charging path a bottleneck. > For now, increase it to 64 for easy acceptance to 6.0. We will need to > revisit this in future for ever increasing demand of higher performance. > > Please note that the memcg charge path drain the per-cpu memcg charge > stock, so there should not be any oom behavior change. > > To evaluate the impact of this optimization, on a 72 CPUs machine, we > ran the following workload in a three level of cgroup hierarchy with top > level having min and low setup appropriately. More specifically > memory.min equal to size of netperf binary and memory.low double of > that. > > $ netserver -6 > # 36 instances of netperf with following params > $ netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- -m 10K > > Results (average throughput of netperf): > Without (6.0-rc1) 10482.7 Mbps > With patch 17064.7 Mbps (62.7% improvement) > > With the patch, the throughput improved by 62.7%. This is pretty significant! Acked-by: Roman Gushchin <roman.gushchin-fxUVXftIFDnyG1zEObXtfA@public.gmane.org> I wonder only if we want to make it configurable (Idk a sysctl or maybe a config option) and close the topic. Thanks! ^ permalink raw reply [flat|nested] 86+ messages in thread
* Re: [PATCH 3/3] memcg: increase MEMCG_CHARGE_BATCH to 64 @ 2022-08-22 18:37 ` Roman Gushchin 0 siblings, 0 replies; 86+ messages in thread From: Roman Gushchin @ 2022-08-22 18:37 UTC (permalink / raw) To: lkp [-- Attachment #1: Type: text/plain, Size: 1409 bytes --] On Mon, Aug 22, 2022 at 12:17:37AM +0000, Shakeel Butt wrote: > For several years, MEMCG_CHARGE_BATCH was kept at 32 but with bigger > machines and the network intensive workloads requiring througput in > Gbps, 32 is too small and makes the memcg charging path a bottleneck. > For now, increase it to 64 for easy acceptance to 6.0. We will need to > revisit this in future for ever increasing demand of higher performance. > > Please note that the memcg charge path drain the per-cpu memcg charge > stock, so there should not be any oom behavior change. > > To evaluate the impact of this optimization, on a 72 CPUs machine, we > ran the following workload in a three level of cgroup hierarchy with top > level having min and low setup appropriately. More specifically > memory.min equal to size of netperf binary and memory.low double of > that. > > $ netserver -6 > # 36 instances of netperf with following params > $ netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- -m 10K > > Results (average throughput of netperf): > Without (6.0-rc1) 10482.7 Mbps > With patch 17064.7 Mbps (62.7% improvement) > > With the patch, the throughput improved by 62.7%. This is pretty significant! Acked-by: Roman Gushchin <roman.gushchin@linux.dev> I wonder only if we want to make it configurable (Idk a sysctl or maybe a config option) and close the topic. Thanks! ^ permalink raw reply [flat|nested] 86+ messages in thread
* Re: [PATCH 3/3] memcg: increase MEMCG_CHARGE_BATCH to 64 2022-08-22 18:37 ` Roman Gushchin (?) @ 2022-08-22 19:34 ` Michal Hocko -1 siblings, 0 replies; 86+ messages in thread From: Michal Hocko @ 2022-08-22 19:34 UTC (permalink / raw) To: Roman Gushchin Cc: Shakeel Butt, Johannes Weiner, Muchun Song, Michal Koutný, Eric Dumazet, Soheil Hassas Yeganeh, Feng Tang, Oliver Sang, Andrew Morton, lkp, cgroups, linux-mm, netdev, linux-kernel On Mon 22-08-22 11:37:30, Roman Gushchin wrote: [...] > I wonder only if we want to make it configurable (Idk a sysctl or maybe > a config option) and close the topic. I do not think this is a good idea. We have other examples where we have outsourced internal tunning to the userspace and it has mostly proven impractical and long term more problematic than useful (e.g. lowmem_reserve_ratio, percpu_pagelist_high_fraction, swappiness just to name some that come to my mind). I have seen more often these to be used incorrectly than useful. In this case, I guess we should consider either moving to per memcg charge batching and see whether the pcp overhead x memcg_count is worth that or some automagic tuning of the batch size depending on how effectively the batch is used. Certainly a lot of room for experimenting. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 86+ messages in thread
* Re: [PATCH 3/3] memcg: increase MEMCG_CHARGE_BATCH to 64 @ 2022-08-22 19:34 ` Michal Hocko 0 siblings, 0 replies; 86+ messages in thread From: Michal Hocko @ 2022-08-22 19:34 UTC (permalink / raw) To: Roman Gushchin Cc: Shakeel Butt, Johannes Weiner, Muchun Song, Michal Koutný, Eric Dumazet, Soheil Hassas Yeganeh, Feng Tang, Oliver Sang, Andrew Morton, lkp-hn68Rpc1hR1g9hUCZPvPmw, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, netdev-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA On Mon 22-08-22 11:37:30, Roman Gushchin wrote: [...] > I wonder only if we want to make it configurable (Idk a sysctl or maybe > a config option) and close the topic. I do not think this is a good idea. We have other examples where we have outsourced internal tunning to the userspace and it has mostly proven impractical and long term more problematic than useful (e.g. lowmem_reserve_ratio, percpu_pagelist_high_fraction, swappiness just to name some that come to my mind). I have seen more often these to be used incorrectly than useful. In this case, I guess we should consider either moving to per memcg charge batching and see whether the pcp overhead x memcg_count is worth that or some automagic tuning of the batch size depending on how effectively the batch is used. Certainly a lot of room for experimenting. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 86+ messages in thread
* Re: [PATCH 3/3] memcg: increase MEMCG_CHARGE_BATCH to 64 @ 2022-08-22 19:34 ` Michal Hocko 0 siblings, 0 replies; 86+ messages in thread From: Michal Hocko @ 2022-08-22 19:34 UTC (permalink / raw) To: lkp [-- Attachment #1: Type: text/plain, Size: 870 bytes --] On Mon 22-08-22 11:37:30, Roman Gushchin wrote: [...] > I wonder only if we want to make it configurable (Idk a sysctl or maybe > a config option) and close the topic. I do not think this is a good idea. We have other examples where we have outsourced internal tunning to the userspace and it has mostly proven impractical and long term more problematic than useful (e.g. lowmem_reserve_ratio, percpu_pagelist_high_fraction, swappiness just to name some that come to my mind). I have seen more often these to be used incorrectly than useful. In this case, I guess we should consider either moving to per memcg charge batching and see whether the pcp overhead x memcg_count is worth that or some automagic tuning of the batch size depending on how effectively the batch is used. Certainly a lot of room for experimenting. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 86+ messages in thread
* Re: [PATCH 3/3] memcg: increase MEMCG_CHARGE_BATCH to 64 2022-08-22 19:34 ` Michal Hocko (?) @ 2022-08-23 2:22 ` Roman Gushchin -1 siblings, 0 replies; 86+ messages in thread From: Roman Gushchin @ 2022-08-23 2:22 UTC (permalink / raw) To: Michal Hocko Cc: Shakeel Butt, Johannes Weiner, Muchun Song, Michal Koutný, Eric Dumazet, Soheil Hassas Yeganeh, Feng Tang, Oliver Sang, Andrew Morton, lkp, cgroups, linux-mm, netdev, linux-kernel On Mon, Aug 22, 2022 at 09:34:59PM +0200, Michal Hocko wrote: > On Mon 22-08-22 11:37:30, Roman Gushchin wrote: > [...] > > I wonder only if we want to make it configurable (Idk a sysctl or maybe > > a config option) and close the topic. > > I do not think this is a good idea. We have other examples where we have > outsourced internal tunning to the userspace and it has mostly proven > impractical and long term more problematic than useful (e.g. > lowmem_reserve_ratio, percpu_pagelist_high_fraction, swappiness just to > name some that come to my mind). I have seen more often these to be used > incorrectly than useful. A agree, not a strong opinion here. But I wonder if somebody will complain on Shakeel's change because of the reduced accuracy. I know some users are using memory cgroups to track the size of various workloads (including relatively small) and 32->64 pages per cpu change can be noticeable for them. But we can wait for an actual bug report :) > > In this case, I guess we should consider either moving to per memcg > charge batching and see whether the pcp overhead x memcg_count is worth > that or some automagic tuning of the batch size depending on how > effectively the batch is used. Certainly a lot of room for > experimenting. I'm not a big believer into the automagic tuning here because it's a fundamental trade-off of accuracy vs performance and various users might make a different choice depending on their needs, not on the cpu count or something else. Per-memcg batching sounds interesting though. For example, we can likely batch updates on leaf cgroups and have a single atomic update instead of multiple most of the times. Or do you mean something different? Thanks! ^ permalink raw reply [flat|nested] 86+ messages in thread
* Re: [PATCH 3/3] memcg: increase MEMCG_CHARGE_BATCH to 64 @ 2022-08-23 2:22 ` Roman Gushchin 0 siblings, 0 replies; 86+ messages in thread From: Roman Gushchin @ 2022-08-23 2:22 UTC (permalink / raw) To: Michal Hocko Cc: Shakeel Butt, Johannes Weiner, Muchun Song, Michal Koutný, Eric Dumazet, Soheil Hassas Yeganeh, Feng Tang, Oliver Sang, Andrew Morton, lkp-hn68Rpc1hR1g9hUCZPvPmw, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, netdev-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA On Mon, Aug 22, 2022 at 09:34:59PM +0200, Michal Hocko wrote: > On Mon 22-08-22 11:37:30, Roman Gushchin wrote: > [...] > > I wonder only if we want to make it configurable (Idk a sysctl or maybe > > a config option) and close the topic. > > I do not think this is a good idea. We have other examples where we have > outsourced internal tunning to the userspace and it has mostly proven > impractical and long term more problematic than useful (e.g. > lowmem_reserve_ratio, percpu_pagelist_high_fraction, swappiness just to > name some that come to my mind). I have seen more often these to be used > incorrectly than useful. A agree, not a strong opinion here. But I wonder if somebody will complain on Shakeel's change because of the reduced accuracy. I know some users are using memory cgroups to track the size of various workloads (including relatively small) and 32->64 pages per cpu change can be noticeable for them. But we can wait for an actual bug report :) > > In this case, I guess we should consider either moving to per memcg > charge batching and see whether the pcp overhead x memcg_count is worth > that or some automagic tuning of the batch size depending on how > effectively the batch is used. Certainly a lot of room for > experimenting. I'm not a big believer into the automagic tuning here because it's a fundamental trade-off of accuracy vs performance and various users might make a different choice depending on their needs, not on the cpu count or something else. Per-memcg batching sounds interesting though. For example, we can likely batch updates on leaf cgroups and have a single atomic update instead of multiple most of the times. Or do you mean something different? Thanks! ^ permalink raw reply [flat|nested] 86+ messages in thread
* Re: [PATCH 3/3] memcg: increase MEMCG_CHARGE_BATCH to 64 @ 2022-08-23 2:22 ` Roman Gushchin 0 siblings, 0 replies; 86+ messages in thread From: Roman Gushchin @ 2022-08-23 2:22 UTC (permalink / raw) To: lkp [-- Attachment #1: Type: text/plain, Size: 1750 bytes --] On Mon, Aug 22, 2022 at 09:34:59PM +0200, Michal Hocko wrote: > On Mon 22-08-22 11:37:30, Roman Gushchin wrote: > [...] > > I wonder only if we want to make it configurable (Idk a sysctl or maybe > > a config option) and close the topic. > > I do not think this is a good idea. We have other examples where we have > outsourced internal tunning to the userspace and it has mostly proven > impractical and long term more problematic than useful (e.g. > lowmem_reserve_ratio, percpu_pagelist_high_fraction, swappiness just to > name some that come to my mind). I have seen more often these to be used > incorrectly than useful. A agree, not a strong opinion here. But I wonder if somebody will complain on Shakeel's change because of the reduced accuracy. I know some users are using memory cgroups to track the size of various workloads (including relatively small) and 32->64 pages per cpu change can be noticeable for them. But we can wait for an actual bug report :) > > In this case, I guess we should consider either moving to per memcg > charge batching and see whether the pcp overhead x memcg_count is worth > that or some automagic tuning of the batch size depending on how > effectively the batch is used. Certainly a lot of room for > experimenting. I'm not a big believer into the automagic tuning here because it's a fundamental trade-off of accuracy vs performance and various users might make a different choice depending on their needs, not on the cpu count or something else. Per-memcg batching sounds interesting though. For example, we can likely batch updates on leaf cgroups and have a single atomic update instead of multiple most of the times. Or do you mean something different? Thanks! ^ permalink raw reply [flat|nested] 86+ messages in thread
* Re: [PATCH 3/3] memcg: increase MEMCG_CHARGE_BATCH to 64 2022-08-23 2:22 ` Roman Gushchin @ 2022-08-23 4:49 ` Michal Hocko -1 siblings, 0 replies; 86+ messages in thread From: Michal Hocko @ 2022-08-23 4:49 UTC (permalink / raw) To: Roman Gushchin Cc: Shakeel Butt, Johannes Weiner, Muchun Song, Michal Koutný, Eric Dumazet, Soheil Hassas Yeganeh, Feng Tang, Oliver Sang, Andrew Morton, lkp, cgroups, linux-mm, netdev, linux-kernel On Mon 22-08-22 19:22:26, Roman Gushchin wrote: > On Mon, Aug 22, 2022 at 09:34:59PM +0200, Michal Hocko wrote: > > On Mon 22-08-22 11:37:30, Roman Gushchin wrote: > > [...] > > > I wonder only if we want to make it configurable (Idk a sysctl or maybe > > > a config option) and close the topic. > > > > I do not think this is a good idea. We have other examples where we have > > outsourced internal tunning to the userspace and it has mostly proven > > impractical and long term more problematic than useful (e.g. > > lowmem_reserve_ratio, percpu_pagelist_high_fraction, swappiness just to > > name some that come to my mind). I have seen more often these to be used > > incorrectly than useful. > > A agree, not a strong opinion here. But I wonder if somebody will > complain on Shakeel's change because of the reduced accuracy. > I know some users are using memory cgroups to track the size of various > workloads (including relatively small) and 32->64 pages per cpu change > can be noticeable for them. But we can wait for an actual bug report :) Yes, that would be my approach. I have seen reports like that already but that was mostly because of heavy caching on the SLUB side on older kernels. So there surely are workloads with small limits configured (e.g. 20MB). On the other hand those users were receptive to adapt their limits as they were kinda arbitrary anyway. > > In this case, I guess we should consider either moving to per memcg > > charge batching and see whether the pcp overhead x memcg_count is worth > > that or some automagic tuning of the batch size depending on how > > effectively the batch is used. Certainly a lot of room for > > experimenting. > > I'm not a big believer into the automagic tuning here because it's a fundamental > trade-off of accuracy vs performance and various users might make a different > choice depending on their needs, not on the cpu count or something else. Yes, this not an easy thing to get right. I was mostly thinking some auto scaling based on the limit size or growing the stock if cache hits are common and decrease when stocks get flushed often because multiple memcgs compete over the same pcp stock. But to me it seems like a per memcg approach might lead better results without too many heuristics (albeit more memory hungry). > Per-memcg batching sounds interesting though. For example, we can likely > batch updates on leaf cgroups and have a single atomic update instead of > multiple most of the times. Or do you mean something different? No, that was exactly my thinking as well. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 86+ messages in thread
* Re: [PATCH 3/3] memcg: increase MEMCG_CHARGE_BATCH to 64 @ 2022-08-23 4:49 ` Michal Hocko 0 siblings, 0 replies; 86+ messages in thread From: Michal Hocko @ 2022-08-23 4:49 UTC (permalink / raw) To: lkp [-- Attachment #1: Type: text/plain, Size: 2643 bytes --] On Mon 22-08-22 19:22:26, Roman Gushchin wrote: > On Mon, Aug 22, 2022 at 09:34:59PM +0200, Michal Hocko wrote: > > On Mon 22-08-22 11:37:30, Roman Gushchin wrote: > > [...] > > > I wonder only if we want to make it configurable (Idk a sysctl or maybe > > > a config option) and close the topic. > > > > I do not think this is a good idea. We have other examples where we have > > outsourced internal tunning to the userspace and it has mostly proven > > impractical and long term more problematic than useful (e.g. > > lowmem_reserve_ratio, percpu_pagelist_high_fraction, swappiness just to > > name some that come to my mind). I have seen more often these to be used > > incorrectly than useful. > > A agree, not a strong opinion here. But I wonder if somebody will > complain on Shakeel's change because of the reduced accuracy. > I know some users are using memory cgroups to track the size of various > workloads (including relatively small) and 32->64 pages per cpu change > can be noticeable for them. But we can wait for an actual bug report :) Yes, that would be my approach. I have seen reports like that already but that was mostly because of heavy caching on the SLUB side on older kernels. So there surely are workloads with small limits configured (e.g. 20MB). On the other hand those users were receptive to adapt their limits as they were kinda arbitrary anyway. > > In this case, I guess we should consider either moving to per memcg > > charge batching and see whether the pcp overhead x memcg_count is worth > > that or some automagic tuning of the batch size depending on how > > effectively the batch is used. Certainly a lot of room for > > experimenting. > > I'm not a big believer into the automagic tuning here because it's a fundamental > trade-off of accuracy vs performance and various users might make a different > choice depending on their needs, not on the cpu count or something else. Yes, this not an easy thing to get right. I was mostly thinking some auto scaling based on the limit size or growing the stock if cache hits are common and decrease when stocks get flushed often because multiple memcgs compete over the same pcp stock. But to me it seems like a per memcg approach might lead better results without too many heuristics (albeit more memory hungry). > Per-memcg batching sounds interesting though. For example, we can likely > batch updates on leaf cgroups and have a single atomic update instead of > multiple most of the times. Or do you mean something different? No, that was exactly my thinking as well. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 86+ messages in thread
end of thread, other threads:[~2022-08-23 12:24 UTC | newest] Thread overview: 86+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2022-08-22 0:17 [PATCH 0/3] memcg: optimizatize charge codepath Shakeel Butt 2022-08-22 0:17 ` Shakeel Butt 2022-08-22 0:17 ` Shakeel Butt 2022-08-22 0:17 ` [PATCH 1/3] mm: page_counter: remove unneeded atomic ops for low/min Shakeel Butt 2022-08-22 0:17 ` Shakeel Butt 2022-08-22 0:17 ` Shakeel Butt 2022-08-22 0:20 ` Soheil Hassas Yeganeh 2022-08-22 0:20 ` Soheil Hassas Yeganeh 2022-08-22 0:20 ` Soheil Hassas Yeganeh 2022-08-22 2:39 ` Feng Tang 2022-08-22 2:39 ` Feng Tang 2022-08-22 2:39 ` Feng Tang 2022-08-22 9:55 ` Michal Hocko 2022-08-22 9:55 ` Michal Hocko 2022-08-22 10:18 ` Michal Hocko 2022-08-22 10:18 ` Michal Hocko 2022-08-22 10:18 ` Michal Hocko 2022-08-22 14:55 ` Shakeel Butt 2022-08-22 14:55 ` Shakeel Butt 2022-08-22 15:20 ` Michal Hocko 2022-08-22 15:20 ` Michal Hocko 2022-08-22 16:06 ` Shakeel Butt 2022-08-22 16:06 ` Shakeel Butt 2022-08-22 16:06 ` Shakeel Butt 2022-08-23 9:42 ` Michal Hocko 2022-08-23 9:42 ` Michal Hocko 2022-08-23 9:42 ` Michal Hocko 2022-08-22 18:23 ` Roman Gushchin 2022-08-22 18:23 ` Roman Gushchin 2022-08-22 18:23 ` Roman Gushchin 2022-08-22 0:17 ` [PATCH 2/3] mm: page_counter: rearrange struct page_counter fields Shakeel Butt 2022-08-22 0:17 ` Shakeel Butt 2022-08-22 0:24 ` Soheil Hassas Yeganeh 2022-08-22 0:24 ` Soheil Hassas Yeganeh 2022-08-22 0:24 ` Soheil Hassas Yeganeh 2022-08-22 4:55 ` Shakeel Butt 2022-08-22 4:55 ` Shakeel Butt 2022-08-22 13:06 ` Soheil Hassas Yeganeh 2022-08-22 13:06 ` Soheil Hassas Yeganeh 2022-08-22 2:10 ` Feng Tang 2022-08-22 2:10 ` Feng Tang 2022-08-22 2:10 ` Feng Tang 2022-08-22 4:59 ` Shakeel Butt 2022-08-22 4:59 ` Shakeel Butt 2022-08-22 4:59 ` Shakeel Butt 2022-08-22 10:23 ` Michal Hocko 2022-08-22 10:23 ` Michal Hocko 2022-08-22 10:23 ` Michal Hocko 2022-08-22 15:06 ` Shakeel Butt 2022-08-22 15:06 ` Shakeel Butt 2022-08-22 15:15 ` Michal Hocko 2022-08-22 15:15 ` Michal Hocko 2022-08-22 15:15 ` Michal Hocko 2022-08-22 16:04 ` Shakeel Butt 2022-08-22 16:04 ` Shakeel Butt 2022-08-22 18:27 ` Roman Gushchin 2022-08-22 18:27 ` Roman Gushchin 2022-08-22 18:27 ` Roman Gushchin 2022-08-22 0:17 ` [PATCH 3/3] memcg: increase MEMCG_CHARGE_BATCH to 64 Shakeel Butt 2022-08-22 0:17 ` Shakeel Butt 2022-08-22 0:24 ` Soheil Hassas Yeganeh 2022-08-22 0:24 ` Soheil Hassas Yeganeh 2022-08-22 0:24 ` Soheil Hassas Yeganeh 2022-08-22 2:30 ` Feng Tang 2022-08-22 2:30 ` Feng Tang 2022-08-22 10:47 ` Michal Hocko 2022-08-22 10:47 ` Michal Hocko 2022-08-22 10:47 ` Michal Hocko 2022-08-22 15:09 ` Shakeel Butt 2022-08-22 15:09 ` Shakeel Butt 2022-08-22 15:22 ` Michal Hocko 2022-08-22 15:22 ` Michal Hocko 2022-08-22 16:07 ` Shakeel Butt 2022-08-22 16:07 ` Shakeel Butt 2022-08-22 16:07 ` Shakeel Butt 2022-08-22 18:37 ` Roman Gushchin 2022-08-22 18:37 ` Roman Gushchin 2022-08-22 18:37 ` Roman Gushchin 2022-08-22 19:34 ` Michal Hocko 2022-08-22 19:34 ` Michal Hocko 2022-08-22 19:34 ` Michal Hocko 2022-08-23 2:22 ` Roman Gushchin 2022-08-23 2:22 ` Roman Gushchin 2022-08-23 2:22 ` Roman Gushchin 2022-08-23 4:49 ` Michal Hocko 2022-08-23 4:49 ` Michal Hocko
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.