* Re: [net] 4890b686f4: netperf.Throughput_Mbps -69.4% regression [not found] ` <CADvbK_csvmkKe46hT9792=+Qcjor2EvkkAnr--CJK3NGX-N9BQ@mail.gmail.com> @ 2022-06-23 22:50 ` Xin Long 2022-06-24 1:57 ` Jakub Kicinski 0 siblings, 1 reply; 32+ messages in thread From: Xin Long @ 2022-06-23 22:50 UTC (permalink / raw) To: Jakub Kicinski Cc: Marcelo Ricardo Leitner, kernel test robot, Eric Dumazet, Shakeel Butt, Soheil Hassas Yeganeh, LKML, Linux Memory Management List, network dev, linux-s390, mptcp, linux-sctp @ vger . kernel . org, lkp, kbuild test robot, Huang Ying, feng.tang, zhengjun.xing, fengwei.yin, Ying Xu On Wed, Jun 22, 2022 at 11:08 PM Xin Long <lucien.xin@gmail.com> wrote: > > Yes, I'm working on it. I couldn't see the regression in my env with > the 'reproduce' script attached. > I will try with lkp tomorrow. > > Thanks. > > On Wed, Jun 22, 2022 at 8:29 PM Jakub Kicinski <kuba@kernel.org> wrote: > > > > Could someone working on SCTP double check this is a real regression? > > Feels like the regression reports are flowing at such rate its hard > > to keep up. > > > > > > > > commit: > > > 7c80b038d2 ("net: fix sk_wmem_schedule() and sk_rmem_schedule() errors") > > > 4890b686f4 ("net: keep sk->sk_forward_alloc as small as possible") > > > > > > 7c80b038d23e1f4c 4890b686f4088c90432149bd6de > > > ---------------- --------------------------- > > > %stddev %change %stddev > > > \ | \ > > > 15855 -69.4% 4854 netperf.Throughput_Mbps > > > 570788 -69.4% 174773 netperf.Throughput_total_Mbps ... > > > 0.00 +5.1 5.10 ± 5% perf-profile.calltrace.cycles-pp.__sk_mem_reduce_allocated.sctp_wfree.skb_release_head_state.consume_skb.sctp_chunk_put > > > 0.17 ±141% +5.3 5.42 ± 6% perf-profile.calltrace.cycles-pp.skb_release_head_state.consume_skb.sctp_chunk_put.sctp_outq_sack.sctp_cmd_interpreter > > > 0.00 +5.3 5.35 ± 6% perf-profile.calltrace.cycles-pp.sctp_wfree.skb_release_head_state.consume_skb.sctp_chunk_put.sctp_outq_sack > > > 0.00 +5.5 5.51 ± 6% perf-profile.calltrace.cycles-pp.__sk_mem_reduce_allocated.skb_release_head_state.kfree_skb_reason.sctp_recvmsg.inet_recvmsg > > > 0.00 +5.7 5.65 ± 6% perf-profile.calltrace.cycles-pp.skb_release_head_state.kfree_skb_reason.sctp_recvmsg.inet_recvmsg.____sys_recvmsg ... > > > 0.00 +4.0 4.04 ± 6% perf-profile.children.cycles-pp.mem_cgroup_charge_skmem > > > 2.92 ± 6% +4.2 7.16 ± 6% perf-profile.children.cycles-pp.sctp_outq_sack > > > 0.00 +4.3 4.29 ± 6% perf-profile.children.cycles-pp.__sk_mem_raise_allocated > > > 0.00 +4.3 4.32 ± 6% perf-profile.children.cycles-pp.__sk_mem_schedule > > > 1.99 ± 6% +4.4 6.40 ± 6% perf-profile.children.cycles-pp.consume_skb > > > 1.78 ± 6% +4.6 6.42 ± 6% perf-profile.children.cycles-pp.kfree_skb_reason > > > 0.37 ± 8% +5.0 5.40 ± 6% perf-profile.children.cycles-pp.sctp_wfree > > > 0.87 ± 9% +10.3 11.20 ± 6% perf-profile.children.cycles-pp.skb_release_head_state > > > 0.00 +10.7 10.66 ± 6% perf-profile.children.cycles-pp.__sk_mem_reduce_allocated ... > > > 0.00 +1.2 1.19 ± 7% perf-profile.self.cycles-pp.try_charge_memcg > > > 0.00 +2.0 1.96 ± 6% perf-profile.self.cycles-pp.page_counter_uncharge > > > 0.00 +2.1 2.07 ± 5% perf-profile.self.cycles-pp.page_counter_try_charge > > > 1.09 ± 8% +2.8 3.92 ± 6% perf-profile.self.cycles-pp.native_queued_spin_lock_slowpath > > > 0.29 ± 6% +3.5 3.81 ± 6% perf-profile.self.cycles-pp.sctp_eat_data > > > 0.00 +7.8 7.76 ± 6% perf-profile.self.cycles-pp.__sk_mem_reduce_allocated From the perf data, we can see __sk_mem_reduce_allocated() is the one using CPU the most more than before, and mem_cgroup APIs are also called in this function. It means the mem cgroup must be enabled in the test env, which may explain why I couldn't reproduce it. The Commit 4890b686f4 ("net: keep sk->sk_forward_alloc as small as possible") uses sk_mem_reclaim(checking reclaimable >= PAGE_SIZE) to reclaim the memory, which is *more frequent* to call __sk_mem_reduce_allocated() than before (checking reclaimable >= SK_RECLAIM_THRESHOLD). It might be cheap when mem_cgroup_sockets_enabled is false, but I'm not sure if it's still cheap when mem_cgroup_sockets_enabled is true. I think SCTP netperf could trigger this, as the CPU is the bottleneck for SCTP netperf testing, which is more sensitive to the extra function calls than TCP. Can we re-run this testing without mem cgroup enabled? Thanks. ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [net] 4890b686f4: netperf.Throughput_Mbps -69.4% regression 2022-06-23 22:50 ` [net] 4890b686f4: netperf.Throughput_Mbps -69.4% regression Xin Long @ 2022-06-24 1:57 ` Jakub Kicinski 2022-06-24 4:13 ` Eric Dumazet 0 siblings, 1 reply; 32+ messages in thread From: Jakub Kicinski @ 2022-06-24 1:57 UTC (permalink / raw) To: Xin Long Cc: Marcelo Ricardo Leitner, kernel test robot, Eric Dumazet, Shakeel Butt, Soheil Hassas Yeganeh, LKML, Linux Memory Management List, network dev, linux-s390, mptcp, linux-sctp @ vger . kernel . org, lkp, kbuild test robot, Huang Ying, feng.tang, zhengjun.xing, fengwei.yin, Ying Xu On Thu, 23 Jun 2022 18:50:07 -0400 Xin Long wrote: > From the perf data, we can see __sk_mem_reduce_allocated() is the one > using CPU the most more than before, and mem_cgroup APIs are also > called in this function. It means the mem cgroup must be enabled in > the test env, which may explain why I couldn't reproduce it. > > The Commit 4890b686f4 ("net: keep sk->sk_forward_alloc as small as > possible") uses sk_mem_reclaim(checking reclaimable >= PAGE_SIZE) to > reclaim the memory, which is *more frequent* to call > __sk_mem_reduce_allocated() than before (checking reclaimable >= > SK_RECLAIM_THRESHOLD). It might be cheap when > mem_cgroup_sockets_enabled is false, but I'm not sure if it's still > cheap when mem_cgroup_sockets_enabled is true. > > I think SCTP netperf could trigger this, as the CPU is the bottleneck > for SCTP netperf testing, which is more sensitive to the extra > function calls than TCP. > > Can we re-run this testing without mem cgroup enabled? FWIW I defer to Eric, thanks a lot for double checking the report and digging in! ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [net] 4890b686f4: netperf.Throughput_Mbps -69.4% regression 2022-06-24 1:57 ` Jakub Kicinski @ 2022-06-24 4:13 ` Eric Dumazet 2022-06-24 4:22 ` Eric Dumazet ` (2 more replies) 0 siblings, 3 replies; 32+ messages in thread From: Eric Dumazet @ 2022-06-24 4:13 UTC (permalink / raw) To: Jakub Kicinski Cc: Xin Long, Marcelo Ricardo Leitner, kernel test robot, Shakeel Butt, Soheil Hassas Yeganeh, LKML, Linux Memory Management List, network dev, linux-s390, MPTCP Upstream, linux-sctp @ vger . kernel . org, lkp, kbuild test robot, Huang Ying, Tang, Feng, zhengjun.xing, fengwei.yin, Ying Xu On Fri, Jun 24, 2022 at 3:57 AM Jakub Kicinski <kuba@kernel.org> wrote: > > On Thu, 23 Jun 2022 18:50:07 -0400 Xin Long wrote: > > From the perf data, we can see __sk_mem_reduce_allocated() is the one > > using CPU the most more than before, and mem_cgroup APIs are also > > called in this function. It means the mem cgroup must be enabled in > > the test env, which may explain why I couldn't reproduce it. > > > > The Commit 4890b686f4 ("net: keep sk->sk_forward_alloc as small as > > possible") uses sk_mem_reclaim(checking reclaimable >= PAGE_SIZE) to > > reclaim the memory, which is *more frequent* to call > > __sk_mem_reduce_allocated() than before (checking reclaimable >= > > SK_RECLAIM_THRESHOLD). It might be cheap when > > mem_cgroup_sockets_enabled is false, but I'm not sure if it's still > > cheap when mem_cgroup_sockets_enabled is true. > > > > I think SCTP netperf could trigger this, as the CPU is the bottleneck > > for SCTP netperf testing, which is more sensitive to the extra > > function calls than TCP. > > > > Can we re-run this testing without mem cgroup enabled? > > FWIW I defer to Eric, thanks a lot for double checking the report > and digging in! I did tests with TCP + memcg and noticed a very small additional cost in memcg functions, because of suboptimal layout: Extract of an internal Google bug, update from June 9th: -------------------------------- I have noticed a minor false sharing to fetch (struct mem_cgroup)->css.parent, at offset 0xc0, because it shares the cache line containing struct mem_cgroup.memory, at offset 0xd0 Ideally, memcg->socket_pressure and memcg->parent should sit in a read mostly cache line. ----------------------- But nothing that could explain a "-69.4% regression" memcg has a very similar strategy of per-cpu reserves, with MEMCG_CHARGE_BATCH being 32 pages per cpu. It is not clear why SCTP with 10K writes would overflow this reserve constantly. Presumably memcg experts will have to rework structure alignments to make sure they can cope better with more charge/uncharge operations, because we are not going back to gigantic per-socket reserves, this simply does not scale. ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [net] 4890b686f4: netperf.Throughput_Mbps -69.4% regression 2022-06-24 4:13 ` Eric Dumazet @ 2022-06-24 4:22 ` Eric Dumazet 2022-06-24 5:13 ` Feng Tang 2022-06-24 6:34 ` Shakeel Butt 2 siblings, 0 replies; 32+ messages in thread From: Eric Dumazet @ 2022-06-24 4:22 UTC (permalink / raw) To: Jakub Kicinski Cc: Xin Long, Marcelo Ricardo Leitner, kernel test robot, Shakeel Butt, Soheil Hassas Yeganeh, LKML, Linux Memory Management List, network dev, linux-s390, MPTCP Upstream, linux-sctp @ vger . kernel . org, lkp, kbuild test robot, Huang Ying, Tang, Feng, zhengjun.xing, fengwei.yin, Ying Xu On Fri, Jun 24, 2022 at 6:13 AM Eric Dumazet <edumazet@google.com> wrote: > > On Fri, Jun 24, 2022 at 3:57 AM Jakub Kicinski <kuba@kernel.org> wrote: > > > > On Thu, 23 Jun 2022 18:50:07 -0400 Xin Long wrote: > > > From the perf data, we can see __sk_mem_reduce_allocated() is the one > > > using CPU the most more than before, and mem_cgroup APIs are also > > > called in this function. It means the mem cgroup must be enabled in > > > the test env, which may explain why I couldn't reproduce it. > > > > > > The Commit 4890b686f4 ("net: keep sk->sk_forward_alloc as small as > > > possible") uses sk_mem_reclaim(checking reclaimable >= PAGE_SIZE) to > > > reclaim the memory, which is *more frequent* to call > > > __sk_mem_reduce_allocated() than before (checking reclaimable >= > > > SK_RECLAIM_THRESHOLD). It might be cheap when > > > mem_cgroup_sockets_enabled is false, but I'm not sure if it's still > > > cheap when mem_cgroup_sockets_enabled is true. > > > > > > I think SCTP netperf could trigger this, as the CPU is the bottleneck > > > for SCTP netperf testing, which is more sensitive to the extra > > > function calls than TCP. > > > > > > Can we re-run this testing without mem cgroup enabled? > > > > FWIW I defer to Eric, thanks a lot for double checking the report > > and digging in! > > I did tests with TCP + memcg and noticed a very small additional cost > in memcg functions, > because of suboptimal layout: > > Extract of an internal Google bug, update from June 9th: > > -------------------------------- > I have noticed a minor false sharing to fetch (struct > mem_cgroup)->css.parent, at offset 0xc0, > because it shares the cache line containing struct mem_cgroup.memory, > at offset 0xd0 > > Ideally, memcg->socket_pressure and memcg->parent should sit in a read > mostly cache line. > ----------------------- > > But nothing that could explain a "-69.4% regression" I guess the test now hits memcg limits more often, forcing expensive reclaim, and the memcg limits need some adjustments. Overall, tests enabling memcg should probably need fine tuning, I will defer to Intel folks. > > memcg has a very similar strategy of per-cpu reserves, with > MEMCG_CHARGE_BATCH being 32 pages per cpu. > > It is not clear why SCTP with 10K writes would overflow this reserve constantly. > > Presumably memcg experts will have to rework structure alignments to > make sure they can cope better > with more charge/uncharge operations, because we are not going back to > gigantic per-socket reserves, > this simply does not scale. ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [net] 4890b686f4: netperf.Throughput_Mbps -69.4% regression 2022-06-24 4:13 ` Eric Dumazet 2022-06-24 4:22 ` Eric Dumazet @ 2022-06-24 5:13 ` Feng Tang 2022-06-24 5:45 ` Eric Dumazet 2022-06-24 6:34 ` Shakeel Butt 2 siblings, 1 reply; 32+ messages in thread From: Feng Tang @ 2022-06-24 5:13 UTC (permalink / raw) To: Eric Dumazet Cc: Jakub Kicinski, Xin Long, Marcelo Ricardo Leitner, kernel test robot, Shakeel Butt, Soheil Hassas Yeganeh, LKML, Linux Memory Management List, network dev, linux-s390, MPTCP Upstream, linux-sctp @ vger . kernel . org, lkp, kbuild test robot, Huang Ying, zhengjun.xing, fengwei.yin, Ying Xu Hi Eric, On Fri, Jun 24, 2022 at 06:13:51AM +0200, Eric Dumazet wrote: > On Fri, Jun 24, 2022 at 3:57 AM Jakub Kicinski <kuba@kernel.org> wrote: > > > > On Thu, 23 Jun 2022 18:50:07 -0400 Xin Long wrote: > > > From the perf data, we can see __sk_mem_reduce_allocated() is the one > > > using CPU the most more than before, and mem_cgroup APIs are also > > > called in this function. It means the mem cgroup must be enabled in > > > the test env, which may explain why I couldn't reproduce it. > > > > > > The Commit 4890b686f4 ("net: keep sk->sk_forward_alloc as small as > > > possible") uses sk_mem_reclaim(checking reclaimable >= PAGE_SIZE) to > > > reclaim the memory, which is *more frequent* to call > > > __sk_mem_reduce_allocated() than before (checking reclaimable >= > > > SK_RECLAIM_THRESHOLD). It might be cheap when > > > mem_cgroup_sockets_enabled is false, but I'm not sure if it's still > > > cheap when mem_cgroup_sockets_enabled is true. > > > > > > I think SCTP netperf could trigger this, as the CPU is the bottleneck > > > for SCTP netperf testing, which is more sensitive to the extra > > > function calls than TCP. > > > > > > Can we re-run this testing without mem cgroup enabled? > > > > FWIW I defer to Eric, thanks a lot for double checking the report > > and digging in! > > I did tests with TCP + memcg and noticed a very small additional cost > in memcg functions, > because of suboptimal layout: > > Extract of an internal Google bug, update from June 9th: > > -------------------------------- > I have noticed a minor false sharing to fetch (struct > mem_cgroup)->css.parent, at offset 0xc0, > because it shares the cache line containing struct mem_cgroup.memory, > at offset 0xd0 > > Ideally, memcg->socket_pressure and memcg->parent should sit in a read > mostly cache line. > ----------------------- > > But nothing that could explain a "-69.4% regression" We can double check that. > memcg has a very similar strategy of per-cpu reserves, with > MEMCG_CHARGE_BATCH being 32 pages per cpu. We have proposed patch to increase the batch numer for stats update, which was not accepted as it hurts the accuracy and the data is used by many tools. > It is not clear why SCTP with 10K writes would overflow this reserve constantly. > > Presumably memcg experts will have to rework structure alignments to > make sure they can cope better > with more charge/uncharge operations, because we are not going back to > gigantic per-socket reserves, > this simply does not scale. Yes, the memcg statitics and charge/unchage update is very sensitive with the data alignemnt layout, and can easily trigger peformance changes, as we've seen quite some similar cases in the past several years. One pattern we've seen is, even if a memcg stats updating or charge function only takes about 2%~3% of the CPU cycles in perf-profile data, once it got affected, the peformance change could be amplified to up to 60% or more. Thanks, Feng ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [net] 4890b686f4: netperf.Throughput_Mbps -69.4% regression 2022-06-24 5:13 ` Feng Tang @ 2022-06-24 5:45 ` Eric Dumazet 2022-06-24 6:00 ` Feng Tang 0 siblings, 1 reply; 32+ messages in thread From: Eric Dumazet @ 2022-06-24 5:45 UTC (permalink / raw) To: Feng Tang Cc: Jakub Kicinski, Xin Long, Marcelo Ricardo Leitner, kernel test robot, Shakeel Butt, Soheil Hassas Yeganeh, LKML, Linux Memory Management List, network dev, linux-s390, MPTCP Upstream, linux-sctp @ vger . kernel . org, lkp, kbuild test robot, Huang Ying, zhengjun.xing, fengwei.yin, Ying Xu On Fri, Jun 24, 2022 at 7:14 AM Feng Tang <feng.tang@intel.com> wrote: > > Hi Eric, > > On Fri, Jun 24, 2022 at 06:13:51AM +0200, Eric Dumazet wrote: > > On Fri, Jun 24, 2022 at 3:57 AM Jakub Kicinski <kuba@kernel.org> wrote: > > > > > > On Thu, 23 Jun 2022 18:50:07 -0400 Xin Long wrote: > > > > From the perf data, we can see __sk_mem_reduce_allocated() is the one > > > > using CPU the most more than before, and mem_cgroup APIs are also > > > > called in this function. It means the mem cgroup must be enabled in > > > > the test env, which may explain why I couldn't reproduce it. > > > > > > > > The Commit 4890b686f4 ("net: keep sk->sk_forward_alloc as small as > > > > possible") uses sk_mem_reclaim(checking reclaimable >= PAGE_SIZE) to > > > > reclaim the memory, which is *more frequent* to call > > > > __sk_mem_reduce_allocated() than before (checking reclaimable >= > > > > SK_RECLAIM_THRESHOLD). It might be cheap when > > > > mem_cgroup_sockets_enabled is false, but I'm not sure if it's still > > > > cheap when mem_cgroup_sockets_enabled is true. > > > > > > > > I think SCTP netperf could trigger this, as the CPU is the bottleneck > > > > for SCTP netperf testing, which is more sensitive to the extra > > > > function calls than TCP. > > > > > > > > Can we re-run this testing without mem cgroup enabled? > > > > > > FWIW I defer to Eric, thanks a lot for double checking the report > > > and digging in! > > > > I did tests with TCP + memcg and noticed a very small additional cost > > in memcg functions, > > because of suboptimal layout: > > > > Extract of an internal Google bug, update from June 9th: > > > > -------------------------------- > > I have noticed a minor false sharing to fetch (struct > > mem_cgroup)->css.parent, at offset 0xc0, > > because it shares the cache line containing struct mem_cgroup.memory, > > at offset 0xd0 > > > > Ideally, memcg->socket_pressure and memcg->parent should sit in a read > > mostly cache line. > > ----------------------- > > > > But nothing that could explain a "-69.4% regression" > > We can double check that. > > > memcg has a very similar strategy of per-cpu reserves, with > > MEMCG_CHARGE_BATCH being 32 pages per cpu. > > We have proposed patch to increase the batch numer for stats > update, which was not accepted as it hurts the accuracy and > the data is used by many tools. > > > It is not clear why SCTP with 10K writes would overflow this reserve constantly. > > > > Presumably memcg experts will have to rework structure alignments to > > make sure they can cope better > > with more charge/uncharge operations, because we are not going back to > > gigantic per-socket reserves, > > this simply does not scale. > > Yes, the memcg statitics and charge/unchage update is very sensitive > with the data alignemnt layout, and can easily trigger peformance > changes, as we've seen quite some similar cases in the past several > years. > > One pattern we've seen is, even if a memcg stats updating or charge > function only takes about 2%~3% of the CPU cycles in perf-profile data, > once it got affected, the peformance change could be amplified to up to > 60% or more. > Reorganizing "struct mem_cgroup" to put "struct page_counter memory" in a separate cache line would be beneficial. Many low hanging fruits, assuming nobody will use __randomize_layout on it ;) Also some fields are written even if their value is not changed. diff --git a/mm/memcontrol.c b/mm/memcontrol.c index abec50f31fe64100f4be5b029c7161b3a6077a74..53d9c1e581e78303ef73942e2b34338567987b74 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -7037,10 +7037,12 @@ bool mem_cgroup_charge_skmem(struct mem_cgroup *memcg, unsigned int nr_pages, struct page_counter *fail; if (page_counter_try_charge(&memcg->tcpmem, nr_pages, &fail)) { - memcg->tcpmem_pressure = 0; + if (READ_ONCE(memcg->tcpmem_pressure)) + WRITE_ONCE(memcg->tcpmem_pressure, 0); return true; } - memcg->tcpmem_pressure = 1; + if (!READ_ONCE(memcg->tcpmem_pressure)) + WRITE_ONCE(memcg->tcpmem_pressure, 1); if (gfp_mask & __GFP_NOFAIL) { page_counter_charge(&memcg->tcpmem, nr_pages); return true; ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [net] 4890b686f4: netperf.Throughput_Mbps -69.4% regression 2022-06-24 5:45 ` Eric Dumazet @ 2022-06-24 6:00 ` Feng Tang 2022-06-24 6:07 ` Eric Dumazet 0 siblings, 1 reply; 32+ messages in thread From: Feng Tang @ 2022-06-24 6:00 UTC (permalink / raw) To: Eric Dumazet Cc: Jakub Kicinski, Xin Long, Marcelo Ricardo Leitner, kernel test robot, Shakeel Butt, Soheil Hassas Yeganeh, LKML, Linux Memory Management List, network dev, linux-s390, MPTCP Upstream, linux-sctp @ vger . kernel . org, lkp, kbuild test robot, Huang Ying, zhengjun.xing, fengwei.yin, Ying Xu On Fri, Jun 24, 2022 at 07:45:00AM +0200, Eric Dumazet wrote: > On Fri, Jun 24, 2022 at 7:14 AM Feng Tang <feng.tang@intel.com> wrote: > > > > Hi Eric, > > > > On Fri, Jun 24, 2022 at 06:13:51AM +0200, Eric Dumazet wrote: > > > On Fri, Jun 24, 2022 at 3:57 AM Jakub Kicinski <kuba@kernel.org> wrote: > > > > > > > > On Thu, 23 Jun 2022 18:50:07 -0400 Xin Long wrote: > > > > > From the perf data, we can see __sk_mem_reduce_allocated() is the one > > > > > using CPU the most more than before, and mem_cgroup APIs are also > > > > > called in this function. It means the mem cgroup must be enabled in > > > > > the test env, which may explain why I couldn't reproduce it. > > > > > > > > > > The Commit 4890b686f4 ("net: keep sk->sk_forward_alloc as small as > > > > > possible") uses sk_mem_reclaim(checking reclaimable >= PAGE_SIZE) to > > > > > reclaim the memory, which is *more frequent* to call > > > > > __sk_mem_reduce_allocated() than before (checking reclaimable >= > > > > > SK_RECLAIM_THRESHOLD). It might be cheap when > > > > > mem_cgroup_sockets_enabled is false, but I'm not sure if it's still > > > > > cheap when mem_cgroup_sockets_enabled is true. > > > > > > > > > > I think SCTP netperf could trigger this, as the CPU is the bottleneck > > > > > for SCTP netperf testing, which is more sensitive to the extra > > > > > function calls than TCP. > > > > > > > > > > Can we re-run this testing without mem cgroup enabled? > > > > > > > > FWIW I defer to Eric, thanks a lot for double checking the report > > > > and digging in! > > > > > > I did tests with TCP + memcg and noticed a very small additional cost > > > in memcg functions, > > > because of suboptimal layout: > > > > > > Extract of an internal Google bug, update from June 9th: > > > > > > -------------------------------- > > > I have noticed a minor false sharing to fetch (struct > > > mem_cgroup)->css.parent, at offset 0xc0, > > > because it shares the cache line containing struct mem_cgroup.memory, > > > at offset 0xd0 > > > > > > Ideally, memcg->socket_pressure and memcg->parent should sit in a read > > > mostly cache line. > > > ----------------------- > > > > > > But nothing that could explain a "-69.4% regression" > > > > We can double check that. > > > > > memcg has a very similar strategy of per-cpu reserves, with > > > MEMCG_CHARGE_BATCH being 32 pages per cpu. > > > > We have proposed patch to increase the batch numer for stats > > update, which was not accepted as it hurts the accuracy and > > the data is used by many tools. > > > > > It is not clear why SCTP with 10K writes would overflow this reserve constantly. > > > > > > Presumably memcg experts will have to rework structure alignments to > > > make sure they can cope better > > > with more charge/uncharge operations, because we are not going back to > > > gigantic per-socket reserves, > > > this simply does not scale. > > > > Yes, the memcg statitics and charge/unchage update is very sensitive > > with the data alignemnt layout, and can easily trigger peformance > > changes, as we've seen quite some similar cases in the past several > > years. > > > > One pattern we've seen is, even if a memcg stats updating or charge > > function only takes about 2%~3% of the CPU cycles in perf-profile data, > > once it got affected, the peformance change could be amplified to up to > > 60% or more. > > > > Reorganizing "struct mem_cgroup" to put "struct page_counter memory" > in a separate cache line would be beneficial. That may help. And I also want to say the benchmarks(especially micro one) are very sensitive to the layout of mem_cgroup. As the 'page_counter' is 112 bytes in size, I recently made a patch to make it cacheline aligned (take 2 cachelines), which improved some hackbench/netperf test cases, but caused huge (49%) drop for some vm-scalability tests. > Many low hanging fruits, assuming nobody will use __randomize_layout on it ;) > > Also some fields are written even if their value is not changed. > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index abec50f31fe64100f4be5b029c7161b3a6077a74..53d9c1e581e78303ef73942e2b34338567987b74 > 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -7037,10 +7037,12 @@ bool mem_cgroup_charge_skmem(struct mem_cgroup > *memcg, unsigned int nr_pages, > struct page_counter *fail; > > if (page_counter_try_charge(&memcg->tcpmem, nr_pages, &fail)) { > - memcg->tcpmem_pressure = 0; > + if (READ_ONCE(memcg->tcpmem_pressure)) > + WRITE_ONCE(memcg->tcpmem_pressure, 0); > return true; > } > - memcg->tcpmem_pressure = 1; > + if (!READ_ONCE(memcg->tcpmem_pressure)) > + WRITE_ONCE(memcg->tcpmem_pressure, 1); > if (gfp_mask & __GFP_NOFAIL) { > page_counter_charge(&memcg->tcpmem, nr_pages); > return true; I will also try this patch, which may take some time. Thanks, Feng ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [net] 4890b686f4: netperf.Throughput_Mbps -69.4% regression 2022-06-24 6:00 ` Feng Tang @ 2022-06-24 6:07 ` Eric Dumazet 0 siblings, 0 replies; 32+ messages in thread From: Eric Dumazet @ 2022-06-24 6:07 UTC (permalink / raw) To: Feng Tang Cc: Jakub Kicinski, Xin Long, Marcelo Ricardo Leitner, kernel test robot, Shakeel Butt, Soheil Hassas Yeganeh, LKML, Linux Memory Management List, network dev, linux-s390, MPTCP Upstream, linux-sctp @ vger . kernel . org, lkp, kbuild test robot, Huang Ying, zhengjun.xing, fengwei.yin, Ying Xu On Fri, Jun 24, 2022 at 8:01 AM Feng Tang <feng.tang@intel.com> wrote: > > On Fri, Jun 24, 2022 at 07:45:00AM +0200, Eric Dumazet wrote: > > On Fri, Jun 24, 2022 at 7:14 AM Feng Tang <feng.tang@intel.com> wrote: > > > > > > Hi Eric, > > > > > > On Fri, Jun 24, 2022 at 06:13:51AM +0200, Eric Dumazet wrote: > > > > On Fri, Jun 24, 2022 at 3:57 AM Jakub Kicinski <kuba@kernel.org> wrote: > > > > > > > > > > On Thu, 23 Jun 2022 18:50:07 -0400 Xin Long wrote: > > > > > > From the perf data, we can see __sk_mem_reduce_allocated() is the one > > > > > > using CPU the most more than before, and mem_cgroup APIs are also > > > > > > called in this function. It means the mem cgroup must be enabled in > > > > > > the test env, which may explain why I couldn't reproduce it. > > > > > > > > > > > > The Commit 4890b686f4 ("net: keep sk->sk_forward_alloc as small as > > > > > > possible") uses sk_mem_reclaim(checking reclaimable >= PAGE_SIZE) to > > > > > > reclaim the memory, which is *more frequent* to call > > > > > > __sk_mem_reduce_allocated() than before (checking reclaimable >= > > > > > > SK_RECLAIM_THRESHOLD). It might be cheap when > > > > > > mem_cgroup_sockets_enabled is false, but I'm not sure if it's still > > > > > > cheap when mem_cgroup_sockets_enabled is true. > > > > > > > > > > > > I think SCTP netperf could trigger this, as the CPU is the bottleneck > > > > > > for SCTP netperf testing, which is more sensitive to the extra > > > > > > function calls than TCP. > > > > > > > > > > > > Can we re-run this testing without mem cgroup enabled? > > > > > > > > > > FWIW I defer to Eric, thanks a lot for double checking the report > > > > > and digging in! > > > > > > > > I did tests with TCP + memcg and noticed a very small additional cost > > > > in memcg functions, > > > > because of suboptimal layout: > > > > > > > > Extract of an internal Google bug, update from June 9th: > > > > > > > > -------------------------------- > > > > I have noticed a minor false sharing to fetch (struct > > > > mem_cgroup)->css.parent, at offset 0xc0, > > > > because it shares the cache line containing struct mem_cgroup.memory, > > > > at offset 0xd0 > > > > > > > > Ideally, memcg->socket_pressure and memcg->parent should sit in a read > > > > mostly cache line. > > > > ----------------------- > > > > > > > > But nothing that could explain a "-69.4% regression" > > > > > > We can double check that. > > > > > > > memcg has a very similar strategy of per-cpu reserves, with > > > > MEMCG_CHARGE_BATCH being 32 pages per cpu. > > > > > > We have proposed patch to increase the batch numer for stats > > > update, which was not accepted as it hurts the accuracy and > > > the data is used by many tools. > > > > > > > It is not clear why SCTP with 10K writes would overflow this reserve constantly. > > > > > > > > Presumably memcg experts will have to rework structure alignments to > > > > make sure they can cope better > > > > with more charge/uncharge operations, because we are not going back to > > > > gigantic per-socket reserves, > > > > this simply does not scale. > > > > > > Yes, the memcg statitics and charge/unchage update is very sensitive > > > with the data alignemnt layout, and can easily trigger peformance > > > changes, as we've seen quite some similar cases in the past several > > > years. > > > > > > One pattern we've seen is, even if a memcg stats updating or charge > > > function only takes about 2%~3% of the CPU cycles in perf-profile data, > > > once it got affected, the peformance change could be amplified to up to > > > 60% or more. > > > > > > > Reorganizing "struct mem_cgroup" to put "struct page_counter memory" > > in a separate cache line would be beneficial. > > That may help. > > And I also want to say the benchmarks(especially micro one) are very > sensitive to the layout of mem_cgroup. As the 'page_counter' is 112 > bytes in size, I recently made a patch to make it cacheline aligned > (take 2 cachelines), which improved some hackbench/netperf test > cases, but caused huge (49%) drop for some vm-scalability tests. > > > Many low hanging fruits, assuming nobody will use __randomize_layout on it ;) > > > > Also some fields are written even if their value is not changed. > > > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > > index abec50f31fe64100f4be5b029c7161b3a6077a74..53d9c1e581e78303ef73942e2b34338567987b74 > > 100644 > > --- a/mm/memcontrol.c > > +++ b/mm/memcontrol.c > > @@ -7037,10 +7037,12 @@ bool mem_cgroup_charge_skmem(struct mem_cgroup > > *memcg, unsigned int nr_pages, > > struct page_counter *fail; > > > > if (page_counter_try_charge(&memcg->tcpmem, nr_pages, &fail)) { > > - memcg->tcpmem_pressure = 0; > > + if (READ_ONCE(memcg->tcpmem_pressure)) > > + WRITE_ONCE(memcg->tcpmem_pressure, 0); > > return true; > > } > > - memcg->tcpmem_pressure = 1; > > + if (!READ_ONCE(memcg->tcpmem_pressure)) > > + WRITE_ONCE(memcg->tcpmem_pressure, 1); > > if (gfp_mask & __GFP_NOFAIL) { > > page_counter_charge(&memcg->tcpmem, nr_pages); > > return true; > > I will also try this patch, which may take some time. Note that applications can opt-in reserving memory for one socket, using SO_RESERVE_MEM This can be used for jobs with a controlled number of sockets, as this will avoid many charge/uncharge operations. ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [net] 4890b686f4: netperf.Throughput_Mbps -69.4% regression 2022-06-24 4:13 ` Eric Dumazet 2022-06-24 4:22 ` Eric Dumazet 2022-06-24 5:13 ` Feng Tang @ 2022-06-24 6:34 ` Shakeel Butt 2022-06-24 7:06 ` Feng Tang 2 siblings, 1 reply; 32+ messages in thread From: Shakeel Butt @ 2022-06-24 6:34 UTC (permalink / raw) To: Eric Dumazet, Linux MM, Andrew Morton, Roman Gushchin, Michal Hocko, Johannes Weiner, Muchun Song Cc: Jakub Kicinski, Xin Long, Marcelo Ricardo Leitner, kernel test robot, Soheil Hassas Yeganeh, LKML, network dev, linux-s390, MPTCP Upstream, linux-sctp @ vger . kernel . org, lkp, kbuild test robot, Huang Ying, Tang, Feng, Xing Zhengjun, Yin Fengwei, Ying Xu CCing memcg folks. The thread starts at https://lore.kernel.org/all/20220619150456.GB34471@xsang-OptiPlex-9020/ On Thu, Jun 23, 2022 at 9:14 PM Eric Dumazet <edumazet@google.com> wrote: > > On Fri, Jun 24, 2022 at 3:57 AM Jakub Kicinski <kuba@kernel.org> wrote: > > > > On Thu, 23 Jun 2022 18:50:07 -0400 Xin Long wrote: > > > From the perf data, we can see __sk_mem_reduce_allocated() is the one > > > using CPU the most more than before, and mem_cgroup APIs are also > > > called in this function. It means the mem cgroup must be enabled in > > > the test env, which may explain why I couldn't reproduce it. > > > > > > The Commit 4890b686f4 ("net: keep sk->sk_forward_alloc as small as > > > possible") uses sk_mem_reclaim(checking reclaimable >= PAGE_SIZE) to > > > reclaim the memory, which is *more frequent* to call > > > __sk_mem_reduce_allocated() than before (checking reclaimable >= > > > SK_RECLAIM_THRESHOLD). It might be cheap when > > > mem_cgroup_sockets_enabled is false, but I'm not sure if it's still > > > cheap when mem_cgroup_sockets_enabled is true. > > > > > > I think SCTP netperf could trigger this, as the CPU is the bottleneck > > > for SCTP netperf testing, which is more sensitive to the extra > > > function calls than TCP. > > > > > > Can we re-run this testing without mem cgroup enabled? > > > > FWIW I defer to Eric, thanks a lot for double checking the report > > and digging in! > > I did tests with TCP + memcg and noticed a very small additional cost > in memcg functions, > because of suboptimal layout: > > Extract of an internal Google bug, update from June 9th: > > -------------------------------- > I have noticed a minor false sharing to fetch (struct > mem_cgroup)->css.parent, at offset 0xc0, > because it shares the cache line containing struct mem_cgroup.memory, > at offset 0xd0 > > Ideally, memcg->socket_pressure and memcg->parent should sit in a read > mostly cache line. > ----------------------- > > But nothing that could explain a "-69.4% regression" > > memcg has a very similar strategy of per-cpu reserves, with > MEMCG_CHARGE_BATCH being 32 pages per cpu. > > It is not clear why SCTP with 10K writes would overflow this reserve constantly. > > Presumably memcg experts will have to rework structure alignments to > make sure they can cope better > with more charge/uncharge operations, because we are not going back to > gigantic per-socket reserves, > this simply does not scale. Yes I agree. As you pointed out there are fields which are mostly read-only but sharing cache lines with fields which get updated and definitely need work. However can we first confirm if memcg charging is really the issue here as I remember these intel lkp tests are configured to run in root memcg and the kernel does not associate root memcg to any socket (see mem_cgroup_sk_alloc()). If these tests are running in non-root memcg, is this cgroup v1 or v2? The memory counter and the 32 pages per cpu stock are only used on v2. For v1, there is no per-cpu stock and there is a separate tcpmem page counter and on v1 the network memory accounting has to be enabled explicitly i.e. not enabled by default. There is definite possibility of slowdown on v1 but let's first confirm the memcg setup used for this testing environment. Feng, can you please explain the memcg setup on these test machines and if the tests are run in root or non-root memcg? thanks, Shakeel ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [net] 4890b686f4: netperf.Throughput_Mbps -69.4% regression 2022-06-24 6:34 ` Shakeel Butt @ 2022-06-24 7:06 ` Feng Tang 2022-06-24 14:43 ` Shakeel Butt 0 siblings, 1 reply; 32+ messages in thread From: Feng Tang @ 2022-06-24 7:06 UTC (permalink / raw) To: Shakeel Butt Cc: Eric Dumazet, Linux MM, Andrew Morton, Roman Gushchin, Michal Hocko, Johannes Weiner, Muchun Song, Jakub Kicinski, Xin Long, Marcelo Ricardo Leitner, kernel test robot, Soheil Hassas Yeganeh, LKML, network dev, linux-s390, MPTCP Upstream, linux-sctp @ vger . kernel . org, lkp, kbuild test robot, Huang Ying, Xing Zhengjun, Yin Fengwei, Ying Xu On Thu, Jun 23, 2022 at 11:34:15PM -0700, Shakeel Butt wrote: > CCing memcg folks. > > The thread starts at > https://lore.kernel.org/all/20220619150456.GB34471@xsang-OptiPlex-9020/ > > On Thu, Jun 23, 2022 at 9:14 PM Eric Dumazet <edumazet@google.com> wrote: > > > > On Fri, Jun 24, 2022 at 3:57 AM Jakub Kicinski <kuba@kernel.org> wrote: > > > > > > On Thu, 23 Jun 2022 18:50:07 -0400 Xin Long wrote: > > > > From the perf data, we can see __sk_mem_reduce_allocated() is the one > > > > using CPU the most more than before, and mem_cgroup APIs are also > > > > called in this function. It means the mem cgroup must be enabled in > > > > the test env, which may explain why I couldn't reproduce it. > > > > > > > > The Commit 4890b686f4 ("net: keep sk->sk_forward_alloc as small as > > > > possible") uses sk_mem_reclaim(checking reclaimable >= PAGE_SIZE) to > > > > reclaim the memory, which is *more frequent* to call > > > > __sk_mem_reduce_allocated() than before (checking reclaimable >= > > > > SK_RECLAIM_THRESHOLD). It might be cheap when > > > > mem_cgroup_sockets_enabled is false, but I'm not sure if it's still > > > > cheap when mem_cgroup_sockets_enabled is true. > > > > > > > > I think SCTP netperf could trigger this, as the CPU is the bottleneck > > > > for SCTP netperf testing, which is more sensitive to the extra > > > > function calls than TCP. > > > > > > > > Can we re-run this testing without mem cgroup enabled? > > > > > > FWIW I defer to Eric, thanks a lot for double checking the report > > > and digging in! > > > > I did tests with TCP + memcg and noticed a very small additional cost > > in memcg functions, > > because of suboptimal layout: > > > > Extract of an internal Google bug, update from June 9th: > > > > -------------------------------- > > I have noticed a minor false sharing to fetch (struct > > mem_cgroup)->css.parent, at offset 0xc0, > > because it shares the cache line containing struct mem_cgroup.memory, > > at offset 0xd0 > > > > Ideally, memcg->socket_pressure and memcg->parent should sit in a read > > mostly cache line. > > ----------------------- > > > > But nothing that could explain a "-69.4% regression" > > > > memcg has a very similar strategy of per-cpu reserves, with > > MEMCG_CHARGE_BATCH being 32 pages per cpu. > > > > It is not clear why SCTP with 10K writes would overflow this reserve constantly. > > > > Presumably memcg experts will have to rework structure alignments to > > make sure they can cope better > > with more charge/uncharge operations, because we are not going back to > > gigantic per-socket reserves, > > this simply does not scale. > > Yes I agree. As you pointed out there are fields which are mostly > read-only but sharing cache lines with fields which get updated and > definitely need work. > > However can we first confirm if memcg charging is really the issue > here as I remember these intel lkp tests are configured to run in root > memcg and the kernel does not associate root memcg to any socket (see > mem_cgroup_sk_alloc()). > > If these tests are running in non-root memcg, is this cgroup v1 or v2? > The memory counter and the 32 pages per cpu stock are only used on v2. > For v1, there is no per-cpu stock and there is a separate tcpmem page > counter and on v1 the network memory accounting has to be enabled > explicitly i.e. not enabled by default. > > There is definite possibility of slowdown on v1 but let's first > confirm the memcg setup used for this testing environment. > > Feng, can you please explain the memcg setup on these test machines > and if the tests are run in root or non-root memcg? I don't know the exact setup, Philip/Oliver from 0Day can correct me. I logged into a test box which runs netperf test, and it seems to be cgoup v1 and non-root memcg. The netperf tasks all sit in dir: '/sys/fs/cgroup/memory/system.slice/lkp-bootstrap.service' And the rootfs is a debian based rootfs Thanks, Feng > thanks, > Shakeel ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [net] 4890b686f4: netperf.Throughput_Mbps -69.4% regression 2022-06-24 7:06 ` Feng Tang @ 2022-06-24 14:43 ` Shakeel Butt 2022-06-25 2:36 ` Feng Tang 0 siblings, 1 reply; 32+ messages in thread From: Shakeel Butt @ 2022-06-24 14:43 UTC (permalink / raw) To: Feng Tang Cc: Eric Dumazet, Linux MM, Andrew Morton, Roman Gushchin, Michal Hocko, Johannes Weiner, Muchun Song, Jakub Kicinski, Xin Long, Marcelo Ricardo Leitner, kernel test robot, Soheil Hassas Yeganeh, LKML, network dev, linux-s390, MPTCP Upstream, linux-sctp @ vger . kernel . org, lkp, kbuild test robot, Huang Ying, Xing Zhengjun, Yin Fengwei, Ying Xu On Fri, Jun 24, 2022 at 03:06:56PM +0800, Feng Tang wrote: > On Thu, Jun 23, 2022 at 11:34:15PM -0700, Shakeel Butt wrote: [...] > > > > Feng, can you please explain the memcg setup on these test machines > > and if the tests are run in root or non-root memcg? > > I don't know the exact setup, Philip/Oliver from 0Day can correct me. > > I logged into a test box which runs netperf test, and it seems to be > cgoup v1 and non-root memcg. The netperf tasks all sit in dir: > '/sys/fs/cgroup/memory/system.slice/lkp-bootstrap.service' > Thanks Feng. Can you check the value of memory.kmem.tcp.max_usage_in_bytes in /sys/fs/cgroup/memory/system.slice/lkp-bootstrap.service after making sure that the netperf test has already run? If this is non-zero then network memory accounting is enabled and the slowdown is expected. > And the rootfs is a debian based rootfs > > Thanks, > Feng > > > > thanks, > > Shakeel ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [net] 4890b686f4: netperf.Throughput_Mbps -69.4% regression 2022-06-24 14:43 ` Shakeel Butt @ 2022-06-25 2:36 ` Feng Tang 2022-06-27 2:38 ` Feng Tang 0 siblings, 1 reply; 32+ messages in thread From: Feng Tang @ 2022-06-25 2:36 UTC (permalink / raw) To: Shakeel Butt, Eric Dumazet Cc: Eric Dumazet, Linux MM, Andrew Morton, Roman Gushchin, Michal Hocko, Johannes Weiner, Muchun Song, Jakub Kicinski, Xin Long, Marcelo Ricardo Leitner, kernel test robot, Soheil Hassas Yeganeh, LKML, network dev, linux-s390, MPTCP Upstream, linux-sctp @ vger . kernel . org, lkp, kbuild test robot, Huang Ying, Xing Zhengjun, Yin Fengwei, Ying Xu On Fri, Jun 24, 2022 at 02:43:58PM +0000, Shakeel Butt wrote: > On Fri, Jun 24, 2022 at 03:06:56PM +0800, Feng Tang wrote: > > On Thu, Jun 23, 2022 at 11:34:15PM -0700, Shakeel Butt wrote: > [...] > > > > > > Feng, can you please explain the memcg setup on these test machines > > > and if the tests are run in root or non-root memcg? > > > > I don't know the exact setup, Philip/Oliver from 0Day can correct me. > > > > I logged into a test box which runs netperf test, and it seems to be > > cgoup v1 and non-root memcg. The netperf tasks all sit in dir: > > '/sys/fs/cgroup/memory/system.slice/lkp-bootstrap.service' > > > > Thanks Feng. Can you check the value of memory.kmem.tcp.max_usage_in_bytes > in /sys/fs/cgroup/memory/system.slice/lkp-bootstrap.service after making > sure that the netperf test has already run? memory.kmem.tcp.max_usage_in_bytes:0 And here is more memcg stats (let me know if you want to check more) /sys/fs/cgroup/memory/system.slice/lkp-bootstrap.service# grep . memory.* memory.failcnt:0 memory.kmem.failcnt:0 memory.kmem.limit_in_bytes:9223372036854771712 memory.kmem.max_usage_in_bytes:47861760 memory.kmem.tcp.failcnt:0 memory.kmem.tcp.limit_in_bytes:9223372036854771712 memory.kmem.tcp.max_usage_in_bytes:0 memory.kmem.tcp.usage_in_bytes:0 memory.kmem.usage_in_bytes:40730624 memory.limit_in_bytes:9223372036854771712 memory.max_usage_in_bytes:642424832 memory.memsw.failcnt:0 memory.memsw.limit_in_bytes:9223372036854771712 memory.memsw.max_usage_in_bytes:642424832 memory.memsw.usage_in_bytes:639549440 memory.move_charge_at_immigrate:0 memory.numa_stat:total=144073 N0=124819 N1=19254 memory.numa_stat:file=0 N0=0 N1=0 memory.numa_stat:anon=77721 N0=58502 N1=19219 memory.numa_stat:unevictable=66352 N0=66317 N1=35 memory.numa_stat:hierarchical_total=144073 N0=124819 N1=19254 memory.numa_stat:hierarchical_file=0 N0=0 N1=0 memory.numa_stat:hierarchical_anon=77721 N0=58502 N1=19219 memory.numa_stat:hierarchical_unevictable=66352 N0=66317 N1=35 memory.oom_control:oom_kill_disable 0 memory.oom_control:under_oom 0 memory.oom_control:oom_kill 0 grep: memory.pressure_level: Invalid argument memory.soft_limit_in_bytes:9223372036854771712 memory.stat:cache 282562560 memory.stat:rss 307884032 memory.stat:rss_huge 239075328 memory.stat:shmem 10784768 memory.stat:mapped_file 3444736 memory.stat:dirty 0 memory.stat:writeback 0 memory.stat:swap 0 memory.stat:pgpgin 1018918 memory.stat:pgpgout 932902 memory.stat:pgfault 2130513 memory.stat:pgmajfault 0 memory.stat:inactive_anon 310272000 memory.stat:active_anon 8073216 memory.stat:inactive_file 0 memory.stat:active_file 0 memory.stat:unevictable 271777792 memory.stat:hierarchical_memory_limit 9223372036854771712 memory.stat:hierarchical_memsw_limit 9223372036854771712 memory.stat:total_cache 282562560 memory.stat:total_rss 307884032 memory.stat:total_rss_huge 239075328 memory.stat:total_shmem 10784768 memory.stat:total_mapped_file 3444736 memory.stat:total_dirty 0 memory.stat:total_writeback 0 memory.stat:total_swap 0 memory.stat:total_pgpgin 1018918 memory.stat:total_pgpgout 932902 memory.stat:total_pgfault 2130513 memory.stat:total_pgmajfault 0 memory.stat:total_inactive_anon 310272000 memory.stat:total_active_anon 8073216 memory.stat:total_inactive_file 0 memory.stat:total_active_file 0 memory.stat:total_unevictable 271777792 memory.swappiness:60 memory.usage_in_bytes:639549440 memory.use_hierarchy:1 > If this is non-zero then network memory accounting is enabled and the > slowdown is expected. From the perf-profile data in original report, both __sk_mem_raise_allocated() and __sk_mem_reduce_allocated() are called much more often, which call memcg charge/uncharge functions. IIUC, the call chain is: __sk_mem_raise_allocated sk_memory_allocated_add mem_cgroup_charge_skmem charge memcg->tcpmem (for cgroup v2) try_charge memcg (for v1) Also from Eric's one earlier commit log: " net: implement per-cpu reserves for memory_allocated ... This means we are going to call sk_memory_allocated_add() and sk_memory_allocated_sub() more often. ... " So this slowdown is related to the more calling of charge/uncharge? Thanks, Feng > > And the rootfs is a debian based rootfs > > > > Thanks, > > Feng > > > > > > > thanks, > > > Shakeel ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [net] 4890b686f4: netperf.Throughput_Mbps -69.4% regression 2022-06-25 2:36 ` Feng Tang @ 2022-06-27 2:38 ` Feng Tang 2022-06-27 8:46 ` Eric Dumazet 0 siblings, 1 reply; 32+ messages in thread From: Feng Tang @ 2022-06-27 2:38 UTC (permalink / raw) To: Shakeel Butt, Eric Dumazet Cc: Linux MM, Andrew Morton, Roman Gushchin, Michal Hocko, Johannes Weiner, Muchun Song, Jakub Kicinski, Xin Long, Marcelo Ricardo Leitner, kernel test robot, Soheil Hassas Yeganeh, LKML, network dev, linux-s390, MPTCP Upstream, linux-sctp @ vger . kernel . org, lkp, kbuild test robot, Huang Ying, Xing Zhengjun, Yin Fengwei, Ying Xu On Sat, Jun 25, 2022 at 10:36:42AM +0800, Feng Tang wrote: > On Fri, Jun 24, 2022 at 02:43:58PM +0000, Shakeel Butt wrote: > > On Fri, Jun 24, 2022 at 03:06:56PM +0800, Feng Tang wrote: > > > On Thu, Jun 23, 2022 at 11:34:15PM -0700, Shakeel Butt wrote: > > [...] > > > > > > > > Feng, can you please explain the memcg setup on these test machines > > > > and if the tests are run in root or non-root memcg? > > > > > > I don't know the exact setup, Philip/Oliver from 0Day can correct me. > > > > > > I logged into a test box which runs netperf test, and it seems to be > > > cgoup v1 and non-root memcg. The netperf tasks all sit in dir: > > > '/sys/fs/cgroup/memory/system.slice/lkp-bootstrap.service' > > > > > > > Thanks Feng. Can you check the value of memory.kmem.tcp.max_usage_in_bytes > > in /sys/fs/cgroup/memory/system.slice/lkp-bootstrap.service after making > > sure that the netperf test has already run? > > memory.kmem.tcp.max_usage_in_bytes:0 Sorry, I made a mistake that in the original report from Oliver, it was 'cgroup v2' with a 'debian-11.1' rootfs. When you asked about cgroup info, I tried the job on another tbox, and the original 'job.yaml' didn't work, so I kept the 'netperf' test parameters and started a new job which somehow run with a 'debian-10.4' rootfs and acutally run with cgroup v1. And as you mentioned cgroup version does make a big difference, that with v1, the regression is reduced to 1% ~ 5% on different generations of test platforms. Eric mentioned they also got regression report, but much smaller one, maybe it's due to the cgroup version? Thanks, Feng > And here is more memcg stats (let me know if you want to check more) > > > If this is non-zero then network memory accounting is enabled and the > > slowdown is expected. > > >From the perf-profile data in original report, both > __sk_mem_raise_allocated() and __sk_mem_reduce_allocated() are called > much more often, which call memcg charge/uncharge functions. > > IIUC, the call chain is: > > __sk_mem_raise_allocated > sk_memory_allocated_add > mem_cgroup_charge_skmem > charge memcg->tcpmem (for cgroup v2) > try_charge memcg (for v1) > > Also from Eric's one earlier commit log: > > " > net: implement per-cpu reserves for memory_allocated > ... > This means we are going to call sk_memory_allocated_add() > and sk_memory_allocated_sub() more often. > ... > " > > So this slowdown is related to the more calling of charge/uncharge? > > Thanks, > Feng > > > > And the rootfs is a debian based rootfs > > > > > > Thanks, > > > Feng > > > > > > > > > > thanks, > > > > Shakeel ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [net] 4890b686f4: netperf.Throughput_Mbps -69.4% regression 2022-06-27 2:38 ` Feng Tang @ 2022-06-27 8:46 ` Eric Dumazet 2022-06-27 12:34 ` Feng Tang 0 siblings, 1 reply; 32+ messages in thread From: Eric Dumazet @ 2022-06-27 8:46 UTC (permalink / raw) To: Feng Tang Cc: Shakeel Butt, Linux MM, Andrew Morton, Roman Gushchin, Michal Hocko, Johannes Weiner, Muchun Song, Jakub Kicinski, Xin Long, Marcelo Ricardo Leitner, kernel test robot, Soheil Hassas Yeganeh, LKML, network dev, linux-s390, MPTCP Upstream, linux-sctp @ vger . kernel . org, lkp, kbuild test robot, Huang Ying, Xing Zhengjun, Yin Fengwei, Ying Xu On Mon, Jun 27, 2022 at 4:38 AM Feng Tang <feng.tang@intel.com> wrote: > > On Sat, Jun 25, 2022 at 10:36:42AM +0800, Feng Tang wrote: > > On Fri, Jun 24, 2022 at 02:43:58PM +0000, Shakeel Butt wrote: > > > On Fri, Jun 24, 2022 at 03:06:56PM +0800, Feng Tang wrote: > > > > On Thu, Jun 23, 2022 at 11:34:15PM -0700, Shakeel Butt wrote: > > > [...] > > > > > > > > > > Feng, can you please explain the memcg setup on these test machines > > > > > and if the tests are run in root or non-root memcg? > > > > > > > > I don't know the exact setup, Philip/Oliver from 0Day can correct me. > > > > > > > > I logged into a test box which runs netperf test, and it seems to be > > > > cgoup v1 and non-root memcg. The netperf tasks all sit in dir: > > > > '/sys/fs/cgroup/memory/system.slice/lkp-bootstrap.service' > > > > > > > > > > Thanks Feng. Can you check the value of memory.kmem.tcp.max_usage_in_bytes > > > in /sys/fs/cgroup/memory/system.slice/lkp-bootstrap.service after making > > > sure that the netperf test has already run? > > > > memory.kmem.tcp.max_usage_in_bytes:0 > > Sorry, I made a mistake that in the original report from Oliver, it > was 'cgroup v2' with a 'debian-11.1' rootfs. > > When you asked about cgroup info, I tried the job on another tbox, and > the original 'job.yaml' didn't work, so I kept the 'netperf' test > parameters and started a new job which somehow run with a 'debian-10.4' > rootfs and acutally run with cgroup v1. > > And as you mentioned cgroup version does make a big difference, that > with v1, the regression is reduced to 1% ~ 5% on different generations > of test platforms. Eric mentioned they also got regression report, > but much smaller one, maybe it's due to the cgroup version? This was using the current net-next tree. Used recipe was something like: Make sure cgroup2 is mounted or mount it by mount -t cgroup2 none $MOUNT_POINT. Enable memory controller by echo +memory > $MOUNT_POINT/cgroup.subtree_control. Create a cgroup by mkdir $MOUNT_POINT/job. Jump into that cgroup by echo $$ > $MOUNT_POINT/job/cgroup.procs. <Launch tests> The regression was smaller than 1%, so considered noise compared to the benefits of the bug fix. > > Thanks, > Feng > > > And here is more memcg stats (let me know if you want to check more) > > > > > If this is non-zero then network memory accounting is enabled and the > > > slowdown is expected. > > > > >From the perf-profile data in original report, both > > __sk_mem_raise_allocated() and __sk_mem_reduce_allocated() are called > > much more often, which call memcg charge/uncharge functions. > > > > IIUC, the call chain is: > > > > __sk_mem_raise_allocated > > sk_memory_allocated_add > > mem_cgroup_charge_skmem > > charge memcg->tcpmem (for cgroup v2) > > try_charge memcg (for v1) > > > > Also from Eric's one earlier commit log: > > > > " > > net: implement per-cpu reserves for memory_allocated > > ... > > This means we are going to call sk_memory_allocated_add() > > and sk_memory_allocated_sub() more often. > > ... > > " > > > > So this slowdown is related to the more calling of charge/uncharge? > > > > Thanks, > > Feng > > > > > > And the rootfs is a debian based rootfs > > > > > > > > Thanks, > > > > Feng > > > > > > > > > > > > > thanks, > > > > > Shakeel ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [net] 4890b686f4: netperf.Throughput_Mbps -69.4% regression 2022-06-27 8:46 ` Eric Dumazet @ 2022-06-27 12:34 ` Feng Tang 2022-06-27 14:07 ` Eric Dumazet 2022-06-27 14:52 ` Shakeel Butt 0 siblings, 2 replies; 32+ messages in thread From: Feng Tang @ 2022-06-27 12:34 UTC (permalink / raw) To: Eric Dumazet Cc: Shakeel Butt, Linux MM, Andrew Morton, Roman Gushchin, Michal Hocko, Johannes Weiner, Muchun Song, Jakub Kicinski, Xin Long, Marcelo Ricardo Leitner, kernel test robot, Soheil Hassas Yeganeh, LKML, network dev, linux-s390, MPTCP Upstream, linux-sctp @ vger . kernel . org, lkp, kbuild test robot, Huang Ying, Xing Zhengjun, Yin Fengwei, Ying Xu On Mon, Jun 27, 2022 at 10:46:21AM +0200, Eric Dumazet wrote: > On Mon, Jun 27, 2022 at 4:38 AM Feng Tang <feng.tang@intel.com> wrote: [snip] > > > > > > > > Thanks Feng. Can you check the value of memory.kmem.tcp.max_usage_in_bytes > > > > in /sys/fs/cgroup/memory/system.slice/lkp-bootstrap.service after making > > > > sure that the netperf test has already run? > > > > > > memory.kmem.tcp.max_usage_in_bytes:0 > > > > Sorry, I made a mistake that in the original report from Oliver, it > > was 'cgroup v2' with a 'debian-11.1' rootfs. > > > > When you asked about cgroup info, I tried the job on another tbox, and > > the original 'job.yaml' didn't work, so I kept the 'netperf' test > > parameters and started a new job which somehow run with a 'debian-10.4' > > rootfs and acutally run with cgroup v1. > > > > And as you mentioned cgroup version does make a big difference, that > > with v1, the regression is reduced to 1% ~ 5% on different generations > > of test platforms. Eric mentioned they also got regression report, > > but much smaller one, maybe it's due to the cgroup version? > > This was using the current net-next tree. > Used recipe was something like: > > Make sure cgroup2 is mounted or mount it by mount -t cgroup2 none $MOUNT_POINT. > Enable memory controller by echo +memory > $MOUNT_POINT/cgroup.subtree_control. > Create a cgroup by mkdir $MOUNT_POINT/job. > Jump into that cgroup by echo $$ > $MOUNT_POINT/job/cgroup.procs. > > <Launch tests> > > The regression was smaller than 1%, so considered noise compared to > the benefits of the bug fix. Yes, 1% is just around noise level for a microbenchmark. I went check the original test data of Oliver's report, the tests was run 6 rounds and the performance data is pretty stable (0Day's report will show any std deviation bigger than 2%) The test platform is a 4 sockets 72C/144T machine, and I run the same job (nr_tasks = 25% * nr_cpus) on one CascadeLake AP (4 nodes) and one Icelake 2 sockets platform, and saw 75% and 53% regresson on them. In the first email, there is a file named 'reproduce', it shows the basic test process: " use 'performane' cpufre governor for all CPUs netserver -4 -D modprobe sctp netperf -4 -H 127.0.0.1 -t SCTP_STREAM_MANY -c -C -l 300 -- -m 10K & netperf -4 -H 127.0.0.1 -t SCTP_STREAM_MANY -c -C -l 300 -- -m 10K & netperf -4 -H 127.0.0.1 -t SCTP_STREAM_MANY -c -C -l 300 -- -m 10K & (repeat 36 times in total) ... " Which starts 36 (25% of nr_cpus) netperf clients. And the clients number also matters, I tried to increase the client number from 36 to 72(50%), and the regression is changed from 69.4% to 73.7% Thanks, Feng > > > > Thanks, > > Feng ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [net] 4890b686f4: netperf.Throughput_Mbps -69.4% regression 2022-06-27 12:34 ` Feng Tang @ 2022-06-27 14:07 ` Eric Dumazet 2022-06-27 14:48 ` Feng Tang 2022-06-27 14:52 ` Shakeel Butt 1 sibling, 1 reply; 32+ messages in thread From: Eric Dumazet @ 2022-06-27 14:07 UTC (permalink / raw) To: Feng Tang Cc: Shakeel Butt, Linux MM, Andrew Morton, Roman Gushchin, Michal Hocko, Johannes Weiner, Muchun Song, Jakub Kicinski, Xin Long, Marcelo Ricardo Leitner, kernel test robot, Soheil Hassas Yeganeh, LKML, network dev, linux-s390, MPTCP Upstream, linux-sctp @ vger . kernel . org, lkp, kbuild test robot, Huang Ying, Xing Zhengjun, Yin Fengwei, Ying Xu On Mon, Jun 27, 2022 at 2:34 PM Feng Tang <feng.tang@intel.com> wrote: > > On Mon, Jun 27, 2022 at 10:46:21AM +0200, Eric Dumazet wrote: > > On Mon, Jun 27, 2022 at 4:38 AM Feng Tang <feng.tang@intel.com> wrote: > [snip] > > > > > > > > > > Thanks Feng. Can you check the value of memory.kmem.tcp.max_usage_in_bytes > > > > > in /sys/fs/cgroup/memory/system.slice/lkp-bootstrap.service after making > > > > > sure that the netperf test has already run? > > > > > > > > memory.kmem.tcp.max_usage_in_bytes:0 > > > > > > Sorry, I made a mistake that in the original report from Oliver, it > > > was 'cgroup v2' with a 'debian-11.1' rootfs. > > > > > > When you asked about cgroup info, I tried the job on another tbox, and > > > the original 'job.yaml' didn't work, so I kept the 'netperf' test > > > parameters and started a new job which somehow run with a 'debian-10.4' > > > rootfs and acutally run with cgroup v1. > > > > > > And as you mentioned cgroup version does make a big difference, that > > > with v1, the regression is reduced to 1% ~ 5% on different generations > > > of test platforms. Eric mentioned they also got regression report, > > > but much smaller one, maybe it's due to the cgroup version? > > > > This was using the current net-next tree. > > Used recipe was something like: > > > > Make sure cgroup2 is mounted or mount it by mount -t cgroup2 none $MOUNT_POINT. > > Enable memory controller by echo +memory > $MOUNT_POINT/cgroup.subtree_control. > > Create a cgroup by mkdir $MOUNT_POINT/job. > > Jump into that cgroup by echo $$ > $MOUNT_POINT/job/cgroup.procs. > > > > <Launch tests> > > > > The regression was smaller than 1%, so considered noise compared to > > the benefits of the bug fix. > > Yes, 1% is just around noise level for a microbenchmark. > > I went check the original test data of Oliver's report, the tests was > run 6 rounds and the performance data is pretty stable (0Day's report > will show any std deviation bigger than 2%) > > The test platform is a 4 sockets 72C/144T machine, and I run the > same job (nr_tasks = 25% * nr_cpus) on one CascadeLake AP (4 nodes) > and one Icelake 2 sockets platform, and saw 75% and 53% regresson on > them. > > In the first email, there is a file named 'reproduce', it shows the > basic test process: > > " > use 'performane' cpufre governor for all CPUs > > netserver -4 -D > modprobe sctp > netperf -4 -H 127.0.0.1 -t SCTP_STREAM_MANY -c -C -l 300 -- -m 10K & > netperf -4 -H 127.0.0.1 -t SCTP_STREAM_MANY -c -C -l 300 -- -m 10K & > netperf -4 -H 127.0.0.1 -t SCTP_STREAM_MANY -c -C -l 300 -- -m 10K & > (repeat 36 times in total) > ... > > " > > Which starts 36 (25% of nr_cpus) netperf clients. And the clients number > also matters, I tried to increase the client number from 36 to 72(50%), > and the regression is changed from 69.4% to 73.7%" > This seems like a lot of opportunities for memcg folks :) struct page_counter has poor field placement [1], and no per-cpu cache. [1] "atomic_long_t usage" is sharing cache line with read mostly fields. (struct mem_cgroup also has poor field placement, mainly because of struct page_counter) 28.69% [kernel] [k] copy_user_enhanced_fast_string 16.13% [kernel] [k] intel_idle_irq 6.46% [kernel] [k] page_counter_try_charge 6.20% [kernel] [k] __sk_mem_reduce_allocated 5.68% [kernel] [k] try_charge_memcg 5.16% [kernel] [k] page_counter_cancel > Thanks, > Feng > > > > > > > Thanks, > > > Feng ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [net] 4890b686f4: netperf.Throughput_Mbps -69.4% regression 2022-06-27 14:07 ` Eric Dumazet @ 2022-06-27 14:48 ` Feng Tang 2022-06-27 16:25 ` Eric Dumazet 0 siblings, 1 reply; 32+ messages in thread From: Feng Tang @ 2022-06-27 14:48 UTC (permalink / raw) To: Eric Dumazet Cc: Shakeel Butt, Linux MM, Andrew Morton, Roman Gushchin, Michal Hocko, Johannes Weiner, Muchun Song, Jakub Kicinski, Xin Long, Marcelo Ricardo Leitner, kernel test robot, Soheil Hassas Yeganeh, LKML, network dev, linux-s390, MPTCP Upstream, linux-sctp @ vger . kernel . org, lkp, kbuild test robot, Huang Ying, Xing Zhengjun, Yin Fengwei, Ying Xu On Mon, Jun 27, 2022 at 04:07:55PM +0200, Eric Dumazet wrote: > On Mon, Jun 27, 2022 at 2:34 PM Feng Tang <feng.tang@intel.com> wrote: > > > > On Mon, Jun 27, 2022 at 10:46:21AM +0200, Eric Dumazet wrote: > > > On Mon, Jun 27, 2022 at 4:38 AM Feng Tang <feng.tang@intel.com> wrote: > > [snip] > > > > > > > > > > > > Thanks Feng. Can you check the value of memory.kmem.tcp.max_usage_in_bytes > > > > > > in /sys/fs/cgroup/memory/system.slice/lkp-bootstrap.service after making > > > > > > sure that the netperf test has already run? > > > > > > > > > > memory.kmem.tcp.max_usage_in_bytes:0 > > > > > > > > Sorry, I made a mistake that in the original report from Oliver, it > > > > was 'cgroup v2' with a 'debian-11.1' rootfs. > > > > > > > > When you asked about cgroup info, I tried the job on another tbox, and > > > > the original 'job.yaml' didn't work, so I kept the 'netperf' test > > > > parameters and started a new job which somehow run with a 'debian-10.4' > > > > rootfs and acutally run with cgroup v1. > > > > > > > > And as you mentioned cgroup version does make a big difference, that > > > > with v1, the regression is reduced to 1% ~ 5% on different generations > > > > of test platforms. Eric mentioned they also got regression report, > > > > but much smaller one, maybe it's due to the cgroup version? > > > > > > This was using the current net-next tree. > > > Used recipe was something like: > > > > > > Make sure cgroup2 is mounted or mount it by mount -t cgroup2 none $MOUNT_POINT. > > > Enable memory controller by echo +memory > $MOUNT_POINT/cgroup.subtree_control. > > > Create a cgroup by mkdir $MOUNT_POINT/job. > > > Jump into that cgroup by echo $$ > $MOUNT_POINT/job/cgroup.procs. > > > > > > <Launch tests> > > > > > > The regression was smaller than 1%, so considered noise compared to > > > the benefits of the bug fix. > > > > Yes, 1% is just around noise level for a microbenchmark. > > > > I went check the original test data of Oliver's report, the tests was > > run 6 rounds and the performance data is pretty stable (0Day's report > > will show any std deviation bigger than 2%) > > > > The test platform is a 4 sockets 72C/144T machine, and I run the > > same job (nr_tasks = 25% * nr_cpus) on one CascadeLake AP (4 nodes) > > and one Icelake 2 sockets platform, and saw 75% and 53% regresson on > > them. > > > > In the first email, there is a file named 'reproduce', it shows the > > basic test process: > > > > " > > use 'performane' cpufre governor for all CPUs > > > > netserver -4 -D > > modprobe sctp > > netperf -4 -H 127.0.0.1 -t SCTP_STREAM_MANY -c -C -l 300 -- -m 10K & > > netperf -4 -H 127.0.0.1 -t SCTP_STREAM_MANY -c -C -l 300 -- -m 10K & > > netperf -4 -H 127.0.0.1 -t SCTP_STREAM_MANY -c -C -l 300 -- -m 10K & > > (repeat 36 times in total) > > ... > > > > " > > > > Which starts 36 (25% of nr_cpus) netperf clients. And the clients number > > also matters, I tried to increase the client number from 36 to 72(50%), > > and the regression is changed from 69.4% to 73.7%" > > > > This seems like a lot of opportunities for memcg folks :) > > struct page_counter has poor field placement [1], and no per-cpu cache. > > [1] "atomic_long_t usage" is sharing cache line with read mostly fields. > > (struct mem_cgroup also has poor field placement, mainly because of > struct page_counter) > > 28.69% [kernel] [k] copy_user_enhanced_fast_string > 16.13% [kernel] [k] intel_idle_irq > 6.46% [kernel] [k] page_counter_try_charge > 6.20% [kernel] [k] __sk_mem_reduce_allocated > 5.68% [kernel] [k] try_charge_memcg > 5.16% [kernel] [k] page_counter_cancel Yes, I also analyzed the perf-profile data, and made some layout changes which could recover the changes from 69% to 40%. 7c80b038d23e1f4c 4890b686f4088c90432149bd6de 332b589c49656a45881bca4ecc0 ---------------- --------------------------- --------------------------- 15722 -69.5% 4792 -40.8% 9300 netperf.Throughput_Mbps diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h index 1bfcfb1af352..aa37bd39116c 100644 --- a/include/linux/cgroup-defs.h +++ b/include/linux/cgroup-defs.h @@ -179,14 +179,13 @@ struct cgroup_subsys_state { atomic_t online_cnt; /* percpu_ref killing and RCU release */ - struct work_struct destroy_work; struct rcu_work destroy_rwork; - + struct cgroup_subsys_state *parent; + struct work_struct destroy_work; /* * PI: the parent css. Placed here for cache proximity to following * fields of the containing structure. */ - struct cgroup_subsys_state *parent; }; /* diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 9ecead1042b9..963b88ab9930 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -239,9 +239,6 @@ struct mem_cgroup { /* Private memcg ID. Used to ID objects that outlive the cgroup */ struct mem_cgroup_id id; - /* Accounted resources */ - struct page_counter memory; /* Both v1 & v2 */ - union { struct page_counter swap; /* v2 only */ struct page_counter memsw; /* v1 only */ @@ -251,6 +248,9 @@ struct mem_cgroup { struct page_counter kmem; /* v1 only */ struct page_counter tcpmem; /* v1 only */ + /* Accounted resources */ + struct page_counter memory; /* Both v1 & v2 */ + /* Range enforcement for interrupt charges */ struct work_struct high_work; @@ -313,7 +313,6 @@ struct mem_cgroup { atomic_long_t memory_events[MEMCG_NR_MEMORY_EVENTS]; atomic_long_t memory_events_local[MEMCG_NR_MEMORY_EVENTS]; - unsigned long socket_pressure; /* Legacy tcp memory accounting */ bool tcpmem_active; @@ -349,6 +348,7 @@ struct mem_cgroup { #ifdef CONFIG_TRANSPARENT_HUGEPAGE struct deferred_split deferred_split_queue; #endif + unsigned long socket_pressure; struct mem_cgroup_per_node *nodeinfo[]; }; And some of these are specific for network and may not be a universal win, though I think the 'cgroup_subsys_state' could keep the read-mostly 'parent' away from following written-mostly counters. Btw, I tried your debug patch which compiled fail with 0Day's kbuild system, but it did compile ok on my local machine. Thanks, Feng > > > Thanks, > > Feng > > > > > > > > > > Thanks, > > > > Feng ^ permalink raw reply related [flat|nested] 32+ messages in thread
* Re: [net] 4890b686f4: netperf.Throughput_Mbps -69.4% regression 2022-06-27 14:48 ` Feng Tang @ 2022-06-27 16:25 ` Eric Dumazet 2022-06-27 16:48 ` Shakeel Butt 2022-06-28 3:49 ` Feng Tang 0 siblings, 2 replies; 32+ messages in thread From: Eric Dumazet @ 2022-06-27 16:25 UTC (permalink / raw) To: Feng Tang Cc: Shakeel Butt, Linux MM, Andrew Morton, Roman Gushchin, Michal Hocko, Johannes Weiner, Muchun Song, Jakub Kicinski, Xin Long, Marcelo Ricardo Leitner, kernel test robot, Soheil Hassas Yeganeh, LKML, network dev, linux-s390, MPTCP Upstream, linux-sctp @ vger . kernel . org, lkp, kbuild test robot, Huang Ying, Xing Zhengjun, Yin Fengwei, Ying Xu On Mon, Jun 27, 2022 at 4:48 PM Feng Tang <feng.tang@intel.com> wrote: > > On Mon, Jun 27, 2022 at 04:07:55PM +0200, Eric Dumazet wrote: > > On Mon, Jun 27, 2022 at 2:34 PM Feng Tang <feng.tang@intel.com> wrote: > > > > > > On Mon, Jun 27, 2022 at 10:46:21AM +0200, Eric Dumazet wrote: > > > > On Mon, Jun 27, 2022 at 4:38 AM Feng Tang <feng.tang@intel.com> wrote: > > > [snip] > > > > > > > > > > > > > > Thanks Feng. Can you check the value of memory.kmem.tcp.max_usage_in_bytes > > > > > > > in /sys/fs/cgroup/memory/system.slice/lkp-bootstrap.service after making > > > > > > > sure that the netperf test has already run? > > > > > > > > > > > > memory.kmem.tcp.max_usage_in_bytes:0 > > > > > > > > > > Sorry, I made a mistake that in the original report from Oliver, it > > > > > was 'cgroup v2' with a 'debian-11.1' rootfs. > > > > > > > > > > When you asked about cgroup info, I tried the job on another tbox, and > > > > > the original 'job.yaml' didn't work, so I kept the 'netperf' test > > > > > parameters and started a new job which somehow run with a 'debian-10.4' > > > > > rootfs and acutally run with cgroup v1. > > > > > > > > > > And as you mentioned cgroup version does make a big difference, that > > > > > with v1, the regression is reduced to 1% ~ 5% on different generations > > > > > of test platforms. Eric mentioned they also got regression report, > > > > > but much smaller one, maybe it's due to the cgroup version? > > > > > > > > This was using the current net-next tree. > > > > Used recipe was something like: > > > > > > > > Make sure cgroup2 is mounted or mount it by mount -t cgroup2 none $MOUNT_POINT. > > > > Enable memory controller by echo +memory > $MOUNT_POINT/cgroup.subtree_control. > > > > Create a cgroup by mkdir $MOUNT_POINT/job. > > > > Jump into that cgroup by echo $$ > $MOUNT_POINT/job/cgroup.procs. > > > > > > > > <Launch tests> > > > > > > > > The regression was smaller than 1%, so considered noise compared to > > > > the benefits of the bug fix. > > > > > > Yes, 1% is just around noise level for a microbenchmark. > > > > > > I went check the original test data of Oliver's report, the tests was > > > run 6 rounds and the performance data is pretty stable (0Day's report > > > will show any std deviation bigger than 2%) > > > > > > The test platform is a 4 sockets 72C/144T machine, and I run the > > > same job (nr_tasks = 25% * nr_cpus) on one CascadeLake AP (4 nodes) > > > and one Icelake 2 sockets platform, and saw 75% and 53% regresson on > > > them. > > > > > > In the first email, there is a file named 'reproduce', it shows the > > > basic test process: > > > > > > " > > > use 'performane' cpufre governor for all CPUs > > > > > > netserver -4 -D > > > modprobe sctp > > > netperf -4 -H 127.0.0.1 -t SCTP_STREAM_MANY -c -C -l 300 -- -m 10K & > > > netperf -4 -H 127.0.0.1 -t SCTP_STREAM_MANY -c -C -l 300 -- -m 10K & > > > netperf -4 -H 127.0.0.1 -t SCTP_STREAM_MANY -c -C -l 300 -- -m 10K & > > > (repeat 36 times in total) > > > ... > > > > > > " > > > > > > Which starts 36 (25% of nr_cpus) netperf clients. And the clients number > > > also matters, I tried to increase the client number from 36 to 72(50%), > > > and the regression is changed from 69.4% to 73.7%" > > > > > > > This seems like a lot of opportunities for memcg folks :) > > > > struct page_counter has poor field placement [1], and no per-cpu cache. > > > > [1] "atomic_long_t usage" is sharing cache line with read mostly fields. > > > > (struct mem_cgroup also has poor field placement, mainly because of > > struct page_counter) > > > > 28.69% [kernel] [k] copy_user_enhanced_fast_string > > 16.13% [kernel] [k] intel_idle_irq > > 6.46% [kernel] [k] page_counter_try_charge > > 6.20% [kernel] [k] __sk_mem_reduce_allocated > > 5.68% [kernel] [k] try_charge_memcg > > 5.16% [kernel] [k] page_counter_cancel > > Yes, I also analyzed the perf-profile data, and made some layout changes > which could recover the changes from 69% to 40%. > > 7c80b038d23e1f4c 4890b686f4088c90432149bd6de 332b589c49656a45881bca4ecc0 > ---------------- --------------------------- --------------------------- > 15722 -69.5% 4792 -40.8% 9300 netperf.Throughput_Mbps > > > diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h > index 1bfcfb1af352..aa37bd39116c 100644 > --- a/include/linux/cgroup-defs.h > +++ b/include/linux/cgroup-defs.h > @@ -179,14 +179,13 @@ struct cgroup_subsys_state { > atomic_t online_cnt; > > /* percpu_ref killing and RCU release */ > - struct work_struct destroy_work; > struct rcu_work destroy_rwork; > - > + struct cgroup_subsys_state *parent; > + struct work_struct destroy_work; > /* > * PI: the parent css. Placed here for cache proximity to following > * fields of the containing structure. > */ > - struct cgroup_subsys_state *parent; > }; > > /* > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h > index 9ecead1042b9..963b88ab9930 100644 > --- a/include/linux/memcontrol.h > +++ b/include/linux/memcontrol.h > @@ -239,9 +239,6 @@ struct mem_cgroup { > /* Private memcg ID. Used to ID objects that outlive the cgroup */ > struct mem_cgroup_id id; > > - /* Accounted resources */ > - struct page_counter memory; /* Both v1 & v2 */ > - > union { > struct page_counter swap; /* v2 only */ > struct page_counter memsw; /* v1 only */ > @@ -251,6 +248,9 @@ struct mem_cgroup { > struct page_counter kmem; /* v1 only */ > struct page_counter tcpmem; /* v1 only */ > > + /* Accounted resources */ > + struct page_counter memory; /* Both v1 & v2 */ > + > /* Range enforcement for interrupt charges */ > struct work_struct high_work; > > @@ -313,7 +313,6 @@ struct mem_cgroup { > atomic_long_t memory_events[MEMCG_NR_MEMORY_EVENTS]; > atomic_long_t memory_events_local[MEMCG_NR_MEMORY_EVENTS]; > > - unsigned long socket_pressure; > > /* Legacy tcp memory accounting */ > bool tcpmem_active; > @@ -349,6 +348,7 @@ struct mem_cgroup { > #ifdef CONFIG_TRANSPARENT_HUGEPAGE > struct deferred_split deferred_split_queue; > #endif > + unsigned long socket_pressure; > > struct mem_cgroup_per_node *nodeinfo[]; > }; > I simply did the following and got much better results. But I am not sure if updates to ->usage are really needed that often... diff --git a/include/linux/page_counter.h b/include/linux/page_counter.h index 679591301994d316062f92b275efa2459a8349c9..e267be4ba849760117d9fd041e22c2a44658ab36 100644 --- a/include/linux/page_counter.h +++ b/include/linux/page_counter.h @@ -3,12 +3,15 @@ #define _LINUX_PAGE_COUNTER_H #include <linux/atomic.h> +#include <linux/cache.h> #include <linux/kernel.h> #include <asm/page.h> struct page_counter { - atomic_long_t usage; - unsigned long min; + /* contended cache line. */ + atomic_long_t usage ____cacheline_aligned_in_smp; + + unsigned long min ____cacheline_aligned_in_smp; unsigned long low; unsigned long high; unsigned long max; @@ -27,12 +30,6 @@ struct page_counter { unsigned long watermark; unsigned long failcnt; - /* - * 'parent' is placed here to be far from 'usage' to reduce - * cache false sharing, as 'usage' is written mostly while - * parent is frequently read for cgroup's hierarchical - * counting nature. - */ struct page_counter *parent; }; > And some of these are specific for network and may not be a universal > win, though I think the 'cgroup_subsys_state' could keep the > read-mostly 'parent' away from following written-mostly counters. > > Btw, I tried your debug patch which compiled fail with 0Day's kbuild > system, but it did compile ok on my local machine. > > Thanks, > Feng > > > > > > Thanks, > > > Feng > > > > > > > > > > > > > Thanks, > > > > > Feng ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [net] 4890b686f4: netperf.Throughput_Mbps -69.4% regression 2022-06-27 16:25 ` Eric Dumazet @ 2022-06-27 16:48 ` Shakeel Butt 2022-06-27 17:05 ` Eric Dumazet 2022-06-28 1:46 ` Roman Gushchin 2022-06-28 3:49 ` Feng Tang 1 sibling, 2 replies; 32+ messages in thread From: Shakeel Butt @ 2022-06-27 16:48 UTC (permalink / raw) To: Eric Dumazet Cc: Feng Tang, Linux MM, Andrew Morton, Roman Gushchin, Michal Hocko, Johannes Weiner, Muchun Song, Jakub Kicinski, Xin Long, Marcelo Ricardo Leitner, kernel test robot, Soheil Hassas Yeganeh, LKML, network dev, linux-s390, MPTCP Upstream, linux-sctp @ vger . kernel . org, lkp, kbuild test robot, Huang Ying, Xing Zhengjun, Yin Fengwei, Ying Xu On Mon, Jun 27, 2022 at 9:26 AM Eric Dumazet <edumazet@google.com> wrote: > [...] > > > > I simply did the following and got much better results. > > But I am not sure if updates to ->usage are really needed that often... I suspect we need to improve the per-cpu memcg stock usage here. Were the updates mostly from uncharge path or charge path or that's irrelevant? I think doing full drain (i.e. drain_stock()) within __refill_stock() when the local cache is larger than MEMCG_CHARGE_BATCH is not best. Rather we should always keep at least MEMCG_CHARGE_BATCH for such scenarios. > > > diff --git a/include/linux/page_counter.h b/include/linux/page_counter.h > index 679591301994d316062f92b275efa2459a8349c9..e267be4ba849760117d9fd041e22c2a44658ab36 > 100644 > --- a/include/linux/page_counter.h > +++ b/include/linux/page_counter.h > @@ -3,12 +3,15 @@ > #define _LINUX_PAGE_COUNTER_H > > #include <linux/atomic.h> > +#include <linux/cache.h> > #include <linux/kernel.h> > #include <asm/page.h> > > struct page_counter { > - atomic_long_t usage; > - unsigned long min; > + /* contended cache line. */ > + atomic_long_t usage ____cacheline_aligned_in_smp; > + > + unsigned long min ____cacheline_aligned_in_smp; Do we need to align 'min' too? > unsigned long low; > unsigned long high; > unsigned long max; > @@ -27,12 +30,6 @@ struct page_counter { > unsigned long watermark; > unsigned long failcnt; > > - /* > - * 'parent' is placed here to be far from 'usage' to reduce > - * cache false sharing, as 'usage' is written mostly while > - * parent is frequently read for cgroup's hierarchical > - * counting nature. > - */ > struct page_counter *parent; > }; > > > ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [net] 4890b686f4: netperf.Throughput_Mbps -69.4% regression 2022-06-27 16:48 ` Shakeel Butt @ 2022-06-27 17:05 ` Eric Dumazet 2022-06-28 1:46 ` Roman Gushchin 1 sibling, 0 replies; 32+ messages in thread From: Eric Dumazet @ 2022-06-27 17:05 UTC (permalink / raw) To: Shakeel Butt Cc: Feng Tang, Linux MM, Andrew Morton, Roman Gushchin, Michal Hocko, Johannes Weiner, Muchun Song, Jakub Kicinski, Xin Long, Marcelo Ricardo Leitner, kernel test robot, Soheil Hassas Yeganeh, LKML, network dev, linux-s390, MPTCP Upstream, linux-sctp @ vger . kernel . org, lkp, kbuild test robot, Huang Ying, Xing Zhengjun, Yin Fengwei, Ying Xu On Mon, Jun 27, 2022 at 6:48 PM Shakeel Butt <shakeelb@google.com> wrote: > > On Mon, Jun 27, 2022 at 9:26 AM Eric Dumazet <edumazet@google.com> wrote: > > > [...] > > > > > > > I simply did the following and got much better results. > > > > But I am not sure if updates to ->usage are really needed that often... > > I suspect we need to improve the per-cpu memcg stock usage here. Were > the updates mostly from uncharge path or charge path or that's > irrelevant? I wonder if the cache is always used... stock = this_cpu_ptr(&memcg_stock); if (memcg == stock->cached && stock->nr_pages >= nr_pages) { Apparently the per-cpu cache is only used for one memcg at a time ? Not sure how this would scale to hosts with dozens of memcgs. Maybe we could add some metrics to have an idea of the cache hit/miss ratio :/ > > I think doing full drain (i.e. drain_stock()) within __refill_stock() > when the local cache is larger than MEMCG_CHARGE_BATCH is not best. > Rather we should always keep at least MEMCG_CHARGE_BATCH for such > scenarios. > > > > > > > diff --git a/include/linux/page_counter.h b/include/linux/page_counter.h > > index 679591301994d316062f92b275efa2459a8349c9..e267be4ba849760117d9fd041e22c2a44658ab36 > > 100644 > > --- a/include/linux/page_counter.h > > +++ b/include/linux/page_counter.h > > @@ -3,12 +3,15 @@ > > #define _LINUX_PAGE_COUNTER_H > > > > #include <linux/atomic.h> > > +#include <linux/cache.h> > > #include <linux/kernel.h> > > #include <asm/page.h> > > > > struct page_counter { > > - atomic_long_t usage; > > - unsigned long min; > > + /* contended cache line. */ > > + atomic_long_t usage ____cacheline_aligned_in_smp; > > + > > + unsigned long min ____cacheline_aligned_in_smp; > > Do we need to align 'min' too? Probably if there is a hierarchy ... propagate_protected_usage() seems to have potential high cost. > > > unsigned long low; > > unsigned long high; > > unsigned long max; > > @@ -27,12 +30,6 @@ struct page_counter { > > unsigned long watermark; > > unsigned long failcnt; > > > > - /* > > - * 'parent' is placed here to be far from 'usage' to reduce > > - * cache false sharing, as 'usage' is written mostly while > > - * parent is frequently read for cgroup's hierarchical > > - * counting nature. > > - */ > > struct page_counter *parent; > > }; > > > > > > ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [net] 4890b686f4: netperf.Throughput_Mbps -69.4% regression 2022-06-27 16:48 ` Shakeel Butt 2022-06-27 17:05 ` Eric Dumazet @ 2022-06-28 1:46 ` Roman Gushchin 1 sibling, 0 replies; 32+ messages in thread From: Roman Gushchin @ 2022-06-28 1:46 UTC (permalink / raw) To: Shakeel Butt Cc: Eric Dumazet, Feng Tang, Linux MM, Andrew Morton, Michal Hocko, Johannes Weiner, Muchun Song, Jakub Kicinski, Xin Long, Marcelo Ricardo Leitner, kernel test robot, Soheil Hassas Yeganeh, LKML, network dev, linux-s390, MPTCP Upstream, linux-sctp @ vger . kernel . org, lkp, kbuild test robot, Huang Ying, Xing Zhengjun, Yin Fengwei, Ying Xu On Mon, Jun 27, 2022 at 09:48:01AM -0700, Shakeel Butt wrote: > On Mon, Jun 27, 2022 at 9:26 AM Eric Dumazet <edumazet@google.com> wrote: > > > [...] > > > > > > > I simply did the following and got much better results. > > > > But I am not sure if updates to ->usage are really needed that often... > > I suspect we need to improve the per-cpu memcg stock usage here. Were > the updates mostly from uncharge path or charge path or that's > irrelevant? > > I think doing full drain (i.e. drain_stock()) within __refill_stock() > when the local cache is larger than MEMCG_CHARGE_BATCH is not best. > Rather we should always keep at least MEMCG_CHARGE_BATCH for such > scenarios. +1, really good point. ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [net] 4890b686f4: netperf.Throughput_Mbps -69.4% regression 2022-06-27 16:25 ` Eric Dumazet 2022-06-27 16:48 ` Shakeel Butt @ 2022-06-28 3:49 ` Feng Tang 2022-07-01 15:47 ` Shakeel Butt 1 sibling, 1 reply; 32+ messages in thread From: Feng Tang @ 2022-06-28 3:49 UTC (permalink / raw) To: Eric Dumazet Cc: Shakeel Butt, Linux MM, Andrew Morton, Roman Gushchin, Michal Hocko, Johannes Weiner, Muchun Song, Jakub Kicinski, Xin Long, Marcelo Ricardo Leitner, kernel test robot, Soheil Hassas Yeganeh, LKML, network dev, linux-s390, MPTCP Upstream, linux-sctp @ vger . kernel . org, lkp, kbuild test robot, Huang Ying, Xing Zhengjun, Yin Fengwei, Ying Xu On Mon, Jun 27, 2022 at 06:25:59PM +0200, Eric Dumazet wrote: > On Mon, Jun 27, 2022 at 4:48 PM Feng Tang <feng.tang@intel.com> wrote: > > > > Yes, I also analyzed the perf-profile data, and made some layout changes > > which could recover the changes from 69% to 40%. > > > > 7c80b038d23e1f4c 4890b686f4088c90432149bd6de 332b589c49656a45881bca4ecc0 > > ---------------- --------------------------- --------------------------- > > 15722 -69.5% 4792 -40.8% 9300 netperf.Throughput_Mbps > > > > I simply did the following and got much better results. > > But I am not sure if updates to ->usage are really needed that often... > > > diff --git a/include/linux/page_counter.h b/include/linux/page_counter.h > index 679591301994d316062f92b275efa2459a8349c9..e267be4ba849760117d9fd041e22c2a44658ab36 > 100644 > --- a/include/linux/page_counter.h > +++ b/include/linux/page_counter.h > @@ -3,12 +3,15 @@ > #define _LINUX_PAGE_COUNTER_H > > #include <linux/atomic.h> > +#include <linux/cache.h> > #include <linux/kernel.h> > #include <asm/page.h> > > struct page_counter { > - atomic_long_t usage; > - unsigned long min; > + /* contended cache line. */ > + atomic_long_t usage ____cacheline_aligned_in_smp; > + > + unsigned long min ____cacheline_aligned_in_smp; > unsigned long low; > unsigned long high; > unsigned long max; > @@ -27,12 +30,6 @@ struct page_counter { > unsigned long watermark; > unsigned long failcnt; > > - /* > - * 'parent' is placed here to be far from 'usage' to reduce > - * cache false sharing, as 'usage' is written mostly while > - * parent is frequently read for cgroup's hierarchical > - * counting nature. > - */ > struct page_counter *parent; > }; I just tested it, it does perform better (the 4th is with your patch), some perf-profile data is also listed. 7c80b038d23e1f4c 4890b686f4088c90432149bd6de 332b589c49656a45881bca4ecc0 e719635902654380b23ffce908d ---------------- --------------------------- --------------------------- --------------------------- 15722 -69.5% 4792 -40.8% 9300 -27.9% 11341 netperf.Throughput_Mbps 0.00 +0.3 0.26 ± 5% +0.5 0.51 +1.3 1.27 ± 2%pp.self.__sk_mem_raise_allocated 0.00 +0.3 0.32 ± 15% +1.7 1.74 ± 2% +0.4 0.40 ± 2% pp.self.propagate_protected_usage 0.00 +0.8 0.82 ± 7% +0.9 0.90 +0.8 0.84 pp.self.__mod_memcg_state 0.00 +1.2 1.24 ± 4% +1.0 1.01 +1.4 1.44 pp.self.try_charge_memcg 0.00 +2.1 2.06 +2.1 2.13 +2.1 2.11 pp.self.page_counter_uncharge 0.00 +2.1 2.14 ± 4% +2.7 2.71 +2.6 2.60 ± 2% pp.self.page_counter_try_charge 1.12 ± 4% +3.1 4.24 +1.1 2.22 +1.4 2.51 pp.self.native_queued_spin_lock_slowpath 0.28 ± 9% +3.8 4.06 ± 4% +0.2 0.48 +0.4 0.68 pp.self.sctp_eat_data 0.00 +8.2 8.23 +0.8 0.83 +1.3 1.26 pp.self.__sk_mem_reduce_allocated And the size of 'mem_cgroup' is increased from 4224 Bytes to 4608. Another info is the perf hotspos are slightly different between tcp and sctp test cases. Thanks, Feng ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [net] 4890b686f4: netperf.Throughput_Mbps -69.4% regression 2022-06-28 3:49 ` Feng Tang @ 2022-07-01 15:47 ` Shakeel Butt 2022-07-03 10:43 ` Feng Tang 0 siblings, 1 reply; 32+ messages in thread From: Shakeel Butt @ 2022-07-01 15:47 UTC (permalink / raw) To: Feng Tang Cc: Eric Dumazet, Linux MM, Andrew Morton, Roman Gushchin, Michal Hocko, Johannes Weiner, Muchun Song, Jakub Kicinski, Xin Long, Marcelo Ricardo Leitner, kernel test robot, Soheil Hassas Yeganeh, LKML, network dev, linux-s390, MPTCP Upstream, linux-sctp @ vger . kernel . org, lkp, kbuild test robot, Huang Ying, Xing Zhengjun, Yin Fengwei, Ying Xu On Mon, Jun 27, 2022 at 8:49 PM Feng Tang <feng.tang@intel.com> wrote: > > On Mon, Jun 27, 2022 at 06:25:59PM +0200, Eric Dumazet wrote: > > On Mon, Jun 27, 2022 at 4:48 PM Feng Tang <feng.tang@intel.com> wrote: > > > > > > Yes, I also analyzed the perf-profile data, and made some layout changes > > > which could recover the changes from 69% to 40%. > > > > > > 7c80b038d23e1f4c 4890b686f4088c90432149bd6de 332b589c49656a45881bca4ecc0 > > > ---------------- --------------------------- --------------------------- > > > 15722 -69.5% 4792 -40.8% 9300 netperf.Throughput_Mbps > > > > > > > I simply did the following and got much better results. > > > > But I am not sure if updates to ->usage are really needed that often... > > > > > > diff --git a/include/linux/page_counter.h b/include/linux/page_counter.h > > index 679591301994d316062f92b275efa2459a8349c9..e267be4ba849760117d9fd041e22c2a44658ab36 > > 100644 > > --- a/include/linux/page_counter.h > > +++ b/include/linux/page_counter.h > > @@ -3,12 +3,15 @@ > > #define _LINUX_PAGE_COUNTER_H > > > > #include <linux/atomic.h> > > +#include <linux/cache.h> > > #include <linux/kernel.h> > > #include <asm/page.h> > > > > struct page_counter { > > - atomic_long_t usage; > > - unsigned long min; > > + /* contended cache line. */ > > + atomic_long_t usage ____cacheline_aligned_in_smp; > > + > > + unsigned long min ____cacheline_aligned_in_smp; > > unsigned long low; > > unsigned long high; > > unsigned long max; > > @@ -27,12 +30,6 @@ struct page_counter { > > unsigned long watermark; > > unsigned long failcnt; > > > > - /* > > - * 'parent' is placed here to be far from 'usage' to reduce > > - * cache false sharing, as 'usage' is written mostly while > > - * parent is frequently read for cgroup's hierarchical > > - * counting nature. > > - */ > > struct page_counter *parent; > > }; > > I just tested it, it does perform better (the 4th is with your patch), > some perf-profile data is also listed. > > 7c80b038d23e1f4c 4890b686f4088c90432149bd6de 332b589c49656a45881bca4ecc0 e719635902654380b23ffce908d > ---------------- --------------------------- --------------------------- --------------------------- > 15722 -69.5% 4792 -40.8% 9300 -27.9% 11341 netperf.Throughput_Mbps > > 0.00 +0.3 0.26 ± 5% +0.5 0.51 +1.3 1.27 ± 2%pp.self.__sk_mem_raise_allocated > 0.00 +0.3 0.32 ± 15% +1.7 1.74 ± 2% +0.4 0.40 ± 2% pp.self.propagate_protected_usage > 0.00 +0.8 0.82 ± 7% +0.9 0.90 +0.8 0.84 pp.self.__mod_memcg_state > 0.00 +1.2 1.24 ± 4% +1.0 1.01 +1.4 1.44 pp.self.try_charge_memcg > 0.00 +2.1 2.06 +2.1 2.13 +2.1 2.11 pp.self.page_counter_uncharge > 0.00 +2.1 2.14 ± 4% +2.7 2.71 +2.6 2.60 ± 2% pp.self.page_counter_try_charge > 1.12 ± 4% +3.1 4.24 +1.1 2.22 +1.4 2.51 pp.self.native_queued_spin_lock_slowpath > 0.28 ± 9% +3.8 4.06 ± 4% +0.2 0.48 +0.4 0.68 pp.self.sctp_eat_data > 0.00 +8.2 8.23 +0.8 0.83 +1.3 1.26 pp.self.__sk_mem_reduce_allocated > > And the size of 'mem_cgroup' is increased from 4224 Bytes to 4608. Hi Feng, can you please try two more configurations? Take Eric's patch of adding ____cacheline_aligned_in_smp in page_counter and for first increase MEMCG_CHARGE_BATCH to 64 and for second increase it to 128. Basically batch increases combined with Eric's patch. ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [net] 4890b686f4: netperf.Throughput_Mbps -69.4% regression 2022-07-01 15:47 ` Shakeel Butt @ 2022-07-03 10:43 ` Feng Tang 2022-07-03 22:55 ` Roman Gushchin 0 siblings, 1 reply; 32+ messages in thread From: Feng Tang @ 2022-07-03 10:43 UTC (permalink / raw) To: Shakeel Butt Cc: Eric Dumazet, Linux MM, Andrew Morton, Roman Gushchin, Michal Hocko, Johannes Weiner, Muchun Song, Jakub Kicinski, Xin Long, Marcelo Ricardo Leitner, kernel test robot, Soheil Hassas Yeganeh, LKML, network dev, linux-s390, MPTCP Upstream, linux-sctp @ vger . kernel . org, lkp, kbuild test robot, Huang Ying, Xing Zhengjun, Yin Fengwei, Ying Xu Hi Shakeel, On Fri, Jul 01, 2022 at 08:47:29AM -0700, Shakeel Butt wrote: > On Mon, Jun 27, 2022 at 8:49 PM Feng Tang <feng.tang@intel.com> wrote: > > I just tested it, it does perform better (the 4th is with your patch), > > some perf-profile data is also listed. > > > > 7c80b038d23e1f4c 4890b686f4088c90432149bd6de 332b589c49656a45881bca4ecc0 e719635902654380b23ffce908d > > ---------------- --------------------------- --------------------------- --------------------------- > > 15722 -69.5% 4792 -40.8% 9300 -27.9% 11341 netperf.Throughput_Mbps > > > > 0.00 +0.3 0.26 ± 5% +0.5 0.51 +1.3 1.27 ± 2%pp.self.__sk_mem_raise_allocated > > 0.00 +0.3 0.32 ± 15% +1.7 1.74 ± 2% +0.4 0.40 ± 2% pp.self.propagate_protected_usage > > 0.00 +0.8 0.82 ± 7% +0.9 0.90 +0.8 0.84 pp.self.__mod_memcg_state > > 0.00 +1.2 1.24 ± 4% +1.0 1.01 +1.4 1.44 pp.self.try_charge_memcg > > 0.00 +2.1 2.06 +2.1 2.13 +2.1 2.11 pp.self.page_counter_uncharge > > 0.00 +2.1 2.14 ± 4% +2.7 2.71 +2.6 2.60 ± 2% pp.self.page_counter_try_charge > > 1.12 ± 4% +3.1 4.24 +1.1 2.22 +1.4 2.51 pp.self.native_queued_spin_lock_slowpath > > 0.28 ± 9% +3.8 4.06 ± 4% +0.2 0.48 +0.4 0.68 pp.self.sctp_eat_data > > 0.00 +8.2 8.23 +0.8 0.83 +1.3 1.26 pp.self.__sk_mem_reduce_allocated > > > > And the size of 'mem_cgroup' is increased from 4224 Bytes to 4608. > > Hi Feng, can you please try two more configurations? Take Eric's patch > of adding ____cacheline_aligned_in_smp in page_counter and for first > increase MEMCG_CHARGE_BATCH to 64 and for second increase it to 128. > Basically batch increases combined with Eric's patch. With increasing batch to 128, the regression could be reduced to -12.4%. Some more details with perf-profile data below: 7c80b038d23e1f4c 4890b686f4088c90432149bd6de Eric's patch Eric's patch + batch-64 Eric's patch + batch-128 ---------------- --------------------------- --------------------------- --------------------------- --------------------------- 15722 -69.5% 4792 -27.9% 11341 -14.0% 13521 -12.4% 13772 netperf.Throughput_Mbps 0.05 +0.2 0.27 ± 18% +0.0 0.08 ± 6% -0.1 0.00 -0.0 0.03 ±100% pp.self.timekeeping_max_deferment 0.00 +0.3 0.26 ± 5% +1.3 1.27 ± 2% +1.8 1.82 ± 10% +2.0 1.96 ± 9% pp.self.__sk_mem_raise_allocated 0.00 +0.3 0.32 ± 15% +0.4 0.40 ± 2% +0.1 0.10 ± 5% +0.0 0.00 pp.self.propagate_protected_usage 0.00 +0.8 0.82 ± 7% +0.8 0.84 +0.5 0.48 +0.4 0.36 ± 2% pp.self.__mod_memcg_state 0.00 +1.2 1.24 ± 4% +1.4 1.44 +0.4 0.40 ± 3% +0.2 0.24 ± 6% pp.self.try_charge_memcg 0.00 +2.1 2.06 +2.1 2.11 +0.5 0.50 +0.2 0.18 ± 8% pp.self.page_counter_uncharge 0.00 +2.1 2.14 ± 4% +2.6 2.60 ± 2% +0.6 0.58 +0.2 0.20 pp.self.page_counter_try_charge 1.12 ± 4% +3.1 4.24 +1.4 2.51 +1.0 2.10 ± 2% +1.0 2.10 ± 9% pp.self.native_queued_spin_lock_slowpath 0.28 ± 9% +3.8 4.06 ± 4% +0.4 0.68 +0.6 0.90 ± 9% +0.7 1.00 ± 11% pp.self.sctp_eat_data 0.00 +8.2 8.23 +1.3 1.26 +1.7 1.72 ± 6% +2.0 1.95 ± 10% pp.self.__sk_mem_reduce_allocated Thanks, Feng ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [net] 4890b686f4: netperf.Throughput_Mbps -69.4% regression 2022-07-03 10:43 ` Feng Tang @ 2022-07-03 22:55 ` Roman Gushchin 2022-07-05 5:03 ` Feng Tang 0 siblings, 1 reply; 32+ messages in thread From: Roman Gushchin @ 2022-07-03 22:55 UTC (permalink / raw) To: Feng Tang Cc: Shakeel Butt, Eric Dumazet, Linux MM, Andrew Morton, Michal Hocko, Johannes Weiner, Muchun Song, Jakub Kicinski, Xin Long, Marcelo Ricardo Leitner, kernel test robot, Soheil Hassas Yeganeh, LKML, network dev, linux-s390, MPTCP Upstream, linux-sctp @ vger . kernel . org, lkp, kbuild test robot, Huang Ying, Xing Zhengjun, Yin Fengwei, Ying Xu On Sun, Jul 03, 2022 at 06:43:53PM +0800, Feng Tang wrote: > Hi Shakeel, > > On Fri, Jul 01, 2022 at 08:47:29AM -0700, Shakeel Butt wrote: > > On Mon, Jun 27, 2022 at 8:49 PM Feng Tang <feng.tang@intel.com> wrote: > > > I just tested it, it does perform better (the 4th is with your patch), > > > some perf-profile data is also listed. > > > > > > 7c80b038d23e1f4c 4890b686f4088c90432149bd6de 332b589c49656a45881bca4ecc0 e719635902654380b23ffce908d > > > ---------------- --------------------------- --------------------------- --------------------------- > > > 15722 -69.5% 4792 -40.8% 9300 -27.9% 11341 netperf.Throughput_Mbps > > > > > > 0.00 +0.3 0.26 ± 5% +0.5 0.51 +1.3 1.27 ± 2%pp.self.__sk_mem_raise_allocated > > > 0.00 +0.3 0.32 ± 15% +1.7 1.74 ± 2% +0.4 0.40 ± 2% pp.self.propagate_protected_usage > > > 0.00 +0.8 0.82 ± 7% +0.9 0.90 +0.8 0.84 pp.self.__mod_memcg_state > > > 0.00 +1.2 1.24 ± 4% +1.0 1.01 +1.4 1.44 pp.self.try_charge_memcg > > > 0.00 +2.1 2.06 +2.1 2.13 +2.1 2.11 pp.self.page_counter_uncharge > > > 0.00 +2.1 2.14 ± 4% +2.7 2.71 +2.6 2.60 ± 2% pp.self.page_counter_try_charge > > > 1.12 ± 4% +3.1 4.24 +1.1 2.22 +1.4 2.51 pp.self.native_queued_spin_lock_slowpath > > > 0.28 ± 9% +3.8 4.06 ± 4% +0.2 0.48 +0.4 0.68 pp.self.sctp_eat_data > > > 0.00 +8.2 8.23 +0.8 0.83 +1.3 1.26 pp.self.__sk_mem_reduce_allocated > > > > > > And the size of 'mem_cgroup' is increased from 4224 Bytes to 4608. > > > > Hi Feng, can you please try two more configurations? Take Eric's patch > > of adding ____cacheline_aligned_in_smp in page_counter and for first > > increase MEMCG_CHARGE_BATCH to 64 and for second increase it to 128. > > Basically batch increases combined with Eric's patch. > > With increasing batch to 128, the regression could be reduced to -12.4%. If we're going to bump it, I wonder if we should scale it dynamically depending on the size of the memory cgroup? Thanks! ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [net] 4890b686f4: netperf.Throughput_Mbps -69.4% regression 2022-07-03 22:55 ` Roman Gushchin @ 2022-07-05 5:03 ` Feng Tang 2022-08-16 5:52 ` Oliver Sang 0 siblings, 1 reply; 32+ messages in thread From: Feng Tang @ 2022-07-05 5:03 UTC (permalink / raw) To: Roman Gushchin Cc: Shakeel Butt, Eric Dumazet, Linux MM, Andrew Morton, Michal Hocko, Johannes Weiner, Muchun Song, Jakub Kicinski, Xin Long, Marcelo Ricardo Leitner, kernel test robot, Soheil Hassas Yeganeh, LKML, network dev, linux-s390, MPTCP Upstream, linux-sctp @ vger . kernel . org, lkp, kbuild test robot, Huang Ying, Xing Zhengjun, Yin Fengwei, Ying Xu On Sun, Jul 03, 2022 at 03:55:31PM -0700, Roman Gushchin wrote: > On Sun, Jul 03, 2022 at 06:43:53PM +0800, Feng Tang wrote: > > Hi Shakeel, > > > > On Fri, Jul 01, 2022 at 08:47:29AM -0700, Shakeel Butt wrote: > > > On Mon, Jun 27, 2022 at 8:49 PM Feng Tang <feng.tang@intel.com> wrote: > > > > I just tested it, it does perform better (the 4th is with your patch), > > > > some perf-profile data is also listed. > > > > > > > > 7c80b038d23e1f4c 4890b686f4088c90432149bd6de 332b589c49656a45881bca4ecc0 e719635902654380b23ffce908d > > > > ---------------- --------------------------- --------------------------- --------------------------- > > > > 15722 -69.5% 4792 -40.8% 9300 -27.9% 11341 netperf.Throughput_Mbps > > > > > > > > 0.00 +0.3 0.26 ± 5% +0.5 0.51 +1.3 1.27 ± 2%pp.self.__sk_mem_raise_allocated > > > > 0.00 +0.3 0.32 ± 15% +1.7 1.74 ± 2% +0.4 0.40 ± 2% pp.self.propagate_protected_usage > > > > 0.00 +0.8 0.82 ± 7% +0.9 0.90 +0.8 0.84 pp.self.__mod_memcg_state > > > > 0.00 +1.2 1.24 ± 4% +1.0 1.01 +1.4 1.44 pp.self.try_charge_memcg > > > > 0.00 +2.1 2.06 +2.1 2.13 +2.1 2.11 pp.self.page_counter_uncharge > > > > 0.00 +2.1 2.14 ± 4% +2.7 2.71 +2.6 2.60 ± 2% pp.self.page_counter_try_charge > > > > 1.12 ± 4% +3.1 4.24 +1.1 2.22 +1.4 2.51 pp.self.native_queued_spin_lock_slowpath > > > > 0.28 ± 9% +3.8 4.06 ± 4% +0.2 0.48 +0.4 0.68 pp.self.sctp_eat_data > > > > 0.00 +8.2 8.23 +0.8 0.83 +1.3 1.26 pp.self.__sk_mem_reduce_allocated > > > > > > > > And the size of 'mem_cgroup' is increased from 4224 Bytes to 4608. > > > > > > Hi Feng, can you please try two more configurations? Take Eric's patch > > > of adding ____cacheline_aligned_in_smp in page_counter and for first > > > increase MEMCG_CHARGE_BATCH to 64 and for second increase it to 128. > > > Basically batch increases combined with Eric's patch. > > > > With increasing batch to 128, the regression could be reduced to -12.4%. > > If we're going to bump it, I wonder if we should scale it dynamically depending > on the size of the memory cgroup? I think it makes sense, or also make it a configurable parameter? From the test reports of 0Day, these charging/counting play critical role in performance (easy to see up to 60% performance effect). If user only wants memcg for isolating things or doesn't care charging/stats, these seem to be extra taxes. For bumping to 64 or 128, universal improvement is expected with the only concern of accuracy. Thanks, Feng > Thanks! ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [net] 4890b686f4: netperf.Throughput_Mbps -69.4% regression 2022-07-05 5:03 ` Feng Tang @ 2022-08-16 5:52 ` Oliver Sang 2022-08-16 15:55 ` Shakeel Butt 0 siblings, 1 reply; 32+ messages in thread From: Oliver Sang @ 2022-08-16 5:52 UTC (permalink / raw) To: Feng Tang Cc: Roman Gushchin, Shakeel Butt, Eric Dumazet, Linux MM, Andrew Morton, Michal Hocko, Johannes Weiner, Muchun Song, Jakub Kicinski, Xin Long, Marcelo Ricardo Leitner, Soheil Hassas Yeganeh, LKML, network dev, linux-s390, MPTCP Upstream, linux-sctp @ vger . kernel . org, lkp, kbuild test robot, Huang Ying, Xing Zhengjun, Yin Fengwei, Ying Xu Hi all, now we noticed this commit has already merged into mainline, and in our tests there is still similar regression. [1] not sure if there is a plan to merge some of the solutions that have been discussed in this thread? we'll very glad to test patches if there is a request Thanks a lot! [1] ========================================================================================= tbox_group/testcase/rootfs/kconfig/compiler/ip/runtime/nr_threads/cluster/send_size/test/cpufreq_governor/ucode: lkp-icl-2sp4/netperf/debian-11.1-x86_64-20220510.cgz/x86_64-rhel-8.3/gcc-11/ipv4/300s/50%/cs-localhost/10K/SCTP_STREAM_MANY/performance/0xd000363 7c80b038d23e1f4c 4890b686f4088c90432149bd6de ---------------- --------------------------- %stddev %change %stddev \ | \ 9078 -55.9% 4006 netperf.Throughput_Mbps 581006 -55.9% 256385 netperf.Throughput_total_Mbps 36715 -54.6% 16674 ± 4% netperf.time.involuntary_context_switches 1885 -50.2% 938.33 ± 3% netperf.time.percent_of_cpu_this_job_got 5533 -49.9% 2771 ± 2% netperf.time.system_time 152.13 -59.5% 61.61 ± 2% netperf.time.user_time 418171 ± 5% +89.4% 791954 ± 17% netperf.time.voluntary_context_switches 2.128e+09 -55.9% 9.389e+08 netperf.workload 30217 +17.8% 35608 uptime.idle 2.689e+10 +20.3% 3.234e+10 cpuidle..time 6.366e+08 -48.1% 3.305e+08 cpuidle..usage 70.26 +13.5 83.78 mpstat.cpu.all.idle% 4.46 -1.5 2.92 ± 3% mpstat.cpu.all.soft% 23.71 -11.6 12.16 ± 3% mpstat.cpu.all.sys% 0.89 -0.5 0.38 mpstat.cpu.all.usr% 1.392e+09 -57.5% 5.91e+08 ± 12% numa-numastat.node0.local_node 1.389e+09 -57.5% 5.906e+08 ± 12% numa-numastat.node0.numa_hit 1.369e+09 -54.5% 6.226e+08 ± 12% numa-numastat.node1.local_node 1.366e+09 -54.4% 6.222e+08 ± 12% numa-numastat.node1.numa_hit 36715 -54.6% 16674 ± 4% time.involuntary_context_switches 1885 -50.2% 938.33 ± 3% time.percent_of_cpu_this_job_got 5533 -49.9% 2771 ± 2% time.system_time 152.13 -59.5% 61.61 ± 2% time.user_time 418171 ± 5% +89.4% 791954 ± 17% time.voluntary_context_switches On Tue, Jul 05, 2022 at 01:03:26PM +0800, Feng Tang wrote: > On Sun, Jul 03, 2022 at 03:55:31PM -0700, Roman Gushchin wrote: > > On Sun, Jul 03, 2022 at 06:43:53PM +0800, Feng Tang wrote: > > > Hi Shakeel, > > > > > > On Fri, Jul 01, 2022 at 08:47:29AM -0700, Shakeel Butt wrote: > > > > On Mon, Jun 27, 2022 at 8:49 PM Feng Tang <feng.tang@intel.com> wrote: > > > > > I just tested it, it does perform better (the 4th is with your patch), > > > > > some perf-profile data is also listed. > > > > > > > > > > 7c80b038d23e1f4c 4890b686f4088c90432149bd6de 332b589c49656a45881bca4ecc0 e719635902654380b23ffce908d > > > > > ---------------- --------------------------- --------------------------- --------------------------- > > > > > 15722 -69.5% 4792 -40.8% 9300 -27.9% 11341 netperf.Throughput_Mbps > > > > > > > > > > 0.00 +0.3 0.26 ± 5% +0.5 0.51 +1.3 1.27 ± 2%pp.self.__sk_mem_raise_allocated > > > > > 0.00 +0.3 0.32 ± 15% +1.7 1.74 ± 2% +0.4 0.40 ± 2% pp.self.propagate_protected_usage > > > > > 0.00 +0.8 0.82 ± 7% +0.9 0.90 +0.8 0.84 pp.self.__mod_memcg_state > > > > > 0.00 +1.2 1.24 ± 4% +1.0 1.01 +1.4 1.44 pp.self.try_charge_memcg > > > > > 0.00 +2.1 2.06 +2.1 2.13 +2.1 2.11 pp.self.page_counter_uncharge > > > > > 0.00 +2.1 2.14 ± 4% +2.7 2.71 +2.6 2.60 ± 2% pp.self.page_counter_try_charge > > > > > 1.12 ± 4% +3.1 4.24 +1.1 2.22 +1.4 2.51 pp.self.native_queued_spin_lock_slowpath > > > > > 0.28 ± 9% +3.8 4.06 ± 4% +0.2 0.48 +0.4 0.68 pp.self.sctp_eat_data > > > > > 0.00 +8.2 8.23 +0.8 0.83 +1.3 1.26 pp.self.__sk_mem_reduce_allocated > > > > > > > > > > And the size of 'mem_cgroup' is increased from 4224 Bytes to 4608. > > > > > > > > Hi Feng, can you please try two more configurations? Take Eric's patch > > > > of adding ____cacheline_aligned_in_smp in page_counter and for first > > > > increase MEMCG_CHARGE_BATCH to 64 and for second increase it to 128. > > > > Basically batch increases combined with Eric's patch. > > > > > > With increasing batch to 128, the regression could be reduced to -12.4%. > > > > If we're going to bump it, I wonder if we should scale it dynamically depending > > on the size of the memory cgroup? > > I think it makes sense, or also make it a configurable parameter? From > the test reports of 0Day, these charging/counting play critical role > in performance (easy to see up to 60% performance effect). If user only > wants memcg for isolating things or doesn't care charging/stats, these > seem to be extra taxes. > > For bumping to 64 or 128, universal improvement is expected with the > only concern of accuracy. > > Thanks, > Feng > > > Thanks! ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [net] 4890b686f4: netperf.Throughput_Mbps -69.4% regression 2022-08-16 5:52 ` Oliver Sang @ 2022-08-16 15:55 ` Shakeel Butt 0 siblings, 0 replies; 32+ messages in thread From: Shakeel Butt @ 2022-08-16 15:55 UTC (permalink / raw) To: Oliver Sang Cc: Feng Tang, Roman Gushchin, Eric Dumazet, Linux MM, Andrew Morton, Michal Hocko, Johannes Weiner, Muchun Song, Jakub Kicinski, Xin Long, Marcelo Ricardo Leitner, Soheil Hassas Yeganeh, LKML, network dev, linux-s390, MPTCP Upstream, linux-sctp @ vger . kernel . org, lkp, kbuild test robot, Huang Ying, Xing Zhengjun, Yin Fengwei, Ying Xu On Mon, Aug 15, 2022 at 10:53 PM Oliver Sang <oliver.sang@intel.com> wrote: > > Hi all, > > now we noticed this commit has already merged into mainline, and in our tests > there is still similar regression. [1] > > not sure if there is a plan to merge some of the solutions that have been > discussed in this thread? we'll very glad to test patches if there is a request > > Thanks a lot! Hi Oliver, sorry for the delay. I will send out the patches in a day or two. thanks, Shakeel ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [net] 4890b686f4: netperf.Throughput_Mbps -69.4% regression 2022-06-27 12:34 ` Feng Tang 2022-06-27 14:07 ` Eric Dumazet @ 2022-06-27 14:52 ` Shakeel Butt 2022-06-27 14:56 ` Eric Dumazet 2022-06-27 15:12 ` Feng Tang 1 sibling, 2 replies; 32+ messages in thread From: Shakeel Butt @ 2022-06-27 14:52 UTC (permalink / raw) To: Feng Tang Cc: Eric Dumazet, Linux MM, Andrew Morton, Roman Gushchin, Michal Hocko, Johannes Weiner, Muchun Song, Jakub Kicinski, Xin Long, Marcelo Ricardo Leitner, kernel test robot, Soheil Hassas Yeganeh, LKML, network dev, linux-s390, MPTCP Upstream, linux-sctp @ vger . kernel . org, lkp, kbuild test robot, Huang Ying, Xing Zhengjun, Yin Fengwei, Ying Xu On Mon, Jun 27, 2022 at 5:34 AM Feng Tang <feng.tang@intel.com> wrote: > > On Mon, Jun 27, 2022 at 10:46:21AM +0200, Eric Dumazet wrote: > > On Mon, Jun 27, 2022 at 4:38 AM Feng Tang <feng.tang@intel.com> wrote: > [snip] > > > > > > > > > > Thanks Feng. Can you check the value of memory.kmem.tcp.max_usage_in_bytes > > > > > in /sys/fs/cgroup/memory/system.slice/lkp-bootstrap.service after making > > > > > sure that the netperf test has already run? > > > > > > > > memory.kmem.tcp.max_usage_in_bytes:0 > > > > > > Sorry, I made a mistake that in the original report from Oliver, it > > > was 'cgroup v2' with a 'debian-11.1' rootfs. > > > > > > When you asked about cgroup info, I tried the job on another tbox, and > > > the original 'job.yaml' didn't work, so I kept the 'netperf' test > > > parameters and started a new job which somehow run with a 'debian-10.4' > > > rootfs and acutally run with cgroup v1. > > > > > > And as you mentioned cgroup version does make a big difference, that > > > with v1, the regression is reduced to 1% ~ 5% on different generations > > > of test platforms. Eric mentioned they also got regression report, > > > but much smaller one, maybe it's due to the cgroup version? > > > > This was using the current net-next tree. > > Used recipe was something like: > > > > Make sure cgroup2 is mounted or mount it by mount -t cgroup2 none $MOUNT_POINT. > > Enable memory controller by echo +memory > $MOUNT_POINT/cgroup.subtree_control. > > Create a cgroup by mkdir $MOUNT_POINT/job. > > Jump into that cgroup by echo $$ > $MOUNT_POINT/job/cgroup.procs. > > > > <Launch tests> > > > > The regression was smaller than 1%, so considered noise compared to > > the benefits of the bug fix. > > Yes, 1% is just around noise level for a microbenchmark. > > I went check the original test data of Oliver's report, the tests was > run 6 rounds and the performance data is pretty stable (0Day's report > will show any std deviation bigger than 2%) > > The test platform is a 4 sockets 72C/144T machine, and I run the > same job (nr_tasks = 25% * nr_cpus) on one CascadeLake AP (4 nodes) > and one Icelake 2 sockets platform, and saw 75% and 53% regresson on > them. > > In the first email, there is a file named 'reproduce', it shows the > basic test process: > > " > use 'performane' cpufre governor for all CPUs > > netserver -4 -D > modprobe sctp > netperf -4 -H 127.0.0.1 -t SCTP_STREAM_MANY -c -C -l 300 -- -m 10K & > netperf -4 -H 127.0.0.1 -t SCTP_STREAM_MANY -c -C -l 300 -- -m 10K & > netperf -4 -H 127.0.0.1 -t SCTP_STREAM_MANY -c -C -l 300 -- -m 10K & > (repeat 36 times in total) > ... > > " > > Which starts 36 (25% of nr_cpus) netperf clients. And the clients number > also matters, I tried to increase the client number from 36 to 72(50%), > and the regression is changed from 69.4% to 73.7% > Am I understanding correctly that this 69.4% (or 73.7%) regression is with cgroup v2? Eric did the experiments on v2 but on real hardware where the performance impact was negligible. BTW do you see similar regression for tcp as well or just sctp? ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [net] 4890b686f4: netperf.Throughput_Mbps -69.4% regression 2022-06-27 14:52 ` Shakeel Butt @ 2022-06-27 14:56 ` Eric Dumazet 2022-06-27 15:12 ` Feng Tang 1 sibling, 0 replies; 32+ messages in thread From: Eric Dumazet @ 2022-06-27 14:56 UTC (permalink / raw) To: Shakeel Butt Cc: Feng Tang, Linux MM, Andrew Morton, Roman Gushchin, Michal Hocko, Johannes Weiner, Muchun Song, Jakub Kicinski, Xin Long, Marcelo Ricardo Leitner, kernel test robot, Soheil Hassas Yeganeh, LKML, network dev, linux-s390, MPTCP Upstream, linux-sctp @ vger . kernel . org, lkp, kbuild test robot, Huang Ying, Xing Zhengjun, Yin Fengwei, Ying Xu On Mon, Jun 27, 2022 at 4:53 PM Shakeel Butt <shakeelb@google.com> wrote: > Am I understanding correctly that this 69.4% (or 73.7%) regression is > with cgroup v2? > > Eric did the experiments on v2 but on real hardware where the > performance impact was negligible. > > BTW do you see similar regression for tcp as well or just sctp? TCP_RR with big packets can show a regression as well. I gave this perf profile: 28.69% [kernel] [k] copy_user_enhanced_fast_string 16.13% [kernel] [k] intel_idle_irq 6.46% [kernel] [k] page_counter_try_charge 6.20% [kernel] [k] __sk_mem_reduce_allocated 5.68% [kernel] [k] try_charge_memcg 5.16% [kernel] [k] page_counter_cancel And this points to false sharing on (struct page_counter *)->usage I guess memcg had free lunch, because of per-socket cache, that we need to remove. ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [net] 4890b686f4: netperf.Throughput_Mbps -69.4% regression 2022-06-27 14:52 ` Shakeel Butt 2022-06-27 14:56 ` Eric Dumazet @ 2022-06-27 15:12 ` Feng Tang 2022-06-27 16:25 ` Shakeel Butt 1 sibling, 1 reply; 32+ messages in thread From: Feng Tang @ 2022-06-27 15:12 UTC (permalink / raw) To: Shakeel Butt Cc: Eric Dumazet, Linux MM, Andrew Morton, Roman Gushchin, Michal Hocko, Johannes Weiner, Muchun Song, Jakub Kicinski, Xin Long, Marcelo Ricardo Leitner, kernel test robot, Soheil Hassas Yeganeh, LKML, network dev, linux-s390, MPTCP Upstream, linux-sctp @ vger . kernel . org, lkp, kbuild test robot, Huang Ying, Xing Zhengjun, Yin Fengwei, Ying Xu On Mon, Jun 27, 2022 at 07:52:55AM -0700, Shakeel Butt wrote: > On Mon, Jun 27, 2022 at 5:34 AM Feng Tang <feng.tang@intel.com> wrote: > > Yes, 1% is just around noise level for a microbenchmark. > > > > I went check the original test data of Oliver's report, the tests was > > run 6 rounds and the performance data is pretty stable (0Day's report > > will show any std deviation bigger than 2%) > > > > The test platform is a 4 sockets 72C/144T machine, and I run the > > same job (nr_tasks = 25% * nr_cpus) on one CascadeLake AP (4 nodes) > > and one Icelake 2 sockets platform, and saw 75% and 53% regresson on > > them. > > > > In the first email, there is a file named 'reproduce', it shows the > > basic test process: > > > > " > > use 'performane' cpufre governor for all CPUs > > > > netserver -4 -D > > modprobe sctp > > netperf -4 -H 127.0.0.1 -t SCTP_STREAM_MANY -c -C -l 300 -- -m 10K & > > netperf -4 -H 127.0.0.1 -t SCTP_STREAM_MANY -c -C -l 300 -- -m 10K & > > netperf -4 -H 127.0.0.1 -t SCTP_STREAM_MANY -c -C -l 300 -- -m 10K & > > (repeat 36 times in total) > > ... > > > > " > > > > Which starts 36 (25% of nr_cpus) netperf clients. And the clients number > > also matters, I tried to increase the client number from 36 to 72(50%), > > and the regression is changed from 69.4% to 73.7% > > > > Am I understanding correctly that this 69.4% (or 73.7%) regression is > with cgroup v2? Yes. > Eric did the experiments on v2 but on real hardware where the > performance impact was negligible. > > BTW do you see similar regression for tcp as well or just sctp? Yes, I run TCP_SENDFILE case with 'send_size'==10K, it hits a 70%+ regressioin. Thanks, Feng ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [net] 4890b686f4: netperf.Throughput_Mbps -69.4% regression 2022-06-27 15:12 ` Feng Tang @ 2022-06-27 16:25 ` Shakeel Butt 0 siblings, 0 replies; 32+ messages in thread From: Shakeel Butt @ 2022-06-27 16:25 UTC (permalink / raw) To: Feng Tang Cc: Eric Dumazet, Linux MM, Andrew Morton, Roman Gushchin, Michal Hocko, Johannes Weiner, Muchun Song, Jakub Kicinski, Xin Long, Marcelo Ricardo Leitner, kernel test robot, Soheil Hassas Yeganeh, LKML, network dev, linux-s390, MPTCP Upstream, linux-sctp @ vger . kernel . org, lkp, kbuild test robot, Huang Ying, Xing Zhengjun, Yin Fengwei, Ying Xu On Mon, Jun 27, 2022 at 8:25 AM Feng Tang <feng.tang@intel.com> wrote: > > On Mon, Jun 27, 2022 at 07:52:55AM -0700, Shakeel Butt wrote: > > On Mon, Jun 27, 2022 at 5:34 AM Feng Tang <feng.tang@intel.com> wrote: > > > Yes, 1% is just around noise level for a microbenchmark. > > > > > > I went check the original test data of Oliver's report, the tests was > > > run 6 rounds and the performance data is pretty stable (0Day's report > > > will show any std deviation bigger than 2%) > > > > > > The test platform is a 4 sockets 72C/144T machine, and I run the > > > same job (nr_tasks = 25% * nr_cpus) on one CascadeLake AP (4 nodes) > > > and one Icelake 2 sockets platform, and saw 75% and 53% regresson on > > > them. > > > > > > In the first email, there is a file named 'reproduce', it shows the > > > basic test process: > > > > > > " > > > use 'performane' cpufre governor for all CPUs > > > > > > netserver -4 -D > > > modprobe sctp > > > netperf -4 -H 127.0.0.1 -t SCTP_STREAM_MANY -c -C -l 300 -- -m 10K & > > > netperf -4 -H 127.0.0.1 -t SCTP_STREAM_MANY -c -C -l 300 -- -m 10K & > > > netperf -4 -H 127.0.0.1 -t SCTP_STREAM_MANY -c -C -l 300 -- -m 10K & > > > (repeat 36 times in total) > > > ... > > > > > > " > > > > > > Which starts 36 (25% of nr_cpus) netperf clients. And the clients number > > > also matters, I tried to increase the client number from 36 to 72(50%), > > > and the regression is changed from 69.4% to 73.7% > > > > > > > Am I understanding correctly that this 69.4% (or 73.7%) regression is > > with cgroup v2? > > Yes. > > > Eric did the experiments on v2 but on real hardware where the > > performance impact was negligible. > > > > BTW do you see similar regression for tcp as well or just sctp? > > Yes, I run TCP_SENDFILE case with 'send_size'==10K, it hits a > 70%+ regressioin. > Thanks Feng. I think we should start with squeezing whatever we can from layout changes and then try other approaches like increasing batch size or something else. I can take a stab at this next week. thanks, Shakeel ^ permalink raw reply [flat|nested] 32+ messages in thread
end of thread, other threads:[~2022-08-16 15:59 UTC | newest] Thread overview: 32+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- [not found] <20220619150456.GB34471@xsang-OptiPlex-9020> [not found] ` <20220622172857.37db0d29@kernel.org> [not found] ` <CADvbK_csvmkKe46hT9792=+Qcjor2EvkkAnr--CJK3NGX-N9BQ@mail.gmail.com> 2022-06-23 22:50 ` [net] 4890b686f4: netperf.Throughput_Mbps -69.4% regression Xin Long 2022-06-24 1:57 ` Jakub Kicinski 2022-06-24 4:13 ` Eric Dumazet 2022-06-24 4:22 ` Eric Dumazet 2022-06-24 5:13 ` Feng Tang 2022-06-24 5:45 ` Eric Dumazet 2022-06-24 6:00 ` Feng Tang 2022-06-24 6:07 ` Eric Dumazet 2022-06-24 6:34 ` Shakeel Butt 2022-06-24 7:06 ` Feng Tang 2022-06-24 14:43 ` Shakeel Butt 2022-06-25 2:36 ` Feng Tang 2022-06-27 2:38 ` Feng Tang 2022-06-27 8:46 ` Eric Dumazet 2022-06-27 12:34 ` Feng Tang 2022-06-27 14:07 ` Eric Dumazet 2022-06-27 14:48 ` Feng Tang 2022-06-27 16:25 ` Eric Dumazet 2022-06-27 16:48 ` Shakeel Butt 2022-06-27 17:05 ` Eric Dumazet 2022-06-28 1:46 ` Roman Gushchin 2022-06-28 3:49 ` Feng Tang 2022-07-01 15:47 ` Shakeel Butt 2022-07-03 10:43 ` Feng Tang 2022-07-03 22:55 ` Roman Gushchin 2022-07-05 5:03 ` Feng Tang 2022-08-16 5:52 ` Oliver Sang 2022-08-16 15:55 ` Shakeel Butt 2022-06-27 14:52 ` Shakeel Butt 2022-06-27 14:56 ` Eric Dumazet 2022-06-27 15:12 ` Feng Tang 2022-06-27 16:25 ` Shakeel Butt
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).