linux-sctp.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: [net] 4890b686f4: netperf.Throughput_Mbps -69.4% regression
       [not found]   ` <CADvbK_csvmkKe46hT9792=+Qcjor2EvkkAnr--CJK3NGX-N9BQ@mail.gmail.com>
@ 2022-06-23 22:50     ` Xin Long
  2022-06-24  1:57       ` Jakub Kicinski
  0 siblings, 1 reply; 32+ messages in thread
From: Xin Long @ 2022-06-23 22:50 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Marcelo Ricardo Leitner, kernel test robot, Eric Dumazet,
	Shakeel Butt, Soheil Hassas Yeganeh, LKML,
	Linux Memory Management List, network dev, linux-s390, mptcp,
	linux-sctp @ vger . kernel . org, lkp, kbuild test robot,
	Huang Ying, feng.tang, zhengjun.xing, fengwei.yin, Ying Xu

On Wed, Jun 22, 2022 at 11:08 PM Xin Long <lucien.xin@gmail.com> wrote:
>
> Yes, I'm working on it. I couldn't see the regression in my env with
> the 'reproduce' script attached.
> I will try with lkp tomorrow.
>
> Thanks.
>
> On Wed, Jun 22, 2022 at 8:29 PM Jakub Kicinski <kuba@kernel.org> wrote:
> >
> > Could someone working on SCTP double check this is a real regression?
> > Feels like the regression reports are flowing at such rate its hard
> > to keep up.
> >
> > >
> > > commit:
> > >   7c80b038d2 ("net: fix sk_wmem_schedule() and sk_rmem_schedule() errors")
> > >   4890b686f4 ("net: keep sk->sk_forward_alloc as small as possible")
> > >
> > > 7c80b038d23e1f4c 4890b686f4088c90432149bd6de
> > > ---------------- ---------------------------
> > >          %stddev     %change         %stddev
> > >              \          |                \
> > >      15855           -69.4%       4854        netperf.Throughput_Mbps
> > >     570788           -69.4%     174773        netperf.Throughput_total_Mbps
...
> > >       0.00            +5.1        5.10 ±  5%  perf-profile.calltrace.cycles-pp.__sk_mem_reduce_allocated.sctp_wfree.skb_release_head_state.consume_skb.sctp_chunk_put
> > >       0.17 ±141%      +5.3        5.42 ±  6%  perf-profile.calltrace.cycles-pp.skb_release_head_state.consume_skb.sctp_chunk_put.sctp_outq_sack.sctp_cmd_interpreter
> > >       0.00            +5.3        5.35 ±  6%  perf-profile.calltrace.cycles-pp.sctp_wfree.skb_release_head_state.consume_skb.sctp_chunk_put.sctp_outq_sack
> > >       0.00            +5.5        5.51 ±  6%  perf-profile.calltrace.cycles-pp.__sk_mem_reduce_allocated.skb_release_head_state.kfree_skb_reason.sctp_recvmsg.inet_recvmsg
> > >       0.00            +5.7        5.65 ±  6%  perf-profile.calltrace.cycles-pp.skb_release_head_state.kfree_skb_reason.sctp_recvmsg.inet_recvmsg.____sys_recvmsg
...
> > >       0.00            +4.0        4.04 ±  6%  perf-profile.children.cycles-pp.mem_cgroup_charge_skmem
> > >       2.92 ±  6%      +4.2        7.16 ±  6%  perf-profile.children.cycles-pp.sctp_outq_sack
> > >       0.00            +4.3        4.29 ±  6%  perf-profile.children.cycles-pp.__sk_mem_raise_allocated
> > >       0.00            +4.3        4.32 ±  6%  perf-profile.children.cycles-pp.__sk_mem_schedule
> > >       1.99 ±  6%      +4.4        6.40 ±  6%  perf-profile.children.cycles-pp.consume_skb
> > >       1.78 ±  6%      +4.6        6.42 ±  6%  perf-profile.children.cycles-pp.kfree_skb_reason
> > >       0.37 ±  8%      +5.0        5.40 ±  6%  perf-profile.children.cycles-pp.sctp_wfree
> > >       0.87 ±  9%     +10.3       11.20 ±  6%  perf-profile.children.cycles-pp.skb_release_head_state
> > >       0.00           +10.7       10.66 ±  6%  perf-profile.children.cycles-pp.__sk_mem_reduce_allocated
...
> > >       0.00            +1.2        1.19 ±  7%  perf-profile.self.cycles-pp.try_charge_memcg
> > >       0.00            +2.0        1.96 ±  6%  perf-profile.self.cycles-pp.page_counter_uncharge
> > >       0.00            +2.1        2.07 ±  5%  perf-profile.self.cycles-pp.page_counter_try_charge
> > >       1.09 ±  8%      +2.8        3.92 ±  6%  perf-profile.self.cycles-pp.native_queued_spin_lock_slowpath
> > >       0.29 ±  6%      +3.5        3.81 ±  6%  perf-profile.self.cycles-pp.sctp_eat_data
> > >       0.00            +7.8        7.76 ±  6%  perf-profile.self.cycles-pp.__sk_mem_reduce_allocated

From the perf data, we can see __sk_mem_reduce_allocated() is the one
using CPU the most more than before, and mem_cgroup APIs are also
called in this function. It means the mem cgroup must be enabled in
the test env, which may explain why I couldn't reproduce it.

The Commit 4890b686f4 ("net: keep sk->sk_forward_alloc as small as
possible") uses sk_mem_reclaim(checking reclaimable >= PAGE_SIZE) to
reclaim the memory, which is *more frequent* to call
__sk_mem_reduce_allocated() than before (checking reclaimable >=
SK_RECLAIM_THRESHOLD). It might be cheap when
mem_cgroup_sockets_enabled is false, but I'm not sure if it's still
cheap when mem_cgroup_sockets_enabled is true.

I think SCTP netperf could trigger this, as the CPU is the bottleneck
for SCTP netperf testing, which is more sensitive to the extra
function calls than TCP.

Can we re-run this testing without mem cgroup enabled?

Thanks.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [net] 4890b686f4: netperf.Throughput_Mbps -69.4% regression
  2022-06-23 22:50     ` [net] 4890b686f4: netperf.Throughput_Mbps -69.4% regression Xin Long
@ 2022-06-24  1:57       ` Jakub Kicinski
  2022-06-24  4:13         ` Eric Dumazet
  0 siblings, 1 reply; 32+ messages in thread
From: Jakub Kicinski @ 2022-06-24  1:57 UTC (permalink / raw)
  To: Xin Long
  Cc: Marcelo Ricardo Leitner, kernel test robot, Eric Dumazet,
	Shakeel Butt, Soheil Hassas Yeganeh, LKML,
	Linux Memory Management List, network dev, linux-s390, mptcp,
	linux-sctp @ vger . kernel . org, lkp, kbuild test robot,
	Huang Ying, feng.tang, zhengjun.xing, fengwei.yin, Ying Xu

On Thu, 23 Jun 2022 18:50:07 -0400 Xin Long wrote:
> From the perf data, we can see __sk_mem_reduce_allocated() is the one
> using CPU the most more than before, and mem_cgroup APIs are also
> called in this function. It means the mem cgroup must be enabled in
> the test env, which may explain why I couldn't reproduce it.
> 
> The Commit 4890b686f4 ("net: keep sk->sk_forward_alloc as small as
> possible") uses sk_mem_reclaim(checking reclaimable >= PAGE_SIZE) to
> reclaim the memory, which is *more frequent* to call
> __sk_mem_reduce_allocated() than before (checking reclaimable >=
> SK_RECLAIM_THRESHOLD). It might be cheap when
> mem_cgroup_sockets_enabled is false, but I'm not sure if it's still
> cheap when mem_cgroup_sockets_enabled is true.
> 
> I think SCTP netperf could trigger this, as the CPU is the bottleneck
> for SCTP netperf testing, which is more sensitive to the extra
> function calls than TCP.
> 
> Can we re-run this testing without mem cgroup enabled?

FWIW I defer to Eric, thanks a lot for double checking the report 
and digging in!

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [net] 4890b686f4: netperf.Throughput_Mbps -69.4% regression
  2022-06-24  1:57       ` Jakub Kicinski
@ 2022-06-24  4:13         ` Eric Dumazet
  2022-06-24  4:22           ` Eric Dumazet
                             ` (2 more replies)
  0 siblings, 3 replies; 32+ messages in thread
From: Eric Dumazet @ 2022-06-24  4:13 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Xin Long, Marcelo Ricardo Leitner, kernel test robot,
	Shakeel Butt, Soheil Hassas Yeganeh, LKML,
	Linux Memory Management List, network dev, linux-s390,
	MPTCP Upstream, linux-sctp @ vger . kernel . org, lkp,
	kbuild test robot, Huang Ying, Tang, Feng, zhengjun.xing,
	fengwei.yin, Ying Xu

On Fri, Jun 24, 2022 at 3:57 AM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Thu, 23 Jun 2022 18:50:07 -0400 Xin Long wrote:
> > From the perf data, we can see __sk_mem_reduce_allocated() is the one
> > using CPU the most more than before, and mem_cgroup APIs are also
> > called in this function. It means the mem cgroup must be enabled in
> > the test env, which may explain why I couldn't reproduce it.
> >
> > The Commit 4890b686f4 ("net: keep sk->sk_forward_alloc as small as
> > possible") uses sk_mem_reclaim(checking reclaimable >= PAGE_SIZE) to
> > reclaim the memory, which is *more frequent* to call
> > __sk_mem_reduce_allocated() than before (checking reclaimable >=
> > SK_RECLAIM_THRESHOLD). It might be cheap when
> > mem_cgroup_sockets_enabled is false, but I'm not sure if it's still
> > cheap when mem_cgroup_sockets_enabled is true.
> >
> > I think SCTP netperf could trigger this, as the CPU is the bottleneck
> > for SCTP netperf testing, which is more sensitive to the extra
> > function calls than TCP.
> >
> > Can we re-run this testing without mem cgroup enabled?
>
> FWIW I defer to Eric, thanks a lot for double checking the report
> and digging in!

I did tests with TCP + memcg and noticed a very small additional cost
in memcg functions,
because of suboptimal layout:

Extract of an internal Google bug, update from June 9th:

--------------------------------
I have noticed a minor false sharing to fetch (struct
mem_cgroup)->css.parent, at offset 0xc0,
because it shares the cache line containing struct mem_cgroup.memory,
at offset 0xd0

Ideally, memcg->socket_pressure and memcg->parent should sit in a read
mostly cache line.
-----------------------

But nothing that could explain a "-69.4% regression"

memcg has a very similar strategy of per-cpu reserves, with
MEMCG_CHARGE_BATCH being 32 pages per cpu.

It is not clear why SCTP with 10K writes would overflow this reserve constantly.

Presumably memcg experts will have to rework structure alignments to
make sure they can cope better
with more charge/uncharge operations, because we are not going back to
gigantic per-socket reserves,
this simply does not scale.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [net] 4890b686f4: netperf.Throughput_Mbps -69.4% regression
  2022-06-24  4:13         ` Eric Dumazet
@ 2022-06-24  4:22           ` Eric Dumazet
  2022-06-24  5:13           ` Feng Tang
  2022-06-24  6:34           ` Shakeel Butt
  2 siblings, 0 replies; 32+ messages in thread
From: Eric Dumazet @ 2022-06-24  4:22 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Xin Long, Marcelo Ricardo Leitner, kernel test robot,
	Shakeel Butt, Soheil Hassas Yeganeh, LKML,
	Linux Memory Management List, network dev, linux-s390,
	MPTCP Upstream, linux-sctp @ vger . kernel . org, lkp,
	kbuild test robot, Huang Ying, Tang, Feng, zhengjun.xing,
	fengwei.yin, Ying Xu

On Fri, Jun 24, 2022 at 6:13 AM Eric Dumazet <edumazet@google.com> wrote:
>
> On Fri, Jun 24, 2022 at 3:57 AM Jakub Kicinski <kuba@kernel.org> wrote:
> >
> > On Thu, 23 Jun 2022 18:50:07 -0400 Xin Long wrote:
> > > From the perf data, we can see __sk_mem_reduce_allocated() is the one
> > > using CPU the most more than before, and mem_cgroup APIs are also
> > > called in this function. It means the mem cgroup must be enabled in
> > > the test env, which may explain why I couldn't reproduce it.
> > >
> > > The Commit 4890b686f4 ("net: keep sk->sk_forward_alloc as small as
> > > possible") uses sk_mem_reclaim(checking reclaimable >= PAGE_SIZE) to
> > > reclaim the memory, which is *more frequent* to call
> > > __sk_mem_reduce_allocated() than before (checking reclaimable >=
> > > SK_RECLAIM_THRESHOLD). It might be cheap when
> > > mem_cgroup_sockets_enabled is false, but I'm not sure if it's still
> > > cheap when mem_cgroup_sockets_enabled is true.
> > >
> > > I think SCTP netperf could trigger this, as the CPU is the bottleneck
> > > for SCTP netperf testing, which is more sensitive to the extra
> > > function calls than TCP.
> > >
> > > Can we re-run this testing without mem cgroup enabled?
> >
> > FWIW I defer to Eric, thanks a lot for double checking the report
> > and digging in!
>
> I did tests with TCP + memcg and noticed a very small additional cost
> in memcg functions,
> because of suboptimal layout:
>
> Extract of an internal Google bug, update from June 9th:
>
> --------------------------------
> I have noticed a minor false sharing to fetch (struct
> mem_cgroup)->css.parent, at offset 0xc0,
> because it shares the cache line containing struct mem_cgroup.memory,
> at offset 0xd0
>
> Ideally, memcg->socket_pressure and memcg->parent should sit in a read
> mostly cache line.
> -----------------------
>
> But nothing that could explain a "-69.4% regression"

I guess the test now hits memcg limits more often, forcing expensive reclaim,
and the memcg limits need some adjustments.

Overall, tests enabling memcg should probably need fine tuning, I will
defer to Intel folks.


>
> memcg has a very similar strategy of per-cpu reserves, with
> MEMCG_CHARGE_BATCH being 32 pages per cpu.
>
> It is not clear why SCTP with 10K writes would overflow this reserve constantly.
>
> Presumably memcg experts will have to rework structure alignments to
> make sure they can cope better
> with more charge/uncharge operations, because we are not going back to
> gigantic per-socket reserves,
> this simply does not scale.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [net] 4890b686f4: netperf.Throughput_Mbps -69.4% regression
  2022-06-24  4:13         ` Eric Dumazet
  2022-06-24  4:22           ` Eric Dumazet
@ 2022-06-24  5:13           ` Feng Tang
  2022-06-24  5:45             ` Eric Dumazet
  2022-06-24  6:34           ` Shakeel Butt
  2 siblings, 1 reply; 32+ messages in thread
From: Feng Tang @ 2022-06-24  5:13 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Jakub Kicinski, Xin Long, Marcelo Ricardo Leitner,
	kernel test robot, Shakeel Butt, Soheil Hassas Yeganeh, LKML,
	Linux Memory Management List, network dev, linux-s390,
	MPTCP Upstream, linux-sctp @ vger . kernel . org, lkp,
	kbuild test robot, Huang Ying, zhengjun.xing, fengwei.yin,
	Ying Xu

Hi Eric,

On Fri, Jun 24, 2022 at 06:13:51AM +0200, Eric Dumazet wrote:
> On Fri, Jun 24, 2022 at 3:57 AM Jakub Kicinski <kuba@kernel.org> wrote:
> >
> > On Thu, 23 Jun 2022 18:50:07 -0400 Xin Long wrote:
> > > From the perf data, we can see __sk_mem_reduce_allocated() is the one
> > > using CPU the most more than before, and mem_cgroup APIs are also
> > > called in this function. It means the mem cgroup must be enabled in
> > > the test env, which may explain why I couldn't reproduce it.
> > >
> > > The Commit 4890b686f4 ("net: keep sk->sk_forward_alloc as small as
> > > possible") uses sk_mem_reclaim(checking reclaimable >= PAGE_SIZE) to
> > > reclaim the memory, which is *more frequent* to call
> > > __sk_mem_reduce_allocated() than before (checking reclaimable >=
> > > SK_RECLAIM_THRESHOLD). It might be cheap when
> > > mem_cgroup_sockets_enabled is false, but I'm not sure if it's still
> > > cheap when mem_cgroup_sockets_enabled is true.
> > >
> > > I think SCTP netperf could trigger this, as the CPU is the bottleneck
> > > for SCTP netperf testing, which is more sensitive to the extra
> > > function calls than TCP.
> > >
> > > Can we re-run this testing without mem cgroup enabled?
> >
> > FWIW I defer to Eric, thanks a lot for double checking the report
> > and digging in!
> 
> I did tests with TCP + memcg and noticed a very small additional cost
> in memcg functions,
> because of suboptimal layout:
> 
> Extract of an internal Google bug, update from June 9th:
> 
> --------------------------------
> I have noticed a minor false sharing to fetch (struct
> mem_cgroup)->css.parent, at offset 0xc0,
> because it shares the cache line containing struct mem_cgroup.memory,
> at offset 0xd0
> 
> Ideally, memcg->socket_pressure and memcg->parent should sit in a read
> mostly cache line.
> -----------------------
> 
> But nothing that could explain a "-69.4% regression"
 
We can double check that. 

> memcg has a very similar strategy of per-cpu reserves, with
> MEMCG_CHARGE_BATCH being 32 pages per cpu.
 
We have proposed patch to increase the batch numer for stats
update, which was not accepted as it hurts the accuracy and
the data is used by many tools.

> It is not clear why SCTP with 10K writes would overflow this reserve constantly.
> 
> Presumably memcg experts will have to rework structure alignments to
> make sure they can cope better
> with more charge/uncharge operations, because we are not going back to
> gigantic per-socket reserves,
> this simply does not scale.

Yes, the memcg statitics and charge/unchage update is very sensitive
with the data alignemnt layout, and can easily trigger peformance
changes, as we've seen quite some similar cases in the past several
years. 

One pattern we've seen is, even if a memcg stats updating or charge
function only takes about 2%~3% of the CPU cycles in perf-profile data,
once it got affected, the peformance change could be amplified to up to
60% or more.

Thanks,
Feng



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [net] 4890b686f4: netperf.Throughput_Mbps -69.4% regression
  2022-06-24  5:13           ` Feng Tang
@ 2022-06-24  5:45             ` Eric Dumazet
  2022-06-24  6:00               ` Feng Tang
  0 siblings, 1 reply; 32+ messages in thread
From: Eric Dumazet @ 2022-06-24  5:45 UTC (permalink / raw)
  To: Feng Tang
  Cc: Jakub Kicinski, Xin Long, Marcelo Ricardo Leitner,
	kernel test robot, Shakeel Butt, Soheil Hassas Yeganeh, LKML,
	Linux Memory Management List, network dev, linux-s390,
	MPTCP Upstream, linux-sctp @ vger . kernel . org, lkp,
	kbuild test robot, Huang Ying, zhengjun.xing, fengwei.yin,
	Ying Xu

On Fri, Jun 24, 2022 at 7:14 AM Feng Tang <feng.tang@intel.com> wrote:
>
> Hi Eric,
>
> On Fri, Jun 24, 2022 at 06:13:51AM +0200, Eric Dumazet wrote:
> > On Fri, Jun 24, 2022 at 3:57 AM Jakub Kicinski <kuba@kernel.org> wrote:
> > >
> > > On Thu, 23 Jun 2022 18:50:07 -0400 Xin Long wrote:
> > > > From the perf data, we can see __sk_mem_reduce_allocated() is the one
> > > > using CPU the most more than before, and mem_cgroup APIs are also
> > > > called in this function. It means the mem cgroup must be enabled in
> > > > the test env, which may explain why I couldn't reproduce it.
> > > >
> > > > The Commit 4890b686f4 ("net: keep sk->sk_forward_alloc as small as
> > > > possible") uses sk_mem_reclaim(checking reclaimable >= PAGE_SIZE) to
> > > > reclaim the memory, which is *more frequent* to call
> > > > __sk_mem_reduce_allocated() than before (checking reclaimable >=
> > > > SK_RECLAIM_THRESHOLD). It might be cheap when
> > > > mem_cgroup_sockets_enabled is false, but I'm not sure if it's still
> > > > cheap when mem_cgroup_sockets_enabled is true.
> > > >
> > > > I think SCTP netperf could trigger this, as the CPU is the bottleneck
> > > > for SCTP netperf testing, which is more sensitive to the extra
> > > > function calls than TCP.
> > > >
> > > > Can we re-run this testing without mem cgroup enabled?
> > >
> > > FWIW I defer to Eric, thanks a lot for double checking the report
> > > and digging in!
> >
> > I did tests with TCP + memcg and noticed a very small additional cost
> > in memcg functions,
> > because of suboptimal layout:
> >
> > Extract of an internal Google bug, update from June 9th:
> >
> > --------------------------------
> > I have noticed a minor false sharing to fetch (struct
> > mem_cgroup)->css.parent, at offset 0xc0,
> > because it shares the cache line containing struct mem_cgroup.memory,
> > at offset 0xd0
> >
> > Ideally, memcg->socket_pressure and memcg->parent should sit in a read
> > mostly cache line.
> > -----------------------
> >
> > But nothing that could explain a "-69.4% regression"
>
> We can double check that.
>
> > memcg has a very similar strategy of per-cpu reserves, with
> > MEMCG_CHARGE_BATCH being 32 pages per cpu.
>
> We have proposed patch to increase the batch numer for stats
> update, which was not accepted as it hurts the accuracy and
> the data is used by many tools.
>
> > It is not clear why SCTP with 10K writes would overflow this reserve constantly.
> >
> > Presumably memcg experts will have to rework structure alignments to
> > make sure they can cope better
> > with more charge/uncharge operations, because we are not going back to
> > gigantic per-socket reserves,
> > this simply does not scale.
>
> Yes, the memcg statitics and charge/unchage update is very sensitive
> with the data alignemnt layout, and can easily trigger peformance
> changes, as we've seen quite some similar cases in the past several
> years.
>
> One pattern we've seen is, even if a memcg stats updating or charge
> function only takes about 2%~3% of the CPU cycles in perf-profile data,
> once it got affected, the peformance change could be amplified to up to
> 60% or more.
>

Reorganizing "struct mem_cgroup" to put "struct page_counter memory"
in a separate cache line would be beneficial.

Many low hanging fruits, assuming nobody will use __randomize_layout on it ;)

Also some fields are written even if their value is not changed.

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index abec50f31fe64100f4be5b029c7161b3a6077a74..53d9c1e581e78303ef73942e2b34338567987b74
100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -7037,10 +7037,12 @@ bool mem_cgroup_charge_skmem(struct mem_cgroup
*memcg, unsigned int nr_pages,
                struct page_counter *fail;

                if (page_counter_try_charge(&memcg->tcpmem, nr_pages, &fail)) {
-                       memcg->tcpmem_pressure = 0;
+                       if (READ_ONCE(memcg->tcpmem_pressure))
+                               WRITE_ONCE(memcg->tcpmem_pressure, 0);
                        return true;
                }
-               memcg->tcpmem_pressure = 1;
+               if (!READ_ONCE(memcg->tcpmem_pressure))
+                       WRITE_ONCE(memcg->tcpmem_pressure, 1);
                if (gfp_mask & __GFP_NOFAIL) {
                        page_counter_charge(&memcg->tcpmem, nr_pages);
                        return true;

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [net] 4890b686f4: netperf.Throughput_Mbps -69.4% regression
  2022-06-24  5:45             ` Eric Dumazet
@ 2022-06-24  6:00               ` Feng Tang
  2022-06-24  6:07                 ` Eric Dumazet
  0 siblings, 1 reply; 32+ messages in thread
From: Feng Tang @ 2022-06-24  6:00 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Jakub Kicinski, Xin Long, Marcelo Ricardo Leitner,
	kernel test robot, Shakeel Butt, Soheil Hassas Yeganeh, LKML,
	Linux Memory Management List, network dev, linux-s390,
	MPTCP Upstream, linux-sctp @ vger . kernel . org, lkp,
	kbuild test robot, Huang Ying, zhengjun.xing, fengwei.yin,
	Ying Xu

On Fri, Jun 24, 2022 at 07:45:00AM +0200, Eric Dumazet wrote:
> On Fri, Jun 24, 2022 at 7:14 AM Feng Tang <feng.tang@intel.com> wrote:
> >
> > Hi Eric,
> >
> > On Fri, Jun 24, 2022 at 06:13:51AM +0200, Eric Dumazet wrote:
> > > On Fri, Jun 24, 2022 at 3:57 AM Jakub Kicinski <kuba@kernel.org> wrote:
> > > >
> > > > On Thu, 23 Jun 2022 18:50:07 -0400 Xin Long wrote:
> > > > > From the perf data, we can see __sk_mem_reduce_allocated() is the one
> > > > > using CPU the most more than before, and mem_cgroup APIs are also
> > > > > called in this function. It means the mem cgroup must be enabled in
> > > > > the test env, which may explain why I couldn't reproduce it.
> > > > >
> > > > > The Commit 4890b686f4 ("net: keep sk->sk_forward_alloc as small as
> > > > > possible") uses sk_mem_reclaim(checking reclaimable >= PAGE_SIZE) to
> > > > > reclaim the memory, which is *more frequent* to call
> > > > > __sk_mem_reduce_allocated() than before (checking reclaimable >=
> > > > > SK_RECLAIM_THRESHOLD). It might be cheap when
> > > > > mem_cgroup_sockets_enabled is false, but I'm not sure if it's still
> > > > > cheap when mem_cgroup_sockets_enabled is true.
> > > > >
> > > > > I think SCTP netperf could trigger this, as the CPU is the bottleneck
> > > > > for SCTP netperf testing, which is more sensitive to the extra
> > > > > function calls than TCP.
> > > > >
> > > > > Can we re-run this testing without mem cgroup enabled?
> > > >
> > > > FWIW I defer to Eric, thanks a lot for double checking the report
> > > > and digging in!
> > >
> > > I did tests with TCP + memcg and noticed a very small additional cost
> > > in memcg functions,
> > > because of suboptimal layout:
> > >
> > > Extract of an internal Google bug, update from June 9th:
> > >
> > > --------------------------------
> > > I have noticed a minor false sharing to fetch (struct
> > > mem_cgroup)->css.parent, at offset 0xc0,
> > > because it shares the cache line containing struct mem_cgroup.memory,
> > > at offset 0xd0
> > >
> > > Ideally, memcg->socket_pressure and memcg->parent should sit in a read
> > > mostly cache line.
> > > -----------------------
> > >
> > > But nothing that could explain a "-69.4% regression"
> >
> > We can double check that.
> >
> > > memcg has a very similar strategy of per-cpu reserves, with
> > > MEMCG_CHARGE_BATCH being 32 pages per cpu.
> >
> > We have proposed patch to increase the batch numer for stats
> > update, which was not accepted as it hurts the accuracy and
> > the data is used by many tools.
> >
> > > It is not clear why SCTP with 10K writes would overflow this reserve constantly.
> > >
> > > Presumably memcg experts will have to rework structure alignments to
> > > make sure they can cope better
> > > with more charge/uncharge operations, because we are not going back to
> > > gigantic per-socket reserves,
> > > this simply does not scale.
> >
> > Yes, the memcg statitics and charge/unchage update is very sensitive
> > with the data alignemnt layout, and can easily trigger peformance
> > changes, as we've seen quite some similar cases in the past several
> > years.
> >
> > One pattern we've seen is, even if a memcg stats updating or charge
> > function only takes about 2%~3% of the CPU cycles in perf-profile data,
> > once it got affected, the peformance change could be amplified to up to
> > 60% or more.
> >
> 
> Reorganizing "struct mem_cgroup" to put "struct page_counter memory"
> in a separate cache line would be beneficial.
 
That may help.

And I also want to say the benchmarks(especially micro one) are very
sensitive to the layout of mem_cgroup. As the 'page_counter' is 112
bytes in size, I recently made a patch to make it cacheline aligned
(take 2 cachelines), which improved some hackbench/netperf test
cases, but caused huge (49%) drop for some vm-scalability tests. 

> Many low hanging fruits, assuming nobody will use __randomize_layout on it ;)
> 
> Also some fields are written even if their value is not changed.
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index abec50f31fe64100f4be5b029c7161b3a6077a74..53d9c1e581e78303ef73942e2b34338567987b74
> 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -7037,10 +7037,12 @@ bool mem_cgroup_charge_skmem(struct mem_cgroup
> *memcg, unsigned int nr_pages,
>                 struct page_counter *fail;
> 
>                 if (page_counter_try_charge(&memcg->tcpmem, nr_pages, &fail)) {
> -                       memcg->tcpmem_pressure = 0;
> +                       if (READ_ONCE(memcg->tcpmem_pressure))
> +                               WRITE_ONCE(memcg->tcpmem_pressure, 0);
>                         return true;
>                 }
> -               memcg->tcpmem_pressure = 1;
> +               if (!READ_ONCE(memcg->tcpmem_pressure))
> +                       WRITE_ONCE(memcg->tcpmem_pressure, 1);
>                 if (gfp_mask & __GFP_NOFAIL) {
>                         page_counter_charge(&memcg->tcpmem, nr_pages);
>                         return true;

I will also try this patch, which may take some time.

Thanks,
Feng

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [net] 4890b686f4: netperf.Throughput_Mbps -69.4% regression
  2022-06-24  6:00               ` Feng Tang
@ 2022-06-24  6:07                 ` Eric Dumazet
  0 siblings, 0 replies; 32+ messages in thread
From: Eric Dumazet @ 2022-06-24  6:07 UTC (permalink / raw)
  To: Feng Tang
  Cc: Jakub Kicinski, Xin Long, Marcelo Ricardo Leitner,
	kernel test robot, Shakeel Butt, Soheil Hassas Yeganeh, LKML,
	Linux Memory Management List, network dev, linux-s390,
	MPTCP Upstream, linux-sctp @ vger . kernel . org, lkp,
	kbuild test robot, Huang Ying, zhengjun.xing, fengwei.yin,
	Ying Xu

On Fri, Jun 24, 2022 at 8:01 AM Feng Tang <feng.tang@intel.com> wrote:
>
> On Fri, Jun 24, 2022 at 07:45:00AM +0200, Eric Dumazet wrote:
> > On Fri, Jun 24, 2022 at 7:14 AM Feng Tang <feng.tang@intel.com> wrote:
> > >
> > > Hi Eric,
> > >
> > > On Fri, Jun 24, 2022 at 06:13:51AM +0200, Eric Dumazet wrote:
> > > > On Fri, Jun 24, 2022 at 3:57 AM Jakub Kicinski <kuba@kernel.org> wrote:
> > > > >
> > > > > On Thu, 23 Jun 2022 18:50:07 -0400 Xin Long wrote:
> > > > > > From the perf data, we can see __sk_mem_reduce_allocated() is the one
> > > > > > using CPU the most more than before, and mem_cgroup APIs are also
> > > > > > called in this function. It means the mem cgroup must be enabled in
> > > > > > the test env, which may explain why I couldn't reproduce it.
> > > > > >
> > > > > > The Commit 4890b686f4 ("net: keep sk->sk_forward_alloc as small as
> > > > > > possible") uses sk_mem_reclaim(checking reclaimable >= PAGE_SIZE) to
> > > > > > reclaim the memory, which is *more frequent* to call
> > > > > > __sk_mem_reduce_allocated() than before (checking reclaimable >=
> > > > > > SK_RECLAIM_THRESHOLD). It might be cheap when
> > > > > > mem_cgroup_sockets_enabled is false, but I'm not sure if it's still
> > > > > > cheap when mem_cgroup_sockets_enabled is true.
> > > > > >
> > > > > > I think SCTP netperf could trigger this, as the CPU is the bottleneck
> > > > > > for SCTP netperf testing, which is more sensitive to the extra
> > > > > > function calls than TCP.
> > > > > >
> > > > > > Can we re-run this testing without mem cgroup enabled?
> > > > >
> > > > > FWIW I defer to Eric, thanks a lot for double checking the report
> > > > > and digging in!
> > > >
> > > > I did tests with TCP + memcg and noticed a very small additional cost
> > > > in memcg functions,
> > > > because of suboptimal layout:
> > > >
> > > > Extract of an internal Google bug, update from June 9th:
> > > >
> > > > --------------------------------
> > > > I have noticed a minor false sharing to fetch (struct
> > > > mem_cgroup)->css.parent, at offset 0xc0,
> > > > because it shares the cache line containing struct mem_cgroup.memory,
> > > > at offset 0xd0
> > > >
> > > > Ideally, memcg->socket_pressure and memcg->parent should sit in a read
> > > > mostly cache line.
> > > > -----------------------
> > > >
> > > > But nothing that could explain a "-69.4% regression"
> > >
> > > We can double check that.
> > >
> > > > memcg has a very similar strategy of per-cpu reserves, with
> > > > MEMCG_CHARGE_BATCH being 32 pages per cpu.
> > >
> > > We have proposed patch to increase the batch numer for stats
> > > update, which was not accepted as it hurts the accuracy and
> > > the data is used by many tools.
> > >
> > > > It is not clear why SCTP with 10K writes would overflow this reserve constantly.
> > > >
> > > > Presumably memcg experts will have to rework structure alignments to
> > > > make sure they can cope better
> > > > with more charge/uncharge operations, because we are not going back to
> > > > gigantic per-socket reserves,
> > > > this simply does not scale.
> > >
> > > Yes, the memcg statitics and charge/unchage update is very sensitive
> > > with the data alignemnt layout, and can easily trigger peformance
> > > changes, as we've seen quite some similar cases in the past several
> > > years.
> > >
> > > One pattern we've seen is, even if a memcg stats updating or charge
> > > function only takes about 2%~3% of the CPU cycles in perf-profile data,
> > > once it got affected, the peformance change could be amplified to up to
> > > 60% or more.
> > >
> >
> > Reorganizing "struct mem_cgroup" to put "struct page_counter memory"
> > in a separate cache line would be beneficial.
>
> That may help.
>
> And I also want to say the benchmarks(especially micro one) are very
> sensitive to the layout of mem_cgroup. As the 'page_counter' is 112
> bytes in size, I recently made a patch to make it cacheline aligned
> (take 2 cachelines), which improved some hackbench/netperf test
> cases, but caused huge (49%) drop for some vm-scalability tests.
>
> > Many low hanging fruits, assuming nobody will use __randomize_layout on it ;)
> >
> > Also some fields are written even if their value is not changed.
> >
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index abec50f31fe64100f4be5b029c7161b3a6077a74..53d9c1e581e78303ef73942e2b34338567987b74
> > 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -7037,10 +7037,12 @@ bool mem_cgroup_charge_skmem(struct mem_cgroup
> > *memcg, unsigned int nr_pages,
> >                 struct page_counter *fail;
> >
> >                 if (page_counter_try_charge(&memcg->tcpmem, nr_pages, &fail)) {
> > -                       memcg->tcpmem_pressure = 0;
> > +                       if (READ_ONCE(memcg->tcpmem_pressure))
> > +                               WRITE_ONCE(memcg->tcpmem_pressure, 0);
> >                         return true;
> >                 }
> > -               memcg->tcpmem_pressure = 1;
> > +               if (!READ_ONCE(memcg->tcpmem_pressure))
> > +                       WRITE_ONCE(memcg->tcpmem_pressure, 1);
> >                 if (gfp_mask & __GFP_NOFAIL) {
> >                         page_counter_charge(&memcg->tcpmem, nr_pages);
> >                         return true;
>
> I will also try this patch, which may take some time.

Note that applications can opt-in reserving memory for one socket,
using SO_RESERVE_MEM

This can be used for jobs with a controlled number of sockets, as this
will avoid many charge/uncharge operations.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [net] 4890b686f4: netperf.Throughput_Mbps -69.4% regression
  2022-06-24  4:13         ` Eric Dumazet
  2022-06-24  4:22           ` Eric Dumazet
  2022-06-24  5:13           ` Feng Tang
@ 2022-06-24  6:34           ` Shakeel Butt
  2022-06-24  7:06             ` Feng Tang
  2 siblings, 1 reply; 32+ messages in thread
From: Shakeel Butt @ 2022-06-24  6:34 UTC (permalink / raw)
  To: Eric Dumazet, Linux MM, Andrew Morton, Roman Gushchin,
	Michal Hocko, Johannes Weiner, Muchun Song
  Cc: Jakub Kicinski, Xin Long, Marcelo Ricardo Leitner,
	kernel test robot, Soheil Hassas Yeganeh, LKML, network dev,
	linux-s390, MPTCP Upstream, linux-sctp @ vger . kernel . org,
	lkp, kbuild test robot, Huang Ying, Tang, Feng, Xing Zhengjun,
	Yin Fengwei, Ying Xu

CCing memcg folks.

The thread starts at
https://lore.kernel.org/all/20220619150456.GB34471@xsang-OptiPlex-9020/

On Thu, Jun 23, 2022 at 9:14 PM Eric Dumazet <edumazet@google.com> wrote:
>
> On Fri, Jun 24, 2022 at 3:57 AM Jakub Kicinski <kuba@kernel.org> wrote:
> >
> > On Thu, 23 Jun 2022 18:50:07 -0400 Xin Long wrote:
> > > From the perf data, we can see __sk_mem_reduce_allocated() is the one
> > > using CPU the most more than before, and mem_cgroup APIs are also
> > > called in this function. It means the mem cgroup must be enabled in
> > > the test env, which may explain why I couldn't reproduce it.
> > >
> > > The Commit 4890b686f4 ("net: keep sk->sk_forward_alloc as small as
> > > possible") uses sk_mem_reclaim(checking reclaimable >= PAGE_SIZE) to
> > > reclaim the memory, which is *more frequent* to call
> > > __sk_mem_reduce_allocated() than before (checking reclaimable >=
> > > SK_RECLAIM_THRESHOLD). It might be cheap when
> > > mem_cgroup_sockets_enabled is false, but I'm not sure if it's still
> > > cheap when mem_cgroup_sockets_enabled is true.
> > >
> > > I think SCTP netperf could trigger this, as the CPU is the bottleneck
> > > for SCTP netperf testing, which is more sensitive to the extra
> > > function calls than TCP.
> > >
> > > Can we re-run this testing without mem cgroup enabled?
> >
> > FWIW I defer to Eric, thanks a lot for double checking the report
> > and digging in!
>
> I did tests with TCP + memcg and noticed a very small additional cost
> in memcg functions,
> because of suboptimal layout:
>
> Extract of an internal Google bug, update from June 9th:
>
> --------------------------------
> I have noticed a minor false sharing to fetch (struct
> mem_cgroup)->css.parent, at offset 0xc0,
> because it shares the cache line containing struct mem_cgroup.memory,
> at offset 0xd0
>
> Ideally, memcg->socket_pressure and memcg->parent should sit in a read
> mostly cache line.
> -----------------------
>
> But nothing that could explain a "-69.4% regression"
>
> memcg has a very similar strategy of per-cpu reserves, with
> MEMCG_CHARGE_BATCH being 32 pages per cpu.
>
> It is not clear why SCTP with 10K writes would overflow this reserve constantly.
>
> Presumably memcg experts will have to rework structure alignments to
> make sure they can cope better
> with more charge/uncharge operations, because we are not going back to
> gigantic per-socket reserves,
> this simply does not scale.

Yes I agree. As you pointed out there are fields which are mostly
read-only but sharing cache lines with fields which get updated and
definitely need work.

However can we first confirm if memcg charging is really the issue
here as I remember these intel lkp tests are configured to run in root
memcg and the kernel does not associate root memcg to any socket (see
mem_cgroup_sk_alloc()).

If these tests are running in non-root memcg, is this cgroup v1 or v2?
The memory counter and the 32 pages per cpu stock are only used on v2.
For v1, there is no per-cpu stock and there is a separate tcpmem page
counter and on v1 the network memory accounting has to be enabled
explicitly i.e. not enabled by default.

There is definite possibility of slowdown on v1 but let's first
confirm the memcg setup used for this testing environment.

Feng, can you please explain the memcg setup on these test machines
and if the tests are run in root or non-root memcg?

thanks,
Shakeel

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [net] 4890b686f4: netperf.Throughput_Mbps -69.4% regression
  2022-06-24  6:34           ` Shakeel Butt
@ 2022-06-24  7:06             ` Feng Tang
  2022-06-24 14:43               ` Shakeel Butt
  0 siblings, 1 reply; 32+ messages in thread
From: Feng Tang @ 2022-06-24  7:06 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Eric Dumazet, Linux MM, Andrew Morton, Roman Gushchin,
	Michal Hocko, Johannes Weiner, Muchun Song, Jakub Kicinski,
	Xin Long, Marcelo Ricardo Leitner, kernel test robot,
	Soheil Hassas Yeganeh, LKML, network dev, linux-s390,
	MPTCP Upstream, linux-sctp @ vger . kernel . org, lkp,
	kbuild test robot, Huang Ying, Xing Zhengjun, Yin Fengwei,
	Ying Xu

On Thu, Jun 23, 2022 at 11:34:15PM -0700, Shakeel Butt wrote:
> CCing memcg folks.
> 
> The thread starts at
> https://lore.kernel.org/all/20220619150456.GB34471@xsang-OptiPlex-9020/
> 
> On Thu, Jun 23, 2022 at 9:14 PM Eric Dumazet <edumazet@google.com> wrote:
> >
> > On Fri, Jun 24, 2022 at 3:57 AM Jakub Kicinski <kuba@kernel.org> wrote:
> > >
> > > On Thu, 23 Jun 2022 18:50:07 -0400 Xin Long wrote:
> > > > From the perf data, we can see __sk_mem_reduce_allocated() is the one
> > > > using CPU the most more than before, and mem_cgroup APIs are also
> > > > called in this function. It means the mem cgroup must be enabled in
> > > > the test env, which may explain why I couldn't reproduce it.
> > > >
> > > > The Commit 4890b686f4 ("net: keep sk->sk_forward_alloc as small as
> > > > possible") uses sk_mem_reclaim(checking reclaimable >= PAGE_SIZE) to
> > > > reclaim the memory, which is *more frequent* to call
> > > > __sk_mem_reduce_allocated() than before (checking reclaimable >=
> > > > SK_RECLAIM_THRESHOLD). It might be cheap when
> > > > mem_cgroup_sockets_enabled is false, but I'm not sure if it's still
> > > > cheap when mem_cgroup_sockets_enabled is true.
> > > >
> > > > I think SCTP netperf could trigger this, as the CPU is the bottleneck
> > > > for SCTP netperf testing, which is more sensitive to the extra
> > > > function calls than TCP.
> > > >
> > > > Can we re-run this testing without mem cgroup enabled?
> > >
> > > FWIW I defer to Eric, thanks a lot for double checking the report
> > > and digging in!
> >
> > I did tests with TCP + memcg and noticed a very small additional cost
> > in memcg functions,
> > because of suboptimal layout:
> >
> > Extract of an internal Google bug, update from June 9th:
> >
> > --------------------------------
> > I have noticed a minor false sharing to fetch (struct
> > mem_cgroup)->css.parent, at offset 0xc0,
> > because it shares the cache line containing struct mem_cgroup.memory,
> > at offset 0xd0
> >
> > Ideally, memcg->socket_pressure and memcg->parent should sit in a read
> > mostly cache line.
> > -----------------------
> >
> > But nothing that could explain a "-69.4% regression"
> >
> > memcg has a very similar strategy of per-cpu reserves, with
> > MEMCG_CHARGE_BATCH being 32 pages per cpu.
> >
> > It is not clear why SCTP with 10K writes would overflow this reserve constantly.
> >
> > Presumably memcg experts will have to rework structure alignments to
> > make sure they can cope better
> > with more charge/uncharge operations, because we are not going back to
> > gigantic per-socket reserves,
> > this simply does not scale.
> 
> Yes I agree. As you pointed out there are fields which are mostly
> read-only but sharing cache lines with fields which get updated and
> definitely need work.
> 
> However can we first confirm if memcg charging is really the issue
> here as I remember these intel lkp tests are configured to run in root
> memcg and the kernel does not associate root memcg to any socket (see
> mem_cgroup_sk_alloc()).
> 
> If these tests are running in non-root memcg, is this cgroup v1 or v2?
> The memory counter and the 32 pages per cpu stock are only used on v2.
> For v1, there is no per-cpu stock and there is a separate tcpmem page
> counter and on v1 the network memory accounting has to be enabled
> explicitly i.e. not enabled by default.
> 
> There is definite possibility of slowdown on v1 but let's first
> confirm the memcg setup used for this testing environment.
> 
> Feng, can you please explain the memcg setup on these test machines
> and if the tests are run in root or non-root memcg?

I don't know the exact setup, Philip/Oliver from 0Day can correct me.

I logged into a test box which runs netperf test, and it seems to be
cgoup v1 and non-root memcg. The netperf tasks all sit in dir:
'/sys/fs/cgroup/memory/system.slice/lkp-bootstrap.service'

And the rootfs is a debian based rootfs

Thanks,
Feng


> thanks,
> Shakeel

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [net] 4890b686f4: netperf.Throughput_Mbps -69.4% regression
  2022-06-24  7:06             ` Feng Tang
@ 2022-06-24 14:43               ` Shakeel Butt
  2022-06-25  2:36                 ` Feng Tang
  0 siblings, 1 reply; 32+ messages in thread
From: Shakeel Butt @ 2022-06-24 14:43 UTC (permalink / raw)
  To: Feng Tang
  Cc: Eric Dumazet, Linux MM, Andrew Morton, Roman Gushchin,
	Michal Hocko, Johannes Weiner, Muchun Song, Jakub Kicinski,
	Xin Long, Marcelo Ricardo Leitner, kernel test robot,
	Soheil Hassas Yeganeh, LKML, network dev, linux-s390,
	MPTCP Upstream, linux-sctp @ vger . kernel . org, lkp,
	kbuild test robot, Huang Ying, Xing Zhengjun, Yin Fengwei,
	Ying Xu

On Fri, Jun 24, 2022 at 03:06:56PM +0800, Feng Tang wrote:
> On Thu, Jun 23, 2022 at 11:34:15PM -0700, Shakeel Butt wrote:
[...]
> > 
> > Feng, can you please explain the memcg setup on these test machines
> > and if the tests are run in root or non-root memcg?
> 
> I don't know the exact setup, Philip/Oliver from 0Day can correct me.
> 
> I logged into a test box which runs netperf test, and it seems to be
> cgoup v1 and non-root memcg. The netperf tasks all sit in dir:
> '/sys/fs/cgroup/memory/system.slice/lkp-bootstrap.service'
> 

Thanks Feng. Can you check the value of memory.kmem.tcp.max_usage_in_bytes
in /sys/fs/cgroup/memory/system.slice/lkp-bootstrap.service after making
sure that the netperf test has already run?

If this is non-zero then network memory accounting is enabled and the
slowdown is expected.

> And the rootfs is a debian based rootfs
> 
> Thanks,
> Feng
> 
> 
> > thanks,
> > Shakeel

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [net] 4890b686f4: netperf.Throughput_Mbps -69.4% regression
  2022-06-24 14:43               ` Shakeel Butt
@ 2022-06-25  2:36                 ` Feng Tang
  2022-06-27  2:38                   ` Feng Tang
  0 siblings, 1 reply; 32+ messages in thread
From: Feng Tang @ 2022-06-25  2:36 UTC (permalink / raw)
  To: Shakeel Butt, Eric Dumazet
  Cc: Eric Dumazet, Linux MM, Andrew Morton, Roman Gushchin,
	Michal Hocko, Johannes Weiner, Muchun Song, Jakub Kicinski,
	Xin Long, Marcelo Ricardo Leitner, kernel test robot,
	Soheil Hassas Yeganeh, LKML, network dev, linux-s390,
	MPTCP Upstream, linux-sctp @ vger . kernel . org, lkp,
	kbuild test robot, Huang Ying, Xing Zhengjun, Yin Fengwei,
	Ying Xu

On Fri, Jun 24, 2022 at 02:43:58PM +0000, Shakeel Butt wrote:
> On Fri, Jun 24, 2022 at 03:06:56PM +0800, Feng Tang wrote:
> > On Thu, Jun 23, 2022 at 11:34:15PM -0700, Shakeel Butt wrote:
> [...]
> > > 
> > > Feng, can you please explain the memcg setup on these test machines
> > > and if the tests are run in root or non-root memcg?
> > 
> > I don't know the exact setup, Philip/Oliver from 0Day can correct me.
> > 
> > I logged into a test box which runs netperf test, and it seems to be
> > cgoup v1 and non-root memcg. The netperf tasks all sit in dir:
> > '/sys/fs/cgroup/memory/system.slice/lkp-bootstrap.service'
> > 
> 
> Thanks Feng. Can you check the value of memory.kmem.tcp.max_usage_in_bytes
> in /sys/fs/cgroup/memory/system.slice/lkp-bootstrap.service after making
> sure that the netperf test has already run?

memory.kmem.tcp.max_usage_in_bytes:0

And here is more memcg stats (let me know if you want to check more)

/sys/fs/cgroup/memory/system.slice/lkp-bootstrap.service# grep . memory.*
memory.failcnt:0
memory.kmem.failcnt:0
memory.kmem.limit_in_bytes:9223372036854771712
memory.kmem.max_usage_in_bytes:47861760
memory.kmem.tcp.failcnt:0
memory.kmem.tcp.limit_in_bytes:9223372036854771712
memory.kmem.tcp.max_usage_in_bytes:0
memory.kmem.tcp.usage_in_bytes:0
memory.kmem.usage_in_bytes:40730624
memory.limit_in_bytes:9223372036854771712
memory.max_usage_in_bytes:642424832
memory.memsw.failcnt:0
memory.memsw.limit_in_bytes:9223372036854771712
memory.memsw.max_usage_in_bytes:642424832
memory.memsw.usage_in_bytes:639549440
memory.move_charge_at_immigrate:0
memory.numa_stat:total=144073 N0=124819 N1=19254
memory.numa_stat:file=0 N0=0 N1=0
memory.numa_stat:anon=77721 N0=58502 N1=19219
memory.numa_stat:unevictable=66352 N0=66317 N1=35
memory.numa_stat:hierarchical_total=144073 N0=124819 N1=19254
memory.numa_stat:hierarchical_file=0 N0=0 N1=0
memory.numa_stat:hierarchical_anon=77721 N0=58502 N1=19219
memory.numa_stat:hierarchical_unevictable=66352 N0=66317 N1=35
memory.oom_control:oom_kill_disable 0
memory.oom_control:under_oom 0
memory.oom_control:oom_kill 0
grep: memory.pressure_level: Invalid argument
memory.soft_limit_in_bytes:9223372036854771712
memory.stat:cache 282562560
memory.stat:rss 307884032
memory.stat:rss_huge 239075328
memory.stat:shmem 10784768
memory.stat:mapped_file 3444736
memory.stat:dirty 0
memory.stat:writeback 0
memory.stat:swap 0
memory.stat:pgpgin 1018918
memory.stat:pgpgout 932902
memory.stat:pgfault 2130513
memory.stat:pgmajfault 0
memory.stat:inactive_anon 310272000
memory.stat:active_anon 8073216
memory.stat:inactive_file 0
memory.stat:active_file 0
memory.stat:unevictable 271777792
memory.stat:hierarchical_memory_limit 9223372036854771712
memory.stat:hierarchical_memsw_limit 9223372036854771712
memory.stat:total_cache 282562560
memory.stat:total_rss 307884032
memory.stat:total_rss_huge 239075328
memory.stat:total_shmem 10784768
memory.stat:total_mapped_file 3444736
memory.stat:total_dirty 0
memory.stat:total_writeback 0
memory.stat:total_swap 0
memory.stat:total_pgpgin 1018918
memory.stat:total_pgpgout 932902
memory.stat:total_pgfault 2130513
memory.stat:total_pgmajfault 0
memory.stat:total_inactive_anon 310272000
memory.stat:total_active_anon 8073216
memory.stat:total_inactive_file 0
memory.stat:total_active_file 0
memory.stat:total_unevictable 271777792
memory.swappiness:60
memory.usage_in_bytes:639549440
memory.use_hierarchy:1

> If this is non-zero then network memory accounting is enabled and the
> slowdown is expected.

From the perf-profile data in original report, both
__sk_mem_raise_allocated() and __sk_mem_reduce_allocated() are called
much more often, which call memcg charge/uncharge functions.

IIUC, the call chain is:

__sk_mem_raise_allocated
    sk_memory_allocated_add
    mem_cgroup_charge_skmem
        charge memcg->tcpmem (for cgroup v2)
	try_charge memcg (for v1)

Also from Eric's one earlier commit log:

"
net: implement per-cpu reserves for memory_allocated
...
This means we are going to call sk_memory_allocated_add()
and sk_memory_allocated_sub() more often.
...
"

So this slowdown is related to the more calling of charge/uncharge? 

Thanks,
Feng

> > And the rootfs is a debian based rootfs
> > 
> > Thanks,
> > Feng
> > 
> > 
> > > thanks,
> > > Shakeel

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [net] 4890b686f4: netperf.Throughput_Mbps -69.4% regression
  2022-06-25  2:36                 ` Feng Tang
@ 2022-06-27  2:38                   ` Feng Tang
  2022-06-27  8:46                     ` Eric Dumazet
  0 siblings, 1 reply; 32+ messages in thread
From: Feng Tang @ 2022-06-27  2:38 UTC (permalink / raw)
  To: Shakeel Butt, Eric Dumazet
  Cc: Linux MM, Andrew Morton, Roman Gushchin, Michal Hocko,
	Johannes Weiner, Muchun Song, Jakub Kicinski, Xin Long,
	Marcelo Ricardo Leitner, kernel test robot,
	Soheil Hassas Yeganeh, LKML, network dev, linux-s390,
	MPTCP Upstream, linux-sctp @ vger . kernel . org, lkp,
	kbuild test robot, Huang Ying, Xing Zhengjun, Yin Fengwei,
	Ying Xu

On Sat, Jun 25, 2022 at 10:36:42AM +0800, Feng Tang wrote:
> On Fri, Jun 24, 2022 at 02:43:58PM +0000, Shakeel Butt wrote:
> > On Fri, Jun 24, 2022 at 03:06:56PM +0800, Feng Tang wrote:
> > > On Thu, Jun 23, 2022 at 11:34:15PM -0700, Shakeel Butt wrote:
> > [...]
> > > > 
> > > > Feng, can you please explain the memcg setup on these test machines
> > > > and if the tests are run in root or non-root memcg?
> > > 
> > > I don't know the exact setup, Philip/Oliver from 0Day can correct me.
> > > 
> > > I logged into a test box which runs netperf test, and it seems to be
> > > cgoup v1 and non-root memcg. The netperf tasks all sit in dir:
> > > '/sys/fs/cgroup/memory/system.slice/lkp-bootstrap.service'
> > > 
> > 
> > Thanks Feng. Can you check the value of memory.kmem.tcp.max_usage_in_bytes
> > in /sys/fs/cgroup/memory/system.slice/lkp-bootstrap.service after making
> > sure that the netperf test has already run?
> 
> memory.kmem.tcp.max_usage_in_bytes:0
 
Sorry, I made a mistake that in the original report from Oliver, it
was 'cgroup v2' with a 'debian-11.1' rootfs. 

When you asked about cgroup info, I tried the job on another tbox, and
the original 'job.yaml' didn't work, so I kept the 'netperf' test
parameters and started a new job which somehow run with a 'debian-10.4'
rootfs and acutally run with cgroup v1. 

And as you mentioned cgroup version does make a big difference, that
with v1, the regression is reduced to 1% ~ 5% on different generations
of test platforms. Eric mentioned they also got regression report,
but much smaller one, maybe it's due to the cgroup version?

Thanks,
Feng

> And here is more memcg stats (let me know if you want to check more)
> 
> > If this is non-zero then network memory accounting is enabled and the
> > slowdown is expected.
> 
> >From the perf-profile data in original report, both
> __sk_mem_raise_allocated() and __sk_mem_reduce_allocated() are called
> much more often, which call memcg charge/uncharge functions.
> 
> IIUC, the call chain is:
> 
> __sk_mem_raise_allocated
>     sk_memory_allocated_add
>     mem_cgroup_charge_skmem
>         charge memcg->tcpmem (for cgroup v2)
> 	try_charge memcg (for v1)
> 
> Also from Eric's one earlier commit log:
> 
> "
> net: implement per-cpu reserves for memory_allocated
> ...
> This means we are going to call sk_memory_allocated_add()
> and sk_memory_allocated_sub() more often.
> ...
> "
> 
> So this slowdown is related to the more calling of charge/uncharge? 
> 
> Thanks,
> Feng
> 
> > > And the rootfs is a debian based rootfs
> > > 
> > > Thanks,
> > > Feng
> > > 
> > > 
> > > > thanks,
> > > > Shakeel

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [net] 4890b686f4: netperf.Throughput_Mbps -69.4% regression
  2022-06-27  2:38                   ` Feng Tang
@ 2022-06-27  8:46                     ` Eric Dumazet
  2022-06-27 12:34                       ` Feng Tang
  0 siblings, 1 reply; 32+ messages in thread
From: Eric Dumazet @ 2022-06-27  8:46 UTC (permalink / raw)
  To: Feng Tang
  Cc: Shakeel Butt, Linux MM, Andrew Morton, Roman Gushchin,
	Michal Hocko, Johannes Weiner, Muchun Song, Jakub Kicinski,
	Xin Long, Marcelo Ricardo Leitner, kernel test robot,
	Soheil Hassas Yeganeh, LKML, network dev, linux-s390,
	MPTCP Upstream, linux-sctp @ vger . kernel . org, lkp,
	kbuild test robot, Huang Ying, Xing Zhengjun, Yin Fengwei,
	Ying Xu

On Mon, Jun 27, 2022 at 4:38 AM Feng Tang <feng.tang@intel.com> wrote:
>
> On Sat, Jun 25, 2022 at 10:36:42AM +0800, Feng Tang wrote:
> > On Fri, Jun 24, 2022 at 02:43:58PM +0000, Shakeel Butt wrote:
> > > On Fri, Jun 24, 2022 at 03:06:56PM +0800, Feng Tang wrote:
> > > > On Thu, Jun 23, 2022 at 11:34:15PM -0700, Shakeel Butt wrote:
> > > [...]
> > > > >
> > > > > Feng, can you please explain the memcg setup on these test machines
> > > > > and if the tests are run in root or non-root memcg?
> > > >
> > > > I don't know the exact setup, Philip/Oliver from 0Day can correct me.
> > > >
> > > > I logged into a test box which runs netperf test, and it seems to be
> > > > cgoup v1 and non-root memcg. The netperf tasks all sit in dir:
> > > > '/sys/fs/cgroup/memory/system.slice/lkp-bootstrap.service'
> > > >
> > >
> > > Thanks Feng. Can you check the value of memory.kmem.tcp.max_usage_in_bytes
> > > in /sys/fs/cgroup/memory/system.slice/lkp-bootstrap.service after making
> > > sure that the netperf test has already run?
> >
> > memory.kmem.tcp.max_usage_in_bytes:0
>
> Sorry, I made a mistake that in the original report from Oliver, it
> was 'cgroup v2' with a 'debian-11.1' rootfs.
>
> When you asked about cgroup info, I tried the job on another tbox, and
> the original 'job.yaml' didn't work, so I kept the 'netperf' test
> parameters and started a new job which somehow run with a 'debian-10.4'
> rootfs and acutally run with cgroup v1.
>
> And as you mentioned cgroup version does make a big difference, that
> with v1, the regression is reduced to 1% ~ 5% on different generations
> of test platforms. Eric mentioned they also got regression report,
> but much smaller one, maybe it's due to the cgroup version?

This was using the current net-next tree.
Used recipe was something like:

Make sure cgroup2 is mounted or mount it by mount -t cgroup2 none $MOUNT_POINT.
Enable memory controller by echo +memory > $MOUNT_POINT/cgroup.subtree_control.
Create a cgroup by mkdir $MOUNT_POINT/job.
Jump into that cgroup by echo $$ > $MOUNT_POINT/job/cgroup.procs.

<Launch tests>

The regression was smaller than 1%, so considered noise compared to
the benefits of the bug fix.

>
> Thanks,
> Feng
>
> > And here is more memcg stats (let me know if you want to check more)
> >
> > > If this is non-zero then network memory accounting is enabled and the
> > > slowdown is expected.
> >
> > >From the perf-profile data in original report, both
> > __sk_mem_raise_allocated() and __sk_mem_reduce_allocated() are called
> > much more often, which call memcg charge/uncharge functions.
> >
> > IIUC, the call chain is:
> >
> > __sk_mem_raise_allocated
> >     sk_memory_allocated_add
> >     mem_cgroup_charge_skmem
> >         charge memcg->tcpmem (for cgroup v2)
> >       try_charge memcg (for v1)
> >
> > Also from Eric's one earlier commit log:
> >
> > "
> > net: implement per-cpu reserves for memory_allocated
> > ...
> > This means we are going to call sk_memory_allocated_add()
> > and sk_memory_allocated_sub() more often.
> > ...
> > "
> >
> > So this slowdown is related to the more calling of charge/uncharge?
> >
> > Thanks,
> > Feng
> >
> > > > And the rootfs is a debian based rootfs
> > > >
> > > > Thanks,
> > > > Feng
> > > >
> > > >
> > > > > thanks,
> > > > > Shakeel

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [net] 4890b686f4: netperf.Throughput_Mbps -69.4% regression
  2022-06-27  8:46                     ` Eric Dumazet
@ 2022-06-27 12:34                       ` Feng Tang
  2022-06-27 14:07                         ` Eric Dumazet
  2022-06-27 14:52                         ` Shakeel Butt
  0 siblings, 2 replies; 32+ messages in thread
From: Feng Tang @ 2022-06-27 12:34 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Shakeel Butt, Linux MM, Andrew Morton, Roman Gushchin,
	Michal Hocko, Johannes Weiner, Muchun Song, Jakub Kicinski,
	Xin Long, Marcelo Ricardo Leitner, kernel test robot,
	Soheil Hassas Yeganeh, LKML, network dev, linux-s390,
	MPTCP Upstream, linux-sctp @ vger . kernel . org, lkp,
	kbuild test robot, Huang Ying, Xing Zhengjun, Yin Fengwei,
	Ying Xu

On Mon, Jun 27, 2022 at 10:46:21AM +0200, Eric Dumazet wrote:
> On Mon, Jun 27, 2022 at 4:38 AM Feng Tang <feng.tang@intel.com> wrote:
[snip]
> > > >
> > > > Thanks Feng. Can you check the value of memory.kmem.tcp.max_usage_in_bytes
> > > > in /sys/fs/cgroup/memory/system.slice/lkp-bootstrap.service after making
> > > > sure that the netperf test has already run?
> > >
> > > memory.kmem.tcp.max_usage_in_bytes:0
> >
> > Sorry, I made a mistake that in the original report from Oliver, it
> > was 'cgroup v2' with a 'debian-11.1' rootfs.
> >
> > When you asked about cgroup info, I tried the job on another tbox, and
> > the original 'job.yaml' didn't work, so I kept the 'netperf' test
> > parameters and started a new job which somehow run with a 'debian-10.4'
> > rootfs and acutally run with cgroup v1.
> >
> > And as you mentioned cgroup version does make a big difference, that
> > with v1, the regression is reduced to 1% ~ 5% on different generations
> > of test platforms. Eric mentioned they also got regression report,
> > but much smaller one, maybe it's due to the cgroup version?
> 
> This was using the current net-next tree.
> Used recipe was something like:
> 
> Make sure cgroup2 is mounted or mount it by mount -t cgroup2 none $MOUNT_POINT.
> Enable memory controller by echo +memory > $MOUNT_POINT/cgroup.subtree_control.
> Create a cgroup by mkdir $MOUNT_POINT/job.
> Jump into that cgroup by echo $$ > $MOUNT_POINT/job/cgroup.procs.
> 
> <Launch tests>
> 
> The regression was smaller than 1%, so considered noise compared to
> the benefits of the bug fix.
 
Yes, 1% is just around noise level for a microbenchmark.

I went check the original test data of Oliver's report, the tests was
run 6 rounds and the performance data is pretty stable (0Day's report
will show any std deviation bigger than 2%)

The test platform is a 4 sockets 72C/144T machine, and I run the
same job (nr_tasks = 25% * nr_cpus) on one CascadeLake AP (4 nodes)
and one Icelake 2 sockets platform, and saw 75% and 53% regresson on
them.

In the first email, there is a file named 'reproduce', it shows the
basic test process:

"
  use 'performane' cpufre  governor for all CPUs

  netserver -4 -D
  modprobe sctp
  netperf -4 -H 127.0.0.1 -t SCTP_STREAM_MANY -c -C -l 300 -- -m 10K  &
  netperf -4 -H 127.0.0.1 -t SCTP_STREAM_MANY -c -C -l 300 -- -m 10K  &
  netperf -4 -H 127.0.0.1 -t SCTP_STREAM_MANY -c -C -l 300 -- -m 10K  &
  (repeat 36 times in total) 
  ...

"

Which starts 36 (25% of nr_cpus) netperf clients. And the clients number
also matters, I tried to increase the client number from 36 to 72(50%),
and the regression is changed from 69.4% to 73.7%

Thanks,
Feng

> >
> > Thanks,
> > Feng

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [net] 4890b686f4: netperf.Throughput_Mbps -69.4% regression
  2022-06-27 12:34                       ` Feng Tang
@ 2022-06-27 14:07                         ` Eric Dumazet
  2022-06-27 14:48                           ` Feng Tang
  2022-06-27 14:52                         ` Shakeel Butt
  1 sibling, 1 reply; 32+ messages in thread
From: Eric Dumazet @ 2022-06-27 14:07 UTC (permalink / raw)
  To: Feng Tang
  Cc: Shakeel Butt, Linux MM, Andrew Morton, Roman Gushchin,
	Michal Hocko, Johannes Weiner, Muchun Song, Jakub Kicinski,
	Xin Long, Marcelo Ricardo Leitner, kernel test robot,
	Soheil Hassas Yeganeh, LKML, network dev, linux-s390,
	MPTCP Upstream, linux-sctp @ vger . kernel . org, lkp,
	kbuild test robot, Huang Ying, Xing Zhengjun, Yin Fengwei,
	Ying Xu

On Mon, Jun 27, 2022 at 2:34 PM Feng Tang <feng.tang@intel.com> wrote:
>
> On Mon, Jun 27, 2022 at 10:46:21AM +0200, Eric Dumazet wrote:
> > On Mon, Jun 27, 2022 at 4:38 AM Feng Tang <feng.tang@intel.com> wrote:
> [snip]
> > > > >
> > > > > Thanks Feng. Can you check the value of memory.kmem.tcp.max_usage_in_bytes
> > > > > in /sys/fs/cgroup/memory/system.slice/lkp-bootstrap.service after making
> > > > > sure that the netperf test has already run?
> > > >
> > > > memory.kmem.tcp.max_usage_in_bytes:0
> > >
> > > Sorry, I made a mistake that in the original report from Oliver, it
> > > was 'cgroup v2' with a 'debian-11.1' rootfs.
> > >
> > > When you asked about cgroup info, I tried the job on another tbox, and
> > > the original 'job.yaml' didn't work, so I kept the 'netperf' test
> > > parameters and started a new job which somehow run with a 'debian-10.4'
> > > rootfs and acutally run with cgroup v1.
> > >
> > > And as you mentioned cgroup version does make a big difference, that
> > > with v1, the regression is reduced to 1% ~ 5% on different generations
> > > of test platforms. Eric mentioned they also got regression report,
> > > but much smaller one, maybe it's due to the cgroup version?
> >
> > This was using the current net-next tree.
> > Used recipe was something like:
> >
> > Make sure cgroup2 is mounted or mount it by mount -t cgroup2 none $MOUNT_POINT.
> > Enable memory controller by echo +memory > $MOUNT_POINT/cgroup.subtree_control.
> > Create a cgroup by mkdir $MOUNT_POINT/job.
> > Jump into that cgroup by echo $$ > $MOUNT_POINT/job/cgroup.procs.
> >
> > <Launch tests>
> >
> > The regression was smaller than 1%, so considered noise compared to
> > the benefits of the bug fix.
>
> Yes, 1% is just around noise level for a microbenchmark.
>
> I went check the original test data of Oliver's report, the tests was
> run 6 rounds and the performance data is pretty stable (0Day's report
> will show any std deviation bigger than 2%)
>
> The test platform is a 4 sockets 72C/144T machine, and I run the
> same job (nr_tasks = 25% * nr_cpus) on one CascadeLake AP (4 nodes)
> and one Icelake 2 sockets platform, and saw 75% and 53% regresson on
> them.
>
> In the first email, there is a file named 'reproduce', it shows the
> basic test process:
>
> "
>   use 'performane' cpufre  governor for all CPUs
>
>   netserver -4 -D
>   modprobe sctp
>   netperf -4 -H 127.0.0.1 -t SCTP_STREAM_MANY -c -C -l 300 -- -m 10K  &
>   netperf -4 -H 127.0.0.1 -t SCTP_STREAM_MANY -c -C -l 300 -- -m 10K  &
>   netperf -4 -H 127.0.0.1 -t SCTP_STREAM_MANY -c -C -l 300 -- -m 10K  &
>   (repeat 36 times in total)
>   ...
>
> "
>
> Which starts 36 (25% of nr_cpus) netperf clients. And the clients number
> also matters, I tried to increase the client number from 36 to 72(50%),
> and the regression is changed from 69.4% to 73.7%"
>

This seems like a lot of opportunities for memcg folks :)

struct page_counter has poor field placement [1], and no per-cpu cache.

[1] "atomic_long_t usage" is sharing cache line with read mostly fields.

(struct mem_cgroup also has poor field placement, mainly because of
struct page_counter)

    28.69%  [kernel]       [k] copy_user_enhanced_fast_string
    16.13%  [kernel]       [k] intel_idle_irq
     6.46%  [kernel]       [k] page_counter_try_charge
     6.20%  [kernel]       [k] __sk_mem_reduce_allocated
     5.68%  [kernel]       [k] try_charge_memcg
     5.16%  [kernel]       [k] page_counter_cancel



> Thanks,
> Feng
>
> > >
> > > Thanks,
> > > Feng

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [net] 4890b686f4: netperf.Throughput_Mbps -69.4% regression
  2022-06-27 14:07                         ` Eric Dumazet
@ 2022-06-27 14:48                           ` Feng Tang
  2022-06-27 16:25                             ` Eric Dumazet
  0 siblings, 1 reply; 32+ messages in thread
From: Feng Tang @ 2022-06-27 14:48 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Shakeel Butt, Linux MM, Andrew Morton, Roman Gushchin,
	Michal Hocko, Johannes Weiner, Muchun Song, Jakub Kicinski,
	Xin Long, Marcelo Ricardo Leitner, kernel test robot,
	Soheil Hassas Yeganeh, LKML, network dev, linux-s390,
	MPTCP Upstream, linux-sctp @ vger . kernel . org, lkp,
	kbuild test robot, Huang Ying, Xing Zhengjun, Yin Fengwei,
	Ying Xu

On Mon, Jun 27, 2022 at 04:07:55PM +0200, Eric Dumazet wrote:
> On Mon, Jun 27, 2022 at 2:34 PM Feng Tang <feng.tang@intel.com> wrote:
> >
> > On Mon, Jun 27, 2022 at 10:46:21AM +0200, Eric Dumazet wrote:
> > > On Mon, Jun 27, 2022 at 4:38 AM Feng Tang <feng.tang@intel.com> wrote:
> > [snip]
> > > > > >
> > > > > > Thanks Feng. Can you check the value of memory.kmem.tcp.max_usage_in_bytes
> > > > > > in /sys/fs/cgroup/memory/system.slice/lkp-bootstrap.service after making
> > > > > > sure that the netperf test has already run?
> > > > >
> > > > > memory.kmem.tcp.max_usage_in_bytes:0
> > > >
> > > > Sorry, I made a mistake that in the original report from Oliver, it
> > > > was 'cgroup v2' with a 'debian-11.1' rootfs.
> > > >
> > > > When you asked about cgroup info, I tried the job on another tbox, and
> > > > the original 'job.yaml' didn't work, so I kept the 'netperf' test
> > > > parameters and started a new job which somehow run with a 'debian-10.4'
> > > > rootfs and acutally run with cgroup v1.
> > > >
> > > > And as you mentioned cgroup version does make a big difference, that
> > > > with v1, the regression is reduced to 1% ~ 5% on different generations
> > > > of test platforms. Eric mentioned they also got regression report,
> > > > but much smaller one, maybe it's due to the cgroup version?
> > >
> > > This was using the current net-next tree.
> > > Used recipe was something like:
> > >
> > > Make sure cgroup2 is mounted or mount it by mount -t cgroup2 none $MOUNT_POINT.
> > > Enable memory controller by echo +memory > $MOUNT_POINT/cgroup.subtree_control.
> > > Create a cgroup by mkdir $MOUNT_POINT/job.
> > > Jump into that cgroup by echo $$ > $MOUNT_POINT/job/cgroup.procs.
> > >
> > > <Launch tests>
> > >
> > > The regression was smaller than 1%, so considered noise compared to
> > > the benefits of the bug fix.
> >
> > Yes, 1% is just around noise level for a microbenchmark.
> >
> > I went check the original test data of Oliver's report, the tests was
> > run 6 rounds and the performance data is pretty stable (0Day's report
> > will show any std deviation bigger than 2%)
> >
> > The test platform is a 4 sockets 72C/144T machine, and I run the
> > same job (nr_tasks = 25% * nr_cpus) on one CascadeLake AP (4 nodes)
> > and one Icelake 2 sockets platform, and saw 75% and 53% regresson on
> > them.
> >
> > In the first email, there is a file named 'reproduce', it shows the
> > basic test process:
> >
> > "
> >   use 'performane' cpufre  governor for all CPUs
> >
> >   netserver -4 -D
> >   modprobe sctp
> >   netperf -4 -H 127.0.0.1 -t SCTP_STREAM_MANY -c -C -l 300 -- -m 10K  &
> >   netperf -4 -H 127.0.0.1 -t SCTP_STREAM_MANY -c -C -l 300 -- -m 10K  &
> >   netperf -4 -H 127.0.0.1 -t SCTP_STREAM_MANY -c -C -l 300 -- -m 10K  &
> >   (repeat 36 times in total)
> >   ...
> >
> > "
> >
> > Which starts 36 (25% of nr_cpus) netperf clients. And the clients number
> > also matters, I tried to increase the client number from 36 to 72(50%),
> > and the regression is changed from 69.4% to 73.7%"
> >
> 
> This seems like a lot of opportunities for memcg folks :)
> 
> struct page_counter has poor field placement [1], and no per-cpu cache.
> 
> [1] "atomic_long_t usage" is sharing cache line with read mostly fields.
> 
> (struct mem_cgroup also has poor field placement, mainly because of
> struct page_counter)
> 
>     28.69%  [kernel]       [k] copy_user_enhanced_fast_string
>     16.13%  [kernel]       [k] intel_idle_irq
>      6.46%  [kernel]       [k] page_counter_try_charge
>      6.20%  [kernel]       [k] __sk_mem_reduce_allocated
>      5.68%  [kernel]       [k] try_charge_memcg
>      5.16%  [kernel]       [k] page_counter_cancel

Yes, I also analyzed the perf-profile data, and made some layout changes
which could recover the changes from 69% to 40%.

7c80b038d23e1f4c 4890b686f4088c90432149bd6de 332b589c49656a45881bca4ecc0
---------------- --------------------------- --------------------------- 
     15722           -69.5%       4792           -40.8%       9300        netperf.Throughput_Mbps
 

diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
index 1bfcfb1af352..aa37bd39116c 100644
--- a/include/linux/cgroup-defs.h
+++ b/include/linux/cgroup-defs.h
@@ -179,14 +179,13 @@ struct cgroup_subsys_state {
 	atomic_t online_cnt;
 
 	/* percpu_ref killing and RCU release */
-	struct work_struct destroy_work;
 	struct rcu_work destroy_rwork;
-
+	struct cgroup_subsys_state *parent;
+	struct work_struct destroy_work;
 	/*
 	 * PI: the parent css.	Placed here for cache proximity to following
 	 * fields of the containing structure.
 	 */
-	struct cgroup_subsys_state *parent;
 };
 
 /*
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 9ecead1042b9..963b88ab9930 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -239,9 +239,6 @@ struct mem_cgroup {
 	/* Private memcg ID. Used to ID objects that outlive the cgroup */
 	struct mem_cgroup_id id;
 
-	/* Accounted resources */
-	struct page_counter memory;		/* Both v1 & v2 */
-
 	union {
 		struct page_counter swap;	/* v2 only */
 		struct page_counter memsw;	/* v1 only */
@@ -251,6 +248,9 @@ struct mem_cgroup {
 	struct page_counter kmem;		/* v1 only */
 	struct page_counter tcpmem;		/* v1 only */
 
+	/* Accounted resources */
+	struct page_counter memory;		/* Both v1 & v2 */
+
 	/* Range enforcement for interrupt charges */
 	struct work_struct high_work;
 
@@ -313,7 +313,6 @@ struct mem_cgroup {
 	atomic_long_t		memory_events[MEMCG_NR_MEMORY_EVENTS];
 	atomic_long_t		memory_events_local[MEMCG_NR_MEMORY_EVENTS];
 
-	unsigned long		socket_pressure;
 
 	/* Legacy tcp memory accounting */
 	bool			tcpmem_active;
@@ -349,6 +348,7 @@ struct mem_cgroup {
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 	struct deferred_split deferred_split_queue;
 #endif
+	unsigned long		socket_pressure;
 
 	struct mem_cgroup_per_node *nodeinfo[];
 };

And some of these are specific for network and may not be a universal
win, though I think the 'cgroup_subsys_state' could keep the
read-mostly 'parent' away from following written-mostly counters.

Btw, I tried your debug patch which compiled fail with 0Day's kbuild
system, but it did compile ok on my local machine.

Thanks,
Feng

> 
> > Thanks,
> > Feng
> >
> > > >
> > > > Thanks,
> > > > Feng

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: [net] 4890b686f4: netperf.Throughput_Mbps -69.4% regression
  2022-06-27 12:34                       ` Feng Tang
  2022-06-27 14:07                         ` Eric Dumazet
@ 2022-06-27 14:52                         ` Shakeel Butt
  2022-06-27 14:56                           ` Eric Dumazet
  2022-06-27 15:12                           ` Feng Tang
  1 sibling, 2 replies; 32+ messages in thread
From: Shakeel Butt @ 2022-06-27 14:52 UTC (permalink / raw)
  To: Feng Tang
  Cc: Eric Dumazet, Linux MM, Andrew Morton, Roman Gushchin,
	Michal Hocko, Johannes Weiner, Muchun Song, Jakub Kicinski,
	Xin Long, Marcelo Ricardo Leitner, kernel test robot,
	Soheil Hassas Yeganeh, LKML, network dev, linux-s390,
	MPTCP Upstream, linux-sctp @ vger . kernel . org, lkp,
	kbuild test robot, Huang Ying, Xing Zhengjun, Yin Fengwei,
	Ying Xu

On Mon, Jun 27, 2022 at 5:34 AM Feng Tang <feng.tang@intel.com> wrote:
>
> On Mon, Jun 27, 2022 at 10:46:21AM +0200, Eric Dumazet wrote:
> > On Mon, Jun 27, 2022 at 4:38 AM Feng Tang <feng.tang@intel.com> wrote:
> [snip]
> > > > >
> > > > > Thanks Feng. Can you check the value of memory.kmem.tcp.max_usage_in_bytes
> > > > > in /sys/fs/cgroup/memory/system.slice/lkp-bootstrap.service after making
> > > > > sure that the netperf test has already run?
> > > >
> > > > memory.kmem.tcp.max_usage_in_bytes:0
> > >
> > > Sorry, I made a mistake that in the original report from Oliver, it
> > > was 'cgroup v2' with a 'debian-11.1' rootfs.
> > >
> > > When you asked about cgroup info, I tried the job on another tbox, and
> > > the original 'job.yaml' didn't work, so I kept the 'netperf' test
> > > parameters and started a new job which somehow run with a 'debian-10.4'
> > > rootfs and acutally run with cgroup v1.
> > >
> > > And as you mentioned cgroup version does make a big difference, that
> > > with v1, the regression is reduced to 1% ~ 5% on different generations
> > > of test platforms. Eric mentioned they also got regression report,
> > > but much smaller one, maybe it's due to the cgroup version?
> >
> > This was using the current net-next tree.
> > Used recipe was something like:
> >
> > Make sure cgroup2 is mounted or mount it by mount -t cgroup2 none $MOUNT_POINT.
> > Enable memory controller by echo +memory > $MOUNT_POINT/cgroup.subtree_control.
> > Create a cgroup by mkdir $MOUNT_POINT/job.
> > Jump into that cgroup by echo $$ > $MOUNT_POINT/job/cgroup.procs.
> >
> > <Launch tests>
> >
> > The regression was smaller than 1%, so considered noise compared to
> > the benefits of the bug fix.
>
> Yes, 1% is just around noise level for a microbenchmark.
>
> I went check the original test data of Oliver's report, the tests was
> run 6 rounds and the performance data is pretty stable (0Day's report
> will show any std deviation bigger than 2%)
>
> The test platform is a 4 sockets 72C/144T machine, and I run the
> same job (nr_tasks = 25% * nr_cpus) on one CascadeLake AP (4 nodes)
> and one Icelake 2 sockets platform, and saw 75% and 53% regresson on
> them.
>
> In the first email, there is a file named 'reproduce', it shows the
> basic test process:
>
> "
>   use 'performane' cpufre  governor for all CPUs
>
>   netserver -4 -D
>   modprobe sctp
>   netperf -4 -H 127.0.0.1 -t SCTP_STREAM_MANY -c -C -l 300 -- -m 10K  &
>   netperf -4 -H 127.0.0.1 -t SCTP_STREAM_MANY -c -C -l 300 -- -m 10K  &
>   netperf -4 -H 127.0.0.1 -t SCTP_STREAM_MANY -c -C -l 300 -- -m 10K  &
>   (repeat 36 times in total)
>   ...
>
> "
>
> Which starts 36 (25% of nr_cpus) netperf clients. And the clients number
> also matters, I tried to increase the client number from 36 to 72(50%),
> and the regression is changed from 69.4% to 73.7%
>

Am I understanding correctly that this 69.4% (or 73.7%) regression is
with cgroup v2?

Eric did the experiments on v2 but on real hardware where the
performance impact was negligible.

BTW do you see similar regression for tcp as well or just sctp?

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [net] 4890b686f4: netperf.Throughput_Mbps -69.4% regression
  2022-06-27 14:52                         ` Shakeel Butt
@ 2022-06-27 14:56                           ` Eric Dumazet
  2022-06-27 15:12                           ` Feng Tang
  1 sibling, 0 replies; 32+ messages in thread
From: Eric Dumazet @ 2022-06-27 14:56 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Feng Tang, Linux MM, Andrew Morton, Roman Gushchin, Michal Hocko,
	Johannes Weiner, Muchun Song, Jakub Kicinski, Xin Long,
	Marcelo Ricardo Leitner, kernel test robot,
	Soheil Hassas Yeganeh, LKML, network dev, linux-s390,
	MPTCP Upstream, linux-sctp @ vger . kernel . org, lkp,
	kbuild test robot, Huang Ying, Xing Zhengjun, Yin Fengwei,
	Ying Xu

On Mon, Jun 27, 2022 at 4:53 PM Shakeel Butt <shakeelb@google.com> wrote:

> Am I understanding correctly that this 69.4% (or 73.7%) regression is
> with cgroup v2?
>
> Eric did the experiments on v2 but on real hardware where the
> performance impact was negligible.
>
> BTW do you see similar regression for tcp as well or just sctp?

TCP_RR with big packets can show a regression as well.

I gave this perf profile:

    28.69%  [kernel]       [k] copy_user_enhanced_fast_string
    16.13%  [kernel]       [k] intel_idle_irq
     6.46%  [kernel]       [k] page_counter_try_charge
     6.20%  [kernel]       [k] __sk_mem_reduce_allocated
     5.68%  [kernel]       [k] try_charge_memcg
     5.16%  [kernel]       [k] page_counter_cancel

And this points to false sharing on (struct page_counter *)->usage

I guess memcg had free lunch, because of per-socket cache, that we
need to remove.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [net] 4890b686f4: netperf.Throughput_Mbps -69.4% regression
  2022-06-27 14:52                         ` Shakeel Butt
  2022-06-27 14:56                           ` Eric Dumazet
@ 2022-06-27 15:12                           ` Feng Tang
  2022-06-27 16:25                             ` Shakeel Butt
  1 sibling, 1 reply; 32+ messages in thread
From: Feng Tang @ 2022-06-27 15:12 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Eric Dumazet, Linux MM, Andrew Morton, Roman Gushchin,
	Michal Hocko, Johannes Weiner, Muchun Song, Jakub Kicinski,
	Xin Long, Marcelo Ricardo Leitner, kernel test robot,
	Soheil Hassas Yeganeh, LKML, network dev, linux-s390,
	MPTCP Upstream, linux-sctp @ vger . kernel . org, lkp,
	kbuild test robot, Huang Ying, Xing Zhengjun, Yin Fengwei,
	Ying Xu

On Mon, Jun 27, 2022 at 07:52:55AM -0700, Shakeel Butt wrote:
> On Mon, Jun 27, 2022 at 5:34 AM Feng Tang <feng.tang@intel.com> wrote:
> > Yes, 1% is just around noise level for a microbenchmark.
> >
> > I went check the original test data of Oliver's report, the tests was
> > run 6 rounds and the performance data is pretty stable (0Day's report
> > will show any std deviation bigger than 2%)
> >
> > The test platform is a 4 sockets 72C/144T machine, and I run the
> > same job (nr_tasks = 25% * nr_cpus) on one CascadeLake AP (4 nodes)
> > and one Icelake 2 sockets platform, and saw 75% and 53% regresson on
> > them.
> >
> > In the first email, there is a file named 'reproduce', it shows the
> > basic test process:
> >
> > "
> >   use 'performane' cpufre  governor for all CPUs
> >
> >   netserver -4 -D
> >   modprobe sctp
> >   netperf -4 -H 127.0.0.1 -t SCTP_STREAM_MANY -c -C -l 300 -- -m 10K  &
> >   netperf -4 -H 127.0.0.1 -t SCTP_STREAM_MANY -c -C -l 300 -- -m 10K  &
> >   netperf -4 -H 127.0.0.1 -t SCTP_STREAM_MANY -c -C -l 300 -- -m 10K  &
> >   (repeat 36 times in total)
> >   ...
> >
> > "
> >
> > Which starts 36 (25% of nr_cpus) netperf clients. And the clients number
> > also matters, I tried to increase the client number from 36 to 72(50%),
> > and the regression is changed from 69.4% to 73.7%
> >
> 
> Am I understanding correctly that this 69.4% (or 73.7%) regression is
> with cgroup v2?

Yes.

> Eric did the experiments on v2 but on real hardware where the
> performance impact was negligible.
> 
> BTW do you see similar regression for tcp as well or just sctp?

Yes, I run TCP_SENDFILE case with 'send_size'==10K, it hits a
70%+ regressioin. 

Thanks,
Feng

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [net] 4890b686f4: netperf.Throughput_Mbps -69.4% regression
  2022-06-27 15:12                           ` Feng Tang
@ 2022-06-27 16:25                             ` Shakeel Butt
  0 siblings, 0 replies; 32+ messages in thread
From: Shakeel Butt @ 2022-06-27 16:25 UTC (permalink / raw)
  To: Feng Tang
  Cc: Eric Dumazet, Linux MM, Andrew Morton, Roman Gushchin,
	Michal Hocko, Johannes Weiner, Muchun Song, Jakub Kicinski,
	Xin Long, Marcelo Ricardo Leitner, kernel test robot,
	Soheil Hassas Yeganeh, LKML, network dev, linux-s390,
	MPTCP Upstream, linux-sctp @ vger . kernel . org, lkp,
	kbuild test robot, Huang Ying, Xing Zhengjun, Yin Fengwei,
	Ying Xu

On Mon, Jun 27, 2022 at 8:25 AM Feng Tang <feng.tang@intel.com> wrote:
>
> On Mon, Jun 27, 2022 at 07:52:55AM -0700, Shakeel Butt wrote:
> > On Mon, Jun 27, 2022 at 5:34 AM Feng Tang <feng.tang@intel.com> wrote:
> > > Yes, 1% is just around noise level for a microbenchmark.
> > >
> > > I went check the original test data of Oliver's report, the tests was
> > > run 6 rounds and the performance data is pretty stable (0Day's report
> > > will show any std deviation bigger than 2%)
> > >
> > > The test platform is a 4 sockets 72C/144T machine, and I run the
> > > same job (nr_tasks = 25% * nr_cpus) on one CascadeLake AP (4 nodes)
> > > and one Icelake 2 sockets platform, and saw 75% and 53% regresson on
> > > them.
> > >
> > > In the first email, there is a file named 'reproduce', it shows the
> > > basic test process:
> > >
> > > "
> > >   use 'performane' cpufre  governor for all CPUs
> > >
> > >   netserver -4 -D
> > >   modprobe sctp
> > >   netperf -4 -H 127.0.0.1 -t SCTP_STREAM_MANY -c -C -l 300 -- -m 10K  &
> > >   netperf -4 -H 127.0.0.1 -t SCTP_STREAM_MANY -c -C -l 300 -- -m 10K  &
> > >   netperf -4 -H 127.0.0.1 -t SCTP_STREAM_MANY -c -C -l 300 -- -m 10K  &
> > >   (repeat 36 times in total)
> > >   ...
> > >
> > > "
> > >
> > > Which starts 36 (25% of nr_cpus) netperf clients. And the clients number
> > > also matters, I tried to increase the client number from 36 to 72(50%),
> > > and the regression is changed from 69.4% to 73.7%
> > >
> >
> > Am I understanding correctly that this 69.4% (or 73.7%) regression is
> > with cgroup v2?
>
> Yes.
>
> > Eric did the experiments on v2 but on real hardware where the
> > performance impact was negligible.
> >
> > BTW do you see similar regression for tcp as well or just sctp?
>
> Yes, I run TCP_SENDFILE case with 'send_size'==10K, it hits a
> 70%+ regressioin.
>

Thanks Feng. I think we should start with squeezing whatever we can
from layout changes and then try other approaches like increasing
batch size or something else. I can take a stab at this next week.

thanks,
Shakeel

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [net] 4890b686f4: netperf.Throughput_Mbps -69.4% regression
  2022-06-27 14:48                           ` Feng Tang
@ 2022-06-27 16:25                             ` Eric Dumazet
  2022-06-27 16:48                               ` Shakeel Butt
  2022-06-28  3:49                               ` Feng Tang
  0 siblings, 2 replies; 32+ messages in thread
From: Eric Dumazet @ 2022-06-27 16:25 UTC (permalink / raw)
  To: Feng Tang
  Cc: Shakeel Butt, Linux MM, Andrew Morton, Roman Gushchin,
	Michal Hocko, Johannes Weiner, Muchun Song, Jakub Kicinski,
	Xin Long, Marcelo Ricardo Leitner, kernel test robot,
	Soheil Hassas Yeganeh, LKML, network dev, linux-s390,
	MPTCP Upstream, linux-sctp @ vger . kernel . org, lkp,
	kbuild test robot, Huang Ying, Xing Zhengjun, Yin Fengwei,
	Ying Xu

On Mon, Jun 27, 2022 at 4:48 PM Feng Tang <feng.tang@intel.com> wrote:
>
> On Mon, Jun 27, 2022 at 04:07:55PM +0200, Eric Dumazet wrote:
> > On Mon, Jun 27, 2022 at 2:34 PM Feng Tang <feng.tang@intel.com> wrote:
> > >
> > > On Mon, Jun 27, 2022 at 10:46:21AM +0200, Eric Dumazet wrote:
> > > > On Mon, Jun 27, 2022 at 4:38 AM Feng Tang <feng.tang@intel.com> wrote:
> > > [snip]
> > > > > > >
> > > > > > > Thanks Feng. Can you check the value of memory.kmem.tcp.max_usage_in_bytes
> > > > > > > in /sys/fs/cgroup/memory/system.slice/lkp-bootstrap.service after making
> > > > > > > sure that the netperf test has already run?
> > > > > >
> > > > > > memory.kmem.tcp.max_usage_in_bytes:0
> > > > >
> > > > > Sorry, I made a mistake that in the original report from Oliver, it
> > > > > was 'cgroup v2' with a 'debian-11.1' rootfs.
> > > > >
> > > > > When you asked about cgroup info, I tried the job on another tbox, and
> > > > > the original 'job.yaml' didn't work, so I kept the 'netperf' test
> > > > > parameters and started a new job which somehow run with a 'debian-10.4'
> > > > > rootfs and acutally run with cgroup v1.
> > > > >
> > > > > And as you mentioned cgroup version does make a big difference, that
> > > > > with v1, the regression is reduced to 1% ~ 5% on different generations
> > > > > of test platforms. Eric mentioned they also got regression report,
> > > > > but much smaller one, maybe it's due to the cgroup version?
> > > >
> > > > This was using the current net-next tree.
> > > > Used recipe was something like:
> > > >
> > > > Make sure cgroup2 is mounted or mount it by mount -t cgroup2 none $MOUNT_POINT.
> > > > Enable memory controller by echo +memory > $MOUNT_POINT/cgroup.subtree_control.
> > > > Create a cgroup by mkdir $MOUNT_POINT/job.
> > > > Jump into that cgroup by echo $$ > $MOUNT_POINT/job/cgroup.procs.
> > > >
> > > > <Launch tests>
> > > >
> > > > The regression was smaller than 1%, so considered noise compared to
> > > > the benefits of the bug fix.
> > >
> > > Yes, 1% is just around noise level for a microbenchmark.
> > >
> > > I went check the original test data of Oliver's report, the tests was
> > > run 6 rounds and the performance data is pretty stable (0Day's report
> > > will show any std deviation bigger than 2%)
> > >
> > > The test platform is a 4 sockets 72C/144T machine, and I run the
> > > same job (nr_tasks = 25% * nr_cpus) on one CascadeLake AP (4 nodes)
> > > and one Icelake 2 sockets platform, and saw 75% and 53% regresson on
> > > them.
> > >
> > > In the first email, there is a file named 'reproduce', it shows the
> > > basic test process:
> > >
> > > "
> > >   use 'performane' cpufre  governor for all CPUs
> > >
> > >   netserver -4 -D
> > >   modprobe sctp
> > >   netperf -4 -H 127.0.0.1 -t SCTP_STREAM_MANY -c -C -l 300 -- -m 10K  &
> > >   netperf -4 -H 127.0.0.1 -t SCTP_STREAM_MANY -c -C -l 300 -- -m 10K  &
> > >   netperf -4 -H 127.0.0.1 -t SCTP_STREAM_MANY -c -C -l 300 -- -m 10K  &
> > >   (repeat 36 times in total)
> > >   ...
> > >
> > > "
> > >
> > > Which starts 36 (25% of nr_cpus) netperf clients. And the clients number
> > > also matters, I tried to increase the client number from 36 to 72(50%),
> > > and the regression is changed from 69.4% to 73.7%"
> > >
> >
> > This seems like a lot of opportunities for memcg folks :)
> >
> > struct page_counter has poor field placement [1], and no per-cpu cache.
> >
> > [1] "atomic_long_t usage" is sharing cache line with read mostly fields.
> >
> > (struct mem_cgroup also has poor field placement, mainly because of
> > struct page_counter)
> >
> >     28.69%  [kernel]       [k] copy_user_enhanced_fast_string
> >     16.13%  [kernel]       [k] intel_idle_irq
> >      6.46%  [kernel]       [k] page_counter_try_charge
> >      6.20%  [kernel]       [k] __sk_mem_reduce_allocated
> >      5.68%  [kernel]       [k] try_charge_memcg
> >      5.16%  [kernel]       [k] page_counter_cancel
>
> Yes, I also analyzed the perf-profile data, and made some layout changes
> which could recover the changes from 69% to 40%.
>
> 7c80b038d23e1f4c 4890b686f4088c90432149bd6de 332b589c49656a45881bca4ecc0
> ---------------- --------------------------- ---------------------------
>      15722           -69.5%       4792           -40.8%       9300        netperf.Throughput_Mbps
>
>
> diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
> index 1bfcfb1af352..aa37bd39116c 100644
> --- a/include/linux/cgroup-defs.h
> +++ b/include/linux/cgroup-defs.h
> @@ -179,14 +179,13 @@ struct cgroup_subsys_state {
>         atomic_t online_cnt;
>
>         /* percpu_ref killing and RCU release */
> -       struct work_struct destroy_work;
>         struct rcu_work destroy_rwork;
> -
> +       struct cgroup_subsys_state *parent;
> +       struct work_struct destroy_work;
>         /*
>          * PI: the parent css.  Placed here for cache proximity to following
>          * fields of the containing structure.
>          */
> -       struct cgroup_subsys_state *parent;
>  };
>
>  /*
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 9ecead1042b9..963b88ab9930 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -239,9 +239,6 @@ struct mem_cgroup {
>         /* Private memcg ID. Used to ID objects that outlive the cgroup */
>         struct mem_cgroup_id id;
>
> -       /* Accounted resources */
> -       struct page_counter memory;             /* Both v1 & v2 */
> -
>         union {
>                 struct page_counter swap;       /* v2 only */
>                 struct page_counter memsw;      /* v1 only */
> @@ -251,6 +248,9 @@ struct mem_cgroup {
>         struct page_counter kmem;               /* v1 only */
>         struct page_counter tcpmem;             /* v1 only */
>
> +       /* Accounted resources */
> +       struct page_counter memory;             /* Both v1 & v2 */
> +
>         /* Range enforcement for interrupt charges */
>         struct work_struct high_work;
>
> @@ -313,7 +313,6 @@ struct mem_cgroup {
>         atomic_long_t           memory_events[MEMCG_NR_MEMORY_EVENTS];
>         atomic_long_t           memory_events_local[MEMCG_NR_MEMORY_EVENTS];
>
> -       unsigned long           socket_pressure;
>
>         /* Legacy tcp memory accounting */
>         bool                    tcpmem_active;
> @@ -349,6 +348,7 @@ struct mem_cgroup {
>  #ifdef CONFIG_TRANSPARENT_HUGEPAGE
>         struct deferred_split deferred_split_queue;
>  #endif
> +       unsigned long           socket_pressure;
>
>         struct mem_cgroup_per_node *nodeinfo[];
>  };
>

I simply did the following and got much better results.

But I am not sure if updates to ->usage are really needed that often...


diff --git a/include/linux/page_counter.h b/include/linux/page_counter.h
index 679591301994d316062f92b275efa2459a8349c9..e267be4ba849760117d9fd041e22c2a44658ab36
100644
--- a/include/linux/page_counter.h
+++ b/include/linux/page_counter.h
@@ -3,12 +3,15 @@
 #define _LINUX_PAGE_COUNTER_H

 #include <linux/atomic.h>
+#include <linux/cache.h>
 #include <linux/kernel.h>
 #include <asm/page.h>

 struct page_counter {
-       atomic_long_t usage;
-       unsigned long min;
+       /* contended cache line. */
+       atomic_long_t usage ____cacheline_aligned_in_smp;
+
+       unsigned long min ____cacheline_aligned_in_smp;
        unsigned long low;
        unsigned long high;
        unsigned long max;
@@ -27,12 +30,6 @@ struct page_counter {
        unsigned long watermark;
        unsigned long failcnt;

-       /*
-        * 'parent' is placed here to be far from 'usage' to reduce
-        * cache false sharing, as 'usage' is written mostly while
-        * parent is frequently read for cgroup's hierarchical
-        * counting nature.
-        */
        struct page_counter *parent;
 };



> And some of these are specific for network and may not be a universal
> win, though I think the 'cgroup_subsys_state' could keep the
> read-mostly 'parent' away from following written-mostly counters.
>
> Btw, I tried your debug patch which compiled fail with 0Day's kbuild
> system, but it did compile ok on my local machine.
>
> Thanks,
> Feng
>
> >
> > > Thanks,
> > > Feng
> > >
> > > > >
> > > > > Thanks,
> > > > > Feng

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [net] 4890b686f4: netperf.Throughput_Mbps -69.4% regression
  2022-06-27 16:25                             ` Eric Dumazet
@ 2022-06-27 16:48                               ` Shakeel Butt
  2022-06-27 17:05                                 ` Eric Dumazet
  2022-06-28  1:46                                 ` Roman Gushchin
  2022-06-28  3:49                               ` Feng Tang
  1 sibling, 2 replies; 32+ messages in thread
From: Shakeel Butt @ 2022-06-27 16:48 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Feng Tang, Linux MM, Andrew Morton, Roman Gushchin, Michal Hocko,
	Johannes Weiner, Muchun Song, Jakub Kicinski, Xin Long,
	Marcelo Ricardo Leitner, kernel test robot,
	Soheil Hassas Yeganeh, LKML, network dev, linux-s390,
	MPTCP Upstream, linux-sctp @ vger . kernel . org, lkp,
	kbuild test robot, Huang Ying, Xing Zhengjun, Yin Fengwei,
	Ying Xu

On Mon, Jun 27, 2022 at 9:26 AM Eric Dumazet <edumazet@google.com> wrote:
>
[...]
> >
>
> I simply did the following and got much better results.
>
> But I am not sure if updates to ->usage are really needed that often...

I suspect we need to improve the per-cpu memcg stock usage here. Were
the updates mostly from uncharge path or charge path or that's
irrelevant?

I think doing full drain (i.e. drain_stock()) within __refill_stock()
when the local cache is larger than MEMCG_CHARGE_BATCH is not best.
Rather we should always keep at least MEMCG_CHARGE_BATCH for such
scenarios.

>
>
> diff --git a/include/linux/page_counter.h b/include/linux/page_counter.h
> index 679591301994d316062f92b275efa2459a8349c9..e267be4ba849760117d9fd041e22c2a44658ab36
> 100644
> --- a/include/linux/page_counter.h
> +++ b/include/linux/page_counter.h
> @@ -3,12 +3,15 @@
>  #define _LINUX_PAGE_COUNTER_H
>
>  #include <linux/atomic.h>
> +#include <linux/cache.h>
>  #include <linux/kernel.h>
>  #include <asm/page.h>
>
>  struct page_counter {
> -       atomic_long_t usage;
> -       unsigned long min;
> +       /* contended cache line. */
> +       atomic_long_t usage ____cacheline_aligned_in_smp;
> +
> +       unsigned long min ____cacheline_aligned_in_smp;

Do we need to align 'min' too?

>         unsigned long low;
>         unsigned long high;
>         unsigned long max;
> @@ -27,12 +30,6 @@ struct page_counter {
>         unsigned long watermark;
>         unsigned long failcnt;
>
> -       /*
> -        * 'parent' is placed here to be far from 'usage' to reduce
> -        * cache false sharing, as 'usage' is written mostly while
> -        * parent is frequently read for cgroup's hierarchical
> -        * counting nature.
> -        */
>         struct page_counter *parent;
>  };
>
>
>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [net] 4890b686f4: netperf.Throughput_Mbps -69.4% regression
  2022-06-27 16:48                               ` Shakeel Butt
@ 2022-06-27 17:05                                 ` Eric Dumazet
  2022-06-28  1:46                                 ` Roman Gushchin
  1 sibling, 0 replies; 32+ messages in thread
From: Eric Dumazet @ 2022-06-27 17:05 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Feng Tang, Linux MM, Andrew Morton, Roman Gushchin, Michal Hocko,
	Johannes Weiner, Muchun Song, Jakub Kicinski, Xin Long,
	Marcelo Ricardo Leitner, kernel test robot,
	Soheil Hassas Yeganeh, LKML, network dev, linux-s390,
	MPTCP Upstream, linux-sctp @ vger . kernel . org, lkp,
	kbuild test robot, Huang Ying, Xing Zhengjun, Yin Fengwei,
	Ying Xu

On Mon, Jun 27, 2022 at 6:48 PM Shakeel Butt <shakeelb@google.com> wrote:
>
> On Mon, Jun 27, 2022 at 9:26 AM Eric Dumazet <edumazet@google.com> wrote:
> >
> [...]
> > >
> >
> > I simply did the following and got much better results.
> >
> > But I am not sure if updates to ->usage are really needed that often...
>
> I suspect we need to improve the per-cpu memcg stock usage here. Were
> the updates mostly from uncharge path or charge path or that's
> irrelevant?

I wonder if the cache is always used...

stock = this_cpu_ptr(&memcg_stock);
if (memcg == stock->cached && stock->nr_pages >= nr_pages) {

Apparently the per-cpu cache is only used for one memcg at a time ?

Not sure how this would scale to hosts with dozens of memcgs.

Maybe we could add some metrics to have an idea of the cache hit/miss ratio :/


>
> I think doing full drain (i.e. drain_stock()) within __refill_stock()
> when the local cache is larger than MEMCG_CHARGE_BATCH is not best.
> Rather we should always keep at least MEMCG_CHARGE_BATCH for such
> scenarios.
>
> >
> >
> > diff --git a/include/linux/page_counter.h b/include/linux/page_counter.h
> > index 679591301994d316062f92b275efa2459a8349c9..e267be4ba849760117d9fd041e22c2a44658ab36
> > 100644
> > --- a/include/linux/page_counter.h
> > +++ b/include/linux/page_counter.h
> > @@ -3,12 +3,15 @@
> >  #define _LINUX_PAGE_COUNTER_H
> >
> >  #include <linux/atomic.h>
> > +#include <linux/cache.h>
> >  #include <linux/kernel.h>
> >  #include <asm/page.h>
> >
> >  struct page_counter {
> > -       atomic_long_t usage;
> > -       unsigned long min;
> > +       /* contended cache line. */
> > +       atomic_long_t usage ____cacheline_aligned_in_smp;
> > +
> > +       unsigned long min ____cacheline_aligned_in_smp;
>
> Do we need to align 'min' too?

Probably if there is a hierarchy ...

propagate_protected_usage() seems to have potential high cost.


>
> >         unsigned long low;
> >         unsigned long high;
> >         unsigned long max;
> > @@ -27,12 +30,6 @@ struct page_counter {
> >         unsigned long watermark;
> >         unsigned long failcnt;
> >
> > -       /*
> > -        * 'parent' is placed here to be far from 'usage' to reduce
> > -        * cache false sharing, as 'usage' is written mostly while
> > -        * parent is frequently read for cgroup's hierarchical
> > -        * counting nature.
> > -        */
> >         struct page_counter *parent;
> >  };
> >
> >
> >

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [net] 4890b686f4: netperf.Throughput_Mbps -69.4% regression
  2022-06-27 16:48                               ` Shakeel Butt
  2022-06-27 17:05                                 ` Eric Dumazet
@ 2022-06-28  1:46                                 ` Roman Gushchin
  1 sibling, 0 replies; 32+ messages in thread
From: Roman Gushchin @ 2022-06-28  1:46 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Eric Dumazet, Feng Tang, Linux MM, Andrew Morton, Michal Hocko,
	Johannes Weiner, Muchun Song, Jakub Kicinski, Xin Long,
	Marcelo Ricardo Leitner, kernel test robot,
	Soheil Hassas Yeganeh, LKML, network dev, linux-s390,
	MPTCP Upstream, linux-sctp @ vger . kernel . org, lkp,
	kbuild test robot, Huang Ying, Xing Zhengjun, Yin Fengwei,
	Ying Xu

On Mon, Jun 27, 2022 at 09:48:01AM -0700, Shakeel Butt wrote:
> On Mon, Jun 27, 2022 at 9:26 AM Eric Dumazet <edumazet@google.com> wrote:
> >
> [...]
> > >
> >
> > I simply did the following and got much better results.
> >
> > But I am not sure if updates to ->usage are really needed that often...
> 
> I suspect we need to improve the per-cpu memcg stock usage here. Were
> the updates mostly from uncharge path or charge path or that's
> irrelevant?
> 
> I think doing full drain (i.e. drain_stock()) within __refill_stock()
> when the local cache is larger than MEMCG_CHARGE_BATCH is not best.
> Rather we should always keep at least MEMCG_CHARGE_BATCH for such
> scenarios.

+1, really good point.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [net] 4890b686f4: netperf.Throughput_Mbps -69.4% regression
  2022-06-27 16:25                             ` Eric Dumazet
  2022-06-27 16:48                               ` Shakeel Butt
@ 2022-06-28  3:49                               ` Feng Tang
  2022-07-01 15:47                                 ` Shakeel Butt
  1 sibling, 1 reply; 32+ messages in thread
From: Feng Tang @ 2022-06-28  3:49 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Shakeel Butt, Linux MM, Andrew Morton, Roman Gushchin,
	Michal Hocko, Johannes Weiner, Muchun Song, Jakub Kicinski,
	Xin Long, Marcelo Ricardo Leitner, kernel test robot,
	Soheil Hassas Yeganeh, LKML, network dev, linux-s390,
	MPTCP Upstream, linux-sctp @ vger . kernel . org, lkp,
	kbuild test robot, Huang Ying, Xing Zhengjun, Yin Fengwei,
	Ying Xu

On Mon, Jun 27, 2022 at 06:25:59PM +0200, Eric Dumazet wrote:
> On Mon, Jun 27, 2022 at 4:48 PM Feng Tang <feng.tang@intel.com> wrote:
> >
> > Yes, I also analyzed the perf-profile data, and made some layout changes
> > which could recover the changes from 69% to 40%.
> >
> > 7c80b038d23e1f4c 4890b686f4088c90432149bd6de 332b589c49656a45881bca4ecc0
> > ---------------- --------------------------- ---------------------------
> >      15722           -69.5%       4792           -40.8%       9300        netperf.Throughput_Mbps
> >
> 
> I simply did the following and got much better results.
> 
> But I am not sure if updates to ->usage are really needed that often...
> 
> 
> diff --git a/include/linux/page_counter.h b/include/linux/page_counter.h
> index 679591301994d316062f92b275efa2459a8349c9..e267be4ba849760117d9fd041e22c2a44658ab36
> 100644
> --- a/include/linux/page_counter.h
> +++ b/include/linux/page_counter.h
> @@ -3,12 +3,15 @@
>  #define _LINUX_PAGE_COUNTER_H
> 
>  #include <linux/atomic.h>
> +#include <linux/cache.h>
>  #include <linux/kernel.h>
>  #include <asm/page.h>
> 
>  struct page_counter {
> -       atomic_long_t usage;
> -       unsigned long min;
> +       /* contended cache line. */
> +       atomic_long_t usage ____cacheline_aligned_in_smp;
> +
> +       unsigned long min ____cacheline_aligned_in_smp;
>         unsigned long low;
>         unsigned long high;
>         unsigned long max;
> @@ -27,12 +30,6 @@ struct page_counter {
>         unsigned long watermark;
>         unsigned long failcnt;
> 
> -       /*
> -        * 'parent' is placed here to be far from 'usage' to reduce
> -        * cache false sharing, as 'usage' is written mostly while
> -        * parent is frequently read for cgroup's hierarchical
> -        * counting nature.
> -        */
>         struct page_counter *parent;
>  };

I just tested it, it does perform better (the 4th is with your patch),
some perf-profile data is also listed.

 7c80b038d23e1f4c 4890b686f4088c90432149bd6de 332b589c49656a45881bca4ecc0 e719635902654380b23ffce908d 
---------------- --------------------------- --------------------------- --------------------------- 
     15722           -69.5%       4792           -40.8%       9300           -27.9%      11341        netperf.Throughput_Mbps

      0.00            +0.3        0.26 ±  5%      +0.5        0.51            +1.3        1.27 ±  2%pp.self.__sk_mem_raise_allocated
      0.00            +0.3        0.32 ± 15%      +1.7        1.74 ±  2%      +0.4        0.40 ±  2%  pp.self.propagate_protected_usage
      0.00            +0.8        0.82 ±  7%      +0.9        0.90            +0.8        0.84        pp.self.__mod_memcg_state
      0.00            +1.2        1.24 ±  4%      +1.0        1.01            +1.4        1.44        pp.self.try_charge_memcg
      0.00            +2.1        2.06            +2.1        2.13            +2.1        2.11        pp.self.page_counter_uncharge
      0.00            +2.1        2.14 ±  4%      +2.7        2.71            +2.6        2.60 ±  2%  pp.self.page_counter_try_charge
      1.12 ±  4%      +3.1        4.24            +1.1        2.22            +1.4        2.51        pp.self.native_queued_spin_lock_slowpath
      0.28 ±  9%      +3.8        4.06 ±  4%      +0.2        0.48            +0.4        0.68        pp.self.sctp_eat_data
      0.00            +8.2        8.23            +0.8        0.83            +1.3        1.26        pp.self.__sk_mem_reduce_allocated

And the size of 'mem_cgroup' is increased from 4224 Bytes to 4608.

Another info is the perf hotspos are slightly different between
tcp and sctp test cases.

Thanks,
Feng

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [net] 4890b686f4: netperf.Throughput_Mbps -69.4% regression
  2022-06-28  3:49                               ` Feng Tang
@ 2022-07-01 15:47                                 ` Shakeel Butt
  2022-07-03 10:43                                   ` Feng Tang
  0 siblings, 1 reply; 32+ messages in thread
From: Shakeel Butt @ 2022-07-01 15:47 UTC (permalink / raw)
  To: Feng Tang
  Cc: Eric Dumazet, Linux MM, Andrew Morton, Roman Gushchin,
	Michal Hocko, Johannes Weiner, Muchun Song, Jakub Kicinski,
	Xin Long, Marcelo Ricardo Leitner, kernel test robot,
	Soheil Hassas Yeganeh, LKML, network dev, linux-s390,
	MPTCP Upstream, linux-sctp @ vger . kernel . org, lkp,
	kbuild test robot, Huang Ying, Xing Zhengjun, Yin Fengwei,
	Ying Xu

On Mon, Jun 27, 2022 at 8:49 PM Feng Tang <feng.tang@intel.com> wrote:
>
> On Mon, Jun 27, 2022 at 06:25:59PM +0200, Eric Dumazet wrote:
> > On Mon, Jun 27, 2022 at 4:48 PM Feng Tang <feng.tang@intel.com> wrote:
> > >
> > > Yes, I also analyzed the perf-profile data, and made some layout changes
> > > which could recover the changes from 69% to 40%.
> > >
> > > 7c80b038d23e1f4c 4890b686f4088c90432149bd6de 332b589c49656a45881bca4ecc0
> > > ---------------- --------------------------- ---------------------------
> > >      15722           -69.5%       4792           -40.8%       9300        netperf.Throughput_Mbps
> > >
> >
> > I simply did the following and got much better results.
> >
> > But I am not sure if updates to ->usage are really needed that often...
> >
> >
> > diff --git a/include/linux/page_counter.h b/include/linux/page_counter.h
> > index 679591301994d316062f92b275efa2459a8349c9..e267be4ba849760117d9fd041e22c2a44658ab36
> > 100644
> > --- a/include/linux/page_counter.h
> > +++ b/include/linux/page_counter.h
> > @@ -3,12 +3,15 @@
> >  #define _LINUX_PAGE_COUNTER_H
> >
> >  #include <linux/atomic.h>
> > +#include <linux/cache.h>
> >  #include <linux/kernel.h>
> >  #include <asm/page.h>
> >
> >  struct page_counter {
> > -       atomic_long_t usage;
> > -       unsigned long min;
> > +       /* contended cache line. */
> > +       atomic_long_t usage ____cacheline_aligned_in_smp;
> > +
> > +       unsigned long min ____cacheline_aligned_in_smp;
> >         unsigned long low;
> >         unsigned long high;
> >         unsigned long max;
> > @@ -27,12 +30,6 @@ struct page_counter {
> >         unsigned long watermark;
> >         unsigned long failcnt;
> >
> > -       /*
> > -        * 'parent' is placed here to be far from 'usage' to reduce
> > -        * cache false sharing, as 'usage' is written mostly while
> > -        * parent is frequently read for cgroup's hierarchical
> > -        * counting nature.
> > -        */
> >         struct page_counter *parent;
> >  };
>
> I just tested it, it does perform better (the 4th is with your patch),
> some perf-profile data is also listed.
>
>  7c80b038d23e1f4c 4890b686f4088c90432149bd6de 332b589c49656a45881bca4ecc0 e719635902654380b23ffce908d
> ---------------- --------------------------- --------------------------- ---------------------------
>      15722           -69.5%       4792           -40.8%       9300           -27.9%      11341        netperf.Throughput_Mbps
>
>       0.00            +0.3        0.26 ±  5%      +0.5        0.51            +1.3        1.27 ±  2%pp.self.__sk_mem_raise_allocated
>       0.00            +0.3        0.32 ± 15%      +1.7        1.74 ±  2%      +0.4        0.40 ±  2%  pp.self.propagate_protected_usage
>       0.00            +0.8        0.82 ±  7%      +0.9        0.90            +0.8        0.84        pp.self.__mod_memcg_state
>       0.00            +1.2        1.24 ±  4%      +1.0        1.01            +1.4        1.44        pp.self.try_charge_memcg
>       0.00            +2.1        2.06            +2.1        2.13            +2.1        2.11        pp.self.page_counter_uncharge
>       0.00            +2.1        2.14 ±  4%      +2.7        2.71            +2.6        2.60 ±  2%  pp.self.page_counter_try_charge
>       1.12 ±  4%      +3.1        4.24            +1.1        2.22            +1.4        2.51        pp.self.native_queued_spin_lock_slowpath
>       0.28 ±  9%      +3.8        4.06 ±  4%      +0.2        0.48            +0.4        0.68        pp.self.sctp_eat_data
>       0.00            +8.2        8.23            +0.8        0.83            +1.3        1.26        pp.self.__sk_mem_reduce_allocated
>
> And the size of 'mem_cgroup' is increased from 4224 Bytes to 4608.

Hi Feng, can you please try two more configurations? Take Eric's patch
of adding ____cacheline_aligned_in_smp in page_counter and for first
increase MEMCG_CHARGE_BATCH to 64 and for second increase it to 128.
Basically batch increases combined with Eric's patch.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [net] 4890b686f4: netperf.Throughput_Mbps -69.4% regression
  2022-07-01 15:47                                 ` Shakeel Butt
@ 2022-07-03 10:43                                   ` Feng Tang
  2022-07-03 22:55                                     ` Roman Gushchin
  0 siblings, 1 reply; 32+ messages in thread
From: Feng Tang @ 2022-07-03 10:43 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Eric Dumazet, Linux MM, Andrew Morton, Roman Gushchin,
	Michal Hocko, Johannes Weiner, Muchun Song, Jakub Kicinski,
	Xin Long, Marcelo Ricardo Leitner, kernel test robot,
	Soheil Hassas Yeganeh, LKML, network dev, linux-s390,
	MPTCP Upstream, linux-sctp @ vger . kernel . org, lkp,
	kbuild test robot, Huang Ying, Xing Zhengjun, Yin Fengwei,
	Ying Xu

Hi Shakeel,

On Fri, Jul 01, 2022 at 08:47:29AM -0700, Shakeel Butt wrote:
> On Mon, Jun 27, 2022 at 8:49 PM Feng Tang <feng.tang@intel.com> wrote:
> > I just tested it, it does perform better (the 4th is with your patch),
> > some perf-profile data is also listed.
> >
> >  7c80b038d23e1f4c 4890b686f4088c90432149bd6de 332b589c49656a45881bca4ecc0 e719635902654380b23ffce908d
> > ---------------- --------------------------- --------------------------- ---------------------------
> >      15722           -69.5%       4792           -40.8%       9300           -27.9%      11341        netperf.Throughput_Mbps
> >
> >       0.00            +0.3        0.26 ±  5%      +0.5        0.51            +1.3        1.27 ±  2%pp.self.__sk_mem_raise_allocated
> >       0.00            +0.3        0.32 ± 15%      +1.7        1.74 ±  2%      +0.4        0.40 ±  2%  pp.self.propagate_protected_usage
> >       0.00            +0.8        0.82 ±  7%      +0.9        0.90            +0.8        0.84        pp.self.__mod_memcg_state
> >       0.00            +1.2        1.24 ±  4%      +1.0        1.01            +1.4        1.44        pp.self.try_charge_memcg
> >       0.00            +2.1        2.06            +2.1        2.13            +2.1        2.11        pp.self.page_counter_uncharge
> >       0.00            +2.1        2.14 ±  4%      +2.7        2.71            +2.6        2.60 ±  2%  pp.self.page_counter_try_charge
> >       1.12 ±  4%      +3.1        4.24            +1.1        2.22            +1.4        2.51        pp.self.native_queued_spin_lock_slowpath
> >       0.28 ±  9%      +3.8        4.06 ±  4%      +0.2        0.48            +0.4        0.68        pp.self.sctp_eat_data
> >       0.00            +8.2        8.23            +0.8        0.83            +1.3        1.26        pp.self.__sk_mem_reduce_allocated
> >
> > And the size of 'mem_cgroup' is increased from 4224 Bytes to 4608.
> 
> Hi Feng, can you please try two more configurations? Take Eric's patch
> of adding ____cacheline_aligned_in_smp in page_counter and for first
> increase MEMCG_CHARGE_BATCH to 64 and for second increase it to 128.
> Basically batch increases combined with Eric's patch.

With increasing batch to 128, the regression could be reduced to -12.4%.

Some more details with perf-profile data below:

7c80b038d23e1f4c 4890b686f4088c90432149bd6de  Eric's patch                 Eric's patch + batch-64   Eric's patch + batch-128
---------------- --------------------------- --------------------------- --------------------------- --------------------------- 
     15722           -69.5%       4792           -27.9%      11341           -14.0%      13521           -12.4%      13772        netperf.Throughput_Mbps
 
      0.05            +0.2        0.27 ± 18%      +0.0        0.08 ±  6%      -0.1        0.00            -0.0        0.03 ±100%  pp.self.timekeeping_max_deferment
      0.00            +0.3        0.26 ±  5%      +1.3        1.27 ±  2%      +1.8        1.82 ± 10%      +2.0        1.96 ±  9%  pp.self.__sk_mem_raise_allocated
      0.00            +0.3        0.32 ± 15%      +0.4        0.40 ±  2%      +0.1        0.10 ±  5%      +0.0        0.00        pp.self.propagate_protected_usage
      0.00            +0.8        0.82 ±  7%      +0.8        0.84            +0.5        0.48            +0.4        0.36 ±  2%  pp.self.__mod_memcg_state
      0.00            +1.2        1.24 ±  4%      +1.4        1.44            +0.4        0.40 ±  3%      +0.2        0.24 ±  6%  pp.self.try_charge_memcg
      0.00            +2.1        2.06            +2.1        2.11            +0.5        0.50            +0.2        0.18 ±  8%  pp.self.page_counter_uncharge
      0.00            +2.1        2.14 ±  4%      +2.6        2.60 ±  2%      +0.6        0.58            +0.2        0.20        pp.self.page_counter_try_charge
      1.12 ±  4%      +3.1        4.24            +1.4        2.51            +1.0        2.10 ±  2%      +1.0        2.10 ±  9%  pp.self.native_queued_spin_lock_slowpath
      0.28 ±  9%      +3.8        4.06 ±  4%      +0.4        0.68            +0.6        0.90 ±  9%      +0.7        1.00 ± 11%  pp.self.sctp_eat_data
      0.00            +8.2        8.23            +1.3        1.26            +1.7        1.72 ±  6%      +2.0        1.95 ± 10%  pp.self.__sk_mem_reduce_allocated
 
 Thanks,
 Feng

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [net] 4890b686f4: netperf.Throughput_Mbps -69.4% regression
  2022-07-03 10:43                                   ` Feng Tang
@ 2022-07-03 22:55                                     ` Roman Gushchin
  2022-07-05  5:03                                       ` Feng Tang
  0 siblings, 1 reply; 32+ messages in thread
From: Roman Gushchin @ 2022-07-03 22:55 UTC (permalink / raw)
  To: Feng Tang
  Cc: Shakeel Butt, Eric Dumazet, Linux MM, Andrew Morton,
	Michal Hocko, Johannes Weiner, Muchun Song, Jakub Kicinski,
	Xin Long, Marcelo Ricardo Leitner, kernel test robot,
	Soheil Hassas Yeganeh, LKML, network dev, linux-s390,
	MPTCP Upstream, linux-sctp @ vger . kernel . org, lkp,
	kbuild test robot, Huang Ying, Xing Zhengjun, Yin Fengwei,
	Ying Xu

On Sun, Jul 03, 2022 at 06:43:53PM +0800, Feng Tang wrote:
> Hi Shakeel,
> 
> On Fri, Jul 01, 2022 at 08:47:29AM -0700, Shakeel Butt wrote:
> > On Mon, Jun 27, 2022 at 8:49 PM Feng Tang <feng.tang@intel.com> wrote:
> > > I just tested it, it does perform better (the 4th is with your patch),
> > > some perf-profile data is also listed.
> > >
> > >  7c80b038d23e1f4c 4890b686f4088c90432149bd6de 332b589c49656a45881bca4ecc0 e719635902654380b23ffce908d
> > > ---------------- --------------------------- --------------------------- ---------------------------
> > >      15722           -69.5%       4792           -40.8%       9300           -27.9%      11341        netperf.Throughput_Mbps
> > >
> > >       0.00            +0.3        0.26 ±  5%      +0.5        0.51            +1.3        1.27 ±  2%pp.self.__sk_mem_raise_allocated
> > >       0.00            +0.3        0.32 ± 15%      +1.7        1.74 ±  2%      +0.4        0.40 ±  2%  pp.self.propagate_protected_usage
> > >       0.00            +0.8        0.82 ±  7%      +0.9        0.90            +0.8        0.84        pp.self.__mod_memcg_state
> > >       0.00            +1.2        1.24 ±  4%      +1.0        1.01            +1.4        1.44        pp.self.try_charge_memcg
> > >       0.00            +2.1        2.06            +2.1        2.13            +2.1        2.11        pp.self.page_counter_uncharge
> > >       0.00            +2.1        2.14 ±  4%      +2.7        2.71            +2.6        2.60 ±  2%  pp.self.page_counter_try_charge
> > >       1.12 ±  4%      +3.1        4.24            +1.1        2.22            +1.4        2.51        pp.self.native_queued_spin_lock_slowpath
> > >       0.28 ±  9%      +3.8        4.06 ±  4%      +0.2        0.48            +0.4        0.68        pp.self.sctp_eat_data
> > >       0.00            +8.2        8.23            +0.8        0.83            +1.3        1.26        pp.self.__sk_mem_reduce_allocated
> > >
> > > And the size of 'mem_cgroup' is increased from 4224 Bytes to 4608.
> > 
> > Hi Feng, can you please try two more configurations? Take Eric's patch
> > of adding ____cacheline_aligned_in_smp in page_counter and for first
> > increase MEMCG_CHARGE_BATCH to 64 and for second increase it to 128.
> > Basically batch increases combined with Eric's patch.
> 
> With increasing batch to 128, the regression could be reduced to -12.4%.

If we're going to bump it, I wonder if we should scale it dynamically depending
on the size of the memory cgroup?

Thanks!

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [net] 4890b686f4: netperf.Throughput_Mbps -69.4% regression
  2022-07-03 22:55                                     ` Roman Gushchin
@ 2022-07-05  5:03                                       ` Feng Tang
  2022-08-16  5:52                                         ` Oliver Sang
  0 siblings, 1 reply; 32+ messages in thread
From: Feng Tang @ 2022-07-05  5:03 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Shakeel Butt, Eric Dumazet, Linux MM, Andrew Morton,
	Michal Hocko, Johannes Weiner, Muchun Song, Jakub Kicinski,
	Xin Long, Marcelo Ricardo Leitner, kernel test robot,
	Soheil Hassas Yeganeh, LKML, network dev, linux-s390,
	MPTCP Upstream, linux-sctp @ vger . kernel . org, lkp,
	kbuild test robot, Huang Ying, Xing Zhengjun, Yin Fengwei,
	Ying Xu

On Sun, Jul 03, 2022 at 03:55:31PM -0700, Roman Gushchin wrote:
> On Sun, Jul 03, 2022 at 06:43:53PM +0800, Feng Tang wrote:
> > Hi Shakeel,
> > 
> > On Fri, Jul 01, 2022 at 08:47:29AM -0700, Shakeel Butt wrote:
> > > On Mon, Jun 27, 2022 at 8:49 PM Feng Tang <feng.tang@intel.com> wrote:
> > > > I just tested it, it does perform better (the 4th is with your patch),
> > > > some perf-profile data is also listed.
> > > >
> > > >  7c80b038d23e1f4c 4890b686f4088c90432149bd6de 332b589c49656a45881bca4ecc0 e719635902654380b23ffce908d
> > > > ---------------- --------------------------- --------------------------- ---------------------------
> > > >      15722           -69.5%       4792           -40.8%       9300           -27.9%      11341        netperf.Throughput_Mbps
> > > >
> > > >       0.00            +0.3        0.26 ±  5%      +0.5        0.51            +1.3        1.27 ±  2%pp.self.__sk_mem_raise_allocated
> > > >       0.00            +0.3        0.32 ± 15%      +1.7        1.74 ±  2%      +0.4        0.40 ±  2%  pp.self.propagate_protected_usage
> > > >       0.00            +0.8        0.82 ±  7%      +0.9        0.90            +0.8        0.84        pp.self.__mod_memcg_state
> > > >       0.00            +1.2        1.24 ±  4%      +1.0        1.01            +1.4        1.44        pp.self.try_charge_memcg
> > > >       0.00            +2.1        2.06            +2.1        2.13            +2.1        2.11        pp.self.page_counter_uncharge
> > > >       0.00            +2.1        2.14 ±  4%      +2.7        2.71            +2.6        2.60 ±  2%  pp.self.page_counter_try_charge
> > > >       1.12 ±  4%      +3.1        4.24            +1.1        2.22            +1.4        2.51        pp.self.native_queued_spin_lock_slowpath
> > > >       0.28 ±  9%      +3.8        4.06 ±  4%      +0.2        0.48            +0.4        0.68        pp.self.sctp_eat_data
> > > >       0.00            +8.2        8.23            +0.8        0.83            +1.3        1.26        pp.self.__sk_mem_reduce_allocated
> > > >
> > > > And the size of 'mem_cgroup' is increased from 4224 Bytes to 4608.
> > > 
> > > Hi Feng, can you please try two more configurations? Take Eric's patch
> > > of adding ____cacheline_aligned_in_smp in page_counter and for first
> > > increase MEMCG_CHARGE_BATCH to 64 and for second increase it to 128.
> > > Basically batch increases combined with Eric's patch.
> > 
> > With increasing batch to 128, the regression could be reduced to -12.4%.
> 
> If we're going to bump it, I wonder if we should scale it dynamically depending
> on the size of the memory cgroup?
 
I think it makes sense, or also make it a configurable parameter? From 
the test reports of 0Day, these charging/counting play critical role
in performance (easy to see up to 60% performance effect). If user only
wants memcg for isolating things or doesn't care charging/stats, these
seem to be extra taxes.

For bumping to 64 or 128, universal improvement is expected with the
only concern of accuracy.

Thanks,
Feng

> Thanks!

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [net] 4890b686f4: netperf.Throughput_Mbps -69.4% regression
  2022-07-05  5:03                                       ` Feng Tang
@ 2022-08-16  5:52                                         ` Oliver Sang
  2022-08-16 15:55                                           ` Shakeel Butt
  0 siblings, 1 reply; 32+ messages in thread
From: Oliver Sang @ 2022-08-16  5:52 UTC (permalink / raw)
  To: Feng Tang
  Cc: Roman Gushchin, Shakeel Butt, Eric Dumazet, Linux MM,
	Andrew Morton, Michal Hocko, Johannes Weiner, Muchun Song,
	Jakub Kicinski, Xin Long, Marcelo Ricardo Leitner,
	Soheil Hassas Yeganeh, LKML, network dev, linux-s390,
	MPTCP Upstream, linux-sctp @ vger . kernel . org, lkp,
	kbuild test robot, Huang Ying, Xing Zhengjun, Yin Fengwei,
	Ying Xu

Hi all,

now we noticed this commit has already merged into mainline, and in our tests
there is still similar regression. [1]

not sure if there is a plan to merge some of the solutions that have been
discussed in this thread? we'll very glad to test patches if there is a request

Thanks a lot!

[1]
=========================================================================================
tbox_group/testcase/rootfs/kconfig/compiler/ip/runtime/nr_threads/cluster/send_size/test/cpufreq_governor/ucode:
  lkp-icl-2sp4/netperf/debian-11.1-x86_64-20220510.cgz/x86_64-rhel-8.3/gcc-11/ipv4/300s/50%/cs-localhost/10K/SCTP_STREAM_MANY/performance/0xd000363

7c80b038d23e1f4c 4890b686f4088c90432149bd6de
---------------- ---------------------------
         %stddev     %change         %stddev
             \          |                \
      9078           -55.9%       4006        netperf.Throughput_Mbps
    581006           -55.9%     256385        netperf.Throughput_total_Mbps
     36715           -54.6%      16674 ±  4%  netperf.time.involuntary_context_switches
      1885           -50.2%     938.33 ±  3%  netperf.time.percent_of_cpu_this_job_got
      5533           -49.9%       2771 ±  2%  netperf.time.system_time
    152.13           -59.5%      61.61 ±  2%  netperf.time.user_time
    418171 ±  5%     +89.4%     791954 ± 17%  netperf.time.voluntary_context_switches
 2.128e+09           -55.9%  9.389e+08        netperf.workload
     30217           +17.8%      35608        uptime.idle
 2.689e+10           +20.3%  3.234e+10        cpuidle..time
 6.366e+08           -48.1%  3.305e+08        cpuidle..usage
     70.26           +13.5       83.78        mpstat.cpu.all.idle%
      4.46            -1.5        2.92 ±  3%  mpstat.cpu.all.soft%
     23.71           -11.6       12.16 ±  3%  mpstat.cpu.all.sys%
      0.89            -0.5        0.38        mpstat.cpu.all.usr%
 1.392e+09           -57.5%   5.91e+08 ± 12%  numa-numastat.node0.local_node
 1.389e+09           -57.5%  5.906e+08 ± 12%  numa-numastat.node0.numa_hit
 1.369e+09           -54.5%  6.226e+08 ± 12%  numa-numastat.node1.local_node
 1.366e+09           -54.4%  6.222e+08 ± 12%  numa-numastat.node1.numa_hit
     36715           -54.6%      16674 ±  4%  time.involuntary_context_switches
      1885           -50.2%     938.33 ±  3%  time.percent_of_cpu_this_job_got
      5533           -49.9%       2771 ±  2%  time.system_time
    152.13           -59.5%      61.61 ±  2%  time.user_time
    418171 ±  5%     +89.4%     791954 ± 17%  time.voluntary_context_switches


On Tue, Jul 05, 2022 at 01:03:26PM +0800, Feng Tang wrote:
> On Sun, Jul 03, 2022 at 03:55:31PM -0700, Roman Gushchin wrote:
> > On Sun, Jul 03, 2022 at 06:43:53PM +0800, Feng Tang wrote:
> > > Hi Shakeel,
> > > 
> > > On Fri, Jul 01, 2022 at 08:47:29AM -0700, Shakeel Butt wrote:
> > > > On Mon, Jun 27, 2022 at 8:49 PM Feng Tang <feng.tang@intel.com> wrote:
> > > > > I just tested it, it does perform better (the 4th is with your patch),
> > > > > some perf-profile data is also listed.
> > > > >
> > > > >  7c80b038d23e1f4c 4890b686f4088c90432149bd6de 332b589c49656a45881bca4ecc0 e719635902654380b23ffce908d
> > > > > ---------------- --------------------------- --------------------------- ---------------------------
> > > > >      15722           -69.5%       4792           -40.8%       9300           -27.9%      11341        netperf.Throughput_Mbps
> > > > >
> > > > >       0.00            +0.3        0.26 ±  5%      +0.5        0.51            +1.3        1.27 ±  2%pp.self.__sk_mem_raise_allocated
> > > > >       0.00            +0.3        0.32 ± 15%      +1.7        1.74 ±  2%      +0.4        0.40 ±  2%  pp.self.propagate_protected_usage
> > > > >       0.00            +0.8        0.82 ±  7%      +0.9        0.90            +0.8        0.84        pp.self.__mod_memcg_state
> > > > >       0.00            +1.2        1.24 ±  4%      +1.0        1.01            +1.4        1.44        pp.self.try_charge_memcg
> > > > >       0.00            +2.1        2.06            +2.1        2.13            +2.1        2.11        pp.self.page_counter_uncharge
> > > > >       0.00            +2.1        2.14 ±  4%      +2.7        2.71            +2.6        2.60 ±  2%  pp.self.page_counter_try_charge
> > > > >       1.12 ±  4%      +3.1        4.24            +1.1        2.22            +1.4        2.51        pp.self.native_queued_spin_lock_slowpath
> > > > >       0.28 ±  9%      +3.8        4.06 ±  4%      +0.2        0.48            +0.4        0.68        pp.self.sctp_eat_data
> > > > >       0.00            +8.2        8.23            +0.8        0.83            +1.3        1.26        pp.self.__sk_mem_reduce_allocated
> > > > >
> > > > > And the size of 'mem_cgroup' is increased from 4224 Bytes to 4608.
> > > > 
> > > > Hi Feng, can you please try two more configurations? Take Eric's patch
> > > > of adding ____cacheline_aligned_in_smp in page_counter and for first
> > > > increase MEMCG_CHARGE_BATCH to 64 and for second increase it to 128.
> > > > Basically batch increases combined with Eric's patch.
> > > 
> > > With increasing batch to 128, the regression could be reduced to -12.4%.
> > 
> > If we're going to bump it, I wonder if we should scale it dynamically depending
> > on the size of the memory cgroup?
>  
> I think it makes sense, or also make it a configurable parameter? From 
> the test reports of 0Day, these charging/counting play critical role
> in performance (easy to see up to 60% performance effect). If user only
> wants memcg for isolating things or doesn't care charging/stats, these
> seem to be extra taxes.
> 
> For bumping to 64 or 128, universal improvement is expected with the
> only concern of accuracy.
> 
> Thanks,
> Feng
> 
> > Thanks!

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [net] 4890b686f4: netperf.Throughput_Mbps -69.4% regression
  2022-08-16  5:52                                         ` Oliver Sang
@ 2022-08-16 15:55                                           ` Shakeel Butt
  0 siblings, 0 replies; 32+ messages in thread
From: Shakeel Butt @ 2022-08-16 15:55 UTC (permalink / raw)
  To: Oliver Sang
  Cc: Feng Tang, Roman Gushchin, Eric Dumazet, Linux MM, Andrew Morton,
	Michal Hocko, Johannes Weiner, Muchun Song, Jakub Kicinski,
	Xin Long, Marcelo Ricardo Leitner, Soheil Hassas Yeganeh, LKML,
	network dev, linux-s390, MPTCP Upstream,
	linux-sctp @ vger . kernel . org, lkp, kbuild test robot,
	Huang Ying, Xing Zhengjun, Yin Fengwei, Ying Xu

On Mon, Aug 15, 2022 at 10:53 PM Oliver Sang <oliver.sang@intel.com> wrote:
>
> Hi all,
>
> now we noticed this commit has already merged into mainline, and in our tests
> there is still similar regression. [1]
>
> not sure if there is a plan to merge some of the solutions that have been
> discussed in this thread? we'll very glad to test patches if there is a request
>
> Thanks a lot!

Hi Oliver, sorry for the delay. I will send out the patches in a day or two.

thanks,
Shakeel

^ permalink raw reply	[flat|nested] 32+ messages in thread

end of thread, other threads:[~2022-08-16 15:59 UTC | newest]

Thread overview: 32+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <20220619150456.GB34471@xsang-OptiPlex-9020>
     [not found] ` <20220622172857.37db0d29@kernel.org>
     [not found]   ` <CADvbK_csvmkKe46hT9792=+Qcjor2EvkkAnr--CJK3NGX-N9BQ@mail.gmail.com>
2022-06-23 22:50     ` [net] 4890b686f4: netperf.Throughput_Mbps -69.4% regression Xin Long
2022-06-24  1:57       ` Jakub Kicinski
2022-06-24  4:13         ` Eric Dumazet
2022-06-24  4:22           ` Eric Dumazet
2022-06-24  5:13           ` Feng Tang
2022-06-24  5:45             ` Eric Dumazet
2022-06-24  6:00               ` Feng Tang
2022-06-24  6:07                 ` Eric Dumazet
2022-06-24  6:34           ` Shakeel Butt
2022-06-24  7:06             ` Feng Tang
2022-06-24 14:43               ` Shakeel Butt
2022-06-25  2:36                 ` Feng Tang
2022-06-27  2:38                   ` Feng Tang
2022-06-27  8:46                     ` Eric Dumazet
2022-06-27 12:34                       ` Feng Tang
2022-06-27 14:07                         ` Eric Dumazet
2022-06-27 14:48                           ` Feng Tang
2022-06-27 16:25                             ` Eric Dumazet
2022-06-27 16:48                               ` Shakeel Butt
2022-06-27 17:05                                 ` Eric Dumazet
2022-06-28  1:46                                 ` Roman Gushchin
2022-06-28  3:49                               ` Feng Tang
2022-07-01 15:47                                 ` Shakeel Butt
2022-07-03 10:43                                   ` Feng Tang
2022-07-03 22:55                                     ` Roman Gushchin
2022-07-05  5:03                                       ` Feng Tang
2022-08-16  5:52                                         ` Oliver Sang
2022-08-16 15:55                                           ` Shakeel Butt
2022-06-27 14:52                         ` Shakeel Butt
2022-06-27 14:56                           ` Eric Dumazet
2022-06-27 15:12                           ` Feng Tang
2022-06-27 16:25                             ` Shakeel Butt

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).