netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [Patch net] cgroup: fix cgroup_sk_alloc() for sk_clone_lock()
@ 2020-06-16 18:03 Cong Wang
  2020-06-18  1:44 ` Zefan Li
  0 siblings, 1 reply; 41+ messages in thread
From: Cong Wang @ 2020-06-16 18:03 UTC (permalink / raw)
  To: netdev
  Cc: Cong Wang, Cameron Berkenpas, Peter Geis, Lu Fengqi,
	Daniël Sonck, Daniel Borkmann, Zefan Li, Tejun Heo

When we clone a socket in sk_clone_lock(), its sk_cgrp_data is
copied, so the cgroup refcnt must be taken too. And, unlike the
sk_alloc() path, sock_update_netprioidx() is not called here.
Therefore, it is safe and necessary to grab the cgroup refcnt
even when cgroup_sk_alloc is disabled.

sk_clone_lock() is in BH context anyway, the in_interrupt()
would terminate this function if called there. And for sk_alloc()
skcd->val is always zero. So it's safe to factor out the code
to make it more readable.

Fixes: 090e28b229af92dc5b ("netprio_cgroup: Fix unlimited memory leak of v2 cgroups")
Reported-by: Cameron Berkenpas <cam@neo-zeon.de>
Reported-by: Peter Geis <pgwipeout@gmail.com>
Reported-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Reported-by: Daniël Sonck <dsonck92@gmail.com>
Tested-by: Cameron Berkenpas <cam@neo-zeon.de>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Zefan Li <lizefan@huawei.com>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
---
 include/linux/cgroup.h |  2 ++
 kernel/cgroup/cgroup.c | 26 ++++++++++++++------------
 net/core/sock.c        |  2 +-
 3 files changed, 17 insertions(+), 13 deletions(-)

diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 4598e4da6b1b..818dc7b3ed6c 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -822,6 +822,7 @@ extern spinlock_t cgroup_sk_update_lock;
 
 void cgroup_sk_alloc_disable(void);
 void cgroup_sk_alloc(struct sock_cgroup_data *skcd);
+void cgroup_sk_clone(struct sock_cgroup_data *skcd);
 void cgroup_sk_free(struct sock_cgroup_data *skcd);
 
 static inline struct cgroup *sock_cgroup_ptr(struct sock_cgroup_data *skcd)
@@ -847,6 +848,7 @@ static inline struct cgroup *sock_cgroup_ptr(struct sock_cgroup_data *skcd)
 #else	/* CONFIG_CGROUP_DATA */
 
 static inline void cgroup_sk_alloc(struct sock_cgroup_data *skcd) {}
+static inline void cgroup_sk_clone(struct sock_cgroup_data *skcd) {}
 static inline void cgroup_sk_free(struct sock_cgroup_data *skcd) {}
 
 #endif	/* CONFIG_CGROUP_DATA */
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index 1ea181a58465..6377045b7096 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -6442,18 +6442,6 @@ void cgroup_sk_alloc(struct sock_cgroup_data *skcd)
 	if (cgroup_sk_alloc_disabled)
 		return;
 
-	/* Socket clone path */
-	if (skcd->val) {
-		/*
-		 * We might be cloning a socket which is left in an empty
-		 * cgroup and the cgroup might have already been rmdir'd.
-		 * Don't use cgroup_get_live().
-		 */
-		cgroup_get(sock_cgroup_ptr(skcd));
-		cgroup_bpf_get(sock_cgroup_ptr(skcd));
-		return;
-	}
-
 	/* Don't associate the sock with unrelated interrupted task's cgroup. */
 	if (in_interrupt())
 		return;
@@ -6475,6 +6463,20 @@ void cgroup_sk_alloc(struct sock_cgroup_data *skcd)
 	rcu_read_unlock();
 }
 
+void cgroup_sk_clone(struct sock_cgroup_data *skcd)
+{
+	/* Socket clone path */
+	if (skcd->val) {
+		/*
+		 * We might be cloning a socket which is left in an empty
+		 * cgroup and the cgroup might have already been rmdir'd.
+		 * Don't use cgroup_get_live().
+		 */
+		cgroup_get(sock_cgroup_ptr(skcd));
+		cgroup_bpf_get(sock_cgroup_ptr(skcd));
+	}
+}
+
 void cgroup_sk_free(struct sock_cgroup_data *skcd)
 {
 	struct cgroup *cgrp = sock_cgroup_ptr(skcd);
diff --git a/net/core/sock.c b/net/core/sock.c
index 6c4acf1f0220..b62f06fa5e37 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -1925,7 +1925,7 @@ struct sock *sk_clone_lock(const struct sock *sk, const gfp_t priority)
 		/* sk->sk_memcg will be populated at accept() time */
 		newsk->sk_memcg = NULL;
 
-		cgroup_sk_alloc(&newsk->sk_cgrp_data);
+		cgroup_sk_clone(&newsk->sk_cgrp_data);
 
 		rcu_read_lock();
 		filter = rcu_dereference(sk->sk_filter);
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* Re: [Patch net] cgroup: fix cgroup_sk_alloc() for sk_clone_lock()
  2020-06-16 18:03 [Patch net] cgroup: fix cgroup_sk_alloc() for sk_clone_lock() Cong Wang
@ 2020-06-18  1:44 ` Zefan Li
  2020-06-18 19:19   ` Cong Wang
  0 siblings, 1 reply; 41+ messages in thread
From: Zefan Li @ 2020-06-18  1:44 UTC (permalink / raw)
  To: Cong Wang, netdev
  Cc: Cameron Berkenpas, Peter Geis, Lu Fengqi, Daniël Sonck,
	Daniel Borkmann, Tejun Heo, Roman Gushchin

Cc: Roman Gushchin <guro@fb.com>

Thanks for fixing this.

On 2020/6/17 2:03, Cong Wang wrote:
> When we clone a socket in sk_clone_lock(), its sk_cgrp_data is
> copied, so the cgroup refcnt must be taken too. And, unlike the
> sk_alloc() path, sock_update_netprioidx() is not called here.
> Therefore, it is safe and necessary to grab the cgroup refcnt
> even when cgroup_sk_alloc is disabled.
> 
> sk_clone_lock() is in BH context anyway, the in_interrupt()
> would terminate this function if called there. And for sk_alloc()
> skcd->val is always zero. So it's safe to factor out the code
> to make it more readable.
> 
> Fixes: 090e28b229af92dc5b ("netprio_cgroup: Fix unlimited memory leak of v2 cgroups")

but I don't think the bug was introduced by this commit, because there
are already calls to cgroup_sk_alloc_disable() in write_priomap() and
write_classid(), which can be triggered by writing to ifpriomap or
classid in cgroupfs. This commit just made it much easier to happen
with systemd invovled.

I think it's 4bfc0bb2c60e2f4c ("bpf: decouple the lifetime of cgroup_bpf from cgroup itself"),
which added cgroup_bpf_get() in cgroup_sk_alloc().

> Reported-by: Cameron Berkenpas <cam@neo-zeon.de>
> Reported-by: Peter Geis <pgwipeout@gmail.com>
> Reported-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
> Reported-by: Daniël Sonck <dsonck92@gmail.com>
> Tested-by: Cameron Berkenpas <cam@neo-zeon.de>
> Cc: Daniel Borkmann <daniel@iogearbox.net>
> Cc: Zefan Li <lizefan@huawei.com>
> Cc: Tejun Heo <tj@kernel.org>
> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
> ---


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [Patch net] cgroup: fix cgroup_sk_alloc() for sk_clone_lock()
  2020-06-18  1:44 ` Zefan Li
@ 2020-06-18 19:19   ` Cong Wang
  2020-06-18 19:36     ` Roman Gushchin
  0 siblings, 1 reply; 41+ messages in thread
From: Cong Wang @ 2020-06-18 19:19 UTC (permalink / raw)
  To: Zefan Li
  Cc: Linux Kernel Network Developers, Cameron Berkenpas, Peter Geis,
	Lu Fengqi, Daniël Sonck, Daniel Borkmann, Tejun Heo,
	Roman Gushchin

On Wed, Jun 17, 2020 at 6:44 PM Zefan Li <lizefan@huawei.com> wrote:
>
> Cc: Roman Gushchin <guro@fb.com>
>
> Thanks for fixing this.
>
> On 2020/6/17 2:03, Cong Wang wrote:
> > When we clone a socket in sk_clone_lock(), its sk_cgrp_data is
> > copied, so the cgroup refcnt must be taken too. And, unlike the
> > sk_alloc() path, sock_update_netprioidx() is not called here.
> > Therefore, it is safe and necessary to grab the cgroup refcnt
> > even when cgroup_sk_alloc is disabled.
> >
> > sk_clone_lock() is in BH context anyway, the in_interrupt()
> > would terminate this function if called there. And for sk_alloc()
> > skcd->val is always zero. So it's safe to factor out the code
> > to make it more readable.
> >
> > Fixes: 090e28b229af92dc5b ("netprio_cgroup: Fix unlimited memory leak of v2 cgroups")
>
> but I don't think the bug was introduced by this commit, because there
> are already calls to cgroup_sk_alloc_disable() in write_priomap() and
> write_classid(), which can be triggered by writing to ifpriomap or
> classid in cgroupfs. This commit just made it much easier to happen
> with systemd invovled.
>
> I think it's 4bfc0bb2c60e2f4c ("bpf: decouple the lifetime of cgroup_bpf from cgroup itself"),
> which added cgroup_bpf_get() in cgroup_sk_alloc().

Good point.

I take a deeper look, it looks like commit d979a39d7242e06
is the one to blame, because it is the first commit that began to
hold cgroup refcnt in cgroup_sk_alloc().

The commit you mentioned above merely adds a refcnt for
cgroup bpf on to of cgroup refcnt.

Thanks.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [Patch net] cgroup: fix cgroup_sk_alloc() for sk_clone_lock()
  2020-06-18 19:19   ` Cong Wang
@ 2020-06-18 19:36     ` Roman Gushchin
  2020-06-18 21:09       ` Cong Wang
  0 siblings, 1 reply; 41+ messages in thread
From: Roman Gushchin @ 2020-06-18 19:36 UTC (permalink / raw)
  To: Cong Wang
  Cc: Zefan Li, Linux Kernel Network Developers, Cameron Berkenpas,
	Peter Geis, Lu Fengqi, Daniël Sonck, Daniel Borkmann,
	Tejun Heo

On Thu, Jun 18, 2020 at 12:19:13PM -0700, Cong Wang wrote:
> On Wed, Jun 17, 2020 at 6:44 PM Zefan Li <lizefan@huawei.com> wrote:
> >
> > Cc: Roman Gushchin <guro@fb.com>
> >
> > Thanks for fixing this.
> >
> > On 2020/6/17 2:03, Cong Wang wrote:
> > > When we clone a socket in sk_clone_lock(), its sk_cgrp_data is
> > > copied, so the cgroup refcnt must be taken too. And, unlike the
> > > sk_alloc() path, sock_update_netprioidx() is not called here.
> > > Therefore, it is safe and necessary to grab the cgroup refcnt
> > > even when cgroup_sk_alloc is disabled.
> > >
> > > sk_clone_lock() is in BH context anyway, the in_interrupt()
> > > would terminate this function if called there. And for sk_alloc()
> > > skcd->val is always zero. So it's safe to factor out the code
> > > to make it more readable.
> > >
> > > Fixes: 090e28b229af92dc5b ("netprio_cgroup: Fix unlimited memory leak of v2 cgroups")
> >
> > but I don't think the bug was introduced by this commit, because there
> > are already calls to cgroup_sk_alloc_disable() in write_priomap() and
> > write_classid(), which can be triggered by writing to ifpriomap or
> > classid in cgroupfs. This commit just made it much easier to happen
> > with systemd invovled.
> >
> > I think it's 4bfc0bb2c60e2f4c ("bpf: decouple the lifetime of cgroup_bpf from cgroup itself"),
> > which added cgroup_bpf_get() in cgroup_sk_alloc().
> 
> Good point.
> 
> I take a deeper look, it looks like commit d979a39d7242e06
> is the one to blame, because it is the first commit that began to
> hold cgroup refcnt in cgroup_sk_alloc().

I agree, ut seems that the issue is not related to bpf and probably
can be reproduced without CONFIG_CGROUP_BPF. d979a39d7242e06 indeed
seems closer to the origin.

Btw, based on the number of reported-by tags it seems that there was
a real issue which the patch is fixing. Maybe you'll a couple of words
about how it reveals itself in the real life?

Thanks!

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [Patch net] cgroup: fix cgroup_sk_alloc() for sk_clone_lock()
  2020-06-18 19:36     ` Roman Gushchin
@ 2020-06-18 21:09       ` Cong Wang
  2020-06-18 21:26         ` Roman Gushchin
  2020-06-19  6:40         ` Zefan Li
  0 siblings, 2 replies; 41+ messages in thread
From: Cong Wang @ 2020-06-18 21:09 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Zefan Li, Linux Kernel Network Developers, Cameron Berkenpas,
	Peter Geis, Lu Fengqi, Daniël Sonck, Daniel Borkmann,
	Tejun Heo

On Thu, Jun 18, 2020 at 12:36 PM Roman Gushchin <guro@fb.com> wrote:
>
> On Thu, Jun 18, 2020 at 12:19:13PM -0700, Cong Wang wrote:
> > On Wed, Jun 17, 2020 at 6:44 PM Zefan Li <lizefan@huawei.com> wrote:
> > >
> > > Cc: Roman Gushchin <guro@fb.com>
> > >
> > > Thanks for fixing this.
> > >
> > > On 2020/6/17 2:03, Cong Wang wrote:
> > > > When we clone a socket in sk_clone_lock(), its sk_cgrp_data is
> > > > copied, so the cgroup refcnt must be taken too. And, unlike the
> > > > sk_alloc() path, sock_update_netprioidx() is not called here.
> > > > Therefore, it is safe and necessary to grab the cgroup refcnt
> > > > even when cgroup_sk_alloc is disabled.
> > > >
> > > > sk_clone_lock() is in BH context anyway, the in_interrupt()
> > > > would terminate this function if called there. And for sk_alloc()
> > > > skcd->val is always zero. So it's safe to factor out the code
> > > > to make it more readable.
> > > >
> > > > Fixes: 090e28b229af92dc5b ("netprio_cgroup: Fix unlimited memory leak of v2 cgroups")
> > >
> > > but I don't think the bug was introduced by this commit, because there
> > > are already calls to cgroup_sk_alloc_disable() in write_priomap() and
> > > write_classid(), which can be triggered by writing to ifpriomap or
> > > classid in cgroupfs. This commit just made it much easier to happen
> > > with systemd invovled.
> > >
> > > I think it's 4bfc0bb2c60e2f4c ("bpf: decouple the lifetime of cgroup_bpf from cgroup itself"),
> > > which added cgroup_bpf_get() in cgroup_sk_alloc().
> >
> > Good point.
> >
> > I take a deeper look, it looks like commit d979a39d7242e06
> > is the one to blame, because it is the first commit that began to
> > hold cgroup refcnt in cgroup_sk_alloc().
>
> I agree, ut seems that the issue is not related to bpf and probably
> can be reproduced without CONFIG_CGROUP_BPF. d979a39d7242e06 indeed
> seems closer to the origin.

Yeah, I will update the Fixes tag and send V2.

>
> Btw, based on the number of reported-by tags it seems that there was
> a real issue which the patch is fixing. Maybe you'll a couple of words
> about how it reveals itself in the real life?

I still have no idea how exactly this is triggered. According to the
people who reported this bug, they just need to wait for some hours
to trigger. So I am not sure what to add here, just the stack trace?

Thanks.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [Patch net] cgroup: fix cgroup_sk_alloc() for sk_clone_lock()
  2020-06-18 21:09       ` Cong Wang
@ 2020-06-18 21:26         ` Roman Gushchin
  2020-06-18 22:45           ` Peter Geis
  2020-06-19  6:40         ` Zefan Li
  1 sibling, 1 reply; 41+ messages in thread
From: Roman Gushchin @ 2020-06-18 21:26 UTC (permalink / raw)
  To: Cong Wang
  Cc: Zefan Li, Linux Kernel Network Developers, Cameron Berkenpas,
	Peter Geis, Lu Fengqi, Daniël Sonck, Daniel Borkmann,
	Tejun Heo

On Thu, Jun 18, 2020 at 02:09:43PM -0700, Cong Wang wrote:
> On Thu, Jun 18, 2020 at 12:36 PM Roman Gushchin <guro@fb.com> wrote:
> >
> > On Thu, Jun 18, 2020 at 12:19:13PM -0700, Cong Wang wrote:
> > > On Wed, Jun 17, 2020 at 6:44 PM Zefan Li <lizefan@huawei.com> wrote:
> > > >
> > > > Cc: Roman Gushchin <guro@fb.com>
> > > >
> > > > Thanks for fixing this.
> > > >
> > > > On 2020/6/17 2:03, Cong Wang wrote:
> > > > > When we clone a socket in sk_clone_lock(), its sk_cgrp_data is
> > > > > copied, so the cgroup refcnt must be taken too. And, unlike the
> > > > > sk_alloc() path, sock_update_netprioidx() is not called here.
> > > > > Therefore, it is safe and necessary to grab the cgroup refcnt
> > > > > even when cgroup_sk_alloc is disabled.
> > > > >
> > > > > sk_clone_lock() is in BH context anyway, the in_interrupt()
> > > > > would terminate this function if called there. And for sk_alloc()
> > > > > skcd->val is always zero. So it's safe to factor out the code
> > > > > to make it more readable.
> > > > >
> > > > > Fixes: 090e28b229af92dc5b ("netprio_cgroup: Fix unlimited memory leak of v2 cgroups")
> > > >
> > > > but I don't think the bug was introduced by this commit, because there
> > > > are already calls to cgroup_sk_alloc_disable() in write_priomap() and
> > > > write_classid(), which can be triggered by writing to ifpriomap or
> > > > classid in cgroupfs. This commit just made it much easier to happen
> > > > with systemd invovled.
> > > >
> > > > I think it's 4bfc0bb2c60e2f4c ("bpf: decouple the lifetime of cgroup_bpf from cgroup itself"),
> > > > which added cgroup_bpf_get() in cgroup_sk_alloc().
> > >
> > > Good point.
> > >
> > > I take a deeper look, it looks like commit d979a39d7242e06
> > > is the one to blame, because it is the first commit that began to
> > > hold cgroup refcnt in cgroup_sk_alloc().
> >
> > I agree, ut seems that the issue is not related to bpf and probably
> > can be reproduced without CONFIG_CGROUP_BPF. d979a39d7242e06 indeed
> > seems closer to the origin.
> 
> Yeah, I will update the Fixes tag and send V2.
> 
> >
> > Btw, based on the number of reported-by tags it seems that there was
> > a real issue which the patch is fixing. Maybe you'll a couple of words
> > about how it reveals itself in the real life?
> 
> I still have no idea how exactly this is triggered. According to the
> people who reported this bug, they just need to wait for some hours
> to trigger. So I am not sure what to add here, just the stack trace?

Yeah, stack trace is definitely useful. So at least if someone will encounter the same
error in the future, they can search for the stacktrace and find the fix.

Thanks!

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [Patch net] cgroup: fix cgroup_sk_alloc() for sk_clone_lock()
  2020-06-18 21:26         ` Roman Gushchin
@ 2020-06-18 22:45           ` Peter Geis
  0 siblings, 0 replies; 41+ messages in thread
From: Peter Geis @ 2020-06-18 22:45 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Cong Wang, Zefan Li, Linux Kernel Network Developers,
	Cameron Berkenpas, Lu Fengqi, Daniël Sonck, Daniel Borkmann,
	Tejun Heo

On Thu, Jun 18, 2020 at 5:26 PM Roman Gushchin <guro@fb.com> wrote:
>
> On Thu, Jun 18, 2020 at 02:09:43PM -0700, Cong Wang wrote:
> > On Thu, Jun 18, 2020 at 12:36 PM Roman Gushchin <guro@fb.com> wrote:
> > >
> > > On Thu, Jun 18, 2020 at 12:19:13PM -0700, Cong Wang wrote:
> > > > On Wed, Jun 17, 2020 at 6:44 PM Zefan Li <lizefan@huawei.com> wrote:
> > > > >
> > > > > Cc: Roman Gushchin <guro@fb.com>
> > > > >
> > > > > Thanks for fixing this.
> > > > >
> > > > > On 2020/6/17 2:03, Cong Wang wrote:
> > > > > > When we clone a socket in sk_clone_lock(), its sk_cgrp_data is
> > > > > > copied, so the cgroup refcnt must be taken too. And, unlike the
> > > > > > sk_alloc() path, sock_update_netprioidx() is not called here.
> > > > > > Therefore, it is safe and necessary to grab the cgroup refcnt
> > > > > > even when cgroup_sk_alloc is disabled.
> > > > > >
> > > > > > sk_clone_lock() is in BH context anyway, the in_interrupt()
> > > > > > would terminate this function if called there. And for sk_alloc()
> > > > > > skcd->val is always zero. So it's safe to factor out the code
> > > > > > to make it more readable.
> > > > > >
> > > > > > Fixes: 090e28b229af92dc5b ("netprio_cgroup: Fix unlimited memory leak of v2 cgroups")
> > > > >
> > > > > but I don't think the bug was introduced by this commit, because there
> > > > > are already calls to cgroup_sk_alloc_disable() in write_priomap() and
> > > > > write_classid(), which can be triggered by writing to ifpriomap or
> > > > > classid in cgroupfs. This commit just made it much easier to happen
> > > > > with systemd invovled.
> > > > >
> > > > > I think it's 4bfc0bb2c60e2f4c ("bpf: decouple the lifetime of cgroup_bpf from cgroup itself"),
> > > > > which added cgroup_bpf_get() in cgroup_sk_alloc().
> > > >
> > > > Good point.
> > > >
> > > > I take a deeper look, it looks like commit d979a39d7242e06
> > > > is the one to blame, because it is the first commit that began to
> > > > hold cgroup refcnt in cgroup_sk_alloc().
> > >
> > > I agree, ut seems that the issue is not related to bpf and probably
> > > can be reproduced without CONFIG_CGROUP_BPF. d979a39d7242e06 indeed
> > > seems closer to the origin.
> >
> > Yeah, I will update the Fixes tag and send V2.
> >
> > >
> > > Btw, based on the number of reported-by tags it seems that there was
> > > a real issue which the patch is fixing. Maybe you'll a couple of words
> > > about how it reveals itself in the real life?
> >
> > I still have no idea how exactly this is triggered. According to the
> > people who reported this bug, they just need to wait for some hours
> > to trigger. So I am not sure what to add here, just the stack trace?
>
> Yeah, stack trace is definitely useful. So at least if someone will encounter the same
> error in the future, they can search for the stacktrace and find the fix.

I can provide at least a little insight into my configuration that
allowed the bug to trigger.
I'm running the following configuration:
Ubuntu 20.04 - aarch64
Mainline 5.6.17 kernel using Ubuntu's config.
LXD with several containers active.
Docker with several containers active.

Every crash was the same, with nfsd triggering the sequence.
My only nfs client at the time was a Windows 10 device.
Disabling nfsd prevented the bug from occurring.

Here was the last stack trace:
[23352.431106] rockpro64 kernel: Unable to handle kernel read from
unreadable memory at virtual address 0000000000000010
[23352.431938] rockpro64 kernel: Mem abort info:
[23352.432199] rockpro64 kernel:   ESR = 0x96000004
[23352.432485] rockpro64 kernel:   EC = 0x25: DABT (current EL), IL = 32 bits
[23352.432965] rockpro64 kernel:   SET = 0, FnV = 0
[23352.433248] rockpro64 kernel:   EA = 0, S1PTW = 0
[23352.433536] rockpro64 kernel: Data abort info:
[23352.433803] rockpro64 kernel:   ISV = 0, ISS = 0x00000004
[23352.434153] rockpro64 kernel:   CM = 0, WnR = 0
[23352.434475] rockpro64 kernel: user pgtable: 4k pages, 48-bit VAs,
pgdp=0000000094d4c000
[23352.435174] rockpro64 kernel: [0000000000000010]
pgd=0000000094f3d003, pud=00000000bdb7f003, pmd=0000000000000000
[23352.435963] rockpro64 kernel: Internal error: Oops: 96000004 [#1] SMP
[23352.436396] rockpro64 kernel: Modules linked in: xt_TCPMSS
nf_conntrack_netlink xfrm_user xt_addrtype br_netfilter ip_set_hash_ip
ip_set_hash_net xt_set ip_set cfg80211 nft_counter xt_length
xt_connmark xt_multiport xt_mark nf_log_ip>
[23352.436519] rockpro64 kernel:  ghash_ce enclosure snd_soc_es8316
scsi_transport_sas snd_seq_midi sha2_ce snd_seq_midi_event
snd_soc_simple_card snd_rawmidi snd_soc_audio_graph_card sha256_arm64
panfrost snd_soc_simple_card_utils sha1>
[23352.444216] rockpro64 kernel:  async_pq async_xor xor xor_neon
async_tx uas raid6_pq raid1 raid0 multipath linear usb_storage
xhci_plat_hcd dwc3 rtc_rk808 clk_rk808 rk808_regulator ulpi udc_core
fusb302 tcpm typec fan53555 rk808 pwm_>
[23352.455532] rockpro64 kernel: CPU: 3 PID: 1237 Comm: nfsd Not
tainted 5.6.17+ #74
[23352.456054] rockpro64 kernel: Hardware name: pine64
rockpro64_rk3399/rockpro64_rk3399, BIOS
2020.07-rc2-00124-g515f613253-dirty 05/19/2020
[23352.457010] rockpro64 kernel: pstate: 60400005 (nZCv daif +PAN -UAO)
[23352.457445] rockpro64 kernel: pc : __cgroup_bpf_run_filter_skb+0x2a8/0x400
[23352.457918] rockpro64 kernel: lr : ip_finish_output+0x98/0xd0
[23352.458287] rockpro64 kernel: sp : ffff80001325b900
[23352.458581] rockpro64 kernel: x29: ffff80001325b900 x28: ffff000012f0fae0
[23352.459051] rockpro64 kernel: x27: 0000000000000001 x26: ffff00005f0ddc00
[23352.459521] rockpro64 kernel: x25: 0000000000000118 x24: ffff0000dcd3c270
[23352.459990] rockpro64 kernel: x23: 0000000000000010 x22: ffff800011b1aec0
[23352.460458] rockpro64 kernel: x21: ffff0000efcacc40 x20: 0000000000000010
[23352.460928] rockpro64 kernel: x19: ffff0000dcd3bf00 x18: 0000000000000000
[23352.461396] rockpro64 kernel: x17: 0000000000000000 x16: 0000000000000000
[23352.461863] rockpro64 kernel: x15: 0000000000000000 x14: 0000000000000004
[23352.462332] rockpro64 kernel: x13: 0000000000000001 x12: 0000000000201400
[23352.462802] rockpro64 kernel: x11: 0000000000000000 x10: 0000000000000000
[23352.463271] rockpro64 kernel: x9 : ffff800010b6f6d0 x8 : 0000000000000260
[23352.463738] rockpro64 kernel: x7 : 0000000000000000 x6 : ffff0000dc12a000
[23352.464208] rockpro64 kernel: x5 : ffff0000dcd3bf00 x4 : 0000000000000028
[23352.464677] rockpro64 kernel: x3 : 0000000000000000 x2 : ffff000012f0fb08
[23352.465145] rockpro64 kernel: x1 : ffff00005f0ddd40 x0 : 0000000000000000
[23352.465616] rockpro64 kernel: Call trace:
[23352.465843] rockpro64 kernel:  __cgroup_bpf_run_filter_skb+0x2a8/0x400
[23352.466283] rockpro64 kernel:  ip_finish_output+0x98/0xd0
[23352.466625] rockpro64 kernel:  ip_output+0xb0/0x130
[23352.466920] rockpro64 kernel:  ip_local_out+0x4c/0x60
[23352.467233] rockpro64 kernel:  __ip_queue_xmit+0x128/0x380
[23352.467584] rockpro64 kernel:  ip_queue_xmit+0x10/0x18
[23352.467903] rockpro64 kernel:  __tcp_transmit_skb+0x470/0xaf0
[23352.468274] rockpro64 kernel:  tcp_write_xmit+0x39c/0x1110
[23352.468623] rockpro64 kernel:  __tcp_push_pending_frames+0x40/0x100
[23352.469040] rockpro64 kernel:  tcp_send_fin+0x6c/0x240
[23352.469358] rockpro64 kernel:  tcp_shutdown+0x60/0x68
[23352.469669] rockpro64 kernel:  inet_shutdown+0xb0/0x120
[23352.469997] rockpro64 kernel:  kernel_sock_shutdown+0x1c/0x28
[23352.470464] rockpro64 kernel:  svc_tcp_sock_detach+0xd0/0x110 [sunrpc]
[23352.470980] rockpro64 kernel:  svc_delete_xprt+0x74/0x240 [sunrpc]
[23352.471445] rockpro64 kernel:  svc_recv+0x45c/0xb10 [sunrpc]
[23352.471864] rockpro64 kernel:  nfsd+0xdc/0x150 [nfsd]
[23352.472179] rockpro64 kernel:  kthread+0xfc/0x128
[23352.472461] rockpro64 kernel:  ret_from_fork+0x10/0x18
[23352.472785] rockpro64 kernel: Code: 9100c0c6 17ffff7b f9431cc0
91004017 (f9400814)
[23352.473324] rockpro64 kernel: ---[ end trace 978df9e144fd1235 ]---
[29973.397069] rockpro64 kernel: Unable to handle kernel read from
unreadable memory at virtual address 0000000000000010
[29973.397966] rockpro64 kernel: Mem abort info:
[29973.398224] rockpro64 kernel:   ESR = 0x96000004
[29973.398503] rockpro64 kernel:   EC = 0x25: DABT (current EL), IL = 32 bits
[29973.398976] rockpro64 kernel:   SET = 0, FnV = 0
[29973.399254] rockpro64 kernel:   EA = 0, S1PTW = 0
[29973.399537] rockpro64 kernel: Data abort info:
[29973.399799] rockpro64 kernel:   ISV = 0, ISS = 0x00000004
[29973.400143] rockpro64 kernel:   CM = 0, WnR = 0
[29973.400416] rockpro64 kernel: user pgtable: 4k pages, 48-bit VAs,
pgdp=00000000dcdd1000
[29973.400989] rockpro64 kernel: [0000000000000010] pgd=0000000000000000
[29973.401490] rockpro64 kernel: Internal error: Oops: 96000004 [#2] SMP

I hope this helps.

>
> Thanks!

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [Patch net] cgroup: fix cgroup_sk_alloc() for sk_clone_lock()
  2020-06-18 21:09       ` Cong Wang
  2020-06-18 21:26         ` Roman Gushchin
@ 2020-06-19  6:40         ` Zefan Li
  2020-06-19 19:51           ` Cong Wang
  2020-06-20  0:51           ` Roman Gushchin
  1 sibling, 2 replies; 41+ messages in thread
From: Zefan Li @ 2020-06-19  6:40 UTC (permalink / raw)
  To: Cong Wang, Roman Gushchin
  Cc: Linux Kernel Network Developers, Cameron Berkenpas, Peter Geis,
	Lu Fengqi, Daniël Sonck, Daniel Borkmann, Tejun Heo

On 2020/6/19 5:09, Cong Wang wrote:
> On Thu, Jun 18, 2020 at 12:36 PM Roman Gushchin <guro@fb.com> wrote:
>>
>> On Thu, Jun 18, 2020 at 12:19:13PM -0700, Cong Wang wrote:
>>> On Wed, Jun 17, 2020 at 6:44 PM Zefan Li <lizefan@huawei.com> wrote:
>>>>
>>>> Cc: Roman Gushchin <guro@fb.com>
>>>>
>>>> Thanks for fixing this.
>>>>
>>>> On 2020/6/17 2:03, Cong Wang wrote:
>>>>> When we clone a socket in sk_clone_lock(), its sk_cgrp_data is
>>>>> copied, so the cgroup refcnt must be taken too. And, unlike the
>>>>> sk_alloc() path, sock_update_netprioidx() is not called here.
>>>>> Therefore, it is safe and necessary to grab the cgroup refcnt
>>>>> even when cgroup_sk_alloc is disabled.
>>>>>
>>>>> sk_clone_lock() is in BH context anyway, the in_interrupt()
>>>>> would terminate this function if called there. And for sk_alloc()
>>>>> skcd->val is always zero. So it's safe to factor out the code
>>>>> to make it more readable.
>>>>>
>>>>> Fixes: 090e28b229af92dc5b ("netprio_cgroup: Fix unlimited memory leak of v2 cgroups")
>>>>
>>>> but I don't think the bug was introduced by this commit, because there
>>>> are already calls to cgroup_sk_alloc_disable() in write_priomap() and
>>>> write_classid(), which can be triggered by writing to ifpriomap or
>>>> classid in cgroupfs. This commit just made it much easier to happen
>>>> with systemd invovled.
>>>>
>>>> I think it's 4bfc0bb2c60e2f4c ("bpf: decouple the lifetime of cgroup_bpf from cgroup itself"),
>>>> which added cgroup_bpf_get() in cgroup_sk_alloc().
>>>
>>> Good point.
>>>
>>> I take a deeper look, it looks like commit d979a39d7242e06
>>> is the one to blame, because it is the first commit that began to
>>> hold cgroup refcnt in cgroup_sk_alloc().
>>
>> I agree, ut seems that the issue is not related to bpf and probably
>> can be reproduced without CONFIG_CGROUP_BPF. d979a39d7242e06 indeed
>> seems closer to the origin.
> 
> Yeah, I will update the Fixes tag and send V2.
> 

Commit d979a39d7242e06 looks innocent to me. With this commit when cgroup_sk_alloc
is disabled and then a socket is cloned the cgroup refcnt will not be incremented,
but this is fine, because when the socket is to be freed:

 sk_prot_free()
   cgroup_sk_free()
     cgroup_put(sock_cgroup_ptr(skcd)) == cgroup_put(&cgrp_dfl_root.cgrp)

cgroup_put() does nothing for the default root cgroup, so nothing bad will happen.

but cgroup_bpf_put() will decrement the bpf refcnt while this refcnt were not incremented
as cgroup_sk_alloc has already been disabled. That's why I think it's 4bfc0bb2c60e2f4c
that needs to be fixed.

>>
>> Btw, based on the number of reported-by tags it seems that there was
>> a real issue which the patch is fixing. Maybe you'll a couple of words
>> about how it reveals itself in the real life?
> 
> I still have no idea how exactly this is triggered. According to the
> people who reported this bug, they just need to wait for some hours
> to trigger. So I am not sure what to add here, just the stack trace?
> 
> Thanks.
> .
> 


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [Patch net] cgroup: fix cgroup_sk_alloc() for sk_clone_lock()
  2020-06-19  6:40         ` Zefan Li
@ 2020-06-19 19:51           ` Cong Wang
  2020-06-20  0:45             ` Zefan Li
  2020-06-20  0:51           ` Roman Gushchin
  1 sibling, 1 reply; 41+ messages in thread
From: Cong Wang @ 2020-06-19 19:51 UTC (permalink / raw)
  To: Zefan Li
  Cc: Roman Gushchin, Linux Kernel Network Developers,
	Cameron Berkenpas, Peter Geis, Lu Fengqi, Daniël Sonck,
	Daniel Borkmann, Tejun Heo

On Thu, Jun 18, 2020 at 11:40 PM Zefan Li <lizefan@huawei.com> wrote:
>
> On 2020/6/19 5:09, Cong Wang wrote:
> > On Thu, Jun 18, 2020 at 12:36 PM Roman Gushchin <guro@fb.com> wrote:
> >>
> >> On Thu, Jun 18, 2020 at 12:19:13PM -0700, Cong Wang wrote:
> >>> On Wed, Jun 17, 2020 at 6:44 PM Zefan Li <lizefan@huawei.com> wrote:
> >>>>
> >>>> Cc: Roman Gushchin <guro@fb.com>
> >>>>
> >>>> Thanks for fixing this.
> >>>>
> >>>> On 2020/6/17 2:03, Cong Wang wrote:
> >>>>> When we clone a socket in sk_clone_lock(), its sk_cgrp_data is
> >>>>> copied, so the cgroup refcnt must be taken too. And, unlike the
> >>>>> sk_alloc() path, sock_update_netprioidx() is not called here.
> >>>>> Therefore, it is safe and necessary to grab the cgroup refcnt
> >>>>> even when cgroup_sk_alloc is disabled.
> >>>>>
> >>>>> sk_clone_lock() is in BH context anyway, the in_interrupt()
> >>>>> would terminate this function if called there. And for sk_alloc()
> >>>>> skcd->val is always zero. So it's safe to factor out the code
> >>>>> to make it more readable.
> >>>>>
> >>>>> Fixes: 090e28b229af92dc5b ("netprio_cgroup: Fix unlimited memory leak of v2 cgroups")
> >>>>
> >>>> but I don't think the bug was introduced by this commit, because there
> >>>> are already calls to cgroup_sk_alloc_disable() in write_priomap() and
> >>>> write_classid(), which can be triggered by writing to ifpriomap or
> >>>> classid in cgroupfs. This commit just made it much easier to happen
> >>>> with systemd invovled.
> >>>>
> >>>> I think it's 4bfc0bb2c60e2f4c ("bpf: decouple the lifetime of cgroup_bpf from cgroup itself"),
> >>>> which added cgroup_bpf_get() in cgroup_sk_alloc().
> >>>
> >>> Good point.
> >>>
> >>> I take a deeper look, it looks like commit d979a39d7242e06
> >>> is the one to blame, because it is the first commit that began to
> >>> hold cgroup refcnt in cgroup_sk_alloc().
> >>
> >> I agree, ut seems that the issue is not related to bpf and probably
> >> can be reproduced without CONFIG_CGROUP_BPF. d979a39d7242e06 indeed
> >> seems closer to the origin.
> >
> > Yeah, I will update the Fixes tag and send V2.
> >
>
> Commit d979a39d7242e06 looks innocent to me. With this commit when cgroup_sk_alloc
> is disabled and then a socket is cloned the cgroup refcnt will not be incremented,
> but this is fine, because when the socket is to be freed:
>
>  sk_prot_free()
>    cgroup_sk_free()
>      cgroup_put(sock_cgroup_ptr(skcd)) == cgroup_put(&cgrp_dfl_root.cgrp)
>
> cgroup_put() does nothing for the default root cgroup, so nothing bad will happen.

But skcd->val can be a pointer to a non-root cgroup:

static inline struct cgroup *sock_cgroup_ptr(struct sock_cgroup_data *skcd)
{
#if defined(CONFIG_CGROUP_NET_PRIO) || defined(CONFIG_CGROUP_NET_CLASSID)
        unsigned long v;

        /*
         * @skcd->val is 64bit but the following is safe on 32bit too as we
         * just need the lower ulong to be written and read atomically.
         */
        v = READ_ONCE(skcd->val);

        if (v & 1)
                return &cgrp_dfl_root.cgrp;

        return (struct cgroup *)(unsigned long)v ?: &cgrp_dfl_root.cgrp;
#else
        return (struct cgroup *)(unsigned long)skcd->val;
#endif
}

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [Patch net] cgroup: fix cgroup_sk_alloc() for sk_clone_lock()
  2020-06-19 19:51           ` Cong Wang
@ 2020-06-20  0:45             ` Zefan Li
  2020-06-20  0:51               ` Zefan Li
  0 siblings, 1 reply; 41+ messages in thread
From: Zefan Li @ 2020-06-20  0:45 UTC (permalink / raw)
  To: Cong Wang
  Cc: Roman Gushchin, Linux Kernel Network Developers,
	Cameron Berkenpas, Peter Geis, Lu Fengqi, Daniël Sonck,
	Daniel Borkmann, Tejun Heo

On 2020/6/20 3:51, Cong Wang wrote:
> On Thu, Jun 18, 2020 at 11:40 PM Zefan Li <lizefan@huawei.com> wrote:
>>
>> On 2020/6/19 5:09, Cong Wang wrote:
>>> On Thu, Jun 18, 2020 at 12:36 PM Roman Gushchin <guro@fb.com> wrote:
>>>>
>>>> On Thu, Jun 18, 2020 at 12:19:13PM -0700, Cong Wang wrote:
>>>>> On Wed, Jun 17, 2020 at 6:44 PM Zefan Li <lizefan@huawei.com> wrote:
>>>>>>
>>>>>> Cc: Roman Gushchin <guro@fb.com>
>>>>>>
>>>>>> Thanks for fixing this.
>>>>>>
>>>>>> On 2020/6/17 2:03, Cong Wang wrote:
>>>>>>> When we clone a socket in sk_clone_lock(), its sk_cgrp_data is
>>>>>>> copied, so the cgroup refcnt must be taken too. And, unlike the
>>>>>>> sk_alloc() path, sock_update_netprioidx() is not called here.
>>>>>>> Therefore, it is safe and necessary to grab the cgroup refcnt
>>>>>>> even when cgroup_sk_alloc is disabled.
>>>>>>>
>>>>>>> sk_clone_lock() is in BH context anyway, the in_interrupt()
>>>>>>> would terminate this function if called there. And for sk_alloc()
>>>>>>> skcd->val is always zero. So it's safe to factor out the code
>>>>>>> to make it more readable.
>>>>>>>
>>>>>>> Fixes: 090e28b229af92dc5b ("netprio_cgroup: Fix unlimited memory leak of v2 cgroups")
>>>>>>
>>>>>> but I don't think the bug was introduced by this commit, because there
>>>>>> are already calls to cgroup_sk_alloc_disable() in write_priomap() and
>>>>>> write_classid(), which can be triggered by writing to ifpriomap or
>>>>>> classid in cgroupfs. This commit just made it much easier to happen
>>>>>> with systemd invovled.
>>>>>>
>>>>>> I think it's 4bfc0bb2c60e2f4c ("bpf: decouple the lifetime of cgroup_bpf from cgroup itself"),
>>>>>> which added cgroup_bpf_get() in cgroup_sk_alloc().
>>>>>
>>>>> Good point.
>>>>>
>>>>> I take a deeper look, it looks like commit d979a39d7242e06
>>>>> is the one to blame, because it is the first commit that began to
>>>>> hold cgroup refcnt in cgroup_sk_alloc().
>>>>
>>>> I agree, ut seems that the issue is not related to bpf and probably
>>>> can be reproduced without CONFIG_CGROUP_BPF. d979a39d7242e06 indeed
>>>> seems closer to the origin.
>>>
>>> Yeah, I will update the Fixes tag and send V2.
>>>
>>
>> Commit d979a39d7242e06 looks innocent to me. With this commit when cgroup_sk_alloc
>> is disabled and then a socket is cloned the cgroup refcnt will not be incremented,
>> but this is fine, because when the socket is to be freed:
>>
>>  sk_prot_free()
>>    cgroup_sk_free()
>>      cgroup_put(sock_cgroup_ptr(skcd)) == cgroup_put(&cgrp_dfl_root.cgrp)
>>
>> cgroup_put() does nothing for the default root cgroup, so nothing bad will happen.
> 
> But skcd->val can be a pointer to a non-root cgroup:

It returns a non-root cgroup when cgroup_sk_alloc is not disabled. The bug happens
when cgroup_sk_alloc is disabled.

> 
> static inline struct cgroup *sock_cgroup_ptr(struct sock_cgroup_data *skcd)
> {
> #if defined(CONFIG_CGROUP_NET_PRIO) || defined(CONFIG_CGROUP_NET_CLASSID)
>         unsigned long v;
> 
>         /*
>          * @skcd->val is 64bit but the following is safe on 32bit too as we
>          * just need the lower ulong to be written and read atomically.
>          */
>         v = READ_ONCE(skcd->val);
> 
>         if (v & 1)
>                 return &cgrp_dfl_root.cgrp;
> 
>         return (struct cgroup *)(unsigned long)v ?: &cgrp_dfl_root.cgrp;
> #else
>         return (struct cgroup *)(unsigned long)skcd->val;
> #endif
> }
> .
> 


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [Patch net] cgroup: fix cgroup_sk_alloc() for sk_clone_lock()
  2020-06-19  6:40         ` Zefan Li
  2020-06-19 19:51           ` Cong Wang
@ 2020-06-20  0:51           ` Roman Gushchin
  2020-06-20  1:00             ` Zefan Li
  1 sibling, 1 reply; 41+ messages in thread
From: Roman Gushchin @ 2020-06-20  0:51 UTC (permalink / raw)
  To: Zefan Li
  Cc: Cong Wang, Linux Kernel Network Developers, Cameron Berkenpas,
	Peter Geis, Lu Fengqi, Daniël Sonck, Daniel Borkmann,
	Tejun Heo

On Fri, Jun 19, 2020 at 02:40:19PM +0800, Zefan Li wrote:
> On 2020/6/19 5:09, Cong Wang wrote:
> > On Thu, Jun 18, 2020 at 12:36 PM Roman Gushchin <guro@fb.com> wrote:
> >>
> >> On Thu, Jun 18, 2020 at 12:19:13PM -0700, Cong Wang wrote:
> >>> On Wed, Jun 17, 2020 at 6:44 PM Zefan Li <lizefan@huawei.com> wrote:
> >>>>
> >>>> Cc: Roman Gushchin <guro@fb.com>
> >>>>
> >>>> Thanks for fixing this.
> >>>>
> >>>> On 2020/6/17 2:03, Cong Wang wrote:
> >>>>> When we clone a socket in sk_clone_lock(), its sk_cgrp_data is
> >>>>> copied, so the cgroup refcnt must be taken too. And, unlike the
> >>>>> sk_alloc() path, sock_update_netprioidx() is not called here.
> >>>>> Therefore, it is safe and necessary to grab the cgroup refcnt
> >>>>> even when cgroup_sk_alloc is disabled.
> >>>>>
> >>>>> sk_clone_lock() is in BH context anyway, the in_interrupt()
> >>>>> would terminate this function if called there. And for sk_alloc()
> >>>>> skcd->val is always zero. So it's safe to factor out the code
> >>>>> to make it more readable.
> >>>>>
> >>>>> Fixes: 090e28b229af92dc5b ("netprio_cgroup: Fix unlimited memory leak of v2 cgroups")
> >>>>
> >>>> but I don't think the bug was introduced by this commit, because there
> >>>> are already calls to cgroup_sk_alloc_disable() in write_priomap() and
> >>>> write_classid(), which can be triggered by writing to ifpriomap or
> >>>> classid in cgroupfs. This commit just made it much easier to happen
> >>>> with systemd invovled.
> >>>>
> >>>> I think it's 4bfc0bb2c60e2f4c ("bpf: decouple the lifetime of cgroup_bpf from cgroup itself"),
> >>>> which added cgroup_bpf_get() in cgroup_sk_alloc().
> >>>
> >>> Good point.
> >>>
> >>> I take a deeper look, it looks like commit d979a39d7242e06
> >>> is the one to blame, because it is the first commit that began to
> >>> hold cgroup refcnt in cgroup_sk_alloc().
> >>
> >> I agree, ut seems that the issue is not related to bpf and probably
> >> can be reproduced without CONFIG_CGROUP_BPF. d979a39d7242e06 indeed
> >> seems closer to the origin.
> > 
> > Yeah, I will update the Fixes tag and send V2.
> > 
> 
> Commit d979a39d7242e06 looks innocent to me. With this commit when cgroup_sk_alloc
> is disabled and then a socket is cloned the cgroup refcnt will not be incremented,
> but this is fine, because when the socket is to be freed:
> 
>  sk_prot_free()
>    cgroup_sk_free()
>      cgroup_put(sock_cgroup_ptr(skcd)) == cgroup_put(&cgrp_dfl_root.cgrp)
> 
> cgroup_put() does nothing for the default root cgroup, so nothing bad will happen.
> 
> but cgroup_bpf_put() will decrement the bpf refcnt while this refcnt were not incremented
> as cgroup_sk_alloc has already been disabled. That's why I think it's 4bfc0bb2c60e2f4c
> that needs to be fixed.

Hm, does it mean that the problem always happens with the root cgroup?

From the stacktrace provided by Peter it looks like that the problem
is bpf-related, but the original patch says nothing about it.

So from the test above it sounds like the problem is that we're trying
to release root's cgroup_bpf, which is a bad idea, I totally agree.
Is this the problem? If so, we might wanna fix it in a different way,
just checking if (!(css->flags & CSS_NO_REF)) in cgroup_bpf_put()
like in cgroup_put(). It feels more reliable to me.

Thanks!


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [Patch net] cgroup: fix cgroup_sk_alloc() for sk_clone_lock()
  2020-06-20  0:45             ` Zefan Li
@ 2020-06-20  0:51               ` Zefan Li
  2020-06-20  3:31                 ` Cong Wang
  0 siblings, 1 reply; 41+ messages in thread
From: Zefan Li @ 2020-06-20  0:51 UTC (permalink / raw)
  To: Cong Wang
  Cc: Roman Gushchin, Linux Kernel Network Developers,
	Cameron Berkenpas, Peter Geis, Lu Fengqi, Daniël Sonck,
	Daniel Borkmann, Tejun Heo

在 2020/6/20 8:45, Zefan Li 写道:
> On 2020/6/20 3:51, Cong Wang wrote:
>> On Thu, Jun 18, 2020 at 11:40 PM Zefan Li <lizefan@huawei.com> wrote:
>>>
>>> On 2020/6/19 5:09, Cong Wang wrote:
>>>> On Thu, Jun 18, 2020 at 12:36 PM Roman Gushchin <guro@fb.com> wrote:
>>>>>
>>>>> On Thu, Jun 18, 2020 at 12:19:13PM -0700, Cong Wang wrote:
>>>>>> On Wed, Jun 17, 2020 at 6:44 PM Zefan Li <lizefan@huawei.com> wrote:
>>>>>>>
>>>>>>> Cc: Roman Gushchin <guro@fb.com>
>>>>>>>
>>>>>>> Thanks for fixing this.
>>>>>>>
>>>>>>> On 2020/6/17 2:03, Cong Wang wrote:
>>>>>>>> When we clone a socket in sk_clone_lock(), its sk_cgrp_data is
>>>>>>>> copied, so the cgroup refcnt must be taken too. And, unlike the
>>>>>>>> sk_alloc() path, sock_update_netprioidx() is not called here.
>>>>>>>> Therefore, it is safe and necessary to grab the cgroup refcnt
>>>>>>>> even when cgroup_sk_alloc is disabled.
>>>>>>>>
>>>>>>>> sk_clone_lock() is in BH context anyway, the in_interrupt()
>>>>>>>> would terminate this function if called there. And for sk_alloc()
>>>>>>>> skcd->val is always zero. So it's safe to factor out the code
>>>>>>>> to make it more readable.
>>>>>>>>
>>>>>>>> Fixes: 090e28b229af92dc5b ("netprio_cgroup: Fix unlimited memory leak of v2 cgroups")
>>>>>>>
>>>>>>> but I don't think the bug was introduced by this commit, because there
>>>>>>> are already calls to cgroup_sk_alloc_disable() in write_priomap() and
>>>>>>> write_classid(), which can be triggered by writing to ifpriomap or
>>>>>>> classid in cgroupfs. This commit just made it much easier to happen
>>>>>>> with systemd invovled.
>>>>>>>
>>>>>>> I think it's 4bfc0bb2c60e2f4c ("bpf: decouple the lifetime of cgroup_bpf from cgroup itself"),
>>>>>>> which added cgroup_bpf_get() in cgroup_sk_alloc().
>>>>>>
>>>>>> Good point.
>>>>>>
>>>>>> I take a deeper look, it looks like commit d979a39d7242e06
>>>>>> is the one to blame, because it is the first commit that began to
>>>>>> hold cgroup refcnt in cgroup_sk_alloc().
>>>>>
>>>>> I agree, ut seems that the issue is not related to bpf and probably
>>>>> can be reproduced without CONFIG_CGROUP_BPF. d979a39d7242e06 indeed
>>>>> seems closer to the origin.
>>>>
>>>> Yeah, I will update the Fixes tag and send V2.
>>>>
>>>
>>> Commit d979a39d7242e06 looks innocent to me. With this commit when cgroup_sk_alloc
>>> is disabled and then a socket is cloned the cgroup refcnt will not be incremented,
>>> but this is fine, because when the socket is to be freed:
>>>
>>>  sk_prot_free()
>>>    cgroup_sk_free()
>>>      cgroup_put(sock_cgroup_ptr(skcd)) == cgroup_put(&cgrp_dfl_root.cgrp)
>>>
>>> cgroup_put() does nothing for the default root cgroup, so nothing bad will happen.
>>
>> But skcd->val can be a pointer to a non-root cgroup:
> 
> It returns a non-root cgroup when cgroup_sk_alloc is not disabled. The bug happens
> when cgroup_sk_alloc is disabled.
> 

And please read those recent bug reports, they all happened when bpf cgroup was in use,
and there was no bpf cgroup when d979a39d7242e06 was merged into mainline.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [Patch net] cgroup: fix cgroup_sk_alloc() for sk_clone_lock()
  2020-06-20  0:51           ` Roman Gushchin
@ 2020-06-20  1:00             ` Zefan Li
  2020-06-20  1:14               ` Roman Gushchin
  0 siblings, 1 reply; 41+ messages in thread
From: Zefan Li @ 2020-06-20  1:00 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Cong Wang, Linux Kernel Network Developers, Cameron Berkenpas,
	Peter Geis, Lu Fengqi, Daniël Sonck, Daniel Borkmann,
	Tejun Heo

On 2020/6/20 8:51, Roman Gushchin wrote:
> On Fri, Jun 19, 2020 at 02:40:19PM +0800, Zefan Li wrote:
>> On 2020/6/19 5:09, Cong Wang wrote:
>>> On Thu, Jun 18, 2020 at 12:36 PM Roman Gushchin <guro@fb.com> wrote:
>>>>
>>>> On Thu, Jun 18, 2020 at 12:19:13PM -0700, Cong Wang wrote:
>>>>> On Wed, Jun 17, 2020 at 6:44 PM Zefan Li <lizefan@huawei.com> wrote:
>>>>>>
>>>>>> Cc: Roman Gushchin <guro@fb.com>
>>>>>>
>>>>>> Thanks for fixing this.
>>>>>>
>>>>>> On 2020/6/17 2:03, Cong Wang wrote:
>>>>>>> When we clone a socket in sk_clone_lock(), its sk_cgrp_data is
>>>>>>> copied, so the cgroup refcnt must be taken too. And, unlike the
>>>>>>> sk_alloc() path, sock_update_netprioidx() is not called here.
>>>>>>> Therefore, it is safe and necessary to grab the cgroup refcnt
>>>>>>> even when cgroup_sk_alloc is disabled.
>>>>>>>
>>>>>>> sk_clone_lock() is in BH context anyway, the in_interrupt()
>>>>>>> would terminate this function if called there. And for sk_alloc()
>>>>>>> skcd->val is always zero. So it's safe to factor out the code
>>>>>>> to make it more readable.
>>>>>>>
>>>>>>> Fixes: 090e28b229af92dc5b ("netprio_cgroup: Fix unlimited memory leak of v2 cgroups")
>>>>>>
>>>>>> but I don't think the bug was introduced by this commit, because there
>>>>>> are already calls to cgroup_sk_alloc_disable() in write_priomap() and
>>>>>> write_classid(), which can be triggered by writing to ifpriomap or
>>>>>> classid in cgroupfs. This commit just made it much easier to happen
>>>>>> with systemd invovled.
>>>>>>
>>>>>> I think it's 4bfc0bb2c60e2f4c ("bpf: decouple the lifetime of cgroup_bpf from cgroup itself"),
>>>>>> which added cgroup_bpf_get() in cgroup_sk_alloc().
>>>>>
>>>>> Good point.
>>>>>
>>>>> I take a deeper look, it looks like commit d979a39d7242e06
>>>>> is the one to blame, because it is the first commit that began to
>>>>> hold cgroup refcnt in cgroup_sk_alloc().
>>>>
>>>> I agree, ut seems that the issue is not related to bpf and probably
>>>> can be reproduced without CONFIG_CGROUP_BPF. d979a39d7242e06 indeed
>>>> seems closer to the origin.
>>>
>>> Yeah, I will update the Fixes tag and send V2.
>>>
>>
>> Commit d979a39d7242e06 looks innocent to me. With this commit when cgroup_sk_alloc
>> is disabled and then a socket is cloned the cgroup refcnt will not be incremented,
>> but this is fine, because when the socket is to be freed:
>>
>>  sk_prot_free()
>>    cgroup_sk_free()
>>      cgroup_put(sock_cgroup_ptr(skcd)) == cgroup_put(&cgrp_dfl_root.cgrp)
>>
>> cgroup_put() does nothing for the default root cgroup, so nothing bad will happen.
>>
>> but cgroup_bpf_put() will decrement the bpf refcnt while this refcnt were not incremented
>> as cgroup_sk_alloc has already been disabled. That's why I think it's 4bfc0bb2c60e2f4c
>> that needs to be fixed.
> 
> Hm, does it mean that the problem always happens with the root cgroup?
> 
>>From the stacktrace provided by Peter it looks like that the problem
> is bpf-related, but the original patch says nothing about it.
> 
> So from the test above it sounds like the problem is that we're trying
> to release root's cgroup_bpf, which is a bad idea, I totally agree.
> Is this the problem?

I think so, though I'm not familiar with the bfp cgroup code.

> If so, we might wanna fix it in a different way,
> just checking if (!(css->flags & CSS_NO_REF)) in cgroup_bpf_put()
> like in cgroup_put(). It feels more reliable to me.
> 

Yeah I also have this idea in my mind.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [Patch net] cgroup: fix cgroup_sk_alloc() for sk_clone_lock()
  2020-06-20  1:00             ` Zefan Li
@ 2020-06-20  1:14               ` Roman Gushchin
  2020-06-20  2:48                 ` Zefan Li
  2020-06-20  3:00                 ` Cong Wang
  0 siblings, 2 replies; 41+ messages in thread
From: Roman Gushchin @ 2020-06-20  1:14 UTC (permalink / raw)
  To: Zefan Li
  Cc: Cong Wang, Linux Kernel Network Developers, Cameron Berkenpas,
	Peter Geis, Lu Fengqi, Daniël Sonck, Daniel Borkmann,
	Tejun Heo

On Sat, Jun 20, 2020 at 09:00:40AM +0800, Zefan Li wrote:
> On 2020/6/20 8:51, Roman Gushchin wrote:
> > On Fri, Jun 19, 2020 at 02:40:19PM +0800, Zefan Li wrote:
> >> On 2020/6/19 5:09, Cong Wang wrote:
> >>> On Thu, Jun 18, 2020 at 12:36 PM Roman Gushchin <guro@fb.com> wrote:
> >>>>
> >>>> On Thu, Jun 18, 2020 at 12:19:13PM -0700, Cong Wang wrote:
> >>>>> On Wed, Jun 17, 2020 at 6:44 PM Zefan Li <lizefan@huawei.com> wrote:
> >>>>>>
> >>>>>> Cc: Roman Gushchin <guro@fb.com>
> >>>>>>
> >>>>>> Thanks for fixing this.
> >>>>>>
> >>>>>> On 2020/6/17 2:03, Cong Wang wrote:
> >>>>>>> When we clone a socket in sk_clone_lock(), its sk_cgrp_data is
> >>>>>>> copied, so the cgroup refcnt must be taken too. And, unlike the
> >>>>>>> sk_alloc() path, sock_update_netprioidx() is not called here.
> >>>>>>> Therefore, it is safe and necessary to grab the cgroup refcnt
> >>>>>>> even when cgroup_sk_alloc is disabled.
> >>>>>>>
> >>>>>>> sk_clone_lock() is in BH context anyway, the in_interrupt()
> >>>>>>> would terminate this function if called there. And for sk_alloc()
> >>>>>>> skcd->val is always zero. So it's safe to factor out the code
> >>>>>>> to make it more readable.
> >>>>>>>
> >>>>>>> Fixes: 090e28b229af92dc5b ("netprio_cgroup: Fix unlimited memory leak of v2 cgroups")
> >>>>>>
> >>>>>> but I don't think the bug was introduced by this commit, because there
> >>>>>> are already calls to cgroup_sk_alloc_disable() in write_priomap() and
> >>>>>> write_classid(), which can be triggered by writing to ifpriomap or
> >>>>>> classid in cgroupfs. This commit just made it much easier to happen
> >>>>>> with systemd invovled.
> >>>>>>
> >>>>>> I think it's 4bfc0bb2c60e2f4c ("bpf: decouple the lifetime of cgroup_bpf from cgroup itself"),
> >>>>>> which added cgroup_bpf_get() in cgroup_sk_alloc().
> >>>>>
> >>>>> Good point.
> >>>>>
> >>>>> I take a deeper look, it looks like commit d979a39d7242e06
> >>>>> is the one to blame, because it is the first commit that began to
> >>>>> hold cgroup refcnt in cgroup_sk_alloc().
> >>>>
> >>>> I agree, ut seems that the issue is not related to bpf and probably
> >>>> can be reproduced without CONFIG_CGROUP_BPF. d979a39d7242e06 indeed
> >>>> seems closer to the origin.
> >>>
> >>> Yeah, I will update the Fixes tag and send V2.
> >>>
> >>
> >> Commit d979a39d7242e06 looks innocent to me. With this commit when cgroup_sk_alloc
> >> is disabled and then a socket is cloned the cgroup refcnt will not be incremented,
> >> but this is fine, because when the socket is to be freed:
> >>
> >>  sk_prot_free()
> >>    cgroup_sk_free()
> >>      cgroup_put(sock_cgroup_ptr(skcd)) == cgroup_put(&cgrp_dfl_root.cgrp)
> >>
> >> cgroup_put() does nothing for the default root cgroup, so nothing bad will happen.
> >>
> >> but cgroup_bpf_put() will decrement the bpf refcnt while this refcnt were not incremented
> >> as cgroup_sk_alloc has already been disabled. That's why I think it's 4bfc0bb2c60e2f4c
> >> that needs to be fixed.
> > 
> > Hm, does it mean that the problem always happens with the root cgroup?
> > 
> >>From the stacktrace provided by Peter it looks like that the problem
> > is bpf-related, but the original patch says nothing about it.
> > 
> > So from the test above it sounds like the problem is that we're trying
> > to release root's cgroup_bpf, which is a bad idea, I totally agree.
> > Is this the problem?
> 
> I think so, though I'm not familiar with the bfp cgroup code.
> 
> > If so, we might wanna fix it in a different way,
> > just checking if (!(css->flags & CSS_NO_REF)) in cgroup_bpf_put()
> > like in cgroup_put(). It feels more reliable to me.
> > 
> 
> Yeah I also have this idea in my mind.

I wonder if the following patch will fix the issue?

--

diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 4598e4da6b1b..7eb51137d896 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -942,12 +942,14 @@ static inline bool cgroup_task_frozen(struct task_struct *task)
 #ifdef CONFIG_CGROUP_BPF
 static inline void cgroup_bpf_get(struct cgroup *cgrp)
 {
-       percpu_ref_get(&cgrp->bpf.refcnt);
+       if (!(cgrp->self.flags & CSS_NO_REF))
+               percpu_ref_get(&cgrp->bpf.refcnt);
 }
 
 static inline void cgroup_bpf_put(struct cgroup *cgrp)
 {
-       percpu_ref_put(&cgrp->bpf.refcnt);
+       if (!(cgrp->self.flags & CSS_NO_REF))
+               percpu_ref_put(&cgrp->bpf.refcnt);
 }
 
 #else /* CONFIG_CGROUP_BPF */

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* Re: [Patch net] cgroup: fix cgroup_sk_alloc() for sk_clone_lock()
  2020-06-20  1:14               ` Roman Gushchin
@ 2020-06-20  2:48                 ` Zefan Li
  2020-06-20  3:00                 ` Cong Wang
  1 sibling, 0 replies; 41+ messages in thread
From: Zefan Li @ 2020-06-20  2:48 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Cong Wang, Linux Kernel Network Developers, Cameron Berkenpas,
	Peter Geis, Lu Fengqi, Daniël Sonck, Daniel Borkmann,
	Tejun Heo

>>> If so, we might wanna fix it in a different way,
>>> just checking if (!(css->flags & CSS_NO_REF)) in cgroup_bpf_put()
>>> like in cgroup_put(). It feels more reliable to me.
>>>
>>
>> Yeah I also have this idea in my mind.
> 
> I wonder if the following patch will fix the issue?
> 

I guess so, but it's better we have someone who reported this bug to
test it.

> --
> 
> diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
> index 4598e4da6b1b..7eb51137d896 100644
> --- a/include/linux/cgroup.h
> +++ b/include/linux/cgroup.h
> @@ -942,12 +942,14 @@ static inline bool cgroup_task_frozen(struct task_struct *task)
>  #ifdef CONFIG_CGROUP_BPF
>  static inline void cgroup_bpf_get(struct cgroup *cgrp)
>  {
> -       percpu_ref_get(&cgrp->bpf.refcnt);
> +       if (!(cgrp->self.flags & CSS_NO_REF))
> +               percpu_ref_get(&cgrp->bpf.refcnt);
>  }
>  
>  static inline void cgroup_bpf_put(struct cgroup *cgrp)
>  {
> -       percpu_ref_put(&cgrp->bpf.refcnt);
> +       if (!(cgrp->self.flags & CSS_NO_REF))
> +               percpu_ref_put(&cgrp->bpf.refcnt);
>  }
>  
>  #else /* CONFIG_CGROUP_BPF */
> 


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [Patch net] cgroup: fix cgroup_sk_alloc() for sk_clone_lock()
  2020-06-20  1:14               ` Roman Gushchin
  2020-06-20  2:48                 ` Zefan Li
@ 2020-06-20  3:00                 ` Cong Wang
  2020-06-20 15:57                   ` Roman Gushchin
  1 sibling, 1 reply; 41+ messages in thread
From: Cong Wang @ 2020-06-20  3:00 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Zefan Li, Linux Kernel Network Developers, Cameron Berkenpas,
	Peter Geis, Lu Fengqi, Daniël Sonck, Daniel Borkmann,
	Tejun Heo

On Fri, Jun 19, 2020 at 6:14 PM Roman Gushchin <guro@fb.com> wrote:
>
> On Sat, Jun 20, 2020 at 09:00:40AM +0800, Zefan Li wrote:
> > I think so, though I'm not familiar with the bfp cgroup code.
> >
> > > If so, we might wanna fix it in a different way,
> > > just checking if (!(css->flags & CSS_NO_REF)) in cgroup_bpf_put()
> > > like in cgroup_put(). It feels more reliable to me.
> > >
> >
> > Yeah I also have this idea in my mind.
>
> I wonder if the following patch will fix the issue?

Interesting, AFAIU, this refcnt is for bpf programs attached
to the cgroup. By this suggestion, do you mean the root
cgroup does not need to refcnt the bpf programs attached
to it? This seems odd, as I don't see how root is different
from others in terms of bpf programs which can be attached
and detached in the same way.

I certainly understand the root cgroup is never gone, but this
does not mean the bpf programs attached to it too.

What am I missing?

Thanks.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [Patch net] cgroup: fix cgroup_sk_alloc() for sk_clone_lock()
  2020-06-20  0:51               ` Zefan Li
@ 2020-06-20  3:31                 ` Cong Wang
  2020-06-20  7:52                   ` Zefan Li
  2020-06-23 22:21                   ` Roman Gushchin
  0 siblings, 2 replies; 41+ messages in thread
From: Cong Wang @ 2020-06-20  3:31 UTC (permalink / raw)
  To: Zefan Li
  Cc: Roman Gushchin, Linux Kernel Network Developers,
	Cameron Berkenpas, Peter Geis, Lu Fengqi, Daniël Sonck,
	Daniel Borkmann, Tejun Heo

On Fri, Jun 19, 2020 at 5:51 PM Zefan Li <lizefan@huawei.com> wrote:
>
> 在 2020/6/20 8:45, Zefan Li 写道:
> > On 2020/6/20 3:51, Cong Wang wrote:
> >> On Thu, Jun 18, 2020 at 11:40 PM Zefan Li <lizefan@huawei.com> wrote:
> >>>
> >>> On 2020/6/19 5:09, Cong Wang wrote:
> >>>> On Thu, Jun 18, 2020 at 12:36 PM Roman Gushchin <guro@fb.com> wrote:
> >>>>>
> >>>>> On Thu, Jun 18, 2020 at 12:19:13PM -0700, Cong Wang wrote:
> >>>>>> On Wed, Jun 17, 2020 at 6:44 PM Zefan Li <lizefan@huawei.com> wrote:
> >>>>>>>
> >>>>>>> Cc: Roman Gushchin <guro@fb.com>
> >>>>>>>
> >>>>>>> Thanks for fixing this.
> >>>>>>>
> >>>>>>> On 2020/6/17 2:03, Cong Wang wrote:
> >>>>>>>> When we clone a socket in sk_clone_lock(), its sk_cgrp_data is
> >>>>>>>> copied, so the cgroup refcnt must be taken too. And, unlike the
> >>>>>>>> sk_alloc() path, sock_update_netprioidx() is not called here.
> >>>>>>>> Therefore, it is safe and necessary to grab the cgroup refcnt
> >>>>>>>> even when cgroup_sk_alloc is disabled.
> >>>>>>>>
> >>>>>>>> sk_clone_lock() is in BH context anyway, the in_interrupt()
> >>>>>>>> would terminate this function if called there. And for sk_alloc()
> >>>>>>>> skcd->val is always zero. So it's safe to factor out the code
> >>>>>>>> to make it more readable.
> >>>>>>>>
> >>>>>>>> Fixes: 090e28b229af92dc5b ("netprio_cgroup: Fix unlimited memory leak of v2 cgroups")
> >>>>>>>
> >>>>>>> but I don't think the bug was introduced by this commit, because there
> >>>>>>> are already calls to cgroup_sk_alloc_disable() in write_priomap() and
> >>>>>>> write_classid(), which can be triggered by writing to ifpriomap or
> >>>>>>> classid in cgroupfs. This commit just made it much easier to happen
> >>>>>>> with systemd invovled.
> >>>>>>>
> >>>>>>> I think it's 4bfc0bb2c60e2f4c ("bpf: decouple the lifetime of cgroup_bpf from cgroup itself"),
> >>>>>>> which added cgroup_bpf_get() in cgroup_sk_alloc().
> >>>>>>
> >>>>>> Good point.
> >>>>>>
> >>>>>> I take a deeper look, it looks like commit d979a39d7242e06
> >>>>>> is the one to blame, because it is the first commit that began to
> >>>>>> hold cgroup refcnt in cgroup_sk_alloc().
> >>>>>
> >>>>> I agree, ut seems that the issue is not related to bpf and probably
> >>>>> can be reproduced without CONFIG_CGROUP_BPF. d979a39d7242e06 indeed
> >>>>> seems closer to the origin.
> >>>>
> >>>> Yeah, I will update the Fixes tag and send V2.
> >>>>
> >>>
> >>> Commit d979a39d7242e06 looks innocent to me. With this commit when cgroup_sk_alloc
> >>> is disabled and then a socket is cloned the cgroup refcnt will not be incremented,
> >>> but this is fine, because when the socket is to be freed:
> >>>
> >>>  sk_prot_free()
> >>>    cgroup_sk_free()
> >>>      cgroup_put(sock_cgroup_ptr(skcd)) == cgroup_put(&cgrp_dfl_root.cgrp)
> >>>
> >>> cgroup_put() does nothing for the default root cgroup, so nothing bad will happen.
> >>
> >> But skcd->val can be a pointer to a non-root cgroup:
> >
> > It returns a non-root cgroup when cgroup_sk_alloc is not disabled. The bug happens
> > when cgroup_sk_alloc is disabled.
> >
>
> And please read those recent bug reports, they all happened when bpf cgroup was in use,
> and there was no bpf cgroup when d979a39d7242e06 was merged into mainline.

I am totally aware of this. My concern is whether cgroup
has the same refcnt bug as it always pairs with the bpf refcnt.

But, after a second look, the non-root cgroup refcnt is immediately
overwritten by sock_update_classid() or sock_update_netprioidx(),
which effectively turns into a root cgroup again. :-/

(It seems we leak a refcnt here, but this is not related to my patch).

Thanks.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [Patch net] cgroup: fix cgroup_sk_alloc() for sk_clone_lock()
  2020-06-20  3:31                 ` Cong Wang
@ 2020-06-20  7:52                   ` Zefan Li
  2020-06-20 16:04                     ` Roman Gushchin
  2020-06-23 22:21                   ` Roman Gushchin
  1 sibling, 1 reply; 41+ messages in thread
From: Zefan Li @ 2020-06-20  7:52 UTC (permalink / raw)
  To: Cong Wang
  Cc: Roman Gushchin, Linux Kernel Network Developers,
	Cameron Berkenpas, Peter Geis, Lu Fengqi, Daniël Sonck,
	Daniel Borkmann, Tejun Heo

在 2020/6/20 11:31, Cong Wang 写道:
> On Fri, Jun 19, 2020 at 5:51 PM Zefan Li <lizefan@huawei.com> wrote:
>>
>> 在 2020/6/20 8:45, Zefan Li 写道:
>>> On 2020/6/20 3:51, Cong Wang wrote:
>>>> On Thu, Jun 18, 2020 at 11:40 PM Zefan Li <lizefan@huawei.com> wrote:
>>>>>
>>>>> On 2020/6/19 5:09, Cong Wang wrote:
>>>>>> On Thu, Jun 18, 2020 at 12:36 PM Roman Gushchin <guro@fb.com> wrote:
>>>>>>>
>>>>>>> On Thu, Jun 18, 2020 at 12:19:13PM -0700, Cong Wang wrote:
>>>>>>>> On Wed, Jun 17, 2020 at 6:44 PM Zefan Li <lizefan@huawei.com> wrote:
>>>>>>>>>
>>>>>>>>> Cc: Roman Gushchin <guro@fb.com>
>>>>>>>>>
>>>>>>>>> Thanks for fixing this.
>>>>>>>>>
>>>>>>>>> On 2020/6/17 2:03, Cong Wang wrote:
>>>>>>>>>> When we clone a socket in sk_clone_lock(), its sk_cgrp_data is
>>>>>>>>>> copied, so the cgroup refcnt must be taken too. And, unlike the
>>>>>>>>>> sk_alloc() path, sock_update_netprioidx() is not called here.
>>>>>>>>>> Therefore, it is safe and necessary to grab the cgroup refcnt
>>>>>>>>>> even when cgroup_sk_alloc is disabled.
>>>>>>>>>>
>>>>>>>>>> sk_clone_lock() is in BH context anyway, the in_interrupt()
>>>>>>>>>> would terminate this function if called there. And for sk_alloc()
>>>>>>>>>> skcd->val is always zero. So it's safe to factor out the code
>>>>>>>>>> to make it more readable.
>>>>>>>>>>
>>>>>>>>>> Fixes: 090e28b229af92dc5b ("netprio_cgroup: Fix unlimited memory leak of v2 cgroups")
>>>>>>>>>
>>>>>>>>> but I don't think the bug was introduced by this commit, because there
>>>>>>>>> are already calls to cgroup_sk_alloc_disable() in write_priomap() and
>>>>>>>>> write_classid(), which can be triggered by writing to ifpriomap or
>>>>>>>>> classid in cgroupfs. This commit just made it much easier to happen
>>>>>>>>> with systemd invovled.
>>>>>>>>>
>>>>>>>>> I think it's 4bfc0bb2c60e2f4c ("bpf: decouple the lifetime of cgroup_bpf from cgroup itself"),
>>>>>>>>> which added cgroup_bpf_get() in cgroup_sk_alloc().
>>>>>>>>
>>>>>>>> Good point.
>>>>>>>>
>>>>>>>> I take a deeper look, it looks like commit d979a39d7242e06
>>>>>>>> is the one to blame, because it is the first commit that began to
>>>>>>>> hold cgroup refcnt in cgroup_sk_alloc().
>>>>>>>
>>>>>>> I agree, ut seems that the issue is not related to bpf and probably
>>>>>>> can be reproduced without CONFIG_CGROUP_BPF. d979a39d7242e06 indeed
>>>>>>> seems closer to the origin.
>>>>>>
>>>>>> Yeah, I will update the Fixes tag and send V2.
>>>>>>
>>>>>
>>>>> Commit d979a39d7242e06 looks innocent to me. With this commit when cgroup_sk_alloc
>>>>> is disabled and then a socket is cloned the cgroup refcnt will not be incremented,
>>>>> but this is fine, because when the socket is to be freed:
>>>>>
>>>>>  sk_prot_free()
>>>>>    cgroup_sk_free()
>>>>>      cgroup_put(sock_cgroup_ptr(skcd)) == cgroup_put(&cgrp_dfl_root.cgrp)
>>>>>
>>>>> cgroup_put() does nothing for the default root cgroup, so nothing bad will happen.
>>>>
>>>> But skcd->val can be a pointer to a non-root cgroup:
>>>
>>> It returns a non-root cgroup when cgroup_sk_alloc is not disabled. The bug happens
>>> when cgroup_sk_alloc is disabled.
>>>
>>
>> And please read those recent bug reports, they all happened when bpf cgroup was in use,
>> and there was no bpf cgroup when d979a39d7242e06 was merged into mainline.
> 
> I am totally aware of this. My concern is whether cgroup
> has the same refcnt bug as it always pairs with the bpf refcnt.
> 
> But, after a second look, the non-root cgroup refcnt is immediately
> overwritten by sock_update_classid() or sock_update_netprioidx(),
> which effectively turns into a root cgroup again. :-/
> 
> (It seems we leak a refcnt here, but this is not related to my patch).
> 

Indead, but it's well known, see bd1060a1d67128bb8fbe2. But now bpf cgroup comes into play...

Your patch doesn't seem to fix the bug completely. If cgroup_sk_alloc_disable happens after
socket cloning, then we will deref the bpf of the root cgroup while incref-ed the bpf of a
non-root cgroup.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [Patch net] cgroup: fix cgroup_sk_alloc() for sk_clone_lock()
  2020-06-20  3:00                 ` Cong Wang
@ 2020-06-20 15:57                   ` Roman Gushchin
  2020-06-22 18:14                     ` Cong Wang
  0 siblings, 1 reply; 41+ messages in thread
From: Roman Gushchin @ 2020-06-20 15:57 UTC (permalink / raw)
  To: Cong Wang
  Cc: Zefan Li, Linux Kernel Network Developers, Cameron Berkenpas,
	Peter Geis, Lu Fengqi, Daniël Sonck, Daniel Borkmann,
	Tejun Heo

On Fri, Jun 19, 2020 at 08:00:41PM -0700, Cong Wang wrote:
> On Fri, Jun 19, 2020 at 6:14 PM Roman Gushchin <guro@fb.com> wrote:
> >
> > On Sat, Jun 20, 2020 at 09:00:40AM +0800, Zefan Li wrote:
> > > I think so, though I'm not familiar with the bfp cgroup code.
> > >
> > > > If so, we might wanna fix it in a different way,
> > > > just checking if (!(css->flags & CSS_NO_REF)) in cgroup_bpf_put()
> > > > like in cgroup_put(). It feels more reliable to me.
> > > >
> > >
> > > Yeah I also have this idea in my mind.
> >
> > I wonder if the following patch will fix the issue?
> 
> Interesting, AFAIU, this refcnt is for bpf programs attached
> to the cgroup. By this suggestion, do you mean the root
> cgroup does not need to refcnt the bpf programs attached
> to it? This seems odd, as I don't see how root is different
> from others in terms of bpf programs which can be attached
> and detached in the same way.
> 
> I certainly understand the root cgroup is never gone, but this
> does not mean the bpf programs attached to it too.
> 
> What am I missing?

It's different because the root cgroup can't be deleted.

All this reference counting is required to automatically detach bpf programs
from a _deleted_ cgroup (look at cgroup_bpf_offline()). It's required
because a cgroup can be in dying state for a long time being pinned by a
pagecache page, for example. Only a user can detach a bpf program from
an existing cgroup.

Thanks!


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [Patch net] cgroup: fix cgroup_sk_alloc() for sk_clone_lock()
  2020-06-20  7:52                   ` Zefan Li
@ 2020-06-20 16:04                     ` Roman Gushchin
  0 siblings, 0 replies; 41+ messages in thread
From: Roman Gushchin @ 2020-06-20 16:04 UTC (permalink / raw)
  To: Zefan Li
  Cc: Cong Wang, Linux Kernel Network Developers, Cameron Berkenpas,
	Peter Geis, Lu Fengqi, Daniël Sonck, Daniel Borkmann,
	Tejun Heo

On Sat, Jun 20, 2020 at 03:52:49PM +0800, Zefan Li wrote:
> 在 2020/6/20 11:31, Cong Wang 写道:
> > On Fri, Jun 19, 2020 at 5:51 PM Zefan Li <lizefan@huawei.com> wrote:
> >>
> >> 在 2020/6/20 8:45, Zefan Li 写道:
> >>> On 2020/6/20 3:51, Cong Wang wrote:
> >>>> On Thu, Jun 18, 2020 at 11:40 PM Zefan Li <lizefan@huawei.com> wrote:
> >>>>>
> >>>>> On 2020/6/19 5:09, Cong Wang wrote:
> >>>>>> On Thu, Jun 18, 2020 at 12:36 PM Roman Gushchin <guro@fb.com> wrote:
> >>>>>>>
> >>>>>>> On Thu, Jun 18, 2020 at 12:19:13PM -0700, Cong Wang wrote:
> >>>>>>>> On Wed, Jun 17, 2020 at 6:44 PM Zefan Li <lizefan@huawei.com> wrote:
> >>>>>>>>>
> >>>>>>>>> Cc: Roman Gushchin <guro@fb.com>
> >>>>>>>>>
> >>>>>>>>> Thanks for fixing this.
> >>>>>>>>>
> >>>>>>>>> On 2020/6/17 2:03, Cong Wang wrote:
> >>>>>>>>>> When we clone a socket in sk_clone_lock(), its sk_cgrp_data is
> >>>>>>>>>> copied, so the cgroup refcnt must be taken too. And, unlike the
> >>>>>>>>>> sk_alloc() path, sock_update_netprioidx() is not called here.
> >>>>>>>>>> Therefore, it is safe and necessary to grab the cgroup refcnt
> >>>>>>>>>> even when cgroup_sk_alloc is disabled.
> >>>>>>>>>>
> >>>>>>>>>> sk_clone_lock() is in BH context anyway, the in_interrupt()
> >>>>>>>>>> would terminate this function if called there. And for sk_alloc()
> >>>>>>>>>> skcd->val is always zero. So it's safe to factor out the code
> >>>>>>>>>> to make it more readable.
> >>>>>>>>>>
> >>>>>>>>>> Fixes: 090e28b229af92dc5b ("netprio_cgroup: Fix unlimited memory leak of v2 cgroups")
> >>>>>>>>>
> >>>>>>>>> but I don't think the bug was introduced by this commit, because there
> >>>>>>>>> are already calls to cgroup_sk_alloc_disable() in write_priomap() and
> >>>>>>>>> write_classid(), which can be triggered by writing to ifpriomap or
> >>>>>>>>> classid in cgroupfs. This commit just made it much easier to happen
> >>>>>>>>> with systemd invovled.
> >>>>>>>>>
> >>>>>>>>> I think it's 4bfc0bb2c60e2f4c ("bpf: decouple the lifetime of cgroup_bpf from cgroup itself"),
> >>>>>>>>> which added cgroup_bpf_get() in cgroup_sk_alloc().
> >>>>>>>>
> >>>>>>>> Good point.
> >>>>>>>>
> >>>>>>>> I take a deeper look, it looks like commit d979a39d7242e06
> >>>>>>>> is the one to blame, because it is the first commit that began to
> >>>>>>>> hold cgroup refcnt in cgroup_sk_alloc().
> >>>>>>>
> >>>>>>> I agree, ut seems that the issue is not related to bpf and probably
> >>>>>>> can be reproduced without CONFIG_CGROUP_BPF. d979a39d7242e06 indeed
> >>>>>>> seems closer to the origin.
> >>>>>>
> >>>>>> Yeah, I will update the Fixes tag and send V2.
> >>>>>>
> >>>>>
> >>>>> Commit d979a39d7242e06 looks innocent to me. With this commit when cgroup_sk_alloc
> >>>>> is disabled and then a socket is cloned the cgroup refcnt will not be incremented,
> >>>>> but this is fine, because when the socket is to be freed:
> >>>>>
> >>>>>  sk_prot_free()
> >>>>>    cgroup_sk_free()
> >>>>>      cgroup_put(sock_cgroup_ptr(skcd)) == cgroup_put(&cgrp_dfl_root.cgrp)
> >>>>>
> >>>>> cgroup_put() does nothing for the default root cgroup, so nothing bad will happen.
> >>>>
> >>>> But skcd->val can be a pointer to a non-root cgroup:
> >>>
> >>> It returns a non-root cgroup when cgroup_sk_alloc is not disabled. The bug happens
> >>> when cgroup_sk_alloc is disabled.
> >>>
> >>
> >> And please read those recent bug reports, they all happened when bpf cgroup was in use,
> >> and there was no bpf cgroup when d979a39d7242e06 was merged into mainline.
> > 
> > I am totally aware of this. My concern is whether cgroup
> > has the same refcnt bug as it always pairs with the bpf refcnt.
> > 
> > But, after a second look, the non-root cgroup refcnt is immediately
> > overwritten by sock_update_classid() or sock_update_netprioidx(),
> > which effectively turns into a root cgroup again. :-/
> > 
> > (It seems we leak a refcnt here, but this is not related to my patch).
> > 
> 
> Indead, but it's well known, see bd1060a1d67128bb8fbe2. But now bpf cgroup comes into play...
> 
> Your patch doesn't seem to fix the bug completely. If cgroup_sk_alloc_disable happens after
> socket cloning, then we will deref the bpf of the root cgroup while incref-ed the bpf of a
> non-root cgroup.

Hm, so it seems that to disable the refcounting for the root cgroup's cgroup_bpf
is a good idea anyway, as it makes cgroup_bpf refcounting work similar to cgroup
refcounting. But it's not completely clear to me, if it's enough or do we have
something else to fix here?

Thanks!

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [Patch net] cgroup: fix cgroup_sk_alloc() for sk_clone_lock()
  2020-06-20 15:57                   ` Roman Gushchin
@ 2020-06-22 18:14                     ` Cong Wang
  2020-06-22 20:39                       ` Roman Gushchin
  0 siblings, 1 reply; 41+ messages in thread
From: Cong Wang @ 2020-06-22 18:14 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Zefan Li, Linux Kernel Network Developers, Cameron Berkenpas,
	Peter Geis, Lu Fengqi, Daniël Sonck, Daniel Borkmann,
	Tejun Heo

On Sat, Jun 20, 2020 at 8:58 AM Roman Gushchin <guro@fb.com> wrote:
>
> On Fri, Jun 19, 2020 at 08:00:41PM -0700, Cong Wang wrote:
> > On Fri, Jun 19, 2020 at 6:14 PM Roman Gushchin <guro@fb.com> wrote:
> > >
> > > On Sat, Jun 20, 2020 at 09:00:40AM +0800, Zefan Li wrote:
> > > > I think so, though I'm not familiar with the bfp cgroup code.
> > > >
> > > > > If so, we might wanna fix it in a different way,
> > > > > just checking if (!(css->flags & CSS_NO_REF)) in cgroup_bpf_put()
> > > > > like in cgroup_put(). It feels more reliable to me.
> > > > >
> > > >
> > > > Yeah I also have this idea in my mind.
> > >
> > > I wonder if the following patch will fix the issue?
> >
> > Interesting, AFAIU, this refcnt is for bpf programs attached
> > to the cgroup. By this suggestion, do you mean the root
> > cgroup does not need to refcnt the bpf programs attached
> > to it? This seems odd, as I don't see how root is different
> > from others in terms of bpf programs which can be attached
> > and detached in the same way.
> >
> > I certainly understand the root cgroup is never gone, but this
> > does not mean the bpf programs attached to it too.
> >
> > What am I missing?
>
> It's different because the root cgroup can't be deleted.
>
> All this reference counting is required to automatically detach bpf programs
> from a _deleted_ cgroup (look at cgroup_bpf_offline()). It's required
> because a cgroup can be in dying state for a long time being pinned by a
> pagecache page, for example. Only a user can detach a bpf program from
> an existing cgroup.

Yeah, but users can still detach the bpf programs from root cgroup.
IIUC, after detaching, the pointer in the bpf array will be empty_prog_array
which is just an array of NULL. Then __cgroup_bpf_run_filter_skb() will
deref it without checking NULL (as check_non_null == false).

This matches the 0000000000000010 pointer seen in the bug reports,
the 0x10, that is 16, is the offset of items[] in struct bpf_prog_array.
So looks like we have to add a NULL check there regardless of refcnt.

Also, I am not sure whether your suggested patch makes a difference
for percpu refcnt, as percpu_ref_put() will never call ->release() until
percpu_ref_kill(), which is never called on root cgroup?

Thanks.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [Patch net] cgroup: fix cgroup_sk_alloc() for sk_clone_lock()
  2020-06-22 18:14                     ` Cong Wang
@ 2020-06-22 20:39                       ` Roman Gushchin
  2020-06-23  8:45                         ` Zhang,Qiang
                                           ` (2 more replies)
  0 siblings, 3 replies; 41+ messages in thread
From: Roman Gushchin @ 2020-06-22 20:39 UTC (permalink / raw)
  To: Cong Wang
  Cc: Zefan Li, Linux Kernel Network Developers, Cameron Berkenpas,
	Peter Geis, Lu Fengqi, Daniël Sonck, Daniel Borkmann,
	Tejun Heo

On Mon, Jun 22, 2020 at 11:14:20AM -0700, Cong Wang wrote:
> On Sat, Jun 20, 2020 at 8:58 AM Roman Gushchin <guro@fb.com> wrote:
> >
> > On Fri, Jun 19, 2020 at 08:00:41PM -0700, Cong Wang wrote:
> > > On Fri, Jun 19, 2020 at 6:14 PM Roman Gushchin <guro@fb.com> wrote:
> > > >
> > > > On Sat, Jun 20, 2020 at 09:00:40AM +0800, Zefan Li wrote:
> > > > > I think so, though I'm not familiar with the bfp cgroup code.
> > > > >
> > > > > > If so, we might wanna fix it in a different way,
> > > > > > just checking if (!(css->flags & CSS_NO_REF)) in cgroup_bpf_put()
> > > > > > like in cgroup_put(). It feels more reliable to me.
> > > > > >
> > > > >
> > > > > Yeah I also have this idea in my mind.
> > > >
> > > > I wonder if the following patch will fix the issue?
> > >
> > > Interesting, AFAIU, this refcnt is for bpf programs attached
> > > to the cgroup. By this suggestion, do you mean the root
> > > cgroup does not need to refcnt the bpf programs attached
> > > to it? This seems odd, as I don't see how root is different
> > > from others in terms of bpf programs which can be attached
> > > and detached in the same way.
> > >
> > > I certainly understand the root cgroup is never gone, but this
> > > does not mean the bpf programs attached to it too.
> > >
> > > What am I missing?
> >
> > It's different because the root cgroup can't be deleted.
> >
> > All this reference counting is required to automatically detach bpf programs
> > from a _deleted_ cgroup (look at cgroup_bpf_offline()). It's required
> > because a cgroup can be in dying state for a long time being pinned by a
> > pagecache page, for example. Only a user can detach a bpf program from
> > an existing cgroup.
> 
> Yeah, but users can still detach the bpf programs from root cgroup.
> IIUC, after detaching, the pointer in the bpf array will be empty_prog_array
> which is just an array of NULL. Then __cgroup_bpf_run_filter_skb() will
> deref it without checking NULL (as check_non_null == false).
> 
> This matches the 0000000000000010 pointer seen in the bug reports,
> the 0x10, that is 16, is the offset of items[] in struct bpf_prog_array.
> So looks like we have to add a NULL check there regardless of refcnt.
> 
> Also, I am not sure whether your suggested patch makes a difference
> for percpu refcnt, as percpu_ref_put() will never call ->release() until
> percpu_ref_kill(), which is never called on root cgroup?

Hm, true. But it means that the problem is not with the root cgroup's bpf?

How easy is to reproduce the problem? Is it possible to bisect the problematic
commit?

Thanks!

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [Patch net] cgroup: fix cgroup_sk_alloc() for sk_clone_lock()
  2020-06-22 20:39                       ` Roman Gushchin
@ 2020-06-23  8:45                         ` Zhang,Qiang
  2020-06-23 17:56                           ` Cong Wang
  2020-06-23  8:54                         ` Zhang,Qiang
  2020-06-23  9:01                         ` Zhang,Qiang
  2 siblings, 1 reply; 41+ messages in thread
From: Zhang,Qiang @ 2020-06-23  8:45 UTC (permalink / raw)
  To: guro
  Cc: cam@neo-zeon.de lizefan@huawei.com pgwipeout@gmail.com
	xiyou.wangcong, daniel@iogearbox.net tj@kernel.org netdev,
	dsonck92, lizefan, lufq.fnst, netdev, pgwipeout, tj,
	xiyou.wangcong

There are some message in kernelv5.4, I don't know if it will help.

demsg:

cgroup: cgroup: disabling cgroup2 socket matching due to net_prio or 
net_cls activation
IPv6: ADDRCONF(NETDEV_CHANGE): veth4c31d8d2: link becomes ready
cni0: port 1(veth4c31d8d2) entered blocking state
cni0: port 1(veth4c31d8d2) entered disabled state
device veth4c31d8d2 entered promiscuous mode
cni0: port 1(veth4c31d8d2) entered blocking state
cni0: port 1(veth4c31d8d2) entered forwarding state
IPv4: martian source 10.244.1.2 from 10.244.1.2, on dev cni0
ll header: 00000000: ff ff ff ff ff ff 12 88 f0 cc 64 b8 08 06
IPv4: martian source 10.244.1.2 from 10.244.1.2, on dev eth0
ll header: 00000000: ff ff ff ff ff ff 12 88 f0 cc 64 b8 08 06
IPv4: martian source 10.244.1.1 from 10.244.1.2, on dev eth0
ll header: 00000000: ff ff ff ff ff ff 12 88 f0 cc 64 b8 08 06
IPv6: ADDRCONF(NETDEV_CHANGE): vethb556dc7b: link becomes ready
cni0: port 2(vethb556dc7b) entered blocking state
cni0: port 2(vethb556dc7b) entered disabled state
device vethb556dc7b entered promiscuous mode
cni0: port 2(vethb556dc7b) entered blocking state
cni0: port 2(vethb556dc7b) entered forwarding state
IPv4: martian source 10.244.1.3 from 10.244.1.3, on dev eth0
ll header: 00000000: ff ff ff ff ff ff 1a d7 25 1c ca 18 08 06
IPv4: martian source 10.244.1.1 from 10.244.1.3, on dev eth0
ll header: 00000000: ff ff ff ff ff ff 1a d7 25 1c ca 18 08 06
IPv4: martian source 10.244.1.2 from 10.244.1.3, on dev eth0
ll header: 00000000: ff ff ff ff ff ff 1a d7 25 1c ca 18 08 06
-----------[ cut here ]-----------
percpu ref (cgroup_bpf_release_fn) <= 0 (-12) after switching to atomic
WARNING: CPU: 1 PID: 0 at lib/percpu-refcount.c:161 
percpu_ref_switch_to_atomic_rcu+0x12a/0x140
Modules linked in: ipt_REJECT nf_reject_ipv4 vxlan ip6_udp_tunnel 
udp_tunnel xt_statistic xt_nat xt_tcpudp iptable_mangle xt_comment 
xt_mark xt_MASQUERADE nf_conntrack_netlink nfnetlink xfrm_user 
iptable_nat xt_addrtype iptable_filter ip_tables xt_conntrack x_tables 
br_netfilter bridge stp llc bnep iTCO_wdt iTCO_vendor_support watchdog 
intel_powerclamp gpio_ich mgag200 drm_vram_helper drm_ttm_helper ttm 
i2c_i801 coretemp lpc_ich acpi_cpufreq sch_fq_codel openvswitch nsh 
nf_conncount nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 nfsd
CPU: 1 PID: 0 Comm: swapper/1 Tainted: G I 5.7.0-yoctodev-standard #1
Hardware name: Intel Corporation S5520HC/S5520HC, BIOS 
S5500.86B.01.10.0025.030220091519 03/02/2009
RIP: 0010:percpu_ref_switch_to_atomic_rcu+0x12a/0x140
Code: 80 3d b1 42 3b 01 00 0f 85 56 ff ff ff 49 8b 54 24 d8 48 c7 c7 68 
57 1d a9 c6 05 98 42 3b 01 01 49 8b 74 24 e8 e8 ea 14 aa ff <0f> 0b e9 
32 ff ff ff 0f 0b eb 97 cc cc cc cc cc cc cc cc cc cc cc
RSP: 0018:ffff996183268e90 EFLAGS: 00010286
RAX: 0000000000000000 RBX: 7ffffffffffffff3 RCX: 0000000000000000
RDX: 0000000000000102 RSI: ffffffffa9794aa7 RDI: 00000000ffffffff
RBP: ffff996183268ea8 R08: ffffffffa9794a60 R09: 0000000000000047
R10: 0000000080000001 R11: ffffffffa9794a8c R12: ffff95855904fef0
R13: 000023dc1ba33080 R14: ffff996183268ee0 R15: ffff95855904fef0
FS: 0000000000000000(0000) GS:ffff958563c00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f62622f8658 CR3: 000000044fc0a000 CR4: 00000000000006e0

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [Patch net] cgroup: fix cgroup_sk_alloc() for sk_clone_lock()
  2020-06-22 20:39                       ` Roman Gushchin
  2020-06-23  8:45                         ` Zhang,Qiang
@ 2020-06-23  8:54                         ` Zhang,Qiang
  2020-06-23  9:01                         ` Zhang,Qiang
  2 siblings, 0 replies; 41+ messages in thread
From: Zhang,Qiang @ 2020-06-23  8:54 UTC (permalink / raw)
  To: guro
  Cc: lizefan, xiyou.wangcong@gmail.com netdev, lizefan, lufq.fnst,
	netdev, pgwipeout, tj, xiyou.wangcong



The tester found the following information during the test

The dmesg information is as follows (kernelv5.4) I don't know if it 
helps for this question


root@intel-x86-64:~# cgroup: cgroup: disabling cgroup2 socket matching 
due to net_prio or net_cls activation
IPv6: ADDRCONF(NETDEV_CHANGE): veth4c31d8d2: link becomes ready
cni0: port 1(veth4c31d8d2) entered blocking state
cni0: port 1(veth4c31d8d2) entered disabled state
device veth4c31d8d2 entered promiscuous mode
cni0: port 1(veth4c31d8d2) entered blocking state
cni0: port 1(veth4c31d8d2) entered forwarding state
IPv4: martian source 10.244.1.2 from 10.244.1.2, on dev cni0
ll header: 00000000: ff ff ff ff ff ff 12 88 f0 cc 64 b8 08 06
IPv4: martian source 10.244.1.2 from 10.244.1.2, on dev eth0
ll header: 00000000: ff ff ff ff ff ff 12 88 f0 cc 64 b8 08 06
IPv4: martian source 10.244.1.1 from 10.244.1.2, on dev eth0
ll header: 00000000: ff ff ff ff ff ff 12 88 f0 cc 64 b8 08 06
IPv6: ADDRCONF(NETDEV_CHANGE): vethb556dc7b: link becomes ready
cni0: port 2(vethb556dc7b) entered blocking state
cni0: port 2(vethb556dc7b) entered disabled state
device vethb556dc7b entered promiscuous mode
cni0: port 2(vethb556dc7b) entered blocking state
cni0: port 2(vethb556dc7b) entered forwarding state
IPv4: martian source 10.244.1.3 from 10.244.1.3, on dev eth0
ll header: 00000000: ff ff ff ff ff ff 1a d7 25 1c ca 18 08 06
IPv4: martian source 10.244.1.1 from 10.244.1.3, on dev eth0
ll header: 00000000: ff ff ff ff ff ff 1a d7 25 1c ca 18 08 06
IPv4: martian source 10.244.1.2 from 10.244.1.3, on dev eth0
ll header: 00000000: ff ff ff ff ff ff 1a d7 25 1c ca 18 08 06
-----------[ cut here ]-----------
percpu ref (cgroup_bpf_release_fn) <= 0 (-12) after switching to atomic
WARNING: CPU: 1 PID: 0 at lib/percpu-refcount.c:161 
percpu_ref_switch_to_atomic_rcu+0x12a/0x140
Modules linked in: ipt_REJECT nf_reject_ipv4 vxlan ip6_udp_tunnel 
udp_tunnel xt_statistic xt_nat xt_tcpudp iptable_mangle xt_comment 
xt_mark xt_MASQUERADE nf_conntrack_netlink nfnetlink xfrm_user 
iptable_nat xt_addrtype iptable_filter ip_tables xt_conntrack x_tables 
br_netfilter bridge stp llc bnep iTCO_wdt iTCO_vendor_support watchdog 
intel_powerclamp gpio_ich mgag200 drm_vram_helper drm_ttm_helper ttm 
i2c_i801 coretemp lpc_ich acpi_cpufreq sch_fq_codel openvswitch nsh 
nf_conncount nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 nfsd
CPU: 1 PID: 0 Comm: swapper/1 Tainted: G I 5.7.0-yoctodev-standard #1
Hardware name: Intel Corporation S5520HC/S5520HC, BIOS 
S5500.86B.01.10.0025.030220091519 03/02/2009
RIP: 0010:percpu_ref_switch_to_atomic_rcu+0x12a/0x140
Code: 80 3d b1 42 3b 01 00 0f 85 56 ff ff ff 49 8b 54 24 d8 48 c7 c7 68 
57 1d a9 c6 05 98 42 3b 01 01 49 8b 74 24 e8 e8 ea 14 aa ff <0f> 0b e9 
32 ff ff ff 0f 0b eb 97 cc cc cc cc cc cc cc cc cc cc cc
RSP: 0018:ffff996183268e90 EFLAGS: 00010286
RAX: 0000000000000000 RBX: 7ffffffffffffff3 RCX: 0000000000000000
RDX: 0000000000000102 RSI: ffffffffa9794aa7 RDI: 00000000ffffffff
RBP: ffff996183268ea8 R08: ffffffffa9794a60 R09: 0000000000000047
R10: 0000000080000001 R11: ffffffffa9794a8c R12: ffff95855904fef0
R13: 000023dc1ba33080 R14: ffff996183268ee0 R15: ffff95855904fef0
FS: 0000000000000000(0000) GS:ffff958563c00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f62622f8658 CR3: 000000044fc0a000 CR4: 00000000000006e0
Call Trace:
<IRQ>
rcu_core+0x227/0x870
? timerqueue_add+0x68/0xa0
rcu_core_si+0xe/0x10
__do_softirq+0x102/0x358
? tick_program_event+0x4d/0x90
irq_exit+0xa0/0x110
smp_apic_timer_interrupt+0xa1/0x1b0
apic_timer_interrupt+0xf/0x20
</IRQ>
RIP: 0010:cpuidle_enter_state+0xc0/0x3c0
Code: 85 c0 0f 8f 28 02 00 00 31 ff e8 5b 1a 5f ff 45 84 ff 74 12 9c 58 
f6 c4 02 0f 85 d0 02 00 00 31 ff e8 34 ba 65 ff fb 45 85 e4 <0f> 88 cc 
00 00 00 49 63 cc 4c 2b 75 d0 48 6b c1 68 48 6b d1 38 48
RSP: 0018:ffff9961831c3e48 EFLAGS: 00000206 ORIG_RAX: ffffffffffffff13
RAX: ffff958563c00000 RBX: ffffffffa9515e60 RCX: 000000000000001f
RDX: 0000000000000000 RSI: 000000003286e833 RDI: 0000000000000000
RBP: ffff9961831c3e88 R08: 0000000000000002 R09: 0000000000000018
R10: 0000000000000364 R11: ffff958563c2a284 R12: 0000000000000003
R13: ffffb9617fc3fd00 R14: 00000194af870de5 R15: 0000000000000000
? cpuidle_enter_state+0xa5/0x3c0
cpuidle_enter+0x2e/0x40
call_cpuidle+0x23/0x40
do_idle+0x1c6/0x240
cpu_startup_entry+0x20/0x30
start_secondary+0x15b/0x190
secondary_startup_64+0xb6/0xc0
--[ end trace 805031dac04b28f5 ]--

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [Patch net] cgroup: fix cgroup_sk_alloc() for sk_clone_lock()
  2020-06-22 20:39                       ` Roman Gushchin
  2020-06-23  8:45                         ` Zhang,Qiang
  2020-06-23  8:54                         ` Zhang,Qiang
@ 2020-06-23  9:01                         ` Zhang,Qiang
  2 siblings, 0 replies; 41+ messages in thread
From: Zhang,Qiang @ 2020-06-23  9:01 UTC (permalink / raw)
  To: guro
  Cc: cam@neo-zeon.de lizefan, daniel, dsonck92, lizefan, lufq.fnst,
	netdev, pgwipeout, tj, xiyou.wangcong

On Mon, Jun 22, 2020 at 11:14:20AM -0700, Cong Wang wrote:
 > On Sat, Jun 20, 2020 at 8:58 AM Roman Gushchin <guro@fb.com> wrote:
 > >
 > > On Fri, Jun 19, 2020 at 08:00:41PM -0700, Cong Wang wrote:
 > > > On Fri, Jun 19, 2020 at 6:14 PM Roman Gushchin <guro@fb.com> wrote:
 > > > >
 > > > > On Sat, Jun 20, 2020 at 09:00:40AM +0800, Zefan Li wrote:
 > > > > > I think so, though I'm not familiar with the bfp cgroup code.
 > > > > >
 > > > > > > If so, we might wanna fix it in a different way,
 > > > > > > just checking if (!(css->flags & CSS_NO_REF)) in 
cgroup_bpf_put()
 > > > > > > like in cgroup_put(). It feels more reliable to me.
 > > > > > >
 > > > > >
 > > > > > Yeah I also have this idea in my mind.
 > > > >
 > > > > I wonder if the following patch will fix the issue?
 > > >
 > > > Interesting, AFAIU, this refcnt is for bpf programs attached
 > > > to the cgroup. By this suggestion, do you mean the root
 > > > cgroup does not need to refcnt the bpf programs attached
 > > > to it? This seems odd, as I don't see how root is different
 > > > from others in terms of bpf programs which can be attached
 > > > and detached in the same way.
 > > >
 > > > I certainly understand the root cgroup is never gone, but this
 > > > does not mean the bpf programs attached to it too.
 > > >
 > > > What am I missing?
 > >
 > > It's different because the root cgroup can't be deleted.
 > >
 > > All this reference counting is required to automatically detach bpf 
programs
 > > from a _deleted_ cgroup (look at cgroup_bpf_offline()). It's required
 > > because a cgroup can be in dying state for a long time being pinned 
by a
 > > pagecache page, for example. Only a user can detach a bpf program from
 > > an existing cgroup.
 >
 > Yeah, but users can still detach the bpf programs from root cgroup.
 > IIUC, after detaching, the pointer in the bpf array will be 
empty_prog_array
 > which is just an array of NULL. Then __cgroup_bpf_run_filter_skb() will
 > deref it without checking NULL (as check_non_null == false).
 >
 > This matches the 0000000000000010 pointer seen in the bug reports,
 > the 0x10, that is 16, is the offset of items[] in struct bpf_prog_array.
 > So looks like we have to add a NULL check there regardless of refcnt.
 >
 > Also, I am not sure whether your suggested patch makes a difference
 > for percpu refcnt, as percpu_ref_put() will never call ->release() until
 > percpu_ref_kill(), which is never called on root cgroup?

 > Hm, true. But it means that the problem is not with the root cgroup's 
bpf?

 >How easy is to reproduce the problem? Is it possible to bisect the 
problematic
 >commit?

 >Thanks!

The tester found the following information during the test

The dmesg information is as follows (kernelv5.4) I don't know if it 
helps for this question


root@intel-x86-64:~# cgroup: cgroup: disabling cgroup2 socket matching 
due to net_prio or net_cls activation
IPv6: ADDRCONF(NETDEV_CHANGE): veth4c31d8d2: link becomes ready
cni0: port 1(veth4c31d8d2) entered blocking state
cni0: port 1(veth4c31d8d2) entered disabled state
device veth4c31d8d2 entered promiscuous mode
cni0: port 1(veth4c31d8d2) entered blocking state
cni0: port 1(veth4c31d8d2) entered forwarding state
IPv4: martian source 10.244.1.2 from 10.244.1.2, on dev cni0
ll header: 00000000: ff ff ff ff ff ff 12 88 f0 cc 64 b8 08 06
IPv4: martian source 10.244.1.2 from 10.244.1.2, on dev eth0
ll header: 00000000: ff ff ff ff ff ff 12 88 f0 cc 64 b8 08 06
IPv4: martian source 10.244.1.1 from 10.244.1.2, on dev eth0
ll header: 00000000: ff ff ff ff ff ff 12 88 f0 cc 64 b8 08 06
IPv6: ADDRCONF(NETDEV_CHANGE): vethb556dc7b: link becomes ready
cni0: port 2(vethb556dc7b) entered blocking state
cni0: port 2(vethb556dc7b) entered disabled state
device vethb556dc7b entered promiscuous mode
cni0: port 2(vethb556dc7b) entered blocking state
cni0: port 2(vethb556dc7b) entered forwarding state
IPv4: martian source 10.244.1.3 from 10.244.1.3, on dev eth0
ll header: 00000000: ff ff ff ff ff ff 1a d7 25 1c ca 18 08 06
IPv4: martian source 10.244.1.1 from 10.244.1.3, on dev eth0
ll header: 00000000: ff ff ff ff ff ff 1a d7 25 1c ca 18 08 06
IPv4: martian source 10.244.1.2 from 10.244.1.3, on dev eth0
ll header: 00000000: ff ff ff ff ff ff 1a d7 25 1c ca 18 08 06
-----------[ cut here ]-----------
percpu ref (cgroup_bpf_release_fn) <= 0 (-12) after switching to atomic
WARNING: CPU: 1 PID: 0 at lib/percpu-refcount.c:161 
percpu_ref_switch_to_atomic_rcu+0x12a/0x140
Modules linked in: ipt_REJECT nf_reject_ipv4 vxlan ip6_udp_tunnel 
udp_tunnel xt_statistic xt_nat xt_tcpudp iptable_mangle xt_comment 
xt_mark xt_MASQUERADE nf_conntrack_netlink nfnetlink xfrm_user 
iptable_nat xt_addrtype iptable_filter ip_tables xt_conntrack x_tables 
br_netfilter bridge stp llc bnep iTCO_wdt iTCO_vendor_support watchdog 
intel_powerclamp gpio_ich mgag200 drm_vram_helper drm_ttm_helper ttm 
i2c_i801 coretemp lpc_ich acpi_cpufreq sch_fq_codel openvswitch nsh 
nf_conncount nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 nfsd
CPU: 1 PID: 0 Comm: swapper/1 Tainted: G I 5.7.0-yoctodev-standard #1
Hardware name: Intel Corporation S5520HC/S5520HC, BIOS 
S5500.86B.01.10.0025.030220091519 03/02/2009
RIP: 0010:percpu_ref_switch_to_atomic_rcu+0x12a/0x140
Code: 80 3d b1 42 3b 01 00 0f 85 56 ff ff ff 49 8b 54 24 d8 48 c7 c7 68 
57 1d a9 c6 05 98 42 3b 01 01 49 8b 74 24 e8 e8 ea 14 aa ff <0f> 0b e9 
32 ff ff ff 0f 0b eb 97 cc cc cc cc cc cc cc cc cc cc cc
RSP: 0018:ffff996183268e90 EFLAGS: 00010286
RAX: 0000000000000000 RBX: 7ffffffffffffff3 RCX: 0000000000000000
RDX: 0000000000000102 RSI: ffffffffa9794aa7 RDI: 00000000ffffffff
RBP: ffff996183268ea8 R08: ffffffffa9794a60 R09: 0000000000000047
R10: 0000000080000001 R11: ffffffffa9794a8c R12: ffff95855904fef0
R13: 000023dc1ba33080 R14: ffff996183268ee0 R15: ffff95855904fef0
FS: 0000000000000000(0000) GS:ffff958563c00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f62622f8658 CR3: 000000044fc0a000 CR4: 00000000000006e0
Call Trace:
<IRQ>
rcu_core+0x227/0x870
? timerqueue_add+0x68/0xa0
rcu_core_si+0xe/0x10
__do_softirq+0x102/0x358
? tick_program_event+0x4d/0x90
irq_exit+0xa0/0x110
smp_apic_timer_interrupt+0xa1/0x1b0
apic_timer_interrupt+0xf/0x20
</IRQ>
RIP: 0010:cpuidle_enter_state+0xc0/0x3c0
Code: 85 c0 0f 8f 28 02 00 00 31 ff e8 5b 1a 5f ff 45 84 ff 74 12 9c 58 
f6 c4 02 0f 85 d0 02 00 00 31 ff e8 34 ba 65 ff fb 45 85 e4 <0f> 88 cc 
00 00 00 49 63 cc 4c 2b 75 d0 48 6b c1 68 48 6b d1 38 48
RSP: 0018:ffff9961831c3e48 EFLAGS: 00000206 ORIG_RAX: ffffffffffffff13
RAX: ffff958563c00000 RBX: ffffffffa9515e60 RCX: 000000000000001f
RDX: 0000000000000000 RSI: 000000003286e833 RDI: 0000000000000000
RBP: ffff9961831c3e88 R08: 0000000000000002 R09: 0000000000000018
R10: 0000000000000364 R11: ffff958563c2a284 R12: 0000000000000003
R13: ffffb9617fc3fd00 R14: 00000194af870de5 R15: 0000000000000000
? cpuidle_enter_state+0xa5/0x3c0
cpuidle_enter+0x2e/0x40
call_cpuidle+0x23/0x40
do_idle+0x1c6/0x240
cpu_startup_entry+0x20/0x30
start_secondary+0x15b/0x190
secondary_startup_64+0xb6/0xc0
--[ end trace 805031dac04b28f5 ]--


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [Patch net] cgroup: fix cgroup_sk_alloc() for sk_clone_lock()
  2020-06-23  8:45                         ` Zhang,Qiang
@ 2020-06-23 17:56                           ` Cong Wang
  0 siblings, 0 replies; 41+ messages in thread
From: Cong Wang @ 2020-06-23 17:56 UTC (permalink / raw)
  To: Zhang,Qiang
  Cc: Roman Gushchin, Cong Wang, Peter Geis, Li Zefan,
	Cameron Berkenpas, Daniël Sonck, Lu Fengqi,
	Linux Kernel Network Developers, Tejun Heo

On Tue, Jun 23, 2020 at 1:45 AM Zhang,Qiang <qiang.zhang@windriver.com> wrote:
>
> There are some message in kernelv5.4, I don't know if it will help.
>
> demsg:
>
> cgroup: cgroup: disabling cgroup2 socket matching due to net_prio or
> net_cls activation
...
> -----------[ cut here ]-----------
> percpu ref (cgroup_bpf_release_fn) <= 0 (-12) after switching to atomic
> WARNING: CPU: 1 PID: 0 at lib/percpu-refcount.c:161
> percpu_ref_switch_to_atomic_rcu+0x12a/0x140

Yes, this proves we have the refcnt bug which my patch tries to fix.
The negative refcnt is exactly a consequence of the bug, as without
my patch we just put the refcnt without holding it first.

If you can reproduce it, please test my patch:
https://patchwork.ozlabs.org/project/netdev/patch/20200616180352.18602-1-xiyou.wangcong@gmail.com/

But, so far I still don't have a good explanation to the NULL
pointer deref. I think that one is an older bug, and we need to check
for NULL even after we fix the refcnt bug, but I do not know how it is
just exposed recently with Zefan's patch. I am still trying to find an
explanation.

Thanks!

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [Patch net] cgroup: fix cgroup_sk_alloc() for sk_clone_lock()
  2020-06-20  3:31                 ` Cong Wang
  2020-06-20  7:52                   ` Zefan Li
@ 2020-06-23 22:21                   ` Roman Gushchin
  2020-06-26  5:23                     ` Cameron Berkenpas
  1 sibling, 1 reply; 41+ messages in thread
From: Roman Gushchin @ 2020-06-23 22:21 UTC (permalink / raw)
  To: Cong Wang
  Cc: Zefan Li, Linux Kernel Network Developers, Cameron Berkenpas,
	Peter Geis, Lu Fengqi, Daniël Sonck, Daniel Borkmann,
	Tejun Heo

On Fri, Jun 19, 2020 at 08:31:14PM -0700, Cong Wang wrote:
> On Fri, Jun 19, 2020 at 5:51 PM Zefan Li <lizefan@huawei.com> wrote:
> >
> > 在 2020/6/20 8:45, Zefan Li 写道:
> > > On 2020/6/20 3:51, Cong Wang wrote:
> > >> On Thu, Jun 18, 2020 at 11:40 PM Zefan Li <lizefan@huawei.com> wrote:
> > >>>
> > >>> On 2020/6/19 5:09, Cong Wang wrote:
> > >>>> On Thu, Jun 18, 2020 at 12:36 PM Roman Gushchin <guro@fb.com> wrote:
> > >>>>>
> > >>>>> On Thu, Jun 18, 2020 at 12:19:13PM -0700, Cong Wang wrote:
> > >>>>>> On Wed, Jun 17, 2020 at 6:44 PM Zefan Li <lizefan@huawei.com> wrote:
> > >>>>>>>
> > >>>>>>> Cc: Roman Gushchin <guro@fb.com>
> > >>>>>>>
> > >>>>>>> Thanks for fixing this.
> > >>>>>>>
> > >>>>>>> On 2020/6/17 2:03, Cong Wang wrote:
> > >>>>>>>> When we clone a socket in sk_clone_lock(), its sk_cgrp_data is
> > >>>>>>>> copied, so the cgroup refcnt must be taken too. And, unlike the
> > >>>>>>>> sk_alloc() path, sock_update_netprioidx() is not called here.
> > >>>>>>>> Therefore, it is safe and necessary to grab the cgroup refcnt
> > >>>>>>>> even when cgroup_sk_alloc is disabled.
> > >>>>>>>>
> > >>>>>>>> sk_clone_lock() is in BH context anyway, the in_interrupt()
> > >>>>>>>> would terminate this function if called there. And for sk_alloc()
> > >>>>>>>> skcd->val is always zero. So it's safe to factor out the code
> > >>>>>>>> to make it more readable.
> > >>>>>>>>
> > >>>>>>>> Fixes: 090e28b229af92dc5b ("netprio_cgroup: Fix unlimited memory leak of v2 cgroups")
> > >>>>>>>
> > >>>>>>> but I don't think the bug was introduced by this commit, because there
> > >>>>>>> are already calls to cgroup_sk_alloc_disable() in write_priomap() and
> > >>>>>>> write_classid(), which can be triggered by writing to ifpriomap or
> > >>>>>>> classid in cgroupfs. This commit just made it much easier to happen
> > >>>>>>> with systemd invovled.
> > >>>>>>>
> > >>>>>>> I think it's 4bfc0bb2c60e2f4c ("bpf: decouple the lifetime of cgroup_bpf from cgroup itself"),
> > >>>>>>> which added cgroup_bpf_get() in cgroup_sk_alloc().
> > >>>>>>
> > >>>>>> Good point.
> > >>>>>>
> > >>>>>> I take a deeper look, it looks like commit d979a39d7242e06
> > >>>>>> is the one to blame, because it is the first commit that began to
> > >>>>>> hold cgroup refcnt in cgroup_sk_alloc().
> > >>>>>
> > >>>>> I agree, ut seems that the issue is not related to bpf and probably
> > >>>>> can be reproduced without CONFIG_CGROUP_BPF. d979a39d7242e06 indeed
> > >>>>> seems closer to the origin.
> > >>>>
> > >>>> Yeah, I will update the Fixes tag and send V2.
> > >>>>
> > >>>
> > >>> Commit d979a39d7242e06 looks innocent to me. With this commit when cgroup_sk_alloc
> > >>> is disabled and then a socket is cloned the cgroup refcnt will not be incremented,
> > >>> but this is fine, because when the socket is to be freed:
> > >>>
> > >>>  sk_prot_free()
> > >>>    cgroup_sk_free()
> > >>>      cgroup_put(sock_cgroup_ptr(skcd)) == cgroup_put(&cgrp_dfl_root.cgrp)
> > >>>
> > >>> cgroup_put() does nothing for the default root cgroup, so nothing bad will happen.
> > >>
> > >> But skcd->val can be a pointer to a non-root cgroup:
> > >
> > > It returns a non-root cgroup when cgroup_sk_alloc is not disabled. The bug happens
> > > when cgroup_sk_alloc is disabled.
> > >
> >
> > And please read those recent bug reports, they all happened when bpf cgroup was in use,
> > and there was no bpf cgroup when d979a39d7242e06 was merged into mainline.
> 
> I am totally aware of this. My concern is whether cgroup
> has the same refcnt bug as it always pairs with the bpf refcnt.
> 
> But, after a second look, the non-root cgroup refcnt is immediately
> overwritten by sock_update_classid() or sock_update_netprioidx(),
> which effectively turns into a root cgroup again. :-/
> 
> (It seems we leak a refcnt here, but this is not related to my patch).

Yeah, I looked over this code, and I have the same suspicion.
Especially in sk_alloc(), where cgroup_sk_alloc() is followed by
sock_update_classid() and sock_update_netprioidx().

I also think your original patch is good, but there are probably
some other problems which it doesn't fix.

I looked over cgroup bpf code again, and the only difference with cgroup
refcounting I see (behind the root cgroup, which is a non-issue) is
here:

void cgroup_sk_alloc(struct sock_cgroup_data *skcd)
{
	...
	while (true) {
		struct css_set *cset;

		cset = task_css_set(current);
		if (likely(cgroup_tryget(cset->dfl_cgrp))) {
			  ^^^^^^^^^^^^^^^
			skcd->val = (unsigned long)cset->dfl_cgrp;
			cgroup_bpf_get(cset->dfl_cgrp);
			^^^^^^^^^^^^^^
			break;
			...

So, in theory, cgroup_bpf_get() can be called here after cgroup_bpf_release().
We might wanna introduce something like cgroup_bpf_tryget_live().
Idk if it can happen in reality, because it would require opening a new socket
in a deleted cgroup (without any other associated sockets).

Other than that I don't see any differences between cgroup and cgroup bpf
reference counting.

Thanks!

PS I'll be completely offline till the end of the week. I'll answer all
e-mails on Monday (Jun 29th).

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [Patch net] cgroup: fix cgroup_sk_alloc() for sk_clone_lock()
  2020-06-23 22:21                   ` Roman Gushchin
@ 2020-06-26  5:23                     ` Cameron Berkenpas
  2020-06-26 17:58                       ` Cong Wang
  0 siblings, 1 reply; 41+ messages in thread
From: Cameron Berkenpas @ 2020-06-26  5:23 UTC (permalink / raw)
  To: Roman Gushchin, Cong Wang
  Cc: Zefan Li, Linux Kernel Network Developers, Peter Geis, Lu Fengqi,
	Daniël Sonck, Daniel Borkmann, Tejun Heo

Hello,

Somewhere along the way I got the impression that it generally takes 
those affected hours before their systems lock up. I'm (generally) able 
to reproduce this issue much faster than that. Regardless, I can help test.

Are there any patches that need testing or is this all still pending 
discussion around the  best way to resolve the issue?

Thanks!

On 6/23/20 3:21 PM, Roman Gushchin wrote:
> On Fri, Jun 19, 2020 at 08:31:14PM -0700, Cong Wang wrote:
>> On Fri, Jun 19, 2020 at 5:51 PM Zefan Li <lizefan@huawei.com> wrote:
>>> 在 2020/6/20 8:45, Zefan Li 写道:
>>>> On 2020/6/20 3:51, Cong Wang wrote:
>>>>> On Thu, Jun 18, 2020 at 11:40 PM Zefan Li <lizefan@huawei.com> wrote:
>>>>>> On 2020/6/19 5:09, Cong Wang wrote:
>>>>>>> On Thu, Jun 18, 2020 at 12:36 PM Roman Gushchin <guro@fb.com> wrote:
>>>>>>>> On Thu, Jun 18, 2020 at 12:19:13PM -0700, Cong Wang wrote:
>>>>>>>>> On Wed, Jun 17, 2020 at 6:44 PM Zefan Li <lizefan@huawei.com> wrote:
>>>>>>>>>> Cc: Roman Gushchin <guro@fb.com>
>>>>>>>>>>
>>>>>>>>>> Thanks for fixing this.
>>>>>>>>>>
>>>>>>>>>> On 2020/6/17 2:03, Cong Wang wrote:
>>>>>>>>>>> When we clone a socket in sk_clone_lock(), its sk_cgrp_data is
>>>>>>>>>>> copied, so the cgroup refcnt must be taken too. And, unlike the
>>>>>>>>>>> sk_alloc() path, sock_update_netprioidx() is not called here.
>>>>>>>>>>> Therefore, it is safe and necessary to grab the cgroup refcnt
>>>>>>>>>>> even when cgroup_sk_alloc is disabled.
>>>>>>>>>>>
>>>>>>>>>>> sk_clone_lock() is in BH context anyway, the in_interrupt()
>>>>>>>>>>> would terminate this function if called there. And for sk_alloc()
>>>>>>>>>>> skcd->val is always zero. So it's safe to factor out the code
>>>>>>>>>>> to make it more readable.
>>>>>>>>>>>
>>>>>>>>>>> Fixes: 090e28b229af92dc5b ("netprio_cgroup: Fix unlimited memory leak of v2 cgroups")
>>>>>>>>>> but I don't think the bug was introduced by this commit, because there
>>>>>>>>>> are already calls to cgroup_sk_alloc_disable() in write_priomap() and
>>>>>>>>>> write_classid(), which can be triggered by writing to ifpriomap or
>>>>>>>>>> classid in cgroupfs. This commit just made it much easier to happen
>>>>>>>>>> with systemd invovled.
>>>>>>>>>>
>>>>>>>>>> I think it's 4bfc0bb2c60e2f4c ("bpf: decouple the lifetime of cgroup_bpf from cgroup itself"),
>>>>>>>>>> which added cgroup_bpf_get() in cgroup_sk_alloc().
>>>>>>>>> Good point.
>>>>>>>>>
>>>>>>>>> I take a deeper look, it looks like commit d979a39d7242e06
>>>>>>>>> is the one to blame, because it is the first commit that began to
>>>>>>>>> hold cgroup refcnt in cgroup_sk_alloc().
>>>>>>>> I agree, ut seems that the issue is not related to bpf and probably
>>>>>>>> can be reproduced without CONFIG_CGROUP_BPF. d979a39d7242e06 indeed
>>>>>>>> seems closer to the origin.
>>>>>>> Yeah, I will update the Fixes tag and send V2.
>>>>>>>
>>>>>> Commit d979a39d7242e06 looks innocent to me. With this commit when cgroup_sk_alloc
>>>>>> is disabled and then a socket is cloned the cgroup refcnt will not be incremented,
>>>>>> but this is fine, because when the socket is to be freed:
>>>>>>
>>>>>>   sk_prot_free()
>>>>>>     cgroup_sk_free()
>>>>>>       cgroup_put(sock_cgroup_ptr(skcd)) == cgroup_put(&cgrp_dfl_root.cgrp)
>>>>>>
>>>>>> cgroup_put() does nothing for the default root cgroup, so nothing bad will happen.
>>>>> But skcd->val can be a pointer to a non-root cgroup:
>>>> It returns a non-root cgroup when cgroup_sk_alloc is not disabled. The bug happens
>>>> when cgroup_sk_alloc is disabled.
>>>>
>>> And please read those recent bug reports, they all happened when bpf cgroup was in use,
>>> and there was no bpf cgroup when d979a39d7242e06 was merged into mainline.
>> I am totally aware of this. My concern is whether cgroup
>> has the same refcnt bug as it always pairs with the bpf refcnt.
>>
>> But, after a second look, the non-root cgroup refcnt is immediately
>> overwritten by sock_update_classid() or sock_update_netprioidx(),
>> which effectively turns into a root cgroup again. :-/
>>
>> (It seems we leak a refcnt here, but this is not related to my patch).
> Yeah, I looked over this code, and I have the same suspicion.
> Especially in sk_alloc(), where cgroup_sk_alloc() is followed by
> sock_update_classid() and sock_update_netprioidx().
>
> I also think your original patch is good, but there are probably
> some other problems which it doesn't fix.
>
> I looked over cgroup bpf code again, and the only difference with cgroup
> refcounting I see (behind the root cgroup, which is a non-issue) is
> here:
>
> void cgroup_sk_alloc(struct sock_cgroup_data *skcd)
> {
> 	...
> 	while (true) {
> 		struct css_set *cset;
>
> 		cset = task_css_set(current);
> 		if (likely(cgroup_tryget(cset->dfl_cgrp))) {
> 			  ^^^^^^^^^^^^^^^
> 			skcd->val = (unsigned long)cset->dfl_cgrp;
> 			cgroup_bpf_get(cset->dfl_cgrp);
> 			^^^^^^^^^^^^^^
> 			break;
> 			...
>
> So, in theory, cgroup_bpf_get() can be called here after cgroup_bpf_release().
> We might wanna introduce something like cgroup_bpf_tryget_live().
> Idk if it can happen in reality, because it would require opening a new socket
> in a deleted cgroup (without any other associated sockets).
>
> Other than that I don't see any differences between cgroup and cgroup bpf
> reference counting.
>
> Thanks!
>
> PS I'll be completely offline till the end of the week. I'll answer all
> e-mails on Monday (Jun 29th).


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [Patch net] cgroup: fix cgroup_sk_alloc() for sk_clone_lock()
  2020-06-26  5:23                     ` Cameron Berkenpas
@ 2020-06-26 17:58                       ` Cong Wang
  2020-06-26 22:03                         ` Cameron Berkenpas
  2020-06-27 23:41                         ` Roman Gushchin
  0 siblings, 2 replies; 41+ messages in thread
From: Cong Wang @ 2020-06-26 17:58 UTC (permalink / raw)
  To: Cameron Berkenpas
  Cc: Roman Gushchin, Zefan Li, Linux Kernel Network Developers,
	Peter Geis, Lu Fengqi, Daniël Sonck, Daniel Borkmann,
	Tejun Heo

[-- Attachment #1: Type: text/plain, Size: 748 bytes --]

On Thu, Jun 25, 2020 at 10:23 PM Cameron Berkenpas <cam@neo-zeon.de> wrote:
>
> Hello,
>
> Somewhere along the way I got the impression that it generally takes
> those affected hours before their systems lock up. I'm (generally) able
> to reproduce this issue much faster than that. Regardless, I can help test.
>
> Are there any patches that need testing or is this all still pending
> discussion around the  best way to resolve the issue?

Yes. I come up with a (hopefully) much better patch in the attachment.
Can you help to test it? You need to unapply the previous patch before
applying this one.

(Just in case of any confusion: I still believe we should check NULL on
top of this refcnt fix. But it should be a separate patch.)

Thank you!

[-- Attachment #2: cgroup-bpf-ref.patch --]
[-- Type: text/x-patch, Size: 4774 bytes --]

commit 259150604c0b77c717fdaab057da5722e2dfd922
Author: Cong Wang <xiyou.wangcong@gmail.com>
Date:   Sat Jun 13 12:34:40 2020 -0700

    cgroup: fix cgroup_sk_alloc() for sk_clone_lock()
    
    When we clone a socket in sk_clone_lock(), its sk_cgrp_data is
    copied, so the cgroup refcnt must be taken too. And, unlike the
    sk_alloc() path, sock_update_netprioidx() is not called here.
    Therefore, it is safe and necessary to grab the cgroup refcnt
    even when cgroup_sk_alloc is disabled.
    
    sk_clone_lock() is in BH context anyway, the in_interrupt()
    would terminate this function if called there. And for sk_alloc()
    skcd->val is always zero. So it's safe to factor out the code
    to make it more readable.
    
    Fixes: 4bfc0bb2c60e ("bpf: decouple the lifetime of cgroup_bpf from cgroup itself")
    Reported-by: Cameron Berkenpas <cam@neo-zeon.de>
    Reported-by: Peter Geis <pgwipeout@gmail.com>
    Reported-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
    Reported-by: Daniël Sonck <dsonck92@gmail.com>
    Tested-by: Cameron Berkenpas <cam@neo-zeon.de>
    Cc: Daniel Borkmann <daniel@iogearbox.net>
    Cc: Zefan Li <lizefan@huawei.com>
    Cc: Tejun Heo <tj@kernel.org>
    Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>

diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
index 52661155f85f..4f1cd0edc57d 100644
--- a/include/linux/cgroup-defs.h
+++ b/include/linux/cgroup-defs.h
@@ -790,7 +790,8 @@ struct sock_cgroup_data {
 	union {
 #ifdef __LITTLE_ENDIAN
 		struct {
-			u8	is_data;
+			u8	is_data : 1;
+			u8	no_refcnt : 1;
 			u8	padding;
 			u16	prioidx;
 			u32	classid;
@@ -800,7 +801,8 @@ struct sock_cgroup_data {
 			u32	classid;
 			u16	prioidx;
 			u8	padding;
-			u8	is_data;
+			u8	no_refcnt : 1;
+			u8	is_data : 1;
 		} __packed;
 #endif
 		u64		val;
diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 4598e4da6b1b..618838c48313 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -822,6 +822,7 @@ extern spinlock_t cgroup_sk_update_lock;
 
 void cgroup_sk_alloc_disable(void);
 void cgroup_sk_alloc(struct sock_cgroup_data *skcd);
+void cgroup_sk_clone(struct sock_cgroup_data *skcd);
 void cgroup_sk_free(struct sock_cgroup_data *skcd);
 
 static inline struct cgroup *sock_cgroup_ptr(struct sock_cgroup_data *skcd)
@@ -835,7 +836,7 @@ static inline struct cgroup *sock_cgroup_ptr(struct sock_cgroup_data *skcd)
 	 */
 	v = READ_ONCE(skcd->val);
 
-	if (v & 1)
+	if (v & 3)
 		return &cgrp_dfl_root.cgrp;
 
 	return (struct cgroup *)(unsigned long)v ?: &cgrp_dfl_root.cgrp;
@@ -847,6 +848,7 @@ static inline struct cgroup *sock_cgroup_ptr(struct sock_cgroup_data *skcd)
 #else	/* CONFIG_CGROUP_DATA */
 
 static inline void cgroup_sk_alloc(struct sock_cgroup_data *skcd) {}
+static inline void cgroup_sk_clone(struct sock_cgroup_data *skcd) {}
 static inline void cgroup_sk_free(struct sock_cgroup_data *skcd) {}
 
 #endif	/* CONFIG_CGROUP_DATA */
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index 1ea181a58465..dd247747ec14 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -6439,18 +6439,8 @@ void cgroup_sk_alloc_disable(void)
 
 void cgroup_sk_alloc(struct sock_cgroup_data *skcd)
 {
-	if (cgroup_sk_alloc_disabled)
-		return;
-
-	/* Socket clone path */
-	if (skcd->val) {
-		/*
-		 * We might be cloning a socket which is left in an empty
-		 * cgroup and the cgroup might have already been rmdir'd.
-		 * Don't use cgroup_get_live().
-		 */
-		cgroup_get(sock_cgroup_ptr(skcd));
-		cgroup_bpf_get(sock_cgroup_ptr(skcd));
+	if (cgroup_sk_alloc_disabled) {
+		skcd->no_refcnt = 1;
 		return;
 	}
 
@@ -6475,10 +6465,27 @@ void cgroup_sk_alloc(struct sock_cgroup_data *skcd)
 	rcu_read_unlock();
 }
 
+void cgroup_sk_clone(struct sock_cgroup_data *skcd)
+{
+	if (skcd->val) {
+		if (skcd->no_refcnt)
+			return;
+		/*
+		 * We might be cloning a socket which is left in an empty
+		 * cgroup and the cgroup might have already been rmdir'd.
+		 * Don't use cgroup_get_live().
+		 */
+		cgroup_get(sock_cgroup_ptr(skcd));
+		cgroup_bpf_get(sock_cgroup_ptr(skcd));
+	}
+}
+
 void cgroup_sk_free(struct sock_cgroup_data *skcd)
 {
 	struct cgroup *cgrp = sock_cgroup_ptr(skcd);
 
+	if (skcd->no_refcnt)
+		return;
 	cgroup_bpf_put(cgrp);
 	cgroup_put(cgrp);
 }
diff --git a/net/core/sock.c b/net/core/sock.c
index d832c650287c..2e5b7870e5d3 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -1926,7 +1926,7 @@ struct sock *sk_clone_lock(const struct sock *sk, const gfp_t priority)
 		/* sk->sk_memcg will be populated at accept() time */
 		newsk->sk_memcg = NULL;
 
-		cgroup_sk_alloc(&newsk->sk_cgrp_data);
+		cgroup_sk_clone(&newsk->sk_cgrp_data);
 
 		rcu_read_lock();
 		filter = rcu_dereference(sk->sk_filter);

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* Re: [Patch net] cgroup: fix cgroup_sk_alloc() for sk_clone_lock()
  2020-06-26 17:58                       ` Cong Wang
@ 2020-06-26 22:03                         ` Cameron Berkenpas
  2020-06-27 22:59                           ` Cameron Berkenpas
  2020-06-27 23:41                         ` Roman Gushchin
  1 sibling, 1 reply; 41+ messages in thread
From: Cameron Berkenpas @ 2020-06-26 22:03 UTC (permalink / raw)
  To: Cong Wang
  Cc: Roman Gushchin, Zefan Li, Linux Kernel Network Developers,
	Peter Geis, Lu Fengqi, Daniël Sonck, Daniel Borkmann,
	Tejun Heo

Box has been up for 25 minutes without issue. Probably the patch works, 
but I can further confirm by tomorrow.

On 6/26/2020 10:58 AM, Cong Wang wrote:
> On Thu, Jun 25, 2020 at 10:23 PM Cameron Berkenpas <cam@neo-zeon.de> wrote:
>> Hello,
>>
>> Somewhere along the way I got the impression that it generally takes
>> those affected hours before their systems lock up. I'm (generally) able
>> to reproduce this issue much faster than that. Regardless, I can help test.
>>
>> Are there any patches that need testing or is this all still pending
>> discussion around the  best way to resolve the issue?
> Yes. I come up with a (hopefully) much better patch in the attachment.
> Can you help to test it? You need to unapply the previous patch before
> applying this one.
>
> (Just in case of any confusion: I still believe we should check NULL on
> top of this refcnt fix. But it should be a separate patch.)
>
> Thank you!


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [Patch net] cgroup: fix cgroup_sk_alloc() for sk_clone_lock()
  2020-06-26 22:03                         ` Cameron Berkenpas
@ 2020-06-27 22:59                           ` Cameron Berkenpas
  2020-06-30 22:16                             ` Cong Wang
  0 siblings, 1 reply; 41+ messages in thread
From: Cameron Berkenpas @ 2020-06-27 22:59 UTC (permalink / raw)
  To: Cong Wang
  Cc: Roman Gushchin, Zefan Li, Linux Kernel Network Developers,
	Peter Geis, Lu Fengqi, Daniël Sonck, Daniel Borkmann,
	Tejun Heo

The box has been up without issue for over 25 hours now. The patch seems 
solid.

On 6/26/20 3:03 PM, Cameron Berkenpas wrote:
> Box has been up for 25 minutes without issue. Probably the patch 
> works, but I can further confirm by tomorrow.
>
> On 6/26/2020 10:58 AM, Cong Wang wrote:
>> On Thu, Jun 25, 2020 at 10:23 PM Cameron Berkenpas <cam@neo-zeon.de> 
>> wrote:
>>> Hello,
>>>
>>> Somewhere along the way I got the impression that it generally takes
>>> those affected hours before their systems lock up. I'm (generally) able
>>> to reproduce this issue much faster than that. Regardless, I can 
>>> help test.
>>>
>>> Are there any patches that need testing or is this all still pending
>>> discussion around the  best way to resolve the issue?
>> Yes. I come up with a (hopefully) much better patch in the attachment.
>> Can you help to test it? You need to unapply the previous patch before
>> applying this one.
>>
>> (Just in case of any confusion: I still believe we should check NULL on
>> top of this refcnt fix. But it should be a separate patch.)
>>
>> Thank you!
>


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [Patch net] cgroup: fix cgroup_sk_alloc() for sk_clone_lock()
  2020-06-26 17:58                       ` Cong Wang
  2020-06-26 22:03                         ` Cameron Berkenpas
@ 2020-06-27 23:41                         ` Roman Gushchin
  2020-06-30 22:22                           ` Cong Wang
  1 sibling, 1 reply; 41+ messages in thread
From: Roman Gushchin @ 2020-06-27 23:41 UTC (permalink / raw)
  To: Cong Wang
  Cc: Cameron Berkenpas, Zefan Li, Linux Kernel Network Developers,
	Peter Geis, Lu Fengqi, Daniël Sonck, Daniel Borkmann,
	Tejun Heo

On Fri, Jun 26, 2020 at 10:58:14AM -0700, Cong Wang wrote:
> On Thu, Jun 25, 2020 at 10:23 PM Cameron Berkenpas <cam@neo-zeon.de> wrote:
> >
> > Hello,
> >
> > Somewhere along the way I got the impression that it generally takes
> > those affected hours before their systems lock up. I'm (generally) able
> > to reproduce this issue much faster than that. Regardless, I can help test.
> >
> > Are there any patches that need testing or is this all still pending
> > discussion around the  best way to resolve the issue?
> 
> Yes. I come up with a (hopefully) much better patch in the attachment.
> Can you help to test it? You need to unapply the previous patch before
> applying this one.
> 
> (Just in case of any confusion: I still believe we should check NULL on
> top of this refcnt fix. But it should be a separate patch.)
> 
> Thank you!

Not opposing the patch, but the Fixes tag is still confusing me.
Do we have an explanation for what's wrong with 4bfc0bb2c60e?

It looks like we have cgroup_bpf_get()/put() exactly where we have
cgroup_get()/put(), so it would be nice to understand what's different
if the problem is bpf-related.

Thanks!

> commit 259150604c0b77c717fdaab057da5722e2dfd922
> Author: Cong Wang <xiyou.wangcong@gmail.com>
> Date:   Sat Jun 13 12:34:40 2020 -0700
> 
>     cgroup: fix cgroup_sk_alloc() for sk_clone_lock()
>     
>     When we clone a socket in sk_clone_lock(), its sk_cgrp_data is
>     copied, so the cgroup refcnt must be taken too. And, unlike the
>     sk_alloc() path, sock_update_netprioidx() is not called here.
>     Therefore, it is safe and necessary to grab the cgroup refcnt
>     even when cgroup_sk_alloc is disabled.
>     
>     sk_clone_lock() is in BH context anyway, the in_interrupt()
>     would terminate this function if called there. And for sk_alloc()
>     skcd->val is always zero. So it's safe to factor out the code
>     to make it more readable.
>     
>     Fixes: 4bfc0bb2c60e ("bpf: decouple the lifetime of cgroup_bpf from cgroup itself")
>     Reported-by: Cameron Berkenpas <cam@neo-zeon.de>
>     Reported-by: Peter Geis <pgwipeout@gmail.com>
>     Reported-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
>     Reported-by: Daniël Sonck <dsonck92@gmail.com>
>     Tested-by: Cameron Berkenpas <cam@neo-zeon.de>
>     Cc: Daniel Borkmann <daniel@iogearbox.net>
>     Cc: Zefan Li <lizefan@huawei.com>
>     Cc: Tejun Heo <tj@kernel.org>
>     Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
> 
> diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
> index 52661155f85f..4f1cd0edc57d 100644
> --- a/include/linux/cgroup-defs.h
> +++ b/include/linux/cgroup-defs.h
> @@ -790,7 +790,8 @@ struct sock_cgroup_data {
>  	union {
>  #ifdef __LITTLE_ENDIAN
>  		struct {
> -			u8	is_data;
> +			u8	is_data : 1;
> +			u8	no_refcnt : 1;
>  			u8	padding;
>  			u16	prioidx;
>  			u32	classid;
> @@ -800,7 +801,8 @@ struct sock_cgroup_data {
>  			u32	classid;
>  			u16	prioidx;
>  			u8	padding;
> -			u8	is_data;
> +			u8	no_refcnt : 1;
> +			u8	is_data : 1;
>  		} __packed;
>  #endif
>  		u64		val;
> diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
> index 4598e4da6b1b..618838c48313 100644
> --- a/include/linux/cgroup.h
> +++ b/include/linux/cgroup.h
> @@ -822,6 +822,7 @@ extern spinlock_t cgroup_sk_update_lock;
>  
>  void cgroup_sk_alloc_disable(void);
>  void cgroup_sk_alloc(struct sock_cgroup_data *skcd);
> +void cgroup_sk_clone(struct sock_cgroup_data *skcd);
>  void cgroup_sk_free(struct sock_cgroup_data *skcd);
>  
>  static inline struct cgroup *sock_cgroup_ptr(struct sock_cgroup_data *skcd)
> @@ -835,7 +836,7 @@ static inline struct cgroup *sock_cgroup_ptr(struct sock_cgroup_data *skcd)
>  	 */
>  	v = READ_ONCE(skcd->val);
>  
> -	if (v & 1)
> +	if (v & 3)
>  		return &cgrp_dfl_root.cgrp;
>  
>  	return (struct cgroup *)(unsigned long)v ?: &cgrp_dfl_root.cgrp;
> @@ -847,6 +848,7 @@ static inline struct cgroup *sock_cgroup_ptr(struct sock_cgroup_data *skcd)
>  #else	/* CONFIG_CGROUP_DATA */
>  
>  static inline void cgroup_sk_alloc(struct sock_cgroup_data *skcd) {}
> +static inline void cgroup_sk_clone(struct sock_cgroup_data *skcd) {}
>  static inline void cgroup_sk_free(struct sock_cgroup_data *skcd) {}
>  
>  #endif	/* CONFIG_CGROUP_DATA */
> diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
> index 1ea181a58465..dd247747ec14 100644
> --- a/kernel/cgroup/cgroup.c
> +++ b/kernel/cgroup/cgroup.c
> @@ -6439,18 +6439,8 @@ void cgroup_sk_alloc_disable(void)
>  
>  void cgroup_sk_alloc(struct sock_cgroup_data *skcd)
>  {
> -	if (cgroup_sk_alloc_disabled)
> -		return;
> -
> -	/* Socket clone path */
> -	if (skcd->val) {
> -		/*
> -		 * We might be cloning a socket which is left in an empty
> -		 * cgroup and the cgroup might have already been rmdir'd.
> -		 * Don't use cgroup_get_live().
> -		 */
> -		cgroup_get(sock_cgroup_ptr(skcd));
> -		cgroup_bpf_get(sock_cgroup_ptr(skcd));
> +	if (cgroup_sk_alloc_disabled) {
> +		skcd->no_refcnt = 1;
>  		return;
>  	}
>  
> @@ -6475,10 +6465,27 @@ void cgroup_sk_alloc(struct sock_cgroup_data *skcd)
>  	rcu_read_unlock();
>  }
>  
> +void cgroup_sk_clone(struct sock_cgroup_data *skcd)
> +{
> +	if (skcd->val) {
> +		if (skcd->no_refcnt)
> +			return;
> +		/*
> +		 * We might be cloning a socket which is left in an empty
> +		 * cgroup and the cgroup might have already been rmdir'd.
> +		 * Don't use cgroup_get_live().
> +		 */
> +		cgroup_get(sock_cgroup_ptr(skcd));
> +		cgroup_bpf_get(sock_cgroup_ptr(skcd));
> +	}
> +}
> +
>  void cgroup_sk_free(struct sock_cgroup_data *skcd)
>  {
>  	struct cgroup *cgrp = sock_cgroup_ptr(skcd);
>  
> +	if (skcd->no_refcnt)
> +		return;
>  	cgroup_bpf_put(cgrp);
>  	cgroup_put(cgrp);
>  }
> diff --git a/net/core/sock.c b/net/core/sock.c
> index d832c650287c..2e5b7870e5d3 100644
> --- a/net/core/sock.c
> +++ b/net/core/sock.c
> @@ -1926,7 +1926,7 @@ struct sock *sk_clone_lock(const struct sock *sk, const gfp_t priority)
>  		/* sk->sk_memcg will be populated at accept() time */
>  		newsk->sk_memcg = NULL;
>  
> -		cgroup_sk_alloc(&newsk->sk_cgrp_data);
> +		cgroup_sk_clone(&newsk->sk_cgrp_data);
>  
>  		rcu_read_lock();
>  		filter = rcu_dereference(sk->sk_filter);


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [Patch net] cgroup: fix cgroup_sk_alloc() for sk_clone_lock()
  2020-06-27 22:59                           ` Cameron Berkenpas
@ 2020-06-30 22:16                             ` Cong Wang
  0 siblings, 0 replies; 41+ messages in thread
From: Cong Wang @ 2020-06-30 22:16 UTC (permalink / raw)
  To: Cameron Berkenpas
  Cc: Roman Gushchin, Zefan Li, Linux Kernel Network Developers,
	Peter Geis, Lu Fengqi, Daniël Sonck, Daniel Borkmann,
	Tejun Heo

On Sat, Jun 27, 2020 at 3:59 PM Cameron Berkenpas <cam@neo-zeon.de> wrote:
>
> The box has been up without issue for over 25 hours now. The patch seems
> solid.

That's great! Thanks for testing!

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [Patch net] cgroup: fix cgroup_sk_alloc() for sk_clone_lock()
  2020-06-27 23:41                         ` Roman Gushchin
@ 2020-06-30 22:22                           ` Cong Wang
  2020-06-30 22:48                             ` Roman Gushchin
  0 siblings, 1 reply; 41+ messages in thread
From: Cong Wang @ 2020-06-30 22:22 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Cameron Berkenpas, Zefan Li, Linux Kernel Network Developers,
	Peter Geis, Lu Fengqi, Daniël Sonck, Daniel Borkmann,
	Tejun Heo

On Sat, Jun 27, 2020 at 4:41 PM Roman Gushchin <guro@fb.com> wrote:
>
> On Fri, Jun 26, 2020 at 10:58:14AM -0700, Cong Wang wrote:
> > On Thu, Jun 25, 2020 at 10:23 PM Cameron Berkenpas <cam@neo-zeon.de> wrote:
> > >
> > > Hello,
> > >
> > > Somewhere along the way I got the impression that it generally takes
> > > those affected hours before their systems lock up. I'm (generally) able
> > > to reproduce this issue much faster than that. Regardless, I can help test.
> > >
> > > Are there any patches that need testing or is this all still pending
> > > discussion around the  best way to resolve the issue?
> >
> > Yes. I come up with a (hopefully) much better patch in the attachment.
> > Can you help to test it? You need to unapply the previous patch before
> > applying this one.
> >
> > (Just in case of any confusion: I still believe we should check NULL on
> > top of this refcnt fix. But it should be a separate patch.)
> >
> > Thank you!
>
> Not opposing the patch, but the Fixes tag is still confusing me.
> Do we have an explanation for what's wrong with 4bfc0bb2c60e?
>
> It looks like we have cgroup_bpf_get()/put() exactly where we have
> cgroup_get()/put(), so it would be nice to understand what's different
> if the problem is bpf-related.

Hmm, I think it is Zefan who believes cgroup refcnt is fine, the bug
is just in cgroup bpf refcnt, in our previous discussion.

Although I agree cgroup refcnt is buggy too, it may not necessarily
cause any real problem, otherwise we would receive bug report
much earlier than just recently, right?

If the Fixes tag is confusing, I can certainly remove it, but this also
means the patch will not be backported to stable. I am fine either
way, this crash is only reported after Zefan's recent change anyway.

Thanks.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [Patch net] cgroup: fix cgroup_sk_alloc() for sk_clone_lock()
  2020-06-30 22:22                           ` Cong Wang
@ 2020-06-30 22:48                             ` Roman Gushchin
  2020-07-01  1:18                               ` Zefan Li
  2020-07-02  4:48                               ` Cong Wang
  0 siblings, 2 replies; 41+ messages in thread
From: Roman Gushchin @ 2020-06-30 22:48 UTC (permalink / raw)
  To: Cong Wang
  Cc: Cameron Berkenpas, Zefan Li, Linux Kernel Network Developers,
	Peter Geis, Lu Fengqi, Daniël Sonck, Daniel Borkmann,
	Tejun Heo

On Tue, Jun 30, 2020 at 03:22:34PM -0700, Cong Wang wrote:
> On Sat, Jun 27, 2020 at 4:41 PM Roman Gushchin <guro@fb.com> wrote:
> >
> > On Fri, Jun 26, 2020 at 10:58:14AM -0700, Cong Wang wrote:
> > > On Thu, Jun 25, 2020 at 10:23 PM Cameron Berkenpas <cam@neo-zeon.de> wrote:
> > > >
> > > > Hello,
> > > >
> > > > Somewhere along the way I got the impression that it generally takes
> > > > those affected hours before their systems lock up. I'm (generally) able
> > > > to reproduce this issue much faster than that. Regardless, I can help test.
> > > >
> > > > Are there any patches that need testing or is this all still pending
> > > > discussion around the  best way to resolve the issue?
> > >
> > > Yes. I come up with a (hopefully) much better patch in the attachment.
> > > Can you help to test it? You need to unapply the previous patch before
> > > applying this one.
> > >
> > > (Just in case of any confusion: I still believe we should check NULL on
> > > top of this refcnt fix. But it should be a separate patch.)
> > >
> > > Thank you!
> >
> > Not opposing the patch, but the Fixes tag is still confusing me.
> > Do we have an explanation for what's wrong with 4bfc0bb2c60e?
> >
> > It looks like we have cgroup_bpf_get()/put() exactly where we have
> > cgroup_get()/put(), so it would be nice to understand what's different
> > if the problem is bpf-related.
> 
> Hmm, I think it is Zefan who believes cgroup refcnt is fine, the bug
> is just in cgroup bpf refcnt, in our previous discussion.
> 
> Although I agree cgroup refcnt is buggy too, it may not necessarily
> cause any real problem, otherwise we would receive bug report
> much earlier than just recently, right?
> 
> If the Fixes tag is confusing, I can certainly remove it, but this also
> means the patch will not be backported to stable. I am fine either
> way, this crash is only reported after Zefan's recent change anyway.

Well, I'm not trying to protect my commit, I just don't understand
the whole picture and what I see doesn't make complete sense to me.

I understand a problem which can be described as copying the cgroup pointer
on cgroup cloning without bumping the reference counter.
It seems that this problem is not caused by bpf changes, so if we're adding
a Fixes tag, it must point at an earlier commit. Most likely, it was there from
scratch, i.e. from bd1060ad671 ("sock, cgroup: add sock->sk_cgroup").
Do we know why Zefan's change made it reproducible?

Btw if we want to backport the problem but can't blame a specific commit,
we can always use something like "Cc: <stable@vger.kernel.org>    [3.1+]".

Thanks!

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [Patch net] cgroup: fix cgroup_sk_alloc() for sk_clone_lock()
  2020-06-30 22:48                             ` Roman Gushchin
@ 2020-07-01  1:18                               ` Zefan Li
  2020-07-02  4:48                               ` Cong Wang
  1 sibling, 0 replies; 41+ messages in thread
From: Zefan Li @ 2020-07-01  1:18 UTC (permalink / raw)
  To: Roman Gushchin, Cong Wang
  Cc: Cameron Berkenpas, Linux Kernel Network Developers, Peter Geis,
	Lu Fengqi, Daniël Sonck, Daniel Borkmann, Tejun Heo

On 2020/7/1 6:48, Roman Gushchin wrote:
> On Tue, Jun 30, 2020 at 03:22:34PM -0700, Cong Wang wrote:
>> On Sat, Jun 27, 2020 at 4:41 PM Roman Gushchin <guro@fb.com> wrote:
>>>
>>> On Fri, Jun 26, 2020 at 10:58:14AM -0700, Cong Wang wrote:
>>>> On Thu, Jun 25, 2020 at 10:23 PM Cameron Berkenpas <cam@neo-zeon.de> wrote:
>>>>>
>>>>> Hello,
>>>>>
>>>>> Somewhere along the way I got the impression that it generally takes
>>>>> those affected hours before their systems lock up. I'm (generally) able
>>>>> to reproduce this issue much faster than that. Regardless, I can help test.
>>>>>
>>>>> Are there any patches that need testing or is this all still pending
>>>>> discussion around the  best way to resolve the issue?
>>>>
>>>> Yes. I come up with a (hopefully) much better patch in the attachment.
>>>> Can you help to test it? You need to unapply the previous patch before
>>>> applying this one.
>>>>
>>>> (Just in case of any confusion: I still believe we should check NULL on
>>>> top of this refcnt fix. But it should be a separate patch.)
>>>>
>>>> Thank you!
>>>
>>> Not opposing the patch, but the Fixes tag is still confusing me.
>>> Do we have an explanation for what's wrong with 4bfc0bb2c60e?
>>>
>>> It looks like we have cgroup_bpf_get()/put() exactly where we have
>>> cgroup_get()/put(), so it would be nice to understand what's different
>>> if the problem is bpf-related.
>>
>> Hmm, I think it is Zefan who believes cgroup refcnt is fine, the bug
>> is just in cgroup bpf refcnt, in our previous discussion.
>>
>> Although I agree cgroup refcnt is buggy too, it may not necessarily
>> cause any real problem, otherwise we would receive bug report
>> much earlier than just recently, right?
>>
>> If the Fixes tag is confusing, I can certainly remove it, but this also
>> means the patch will not be backported to stable. I am fine either
>> way, this crash is only reported after Zefan's recent change anyway.
> 
> Well, I'm not trying to protect my commit, I just don't understand
> the whole picture and what I see doesn't make complete sense to me.
> 
> I understand a problem which can be described as copying the cgroup pointer
> on cgroup cloning without bumping the reference counter.
> It seems that this problem is not caused by bpf changes, so if we're adding
> a Fixes tag, it must point at an earlier commit. Most likely, it was there from
> scratch, i.e. from bd1060ad671 ("sock, cgroup: add sock->sk_cgroup").
> Do we know why Zefan's change made it reproducible?
> 

Actually I've explain it in my earlier reply.

https://www.spinics.net/lists/netdev/msg661004.html

With my change cgroup_sk_alloc will be disabled when we create a cgroup in
netprio cgroup subsystem, and systemd keeps creating cgroups with high
frequency.

Before this change, it will be disabled only when someone writes to ifpriomap
or classid in cgroupfs, and he's very unlikely to do this if he's using
bpf in the default cgroup tree I think.

So the bug has been there for sometime.

> Btw if we want to backport the problem but can't blame a specific commit,
> we can always use something like "Cc: <stable@vger.kernel.org>    [3.1+]".
> 

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [Patch net] cgroup: fix cgroup_sk_alloc() for sk_clone_lock()
  2020-06-30 22:48                             ` Roman Gushchin
  2020-07-01  1:18                               ` Zefan Li
@ 2020-07-02  4:48                               ` Cong Wang
  2020-07-02  8:12                                 ` Thomas Lamprecht
  2020-07-02 16:02                                 ` Roman Gushchin
  1 sibling, 2 replies; 41+ messages in thread
From: Cong Wang @ 2020-07-02  4:48 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Cameron Berkenpas, Zefan Li, Linux Kernel Network Developers,
	Peter Geis, Lu Fengqi, Daniël Sonck, Daniel Borkmann,
	Tejun Heo

On Tue, Jun 30, 2020 at 3:48 PM Roman Gushchin <guro@fb.com> wrote:
>
> Btw if we want to backport the problem but can't blame a specific commit,
> we can always use something like "Cc: <stable@vger.kernel.org>    [3.1+]".

Sure, but if we don't know which is the right commit to blame, then how
do we know which stable version should the patch target? :)

I am open to all options here, including not backporting to stable at all.

Thanks.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [Patch net] cgroup: fix cgroup_sk_alloc() for sk_clone_lock()
  2020-07-02  4:48                               ` Cong Wang
@ 2020-07-02  8:12                                 ` Thomas Lamprecht
  2020-07-02 16:02                                 ` Roman Gushchin
  1 sibling, 0 replies; 41+ messages in thread
From: Thomas Lamprecht @ 2020-07-02  8:12 UTC (permalink / raw)
  To: Cong Wang, Roman Gushchin
  Cc: Cameron Berkenpas, Zefan Li, Linux Kernel Network Developers,
	Peter Geis, Lu Fengqi, Daniël Sonck, Daniel Borkmann,
	Tejun Heo

On 02.07.20 06:48, Cong Wang wrote:
> On Tue, Jun 30, 2020 at 3:48 PM Roman Gushchin <guro@fb.com> wrote:
>>
>> Btw if we want to backport the problem but can't blame a specific commit,
>> we can always use something like "Cc: <stable@vger.kernel.org>    [3.1+]".
> 
> Sure, but if we don't know which is the right commit to blame, then how
> do we know which stable version should the patch target? :)

We run into a similar issue here once we made an update from the 5.4.41 to the
5.4.44 stable kernel. This patch addresses the issue, at least we are running
stable at >17 hours uptime with this patch, whereas we ran into issues normally
at <6 hour uptime without this patch.

That update included newly the commit 090e28b229af92dc5b ("netprio_cgroup: Fix
unlimited memory leak of v2 cgroups") which this patch originally mentions as
"Fixes", whereas the other mentioned possible culprit 4bfc0bb2c60e2f4c ("bpf:
decouple the lifetime of cgroup_bpf from cgroup itself") was included with 5.2
here, and did *not* made problems here.

So, while the real culprit may be something else, a mix of them, or even more
complex, the race is at least triggered way more frequently with the
090e28b229af92dc5b ("netprio_cgroup: Fix unlimited memory leak of v2 cgroups")
one or, for the sake of mentioning, possibly also something else from the
v5.4.41..v5.4.44 commit range - I did not looked into that in detail yet.

> 
> I am open to all options here, including not backporting to stable at all.

As said, the stable-5.4.y tree profits from having this patch here, so there's
that.

Also, FWIW:
Tested-by: Thomas Lamprecht <t.lamprecht@proxmox.com>


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [Patch net] cgroup: fix cgroup_sk_alloc() for sk_clone_lock()
  2020-07-02  4:48                               ` Cong Wang
  2020-07-02  8:12                                 ` Thomas Lamprecht
@ 2020-07-02 16:02                                 ` Roman Gushchin
  2020-07-02 16:24                                   ` Peter Geis
  2020-07-03  1:17                                   ` Zefan Li
  1 sibling, 2 replies; 41+ messages in thread
From: Roman Gushchin @ 2020-07-02 16:02 UTC (permalink / raw)
  To: Cong Wang
  Cc: Cameron Berkenpas, Zefan Li, Linux Kernel Network Developers,
	Peter Geis, Lu Fengqi, Daniël Sonck, Daniel Borkmann,
	Tejun Heo

On Wed, Jul 01, 2020 at 09:48:48PM -0700, Cong Wang wrote:
> On Tue, Jun 30, 2020 at 3:48 PM Roman Gushchin <guro@fb.com> wrote:
> >
> > Btw if we want to backport the problem but can't blame a specific commit,
> > we can always use something like "Cc: <stable@vger.kernel.org>    [3.1+]".
> 
> Sure, but if we don't know which is the right commit to blame, then how
> do we know which stable version should the patch target? :)
> 
> I am open to all options here, including not backporting to stable at all.

It seems to be that the issue was there from bd1060a1d671 ("sock, cgroup: add sock->sk_cgroup"),
so I'd go with it. Otherwise we can go with 5.4+, as I understand before that it was
hard to reproduce it.

Thanks!

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [Patch net] cgroup: fix cgroup_sk_alloc() for sk_clone_lock()
  2020-07-02 16:02                                 ` Roman Gushchin
@ 2020-07-02 16:24                                   ` Peter Geis
  2020-07-03  1:17                                   ` Zefan Li
  1 sibling, 0 replies; 41+ messages in thread
From: Peter Geis @ 2020-07-02 16:24 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Cong Wang, Cameron Berkenpas, Zefan Li,
	Linux Kernel Network Developers, Lu Fengqi, Daniël Sonck,
	Daniel Borkmann, Tejun Heo

On Thu, Jul 2, 2020 at 12:03 PM Roman Gushchin <guro@fb.com> wrote:
>
> On Wed, Jul 01, 2020 at 09:48:48PM -0700, Cong Wang wrote:
> > On Tue, Jun 30, 2020 at 3:48 PM Roman Gushchin <guro@fb.com> wrote:
> > >
> > > Btw if we want to backport the problem but can't blame a specific commit,
> > > we can always use something like "Cc: <stable@vger.kernel.org>    [3.1+]".
> >
> > Sure, but if we don't know which is the right commit to blame, then how
> > do we know which stable version should the patch target? :)
> >
> > I am open to all options here, including not backporting to stable at all.
>
> It seems to be that the issue was there from bd1060a1d671 ("sock, cgroup: add sock->sk_cgroup"),
> so I'd go with it. Otherwise we can go with 5.4+, as I understand before that it was
> hard to reproduce it.
>
> Thanks!

Just wanted to let you know, this patch has been running for ~36 hours
or so without a crash.
So:
Tested-by: Peter Geis <pgwipeout@gmail.com>

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [Patch net] cgroup: fix cgroup_sk_alloc() for sk_clone_lock()
  2020-07-02 16:02                                 ` Roman Gushchin
  2020-07-02 16:24                                   ` Peter Geis
@ 2020-07-03  1:17                                   ` Zefan Li
  1 sibling, 0 replies; 41+ messages in thread
From: Zefan Li @ 2020-07-03  1:17 UTC (permalink / raw)
  To: Roman Gushchin, Cong Wang
  Cc: Cameron Berkenpas, Linux Kernel Network Developers, Peter Geis,
	Lu Fengqi, Daniël Sonck, Daniel Borkmann, Tejun Heo

On 2020/7/3 0:02, Roman Gushchin wrote:
> On Wed, Jul 01, 2020 at 09:48:48PM -0700, Cong Wang wrote:
>> On Tue, Jun 30, 2020 at 3:48 PM Roman Gushchin <guro@fb.com> wrote:
>>>
>>> Btw if we want to backport the problem but can't blame a specific commit,
>>> we can always use something like "Cc: <stable@vger.kernel.org>    [3.1+]".
>>
>> Sure, but if we don't know which is the right commit to blame, then how
>> do we know which stable version should the patch target? :)
>>
>> I am open to all options here, including not backporting to stable at all.
> 
> It seems to be that the issue was there from bd1060a1d671 ("sock, cgroup: add sock->sk_cgroup"),
> so I'd go with it. Otherwise we can go with 5.4+, as I understand before that it was
> hard to reproduce it.
> 

Actually I think it should be very easy to reproduce the bug.

suppose default cgroup and netcls cgroup are mounted in /cgroup/default and
/cgroup/netcls respectively, and then:

1. mkdir /cgroup/default/sub1
2. mkdir /cgroup/default/sub2
3. attach some tasks into sub1/ and sub2/
4. attach bpf program to sub1/ and sub2/ # get bpf refcnt for those cgroups
5. echo 1 > /cgroup/netcls/classid       # this will disable cgroup_sk_alloc
6. kill all tasks in sub1/ and sub2/
7. rmdir sub1/ sub2/

The last step will deref bpf for the default root cgroup instead of sub1/
and sub2/, and should trigger the bug.

FYI I never use bpf, so I might be wrong.

^ permalink raw reply	[flat|nested] 41+ messages in thread

end of thread, other threads:[~2020-07-03  1:17 UTC | newest]

Thread overview: 41+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-06-16 18:03 [Patch net] cgroup: fix cgroup_sk_alloc() for sk_clone_lock() Cong Wang
2020-06-18  1:44 ` Zefan Li
2020-06-18 19:19   ` Cong Wang
2020-06-18 19:36     ` Roman Gushchin
2020-06-18 21:09       ` Cong Wang
2020-06-18 21:26         ` Roman Gushchin
2020-06-18 22:45           ` Peter Geis
2020-06-19  6:40         ` Zefan Li
2020-06-19 19:51           ` Cong Wang
2020-06-20  0:45             ` Zefan Li
2020-06-20  0:51               ` Zefan Li
2020-06-20  3:31                 ` Cong Wang
2020-06-20  7:52                   ` Zefan Li
2020-06-20 16:04                     ` Roman Gushchin
2020-06-23 22:21                   ` Roman Gushchin
2020-06-26  5:23                     ` Cameron Berkenpas
2020-06-26 17:58                       ` Cong Wang
2020-06-26 22:03                         ` Cameron Berkenpas
2020-06-27 22:59                           ` Cameron Berkenpas
2020-06-30 22:16                             ` Cong Wang
2020-06-27 23:41                         ` Roman Gushchin
2020-06-30 22:22                           ` Cong Wang
2020-06-30 22:48                             ` Roman Gushchin
2020-07-01  1:18                               ` Zefan Li
2020-07-02  4:48                               ` Cong Wang
2020-07-02  8:12                                 ` Thomas Lamprecht
2020-07-02 16:02                                 ` Roman Gushchin
2020-07-02 16:24                                   ` Peter Geis
2020-07-03  1:17                                   ` Zefan Li
2020-06-20  0:51           ` Roman Gushchin
2020-06-20  1:00             ` Zefan Li
2020-06-20  1:14               ` Roman Gushchin
2020-06-20  2:48                 ` Zefan Li
2020-06-20  3:00                 ` Cong Wang
2020-06-20 15:57                   ` Roman Gushchin
2020-06-22 18:14                     ` Cong Wang
2020-06-22 20:39                       ` Roman Gushchin
2020-06-23  8:45                         ` Zhang,Qiang
2020-06-23 17:56                           ` Cong Wang
2020-06-23  8:54                         ` Zhang,Qiang
2020-06-23  9:01                         ` Zhang,Qiang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).