linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] netprio_cgroup: Fix unlimited memory leak of v2 cgroup
@ 2020-05-09  3:19 Zefan Li
  2020-05-09  3:32 ` [PATCH v2] netprio_cgroup: Fix unlimited memory leak of v2 cgroups Zefan Li
  0 siblings, 1 reply; 7+ messages in thread
From: Zefan Li @ 2020-05-09  3:19 UTC (permalink / raw)
  To: Tejun Heo, David Miller
  Cc: yangyingliang, Kefeng Wang, huawei.libin, guofan5, linux-kernel,
	cgroups, Linux Kernel Network Developers

If systemd is configured to use hybrid mode which enables the use of
both cgroup v1 and v2, systemd will create new cgroup on both the default
root (v2) and netprio_cgroup hierarchy (v1) for a new session and attach
task to the two cgroups. If the task does some network thing then the v2
cgroup can never be freed after the session exited.

One of our machines ran into OOM due to this memory leak.

In the scenario described above when sk_alloc() is called cgroup_sk_alloc()
thought it's in v2 mode, so it stores the cgroup pointer in sk->sk_cgrp_data
and increments the cgroup refcnt, but then sock_update_netprioidx() thought
it's in v1 mode, so it stores netprioidx value in sk->sk_cgrp_data, so the
cgroup refcnt will never be freed.

Currently we do the mode switch when someone writes to the ifpriomap cgroup
control file. The easiest fix is to also do the switch when a task is attached
to a new cgroup.

Fixes: bd1060a1d671("sock, cgroup: add sock->sk_cgroup")
Reported-by: Yang Yingliang <yangyingliang@huawei.com>
Tested-by: Yang Yingliang <yangyingliang@huawei.com>
Signed-off-by: Zefan Li <lizefan@huawei.com>
---
 net/core/netprio_cgroup.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/net/core/netprio_cgroup.c b/net/core/netprio_cgroup.c
index b905747..2397866 100644
--- a/net/core/netprio_cgroup.c
+++ b/net/core/netprio_cgroup.c
@@ -240,6 +240,8 @@ static void net_prio_attach(struct cgroup_taskset *tset)
 	struct task_struct *p;
 	struct cgroup_subsys_state *css;
 
+	cgroup_sk_alloc_disable();
+
 	cgroup_taskset_for_each(p, css, tset) {
 		void *v = (void *)(unsigned long)css->cgroup->id;
 
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [PATCH v2] netprio_cgroup: Fix unlimited memory leak of v2 cgroups
  2020-05-09  3:19 [PATCH] netprio_cgroup: Fix unlimited memory leak of v2 cgroup Zefan Li
@ 2020-05-09  3:32 ` Zefan Li
  2020-05-09  5:58   ` Jakub Kicinski
  2020-05-09 22:57   ` Tejun Heo
  0 siblings, 2 replies; 7+ messages in thread
From: Zefan Li @ 2020-05-09  3:32 UTC (permalink / raw)
  To: Tejun Heo, David Miller
  Cc: yangyingliang, Kefeng Wang, huawei.libin, guofan5, linux-kernel,
	cgroups, Linux Kernel Network Developers

If systemd is configured to use hybrid mode which enables the use of
both cgroup v1 and v2, systemd will create new cgroup on both the default
root (v2) and netprio_cgroup hierarchy (v1) for a new session and attach
task to the two cgroups. If the task does some network thing then the v2
cgroup can never be freed after the session exited.

One of our machines ran into OOM due to this memory leak.

In the scenario described above when sk_alloc() is called cgroup_sk_alloc()
thought it's in v2 mode, so it stores the cgroup pointer in sk->sk_cgrp_data
and increments the cgroup refcnt, but then sock_update_netprioidx() thought
it's in v1 mode, so it stores netprioidx value in sk->sk_cgrp_data, so the
cgroup refcnt will never be freed.

Currently we do the mode switch when someone writes to the ifpriomap cgroup
control file. The easiest fix is to also do the switch when a task is attached
to a new cgroup.

Fixes: bd1060a1d671("sock, cgroup: add sock->sk_cgroup")
Reported-by: Yang Yingliang <yangyingliang@huawei.com>
Tested-by: Yang Yingliang <yangyingliang@huawei.com>
Signed-off-by: Zefan Li <lizefan@huawei.com>
---

forgot to rebase to the latest kernel.

---
 net/core/netprio_cgroup.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/net/core/netprio_cgroup.c b/net/core/netprio_cgroup.c
index 8881dd9..9bd4cab 100644
--- a/net/core/netprio_cgroup.c
+++ b/net/core/netprio_cgroup.c
@@ -236,6 +236,8 @@ static void net_prio_attach(struct cgroup_taskset *tset)
 	struct task_struct *p;
 	struct cgroup_subsys_state *css;
 
+	cgroup_sk_alloc_disable();
+
 	cgroup_taskset_for_each(p, css, tset) {
 		void *v = (void *)(unsigned long)css->id;
 
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH v2] netprio_cgroup: Fix unlimited memory leak of v2 cgroups
  2020-05-09  3:32 ` [PATCH v2] netprio_cgroup: Fix unlimited memory leak of v2 cgroups Zefan Li
@ 2020-05-09  5:58   ` Jakub Kicinski
  2020-05-10  4:02     ` Jakub Kicinski
  2020-05-09 22:57   ` Tejun Heo
  1 sibling, 1 reply; 7+ messages in thread
From: Jakub Kicinski @ 2020-05-09  5:58 UTC (permalink / raw)
  To: Zefan Li
  Cc: Tejun Heo, David Miller, yangyingliang, Kefeng Wang,
	huawei.libin, guofan5, linux-kernel, cgroups,
	Linux Kernel Network Developers

On Sat, 9 May 2020 11:32:10 +0800 Zefan Li wrote:
> If systemd is configured to use hybrid mode which enables the use of
> both cgroup v1 and v2, systemd will create new cgroup on both the default
> root (v2) and netprio_cgroup hierarchy (v1) for a new session and attach
> task to the two cgroups. If the task does some network thing then the v2
> cgroup can never be freed after the session exited.
> 
> One of our machines ran into OOM due to this memory leak.
> 
> In the scenario described above when sk_alloc() is called cgroup_sk_alloc()
> thought it's in v2 mode, so it stores the cgroup pointer in sk->sk_cgrp_data
> and increments the cgroup refcnt, but then sock_update_netprioidx() thought
> it's in v1 mode, so it stores netprioidx value in sk->sk_cgrp_data, so the
> cgroup refcnt will never be freed.
> 
> Currently we do the mode switch when someone writes to the ifpriomap cgroup
> control file. The easiest fix is to also do the switch when a task is attached
> to a new cgroup.
> 
> Fixes: bd1060a1d671("sock, cgroup: add sock->sk_cgroup")

                     ^ space missing here

> Reported-by: Yang Yingliang <yangyingliang@huawei.com>
> Tested-by: Yang Yingliang <yangyingliang@huawei.com>
> Signed-off-by: Zefan Li <lizefan@huawei.com>
> ---
> 
> forgot to rebase to the latest kernel.
> 
> ---
>  net/core/netprio_cgroup.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/net/core/netprio_cgroup.c b/net/core/netprio_cgroup.c
> index 8881dd9..9bd4cab 100644
> --- a/net/core/netprio_cgroup.c
> +++ b/net/core/netprio_cgroup.c
> @@ -236,6 +236,8 @@ static void net_prio_attach(struct cgroup_taskset *tset)
>  	struct task_struct *p;
>  	struct cgroup_subsys_state *css;
>  
> +	cgroup_sk_alloc_disable();
> +
>  	cgroup_taskset_for_each(p, css, tset) {
>  		void *v = (void *)(unsigned long)css->id;
>  


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v2] netprio_cgroup: Fix unlimited memory leak of v2 cgroups
  2020-05-09  3:32 ` [PATCH v2] netprio_cgroup: Fix unlimited memory leak of v2 cgroups Zefan Li
  2020-05-09  5:58   ` Jakub Kicinski
@ 2020-05-09 22:57   ` Tejun Heo
  1 sibling, 0 replies; 7+ messages in thread
From: Tejun Heo @ 2020-05-09 22:57 UTC (permalink / raw)
  To: Zefan Li
  Cc: David Miller, yangyingliang, Kefeng Wang, huawei.libin, guofan5,
	linux-kernel, cgroups, Linux Kernel Network Developers

On Sat, May 09, 2020 at 11:32:10AM +0800, Zefan Li wrote:
> If systemd is configured to use hybrid mode which enables the use of
> both cgroup v1 and v2, systemd will create new cgroup on both the default
> root (v2) and netprio_cgroup hierarchy (v1) for a new session and attach
> task to the two cgroups. If the task does some network thing then the v2
> cgroup can never be freed after the session exited.
> 
> One of our machines ran into OOM due to this memory leak.
> 
> In the scenario described above when sk_alloc() is called cgroup_sk_alloc()
> thought it's in v2 mode, so it stores the cgroup pointer in sk->sk_cgrp_data
> and increments the cgroup refcnt, but then sock_update_netprioidx() thought
> it's in v1 mode, so it stores netprioidx value in sk->sk_cgrp_data, so the
> cgroup refcnt will never be freed.
> 
> Currently we do the mode switch when someone writes to the ifpriomap cgroup
> control file. The easiest fix is to also do the switch when a task is attached
> to a new cgroup.
> 
> Fixes: bd1060a1d671("sock, cgroup: add sock->sk_cgroup")
> Reported-by: Yang Yingliang <yangyingliang@huawei.com>
> Tested-by: Yang Yingliang <yangyingliang@huawei.com>
> Signed-off-by: Zefan Li <lizefan@huawei.com>

Acked-by: Tejun Heo <tj@kernel.org>

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v2] netprio_cgroup: Fix unlimited memory leak of v2 cgroups
  2020-05-09  5:58   ` Jakub Kicinski
@ 2020-05-10  4:02     ` Jakub Kicinski
  2020-05-21 21:14       ` John Fastabend
  0 siblings, 1 reply; 7+ messages in thread
From: Jakub Kicinski @ 2020-05-10  4:02 UTC (permalink / raw)
  To: Zefan Li
  Cc: Tejun Heo, David Miller, yangyingliang, Kefeng Wang,
	huawei.libin, guofan5, linux-kernel, cgroups,
	Linux Kernel Network Developers

On Fri, 8 May 2020 22:58:29 -0700 Jakub Kicinski wrote:
> On Sat, 9 May 2020 11:32:10 +0800 Zefan Li wrote:
> > If systemd is configured to use hybrid mode which enables the use of
> > both cgroup v1 and v2, systemd will create new cgroup on both the default
> > root (v2) and netprio_cgroup hierarchy (v1) for a new session and attach
> > task to the two cgroups. If the task does some network thing then the v2
> > cgroup can never be freed after the session exited.
> > 
> > One of our machines ran into OOM due to this memory leak.
> > 
> > In the scenario described above when sk_alloc() is called cgroup_sk_alloc()
> > thought it's in v2 mode, so it stores the cgroup pointer in sk->sk_cgrp_data
> > and increments the cgroup refcnt, but then sock_update_netprioidx() thought
> > it's in v1 mode, so it stores netprioidx value in sk->sk_cgrp_data, so the
> > cgroup refcnt will never be freed.
> > 
> > Currently we do the mode switch when someone writes to the ifpriomap cgroup
> > control file. The easiest fix is to also do the switch when a task is attached
> > to a new cgroup.
> > 
> > Fixes: bd1060a1d671("sock, cgroup: add sock->sk_cgroup")  
> 
>                      ^ space missing here
> 
> > Reported-by: Yang Yingliang <yangyingliang@huawei.com>
> > Tested-by: Yang Yingliang <yangyingliang@huawei.com>
> > Signed-off-by: Zefan Li <lizefan@huawei.com>

Fixed up the commit message and applied, thank you.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v2] netprio_cgroup: Fix unlimited memory leak of v2 cgroups
  2020-05-10  4:02     ` Jakub Kicinski
@ 2020-05-21 21:14       ` John Fastabend
  2020-05-22  3:26         ` Zefan Li
  0 siblings, 1 reply; 7+ messages in thread
From: John Fastabend @ 2020-05-21 21:14 UTC (permalink / raw)
  To: Tejun Heo, Zefan Li, ast
  Cc: Jakub Kicinski, David Miller, yangyingliang, Kefeng Wang,
	huawei.libin, guofan5, linux-kernel, cgroups,
	Linux Kernel Network Developers

Jakub Kicinski wrote:
> On Fri, 8 May 2020 22:58:29 -0700 Jakub Kicinski wrote:
> > On Sat, 9 May 2020 11:32:10 +0800 Zefan Li wrote:
> > > If systemd is configured to use hybrid mode which enables the use of
> > > both cgroup v1 and v2, systemd will create new cgroup on both the default
> > > root (v2) and netprio_cgroup hierarchy (v1) for a new session and attach
> > > task to the two cgroups. If the task does some network thing then the v2
> > > cgroup can never be freed after the session exited.
> > > 
> > > One of our machines ran into OOM due to this memory leak.
> > > 
> > > In the scenario described above when sk_alloc() is called cgroup_sk_alloc()
> > > thought it's in v2 mode, so it stores the cgroup pointer in sk->sk_cgrp_data
> > > and increments the cgroup refcnt, but then sock_update_netprioidx() thought
> > > it's in v1 mode, so it stores netprioidx value in sk->sk_cgrp_data, so the
> > > cgroup refcnt will never be freed.
> > > 
> > > Currently we do the mode switch when someone writes to the ifpriomap cgroup
> > > control file. The easiest fix is to also do the switch when a task is attached
> > > to a new cgroup.
> > > 
> > > Fixes: bd1060a1d671("sock, cgroup: add sock->sk_cgroup")  
> > 
> >                      ^ space missing here
> > 
> > > Reported-by: Yang Yingliang <yangyingliang@huawei.com>
> > > Tested-by: Yang Yingliang <yangyingliang@huawei.com>
> > > Signed-off-by: Zefan Li <lizefan@huawei.com>
> 
> Fixed up the commit message and applied, thank you.

Hi Zefan, Tejun,

This is causing a regression where previously cgroupv2 bpf sockops programs
could be attached and would run even if netprio_cgroup was enabled as long
as  the netprio cgroup had not been configured. After this the bpf sockops
programs can still be attached but only programs attached to the root cgroup
will be run. For example I hit this when I ran bpf selftests on a box that
also happened to have netprio cgroup enabled, tests started failing after
bumping kernel to rc5.

I'm a bit on the fence here if it needs to be reverted. For my case its just
a test box and easy enough to work around. Also all the production cases I
have already have to be aware of this to avoid the configured error. So it
may be fine but worth noting at least. Added Alexei to see if he has any
opinion and/or uses net_prio+cgroubv2. I only looked it over briefly but
didn't see any simple rc6 worthy fixes that would fix the issue above and
also keep the original behavior.

And then while reviewing I also wonder do we have the same issue described
here in netclasid_cgroup.c with the cgrp_attach()? It would be best to keep
netcls and netprio in sync in this regard imo. At least netcls calls
cgroup_sk_alloc_disable in the write_classid() hook so I suspect it makes
sense to also add that to the attach hook?

Thanks,
John

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v2] netprio_cgroup: Fix unlimited memory leak of v2 cgroups
  2020-05-21 21:14       ` John Fastabend
@ 2020-05-22  3:26         ` Zefan Li
  0 siblings, 0 replies; 7+ messages in thread
From: Zefan Li @ 2020-05-22  3:26 UTC (permalink / raw)
  To: John Fastabend, Tejun Heo, ast
  Cc: Jakub Kicinski, David Miller, yangyingliang, Kefeng Wang,
	huawei.libin, guofan5, linux-kernel, cgroups,
	Linux Kernel Network Developers

On 2020/5/22 5:14, John Fastabend wrote:
> Jakub Kicinski wrote:
>> On Fri, 8 May 2020 22:58:29 -0700 Jakub Kicinski wrote:
>>> On Sat, 9 May 2020 11:32:10 +0800 Zefan Li wrote:
>>>> If systemd is configured to use hybrid mode which enables the use of
>>>> both cgroup v1 and v2, systemd will create new cgroup on both the default
>>>> root (v2) and netprio_cgroup hierarchy (v1) for a new session and attach
>>>> task to the two cgroups. If the task does some network thing then the v2
>>>> cgroup can never be freed after the session exited.
>>>>
>>>> One of our machines ran into OOM due to this memory leak.
>>>>
>>>> In the scenario described above when sk_alloc() is called cgroup_sk_alloc()
>>>> thought it's in v2 mode, so it stores the cgroup pointer in sk->sk_cgrp_data
>>>> and increments the cgroup refcnt, but then sock_update_netprioidx() thought
>>>> it's in v1 mode, so it stores netprioidx value in sk->sk_cgrp_data, so the
>>>> cgroup refcnt will never be freed.
>>>>
>>>> Currently we do the mode switch when someone writes to the ifpriomap cgroup
>>>> control file. The easiest fix is to also do the switch when a task is attached
>>>> to a new cgroup.
>>>>
>>>> Fixes: bd1060a1d671("sock, cgroup: add sock->sk_cgroup")  
>>>
>>>                      ^ space missing here
>>>
>>>> Reported-by: Yang Yingliang <yangyingliang@huawei.com>
>>>> Tested-by: Yang Yingliang <yangyingliang@huawei.com>
>>>> Signed-off-by: Zefan Li <lizefan@huawei.com>
>>
>> Fixed up the commit message and applied, thank you.
> 
> Hi Zefan, Tejun,
> 
> This is causing a regression where previously cgroupv2 bpf sockops programs
> could be attached and would run even if netprio_cgroup was enabled as long
> as  the netprio cgroup had not been configured. After this the bpf sockops
> programs can still be attached but only programs attached to the root cgroup
> will be run. For example I hit this when I ran bpf selftests on a box that
> also happened to have netprio cgroup enabled, tests started failing after
> bumping kernel to rc5.
> 
> I'm a bit on the fence here if it needs to be reverted. For my case its just
> a test box and easy enough to work around. Also all the production cases I
> have already have to be aware of this to avoid the configured error. So it
> may be fine but worth noting at least. Added Alexei to see if he has any
> opinion and/or uses net_prio+cgroubv2. I only looked it over briefly but
> didn't see any simple rc6 worthy fixes that would fix the issue above and
> also keep the original behavior.
> 

Me neither. If we really want to keep the original behavior we probably need
to do something similar to what netclassid cgroup does, which is to iterate
all the tasks in the cgroup to update netprioidx when netprio cgroup is
configured, and we also need to not update netprioidx when a task is attached
to a new cgroup.

> And then while reviewing I also wonder do we have the same issue described
> here in netclasid_cgroup.c with the cgrp_attach()? It would be best to keep
> netcls and netprio in sync in this regard imo. At least netcls calls
> cgroup_sk_alloc_disable in the write_classid() hook so I suspect it makes
> sense to also add that to the attach hook?
> 

Fortunately we don't have this issue in netclassid cgroup. :)

Because task_cls_classid() remains 0 as long as netclassid cgroup is not
configured.

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2020-05-22  3:26 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-05-09  3:19 [PATCH] netprio_cgroup: Fix unlimited memory leak of v2 cgroup Zefan Li
2020-05-09  3:32 ` [PATCH v2] netprio_cgroup: Fix unlimited memory leak of v2 cgroups Zefan Li
2020-05-09  5:58   ` Jakub Kicinski
2020-05-10  4:02     ` Jakub Kicinski
2020-05-21 21:14       ` John Fastabend
2020-05-22  3:26         ` Zefan Li
2020-05-09 22:57   ` Tejun Heo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).