linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* cgroup pointed by sock is leaked on mode switch
@ 2020-05-02 10:27 Yang Yingliang
  2020-05-05 16:06 ` Tejun Heo
  0 siblings, 1 reply; 6+ messages in thread
From: Yang Yingliang @ 2020-05-02 10:27 UTC (permalink / raw)
  To: tj
  Cc: linux-kernel, cgroups, netdev, Libin (Huawei),
	yangyingliang, guofan5, wangkefeng.wang

Hi,

I got an oom panic because cgroup is leaked.

Here is the steps :
   - run a docker with --cap-add sys_admin parameter and the systemd 
process in the docker uses both cgroupv1 and cgroupv2
   - ssh/exit from host to docker repeately

I find the number nr_dying_descendants is increasing:
linux-dVpNUK:~ # find /sys/fs/cgroup/ -name cgroup.stat -exec grep 
'^nr_dying_descendants [^0]'  {} +
/sys/fs/cgroup/unified/cgroup.stat:nr_dying_descendants 80
/sys/fs/cgroup/unified/system.slice/cgroup.stat:nr_dying_descendants 1
/sys/fs/cgroup/unified/system.slice/system-hostos.slice/cgroup.stat:nr_dying_descendants 
1
/sys/fs/cgroup/unified/lxc/cgroup.stat:nr_dying_descendants 79
/sys/fs/cgroup/unified/lxc/5f1fdb8c54fa40c3e599613dab6e4815058b76ebada8a27bc1fe80c0d4801764/cgroup.stat:nr_dying_descendants 
78
/sys/fs/cgroup/unified/lxc/5f1fdb8c54fa40c3e599613dab6e4815058b76ebada8a27bc1fe80c0d4801764/system.slice/cgroup.stat:nr_dying_descendants 
78


The situation is as same as the commit bd1060a1d671 ("sock, cgroup: add 
sock->sk_cgroup") describes.
"On mode switch, cgroup references which are already being pointed to by 
socks may be leaked."

Do we have a fix for this leak now ?

Or how  about fix this by record the cgrp2 pointer, then put it when sk 
is freeing like this:

diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
index d9bd671105e2..cbb1e76ea305 100644
--- a/include/linux/cgroup-defs.h
+++ b/include/linux/cgroup-defs.h
@@ -770,6 +770,7 @@ struct sock_cgroup_data {
  #endif
          u64        val;
      };
+    struct cgroup *cgrpv2;
  };

  /*
@@ -802,6 +803,7 @@ static inline void sock_cgroup_set_prioidx(struct 
sock_cgroup_data *skcd,
          return;

      if (!(skcd_buf.is_data & 1)) {
+        WRITE_ONCE(skcd->cgrpv2, skcd_buf.val);
          skcd_buf.val = 0;
          skcd_buf.is_data = 1;
      }
@@ -819,6 +821,7 @@ static inline void sock_cgroup_set_classid(struct 
sock_cgroup_data *skcd,
          return;

      if (!(skcd_buf.is_data & 1)) {
+        WRITE_ONCE(skcd->cgrpv2, skcd_buf.val);
          skcd_buf.val = 0;
          skcd_buf.is_data = 1;
      }
diff --git a/net/core/sock.c b/net/core/sock.c
index a0dda2bf9d7c..7c761ef2d32e 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -1520,6 +1520,10 @@ static void sk_prot_free(struct proto *prot, 
struct sock *sk)
      slab = prot->slab;

      cgroup_sk_free(&sk->sk_cgrp_data);
+    if (sk->sk_cgrp_data.cgrpv2) {
+        cgroup_put(sk->sk_cgrp_data.cgrpv2);
+        sk->sk_cgrp_data.cgrpv2 = NULL;
+    }
      mem_cgroup_sk_free(sk);
      security_sk_free(sk);
      if (slab != NULL)


Thanks,
Yang


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: cgroup pointed by sock is leaked on mode switch
  2020-05-02 10:27 cgroup pointed by sock is leaked on mode switch Yang Yingliang
@ 2020-05-05 16:06 ` Tejun Heo
  2020-05-06  1:50   ` Yang Yingliang
  0 siblings, 1 reply; 6+ messages in thread
From: Tejun Heo @ 2020-05-05 16:06 UTC (permalink / raw)
  To: Yang Yingliang
  Cc: linux-kernel, cgroups, netdev, Libin (Huawei), guofan5, wangkefeng.wang

Hello, Yang.

On Sat, May 02, 2020 at 06:27:21PM +0800, Yang Yingliang wrote:
> I find the number nr_dying_descendants is increasing:
> linux-dVpNUK:~ # find /sys/fs/cgroup/ -name cgroup.stat -exec grep
> '^nr_dying_descendants [^0]'  {} +
> /sys/fs/cgroup/unified/cgroup.stat:nr_dying_descendants 80
> /sys/fs/cgroup/unified/system.slice/cgroup.stat:nr_dying_descendants 1
> /sys/fs/cgroup/unified/system.slice/system-hostos.slice/cgroup.stat:nr_dying_descendants
> 1
> /sys/fs/cgroup/unified/lxc/cgroup.stat:nr_dying_descendants 79
> /sys/fs/cgroup/unified/lxc/5f1fdb8c54fa40c3e599613dab6e4815058b76ebada8a27bc1fe80c0d4801764/cgroup.stat:nr_dying_descendants
> 78
> /sys/fs/cgroup/unified/lxc/5f1fdb8c54fa40c3e599613dab6e4815058b76ebada8a27bc1fe80c0d4801764/system.slice/cgroup.stat:nr_dying_descendants
> 78

Those numbers are nowhere close to causing oom issues. There are some
aspects of page and other cache draining which is being improved but unless
you're seeing numbers multiple orders of magnitude higher, this isn't the
source of your problem.

> The situation is as same as the commit bd1060a1d671 ("sock, cgroup: add
> sock->sk_cgroup") describes.
> "On mode switch, cgroup references which are already being pointed to by
> socks may be leaked."

I'm doubtful that you're hitting that issue. Mode switching means memcg
being switched between cgroup1 and cgroup2 hierarchies, which is unlikely to
be what's happening when you're launching docker containers.

The first step would be identifying where memory is going and finding out
whether memcg is actually being switched between cgroup1 and 2 - look at the
hierarchy number in /proc/cgroups, if that's switching between 0 and
someting not zero, it is switching.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: cgroup pointed by sock is leaked on mode switch
  2020-05-05 16:06 ` Tejun Heo
@ 2020-05-06  1:50   ` Yang Yingliang
  2020-05-06  2:16     ` Zefan Li
  0 siblings, 1 reply; 6+ messages in thread
From: Yang Yingliang @ 2020-05-06  1:50 UTC (permalink / raw)
  To: Tejun Heo
  Cc: linux-kernel, cgroups, netdev, Libin (Huawei),
	guofan5, wangkefeng.wang, lizefan

+cc lizefan@huawei.com

On 2020/5/6 0:06, Tejun Heo wrote:
> Hello, Yang.
>
> On Sat, May 02, 2020 at 06:27:21PM +0800, Yang Yingliang wrote:
>> I find the number nr_dying_descendants is increasing:
>> linux-dVpNUK:~ # find /sys/fs/cgroup/ -name cgroup.stat -exec grep
>> '^nr_dying_descendants [^0]'  {} +
>> /sys/fs/cgroup/unified/cgroup.stat:nr_dying_descendants 80
>> /sys/fs/cgroup/unified/system.slice/cgroup.stat:nr_dying_descendants 1
>> /sys/fs/cgroup/unified/system.slice/system-hostos.slice/cgroup.stat:nr_dying_descendants
>> 1
>> /sys/fs/cgroup/unified/lxc/cgroup.stat:nr_dying_descendants 79
>> /sys/fs/cgroup/unified/lxc/5f1fdb8c54fa40c3e599613dab6e4815058b76ebada8a27bc1fe80c0d4801764/cgroup.stat:nr_dying_descendants
>> 78
>> /sys/fs/cgroup/unified/lxc/5f1fdb8c54fa40c3e599613dab6e4815058b76ebada8a27bc1fe80c0d4801764/system.slice/cgroup.stat:nr_dying_descendants
>> 78
> Those numbers are nowhere close to causing oom issues. There are some
> aspects of page and other cache draining which is being improved but unless
> you're seeing numbers multiple orders of magnitude higher, this isn't the
> source of your problem.
>
>> The situation is as same as the commit bd1060a1d671 ("sock, cgroup: add
>> sock->sk_cgroup") describes.
>> "On mode switch, cgroup references which are already being pointed to by
>> socks may be leaked."
> I'm doubtful that you're hitting that issue. Mode switching means memcg
> being switched between cgroup1 and cgroup2 hierarchies, which is unlikely to
> be what's happening when you're launching docker containers.
>
> The first step would be identifying where memory is going and finding out
> whether memcg is actually being switched between cgroup1 and 2 - look at the
> hierarchy number in /proc/cgroups, if that's switching between 0 and
> someting not zero, it is switching.
>
> Thanks.
>


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: cgroup pointed by sock is leaked on mode switch
  2020-05-06  1:50   ` Yang Yingliang
@ 2020-05-06  2:16     ` Zefan Li
  2020-05-06  7:51       ` Zefan Li
  0 siblings, 1 reply; 6+ messages in thread
From: Zefan Li @ 2020-05-06  2:16 UTC (permalink / raw)
  To: Yang Yingliang, Tejun Heo
  Cc: linux-kernel, cgroups, netdev, Libin (Huawei), guofan5, wangkefeng.wang

On 2020/5/6 9:50, Yang Yingliang wrotee:
> +cc lizefan@huawei.com
> 
> On 2020/5/6 0:06, Tejun Heo wrote:
>> Hello, Yang.
>>
>> On Sat, May 02, 2020 at 06:27:21PM +0800, Yang Yingliang wrote:
>>> I find the number nr_dying_descendants is increasing:
>>> linux-dVpNUK:~ # find /sys/fs/cgroup/ -name cgroup.stat -exec grep
>>> '^nr_dying_descendants [^0]'  {} +
>>> /sys/fs/cgroup/unified/cgroup.stat:nr_dying_descendants 80
>>> /sys/fs/cgroup/unified/system.slice/cgroup.stat:nr_dying_descendants 1
>>> /sys/fs/cgroup/unified/system.slice/system-hostos.slice/cgroup.stat:nr_dying_descendants
>>> 1
>>> /sys/fs/cgroup/unified/lxc/cgroup.stat:nr_dying_descendants 79
>>> /sys/fs/cgroup/unified/lxc/5f1fdb8c54fa40c3e599613dab6e4815058b76ebada8a27bc1fe80c0d4801764/cgroup.stat:nr_dying_descendants
>>> 78
>>> /sys/fs/cgroup/unified/lxc/5f1fdb8c54fa40c3e599613dab6e4815058b76ebada8a27bc1fe80c0d4801764/system.slice/cgroup.stat:nr_dying_descendants
>>> 78
>> Those numbers are nowhere close to causing oom issues. There are some
>> aspects of page and other cache draining which is being improved but unless
>> you're seeing numbers multiple orders of magnitude higher, this isn't the
>> source of your problem.
>>
>>> The situation is as same as the commit bd1060a1d671 ("sock, cgroup: add
>>> sock->sk_cgroup") describes.
>>> "On mode switch, cgroup references which are already being pointed to by
>>> socks may be leaked."
>> I'm doubtful that you're hitting that issue. Mode switching means memcg
>> being switched between cgroup1 and cgroup2 hierarchies, which is unlikely to
>> be what's happening when you're launching docker containers.
>>
>> The first step would be identifying where memory is going and finding out
>> whether memcg is actually being switched between cgroup1 and 2 - look at the
>> hierarchy number in /proc/cgroups, if that's switching between 0 and
>> someting not zero, it is switching.
>>

I think there's a bug here which can lead to unlimited memory leak.
This should reproduce the bug:

    # mount -t cgroup -o netprio xxx /cgroup/netprio
    # mkdir /cgroup/netprio/xxx
    # echo PID > /cgroup/netprio/xxx/tasks
    /* this PID process starts to do some network thing and then exits */
    # rmdir /cgroup/netprio/xxx
    /* now this cgroup will never be freed */

Look at the code:

static inline void sock_update_netprioidx(struct sock_cgroup_data *skcd)
{
	...
	sock_cgroup_set_prioidx(skcd, task_netprioidx(current));
}

static inline void sock_cgroup_set_prioidx(struct sock_cgroup_data *skcd,
					u16 prioidx)
{
	...
	if (sock_cgroup_prioidx(&skcd_buf) == prioidx)
		return ;
	...
	skcd_buf.prioidx = prioidx;
	WRITE_ONCE(skcd->val, skcd_buf.val);
}

task_netprioidx() will be the cgrp id of xxx which is not 1, but
sock_cgroup_prioidx(&skcd_buf) is 1 because it thought it's in v2 mode.
Now we have a memory leak.

I think the eastest fix is to do the mode switch here:

diff --git a/net/core/netprio_cgroup.c b/net/core/netprio_cgroup.c
index b905747..2397866 100644
--- a/net/core/netprio_cgroup.c
+++ b/net/core/netprio_cgroup.c
@@ -240,6 +240,8 @@ static void net_prio_attach(struct cgroup_taskset *tset)
         struct task_struct *p;
         struct cgroup_subsys_state *css;

+       cgroup_sk_alloc_disable();
+
         cgroup_taskset_for_each(p, css, tset) {
                 void *v = (void *)(unsigned long)css->cgroup->id;

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: cgroup pointed by sock is leaked on mode switch
  2020-05-06  2:16     ` Zefan Li
@ 2020-05-06  7:51       ` Zefan Li
  2020-05-09  2:31         ` Yang Yingliang
  0 siblings, 1 reply; 6+ messages in thread
From: Zefan Li @ 2020-05-06  7:51 UTC (permalink / raw)
  To: Yang Yingliang, Tejun Heo
  Cc: linux-kernel, cgroups, netdev, Libin (Huawei), guofan5, wangkefeng.wang

On 2020/5/6 10:16, Zefan Li wrote:
> On 2020/5/6 9:50, Yang Yingliang wrotee:
>> +cc lizefan@huawei.com
>>
>> On 2020/5/6 0:06, Tejun Heo wrote:
>>> Hello, Yang.
>>>
>>> On Sat, May 02, 2020 at 06:27:21PM +0800, Yang Yingliang wrote:
>>>> I find the number nr_dying_descendants is increasing:
>>>> linux-dVpNUK:~ # find /sys/fs/cgroup/ -name cgroup.stat -exec grep
>>>> '^nr_dying_descendants [^0]'  {} +
>>>> /sys/fs/cgroup/unified/cgroup.stat:nr_dying_descendants 80
>>>> /sys/fs/cgroup/unified/system.slice/cgroup.stat:nr_dying_descendants 1
>>>> /sys/fs/cgroup/unified/system.slice/system-hostos.slice/cgroup.stat:nr_dying_descendants
>>>> 1
>>>> /sys/fs/cgroup/unified/lxc/cgroup.stat:nr_dying_descendants 79
>>>> /sys/fs/cgroup/unified/lxc/5f1fdb8c54fa40c3e599613dab6e4815058b76ebada8a27bc1fe80c0d4801764/cgroup.stat:nr_dying_descendants
>>>> 78
>>>> /sys/fs/cgroup/unified/lxc/5f1fdb8c54fa40c3e599613dab6e4815058b76ebada8a27bc1fe80c0d4801764/system.slice/cgroup.stat:nr_dying_descendants
>>>> 78
>>> Those numbers are nowhere close to causing oom issues. There are some
>>> aspects of page and other cache draining which is being improved but unless
>>> you're seeing numbers multiple orders of magnitude higher, this isn't the
>>> source of your problem.
>>>
>>>> The situation is as same as the commit bd1060a1d671 ("sock, cgroup: add
>>>> sock->sk_cgroup") describes.
>>>> "On mode switch, cgroup references which are already being pointed to by
>>>> socks may be leaked."
>>> I'm doubtful that you're hitting that issue. Mode switching means memcg
>>> being switched between cgroup1 and cgroup2 hierarchies, which is unlikely to
>>> be what's happening when you're launching docker containers.
>>>
>>> The first step would be identifying where memory is going and finding out
>>> whether memcg is actually being switched between cgroup1 and 2 - look at the
>>> hierarchy number in /proc/cgroups, if that's switching between 0 and
>>> someting not zero, it is switching.
>>>
> 
> I think there's a bug here which can lead to unlimited memory leak.
> This should reproduce the bug:
> 
>     # mount -t cgroup -o netprio xxx /cgroup/netprio
>     # mkdir /cgroup/netprio/xxx
>     # echo PID > /cgroup/netprio/xxx/tasks
>     /* this PID process starts to do some network thing and then exits */
>     # rmdir /cgroup/netprio/xxx
>     /* now this cgroup will never be freed */
> 

Correction (still not tested):

     # mount -t cgroup2 none /cgroup/v2
     # mkdir /cgroup/v2/xxx
     # echo PID > /cgroup/v2/xxx/cgroup.procs
     /* this PID process starts to do some network thing */

     # mount -t cgroup -o netprio xxx /cgroup/netprio
     # mkdir /cgroup/netprio/xxx
     # echo PID > /cgroup/netprio/xxx/tasks
     ...
     /* the PID process exits */

     rmdir /cgroup/netprio/xxx
     rmdir /cgroup/v2/xxx
     /* now looks like this v2 cgroup will never be freed */

> Look at the code:
> 
> static inline void sock_update_netprioidx(struct sock_cgroup_data *skcd)
> {
>      ...
>      sock_cgroup_set_prioidx(skcd, task_netprioidx(current));
> }
> 
> static inline void sock_cgroup_set_prioidx(struct sock_cgroup_data *skcd,
>                      u16 prioidx)
> {
>      ...
>      if (sock_cgroup_prioidx(&skcd_buf) == prioidx)
>          return ;
>      ...
>      skcd_buf.prioidx = prioidx;
>      WRITE_ONCE(skcd->val, skcd_buf.val);
> }
> 
> task_netprioidx() will be the cgrp id of xxx which is not 1, but
> sock_cgroup_prioidx(&skcd_buf) is 1 because it thought it's in v2 mode.
> Now we have a memory leak.
> 
> I think the eastest fix is to do the mode switch here:
> 
> diff --git a/net/core/netprio_cgroup.c b/net/core/netprio_cgroup.c
> index b905747..2397866 100644
> --- a/net/core/netprio_cgroup.c
> +++ b/net/core/netprio_cgroup.c
> @@ -240,6 +240,8 @@ static void net_prio_attach(struct cgroup_taskset *tset)
>          struct task_struct *p;
>          struct cgroup_subsys_state *css;
> 
> +       cgroup_sk_alloc_disable();
> +
>          cgroup_taskset_for_each(p, css, tset) {
>                  void *v = (void *)(unsigned long)css->cgroup->id;


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: cgroup pointed by sock is leaked on mode switch
  2020-05-06  7:51       ` Zefan Li
@ 2020-05-09  2:31         ` Yang Yingliang
  0 siblings, 0 replies; 6+ messages in thread
From: Yang Yingliang @ 2020-05-09  2:31 UTC (permalink / raw)
  To: Zefan Li, Tejun Heo
  Cc: linux-kernel, cgroups, netdev, Libin (Huawei), guofan5, wangkefeng.wang


On 2020/5/6 15:51, Zefan Li wrote:
> On 2020/5/6 10:16, Zefan Li wrote:
>> On 2020/5/6 9:50, Yang Yingliang wrotee:
>>> +cc lizefan@huawei.com
>>>
>>> On 2020/5/6 0:06, Tejun Heo wrote:
>>>> Hello, Yang.
>>>>
>>>> On Sat, May 02, 2020 at 06:27:21PM +0800, Yang Yingliang wrote:
>>>>> I find the number nr_dying_descendants is increasing:
>>>>> linux-dVpNUK:~ # find /sys/fs/cgroup/ -name cgroup.stat -exec grep
>>>>> '^nr_dying_descendants [^0]'  {} +
>>>>> /sys/fs/cgroup/unified/cgroup.stat:nr_dying_descendants 80
>>>>> /sys/fs/cgroup/unified/system.slice/cgroup.stat:nr_dying_descendants 
>>>>> 1
>>>>> /sys/fs/cgroup/unified/system.slice/system-hostos.slice/cgroup.stat:nr_dying_descendants 
>>>>>
>>>>> 1
>>>>> /sys/fs/cgroup/unified/lxc/cgroup.stat:nr_dying_descendants 79
>>>>> /sys/fs/cgroup/unified/lxc/5f1fdb8c54fa40c3e599613dab6e4815058b76ebada8a27bc1fe80c0d4801764/cgroup.stat:nr_dying_descendants 
>>>>>
>>>>> 78
>>>>> /sys/fs/cgroup/unified/lxc/5f1fdb8c54fa40c3e599613dab6e4815058b76ebada8a27bc1fe80c0d4801764/system.slice/cgroup.stat:nr_dying_descendants 
>>>>>
>>>>> 78
>>>> Those numbers are nowhere close to causing oom issues. There are some
>>>> aspects of page and other cache draining which is being improved 
>>>> but unless
>>>> you're seeing numbers multiple orders of magnitude higher, this 
>>>> isn't the
>>>> source of your problem.
>>>>
>>>>> The situation is as same as the commit bd1060a1d671 ("sock, 
>>>>> cgroup: add
>>>>> sock->sk_cgroup") describes.
>>>>> "On mode switch, cgroup references which are already being pointed 
>>>>> to by
>>>>> socks may be leaked."
>>>> I'm doubtful that you're hitting that issue. Mode switching means 
>>>> memcg
>>>> being switched between cgroup1 and cgroup2 hierarchies, which is 
>>>> unlikely to
>>>> be what's happening when you're launching docker containers.
>>>>
>>>> The first step would be identifying where memory is going and 
>>>> finding out
>>>> whether memcg is actually being switched between cgroup1 and 2 - 
>>>> look at the
>>>> hierarchy number in /proc/cgroups, if that's switching between 0 and
>>>> someting not zero, it is switching.
>>>>
>>
>> I think there's a bug here which can lead to unlimited memory leak.
>> This should reproduce the bug:
>>
>>     # mount -t cgroup -o netprio xxx /cgroup/netprio
>>     # mkdir /cgroup/netprio/xxx
>>     # echo PID > /cgroup/netprio/xxx/tasks
>>     /* this PID process starts to do some network thing and then 
>> exits */
>>     # rmdir /cgroup/netprio/xxx
>>     /* now this cgroup will never be freed */
>>
>
> Correction (still not tested):
>
>     # mount -t cgroup2 none /cgroup/v2
>     # mkdir /cgroup/v2/xxx
>     # echo PID > /cgroup/v2/xxx/cgroup.procs
>     /* this PID process starts to do some network thing */
>
>     # mount -t cgroup -o netprio xxx /cgroup/netprio
>     # mkdir /cgroup/netprio/xxx
>     # echo PID > /cgroup/netprio/xxx/tasks
>     ...
>     /* the PID process exits */
>
>     rmdir /cgroup/netprio/xxx
>     rmdir /cgroup/v2/xxx
>     /* now looks like this v2 cgroup will never be freed */
>
>> Look at the code:
>>
>> static inline void sock_update_netprioidx(struct sock_cgroup_data *skcd)
>> {
>>      ...
>>      sock_cgroup_set_prioidx(skcd, task_netprioidx(current));
>> }
>>
>> static inline void sock_cgroup_set_prioidx(struct sock_cgroup_data 
>> *skcd,
>>                      u16 prioidx)
>> {
>>      ...
>>      if (sock_cgroup_prioidx(&skcd_buf) == prioidx)
>>          return ;
>>      ...
>>      skcd_buf.prioidx = prioidx;
>>      WRITE_ONCE(skcd->val, skcd_buf.val);
>> }
>>
>> task_netprioidx() will be the cgrp id of xxx which is not 1, but
>> sock_cgroup_prioidx(&skcd_buf) is 1 because it thought it's in v2 mode.
>> Now we have a memory leak.
>>
>> I think the eastest fix is to do the mode switch here:
>>
>> diff --git a/net/core/netprio_cgroup.c b/net/core/netprio_cgroup.c
>> index b905747..2397866 100644
>> --- a/net/core/netprio_cgroup.c
>> +++ b/net/core/netprio_cgroup.c
>> @@ -240,6 +240,8 @@ static void net_prio_attach(struct cgroup_taskset 
>> *tset)
>>          struct task_struct *p;
>>          struct cgroup_subsys_state *css;
>>
>> +       cgroup_sk_alloc_disable();
>> +
>>          cgroup_taskset_for_each(p, css, tset) {
>>                  void *v = (void *)(unsigned long)css->cgroup->id;
>
I've do some test with this change, here is the test result:


Without this change, nr_dying_descendants is increased and the cgroup is 
leaked:

linux-dVpNUK:~ # mount | grep cgroup
tmpfs on /sys/fs/cgroup type tmpfs (ro,nosuid,nodev,noexec,mode=755)
cgroup2 on /sys/fs/cgroup/unified type cgroup2 
(rw,nosuid,nodev,noexec,relatime,nsdelegate)
cgroup on /sys/fs/cgroup/systemd type cgroup 
(rw,nosuid,nodev,noexec,relatime,xattr,name=systemd)
cgroup on /sys/fs/cgroup/blkio type cgroup 
(rw,nosuid,nodev,noexec,relatime,blkio)
cgroup on /sys/fs/cgroup/cpu,cpuacct type cgroup 
(rw,nosuid,nodev,noexec,relatime,cpu,cpuacct)
cgroup on /sys/fs/cgroup/net_cls,net_prio type cgroup 
(rw,nosuid,nodev,noexec,relatime,net_cls,net_prio)
cgroup on /sys/fs/cgroup/freezer type cgroup 
(rw,nosuid,nodev,noexec,relatime,freezer)
cgroup on /sys/fs/cgroup/devices type cgroup 
(rw,nosuid,nodev,noexec,relatime,devices)
cgroup on /sys/fs/cgroup/memory type cgroup 
(rw,nosuid,nodev,noexec,relatime,memory)
cgroup on /sys/fs/cgroup/cpuset type cgroup 
(rw,nosuid,nodev,noexec,relatime,cpuset)
cgroup on /sys/fs/cgroup/perf_event type cgroup 
(rw,nosuid,nodev,noexec,relatime,perf_event)
cgroup on /sys/fs/cgroup/pids type cgroup 
(rw,nosuid,nodev,noexec,relatime,pids)
cgroup on /sys/fs/cgroup/hugetlb type cgroup 
(rw,nosuid,nodev,noexec,relatime,hugetlb)
linux-dVpNUK:~ # mkdir /sys/fs/cgroup/net_cls,net_prio/test
linux-dVpNUK:~ # ps -ef | grep bash
root     12151 12150  0 00:31 pts/0    00:00:00 -bash
root     12322 12321  0 00:31 pts/1    00:00:00 -bash
root     12704 12703  0 00:31 pts/2    00:00:00 -bash
root     13359 12704  0 00:31 pts/2    00:00:00 grep --color=auto bash
linux-dVpNUK:~ # echo 12151 > /sys/fs/cgroup/net_cls,net_prio/test/tasks
linux-dVpNUK:~ # echo 12322 > /sys/fs/cgroup/net_cls,net_prio/test/tasks

(Use bash(12151/12322) do some network things, then kill them, the 
nr_dying_descendants is increased.)
linux-dVpNUK:~ # find /sys/fs/cgroup/ -name cgroup.stat -exec grep 
'^nr_dying_descendants [^0]'  {} +
/sys/fs/cgroup/unified/cgroup.stat:nr_dying_descendants 1
/sys/fs/cgroup/unified/system.slice/cgroup.stat:nr_dying_descendants 1
/sys/fs/cgroup/unified/system.slice/system-hostos.slice/cgroup.stat:nr_dying_descendants 
1
linux-dVpNUK:~ # kill 12151 12322
linux-dVpNUK:~ # find /sys/fs/cgroup/ -name cgroup.stat -exec grep 
'^nr_dying_descendants [^0]'  {} +
/sys/fs/cgroup/unified/cgroup.stat:nr_dying_descendants 3
/sys/fs/cgroup/unified/user.slice/cgroup.stat:nr_dying_descendants 2
/sys/fs/cgroup/unified/user.slice/user-0.slice/cgroup.stat:nr_dying_descendants 
2
/sys/fs/cgroup/unified/system.slice/cgroup.stat:nr_dying_descendants 1
/sys/fs/cgroup/unified/system.slice/system-hostos.slice/cgroup.stat:nr_dying_descendants 
1

(after rmdir test, the nr_dying_descendants is not decreased.)
linux-dVpNUK:~ # rmdir /sys/fs/cgroup/net_cls,net_prio/test
linux-dVpNUK:~ # find /sys/fs/cgroup/ -name cgroup.stat -exec grep 
'^nr_dying_descendants [^0]'  {} +
/sys/fs/cgroup/unified/cgroup.stat:nr_dying_descendants 3
/sys/fs/cgroup/unified/user.slice/cgroup.stat:nr_dying_descendants 2
/sys/fs/cgroup/unified/user.slice/user-0.slice/cgroup.stat:nr_dying_descendants 
2
/sys/fs/cgroup/unified/system.slice/cgroup.stat:nr_dying_descendants 1
/sys/fs/cgroup/unified/system.slice/system-hostos.slice/cgroup.stat:nr_dying_descendants 
1


With this change, nr_dying_descendants is not increased:

linux-dVpNUK:~ # mkdir /sys/fs/cgroup/net_cls,net_prio/test
linux-dVpNUK:~ # ps -ef | grep bash
root      5466  5443  0 00:50 pts/1    00:00:00 -bash
root      5724  5723  0 00:50 pts/2    00:00:00 -bash
root      6701 17013  0 00:50 pts/0    00:00:00 grep --color=auto bash
root     17013 17012  0 00:46 pts/0    00:00:00 -bash
linux-dVpNUK:~ # echo 5466 > /sys/fs/cgroup/net_cls,net_prio/test/tasks
linux-dVpNUK:~ # echo 5724 > /sys/fs/cgroup/net_cls,net_prio/test/tasks
linux-dVpNUK:~ # find /sys/fs/cgroup/ -name cgroup.stat -exec grep 
'^nr_dying_descendants [^0]'  {} +
/sys/fs/cgroup/unified/cgroup.stat:nr_dying_descendants 1
/sys/fs/cgroup/unified/system.slice/cgroup.stat:nr_dying_descendants 1
/sys/fs/cgroup/unified/system.slice/system-hostos.slice/cgroup.stat:nr_dying_descendants 
1
linux-dVpNUK:~ # kill 5466 5724
linux-dVpNUK:~ # find /sys/fs/cgroup/ -name cgroup.stat -exec grep 
'^nr_dying_descendants [^0]'  {} +
/sys/fs/cgroup/unified/cgroup.stat:nr_dying_descendants 1
/sys/fs/cgroup/unified/system.slice/cgroup.stat:nr_dying_descendants 1
/sys/fs/cgroup/unified/system.slice/system-hostos.slice/cgroup.stat:nr_dying_descendants 
1
linux-dVpNUK:~ # rmdir /sys/fs/cgroup/net_cls,net_prio/test/
linux-dVpNUK:~ # find /sys/fs/cgroup/ -name cgroup.stat -exec grep 
'^nr_dying_descendants [^0]'  {} +
/sys/fs/cgroup/unified/cgroup.stat:nr_dying_descendants 1
/sys/fs/cgroup/unified/system.slice/cgroup.stat:nr_dying_descendants 1
/sys/fs/cgroup/unified/system.slice/system-hostos.slice/cgroup.stat:nr_dying_descendants 
1

> .


^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2020-05-09  2:31 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-05-02 10:27 cgroup pointed by sock is leaked on mode switch Yang Yingliang
2020-05-05 16:06 ` Tejun Heo
2020-05-06  1:50   ` Yang Yingliang
2020-05-06  2:16     ` Zefan Li
2020-05-06  7:51       ` Zefan Li
2020-05-09  2:31         ` Yang Yingliang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).