* cgroup: rmdir() does not complete @ 2010-08-26 15:51 Mark Hills 2010-08-27 0:56 ` Daisuke Nishimura 2010-08-27 1:25 ` KAMEZAWA Hiroyuki 0 siblings, 2 replies; 26+ messages in thread From: Mark Hills @ 2010-08-26 15:51 UTC (permalink / raw) To: KAMEZAWA Hiroyuki; +Cc: linux-kernel I am experiencing hung tasks when trying to rmdir() on a cgroup. One task spins, others queue up behind it with the following: INFO: task soaked-cgroup:27257 blocked for more than 120 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. soaked-cgrou D ffff8800058157c0 0 27257 29411 0x00000000 ffff88004ffffdd8 0000000000000086 ffff88004ffffda8 ffff88004ffffeb8 0000000000000010 ffff880119813780 ffff88004ffffd48 ffff88004fffffd8 ffff88004fffffd8 000000000000f9b0 00000000000157c0 ffff880137693268 Call Trace: [<ffffffff81115edb>] ? mntput_no_expire+0x24/0xe7 [<ffffffff81427acd>] __mutex_lock_common+0x14d/0x1b4 [<ffffffff81108a7c>] ? path_put+0x1d/0x22 [<ffffffff81427b48>] __mutex_lock_slowpath+0x14/0x16 [<ffffffff81427c4f>] mutex_lock+0x31/0x4b [<ffffffff8110bdf8>] do_rmdir+0x74/0x102 [<ffffffff8110bebd>] sys_rmdir+0x11/0x13 [<ffffffff81009b02>] system_call_fastpath+0x16/0x1b Kernel is from Fedora, 2.6.33.6. In all cases the cgroup contains no tasks. Commit ec64f5 ("fix frequent -EBUSY at rmdir") adds a busy wait loop to the rmdir. It looks like what I am seeing here and indicates that some cgroup subsystem is busy, indefinitely. I have not worked out how to reproduce it quickly. My only way is to complete a 'dd' command in the cgroup, but then the problem is so rare it is slow progress. Documentation/cgroup.memory.txt describes how force_empty can be required in some cases. Does this mean that with the patch above, these cases will now spin on rmdir(), instead of returning -EBUSY? How can produce a reliable test case requiring memory.force_empty to be used, to test this? Or is it likely to be some other cause, and how best to find it? Thanks -- Mark ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: cgroup: rmdir() does not complete 2010-08-26 15:51 cgroup: rmdir() does not complete Mark Hills @ 2010-08-27 0:56 ` Daisuke Nishimura 2010-08-27 1:20 ` Balbir Singh 2010-08-27 2:35 ` KAMEZAWA Hiroyuki 2010-08-27 1:25 ` KAMEZAWA Hiroyuki 1 sibling, 2 replies; 26+ messages in thread From: Daisuke Nishimura @ 2010-08-27 0:56 UTC (permalink / raw) To: Mark Hills; +Cc: KAMEZAWA Hiroyuki, linux-kernel, Daisuke Nishimura Hi. On Thu, 26 Aug 2010 16:51:55 +0100 (BST) Mark Hills <mark@pogo.org.uk> wrote: > I am experiencing hung tasks when trying to rmdir() on a cgroup. One task > spins, others queue up behind it with the following: > > INFO: task soaked-cgroup:27257 blocked for more than 120 seconds. > "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > soaked-cgrou D ffff8800058157c0 0 27257 29411 0x00000000 > ffff88004ffffdd8 0000000000000086 ffff88004ffffda8 ffff88004ffffeb8 > 0000000000000010 ffff880119813780 ffff88004ffffd48 ffff88004fffffd8 > ffff88004fffffd8 000000000000f9b0 00000000000157c0 ffff880137693268 > Call Trace: > [<ffffffff81115edb>] ? mntput_no_expire+0x24/0xe7 > [<ffffffff81427acd>] __mutex_lock_common+0x14d/0x1b4 > [<ffffffff81108a7c>] ? path_put+0x1d/0x22 > [<ffffffff81427b48>] __mutex_lock_slowpath+0x14/0x16 > [<ffffffff81427c4f>] mutex_lock+0x31/0x4b > [<ffffffff8110bdf8>] do_rmdir+0x74/0x102 > [<ffffffff8110bebd>] sys_rmdir+0x11/0x13 > [<ffffffff81009b02>] system_call_fastpath+0x16/0x1b > > Kernel is from Fedora, 2.6.33.6. In all cases the cgroup contains no > tasks. > > Commit ec64f5 ("fix frequent -EBUSY at rmdir") adds a busy wait loop to > the rmdir. It looks like what I am seeing here and indicates that some > cgroup subsystem is busy, indefinitely. > The commit had caused a bug about rmdir, but it was fixed by the commit 88703267. The fix was merged in 2.6.31, so it seems that you hit a new one... > I have not worked out how to reproduce it quickly. My only way is to > complete a 'dd' command in the cgroup, but then the problem is so rare it > is slow progress. > > Documentation/cgroup.memory.txt describes how force_empty can be required > in some cases. Does this mean that with the patch above, these cases will > now spin on rmdir(), instead of returning -EBUSY? How can produce a > reliable test case requiring memory.force_empty to be used, to test this? > You don't need to touch "force_empty". rmdir() does what "force_empty" does. > Or is it likely to be some other cause, and how best to find it? > What cgroup subsystem did you mount where the directory existed you tried to rmdir() first ? If you mounted several subsystems on the same hierarchy, can you mount them separately to narrow down the cause ? Thanks, Daisuke Nishimura. ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: cgroup: rmdir() does not complete 2010-08-27 0:56 ` Daisuke Nishimura @ 2010-08-27 1:20 ` Balbir Singh 2010-08-27 2:35 ` KAMEZAWA Hiroyuki 1 sibling, 0 replies; 26+ messages in thread From: Balbir Singh @ 2010-08-27 1:20 UTC (permalink / raw) To: Daisuke Nishimura; +Cc: Mark Hills, KAMEZAWA Hiroyuki, linux-kernel On Fri, Aug 27, 2010 at 6:26 AM, Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> wrote: > Hi. > > On Thu, 26 Aug 2010 16:51:55 +0100 (BST) > Mark Hills <mark@pogo.org.uk> wrote: > >> I am experiencing hung tasks when trying to rmdir() on a cgroup. One task >> spins, others queue up behind it with the following: >> >> INFO: task soaked-cgroup:27257 blocked for more than 120 seconds. >> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. >> soaked-cgrou D ffff8800058157c0 0 27257 29411 0x00000000 >> ffff88004ffffdd8 0000000000000086 ffff88004ffffda8 ffff88004ffffeb8 >> 0000000000000010 ffff880119813780 ffff88004ffffd48 ffff88004fffffd8 >> ffff88004fffffd8 000000000000f9b0 00000000000157c0 ffff880137693268 >> Call Trace: >> [<ffffffff81115edb>] ? mntput_no_expire+0x24/0xe7 >> [<ffffffff81427acd>] __mutex_lock_common+0x14d/0x1b4 >> [<ffffffff81108a7c>] ? path_put+0x1d/0x22 >> [<ffffffff81427b48>] __mutex_lock_slowpath+0x14/0x16 >> [<ffffffff81427c4f>] mutex_lock+0x31/0x4b >> [<ffffffff8110bdf8>] do_rmdir+0x74/0x102 >> [<ffffffff8110bebd>] sys_rmdir+0x11/0x13 >> [<ffffffff81009b02>] system_call_fastpath+0x16/0x1b >> >> Kernel is from Fedora, 2.6.33.6. In all cases the cgroup contains no >> tasks. >> >> Commit ec64f5 ("fix frequent -EBUSY at rmdir") adds a busy wait loop to >> the rmdir. It looks like what I am seeing here and indicates that some >> cgroup subsystem is busy, indefinitely. >> > The commit had caused a bug about rmdir, but it was fixed by the commit 88703267. > The fix was merged in 2.6.31, so it seems that you hit a new one... > >> I have not worked out how to reproduce it quickly. My only way is to >> complete a 'dd' command in the cgroup, but then the problem is so rare it >> is slow progress. >> >> Documentation/cgroup.memory.txt describes how force_empty can be required >> in some cases. Does this mean that with the patch above, these cases will >> now spin on rmdir(), instead of returning -EBUSY? How can produce a >> reliable test case requiring memory.force_empty to be used, to test this? >> > You don't need to touch "force_empty". rmdir() does what "force_empty" does. > >> Or is it likely to be some other cause, and how best to find it? >> > What cgroup subsystem did you mount where the directory existed you tried > to rmdir() first ? > If you mounted several subsystems on the same hierarchy, can you mount them > separately to narrow down the cause ? > It would also be nice to see what your mounted cgroup (filesystem perspective) looks like and what /proc/cgroups looks like when the problem occurs. Balbir ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: cgroup: rmdir() does not complete 2010-08-27 0:56 ` Daisuke Nishimura 2010-08-27 1:20 ` Balbir Singh @ 2010-08-27 2:35 ` KAMEZAWA Hiroyuki 2010-08-27 3:39 ` Daisuke Nishimura 1 sibling, 1 reply; 26+ messages in thread From: KAMEZAWA Hiroyuki @ 2010-08-27 2:35 UTC (permalink / raw) To: Daisuke Nishimura; +Cc: Mark Hills, linux-kernel On Fri, 27 Aug 2010 09:56:39 +0900 Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> wrote: > > Or is it likely to be some other cause, and how best to find it? > > > What cgroup subsystem did you mount where the directory existed you tried > to rmdir() first ? > If you mounted several subsystems on the same hierarchy, can you mount them > separately to narrow down the cause ? > It seems I can reproduce the issue on mmotm-0811, too. try this. Here, memory cgroup is mounted at /cgroups. == #!/bin/bash -x while sleep 1; do date mkdir /cgroups/test echo 0 > /cgroups/test/tasks echo 300M > /cgroups/test/memory.limit_in_bytes cat /proc/self/cgroup dd if=/dev/zero of=./tmpfile bs=4096 count=100000 echo 0 > /cgroups/tasks cat /proc/self/cgroup rmdir /cgroups/test rm ./tmpfile done == hangs at rmdir. I'm no investigating force_empty. Thanks, -Kame ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: cgroup: rmdir() does not complete 2010-08-27 2:35 ` KAMEZAWA Hiroyuki @ 2010-08-27 3:39 ` Daisuke Nishimura 2010-08-27 5:42 ` KAMEZAWA Hiroyuki 0 siblings, 1 reply; 26+ messages in thread From: Daisuke Nishimura @ 2010-08-27 3:39 UTC (permalink / raw) To: KAMEZAWA Hiroyuki; +Cc: Mark Hills, linux-kernel, balbir, Daisuke Nishimura On Fri, 27 Aug 2010 11:35:06 +0900 KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote: > On Fri, 27 Aug 2010 09:56:39 +0900 > Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> wrote: > > > > Or is it likely to be some other cause, and how best to find it? > > > > > What cgroup subsystem did you mount where the directory existed you tried > > to rmdir() first ? > > If you mounted several subsystems on the same hierarchy, can you mount them > > separately to narrow down the cause ? > > > > It seems I can reproduce the issue on mmotm-0811, too. > > try this. > > Here, memory cgroup is mounted at /cgroups. > == > #!/bin/bash -x > > while sleep 1; do > date > mkdir /cgroups/test > echo 0 > /cgroups/test/tasks > echo 300M > /cgroups/test/memory.limit_in_bytes > cat /proc/self/cgroup > dd if=/dev/zero of=./tmpfile bs=4096 count=100000 > echo 0 > /cgroups/tasks > cat /proc/self/cgroup > rmdir /cgroups/test > rm ./tmpfile > done > == > > hangs at rmdir. I'm no investigating force_empty. > Thank you very much for your information. Some questions. Is "tmpfile" created on a normal filesystem(e.g. ext3) or tmpfs ? And, how long does it likely to take to cause this problem ? I've run it on RHEL6-based kernel/ext3 for about one hour, but I cannot reproduce it yet. Thanks, Daisuke Nishimura. ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: cgroup: rmdir() does not complete 2010-08-27 3:39 ` Daisuke Nishimura @ 2010-08-27 5:42 ` KAMEZAWA Hiroyuki 2010-08-27 6:29 ` KAMEZAWA Hiroyuki ` (2 more replies) 0 siblings, 3 replies; 26+ messages in thread From: KAMEZAWA Hiroyuki @ 2010-08-27 5:42 UTC (permalink / raw) To: Daisuke Nishimura; +Cc: Mark Hills, linux-kernel, balbir On Fri, 27 Aug 2010 12:39:48 +0900 Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> wrote: > On Fri, 27 Aug 2010 11:35:06 +0900 > KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote: > > > On Fri, 27 Aug 2010 09:56:39 +0900 > > Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> wrote: > > > > > > Or is it likely to be some other cause, and how best to find it? > > > > > > > What cgroup subsystem did you mount where the directory existed you tried > > > to rmdir() first ? > > > If you mounted several subsystems on the same hierarchy, can you mount them > > > separately to narrow down the cause ? > > > > > > > It seems I can reproduce the issue on mmotm-0811, too. > > > > try this. > > > > Here, memory cgroup is mounted at /cgroups. > > == > > #!/bin/bash -x > > > > while sleep 1; do > > date > > mkdir /cgroups/test > > echo 0 > /cgroups/test/tasks > > echo 300M > /cgroups/test/memory.limit_in_bytes > > cat /proc/self/cgroup > > dd if=/dev/zero of=./tmpfile bs=4096 count=100000 > > echo 0 > /cgroups/tasks > > cat /proc/self/cgroup > > rmdir /cgroups/test > > rm ./tmpfile > > done > > == > > > > hangs at rmdir. I'm no investigating force_empty. > > > Thank you very much for your information. > > Some questions. > > Is "tmpfile" created on a normal filesystem(e.g. ext3) or tmpfs ? on ext4. > And, how long does it likely to take to cause this problem ? very soon. 10-20 loop. > I've run it on RHEL6-based kernel/ext3 for about one hour, but > I cannot reproduce it yet. > Hmm...I'll dig more. Maybe I need to use stock kernel rather than -mm... Thanks, -Kame ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: cgroup: rmdir() does not complete 2010-08-27 5:42 ` KAMEZAWA Hiroyuki @ 2010-08-27 6:29 ` KAMEZAWA Hiroyuki 2010-08-30 7:32 ` Balbir Singh 2010-08-30 9:13 ` Mark Hills 2010-09-01 11:10 ` Mark Hills 2 siblings, 1 reply; 26+ messages in thread From: KAMEZAWA Hiroyuki @ 2010-08-27 6:29 UTC (permalink / raw) To: KAMEZAWA Hiroyuki; +Cc: Daisuke Nishimura, Mark Hills, linux-kernel, balbir On Fri, 27 Aug 2010 14:42:25 +0900 KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote: > > I've run it on RHEL6-based kernel/ext3 for about one hour, but > > I cannot reproduce it yet. > > > > Hmm...I'll dig more. Maybe I need to use stock kernel rather than -mm... > > Sorry, my test just hangs on -mm + (other patches) no troubles on 2.6.34 and 2.6.36-rc1. Where can I see 2.6.33.6(Fedora) kernel ? Thanks, -Kame ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: cgroup: rmdir() does not complete 2010-08-27 6:29 ` KAMEZAWA Hiroyuki @ 2010-08-30 7:32 ` Balbir Singh 0 siblings, 0 replies; 26+ messages in thread From: Balbir Singh @ 2010-08-30 7:32 UTC (permalink / raw) To: KAMEZAWA Hiroyuki; +Cc: Daisuke Nishimura, Mark Hills, linux-kernel * KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2010-08-27 15:29:58]: > On Fri, 27 Aug 2010 14:42:25 +0900 > KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote: > > > > I've run it on RHEL6-based kernel/ext3 for about one hour, but > > > I cannot reproduce it yet. > > > > > > > Hmm...I'll dig more. Maybe I need to use stock kernel rather than -mm... > > > > > Sorry, my test just hangs on -mm + (other patches) > no troubles on 2.6.34 and 2.6.36-rc1. > > Where can I see 2.6.33.6(Fedora) kernel ? > You can get the SRPM from the mirrors, one place to find it would be http://download.fedora.redhat.com/pub/fedora/linux/updates/13/SRPMS/ -- Three Cheers, Balbir ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: cgroup: rmdir() does not complete 2010-08-27 5:42 ` KAMEZAWA Hiroyuki 2010-08-27 6:29 ` KAMEZAWA Hiroyuki @ 2010-08-30 9:13 ` Mark Hills 2010-09-01 11:10 ` Mark Hills 2 siblings, 0 replies; 26+ messages in thread From: Mark Hills @ 2010-08-30 9:13 UTC (permalink / raw) To: KAMEZAWA Hiroyuki; +Cc: Daisuke Nishimura, linux-kernel, balbir On Fri, 27 Aug 2010, KAMEZAWA Hiroyuki wrote: > On Fri, 27 Aug 2010 12:39:48 +0900 > Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> wrote: > > > On Fri, 27 Aug 2010 11:35:06 +0900 > > KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote: > > > > > On Fri, 27 Aug 2010 09:56:39 +0900 > > > Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> wrote: > > > > > > > > Or is it likely to be some other cause, and how best to find it? > > > > > > > > > What cgroup subsystem did you mount where the directory existed you tried > > > > to rmdir() first ? > > > > If you mounted several subsystems on the same hierarchy, can you mount them > > > > separately to narrow down the cause ? > > > > > > > > > > It seems I can reproduce the issue on mmotm-0811, too. > > > > > > try this. > > > > > > Here, memory cgroup is mounted at /cgroups. > > > == > > > #!/bin/bash -x > > > > > > while sleep 1; do > > > date > > > mkdir /cgroups/test > > > echo 0 > /cgroups/test/tasks > > > echo 300M > /cgroups/test/memory.limit_in_bytes > > > cat /proc/self/cgroup > > > dd if=/dev/zero of=./tmpfile bs=4096 count=100000 > > > echo 0 > /cgroups/tasks > > > cat /proc/self/cgroup > > > rmdir /cgroups/test > > > rm ./tmpfile > > > done > > > == > > > > > > hangs at rmdir. I'm no investigating force_empty. > > > > > Thank you very much for your information. > > > > Some questions. > > > > Is "tmpfile" created on a normal filesystem(e.g. ext3) or tmpfs ? > on ext4. > > > And, how long does it likely to take to cause this problem ? > > very soon. 10-20 loop. The test case I was running is similar to the above. With the Lustre filesystem the problem takes 4 hours or more to show itself. Recently I ran 4 threads for over 24 hours without it being seen -- I suspect some external factor is involved. I also tried NFS, and did not see a problem after 8 hours or so, but this is inconclusive. The use of the Fedora kernel, and the Lustre filesystem is not satisfactory to trace the bug. Until I can get a test case which is more readily reproducable, I'm not able to reasonably think about changing variables. It is interesting you see the problem so readily on ext4; I will test that soon (it is currently holiday weekend in the UK). I hope it will give me the test case I am looking for. Thanks -- Mark ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: cgroup: rmdir() does not complete 2010-08-27 5:42 ` KAMEZAWA Hiroyuki 2010-08-27 6:29 ` KAMEZAWA Hiroyuki 2010-08-30 9:13 ` Mark Hills @ 2010-09-01 11:10 ` Mark Hills 2010-09-01 23:42 ` KAMEZAWA Hiroyuki 2 siblings, 1 reply; 26+ messages in thread From: Mark Hills @ 2010-09-01 11:10 UTC (permalink / raw) To: KAMEZAWA Hiroyuki; +Cc: Daisuke Nishimura, linux-kernel, balbir On Fri, 27 Aug 2010, KAMEZAWA Hiroyuki wrote: > On Fri, 27 Aug 2010 12:39:48 +0900 > Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> wrote: > > > On Fri, 27 Aug 2010 11:35:06 +0900 > > KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote: > > > > > On Fri, 27 Aug 2010 09:56:39 +0900 > > > Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> wrote: > > > > > > > > Or is it likely to be some other cause, and how best to find it? > > > > > > > > > What cgroup subsystem did you mount where the directory existed you tried > > > > to rmdir() first ? > > > > If you mounted several subsystems on the same hierarchy, can you mount them > > > > separately to narrow down the cause ? > > > > > > > > > > It seems I can reproduce the issue on mmotm-0811, too. > > > > > > try this. > > > > > > Here, memory cgroup is mounted at /cgroups. > > > == > > > #!/bin/bash -x > > > > > > while sleep 1; do > > > date > > > mkdir /cgroups/test > > > echo 0 > /cgroups/test/tasks > > > echo 300M > /cgroups/test/memory.limit_in_bytes > > > cat /proc/self/cgroup > > > dd if=/dev/zero of=./tmpfile bs=4096 count=100000 > > > echo 0 > /cgroups/tasks > > > cat /proc/self/cgroup > > > rmdir /cgroups/test > > > rm ./tmpfile > > > done > > > == > > > > > > hangs at rmdir. I'm no investigating force_empty. > > > > > Thank you very much for your information. > > > > Some questions. > > > > Is "tmpfile" created on a normal filesystem(e.g. ext3) or tmpfs ? > on ext4. > > > And, how long does it likely to take to cause this problem ? > > very soon. 10-20 loop. I repeated the test above, but did not see a problem after many hundreds of loops. My test was with the same kernel from my original bug report (Fedora 2.6.33.6-147), using memory cgroup only and ext4 filesystem. So it is possible we are experiencing different bugs with similar symptoms. -- Mark ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: cgroup: rmdir() does not complete 2010-09-01 11:10 ` Mark Hills @ 2010-09-01 23:42 ` KAMEZAWA Hiroyuki 2010-09-02 9:45 ` Mark Hills 2010-09-09 10:01 ` Mark Hills 0 siblings, 2 replies; 26+ messages in thread From: KAMEZAWA Hiroyuki @ 2010-09-01 23:42 UTC (permalink / raw) To: Mark Hills; +Cc: Daisuke Nishimura, linux-kernel, balbir On Wed, 1 Sep 2010 12:10:23 +0100 (BST) Mark Hills <mark@pogo.org.uk> wrote: > On Fri, 27 Aug 2010, KAMEZAWA Hiroyuki wrote: > > > On Fri, 27 Aug 2010 12:39:48 +0900 > > Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> wrote: > > > > > On Fri, 27 Aug 2010 11:35:06 +0900 > > > KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote: > > > > > > > On Fri, 27 Aug 2010 09:56:39 +0900 > > > > Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> wrote: > > > > > > > > > > Or is it likely to be some other cause, and how best to find it? > > > > > > > > > > > What cgroup subsystem did you mount where the directory existed you tried > > > > > to rmdir() first ? > > > > > If you mounted several subsystems on the same hierarchy, can you mount them > > > > > separately to narrow down the cause ? > > > > > > > > > > > > > It seems I can reproduce the issue on mmotm-0811, too. > > > > > > > > try this. > > > > > > > > Here, memory cgroup is mounted at /cgroups. > > > > == > > > > #!/bin/bash -x > > > > > > > > while sleep 1; do > > > > date > > > > mkdir /cgroups/test > > > > echo 0 > /cgroups/test/tasks > > > > echo 300M > /cgroups/test/memory.limit_in_bytes > > > > cat /proc/self/cgroup > > > > dd if=/dev/zero of=./tmpfile bs=4096 count=100000 > > > > echo 0 > /cgroups/tasks > > > > cat /proc/self/cgroup > > > > rmdir /cgroups/test > > > > rm ./tmpfile > > > > done > > > > == > > > > > > > > hangs at rmdir. I'm no investigating force_empty. > > > > > > > Thank you very much for your information. > > > > > > Some questions. > > > > > > Is "tmpfile" created on a normal filesystem(e.g. ext3) or tmpfs ? > > on ext4. > > > > > And, how long does it likely to take to cause this problem ? > > > > very soon. 10-20 loop. > > I repeated the test above, but did not see a problem after many hundreds > of loops. > > My test was with the same kernel from my original bug report (Fedora > 2.6.33.6-147), using memory cgroup only and ext4 filesystem. > > So it is possible we are experiencing different bugs with similar > symptoms. > Thank you for confirming. But hmm...it's curious who holds mutex and what happens. -Kame ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: cgroup: rmdir() does not complete 2010-09-01 23:42 ` KAMEZAWA Hiroyuki @ 2010-09-02 9:45 ` Mark Hills 2010-09-09 10:01 ` Mark Hills 1 sibling, 0 replies; 26+ messages in thread From: Mark Hills @ 2010-09-02 9:45 UTC (permalink / raw) To: KAMEZAWA Hiroyuki; +Cc: Daisuke Nishimura, linux-kernel, balbir On Thu, 2 Sep 2010, KAMEZAWA Hiroyuki wrote: > On Wed, 1 Sep 2010 12:10:23 +0100 (BST) > Mark Hills <mark@pogo.org.uk> wrote: [...] > > I repeated the test above, but did not see a problem after many hundreds > > of loops. > > > > My test was with the same kernel from my original bug report (Fedora > > 2.6.33.6-147), using memory cgroup only and ext4 filesystem. > > > > So it is possible we are experiencing different bugs with similar > > symptoms. > > > > Thank you for confirming. > But hmm...it's curious who holds mutex and what happens. Refer to my original email, where I was running multiple tests at once. This backtrace is from the tests which queue up: Call Trace: [<ffffffff81115edb>] ? mntput_no_expire+0x24/0xe7 [<ffffffff81427acd>] __mutex_lock_common+0x14d/0x1b4 [<ffffffff81108a7c>] ? path_put+0x1d/0x22 [<ffffffff81427b48>] __mutex_lock_slowpath+0x14/0x16 [<ffffffff81427c4f>] mutex_lock+0x31/0x4b [<ffffffff8110bdf8>] do_rmdir+0x74/0x102 [<ffffffff8110bebd>] sys_rmdir+0x11/0x13 [<ffffffff81009b02>] system_call_fastpath+0x16/0x1b The one which spins has already managed to claim the mutex lock on the /cgroup directory, and no call trace is shown for this. Is there a usable way to force a similar call trace for the spinning process? Unfortunately I have not been able to reproduce the problem for some days now, so I think some network factor is able to influence this. -- Mark ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: cgroup: rmdir() does not complete 2010-09-01 23:42 ` KAMEZAWA Hiroyuki 2010-09-02 9:45 ` Mark Hills @ 2010-09-09 10:01 ` Mark Hills 2010-09-09 10:09 ` Balbir Singh 1 sibling, 1 reply; 26+ messages in thread From: Mark Hills @ 2010-09-09 10:01 UTC (permalink / raw) To: KAMEZAWA Hiroyuki; +Cc: Daisuke Nishimura, linux-kernel, balbir On Thu, 2 Sep 2010, KAMEZAWA Hiroyuki wrote: [...] > But hmm...it's curious who holds mutex and what happens. I have a system showing the failure case (but still do not have a way to reliably repeat it) Here are the two processes: 23586 pts/0 RL+ 5059:18 /net/homes/mhills/tmp/soaked-cgroup 23685 pts/6 DL+ 0:00 /net/homes/mhills/tmp/soaked-cgroup 23586 spends almost all of its time in 'RL+' status, occasionally it is seen in 'DL+' status. >From my analysis before, both are blocked on rmdir(), but one is spinning, holding the lock on the /cgroup, and the other is waiting for the lock. If I strace 23586 then the rmdir() fails with EINTR. How best to capture information which might show why the process spins? -- Mark ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: cgroup: rmdir() does not complete 2010-09-09 10:01 ` Mark Hills @ 2010-09-09 10:09 ` Balbir Singh 2010-09-09 11:36 ` Mark Hills 0 siblings, 1 reply; 26+ messages in thread From: Balbir Singh @ 2010-09-09 10:09 UTC (permalink / raw) To: Mark Hills; +Cc: KAMEZAWA Hiroyuki, Daisuke Nishimura, linux-kernel * Mark Hills <mark@pogo.org.uk> [2010-09-09 11:01:45]: > On Thu, 2 Sep 2010, KAMEZAWA Hiroyuki wrote: > > [...] > > But hmm...it's curious who holds mutex and what happens. > > I have a system showing the failure case (but still do not have a way to > reliably repeat it) > > Here are the two processes: > > 23586 pts/0 RL+ 5059:18 /net/homes/mhills/tmp/soaked-cgroup > 23685 pts/6 DL+ 0:00 /net/homes/mhills/tmp/soaked-cgroup > > 23586 spends almost all of its time in 'RL+' status, occasionally it is > seen in 'DL+' status. > > From my analysis before, both are blocked on rmdir(), but one is spinning, > holding the lock on the /cgroup, and the other is waiting for the lock. If > I strace 23586 then the rmdir() fails with EINTR. > Any chance you can compile with debug cgroup subsystem and get information from there? -- Three Cheers, Balbir ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: cgroup: rmdir() does not complete 2010-09-09 10:09 ` Balbir Singh @ 2010-09-09 11:36 ` Mark Hills 2010-09-09 11:50 ` Peter Zijlstra 0 siblings, 1 reply; 26+ messages in thread From: Mark Hills @ 2010-09-09 11:36 UTC (permalink / raw) To: Balbir Singh; +Cc: KAMEZAWA Hiroyuki, Daisuke Nishimura, linux-kernel On Thu, 9 Sep 2010, Balbir Singh wrote: > * Mark Hills <mark@pogo.org.uk> [2010-09-09 11:01:45]: > > > On Thu, 2 Sep 2010, KAMEZAWA Hiroyuki wrote: > > > > [...] > > > But hmm...it's curious who holds mutex and what happens. > > > > I have a system showing the failure case (but still do not have a way to > > reliably repeat it) > > > > Here are the two processes: > > > > 23586 pts/0 RL+ 5059:18 /net/homes/mhills/tmp/soaked-cgroup > > 23685 pts/6 DL+ 0:00 /net/homes/mhills/tmp/soaked-cgroup > > > > 23586 spends almost all of its time in 'RL+' status, occasionally it is > > seen in 'DL+' status. > > > > From my analysis before, both are blocked on rmdir(), but one is spinning, > > holding the lock on the /cgroup, and the other is waiting for the lock. If > > I strace 23586 then the rmdir() fails with EINTR. > > > > Any chance you can compile with debug cgroup subsystem and get > information from there? I can, I'd like to experiment with a custom kernel next. I am still finding the problem incredibly hard to reproduce, so I'd like to observe as much data as possible from the current case before rebooting. If I could capture some kind of stack trace in the kernel for the running process that would be great, any suggestions appreciated. Thanks -- Mark ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: cgroup: rmdir() does not complete 2010-09-09 11:36 ` Mark Hills @ 2010-09-09 11:50 ` Peter Zijlstra 2010-09-09 23:04 ` Mark Hills 0 siblings, 1 reply; 26+ messages in thread From: Peter Zijlstra @ 2010-09-09 11:50 UTC (permalink / raw) To: Mark Hills Cc: Balbir Singh, KAMEZAWA Hiroyuki, Daisuke Nishimura, linux-kernel On Thu, 2010-09-09 at 12:36 +0100, Mark Hills wrote: > I am still finding the problem incredibly hard to reproduce, so I'd like > to observe as much data as possible from the current case before > rebooting. If I could capture some kind of stack trace in the kernel for > the running process that would be great, any suggestions appreciated. echo l > /proc/sysrq-trigger another thing you can do is run something like: perf record -gp $pid which will give you a profile of that task. ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: cgroup: rmdir() does not complete 2010-09-09 11:50 ` Peter Zijlstra @ 2010-09-09 23:04 ` Mark Hills 2010-09-09 23:43 ` KAMEZAWA Hiroyuki 2010-09-10 2:16 ` KAMEZAWA Hiroyuki 0 siblings, 2 replies; 26+ messages in thread From: Mark Hills @ 2010-09-09 23:04 UTC (permalink / raw) To: Peter Zijlstra Cc: Balbir Singh, KAMEZAWA Hiroyuki, Daisuke Nishimura, linux-kernel On Thu, 9 Sep 2010, Peter Zijlstra wrote: > On Thu, 2010-09-09 at 12:36 +0100, Mark Hills wrote: > > > I am still finding the problem incredibly hard to reproduce, so I'd like > > to observe as much data as possible from the current case before > > rebooting. If I could capture some kind of stack trace in the kernel for > > the running process that would be great, any suggestions appreciated. > > echo l > /proc/sysrq-trigger Despite running this many times, I never 'catch' the process on a CPU, despite it using 70% in top. But... > another thing you can do is run something like: perf record -gp $pid > which will give you a profile of that task. This is very useful, thanks. The report on the spinning process (23586) is dominated by calls from mem_cgroup_force_empty. It seems to show lru_add_drain_all and drain_all_stock_sync are causing the load (I assume drain_all_stock_sync has been optimised out). But I don't think this is as important as what causes the spin. There are no tasks in the cgroup, but memory usage is non-zero and constant. It seems mem_cgroup_force_empty is unable to empty the cgroup in this case. # cat /cgroup/soaked-23586/tasks # cat /cgroup/soaked-23586/memory.usage_in_bytes 24576 # cat /cgroup/soaked-23586/memsw.usage_in_bytes <hangs> Here are the first few entries from the perf output, I can provide the rest if needed, but all result from mem_cgroup_force_empty. 8.13% :23586 [kernel] [k] _raw_spin_lock_irqsave | --- _raw_spin_lock_irqsave | |--45.14%-- probe_workqueue_insertion | insert_work | | | |--99.09%-- __queue_work | | queue_work_on | | schedule_work_on | | schedule_on_each_cpu | | | | | |--50.59%-- lru_add_drain_all | | | mem_cgroup_force_empty | | | mem_cgroup_pre_destroy | | | cgroup_rmdir | | | vfs_rmdir | | | do_rmdir | | | sys_rmdir | | | system_call_fastpath | | | 0x3f504d27d7 | | | 0x405687 | | | 0x406ef0 | | | 0x402f31 | | | 0x3f5041eb1d | | | | | --49.41%-- mem_cgroup_force_empty | | mem_cgroup_pre_destroy | | cgroup_rmdir | | vfs_rmdir | | do_rmdir | | sys_rmdir | | system_call_fastpath | | 0x3f504d27d7 | | 0x405687 | | 0x406ef0 | | 0x402f31 | | 0x3f5041eb1d | --0.91%-- [...] | |--22.92%-- mem_cgroup_force_empty | mem_cgroup_pre_destroy | cgroup_rmdir | vfs_rmdir | do_rmdir | sys_rmdir | system_call_fastpath | 0x3f504d27d7 | 0x405687 | 0x406ef0 | 0x402f31 | 0x3f5041eb1d | |--8.17%-- __queue_work | queue_work_on | schedule_work_on | schedule_on_each_cpu | | | |--52.09%-- lru_add_drain_all | | mem_cgroup_force_empty | | mem_cgroup_pre_destroy | | cgroup_rmdir | | vfs_rmdir | | do_rmdir | | sys_rmdir | | system_call_fastpath | | 0x3f504d27d7 | | 0x405687 | | 0x406ef0 | | 0x402f31 | | 0x3f5041eb1d | | | --47.91%-- mem_cgroup_force_empty | mem_cgroup_pre_destroy | cgroup_rmdir | vfs_rmdir | do_rmdir | sys_rmdir | system_call_fastpath | 0x3f504d27d7 | 0x405687 | 0x406ef0 | 0x402f31 | 0x3f5041eb1d | |--7.94%-- __wake_up | | | |--99.71%-- insert_work | | | | | |--97.70%-- __queue_work | | | queue_work_on | | | schedule_work_on | | | schedule_on_each_cpu | | | | | | | |--50.59%-- mem_cgroup_force_empty | | | | mem_cgroup_pre_destroy | | | | cgroup_rmdir | | | | vfs_rmdir | | | | do_rmdir | | | | sys_rmdir | | | | system_call_fastpath | | | | 0x3f504d27d7 | | | | 0x405687 | | | | 0x406ef0 | | | | 0x402f31 | | | | 0x3f5041eb1d | | | | | | | --49.41%-- lru_add_drain_all | | | mem_cgroup_force_empty | | | mem_cgroup_pre_destroy | | | cgroup_rmdir | | | vfs_rmdir | | | do_rmdir | | | sys_rmdir | | | system_call_fastpath | | | 0x3f504d27d7 | | | 0x405687 | | | 0x406ef0 | | | 0x402f31 | | | 0x3f5041eb1d | | --2.30%-- [...] | --0.29%-- [...] | |--4.35%-- mem_cgroup_pre_destroy | cgroup_rmdir | vfs_rmdir | do_rmdir | sys_rmdir | system_call_fastpath | 0x3f504d27d7 | 0x405687 | 0x406ef0 | 0x402f31 | 0x3f5041eb1d --11.47%-- [...] 7.25% :23586 [kernel] [k] sched_clock_cpu | --- sched_clock_cpu | |--97.11%-- update_rq_clock | | | |--98.89%-- try_to_wake_up | | default_wake_function | | autoremove_wake_function | | __wake_up_common | | __wake_up | | insert_work | | __queue_work | | queue_work_on | | schedule_work_on | | schedule_on_each_cpu | | | | | |--50.69%-- lru_add_drain_all | | | mem_cgroup_force_empty | | | mem_cgroup_pre_destroy | | | cgroup_rmdir | | | vfs_rmdir | | | do_rmdir | | | sys_rmdir | | | system_call_fastpath | | | 0x3f504d27d7 | | | 0x405687 | | | 0x406ef0 | | | 0x402f31 | | | 0x3f5041eb1d | | | | | --49.31%-- mem_cgroup_force_empty | | mem_cgroup_pre_destroy | | cgroup_rmdir | | vfs_rmdir | | do_rmdir | | sys_rmdir | | system_call_fastpath | | 0x3f504d27d7 | | 0x405687 | | 0x406ef0 | | 0x402f31 | | 0x3f5041eb1d | --1.11%-- [...] --2.89%-- [...] 5.54% :23586 [kernel] [k] try_to_wake_up | --- try_to_wake_up | |--99.13%-- default_wake_function | autoremove_wake_function | __wake_up_common | __wake_up | insert_work | __queue_work | queue_work_on | schedule_work_on | schedule_on_each_cpu | | | |--52.03%-- lru_add_drain_all | | mem_cgroup_force_empty | | mem_cgroup_pre_destroy | | cgroup_rmdir | | vfs_rmdir | | do_rmdir | | sys_rmdir | | system_call_fastpath | | 0x3f504d27d7 | | 0x405687 | | 0x406ef0 | | 0x402f31 | | 0x3f5041eb1d | | | --47.97%-- mem_cgroup_force_empty | mem_cgroup_pre_destroy | cgroup_rmdir | vfs_rmdir | do_rmdir | sys_rmdir | system_call_fastpath | 0x3f504d27d7 | 0x405687 | 0x406ef0 | 0x402f31 | 0x3f5041eb1d --0.87%-- [...] -- Mark ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: cgroup: rmdir() does not complete 2010-09-09 23:04 ` Mark Hills @ 2010-09-09 23:43 ` KAMEZAWA Hiroyuki 2010-09-10 2:16 ` KAMEZAWA Hiroyuki 1 sibling, 0 replies; 26+ messages in thread From: KAMEZAWA Hiroyuki @ 2010-09-09 23:43 UTC (permalink / raw) To: Mark Hills; +Cc: Peter Zijlstra, Balbir Singh, Daisuke Nishimura, linux-kernel On Fri, 10 Sep 2010 00:04:31 +0100 (BST) Mark Hills <mark@pogo.org.uk> wrote: > On Thu, 9 Sep 2010, Peter Zijlstra wrote: > > > On Thu, 2010-09-09 at 12:36 +0100, Mark Hills wrote: > > > > > I am still finding the problem incredibly hard to reproduce, so I'd like > > > to observe as much data as possible from the current case before > > > rebooting. If I could capture some kind of stack trace in the kernel for > > > the running process that would be great, any suggestions appreciated. > > > > echo l > /proc/sysrq-trigger > > Despite running this many times, I never 'catch' the process on a CPU, > despite it using 70% in top. But... > > > another thing you can do is run something like: perf record -gp $pid > > which will give you a profile of that task. > > This is very useful, thanks. > > The report on the spinning process (23586) is dominated by calls from > mem_cgroup_force_empty. > > It seems to show lru_add_drain_all and drain_all_stock_sync are causing > the load (I assume drain_all_stock_sync has been optimised out). But I > don't think this is as important as what causes the spin. > > There are no tasks in the cgroup, but memory usage is non-zero and > constant. It seems mem_cgroup_force_empty is unable to empty the cgroup in > this case. > > # cat /cgroup/soaked-23586/tasks > # cat /cgroup/soaked-23586/memory.usage_in_bytes > 24576 > # cat /cgroup/soaked-23586/memsw.usage_in_bytes > <hangs> > I think this "cat" hang is because of vfs's lock. Hmm, then, there are pages on LRU which cannot be moved or there is leak of account. BTW, mem_cgroup's rmdir is desgined to be able to receive SIGINT etc... Can't you stop rmdir by Ctrl-C or some ? rmdir -> hang -> Ctrl-C (or some) -> cat .../memory.stat can work ? And do you still use Fedora's kernel ? Thanks, -Kame > Here are the first few entries from the perf output, I can provide the > rest if needed, but all result from mem_cgroup_force_empty. > > 8.13% :23586 [kernel] [k] _raw_spin_lock_irqsave > | > --- _raw_spin_lock_irqsave > | > |--45.14%-- probe_workqueue_insertion > | insert_work > | | > | |--99.09%-- __queue_work > | | queue_work_on > | | schedule_work_on > | | schedule_on_each_cpu > | | | > | | |--50.59%-- lru_add_drain_all > | | | mem_cgroup_force_empty > | | | mem_cgroup_pre_destroy > | | | cgroup_rmdir > | | | vfs_rmdir > | | | do_rmdir > | | | sys_rmdir > | | | system_call_fastpath > | | | 0x3f504d27d7 > | | | 0x405687 > | | | 0x406ef0 > | | | 0x402f31 > | | | 0x3f5041eb1d > | | | > | | --49.41%-- mem_cgroup_force_empty > | | mem_cgroup_pre_destroy > | | cgroup_rmdir > | | vfs_rmdir > | | do_rmdir > | | sys_rmdir > | | system_call_fastpath > | | 0x3f504d27d7 > | | 0x405687 > | | 0x406ef0 > | | 0x402f31 > | | 0x3f5041eb1d > | --0.91%-- [...] > | > |--22.92%-- mem_cgroup_force_empty > | mem_cgroup_pre_destroy > | cgroup_rmdir > | vfs_rmdir > | do_rmdir > | sys_rmdir > | system_call_fastpath > | 0x3f504d27d7 > | 0x405687 > | 0x406ef0 > | 0x402f31 > | 0x3f5041eb1d > | > |--8.17%-- __queue_work > | queue_work_on > | schedule_work_on > | schedule_on_each_cpu > | | > | |--52.09%-- lru_add_drain_all > | | mem_cgroup_force_empty > | | mem_cgroup_pre_destroy > | | cgroup_rmdir > | | vfs_rmdir > | | do_rmdir > | | sys_rmdir > | | system_call_fastpath > | | 0x3f504d27d7 > | | 0x405687 > | | 0x406ef0 > | | 0x402f31 > | | 0x3f5041eb1d > | | > | --47.91%-- mem_cgroup_force_empty > | mem_cgroup_pre_destroy > | cgroup_rmdir > | vfs_rmdir > | do_rmdir > | sys_rmdir > | system_call_fastpath > | 0x3f504d27d7 > | 0x405687 > | 0x406ef0 > | 0x402f31 > | 0x3f5041eb1d > | > |--7.94%-- __wake_up > | | > | |--99.71%-- insert_work > | | | > | | |--97.70%-- __queue_work > | | | queue_work_on > | | | schedule_work_on > | | | schedule_on_each_cpu > | | | | > | | | |--50.59%-- mem_cgroup_force_empty > | | | | mem_cgroup_pre_destroy > | | | | cgroup_rmdir > | | | | vfs_rmdir > | | | | do_rmdir > | | | | sys_rmdir > | | | | system_call_fastpath > | | | | 0x3f504d27d7 > | | | | 0x405687 > | | | | 0x406ef0 > | | | | 0x402f31 > | | | | 0x3f5041eb1d > | | | | > | | | --49.41%-- lru_add_drain_all > | | | mem_cgroup_force_empty > | | | mem_cgroup_pre_destroy > | | | cgroup_rmdir > | | | vfs_rmdir > | | | do_rmdir > | | | sys_rmdir > | | | system_call_fastpath > | | | 0x3f504d27d7 > | | | 0x405687 > | | | 0x406ef0 > | | | 0x402f31 > | | | 0x3f5041eb1d > | | --2.30%-- [...] > | --0.29%-- [...] > | > |--4.35%-- mem_cgroup_pre_destroy > | cgroup_rmdir > | vfs_rmdir > | do_rmdir > | sys_rmdir > | system_call_fastpath > | 0x3f504d27d7 > | 0x405687 > | 0x406ef0 > | 0x402f31 > | 0x3f5041eb1d > --11.47%-- [...] > > 7.25% :23586 [kernel] [k] sched_clock_cpu > | > --- sched_clock_cpu > | > |--97.11%-- update_rq_clock > | | > | |--98.89%-- try_to_wake_up > | | default_wake_function > | | autoremove_wake_function > | | __wake_up_common > | | __wake_up > | | insert_work > | | __queue_work > | | queue_work_on > | | schedule_work_on > | | schedule_on_each_cpu > | | | > | | |--50.69%-- lru_add_drain_all > | | | mem_cgroup_force_empty > | | | mem_cgroup_pre_destroy > | | | cgroup_rmdir > | | | vfs_rmdir > | | | do_rmdir > | | | sys_rmdir > | | | system_call_fastpath > | | | 0x3f504d27d7 > | | | 0x405687 > | | | 0x406ef0 > | | | 0x402f31 > | | | 0x3f5041eb1d > | | | > | | --49.31%-- mem_cgroup_force_empty > | | mem_cgroup_pre_destroy > | | cgroup_rmdir > | | vfs_rmdir > | | do_rmdir > | | sys_rmdir > | | system_call_fastpath > | | 0x3f504d27d7 > | | 0x405687 > | | 0x406ef0 > | | 0x402f31 > | | 0x3f5041eb1d > | --1.11%-- [...] > --2.89%-- [...] > > 5.54% :23586 [kernel] [k] try_to_wake_up > | > --- try_to_wake_up > | > |--99.13%-- default_wake_function > | autoremove_wake_function > | __wake_up_common > | __wake_up > | insert_work > | __queue_work > | queue_work_on > | schedule_work_on > | schedule_on_each_cpu > | | > | |--52.03%-- lru_add_drain_all > | | mem_cgroup_force_empty > | | mem_cgroup_pre_destroy > | | cgroup_rmdir > | | vfs_rmdir > | | do_rmdir > | | sys_rmdir > | | system_call_fastpath > | | 0x3f504d27d7 > | | 0x405687 > | | 0x406ef0 > | | 0x402f31 > | | 0x3f5041eb1d > | | > | --47.97%-- mem_cgroup_force_empty > | mem_cgroup_pre_destroy > | cgroup_rmdir > | vfs_rmdir > | do_rmdir > | sys_rmdir > | system_call_fastpath > | 0x3f504d27d7 > | 0x405687 > | 0x406ef0 > | 0x402f31 > | 0x3f5041eb1d > --0.87%-- [...] > > -- > Mark > ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: cgroup: rmdir() does not complete 2010-09-09 23:04 ` Mark Hills 2010-09-09 23:43 ` KAMEZAWA Hiroyuki @ 2010-09-10 2:16 ` KAMEZAWA Hiroyuki 2010-09-10 4:05 ` Daisuke Nishimura 2010-09-10 7:28 ` Mark Hills 1 sibling, 2 replies; 26+ messages in thread From: KAMEZAWA Hiroyuki @ 2010-09-10 2:16 UTC (permalink / raw) To: Mark Hills; +Cc: Peter Zijlstra, Balbir Singh, Daisuke Nishimura, linux-kernel On Fri, 10 Sep 2010 00:04:31 +0100 (BST) Mark Hills <mark@pogo.org.uk> wrote: > The report on the spinning process (23586) is dominated by calls from > mem_cgroup_force_empty. > > It seems to show lru_add_drain_all and drain_all_stock_sync are causing > the load (I assume drain_all_stock_sync has been optimised out). But I > don't think this is as important as what causes the spin. > I noticed you use FUSE and it seems there is a problem in FUSE v.s. memcg. I wrote a patch (onto 2.6.36 but can be applied..) Could you try this ? I'm sorry I don't use FUSE system and can't test right now. == From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> memory cgroup catches all pages which is added to radix-tree and assumes the pages will be added to LRU, somewhere. But there are pages which not on LRU but on radix-tree. Then, force_empty cannot find them and cannot finish ->pre_destroy(), rmdir operations. This patch adds __GFP_NOMEMCGROUP and avoids unnecessary, out-of-control pages are registered to memory cgroup. Note: This gfp flag can be used for shmem handling, which now uses complicated heuristics. Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> --- fs/fuse/dev.c | 11 ++++++++++- include/linux/gfp.h | 7 +++++++ mm/memcontrol.c | 2 +- 3 files changed, 18 insertions(+), 2 deletions(-) Index: linux-2.6.36-rc3/fs/fuse/dev.c =================================================================== --- linux-2.6.36-rc3.orig/fs/fuse/dev.c +++ linux-2.6.36-rc3/fs/fuse/dev.c @@ -19,6 +19,7 @@ #include <linux/pipe_fs_i.h> #include <linux/swap.h> #include <linux/splice.h> +#include <linux/memcontrol.h> MODULE_ALIAS_MISCDEV(FUSE_MINOR); MODULE_ALIAS("devname:fuse"); @@ -683,6 +684,7 @@ static int fuse_try_move_page(struct fus struct pipe_buffer *buf = cs->pipebufs; struct address_space *mapping; pgoff_t index; + gfp_t mask = GFP_KERNEL; unlock_request(cs->fc, cs->req); fuse_copy_finish(cs); @@ -732,7 +734,14 @@ static int fuse_try_move_page(struct fus remove_from_page_cache(oldpage); page_cache_release(oldpage); - err = add_to_page_cache_locked(newpage, mapping, index, GFP_KERNEL); + /* + * not-on-LRU pages are out of control. So, add to root cgroup. + * See mm/memcontrol.c for details. + */ + if (buf->flags & PIPE_BUF_FLAG_LRU) + mask |= __GFP_NOMEMCGROUP; + + err = add_to_page_cache_locked(newpage, mapping, index, mask); if (err) { printk(KERN_WARNING "fuse_try_move_page: failed to add page"); goto out_fallback_unlock; Index: linux-2.6.36-rc3/include/linux/gfp.h =================================================================== --- linux-2.6.36-rc3.orig/include/linux/gfp.h +++ linux-2.6.36-rc3/include/linux/gfp.h @@ -60,6 +60,13 @@ struct vm_area_struct; #define __GFP_NOTRACK ((__force gfp_t)0) #endif +#ifdef CONFIG_CGROUP_MEM_RES_CTLR +#define __GFP_NOMEMCGROUP ((__force gfp_t)0x400000u) + /* Don't track by memory cgroup */ +#else +#define __GFP_NOMEMCGROUP ((__force gfp_t)0) +#endif + /* * This may seem redundant, but it's a way of annotating false positives vs. * allocations that simply cannot be supported (e.g. page tables). Index: linux-2.6.36-rc3/mm/memcontrol.c =================================================================== --- linux-2.6.36-rc3.orig/mm/memcontrol.c +++ linux-2.6.36-rc3/mm/memcontrol.c @@ -2114,7 +2114,7 @@ int mem_cgroup_cache_charge(struct page if (mem_cgroup_disabled()) return 0; - if (PageCompound(page)) + if (PageCompound(page) || (gfp_mask & __GFP_NOMEMCGROUP)) return 0; /* * Corner case handling. This is called from add_to_page_cache() ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: cgroup: rmdir() does not complete 2010-09-10 2:16 ` KAMEZAWA Hiroyuki @ 2010-09-10 4:05 ` Daisuke Nishimura 2010-09-10 4:11 ` KAMEZAWA Hiroyuki 2010-09-10 7:28 ` Mark Hills 1 sibling, 1 reply; 26+ messages in thread From: Daisuke Nishimura @ 2010-09-10 4:05 UTC (permalink / raw) To: KAMEZAWA Hiroyuki Cc: Mark Hills, Peter Zijlstra, Balbir Singh, linux-kernel, Daisuke Nishimura On Fri, 10 Sep 2010 11:16:46 +0900 KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote: > On Fri, 10 Sep 2010 00:04:31 +0100 (BST) > Mark Hills <mark@pogo.org.uk> wrote: > > The report on the spinning process (23586) is dominated by calls from > > mem_cgroup_force_empty. > > > > It seems to show lru_add_drain_all and drain_all_stock_sync are causing > > the load (I assume drain_all_stock_sync has been optimised out). But I > > don't think this is as important as what causes the spin. > > > > I noticed you use FUSE and it seems there is a problem in FUSE v.s. memcg. > I wrote a patch (onto 2.6.36 but can be applied..) > Nice catch! > Could you try this ? I'm sorry I don't use FUSE system and can't test > right now. > Sorry, I can't either. > == > From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> > > memory cgroup catches all pages which is added to radix-tree and > assumes the pages will be added to LRU, somewhere. > But there are pages which not on LRU but on radix-tree. Then, > force_empty cannot find them and cannot finish ->pre_destroy(), rmdir > operations. > > This patch adds __GFP_NOMEMCGROUP and avoids unnecessary, out-of-control > pages are registered to memory cgroup. > > Note: This gfp flag can be used for shmem handling, which now uses > complicated heuristics. > > Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> > --- > fs/fuse/dev.c | 11 ++++++++++- > include/linux/gfp.h | 7 +++++++ > mm/memcontrol.c | 2 +- > 3 files changed, 18 insertions(+), 2 deletions(-) > > Index: linux-2.6.36-rc3/fs/fuse/dev.c > =================================================================== > --- linux-2.6.36-rc3.orig/fs/fuse/dev.c > +++ linux-2.6.36-rc3/fs/fuse/dev.c > @@ -19,6 +19,7 @@ > #include <linux/pipe_fs_i.h> > #include <linux/swap.h> > #include <linux/splice.h> > +#include <linux/memcontrol.h> > > MODULE_ALIAS_MISCDEV(FUSE_MINOR); > MODULE_ALIAS("devname:fuse"); > @@ -683,6 +684,7 @@ static int fuse_try_move_page(struct fus > struct pipe_buffer *buf = cs->pipebufs; > struct address_space *mapping; > pgoff_t index; > + gfp_t mask = GFP_KERNEL; > > unlock_request(cs->fc, cs->req); > fuse_copy_finish(cs); > @@ -732,7 +734,14 @@ static int fuse_try_move_page(struct fus > remove_from_page_cache(oldpage); > page_cache_release(oldpage); > > - err = add_to_page_cache_locked(newpage, mapping, index, GFP_KERNEL); > + /* > + * not-on-LRU pages are out of control. So, add to root cgroup. > + * See mm/memcontrol.c for details. > + */ > + if (buf->flags & PIPE_BUF_FLAG_LRU) > + mask |= __GFP_NOMEMCGROUP; > + > + err = add_to_page_cache_locked(newpage, mapping, index, mask); > if (err) { > printk(KERN_WARNING "fuse_try_move_page: failed to add page"); > goto out_fallback_unlock; > Index: linux-2.6.36-rc3/include/linux/gfp.h > =================================================================== > --- linux-2.6.36-rc3.orig/include/linux/gfp.h > +++ linux-2.6.36-rc3/include/linux/gfp.h > @@ -60,6 +60,13 @@ struct vm_area_struct; > #define __GFP_NOTRACK ((__force gfp_t)0) > #endif > > +#ifdef CONFIG_CGROUP_MEM_RES_CTLR > +#define __GFP_NOMEMCGROUP ((__force gfp_t)0x400000u) > + /* Don't track by memory cgroup */ > +#else > +#define __GFP_NOMEMCGROUP ((__force gfp_t)0) > +#endif > + > /* > * This may seem redundant, but it's a way of annotating false positives vs. > * allocations that simply cannot be supported (e.g. page tables). > Index: linux-2.6.36-rc3/mm/memcontrol.c > =================================================================== > --- linux-2.6.36-rc3.orig/mm/memcontrol.c > +++ linux-2.6.36-rc3/mm/memcontrol.c > @@ -2114,7 +2114,7 @@ int mem_cgroup_cache_charge(struct page > > if (mem_cgroup_disabled()) > return 0; > - if (PageCompound(page)) > + if (PageCompound(page) || (gfp_mask & __GFP_NOMEMCGROUP)) > return 0; > /* > * Corner case handling. This is called from add_to_page_cache() > The comments above says "not-on-LRU pages are out of control. So, add to root cgroup.". But this change means that we don't charge these pages at all. Should it be: if (gfp_mask & __GFP_NOMEMCGROUP)) mm = &init_mm; ? Or, change the comment ? Thanks, Daisuke Nishimura. ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: cgroup: rmdir() does not complete 2010-09-10 4:05 ` Daisuke Nishimura @ 2010-09-10 4:11 ` KAMEZAWA Hiroyuki 0 siblings, 0 replies; 26+ messages in thread From: KAMEZAWA Hiroyuki @ 2010-09-10 4:11 UTC (permalink / raw) To: Daisuke Nishimura; +Cc: Mark Hills, Peter Zijlstra, Balbir Singh, linux-kernel On Fri, 10 Sep 2010 13:05:39 +0900 Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> wrote: > On Fri, 10 Sep 2010 11:16:46 +0900 > KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote: > > > On Fri, 10 Sep 2010 00:04:31 +0100 (BST) > > Mark Hills <mark@pogo.org.uk> wrote: > > > The report on the spinning process (23586) is dominated by calls from > > > mem_cgroup_force_empty. > > > > > > It seems to show lru_add_drain_all and drain_all_stock_sync are causing > > > the load (I assume drain_all_stock_sync has been optimised out). But I > > > don't think this is as important as what causes the spin. > > > > > > > I noticed you use FUSE and it seems there is a problem in FUSE v.s. memcg. > > I wrote a patch (onto 2.6.36 but can be applied..) > > > Nice catch! > > > Could you try this ? I'm sorry I don't use FUSE system and can't test > > right now. > > > Sorry, I can't either. > > > == > > From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> > > > > memory cgroup catches all pages which is added to radix-tree and > > assumes the pages will be added to LRU, somewhere. > > But there are pages which not on LRU but on radix-tree. Then, > > force_empty cannot find them and cannot finish ->pre_destroy(), rmdir > > operations. > > > > This patch adds __GFP_NOMEMCGROUP and avoids unnecessary, out-of-control > > pages are registered to memory cgroup. > > > > Note: This gfp flag can be used for shmem handling, which now uses > > complicated heuristics. > > > > Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> > > --- > > fs/fuse/dev.c | 11 ++++++++++- > > include/linux/gfp.h | 7 +++++++ > > mm/memcontrol.c | 2 +- > > 3 files changed, 18 insertions(+), 2 deletions(-) > > > > Index: linux-2.6.36-rc3/fs/fuse/dev.c > > =================================================================== > > --- linux-2.6.36-rc3.orig/fs/fuse/dev.c > > +++ linux-2.6.36-rc3/fs/fuse/dev.c > > @@ -19,6 +19,7 @@ > > #include <linux/pipe_fs_i.h> > > #include <linux/swap.h> > > #include <linux/splice.h> > > +#include <linux/memcontrol.h> > > > > MODULE_ALIAS_MISCDEV(FUSE_MINOR); > > MODULE_ALIAS("devname:fuse"); > > @@ -683,6 +684,7 @@ static int fuse_try_move_page(struct fus > > struct pipe_buffer *buf = cs->pipebufs; > > struct address_space *mapping; > > pgoff_t index; > > + gfp_t mask = GFP_KERNEL; > > > > unlock_request(cs->fc, cs->req); > > fuse_copy_finish(cs); > > @@ -732,7 +734,14 @@ static int fuse_try_move_page(struct fus > > remove_from_page_cache(oldpage); > > page_cache_release(oldpage); > > > > - err = add_to_page_cache_locked(newpage, mapping, index, GFP_KERNEL); > > + /* > > + * not-on-LRU pages are out of control. So, add to root cgroup. > > + * See mm/memcontrol.c for details. > > + */ > > + if (buf->flags & PIPE_BUF_FLAG_LRU) > > + mask |= __GFP_NOMEMCGROUP; > > + > > + err = add_to_page_cache_locked(newpage, mapping, index, mask); > > if (err) { > > printk(KERN_WARNING "fuse_try_move_page: failed to add page"); > > goto out_fallback_unlock; > > Index: linux-2.6.36-rc3/include/linux/gfp.h > > =================================================================== > > --- linux-2.6.36-rc3.orig/include/linux/gfp.h > > +++ linux-2.6.36-rc3/include/linux/gfp.h > > @@ -60,6 +60,13 @@ struct vm_area_struct; > > #define __GFP_NOTRACK ((__force gfp_t)0) > > #endif > > > > +#ifdef CONFIG_CGROUP_MEM_RES_CTLR > > +#define __GFP_NOMEMCGROUP ((__force gfp_t)0x400000u) > > + /* Don't track by memory cgroup */ > > +#else > > +#define __GFP_NOMEMCGROUP ((__force gfp_t)0) > > +#endif > > + > > /* > > * This may seem redundant, but it's a way of annotating false positives vs. > > * allocations that simply cannot be supported (e.g. page tables). > > Index: linux-2.6.36-rc3/mm/memcontrol.c > > =================================================================== > > --- linux-2.6.36-rc3.orig/mm/memcontrol.c > > +++ linux-2.6.36-rc3/mm/memcontrol.c > > @@ -2114,7 +2114,7 @@ int mem_cgroup_cache_charge(struct page > > > > if (mem_cgroup_disabled()) > > return 0; > > - if (PageCompound(page)) > > + if (PageCompound(page) || (gfp_mask & __GFP_NOMEMCGROUP)) > > return 0; > > /* > > * Corner case handling. This is called from add_to_page_cache() > > > The comments above says "not-on-LRU pages are out of control. So, add to root cgroup.". > But this change means that we don't charge these pages at all. > > Should it be: > > if (gfp_mask & __GFP_NOMEMCGROUP)) > mm = &init_mm; > > ? > Or, change the comment ? > yes....the comment is wrong. Thanks, -Kame == From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> memory cgroup catches all pages which is added to radix-tree and assumes the pages will be added to LRU, somewhere. But there are pages which not on LRU but on radix-tree. Then, force_empty cannot find them and cannot finish ->pre_destroy(), rmdir operations. This patch adds __GFP_NOMEMCGROUP and avoids unnecessary, out-of-control pages are registered to memory cgroup. Note: This gfp flag can be used for shmem handling, which now uses complicated heuristics. Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> --- fs/fuse/dev.c | 11 ++++++++++- include/linux/gfp.h | 7 +++++++ mm/memcontrol.c | 2 +- 3 files changed, 18 insertions(+), 2 deletions(-) Index: linux-2.6.36-rc3/fs/fuse/dev.c =================================================================== --- linux-2.6.36-rc3.orig/fs/fuse/dev.c +++ linux-2.6.36-rc3/fs/fuse/dev.c @@ -19,6 +19,7 @@ #include <linux/pipe_fs_i.h> #include <linux/swap.h> #include <linux/splice.h> +#include <linux/memcontrol.h> MODULE_ALIAS_MISCDEV(FUSE_MINOR); MODULE_ALIAS("devname:fuse"); @@ -683,6 +684,7 @@ static int fuse_try_move_page(struct fus struct pipe_buffer *buf = cs->pipebufs; struct address_space *mapping; pgoff_t index; + gfp_t mask = GFP_KERNEL; unlock_request(cs->fc, cs->req); fuse_copy_finish(cs); @@ -732,7 +734,14 @@ static int fuse_try_move_page(struct fus remove_from_page_cache(oldpage); page_cache_release(oldpage); - err = add_to_page_cache_locked(newpage, mapping, index, GFP_KERNEL); + /* + * non-LRU pages are out of cgroup controls. + * See mm/memcontrol.c or Documentation/cgroup/memory.txt for details. + */ + if (buf->flags & PIPE_BUF_FLAG_LRU) + mask |= __GFP_NOMEMCGROUP; + + err = add_to_page_cache_locked(newpage, mapping, index, mask); if (err) { printk(KERN_WARNING "fuse_try_move_page: failed to add page"); goto out_fallback_unlock; Index: linux-2.6.36-rc3/include/linux/gfp.h =================================================================== --- linux-2.6.36-rc3.orig/include/linux/gfp.h +++ linux-2.6.36-rc3/include/linux/gfp.h @@ -60,6 +60,13 @@ struct vm_area_struct; #define __GFP_NOTRACK ((__force gfp_t)0) #endif +#ifdef CONFIG_CGROUP_MEM_RES_CTLR +#define __GFP_NOMEMCGROUP ((__force gfp_t)0x400000u) + /* Don't track by memory cgroup */ +#else +#define __GFP_NOMEMCGROUP ((__force gfp_t)0) +#endif + /* * This may seem redundant, but it's a way of annotating false positives vs. * allocations that simply cannot be supported (e.g. page tables). Index: linux-2.6.36-rc3/mm/memcontrol.c =================================================================== --- linux-2.6.36-rc3.orig/mm/memcontrol.c +++ linux-2.6.36-rc3/mm/memcontrol.c @@ -2114,7 +2114,7 @@ int mem_cgroup_cache_charge(struct page if (mem_cgroup_disabled()) return 0; - if (PageCompound(page)) + if (PageCompound(page) || (gfp_mask & __GFP_NOMEMCGROUP)) return 0; /* * Corner case handling. This is called from add_to_page_cache() ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: cgroup: rmdir() does not complete 2010-09-10 2:16 ` KAMEZAWA Hiroyuki 2010-09-10 4:05 ` Daisuke Nishimura @ 2010-09-10 7:28 ` Mark Hills 2010-09-10 7:33 ` KAMEZAWA Hiroyuki 1 sibling, 1 reply; 26+ messages in thread From: Mark Hills @ 2010-09-10 7:28 UTC (permalink / raw) To: KAMEZAWA Hiroyuki Cc: Peter Zijlstra, Balbir Singh, Daisuke Nishimura, linux-kernel On Fri, 10 Sep 2010, KAMEZAWA Hiroyuki wrote: > On Fri, 10 Sep 2010 00:04:31 +0100 (BST) > Mark Hills <mark@pogo.org.uk> wrote: > > The report on the spinning process (23586) is dominated by calls from > > mem_cgroup_force_empty. > > > > It seems to show lru_add_drain_all and drain_all_stock_sync are causing > > the load (I assume drain_all_stock_sync has been optimised out). But I > > don't think this is as important as what causes the spin. > > > > I noticed you use FUSE and it seems there is a problem in FUSE v.s. memcg. > I wrote a patch (onto 2.6.36 but can be applied..) > > Could you try this ? I'm sorry I don't use FUSE system and can't test > right now. What makes you conclude that FUSE is in use? I do not think this is the case. Or do you mean that it is a problem that the kernel is built with FUSE support? I _can_ test the patch, but I still cannot reliably reproduce the problem so it will be hard to conclude whether the patch works or not. Is there a way to build a test case for this? Thanks for your help -- Mark ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: cgroup: rmdir() does not complete 2010-09-10 7:28 ` Mark Hills @ 2010-09-10 7:33 ` KAMEZAWA Hiroyuki 2010-09-10 7:51 ` Mark Hills 0 siblings, 1 reply; 26+ messages in thread From: KAMEZAWA Hiroyuki @ 2010-09-10 7:33 UTC (permalink / raw) To: Mark Hills; +Cc: Peter Zijlstra, Balbir Singh, Daisuke Nishimura, linux-kernel On Fri, 10 Sep 2010 08:28:00 +0100 (BST) Mark Hills <mark@pogo.org.uk> wrote: > On Fri, 10 Sep 2010, KAMEZAWA Hiroyuki wrote: > > > On Fri, 10 Sep 2010 00:04:31 +0100 (BST) > > Mark Hills <mark@pogo.org.uk> wrote: > > > The report on the spinning process (23586) is dominated by calls from > > > mem_cgroup_force_empty. > > > > > > It seems to show lru_add_drain_all and drain_all_stock_sync are causing > > > the load (I assume drain_all_stock_sync has been optimised out). But I > > > don't think this is as important as what causes the spin. > > > > > > > I noticed you use FUSE and it seems there is a problem in FUSE v.s. memcg. > > I wrote a patch (onto 2.6.36 but can be applied..) > > > > Could you try this ? I'm sorry I don't use FUSE system and can't test > > right now. > > What makes you conclude that FUSE is in use? I do not think this is the > case. Or do you mean that it is a problem that the kernel is built with > FUSE support? > You wrote > The test case I was running is similar to the above. With the Lustre > filesystem the problem takes 4 hours or more to show itself. Recently I > ran 4 threads for over 24 hours without it being seen -- I suspect some > external factor is involved. I think Lustre FS is using FUSE. I'm wrong ? > I _can_ test the patch, but I still cannot reliably reproduce the problem > so it will be hard to conclude whether the patch works or not. Is there a > way to build a test case for this? > I'm sorry I'm not sure yet. But from your report, you have 6 pages of charge which cannot be found by force_empty(). And I found FUSE's pipe copy code inserts a page cache into radix-tree but not move them onto LRU. So, - There are remaining pages which is out-of-LRU - FUSE's "move" code does something curious, add_to_page_cache() but not LRU. - You reporeted you use Lustre FS. Then, I ask you. To test this, I have to study FUSE to write test module... Maybe adding printk() to where I added gfp_mask modification of fuse/dev.c can show something but... We may have something other problem, but it seems this is one of them. Thanks, -Kame ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: cgroup: rmdir() does not complete 2010-09-10 7:33 ` KAMEZAWA Hiroyuki @ 2010-09-10 7:51 ` Mark Hills 0 siblings, 0 replies; 26+ messages in thread From: Mark Hills @ 2010-09-10 7:51 UTC (permalink / raw) To: KAMEZAWA Hiroyuki Cc: Peter Zijlstra, Balbir Singh, Daisuke Nishimura, linux-kernel On Fri, 10 Sep 2010, KAMEZAWA Hiroyuki wrote: > On Fri, 10 Sep 2010 08:28:00 +0100 (BST) > Mark Hills <mark@pogo.org.uk> wrote: > > > On Fri, 10 Sep 2010, KAMEZAWA Hiroyuki wrote: > > > > > On Fri, 10 Sep 2010 00:04:31 +0100 (BST) > > > Mark Hills <mark@pogo.org.uk> wrote: > > > > The report on the spinning process (23586) is dominated by calls from > > > > mem_cgroup_force_empty. > > > > > > > > It seems to show lru_add_drain_all and drain_all_stock_sync are causing > > > > the load (I assume drain_all_stock_sync has been optimised out). But I > > > > don't think this is as important as what causes the spin. > > > > > > > > > > I noticed you use FUSE and it seems there is a problem in FUSE v.s. memcg. > > > I wrote a patch (onto 2.6.36 but can be applied..) > > > > > > Could you try this ? I'm sorry I don't use FUSE system and can't test > > > right now. > > > > What makes you conclude that FUSE is in use? I do not think this is the > > case. Or do you mean that it is a problem that the kernel is built with > > FUSE support? > > > You wrote > > The test case I was running is similar to the above. With the Lustre > > filesystem the problem takes 4 hours or more to show itself. Recently I > > ran 4 threads for over 24 hours without it being seen -- I suspect some > > external factor is involved. > > I think Lustre FS is using FUSE. I'm wrong ? Lustre does not use FUSE. But the client is a set of kernel modules, so these could do anything. > > I _can_ test the patch, but I still cannot reliably reproduce the problem > > so it will be hard to conclude whether the patch works or not. Is there a > > way to build a test case for this? > > > > I'm sorry I'm not sure yet. But from your report, you have 6 pages of charge > which cannot be found by force_empty(). And I found FUSE's pipe copy code > inserts a page cache into radix-tree but not move them onto LRU. > > So, > - There are remaining pages which is out-of-LRU > - FUSE's "move" code does something curious, add_to_page_cache() but not LRU. > - You reporeted you use Lustre FS. > > Then, I ask you. To test this, I have to study FUSE to write test module... > Maybe adding printk() to where I added gfp_mask modification of fuse/dev.c > can show something but... > > We may have something other problem, but it seems this is one of them. Okay, it sounds like perhaps I need to investigate Lustre, I will do this next week. But I think FUSE can be ruled out. Thanks again -- Mark ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: cgroup: rmdir() does not complete 2010-08-26 15:51 cgroup: rmdir() does not complete Mark Hills 2010-08-27 0:56 ` Daisuke Nishimura @ 2010-08-27 1:25 ` KAMEZAWA Hiroyuki 2010-08-30 9:25 ` Mark Hills 1 sibling, 1 reply; 26+ messages in thread From: KAMEZAWA Hiroyuki @ 2010-08-27 1:25 UTC (permalink / raw) To: Mark Hills; +Cc: linux-kernel On Thu, 26 Aug 2010 16:51:55 +0100 (BST) Mark Hills <mark@pogo.org.uk> wrote: > I am experiencing hung tasks when trying to rmdir() on a cgroup. One task > spins, others queue up behind it with the following: > > INFO: task soaked-cgroup:27257 blocked for more than 120 seconds. > "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > soaked-cgrou D ffff8800058157c0 0 27257 29411 0x00000000 > ffff88004ffffdd8 0000000000000086 ffff88004ffffda8 ffff88004ffffeb8 > 0000000000000010 ffff880119813780 ffff88004ffffd48 ffff88004fffffd8 > ffff88004fffffd8 000000000000f9b0 00000000000157c0 ffff880137693268 > Call Trace: > [<ffffffff81115edb>] ? mntput_no_expire+0x24/0xe7 > [<ffffffff81427acd>] __mutex_lock_common+0x14d/0x1b4 > [<ffffffff81108a7c>] ? path_put+0x1d/0x22 > [<ffffffff81427b48>] __mutex_lock_slowpath+0x14/0x16 > [<ffffffff81427c4f>] mutex_lock+0x31/0x4b > [<ffffffff8110bdf8>] do_rmdir+0x74/0x102 > [<ffffffff8110bebd>] sys_rmdir+0x11/0x13 > [<ffffffff81009b02>] system_call_fastpath+0x16/0x1b > > Kernel is from Fedora, 2.6.33.6. In all cases the cgroup contains no > tasks. > > Commit ec64f5 ("fix frequent -EBUSY at rmdir") adds a busy wait loop to > the rmdir. It looks like what I am seeing here and indicates that some > cgroup subsystem is busy, indefinitely. > Hmm. really spin ? sleeping-forever-no-wake-up ? > I have not worked out how to reproduce it quickly. My only way is to > complete a 'dd' command in the cgroup, but then the problem is so rare it > is slow progress. > please show how-to-reproduce in your way. And what cgroup is mounted ? memory cgroup only ? > Documentation/cgroup.memory.txt describes how force_empty can be required > in some cases. Ah, maybe that's wrong text. rmdir() calls force-empty automatically. > Does this mean that with the patch above, these cases will > now spin on rmdir(), instead of returning -EBUSY? How can produce a > reliable test case requiring memory.force_empty to be used, to test this? > Hmm. I'm not sure fedora-kernel has other (its own) featrues than stock kernel. I'm grad if you can check it can happen in stock kernel, 2.6.35. > Or is it likely to be some other cause, and how best to find it? > At the first look, above mutex is the mutex in do_rmdir(), not kernel/cgroup.c Then, rmdir doesn't seem to reach cgroup code... Do you do another operation on the directory while rmdir is called ? Thanks, -Kame ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: cgroup: rmdir() does not complete 2010-08-27 1:25 ` KAMEZAWA Hiroyuki @ 2010-08-30 9:25 ` Mark Hills 0 siblings, 0 replies; 26+ messages in thread From: Mark Hills @ 2010-08-30 9:25 UTC (permalink / raw) To: KAMEZAWA Hiroyuki; +Cc: linux-kernel On Fri, 27 Aug 2010, KAMEZAWA Hiroyuki wrote: > On Thu, 26 Aug 2010 16:51:55 +0100 (BST) > Mark Hills <mark@pogo.org.uk> wrote: > > > I am experiencing hung tasks when trying to rmdir() on a cgroup. One task > > spins, others queue up behind it with the following: > > > > INFO: task soaked-cgroup:27257 blocked for more than 120 seconds. > > "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > > soaked-cgrou D ffff8800058157c0 0 27257 29411 0x00000000 > > ffff88004ffffdd8 0000000000000086 ffff88004ffffda8 ffff88004ffffeb8 > > 0000000000000010 ffff880119813780 ffff88004ffffd48 ffff88004fffffd8 > > ffff88004fffffd8 000000000000f9b0 00000000000157c0 ffff880137693268 > > Call Trace: > > [<ffffffff81115edb>] ? mntput_no_expire+0x24/0xe7 > > [<ffffffff81427acd>] __mutex_lock_common+0x14d/0x1b4 > > [<ffffffff81108a7c>] ? path_put+0x1d/0x22 > > [<ffffffff81427b48>] __mutex_lock_slowpath+0x14/0x16 > > [<ffffffff81427c4f>] mutex_lock+0x31/0x4b > > [<ffffffff8110bdf8>] do_rmdir+0x74/0x102 > > [<ffffffff8110bebd>] sys_rmdir+0x11/0x13 > > [<ffffffff81009b02>] system_call_fastpath+0x16/0x1b > > > > Kernel is from Fedora, 2.6.33.6. In all cases the cgroup contains no > > tasks. > > > > Commit ec64f5 ("fix frequent -EBUSY at rmdir") adds a busy wait loop to > > the rmdir. It looks like what I am seeing here and indicates that some > > cgroup subsystem is busy, indefinitely. > > > > Hmm. really spin ? sleeping-forever-no-wake-up ? It sleeps in D state, but enters interruptable state periodically which is why my attention was drawn to that loop. > > I have not worked out how to reproduce it quickly. My only way is to > > complete a 'dd' command in the cgroup, but then the problem is so rare it > > is slow progress. > > > please show how-to-reproduce in your way. I use a C program which creates a container and places itself in the container, then forks a dd process. But it seems you found an easier test case; I hope to test that soon. > And what cgroup is mounted ? memory cgroup only ? Quite a few: memory, blkio, cpuacct, cpuset. Until I can get a more reproducable test case (see my previous mail), I can't really reduce this. > > Documentation/cgroup.memory.txt describes how force_empty can be required > > in some cases. > > Ah, maybe that's wrong text. rmdir() calls force-empty automatically. > > > Does this mean that with the patch above, these cases will > > now spin on rmdir(), instead of returning -EBUSY? How can produce a > > reliable test case requiring memory.force_empty to be used, to test this? > > > > Hmm. I'm not sure fedora-kernel has other (its own) featrues than stock kernel. > I'm grad if you can check it can happen in stock kernel, 2.6.35. > > > Or is it likely to be some other cause, and how best to find it? > > > > At the first look, above mutex is the mutex in do_rmdir(), not kernel/cgroup.c > Then, rmdir doesn't seem to reach cgroup code... Interesting, I checked for that but not sure how I missed it. There is clearly a mutex lock in do_rmdir() in fs/namei.c. > Do you do another operation on the directory while rmdir is called ? In one case I did an 'ls -l' on the filesystem which coencided with a lock up, but I was not able to reproduce this. -- Mark ^ permalink raw reply [flat|nested] 26+ messages in thread
end of thread, other threads:[~2010-09-10 7:51 UTC | newest] Thread overview: 26+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2010-08-26 15:51 cgroup: rmdir() does not complete Mark Hills 2010-08-27 0:56 ` Daisuke Nishimura 2010-08-27 1:20 ` Balbir Singh 2010-08-27 2:35 ` KAMEZAWA Hiroyuki 2010-08-27 3:39 ` Daisuke Nishimura 2010-08-27 5:42 ` KAMEZAWA Hiroyuki 2010-08-27 6:29 ` KAMEZAWA Hiroyuki 2010-08-30 7:32 ` Balbir Singh 2010-08-30 9:13 ` Mark Hills 2010-09-01 11:10 ` Mark Hills 2010-09-01 23:42 ` KAMEZAWA Hiroyuki 2010-09-02 9:45 ` Mark Hills 2010-09-09 10:01 ` Mark Hills 2010-09-09 10:09 ` Balbir Singh 2010-09-09 11:36 ` Mark Hills 2010-09-09 11:50 ` Peter Zijlstra 2010-09-09 23:04 ` Mark Hills 2010-09-09 23:43 ` KAMEZAWA Hiroyuki 2010-09-10 2:16 ` KAMEZAWA Hiroyuki 2010-09-10 4:05 ` Daisuke Nishimura 2010-09-10 4:11 ` KAMEZAWA Hiroyuki 2010-09-10 7:28 ` Mark Hills 2010-09-10 7:33 ` KAMEZAWA Hiroyuki 2010-09-10 7:51 ` Mark Hills 2010-08-27 1:25 ` KAMEZAWA Hiroyuki 2010-08-30 9:25 ` Mark Hills
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.