* cgroup: rmdir() does not complete
@ 2010-08-26 15:51 Mark Hills
2010-08-27 0:56 ` Daisuke Nishimura
2010-08-27 1:25 ` KAMEZAWA Hiroyuki
0 siblings, 2 replies; 26+ messages in thread
From: Mark Hills @ 2010-08-26 15:51 UTC (permalink / raw)
To: KAMEZAWA Hiroyuki; +Cc: linux-kernel
I am experiencing hung tasks when trying to rmdir() on a cgroup. One task
spins, others queue up behind it with the following:
INFO: task soaked-cgroup:27257 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
soaked-cgrou D ffff8800058157c0 0 27257 29411 0x00000000
ffff88004ffffdd8 0000000000000086 ffff88004ffffda8 ffff88004ffffeb8
0000000000000010 ffff880119813780 ffff88004ffffd48 ffff88004fffffd8
ffff88004fffffd8 000000000000f9b0 00000000000157c0 ffff880137693268
Call Trace:
[<ffffffff81115edb>] ? mntput_no_expire+0x24/0xe7
[<ffffffff81427acd>] __mutex_lock_common+0x14d/0x1b4
[<ffffffff81108a7c>] ? path_put+0x1d/0x22
[<ffffffff81427b48>] __mutex_lock_slowpath+0x14/0x16
[<ffffffff81427c4f>] mutex_lock+0x31/0x4b
[<ffffffff8110bdf8>] do_rmdir+0x74/0x102
[<ffffffff8110bebd>] sys_rmdir+0x11/0x13
[<ffffffff81009b02>] system_call_fastpath+0x16/0x1b
Kernel is from Fedora, 2.6.33.6. In all cases the cgroup contains no
tasks.
Commit ec64f5 ("fix frequent -EBUSY at rmdir") adds a busy wait loop to
the rmdir. It looks like what I am seeing here and indicates that some
cgroup subsystem is busy, indefinitely.
I have not worked out how to reproduce it quickly. My only way is to
complete a 'dd' command in the cgroup, but then the problem is so rare it
is slow progress.
Documentation/cgroup.memory.txt describes how force_empty can be required
in some cases. Does this mean that with the patch above, these cases will
now spin on rmdir(), instead of returning -EBUSY? How can produce a
reliable test case requiring memory.force_empty to be used, to test this?
Or is it likely to be some other cause, and how best to find it?
Thanks
--
Mark
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: cgroup: rmdir() does not complete
2010-08-26 15:51 cgroup: rmdir() does not complete Mark Hills
@ 2010-08-27 0:56 ` Daisuke Nishimura
2010-08-27 1:20 ` Balbir Singh
2010-08-27 2:35 ` KAMEZAWA Hiroyuki
2010-08-27 1:25 ` KAMEZAWA Hiroyuki
1 sibling, 2 replies; 26+ messages in thread
From: Daisuke Nishimura @ 2010-08-27 0:56 UTC (permalink / raw)
To: Mark Hills; +Cc: KAMEZAWA Hiroyuki, linux-kernel, Daisuke Nishimura
Hi.
On Thu, 26 Aug 2010 16:51:55 +0100 (BST)
Mark Hills <mark@pogo.org.uk> wrote:
> I am experiencing hung tasks when trying to rmdir() on a cgroup. One task
> spins, others queue up behind it with the following:
>
> INFO: task soaked-cgroup:27257 blocked for more than 120 seconds.
> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> soaked-cgrou D ffff8800058157c0 0 27257 29411 0x00000000
> ffff88004ffffdd8 0000000000000086 ffff88004ffffda8 ffff88004ffffeb8
> 0000000000000010 ffff880119813780 ffff88004ffffd48 ffff88004fffffd8
> ffff88004fffffd8 000000000000f9b0 00000000000157c0 ffff880137693268
> Call Trace:
> [<ffffffff81115edb>] ? mntput_no_expire+0x24/0xe7
> [<ffffffff81427acd>] __mutex_lock_common+0x14d/0x1b4
> [<ffffffff81108a7c>] ? path_put+0x1d/0x22
> [<ffffffff81427b48>] __mutex_lock_slowpath+0x14/0x16
> [<ffffffff81427c4f>] mutex_lock+0x31/0x4b
> [<ffffffff8110bdf8>] do_rmdir+0x74/0x102
> [<ffffffff8110bebd>] sys_rmdir+0x11/0x13
> [<ffffffff81009b02>] system_call_fastpath+0x16/0x1b
>
> Kernel is from Fedora, 2.6.33.6. In all cases the cgroup contains no
> tasks.
>
> Commit ec64f5 ("fix frequent -EBUSY at rmdir") adds a busy wait loop to
> the rmdir. It looks like what I am seeing here and indicates that some
> cgroup subsystem is busy, indefinitely.
>
The commit had caused a bug about rmdir, but it was fixed by the commit 88703267.
The fix was merged in 2.6.31, so it seems that you hit a new one...
> I have not worked out how to reproduce it quickly. My only way is to
> complete a 'dd' command in the cgroup, but then the problem is so rare it
> is slow progress.
>
> Documentation/cgroup.memory.txt describes how force_empty can be required
> in some cases. Does this mean that with the patch above, these cases will
> now spin on rmdir(), instead of returning -EBUSY? How can produce a
> reliable test case requiring memory.force_empty to be used, to test this?
>
You don't need to touch "force_empty". rmdir() does what "force_empty" does.
> Or is it likely to be some other cause, and how best to find it?
>
What cgroup subsystem did you mount where the directory existed you tried
to rmdir() first ?
If you mounted several subsystems on the same hierarchy, can you mount them
separately to narrow down the cause ?
Thanks,
Daisuke Nishimura.
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: cgroup: rmdir() does not complete
2010-08-27 0:56 ` Daisuke Nishimura
@ 2010-08-27 1:20 ` Balbir Singh
2010-08-27 2:35 ` KAMEZAWA Hiroyuki
1 sibling, 0 replies; 26+ messages in thread
From: Balbir Singh @ 2010-08-27 1:20 UTC (permalink / raw)
To: Daisuke Nishimura; +Cc: Mark Hills, KAMEZAWA Hiroyuki, linux-kernel
On Fri, Aug 27, 2010 at 6:26 AM, Daisuke Nishimura
<nishimura@mxp.nes.nec.co.jp> wrote:
> Hi.
>
> On Thu, 26 Aug 2010 16:51:55 +0100 (BST)
> Mark Hills <mark@pogo.org.uk> wrote:
>
>> I am experiencing hung tasks when trying to rmdir() on a cgroup. One task
>> spins, others queue up behind it with the following:
>>
>> INFO: task soaked-cgroup:27257 blocked for more than 120 seconds.
>> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>> soaked-cgrou D ffff8800058157c0 0 27257 29411 0x00000000
>> ffff88004ffffdd8 0000000000000086 ffff88004ffffda8 ffff88004ffffeb8
>> 0000000000000010 ffff880119813780 ffff88004ffffd48 ffff88004fffffd8
>> ffff88004fffffd8 000000000000f9b0 00000000000157c0 ffff880137693268
>> Call Trace:
>> [<ffffffff81115edb>] ? mntput_no_expire+0x24/0xe7
>> [<ffffffff81427acd>] __mutex_lock_common+0x14d/0x1b4
>> [<ffffffff81108a7c>] ? path_put+0x1d/0x22
>> [<ffffffff81427b48>] __mutex_lock_slowpath+0x14/0x16
>> [<ffffffff81427c4f>] mutex_lock+0x31/0x4b
>> [<ffffffff8110bdf8>] do_rmdir+0x74/0x102
>> [<ffffffff8110bebd>] sys_rmdir+0x11/0x13
>> [<ffffffff81009b02>] system_call_fastpath+0x16/0x1b
>>
>> Kernel is from Fedora, 2.6.33.6. In all cases the cgroup contains no
>> tasks.
>>
>> Commit ec64f5 ("fix frequent -EBUSY at rmdir") adds a busy wait loop to
>> the rmdir. It looks like what I am seeing here and indicates that some
>> cgroup subsystem is busy, indefinitely.
>>
> The commit had caused a bug about rmdir, but it was fixed by the commit 88703267.
> The fix was merged in 2.6.31, so it seems that you hit a new one...
>
>> I have not worked out how to reproduce it quickly. My only way is to
>> complete a 'dd' command in the cgroup, but then the problem is so rare it
>> is slow progress.
>>
>> Documentation/cgroup.memory.txt describes how force_empty can be required
>> in some cases. Does this mean that with the patch above, these cases will
>> now spin on rmdir(), instead of returning -EBUSY? How can produce a
>> reliable test case requiring memory.force_empty to be used, to test this?
>>
> You don't need to touch "force_empty". rmdir() does what "force_empty" does.
>
>> Or is it likely to be some other cause, and how best to find it?
>>
> What cgroup subsystem did you mount where the directory existed you tried
> to rmdir() first ?
> If you mounted several subsystems on the same hierarchy, can you mount them
> separately to narrow down the cause ?
>
It would also be nice to see what your mounted cgroup (filesystem
perspective) looks like and what /proc/cgroups looks like when the
problem occurs.
Balbir
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: cgroup: rmdir() does not complete
2010-08-26 15:51 cgroup: rmdir() does not complete Mark Hills
2010-08-27 0:56 ` Daisuke Nishimura
@ 2010-08-27 1:25 ` KAMEZAWA Hiroyuki
2010-08-30 9:25 ` Mark Hills
1 sibling, 1 reply; 26+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-08-27 1:25 UTC (permalink / raw)
To: Mark Hills; +Cc: linux-kernel
On Thu, 26 Aug 2010 16:51:55 +0100 (BST)
Mark Hills <mark@pogo.org.uk> wrote:
> I am experiencing hung tasks when trying to rmdir() on a cgroup. One task
> spins, others queue up behind it with the following:
>
> INFO: task soaked-cgroup:27257 blocked for more than 120 seconds.
> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> soaked-cgrou D ffff8800058157c0 0 27257 29411 0x00000000
> ffff88004ffffdd8 0000000000000086 ffff88004ffffda8 ffff88004ffffeb8
> 0000000000000010 ffff880119813780 ffff88004ffffd48 ffff88004fffffd8
> ffff88004fffffd8 000000000000f9b0 00000000000157c0 ffff880137693268
> Call Trace:
> [<ffffffff81115edb>] ? mntput_no_expire+0x24/0xe7
> [<ffffffff81427acd>] __mutex_lock_common+0x14d/0x1b4
> [<ffffffff81108a7c>] ? path_put+0x1d/0x22
> [<ffffffff81427b48>] __mutex_lock_slowpath+0x14/0x16
> [<ffffffff81427c4f>] mutex_lock+0x31/0x4b
> [<ffffffff8110bdf8>] do_rmdir+0x74/0x102
> [<ffffffff8110bebd>] sys_rmdir+0x11/0x13
> [<ffffffff81009b02>] system_call_fastpath+0x16/0x1b
>
> Kernel is from Fedora, 2.6.33.6. In all cases the cgroup contains no
> tasks.
>
> Commit ec64f5 ("fix frequent -EBUSY at rmdir") adds a busy wait loop to
> the rmdir. It looks like what I am seeing here and indicates that some
> cgroup subsystem is busy, indefinitely.
>
Hmm. really spin ? sleeping-forever-no-wake-up ?
> I have not worked out how to reproduce it quickly. My only way is to
> complete a 'dd' command in the cgroup, but then the problem is so rare it
> is slow progress.
>
please show how-to-reproduce in your way.
And what cgroup is mounted ? memory cgroup only ?
> Documentation/cgroup.memory.txt describes how force_empty can be required
> in some cases.
Ah, maybe that's wrong text. rmdir() calls force-empty automatically.
> Does this mean that with the patch above, these cases will
> now spin on rmdir(), instead of returning -EBUSY? How can produce a
> reliable test case requiring memory.force_empty to be used, to test this?
>
Hmm. I'm not sure fedora-kernel has other (its own) featrues than stock kernel.
I'm grad if you can check it can happen in stock kernel, 2.6.35.
> Or is it likely to be some other cause, and how best to find it?
>
At the first look, above mutex is the mutex in do_rmdir(), not kernel/cgroup.c
Then, rmdir doesn't seem to reach cgroup code...
Do you do another operation on the directory while rmdir is called ?
Thanks,
-Kame
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: cgroup: rmdir() does not complete
2010-08-27 0:56 ` Daisuke Nishimura
2010-08-27 1:20 ` Balbir Singh
@ 2010-08-27 2:35 ` KAMEZAWA Hiroyuki
2010-08-27 3:39 ` Daisuke Nishimura
1 sibling, 1 reply; 26+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-08-27 2:35 UTC (permalink / raw)
To: Daisuke Nishimura; +Cc: Mark Hills, linux-kernel
On Fri, 27 Aug 2010 09:56:39 +0900
Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> wrote:
> > Or is it likely to be some other cause, and how best to find it?
> >
> What cgroup subsystem did you mount where the directory existed you tried
> to rmdir() first ?
> If you mounted several subsystems on the same hierarchy, can you mount them
> separately to narrow down the cause ?
>
It seems I can reproduce the issue on mmotm-0811, too.
try this.
Here, memory cgroup is mounted at /cgroups.
==
#!/bin/bash -x
while sleep 1; do
date
mkdir /cgroups/test
echo 0 > /cgroups/test/tasks
echo 300M > /cgroups/test/memory.limit_in_bytes
cat /proc/self/cgroup
dd if=/dev/zero of=./tmpfile bs=4096 count=100000
echo 0 > /cgroups/tasks
cat /proc/self/cgroup
rmdir /cgroups/test
rm ./tmpfile
done
==
hangs at rmdir. I'm no investigating force_empty.
Thanks,
-Kame
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: cgroup: rmdir() does not complete
2010-08-27 2:35 ` KAMEZAWA Hiroyuki
@ 2010-08-27 3:39 ` Daisuke Nishimura
2010-08-27 5:42 ` KAMEZAWA Hiroyuki
0 siblings, 1 reply; 26+ messages in thread
From: Daisuke Nishimura @ 2010-08-27 3:39 UTC (permalink / raw)
To: KAMEZAWA Hiroyuki; +Cc: Mark Hills, linux-kernel, balbir, Daisuke Nishimura
On Fri, 27 Aug 2010 11:35:06 +0900
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> On Fri, 27 Aug 2010 09:56:39 +0900
> Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> wrote:
>
> > > Or is it likely to be some other cause, and how best to find it?
> > >
> > What cgroup subsystem did you mount where the directory existed you tried
> > to rmdir() first ?
> > If you mounted several subsystems on the same hierarchy, can you mount them
> > separately to narrow down the cause ?
> >
>
> It seems I can reproduce the issue on mmotm-0811, too.
>
> try this.
>
> Here, memory cgroup is mounted at /cgroups.
> ==
> #!/bin/bash -x
>
> while sleep 1; do
> date
> mkdir /cgroups/test
> echo 0 > /cgroups/test/tasks
> echo 300M > /cgroups/test/memory.limit_in_bytes
> cat /proc/self/cgroup
> dd if=/dev/zero of=./tmpfile bs=4096 count=100000
> echo 0 > /cgroups/tasks
> cat /proc/self/cgroup
> rmdir /cgroups/test
> rm ./tmpfile
> done
> ==
>
> hangs at rmdir. I'm no investigating force_empty.
>
Thank you very much for your information.
Some questions.
Is "tmpfile" created on a normal filesystem(e.g. ext3) or tmpfs ?
And, how long does it likely to take to cause this problem ?
I've run it on RHEL6-based kernel/ext3 for about one hour, but
I cannot reproduce it yet.
Thanks,
Daisuke Nishimura.
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: cgroup: rmdir() does not complete
2010-08-27 3:39 ` Daisuke Nishimura
@ 2010-08-27 5:42 ` KAMEZAWA Hiroyuki
2010-08-27 6:29 ` KAMEZAWA Hiroyuki
` (2 more replies)
0 siblings, 3 replies; 26+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-08-27 5:42 UTC (permalink / raw)
To: Daisuke Nishimura; +Cc: Mark Hills, linux-kernel, balbir
On Fri, 27 Aug 2010 12:39:48 +0900
Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> wrote:
> On Fri, 27 Aug 2010 11:35:06 +0900
> KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
>
> > On Fri, 27 Aug 2010 09:56:39 +0900
> > Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> wrote:
> >
> > > > Or is it likely to be some other cause, and how best to find it?
> > > >
> > > What cgroup subsystem did you mount where the directory existed you tried
> > > to rmdir() first ?
> > > If you mounted several subsystems on the same hierarchy, can you mount them
> > > separately to narrow down the cause ?
> > >
> >
> > It seems I can reproduce the issue on mmotm-0811, too.
> >
> > try this.
> >
> > Here, memory cgroup is mounted at /cgroups.
> > ==
> > #!/bin/bash -x
> >
> > while sleep 1; do
> > date
> > mkdir /cgroups/test
> > echo 0 > /cgroups/test/tasks
> > echo 300M > /cgroups/test/memory.limit_in_bytes
> > cat /proc/self/cgroup
> > dd if=/dev/zero of=./tmpfile bs=4096 count=100000
> > echo 0 > /cgroups/tasks
> > cat /proc/self/cgroup
> > rmdir /cgroups/test
> > rm ./tmpfile
> > done
> > ==
> >
> > hangs at rmdir. I'm no investigating force_empty.
> >
> Thank you very much for your information.
>
> Some questions.
>
> Is "tmpfile" created on a normal filesystem(e.g. ext3) or tmpfs ?
on ext4.
> And, how long does it likely to take to cause this problem ?
very soon. 10-20 loop.
> I've run it on RHEL6-based kernel/ext3 for about one hour, but
> I cannot reproduce it yet.
>
Hmm...I'll dig more. Maybe I need to use stock kernel rather than -mm...
Thanks,
-Kame
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: cgroup: rmdir() does not complete
2010-08-27 5:42 ` KAMEZAWA Hiroyuki
@ 2010-08-27 6:29 ` KAMEZAWA Hiroyuki
2010-08-30 7:32 ` Balbir Singh
2010-08-30 9:13 ` Mark Hills
2010-09-01 11:10 ` Mark Hills
2 siblings, 1 reply; 26+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-08-27 6:29 UTC (permalink / raw)
To: KAMEZAWA Hiroyuki; +Cc: Daisuke Nishimura, Mark Hills, linux-kernel, balbir
On Fri, 27 Aug 2010 14:42:25 +0900
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > I've run it on RHEL6-based kernel/ext3 for about one hour, but
> > I cannot reproduce it yet.
> >
>
> Hmm...I'll dig more. Maybe I need to use stock kernel rather than -mm...
>
>
Sorry, my test just hangs on -mm + (other patches)
no troubles on 2.6.34 and 2.6.36-rc1.
Where can I see 2.6.33.6(Fedora) kernel ?
Thanks,
-Kame
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: cgroup: rmdir() does not complete
2010-08-27 6:29 ` KAMEZAWA Hiroyuki
@ 2010-08-30 7:32 ` Balbir Singh
0 siblings, 0 replies; 26+ messages in thread
From: Balbir Singh @ 2010-08-30 7:32 UTC (permalink / raw)
To: KAMEZAWA Hiroyuki; +Cc: Daisuke Nishimura, Mark Hills, linux-kernel
* KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2010-08-27 15:29:58]:
> On Fri, 27 Aug 2010 14:42:25 +0900
> KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
>
> > > I've run it on RHEL6-based kernel/ext3 for about one hour, but
> > > I cannot reproduce it yet.
> > >
> >
> > Hmm...I'll dig more. Maybe I need to use stock kernel rather than -mm...
> >
> >
> Sorry, my test just hangs on -mm + (other patches)
> no troubles on 2.6.34 and 2.6.36-rc1.
>
> Where can I see 2.6.33.6(Fedora) kernel ?
>
You can get the SRPM from the mirrors, one place to find it would be
http://download.fedora.redhat.com/pub/fedora/linux/updates/13/SRPMS/
--
Three Cheers,
Balbir
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: cgroup: rmdir() does not complete
2010-08-27 5:42 ` KAMEZAWA Hiroyuki
2010-08-27 6:29 ` KAMEZAWA Hiroyuki
@ 2010-08-30 9:13 ` Mark Hills
2010-09-01 11:10 ` Mark Hills
2 siblings, 0 replies; 26+ messages in thread
From: Mark Hills @ 2010-08-30 9:13 UTC (permalink / raw)
To: KAMEZAWA Hiroyuki; +Cc: Daisuke Nishimura, linux-kernel, balbir
On Fri, 27 Aug 2010, KAMEZAWA Hiroyuki wrote:
> On Fri, 27 Aug 2010 12:39:48 +0900
> Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> wrote:
>
> > On Fri, 27 Aug 2010 11:35:06 +0900
> > KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> >
> > > On Fri, 27 Aug 2010 09:56:39 +0900
> > > Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> wrote:
> > >
> > > > > Or is it likely to be some other cause, and how best to find it?
> > > > >
> > > > What cgroup subsystem did you mount where the directory existed you tried
> > > > to rmdir() first ?
> > > > If you mounted several subsystems on the same hierarchy, can you mount them
> > > > separately to narrow down the cause ?
> > > >
> > >
> > > It seems I can reproduce the issue on mmotm-0811, too.
> > >
> > > try this.
> > >
> > > Here, memory cgroup is mounted at /cgroups.
> > > ==
> > > #!/bin/bash -x
> > >
> > > while sleep 1; do
> > > date
> > > mkdir /cgroups/test
> > > echo 0 > /cgroups/test/tasks
> > > echo 300M > /cgroups/test/memory.limit_in_bytes
> > > cat /proc/self/cgroup
> > > dd if=/dev/zero of=./tmpfile bs=4096 count=100000
> > > echo 0 > /cgroups/tasks
> > > cat /proc/self/cgroup
> > > rmdir /cgroups/test
> > > rm ./tmpfile
> > > done
> > > ==
> > >
> > > hangs at rmdir. I'm no investigating force_empty.
> > >
> > Thank you very much for your information.
> >
> > Some questions.
> >
> > Is "tmpfile" created on a normal filesystem(e.g. ext3) or tmpfs ?
> on ext4.
>
> > And, how long does it likely to take to cause this problem ?
>
> very soon. 10-20 loop.
The test case I was running is similar to the above. With the Lustre
filesystem the problem takes 4 hours or more to show itself. Recently I
ran 4 threads for over 24 hours without it being seen -- I suspect some
external factor is involved.
I also tried NFS, and did not see a problem after 8 hours or so, but this
is inconclusive.
The use of the Fedora kernel, and the Lustre filesystem is not
satisfactory to trace the bug. Until I can get a test case which is more
readily reproducable, I'm not able to reasonably think about changing
variables.
It is interesting you see the problem so readily on ext4; I will test that
soon (it is currently holiday weekend in the UK). I hope it will give me
the test case I am looking for.
Thanks
--
Mark
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: cgroup: rmdir() does not complete
2010-08-27 1:25 ` KAMEZAWA Hiroyuki
@ 2010-08-30 9:25 ` Mark Hills
0 siblings, 0 replies; 26+ messages in thread
From: Mark Hills @ 2010-08-30 9:25 UTC (permalink / raw)
To: KAMEZAWA Hiroyuki; +Cc: linux-kernel
On Fri, 27 Aug 2010, KAMEZAWA Hiroyuki wrote:
> On Thu, 26 Aug 2010 16:51:55 +0100 (BST)
> Mark Hills <mark@pogo.org.uk> wrote:
>
> > I am experiencing hung tasks when trying to rmdir() on a cgroup. One task
> > spins, others queue up behind it with the following:
> >
> > INFO: task soaked-cgroup:27257 blocked for more than 120 seconds.
> > "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> > soaked-cgrou D ffff8800058157c0 0 27257 29411 0x00000000
> > ffff88004ffffdd8 0000000000000086 ffff88004ffffda8 ffff88004ffffeb8
> > 0000000000000010 ffff880119813780 ffff88004ffffd48 ffff88004fffffd8
> > ffff88004fffffd8 000000000000f9b0 00000000000157c0 ffff880137693268
> > Call Trace:
> > [<ffffffff81115edb>] ? mntput_no_expire+0x24/0xe7
> > [<ffffffff81427acd>] __mutex_lock_common+0x14d/0x1b4
> > [<ffffffff81108a7c>] ? path_put+0x1d/0x22
> > [<ffffffff81427b48>] __mutex_lock_slowpath+0x14/0x16
> > [<ffffffff81427c4f>] mutex_lock+0x31/0x4b
> > [<ffffffff8110bdf8>] do_rmdir+0x74/0x102
> > [<ffffffff8110bebd>] sys_rmdir+0x11/0x13
> > [<ffffffff81009b02>] system_call_fastpath+0x16/0x1b
> >
> > Kernel is from Fedora, 2.6.33.6. In all cases the cgroup contains no
> > tasks.
> >
> > Commit ec64f5 ("fix frequent -EBUSY at rmdir") adds a busy wait loop to
> > the rmdir. It looks like what I am seeing here and indicates that some
> > cgroup subsystem is busy, indefinitely.
> >
>
> Hmm. really spin ? sleeping-forever-no-wake-up ?
It sleeps in D state, but enters interruptable state periodically which is
why my attention was drawn to that loop.
> > I have not worked out how to reproduce it quickly. My only way is to
> > complete a 'dd' command in the cgroup, but then the problem is so rare it
> > is slow progress.
> >
> please show how-to-reproduce in your way.
I use a C program which creates a container and places itself in the
container, then forks a dd process.
But it seems you found an easier test case; I hope to test that soon.
> And what cgroup is mounted ? memory cgroup only ?
Quite a few: memory, blkio, cpuacct, cpuset.
Until I can get a more reproducable test case (see my previous mail), I
can't really reduce this.
> > Documentation/cgroup.memory.txt describes how force_empty can be required
> > in some cases.
>
> Ah, maybe that's wrong text. rmdir() calls force-empty automatically.
>
> > Does this mean that with the patch above, these cases will
> > now spin on rmdir(), instead of returning -EBUSY? How can produce a
> > reliable test case requiring memory.force_empty to be used, to test this?
> >
>
> Hmm. I'm not sure fedora-kernel has other (its own) featrues than stock kernel.
> I'm grad if you can check it can happen in stock kernel, 2.6.35.
>
> > Or is it likely to be some other cause, and how best to find it?
> >
>
> At the first look, above mutex is the mutex in do_rmdir(), not kernel/cgroup.c
> Then, rmdir doesn't seem to reach cgroup code...
Interesting, I checked for that but not sure how I missed it. There is
clearly a mutex lock in do_rmdir() in fs/namei.c.
> Do you do another operation on the directory while rmdir is called ?
In one case I did an 'ls -l' on the filesystem which coencided with a lock
up, but I was not able to reproduce this.
--
Mark
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: cgroup: rmdir() does not complete
2010-08-27 5:42 ` KAMEZAWA Hiroyuki
2010-08-27 6:29 ` KAMEZAWA Hiroyuki
2010-08-30 9:13 ` Mark Hills
@ 2010-09-01 11:10 ` Mark Hills
2010-09-01 23:42 ` KAMEZAWA Hiroyuki
2 siblings, 1 reply; 26+ messages in thread
From: Mark Hills @ 2010-09-01 11:10 UTC (permalink / raw)
To: KAMEZAWA Hiroyuki; +Cc: Daisuke Nishimura, linux-kernel, balbir
On Fri, 27 Aug 2010, KAMEZAWA Hiroyuki wrote:
> On Fri, 27 Aug 2010 12:39:48 +0900
> Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> wrote:
>
> > On Fri, 27 Aug 2010 11:35:06 +0900
> > KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> >
> > > On Fri, 27 Aug 2010 09:56:39 +0900
> > > Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> wrote:
> > >
> > > > > Or is it likely to be some other cause, and how best to find it?
> > > > >
> > > > What cgroup subsystem did you mount where the directory existed you tried
> > > > to rmdir() first ?
> > > > If you mounted several subsystems on the same hierarchy, can you mount them
> > > > separately to narrow down the cause ?
> > > >
> > >
> > > It seems I can reproduce the issue on mmotm-0811, too.
> > >
> > > try this.
> > >
> > > Here, memory cgroup is mounted at /cgroups.
> > > ==
> > > #!/bin/bash -x
> > >
> > > while sleep 1; do
> > > date
> > > mkdir /cgroups/test
> > > echo 0 > /cgroups/test/tasks
> > > echo 300M > /cgroups/test/memory.limit_in_bytes
> > > cat /proc/self/cgroup
> > > dd if=/dev/zero of=./tmpfile bs=4096 count=100000
> > > echo 0 > /cgroups/tasks
> > > cat /proc/self/cgroup
> > > rmdir /cgroups/test
> > > rm ./tmpfile
> > > done
> > > ==
> > >
> > > hangs at rmdir. I'm no investigating force_empty.
> > >
> > Thank you very much for your information.
> >
> > Some questions.
> >
> > Is "tmpfile" created on a normal filesystem(e.g. ext3) or tmpfs ?
> on ext4.
>
> > And, how long does it likely to take to cause this problem ?
>
> very soon. 10-20 loop.
I repeated the test above, but did not see a problem after many hundreds
of loops.
My test was with the same kernel from my original bug report (Fedora
2.6.33.6-147), using memory cgroup only and ext4 filesystem.
So it is possible we are experiencing different bugs with similar
symptoms.
--
Mark
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: cgroup: rmdir() does not complete
2010-09-01 11:10 ` Mark Hills
@ 2010-09-01 23:42 ` KAMEZAWA Hiroyuki
2010-09-02 9:45 ` Mark Hills
2010-09-09 10:01 ` Mark Hills
0 siblings, 2 replies; 26+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-09-01 23:42 UTC (permalink / raw)
To: Mark Hills; +Cc: Daisuke Nishimura, linux-kernel, balbir
On Wed, 1 Sep 2010 12:10:23 +0100 (BST)
Mark Hills <mark@pogo.org.uk> wrote:
> On Fri, 27 Aug 2010, KAMEZAWA Hiroyuki wrote:
>
> > On Fri, 27 Aug 2010 12:39:48 +0900
> > Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> wrote:
> >
> > > On Fri, 27 Aug 2010 11:35:06 +0900
> > > KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > >
> > > > On Fri, 27 Aug 2010 09:56:39 +0900
> > > > Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> wrote:
> > > >
> > > > > > Or is it likely to be some other cause, and how best to find it?
> > > > > >
> > > > > What cgroup subsystem did you mount where the directory existed you tried
> > > > > to rmdir() first ?
> > > > > If you mounted several subsystems on the same hierarchy, can you mount them
> > > > > separately to narrow down the cause ?
> > > > >
> > > >
> > > > It seems I can reproduce the issue on mmotm-0811, too.
> > > >
> > > > try this.
> > > >
> > > > Here, memory cgroup is mounted at /cgroups.
> > > > ==
> > > > #!/bin/bash -x
> > > >
> > > > while sleep 1; do
> > > > date
> > > > mkdir /cgroups/test
> > > > echo 0 > /cgroups/test/tasks
> > > > echo 300M > /cgroups/test/memory.limit_in_bytes
> > > > cat /proc/self/cgroup
> > > > dd if=/dev/zero of=./tmpfile bs=4096 count=100000
> > > > echo 0 > /cgroups/tasks
> > > > cat /proc/self/cgroup
> > > > rmdir /cgroups/test
> > > > rm ./tmpfile
> > > > done
> > > > ==
> > > >
> > > > hangs at rmdir. I'm no investigating force_empty.
> > > >
> > > Thank you very much for your information.
> > >
> > > Some questions.
> > >
> > > Is "tmpfile" created on a normal filesystem(e.g. ext3) or tmpfs ?
> > on ext4.
> >
> > > And, how long does it likely to take to cause this problem ?
> >
> > very soon. 10-20 loop.
>
> I repeated the test above, but did not see a problem after many hundreds
> of loops.
>
> My test was with the same kernel from my original bug report (Fedora
> 2.6.33.6-147), using memory cgroup only and ext4 filesystem.
>
> So it is possible we are experiencing different bugs with similar
> symptoms.
>
Thank you for confirming.
But hmm...it's curious who holds mutex and what happens.
-Kame
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: cgroup: rmdir() does not complete
2010-09-01 23:42 ` KAMEZAWA Hiroyuki
@ 2010-09-02 9:45 ` Mark Hills
2010-09-09 10:01 ` Mark Hills
1 sibling, 0 replies; 26+ messages in thread
From: Mark Hills @ 2010-09-02 9:45 UTC (permalink / raw)
To: KAMEZAWA Hiroyuki; +Cc: Daisuke Nishimura, linux-kernel, balbir
On Thu, 2 Sep 2010, KAMEZAWA Hiroyuki wrote:
> On Wed, 1 Sep 2010 12:10:23 +0100 (BST)
> Mark Hills <mark@pogo.org.uk> wrote:
[...]
> > I repeated the test above, but did not see a problem after many hundreds
> > of loops.
> >
> > My test was with the same kernel from my original bug report (Fedora
> > 2.6.33.6-147), using memory cgroup only and ext4 filesystem.
> >
> > So it is possible we are experiencing different bugs with similar
> > symptoms.
> >
>
> Thank you for confirming.
> But hmm...it's curious who holds mutex and what happens.
Refer to my original email, where I was running multiple tests at once.
This backtrace is from the tests which queue up:
Call Trace:
[<ffffffff81115edb>] ? mntput_no_expire+0x24/0xe7
[<ffffffff81427acd>] __mutex_lock_common+0x14d/0x1b4
[<ffffffff81108a7c>] ? path_put+0x1d/0x22
[<ffffffff81427b48>] __mutex_lock_slowpath+0x14/0x16
[<ffffffff81427c4f>] mutex_lock+0x31/0x4b
[<ffffffff8110bdf8>] do_rmdir+0x74/0x102
[<ffffffff8110bebd>] sys_rmdir+0x11/0x13
[<ffffffff81009b02>] system_call_fastpath+0x16/0x1b
The one which spins has already managed to claim the mutex lock on the
/cgroup directory, and no call trace is shown for this. Is there a usable
way to force a similar call trace for the spinning process?
Unfortunately I have not been able to reproduce the problem for some days
now, so I think some network factor is able to influence this.
--
Mark
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: cgroup: rmdir() does not complete
2010-09-01 23:42 ` KAMEZAWA Hiroyuki
2010-09-02 9:45 ` Mark Hills
@ 2010-09-09 10:01 ` Mark Hills
2010-09-09 10:09 ` Balbir Singh
1 sibling, 1 reply; 26+ messages in thread
From: Mark Hills @ 2010-09-09 10:01 UTC (permalink / raw)
To: KAMEZAWA Hiroyuki; +Cc: Daisuke Nishimura, linux-kernel, balbir
On Thu, 2 Sep 2010, KAMEZAWA Hiroyuki wrote:
[...]
> But hmm...it's curious who holds mutex and what happens.
I have a system showing the failure case (but still do not have a way to
reliably repeat it)
Here are the two processes:
23586 pts/0 RL+ 5059:18 /net/homes/mhills/tmp/soaked-cgroup
23685 pts/6 DL+ 0:00 /net/homes/mhills/tmp/soaked-cgroup
23586 spends almost all of its time in 'RL+' status, occasionally it is
seen in 'DL+' status.
>From my analysis before, both are blocked on rmdir(), but one is spinning,
holding the lock on the /cgroup, and the other is waiting for the lock. If
I strace 23586 then the rmdir() fails with EINTR.
How best to capture information which might show why the process spins?
--
Mark
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: cgroup: rmdir() does not complete
2010-09-09 10:01 ` Mark Hills
@ 2010-09-09 10:09 ` Balbir Singh
2010-09-09 11:36 ` Mark Hills
0 siblings, 1 reply; 26+ messages in thread
From: Balbir Singh @ 2010-09-09 10:09 UTC (permalink / raw)
To: Mark Hills; +Cc: KAMEZAWA Hiroyuki, Daisuke Nishimura, linux-kernel
* Mark Hills <mark@pogo.org.uk> [2010-09-09 11:01:45]:
> On Thu, 2 Sep 2010, KAMEZAWA Hiroyuki wrote:
>
> [...]
> > But hmm...it's curious who holds mutex and what happens.
>
> I have a system showing the failure case (but still do not have a way to
> reliably repeat it)
>
> Here are the two processes:
>
> 23586 pts/0 RL+ 5059:18 /net/homes/mhills/tmp/soaked-cgroup
> 23685 pts/6 DL+ 0:00 /net/homes/mhills/tmp/soaked-cgroup
>
> 23586 spends almost all of its time in 'RL+' status, occasionally it is
> seen in 'DL+' status.
>
> From my analysis before, both are blocked on rmdir(), but one is spinning,
> holding the lock on the /cgroup, and the other is waiting for the lock. If
> I strace 23586 then the rmdir() fails with EINTR.
>
Any chance you can compile with debug cgroup subsystem and get
information from there?
--
Three Cheers,
Balbir
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: cgroup: rmdir() does not complete
2010-09-09 10:09 ` Balbir Singh
@ 2010-09-09 11:36 ` Mark Hills
2010-09-09 11:50 ` Peter Zijlstra
0 siblings, 1 reply; 26+ messages in thread
From: Mark Hills @ 2010-09-09 11:36 UTC (permalink / raw)
To: Balbir Singh; +Cc: KAMEZAWA Hiroyuki, Daisuke Nishimura, linux-kernel
On Thu, 9 Sep 2010, Balbir Singh wrote:
> * Mark Hills <mark@pogo.org.uk> [2010-09-09 11:01:45]:
>
> > On Thu, 2 Sep 2010, KAMEZAWA Hiroyuki wrote:
> >
> > [...]
> > > But hmm...it's curious who holds mutex and what happens.
> >
> > I have a system showing the failure case (but still do not have a way to
> > reliably repeat it)
> >
> > Here are the two processes:
> >
> > 23586 pts/0 RL+ 5059:18 /net/homes/mhills/tmp/soaked-cgroup
> > 23685 pts/6 DL+ 0:00 /net/homes/mhills/tmp/soaked-cgroup
> >
> > 23586 spends almost all of its time in 'RL+' status, occasionally it is
> > seen in 'DL+' status.
> >
> > From my analysis before, both are blocked on rmdir(), but one is spinning,
> > holding the lock on the /cgroup, and the other is waiting for the lock. If
> > I strace 23586 then the rmdir() fails with EINTR.
> >
>
> Any chance you can compile with debug cgroup subsystem and get
> information from there?
I can, I'd like to experiment with a custom kernel next.
I am still finding the problem incredibly hard to reproduce, so I'd like
to observe as much data as possible from the current case before
rebooting. If I could capture some kind of stack trace in the kernel for
the running process that would be great, any suggestions appreciated.
Thanks
--
Mark
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: cgroup: rmdir() does not complete
2010-09-09 11:36 ` Mark Hills
@ 2010-09-09 11:50 ` Peter Zijlstra
2010-09-09 23:04 ` Mark Hills
0 siblings, 1 reply; 26+ messages in thread
From: Peter Zijlstra @ 2010-09-09 11:50 UTC (permalink / raw)
To: Mark Hills
Cc: Balbir Singh, KAMEZAWA Hiroyuki, Daisuke Nishimura, linux-kernel
On Thu, 2010-09-09 at 12:36 +0100, Mark Hills wrote:
> I am still finding the problem incredibly hard to reproduce, so I'd like
> to observe as much data as possible from the current case before
> rebooting. If I could capture some kind of stack trace in the kernel for
> the running process that would be great, any suggestions appreciated.
echo l > /proc/sysrq-trigger
another thing you can do is run something like: perf record -gp $pid
which will give you a profile of that task.
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: cgroup: rmdir() does not complete
2010-09-09 11:50 ` Peter Zijlstra
@ 2010-09-09 23:04 ` Mark Hills
2010-09-09 23:43 ` KAMEZAWA Hiroyuki
2010-09-10 2:16 ` KAMEZAWA Hiroyuki
0 siblings, 2 replies; 26+ messages in thread
From: Mark Hills @ 2010-09-09 23:04 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Balbir Singh, KAMEZAWA Hiroyuki, Daisuke Nishimura, linux-kernel
On Thu, 9 Sep 2010, Peter Zijlstra wrote:
> On Thu, 2010-09-09 at 12:36 +0100, Mark Hills wrote:
>
> > I am still finding the problem incredibly hard to reproduce, so I'd like
> > to observe as much data as possible from the current case before
> > rebooting. If I could capture some kind of stack trace in the kernel for
> > the running process that would be great, any suggestions appreciated.
>
> echo l > /proc/sysrq-trigger
Despite running this many times, I never 'catch' the process on a CPU,
despite it using 70% in top. But...
> another thing you can do is run something like: perf record -gp $pid
> which will give you a profile of that task.
This is very useful, thanks.
The report on the spinning process (23586) is dominated by calls from
mem_cgroup_force_empty.
It seems to show lru_add_drain_all and drain_all_stock_sync are causing
the load (I assume drain_all_stock_sync has been optimised out). But I
don't think this is as important as what causes the spin.
There are no tasks in the cgroup, but memory usage is non-zero and
constant. It seems mem_cgroup_force_empty is unable to empty the cgroup in
this case.
# cat /cgroup/soaked-23586/tasks
# cat /cgroup/soaked-23586/memory.usage_in_bytes
24576
# cat /cgroup/soaked-23586/memsw.usage_in_bytes
<hangs>
Here are the first few entries from the perf output, I can provide the
rest if needed, but all result from mem_cgroup_force_empty.
8.13% :23586 [kernel] [k] _raw_spin_lock_irqsave
|
--- _raw_spin_lock_irqsave
|
|--45.14%-- probe_workqueue_insertion
| insert_work
| |
| |--99.09%-- __queue_work
| | queue_work_on
| | schedule_work_on
| | schedule_on_each_cpu
| | |
| | |--50.59%-- lru_add_drain_all
| | | mem_cgroup_force_empty
| | | mem_cgroup_pre_destroy
| | | cgroup_rmdir
| | | vfs_rmdir
| | | do_rmdir
| | | sys_rmdir
| | | system_call_fastpath
| | | 0x3f504d27d7
| | | 0x405687
| | | 0x406ef0
| | | 0x402f31
| | | 0x3f5041eb1d
| | |
| | --49.41%-- mem_cgroup_force_empty
| | mem_cgroup_pre_destroy
| | cgroup_rmdir
| | vfs_rmdir
| | do_rmdir
| | sys_rmdir
| | system_call_fastpath
| | 0x3f504d27d7
| | 0x405687
| | 0x406ef0
| | 0x402f31
| | 0x3f5041eb1d
| --0.91%-- [...]
|
|--22.92%-- mem_cgroup_force_empty
| mem_cgroup_pre_destroy
| cgroup_rmdir
| vfs_rmdir
| do_rmdir
| sys_rmdir
| system_call_fastpath
| 0x3f504d27d7
| 0x405687
| 0x406ef0
| 0x402f31
| 0x3f5041eb1d
|
|--8.17%-- __queue_work
| queue_work_on
| schedule_work_on
| schedule_on_each_cpu
| |
| |--52.09%-- lru_add_drain_all
| | mem_cgroup_force_empty
| | mem_cgroup_pre_destroy
| | cgroup_rmdir
| | vfs_rmdir
| | do_rmdir
| | sys_rmdir
| | system_call_fastpath
| | 0x3f504d27d7
| | 0x405687
| | 0x406ef0
| | 0x402f31
| | 0x3f5041eb1d
| |
| --47.91%-- mem_cgroup_force_empty
| mem_cgroup_pre_destroy
| cgroup_rmdir
| vfs_rmdir
| do_rmdir
| sys_rmdir
| system_call_fastpath
| 0x3f504d27d7
| 0x405687
| 0x406ef0
| 0x402f31
| 0x3f5041eb1d
|
|--7.94%-- __wake_up
| |
| |--99.71%-- insert_work
| | |
| | |--97.70%-- __queue_work
| | | queue_work_on
| | | schedule_work_on
| | | schedule_on_each_cpu
| | | |
| | | |--50.59%-- mem_cgroup_force_empty
| | | | mem_cgroup_pre_destroy
| | | | cgroup_rmdir
| | | | vfs_rmdir
| | | | do_rmdir
| | | | sys_rmdir
| | | | system_call_fastpath
| | | | 0x3f504d27d7
| | | | 0x405687
| | | | 0x406ef0
| | | | 0x402f31
| | | | 0x3f5041eb1d
| | | |
| | | --49.41%-- lru_add_drain_all
| | | mem_cgroup_force_empty
| | | mem_cgroup_pre_destroy
| | | cgroup_rmdir
| | | vfs_rmdir
| | | do_rmdir
| | | sys_rmdir
| | | system_call_fastpath
| | | 0x3f504d27d7
| | | 0x405687
| | | 0x406ef0
| | | 0x402f31
| | | 0x3f5041eb1d
| | --2.30%-- [...]
| --0.29%-- [...]
|
|--4.35%-- mem_cgroup_pre_destroy
| cgroup_rmdir
| vfs_rmdir
| do_rmdir
| sys_rmdir
| system_call_fastpath
| 0x3f504d27d7
| 0x405687
| 0x406ef0
| 0x402f31
| 0x3f5041eb1d
--11.47%-- [...]
7.25% :23586 [kernel] [k] sched_clock_cpu
|
--- sched_clock_cpu
|
|--97.11%-- update_rq_clock
| |
| |--98.89%-- try_to_wake_up
| | default_wake_function
| | autoremove_wake_function
| | __wake_up_common
| | __wake_up
| | insert_work
| | __queue_work
| | queue_work_on
| | schedule_work_on
| | schedule_on_each_cpu
| | |
| | |--50.69%-- lru_add_drain_all
| | | mem_cgroup_force_empty
| | | mem_cgroup_pre_destroy
| | | cgroup_rmdir
| | | vfs_rmdir
| | | do_rmdir
| | | sys_rmdir
| | | system_call_fastpath
| | | 0x3f504d27d7
| | | 0x405687
| | | 0x406ef0
| | | 0x402f31
| | | 0x3f5041eb1d
| | |
| | --49.31%-- mem_cgroup_force_empty
| | mem_cgroup_pre_destroy
| | cgroup_rmdir
| | vfs_rmdir
| | do_rmdir
| | sys_rmdir
| | system_call_fastpath
| | 0x3f504d27d7
| | 0x405687
| | 0x406ef0
| | 0x402f31
| | 0x3f5041eb1d
| --1.11%-- [...]
--2.89%-- [...]
5.54% :23586 [kernel] [k] try_to_wake_up
|
--- try_to_wake_up
|
|--99.13%-- default_wake_function
| autoremove_wake_function
| __wake_up_common
| __wake_up
| insert_work
| __queue_work
| queue_work_on
| schedule_work_on
| schedule_on_each_cpu
| |
| |--52.03%-- lru_add_drain_all
| | mem_cgroup_force_empty
| | mem_cgroup_pre_destroy
| | cgroup_rmdir
| | vfs_rmdir
| | do_rmdir
| | sys_rmdir
| | system_call_fastpath
| | 0x3f504d27d7
| | 0x405687
| | 0x406ef0
| | 0x402f31
| | 0x3f5041eb1d
| |
| --47.97%-- mem_cgroup_force_empty
| mem_cgroup_pre_destroy
| cgroup_rmdir
| vfs_rmdir
| do_rmdir
| sys_rmdir
| system_call_fastpath
| 0x3f504d27d7
| 0x405687
| 0x406ef0
| 0x402f31
| 0x3f5041eb1d
--0.87%-- [...]
--
Mark
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: cgroup: rmdir() does not complete
2010-09-09 23:04 ` Mark Hills
@ 2010-09-09 23:43 ` KAMEZAWA Hiroyuki
2010-09-10 2:16 ` KAMEZAWA Hiroyuki
1 sibling, 0 replies; 26+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-09-09 23:43 UTC (permalink / raw)
To: Mark Hills; +Cc: Peter Zijlstra, Balbir Singh, Daisuke Nishimura, linux-kernel
On Fri, 10 Sep 2010 00:04:31 +0100 (BST)
Mark Hills <mark@pogo.org.uk> wrote:
> On Thu, 9 Sep 2010, Peter Zijlstra wrote:
>
> > On Thu, 2010-09-09 at 12:36 +0100, Mark Hills wrote:
> >
> > > I am still finding the problem incredibly hard to reproduce, so I'd like
> > > to observe as much data as possible from the current case before
> > > rebooting. If I could capture some kind of stack trace in the kernel for
> > > the running process that would be great, any suggestions appreciated.
> >
> > echo l > /proc/sysrq-trigger
>
> Despite running this many times, I never 'catch' the process on a CPU,
> despite it using 70% in top. But...
>
> > another thing you can do is run something like: perf record -gp $pid
> > which will give you a profile of that task.
>
> This is very useful, thanks.
>
> The report on the spinning process (23586) is dominated by calls from
> mem_cgroup_force_empty.
>
> It seems to show lru_add_drain_all and drain_all_stock_sync are causing
> the load (I assume drain_all_stock_sync has been optimised out). But I
> don't think this is as important as what causes the spin.
>
> There are no tasks in the cgroup, but memory usage is non-zero and
> constant. It seems mem_cgroup_force_empty is unable to empty the cgroup in
> this case.
>
> # cat /cgroup/soaked-23586/tasks
> # cat /cgroup/soaked-23586/memory.usage_in_bytes
> 24576
> # cat /cgroup/soaked-23586/memsw.usage_in_bytes
> <hangs>
>
I think this "cat" hang is because of vfs's lock.
Hmm, then, there are pages on LRU which cannot be moved or there is
leak of account.
BTW, mem_cgroup's rmdir is desgined to be able to receive SIGINT etc...
Can't you stop rmdir by Ctrl-C or some ?
rmdir -> hang -> Ctrl-C (or some) -> cat .../memory.stat
can work ? And do you still use Fedora's kernel ?
Thanks,
-Kame
> Here are the first few entries from the perf output, I can provide the
> rest if needed, but all result from mem_cgroup_force_empty.
>
> 8.13% :23586 [kernel] [k] _raw_spin_lock_irqsave
> |
> --- _raw_spin_lock_irqsave
> |
> |--45.14%-- probe_workqueue_insertion
> | insert_work
> | |
> | |--99.09%-- __queue_work
> | | queue_work_on
> | | schedule_work_on
> | | schedule_on_each_cpu
> | | |
> | | |--50.59%-- lru_add_drain_all
> | | | mem_cgroup_force_empty
> | | | mem_cgroup_pre_destroy
> | | | cgroup_rmdir
> | | | vfs_rmdir
> | | | do_rmdir
> | | | sys_rmdir
> | | | system_call_fastpath
> | | | 0x3f504d27d7
> | | | 0x405687
> | | | 0x406ef0
> | | | 0x402f31
> | | | 0x3f5041eb1d
> | | |
> | | --49.41%-- mem_cgroup_force_empty
> | | mem_cgroup_pre_destroy
> | | cgroup_rmdir
> | | vfs_rmdir
> | | do_rmdir
> | | sys_rmdir
> | | system_call_fastpath
> | | 0x3f504d27d7
> | | 0x405687
> | | 0x406ef0
> | | 0x402f31
> | | 0x3f5041eb1d
> | --0.91%-- [...]
> |
> |--22.92%-- mem_cgroup_force_empty
> | mem_cgroup_pre_destroy
> | cgroup_rmdir
> | vfs_rmdir
> | do_rmdir
> | sys_rmdir
> | system_call_fastpath
> | 0x3f504d27d7
> | 0x405687
> | 0x406ef0
> | 0x402f31
> | 0x3f5041eb1d
> |
> |--8.17%-- __queue_work
> | queue_work_on
> | schedule_work_on
> | schedule_on_each_cpu
> | |
> | |--52.09%-- lru_add_drain_all
> | | mem_cgroup_force_empty
> | | mem_cgroup_pre_destroy
> | | cgroup_rmdir
> | | vfs_rmdir
> | | do_rmdir
> | | sys_rmdir
> | | system_call_fastpath
> | | 0x3f504d27d7
> | | 0x405687
> | | 0x406ef0
> | | 0x402f31
> | | 0x3f5041eb1d
> | |
> | --47.91%-- mem_cgroup_force_empty
> | mem_cgroup_pre_destroy
> | cgroup_rmdir
> | vfs_rmdir
> | do_rmdir
> | sys_rmdir
> | system_call_fastpath
> | 0x3f504d27d7
> | 0x405687
> | 0x406ef0
> | 0x402f31
> | 0x3f5041eb1d
> |
> |--7.94%-- __wake_up
> | |
> | |--99.71%-- insert_work
> | | |
> | | |--97.70%-- __queue_work
> | | | queue_work_on
> | | | schedule_work_on
> | | | schedule_on_each_cpu
> | | | |
> | | | |--50.59%-- mem_cgroup_force_empty
> | | | | mem_cgroup_pre_destroy
> | | | | cgroup_rmdir
> | | | | vfs_rmdir
> | | | | do_rmdir
> | | | | sys_rmdir
> | | | | system_call_fastpath
> | | | | 0x3f504d27d7
> | | | | 0x405687
> | | | | 0x406ef0
> | | | | 0x402f31
> | | | | 0x3f5041eb1d
> | | | |
> | | | --49.41%-- lru_add_drain_all
> | | | mem_cgroup_force_empty
> | | | mem_cgroup_pre_destroy
> | | | cgroup_rmdir
> | | | vfs_rmdir
> | | | do_rmdir
> | | | sys_rmdir
> | | | system_call_fastpath
> | | | 0x3f504d27d7
> | | | 0x405687
> | | | 0x406ef0
> | | | 0x402f31
> | | | 0x3f5041eb1d
> | | --2.30%-- [...]
> | --0.29%-- [...]
> |
> |--4.35%-- mem_cgroup_pre_destroy
> | cgroup_rmdir
> | vfs_rmdir
> | do_rmdir
> | sys_rmdir
> | system_call_fastpath
> | 0x3f504d27d7
> | 0x405687
> | 0x406ef0
> | 0x402f31
> | 0x3f5041eb1d
> --11.47%-- [...]
>
> 7.25% :23586 [kernel] [k] sched_clock_cpu
> |
> --- sched_clock_cpu
> |
> |--97.11%-- update_rq_clock
> | |
> | |--98.89%-- try_to_wake_up
> | | default_wake_function
> | | autoremove_wake_function
> | | __wake_up_common
> | | __wake_up
> | | insert_work
> | | __queue_work
> | | queue_work_on
> | | schedule_work_on
> | | schedule_on_each_cpu
> | | |
> | | |--50.69%-- lru_add_drain_all
> | | | mem_cgroup_force_empty
> | | | mem_cgroup_pre_destroy
> | | | cgroup_rmdir
> | | | vfs_rmdir
> | | | do_rmdir
> | | | sys_rmdir
> | | | system_call_fastpath
> | | | 0x3f504d27d7
> | | | 0x405687
> | | | 0x406ef0
> | | | 0x402f31
> | | | 0x3f5041eb1d
> | | |
> | | --49.31%-- mem_cgroup_force_empty
> | | mem_cgroup_pre_destroy
> | | cgroup_rmdir
> | | vfs_rmdir
> | | do_rmdir
> | | sys_rmdir
> | | system_call_fastpath
> | | 0x3f504d27d7
> | | 0x405687
> | | 0x406ef0
> | | 0x402f31
> | | 0x3f5041eb1d
> | --1.11%-- [...]
> --2.89%-- [...]
>
> 5.54% :23586 [kernel] [k] try_to_wake_up
> |
> --- try_to_wake_up
> |
> |--99.13%-- default_wake_function
> | autoremove_wake_function
> | __wake_up_common
> | __wake_up
> | insert_work
> | __queue_work
> | queue_work_on
> | schedule_work_on
> | schedule_on_each_cpu
> | |
> | |--52.03%-- lru_add_drain_all
> | | mem_cgroup_force_empty
> | | mem_cgroup_pre_destroy
> | | cgroup_rmdir
> | | vfs_rmdir
> | | do_rmdir
> | | sys_rmdir
> | | system_call_fastpath
> | | 0x3f504d27d7
> | | 0x405687
> | | 0x406ef0
> | | 0x402f31
> | | 0x3f5041eb1d
> | |
> | --47.97%-- mem_cgroup_force_empty
> | mem_cgroup_pre_destroy
> | cgroup_rmdir
> | vfs_rmdir
> | do_rmdir
> | sys_rmdir
> | system_call_fastpath
> | 0x3f504d27d7
> | 0x405687
> | 0x406ef0
> | 0x402f31
> | 0x3f5041eb1d
> --0.87%-- [...]
>
> --
> Mark
>
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: cgroup: rmdir() does not complete
2010-09-09 23:04 ` Mark Hills
2010-09-09 23:43 ` KAMEZAWA Hiroyuki
@ 2010-09-10 2:16 ` KAMEZAWA Hiroyuki
2010-09-10 4:05 ` Daisuke Nishimura
2010-09-10 7:28 ` Mark Hills
1 sibling, 2 replies; 26+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-09-10 2:16 UTC (permalink / raw)
To: Mark Hills; +Cc: Peter Zijlstra, Balbir Singh, Daisuke Nishimura, linux-kernel
On Fri, 10 Sep 2010 00:04:31 +0100 (BST)
Mark Hills <mark@pogo.org.uk> wrote:
> The report on the spinning process (23586) is dominated by calls from
> mem_cgroup_force_empty.
>
> It seems to show lru_add_drain_all and drain_all_stock_sync are causing
> the load (I assume drain_all_stock_sync has been optimised out). But I
> don't think this is as important as what causes the spin.
>
I noticed you use FUSE and it seems there is a problem in FUSE v.s. memcg.
I wrote a patch (onto 2.6.36 but can be applied..)
Could you try this ? I'm sorry I don't use FUSE system and can't test
right now.
==
From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
memory cgroup catches all pages which is added to radix-tree and
assumes the pages will be added to LRU, somewhere.
But there are pages which not on LRU but on radix-tree. Then,
force_empty cannot find them and cannot finish ->pre_destroy(), rmdir
operations.
This patch adds __GFP_NOMEMCGROUP and avoids unnecessary, out-of-control
pages are registered to memory cgroup.
Note: This gfp flag can be used for shmem handling, which now uses
complicated heuristics.
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
fs/fuse/dev.c | 11 ++++++++++-
include/linux/gfp.h | 7 +++++++
mm/memcontrol.c | 2 +-
3 files changed, 18 insertions(+), 2 deletions(-)
Index: linux-2.6.36-rc3/fs/fuse/dev.c
===================================================================
--- linux-2.6.36-rc3.orig/fs/fuse/dev.c
+++ linux-2.6.36-rc3/fs/fuse/dev.c
@@ -19,6 +19,7 @@
#include <linux/pipe_fs_i.h>
#include <linux/swap.h>
#include <linux/splice.h>
+#include <linux/memcontrol.h>
MODULE_ALIAS_MISCDEV(FUSE_MINOR);
MODULE_ALIAS("devname:fuse");
@@ -683,6 +684,7 @@ static int fuse_try_move_page(struct fus
struct pipe_buffer *buf = cs->pipebufs;
struct address_space *mapping;
pgoff_t index;
+ gfp_t mask = GFP_KERNEL;
unlock_request(cs->fc, cs->req);
fuse_copy_finish(cs);
@@ -732,7 +734,14 @@ static int fuse_try_move_page(struct fus
remove_from_page_cache(oldpage);
page_cache_release(oldpage);
- err = add_to_page_cache_locked(newpage, mapping, index, GFP_KERNEL);
+ /*
+ * not-on-LRU pages are out of control. So, add to root cgroup.
+ * See mm/memcontrol.c for details.
+ */
+ if (buf->flags & PIPE_BUF_FLAG_LRU)
+ mask |= __GFP_NOMEMCGROUP;
+
+ err = add_to_page_cache_locked(newpage, mapping, index, mask);
if (err) {
printk(KERN_WARNING "fuse_try_move_page: failed to add page");
goto out_fallback_unlock;
Index: linux-2.6.36-rc3/include/linux/gfp.h
===================================================================
--- linux-2.6.36-rc3.orig/include/linux/gfp.h
+++ linux-2.6.36-rc3/include/linux/gfp.h
@@ -60,6 +60,13 @@ struct vm_area_struct;
#define __GFP_NOTRACK ((__force gfp_t)0)
#endif
+#ifdef CONFIG_CGROUP_MEM_RES_CTLR
+#define __GFP_NOMEMCGROUP ((__force gfp_t)0x400000u)
+ /* Don't track by memory cgroup */
+#else
+#define __GFP_NOMEMCGROUP ((__force gfp_t)0)
+#endif
+
/*
* This may seem redundant, but it's a way of annotating false positives vs.
* allocations that simply cannot be supported (e.g. page tables).
Index: linux-2.6.36-rc3/mm/memcontrol.c
===================================================================
--- linux-2.6.36-rc3.orig/mm/memcontrol.c
+++ linux-2.6.36-rc3/mm/memcontrol.c
@@ -2114,7 +2114,7 @@ int mem_cgroup_cache_charge(struct page
if (mem_cgroup_disabled())
return 0;
- if (PageCompound(page))
+ if (PageCompound(page) || (gfp_mask & __GFP_NOMEMCGROUP))
return 0;
/*
* Corner case handling. This is called from add_to_page_cache()
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: cgroup: rmdir() does not complete
2010-09-10 2:16 ` KAMEZAWA Hiroyuki
@ 2010-09-10 4:05 ` Daisuke Nishimura
2010-09-10 4:11 ` KAMEZAWA Hiroyuki
2010-09-10 7:28 ` Mark Hills
1 sibling, 1 reply; 26+ messages in thread
From: Daisuke Nishimura @ 2010-09-10 4:05 UTC (permalink / raw)
To: KAMEZAWA Hiroyuki
Cc: Mark Hills, Peter Zijlstra, Balbir Singh, linux-kernel,
Daisuke Nishimura
On Fri, 10 Sep 2010 11:16:46 +0900
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> On Fri, 10 Sep 2010 00:04:31 +0100 (BST)
> Mark Hills <mark@pogo.org.uk> wrote:
> > The report on the spinning process (23586) is dominated by calls from
> > mem_cgroup_force_empty.
> >
> > It seems to show lru_add_drain_all and drain_all_stock_sync are causing
> > the load (I assume drain_all_stock_sync has been optimised out). But I
> > don't think this is as important as what causes the spin.
> >
>
> I noticed you use FUSE and it seems there is a problem in FUSE v.s. memcg.
> I wrote a patch (onto 2.6.36 but can be applied..)
>
Nice catch!
> Could you try this ? I'm sorry I don't use FUSE system and can't test
> right now.
>
Sorry, I can't either.
> ==
> From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
>
> memory cgroup catches all pages which is added to radix-tree and
> assumes the pages will be added to LRU, somewhere.
> But there are pages which not on LRU but on radix-tree. Then,
> force_empty cannot find them and cannot finish ->pre_destroy(), rmdir
> operations.
>
> This patch adds __GFP_NOMEMCGROUP and avoids unnecessary, out-of-control
> pages are registered to memory cgroup.
>
> Note: This gfp flag can be used for shmem handling, which now uses
> complicated heuristics.
>
> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> ---
> fs/fuse/dev.c | 11 ++++++++++-
> include/linux/gfp.h | 7 +++++++
> mm/memcontrol.c | 2 +-
> 3 files changed, 18 insertions(+), 2 deletions(-)
>
> Index: linux-2.6.36-rc3/fs/fuse/dev.c
> ===================================================================
> --- linux-2.6.36-rc3.orig/fs/fuse/dev.c
> +++ linux-2.6.36-rc3/fs/fuse/dev.c
> @@ -19,6 +19,7 @@
> #include <linux/pipe_fs_i.h>
> #include <linux/swap.h>
> #include <linux/splice.h>
> +#include <linux/memcontrol.h>
>
> MODULE_ALIAS_MISCDEV(FUSE_MINOR);
> MODULE_ALIAS("devname:fuse");
> @@ -683,6 +684,7 @@ static int fuse_try_move_page(struct fus
> struct pipe_buffer *buf = cs->pipebufs;
> struct address_space *mapping;
> pgoff_t index;
> + gfp_t mask = GFP_KERNEL;
>
> unlock_request(cs->fc, cs->req);
> fuse_copy_finish(cs);
> @@ -732,7 +734,14 @@ static int fuse_try_move_page(struct fus
> remove_from_page_cache(oldpage);
> page_cache_release(oldpage);
>
> - err = add_to_page_cache_locked(newpage, mapping, index, GFP_KERNEL);
> + /*
> + * not-on-LRU pages are out of control. So, add to root cgroup.
> + * See mm/memcontrol.c for details.
> + */
> + if (buf->flags & PIPE_BUF_FLAG_LRU)
> + mask |= __GFP_NOMEMCGROUP;
> +
> + err = add_to_page_cache_locked(newpage, mapping, index, mask);
> if (err) {
> printk(KERN_WARNING "fuse_try_move_page: failed to add page");
> goto out_fallback_unlock;
> Index: linux-2.6.36-rc3/include/linux/gfp.h
> ===================================================================
> --- linux-2.6.36-rc3.orig/include/linux/gfp.h
> +++ linux-2.6.36-rc3/include/linux/gfp.h
> @@ -60,6 +60,13 @@ struct vm_area_struct;
> #define __GFP_NOTRACK ((__force gfp_t)0)
> #endif
>
> +#ifdef CONFIG_CGROUP_MEM_RES_CTLR
> +#define __GFP_NOMEMCGROUP ((__force gfp_t)0x400000u)
> + /* Don't track by memory cgroup */
> +#else
> +#define __GFP_NOMEMCGROUP ((__force gfp_t)0)
> +#endif
> +
> /*
> * This may seem redundant, but it's a way of annotating false positives vs.
> * allocations that simply cannot be supported (e.g. page tables).
> Index: linux-2.6.36-rc3/mm/memcontrol.c
> ===================================================================
> --- linux-2.6.36-rc3.orig/mm/memcontrol.c
> +++ linux-2.6.36-rc3/mm/memcontrol.c
> @@ -2114,7 +2114,7 @@ int mem_cgroup_cache_charge(struct page
>
> if (mem_cgroup_disabled())
> return 0;
> - if (PageCompound(page))
> + if (PageCompound(page) || (gfp_mask & __GFP_NOMEMCGROUP))
> return 0;
> /*
> * Corner case handling. This is called from add_to_page_cache()
>
The comments above says "not-on-LRU pages are out of control. So, add to root cgroup.".
But this change means that we don't charge these pages at all.
Should it be:
if (gfp_mask & __GFP_NOMEMCGROUP))
mm = &init_mm;
?
Or, change the comment ?
Thanks,
Daisuke Nishimura.
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: cgroup: rmdir() does not complete
2010-09-10 4:05 ` Daisuke Nishimura
@ 2010-09-10 4:11 ` KAMEZAWA Hiroyuki
0 siblings, 0 replies; 26+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-09-10 4:11 UTC (permalink / raw)
To: Daisuke Nishimura; +Cc: Mark Hills, Peter Zijlstra, Balbir Singh, linux-kernel
On Fri, 10 Sep 2010 13:05:39 +0900
Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> wrote:
> On Fri, 10 Sep 2010 11:16:46 +0900
> KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
>
> > On Fri, 10 Sep 2010 00:04:31 +0100 (BST)
> > Mark Hills <mark@pogo.org.uk> wrote:
> > > The report on the spinning process (23586) is dominated by calls from
> > > mem_cgroup_force_empty.
> > >
> > > It seems to show lru_add_drain_all and drain_all_stock_sync are causing
> > > the load (I assume drain_all_stock_sync has been optimised out). But I
> > > don't think this is as important as what causes the spin.
> > >
> >
> > I noticed you use FUSE and it seems there is a problem in FUSE v.s. memcg.
> > I wrote a patch (onto 2.6.36 but can be applied..)
> >
> Nice catch!
>
> > Could you try this ? I'm sorry I don't use FUSE system and can't test
> > right now.
> >
> Sorry, I can't either.
>
> > ==
> > From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> >
> > memory cgroup catches all pages which is added to radix-tree and
> > assumes the pages will be added to LRU, somewhere.
> > But there are pages which not on LRU but on radix-tree. Then,
> > force_empty cannot find them and cannot finish ->pre_destroy(), rmdir
> > operations.
> >
> > This patch adds __GFP_NOMEMCGROUP and avoids unnecessary, out-of-control
> > pages are registered to memory cgroup.
> >
> > Note: This gfp flag can be used for shmem handling, which now uses
> > complicated heuristics.
> >
> > Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > ---
> > fs/fuse/dev.c | 11 ++++++++++-
> > include/linux/gfp.h | 7 +++++++
> > mm/memcontrol.c | 2 +-
> > 3 files changed, 18 insertions(+), 2 deletions(-)
> >
> > Index: linux-2.6.36-rc3/fs/fuse/dev.c
> > ===================================================================
> > --- linux-2.6.36-rc3.orig/fs/fuse/dev.c
> > +++ linux-2.6.36-rc3/fs/fuse/dev.c
> > @@ -19,6 +19,7 @@
> > #include <linux/pipe_fs_i.h>
> > #include <linux/swap.h>
> > #include <linux/splice.h>
> > +#include <linux/memcontrol.h>
> >
> > MODULE_ALIAS_MISCDEV(FUSE_MINOR);
> > MODULE_ALIAS("devname:fuse");
> > @@ -683,6 +684,7 @@ static int fuse_try_move_page(struct fus
> > struct pipe_buffer *buf = cs->pipebufs;
> > struct address_space *mapping;
> > pgoff_t index;
> > + gfp_t mask = GFP_KERNEL;
> >
> > unlock_request(cs->fc, cs->req);
> > fuse_copy_finish(cs);
> > @@ -732,7 +734,14 @@ static int fuse_try_move_page(struct fus
> > remove_from_page_cache(oldpage);
> > page_cache_release(oldpage);
> >
> > - err = add_to_page_cache_locked(newpage, mapping, index, GFP_KERNEL);
> > + /*
> > + * not-on-LRU pages are out of control. So, add to root cgroup.
> > + * See mm/memcontrol.c for details.
> > + */
> > + if (buf->flags & PIPE_BUF_FLAG_LRU)
> > + mask |= __GFP_NOMEMCGROUP;
> > +
> > + err = add_to_page_cache_locked(newpage, mapping, index, mask);
> > if (err) {
> > printk(KERN_WARNING "fuse_try_move_page: failed to add page");
> > goto out_fallback_unlock;
> > Index: linux-2.6.36-rc3/include/linux/gfp.h
> > ===================================================================
> > --- linux-2.6.36-rc3.orig/include/linux/gfp.h
> > +++ linux-2.6.36-rc3/include/linux/gfp.h
> > @@ -60,6 +60,13 @@ struct vm_area_struct;
> > #define __GFP_NOTRACK ((__force gfp_t)0)
> > #endif
> >
> > +#ifdef CONFIG_CGROUP_MEM_RES_CTLR
> > +#define __GFP_NOMEMCGROUP ((__force gfp_t)0x400000u)
> > + /* Don't track by memory cgroup */
> > +#else
> > +#define __GFP_NOMEMCGROUP ((__force gfp_t)0)
> > +#endif
> > +
> > /*
> > * This may seem redundant, but it's a way of annotating false positives vs.
> > * allocations that simply cannot be supported (e.g. page tables).
> > Index: linux-2.6.36-rc3/mm/memcontrol.c
> > ===================================================================
> > --- linux-2.6.36-rc3.orig/mm/memcontrol.c
> > +++ linux-2.6.36-rc3/mm/memcontrol.c
> > @@ -2114,7 +2114,7 @@ int mem_cgroup_cache_charge(struct page
> >
> > if (mem_cgroup_disabled())
> > return 0;
> > - if (PageCompound(page))
> > + if (PageCompound(page) || (gfp_mask & __GFP_NOMEMCGROUP))
> > return 0;
> > /*
> > * Corner case handling. This is called from add_to_page_cache()
> >
> The comments above says "not-on-LRU pages are out of control. So, add to root cgroup.".
> But this change means that we don't charge these pages at all.
>
> Should it be:
>
> if (gfp_mask & __GFP_NOMEMCGROUP))
> mm = &init_mm;
>
> ?
> Or, change the comment ?
>
yes....the comment is wrong.
Thanks,
-Kame
==
From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
memory cgroup catches all pages which is added to radix-tree and
assumes the pages will be added to LRU, somewhere.
But there are pages which not on LRU but on radix-tree. Then,
force_empty cannot find them and cannot finish ->pre_destroy(), rmdir
operations.
This patch adds __GFP_NOMEMCGROUP and avoids unnecessary, out-of-control
pages are registered to memory cgroup.
Note: This gfp flag can be used for shmem handling, which now uses
complicated heuristics.
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
fs/fuse/dev.c | 11 ++++++++++-
include/linux/gfp.h | 7 +++++++
mm/memcontrol.c | 2 +-
3 files changed, 18 insertions(+), 2 deletions(-)
Index: linux-2.6.36-rc3/fs/fuse/dev.c
===================================================================
--- linux-2.6.36-rc3.orig/fs/fuse/dev.c
+++ linux-2.6.36-rc3/fs/fuse/dev.c
@@ -19,6 +19,7 @@
#include <linux/pipe_fs_i.h>
#include <linux/swap.h>
#include <linux/splice.h>
+#include <linux/memcontrol.h>
MODULE_ALIAS_MISCDEV(FUSE_MINOR);
MODULE_ALIAS("devname:fuse");
@@ -683,6 +684,7 @@ static int fuse_try_move_page(struct fus
struct pipe_buffer *buf = cs->pipebufs;
struct address_space *mapping;
pgoff_t index;
+ gfp_t mask = GFP_KERNEL;
unlock_request(cs->fc, cs->req);
fuse_copy_finish(cs);
@@ -732,7 +734,14 @@ static int fuse_try_move_page(struct fus
remove_from_page_cache(oldpage);
page_cache_release(oldpage);
- err = add_to_page_cache_locked(newpage, mapping, index, GFP_KERNEL);
+ /*
+ * non-LRU pages are out of cgroup controls.
+ * See mm/memcontrol.c or Documentation/cgroup/memory.txt for details.
+ */
+ if (buf->flags & PIPE_BUF_FLAG_LRU)
+ mask |= __GFP_NOMEMCGROUP;
+
+ err = add_to_page_cache_locked(newpage, mapping, index, mask);
if (err) {
printk(KERN_WARNING "fuse_try_move_page: failed to add page");
goto out_fallback_unlock;
Index: linux-2.6.36-rc3/include/linux/gfp.h
===================================================================
--- linux-2.6.36-rc3.orig/include/linux/gfp.h
+++ linux-2.6.36-rc3/include/linux/gfp.h
@@ -60,6 +60,13 @@ struct vm_area_struct;
#define __GFP_NOTRACK ((__force gfp_t)0)
#endif
+#ifdef CONFIG_CGROUP_MEM_RES_CTLR
+#define __GFP_NOMEMCGROUP ((__force gfp_t)0x400000u)
+ /* Don't track by memory cgroup */
+#else
+#define __GFP_NOMEMCGROUP ((__force gfp_t)0)
+#endif
+
/*
* This may seem redundant, but it's a way of annotating false positives vs.
* allocations that simply cannot be supported (e.g. page tables).
Index: linux-2.6.36-rc3/mm/memcontrol.c
===================================================================
--- linux-2.6.36-rc3.orig/mm/memcontrol.c
+++ linux-2.6.36-rc3/mm/memcontrol.c
@@ -2114,7 +2114,7 @@ int mem_cgroup_cache_charge(struct page
if (mem_cgroup_disabled())
return 0;
- if (PageCompound(page))
+ if (PageCompound(page) || (gfp_mask & __GFP_NOMEMCGROUP))
return 0;
/*
* Corner case handling. This is called from add_to_page_cache()
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: cgroup: rmdir() does not complete
2010-09-10 2:16 ` KAMEZAWA Hiroyuki
2010-09-10 4:05 ` Daisuke Nishimura
@ 2010-09-10 7:28 ` Mark Hills
2010-09-10 7:33 ` KAMEZAWA Hiroyuki
1 sibling, 1 reply; 26+ messages in thread
From: Mark Hills @ 2010-09-10 7:28 UTC (permalink / raw)
To: KAMEZAWA Hiroyuki
Cc: Peter Zijlstra, Balbir Singh, Daisuke Nishimura, linux-kernel
On Fri, 10 Sep 2010, KAMEZAWA Hiroyuki wrote:
> On Fri, 10 Sep 2010 00:04:31 +0100 (BST)
> Mark Hills <mark@pogo.org.uk> wrote:
> > The report on the spinning process (23586) is dominated by calls from
> > mem_cgroup_force_empty.
> >
> > It seems to show lru_add_drain_all and drain_all_stock_sync are causing
> > the load (I assume drain_all_stock_sync has been optimised out). But I
> > don't think this is as important as what causes the spin.
> >
>
> I noticed you use FUSE and it seems there is a problem in FUSE v.s. memcg.
> I wrote a patch (onto 2.6.36 but can be applied..)
>
> Could you try this ? I'm sorry I don't use FUSE system and can't test
> right now.
What makes you conclude that FUSE is in use? I do not think this is the
case. Or do you mean that it is a problem that the kernel is built with
FUSE support?
I _can_ test the patch, but I still cannot reliably reproduce the problem
so it will be hard to conclude whether the patch works or not. Is there a
way to build a test case for this?
Thanks for your help
--
Mark
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: cgroup: rmdir() does not complete
2010-09-10 7:28 ` Mark Hills
@ 2010-09-10 7:33 ` KAMEZAWA Hiroyuki
2010-09-10 7:51 ` Mark Hills
0 siblings, 1 reply; 26+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-09-10 7:33 UTC (permalink / raw)
To: Mark Hills; +Cc: Peter Zijlstra, Balbir Singh, Daisuke Nishimura, linux-kernel
On Fri, 10 Sep 2010 08:28:00 +0100 (BST)
Mark Hills <mark@pogo.org.uk> wrote:
> On Fri, 10 Sep 2010, KAMEZAWA Hiroyuki wrote:
>
> > On Fri, 10 Sep 2010 00:04:31 +0100 (BST)
> > Mark Hills <mark@pogo.org.uk> wrote:
> > > The report on the spinning process (23586) is dominated by calls from
> > > mem_cgroup_force_empty.
> > >
> > > It seems to show lru_add_drain_all and drain_all_stock_sync are causing
> > > the load (I assume drain_all_stock_sync has been optimised out). But I
> > > don't think this is as important as what causes the spin.
> > >
> >
> > I noticed you use FUSE and it seems there is a problem in FUSE v.s. memcg.
> > I wrote a patch (onto 2.6.36 but can be applied..)
> >
> > Could you try this ? I'm sorry I don't use FUSE system and can't test
> > right now.
>
> What makes you conclude that FUSE is in use? I do not think this is the
> case. Or do you mean that it is a problem that the kernel is built with
> FUSE support?
>
You wrote
> The test case I was running is similar to the above. With the Lustre
> filesystem the problem takes 4 hours or more to show itself. Recently I
> ran 4 threads for over 24 hours without it being seen -- I suspect some
> external factor is involved.
I think Lustre FS is using FUSE. I'm wrong ?
> I _can_ test the patch, but I still cannot reliably reproduce the problem
> so it will be hard to conclude whether the patch works or not. Is there a
> way to build a test case for this?
>
I'm sorry I'm not sure yet. But from your report, you have 6 pages of charge
which cannot be found by force_empty(). And I found FUSE's pipe copy code
inserts a page cache into radix-tree but not move them onto LRU.
So,
- There are remaining pages which is out-of-LRU
- FUSE's "move" code does something curious, add_to_page_cache() but not LRU.
- You reporeted you use Lustre FS.
Then, I ask you. To test this, I have to study FUSE to write test module...
Maybe adding printk() to where I added gfp_mask modification of fuse/dev.c
can show something but...
We may have something other problem, but it seems this is one of them.
Thanks,
-Kame
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: cgroup: rmdir() does not complete
2010-09-10 7:33 ` KAMEZAWA Hiroyuki
@ 2010-09-10 7:51 ` Mark Hills
0 siblings, 0 replies; 26+ messages in thread
From: Mark Hills @ 2010-09-10 7:51 UTC (permalink / raw)
To: KAMEZAWA Hiroyuki
Cc: Peter Zijlstra, Balbir Singh, Daisuke Nishimura, linux-kernel
On Fri, 10 Sep 2010, KAMEZAWA Hiroyuki wrote:
> On Fri, 10 Sep 2010 08:28:00 +0100 (BST)
> Mark Hills <mark@pogo.org.uk> wrote:
>
> > On Fri, 10 Sep 2010, KAMEZAWA Hiroyuki wrote:
> >
> > > On Fri, 10 Sep 2010 00:04:31 +0100 (BST)
> > > Mark Hills <mark@pogo.org.uk> wrote:
> > > > The report on the spinning process (23586) is dominated by calls from
> > > > mem_cgroup_force_empty.
> > > >
> > > > It seems to show lru_add_drain_all and drain_all_stock_sync are causing
> > > > the load (I assume drain_all_stock_sync has been optimised out). But I
> > > > don't think this is as important as what causes the spin.
> > > >
> > >
> > > I noticed you use FUSE and it seems there is a problem in FUSE v.s. memcg.
> > > I wrote a patch (onto 2.6.36 but can be applied..)
> > >
> > > Could you try this ? I'm sorry I don't use FUSE system and can't test
> > > right now.
> >
> > What makes you conclude that FUSE is in use? I do not think this is the
> > case. Or do you mean that it is a problem that the kernel is built with
> > FUSE support?
> >
> You wrote
> > The test case I was running is similar to the above. With the Lustre
> > filesystem the problem takes 4 hours or more to show itself. Recently I
> > ran 4 threads for over 24 hours without it being seen -- I suspect some
> > external factor is involved.
>
> I think Lustre FS is using FUSE. I'm wrong ?
Lustre does not use FUSE. But the client is a set of kernel modules, so
these could do anything.
> > I _can_ test the patch, but I still cannot reliably reproduce the problem
> > so it will be hard to conclude whether the patch works or not. Is there a
> > way to build a test case for this?
> >
>
> I'm sorry I'm not sure yet. But from your report, you have 6 pages of charge
> which cannot be found by force_empty(). And I found FUSE's pipe copy code
> inserts a page cache into radix-tree but not move them onto LRU.
>
> So,
> - There are remaining pages which is out-of-LRU
> - FUSE's "move" code does something curious, add_to_page_cache() but not LRU.
> - You reporeted you use Lustre FS.
>
> Then, I ask you. To test this, I have to study FUSE to write test module...
> Maybe adding printk() to where I added gfp_mask modification of fuse/dev.c
> can show something but...
>
> We may have something other problem, but it seems this is one of them.
Okay, it sounds like perhaps I need to investigate Lustre, I will do this
next week. But I think FUSE can be ruled out.
Thanks again
--
Mark
^ permalink raw reply [flat|nested] 26+ messages in thread
end of thread, other threads:[~2010-09-10 7:51 UTC | newest]
Thread overview: 26+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-08-26 15:51 cgroup: rmdir() does not complete Mark Hills
2010-08-27 0:56 ` Daisuke Nishimura
2010-08-27 1:20 ` Balbir Singh
2010-08-27 2:35 ` KAMEZAWA Hiroyuki
2010-08-27 3:39 ` Daisuke Nishimura
2010-08-27 5:42 ` KAMEZAWA Hiroyuki
2010-08-27 6:29 ` KAMEZAWA Hiroyuki
2010-08-30 7:32 ` Balbir Singh
2010-08-30 9:13 ` Mark Hills
2010-09-01 11:10 ` Mark Hills
2010-09-01 23:42 ` KAMEZAWA Hiroyuki
2010-09-02 9:45 ` Mark Hills
2010-09-09 10:01 ` Mark Hills
2010-09-09 10:09 ` Balbir Singh
2010-09-09 11:36 ` Mark Hills
2010-09-09 11:50 ` Peter Zijlstra
2010-09-09 23:04 ` Mark Hills
2010-09-09 23:43 ` KAMEZAWA Hiroyuki
2010-09-10 2:16 ` KAMEZAWA Hiroyuki
2010-09-10 4:05 ` Daisuke Nishimura
2010-09-10 4:11 ` KAMEZAWA Hiroyuki
2010-09-10 7:28 ` Mark Hills
2010-09-10 7:33 ` KAMEZAWA Hiroyuki
2010-09-10 7:51 ` Mark Hills
2010-08-27 1:25 ` KAMEZAWA Hiroyuki
2010-08-30 9:25 ` Mark Hills
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.