All of lore.kernel.org
 help / color / mirror / Atom feed
* cgroup: rmdir() does not complete
@ 2010-08-26 15:51 Mark Hills
  2010-08-27  0:56 ` Daisuke Nishimura
  2010-08-27  1:25 ` KAMEZAWA Hiroyuki
  0 siblings, 2 replies; 26+ messages in thread
From: Mark Hills @ 2010-08-26 15:51 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-kernel

I am experiencing hung tasks when trying to rmdir() on a cgroup. One task 
spins, others queue up behind it with the following:

  INFO: task soaked-cgroup:27257 blocked for more than 120 seconds.
  "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
  soaked-cgrou D ffff8800058157c0     0 27257  29411 0x00000000
  ffff88004ffffdd8 0000000000000086 ffff88004ffffda8 ffff88004ffffeb8
  0000000000000010 ffff880119813780 ffff88004ffffd48 ffff88004fffffd8
  ffff88004fffffd8 000000000000f9b0 00000000000157c0 ffff880137693268
  Call Trace:
  [<ffffffff81115edb>] ? mntput_no_expire+0x24/0xe7
  [<ffffffff81427acd>] __mutex_lock_common+0x14d/0x1b4
  [<ffffffff81108a7c>] ? path_put+0x1d/0x22
  [<ffffffff81427b48>] __mutex_lock_slowpath+0x14/0x16
  [<ffffffff81427c4f>] mutex_lock+0x31/0x4b
  [<ffffffff8110bdf8>] do_rmdir+0x74/0x102
  [<ffffffff8110bebd>] sys_rmdir+0x11/0x13
  [<ffffffff81009b02>] system_call_fastpath+0x16/0x1b

Kernel is from Fedora, 2.6.33.6. In all cases the cgroup contains no 
tasks.

Commit ec64f5 ("fix frequent -EBUSY at rmdir") adds a busy wait loop to 
the rmdir. It looks like what I am seeing here and indicates that some 
cgroup subsystem is busy, indefinitely.

I have not worked out how to reproduce it quickly. My only way is to 
complete a 'dd' command in the cgroup, but then the problem is so rare it 
is slow progress.

Documentation/cgroup.memory.txt describes how force_empty can be required 
in some cases. Does this mean that with the patch above, these cases will 
now spin on rmdir(), instead of returning -EBUSY? How can produce a 
reliable test case requiring memory.force_empty to be used, to test this?

Or is it likely to be some other cause, and how best to find it?

Thanks

-- 
Mark

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: cgroup: rmdir() does not complete
  2010-08-26 15:51 cgroup: rmdir() does not complete Mark Hills
@ 2010-08-27  0:56 ` Daisuke Nishimura
  2010-08-27  1:20   ` Balbir Singh
  2010-08-27  2:35   ` KAMEZAWA Hiroyuki
  2010-08-27  1:25 ` KAMEZAWA Hiroyuki
  1 sibling, 2 replies; 26+ messages in thread
From: Daisuke Nishimura @ 2010-08-27  0:56 UTC (permalink / raw)
  To: Mark Hills; +Cc: KAMEZAWA Hiroyuki, linux-kernel, Daisuke Nishimura

Hi.

On Thu, 26 Aug 2010 16:51:55 +0100 (BST)
Mark Hills <mark@pogo.org.uk> wrote:

> I am experiencing hung tasks when trying to rmdir() on a cgroup. One task 
> spins, others queue up behind it with the following:
> 
>   INFO: task soaked-cgroup:27257 blocked for more than 120 seconds.
>   "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>   soaked-cgrou D ffff8800058157c0     0 27257  29411 0x00000000
>   ffff88004ffffdd8 0000000000000086 ffff88004ffffda8 ffff88004ffffeb8
>   0000000000000010 ffff880119813780 ffff88004ffffd48 ffff88004fffffd8
>   ffff88004fffffd8 000000000000f9b0 00000000000157c0 ffff880137693268
>   Call Trace:
>   [<ffffffff81115edb>] ? mntput_no_expire+0x24/0xe7
>   [<ffffffff81427acd>] __mutex_lock_common+0x14d/0x1b4
>   [<ffffffff81108a7c>] ? path_put+0x1d/0x22
>   [<ffffffff81427b48>] __mutex_lock_slowpath+0x14/0x16
>   [<ffffffff81427c4f>] mutex_lock+0x31/0x4b
>   [<ffffffff8110bdf8>] do_rmdir+0x74/0x102
>   [<ffffffff8110bebd>] sys_rmdir+0x11/0x13
>   [<ffffffff81009b02>] system_call_fastpath+0x16/0x1b
> 
> Kernel is from Fedora, 2.6.33.6. In all cases the cgroup contains no 
> tasks.
> 
> Commit ec64f5 ("fix frequent -EBUSY at rmdir") adds a busy wait loop to 
> the rmdir. It looks like what I am seeing here and indicates that some 
> cgroup subsystem is busy, indefinitely.
> 
The commit had caused a bug about rmdir, but it was fixed by the commit 88703267.
The fix was merged in 2.6.31, so it seems that you hit a new one...

> I have not worked out how to reproduce it quickly. My only way is to 
> complete a 'dd' command in the cgroup, but then the problem is so rare it 
> is slow progress.
> 
> Documentation/cgroup.memory.txt describes how force_empty can be required 
> in some cases. Does this mean that with the patch above, these cases will 
> now spin on rmdir(), instead of returning -EBUSY? How can produce a 
> reliable test case requiring memory.force_empty to be used, to test this?
> 
You don't need to touch "force_empty". rmdir() does what "force_empty" does.

> Or is it likely to be some other cause, and how best to find it?
> 
What cgroup subsystem did you mount where the directory existed you tried
to rmdir() first ?
If you mounted several subsystems on the same hierarchy, can you mount them
separately to narrow down the cause ?


Thanks,
Daisuke Nishimura.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: cgroup: rmdir() does not complete
  2010-08-27  0:56 ` Daisuke Nishimura
@ 2010-08-27  1:20   ` Balbir Singh
  2010-08-27  2:35   ` KAMEZAWA Hiroyuki
  1 sibling, 0 replies; 26+ messages in thread
From: Balbir Singh @ 2010-08-27  1:20 UTC (permalink / raw)
  To: Daisuke Nishimura; +Cc: Mark Hills, KAMEZAWA Hiroyuki, linux-kernel

On Fri, Aug 27, 2010 at 6:26 AM, Daisuke Nishimura
<nishimura@mxp.nes.nec.co.jp> wrote:
> Hi.
>
> On Thu, 26 Aug 2010 16:51:55 +0100 (BST)
> Mark Hills <mark@pogo.org.uk> wrote:
>
>> I am experiencing hung tasks when trying to rmdir() on a cgroup. One task
>> spins, others queue up behind it with the following:
>>
>>   INFO: task soaked-cgroup:27257 blocked for more than 120 seconds.
>>   "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>>   soaked-cgrou D ffff8800058157c0     0 27257  29411 0x00000000
>>   ffff88004ffffdd8 0000000000000086 ffff88004ffffda8 ffff88004ffffeb8
>>   0000000000000010 ffff880119813780 ffff88004ffffd48 ffff88004fffffd8
>>   ffff88004fffffd8 000000000000f9b0 00000000000157c0 ffff880137693268
>>   Call Trace:
>>   [<ffffffff81115edb>] ? mntput_no_expire+0x24/0xe7
>>   [<ffffffff81427acd>] __mutex_lock_common+0x14d/0x1b4
>>   [<ffffffff81108a7c>] ? path_put+0x1d/0x22
>>   [<ffffffff81427b48>] __mutex_lock_slowpath+0x14/0x16
>>   [<ffffffff81427c4f>] mutex_lock+0x31/0x4b
>>   [<ffffffff8110bdf8>] do_rmdir+0x74/0x102
>>   [<ffffffff8110bebd>] sys_rmdir+0x11/0x13
>>   [<ffffffff81009b02>] system_call_fastpath+0x16/0x1b
>>
>> Kernel is from Fedora, 2.6.33.6. In all cases the cgroup contains no
>> tasks.
>>
>> Commit ec64f5 ("fix frequent -EBUSY at rmdir") adds a busy wait loop to
>> the rmdir. It looks like what I am seeing here and indicates that some
>> cgroup subsystem is busy, indefinitely.
>>
> The commit had caused a bug about rmdir, but it was fixed by the commit 88703267.
> The fix was merged in 2.6.31, so it seems that you hit a new one...
>
>> I have not worked out how to reproduce it quickly. My only way is to
>> complete a 'dd' command in the cgroup, but then the problem is so rare it
>> is slow progress.
>>
>> Documentation/cgroup.memory.txt describes how force_empty can be required
>> in some cases. Does this mean that with the patch above, these cases will
>> now spin on rmdir(), instead of returning -EBUSY? How can produce a
>> reliable test case requiring memory.force_empty to be used, to test this?
>>
> You don't need to touch "force_empty". rmdir() does what "force_empty" does.
>
>> Or is it likely to be some other cause, and how best to find it?
>>
> What cgroup subsystem did you mount where the directory existed you tried
> to rmdir() first ?
> If you mounted several subsystems on the same hierarchy, can you mount them
> separately to narrow down the cause ?
>

It would also be nice to see what your mounted cgroup (filesystem
perspective) looks like and what /proc/cgroups looks like when the
problem occurs.

Balbir

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: cgroup: rmdir() does not complete
  2010-08-26 15:51 cgroup: rmdir() does not complete Mark Hills
  2010-08-27  0:56 ` Daisuke Nishimura
@ 2010-08-27  1:25 ` KAMEZAWA Hiroyuki
  2010-08-30  9:25   ` Mark Hills
  1 sibling, 1 reply; 26+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-08-27  1:25 UTC (permalink / raw)
  To: Mark Hills; +Cc: linux-kernel

On Thu, 26 Aug 2010 16:51:55 +0100 (BST)
Mark Hills <mark@pogo.org.uk> wrote:

> I am experiencing hung tasks when trying to rmdir() on a cgroup. One task 
> spins, others queue up behind it with the following:
> 
>   INFO: task soaked-cgroup:27257 blocked for more than 120 seconds.
>   "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>   soaked-cgrou D ffff8800058157c0     0 27257  29411 0x00000000
>   ffff88004ffffdd8 0000000000000086 ffff88004ffffda8 ffff88004ffffeb8
>   0000000000000010 ffff880119813780 ffff88004ffffd48 ffff88004fffffd8
>   ffff88004fffffd8 000000000000f9b0 00000000000157c0 ffff880137693268
>   Call Trace:
>   [<ffffffff81115edb>] ? mntput_no_expire+0x24/0xe7
>   [<ffffffff81427acd>] __mutex_lock_common+0x14d/0x1b4
>   [<ffffffff81108a7c>] ? path_put+0x1d/0x22
>   [<ffffffff81427b48>] __mutex_lock_slowpath+0x14/0x16
>   [<ffffffff81427c4f>] mutex_lock+0x31/0x4b
>   [<ffffffff8110bdf8>] do_rmdir+0x74/0x102
>   [<ffffffff8110bebd>] sys_rmdir+0x11/0x13
>   [<ffffffff81009b02>] system_call_fastpath+0x16/0x1b
> 
> Kernel is from Fedora, 2.6.33.6. In all cases the cgroup contains no 
> tasks.
> 
> Commit ec64f5 ("fix frequent -EBUSY at rmdir") adds a busy wait loop to 
> the rmdir. It looks like what I am seeing here and indicates that some 
> cgroup subsystem is busy, indefinitely.
> 

Hmm. really spin ? sleeping-forever-no-wake-up ?

> I have not worked out how to reproduce it quickly. My only way is to 
> complete a 'dd' command in the cgroup, but then the problem is so rare it 
> is slow progress.
> 
please show how-to-reproduce in your way.
And what cgroup is mounted ? memory cgroup only ?

> Documentation/cgroup.memory.txt describes how force_empty can be required 
> in some cases. 

Ah, maybe that's wrong text. rmdir() calls force-empty automatically.

> Does this mean that with the patch above, these cases will 
> now spin on rmdir(), instead of returning -EBUSY? How can produce a 
> reliable test case requiring memory.force_empty to be used, to test this?
> 

Hmm. I'm not sure fedora-kernel has other (its own) featrues than stock kernel.
I'm grad if you can check it can happen in stock kernel, 2.6.35.

> Or is it likely to be some other cause, and how best to find it?
> 

At the first look, above mutex is the mutex in do_rmdir(), not kernel/cgroup.c
Then, rmdir doesn't seem to reach cgroup code...
Do you do another operation on the directory while rmdir is called ?

Thanks,
-Kame



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: cgroup: rmdir() does not complete
  2010-08-27  0:56 ` Daisuke Nishimura
  2010-08-27  1:20   ` Balbir Singh
@ 2010-08-27  2:35   ` KAMEZAWA Hiroyuki
  2010-08-27  3:39     ` Daisuke Nishimura
  1 sibling, 1 reply; 26+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-08-27  2:35 UTC (permalink / raw)
  To: Daisuke Nishimura; +Cc: Mark Hills, linux-kernel

On Fri, 27 Aug 2010 09:56:39 +0900
Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> wrote:

> > Or is it likely to be some other cause, and how best to find it?
> > 
> What cgroup subsystem did you mount where the directory existed you tried
> to rmdir() first ?
> If you mounted several subsystems on the same hierarchy, can you mount them
> separately to narrow down the cause ?
> 

It seems I can reproduce the issue on mmotm-0811, too.

try this.

Here, memory cgroup is mounted at /cgroups.
==
#!/bin/bash -x

while sleep 1; do
        date
        mkdir /cgroups/test
        echo 0 > /cgroups/test/tasks
        echo 300M > /cgroups/test/memory.limit_in_bytes
        cat /proc/self/cgroup
        dd if=/dev/zero of=./tmpfile bs=4096 count=100000
        echo 0 > /cgroups/tasks
        cat /proc/self/cgroup
        rmdir /cgroups/test
        rm ./tmpfile
done
==

hangs at rmdir. I'm no investigating force_empty.

Thanks,
-Kame


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: cgroup: rmdir() does not complete
  2010-08-27  2:35   ` KAMEZAWA Hiroyuki
@ 2010-08-27  3:39     ` Daisuke Nishimura
  2010-08-27  5:42       ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 26+ messages in thread
From: Daisuke Nishimura @ 2010-08-27  3:39 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: Mark Hills, linux-kernel, balbir, Daisuke Nishimura

On Fri, 27 Aug 2010 11:35:06 +0900
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:

> On Fri, 27 Aug 2010 09:56:39 +0900
> Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> wrote:
> 
> > > Or is it likely to be some other cause, and how best to find it?
> > > 
> > What cgroup subsystem did you mount where the directory existed you tried
> > to rmdir() first ?
> > If you mounted several subsystems on the same hierarchy, can you mount them
> > separately to narrow down the cause ?
> > 
> 
> It seems I can reproduce the issue on mmotm-0811, too.
> 
> try this.
> 
> Here, memory cgroup is mounted at /cgroups.
> ==
> #!/bin/bash -x
> 
> while sleep 1; do
>         date
>         mkdir /cgroups/test
>         echo 0 > /cgroups/test/tasks
>         echo 300M > /cgroups/test/memory.limit_in_bytes
>         cat /proc/self/cgroup
>         dd if=/dev/zero of=./tmpfile bs=4096 count=100000
>         echo 0 > /cgroups/tasks
>         cat /proc/self/cgroup
>         rmdir /cgroups/test
>         rm ./tmpfile
> done
> ==
> 
> hangs at rmdir. I'm no investigating force_empty.
> 
Thank you very much for your information.

Some questions.

Is "tmpfile" created on a normal filesystem(e.g. ext3) or tmpfs ?
And, how long does it likely to take to cause this problem ?
I've run it on RHEL6-based kernel/ext3 for about one hour, but
I cannot reproduce it yet.


Thanks,
Daisuke Nishimura.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: cgroup: rmdir() does not complete
  2010-08-27  3:39     ` Daisuke Nishimura
@ 2010-08-27  5:42       ` KAMEZAWA Hiroyuki
  2010-08-27  6:29         ` KAMEZAWA Hiroyuki
                           ` (2 more replies)
  0 siblings, 3 replies; 26+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-08-27  5:42 UTC (permalink / raw)
  To: Daisuke Nishimura; +Cc: Mark Hills, linux-kernel, balbir

On Fri, 27 Aug 2010 12:39:48 +0900
Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> wrote:

> On Fri, 27 Aug 2010 11:35:06 +0900
> KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> 
> > On Fri, 27 Aug 2010 09:56:39 +0900
> > Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> wrote:
> > 
> > > > Or is it likely to be some other cause, and how best to find it?
> > > > 
> > > What cgroup subsystem did you mount where the directory existed you tried
> > > to rmdir() first ?
> > > If you mounted several subsystems on the same hierarchy, can you mount them
> > > separately to narrow down the cause ?
> > > 
> > 
> > It seems I can reproduce the issue on mmotm-0811, too.
> > 
> > try this.
> > 
> > Here, memory cgroup is mounted at /cgroups.
> > ==
> > #!/bin/bash -x
> > 
> > while sleep 1; do
> >         date
> >         mkdir /cgroups/test
> >         echo 0 > /cgroups/test/tasks
> >         echo 300M > /cgroups/test/memory.limit_in_bytes
> >         cat /proc/self/cgroup
> >         dd if=/dev/zero of=./tmpfile bs=4096 count=100000
> >         echo 0 > /cgroups/tasks
> >         cat /proc/self/cgroup
> >         rmdir /cgroups/test
> >         rm ./tmpfile
> > done
> > ==
> > 
> > hangs at rmdir. I'm no investigating force_empty.
> > 
> Thank you very much for your information.
> 
> Some questions.
> 
> Is "tmpfile" created on a normal filesystem(e.g. ext3) or tmpfs ?
on ext4.

> And, how long does it likely to take to cause this problem ?

very soon. 10-20 loop.

> I've run it on RHEL6-based kernel/ext3 for about one hour, but
> I cannot reproduce it yet.
> 

Hmm...I'll dig more. Maybe I need to use stock kernel rather than -mm...


Thanks,
-Kame


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: cgroup: rmdir() does not complete
  2010-08-27  5:42       ` KAMEZAWA Hiroyuki
@ 2010-08-27  6:29         ` KAMEZAWA Hiroyuki
  2010-08-30  7:32           ` Balbir Singh
  2010-08-30  9:13         ` Mark Hills
  2010-09-01 11:10         ` Mark Hills
  2 siblings, 1 reply; 26+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-08-27  6:29 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: Daisuke Nishimura, Mark Hills, linux-kernel, balbir

On Fri, 27 Aug 2010 14:42:25 +0900
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
 
> > I've run it on RHEL6-based kernel/ext3 for about one hour, but
> > I cannot reproduce it yet.
> > 
> 
> Hmm...I'll dig more. Maybe I need to use stock kernel rather than -mm...
> 
> 
Sorry, my test just hangs on -mm + (other patches)
no troubles on 2.6.34 and 2.6.36-rc1.

Where can I see  2.6.33.6(Fedora) kernel ?

Thanks,
-Kame


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: cgroup: rmdir() does not complete
  2010-08-27  6:29         ` KAMEZAWA Hiroyuki
@ 2010-08-30  7:32           ` Balbir Singh
  0 siblings, 0 replies; 26+ messages in thread
From: Balbir Singh @ 2010-08-30  7:32 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: Daisuke Nishimura, Mark Hills, linux-kernel

* KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2010-08-27 15:29:58]:

> On Fri, 27 Aug 2010 14:42:25 +0900
> KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> 
> > > I've run it on RHEL6-based kernel/ext3 for about one hour, but
> > > I cannot reproduce it yet.
> > > 
> > 
> > Hmm...I'll dig more. Maybe I need to use stock kernel rather than -mm...
> > 
> > 
> Sorry, my test just hangs on -mm + (other patches)
> no troubles on 2.6.34 and 2.6.36-rc1.
> 
> Where can I see  2.6.33.6(Fedora) kernel ?
>

You can get the SRPM from the mirrors, one place to find it would be

http://download.fedora.redhat.com/pub/fedora/linux/updates/13/SRPMS/ 

-- 
	Three Cheers,
	Balbir

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: cgroup: rmdir() does not complete
  2010-08-27  5:42       ` KAMEZAWA Hiroyuki
  2010-08-27  6:29         ` KAMEZAWA Hiroyuki
@ 2010-08-30  9:13         ` Mark Hills
  2010-09-01 11:10         ` Mark Hills
  2 siblings, 0 replies; 26+ messages in thread
From: Mark Hills @ 2010-08-30  9:13 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: Daisuke Nishimura, linux-kernel, balbir

On Fri, 27 Aug 2010, KAMEZAWA Hiroyuki wrote:

> On Fri, 27 Aug 2010 12:39:48 +0900
> Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> wrote:
> 
> > On Fri, 27 Aug 2010 11:35:06 +0900
> > KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > 
> > > On Fri, 27 Aug 2010 09:56:39 +0900
> > > Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> wrote:
> > > 
> > > > > Or is it likely to be some other cause, and how best to find it?
> > > > > 
> > > > What cgroup subsystem did you mount where the directory existed you tried
> > > > to rmdir() first ?
> > > > If you mounted several subsystems on the same hierarchy, can you mount them
> > > > separately to narrow down the cause ?
> > > > 
> > > 
> > > It seems I can reproduce the issue on mmotm-0811, too.
> > > 
> > > try this.
> > > 
> > > Here, memory cgroup is mounted at /cgroups.
> > > ==
> > > #!/bin/bash -x
> > > 
> > > while sleep 1; do
> > >         date
> > >         mkdir /cgroups/test
> > >         echo 0 > /cgroups/test/tasks
> > >         echo 300M > /cgroups/test/memory.limit_in_bytes
> > >         cat /proc/self/cgroup
> > >         dd if=/dev/zero of=./tmpfile bs=4096 count=100000
> > >         echo 0 > /cgroups/tasks
> > >         cat /proc/self/cgroup
> > >         rmdir /cgroups/test
> > >         rm ./tmpfile
> > > done
> > > ==
> > > 
> > > hangs at rmdir. I'm no investigating force_empty.
> > > 
> > Thank you very much for your information.
> > 
> > Some questions.
> > 
> > Is "tmpfile" created on a normal filesystem(e.g. ext3) or tmpfs ?
> on ext4.
> 
> > And, how long does it likely to take to cause this problem ?
> 
> very soon. 10-20 loop.

The test case I was running is similar to the above. With the Lustre 
filesystem the problem takes 4 hours or more to show itself. Recently I 
ran 4 threads for over 24 hours without it being seen -- I suspect some 
external factor is involved.

I also tried NFS, and did not see a problem after 8 hours or so, but this 
is inconclusive.

The use of the Fedora kernel, and the Lustre filesystem is not 
satisfactory to trace the bug. Until I can get a test case which is more 
readily reproducable, I'm not able to reasonably think about changing 
variables.

It is interesting you see the problem so readily on ext4; I will test that 
soon (it is currently holiday weekend in the UK). I hope it will give me 
the test case I am looking for.

Thanks

-- 
Mark

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: cgroup: rmdir() does not complete
  2010-08-27  1:25 ` KAMEZAWA Hiroyuki
@ 2010-08-30  9:25   ` Mark Hills
  0 siblings, 0 replies; 26+ messages in thread
From: Mark Hills @ 2010-08-30  9:25 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-kernel

On Fri, 27 Aug 2010, KAMEZAWA Hiroyuki wrote:

> On Thu, 26 Aug 2010 16:51:55 +0100 (BST)
> Mark Hills <mark@pogo.org.uk> wrote:
> 
> > I am experiencing hung tasks when trying to rmdir() on a cgroup. One task 
> > spins, others queue up behind it with the following:
> > 
> >   INFO: task soaked-cgroup:27257 blocked for more than 120 seconds.
> >   "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> >   soaked-cgrou D ffff8800058157c0     0 27257  29411 0x00000000
> >   ffff88004ffffdd8 0000000000000086 ffff88004ffffda8 ffff88004ffffeb8
> >   0000000000000010 ffff880119813780 ffff88004ffffd48 ffff88004fffffd8
> >   ffff88004fffffd8 000000000000f9b0 00000000000157c0 ffff880137693268
> >   Call Trace:
> >   [<ffffffff81115edb>] ? mntput_no_expire+0x24/0xe7
> >   [<ffffffff81427acd>] __mutex_lock_common+0x14d/0x1b4
> >   [<ffffffff81108a7c>] ? path_put+0x1d/0x22
> >   [<ffffffff81427b48>] __mutex_lock_slowpath+0x14/0x16
> >   [<ffffffff81427c4f>] mutex_lock+0x31/0x4b
> >   [<ffffffff8110bdf8>] do_rmdir+0x74/0x102
> >   [<ffffffff8110bebd>] sys_rmdir+0x11/0x13
> >   [<ffffffff81009b02>] system_call_fastpath+0x16/0x1b
> > 
> > Kernel is from Fedora, 2.6.33.6. In all cases the cgroup contains no 
> > tasks.
> > 
> > Commit ec64f5 ("fix frequent -EBUSY at rmdir") adds a busy wait loop to 
> > the rmdir. It looks like what I am seeing here and indicates that some 
> > cgroup subsystem is busy, indefinitely.
> > 
> 
> Hmm. really spin ? sleeping-forever-no-wake-up ?

It sleeps in D state, but enters interruptable state periodically which is 
why my attention was drawn to that loop.

> > I have not worked out how to reproduce it quickly. My only way is to 
> > complete a 'dd' command in the cgroup, but then the problem is so rare it 
> > is slow progress.
> > 
> please show how-to-reproduce in your way.

I use a C program which creates a container and places itself in the 
container, then forks a dd process.

But it seems you found an easier test case; I hope to test that soon.

> And what cgroup is mounted ? memory cgroup only ?

Quite a few: memory, blkio, cpuacct, cpuset.

Until I can get a more reproducable test case (see my previous mail), I 
can't really reduce this.

> > Documentation/cgroup.memory.txt describes how force_empty can be required 
> > in some cases. 
> 
> Ah, maybe that's wrong text. rmdir() calls force-empty automatically.
> 
> > Does this mean that with the patch above, these cases will 
> > now spin on rmdir(), instead of returning -EBUSY? How can produce a 
> > reliable test case requiring memory.force_empty to be used, to test this?
> > 
> 
> Hmm. I'm not sure fedora-kernel has other (its own) featrues than stock kernel.
> I'm grad if you can check it can happen in stock kernel, 2.6.35.
> 
> > Or is it likely to be some other cause, and how best to find it?
> > 
> 
> At the first look, above mutex is the mutex in do_rmdir(), not kernel/cgroup.c
> Then, rmdir doesn't seem to reach cgroup code...

Interesting, I checked for that but not sure how I missed it. There is 
clearly a mutex lock in do_rmdir() in fs/namei.c.

> Do you do another operation on the directory while rmdir is called ?

In one case I did an 'ls -l' on the filesystem which coencided with a lock 
up, but I was not able to reproduce this.

-- 
Mark

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: cgroup: rmdir() does not complete
  2010-08-27  5:42       ` KAMEZAWA Hiroyuki
  2010-08-27  6:29         ` KAMEZAWA Hiroyuki
  2010-08-30  9:13         ` Mark Hills
@ 2010-09-01 11:10         ` Mark Hills
  2010-09-01 23:42           ` KAMEZAWA Hiroyuki
  2 siblings, 1 reply; 26+ messages in thread
From: Mark Hills @ 2010-09-01 11:10 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: Daisuke Nishimura, linux-kernel, balbir

On Fri, 27 Aug 2010, KAMEZAWA Hiroyuki wrote:

> On Fri, 27 Aug 2010 12:39:48 +0900
> Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> wrote:
> 
> > On Fri, 27 Aug 2010 11:35:06 +0900
> > KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > 
> > > On Fri, 27 Aug 2010 09:56:39 +0900
> > > Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> wrote:
> > > 
> > > > > Or is it likely to be some other cause, and how best to find it?
> > > > > 
> > > > What cgroup subsystem did you mount where the directory existed you tried
> > > > to rmdir() first ?
> > > > If you mounted several subsystems on the same hierarchy, can you mount them
> > > > separately to narrow down the cause ?
> > > > 
> > > 
> > > It seems I can reproduce the issue on mmotm-0811, too.
> > > 
> > > try this.
> > > 
> > > Here, memory cgroup is mounted at /cgroups.
> > > ==
> > > #!/bin/bash -x
> > > 
> > > while sleep 1; do
> > >         date
> > >         mkdir /cgroups/test
> > >         echo 0 > /cgroups/test/tasks
> > >         echo 300M > /cgroups/test/memory.limit_in_bytes
> > >         cat /proc/self/cgroup
> > >         dd if=/dev/zero of=./tmpfile bs=4096 count=100000
> > >         echo 0 > /cgroups/tasks
> > >         cat /proc/self/cgroup
> > >         rmdir /cgroups/test
> > >         rm ./tmpfile
> > > done
> > > ==
> > > 
> > > hangs at rmdir. I'm no investigating force_empty.
> > > 
> > Thank you very much for your information.
> > 
> > Some questions.
> > 
> > Is "tmpfile" created on a normal filesystem(e.g. ext3) or tmpfs ?
> on ext4.
> 
> > And, how long does it likely to take to cause this problem ?
> 
> very soon. 10-20 loop.

I repeated the test above, but did not see a problem after many hundreds 
of loops.

My test was with the same kernel from my original bug report (Fedora 
2.6.33.6-147), using memory cgroup only and ext4 filesystem.

So it is possible we are experiencing different bugs with similar 
symptoms.

-- 
Mark

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: cgroup: rmdir() does not complete
  2010-09-01 11:10         ` Mark Hills
@ 2010-09-01 23:42           ` KAMEZAWA Hiroyuki
  2010-09-02  9:45             ` Mark Hills
  2010-09-09 10:01             ` Mark Hills
  0 siblings, 2 replies; 26+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-09-01 23:42 UTC (permalink / raw)
  To: Mark Hills; +Cc: Daisuke Nishimura, linux-kernel, balbir

On Wed, 1 Sep 2010 12:10:23 +0100 (BST)
Mark Hills <mark@pogo.org.uk> wrote:

> On Fri, 27 Aug 2010, KAMEZAWA Hiroyuki wrote:
> 
> > On Fri, 27 Aug 2010 12:39:48 +0900
> > Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> wrote:
> > 
> > > On Fri, 27 Aug 2010 11:35:06 +0900
> > > KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > > 
> > > > On Fri, 27 Aug 2010 09:56:39 +0900
> > > > Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> wrote:
> > > > 
> > > > > > Or is it likely to be some other cause, and how best to find it?
> > > > > > 
> > > > > What cgroup subsystem did you mount where the directory existed you tried
> > > > > to rmdir() first ?
> > > > > If you mounted several subsystems on the same hierarchy, can you mount them
> > > > > separately to narrow down the cause ?
> > > > > 
> > > > 
> > > > It seems I can reproduce the issue on mmotm-0811, too.
> > > > 
> > > > try this.
> > > > 
> > > > Here, memory cgroup is mounted at /cgroups.
> > > > ==
> > > > #!/bin/bash -x
> > > > 
> > > > while sleep 1; do
> > > >         date
> > > >         mkdir /cgroups/test
> > > >         echo 0 > /cgroups/test/tasks
> > > >         echo 300M > /cgroups/test/memory.limit_in_bytes
> > > >         cat /proc/self/cgroup
> > > >         dd if=/dev/zero of=./tmpfile bs=4096 count=100000
> > > >         echo 0 > /cgroups/tasks
> > > >         cat /proc/self/cgroup
> > > >         rmdir /cgroups/test
> > > >         rm ./tmpfile
> > > > done
> > > > ==
> > > > 
> > > > hangs at rmdir. I'm no investigating force_empty.
> > > > 
> > > Thank you very much for your information.
> > > 
> > > Some questions.
> > > 
> > > Is "tmpfile" created on a normal filesystem(e.g. ext3) or tmpfs ?
> > on ext4.
> > 
> > > And, how long does it likely to take to cause this problem ?
> > 
> > very soon. 10-20 loop.
> 
> I repeated the test above, but did not see a problem after many hundreds 
> of loops.
> 
> My test was with the same kernel from my original bug report (Fedora 
> 2.6.33.6-147), using memory cgroup only and ext4 filesystem.
> 
> So it is possible we are experiencing different bugs with similar 
> symptoms.
> 

Thank you for confirming.
But hmm...it's curious who holds mutex and what happens.

-Kame


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: cgroup: rmdir() does not complete
  2010-09-01 23:42           ` KAMEZAWA Hiroyuki
@ 2010-09-02  9:45             ` Mark Hills
  2010-09-09 10:01             ` Mark Hills
  1 sibling, 0 replies; 26+ messages in thread
From: Mark Hills @ 2010-09-02  9:45 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: Daisuke Nishimura, linux-kernel, balbir

On Thu, 2 Sep 2010, KAMEZAWA Hiroyuki wrote:

> On Wed, 1 Sep 2010 12:10:23 +0100 (BST)
> Mark Hills <mark@pogo.org.uk> wrote:
[...]
> > I repeated the test above, but did not see a problem after many hundreds 
> > of loops.
> > 
> > My test was with the same kernel from my original bug report (Fedora 
> > 2.6.33.6-147), using memory cgroup only and ext4 filesystem.
> > 
> > So it is possible we are experiencing different bugs with similar 
> > symptoms.
> > 
> 
> Thank you for confirming.
> But hmm...it's curious who holds mutex and what happens.

Refer to my original email, where I was running multiple tests at once. 
This backtrace is from the tests which queue up:

  Call Trace:
  [<ffffffff81115edb>] ? mntput_no_expire+0x24/0xe7
  [<ffffffff81427acd>] __mutex_lock_common+0x14d/0x1b4
  [<ffffffff81108a7c>] ? path_put+0x1d/0x22
  [<ffffffff81427b48>] __mutex_lock_slowpath+0x14/0x16
  [<ffffffff81427c4f>] mutex_lock+0x31/0x4b
  [<ffffffff8110bdf8>] do_rmdir+0x74/0x102
  [<ffffffff8110bebd>] sys_rmdir+0x11/0x13
  [<ffffffff81009b02>] system_call_fastpath+0x16/0x1b

The one which spins has already managed to claim the mutex lock on the 
/cgroup directory, and no call trace is shown for this. Is there a usable 
way to force a similar call trace for the spinning process?

Unfortunately I have not been able to reproduce the problem for some days 
now, so I think some network factor is able to influence this.

-- 
Mark

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: cgroup: rmdir() does not complete
  2010-09-01 23:42           ` KAMEZAWA Hiroyuki
  2010-09-02  9:45             ` Mark Hills
@ 2010-09-09 10:01             ` Mark Hills
  2010-09-09 10:09               ` Balbir Singh
  1 sibling, 1 reply; 26+ messages in thread
From: Mark Hills @ 2010-09-09 10:01 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: Daisuke Nishimura, linux-kernel, balbir

On Thu, 2 Sep 2010, KAMEZAWA Hiroyuki wrote:

[...]
> But hmm...it's curious who holds mutex and what happens.

I have a system showing the failure case (but still do not have a way to 
reliably repeat it)

Here are the two processes:

23586 pts/0    RL+  5059:18 /net/homes/mhills/tmp/soaked-cgroup
23685 pts/6    DL+    0:00 /net/homes/mhills/tmp/soaked-cgroup

23586 spends almost all of its time in 'RL+' status, occasionally it is 
seen in 'DL+' status.

>From my analysis before, both are blocked on rmdir(), but one is spinning, 
holding the lock on the /cgroup, and the other is waiting for the lock. If 
I strace 23586 then the rmdir() fails with EINTR.

How best to capture information which might show why the process spins?

-- 
Mark

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: cgroup: rmdir() does not complete
  2010-09-09 10:01             ` Mark Hills
@ 2010-09-09 10:09               ` Balbir Singh
  2010-09-09 11:36                 ` Mark Hills
  0 siblings, 1 reply; 26+ messages in thread
From: Balbir Singh @ 2010-09-09 10:09 UTC (permalink / raw)
  To: Mark Hills; +Cc: KAMEZAWA Hiroyuki, Daisuke Nishimura, linux-kernel

* Mark Hills <mark@pogo.org.uk> [2010-09-09 11:01:45]:

> On Thu, 2 Sep 2010, KAMEZAWA Hiroyuki wrote:
> 
> [...]
> > But hmm...it's curious who holds mutex and what happens.
> 
> I have a system showing the failure case (but still do not have a way to 
> reliably repeat it)
> 
> Here are the two processes:
> 
> 23586 pts/0    RL+  5059:18 /net/homes/mhills/tmp/soaked-cgroup
> 23685 pts/6    DL+    0:00 /net/homes/mhills/tmp/soaked-cgroup
> 
> 23586 spends almost all of its time in 'RL+' status, occasionally it is 
> seen in 'DL+' status.
> 
> From my analysis before, both are blocked on rmdir(), but one is spinning, 
> holding the lock on the /cgroup, and the other is waiting for the lock. If 
> I strace 23586 then the rmdir() fails with EINTR.
>

Any chance you can compile with debug cgroup subsystem and get
information from there? 

-- 
	Three Cheers,
	Balbir

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: cgroup: rmdir() does not complete
  2010-09-09 10:09               ` Balbir Singh
@ 2010-09-09 11:36                 ` Mark Hills
  2010-09-09 11:50                   ` Peter Zijlstra
  0 siblings, 1 reply; 26+ messages in thread
From: Mark Hills @ 2010-09-09 11:36 UTC (permalink / raw)
  To: Balbir Singh; +Cc: KAMEZAWA Hiroyuki, Daisuke Nishimura, linux-kernel

On Thu, 9 Sep 2010, Balbir Singh wrote:

> * Mark Hills <mark@pogo.org.uk> [2010-09-09 11:01:45]:
> 
> > On Thu, 2 Sep 2010, KAMEZAWA Hiroyuki wrote:
> > 
> > [...]
> > > But hmm...it's curious who holds mutex and what happens.
> > 
> > I have a system showing the failure case (but still do not have a way to 
> > reliably repeat it)
> > 
> > Here are the two processes:
> > 
> > 23586 pts/0    RL+  5059:18 /net/homes/mhills/tmp/soaked-cgroup
> > 23685 pts/6    DL+    0:00 /net/homes/mhills/tmp/soaked-cgroup
> > 
> > 23586 spends almost all of its time in 'RL+' status, occasionally it is 
> > seen in 'DL+' status.
> > 
> > From my analysis before, both are blocked on rmdir(), but one is spinning, 
> > holding the lock on the /cgroup, and the other is waiting for the lock. If 
> > I strace 23586 then the rmdir() fails with EINTR.
> >
> 
> Any chance you can compile with debug cgroup subsystem and get
> information from there? 

I can, I'd like to experiment with a custom kernel next.

I am still finding the problem incredibly hard to reproduce, so I'd like 
to observe as much data as possible from the current case before 
rebooting. If I could capture some kind of stack trace in the kernel for 
the running process that would be great, any suggestions appreciated.

Thanks

-- 
Mark

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: cgroup: rmdir() does not complete
  2010-09-09 11:36                 ` Mark Hills
@ 2010-09-09 11:50                   ` Peter Zijlstra
  2010-09-09 23:04                     ` Mark Hills
  0 siblings, 1 reply; 26+ messages in thread
From: Peter Zijlstra @ 2010-09-09 11:50 UTC (permalink / raw)
  To: Mark Hills
  Cc: Balbir Singh, KAMEZAWA Hiroyuki, Daisuke Nishimura, linux-kernel

On Thu, 2010-09-09 at 12:36 +0100, Mark Hills wrote:

> I am still finding the problem incredibly hard to reproduce, so I'd like 
> to observe as much data as possible from the current case before 
> rebooting. If I could capture some kind of stack trace in the kernel for 
> the running process that would be great, any suggestions appreciated.

echo l > /proc/sysrq-trigger

another thing you can do is run something like: perf record -gp $pid
which will give you a profile of that task.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: cgroup: rmdir() does not complete
  2010-09-09 11:50                   ` Peter Zijlstra
@ 2010-09-09 23:04                     ` Mark Hills
  2010-09-09 23:43                       ` KAMEZAWA Hiroyuki
  2010-09-10  2:16                       ` KAMEZAWA Hiroyuki
  0 siblings, 2 replies; 26+ messages in thread
From: Mark Hills @ 2010-09-09 23:04 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Balbir Singh, KAMEZAWA Hiroyuki, Daisuke Nishimura, linux-kernel

On Thu, 9 Sep 2010, Peter Zijlstra wrote:

> On Thu, 2010-09-09 at 12:36 +0100, Mark Hills wrote:
> 
> > I am still finding the problem incredibly hard to reproduce, so I'd like 
> > to observe as much data as possible from the current case before 
> > rebooting. If I could capture some kind of stack trace in the kernel for 
> > the running process that would be great, any suggestions appreciated.
> 
> echo l > /proc/sysrq-trigger

Despite running this many times, I never 'catch' the process on a CPU, 
despite it using 70% in top. But...

> another thing you can do is run something like: perf record -gp $pid
> which will give you a profile of that task.

This is very useful, thanks.

The report on the spinning process (23586) is dominated by calls from 
mem_cgroup_force_empty.

It seems to show lru_add_drain_all and drain_all_stock_sync are causing 
the load (I assume drain_all_stock_sync has been optimised out). But I 
don't think this is as important as what causes the spin.

There are no tasks in the cgroup, but memory usage is non-zero and 
constant. It seems mem_cgroup_force_empty is unable to empty the cgroup in 
this case.

  # cat /cgroup/soaked-23586/tasks
  # cat /cgroup/soaked-23586/memory.usage_in_bytes
  24576
  # cat /cgroup/soaked-23586/memsw.usage_in_bytes
  <hangs>

Here are the first few entries from the perf output, I can provide the 
rest if needed, but all result from mem_cgroup_force_empty.

     8.13%   :23586  [kernel]           [k] _raw_spin_lock_irqsave
             |
             --- _raw_spin_lock_irqsave
                |          
                |--45.14%-- probe_workqueue_insertion
                |          insert_work
                |          |          
                |          |--99.09%-- __queue_work
                |          |          queue_work_on
                |          |          schedule_work_on
                |          |          schedule_on_each_cpu
                |          |          |          
                |          |          |--50.59%-- lru_add_drain_all
                |          |          |          mem_cgroup_force_empty
                |          |          |          mem_cgroup_pre_destroy
                |          |          |          cgroup_rmdir
                |          |          |          vfs_rmdir
                |          |          |          do_rmdir
                |          |          |          sys_rmdir
                |          |          |          system_call_fastpath
                |          |          |          0x3f504d27d7
                |          |          |          0x405687
                |          |          |          0x406ef0
                |          |          |          0x402f31
                |          |          |          0x3f5041eb1d
                |          |          |          
                |          |           --49.41%-- mem_cgroup_force_empty
                |          |                     mem_cgroup_pre_destroy
                |          |                     cgroup_rmdir
                |          |                     vfs_rmdir
                |          |                     do_rmdir
                |          |                     sys_rmdir
                |          |                     system_call_fastpath
                |          |                     0x3f504d27d7
                |          |                     0x405687
                |          |                     0x406ef0
                |          |                     0x402f31
                |          |                     0x3f5041eb1d
                |           --0.91%-- [...]
                |          
                |--22.92%-- mem_cgroup_force_empty
                |          mem_cgroup_pre_destroy
                |          cgroup_rmdir
                |          vfs_rmdir
                |          do_rmdir
                |          sys_rmdir
                |          system_call_fastpath
                |          0x3f504d27d7
                |          0x405687
                |          0x406ef0
                |          0x402f31
                |          0x3f5041eb1d
                |          
                |--8.17%-- __queue_work
                |          queue_work_on
                |          schedule_work_on
                |          schedule_on_each_cpu
                |          |          
                |          |--52.09%-- lru_add_drain_all
                |          |          mem_cgroup_force_empty
                |          |          mem_cgroup_pre_destroy
                |          |          cgroup_rmdir
                |          |          vfs_rmdir
                |          |          do_rmdir
                |          |          sys_rmdir
                |          |          system_call_fastpath
                |          |          0x3f504d27d7
                |          |          0x405687
                |          |          0x406ef0
                |          |          0x402f31
                |          |          0x3f5041eb1d
                |          |          
                |           --47.91%-- mem_cgroup_force_empty
                |                     mem_cgroup_pre_destroy
                |                     cgroup_rmdir
                |                     vfs_rmdir
                |                     do_rmdir
                |                     sys_rmdir
                |                     system_call_fastpath
                |                     0x3f504d27d7
                |                     0x405687
                |                     0x406ef0
                |                     0x402f31
                |                     0x3f5041eb1d
                |          
                |--7.94%-- __wake_up
                |          |          
                |          |--99.71%-- insert_work
                |          |          |          
                |          |          |--97.70%-- __queue_work
                |          |          |          queue_work_on
                |          |          |          schedule_work_on
                |          |          |          schedule_on_each_cpu
                |          |          |          |          
                |          |          |          |--50.59%-- mem_cgroup_force_empty
                |          |          |          |          mem_cgroup_pre_destroy
                |          |          |          |          cgroup_rmdir
                |          |          |          |          vfs_rmdir
                |          |          |          |          do_rmdir
                |          |          |          |          sys_rmdir
                |          |          |          |          system_call_fastpath
                |          |          |          |          0x3f504d27d7
                |          |          |          |          0x405687
                |          |          |          |          0x406ef0
                |          |          |          |          0x402f31
                |          |          |          |          0x3f5041eb1d
                |          |          |          |          
                |          |          |           --49.41%-- lru_add_drain_all
                |          |          |                     mem_cgroup_force_empty
                |          |          |                     mem_cgroup_pre_destroy
                |          |          |                     cgroup_rmdir
                |          |          |                     vfs_rmdir
                |          |          |                     do_rmdir
                |          |          |                     sys_rmdir
                |          |          |                     system_call_fastpath
                |          |          |                     0x3f504d27d7
                |          |          |                     0x405687
                |          |          |                     0x406ef0
                |          |          |                     0x402f31
                |          |          |                     0x3f5041eb1d
                |          |           --2.30%-- [...]
                |           --0.29%-- [...]
                |          
                |--4.35%-- mem_cgroup_pre_destroy
                |          cgroup_rmdir
                |          vfs_rmdir
                |          do_rmdir
                |          sys_rmdir
                |          system_call_fastpath
                |          0x3f504d27d7
                |          0x405687
                |          0x406ef0
                |          0x402f31
                |          0x3f5041eb1d
                 --11.47%-- [...]

     7.25%   :23586  [kernel]           [k] sched_clock_cpu
             |
             --- sched_clock_cpu
                |          
                |--97.11%-- update_rq_clock
                |          |          
                |          |--98.89%-- try_to_wake_up
                |          |          default_wake_function
                |          |          autoremove_wake_function
                |          |          __wake_up_common
                |          |          __wake_up
                |          |          insert_work
                |          |          __queue_work
                |          |          queue_work_on
                |          |          schedule_work_on
                |          |          schedule_on_each_cpu
                |          |          |          
                |          |          |--50.69%-- lru_add_drain_all
                |          |          |          mem_cgroup_force_empty
                |          |          |          mem_cgroup_pre_destroy
                |          |          |          cgroup_rmdir
                |          |          |          vfs_rmdir
                |          |          |          do_rmdir
                |          |          |          sys_rmdir
                |          |          |          system_call_fastpath
                |          |          |          0x3f504d27d7
                |          |          |          0x405687
                |          |          |          0x406ef0
                |          |          |          0x402f31
                |          |          |          0x3f5041eb1d
                |          |          |          
                |          |           --49.31%-- mem_cgroup_force_empty
                |          |                     mem_cgroup_pre_destroy
                |          |                     cgroup_rmdir
                |          |                     vfs_rmdir
                |          |                     do_rmdir
                |          |                     sys_rmdir
                |          |                     system_call_fastpath
                |          |                     0x3f504d27d7
                |          |                     0x405687
                |          |                     0x406ef0
                |          |                     0x402f31
                |          |                     0x3f5041eb1d
                |           --1.11%-- [...]
                 --2.89%-- [...]

     5.54%   :23586  [kernel]           [k] try_to_wake_up
             |
             --- try_to_wake_up
                |          
                |--99.13%-- default_wake_function
                |          autoremove_wake_function
                |          __wake_up_common
                |          __wake_up
                |          insert_work
                |          __queue_work
                |          queue_work_on
                |          schedule_work_on
                |          schedule_on_each_cpu
                |          |          
                |          |--52.03%-- lru_add_drain_all
                |          |          mem_cgroup_force_empty
                |          |          mem_cgroup_pre_destroy
                |          |          cgroup_rmdir
                |          |          vfs_rmdir
                |          |          do_rmdir
                |          |          sys_rmdir
                |          |          system_call_fastpath
                |          |          0x3f504d27d7
                |          |          0x405687
                |          |          0x406ef0
                |          |          0x402f31
                |          |          0x3f5041eb1d
                |          |          
                |           --47.97%-- mem_cgroup_force_empty
                |                     mem_cgroup_pre_destroy
                |                     cgroup_rmdir
                |                     vfs_rmdir
                |                     do_rmdir
                |                     sys_rmdir
                |                     system_call_fastpath
                |                     0x3f504d27d7
                |                     0x405687
                |                     0x406ef0
                |                     0x402f31
                |                     0x3f5041eb1d
                 --0.87%-- [...]

-- 
Mark

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: cgroup: rmdir() does not complete
  2010-09-09 23:04                     ` Mark Hills
@ 2010-09-09 23:43                       ` KAMEZAWA Hiroyuki
  2010-09-10  2:16                       ` KAMEZAWA Hiroyuki
  1 sibling, 0 replies; 26+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-09-09 23:43 UTC (permalink / raw)
  To: Mark Hills; +Cc: Peter Zijlstra, Balbir Singh, Daisuke Nishimura, linux-kernel

On Fri, 10 Sep 2010 00:04:31 +0100 (BST)
Mark Hills <mark@pogo.org.uk> wrote:

> On Thu, 9 Sep 2010, Peter Zijlstra wrote:
> 
> > On Thu, 2010-09-09 at 12:36 +0100, Mark Hills wrote:
> > 
> > > I am still finding the problem incredibly hard to reproduce, so I'd like 
> > > to observe as much data as possible from the current case before 
> > > rebooting. If I could capture some kind of stack trace in the kernel for 
> > > the running process that would be great, any suggestions appreciated.
> > 
> > echo l > /proc/sysrq-trigger
> 
> Despite running this many times, I never 'catch' the process on a CPU, 
> despite it using 70% in top. But...
> 
> > another thing you can do is run something like: perf record -gp $pid
> > which will give you a profile of that task.
> 
> This is very useful, thanks.
> 
> The report on the spinning process (23586) is dominated by calls from 
> mem_cgroup_force_empty.
> 
> It seems to show lru_add_drain_all and drain_all_stock_sync are causing 
> the load (I assume drain_all_stock_sync has been optimised out). But I 
> don't think this is as important as what causes the spin.
> 
> There are no tasks in the cgroup, but memory usage is non-zero and 
> constant. It seems mem_cgroup_force_empty is unable to empty the cgroup in 
> this case.
> 
>   # cat /cgroup/soaked-23586/tasks
>   # cat /cgroup/soaked-23586/memory.usage_in_bytes
>   24576
>   # cat /cgroup/soaked-23586/memsw.usage_in_bytes
>   <hangs>
> 
I think this "cat" hang is because of vfs's lock.

Hmm, then, there are pages on LRU which cannot be moved or there is
leak of account.

BTW, mem_cgroup's rmdir is desgined to be able to receive SIGINT etc...
Can't you stop rmdir by Ctrl-C or some ?

  rmdir -> hang -> Ctrl-C (or some) -> cat .../memory.stat

can work ? And do you still use Fedora's kernel ?

Thanks,
-Kame

> Here are the first few entries from the perf output, I can provide the 
> rest if needed, but all result from mem_cgroup_force_empty.
> 
>      8.13%   :23586  [kernel]           [k] _raw_spin_lock_irqsave
>              |
>              --- _raw_spin_lock_irqsave
>                 |          
>                 |--45.14%-- probe_workqueue_insertion
>                 |          insert_work
>                 |          |          
>                 |          |--99.09%-- __queue_work
>                 |          |          queue_work_on
>                 |          |          schedule_work_on
>                 |          |          schedule_on_each_cpu
>                 |          |          |          
>                 |          |          |--50.59%-- lru_add_drain_all
>                 |          |          |          mem_cgroup_force_empty
>                 |          |          |          mem_cgroup_pre_destroy
>                 |          |          |          cgroup_rmdir
>                 |          |          |          vfs_rmdir
>                 |          |          |          do_rmdir
>                 |          |          |          sys_rmdir
>                 |          |          |          system_call_fastpath
>                 |          |          |          0x3f504d27d7
>                 |          |          |          0x405687
>                 |          |          |          0x406ef0
>                 |          |          |          0x402f31
>                 |          |          |          0x3f5041eb1d
>                 |          |          |          
>                 |          |           --49.41%-- mem_cgroup_force_empty
>                 |          |                     mem_cgroup_pre_destroy
>                 |          |                     cgroup_rmdir
>                 |          |                     vfs_rmdir
>                 |          |                     do_rmdir
>                 |          |                     sys_rmdir
>                 |          |                     system_call_fastpath
>                 |          |                     0x3f504d27d7
>                 |          |                     0x405687
>                 |          |                     0x406ef0
>                 |          |                     0x402f31
>                 |          |                     0x3f5041eb1d
>                 |           --0.91%-- [...]
>                 |          
>                 |--22.92%-- mem_cgroup_force_empty
>                 |          mem_cgroup_pre_destroy
>                 |          cgroup_rmdir
>                 |          vfs_rmdir
>                 |          do_rmdir
>                 |          sys_rmdir
>                 |          system_call_fastpath
>                 |          0x3f504d27d7
>                 |          0x405687
>                 |          0x406ef0
>                 |          0x402f31
>                 |          0x3f5041eb1d
>                 |          
>                 |--8.17%-- __queue_work
>                 |          queue_work_on
>                 |          schedule_work_on
>                 |          schedule_on_each_cpu
>                 |          |          
>                 |          |--52.09%-- lru_add_drain_all
>                 |          |          mem_cgroup_force_empty
>                 |          |          mem_cgroup_pre_destroy
>                 |          |          cgroup_rmdir
>                 |          |          vfs_rmdir
>                 |          |          do_rmdir
>                 |          |          sys_rmdir
>                 |          |          system_call_fastpath
>                 |          |          0x3f504d27d7
>                 |          |          0x405687
>                 |          |          0x406ef0
>                 |          |          0x402f31
>                 |          |          0x3f5041eb1d
>                 |          |          
>                 |           --47.91%-- mem_cgroup_force_empty
>                 |                     mem_cgroup_pre_destroy
>                 |                     cgroup_rmdir
>                 |                     vfs_rmdir
>                 |                     do_rmdir
>                 |                     sys_rmdir
>                 |                     system_call_fastpath
>                 |                     0x3f504d27d7
>                 |                     0x405687
>                 |                     0x406ef0
>                 |                     0x402f31
>                 |                     0x3f5041eb1d
>                 |          
>                 |--7.94%-- __wake_up
>                 |          |          
>                 |          |--99.71%-- insert_work
>                 |          |          |          
>                 |          |          |--97.70%-- __queue_work
>                 |          |          |          queue_work_on
>                 |          |          |          schedule_work_on
>                 |          |          |          schedule_on_each_cpu
>                 |          |          |          |          
>                 |          |          |          |--50.59%-- mem_cgroup_force_empty
>                 |          |          |          |          mem_cgroup_pre_destroy
>                 |          |          |          |          cgroup_rmdir
>                 |          |          |          |          vfs_rmdir
>                 |          |          |          |          do_rmdir
>                 |          |          |          |          sys_rmdir
>                 |          |          |          |          system_call_fastpath
>                 |          |          |          |          0x3f504d27d7
>                 |          |          |          |          0x405687
>                 |          |          |          |          0x406ef0
>                 |          |          |          |          0x402f31
>                 |          |          |          |          0x3f5041eb1d
>                 |          |          |          |          
>                 |          |          |           --49.41%-- lru_add_drain_all
>                 |          |          |                     mem_cgroup_force_empty
>                 |          |          |                     mem_cgroup_pre_destroy
>                 |          |          |                     cgroup_rmdir
>                 |          |          |                     vfs_rmdir
>                 |          |          |                     do_rmdir
>                 |          |          |                     sys_rmdir
>                 |          |          |                     system_call_fastpath
>                 |          |          |                     0x3f504d27d7
>                 |          |          |                     0x405687
>                 |          |          |                     0x406ef0
>                 |          |          |                     0x402f31
>                 |          |          |                     0x3f5041eb1d
>                 |          |           --2.30%-- [...]
>                 |           --0.29%-- [...]
>                 |          
>                 |--4.35%-- mem_cgroup_pre_destroy
>                 |          cgroup_rmdir
>                 |          vfs_rmdir
>                 |          do_rmdir
>                 |          sys_rmdir
>                 |          system_call_fastpath
>                 |          0x3f504d27d7
>                 |          0x405687
>                 |          0x406ef0
>                 |          0x402f31
>                 |          0x3f5041eb1d
>                  --11.47%-- [...]
> 
>      7.25%   :23586  [kernel]           [k] sched_clock_cpu
>              |
>              --- sched_clock_cpu
>                 |          
>                 |--97.11%-- update_rq_clock
>                 |          |          
>                 |          |--98.89%-- try_to_wake_up
>                 |          |          default_wake_function
>                 |          |          autoremove_wake_function
>                 |          |          __wake_up_common
>                 |          |          __wake_up
>                 |          |          insert_work
>                 |          |          __queue_work
>                 |          |          queue_work_on
>                 |          |          schedule_work_on
>                 |          |          schedule_on_each_cpu
>                 |          |          |          
>                 |          |          |--50.69%-- lru_add_drain_all
>                 |          |          |          mem_cgroup_force_empty
>                 |          |          |          mem_cgroup_pre_destroy
>                 |          |          |          cgroup_rmdir
>                 |          |          |          vfs_rmdir
>                 |          |          |          do_rmdir
>                 |          |          |          sys_rmdir
>                 |          |          |          system_call_fastpath
>                 |          |          |          0x3f504d27d7
>                 |          |          |          0x405687
>                 |          |          |          0x406ef0
>                 |          |          |          0x402f31
>                 |          |          |          0x3f5041eb1d
>                 |          |          |          
>                 |          |           --49.31%-- mem_cgroup_force_empty
>                 |          |                     mem_cgroup_pre_destroy
>                 |          |                     cgroup_rmdir
>                 |          |                     vfs_rmdir
>                 |          |                     do_rmdir
>                 |          |                     sys_rmdir
>                 |          |                     system_call_fastpath
>                 |          |                     0x3f504d27d7
>                 |          |                     0x405687
>                 |          |                     0x406ef0
>                 |          |                     0x402f31
>                 |          |                     0x3f5041eb1d
>                 |           --1.11%-- [...]
>                  --2.89%-- [...]
> 
>      5.54%   :23586  [kernel]           [k] try_to_wake_up
>              |
>              --- try_to_wake_up
>                 |          
>                 |--99.13%-- default_wake_function
>                 |          autoremove_wake_function
>                 |          __wake_up_common
>                 |          __wake_up
>                 |          insert_work
>                 |          __queue_work
>                 |          queue_work_on
>                 |          schedule_work_on
>                 |          schedule_on_each_cpu
>                 |          |          
>                 |          |--52.03%-- lru_add_drain_all
>                 |          |          mem_cgroup_force_empty
>                 |          |          mem_cgroup_pre_destroy
>                 |          |          cgroup_rmdir
>                 |          |          vfs_rmdir
>                 |          |          do_rmdir
>                 |          |          sys_rmdir
>                 |          |          system_call_fastpath
>                 |          |          0x3f504d27d7
>                 |          |          0x405687
>                 |          |          0x406ef0
>                 |          |          0x402f31
>                 |          |          0x3f5041eb1d
>                 |          |          
>                 |           --47.97%-- mem_cgroup_force_empty
>                 |                     mem_cgroup_pre_destroy
>                 |                     cgroup_rmdir
>                 |                     vfs_rmdir
>                 |                     do_rmdir
>                 |                     sys_rmdir
>                 |                     system_call_fastpath
>                 |                     0x3f504d27d7
>                 |                     0x405687
>                 |                     0x406ef0
>                 |                     0x402f31
>                 |                     0x3f5041eb1d
>                  --0.87%-- [...]
> 
> -- 
> Mark
> 


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: cgroup: rmdir() does not complete
  2010-09-09 23:04                     ` Mark Hills
  2010-09-09 23:43                       ` KAMEZAWA Hiroyuki
@ 2010-09-10  2:16                       ` KAMEZAWA Hiroyuki
  2010-09-10  4:05                         ` Daisuke Nishimura
  2010-09-10  7:28                         ` Mark Hills
  1 sibling, 2 replies; 26+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-09-10  2:16 UTC (permalink / raw)
  To: Mark Hills; +Cc: Peter Zijlstra, Balbir Singh, Daisuke Nishimura, linux-kernel

On Fri, 10 Sep 2010 00:04:31 +0100 (BST)
Mark Hills <mark@pogo.org.uk> wrote:
> The report on the spinning process (23586) is dominated by calls from 
> mem_cgroup_force_empty.
> 
> It seems to show lru_add_drain_all and drain_all_stock_sync are causing 
> the load (I assume drain_all_stock_sync has been optimised out). But I 
> don't think this is as important as what causes the spin.
> 

I noticed you use FUSE and it seems there is a problem in FUSE v.s. memcg.
I wrote a patch (onto 2.6.36 but can be applied..)

Could you try this ? I'm sorry I don't use FUSE system and can't test
right now.

==
From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

memory cgroup catches all pages which is added to radix-tree and
assumes the pages will be added to LRU, somewhere.
But there are pages which not on LRU but on radix-tree. Then,
force_empty cannot find them and cannot finish ->pre_destroy(), rmdir
operations.

This patch adds __GFP_NOMEMCGROUP and avoids unnecessary, out-of-control
pages are registered to memory cgroup. 

Note: This gfp flag can be used for shmem handling, which now uses
      complicated heuristics.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 fs/fuse/dev.c       |   11 ++++++++++-
 include/linux/gfp.h |    7 +++++++
 mm/memcontrol.c     |    2 +-
 3 files changed, 18 insertions(+), 2 deletions(-)

Index: linux-2.6.36-rc3/fs/fuse/dev.c
===================================================================
--- linux-2.6.36-rc3.orig/fs/fuse/dev.c
+++ linux-2.6.36-rc3/fs/fuse/dev.c
@@ -19,6 +19,7 @@
 #include <linux/pipe_fs_i.h>
 #include <linux/swap.h>
 #include <linux/splice.h>
+#include <linux/memcontrol.h>
 
 MODULE_ALIAS_MISCDEV(FUSE_MINOR);
 MODULE_ALIAS("devname:fuse");
@@ -683,6 +684,7 @@ static int fuse_try_move_page(struct fus
 	struct pipe_buffer *buf = cs->pipebufs;
 	struct address_space *mapping;
 	pgoff_t index;
+	gfp_t mask = GFP_KERNEL;
 
 	unlock_request(cs->fc, cs->req);
 	fuse_copy_finish(cs);
@@ -732,7 +734,14 @@ static int fuse_try_move_page(struct fus
 	remove_from_page_cache(oldpage);
 	page_cache_release(oldpage);
 
-	err = add_to_page_cache_locked(newpage, mapping, index, GFP_KERNEL);
+	/*
+	 * not-on-LRU pages are out of control. So, add to root cgroup.
+ 	 * See mm/memcontrol.c for details.
+	 */
+	if (buf->flags & PIPE_BUF_FLAG_LRU)
+		mask |= __GFP_NOMEMCGROUP;
+
+	err = add_to_page_cache_locked(newpage, mapping, index, mask);
 	if (err) {
 		printk(KERN_WARNING "fuse_try_move_page: failed to add page");
 		goto out_fallback_unlock;
Index: linux-2.6.36-rc3/include/linux/gfp.h
===================================================================
--- linux-2.6.36-rc3.orig/include/linux/gfp.h
+++ linux-2.6.36-rc3/include/linux/gfp.h
@@ -60,6 +60,13 @@ struct vm_area_struct;
 #define __GFP_NOTRACK	((__force gfp_t)0)
 #endif
 
+#ifdef CONFIG_CGROUP_MEM_RES_CTLR
+#define __GFP_NOMEMCGROUP	((__force gfp_t)0x400000u)
+	/* Don't track by memory cgroup */
+#else
+#define __GFP_NOMEMCGROUP	((__force gfp_t)0)
+#endif
+
 /*
  * This may seem redundant, but it's a way of annotating false positives vs.
  * allocations that simply cannot be supported (e.g. page tables).
Index: linux-2.6.36-rc3/mm/memcontrol.c
===================================================================
--- linux-2.6.36-rc3.orig/mm/memcontrol.c
+++ linux-2.6.36-rc3/mm/memcontrol.c
@@ -2114,7 +2114,7 @@ int mem_cgroup_cache_charge(struct page 
 
 	if (mem_cgroup_disabled())
 		return 0;
-	if (PageCompound(page))
+	if (PageCompound(page) || (gfp_mask & __GFP_NOMEMCGROUP))
 		return 0;
 	/*
 	 * Corner case handling. This is called from add_to_page_cache()


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: cgroup: rmdir() does not complete
  2010-09-10  2:16                       ` KAMEZAWA Hiroyuki
@ 2010-09-10  4:05                         ` Daisuke Nishimura
  2010-09-10  4:11                           ` KAMEZAWA Hiroyuki
  2010-09-10  7:28                         ` Mark Hills
  1 sibling, 1 reply; 26+ messages in thread
From: Daisuke Nishimura @ 2010-09-10  4:05 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Mark Hills, Peter Zijlstra, Balbir Singh, linux-kernel,
	Daisuke Nishimura

On Fri, 10 Sep 2010 11:16:46 +0900
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:

> On Fri, 10 Sep 2010 00:04:31 +0100 (BST)
> Mark Hills <mark@pogo.org.uk> wrote:
> > The report on the spinning process (23586) is dominated by calls from 
> > mem_cgroup_force_empty.
> > 
> > It seems to show lru_add_drain_all and drain_all_stock_sync are causing 
> > the load (I assume drain_all_stock_sync has been optimised out). But I 
> > don't think this is as important as what causes the spin.
> > 
> 
> I noticed you use FUSE and it seems there is a problem in FUSE v.s. memcg.
> I wrote a patch (onto 2.6.36 but can be applied..)
> 
Nice catch!

> Could you try this ? I'm sorry I don't use FUSE system and can't test
> right now.
> 
Sorry, I can't either.

> ==
> From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> 
> memory cgroup catches all pages which is added to radix-tree and
> assumes the pages will be added to LRU, somewhere.
> But there are pages which not on LRU but on radix-tree. Then,
> force_empty cannot find them and cannot finish ->pre_destroy(), rmdir
> operations.
> 
> This patch adds __GFP_NOMEMCGROUP and avoids unnecessary, out-of-control
> pages are registered to memory cgroup. 
> 
> Note: This gfp flag can be used for shmem handling, which now uses
>       complicated heuristics.
> 
> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> ---
>  fs/fuse/dev.c       |   11 ++++++++++-
>  include/linux/gfp.h |    7 +++++++
>  mm/memcontrol.c     |    2 +-
>  3 files changed, 18 insertions(+), 2 deletions(-)
> 
> Index: linux-2.6.36-rc3/fs/fuse/dev.c
> ===================================================================
> --- linux-2.6.36-rc3.orig/fs/fuse/dev.c
> +++ linux-2.6.36-rc3/fs/fuse/dev.c
> @@ -19,6 +19,7 @@
>  #include <linux/pipe_fs_i.h>
>  #include <linux/swap.h>
>  #include <linux/splice.h>
> +#include <linux/memcontrol.h>
>  
>  MODULE_ALIAS_MISCDEV(FUSE_MINOR);
>  MODULE_ALIAS("devname:fuse");
> @@ -683,6 +684,7 @@ static int fuse_try_move_page(struct fus
>  	struct pipe_buffer *buf = cs->pipebufs;
>  	struct address_space *mapping;
>  	pgoff_t index;
> +	gfp_t mask = GFP_KERNEL;
>  
>  	unlock_request(cs->fc, cs->req);
>  	fuse_copy_finish(cs);
> @@ -732,7 +734,14 @@ static int fuse_try_move_page(struct fus
>  	remove_from_page_cache(oldpage);
>  	page_cache_release(oldpage);
>  
> -	err = add_to_page_cache_locked(newpage, mapping, index, GFP_KERNEL);
> +	/*
> +	 * not-on-LRU pages are out of control. So, add to root cgroup.
> + 	 * See mm/memcontrol.c for details.
> +	 */
> +	if (buf->flags & PIPE_BUF_FLAG_LRU)
> +		mask |= __GFP_NOMEMCGROUP;
> +
> +	err = add_to_page_cache_locked(newpage, mapping, index, mask);
>  	if (err) {
>  		printk(KERN_WARNING "fuse_try_move_page: failed to add page");
>  		goto out_fallback_unlock;
> Index: linux-2.6.36-rc3/include/linux/gfp.h
> ===================================================================
> --- linux-2.6.36-rc3.orig/include/linux/gfp.h
> +++ linux-2.6.36-rc3/include/linux/gfp.h
> @@ -60,6 +60,13 @@ struct vm_area_struct;
>  #define __GFP_NOTRACK	((__force gfp_t)0)
>  #endif
>  
> +#ifdef CONFIG_CGROUP_MEM_RES_CTLR
> +#define __GFP_NOMEMCGROUP	((__force gfp_t)0x400000u)
> +	/* Don't track by memory cgroup */
> +#else
> +#define __GFP_NOMEMCGROUP	((__force gfp_t)0)
> +#endif
> +
>  /*
>   * This may seem redundant, but it's a way of annotating false positives vs.
>   * allocations that simply cannot be supported (e.g. page tables).
> Index: linux-2.6.36-rc3/mm/memcontrol.c
> ===================================================================
> --- linux-2.6.36-rc3.orig/mm/memcontrol.c
> +++ linux-2.6.36-rc3/mm/memcontrol.c
> @@ -2114,7 +2114,7 @@ int mem_cgroup_cache_charge(struct page 
>  
>  	if (mem_cgroup_disabled())
>  		return 0;
> -	if (PageCompound(page))
> +	if (PageCompound(page) || (gfp_mask & __GFP_NOMEMCGROUP))
>  		return 0;
>  	/*
>  	 * Corner case handling. This is called from add_to_page_cache()
> 
The comments above says "not-on-LRU pages are out of control. So, add to root cgroup.".
But this change means that we don't charge these pages at all.

Should it be:

	if (gfp_mask & __GFP_NOMEMCGROUP))
		mm = &init_mm;

?
Or, change the comment ?


Thanks,
Daisuke Nishimura.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: cgroup: rmdir() does not complete
  2010-09-10  4:05                         ` Daisuke Nishimura
@ 2010-09-10  4:11                           ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 26+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-09-10  4:11 UTC (permalink / raw)
  To: Daisuke Nishimura; +Cc: Mark Hills, Peter Zijlstra, Balbir Singh, linux-kernel

On Fri, 10 Sep 2010 13:05:39 +0900
Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> wrote:

> On Fri, 10 Sep 2010 11:16:46 +0900
> KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> 
> > On Fri, 10 Sep 2010 00:04:31 +0100 (BST)
> > Mark Hills <mark@pogo.org.uk> wrote:
> > > The report on the spinning process (23586) is dominated by calls from 
> > > mem_cgroup_force_empty.
> > > 
> > > It seems to show lru_add_drain_all and drain_all_stock_sync are causing 
> > > the load (I assume drain_all_stock_sync has been optimised out). But I 
> > > don't think this is as important as what causes the spin.
> > > 
> > 
> > I noticed you use FUSE and it seems there is a problem in FUSE v.s. memcg.
> > I wrote a patch (onto 2.6.36 but can be applied..)
> > 
> Nice catch!
> 
> > Could you try this ? I'm sorry I don't use FUSE system and can't test
> > right now.
> > 
> Sorry, I can't either.
> 
> > ==
> > From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > 
> > memory cgroup catches all pages which is added to radix-tree and
> > assumes the pages will be added to LRU, somewhere.
> > But there are pages which not on LRU but on radix-tree. Then,
> > force_empty cannot find them and cannot finish ->pre_destroy(), rmdir
> > operations.
> > 
> > This patch adds __GFP_NOMEMCGROUP and avoids unnecessary, out-of-control
> > pages are registered to memory cgroup. 
> > 
> > Note: This gfp flag can be used for shmem handling, which now uses
> >       complicated heuristics.
> > 
> > Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > ---
> >  fs/fuse/dev.c       |   11 ++++++++++-
> >  include/linux/gfp.h |    7 +++++++
> >  mm/memcontrol.c     |    2 +-
> >  3 files changed, 18 insertions(+), 2 deletions(-)
> > 
> > Index: linux-2.6.36-rc3/fs/fuse/dev.c
> > ===================================================================
> > --- linux-2.6.36-rc3.orig/fs/fuse/dev.c
> > +++ linux-2.6.36-rc3/fs/fuse/dev.c
> > @@ -19,6 +19,7 @@
> >  #include <linux/pipe_fs_i.h>
> >  #include <linux/swap.h>
> >  #include <linux/splice.h>
> > +#include <linux/memcontrol.h>
> >  
> >  MODULE_ALIAS_MISCDEV(FUSE_MINOR);
> >  MODULE_ALIAS("devname:fuse");
> > @@ -683,6 +684,7 @@ static int fuse_try_move_page(struct fus
> >  	struct pipe_buffer *buf = cs->pipebufs;
> >  	struct address_space *mapping;
> >  	pgoff_t index;
> > +	gfp_t mask = GFP_KERNEL;
> >  
> >  	unlock_request(cs->fc, cs->req);
> >  	fuse_copy_finish(cs);
> > @@ -732,7 +734,14 @@ static int fuse_try_move_page(struct fus
> >  	remove_from_page_cache(oldpage);
> >  	page_cache_release(oldpage);
> >  
> > -	err = add_to_page_cache_locked(newpage, mapping, index, GFP_KERNEL);
> > +	/*
> > +	 * not-on-LRU pages are out of control. So, add to root cgroup.
> > + 	 * See mm/memcontrol.c for details.
> > +	 */
> > +	if (buf->flags & PIPE_BUF_FLAG_LRU)
> > +		mask |= __GFP_NOMEMCGROUP;
> > +
> > +	err = add_to_page_cache_locked(newpage, mapping, index, mask);
> >  	if (err) {
> >  		printk(KERN_WARNING "fuse_try_move_page: failed to add page");
> >  		goto out_fallback_unlock;
> > Index: linux-2.6.36-rc3/include/linux/gfp.h
> > ===================================================================
> > --- linux-2.6.36-rc3.orig/include/linux/gfp.h
> > +++ linux-2.6.36-rc3/include/linux/gfp.h
> > @@ -60,6 +60,13 @@ struct vm_area_struct;
> >  #define __GFP_NOTRACK	((__force gfp_t)0)
> >  #endif
> >  
> > +#ifdef CONFIG_CGROUP_MEM_RES_CTLR
> > +#define __GFP_NOMEMCGROUP	((__force gfp_t)0x400000u)
> > +	/* Don't track by memory cgroup */
> > +#else
> > +#define __GFP_NOMEMCGROUP	((__force gfp_t)0)
> > +#endif
> > +
> >  /*
> >   * This may seem redundant, but it's a way of annotating false positives vs.
> >   * allocations that simply cannot be supported (e.g. page tables).
> > Index: linux-2.6.36-rc3/mm/memcontrol.c
> > ===================================================================
> > --- linux-2.6.36-rc3.orig/mm/memcontrol.c
> > +++ linux-2.6.36-rc3/mm/memcontrol.c
> > @@ -2114,7 +2114,7 @@ int mem_cgroup_cache_charge(struct page 
> >  
> >  	if (mem_cgroup_disabled())
> >  		return 0;
> > -	if (PageCompound(page))
> > +	if (PageCompound(page) || (gfp_mask & __GFP_NOMEMCGROUP))
> >  		return 0;
> >  	/*
> >  	 * Corner case handling. This is called from add_to_page_cache()
> > 
> The comments above says "not-on-LRU pages are out of control. So, add to root cgroup.".
> But this change means that we don't charge these pages at all.
> 
> Should it be:
> 
> 	if (gfp_mask & __GFP_NOMEMCGROUP))
> 		mm = &init_mm;
> 
> ?
> Or, change the comment ?
> 

yes....the comment is wrong.

Thanks,
-Kame
==

From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

memory cgroup catches all pages which is added to radix-tree and
assumes the pages will be added to LRU, somewhere.
But there are pages which not on LRU but on radix-tree. Then,
force_empty cannot find them and cannot finish ->pre_destroy(), rmdir
operations.

This patch adds __GFP_NOMEMCGROUP and avoids unnecessary, out-of-control
pages are registered to memory cgroup. 

Note: This gfp flag can be used for shmem handling, which now uses
      complicated heuristics.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 fs/fuse/dev.c       |   11 ++++++++++-
 include/linux/gfp.h |    7 +++++++
 mm/memcontrol.c     |    2 +-
 3 files changed, 18 insertions(+), 2 deletions(-)

Index: linux-2.6.36-rc3/fs/fuse/dev.c
===================================================================
--- linux-2.6.36-rc3.orig/fs/fuse/dev.c
+++ linux-2.6.36-rc3/fs/fuse/dev.c
@@ -19,6 +19,7 @@
 #include <linux/pipe_fs_i.h>
 #include <linux/swap.h>
 #include <linux/splice.h>
+#include <linux/memcontrol.h>
 
 MODULE_ALIAS_MISCDEV(FUSE_MINOR);
 MODULE_ALIAS("devname:fuse");
@@ -683,6 +684,7 @@ static int fuse_try_move_page(struct fus
 	struct pipe_buffer *buf = cs->pipebufs;
 	struct address_space *mapping;
 	pgoff_t index;
+	gfp_t mask = GFP_KERNEL;
 
 	unlock_request(cs->fc, cs->req);
 	fuse_copy_finish(cs);
@@ -732,7 +734,14 @@ static int fuse_try_move_page(struct fus
 	remove_from_page_cache(oldpage);
 	page_cache_release(oldpage);
 
-	err = add_to_page_cache_locked(newpage, mapping, index, GFP_KERNEL);
+	/*
+	 * non-LRU pages are out of cgroup controls.
+ 	 * See mm/memcontrol.c or Documentation/cgroup/memory.txt for details.
+	 */
+	if (buf->flags & PIPE_BUF_FLAG_LRU)
+		mask |= __GFP_NOMEMCGROUP;
+
+	err = add_to_page_cache_locked(newpage, mapping, index, mask);
 	if (err) {
 		printk(KERN_WARNING "fuse_try_move_page: failed to add page");
 		goto out_fallback_unlock;
Index: linux-2.6.36-rc3/include/linux/gfp.h
===================================================================
--- linux-2.6.36-rc3.orig/include/linux/gfp.h
+++ linux-2.6.36-rc3/include/linux/gfp.h
@@ -60,6 +60,13 @@ struct vm_area_struct;
 #define __GFP_NOTRACK	((__force gfp_t)0)
 #endif
 
+#ifdef CONFIG_CGROUP_MEM_RES_CTLR
+#define __GFP_NOMEMCGROUP	((__force gfp_t)0x400000u)
+	/* Don't track by memory cgroup */
+#else
+#define __GFP_NOMEMCGROUP	((__force gfp_t)0)
+#endif
+
 /*
  * This may seem redundant, but it's a way of annotating false positives vs.
  * allocations that simply cannot be supported (e.g. page tables).
Index: linux-2.6.36-rc3/mm/memcontrol.c
===================================================================
--- linux-2.6.36-rc3.orig/mm/memcontrol.c
+++ linux-2.6.36-rc3/mm/memcontrol.c
@@ -2114,7 +2114,7 @@ int mem_cgroup_cache_charge(struct page 
 
 	if (mem_cgroup_disabled())
 		return 0;
-	if (PageCompound(page))
+	if (PageCompound(page) || (gfp_mask & __GFP_NOMEMCGROUP))
 		return 0;
 	/*
 	 * Corner case handling. This is called from add_to_page_cache()



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: cgroup: rmdir() does not complete
  2010-09-10  2:16                       ` KAMEZAWA Hiroyuki
  2010-09-10  4:05                         ` Daisuke Nishimura
@ 2010-09-10  7:28                         ` Mark Hills
  2010-09-10  7:33                           ` KAMEZAWA Hiroyuki
  1 sibling, 1 reply; 26+ messages in thread
From: Mark Hills @ 2010-09-10  7:28 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Peter Zijlstra, Balbir Singh, Daisuke Nishimura, linux-kernel

On Fri, 10 Sep 2010, KAMEZAWA Hiroyuki wrote:

> On Fri, 10 Sep 2010 00:04:31 +0100 (BST)
> Mark Hills <mark@pogo.org.uk> wrote:
> > The report on the spinning process (23586) is dominated by calls from 
> > mem_cgroup_force_empty.
> > 
> > It seems to show lru_add_drain_all and drain_all_stock_sync are causing 
> > the load (I assume drain_all_stock_sync has been optimised out). But I 
> > don't think this is as important as what causes the spin.
> > 
> 
> I noticed you use FUSE and it seems there is a problem in FUSE v.s. memcg.
> I wrote a patch (onto 2.6.36 but can be applied..)
> 
> Could you try this ? I'm sorry I don't use FUSE system and can't test
> right now.

What makes you conclude that FUSE is in use? I do not think this is the 
case. Or do you mean that it is a problem that the kernel is built with 
FUSE support?

I _can_ test the patch, but I still cannot reliably reproduce the problem 
so it will be hard to conclude whether the patch works or not. Is there a 
way to build a test case for this?

Thanks for your help

-- 
Mark

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: cgroup: rmdir() does not complete
  2010-09-10  7:28                         ` Mark Hills
@ 2010-09-10  7:33                           ` KAMEZAWA Hiroyuki
  2010-09-10  7:51                             ` Mark Hills
  0 siblings, 1 reply; 26+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-09-10  7:33 UTC (permalink / raw)
  To: Mark Hills; +Cc: Peter Zijlstra, Balbir Singh, Daisuke Nishimura, linux-kernel

On Fri, 10 Sep 2010 08:28:00 +0100 (BST)
Mark Hills <mark@pogo.org.uk> wrote:

> On Fri, 10 Sep 2010, KAMEZAWA Hiroyuki wrote:
> 
> > On Fri, 10 Sep 2010 00:04:31 +0100 (BST)
> > Mark Hills <mark@pogo.org.uk> wrote:
> > > The report on the spinning process (23586) is dominated by calls from 
> > > mem_cgroup_force_empty.
> > > 
> > > It seems to show lru_add_drain_all and drain_all_stock_sync are causing 
> > > the load (I assume drain_all_stock_sync has been optimised out). But I 
> > > don't think this is as important as what causes the spin.
> > > 
> > 
> > I noticed you use FUSE and it seems there is a problem in FUSE v.s. memcg.
> > I wrote a patch (onto 2.6.36 but can be applied..)
> > 
> > Could you try this ? I'm sorry I don't use FUSE system and can't test
> > right now.
> 
> What makes you conclude that FUSE is in use? I do not think this is the 
> case. Or do you mean that it is a problem that the kernel is built with 
> FUSE support?
> 
You wrote 
> The test case I was running is similar to the above. With the Lustre 
> filesystem the problem takes 4 hours or more to show itself. Recently I 
> ran 4 threads for over 24 hours without it being seen -- I suspect some 
> external factor is involved.

I think Lustre FS is using FUSE. I'm wrong ?


> I _can_ test the patch, but I still cannot reliably reproduce the problem 
> so it will be hard to conclude whether the patch works or not. Is there a 
> way to build a test case for this?
> 

I'm sorry I'm not sure yet. But from your report, you have 6 pages of charge
which cannot be found by force_empty(). And I found FUSE's pipe copy code
inserts a page cache into radix-tree but not move them onto LRU.

So,
  - There are remaining pages which is out-of-LRU
  - FUSE's "move" code does something curious, add_to_page_cache() but not LRU.
  - You reporeted you use Lustre FS.

Then, I ask you. To test this, I have to study FUSE to write test module...
Maybe adding printk() to where I added gfp_mask modification of fuse/dev.c
can show something but...

We may have something other problem, but it seems this is one of them.

Thanks,
-Kame


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: cgroup: rmdir() does not complete
  2010-09-10  7:33                           ` KAMEZAWA Hiroyuki
@ 2010-09-10  7:51                             ` Mark Hills
  0 siblings, 0 replies; 26+ messages in thread
From: Mark Hills @ 2010-09-10  7:51 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Peter Zijlstra, Balbir Singh, Daisuke Nishimura, linux-kernel

On Fri, 10 Sep 2010, KAMEZAWA Hiroyuki wrote:

> On Fri, 10 Sep 2010 08:28:00 +0100 (BST)
> Mark Hills <mark@pogo.org.uk> wrote:
> 
> > On Fri, 10 Sep 2010, KAMEZAWA Hiroyuki wrote:
> > 
> > > On Fri, 10 Sep 2010 00:04:31 +0100 (BST)
> > > Mark Hills <mark@pogo.org.uk> wrote:
> > > > The report on the spinning process (23586) is dominated by calls from 
> > > > mem_cgroup_force_empty.
> > > > 
> > > > It seems to show lru_add_drain_all and drain_all_stock_sync are causing 
> > > > the load (I assume drain_all_stock_sync has been optimised out). But I 
> > > > don't think this is as important as what causes the spin.
> > > > 
> > > 
> > > I noticed you use FUSE and it seems there is a problem in FUSE v.s. memcg.
> > > I wrote a patch (onto 2.6.36 but can be applied..)
> > > 
> > > Could you try this ? I'm sorry I don't use FUSE system and can't test
> > > right now.
> > 
> > What makes you conclude that FUSE is in use? I do not think this is the 
> > case. Or do you mean that it is a problem that the kernel is built with 
> > FUSE support?
> > 
> You wrote 
> > The test case I was running is similar to the above. With the Lustre 
> > filesystem the problem takes 4 hours or more to show itself. Recently I 
> > ran 4 threads for over 24 hours without it being seen -- I suspect some 
> > external factor is involved.
> 
> I think Lustre FS is using FUSE. I'm wrong ?

Lustre does not use FUSE. But the client is a set of kernel modules, so 
these could do anything.

> > I _can_ test the patch, but I still cannot reliably reproduce the problem 
> > so it will be hard to conclude whether the patch works or not. Is there a 
> > way to build a test case for this?
> > 
> 
> I'm sorry I'm not sure yet. But from your report, you have 6 pages of charge
> which cannot be found by force_empty(). And I found FUSE's pipe copy code
> inserts a page cache into radix-tree but not move them onto LRU.
> 
> So,
>   - There are remaining pages which is out-of-LRU
>   - FUSE's "move" code does something curious, add_to_page_cache() but not LRU.
>   - You reporeted you use Lustre FS.
> 
> Then, I ask you. To test this, I have to study FUSE to write test module...
> Maybe adding printk() to where I added gfp_mask modification of fuse/dev.c
> can show something but...
> 
> We may have something other problem, but it seems this is one of them.

Okay, it sounds like perhaps I need to investigate Lustre, I will do this 
next week. But I think FUSE can be ruled out.

Thanks again

-- 
Mark

^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2010-09-10  7:51 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-08-26 15:51 cgroup: rmdir() does not complete Mark Hills
2010-08-27  0:56 ` Daisuke Nishimura
2010-08-27  1:20   ` Balbir Singh
2010-08-27  2:35   ` KAMEZAWA Hiroyuki
2010-08-27  3:39     ` Daisuke Nishimura
2010-08-27  5:42       ` KAMEZAWA Hiroyuki
2010-08-27  6:29         ` KAMEZAWA Hiroyuki
2010-08-30  7:32           ` Balbir Singh
2010-08-30  9:13         ` Mark Hills
2010-09-01 11:10         ` Mark Hills
2010-09-01 23:42           ` KAMEZAWA Hiroyuki
2010-09-02  9:45             ` Mark Hills
2010-09-09 10:01             ` Mark Hills
2010-09-09 10:09               ` Balbir Singh
2010-09-09 11:36                 ` Mark Hills
2010-09-09 11:50                   ` Peter Zijlstra
2010-09-09 23:04                     ` Mark Hills
2010-09-09 23:43                       ` KAMEZAWA Hiroyuki
2010-09-10  2:16                       ` KAMEZAWA Hiroyuki
2010-09-10  4:05                         ` Daisuke Nishimura
2010-09-10  4:11                           ` KAMEZAWA Hiroyuki
2010-09-10  7:28                         ` Mark Hills
2010-09-10  7:33                           ` KAMEZAWA Hiroyuki
2010-09-10  7:51                             ` Mark Hills
2010-08-27  1:25 ` KAMEZAWA Hiroyuki
2010-08-30  9:25   ` Mark Hills

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.