All of lore.kernel.org
 help / color / mirror / Atom feed
* Please backport commit 3812c8c8f39 to stable
@ 2014-09-30  3:43 Cong Wang
  2014-09-30  8:16 ` Michal Hocko
  0 siblings, 1 reply; 12+ messages in thread
From: Cong Wang @ 2014-09-30  3:43 UTC (permalink / raw)
  To: Johannes Weiner; +Cc: Greg KH, LKML, stable, Michal Hocko

Hi, Johannes and Greg


Please consider to backport the following commit to stable kernels < 3.12.

commit 3812c8c8f3953921ef18544110dafc3505c1ac62
Author: Johannes Weiner <hannes@cmpxchg.org>
Date:   Thu Sep 12 15:13:44 2013 -0700

    mm: memcg: do not trap chargers with full callstack on OOM

It should solve some soft lockup I observed on different machines
recently. For me myself, I only care about 3.10. :-p

commit 4942642080ea82d99ab5b65 (mm: memcg: handle non-error OOM
situations more gracefully) seems needed as a followup. Johannes
should know much better than I do.

Let me know if you need any more information or help from me.

Thanks much!

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Please backport commit 3812c8c8f39 to stable
  2014-09-30  3:43 Please backport commit 3812c8c8f39 to stable Cong Wang
@ 2014-09-30  8:16 ` Michal Hocko
  2014-09-30 17:14   ` Cong Wang
  0 siblings, 1 reply; 12+ messages in thread
From: Michal Hocko @ 2014-09-30  8:16 UTC (permalink / raw)
  To: Cong Wang; +Cc: Johannes Weiner, Greg KH, LKML, stable

On Mon 29-09-14 20:43:46, Cong Wang wrote:
> Hi, Johannes and Greg
> 
> 
> Please consider to backport the following commit to stable kernels < 3.12.
> 
> commit 3812c8c8f3953921ef18544110dafc3505c1ac62
> Author: Johannes Weiner <hannes@cmpxchg.org>
> Date:   Thu Sep 12 15:13:44 2013 -0700
> 
>     mm: memcg: do not trap chargers with full callstack on OOM
>
> It should solve some soft lockup I observed on different machines
> recently. For me myself, I only care about 3.10. :-p

Could you be more specific about the soft lockup you are seeing?
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Please backport commit 3812c8c8f39 to stable
  2014-09-30  8:16 ` Michal Hocko
@ 2014-09-30 17:14   ` Cong Wang
  2014-10-02 21:04     ` Cong Wang
  2014-10-03 15:13     ` Michal Hocko
  0 siblings, 2 replies; 12+ messages in thread
From: Cong Wang @ 2014-09-30 17:14 UTC (permalink / raw)
  To: Michal Hocko; +Cc: Johannes Weiner, Greg KH, LKML, stable

On Tue, Sep 30, 2014 at 1:16 AM, Michal Hocko <mhocko@suse.cz> wrote:
> On Mon 29-09-14 20:43:46, Cong Wang wrote:
>> Hi, Johannes and Greg
>>
>>
>> Please consider to backport the following commit to stable kernels < 3.12.
>>
>> commit 3812c8c8f3953921ef18544110dafc3505c1ac62
>> Author: Johannes Weiner <hannes@cmpxchg.org>
>> Date:   Thu Sep 12 15:13:44 2013 -0700
>>
>>     mm: memcg: do not trap chargers with full callstack on OOM
>>
>> It should solve some soft lockup I observed on different machines
>> recently. For me myself, I only care about 3.10. :-p
>
> Could you be more specific about the soft lockup you are seeing?
>

Sure, almost same with the one in that changelog, this is why I
didn't provide it in my previous email. See the bottom of this email
for details.

Note, I am not entirely sure it is because OOM killer tried to kill
the one sleeping on inode mutex which caused the deadlock,
it may be because OOM killer failed to kill some frozen process
too as I saw many processes got frozen. If it is this case, we will
need my patch: https://lkml.org/lkml/2014/9/4/646.

But anyway, that commit definitely fixes some real soft lockups,
which could be a stable candidate although it is a large one.
I am willing to help if needed.

Thanks!

---------------------->

[8073927.905238] INFO: task mesos-slave:10041 blocked for more than 120 seconds.
[8073927.905241] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[8073927.905243] mesos-slave     D ffff88081bf46060     0 10041  10030
0x00000000
[8073927.905247]  ffff8808208bddb8 0000000000000082 ffff8808545e2e40
ffff8808208bdfd8
[8073927.905252]  ffff8808208bdfd8 0000000000012a00 ffff88081bf45c80
ffff88081bf45c80
[8073927.905255]  ffff880da4351f94 ffff880da4351f90 ffff880da4351f98
0000000000000000
[8073927.905258] Call Trace:
[8073927.905267]  [<ffffffff814a40a6>] schedule+0x69/0x6b
[8073927.905270]  [<ffffffff814a4484>] schedule_preempt_disabled+0xe/0x10
[8073927.905273]  [<ffffffff814a3287>] __mutex_lock_common.isra.9+0x148/0x1d6
[8073927.905278]  [<ffffffff811ec271>] ? security_inode_permission+0x1c/0x21
[8073927.905281]  [<ffffffff814a3401>] __mutex_lock_slowpath+0x13/0x15
[8073927.905284]  [<ffffffff814a3102>] mutex_lock+0x1f/0x2f
[8073927.905287]  [<ffffffff81134f3c>] vfs_unlink+0x44/0xb7
[8073927.905289]  [<ffffffff8113508b>] do_unlinkat+0xdc/0x17f
[8073927.905292]  [<ffffffff814a426d>] ? _cond_resched+0xe/0x1e
[8073927.905295]  [<ffffffff81058252>] ? task_work_run+0x82/0x94
[8073927.905300]  [<ffffffff81002811>] ? do_notify_resume+0x57/0x65
[8073927.905303]  [<ffffffff81135ca4>] SyS_unlink+0x16/0x18
[8073927.905307]  [<ffffffff814aba46>] system_call_fastpath+0x1a/0x1f


sysrq-t output:

[8821221.981672] mesos-slave     D ffff88081bf46060     0 10041  10030
0x00000000
[8821221.981674]  ffff8808208bddb8 0000000000000082 ffff8808545e2e40
ffff8808208bdfd8
[8821221.981677]  ffff8808208bdfd8 0000000000012a00 ffff88081bf45c80
ffff88081bf45c80
[8821221.981679]  ffff880da4351f94 ffff880da4351f90 ffff880da4351f98
0000000000000000
[8821221.981682] Call Trace:
[8821221.981685]  [<ffffffff814a40a6>] schedule+0x69/0x6b
[8821221.981687]  [<ffffffff814a4484>] schedule_preempt_disabled+0xe/0x10
[8821221.981690]  [<ffffffff814a3287>] __mutex_lock_common.isra.9+0x148/0x1d6
[8821221.981693]  [<ffffffff811ec271>] ? security_inode_permission+0x1c/0x21
[8821221.981696]  [<ffffffff814a3401>] __mutex_lock_slowpath+0x13/0x15
[8821221.981698]  [<ffffffff814a3102>] mutex_lock+0x1f/0x2f
[8821221.981701]  [<ffffffff81134f3c>] vfs_unlink+0x44/0xb7
[8821221.981703]  [<ffffffff8113508b>] do_unlinkat+0xdc/0x17f
[8821221.981705]  [<ffffffff814a426d>] ? _cond_resched+0xe/0x1e
[8821221.981707]  [<ffffffff81058252>] ? task_work_run+0x82/0x94
[8821221.981711]  [<ffffffff81002811>] ? do_notify_resume+0x57/0x65
[8821221.981714]  [<ffffffff81135ca4>] SyS_unlink+0x16/0x18
[8821221.981716]  [<ffffffff814aba46>] system_call_fastpath+0x1a/0x1f
[...]
[8821221.986069] python2.6       D ffff881054386060     0 41843  10049
0x00000004
[8821221.986071]  ffff8809677f5930 0000000000000082 ffff880eedf0ae40
ffff8809677f5fd8
[8821221.986074]  ffff8809677f5fd8 0000000000012a00 ffff881054385c80
000000030d5d1e86
[8821221.986077]  ffff88084ea73000 ffff88084ea73000 ffff88041ef49720
ffff88084ea73000
[8821221.986080] Call Trace:
[8821221.986082]  [<ffffffff814a40a6>] schedule+0x69/0x6b
[8821221.986085]  [<ffffffff814a2e24>] schedule_timeout+0xf3/0x129
[8821221.986087]  [<ffffffff810499ce>] ? __internal_add_timer+0xb6/0xb6
[8821221.986090]  [<ffffffff814a2eb8>]
schedule_timeout_uninterruptible+0x1e/0x20
[8821221.986092]  [<ffffffff811232d3>] __mem_cgroup_try_charge+0x3ea/0x8ff
[8821221.986095]  [<ffffffff81122d7a>] ? mem_cgroup_reclaim+0xb2/0xb2
[8821221.986097]  [<ffffffff81123c2a>] mem_cgroup_charge_common+0x35/0x5d
[8821221.986100]  [<ffffffff811250aa>] mem_cgroup_cache_charge+0x51/0x81
[8821221.986103]  [<ffffffff810e237d>] add_to_page_cache_locked+0x3b/0x104
[8821221.986106]  [<ffffffff810e245e>] add_to_page_cache_lru+0x18/0x39
[8821221.986110]  [<ffffffff810e278a>] grab_cache_page_write_begin+0x87/0xb7
[8821221.986113]  [<ffffffff81190c20>] ext4_write_begin+0xef/0x28b
[8821221.986116]  [<ffffffff810e1adc>] generic_file_buffered_write+0xfd/0x20c
[8821221.986119]  [<ffffffff8113c8fb>] ? update_time+0xa2/0xa9
[8821221.986122]  [<ffffffff810e3375>] __generic_file_aio_write+0x1c0/0x1f8
[8821221.986124]  [<ffffffff810e3408>] generic_file_aio_write+0x5b/0xa9
[8821221.986127]  [<ffffffff8118951f>] ext4_file_write+0x2e5/0x376
[8821221.986129]  [<ffffffff8100665c>] ? emulate_vsyscall+0x212/0x2f6
[8821221.986132]  [<ffffffff8149ae9f>] ? __bad_area_nosemaphore+0xb4/0x1bf
[8821221.986135]  [<ffffffff81128ee1>] do_sync_write+0x68/0x95
[8821221.986138]  [<ffffffff81129566>] vfs_write+0xb2/0x117
[8821221.986141]  [<ffffffff81129bb9>] SyS_write+0x46/0x74
[8821221.986144]  [<ffffffff814aba46>] system_call_fastpath+0x1a/0x1f

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Please backport commit 3812c8c8f39 to stable
  2014-09-30 17:14   ` Cong Wang
@ 2014-10-02 21:04     ` Cong Wang
  2014-10-03 15:37       ` Michal Hocko
  2014-10-03 15:13     ` Michal Hocko
  1 sibling, 1 reply; 12+ messages in thread
From: Cong Wang @ 2014-10-02 21:04 UTC (permalink / raw)
  To: Michal Hocko; +Cc: Johannes Weiner, Greg KH, LKML, stable

Hello again,

I realized it is a series of patch actually:

3812c8c8f3953921ef18544110dafc3505c1ac62 mm: memcg: do not trap
chargers with full callstack on OOM
fb2a6fc56be66c169f8b80e07ed999ba453a2db2 mm: memcg: rework and
document OOM waiting and wakeup
519e52473ebe9db5cdef44670d5a97f1fd53d721 mm: memcg: enable memcg OOM
killer only for user faults
3a13c4d761b4b979ba8767f42345fed3274991b0 x86: finish user fault error
path with fatal signal
759496ba6407c6994d6a5ce3a5e74937d7816208 arch: mm: pass userspace
fault flag to generic fault handler
871341023c771ad233620b7a1fb3d9c7031c4e5c arch: mm: do not invoke OOM
killer on kernel fault OOM
94bce453c78996cc4373d5da6cfabe07fcc6d9f9 arch: mm: remove obsolete
init OOM protection

I am not sure if they have more dependencies.

However, this bug is *fairly* easy to reproduce on 3.10, just using the
following script:

#!/bin/bash

TEST_DIR=/tmp/cgroup_test
[ -d $TEST_DIR ] || mkdir -p $TEST_DIR
mount -t cgroup none $TEST_DIR -o memory
mkdir $TEST_DIR/test
echo 512k > $TEST_DIR/test/memory.limit_in_bytes
dd if=/dev/zero of=/tmp/oom_test_big_file bs=512 count=20000000 &
echo $! > $TEST_DIR/test/tasks
rm -f /tmp/oom_test_big_file
umount $TEST_DIR


Run it like this:

for i in `seq 1 1000`; do ./oom_hung.sh ; done


So please consider this seriously. :)

Thanks!

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Please backport commit 3812c8c8f39 to stable
  2014-09-30 17:14   ` Cong Wang
  2014-10-02 21:04     ` Cong Wang
@ 2014-10-03 15:13     ` Michal Hocko
  2014-10-03 18:03       ` Cong Wang
  1 sibling, 1 reply; 12+ messages in thread
From: Michal Hocko @ 2014-10-03 15:13 UTC (permalink / raw)
  To: Cong Wang; +Cc: Johannes Weiner, Greg KH, LKML, stable

On Tue 30-09-14 10:14:08, Cong Wang wrote:
> On Tue, Sep 30, 2014 at 1:16 AM, Michal Hocko <mhocko@suse.cz> wrote:
> > On Mon 29-09-14 20:43:46, Cong Wang wrote:
> >> Hi, Johannes and Greg
> >>
> >>
> >> Please consider to backport the following commit to stable kernels < 3.12.
> >>
> >> commit 3812c8c8f3953921ef18544110dafc3505c1ac62
> >> Author: Johannes Weiner <hannes@cmpxchg.org>
> >> Date:   Thu Sep 12 15:13:44 2013 -0700
> >>
> >>     mm: memcg: do not trap chargers with full callstack on OOM
> >>
> >> It should solve some soft lockup I observed on different machines
> >> recently. For me myself, I only care about 3.10. :-p
> >
> > Could you be more specific about the soft lockup you are seeing?
> >
> 
> Sure, almost same with the one in that changelog, this is why I
> didn't provide it in my previous email. See the bottom of this email
> for details.
> 
> Note, I am not entirely sure it is because OOM killer tried to kill
> the one sleeping on inode mutex which caused the deadlock,
> it may be because OOM killer failed to kill some frozen process
> too as I saw many processes got frozen. If it is this case, we will
> need my patch: https://lkml.org/lkml/2014/9/4/646.
> 
> But anyway, that commit definitely fixes some real soft lockups,

That commit fixes an OOM deadlock. Not a soft lockup. Do you have the
OOM killer report from the log? This would tell us that the killed task
was indeed sleeping on the lock which is hold by the charger which
triggered the OOM. I am little bit surprised that I do not see any OOM
related functions on the stacks (maybe the code is inlined...).

It would be better to know what exactly is going on before backporting
this change because it is quite large.

> which could be a stable candidate although it is a large one.
> I am willing to help if needed.
> 
> Thanks!
> 
> ---------------------->
> 
> [8073927.905238] INFO: task mesos-slave:10041 blocked for more than 120 seconds.
> [8073927.905241] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> disables this message.
> [8073927.905243] mesos-slave     D ffff88081bf46060     0 10041  10030
> 0x00000000
> [8073927.905247]  ffff8808208bddb8 0000000000000082 ffff8808545e2e40
> ffff8808208bdfd8
> [8073927.905252]  ffff8808208bdfd8 0000000000012a00 ffff88081bf45c80
> ffff88081bf45c80
> [8073927.905255]  ffff880da4351f94 ffff880da4351f90 ffff880da4351f98
> 0000000000000000
> [8073927.905258] Call Trace:
> [8073927.905267]  [<ffffffff814a40a6>] schedule+0x69/0x6b
> [8073927.905270]  [<ffffffff814a4484>] schedule_preempt_disabled+0xe/0x10
> [8073927.905273]  [<ffffffff814a3287>] __mutex_lock_common.isra.9+0x148/0x1d6
> [8073927.905278]  [<ffffffff811ec271>] ? security_inode_permission+0x1c/0x21
> [8073927.905281]  [<ffffffff814a3401>] __mutex_lock_slowpath+0x13/0x15
> [8073927.905284]  [<ffffffff814a3102>] mutex_lock+0x1f/0x2f
> [8073927.905287]  [<ffffffff81134f3c>] vfs_unlink+0x44/0xb7
> [8073927.905289]  [<ffffffff8113508b>] do_unlinkat+0xdc/0x17f
> [8073927.905292]  [<ffffffff814a426d>] ? _cond_resched+0xe/0x1e
> [8073927.905295]  [<ffffffff81058252>] ? task_work_run+0x82/0x94
> [8073927.905300]  [<ffffffff81002811>] ? do_notify_resume+0x57/0x65
> [8073927.905303]  [<ffffffff81135ca4>] SyS_unlink+0x16/0x18
> [8073927.905307]  [<ffffffff814aba46>] system_call_fastpath+0x1a/0x1f
> 
> 
> sysrq-t output:
> 
> [8821221.981672] mesos-slave     D ffff88081bf46060     0 10041  10030
> 0x00000000
> [8821221.981674]  ffff8808208bddb8 0000000000000082 ffff8808545e2e40
> ffff8808208bdfd8
> [8821221.981677]  ffff8808208bdfd8 0000000000012a00 ffff88081bf45c80
> ffff88081bf45c80
> [8821221.981679]  ffff880da4351f94 ffff880da4351f90 ffff880da4351f98
> 0000000000000000
> [8821221.981682] Call Trace:
> [8821221.981685]  [<ffffffff814a40a6>] schedule+0x69/0x6b
> [8821221.981687]  [<ffffffff814a4484>] schedule_preempt_disabled+0xe/0x10
> [8821221.981690]  [<ffffffff814a3287>] __mutex_lock_common.isra.9+0x148/0x1d6
> [8821221.981693]  [<ffffffff811ec271>] ? security_inode_permission+0x1c/0x21
> [8821221.981696]  [<ffffffff814a3401>] __mutex_lock_slowpath+0x13/0x15
> [8821221.981698]  [<ffffffff814a3102>] mutex_lock+0x1f/0x2f
> [8821221.981701]  [<ffffffff81134f3c>] vfs_unlink+0x44/0xb7
> [8821221.981703]  [<ffffffff8113508b>] do_unlinkat+0xdc/0x17f
> [8821221.981705]  [<ffffffff814a426d>] ? _cond_resched+0xe/0x1e
> [8821221.981707]  [<ffffffff81058252>] ? task_work_run+0x82/0x94
> [8821221.981711]  [<ffffffff81002811>] ? do_notify_resume+0x57/0x65
> [8821221.981714]  [<ffffffff81135ca4>] SyS_unlink+0x16/0x18
> [8821221.981716]  [<ffffffff814aba46>] system_call_fastpath+0x1a/0x1f
> [...]
> [8821221.986069] python2.6       D ffff881054386060     0 41843  10049
> 0x00000004
> [8821221.986071]  ffff8809677f5930 0000000000000082 ffff880eedf0ae40
> ffff8809677f5fd8
> [8821221.986074]  ffff8809677f5fd8 0000000000012a00 ffff881054385c80
> 000000030d5d1e86
> [8821221.986077]  ffff88084ea73000 ffff88084ea73000 ffff88041ef49720
> ffff88084ea73000
> [8821221.986080] Call Trace:
> [8821221.986082]  [<ffffffff814a40a6>] schedule+0x69/0x6b
> [8821221.986085]  [<ffffffff814a2e24>] schedule_timeout+0xf3/0x129
> [8821221.986087]  [<ffffffff810499ce>] ? __internal_add_timer+0xb6/0xb6
> [8821221.986090]  [<ffffffff814a2eb8>]
> schedule_timeout_uninterruptible+0x1e/0x20
> [8821221.986092]  [<ffffffff811232d3>] __mem_cgroup_try_charge+0x3ea/0x8ff
> [8821221.986095]  [<ffffffff81122d7a>] ? mem_cgroup_reclaim+0xb2/0xb2
> [8821221.986097]  [<ffffffff81123c2a>] mem_cgroup_charge_common+0x35/0x5d
> [8821221.986100]  [<ffffffff811250aa>] mem_cgroup_cache_charge+0x51/0x81
> [8821221.986103]  [<ffffffff810e237d>] add_to_page_cache_locked+0x3b/0x104
> [8821221.986106]  [<ffffffff810e245e>] add_to_page_cache_lru+0x18/0x39
> [8821221.986110]  [<ffffffff810e278a>] grab_cache_page_write_begin+0x87/0xb7
> [8821221.986113]  [<ffffffff81190c20>] ext4_write_begin+0xef/0x28b
> [8821221.986116]  [<ffffffff810e1adc>] generic_file_buffered_write+0xfd/0x20c
> [8821221.986119]  [<ffffffff8113c8fb>] ? update_time+0xa2/0xa9
> [8821221.986122]  [<ffffffff810e3375>] __generic_file_aio_write+0x1c0/0x1f8
> [8821221.986124]  [<ffffffff810e3408>] generic_file_aio_write+0x5b/0xa9
> [8821221.986127]  [<ffffffff8118951f>] ext4_file_write+0x2e5/0x376
> [8821221.986129]  [<ffffffff8100665c>] ? emulate_vsyscall+0x212/0x2f6
> [8821221.986132]  [<ffffffff8149ae9f>] ? __bad_area_nosemaphore+0xb4/0x1bf
> [8821221.986135]  [<ffffffff81128ee1>] do_sync_write+0x68/0x95
> [8821221.986138]  [<ffffffff81129566>] vfs_write+0xb2/0x117
> [8821221.986141]  [<ffffffff81129bb9>] SyS_write+0x46/0x74
> [8821221.986144]  [<ffffffff814aba46>] system_call_fastpath+0x1a/0x1f

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Please backport commit 3812c8c8f39 to stable
  2014-10-02 21:04     ` Cong Wang
@ 2014-10-03 15:37       ` Michal Hocko
  2014-10-03 18:16         ` Cong Wang
  0 siblings, 1 reply; 12+ messages in thread
From: Michal Hocko @ 2014-10-03 15:37 UTC (permalink / raw)
  To: Cong Wang; +Cc: Johannes Weiner, Greg KH, LKML, stable

On Thu 02-10-14 14:04:08, Cong Wang wrote:
> Hello again,
> 
> I realized it is a series of patch actually:
> 
> 3812c8c8f3953921ef18544110dafc3505c1ac62 mm: memcg: do not trap
> chargers with full callstack on OOM
> fb2a6fc56be66c169f8b80e07ed999ba453a2db2 mm: memcg: rework and
> document OOM waiting and wakeup
> 519e52473ebe9db5cdef44670d5a97f1fd53d721 mm: memcg: enable memcg OOM
> killer only for user faults
> 3a13c4d761b4b979ba8767f42345fed3274991b0 x86: finish user fault error
> path with fatal signal
> 759496ba6407c6994d6a5ce3a5e74937d7816208 arch: mm: pass userspace
> fault flag to generic fault handler
> 871341023c771ad233620b7a1fb3d9c7031c4e5c arch: mm: do not invoke OOM
> killer on kernel fault OOM
> 94bce453c78996cc4373d5da6cfabe07fcc6d9f9 arch: mm: remove obsolete
> init OOM protection

Yes, that looks like the full series.

> I am not sure if they have more dependencies.
> 
> However, this bug is *fairly* easy to reproduce on 3.10, just using the
> following script:
> 
> #!/bin/bash
> 
> TEST_DIR=/tmp/cgroup_test
> [ -d $TEST_DIR ] || mkdir -p $TEST_DIR
> mount -t cgroup none $TEST_DIR -o memory
> mkdir $TEST_DIR/test
> echo 512k > $TEST_DIR/test/memory.limit_in_bytes

This is just insane. You allow only 128 pages to be charged and the
reclaim will have to constantly wait for each page to finish the
writeback.

> dd if=/dev/zero of=/tmp/oom_test_big_file bs=512 count=20000000 &
> echo $! > $TEST_DIR/test/tasks
> rm -f /tmp/oom_test_big_file
> umount $TEST_DIR
> 
> 
> Run it like this:
> 
> for i in `seq 1 1000`; do ./oom_hung.sh ; done

OK, so you will eventually deplete the limit by anon charges if the pid
makes it into the group sooner than dd allocates its 512B buffer (which
will end up consuming the full page anyway). So the OOM is pretty much
unavoidable. All the task will have minimum rss so then it is just a
matter of luck which one gets killed. But this alone shouldn't cause a
dead lock. Are you really sure this is the same issue discussed in the
mentioned patch?

> So please consider this seriously. :)

The bug is there since the memory controller has been introduced. Yet we
only had a single report happening in the real life. So I do not think
this is that urgent. It was definitely not a good design decision that
OOM killer was handled on top of unknown locks which might prevent from
forward progress. No question about that. Do you see the problem in the
real life somewhere because to be honest the test case is pretty much
insane.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Please backport commit 3812c8c8f39 to stable
  2014-10-03 15:13     ` Michal Hocko
@ 2014-10-03 18:03       ` Cong Wang
  2014-10-07 12:23         ` Michal Hocko
  0 siblings, 1 reply; 12+ messages in thread
From: Cong Wang @ 2014-10-03 18:03 UTC (permalink / raw)
  To: Michal Hocko; +Cc: Johannes Weiner, Greg KH, LKML, stable

On Fri, Oct 3, 2014 at 8:13 AM, Michal Hocko <mhocko@suse.cz> wrote:
>
> That commit fixes an OOM deadlock. Not a soft lockup. Do you have the
> OOM killer report from the log? This would tell us that the killed task
> was indeed sleeping on the lock which is hold by the charger which
> triggered the OOM. I am little bit surprised that I do not see any OOM
> related functions on the stacks (maybe the code is inlined...).


Oh, did you see __mem_cgroup_try_charge() calls
schedule_timeout_uninterruptible() in stack trace? Yes, they are inlined
and I don't see any other possibilities for calling it.

>
> It would be better to know what exactly is going on before backporting
> this change because it is quite large.
>

I thought the stack trace I showed is obvious. :) I am very happy
to investigate if you see any other path calling
schedule_timeout_uninterruptible()
in __mem_cgroup_try_charge().

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Please backport commit 3812c8c8f39 to stable
  2014-10-03 15:37       ` Michal Hocko
@ 2014-10-03 18:16         ` Cong Wang
  2014-10-03 20:36           ` Cong Wang
  2014-10-07 12:33           ` Michal Hocko
  0 siblings, 2 replies; 12+ messages in thread
From: Cong Wang @ 2014-10-03 18:16 UTC (permalink / raw)
  To: Michal Hocko; +Cc: Johannes Weiner, Greg KH, LKML, stable

On Fri, Oct 3, 2014 at 8:37 AM, Michal Hocko <mhocko@suse.cz> wrote:
> On Thu 02-10-14 14:04:08, Cong Wang wrote:
>> Hello again,
>>
>> I realized it is a series of patch actually:
>>
>> 3812c8c8f3953921ef18544110dafc3505c1ac62 mm: memcg: do not trap
>> chargers with full callstack on OOM
>> fb2a6fc56be66c169f8b80e07ed999ba453a2db2 mm: memcg: rework and
>> document OOM waiting and wakeup
>> 519e52473ebe9db5cdef44670d5a97f1fd53d721 mm: memcg: enable memcg OOM
>> killer only for user faults
>> 3a13c4d761b4b979ba8767f42345fed3274991b0 x86: finish user fault error
>> path with fatal signal
>> 759496ba6407c6994d6a5ce3a5e74937d7816208 arch: mm: pass userspace
>> fault flag to generic fault handler
>> 871341023c771ad233620b7a1fb3d9c7031c4e5c arch: mm: do not invoke OOM
>> killer on kernel fault OOM
>> 94bce453c78996cc4373d5da6cfabe07fcc6d9f9 arch: mm: remove obsolete
>> init OOM protection
>
> Yes, that looks like the full series.
>
>> I am not sure if they have more dependencies.
>>
>> However, this bug is *fairly* easy to reproduce on 3.10, just using the
>> following script:
>>
>> #!/bin/bash
>>
>> TEST_DIR=/tmp/cgroup_test
>> [ -d $TEST_DIR ] || mkdir -p $TEST_DIR
>> mount -t cgroup none $TEST_DIR -o memory
>> mkdir $TEST_DIR/test
>> echo 512k > $TEST_DIR/test/memory.limit_in_bytes
>
> This is just insane. You allow only 128 pages to be charged and the
> reclaim will have to constantly wait for each page to finish the
> writeback.

This is a test case ONLY used to reproduce this bug, why it has to be
sane? :)

On the other hand, no matter how insane a test case is, as long as it
triggers some hung tasks in kernel, it is a kernel bug needs to fix.

>
>> dd if=/dev/zero of=/tmp/oom_test_big_file bs=512 count=20000000 &
>> echo $! > $TEST_DIR/test/tasks
>> rm -f /tmp/oom_test_big_file
>> umount $TEST_DIR
>>
>>
>> Run it like this:
>>
>> for i in `seq 1 1000`; do ./oom_hung.sh ; done
>
> OK, so you will eventually deplete the limit by anon charges if the pid
> makes it into the group sooner than dd allocates its 512B buffer (which
> will end up consuming the full page anyway). So the OOM is pretty much
> unavoidable. All the task will have minimum rss so then it is just a
> matter of luck which one gets killed. But this alone shouldn't cause a
> dead lock. Are you really sure this is the same issue discussed in the
> mentioned patch?

Why not? OOM killer tries to kill a process sleeping on a mutex it already
holds, why not a deadlock? Given the fact that both are lots of inode mutex
hung because of OOM, I am 90% sure they are the same.


>
>> So please consider this seriously. :)
>
> The bug is there since the memory controller has been introduced. Yet we
> only had a single report happening in the real life. So I do not think
> this is that urgent. It was definitely not a good design decision that
> OOM killer was handled on top of unknown locks which might prevent from
> forward progress. No question about that. Do you see the problem in the
> real life somewhere because to be honest the test case is pretty much
> insane.

I am sorry to confuse you that it is my the above test case which caused
this bug. No, we saw this bug in *production* in our data center, it happened
on 30+ machines!! :) The above insane test case is ONLY to draw your
attention on how serious the bug is, nothing else.

BTW, I don't spend my working time to debug a problem in non-real world,
it must be a bug in real world, that is in our data center.

Thanks.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Please backport commit 3812c8c8f39 to stable
  2014-10-03 18:16         ` Cong Wang
@ 2014-10-03 20:36           ` Cong Wang
  2014-10-07 12:33           ` Michal Hocko
  1 sibling, 0 replies; 12+ messages in thread
From: Cong Wang @ 2014-10-03 20:36 UTC (permalink / raw)
  To: Michal Hocko; +Cc: Johannes Weiner, Greg KH, LKML, stable

On Fri, Oct 3, 2014 at 11:16 AM, Cong Wang <xiyou.wangcong@gmail.com> wrote:
>
> Why not? OOM killer tries to kill a process sleeping on a mutex it already
> holds, why not a deadlock? Given the fact that both are lots of inode mutex
> hung because of OOM, I am 90% sure they are the same.
>

I backported these patches to 3.10 and can't reproduce the bug with the
script any more.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Please backport commit 3812c8c8f39 to stable
  2014-10-03 18:03       ` Cong Wang
@ 2014-10-07 12:23         ` Michal Hocko
  0 siblings, 0 replies; 12+ messages in thread
From: Michal Hocko @ 2014-10-07 12:23 UTC (permalink / raw)
  To: Cong Wang; +Cc: Johannes Weiner, Greg KH, LKML, stable

On Fri 03-10-14 11:03:30, Cong Wang wrote:
> On Fri, Oct 3, 2014 at 8:13 AM, Michal Hocko <mhocko@suse.cz> wrote:
> >
> > That commit fixes an OOM deadlock. Not a soft lockup. Do you have the
> > OOM killer report from the log? This would tell us that the killed task
> > was indeed sleeping on the lock which is hold by the charger which
> > triggered the OOM. I am little bit surprised that I do not see any OOM
> > related functions on the stacks (maybe the code is inlined...).
> 
> 
> Oh, did you see __mem_cgroup_try_charge() calls
> schedule_timeout_uninterruptible() in stack trace? Yes, they are inlined
> and I don't see any other possibilities for calling it.

Yes the only place we call schedule_timeout_uninterruptible from is
mem_cgroup_handle_oom. And it happens only for a task which hasn't been
killed by OOM killer.

> > It would be better to know what exactly is going on before backporting
> > this change because it is quite large.
> >
> 
> I thought the stack trace I showed is obvious. :) I am very happy
> to investigate if you see any other path calling
> schedule_timeout_uninterruptible()
> in __mem_cgroup_try_charge().

I was expecting an oom report which kills a task which is sleeping on a
lock which is held on the way up to the charge function. Your report
mentioned a task waiting for i_mutex for too long. It is true that the
charging path is holding an i_mutex as well so it might be the same
situation handled by the said patch. But it is not 100% clear this is
the case without an OOM report which would point to the waiting task.
The memcg might be trashing on the hard limit and reclaim might take a
long time.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Please backport commit 3812c8c8f39 to stable
  2014-10-03 18:16         ` Cong Wang
  2014-10-03 20:36           ` Cong Wang
@ 2014-10-07 12:33           ` Michal Hocko
  2014-10-09  4:56             ` Cong Wang
  1 sibling, 1 reply; 12+ messages in thread
From: Michal Hocko @ 2014-10-07 12:33 UTC (permalink / raw)
  To: Cong Wang; +Cc: Johannes Weiner, Greg KH, LKML, stable

On Fri 03-10-14 11:16:31, Cong Wang wrote:
> On Fri, Oct 3, 2014 at 8:37 AM, Michal Hocko <mhocko@suse.cz> wrote:
> > On Thu 02-10-14 14:04:08, Cong Wang wrote:
> >> Hello again,
> >>
> >> I realized it is a series of patch actually:
> >>
> >> 3812c8c8f3953921ef18544110dafc3505c1ac62 mm: memcg: do not trap
> >> chargers with full callstack on OOM
> >> fb2a6fc56be66c169f8b80e07ed999ba453a2db2 mm: memcg: rework and
> >> document OOM waiting and wakeup
> >> 519e52473ebe9db5cdef44670d5a97f1fd53d721 mm: memcg: enable memcg OOM
> >> killer only for user faults
> >> 3a13c4d761b4b979ba8767f42345fed3274991b0 x86: finish user fault error
> >> path with fatal signal
> >> 759496ba6407c6994d6a5ce3a5e74937d7816208 arch: mm: pass userspace
> >> fault flag to generic fault handler
> >> 871341023c771ad233620b7a1fb3d9c7031c4e5c arch: mm: do not invoke OOM
> >> killer on kernel fault OOM
> >> 94bce453c78996cc4373d5da6cfabe07fcc6d9f9 arch: mm: remove obsolete
> >> init OOM protection
> >
> > Yes, that looks like the full series.
> >
> >> I am not sure if they have more dependencies.
> >>
> >> However, this bug is *fairly* easy to reproduce on 3.10, just using the
> >> following script:
> >>
> >> #!/bin/bash
> >>
> >> TEST_DIR=/tmp/cgroup_test
> >> [ -d $TEST_DIR ] || mkdir -p $TEST_DIR
> >> mount -t cgroup none $TEST_DIR -o memory
> >> mkdir $TEST_DIR/test
> >> echo 512k > $TEST_DIR/test/memory.limit_in_bytes
> >
> > This is just insane. You allow only 128 pages to be charged and the
> > reclaim will have to constantly wait for each page to finish the
> > writeback.
> 
> This is a test case ONLY used to reproduce this bug, why it has to be
> sane? :)
>
> On the other hand, no matter how insane a test case is, as long as it
> triggers some hung tasks in kernel, it is a kernel bug needs to fix.

Well, my point was that an insane setting might produce a lot of
problems. And as said this problem has been inherent since the day 1.
So a real world example would be much more preferable. Especially when
we have this state for years and nobody triggered it.

[...]
> >> So please consider this seriously. :)
> >
> > The bug is there since the memory controller has been introduced. Yet we
> > only had a single report happening in the real life. So I do not think
> > this is that urgent. It was definitely not a good design decision that
> > OOM killer was handled on top of unknown locks which might prevent from
> > forward progress. No question about that. Do you see the problem in the
> > real life somewhere because to be honest the test case is pretty much
> > insane.
> 
> I am sorry to confuse you that it is my the above test case which caused
> this bug. No, we saw this bug in *production* in our data center, it happened
> on 30+ machines!! :) The above insane test case is ONLY to draw your
> attention on how serious the bug is, nothing else.

Sure then the issue definitely needs to be fixed.

You have written in other email, that you have a backport. I will help
you with the review if you post it publicly.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Please backport commit 3812c8c8f39 to stable
  2014-10-07 12:33           ` Michal Hocko
@ 2014-10-09  4:56             ` Cong Wang
  0 siblings, 0 replies; 12+ messages in thread
From: Cong Wang @ 2014-10-09  4:56 UTC (permalink / raw)
  To: Michal Hocko; +Cc: Johannes Weiner, Greg KH, LKML, stable

On Tue, Oct 7, 2014 at 5:33 AM, Michal Hocko <mhocko@suse.cz> wrote:
> On Fri 03-10-14 11:16:31, Cong Wang wrote:
>> I am sorry to confuse you that it is my the above test case which caused
>> this bug. No, we saw this bug in *production* in our data center, it happened
>> on 30+ machines!! :) The above insane test case is ONLY to draw your
>> attention on how serious the bug is, nothing else.
>
> Sure then the issue definitely needs to be fixed.
>
> You have written in other email, that you have a backport. I will help
> you with the review if you post it publicly.
>

Cool!

Note that I only have time to backport them to 3.10 and test on it.
Currently there are 8 patches backported to fix this bug, I am still
checking if we need more.

I will post them soon and Cc you of course.

Thanks for looking at it!

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2014-10-09  4:56 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-09-30  3:43 Please backport commit 3812c8c8f39 to stable Cong Wang
2014-09-30  8:16 ` Michal Hocko
2014-09-30 17:14   ` Cong Wang
2014-10-02 21:04     ` Cong Wang
2014-10-03 15:37       ` Michal Hocko
2014-10-03 18:16         ` Cong Wang
2014-10-03 20:36           ` Cong Wang
2014-10-07 12:33           ` Michal Hocko
2014-10-09  4:56             ` Cong Wang
2014-10-03 15:13     ` Michal Hocko
2014-10-03 18:03       ` Cong Wang
2014-10-07 12:23         ` Michal Hocko

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.