linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* memory-cgroup bug
@ 2012-11-21 19:02 azurIt
  2012-11-22  0:26 ` Kamezawa Hiroyuki
  2012-11-22 15:24 ` Michal Hocko
  0 siblings, 2 replies; 172+ messages in thread
From: azurIt @ 2012-11-21 19:02 UTC (permalink / raw)
  To: linux-kernel

Hi,

i'm using memory cgroup for limiting our users and having a really strange problem when a cgroup gets out of its memory limit. It's very strange because it happens only sometimes (about once per week on random user), out of memory is usually handled ok. This happens when problem occures:
 - no new processes can be started for this cgroup
 - current processes are freezed and taking 100% of CPU
 - when i try to 'strace' any of current processes, the whole strace freezes until process is killed (strace cannot be terminated by CTRL-c)
 - problem can be resolved by raising memory limit for cgroup or killing of few processes inside cgroup so some memory is freed

I also garbbed the content of /proc/<pid>/stack of freezed process:
[<ffffffff8110a9c1>] mem_cgroup_handle_oom+0x241/0x3b0
[<ffffffff8110b5ab>] T.1146+0x5ab/0x5c0
[<ffffffff8110ba56>] mem_cgroup_charge_common+0x56/0xa0
[<ffffffff8110bae5>] mem_cgroup_newpage_charge+0x45/0x50
[<ffffffff810ec54e>] do_wp_page+0x14e/0x800
[<ffffffff810eda34>] handle_pte_fault+0x264/0x940
[<ffffffff810ee248>] handle_mm_fault+0x138/0x260
[<ffffffff810270ed>] do_page_fault+0x13d/0x460
[<ffffffff815b53ff>] page_fault+0x1f/0x30
[<ffffffffffffffff>] 0xffffffffffffffff

I'm currently using kernel 3.2.34 but i'm having this problem since 2.6.32.

Any ideas? Thnx.

azurIt

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: memory-cgroup bug
  2012-11-21 19:02 memory-cgroup bug azurIt
@ 2012-11-22  0:26 ` Kamezawa Hiroyuki
  2012-11-22  9:36   ` azurIt
  2012-11-22 15:24 ` Michal Hocko
  1 sibling, 1 reply; 172+ messages in thread
From: Kamezawa Hiroyuki @ 2012-11-22  0:26 UTC (permalink / raw)
  To: azurIt; +Cc: linux-kernel, linux-mm

(2012/11/22 4:02), azurIt wrote:
> Hi,
>
> i'm using memory cgroup for limiting our users and having a really strange problem when a cgroup gets out of its memory limit. It's very strange because it happens only sometimes (about once per week on random user), out of memory is usually handled ok. This happens when problem occures:
>   - no new processes can be started for this cgroup
>   - current processes are freezed and taking 100% of CPU
>   - when i try to 'strace' any of current processes, the whole strace freezes until process is killed (strace cannot be terminated by CTRL-c)
>   - problem can be resolved by raising memory limit for cgroup or killing of few processes inside cgroup so some memory is freed
>
> I also garbbed the content of /proc/<pid>/stack of freezed process:
> [<ffffffff8110a9c1>] mem_cgroup_handle_oom+0x241/0x3b0
> [<ffffffff8110b5ab>] T.1146+0x5ab/0x5c0
> [<ffffffff8110ba56>] mem_cgroup_charge_common+0x56/0xa0
> [<ffffffff8110bae5>] mem_cgroup_newpage_charge+0x45/0x50
> [<ffffffff810ec54e>] do_wp_page+0x14e/0x800
> [<ffffffff810eda34>] handle_pte_fault+0x264/0x940
> [<ffffffff810ee248>] handle_mm_fault+0x138/0x260
> [<ffffffff810270ed>] do_page_fault+0x13d/0x460
> [<ffffffff815b53ff>] page_fault+0x1f/0x30
> [<ffffffffffffffff>] 0xffffffffffffffff
>
> I'm currently using kernel 3.2.34 but i'm having this problem since 2.6.32.
>
> Any ideas? Thnx.
>

Under OOM in memcg, only one process is allowed to work. Because processes tends to use up
CPU at memory shortage. other processes are freezed.


Then, the problem here is the one process which uses CPU. IIUC, 'freezed' threads are
in sleep and never use CPU. It's expected oom-killer or memory-reclaim can solve the probelm.

What is your memcg's

  memory.oom_control

value ?

and process's oom_adj values ? (/proc/<pid>/oom_adj, /proc/<pid>/oom_score_adj)

Thanks,
-Kame






> azurIt
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
>



^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: memory-cgroup bug
  2012-11-22  0:26 ` Kamezawa Hiroyuki
@ 2012-11-22  9:36   ` azurIt
  2012-11-22 21:45     ` Michal Hocko
  0 siblings, 1 reply; 172+ messages in thread
From: azurIt @ 2012-11-22  9:36 UTC (permalink / raw)
  To: Kamezawa Hiroyuki; +Cc: linux-kernel, linux-mm

______________________________________________________________
> Od: "Kamezawa Hiroyuki" <kamezawa.hiroyu@jp.fujitsu.com>
> Komu: azurIt <azurit@pobox.sk>
> Dátum: 22.11.2012 01:27
> Predmet: Re: memory-cgroup bug
>
> CC: linux-kernel@vger.kernel.org, "linux-mm" <linux-mm@kvack.org>
>(2012/11/22 4:02), azurIt wrote:
>> Hi,
>>
>> i'm using memory cgroup for limiting our users and having a really strange problem when a cgroup gets out of its memory limit. It's very strange because it happens only sometimes (about once per week on random user), out of memory is usually handled ok. This happens when problem occures:
>>   - no new processes can be started for this cgroup
>>   - current processes are freezed and taking 100% of CPU
>>   - when i try to 'strace' any of current processes, the whole strace freezes until process is killed (strace cannot be terminated by CTRL-c)
>>   - problem can be resolved by raising memory limit for cgroup or killing of few processes inside cgroup so some memory is freed
>>
>> I also garbbed the content of /proc/<pid>/stack of freezed process:
>> [<ffffffff8110a9c1>] mem_cgroup_handle_oom+0x241/0x3b0
>> [<ffffffff8110b5ab>] T.1146+0x5ab/0x5c0
>> [<ffffffff8110ba56>] mem_cgroup_charge_common+0x56/0xa0
>> [<ffffffff8110bae5>] mem_cgroup_newpage_charge+0x45/0x50
>> [<ffffffff810ec54e>] do_wp_page+0x14e/0x800
>> [<ffffffff810eda34>] handle_pte_fault+0x264/0x940
>> [<ffffffff810ee248>] handle_mm_fault+0x138/0x260
>> [<ffffffff810270ed>] do_page_fault+0x13d/0x460
>> [<ffffffff815b53ff>] page_fault+0x1f/0x30
>> [<ffffffffffffffff>] 0xffffffffffffffff
>>
>> I'm currently using kernel 3.2.34 but i'm having this problem since 2.6.32.
>>
>> Any ideas? Thnx.
>>
>
>Under OOM in memcg, only one process is allowed to work. Because processes tends to use up
>CPU at memory shortage. other processes are freezed.
>
>
>Then, the problem here is the one process which uses CPU. IIUC, 'freezed' threads are
>in sleep and never use CPU. It's expected oom-killer or memory-reclaim can solve the probelm.
>
>What is your memcg's memory.oom_control value ?



oom_kill_disable 0



>and process's oom_adj values ? (/proc/<pid>/oom_adj, /proc/<pid>/oom_score_adj)


when i look to a random user PID (Apache web server):
oom_adj = 0
oom_score_adj = 0

I can look also to the data of 'freezed' proces if you need it but i will have to wait until problem occurs again.

The main problem is that when this problem happens, it's NOT resolved automatically by kernel/OOM and user of cgroup, where it happend, has non-working services until i kill his processes by hand. I'm sure that all 'freezed' processes are taking very much CPU because also server load goes really high - next time i will make a screenshot of htop. I really wonder why OOM is __sometimes__ not resolving this (it's usually is, only sometimes not).


Thank you!

azur

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: memory-cgroup bug
  2012-11-21 19:02 memory-cgroup bug azurIt
  2012-11-22  0:26 ` Kamezawa Hiroyuki
@ 2012-11-22 15:24 ` Michal Hocko
  2012-11-22 18:05   ` azurIt
  1 sibling, 1 reply; 172+ messages in thread
From: Michal Hocko @ 2012-11-22 15:24 UTC (permalink / raw)
  To: azurIt; +Cc: linux-kernel, linux-mm, cgroups mailinglist

On Wed 21-11-12 20:02:07, azurIt wrote:
> Hi,
> 
> i'm using memory cgroup for limiting our users and having a really
> strange problem when a cgroup gets out of its memory limit. It's very
> strange because it happens only sometimes (about once per week on
> random user), out of memory is usually handled ok.

What is your memcg configuration? Do you use deeper hierarchies, is
use_hierarchy enabled? Is the memcg oom (aka memory.oom_control)
enabled? Do you use soft limit for those groups? Is memcg swap
accounting enabled and memsw limits in place?
Is the machine under global memory pressure as well?
Could you post sysrq+t or sysrq+w?

> This happens when problem occures:
>  - no new processes can be started for this cgroup
>  - current processes are freezed and taking 100% of CPU
>  - when i try to 'strace' any of current processes, the whole strace
>    freezes until process is killed (strace cannot be terminated by
>    CTRL-c)
>  - problem can be resolved by raising memory limit for cgroup or
>    killing of few processes inside cgroup so some memory is freed
> 
> I also garbbed the content of /proc/<pid>/stack of freezed process:
> [<ffffffff8110a9c1>] mem_cgroup_handle_oom+0x241/0x3b0
> [<ffffffff8110b5ab>] T.1146+0x5ab/0x5c0

Hmm what is this?

> [<ffffffff8110ba56>] mem_cgroup_charge_common+0x56/0xa0
> [<ffffffff8110bae5>] mem_cgroup_newpage_charge+0x45/0x50
> [<ffffffff810ec54e>] do_wp_page+0x14e/0x800
> [<ffffffff810eda34>] handle_pte_fault+0x264/0x940
> [<ffffffff810ee248>] handle_mm_fault+0x138/0x260
> [<ffffffff810270ed>] do_page_fault+0x13d/0x460
> [<ffffffff815b53ff>] page_fault+0x1f/0x30
> [<ffffffffffffffff>] 0xffffffffffffffff
>

How many tasks are hung in mem_cgroup_handle_oom? If there were many
of them then it'd smell like an issue fixed by 79dfdaccd1d5 (memcg:
make oom_lock 0 and 1 based rather than counter) and its follow up fix
23751be00940 (memcg: fix hierarchical oom locking) but you are saying
that you can reproduce with 3.2 and those went in for 3.1. 2.6.32 would
make more sense.

> I'm currently using kernel 3.2.34 but i'm having this problem since 2.6.32.

I guess this is a clean vanilla (stable) kernel, right? Are you able to
reproduce with the latest Linus tree?

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: memory-cgroup bug
  2012-11-22 15:24 ` Michal Hocko
@ 2012-11-22 18:05   ` azurIt
  2012-11-22 21:42     ` Michal Hocko
  0 siblings, 1 reply; 172+ messages in thread
From: azurIt @ 2012-11-22 18:05 UTC (permalink / raw)
  To: Michal Hocko; +Cc: linux-kernel, linux-mm, cgroups mailinglist

>> i'm using memory cgroup for limiting our users and having a really
>> strange problem when a cgroup gets out of its memory limit. It's very
>> strange because it happens only sometimes (about once per week on
>> random user), out of memory is usually handled ok.
>
>What is your memcg configuration? Do you use deeper hierarchies, is
>use_hierarchy enabled? Is the memcg oom (aka memory.oom_control)
>enabled? Do you use soft limit for those groups? Is memcg swap
>accounting enabled and memsw limits in place?
>Is the machine under global memory pressure as well?
>Could you post sysrq+t or sysrq+w?


My cgroups hierarchy:
/cgroups/<user_id>/uid/

where '<user_id>' is system user id and 'uid' is just word 'uid'.

Memory limits are set in /cgroups/<user_id>/ and hierarchy is enabled. Processes are inside /cgroups/<user_id>/uid/ . I'm using hard limits for memory and swap BUT system has no swap at all (it has 'only' 16 GB of real RAM). memory.oom_control is set to 'oom_kill_disable 0'. Server has enough of free memory when problem occurs.




>> This happens when problem occures:
>>  - no new processes can be started for this cgroup
>>  - current processes are freezed and taking 100% of CPU
>>  - when i try to 'strace' any of current processes, the whole strace
>>    freezes until process is killed (strace cannot be terminated by
>>    CTRL-c)
>>  - problem can be resolved by raising memory limit for cgroup or
>>    killing of few processes inside cgroup so some memory is freed
>> 
>> I also garbbed the content of /proc/<pid>/stack of freezed process:
>> [<ffffffff8110a9c1>] mem_cgroup_handle_oom+0x241/0x3b0
>> [<ffffffff8110b5ab>] T.1146+0x5ab/0x5c0
>
>Hmm what is this?


Really doesn't know, i will get stack of all freezed processes next time so we can compare it.



>> [<ffffffff8110ba56>] mem_cgroup_charge_common+0x56/0xa0
>> [<ffffffff8110bae5>] mem_cgroup_newpage_charge+0x45/0x50
>> [<ffffffff810ec54e>] do_wp_page+0x14e/0x800
>> [<ffffffff810eda34>] handle_pte_fault+0x264/0x940
>> [<ffffffff810ee248>] handle_mm_fault+0x138/0x260
>> [<ffffffff810270ed>] do_page_fault+0x13d/0x460
>> [<ffffffff815b53ff>] page_fault+0x1f/0x30
>> [<ffffffffffffffff>] 0xffffffffffffffff
>>
>
>How many tasks are hung in mem_cgroup_handle_oom? If there were many
>of them then it'd smell like an issue fixed by 79dfdaccd1d5 (memcg:
>make oom_lock 0 and 1 based rather than counter) and its follow up fix
>23751be00940 (memcg: fix hierarchical oom locking) but you are saying
>that you can reproduce with 3.2 and those went in for 3.1. 2.6.32 would
>make more sense.


Usually maximum of several 10s of processes but i will check it next time. I was having much worse problems in 2.6.32 - when freezing happens, the whole server was affected (i wasn't able to do anything and needs to wait until my scripts takes case of it and killed apache, so i don't have any detailed info). In 3.2 only target cgroup is affected.




>> I'm currently using kernel 3.2.34 but i'm having this problem since 2.6.32.
>
>I guess this is a clean vanilla (stable) kernel, right? Are you able to
>reproduce with the latest Linus tree?


Well, no. I'm using, for example, newest stable grsecurity patch. I'm also using few of Andrea Righi's cgroup subsystems but i don't believe these are doing problems:
 - cgroup-uid which is moving processes into cgroups based on UID
 - cgroup-task which can limit number of tasks in cgroup (i already tried to disable this one, it didn't help)
http://www.develer.com/~arighi/linux/patches/

Unfortunately i cannot just install new and untested kernel version cos i'm not able to reproduce this problem anytime (it's happening randomly in production environment).

Could it be that OOM cannot start and kill processes because there's no free memory in cgroup?



Thank you!

azur

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: memory-cgroup bug
  2012-11-22 18:05   ` azurIt
@ 2012-11-22 21:42     ` Michal Hocko
  2012-11-22 22:34       ` azurIt
  0 siblings, 1 reply; 172+ messages in thread
From: Michal Hocko @ 2012-11-22 21:42 UTC (permalink / raw)
  To: azurIt; +Cc: linux-kernel, linux-mm, cgroups mailinglist

On Thu 22-11-12 19:05:26, azurIt wrote:
[...]
> My cgroups hierarchy:
> /cgroups/<user_id>/uid/
> 
> where '<user_id>' is system user id and 'uid' is just word 'uid'.
> 
> Memory limits are set in /cgroups/<user_id>/ and hierarchy is
> enabled. Processes are inside /cgroups/<user_id>/uid/ . I'm using
> hard limits for memory and swap BUT system has no swap at all
> (it has 'only' 16 GB of real RAM). memory.oom_control is set to
> 'oom_kill_disable 0'. Server has enough of free memory when problem
> occurs.

OK, so so the global reclaim shouldn't be active. This is definitely
good to know.
 
> >> This happens when problem occures:
> >>  - no new processes can be started for this cgroup
> >>  - current processes are freezed and taking 100% of CPU
> >>  - when i try to 'strace' any of current processes, the whole strace
> >>    freezes until process is killed (strace cannot be terminated by
> >>    CTRL-c)
> >>  - problem can be resolved by raising memory limit for cgroup or
> >>    killing of few processes inside cgroup so some memory is freed
> >> 
> >> I also garbbed the content of /proc/<pid>/stack of freezed process:
> >> [<ffffffff8110a9c1>] mem_cgroup_handle_oom+0x241/0x3b0
> >> [<ffffffff8110b5ab>] T.1146+0x5ab/0x5c0
> >
> >Hmm what is this?
> 
> Really doesn't know, i will get stack of all freezed processes next
> time so we can compare it.
> 
> >> [<ffffffff8110ba56>] mem_cgroup_charge_common+0x56/0xa0
> >> [<ffffffff8110bae5>] mem_cgroup_newpage_charge+0x45/0x50
> >> [<ffffffff810ec54e>] do_wp_page+0x14e/0x800
> >> [<ffffffff810eda34>] handle_pte_fault+0x264/0x940
> >> [<ffffffff810ee248>] handle_mm_fault+0x138/0x260
> >> [<ffffffff810270ed>] do_page_fault+0x13d/0x460
> >> [<ffffffff815b53ff>] page_fault+0x1f/0x30
> >> [<ffffffffffffffff>] 0xffffffffffffffff

Btw. is this stack stable or is the task bouncing in some loop?
And finally could you post the disassembly of your version of
mem_cgroup_handle_oom, please?

> >How many tasks are hung in mem_cgroup_handle_oom? If there were many
> >of them then it'd smell like an issue fixed by 79dfdaccd1d5 (memcg:
> >make oom_lock 0 and 1 based rather than counter) and its follow up fix
> >23751be00940 (memcg: fix hierarchical oom locking) but you are saying
> >that you can reproduce with 3.2 and those went in for 3.1. 2.6.32 would
> >make more sense.
> 
> 
> Usually maximum of several 10s of processes but i will check it next
> time. I was having much worse problems in 2.6.32 - when freezing
> happens, the whole server was affected (i wasn't able to do anything
> and needs to wait until my scripts takes case of it and killed apache,
> so i don't have any detailed info).

Hmm, maybe the issue fixed by 1d65f86d (mm: preallocate page before
lock_page() at filemap COW) which was merged in 3.1.

> In 3.2 only target cgroup is affected.
> 
> >> I'm currently using kernel 3.2.34 but i'm having this problem since 2.6.32.
> >
> >I guess this is a clean vanilla (stable) kernel, right? Are you able to
> >reproduce with the latest Linus tree?
> 
> 
> Well, no. I'm using, for example, newest stable grsecurity patch.

That shouldn't be related

> I'm also using few of Andrea Righi's cgroup subsystems but i don't
> believe
> these are doing problems:
>  - cgroup-uid which is moving processes into cgroups based on UID
>  - cgroup-task which can limit number of tasks in cgroup (i already
>    tried to disable this one, it didn't help)
> http://www.develer.com/~arighi/linux/patches/

I am not familiar with those pathces but I will double check.

> Unfortunately i cannot just install new and untested kernel version
> cos i'm not able to reproduce this problem anytime (it's happening
> randomly in production environment).

This will make it a bit harder to debug but let's see maybe the new
traces would help...
 
> Could it be that OOM cannot start and kill processes because there's
> no free memory in cgroup?

That shouldn't happen. 

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: memory-cgroup bug
  2012-11-22  9:36   ` azurIt
@ 2012-11-22 21:45     ` Michal Hocko
  0 siblings, 0 replies; 172+ messages in thread
From: Michal Hocko @ 2012-11-22 21:45 UTC (permalink / raw)
  To: azurIt; +Cc: Kamezawa Hiroyuki, linux-kernel, linux-mm

On Thu 22-11-12 10:36:18, azurIt wrote:
[...]
> I can look also to the data of 'freezed' proces if you need it but i
> will have to wait until problem occurs again.
> 
> The main problem is that when this problem happens, it's NOT resolved
> automatically by kernel/OOM and user of cgroup, where it happend, has
> non-working services until i kill his processes by hand. I'm sure
> that all 'freezed' processes are taking very much CPU because also
> server load goes really high - next time i will make a screenshot of
> htop. I really wonder why OOM is __sometimes__ not resolving this
> (it's usually is, only sometimes not).

What does your kernel log says while this is happening. Are there any
memcg OOM messages showing up?
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: memory-cgroup bug
  2012-11-22 21:42     ` Michal Hocko
@ 2012-11-22 22:34       ` azurIt
  2012-11-23  7:40         ` Michal Hocko
  0 siblings, 1 reply; 172+ messages in thread
From: azurIt @ 2012-11-22 22:34 UTC (permalink / raw)
  To: Michal Hocko; +Cc: linux-kernel, linux-mm, cgroups mailinglist

>Btw. is this stack stable or is the task bouncing in some loop?


Not sure, will check it next time.



>And finally could you post the disassembly of your version of
>mem_cgroup_handle_oom, please?


How can i do this?



>What does your kernel log says while this is happening. Are there any
>memcg OOM messages showing up?


I will get the logs next time.


Thank you!

azur

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: memory-cgroup bug
  2012-11-22 22:34       ` azurIt
@ 2012-11-23  7:40         ` Michal Hocko
  2012-11-23  9:21           ` azurIt
  0 siblings, 1 reply; 172+ messages in thread
From: Michal Hocko @ 2012-11-23  7:40 UTC (permalink / raw)
  To: azurIt; +Cc: linux-kernel, linux-mm, cgroups mailinglist

On Thu 22-11-12 23:34:34, azurIt wrote:
[...]
> >And finally could you post the disassembly of your version of
> >mem_cgroup_handle_oom, please?
> 
> How can i do this?

Either use gdb YOUR_VMLINUX and disassemble mem_cgroup_handle_oom or
use objdump -d YOUR_VMLINUX and copy out only mem_cgroup_handle_oom
function.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: memory-cgroup bug
  2012-11-23  7:40         ` Michal Hocko
@ 2012-11-23  9:21           ` azurIt
  2012-11-23  9:28             ` Michal Hocko
                               ` (2 more replies)
  0 siblings, 3 replies; 172+ messages in thread
From: azurIt @ 2012-11-23  9:21 UTC (permalink / raw)
  To: Michal Hocko; +Cc: linux-kernel, linux-mm, cgroups mailinglist

>Either use gdb YOUR_VMLINUX and disassemble mem_cgroup_handle_oom or
>use objdump -d YOUR_VMLINUX and copy out only mem_cgroup_handle_oom
>function.
If 'YOUR_VMLINUX' is supposed to be my kernel image:

# gdb vmlinuz-3.2.34-grsec-1 
GNU gdb (GDB) 7.0.1-debian
Copyright (C) 2009 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
"/root/bug/vmlinuz-3.2.34-grsec-1": not in executable format: File format not recognized


# objdump -d vmlinuz-3.2.34-grsec-1 
objdump: vmlinuz-3.2.34-grsec-1: File format not recognized


# file vmlinuz-3.2.34-grsec-1 
vmlinuz-3.2.34-grsec-1: Linux kernel x86 boot executable bzImage, version 3.2.34-grsec (root@server01) #1, RO-rootFS, swap_dev 0x3, Normal VGA

I'm probably doing something wrong :)



It, luckily, happend again so i have more info.

 - there wasn't any logs in kernel from OOM for that cgroup
 - there were 16 processes in cgroup
 - processes in cgroup were taking togather 100% of CPU (it was allowed to use only one core, so 100% of that core)
 - memory.failcnt was groving fast
 - oom_control:
oom_kill_disable 0
under_oom 0 (this was looping from 0 to 1)
 - limit_in_bytes was set to 157286400
 - content of stat (as you can see, the whole memory limit was used):
cache 0
rss 0
mapped_file 0
pgpgin 0
pgpgout 0
swap 0
pgfault 0
pgmajfault 0
inactive_anon 0
active_anon 0
inactive_file 0
active_file 0
unevictable 0
hierarchical_memory_limit 157286400
hierarchical_memsw_limit 157286400
total_cache 0
total_rss 157286400
total_mapped_file 0
total_pgpgin 10326454
total_pgpgout 10288054
total_swap 0
total_pgfault 12939677
total_pgmajfault 4283
total_inactive_anon 0
total_active_anon 157286400
total_inactive_file 0
total_active_file 0
total_unevictable 0


i also grabber oom_adj, oom_score_adj and stack of all processes, here it is:
http://www.watchdog.sk/lkml/memcg-bug.tar

Notice that stack is different for few processes. Stack for all processes were NOT chaging and was still the same.

Btw, don't know if it matters but i was several cgroup subsystems mounted and i'm also using them (i was not activating freezer in this case, don't know if it can be active automatically by kernel or what, didn't checked if cgroup was freezed but i suppose it wasn't):
none            /cgroups        cgroup  defaults,cpuacct,cpuset,memory,freezer,task,blkio 0 0

Thank you.

azur

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: memory-cgroup bug
  2012-11-23  9:21           ` azurIt
@ 2012-11-23  9:28             ` Michal Hocko
  2012-11-23  9:44               ` azurIt
  2012-11-23  9:34             ` Glauber Costa
  2012-11-23 10:04             ` Michal Hocko
  2 siblings, 1 reply; 172+ messages in thread
From: Michal Hocko @ 2012-11-23  9:28 UTC (permalink / raw)
  To: azurIt; +Cc: linux-kernel, linux-mm, cgroups mailinglist

On Fri 23-11-12 10:21:37, azurIt wrote:
> >Either use gdb YOUR_VMLINUX and disassemble mem_cgroup_handle_oom or
> >use objdump -d YOUR_VMLINUX and copy out only mem_cgroup_handle_oom
> >function.
> If 'YOUR_VMLINUX' is supposed to be my kernel image:
> 
> # gdb vmlinuz-3.2.34-grsec-1 
> GNU gdb (GDB) 7.0.1-debian
> Copyright (C) 2009 Free Software Foundation, Inc.
> License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
> This is free software: you are free to change and redistribute it.
> There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
> and "show warranty" for details.
> This GDB was configured as "x86_64-linux-gnu".
> For bug reporting instructions, please see:
> <http://www.gnu.org/software/gdb/bugs/>...
> "/root/bug/vmlinuz-3.2.34-grsec-1": not in executable format: File format not recognized
> 
> 
> # objdump -d vmlinuz-3.2.34-grsec-1 

You need vmlinux not vmlinuz...
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: memory-cgroup bug
  2012-11-23  9:21           ` azurIt
  2012-11-23  9:28             ` Michal Hocko
@ 2012-11-23  9:34             ` Glauber Costa
  2012-11-23 10:04             ` Michal Hocko
  2 siblings, 0 replies; 172+ messages in thread
From: Glauber Costa @ 2012-11-23  9:34 UTC (permalink / raw)
  To: azurIt; +Cc: Michal Hocko, linux-kernel, linux-mm, cgroups mailinglist

On 11/23/2012 01:21 PM, azurIt wrote:
>> Either use gdb YOUR_VMLINUX and disassemble mem_cgroup_handle_oom or
>> use objdump -d YOUR_VMLINUX and copy out only mem_cgroup_handle_oom
>> function.
> If 'YOUR_VMLINUX' is supposed to be my kernel image:
> 
> # gdb vmlinuz-3.2.34-grsec-1 

this is vmlinuz, not vmlinux. This is the compressed image.

> 
> # file vmlinuz-3.2.34-grsec-1 
> vmlinuz-3.2.34-grsec-1: Linux kernel x86 boot executable bzImage, version 3.2.34-grsec (root@server01) #1, RO-rootFS, swap_dev 0x3, Normal VGA
> 
> I'm probably doing something wrong :)

You need this:

[glauber@straightjacket linux-glommer]$ file vmlinux
vmlinux: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), statically
linked, BuildID[sha1]=0xba936ee6b6096f9bc4c663f2a2ee0c2d2481c408, not
stripped

instead of bzImage.


^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: memory-cgroup bug
  2012-11-23  9:28             ` Michal Hocko
@ 2012-11-23  9:44               ` azurIt
  2012-11-23 10:10                 ` Michal Hocko
  0 siblings, 1 reply; 172+ messages in thread
From: azurIt @ 2012-11-23  9:44 UTC (permalink / raw)
  To: Michal Hocko; +Cc: linux-kernel, linux-mm, cgroups mailinglist

> CC: linux-kernel@vger.kernel.org, linux-mm@kvack.org, "cgroups mailinglist" <cgroups@vger.kernel.org>
>On Fri 23-11-12 10:21:37, azurIt wrote:
>> >Either use gdb YOUR_VMLINUX and disassemble mem_cgroup_handle_oom or
>> >use objdump -d YOUR_VMLINUX and copy out only mem_cgroup_handle_oom
>> >function.
>> If 'YOUR_VMLINUX' is supposed to be my kernel image:
>> 
>> # gdb vmlinuz-3.2.34-grsec-1 
>> GNU gdb (GDB) 7.0.1-debian
>> Copyright (C) 2009 Free Software Foundation, Inc.
>> License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
>> This is free software: you are free to change and redistribute it.
>> There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
>> and "show warranty" for details.
>> This GDB was configured as "x86_64-linux-gnu".
>> For bug reporting instructions, please see:
>> <http://www.gnu.org/software/gdb/bugs/>...
>> "/root/bug/vmlinuz-3.2.34-grsec-1": not in executable format: File format not recognized
>> 
>> 
>> # objdump -d vmlinuz-3.2.34-grsec-1 
>
>You need vmlinux not vmlinuz...




ok, got it but still no luck:

# gdb vmlinux 
GNU gdb (GDB) 7.0.1-debian
Copyright (C) 2009 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /root/bug/dddddddd/vmlinux...(no debugging symbols found)...done.
(gdb) disassemble mem_cgroup_handle_oom
No symbol table is loaded.  Use the "file" command.



# objdump -d vmlinux | grep mem_cgroup_handle_oom
<no output>


i can recompile the kernel if anything needs to be added into it.


azur

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: memory-cgroup bug
  2012-11-23  9:21           ` azurIt
  2012-11-23  9:28             ` Michal Hocko
  2012-11-23  9:34             ` Glauber Costa
@ 2012-11-23 10:04             ` Michal Hocko
  2012-11-23 14:59               ` azurIt
  2012-11-25  0:10               ` azurIt
  2 siblings, 2 replies; 172+ messages in thread
From: Michal Hocko @ 2012-11-23 10:04 UTC (permalink / raw)
  To: azurIt; +Cc: linux-kernel, linux-mm, cgroups mailinglist

On Fri 23-11-12 10:21:37, azurIt wrote:
[...]
> It, luckily, happend again so i have more info.
> 
>  - there wasn't any logs in kernel from OOM for that cgroup
>  - there were 16 processes in cgroup
>  - processes in cgroup were taking togather 100% of CPU (it
>    was allowed to use only one core, so 100% of that core)
>  - memory.failcnt was groving fast
>  - oom_control:
> oom_kill_disable 0
> under_oom 0 (this was looping from 0 to 1)

So there was an OOM going on but no messages in the log? Really strange.
Kame already asked about oom_score_adj of the processes in the group but
it didn't look like all the processes would have oom disabled, right?

>  - limit_in_bytes was set to 157286400
>  - content of stat (as you can see, the whole memory limit was used):
> cache 0
> rss 0

This looks like a top-level group for your user.

> mapped_file 0
> pgpgin 0
> pgpgout 0
> swap 0
> pgfault 0
> pgmajfault 0
> inactive_anon 0
> active_anon 0
> inactive_file 0
> active_file 0
> unevictable 0
> hierarchical_memory_limit 157286400
> hierarchical_memsw_limit 157286400
> total_cache 0
> total_rss 157286400

OK, so all the memory is anonymous and you have no swap so the oom is
the only thing to do.

> total_mapped_file 0
> total_pgpgin 10326454
> total_pgpgout 10288054
> total_swap 0
> total_pgfault 12939677
> total_pgmajfault 4283
> total_inactive_anon 0
> total_active_anon 157286400
> total_inactive_file 0
> total_active_file 0
> total_unevictable 0
> 
> 
> i also grabber oom_adj, oom_score_adj and stack of all processes, here
> it is:
> http://www.watchdog.sk/lkml/memcg-bug.tar

Hmm, all processes waiting for oom are stuck at the very same place:
$ grep mem_cgroup_handle_oom -r [0-9]*
30858/stack:[<ffffffff8110a9c1>] mem_cgroup_handle_oom+0x241/0x3b0
30859/stack:[<ffffffff8110a9c1>] mem_cgroup_handle_oom+0x241/0x3b0
30860/stack:[<ffffffff8110a9c1>] mem_cgroup_handle_oom+0x241/0x3b0
30892/stack:[<ffffffff8110a9c1>] mem_cgroup_handle_oom+0x241/0x3b0
30898/stack:[<ffffffff8110a9c1>] mem_cgroup_handle_oom+0x241/0x3b0
31588/stack:[<ffffffff8110a9c1>] mem_cgroup_handle_oom+0x241/0x3b0
32044/stack:[<ffffffff8110a9c1>] mem_cgroup_handle_oom+0x241/0x3b0
32358/stack:[<ffffffff8110a9c1>] mem_cgroup_handle_oom+0x241/0x3b0
6031/stack:[<ffffffff8110a9c1>] mem_cgroup_handle_oom+0x241/0x3b0
6534/stack:[<ffffffff8110a9c1>] mem_cgroup_handle_oom+0x241/0x3b0
7020/stack:[<ffffffff8110a9c1>] mem_cgroup_handle_oom+0x241/0x3b0

We are taking memcg_oom_lock spinlock twice in that function + we can
schedule. As none of the tasks is scheduled this would suggest that you
are blocked at the first lock. But who got the lock then?
This is really strange.
Btw. is sysrq+t resp. sysrq+w showing the same traces as
/proc/<pid>/stat?
 
> Notice that stack is different for few processes.

Yes others are in VFS resp ext3. ext3_write_begin looks a bit dangerous
but it grabs the page before it really starts a transaction.

> Stack for all processes were NOT chaging and was still the same.

Could you take few snapshots over time?

> Btw, don't know if it matters but i was several cgroup subsystems
> mounted and i'm also using them (i was not activating freezer in this
> case, don't know if it can be active automatically by kernel or what,

No

> didn't checked if cgroup was freezed but i suppose it wasn't):
> none            /cgroups        cgroup  defaults,cpuacct,cpuset,memory,freezer,task,blkio 0 0

Do you see the same issue if only memory controller was mounted (resp.
cpuset which you seem to use as well from your description).

I know you said booting into a vanilla kernel would be problematic but
could you at least rule out te cgroup patches that you have mentioned?
If you need to move a task to a group based by an uid you can use
cgrules daemon (libcgroup1 package) for that as well.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: memory-cgroup bug
  2012-11-23  9:44               ` azurIt
@ 2012-11-23 10:10                 ` Michal Hocko
  0 siblings, 0 replies; 172+ messages in thread
From: Michal Hocko @ 2012-11-23 10:10 UTC (permalink / raw)
  To: azurIt; +Cc: linux-kernel, linux-mm, cgroups mailinglist

On Fri 23-11-12 10:44:23, azurIt wrote:
[...]
> # gdb vmlinux 
> GNU gdb (GDB) 7.0.1-debian
> Copyright (C) 2009 Free Software Foundation, Inc.
> License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
> This is free software: you are free to change and redistribute it.
> There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
> and "show warranty" for details.
> This GDB was configured as "x86_64-linux-gnu".
> For bug reporting instructions, please see:
> <http://www.gnu.org/software/gdb/bugs/>...
> Reading symbols from /root/bug/dddddddd/vmlinux...(no debugging symbols found)...done.
> (gdb) disassemble mem_cgroup_handle_oom
> No symbol table is loaded.  Use the "file" command.
> 
> 
> 
> # objdump -d vmlinux | grep mem_cgroup_handle_oom
> <no output>

Hmm, strange so the function is on the stack but it has been inlined?
Doesn't make much sense to me.

> i can recompile the kernel if anything needs to be added into it.

If you could instrument mem_cgroup_handle_oom with some printks (before
we take the memcg_oom_lock, before we schedule and into
mem_cgroup_out_of_memory)
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: memory-cgroup bug
  2012-11-23 10:04             ` Michal Hocko
@ 2012-11-23 14:59               ` azurIt
  2012-11-25 10:17                 ` Michal Hocko
  2012-11-25  0:10               ` azurIt
  1 sibling, 1 reply; 172+ messages in thread
From: azurIt @ 2012-11-23 14:59 UTC (permalink / raw)
  To: Michal Hocko; +Cc: linux-kernel, linux-mm, cgroups mailinglist

>If you could instrument mem_cgroup_handle_oom with some printks (before
>we take the memcg_oom_lock, before we schedule and into
>mem_cgroup_out_of_memory)


If you send me patch i can do it. I'm, unfortunately, not able to code it.



>> It, luckily, happend again so i have more info.
>> 
>>  - there wasn't any logs in kernel from OOM for that cgroup
>>  - there were 16 processes in cgroup
>>  - processes in cgroup were taking togather 100% of CPU (it
>>    was allowed to use only one core, so 100% of that core)
>>  - memory.failcnt was groving fast
>>  - oom_control:
>> oom_kill_disable 0
>> under_oom 0 (this was looping from 0 to 1)
>
>So there was an OOM going on but no messages in the log? Really strange.
>Kame already asked about oom_score_adj of the processes in the group but
>it didn't look like all the processes would have oom disabled, right?


There were no messages telling that some processes were killed because of OOM.


>>  - limit_in_bytes was set to 157286400
>>  - content of stat (as you can see, the whole memory limit was used):
>> cache 0
>> rss 0
>
>This looks like a top-level group for your user.


Yes, it was from /cgroup/<user-id>/


>> mapped_file 0
>> pgpgin 0
>> pgpgout 0
>> swap 0
>> pgfault 0
>> pgmajfault 0
>> inactive_anon 0
>> active_anon 0
>> inactive_file 0
>> active_file 0
>> unevictable 0
>> hierarchical_memory_limit 157286400
>> hierarchical_memsw_limit 157286400
>> total_cache 0
>> total_rss 157286400
>
>OK, so all the memory is anonymous and you have no swap so the oom is
>the only thing to do.


What will happen if the same situation occurs globally? No swap, every bit of memory used. Will kernel be able to start OOM killer? Maybe the same thing is happening in cgroup - there's simply no space to run OOM killer. And maybe this is why it's happening rarely - usually there are still at least few KBs of memory left to start OOM killer.


>Hmm, all processes waiting for oom are stuck at the very same place:
>$ grep mem_cgroup_handle_oom -r [0-9]*
>30858/stack:[<ffffffff8110a9c1>] mem_cgroup_handle_oom+0x241/0x3b0
>30859/stack:[<ffffffff8110a9c1>] mem_cgroup_handle_oom+0x241/0x3b0
>30860/stack:[<ffffffff8110a9c1>] mem_cgroup_handle_oom+0x241/0x3b0
>30892/stack:[<ffffffff8110a9c1>] mem_cgroup_handle_oom+0x241/0x3b0
>30898/stack:[<ffffffff8110a9c1>] mem_cgroup_handle_oom+0x241/0x3b0
>31588/stack:[<ffffffff8110a9c1>] mem_cgroup_handle_oom+0x241/0x3b0
>32044/stack:[<ffffffff8110a9c1>] mem_cgroup_handle_oom+0x241/0x3b0
>32358/stack:[<ffffffff8110a9c1>] mem_cgroup_handle_oom+0x241/0x3b0
>6031/stack:[<ffffffff8110a9c1>] mem_cgroup_handle_oom+0x241/0x3b0
>6534/stack:[<ffffffff8110a9c1>] mem_cgroup_handle_oom+0x241/0x3b0
>7020/stack:[<ffffffff8110a9c1>] mem_cgroup_handle_oom+0x241/0x3b0
>
>We are taking memcg_oom_lock spinlock twice in that function + we can
>schedule. As none of the tasks is scheduled this would suggest that you
>are blocked at the first lock. But who got the lock then?
>This is really strange.
>Btw. is sysrq+t resp. sysrq+w showing the same traces as
>/proc/<pid>/stat?


Unfortunately i'm connecting remotely to the servers (SSH).


>> Notice that stack is different for few processes.
>
>Yes others are in VFS resp ext3. ext3_write_begin looks a bit dangerous
>but it grabs the page before it really starts a transaction.


Maybe these processes were throttled by cgroup-blkio at the same time and are still keeping the lock? So the problem occurs when there are low on memory and cgroup is doing IO out of it's limits. Only guessing and telling my thoughts.


>> Stack for all processes were NOT chaging and was still the same.
>
>Could you take few snapshots over time?


Will do next time but i can't keep services freezed for a long time or customers will be angry.


>> didn't checked if cgroup was freezed but i suppose it wasn't):
>> none            /cgroups        cgroup  defaults,cpuacct,cpuset,memory,freezer,task,blkio 0 0
>
>Do you see the same issue if only memory controller was mounted (resp.
>cpuset which you seem to use as well from your description).


Uh, we are using all mounted subsystems :( I will be able to umount only freezer and maybe blkio for some time. Will it help?


>I know you said booting into a vanilla kernel would be problematic but
>could you at least rule out te cgroup patches that you have mentioned?
>If you need to move a task to a group based by an uid you can use
>cgrules daemon (libcgroup1 package) for that as well.


We are using cgroup-uid cos it's MUCH MUCH MUCH more efective and better. For example, i don't believe that cgroup-task will work with that daemon. What will happen if cgrules won't be able to add process into cgroup because of task limit? Process will probably continue and will run outside of any cgroup which is wrong. With cgroup-task + cgroup-uid, such processes cannot be even started (and this is what we need).

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: memory-cgroup bug
  2012-11-23 10:04             ` Michal Hocko
  2012-11-23 14:59               ` azurIt
@ 2012-11-25  0:10               ` azurIt
  2012-11-25 12:05                 ` Michal Hocko
  1 sibling, 1 reply; 172+ messages in thread
From: azurIt @ 2012-11-25  0:10 UTC (permalink / raw)
  To: Michal Hocko; +Cc: linux-kernel, linux-mm, cgroups mailinglist

>Could you take few snapshots over time?


Here it is, now from different server, snapshot was taken every second for 10 minutes (hope it's enough):
www.watchdog.sk/lkml/memcg-bug-2.tar.gz

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: memory-cgroup bug
  2012-11-23 14:59               ` azurIt
@ 2012-11-25 10:17                 ` Michal Hocko
  2012-11-25 12:39                   ` azurIt
  0 siblings, 1 reply; 172+ messages in thread
From: Michal Hocko @ 2012-11-25 10:17 UTC (permalink / raw)
  To: azurIt; +Cc: linux-kernel, linux-mm, cgroups mailinglist

On Fri 23-11-12 15:59:04, azurIt wrote:
> >If you could instrument mem_cgroup_handle_oom with some printks (before
> >we take the memcg_oom_lock, before we schedule and into
> >mem_cgroup_out_of_memory)
> 
> 
> If you send me patch i can do it. I'm, unfortunately, not able to code it.

Inlined at the end of the email. Please note I have compile tested
it. It might produce a lot of output.
 
> >> It, luckily, happend again so i have more info.
> >> 
> >>  - there wasn't any logs in kernel from OOM for that cgroup
> >>  - there were 16 processes in cgroup
> >>  - processes in cgroup were taking togather 100% of CPU (it
> >>    was allowed to use only one core, so 100% of that core)
> >>  - memory.failcnt was groving fast
> >>  - oom_control:
> >> oom_kill_disable 0
> >> under_oom 0 (this was looping from 0 to 1)
> >
> >So there was an OOM going on but no messages in the log? Really strange.
> >Kame already asked about oom_score_adj of the processes in the group but
> >it didn't look like all the processes would have oom disabled, right?
> 
> 
> There were no messages telling that some processes were killed because of OOM.

dmesg | grep "Out of memory"
doesn't tell anything, right?

> >>  - limit_in_bytes was set to 157286400
> >>  - content of stat (as you can see, the whole memory limit was used):
> >> cache 0
> >> rss 0
> >
> >This looks like a top-level group for your user.
> 
> 
> Yes, it was from /cgroup/<user-id>/
> 
> 
> >> mapped_file 0
> >> pgpgin 0
> >> pgpgout 0
> >> swap 0
> >> pgfault 0
> >> pgmajfault 0
> >> inactive_anon 0
> >> active_anon 0
> >> inactive_file 0
> >> active_file 0
> >> unevictable 0
> >> hierarchical_memory_limit 157286400
> >> hierarchical_memsw_limit 157286400
> >> total_cache 0
> >> total_rss 157286400
> >
> >OK, so all the memory is anonymous and you have no swap so the oom is
> >the only thing to do.
> 
> 
> What will happen if the same situation occurs globally? No swap, every
> bit of memory used. Will kernel be able to start OOM killer?

OOM killer is not a task. It doesn't allocate any memory. It just walks
the process list and picks up a task with the highest score. If the
global oom is not able to find any such a task (e.g. because all of them
have oom disabled) the the system panics.

> Maybe the same thing is happening in cgroup

cgroup oom differs only in that aspect that the system doesn't panic if
there is no suitable task to kill.

[...]
> >> Notice that stack is different for few processes.
> >
> >Yes others are in VFS resp ext3. ext3_write_begin looks a bit dangerous
> >but it grabs the page before it really starts a transaction.
> 
> 
> Maybe these processes were throttled by cgroup-blkio at the same time
> and are still keeping the lock?

If you are thinking about memcg_oom_lock then this is not possible
because the lock is held only for short times. There is no other lock
that memcg oom holds.

> So the problem occurs when there are low on memory and cgroup is doing
> IO out of it's limits. Only guessing and telling my thoughts.

The lockup (if this is what happens) still might be related to the IO
controller if the killed task cannot finish due to pending IO, though.
 
[...]
> >> didn't checked if cgroup was freezed but i suppose it wasn't):
> >> none            /cgroups        cgroup  defaults,cpuacct,cpuset,memory,freezer,task,blkio 0 0
> >
> >Do you see the same issue if only memory controller was mounted (resp.
> >cpuset which you seem to use as well from your description).
> 
> 
> Uh, we are using all mounted subsystems :( I will be able to umount
> only freezer and maybe blkio for some time. Will it help?

Not sure about that without further data.

> >I know you said booting into a vanilla kernel would be problematic but
> >could you at least rule out te cgroup patches that you have mentioned?
> >If you need to move a task to a group based by an uid you can use
> >cgrules daemon (libcgroup1 package) for that as well.
> 
> 
> We are using cgroup-uid cos it's MUCH MUCH MUCH more efective and
> better. For example, i don't believe that cgroup-task will work with
> that daemon. What will happen if cgrules won't be able to add process
> into cgroup because of task limit? Process will probably continue and
> will run outside of any cgroup which is wrong. With cgroup-task +
> cgroup-uid, such processes cannot be even started (and this is what we
> need).

I am not familiar with cgroup-task controller so I cannot comment on
that.

---
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index c8425b1..7f26ec8 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1863,6 +1863,7 @@ bool mem_cgroup_handle_oom(struct mem_cgroup *memcg, gfp_t mask)
 {
 	struct oom_wait_info owait;
 	bool locked, need_to_kill;
+	int ret = false;
 
 	owait.mem = memcg;
 	owait.wait.flags = 0;
@@ -1873,6 +1874,7 @@ bool mem_cgroup_handle_oom(struct mem_cgroup *memcg, gfp_t mask)
 	mem_cgroup_mark_under_oom(memcg);
 
 	/* At first, try to OOM lock hierarchy under memcg.*/
+	printk("XXX: %d waiting for memcg_oom_lock\n", current->pid);
 	spin_lock(&memcg_oom_lock);
 	locked = mem_cgroup_oom_lock(memcg);
 	/*
@@ -1887,12 +1889,14 @@ bool mem_cgroup_handle_oom(struct mem_cgroup *memcg, gfp_t mask)
 		mem_cgroup_oom_notify(memcg);
 	spin_unlock(&memcg_oom_lock);
 
+	printk("XXX: %d need_to_kill:%d locked:%d\n", current->pid, need_to_kill, locked);
 	if (need_to_kill) {
 		finish_wait(&memcg_oom_waitq, &owait.wait);
 		mem_cgroup_out_of_memory(memcg, mask);
 	} else {
 		schedule();
 		finish_wait(&memcg_oom_waitq, &owait.wait);
+		printk("XXX: %d woken up\n", current->pid);
 	}
 	spin_lock(&memcg_oom_lock);
 	if (locked)
@@ -1903,10 +1907,13 @@ bool mem_cgroup_handle_oom(struct mem_cgroup *memcg, gfp_t mask)
 	mem_cgroup_unmark_under_oom(memcg);
 
 	if (test_thread_flag(TIF_MEMDIE) || fatal_signal_pending(current))
-		return false;
+		goto out;
 	/* Give chance to dying process */
 	schedule_timeout_uninterruptible(1);
-	return true;
+	ret = true;
+out:
+	printk("XXX: %d done with %d\n", current->pid, ret);
+	return ret;
 }
 
 /*
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 069b64e..a7db813 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -568,6 +568,7 @@ void mem_cgroup_out_of_memory(struct mem_cgroup *mem, gfp_t gfp_mask)
 	 */
 	if (fatal_signal_pending(current)) {
 		set_thread_flag(TIF_MEMDIE);
+		printk("XXX: %d skipping task with fatal signal pending\n", current->pid);
 		return;
 	}
 
@@ -576,8 +577,10 @@ void mem_cgroup_out_of_memory(struct mem_cgroup *mem, gfp_t gfp_mask)
 	read_lock(&tasklist_lock);
 retry:
 	p = select_bad_process(&points, limit, mem, NULL);
-	if (!p || PTR_ERR(p) == -1UL)
+	if (!p || PTR_ERR(p) == -1UL) {
+		printk("XXX: %d nothing to kill\n", current->pid);
 		goto out;
+	}
 
 	if (oom_kill_process(p, gfp_mask, 0, points, limit, mem, NULL,
 				"Memory cgroup out of memory"))

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply related	[flat|nested] 172+ messages in thread

* Re: memory-cgroup bug
  2012-11-25  0:10               ` azurIt
@ 2012-11-25 12:05                 ` Michal Hocko
  2012-11-25 12:36                   ` azurIt
  2012-11-25 13:55                   ` Michal Hocko
  0 siblings, 2 replies; 172+ messages in thread
From: Michal Hocko @ 2012-11-25 12:05 UTC (permalink / raw)
  To: azurIt; +Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki

[Adding Kamezawa into CC]

On Sun 25-11-12 01:10:47, azurIt wrote:
> >Could you take few snapshots over time?
> 
> 
> Here it is, now from different server, snapshot was taken every second
> for 10 minutes (hope it's enough):
> www.watchdog.sk/lkml/memcg-bug-2.tar.gz

Hmm, interesting:
$ grep . */memory.failcnt | cut -d: -f2 | awk 'BEGIN{min=666666}{if (prev>0) {diff=$1-prev; if (diff>max) max=diff; if (diff<min) min=diff; sum+=diff; n++} prev=$1}END{printf "min:%d max:%d avg:%f\n", min, max, sum/n}'
min:16281 max:224048 avg:18818.943119

So there is a lot of attempts to allocate which fail, every second!
Will get to that later.

The number of tasks in the group is stable (20):
$ for i in *; do ls -d1 $i/[0-9]* | wc -l; done | sort | uniq -c
    546 20

And no task has been killed or spawned:
$ for i in *; do ls -d1 $i/[0-9]* | cut -d/ -f2; done | sort | uniq
24495
24762
24774
24796
24798
24805
24813
24827
24831
24841
24842
24863
24892
24924
24931
25130
25131
25192
25193
25243

$ for stack in [0-9]*/[0-9]*
do 
	head -n1 $stack/stack
done | sort | uniq -c
   9841 [<ffffffff8110a9c1>] mem_cgroup_handle_oom+0x241/0x3b0
    546 [<ffffffff811109b8>] do_truncate+0x58/0xa0
    533 [<ffffffffffffffff>] 0xffffffffffffffff

Tells us that the stacks are pretty much stable.
$ grep do_truncate -r [0-9]* | cut -d/ -f2 | sort | uniq -c
    546 24495

So 24495 is stuck in do_truncate
[<ffffffff811109b8>] do_truncate+0x58/0xa0
[<ffffffff81121c90>] do_last+0x250/0xa30
[<ffffffff81122547>] path_openat+0xd7/0x440
[<ffffffff811229c9>] do_filp_open+0x49/0xa0
[<ffffffff8110f7d6>] do_sys_open+0x106/0x240
[<ffffffff8110f950>] sys_open+0x20/0x30
[<ffffffff815b5926>] system_call_fastpath+0x18/0x1d
[<ffffffffffffffff>] 0xffffffffffffffff

I suspect it is waiting for i_mutex. Who is holding that lock?
Other tasks are blocked on the mem_cgroup_handle_oom either coming from
the page fault path so i_mutex can be exluded or vfs_write (24796) and
that one is interesting:
[<ffffffff8110a9c1>] mem_cgroup_handle_oom+0x241/0x3b0
[<ffffffff8110b5ab>] T.1146+0x5ab/0x5c0
[<ffffffff8110c22e>] mem_cgroup_cache_charge+0xbe/0xe0
[<ffffffff810ca28c>] add_to_page_cache_locked+0x4c/0x140
[<ffffffff810ca3a2>] add_to_page_cache_lru+0x22/0x50
[<ffffffff810ca45b>] grab_cache_page_write_begin+0x8b/0xe0
[<ffffffff81193a18>] ext3_write_begin+0x88/0x270
[<ffffffff810c8fc6>] generic_file_buffered_write+0x116/0x290
[<ffffffff810cb3cc>] __generic_file_aio_write+0x27c/0x480
[<ffffffff810cb646>] generic_file_aio_write+0x76/0xf0		# takes &inode->i_mutex
[<ffffffff8111156a>] do_sync_write+0xea/0x130
[<ffffffff81112183>] vfs_write+0xf3/0x1f0
[<ffffffff81112381>] sys_write+0x51/0x90
[<ffffffff815b5926>] system_call_fastpath+0x18/0x1d
[<ffffffffffffffff>] 0xffffffffffffffff

This smells like a deadlock. But kind strange one. The rapidly
increasing failcnt suggests that somebody still tries to allocate but
who when all of them hung in the mem_cgroup_handle_oom. This can be
explained though.
Memcg OOM killer let's only one process (which is able to lock the
hierarchy by mem_cgroup_oom_lock) call mem_cgroup_out_of_memory and kill
a process, while others are waiting on the wait queue. Once the killer
is done it calls memcg_wakeup_oom which wakes up other tasks waiting on
the queue. Those retry the charge, in a hope there is some memory freed
in the meantime which hasn't happened so they get into OOM again (and
again and again).
This all usually works out except in this particular case I would bet
my hat that the OOM selected task is pid 24495 which is blocked on the
mutex which is held by one of the oom killer task so it cannot finish -
thus free a memory.

It seems that the current Linus' tree is affected as well.

I will have to think about a solution but it sounds really tricky. It is
not just ext3 that is affected.

I guess we need to tell mem_cgroup_cache_charge that it should never
reach OOM from add_to_page_cache_locked. This sounds quite intrusive to
me. On the other hand it is really weird that an excessive writer might
trigger a memcg OOM killer.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: memory-cgroup bug
  2012-11-25 12:05                 ` Michal Hocko
@ 2012-11-25 12:36                   ` azurIt
  2012-11-25 13:55                   ` Michal Hocko
  1 sibling, 0 replies; 172+ messages in thread
From: azurIt @ 2012-11-25 12:36 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki

>So there is a lot of attempts to allocate which fail, every second!


Yes, as i said, the cgroup was taking 100% of (allocated) CPU core(s). Not sure if all processes were using CPU but _few_ of them (not only one) for sure.

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: memory-cgroup bug
  2012-11-25 10:17                 ` Michal Hocko
@ 2012-11-25 12:39                   ` azurIt
  2012-11-25 13:02                     ` Michal Hocko
  0 siblings, 1 reply; 172+ messages in thread
From: azurIt @ 2012-11-25 12:39 UTC (permalink / raw)
  To: Michal Hocko; +Cc: linux-kernel, linux-mm, cgroups mailinglist

>Inlined at the end of the email. Please note I have compile tested
>it. It might produce a lot of output.


Thank you very much, i will install it ASAP (probably this night).


>dmesg | grep "Out of memory"
>doesn't tell anything, right?


Only messages for other cgroups but not for the freezed one (before nor after the freeze).


azur

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: memory-cgroup bug
  2012-11-25 12:39                   ` azurIt
@ 2012-11-25 13:02                     ` Michal Hocko
  2012-11-25 13:27                       ` azurIt
  0 siblings, 1 reply; 172+ messages in thread
From: Michal Hocko @ 2012-11-25 13:02 UTC (permalink / raw)
  To: azurIt; +Cc: linux-kernel, linux-mm, cgroups mailinglist

On Sun 25-11-12 13:39:53, azurIt wrote:
> >Inlined at the end of the email. Please note I have compile tested
> >it. It might produce a lot of output.
> 
> 
> Thank you very much, i will install it ASAP (probably this night).

Please don't. If my analysis is correct which I am almost 100% sure it
is then it would cause excessive logging. I am sorry I cannot come up
with something else in the mean time.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: memory-cgroup bug
  2012-11-25 13:02                     ` Michal Hocko
@ 2012-11-25 13:27                       ` azurIt
  2012-11-25 13:44                         ` Michal Hocko
  0 siblings, 1 reply; 172+ messages in thread
From: azurIt @ 2012-11-25 13:27 UTC (permalink / raw)
  To: Michal Hocko; +Cc: linux-kernel, linux-mm, cgroups mailinglist

>> Thank you very much, i will install it ASAP (probably this night).
>
>Please don't. If my analysis is correct which I am almost 100% sure it
>is then it would cause excessive logging. I am sorry I cannot come up
>with something else in the mean time.


Ok then. I will, meanwhile, try to contact Andrea Righi (author of cgroup-task etc.) and ask him to send here his opinion about relation between freezes and his patches. Maybe it's some kind of a bug in memcg which don't appear in current vanilla code and is triggered by conditions created by, for example, cgroup-task. I noticed that there is always the exact number of freezed processes as the limit set for number of tasks by cgroup-task (i already tried to raise this limit AFTER the cgroup was freezed, didn't change anything). I'm sure it's not the problem with cgroup-task alone, it's 100% related also to memcg (but maybe there must be the combination of both of them).

Thank you so far for your time!

azur

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: memory-cgroup bug
  2012-11-25 13:27                       ` azurIt
@ 2012-11-25 13:44                         ` Michal Hocko
  0 siblings, 0 replies; 172+ messages in thread
From: Michal Hocko @ 2012-11-25 13:44 UTC (permalink / raw)
  To: azurIt; +Cc: linux-kernel, linux-mm, cgroups mailinglist

On Sun 25-11-12 14:27:09, azurIt wrote:
> >> Thank you very much, i will install it ASAP (probably this night).
> >
> >Please don't. If my analysis is correct which I am almost 100% sure it
> >is then it would cause excessive logging. I am sorry I cannot come up
> >with something else in the mean time.
> 
> 
> Ok then. I will, meanwhile, try to contact Andrea Righi (author of
> cgroup-task etc.) and ask him to send here his opinion about relation
> between freezes and his patches.

As I described in other email. This seems to be a deadlock in memcg oom
so I do not think that other patches influence this.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: memory-cgroup bug
  2012-11-25 12:05                 ` Michal Hocko
  2012-11-25 12:36                   ` azurIt
@ 2012-11-25 13:55                   ` Michal Hocko
  2012-11-26  0:38                     ` azurIt
  1 sibling, 1 reply; 172+ messages in thread
From: Michal Hocko @ 2012-11-25 13:55 UTC (permalink / raw)
  To: azurIt; +Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki

On Sun 25-11-12 13:05:24, Michal Hocko wrote:
> [Adding Kamezawa into CC]
> 
> On Sun 25-11-12 01:10:47, azurIt wrote:
> > >Could you take few snapshots over time?
> > 
> > 
> > Here it is, now from different server, snapshot was taken every second
> > for 10 minutes (hope it's enough):
> > www.watchdog.sk/lkml/memcg-bug-2.tar.gz
> 
> Hmm, interesting:
> $ grep . */memory.failcnt | cut -d: -f2 | awk 'BEGIN{min=666666}{if (prev>0) {diff=$1-prev; if (diff>max) max=diff; if (diff<min) min=diff; sum+=diff; n++} prev=$1}END{printf "min:%d max:%d avg:%f\n", min, max, sum/n}'
> min:16281 max:224048 avg:18818.943119
> 
> So there is a lot of attempts to allocate which fail, every second!
> Will get to that later.
> 
> The number of tasks in the group is stable (20):
> $ for i in *; do ls -d1 $i/[0-9]* | wc -l; done | sort | uniq -c
>     546 20
> 
> And no task has been killed or spawned:
> $ for i in *; do ls -d1 $i/[0-9]* | cut -d/ -f2; done | sort | uniq
> 24495
> 24762
> 24774
> 24796
> 24798
> 24805
> 24813
> 24827
> 24831
> 24841
> 24842
> 24863
> 24892
> 24924
> 24931
> 25130
> 25131
> 25192
> 25193
> 25243
> 
> $ for stack in [0-9]*/[0-9]*
> do 
> 	head -n1 $stack/stack
> done | sort | uniq -c
>    9841 [<ffffffff8110a9c1>] mem_cgroup_handle_oom+0x241/0x3b0
>     546 [<ffffffff811109b8>] do_truncate+0x58/0xa0
>     533 [<ffffffffffffffff>] 0xffffffffffffffff
> 
> Tells us that the stacks are pretty much stable.
> $ grep do_truncate -r [0-9]* | cut -d/ -f2 | sort | uniq -c
>     546 24495
> 
> So 24495 is stuck in do_truncate
> [<ffffffff811109b8>] do_truncate+0x58/0xa0
> [<ffffffff81121c90>] do_last+0x250/0xa30
> [<ffffffff81122547>] path_openat+0xd7/0x440
> [<ffffffff811229c9>] do_filp_open+0x49/0xa0
> [<ffffffff8110f7d6>] do_sys_open+0x106/0x240
> [<ffffffff8110f950>] sys_open+0x20/0x30
> [<ffffffff815b5926>] system_call_fastpath+0x18/0x1d
> [<ffffffffffffffff>] 0xffffffffffffffff
> 
> I suspect it is waiting for i_mutex. Who is holding that lock?
> Other tasks are blocked on the mem_cgroup_handle_oom either coming from
> the page fault path so i_mutex can be exluded or vfs_write (24796) and
> that one is interesting:
> [<ffffffff8110a9c1>] mem_cgroup_handle_oom+0x241/0x3b0
> [<ffffffff8110b5ab>] T.1146+0x5ab/0x5c0
> [<ffffffff8110c22e>] mem_cgroup_cache_charge+0xbe/0xe0
> [<ffffffff810ca28c>] add_to_page_cache_locked+0x4c/0x140
> [<ffffffff810ca3a2>] add_to_page_cache_lru+0x22/0x50
> [<ffffffff810ca45b>] grab_cache_page_write_begin+0x8b/0xe0
> [<ffffffff81193a18>] ext3_write_begin+0x88/0x270
> [<ffffffff810c8fc6>] generic_file_buffered_write+0x116/0x290
> [<ffffffff810cb3cc>] __generic_file_aio_write+0x27c/0x480
> [<ffffffff810cb646>] generic_file_aio_write+0x76/0xf0		# takes &inode->i_mutex
> [<ffffffff8111156a>] do_sync_write+0xea/0x130
> [<ffffffff81112183>] vfs_write+0xf3/0x1f0
> [<ffffffff81112381>] sys_write+0x51/0x90
> [<ffffffff815b5926>] system_call_fastpath+0x18/0x1d
> [<ffffffffffffffff>] 0xffffffffffffffff
> 
> This smells like a deadlock. But kind strange one. The rapidly
> increasing failcnt suggests that somebody still tries to allocate but
> who when all of them hung in the mem_cgroup_handle_oom. This can be
> explained though.
> Memcg OOM killer let's only one process (which is able to lock the
> hierarchy by mem_cgroup_oom_lock) call mem_cgroup_out_of_memory and kill
> a process, while others are waiting on the wait queue. Once the killer
> is done it calls memcg_wakeup_oom which wakes up other tasks waiting on
> the queue. Those retry the charge, in a hope there is some memory freed
> in the meantime which hasn't happened so they get into OOM again (and
> again and again).
> This all usually works out except in this particular case I would bet
> my hat that the OOM selected task is pid 24495 which is blocked on the
> mutex which is held by one of the oom killer task so it cannot finish -
> thus free a memory.
> 
> It seems that the current Linus' tree is affected as well.
> 
> I will have to think about a solution but it sounds really tricky. It is
> not just ext3 that is affected.
> 
> I guess we need to tell mem_cgroup_cache_charge that it should never
> reach OOM from add_to_page_cache_locked. This sounds quite intrusive to
> me. On the other hand it is really weird that an excessive writer might
> trigger a memcg OOM killer.

This is hackish but it should help you in this case. Kamezawa, what do
you think about that? Should we generalize this and prepare something
like mem_cgroup_cache_charge_locked which would add __GFP_NORETRY
automatically and use the function whenever we are in a locked context?
To be honest I do not like this very much but nothing more sensible
(without touching non-memcg paths) comes to my mind.
---
diff --git a/mm/filemap.c b/mm/filemap.c
index 83efee7..da50c83 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -448,7 +448,7 @@ int add_to_page_cache_locked(struct page *page, struct address_space *mapping,
 	VM_BUG_ON(PageSwapBacked(page));
 
 	error = mem_cgroup_cache_charge(page, current->mm,
-					gfp_mask & GFP_RECLAIM_MASK);
+					(gfp_mask | __GFP_NORETRY) & GFP_RECLAIM_MASK);
 	if (error)
 		goto out;
 
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply related	[flat|nested] 172+ messages in thread

* Re: memory-cgroup bug
  2012-11-25 13:55                   ` Michal Hocko
@ 2012-11-26  0:38                     ` azurIt
  2012-11-26  7:57                       ` Michal Hocko
  2012-11-26 13:18                       ` [PATCH -mm] memcg: do not trigger OOM from add_to_page_cache_locked Michal Hocko
  0 siblings, 2 replies; 172+ messages in thread
From: azurIt @ 2012-11-26  0:38 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki

>This is hackish but it should help you in this case. Kamezawa, what do
>you think about that? Should we generalize this and prepare something
>like mem_cgroup_cache_charge_locked which would add __GFP_NORETRY
>automatically and use the function whenever we are in a locked context?
>To be honest I do not like this very much but nothing more sensible
>(without touching non-memcg paths) comes to my mind.


I installed kernel with this patch, will report back if problem occurs again OR in few weeks if everything will be ok. Thank you!

Btw, will this patch be backported to 3.2?

azur

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: memory-cgroup bug
  2012-11-26  0:38                     ` azurIt
@ 2012-11-26  7:57                       ` Michal Hocko
  2012-11-26 13:18                       ` [PATCH -mm] memcg: do not trigger OOM from add_to_page_cache_locked Michal Hocko
  1 sibling, 0 replies; 172+ messages in thread
From: Michal Hocko @ 2012-11-26  7:57 UTC (permalink / raw)
  To: azurIt; +Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki

On Mon 26-11-12 01:38:55, azurIt wrote:
> >This is hackish but it should help you in this case. Kamezawa, what do
> >you think about that? Should we generalize this and prepare something
> >like mem_cgroup_cache_charge_locked which would add __GFP_NORETRY
> >automatically and use the function whenever we are in a locked context?
> >To be honest I do not like this very much but nothing more sensible
> >(without touching non-memcg paths) comes to my mind.
> 
> 
> I installed kernel with this patch, will report back if problem occurs
> again OR in few weeks if everything will be ok. Thank you!

Thanks!

> Btw, will this patch be backported to 3.2?

Once we agree on a proper solution it will be backported to the stable
trees.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 172+ messages in thread

* [PATCH -mm] memcg: do not trigger OOM from add_to_page_cache_locked
  2012-11-26  0:38                     ` azurIt
  2012-11-26  7:57                       ` Michal Hocko
@ 2012-11-26 13:18                       ` Michal Hocko
  2012-11-26 13:21                         ` [PATCH for 3.2.34] " Michal Hocko
                                           ` (2 more replies)
  1 sibling, 3 replies; 172+ messages in thread
From: Michal Hocko @ 2012-11-26 13:18 UTC (permalink / raw)
  To: azurIt
  Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki,
	Johannes Weiner

[CCing also Johannes - the thread started here:
https://lkml.org/lkml/2012/11/21/497]

On Mon 26-11-12 01:38:55, azurIt wrote:
> >This is hackish but it should help you in this case. Kamezawa, what do
> >you think about that? Should we generalize this and prepare something
> >like mem_cgroup_cache_charge_locked which would add __GFP_NORETRY
> >automatically and use the function whenever we are in a locked context?
> >To be honest I do not like this very much but nothing more sensible
> >(without touching non-memcg paths) comes to my mind.
> 
> 
> I installed kernel with this patch, will report back if problem occurs
> again OR in few weeks if everything will be ok. Thank you!

Now that I am looking at the patch closer it will not work because it
depends on other patch which is not merged yet and even that one would
help on its own because __GFP_NORETRY doesn't break the charge loop.
Sorry I have missed that...

The patch bellow should help though. (it is based on top of the current
-mm tree but I will send a backport to 3.2 in the reply as well)
---
>From 7796f942d62081ad45726efd90b5292b80e7c690 Mon Sep 17 00:00:00 2001
From: Michal Hocko <mhocko@suse.cz>
Date: Mon, 26 Nov 2012 11:47:57 +0100
Subject: [PATCH] memcg: do not trigger OOM from add_to_page_cache_locked

memcg oom killer might deadlock if the process which falls down to
mem_cgroup_handle_oom holds a lock which prevents other task to
terminate because it is blocked on the very same lock.
This can happen when a write system call needs to allocate a page but
the allocation hits the memcg hard limit and there is nothing to reclaim
(e.g. there is no swap or swap limit is hit as well and all cache pages
have been reclaimed already) and the process selected by memcg OOM
killer is blocked on i_mutex on the same inode (e.g. truncate it).

Process A
[<ffffffff811109b8>] do_truncate+0x58/0xa0		# takes i_mutex
[<ffffffff81121c90>] do_last+0x250/0xa30
[<ffffffff81122547>] path_openat+0xd7/0x440
[<ffffffff811229c9>] do_filp_open+0x49/0xa0
[<ffffffff8110f7d6>] do_sys_open+0x106/0x240
[<ffffffff8110f950>] sys_open+0x20/0x30
[<ffffffff815b5926>] system_call_fastpath+0x18/0x1d
[<ffffffffffffffff>] 0xffffffffffffffff

Process B
[<ffffffff8110a9c1>] mem_cgroup_handle_oom+0x241/0x3b0
[<ffffffff8110b5ab>] T.1146+0x5ab/0x5c0
[<ffffffff8110c22e>] mem_cgroup_cache_charge+0xbe/0xe0
[<ffffffff810ca28c>] add_to_page_cache_locked+0x4c/0x140
[<ffffffff810ca3a2>] add_to_page_cache_lru+0x22/0x50
[<ffffffff810ca45b>] grab_cache_page_write_begin+0x8b/0xe0
[<ffffffff81193a18>] ext3_write_begin+0x88/0x270
[<ffffffff810c8fc6>] generic_file_buffered_write+0x116/0x290
[<ffffffff810cb3cc>] __generic_file_aio_write+0x27c/0x480
[<ffffffff810cb646>] generic_file_aio_write+0x76/0xf0           # takes ->i_mutex
[<ffffffff8111156a>] do_sync_write+0xea/0x130
[<ffffffff81112183>] vfs_write+0xf3/0x1f0
[<ffffffff81112381>] sys_write+0x51/0x90
[<ffffffff815b5926>] system_call_fastpath+0x18/0x1d
[<ffffffffffffffff>] 0xffffffffffffffff

This is not a hard deadlock though because administrator can still
intervene and increase the limit on the group which helps the writer to
finish the allocation and release the lock.

This patch heals the problem by forbidding OOM from page cache charges
(namely add_ro_page_cache_locked). mem_cgroup_cache_charge_no_oom helper
function is defined which adds GFP_MEMCG_NO_OOM to the gfp mask which
then tells mem_cgroup_charge_common that OOM is not allowed for the
charge. No OOM from this path, except for fixing the bug, also make some
sense as we really do not want to cause an OOM because of a page cache
usage.
As a possibly visible result add_to_page_cache_lru might fail more often
with ENOMEM but this is to be expected if the limit is set and it is
preferable than OOM killer IMO.

__GFP_NORETRY is abused for this memcg specific flag because it has been
used to prevent from OOM already (since not-merged-yet "memcg: reclaim
when more than one page needed"). The only difference is that the flag
doesn't prevent from reclaim anymore which kind of makes sense because
the global memory allocator triggers reclaim as well. The retry without
any reclaim on __GFP_NORETRY doesn't make much sense anyway because this
is effectively a busy loop with allowed OOM in this path.

Reported-by: azurIt <azurit@pobox.sk>
Signed-off-by: Michal Hocko <mhocko@suse.cz>
---
 include/linux/gfp.h        |    3 +++
 include/linux/memcontrol.h |   12 ++++++++++++
 mm/filemap.c               |    8 +++++++-
 mm/memcontrol.c            |    5 +----
 4 files changed, 23 insertions(+), 5 deletions(-)

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 10e667f..aac9b21 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -152,6 +152,9 @@ struct vm_area_struct;
 /* 4GB DMA on some platforms */
 #define GFP_DMA32	__GFP_DMA32
 
+/* memcg oom killer is not allowed */
+#define GFP_MEMCG_NO_OOM	__GFP_NORETRY
+
 /* Convert GFP flags to their corresponding migrate type */
 static inline int allocflags_to_migratetype(gfp_t gfp_flags)
 {
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 095d2b4..1ad4bc6 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -65,6 +65,12 @@ extern void mem_cgroup_cancel_charge_swapin(struct mem_cgroup *memcg);
 extern int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm,
 					gfp_t gfp_mask);
 
+static inline int mem_cgroup_cache_charge_no_oom(struct page *page,
+					struct mm_struct *mm, gfp_t gfp_mask)
+{
+	return mem_cgroup_cache_charge(page, mm, gfp_mask | GFP_MEMCG_NO_OOM);
+}
+
 struct lruvec *mem_cgroup_zone_lruvec(struct zone *, struct mem_cgroup *);
 struct lruvec *mem_cgroup_page_lruvec(struct page *, struct zone *);
 
@@ -215,6 +221,12 @@ static inline int mem_cgroup_cache_charge(struct page *page,
 	return 0;
 }
 
+static inline int mem_cgroup_cache_charge_no_oom(struct page *page,
+					struct mm_struct *mm, gfp_t gfp_mask)
+{
+	return 0;
+}
+
 static inline int mem_cgroup_try_charge_swapin(struct mm_struct *mm,
 		struct page *page, gfp_t gfp_mask, struct mem_cgroup **memcgp)
 {
diff --git a/mm/filemap.c b/mm/filemap.c
index 83efee7..ef14351 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -447,7 +447,13 @@ int add_to_page_cache_locked(struct page *page, struct address_space *mapping,
 	VM_BUG_ON(!PageLocked(page));
 	VM_BUG_ON(PageSwapBacked(page));
 
-	error = mem_cgroup_cache_charge(page, current->mm,
+	/*
+	 * Cannot trigger OOM even if gfp_mask would allow that normally
+	 * because we might be called from a locked context and that
+	 * could lead to deadlocks if the killed process is waiting for
+	 * the same lock.
+	 */
+	error = mem_cgroup_cache_charge_no_oom(page, current->mm,
 					gfp_mask & GFP_RECLAIM_MASK);
 	if (error)
 		goto out;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 02ee2f7..b4754ba 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2430,9 +2430,6 @@ static int mem_cgroup_do_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
 	if (!(gfp_mask & __GFP_WAIT))
 		return CHARGE_WOULDBLOCK;
 
-	if (gfp_mask & __GFP_NORETRY)
-		return CHARGE_NOMEM;
-
 	ret = mem_cgroup_reclaim(mem_over_limit, gfp_mask, flags);
 	if (mem_cgroup_margin(mem_over_limit) >= nr_pages)
 		return CHARGE_RETRY;
@@ -3713,7 +3710,7 @@ static int mem_cgroup_charge_common(struct page *page, struct mm_struct *mm,
 {
 	struct mem_cgroup *memcg = NULL;
 	unsigned int nr_pages = 1;
-	bool oom = true;
+	bool oom = !(gfp_mask | GFP_MEMCG_NO_OOM);
 	int ret;
 
 	if (PageTransHuge(page)) {
-- 
1.7.10.4

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply related	[flat|nested] 172+ messages in thread

* Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked
  2012-11-26 13:18                       ` [PATCH -mm] memcg: do not trigger OOM from add_to_page_cache_locked Michal Hocko
@ 2012-11-26 13:21                         ` Michal Hocko
  2012-11-26 21:28                           ` azurIt
                                             ` (2 more replies)
  2012-11-26 17:46                         ` [PATCH -mm] " Johannes Weiner
  2012-11-27  0:05                         ` Kamezawa Hiroyuki
  2 siblings, 3 replies; 172+ messages in thread
From: Michal Hocko @ 2012-11-26 13:21 UTC (permalink / raw)
  To: azurIt
  Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki,
	Johannes Weiner

Here we go with the patch for 3.2.34. Could you test with this one,
please?
---
>From 0d2d915c16f93918051b7ab8039d30b5a922049c Mon Sep 17 00:00:00 2001
From: Michal Hocko <mhocko@suse.cz>
Date: Mon, 26 Nov 2012 11:47:57 +0100
Subject: [PATCH] memcg: do not trigger OOM from add_to_page_cache_locked

memcg oom killer might deadlock if the process which falls down to
mem_cgroup_handle_oom holds a lock which prevents other task to
terminate because it is blocked on the very same lock.
This can happen when a write system call needs to allocate a page but
the allocation hits the memcg hard limit and there is nothing to reclaim
(e.g. there is no swap or swap limit is hit as well and all cache pages
have been reclaimed already) and the process selected by memcg OOM
killer is blocked on i_mutex on the same inode (e.g. truncate it).

Process A
[<ffffffff811109b8>] do_truncate+0x58/0xa0		# takes i_mutex
[<ffffffff81121c90>] do_last+0x250/0xa30
[<ffffffff81122547>] path_openat+0xd7/0x440
[<ffffffff811229c9>] do_filp_open+0x49/0xa0
[<ffffffff8110f7d6>] do_sys_open+0x106/0x240
[<ffffffff8110f950>] sys_open+0x20/0x30
[<ffffffff815b5926>] system_call_fastpath+0x18/0x1d
[<ffffffffffffffff>] 0xffffffffffffffff

Process B
[<ffffffff8110a9c1>] mem_cgroup_handle_oom+0x241/0x3b0
[<ffffffff8110b5ab>] T.1146+0x5ab/0x5c0
[<ffffffff8110c22e>] mem_cgroup_cache_charge+0xbe/0xe0
[<ffffffff810ca28c>] add_to_page_cache_locked+0x4c/0x140
[<ffffffff810ca3a2>] add_to_page_cache_lru+0x22/0x50
[<ffffffff810ca45b>] grab_cache_page_write_begin+0x8b/0xe0
[<ffffffff81193a18>] ext3_write_begin+0x88/0x270
[<ffffffff810c8fc6>] generic_file_buffered_write+0x116/0x290
[<ffffffff810cb3cc>] __generic_file_aio_write+0x27c/0x480
[<ffffffff810cb646>] generic_file_aio_write+0x76/0xf0           # takes ->i_mutex
[<ffffffff8111156a>] do_sync_write+0xea/0x130
[<ffffffff81112183>] vfs_write+0xf3/0x1f0
[<ffffffff81112381>] sys_write+0x51/0x90
[<ffffffff815b5926>] system_call_fastpath+0x18/0x1d
[<ffffffffffffffff>] 0xffffffffffffffff

This is not a hard deadlock though because administrator can still
intervene and increase the limit on the group which helps the writer to
finish the allocation and release the lock.

This patch heals the problem by forbidding OOM from page cache charges
(namely add_ro_page_cache_locked). mem_cgroup_cache_charge_no_oom helper
function is defined which adds GFP_MEMCG_NO_OOM to the gfp mask which
then tells mem_cgroup_charge_common that OOM is not allowed for the
charge. No OOM from this path, except for fixing the bug, also make some
sense as we really do not want to cause an OOM because of a page cache
usage.
As a possibly visible result add_to_page_cache_lru might fail more often
with ENOMEM but this is to be expected if the limit is set and it is
preferable than OOM killer IMO.

__GFP_NORETRY is abused for this memcg specific flag because no user
accounted allocation use this flag except for THP which have memcg oom
disabled already.

Reported-by: azurIt <azurit@pobox.sk>
Signed-off-by: Michal Hocko <mhocko@suse.cz>
---
 include/linux/gfp.h        |    3 +++
 include/linux/memcontrol.h |   13 +++++++++++++
 mm/filemap.c               |    8 +++++++-
 mm/memcontrol.c            |    2 +-
 4 files changed, 24 insertions(+), 2 deletions(-)

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 3a76faf..806fb54 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -146,6 +146,9 @@ struct vm_area_struct;
 /* 4GB DMA on some platforms */
 #define GFP_DMA32	__GFP_DMA32
 
+/* memcg oom killer is not allowed */
+#define GFP_MEMCG_NO_OOM	__GFP_NORETRY
+
 /* Convert GFP flags to their corresponding migrate type */
 static inline int allocflags_to_migratetype(gfp_t gfp_flags)
 {
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 81572af..bf0e575 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -63,6 +63,13 @@ extern void mem_cgroup_cancel_charge_swapin(struct mem_cgroup *ptr);
 
 extern int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm,
 					gfp_t gfp_mask);
+
+static inline int mem_cgroup_cache_charge_no_oom(struct page *page,
+					struct mm_struct *mm, gfp_t gfp_mask)
+{
+	return mem_cgroup_cache_charge(page, mm, gfp_mask | GFP_MEMCG_NO_OOM);
+}
+
 extern void mem_cgroup_add_lru_list(struct page *page, enum lru_list lru);
 extern void mem_cgroup_del_lru_list(struct page *page, enum lru_list lru);
 extern void mem_cgroup_rotate_reclaimable_page(struct page *page);
@@ -178,6 +185,12 @@ static inline int mem_cgroup_cache_charge(struct page *page,
 	return 0;
 }
 
+static inline int mem_cgroup_cache_charge_no_oom(struct page *page,
+					struct mm_struct *mm, gfp_t gfp_mask)
+{
+	return 0;
+}
+
 static inline int mem_cgroup_try_charge_swapin(struct mm_struct *mm,
 		struct page *page, gfp_t gfp_mask, struct mem_cgroup **ptr)
 {
diff --git a/mm/filemap.c b/mm/filemap.c
index 556858c..ef182a9 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -449,7 +449,13 @@ int add_to_page_cache_locked(struct page *page, struct address_space *mapping,
 	VM_BUG_ON(!PageLocked(page));
 	VM_BUG_ON(PageSwapBacked(page));
 
-	error = mem_cgroup_cache_charge(page, current->mm,
+	/*
+	 * Cannot trigger OOM even if gfp_mask would allow that normally
+	 * because we might be called from a locked context and that
+	 * could lead to deadlocks if the killed process is waiting for
+	 * the same lock.
+	 */
+	error = mem_cgroup_cache_charge_no_oom(page, current->mm,
 					gfp_mask & GFP_RECLAIM_MASK);
 	if (error)
 		goto out;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index c8425b1..1dbbe7f 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2703,7 +2703,7 @@ static int mem_cgroup_charge_common(struct page *page, struct mm_struct *mm,
 	struct mem_cgroup *memcg = NULL;
 	unsigned int nr_pages = 1;
 	struct page_cgroup *pc;
-	bool oom = true;
+	bool oom = !(gfp_mask | GFP_MEMCG_NO_OOM);
 	int ret;
 
 	if (PageTransHuge(page)) {
-- 
1.7.10.4

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply related	[flat|nested] 172+ messages in thread

* Re: [PATCH -mm] memcg: do not trigger OOM from add_to_page_cache_locked
  2012-11-26 13:18                       ` [PATCH -mm] memcg: do not trigger OOM from add_to_page_cache_locked Michal Hocko
  2012-11-26 13:21                         ` [PATCH for 3.2.34] " Michal Hocko
@ 2012-11-26 17:46                         ` Johannes Weiner
  2012-11-26 18:04                           ` Michal Hocko
  2012-11-27  0:05                         ` Kamezawa Hiroyuki
  2 siblings, 1 reply; 172+ messages in thread
From: Johannes Weiner @ 2012-11-26 17:46 UTC (permalink / raw)
  To: Michal Hocko
  Cc: azurIt, linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki

On Mon, Nov 26, 2012 at 02:18:37PM +0100, Michal Hocko wrote:
> [CCing also Johannes - the thread started here:
> https://lkml.org/lkml/2012/11/21/497]
> 
> On Mon 26-11-12 01:38:55, azurIt wrote:
> > >This is hackish but it should help you in this case. Kamezawa, what do
> > >you think about that? Should we generalize this and prepare something
> > >like mem_cgroup_cache_charge_locked which would add __GFP_NORETRY
> > >automatically and use the function whenever we are in a locked context?
> > >To be honest I do not like this very much but nothing more sensible
> > >(without touching non-memcg paths) comes to my mind.
> > 
> > 
> > I installed kernel with this patch, will report back if problem occurs
> > again OR in few weeks if everything will be ok. Thank you!
> 
> Now that I am looking at the patch closer it will not work because it
> depends on other patch which is not merged yet and even that one would
> help on its own because __GFP_NORETRY doesn't break the charge loop.
> Sorry I have missed that...
> 
> The patch bellow should help though. (it is based on top of the current
> -mm tree but I will send a backport to 3.2 in the reply as well)
> ---
> >From 7796f942d62081ad45726efd90b5292b80e7c690 Mon Sep 17 00:00:00 2001
> From: Michal Hocko <mhocko@suse.cz>
> Date: Mon, 26 Nov 2012 11:47:57 +0100
> Subject: [PATCH] memcg: do not trigger OOM from add_to_page_cache_locked
> 
> memcg oom killer might deadlock if the process which falls down to
> mem_cgroup_handle_oom holds a lock which prevents other task to
> terminate because it is blocked on the very same lock.
> This can happen when a write system call needs to allocate a page but
> the allocation hits the memcg hard limit and there is nothing to reclaim
> (e.g. there is no swap or swap limit is hit as well and all cache pages
> have been reclaimed already) and the process selected by memcg OOM
> killer is blocked on i_mutex on the same inode (e.g. truncate it).
> 
> Process A
> [<ffffffff811109b8>] do_truncate+0x58/0xa0		# takes i_mutex
> [<ffffffff81121c90>] do_last+0x250/0xa30
> [<ffffffff81122547>] path_openat+0xd7/0x440
> [<ffffffff811229c9>] do_filp_open+0x49/0xa0
> [<ffffffff8110f7d6>] do_sys_open+0x106/0x240
> [<ffffffff8110f950>] sys_open+0x20/0x30
> [<ffffffff815b5926>] system_call_fastpath+0x18/0x1d
> [<ffffffffffffffff>] 0xffffffffffffffff
> 
> Process B
> [<ffffffff8110a9c1>] mem_cgroup_handle_oom+0x241/0x3b0
> [<ffffffff8110b5ab>] T.1146+0x5ab/0x5c0
> [<ffffffff8110c22e>] mem_cgroup_cache_charge+0xbe/0xe0
> [<ffffffff810ca28c>] add_to_page_cache_locked+0x4c/0x140
> [<ffffffff810ca3a2>] add_to_page_cache_lru+0x22/0x50
> [<ffffffff810ca45b>] grab_cache_page_write_begin+0x8b/0xe0
> [<ffffffff81193a18>] ext3_write_begin+0x88/0x270
> [<ffffffff810c8fc6>] generic_file_buffered_write+0x116/0x290
> [<ffffffff810cb3cc>] __generic_file_aio_write+0x27c/0x480
> [<ffffffff810cb646>] generic_file_aio_write+0x76/0xf0           # takes ->i_mutex
> [<ffffffff8111156a>] do_sync_write+0xea/0x130
> [<ffffffff81112183>] vfs_write+0xf3/0x1f0
> [<ffffffff81112381>] sys_write+0x51/0x90
> [<ffffffff815b5926>] system_call_fastpath+0x18/0x1d
> [<ffffffffffffffff>] 0xffffffffffffffff

So process B manages to lock the hierarchy, calls
mem_cgroup_out_of_memory() and retries the charge infinitely, waiting
for task A to die.  All while it holds the i_mutex, preventing task A
from dying, right?

I think global oom already handles this in a much better way: invoke
the OOM killer, sleep for a second, then return to userspace to
relinquish all kernel resources and locks.  The only reason why we
can't simply change from an endless retry loop is because we don't
want to return VM_FAULT_OOM and invoke the global OOM killer.  But
maybe we can return a new VM_FAULT_OOM_HANDLED for memcg OOM and just
restart the pagefault.  Return -ENOMEM to the buffered IO syscall
respectively.  This way, the memcg OOM killer is invoked as it should
but nobody gets stuck anywhere livelocking with the exiting task.

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: [PATCH -mm] memcg: do not trigger OOM from add_to_page_cache_locked
  2012-11-26 17:46                         ` [PATCH -mm] " Johannes Weiner
@ 2012-11-26 18:04                           ` Michal Hocko
  2012-11-26 18:24                             ` Johannes Weiner
  0 siblings, 1 reply; 172+ messages in thread
From: Michal Hocko @ 2012-11-26 18:04 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: azurIt, linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki

On Mon 26-11-12 12:46:22, Johannes Weiner wrote:
> On Mon, Nov 26, 2012 at 02:18:37PM +0100, Michal Hocko wrote:
> > [CCing also Johannes - the thread started here:
> > https://lkml.org/lkml/2012/11/21/497]
> > 
> > On Mon 26-11-12 01:38:55, azurIt wrote:
> > > >This is hackish but it should help you in this case. Kamezawa, what do
> > > >you think about that? Should we generalize this and prepare something
> > > >like mem_cgroup_cache_charge_locked which would add __GFP_NORETRY
> > > >automatically and use the function whenever we are in a locked context?
> > > >To be honest I do not like this very much but nothing more sensible
> > > >(without touching non-memcg paths) comes to my mind.
> > > 
> > > 
> > > I installed kernel with this patch, will report back if problem occurs
> > > again OR in few weeks if everything will be ok. Thank you!
> > 
> > Now that I am looking at the patch closer it will not work because it
> > depends on other patch which is not merged yet and even that one would
> > help on its own because __GFP_NORETRY doesn't break the charge loop.
> > Sorry I have missed that...
> > 
> > The patch bellow should help though. (it is based on top of the current
> > -mm tree but I will send a backport to 3.2 in the reply as well)
> > ---
> > >From 7796f942d62081ad45726efd90b5292b80e7c690 Mon Sep 17 00:00:00 2001
> > From: Michal Hocko <mhocko@suse.cz>
> > Date: Mon, 26 Nov 2012 11:47:57 +0100
> > Subject: [PATCH] memcg: do not trigger OOM from add_to_page_cache_locked
> > 
> > memcg oom killer might deadlock if the process which falls down to
> > mem_cgroup_handle_oom holds a lock which prevents other task to
> > terminate because it is blocked on the very same lock.
> > This can happen when a write system call needs to allocate a page but
> > the allocation hits the memcg hard limit and there is nothing to reclaim
> > (e.g. there is no swap or swap limit is hit as well and all cache pages
> > have been reclaimed already) and the process selected by memcg OOM
> > killer is blocked on i_mutex on the same inode (e.g. truncate it).
> > 
> > Process A
> > [<ffffffff811109b8>] do_truncate+0x58/0xa0		# takes i_mutex
> > [<ffffffff81121c90>] do_last+0x250/0xa30
> > [<ffffffff81122547>] path_openat+0xd7/0x440
> > [<ffffffff811229c9>] do_filp_open+0x49/0xa0
> > [<ffffffff8110f7d6>] do_sys_open+0x106/0x240
> > [<ffffffff8110f950>] sys_open+0x20/0x30
> > [<ffffffff815b5926>] system_call_fastpath+0x18/0x1d
> > [<ffffffffffffffff>] 0xffffffffffffffff
> > 
> > Process B
> > [<ffffffff8110a9c1>] mem_cgroup_handle_oom+0x241/0x3b0
> > [<ffffffff8110b5ab>] T.1146+0x5ab/0x5c0
> > [<ffffffff8110c22e>] mem_cgroup_cache_charge+0xbe/0xe0
> > [<ffffffff810ca28c>] add_to_page_cache_locked+0x4c/0x140
> > [<ffffffff810ca3a2>] add_to_page_cache_lru+0x22/0x50
> > [<ffffffff810ca45b>] grab_cache_page_write_begin+0x8b/0xe0
> > [<ffffffff81193a18>] ext3_write_begin+0x88/0x270
> > [<ffffffff810c8fc6>] generic_file_buffered_write+0x116/0x290
> > [<ffffffff810cb3cc>] __generic_file_aio_write+0x27c/0x480
> > [<ffffffff810cb646>] generic_file_aio_write+0x76/0xf0           # takes ->i_mutex
> > [<ffffffff8111156a>] do_sync_write+0xea/0x130
> > [<ffffffff81112183>] vfs_write+0xf3/0x1f0
> > [<ffffffff81112381>] sys_write+0x51/0x90
> > [<ffffffff815b5926>] system_call_fastpath+0x18/0x1d
> > [<ffffffffffffffff>] 0xffffffffffffffff
> 
> So process B manages to lock the hierarchy, calls
> mem_cgroup_out_of_memory() and retries the charge infinitely, waiting
> for task A to die.  All while it holds the i_mutex, preventing task A
> from dying, right?

Right.

> I think global oom already handles this in a much better way: invoke
> the OOM killer, sleep for a second, then return to userspace to
> relinquish all kernel resources and locks.  The only reason why we
> can't simply change from an endless retry loop is because we don't
> want to return VM_FAULT_OOM and invoke the global OOM killer.

Exactly.

> But maybe we can return a new VM_FAULT_OOM_HANDLED for memcg OOM and
> just restart the pagefault.  Return -ENOMEM to the buffered IO syscall
> respectively.  This way, the memcg OOM killer is invoked as it should
> but nobody gets stuck anywhere livelocking with the exiting task.

Hmm, we would still have a problem with oom disabled (aka user space OOM
killer), right? All processes but those in mem_cgroup_handle_oom are
risky to be killed.
Other POV might be, why we should trigger an OOM killer from those paths
in the first place. Write or read (or even readahead) are all calls that
should rather fail than cause an OOM killer in my opinion.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: [PATCH -mm] memcg: do not trigger OOM from add_to_page_cache_locked
  2012-11-26 18:04                           ` Michal Hocko
@ 2012-11-26 18:24                             ` Johannes Weiner
  2012-11-26 19:03                               ` Michal Hocko
  0 siblings, 1 reply; 172+ messages in thread
From: Johannes Weiner @ 2012-11-26 18:24 UTC (permalink / raw)
  To: Michal Hocko
  Cc: azurIt, linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki

On Mon, Nov 26, 2012 at 07:04:44PM +0100, Michal Hocko wrote:
> On Mon 26-11-12 12:46:22, Johannes Weiner wrote:
> > On Mon, Nov 26, 2012 at 02:18:37PM +0100, Michal Hocko wrote:
> > > [CCing also Johannes - the thread started here:
> > > https://lkml.org/lkml/2012/11/21/497]
> > > 
> > > On Mon 26-11-12 01:38:55, azurIt wrote:
> > > > >This is hackish but it should help you in this case. Kamezawa, what do
> > > > >you think about that? Should we generalize this and prepare something
> > > > >like mem_cgroup_cache_charge_locked which would add __GFP_NORETRY
> > > > >automatically and use the function whenever we are in a locked context?
> > > > >To be honest I do not like this very much but nothing more sensible
> > > > >(without touching non-memcg paths) comes to my mind.
> > > > 
> > > > 
> > > > I installed kernel with this patch, will report back if problem occurs
> > > > again OR in few weeks if everything will be ok. Thank you!
> > > 
> > > Now that I am looking at the patch closer it will not work because it
> > > depends on other patch which is not merged yet and even that one would
> > > help on its own because __GFP_NORETRY doesn't break the charge loop.
> > > Sorry I have missed that...
> > > 
> > > The patch bellow should help though. (it is based on top of the current
> > > -mm tree but I will send a backport to 3.2 in the reply as well)
> > > ---
> > > >From 7796f942d62081ad45726efd90b5292b80e7c690 Mon Sep 17 00:00:00 2001
> > > From: Michal Hocko <mhocko@suse.cz>
> > > Date: Mon, 26 Nov 2012 11:47:57 +0100
> > > Subject: [PATCH] memcg: do not trigger OOM from add_to_page_cache_locked
> > > 
> > > memcg oom killer might deadlock if the process which falls down to
> > > mem_cgroup_handle_oom holds a lock which prevents other task to
> > > terminate because it is blocked on the very same lock.
> > > This can happen when a write system call needs to allocate a page but
> > > the allocation hits the memcg hard limit and there is nothing to reclaim
> > > (e.g. there is no swap or swap limit is hit as well and all cache pages
> > > have been reclaimed already) and the process selected by memcg OOM
> > > killer is blocked on i_mutex on the same inode (e.g. truncate it).
> > > 
> > > Process A
> > > [<ffffffff811109b8>] do_truncate+0x58/0xa0		# takes i_mutex
> > > [<ffffffff81121c90>] do_last+0x250/0xa30
> > > [<ffffffff81122547>] path_openat+0xd7/0x440
> > > [<ffffffff811229c9>] do_filp_open+0x49/0xa0
> > > [<ffffffff8110f7d6>] do_sys_open+0x106/0x240
> > > [<ffffffff8110f950>] sys_open+0x20/0x30
> > > [<ffffffff815b5926>] system_call_fastpath+0x18/0x1d
> > > [<ffffffffffffffff>] 0xffffffffffffffff
> > > 
> > > Process B
> > > [<ffffffff8110a9c1>] mem_cgroup_handle_oom+0x241/0x3b0
> > > [<ffffffff8110b5ab>] T.1146+0x5ab/0x5c0
> > > [<ffffffff8110c22e>] mem_cgroup_cache_charge+0xbe/0xe0
> > > [<ffffffff810ca28c>] add_to_page_cache_locked+0x4c/0x140
> > > [<ffffffff810ca3a2>] add_to_page_cache_lru+0x22/0x50
> > > [<ffffffff810ca45b>] grab_cache_page_write_begin+0x8b/0xe0
> > > [<ffffffff81193a18>] ext3_write_begin+0x88/0x270
> > > [<ffffffff810c8fc6>] generic_file_buffered_write+0x116/0x290
> > > [<ffffffff810cb3cc>] __generic_file_aio_write+0x27c/0x480
> > > [<ffffffff810cb646>] generic_file_aio_write+0x76/0xf0           # takes ->i_mutex
> > > [<ffffffff8111156a>] do_sync_write+0xea/0x130
> > > [<ffffffff81112183>] vfs_write+0xf3/0x1f0
> > > [<ffffffff81112381>] sys_write+0x51/0x90
> > > [<ffffffff815b5926>] system_call_fastpath+0x18/0x1d
> > > [<ffffffffffffffff>] 0xffffffffffffffff
> > 
> > So process B manages to lock the hierarchy, calls
> > mem_cgroup_out_of_memory() and retries the charge infinitely, waiting
> > for task A to die.  All while it holds the i_mutex, preventing task A
> > from dying, right?
> 
> Right.
> 
> > I think global oom already handles this in a much better way: invoke
> > the OOM killer, sleep for a second, then return to userspace to
> > relinquish all kernel resources and locks.  The only reason why we
> > can't simply change from an endless retry loop is because we don't
> > want to return VM_FAULT_OOM and invoke the global OOM killer.
> 
> Exactly.
> 
> > But maybe we can return a new VM_FAULT_OOM_HANDLED for memcg OOM and
> > just restart the pagefault.  Return -ENOMEM to the buffered IO syscall
> > respectively.  This way, the memcg OOM killer is invoked as it should
> > but nobody gets stuck anywhere livelocking with the exiting task.
> 
> Hmm, we would still have a problem with oom disabled (aka user space OOM
> killer), right? All processes but those in mem_cgroup_handle_oom are
> risky to be killed.

Could we still let everybody get stuck in there when the OOM killer is
disabled and let userspace take care of it?

> Other POV might be, why we should trigger an OOM killer from those paths
> in the first place. Write or read (or even readahead) are all calls that
> should rather fail than cause an OOM killer in my opinion.

Readahead is arguable, but we kill globally for read() and write() and
I think we should do the same for memcg.

The OOM killer is there to resolve a problem that comes from
overcommitting the machine but the overuse does not have to be from
the application that pushes the machine over the edge, that's why we
don't just kill the allocating task but actually go look for the best
candidate.  If you have one memory hog that overuses the resources,
attempted memory consumption in a different program should invoke the
OOM killer.  It does not matter if this is a page fault (would still
happen with your patch) or a bufferd read/write (would no longer
happen).

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: [PATCH -mm] memcg: do not trigger OOM from add_to_page_cache_locked
  2012-11-26 18:24                             ` Johannes Weiner
@ 2012-11-26 19:03                               ` Michal Hocko
  2012-11-26 19:29                                 ` Johannes Weiner
  0 siblings, 1 reply; 172+ messages in thread
From: Michal Hocko @ 2012-11-26 19:03 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: azurIt, linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki

On Mon 26-11-12 13:24:21, Johannes Weiner wrote:
> On Mon, Nov 26, 2012 at 07:04:44PM +0100, Michal Hocko wrote:
> > On Mon 26-11-12 12:46:22, Johannes Weiner wrote:
[...]
> > > I think global oom already handles this in a much better way: invoke
> > > the OOM killer, sleep for a second, then return to userspace to
> > > relinquish all kernel resources and locks.  The only reason why we
> > > can't simply change from an endless retry loop is because we don't
> > > want to return VM_FAULT_OOM and invoke the global OOM killer.
> > 
> > Exactly.
> > 
> > > But maybe we can return a new VM_FAULT_OOM_HANDLED for memcg OOM and
> > > just restart the pagefault.  Return -ENOMEM to the buffered IO syscall
> > > respectively.  This way, the memcg OOM killer is invoked as it should
> > > but nobody gets stuck anywhere livelocking with the exiting task.
> > 
> > Hmm, we would still have a problem with oom disabled (aka user space OOM
> > killer), right? All processes but those in mem_cgroup_handle_oom are
> > risky to be killed.
> 
> Could we still let everybody get stuck in there when the OOM killer is
> disabled and let userspace take care of it?

I am not sure what exactly you mean by "userspace take care of it" but
if those processes are stuck and holding the lock then it is usually
hard to find that out. Well if somebody is familiar with internal then
it is doable but this makes the interface really unusable for regular
usage.

> > Other POV might be, why we should trigger an OOM killer from those paths
> > in the first place. Write or read (or even readahead) are all calls that
> > should rather fail than cause an OOM killer in my opinion.
> 
> Readahead is arguable, but we kill globally for read() and write() and
> I think we should do the same for memcg.

Fair point but the global case is little bit easier than memcg in this
case because nobody can hook on OOM killer and provide a userspace
implementation for it which is one of the cooler feature of memcg...
I am all open to any suggestions but we should somehow fix this (and
backport it to stable trees as this is there for quite some time. The
current report shows that the problem is not that hard to trigger).

> The OOM killer is there to resolve a problem that comes from
> overcommitting the machine but the overuse does not have to be from
> the application that pushes the machine over the edge, that's why we
> don't just kill the allocating task but actually go look for the best
> candidate.  If you have one memory hog that overuses the resources,
> attempted memory consumption in a different program should invoke the
> OOM killer.  

> It does not matter if this is a page fault (would still happen with
> your patch) or a bufferd read/write (would no longer happen).

true and it is sad that mmap then behaves slightly different than
read/write which should I've mentioned in the changelog. As I said I am
open to other suggestions.

Thanks
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: [PATCH -mm] memcg: do not trigger OOM from add_to_page_cache_locked
  2012-11-26 19:03                               ` Michal Hocko
@ 2012-11-26 19:29                                 ` Johannes Weiner
  2012-11-26 20:08                                   ` Michal Hocko
  0 siblings, 1 reply; 172+ messages in thread
From: Johannes Weiner @ 2012-11-26 19:29 UTC (permalink / raw)
  To: Michal Hocko
  Cc: azurIt, linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki

On Mon, Nov 26, 2012 at 08:03:29PM +0100, Michal Hocko wrote:
> On Mon 26-11-12 13:24:21, Johannes Weiner wrote:
> > On Mon, Nov 26, 2012 at 07:04:44PM +0100, Michal Hocko wrote:
> > > On Mon 26-11-12 12:46:22, Johannes Weiner wrote:
> [...]
> > > > I think global oom already handles this in a much better way: invoke
> > > > the OOM killer, sleep for a second, then return to userspace to
> > > > relinquish all kernel resources and locks.  The only reason why we
> > > > can't simply change from an endless retry loop is because we don't
> > > > want to return VM_FAULT_OOM and invoke the global OOM killer.
> > > 
> > > Exactly.
> > > 
> > > > But maybe we can return a new VM_FAULT_OOM_HANDLED for memcg OOM and
> > > > just restart the pagefault.  Return -ENOMEM to the buffered IO syscall
> > > > respectively.  This way, the memcg OOM killer is invoked as it should
> > > > but nobody gets stuck anywhere livelocking with the exiting task.
> > > 
> > > Hmm, we would still have a problem with oom disabled (aka user space OOM
> > > killer), right? All processes but those in mem_cgroup_handle_oom are
> > > risky to be killed.
> > 
> > Could we still let everybody get stuck in there when the OOM killer is
> > disabled and let userspace take care of it?
> 
> I am not sure what exactly you mean by "userspace take care of it" but
> if those processes are stuck and holding the lock then it is usually
> hard to find that out. Well if somebody is familiar with internal then
> it is doable but this makes the interface really unusable for regular
> usage.

If oom_kill_disable is set, then all processes get stuck all the way
down in the charge stack.  Whatever resource they pin, you may
deadlock on if you try to touch it while handling the problem from
userspace.  I don't see how this is a new problem...?  Or do you mean
something else?

> > > Other POV might be, why we should trigger an OOM killer from those paths
> > > in the first place. Write or read (or even readahead) are all calls that
> > > should rather fail than cause an OOM killer in my opinion.
> > 
> > Readahead is arguable, but we kill globally for read() and write() and
> > I think we should do the same for memcg.
> 
> Fair point but the global case is little bit easier than memcg in this
> case because nobody can hook on OOM killer and provide a userspace
> implementation for it which is one of the cooler feature of memcg...
> I am all open to any suggestions but we should somehow fix this (and
> backport it to stable trees as this is there for quite some time. The
> current report shows that the problem is not that hard to trigger).

As per above, the userspace OOM handling is risky as hell anyway.
What happens when an anonymous fault waits in memcg userspace OOM
while holding the mmap_sem, and a writer lines up behind it?  Your
userspace OOM handler had better not look at any of the /proc files of
the stuck task that require the mmap_sem.

At the same token, it probably shouldn't touch the same files a memcg
task is stuck trying to read/write.

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: [PATCH -mm] memcg: do not trigger OOM from add_to_page_cache_locked
  2012-11-26 19:29                                 ` Johannes Weiner
@ 2012-11-26 20:08                                   ` Michal Hocko
  2012-11-26 20:19                                     ` Johannes Weiner
  0 siblings, 1 reply; 172+ messages in thread
From: Michal Hocko @ 2012-11-26 20:08 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: azurIt, linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki

On Mon 26-11-12 14:29:41, Johannes Weiner wrote:
> On Mon, Nov 26, 2012 at 08:03:29PM +0100, Michal Hocko wrote:
> > On Mon 26-11-12 13:24:21, Johannes Weiner wrote:
> > > On Mon, Nov 26, 2012 at 07:04:44PM +0100, Michal Hocko wrote:
> > > > On Mon 26-11-12 12:46:22, Johannes Weiner wrote:
> > [...]
> > > > > I think global oom already handles this in a much better way: invoke
> > > > > the OOM killer, sleep for a second, then return to userspace to
> > > > > relinquish all kernel resources and locks.  The only reason why we
> > > > > can't simply change from an endless retry loop is because we don't
> > > > > want to return VM_FAULT_OOM and invoke the global OOM killer.
> > > > 
> > > > Exactly.
> > > > 
> > > > > But maybe we can return a new VM_FAULT_OOM_HANDLED for memcg OOM and
> > > > > just restart the pagefault.  Return -ENOMEM to the buffered IO syscall
> > > > > respectively.  This way, the memcg OOM killer is invoked as it should
> > > > > but nobody gets stuck anywhere livelocking with the exiting task.
> > > > 
> > > > Hmm, we would still have a problem with oom disabled (aka user space OOM
> > > > killer), right? All processes but those in mem_cgroup_handle_oom are
> > > > risky to be killed.
> > > 
> > > Could we still let everybody get stuck in there when the OOM killer is
> > > disabled and let userspace take care of it?
> > 
> > I am not sure what exactly you mean by "userspace take care of it" but
> > if those processes are stuck and holding the lock then it is usually
> > hard to find that out. Well if somebody is familiar with internal then
> > it is doable but this makes the interface really unusable for regular
> > usage.
> 
> If oom_kill_disable is set, then all processes get stuck all the way
> down in the charge stack.  Whatever resource they pin, you may
> deadlock on if you try to touch it while handling the problem from
> userspace.

OK, I guess I am getting what you are trying to say. So what you are
suggesting is to just let mem_cgroup_out_of_memory send the signal and
move on without retry (or with few charge retries without further OOM
killing) and fail the charge with your new FAULT_OOM_HANDLED (resp.
something like FAULT_RETRY) error code resp. ENOMEM depending on the
caller.  OOM disabled case would be "you are on your own" because this
has been dangerous anyway. Correct?
I do agree that the current endless retry loop is far from being ideal
and can see some updates but I am quite nervous about any potential
regressions in this area (e.g. too aggressive OOM etc...). I have to
think about it some more.
Anyway if you have some more specific ideas I would be happy to review
patches.

[...]

Thanks
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: [PATCH -mm] memcg: do not trigger OOM from add_to_page_cache_locked
  2012-11-26 20:08                                   ` Michal Hocko
@ 2012-11-26 20:19                                     ` Johannes Weiner
  2012-11-26 20:46                                       ` azurIt
  2012-11-26 22:06                                       ` Michal Hocko
  0 siblings, 2 replies; 172+ messages in thread
From: Johannes Weiner @ 2012-11-26 20:19 UTC (permalink / raw)
  To: Michal Hocko
  Cc: azurIt, linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki

On Mon, Nov 26, 2012 at 09:08:48PM +0100, Michal Hocko wrote:
> On Mon 26-11-12 14:29:41, Johannes Weiner wrote:
> > On Mon, Nov 26, 2012 at 08:03:29PM +0100, Michal Hocko wrote:
> > > On Mon 26-11-12 13:24:21, Johannes Weiner wrote:
> > > > On Mon, Nov 26, 2012 at 07:04:44PM +0100, Michal Hocko wrote:
> > > > > On Mon 26-11-12 12:46:22, Johannes Weiner wrote:
> > > [...]
> > > > > > I think global oom already handles this in a much better way: invoke
> > > > > > the OOM killer, sleep for a second, then return to userspace to
> > > > > > relinquish all kernel resources and locks.  The only reason why we
> > > > > > can't simply change from an endless retry loop is because we don't
> > > > > > want to return VM_FAULT_OOM and invoke the global OOM killer.
> > > > > 
> > > > > Exactly.
> > > > > 
> > > > > > But maybe we can return a new VM_FAULT_OOM_HANDLED for memcg OOM and
> > > > > > just restart the pagefault.  Return -ENOMEM to the buffered IO syscall
> > > > > > respectively.  This way, the memcg OOM killer is invoked as it should
> > > > > > but nobody gets stuck anywhere livelocking with the exiting task.
> > > > > 
> > > > > Hmm, we would still have a problem with oom disabled (aka user space OOM
> > > > > killer), right? All processes but those in mem_cgroup_handle_oom are
> > > > > risky to be killed.
> > > > 
> > > > Could we still let everybody get stuck in there when the OOM killer is
> > > > disabled and let userspace take care of it?
> > > 
> > > I am not sure what exactly you mean by "userspace take care of it" but
> > > if those processes are stuck and holding the lock then it is usually
> > > hard to find that out. Well if somebody is familiar with internal then
> > > it is doable but this makes the interface really unusable for regular
> > > usage.
> > 
> > If oom_kill_disable is set, then all processes get stuck all the way
> > down in the charge stack.  Whatever resource they pin, you may
> > deadlock on if you try to touch it while handling the problem from
> > userspace.
> 
> OK, I guess I am getting what you are trying to say. So what you are
> suggesting is to just let mem_cgroup_out_of_memory send the signal and
> move on without retry (or with few charge retries without further OOM
> killing) and fail the charge with your new FAULT_OOM_HANDLED (resp.
> something like FAULT_RETRY) error code resp. ENOMEM depending on the
> caller.  OOM disabled case would be "you are on your own" because this
> has been dangerous anyway. Correct?

Yes.

> I do agree that the current endless retry loop is far from being ideal
> and can see some updates but I am quite nervous about any potential
> regressions in this area (e.g. too aggressive OOM etc...). I have to
> think about it some more.

Agreed on all points.  Maybe we can keep a couple of the oom retry
iterations or something like that, which is still much more than what
global does and I don't think the global OOM killer is overly eager.

Testing will show more.

> Anyway if you have some more specific ideas I would be happy to review
> patches.

Okay, I just wanted to check back with you before going down this
path.  What are we going to do short term, though?  Do you want to
push the disable-oom-for-pagecache for now or should we put the
VM_FAULT_OOM_HANDLED fix in the next version and do stable backports?

This issue has been around for a while so frankly I don't think it's
urgent enough to rush things.


^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: [PATCH -mm] memcg: do not trigger OOM from add_to_page_cache_locked
  2012-11-26 20:19                                     ` Johannes Weiner
@ 2012-11-26 20:46                                       ` azurIt
  2012-11-26 20:53                                         ` Johannes Weiner
  2012-11-26 22:06                                       ` Michal Hocko
  1 sibling, 1 reply; 172+ messages in thread
From: azurIt @ 2012-11-26 20:46 UTC (permalink / raw)
  To: Johannes Weiner, Michal Hocko
  Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki

>This issue has been around for a while so frankly I don't think it's
>urgent enough to rush things.


Well, it's quite urgent at least for us :( i wasn't reported this so far cos i wasn't sure it's a kernel thing. I will be really happy and thankfull if fix for this can go to 3.2 in some near future.. Thank you very much!

azur

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: [PATCH -mm] memcg: do not trigger OOM from add_to_page_cache_locked
  2012-11-26 20:46                                       ` azurIt
@ 2012-11-26 20:53                                         ` Johannes Weiner
  0 siblings, 0 replies; 172+ messages in thread
From: Johannes Weiner @ 2012-11-26 20:53 UTC (permalink / raw)
  To: azurIt
  Cc: Michal Hocko, linux-kernel, linux-mm, cgroups mailinglist,
	KAMEZAWA Hiroyuki

On Mon, Nov 26, 2012 at 09:46:38PM +0100, azurIt wrote:
> >This issue has been around for a while so frankly I don't think it's
> >urgent enough to rush things.
> 
> 
> Well, it's quite urgent at least for us :( i wasn't reported this so
> far cos i wasn't sure it's a kernel thing. I will be really happy
> and thankfull if fix for this can go to 3.2 in some near
> future.. Thank you very much!

I understand and of course it's important that we get it fixed as soon
as possible.  All I meant was that this problem has not exactly been
introduced in 3.7 and the fix is non-trivial so we should not be
rushing a change like this into 3.7 just days before its release.

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked
  2012-11-26 13:21                         ` [PATCH for 3.2.34] " Michal Hocko
@ 2012-11-26 21:28                           ` azurIt
  2012-11-30  1:45                           ` azurIt
  2012-11-30  2:29                           ` azurIt
  2 siblings, 0 replies; 172+ messages in thread
From: azurIt @ 2012-11-26 21:28 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki,
	Johannes Weiner

>Here we go with the patch for 3.2.34. Could you test with this one,
>please?

Michal, regarding to your conversation with Johannes Weiner, should i try this patch or not?

azur

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: [PATCH -mm] memcg: do not trigger OOM from add_to_page_cache_locked
  2012-11-26 20:19                                     ` Johannes Weiner
  2012-11-26 20:46                                       ` azurIt
@ 2012-11-26 22:06                                       ` Michal Hocko
  1 sibling, 0 replies; 172+ messages in thread
From: Michal Hocko @ 2012-11-26 22:06 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: azurIt, linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki

On Mon 26-11-12 15:19:18, Johannes Weiner wrote:
> On Mon, Nov 26, 2012 at 09:08:48PM +0100, Michal Hocko wrote:
[...]
> > OK, I guess I am getting what you are trying to say. So what you are
> > suggesting is to just let mem_cgroup_out_of_memory send the signal and
> > move on without retry (or with few charge retries without further OOM
> > killing) and fail the charge with your new FAULT_OOM_HANDLED (resp.
> > something like FAULT_RETRY) error code resp. ENOMEM depending on the
> > caller.  OOM disabled case would be "you are on your own" because this
> > has been dangerous anyway. Correct?
> 
> Yes.
> 
> > I do agree that the current endless retry loop is far from being ideal
> > and can see some updates but I am quite nervous about any potential
> > regressions in this area (e.g. too aggressive OOM etc...). I have to
> > think about it some more.
> 
> Agreed on all points.  Maybe we can keep a couple of the oom retry
> iterations or something like that, which is still much more than what
> global does and I don't think the global OOM killer is overly eager.

Yes we can offer less blood and more confort

> 
> Testing will show more.
> 
> > Anyway if you have some more specific ideas I would be happy to review
> > patches.
> 
> Okay, I just wanted to check back with you before going down this
> path.  What are we going to do short term, though?  Do you want to
> push the disable-oom-for-pagecache for now or should we put the
> VM_FAULT_OOM_HANDLED fix in the next version and do stable backports?
> 
> This issue has been around for a while so frankly I don't think it's
> urgent enough to rush things.

Yes, but now we have a real usecase where this hurts AFAIU. Unless
we come up with a fix/reasonable workaround I would rather go with
something simpler for starter and more sofisticated later.

I have to double check other places where we do charging but the last
time I've checked we don't hold page locks on already visible pages (we
do precharge in __do_fault f.e.), mem_map for reading in the page fault
path is also safe (with oom enabled) and I guess that tmpfs is ok as
well. Then we have a page cache and that one should be covered by my
patch. So we should be covered.

But I like your idea long term.

Thanks!
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: [PATCH -mm] memcg: do not trigger OOM from add_to_page_cache_locked
  2012-11-26 13:18                       ` [PATCH -mm] memcg: do not trigger OOM from add_to_page_cache_locked Michal Hocko
  2012-11-26 13:21                         ` [PATCH for 3.2.34] " Michal Hocko
  2012-11-26 17:46                         ` [PATCH -mm] " Johannes Weiner
@ 2012-11-27  0:05                         ` Kamezawa Hiroyuki
  2012-11-27  9:54                           ` Michal Hocko
  2012-11-27 19:48                           ` Johannes Weiner
  2 siblings, 2 replies; 172+ messages in thread
From: Kamezawa Hiroyuki @ 2012-11-27  0:05 UTC (permalink / raw)
  To: Michal Hocko
  Cc: azurIt, linux-kernel, linux-mm, cgroups mailinglist, Johannes Weiner

(2012/11/26 22:18), Michal Hocko wrote:
> [CCing also Johannes - the thread started here:
> https://lkml.org/lkml/2012/11/21/497]
>
> On Mon 26-11-12 01:38:55, azurIt wrote:
>>> This is hackish but it should help you in this case. Kamezawa, what do
>>> you think about that? Should we generalize this and prepare something
>>> like mem_cgroup_cache_charge_locked which would add __GFP_NORETRY
>>> automatically and use the function whenever we are in a locked context?
>>> To be honest I do not like this very much but nothing more sensible
>>> (without touching non-memcg paths) comes to my mind.
>>
>>
>> I installed kernel with this patch, will report back if problem occurs
>> again OR in few weeks if everything will be ok. Thank you!
>
> Now that I am looking at the patch closer it will not work because it
> depends on other patch which is not merged yet and even that one would
> help on its own because __GFP_NORETRY doesn't break the charge loop.
> Sorry I have missed that...
>
> The patch bellow should help though. (it is based on top of the current
> -mm tree but I will send a backport to 3.2 in the reply as well)
> ---
>  From 7796f942d62081ad45726efd90b5292b80e7c690 Mon Sep 17 00:00:00 2001
> From: Michal Hocko <mhocko@suse.cz>
> Date: Mon, 26 Nov 2012 11:47:57 +0100
> Subject: [PATCH] memcg: do not trigger OOM from add_to_page_cache_locked
>
> memcg oom killer might deadlock if the process which falls down to
> mem_cgroup_handle_oom holds a lock which prevents other task to
> terminate because it is blocked on the very same lock.
> This can happen when a write system call needs to allocate a page but
> the allocation hits the memcg hard limit and there is nothing to reclaim
> (e.g. there is no swap or swap limit is hit as well and all cache pages
> have been reclaimed already) and the process selected by memcg OOM
> killer is blocked on i_mutex on the same inode (e.g. truncate it).
>
> Process A
> [<ffffffff811109b8>] do_truncate+0x58/0xa0		# takes i_mutex
> [<ffffffff81121c90>] do_last+0x250/0xa30
> [<ffffffff81122547>] path_openat+0xd7/0x440
> [<ffffffff811229c9>] do_filp_open+0x49/0xa0
> [<ffffffff8110f7d6>] do_sys_open+0x106/0x240
> [<ffffffff8110f950>] sys_open+0x20/0x30
> [<ffffffff815b5926>] system_call_fastpath+0x18/0x1d
> [<ffffffffffffffff>] 0xffffffffffffffff
>
> Process B
> [<ffffffff8110a9c1>] mem_cgroup_handle_oom+0x241/0x3b0
> [<ffffffff8110b5ab>] T.1146+0x5ab/0x5c0
> [<ffffffff8110c22e>] mem_cgroup_cache_charge+0xbe/0xe0
> [<ffffffff810ca28c>] add_to_page_cache_locked+0x4c/0x140
> [<ffffffff810ca3a2>] add_to_page_cache_lru+0x22/0x50
> [<ffffffff810ca45b>] grab_cache_page_write_begin+0x8b/0xe0
> [<ffffffff81193a18>] ext3_write_begin+0x88/0x270
> [<ffffffff810c8fc6>] generic_file_buffered_write+0x116/0x290
> [<ffffffff810cb3cc>] __generic_file_aio_write+0x27c/0x480
> [<ffffffff810cb646>] generic_file_aio_write+0x76/0xf0           # takes ->i_mutex
> [<ffffffff8111156a>] do_sync_write+0xea/0x130
> [<ffffffff81112183>] vfs_write+0xf3/0x1f0
> [<ffffffff81112381>] sys_write+0x51/0x90
> [<ffffffff815b5926>] system_call_fastpath+0x18/0x1d
> [<ffffffffffffffff>] 0xffffffffffffffff
>
> This is not a hard deadlock though because administrator can still
> intervene and increase the limit on the group which helps the writer to
> finish the allocation and release the lock.
>
> This patch heals the problem by forbidding OOM from page cache charges
> (namely add_ro_page_cache_locked). mem_cgroup_cache_charge_no_oom helper
> function is defined which adds GFP_MEMCG_NO_OOM to the gfp mask which
> then tells mem_cgroup_charge_common that OOM is not allowed for the
> charge. No OOM from this path, except for fixing the bug, also make some
> sense as we really do not want to cause an OOM because of a page cache
> usage.
> As a possibly visible result add_to_page_cache_lru might fail more often
> with ENOMEM but this is to be expected if the limit is set and it is
> preferable than OOM killer IMO.
>
> __GFP_NORETRY is abused for this memcg specific flag because it has been
> used to prevent from OOM already (since not-merged-yet "memcg: reclaim
> when more than one page needed"). The only difference is that the flag
> doesn't prevent from reclaim anymore which kind of makes sense because
> the global memory allocator triggers reclaim as well. The retry without
> any reclaim on __GFP_NORETRY doesn't make much sense anyway because this
> is effectively a busy loop with allowed OOM in this path.
>
> Reported-by: azurIt <azurit@pobox.sk>
> Signed-off-by: Michal Hocko <mhocko@suse.cz>

As a short term fix, I think this patch will work enough and seems simple enough.
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

Reading discussion between you and Johannes, to release locks, I understand
the memcg need to return "RETRY" for a long term fix. Thinking a little,
it will be simple to return "RETRY" to all processes waited on oom kill queue
of a memcg and it can be done by a small fixes to memory.c.

Thank you.
-Kame

> ---
>   include/linux/gfp.h        |    3 +++
>   include/linux/memcontrol.h |   12 ++++++++++++
>   mm/filemap.c               |    8 +++++++-
>   mm/memcontrol.c            |    5 +----
>   4 files changed, 23 insertions(+), 5 deletions(-)
>
> diff --git a/include/linux/gfp.h b/include/linux/gfp.h
> index 10e667f..aac9b21 100644
> --- a/include/linux/gfp.h
> +++ b/include/linux/gfp.h
> @@ -152,6 +152,9 @@ struct vm_area_struct;
>   /* 4GB DMA on some platforms */
>   #define GFP_DMA32	__GFP_DMA32
>
> +/* memcg oom killer is not allowed */
> +#define GFP_MEMCG_NO_OOM	__GFP_NORETRY
> +
>   /* Convert GFP flags to their corresponding migrate type */
>   static inline int allocflags_to_migratetype(gfp_t gfp_flags)
>   {
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 095d2b4..1ad4bc6 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -65,6 +65,12 @@ extern void mem_cgroup_cancel_charge_swapin(struct mem_cgroup *memcg);
>   extern int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm,
>   					gfp_t gfp_mask);
>
> +static inline int mem_cgroup_cache_charge_no_oom(struct page *page,
> +					struct mm_struct *mm, gfp_t gfp_mask)
> +{
> +	return mem_cgroup_cache_charge(page, mm, gfp_mask | GFP_MEMCG_NO_OOM);
> +}
> +
>   struct lruvec *mem_cgroup_zone_lruvec(struct zone *, struct mem_cgroup *);
>   struct lruvec *mem_cgroup_page_lruvec(struct page *, struct zone *);
>
> @@ -215,6 +221,12 @@ static inline int mem_cgroup_cache_charge(struct page *page,
>   	return 0;
>   }
>
> +static inline int mem_cgroup_cache_charge_no_oom(struct page *page,
> +					struct mm_struct *mm, gfp_t gfp_mask)
> +{
> +	return 0;
> +}
> +
>   static inline int mem_cgroup_try_charge_swapin(struct mm_struct *mm,
>   		struct page *page, gfp_t gfp_mask, struct mem_cgroup **memcgp)
>   {
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 83efee7..ef14351 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -447,7 +447,13 @@ int add_to_page_cache_locked(struct page *page, struct address_space *mapping,
>   	VM_BUG_ON(!PageLocked(page));
>   	VM_BUG_ON(PageSwapBacked(page));
>
> -	error = mem_cgroup_cache_charge(page, current->mm,
> +	/*
> +	 * Cannot trigger OOM even if gfp_mask would allow that normally
> +	 * because we might be called from a locked context and that
> +	 * could lead to deadlocks if the killed process is waiting for
> +	 * the same lock.
> +	 */
> +	error = mem_cgroup_cache_charge_no_oom(page, current->mm,
>   					gfp_mask & GFP_RECLAIM_MASK);
>   	if (error)
>   		goto out;
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 02ee2f7..b4754ba 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -2430,9 +2430,6 @@ static int mem_cgroup_do_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
>   	if (!(gfp_mask & __GFP_WAIT))
>   		return CHARGE_WOULDBLOCK;
>
> -	if (gfp_mask & __GFP_NORETRY)
> -		return CHARGE_NOMEM;
> -
>   	ret = mem_cgroup_reclaim(mem_over_limit, gfp_mask, flags);
>   	if (mem_cgroup_margin(mem_over_limit) >= nr_pages)
>   		return CHARGE_RETRY;
> @@ -3713,7 +3710,7 @@ static int mem_cgroup_charge_common(struct page *page, struct mm_struct *mm,
>   {
>   	struct mem_cgroup *memcg = NULL;
>   	unsigned int nr_pages = 1;
> -	bool oom = true;
> +	bool oom = !(gfp_mask | GFP_MEMCG_NO_OOM);
>   	int ret;
>
>   	if (PageTransHuge(page)) {
>



^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: [PATCH -mm] memcg: do not trigger OOM from add_to_page_cache_locked
  2012-11-27  0:05                         ` Kamezawa Hiroyuki
@ 2012-11-27  9:54                           ` Michal Hocko
  2012-11-27 19:48                           ` Johannes Weiner
  1 sibling, 0 replies; 172+ messages in thread
From: Michal Hocko @ 2012-11-27  9:54 UTC (permalink / raw)
  To: Kamezawa Hiroyuki
  Cc: azurIt, linux-kernel, linux-mm, cgroups mailinglist, Johannes Weiner

On Tue 27-11-12 09:05:30, KAMEZAWA Hiroyuki wrote:
[...]
> As a short term fix, I think this patch will work enough and seems simple enough.
> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

Thanks!
If Johannes is also ok with this for now I will resubmit the patch to
Andrew after I hear back from the reporter.
 
> Reading discussion between you and Johannes, to release locks, I understand
> the memcg need to return "RETRY" for a long term fix. Thinking a little,
> it will be simple to return "RETRY" to all processes waited on oom kill queue
> of a memcg and it can be done by a small fixes to memory.c.

I wouldn't call it simple but it is doable.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: [PATCH -mm] memcg: do not trigger OOM from add_to_page_cache_locked
  2012-11-27  0:05                         ` Kamezawa Hiroyuki
  2012-11-27  9:54                           ` Michal Hocko
@ 2012-11-27 19:48                           ` Johannes Weiner
  2012-11-27 20:54                             ` [PATCH -v2 " Michal Hocko
  1 sibling, 1 reply; 172+ messages in thread
From: Johannes Weiner @ 2012-11-27 19:48 UTC (permalink / raw)
  To: Kamezawa Hiroyuki
  Cc: Michal Hocko, azurIt, linux-kernel, linux-mm, cgroups mailinglist

On Tue, Nov 27, 2012 at 09:05:30AM +0900, Kamezawa Hiroyuki wrote:
> (2012/11/26 22:18), Michal Hocko wrote:
> >[CCing also Johannes - the thread started here:
> >https://lkml.org/lkml/2012/11/21/497]
> >
> >On Mon 26-11-12 01:38:55, azurIt wrote:
> >>>This is hackish but it should help you in this case. Kamezawa, what do
> >>>you think about that? Should we generalize this and prepare something
> >>>like mem_cgroup_cache_charge_locked which would add __GFP_NORETRY
> >>>automatically and use the function whenever we are in a locked context?
> >>>To be honest I do not like this very much but nothing more sensible
> >>>(without touching non-memcg paths) comes to my mind.
> >>
> >>
> >>I installed kernel with this patch, will report back if problem occurs
> >>again OR in few weeks if everything will be ok. Thank you!
> >
> >Now that I am looking at the patch closer it will not work because it
> >depends on other patch which is not merged yet and even that one would
> >help on its own because __GFP_NORETRY doesn't break the charge loop.
> >Sorry I have missed that...
> >
> >The patch bellow should help though. (it is based on top of the current
> >-mm tree but I will send a backport to 3.2 in the reply as well)
> >---
> > From 7796f942d62081ad45726efd90b5292b80e7c690 Mon Sep 17 00:00:00 2001
> >From: Michal Hocko <mhocko@suse.cz>
> >Date: Mon, 26 Nov 2012 11:47:57 +0100
> >Subject: [PATCH] memcg: do not trigger OOM from add_to_page_cache_locked
> >
> >memcg oom killer might deadlock if the process which falls down to
> >mem_cgroup_handle_oom holds a lock which prevents other task to
> >terminate because it is blocked on the very same lock.
> >This can happen when a write system call needs to allocate a page but
> >the allocation hits the memcg hard limit and there is nothing to reclaim
> >(e.g. there is no swap or swap limit is hit as well and all cache pages
> >have been reclaimed already) and the process selected by memcg OOM
> >killer is blocked on i_mutex on the same inode (e.g. truncate it).
> >
> >Process A
> >[<ffffffff811109b8>] do_truncate+0x58/0xa0		# takes i_mutex
> >[<ffffffff81121c90>] do_last+0x250/0xa30
> >[<ffffffff81122547>] path_openat+0xd7/0x440
> >[<ffffffff811229c9>] do_filp_open+0x49/0xa0
> >[<ffffffff8110f7d6>] do_sys_open+0x106/0x240
> >[<ffffffff8110f950>] sys_open+0x20/0x30
> >[<ffffffff815b5926>] system_call_fastpath+0x18/0x1d
> >[<ffffffffffffffff>] 0xffffffffffffffff
> >
> >Process B
> >[<ffffffff8110a9c1>] mem_cgroup_handle_oom+0x241/0x3b0
> >[<ffffffff8110b5ab>] T.1146+0x5ab/0x5c0
> >[<ffffffff8110c22e>] mem_cgroup_cache_charge+0xbe/0xe0
> >[<ffffffff810ca28c>] add_to_page_cache_locked+0x4c/0x140
> >[<ffffffff810ca3a2>] add_to_page_cache_lru+0x22/0x50
> >[<ffffffff810ca45b>] grab_cache_page_write_begin+0x8b/0xe0
> >[<ffffffff81193a18>] ext3_write_begin+0x88/0x270
> >[<ffffffff810c8fc6>] generic_file_buffered_write+0x116/0x290
> >[<ffffffff810cb3cc>] __generic_file_aio_write+0x27c/0x480
> >[<ffffffff810cb646>] generic_file_aio_write+0x76/0xf0           # takes ->i_mutex
> >[<ffffffff8111156a>] do_sync_write+0xea/0x130
> >[<ffffffff81112183>] vfs_write+0xf3/0x1f0
> >[<ffffffff81112381>] sys_write+0x51/0x90
> >[<ffffffff815b5926>] system_call_fastpath+0x18/0x1d
> >[<ffffffffffffffff>] 0xffffffffffffffff
> >
> >This is not a hard deadlock though because administrator can still
> >intervene and increase the limit on the group which helps the writer to
> >finish the allocation and release the lock.
> >
> >This patch heals the problem by forbidding OOM from page cache charges
> >(namely add_ro_page_cache_locked). mem_cgroup_cache_charge_no_oom helper
> >function is defined which adds GFP_MEMCG_NO_OOM to the gfp mask which
> >then tells mem_cgroup_charge_common that OOM is not allowed for the
> >charge. No OOM from this path, except for fixing the bug, also make some
> >sense as we really do not want to cause an OOM because of a page cache
> >usage.
> >As a possibly visible result add_to_page_cache_lru might fail more often
> >with ENOMEM but this is to be expected if the limit is set and it is
> >preferable than OOM killer IMO.
> >
> >__GFP_NORETRY is abused for this memcg specific flag because it has been
> >used to prevent from OOM already (since not-merged-yet "memcg: reclaim
> >when more than one page needed"). The only difference is that the flag
> >doesn't prevent from reclaim anymore which kind of makes sense because
> >the global memory allocator triggers reclaim as well. The retry without
> >any reclaim on __GFP_NORETRY doesn't make much sense anyway because this
> >is effectively a busy loop with allowed OOM in this path.
> >
> >Reported-by: azurIt <azurit@pobox.sk>
> >Signed-off-by: Michal Hocko <mhocko@suse.cz>
> 
> As a short term fix, I think this patch will work enough and seems simple enough.
> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

Yes, let's do this for now.

> >diff --git a/include/linux/gfp.h b/include/linux/gfp.h
> >index 10e667f..aac9b21 100644
> >--- a/include/linux/gfp.h
> >+++ b/include/linux/gfp.h
> >@@ -152,6 +152,9 @@ struct vm_area_struct;
> >  /* 4GB DMA on some platforms */
> >  #define GFP_DMA32	__GFP_DMA32
> >
> >+/* memcg oom killer is not allowed */
> >+#define GFP_MEMCG_NO_OOM	__GFP_NORETRY

Could we leave this within memcg, please?  An extra flag to
mem_cgroup_cache_charge() or the like.  GFP flags are about
controlling the page allocator, this seems abusive.  We have an oom
flag down in try_charge, maybe just propagate this up the stack?

> >diff --git a/mm/filemap.c b/mm/filemap.c
> >index 83efee7..ef14351 100644
> >--- a/mm/filemap.c
> >+++ b/mm/filemap.c
> >@@ -447,7 +447,13 @@ int add_to_page_cache_locked(struct page *page, struct address_space *mapping,
> >  	VM_BUG_ON(!PageLocked(page));
> >  	VM_BUG_ON(PageSwapBacked(page));
> >
> >-	error = mem_cgroup_cache_charge(page, current->mm,
> >+	/*
> >+	 * Cannot trigger OOM even if gfp_mask would allow that normally
> >+	 * because we might be called from a locked context and that
> >+	 * could lead to deadlocks if the killed process is waiting for
> >+	 * the same lock.
> >+	 */
> >+	error = mem_cgroup_cache_charge_no_oom(page, current->mm,
> >  					gfp_mask & GFP_RECLAIM_MASK);
> >  	if (error)
> >  		goto out;

Shmem does not use this function but also charges under the i_mutex in
the write path and fallocate at least.

^ permalink raw reply	[flat|nested] 172+ messages in thread

* [PATCH -v2 -mm] memcg: do not trigger OOM from add_to_page_cache_locked
  2012-11-27 19:48                           ` Johannes Weiner
@ 2012-11-27 20:54                             ` Michal Hocko
  2012-11-27 20:59                               ` Michal Hocko
  0 siblings, 1 reply; 172+ messages in thread
From: Michal Hocko @ 2012-11-27 20:54 UTC (permalink / raw)
  To: Johannes Weiner, KAMEZAWA Hiroyuki
  Cc: azurIt, linux-kernel, linux-mm, cgroups mailinglist

On Tue 27-11-12 14:48:13, Johannes Weiner wrote:
[...]
> > >diff --git a/include/linux/gfp.h b/include/linux/gfp.h
> > >index 10e667f..aac9b21 100644
> > >--- a/include/linux/gfp.h
> > >+++ b/include/linux/gfp.h
> > >@@ -152,6 +152,9 @@ struct vm_area_struct;
> > >  /* 4GB DMA on some platforms */
> > >  #define GFP_DMA32	__GFP_DMA32
> > >
> > >+/* memcg oom killer is not allowed */
> > >+#define GFP_MEMCG_NO_OOM	__GFP_NORETRY
> 
> Could we leave this within memcg, please?  An extra flag to
> mem_cgroup_cache_charge() or the like.  GFP flags are about
> controlling the page allocator, this seems abusive.  We have an oom
> flag down in try_charge, maybe just propagate this up the stack?

OK, what about the patch bellow?
I have dropped Kame's Acked-by because it has been reworked. The patch
is the same in principle.

> > >diff --git a/mm/filemap.c b/mm/filemap.c
> > >index 83efee7..ef14351 100644
> > >--- a/mm/filemap.c
> > >+++ b/mm/filemap.c
> > >@@ -447,7 +447,13 @@ int add_to_page_cache_locked(struct page *page, struct address_space *mapping,
> > >  	VM_BUG_ON(!PageLocked(page));
> > >  	VM_BUG_ON(PageSwapBacked(page));
> > >
> > >-	error = mem_cgroup_cache_charge(page, current->mm,
> > >+	/*
> > >+	 * Cannot trigger OOM even if gfp_mask would allow that normally
> > >+	 * because we might be called from a locked context and that
> > >+	 * could lead to deadlocks if the killed process is waiting for
> > >+	 * the same lock.
> > >+	 */
> > >+	error = mem_cgroup_cache_charge_no_oom(page, current->mm,
> > >  					gfp_mask & GFP_RECLAIM_MASK);
> > >  	if (error)
> > >  		goto out;
> 
> Shmem does not use this function but also charges under the i_mutex in
> the write path and fallocate at least.

Right you are
---
>From 60cc8a184490d277eb24fca551b114f1e2234ce0 Mon Sep 17 00:00:00 2001
From: Michal Hocko <mhocko@suse.cz>
Date: Mon, 26 Nov 2012 11:47:57 +0100
Subject: [PATCH] memcg: do not trigger OOM from add_to_page_cache_locked

memcg oom killer might deadlock if the process which falls down to
mem_cgroup_handle_oom holds a lock which prevents other task to
terminate because it is blocked on the very same lock.
This can happen when a write system call needs to allocate a page but
the allocation hits the memcg hard limit and there is nothing to reclaim
(e.g. there is no swap or swap limit is hit as well and all cache pages
have been reclaimed already) and the process selected by memcg OOM
killer is blocked on i_mutex on the same inode (e.g. truncate it).

Process A
[<ffffffff811109b8>] do_truncate+0x58/0xa0		# takes i_mutex
[<ffffffff81121c90>] do_last+0x250/0xa30
[<ffffffff81122547>] path_openat+0xd7/0x440
[<ffffffff811229c9>] do_filp_open+0x49/0xa0
[<ffffffff8110f7d6>] do_sys_open+0x106/0x240
[<ffffffff8110f950>] sys_open+0x20/0x30
[<ffffffff815b5926>] system_call_fastpath+0x18/0x1d
[<ffffffffffffffff>] 0xffffffffffffffff

Process B
[<ffffffff8110a9c1>] mem_cgroup_handle_oom+0x241/0x3b0
[<ffffffff8110b5ab>] T.1146+0x5ab/0x5c0
[<ffffffff8110c22e>] mem_cgroup_cache_charge+0xbe/0xe0
[<ffffffff810ca28c>] add_to_page_cache_locked+0x4c/0x140
[<ffffffff810ca3a2>] add_to_page_cache_lru+0x22/0x50
[<ffffffff810ca45b>] grab_cache_page_write_begin+0x8b/0xe0
[<ffffffff81193a18>] ext3_write_begin+0x88/0x270
[<ffffffff810c8fc6>] generic_file_buffered_write+0x116/0x290
[<ffffffff810cb3cc>] __generic_file_aio_write+0x27c/0x480
[<ffffffff810cb646>] generic_file_aio_write+0x76/0xf0           # takes ->i_mutex
[<ffffffff8111156a>] do_sync_write+0xea/0x130
[<ffffffff81112183>] vfs_write+0xf3/0x1f0
[<ffffffff81112381>] sys_write+0x51/0x90
[<ffffffff815b5926>] system_call_fastpath+0x18/0x1d
[<ffffffffffffffff>] 0xffffffffffffffff

This is not a hard deadlock though because administrator can still
intervene and increase the limit on the group which helps the writer to
finish the allocation and release the lock.

This patch heals the problem by forbidding OOM from page cache charges
(namely add_ro_page_cache_locked). mem_cgroup_cache_charge grows oom
argument which is pushed down the call chain.

As a possibly visible result add_to_page_cache_lru might fail more often
with ENOMEM but this is to be expected if the limit is set and it is
preferable than OOM killer IMO.

Changes since v1
- do not abuse gfp_flags and rather use oom parameter directly as per
  Johannes
- handle also shmem write fauls resp. fallocate properly as per Johannes

Reported-by: azurIt <azurit@pobox.sk>
Signed-off-by: Michal Hocko <mhocko@suse.cz>
---
 include/linux/memcontrol.h |    5 +++--
 mm/filemap.c               |    9 +++++++--
 mm/memcontrol.c            |    9 ++++-----
 mm/shmem.c                 |   14 +++++++++++---
 4 files changed, 25 insertions(+), 12 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 095d2b4..8f48d5e 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -63,7 +63,7 @@ extern void mem_cgroup_commit_charge_swapin(struct page *page,
 extern void mem_cgroup_cancel_charge_swapin(struct mem_cgroup *memcg);
 
 extern int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm,
-					gfp_t gfp_mask);
+					gfp_t gfp_mask, bool oom);
 
 struct lruvec *mem_cgroup_zone_lruvec(struct zone *, struct mem_cgroup *);
 struct lruvec *mem_cgroup_page_lruvec(struct page *, struct zone *);
@@ -210,7 +210,8 @@ static inline int mem_cgroup_newpage_charge(struct page *page,
 }
 
 static inline int mem_cgroup_cache_charge(struct page *page,
-					struct mm_struct *mm, gfp_t gfp_mask)
+					struct mm_struct *mm, gfp_t gfp_mask,
+					bool oom)
 {
 	return 0;
 }
diff --git a/mm/filemap.c b/mm/filemap.c
index 83efee7..ef8fbd5 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -447,8 +447,13 @@ int add_to_page_cache_locked(struct page *page, struct address_space *mapping,
 	VM_BUG_ON(!PageLocked(page));
 	VM_BUG_ON(PageSwapBacked(page));
 
-	error = mem_cgroup_cache_charge(page, current->mm,
-					gfp_mask & GFP_RECLAIM_MASK);
+	/*
+	 * Cannot trigger OOM even if gfp_mask would allow that normally
+	 * because we might be called from a locked context and that
+	 * could lead to deadlocks if the killed process is waiting for
+	 * the same lock.
+	 */
+	error = mem_cgroup_cache_charge(page, current->mm, gfp_mask, false);
 	if (error)
 		goto out;
 
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 02ee2f7..26690d6 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3709,11 +3709,10 @@ out:
  * < 0 if the cgroup is over its limit
  */
 static int mem_cgroup_charge_common(struct page *page, struct mm_struct *mm,
-				gfp_t gfp_mask, enum charge_type ctype)
+				gfp_t gfp_mask, enum charge_type ctype, bool oom)
 {
 	struct mem_cgroup *memcg = NULL;
 	unsigned int nr_pages = 1;
-	bool oom = true;
 	int ret;
 
 	if (PageTransHuge(page)) {
@@ -3742,7 +3741,7 @@ int mem_cgroup_newpage_charge(struct page *page,
 	VM_BUG_ON(page->mapping && !PageAnon(page));
 	VM_BUG_ON(!mm);
 	return mem_cgroup_charge_common(page, mm, gfp_mask,
-					MEM_CGROUP_CHARGE_TYPE_ANON);
+					MEM_CGROUP_CHARGE_TYPE_ANON, true);
 }
 
 /*
@@ -3851,7 +3850,7 @@ void mem_cgroup_commit_charge_swapin(struct page *page,
 }
 
 int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm,
-				gfp_t gfp_mask)
+				gfp_t gfp_mask, bool oom)
 {
 	struct mem_cgroup *memcg = NULL;
 	enum charge_type type = MEM_CGROUP_CHARGE_TYPE_CACHE;
@@ -3863,7 +3862,7 @@ int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm,
 		return 0;
 
 	if (!PageSwapCache(page))
-		ret = mem_cgroup_charge_common(page, mm, gfp_mask, type);
+		ret = mem_cgroup_charge_common(page, mm, gfp_mask, type, oom);
 	else { /* page is swapcache/shmem */
 		ret = __mem_cgroup_try_charge_swapin(mm, page,
 						     gfp_mask, &memcg);
diff --git a/mm/shmem.c b/mm/shmem.c
index 55054a7..cef63b5 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -760,7 +760,7 @@ int shmem_unuse(swp_entry_t swap, struct page *page)
 	 * the shmem_swaplist_mutex which might hold up shmem_writepage().
 	 * Charged back to the user (not to caller) when swap account is used.
 	 */
-	error = mem_cgroup_cache_charge(page, current->mm, GFP_KERNEL);
+	error = mem_cgroup_cache_charge(page, current->mm, GFP_KERNEL, true);
 	if (error)
 		goto out;
 	/* No radix_tree_preload: swap entry keeps a place for page in tree */
@@ -1152,8 +1152,16 @@ repeat:
 				goto failed;
 		}
 
+		 /*
+                  * Cannot trigger OOM even if gfp_mask would allow that
+                  * normally because we might be called from a locked
+                  * context (i_mutex held) if this is a write lock or
+                  * fallocate and that could lead to deadlocks if the
+                  * killed process is waiting for the same lock.
+		  */
 		error = mem_cgroup_cache_charge(page, current->mm,
-						gfp & GFP_RECLAIM_MASK);
+						gfp & GFP_RECLAIM_MASK,
+						sgp < SGP_WRITE);
 		if (!error) {
 			error = shmem_add_to_page_cache(page, mapping, index,
 						gfp, swp_to_radix_entry(swap));
@@ -1209,7 +1217,7 @@ repeat:
 		SetPageSwapBacked(page);
 		__set_page_locked(page);
 		error = mem_cgroup_cache_charge(page, current->mm,
-						gfp & GFP_RECLAIM_MASK);
+						gfp & GFP_RECLAIM_MASK, true);
 		if (error)
 			goto decused;
 		error = radix_tree_preload(gfp & GFP_RECLAIM_MASK);
-- 
1.7.10.4


-- 
Michal Hocko
SUSE Labs

^ permalink raw reply related	[flat|nested] 172+ messages in thread

* Re: [PATCH -v2 -mm] memcg: do not trigger OOM from add_to_page_cache_locked
  2012-11-27 20:54                             ` [PATCH -v2 " Michal Hocko
@ 2012-11-27 20:59                               ` Michal Hocko
  2012-11-28 15:26                                 ` Johannes Weiner
  0 siblings, 1 reply; 172+ messages in thread
From: Michal Hocko @ 2012-11-27 20:59 UTC (permalink / raw)
  To: Johannes Weiner, KAMEZAWA Hiroyuki
  Cc: azurIt, linux-kernel, linux-mm, cgroups mailinglist

Sorry, forgot to about one shmem charge:
---
>From 7ae29927d24471c1b1a6ceb021219c592c1ef518 Mon Sep 17 00:00:00 2001
From: Michal Hocko <mhocko@suse.cz>
Date: Tue, 27 Nov 2012 21:53:13 +0100
Subject: [PATCH] memcg: do not trigger OOM from add_to_page_cache_locked

memcg oom killer might deadlock if the process which falls down to
mem_cgroup_handle_oom holds a lock which prevents other task to
terminate because it is blocked on the very same lock.
This can happen when a write system call needs to allocate a page but
the allocation hits the memcg hard limit and there is nothing to reclaim
(e.g. there is no swap or swap limit is hit as well and all cache pages
have been reclaimed already) and the process selected by memcg OOM
killer is blocked on i_mutex on the same inode (e.g. truncate it).

Process A
[<ffffffff811109b8>] do_truncate+0x58/0xa0		# takes i_mutex
[<ffffffff81121c90>] do_last+0x250/0xa30
[<ffffffff81122547>] path_openat+0xd7/0x440
[<ffffffff811229c9>] do_filp_open+0x49/0xa0
[<ffffffff8110f7d6>] do_sys_open+0x106/0x240
[<ffffffff8110f950>] sys_open+0x20/0x30
[<ffffffff815b5926>] system_call_fastpath+0x18/0x1d
[<ffffffffffffffff>] 0xffffffffffffffff

Process B
[<ffffffff8110a9c1>] mem_cgroup_handle_oom+0x241/0x3b0
[<ffffffff8110b5ab>] T.1146+0x5ab/0x5c0
[<ffffffff8110c22e>] mem_cgroup_cache_charge+0xbe/0xe0
[<ffffffff810ca28c>] add_to_page_cache_locked+0x4c/0x140
[<ffffffff810ca3a2>] add_to_page_cache_lru+0x22/0x50
[<ffffffff810ca45b>] grab_cache_page_write_begin+0x8b/0xe0
[<ffffffff81193a18>] ext3_write_begin+0x88/0x270
[<ffffffff810c8fc6>] generic_file_buffered_write+0x116/0x290
[<ffffffff810cb3cc>] __generic_file_aio_write+0x27c/0x480
[<ffffffff810cb646>] generic_file_aio_write+0x76/0xf0           # takes ->i_mutex
[<ffffffff8111156a>] do_sync_write+0xea/0x130
[<ffffffff81112183>] vfs_write+0xf3/0x1f0
[<ffffffff81112381>] sys_write+0x51/0x90
[<ffffffff815b5926>] system_call_fastpath+0x18/0x1d
[<ffffffffffffffff>] 0xffffffffffffffff

This is not a hard deadlock though because administrator can still
intervene and increase the limit on the group which helps the writer to
finish the allocation and release the lock.

This patch heals the problem by forbidding OOM from page cache charges
(namely add_ro_page_cache_locked). mem_cgroup_cache_charge grows oom
argument which is pushed down the call chain.

As a possibly visible result add_to_page_cache_lru might fail more often
with ENOMEM but this is to be expected if the limit is set and it is
preferable than OOM killer IMO.

Changes since v1
- do not abuse gfp_flags and rather use oom parameter directly as per
  Johannes
- handle also shmem write fauls resp. fallocate properly as per Johannes

Reported-by: azurIt <azurit@pobox.sk>
Signed-off-by: Michal Hocko <mhocko@suse.cz>
---
 include/linux/memcontrol.h |    5 +++--
 mm/filemap.c               |    9 +++++++--
 mm/memcontrol.c            |    9 ++++-----
 mm/shmem.c                 |   15 ++++++++++++---
 4 files changed, 26 insertions(+), 12 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 095d2b4..8f48d5e 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -63,7 +63,7 @@ extern void mem_cgroup_commit_charge_swapin(struct page *page,
 extern void mem_cgroup_cancel_charge_swapin(struct mem_cgroup *memcg);
 
 extern int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm,
-					gfp_t gfp_mask);
+					gfp_t gfp_mask, bool oom);
 
 struct lruvec *mem_cgroup_zone_lruvec(struct zone *, struct mem_cgroup *);
 struct lruvec *mem_cgroup_page_lruvec(struct page *, struct zone *);
@@ -210,7 +210,8 @@ static inline int mem_cgroup_newpage_charge(struct page *page,
 }
 
 static inline int mem_cgroup_cache_charge(struct page *page,
-					struct mm_struct *mm, gfp_t gfp_mask)
+					struct mm_struct *mm, gfp_t gfp_mask,
+					bool oom)
 {
 	return 0;
 }
diff --git a/mm/filemap.c b/mm/filemap.c
index 83efee7..ef8fbd5 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -447,8 +447,13 @@ int add_to_page_cache_locked(struct page *page, struct address_space *mapping,
 	VM_BUG_ON(!PageLocked(page));
 	VM_BUG_ON(PageSwapBacked(page));
 
-	error = mem_cgroup_cache_charge(page, current->mm,
-					gfp_mask & GFP_RECLAIM_MASK);
+	/*
+	 * Cannot trigger OOM even if gfp_mask would allow that normally
+	 * because we might be called from a locked context and that
+	 * could lead to deadlocks if the killed process is waiting for
+	 * the same lock.
+	 */
+	error = mem_cgroup_cache_charge(page, current->mm, gfp_mask, false);
 	if (error)
 		goto out;
 
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 02ee2f7..26690d6 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3709,11 +3709,10 @@ out:
  * < 0 if the cgroup is over its limit
  */
 static int mem_cgroup_charge_common(struct page *page, struct mm_struct *mm,
-				gfp_t gfp_mask, enum charge_type ctype)
+				gfp_t gfp_mask, enum charge_type ctype, bool oom)
 {
 	struct mem_cgroup *memcg = NULL;
 	unsigned int nr_pages = 1;
-	bool oom = true;
 	int ret;
 
 	if (PageTransHuge(page)) {
@@ -3742,7 +3741,7 @@ int mem_cgroup_newpage_charge(struct page *page,
 	VM_BUG_ON(page->mapping && !PageAnon(page));
 	VM_BUG_ON(!mm);
 	return mem_cgroup_charge_common(page, mm, gfp_mask,
-					MEM_CGROUP_CHARGE_TYPE_ANON);
+					MEM_CGROUP_CHARGE_TYPE_ANON, true);
 }
 
 /*
@@ -3851,7 +3850,7 @@ void mem_cgroup_commit_charge_swapin(struct page *page,
 }
 
 int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm,
-				gfp_t gfp_mask)
+				gfp_t gfp_mask, bool oom)
 {
 	struct mem_cgroup *memcg = NULL;
 	enum charge_type type = MEM_CGROUP_CHARGE_TYPE_CACHE;
@@ -3863,7 +3862,7 @@ int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm,
 		return 0;
 
 	if (!PageSwapCache(page))
-		ret = mem_cgroup_charge_common(page, mm, gfp_mask, type);
+		ret = mem_cgroup_charge_common(page, mm, gfp_mask, type, oom);
 	else { /* page is swapcache/shmem */
 		ret = __mem_cgroup_try_charge_swapin(mm, page,
 						     gfp_mask, &memcg);
diff --git a/mm/shmem.c b/mm/shmem.c
index 55054a7..ba59cfa 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -760,7 +760,7 @@ int shmem_unuse(swp_entry_t swap, struct page *page)
 	 * the shmem_swaplist_mutex which might hold up shmem_writepage().
 	 * Charged back to the user (not to caller) when swap account is used.
 	 */
-	error = mem_cgroup_cache_charge(page, current->mm, GFP_KERNEL);
+	error = mem_cgroup_cache_charge(page, current->mm, GFP_KERNEL, true);
 	if (error)
 		goto out;
 	/* No radix_tree_preload: swap entry keeps a place for page in tree */
@@ -1152,8 +1152,16 @@ repeat:
 				goto failed;
 		}
 
+		 /*
+                  * Cannot trigger OOM even if gfp_mask would allow that
+                  * normally because we might be called from a locked
+                  * context (i_mutex held) if this is a write lock or
+                  * fallocate and that could lead to deadlocks if the
+                  * killed process is waiting for the same lock.
+		  */
 		error = mem_cgroup_cache_charge(page, current->mm,
-						gfp & GFP_RECLAIM_MASK);
+						gfp & GFP_RECLAIM_MASK,
+						sgp < SGP_WRITE);
 		if (!error) {
 			error = shmem_add_to_page_cache(page, mapping, index,
 						gfp, swp_to_radix_entry(swap));
@@ -1209,7 +1217,8 @@ repeat:
 		SetPageSwapBacked(page);
 		__set_page_locked(page);
 		error = mem_cgroup_cache_charge(page, current->mm,
-						gfp & GFP_RECLAIM_MASK);
+						gfp & GFP_RECLAIM_MASK,
+						sgp < SGP_WRITE);
 		if (error)
 			goto decused;
 		error = radix_tree_preload(gfp & GFP_RECLAIM_MASK);
-- 
1.7.10.4

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply related	[flat|nested] 172+ messages in thread

* Re: [PATCH -v2 -mm] memcg: do not trigger OOM from add_to_page_cache_locked
  2012-11-27 20:59                               ` Michal Hocko
@ 2012-11-28 15:26                                 ` Johannes Weiner
  2012-11-28 16:04                                   ` Michal Hocko
  0 siblings, 1 reply; 172+ messages in thread
From: Johannes Weiner @ 2012-11-28 15:26 UTC (permalink / raw)
  To: Michal Hocko
  Cc: KAMEZAWA Hiroyuki, azurIt, linux-kernel, linux-mm, cgroups mailinglist

On Tue, Nov 27, 2012 at 09:59:44PM +0100, Michal Hocko wrote:
> @@ -3863,7 +3862,7 @@ int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm,
>  		return 0;
>  
>  	if (!PageSwapCache(page))
> -		ret = mem_cgroup_charge_common(page, mm, gfp_mask, type);
> +		ret = mem_cgroup_charge_common(page, mm, gfp_mask, type, oom);
>  	else { /* page is swapcache/shmem */
>  		ret = __mem_cgroup_try_charge_swapin(mm, page,
>  						     gfp_mask, &memcg);

I think you need to pass it down the swapcache path too, as that is
what happens when the shmem page written to is in swap and has been
read into swapcache by the time of charging.

> @@ -1152,8 +1152,16 @@ repeat:
>  				goto failed;
>  		}
>  
> +		 /*
> +                  * Cannot trigger OOM even if gfp_mask would allow that
> +                  * normally because we might be called from a locked
> +                  * context (i_mutex held) if this is a write lock or
> +                  * fallocate and that could lead to deadlocks if the
> +                  * killed process is waiting for the same lock.
> +		  */

Indentation broken?

>  		error = mem_cgroup_cache_charge(page, current->mm,
> -						gfp & GFP_RECLAIM_MASK);
> +						gfp & GFP_RECLAIM_MASK,
> +						sgp < SGP_WRITE);

The code tests for read-only paths a bunch of times using

	sgp != SGP_WRITE && sgp != SGP_FALLOC

Would probably be more consistent and more robust to use this here as
well?

> @@ -1209,7 +1217,8 @@ repeat:
>  		SetPageSwapBacked(page);
>  		__set_page_locked(page);
>  		error = mem_cgroup_cache_charge(page, current->mm,
> -						gfp & GFP_RECLAIM_MASK);
> +						gfp & GFP_RECLAIM_MASK,
> +						sgp < SGP_WRITE);

Same.

Otherwise, the patch looks good to me, thanks for persisting :)

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: [PATCH -v2 -mm] memcg: do not trigger OOM from add_to_page_cache_locked
  2012-11-28 15:26                                 ` Johannes Weiner
@ 2012-11-28 16:04                                   ` Michal Hocko
  2012-11-28 16:37                                     ` Johannes Weiner
  0 siblings, 1 reply; 172+ messages in thread
From: Michal Hocko @ 2012-11-28 16:04 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: KAMEZAWA Hiroyuki, azurIt, linux-kernel, linux-mm, cgroups mailinglist

On Wed 28-11-12 10:26:31, Johannes Weiner wrote:
> On Tue, Nov 27, 2012 at 09:59:44PM +0100, Michal Hocko wrote:
> > @@ -3863,7 +3862,7 @@ int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm,
> >  		return 0;
> >  
> >  	if (!PageSwapCache(page))
> > -		ret = mem_cgroup_charge_common(page, mm, gfp_mask, type);
> > +		ret = mem_cgroup_charge_common(page, mm, gfp_mask, type, oom);
> >  	else { /* page is swapcache/shmem */
> >  		ret = __mem_cgroup_try_charge_swapin(mm, page,
> >  						     gfp_mask, &memcg);
> 
> I think you need to pass it down the swapcache path too, as that is
> what happens when the shmem page written to is in swap and has been
> read into swapcache by the time of charging.

You are right, of course. I shouldn't send patches late in the evening
after staring to a crashdump for a good part of the day. /me ashamed.

> > @@ -1152,8 +1152,16 @@ repeat:
> >  				goto failed;
> >  		}
> >  
> > +		 /*
> > +                  * Cannot trigger OOM even if gfp_mask would allow that
> > +                  * normally because we might be called from a locked
> > +                  * context (i_mutex held) if this is a write lock or
> > +                  * fallocate and that could lead to deadlocks if the
> > +                  * killed process is waiting for the same lock.
> > +		  */
> 
> Indentation broken?

c&p

> >  		error = mem_cgroup_cache_charge(page, current->mm,
> > -						gfp & GFP_RECLAIM_MASK);
> > +						gfp & GFP_RECLAIM_MASK,
> > +						sgp < SGP_WRITE);
> 
> The code tests for read-only paths a bunch of times using
> 
> 	sgp != SGP_WRITE && sgp != SGP_FALLOC
> 
> Would probably be more consistent and more robust to use this here as
> well?

Yes my laziness. I was considering that but it was really long so I've
chosen the simpler way. But you are right that consistency is probably
better here

> > @@ -1209,7 +1217,8 @@ repeat:
> >  		SetPageSwapBacked(page);
> >  		__set_page_locked(page);
> >  		error = mem_cgroup_cache_charge(page, current->mm,
> > -						gfp & GFP_RECLAIM_MASK);
> > +						gfp & GFP_RECLAIM_MASK,
> > +						sgp < SGP_WRITE);
> 
> Same.
> 
> Otherwise, the patch looks good to me, thanks for persisting :)

Thanks for the throughout review.
Here we go with the fixed version.
---
>From 5000bf32c9c02fcd31d18e615300d8e7e7ef94a5 Mon Sep 17 00:00:00 2001
From: Michal Hocko <mhocko@suse.cz>
Date: Wed, 28 Nov 2012 16:49:46 +0100
Subject: [PATCH] memcg: do not trigger OOM from add_to_page_cache_locked

memcg oom killer might deadlock if the process which falls down to
mem_cgroup_handle_oom holds a lock which prevents other task to
terminate because it is blocked on the very same lock.
This can happen when a write system call needs to allocate a page but
the allocation hits the memcg hard limit and there is nothing to reclaim
(e.g. there is no swap or swap limit is hit as well and all cache pages
have been reclaimed already) and the process selected by memcg OOM
killer is blocked on i_mutex on the same inode (e.g. truncate it).

Process A
[<ffffffff811109b8>] do_truncate+0x58/0xa0		# takes i_mutex
[<ffffffff81121c90>] do_last+0x250/0xa30
[<ffffffff81122547>] path_openat+0xd7/0x440
[<ffffffff811229c9>] do_filp_open+0x49/0xa0
[<ffffffff8110f7d6>] do_sys_open+0x106/0x240
[<ffffffff8110f950>] sys_open+0x20/0x30
[<ffffffff815b5926>] system_call_fastpath+0x18/0x1d
[<ffffffffffffffff>] 0xffffffffffffffff

Process B
[<ffffffff8110a9c1>] mem_cgroup_handle_oom+0x241/0x3b0
[<ffffffff8110b5ab>] T.1146+0x5ab/0x5c0
[<ffffffff8110c22e>] mem_cgroup_cache_charge+0xbe/0xe0
[<ffffffff810ca28c>] add_to_page_cache_locked+0x4c/0x140
[<ffffffff810ca3a2>] add_to_page_cache_lru+0x22/0x50
[<ffffffff810ca45b>] grab_cache_page_write_begin+0x8b/0xe0
[<ffffffff81193a18>] ext3_write_begin+0x88/0x270
[<ffffffff810c8fc6>] generic_file_buffered_write+0x116/0x290
[<ffffffff810cb3cc>] __generic_file_aio_write+0x27c/0x480
[<ffffffff810cb646>] generic_file_aio_write+0x76/0xf0           # takes ->i_mutex
[<ffffffff8111156a>] do_sync_write+0xea/0x130
[<ffffffff81112183>] vfs_write+0xf3/0x1f0
[<ffffffff81112381>] sys_write+0x51/0x90
[<ffffffff815b5926>] system_call_fastpath+0x18/0x1d
[<ffffffffffffffff>] 0xffffffffffffffff

This is not a hard deadlock though because administrator can still
intervene and increase the limit on the group which helps the writer to
finish the allocation and release the lock.

This patch heals the problem by forbidding OOM from page cache charges
(namely add_ro_page_cache_locked). mem_cgroup_cache_charge grows oom
argument which is pushed down the call chain.

As a possibly visible result add_to_page_cache_lru might fail more often
with ENOMEM but this is to be expected if the limit is set and it is
preferable than OOM killer IMO.

Changes since v1
- do not abuse gfp_flags and rather use oom parameter directly as per
  Johannes
- handle also shmem write fauls resp. fallocate properly as per Johannes

Reported-by: azurIt <azurit@pobox.sk>
Signed-off-by: Michal Hocko <mhocko@suse.cz>
---
 include/linux/memcontrol.h |   11 +++++++----
 mm/filemap.c               |    9 +++++++--
 mm/memcontrol.c            |   25 +++++++++++++------------
 mm/memory.c                |    2 +-
 mm/shmem.c                 |   17 ++++++++++++++---
 mm/swapfile.c              |    2 +-
 6 files changed, 43 insertions(+), 23 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 095d2b4..5abe441 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -57,13 +57,14 @@ extern int mem_cgroup_newpage_charge(struct page *page, struct mm_struct *mm,
 				gfp_t gfp_mask);
 /* for swap handling */
 extern int mem_cgroup_try_charge_swapin(struct mm_struct *mm,
-		struct page *page, gfp_t mask, struct mem_cgroup **memcgp);
+		struct page *page, gfp_t mask, struct mem_cgroup **memcgp,
+		bool oom);
 extern void mem_cgroup_commit_charge_swapin(struct page *page,
 					struct mem_cgroup *memcg);
 extern void mem_cgroup_cancel_charge_swapin(struct mem_cgroup *memcg);
 
 extern int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm,
-					gfp_t gfp_mask);
+					gfp_t gfp_mask, bool oom);
 
 struct lruvec *mem_cgroup_zone_lruvec(struct zone *, struct mem_cgroup *);
 struct lruvec *mem_cgroup_page_lruvec(struct page *, struct zone *);
@@ -210,13 +211,15 @@ static inline int mem_cgroup_newpage_charge(struct page *page,
 }
 
 static inline int mem_cgroup_cache_charge(struct page *page,
-					struct mm_struct *mm, gfp_t gfp_mask)
+					struct mm_struct *mm, gfp_t gfp_mask,
+					bool oom)
 {
 	return 0;
 }
 
 static inline int mem_cgroup_try_charge_swapin(struct mm_struct *mm,
-		struct page *page, gfp_t gfp_mask, struct mem_cgroup **memcgp)
+		struct page *page, gfp_t gfp_mask, struct mem_cgroup **memcgp,
+		bool oom)
 {
 	return 0;
 }
diff --git a/mm/filemap.c b/mm/filemap.c
index 83efee7..ef8fbd5 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -447,8 +447,13 @@ int add_to_page_cache_locked(struct page *page, struct address_space *mapping,
 	VM_BUG_ON(!PageLocked(page));
 	VM_BUG_ON(PageSwapBacked(page));
 
-	error = mem_cgroup_cache_charge(page, current->mm,
-					gfp_mask & GFP_RECLAIM_MASK);
+	/*
+	 * Cannot trigger OOM even if gfp_mask would allow that normally
+	 * because we might be called from a locked context and that
+	 * could lead to deadlocks if the killed process is waiting for
+	 * the same lock.
+	 */
+	error = mem_cgroup_cache_charge(page, current->mm, gfp_mask, false);
 	if (error)
 		goto out;
 
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 02ee2f7..02a6d70 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3709,11 +3709,10 @@ out:
  * < 0 if the cgroup is over its limit
  */
 static int mem_cgroup_charge_common(struct page *page, struct mm_struct *mm,
-				gfp_t gfp_mask, enum charge_type ctype)
+				gfp_t gfp_mask, enum charge_type ctype, bool oom)
 {
 	struct mem_cgroup *memcg = NULL;
 	unsigned int nr_pages = 1;
-	bool oom = true;
 	int ret;
 
 	if (PageTransHuge(page)) {
@@ -3742,7 +3741,7 @@ int mem_cgroup_newpage_charge(struct page *page,
 	VM_BUG_ON(page->mapping && !PageAnon(page));
 	VM_BUG_ON(!mm);
 	return mem_cgroup_charge_common(page, mm, gfp_mask,
-					MEM_CGROUP_CHARGE_TYPE_ANON);
+					MEM_CGROUP_CHARGE_TYPE_ANON, true);
 }
 
 /*
@@ -3754,7 +3753,8 @@ int mem_cgroup_newpage_charge(struct page *page,
 static int __mem_cgroup_try_charge_swapin(struct mm_struct *mm,
 					  struct page *page,
 					  gfp_t mask,
-					  struct mem_cgroup **memcgp)
+					  struct mem_cgroup **memcgp,
+					  bool oom)
 {
 	struct mem_cgroup *memcg;
 	struct page_cgroup *pc;
@@ -3776,20 +3776,21 @@ static int __mem_cgroup_try_charge_swapin(struct mm_struct *mm,
 	if (!memcg)
 		goto charge_cur_mm;
 	*memcgp = memcg;
-	ret = __mem_cgroup_try_charge(NULL, mask, 1, memcgp, true);
+	ret = __mem_cgroup_try_charge(NULL, mask, 1, memcgp, oom);
 	css_put(&memcg->css);
 	if (ret == -EINTR)
 		ret = 0;
 	return ret;
 charge_cur_mm:
-	ret = __mem_cgroup_try_charge(mm, mask, 1, memcgp, true);
+	ret = __mem_cgroup_try_charge(mm, mask, 1, memcgp, oom);
 	if (ret == -EINTR)
 		ret = 0;
 	return ret;
 }
 
 int mem_cgroup_try_charge_swapin(struct mm_struct *mm, struct page *page,
-				 gfp_t gfp_mask, struct mem_cgroup **memcgp)
+				 gfp_t gfp_mask, struct mem_cgroup **memcgp,
+				 bool oom)
 {
 	*memcgp = NULL;
 	if (mem_cgroup_disabled())
@@ -3803,12 +3804,12 @@ int mem_cgroup_try_charge_swapin(struct mm_struct *mm, struct page *page,
 	if (!PageSwapCache(page)) {
 		int ret;
 
-		ret = __mem_cgroup_try_charge(mm, gfp_mask, 1, memcgp, true);
+		ret = __mem_cgroup_try_charge(mm, gfp_mask, 1, memcgp, oom);
 		if (ret == -EINTR)
 			ret = 0;
 		return ret;
 	}
-	return __mem_cgroup_try_charge_swapin(mm, page, gfp_mask, memcgp);
+	return __mem_cgroup_try_charge_swapin(mm, page, gfp_mask, memcgp, oom);
 }
 
 void mem_cgroup_cancel_charge_swapin(struct mem_cgroup *memcg)
@@ -3851,7 +3852,7 @@ void mem_cgroup_commit_charge_swapin(struct page *page,
 }
 
 int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm,
-				gfp_t gfp_mask)
+				gfp_t gfp_mask, bool oom)
 {
 	struct mem_cgroup *memcg = NULL;
 	enum charge_type type = MEM_CGROUP_CHARGE_TYPE_CACHE;
@@ -3863,10 +3864,10 @@ int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm,
 		return 0;
 
 	if (!PageSwapCache(page))
-		ret = mem_cgroup_charge_common(page, mm, gfp_mask, type);
+		ret = mem_cgroup_charge_common(page, mm, gfp_mask, type, oom);
 	else { /* page is swapcache/shmem */
 		ret = __mem_cgroup_try_charge_swapin(mm, page,
-						     gfp_mask, &memcg);
+						     gfp_mask, &memcg, oom);
 		if (!ret)
 			__mem_cgroup_commit_charge_swapin(page, memcg, type);
 	}
diff --git a/mm/memory.c b/mm/memory.c
index 6891d3b..afad903 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2991,7 +2991,7 @@ static int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		}
 	}
 
-	if (mem_cgroup_try_charge_swapin(mm, page, GFP_KERNEL, &ptr)) {
+	if (mem_cgroup_try_charge_swapin(mm, page, GFP_KERNEL, &ptr, true)) {
 		ret = VM_FAULT_OOM;
 		goto out_page;
 	}
diff --git a/mm/shmem.c b/mm/shmem.c
index 55054a7..3b27db4 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -760,7 +760,7 @@ int shmem_unuse(swp_entry_t swap, struct page *page)
 	 * the shmem_swaplist_mutex which might hold up shmem_writepage().
 	 * Charged back to the user (not to caller) when swap account is used.
 	 */
-	error = mem_cgroup_cache_charge(page, current->mm, GFP_KERNEL);
+	error = mem_cgroup_cache_charge(page, current->mm, GFP_KERNEL, true);
 	if (error)
 		goto out;
 	/* No radix_tree_preload: swap entry keeps a place for page in tree */
@@ -1152,8 +1152,17 @@ repeat:
 				goto failed;
 		}
 
+		 /*
+		  * Cannot trigger OOM even if gfp_mask would allow that
+		  * normally because we might be called from a locked
+		  * context (i_mutex held) if this is a write lock or
+		  * fallocate and that could lead to deadlocks if the
+		  * killed process is waiting for the same lock.
+		  */
 		error = mem_cgroup_cache_charge(page, current->mm,
-						gfp & GFP_RECLAIM_MASK);
+						gfp & GFP_RECLAIM_MASK,
+						sgp != SGP_WRITE &&
+						sgp != SGP_FALLOC);
 		if (!error) {
 			error = shmem_add_to_page_cache(page, mapping, index,
 						gfp, swp_to_radix_entry(swap));
@@ -1209,7 +1218,9 @@ repeat:
 		SetPageSwapBacked(page);
 		__set_page_locked(page);
 		error = mem_cgroup_cache_charge(page, current->mm,
-						gfp & GFP_RECLAIM_MASK);
+						gfp & GFP_RECLAIM_MASK,
+						sgp != SGP_WRITE &&
+						sgp != SGP_FALLOC);
 		if (error)
 			goto decused;
 		error = radix_tree_preload(gfp & GFP_RECLAIM_MASK);
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 2f8e429..8ec511e 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -828,7 +828,7 @@ static int unuse_pte(struct vm_area_struct *vma, pmd_t *pmd,
 	int ret = 1;
 
 	if (mem_cgroup_try_charge_swapin(vma->vm_mm, page,
-					 GFP_KERNEL, &memcg)) {
+					 GFP_KERNEL, &memcg, true)) {
 		ret = -ENOMEM;
 		goto out_nolock;
 	}
-- 
1.7.10.4

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply related	[flat|nested] 172+ messages in thread

* Re: [PATCH -v2 -mm] memcg: do not trigger OOM from add_to_page_cache_locked
  2012-11-28 16:04                                   ` Michal Hocko
@ 2012-11-28 16:37                                     ` Johannes Weiner
  2012-11-28 16:46                                       ` Michal Hocko
  0 siblings, 1 reply; 172+ messages in thread
From: Johannes Weiner @ 2012-11-28 16:37 UTC (permalink / raw)
  To: Michal Hocko
  Cc: KAMEZAWA Hiroyuki, azurIt, linux-kernel, linux-mm, cgroups mailinglist

On Wed, Nov 28, 2012 at 05:04:47PM +0100, Michal Hocko wrote:
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 095d2b4..5abe441 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -57,13 +57,14 @@ extern int mem_cgroup_newpage_charge(struct page *page, struct mm_struct *mm,
>  				gfp_t gfp_mask);
>  /* for swap handling */
>  extern int mem_cgroup_try_charge_swapin(struct mm_struct *mm,
> -		struct page *page, gfp_t mask, struct mem_cgroup **memcgp);
> +		struct page *page, gfp_t mask, struct mem_cgroup **memcgp,
> +		bool oom);

Ok, now I feel almost bad for asking, but why the public interface,
too?  You only ever pass "true" in there and this is unlikely to
change anytime soon, no?

> @@ -3754,7 +3753,8 @@ int mem_cgroup_newpage_charge(struct page *page,
>  static int __mem_cgroup_try_charge_swapin(struct mm_struct *mm,
>  					  struct page *page,
>  					  gfp_t mask,
> -					  struct mem_cgroup **memcgp)
> +					  struct mem_cgroup **memcgp,
> +					  bool oom)
>  {
>  	struct mem_cgroup *memcg;
>  	struct page_cgroup *pc;
> @@ -3776,20 +3776,21 @@ static int __mem_cgroup_try_charge_swapin(struct mm_struct *mm,
>  	if (!memcg)
>  		goto charge_cur_mm;
>  	*memcgp = memcg;
> -	ret = __mem_cgroup_try_charge(NULL, mask, 1, memcgp, true);
> +	ret = __mem_cgroup_try_charge(NULL, mask, 1, memcgp, oom);
>  	css_put(&memcg->css);
>  	if (ret == -EINTR)
>  		ret = 0;
>  	return ret;
>  charge_cur_mm:
> -	ret = __mem_cgroup_try_charge(mm, mask, 1, memcgp, true);
> +	ret = __mem_cgroup_try_charge(mm, mask, 1, memcgp, oom);
>  	if (ret == -EINTR)
>  		ret = 0;
>  	return ret;
>  }

Only this one is needed...

> @@ -3851,7 +3852,7 @@ void mem_cgroup_commit_charge_swapin(struct page *page,
>  }
>  
>  int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm,
> -				gfp_t gfp_mask)
> +				gfp_t gfp_mask, bool oom)
>  {
>  	struct mem_cgroup *memcg = NULL;
>  	enum charge_type type = MEM_CGROUP_CHARGE_TYPE_CACHE;
> @@ -3863,10 +3864,10 @@ int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm,
>  		return 0;
>  
>  	if (!PageSwapCache(page))
> -		ret = mem_cgroup_charge_common(page, mm, gfp_mask, type);
> +		ret = mem_cgroup_charge_common(page, mm, gfp_mask, type, oom);
>  	else { /* page is swapcache/shmem */
>  		ret = __mem_cgroup_try_charge_swapin(mm, page,
> -						     gfp_mask, &memcg);
> +						     gfp_mask, &memcg, oom);
>  		if (!ret)
>  			__mem_cgroup_commit_charge_swapin(page, memcg, type);
>  	}

...for this site.

> diff --git a/mm/memory.c b/mm/memory.c
> index 6891d3b..afad903 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -2991,7 +2991,7 @@ static int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma,
>  		}
>  	}
>  
> -	if (mem_cgroup_try_charge_swapin(mm, page, GFP_KERNEL, &ptr)) {
> +	if (mem_cgroup_try_charge_swapin(mm, page, GFP_KERNEL, &ptr, true)) {
>  		ret = VM_FAULT_OOM;
>  		goto out_page;
>  	}

Can not happen for shmem, the fault handler uses vma->vm_ops->fault.

> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index 2f8e429..8ec511e 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -828,7 +828,7 @@ static int unuse_pte(struct vm_area_struct *vma, pmd_t *pmd,
>  	int ret = 1;
>  
>  	if (mem_cgroup_try_charge_swapin(vma->vm_mm, page,
> -					 GFP_KERNEL, &memcg)) {
> +					 GFP_KERNEL, &memcg, true)) {
>  		ret = -ENOMEM;
>  		goto out_nolock;
>  	}

Can not happen for shmem, uses shmem_unuse() instead.

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: [PATCH -v2 -mm] memcg: do not trigger OOM from add_to_page_cache_locked
  2012-11-28 16:37                                     ` Johannes Weiner
@ 2012-11-28 16:46                                       ` Michal Hocko
  2012-11-28 16:48                                         ` Michal Hocko
  0 siblings, 1 reply; 172+ messages in thread
From: Michal Hocko @ 2012-11-28 16:46 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: KAMEZAWA Hiroyuki, azurIt, linux-kernel, linux-mm, cgroups mailinglist

On Wed 28-11-12 11:37:36, Johannes Weiner wrote:
> On Wed, Nov 28, 2012 at 05:04:47PM +0100, Michal Hocko wrote:
> > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> > index 095d2b4..5abe441 100644
> > --- a/include/linux/memcontrol.h
> > +++ b/include/linux/memcontrol.h
> > @@ -57,13 +57,14 @@ extern int mem_cgroup_newpage_charge(struct page *page, struct mm_struct *mm,
> >  				gfp_t gfp_mask);
> >  /* for swap handling */
> >  extern int mem_cgroup_try_charge_swapin(struct mm_struct *mm,
> > -		struct page *page, gfp_t mask, struct mem_cgroup **memcgp);
> > +		struct page *page, gfp_t mask, struct mem_cgroup **memcgp,
> > +		bool oom);
> 
> Ok, now I feel almost bad for asking, but why the public interface,
> too?

Would it work out if I tell it was to double check that your review
quality is not decreased after that many revisions? :P

Incremental update and the full patch in the reply
---
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 5abe441..8f48d5e 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -57,8 +57,7 @@ extern int mem_cgroup_newpage_charge(struct page *page, struct mm_struct *mm,
 				gfp_t gfp_mask);
 /* for swap handling */
 extern int mem_cgroup_try_charge_swapin(struct mm_struct *mm,
-		struct page *page, gfp_t mask, struct mem_cgroup **memcgp,
-		bool oom);
+		struct page *page, gfp_t mask, struct mem_cgroup **memcgp);
 extern void mem_cgroup_commit_charge_swapin(struct page *page,
 					struct mem_cgroup *memcg);
 extern void mem_cgroup_cancel_charge_swapin(struct mem_cgroup *memcg);
@@ -218,8 +217,7 @@ static inline int mem_cgroup_cache_charge(struct page *page,
 }
 
 static inline int mem_cgroup_try_charge_swapin(struct mm_struct *mm,
-		struct page *page, gfp_t gfp_mask, struct mem_cgroup **memcgp,
-		bool oom)
+		struct page *page, gfp_t gfp_mask, struct mem_cgroup **memcgp)
 {
 	return 0;
 }
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 02a6d70..3c9b1c5 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3789,8 +3789,7 @@ charge_cur_mm:
 }
 
 int mem_cgroup_try_charge_swapin(struct mm_struct *mm, struct page *page,
-				 gfp_t gfp_mask, struct mem_cgroup **memcgp,
-				 bool oom)
+				 gfp_t gfp_mask, struct mem_cgroup **memcgp)
 {
 	*memcgp = NULL;
 	if (mem_cgroup_disabled())
@@ -3804,12 +3803,12 @@ int mem_cgroup_try_charge_swapin(struct mm_struct *mm, struct page *page,
 	if (!PageSwapCache(page)) {
 		int ret;
 
-		ret = __mem_cgroup_try_charge(mm, gfp_mask, 1, memcgp, oom);
+		ret = __mem_cgroup_try_charge(mm, gfp_mask, 1, memcgp, true);
 		if (ret == -EINTR)
 			ret = 0;
 		return ret;
 	}
-	return __mem_cgroup_try_charge_swapin(mm, page, gfp_mask, memcgp, oom);
+	return __mem_cgroup_try_charge_swapin(mm, page, gfp_mask, memcgp, true);
 }
 
 void mem_cgroup_cancel_charge_swapin(struct mem_cgroup *memcg)
diff --git a/mm/memory.c b/mm/memory.c
index afad903..6891d3b 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2991,7 +2991,7 @@ static int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		}
 	}
 
-	if (mem_cgroup_try_charge_swapin(mm, page, GFP_KERNEL, &ptr, true)) {
+	if (mem_cgroup_try_charge_swapin(mm, page, GFP_KERNEL, &ptr)) {
 		ret = VM_FAULT_OOM;
 		goto out_page;
 	}
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 8ec511e..2f8e429 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -828,7 +828,7 @@ static int unuse_pte(struct vm_area_struct *vma, pmd_t *pmd,
 	int ret = 1;
 
 	if (mem_cgroup_try_charge_swapin(vma->vm_mm, page,
-					 GFP_KERNEL, &memcg, true)) {
+					 GFP_KERNEL, &memcg)) {
 		ret = -ENOMEM;
 		goto out_nolock;
 	}
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply related	[flat|nested] 172+ messages in thread

* Re: [PATCH -v2 -mm] memcg: do not trigger OOM from add_to_page_cache_locked
  2012-11-28 16:46                                       ` Michal Hocko
@ 2012-11-28 16:48                                         ` Michal Hocko
  2012-11-28 18:44                                           ` Johannes Weiner
  2012-11-28 20:20                                           ` Hugh Dickins
  0 siblings, 2 replies; 172+ messages in thread
From: Michal Hocko @ 2012-11-28 16:48 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: KAMEZAWA Hiroyuki, azurIt, linux-kernel, linux-mm, cgroups mailinglist

On Wed 28-11-12 17:46:40, Michal Hocko wrote:
> On Wed 28-11-12 11:37:36, Johannes Weiner wrote:
> > On Wed, Nov 28, 2012 at 05:04:47PM +0100, Michal Hocko wrote:
> > > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> > > index 095d2b4..5abe441 100644
> > > --- a/include/linux/memcontrol.h
> > > +++ b/include/linux/memcontrol.h
> > > @@ -57,13 +57,14 @@ extern int mem_cgroup_newpage_charge(struct page *page, struct mm_struct *mm,
> > >  				gfp_t gfp_mask);
> > >  /* for swap handling */
> > >  extern int mem_cgroup_try_charge_swapin(struct mm_struct *mm,
> > > -		struct page *page, gfp_t mask, struct mem_cgroup **memcgp);
> > > +		struct page *page, gfp_t mask, struct mem_cgroup **memcgp,
> > > +		bool oom);
> > 
> > Ok, now I feel almost bad for asking, but why the public interface,
> > too?
> 
> Would it work out if I tell it was to double check that your review
> quality is not decreased after that many revisions? :P
> 
> Incremental update and the full patch in the reply
---
>From e21bb704947e9a477ec1df9121575c606dbfcb52 Mon Sep 17 00:00:00 2001
From: Michal Hocko <mhocko@suse.cz>
Date: Wed, 28 Nov 2012 17:46:32 +0100
Subject: [PATCH] memcg: do not trigger OOM from add_to_page_cache_locked

memcg oom killer might deadlock if the process which falls down to
mem_cgroup_handle_oom holds a lock which prevents other task to
terminate because it is blocked on the very same lock.
This can happen when a write system call needs to allocate a page but
the allocation hits the memcg hard limit and there is nothing to reclaim
(e.g. there is no swap or swap limit is hit as well and all cache pages
have been reclaimed already) and the process selected by memcg OOM
killer is blocked on i_mutex on the same inode (e.g. truncate it).

Process A
[<ffffffff811109b8>] do_truncate+0x58/0xa0		# takes i_mutex
[<ffffffff81121c90>] do_last+0x250/0xa30
[<ffffffff81122547>] path_openat+0xd7/0x440
[<ffffffff811229c9>] do_filp_open+0x49/0xa0
[<ffffffff8110f7d6>] do_sys_open+0x106/0x240
[<ffffffff8110f950>] sys_open+0x20/0x30
[<ffffffff815b5926>] system_call_fastpath+0x18/0x1d
[<ffffffffffffffff>] 0xffffffffffffffff

Process B
[<ffffffff8110a9c1>] mem_cgroup_handle_oom+0x241/0x3b0
[<ffffffff8110b5ab>] T.1146+0x5ab/0x5c0
[<ffffffff8110c22e>] mem_cgroup_cache_charge+0xbe/0xe0
[<ffffffff810ca28c>] add_to_page_cache_locked+0x4c/0x140
[<ffffffff810ca3a2>] add_to_page_cache_lru+0x22/0x50
[<ffffffff810ca45b>] grab_cache_page_write_begin+0x8b/0xe0
[<ffffffff81193a18>] ext3_write_begin+0x88/0x270
[<ffffffff810c8fc6>] generic_file_buffered_write+0x116/0x290
[<ffffffff810cb3cc>] __generic_file_aio_write+0x27c/0x480
[<ffffffff810cb646>] generic_file_aio_write+0x76/0xf0           # takes ->i_mutex
[<ffffffff8111156a>] do_sync_write+0xea/0x130
[<ffffffff81112183>] vfs_write+0xf3/0x1f0
[<ffffffff81112381>] sys_write+0x51/0x90
[<ffffffff815b5926>] system_call_fastpath+0x18/0x1d
[<ffffffffffffffff>] 0xffffffffffffffff

This is not a hard deadlock though because administrator can still
intervene and increase the limit on the group which helps the writer to
finish the allocation and release the lock.

This patch heals the problem by forbidding OOM from page cache charges
(namely add_ro_page_cache_locked). mem_cgroup_cache_charge grows oom
argument which is pushed down the call chain.

As a possibly visible result add_to_page_cache_lru might fail more often
with ENOMEM but this is to be expected if the limit is set and it is
preferable than OOM killer IMO.

Changes since v1
- do not abuse gfp_flags and rather use oom parameter directly as per
  Johannes
- handle also shmem write fauls resp. fallocate properly as per Johannes

Reported-by: azurIt <azurit@pobox.sk>
Signed-off-by: Michal Hocko <mhocko@suse.cz>
---
 include/linux/memcontrol.h |    5 +++--
 mm/filemap.c               |    9 +++++++--
 mm/memcontrol.c            |   20 ++++++++++----------
 mm/shmem.c                 |   17 ++++++++++++++---
 4 files changed, 34 insertions(+), 17 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 095d2b4..8f48d5e 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -63,7 +63,7 @@ extern void mem_cgroup_commit_charge_swapin(struct page *page,
 extern void mem_cgroup_cancel_charge_swapin(struct mem_cgroup *memcg);
 
 extern int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm,
-					gfp_t gfp_mask);
+					gfp_t gfp_mask, bool oom);
 
 struct lruvec *mem_cgroup_zone_lruvec(struct zone *, struct mem_cgroup *);
 struct lruvec *mem_cgroup_page_lruvec(struct page *, struct zone *);
@@ -210,7 +210,8 @@ static inline int mem_cgroup_newpage_charge(struct page *page,
 }
 
 static inline int mem_cgroup_cache_charge(struct page *page,
-					struct mm_struct *mm, gfp_t gfp_mask)
+					struct mm_struct *mm, gfp_t gfp_mask,
+					bool oom)
 {
 	return 0;
 }
diff --git a/mm/filemap.c b/mm/filemap.c
index 83efee7..ef8fbd5 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -447,8 +447,13 @@ int add_to_page_cache_locked(struct page *page, struct address_space *mapping,
 	VM_BUG_ON(!PageLocked(page));
 	VM_BUG_ON(PageSwapBacked(page));
 
-	error = mem_cgroup_cache_charge(page, current->mm,
-					gfp_mask & GFP_RECLAIM_MASK);
+	/*
+	 * Cannot trigger OOM even if gfp_mask would allow that normally
+	 * because we might be called from a locked context and that
+	 * could lead to deadlocks if the killed process is waiting for
+	 * the same lock.
+	 */
+	error = mem_cgroup_cache_charge(page, current->mm, gfp_mask, false);
 	if (error)
 		goto out;
 
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 02ee2f7..3c9b1c5 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3709,11 +3709,10 @@ out:
  * < 0 if the cgroup is over its limit
  */
 static int mem_cgroup_charge_common(struct page *page, struct mm_struct *mm,
-				gfp_t gfp_mask, enum charge_type ctype)
+				gfp_t gfp_mask, enum charge_type ctype, bool oom)
 {
 	struct mem_cgroup *memcg = NULL;
 	unsigned int nr_pages = 1;
-	bool oom = true;
 	int ret;
 
 	if (PageTransHuge(page)) {
@@ -3742,7 +3741,7 @@ int mem_cgroup_newpage_charge(struct page *page,
 	VM_BUG_ON(page->mapping && !PageAnon(page));
 	VM_BUG_ON(!mm);
 	return mem_cgroup_charge_common(page, mm, gfp_mask,
-					MEM_CGROUP_CHARGE_TYPE_ANON);
+					MEM_CGROUP_CHARGE_TYPE_ANON, true);
 }
 
 /*
@@ -3754,7 +3753,8 @@ int mem_cgroup_newpage_charge(struct page *page,
 static int __mem_cgroup_try_charge_swapin(struct mm_struct *mm,
 					  struct page *page,
 					  gfp_t mask,
-					  struct mem_cgroup **memcgp)
+					  struct mem_cgroup **memcgp,
+					  bool oom)
 {
 	struct mem_cgroup *memcg;
 	struct page_cgroup *pc;
@@ -3776,13 +3776,13 @@ static int __mem_cgroup_try_charge_swapin(struct mm_struct *mm,
 	if (!memcg)
 		goto charge_cur_mm;
 	*memcgp = memcg;
-	ret = __mem_cgroup_try_charge(NULL, mask, 1, memcgp, true);
+	ret = __mem_cgroup_try_charge(NULL, mask, 1, memcgp, oom);
 	css_put(&memcg->css);
 	if (ret == -EINTR)
 		ret = 0;
 	return ret;
 charge_cur_mm:
-	ret = __mem_cgroup_try_charge(mm, mask, 1, memcgp, true);
+	ret = __mem_cgroup_try_charge(mm, mask, 1, memcgp, oom);
 	if (ret == -EINTR)
 		ret = 0;
 	return ret;
@@ -3808,7 +3808,7 @@ int mem_cgroup_try_charge_swapin(struct mm_struct *mm, struct page *page,
 			ret = 0;
 		return ret;
 	}
-	return __mem_cgroup_try_charge_swapin(mm, page, gfp_mask, memcgp);
+	return __mem_cgroup_try_charge_swapin(mm, page, gfp_mask, memcgp, true);
 }
 
 void mem_cgroup_cancel_charge_swapin(struct mem_cgroup *memcg)
@@ -3851,7 +3851,7 @@ void mem_cgroup_commit_charge_swapin(struct page *page,
 }
 
 int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm,
-				gfp_t gfp_mask)
+				gfp_t gfp_mask, bool oom)
 {
 	struct mem_cgroup *memcg = NULL;
 	enum charge_type type = MEM_CGROUP_CHARGE_TYPE_CACHE;
@@ -3863,10 +3863,10 @@ int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm,
 		return 0;
 
 	if (!PageSwapCache(page))
-		ret = mem_cgroup_charge_common(page, mm, gfp_mask, type);
+		ret = mem_cgroup_charge_common(page, mm, gfp_mask, type, oom);
 	else { /* page is swapcache/shmem */
 		ret = __mem_cgroup_try_charge_swapin(mm, page,
-						     gfp_mask, &memcg);
+						     gfp_mask, &memcg, oom);
 		if (!ret)
 			__mem_cgroup_commit_charge_swapin(page, memcg, type);
 	}
diff --git a/mm/shmem.c b/mm/shmem.c
index 55054a7..3b27db4 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -760,7 +760,7 @@ int shmem_unuse(swp_entry_t swap, struct page *page)
 	 * the shmem_swaplist_mutex which might hold up shmem_writepage().
 	 * Charged back to the user (not to caller) when swap account is used.
 	 */
-	error = mem_cgroup_cache_charge(page, current->mm, GFP_KERNEL);
+	error = mem_cgroup_cache_charge(page, current->mm, GFP_KERNEL, true);
 	if (error)
 		goto out;
 	/* No radix_tree_preload: swap entry keeps a place for page in tree */
@@ -1152,8 +1152,17 @@ repeat:
 				goto failed;
 		}
 
+		 /*
+		  * Cannot trigger OOM even if gfp_mask would allow that
+		  * normally because we might be called from a locked
+		  * context (i_mutex held) if this is a write lock or
+		  * fallocate and that could lead to deadlocks if the
+		  * killed process is waiting for the same lock.
+		  */
 		error = mem_cgroup_cache_charge(page, current->mm,
-						gfp & GFP_RECLAIM_MASK);
+						gfp & GFP_RECLAIM_MASK,
+						sgp != SGP_WRITE &&
+						sgp != SGP_FALLOC);
 		if (!error) {
 			error = shmem_add_to_page_cache(page, mapping, index,
 						gfp, swp_to_radix_entry(swap));
@@ -1209,7 +1218,9 @@ repeat:
 		SetPageSwapBacked(page);
 		__set_page_locked(page);
 		error = mem_cgroup_cache_charge(page, current->mm,
-						gfp & GFP_RECLAIM_MASK);
+						gfp & GFP_RECLAIM_MASK,
+						sgp != SGP_WRITE &&
+						sgp != SGP_FALLOC);
 		if (error)
 			goto decused;
 		error = radix_tree_preload(gfp & GFP_RECLAIM_MASK);
-- 
1.7.10.4

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply related	[flat|nested] 172+ messages in thread

* Re: [PATCH -v2 -mm] memcg: do not trigger OOM from add_to_page_cache_locked
  2012-11-28 16:48                                         ` Michal Hocko
@ 2012-11-28 18:44                                           ` Johannes Weiner
  2012-11-28 20:20                                           ` Hugh Dickins
  1 sibling, 0 replies; 172+ messages in thread
From: Johannes Weiner @ 2012-11-28 18:44 UTC (permalink / raw)
  To: Michal Hocko
  Cc: KAMEZAWA Hiroyuki, azurIt, linux-kernel, linux-mm, cgroups mailinglist

On Wed, Nov 28, 2012 at 05:48:24PM +0100, Michal Hocko wrote:
> On Wed 28-11-12 17:46:40, Michal Hocko wrote:
> > On Wed 28-11-12 11:37:36, Johannes Weiner wrote:
> > > On Wed, Nov 28, 2012 at 05:04:47PM +0100, Michal Hocko wrote:
> > > > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> > > > index 095d2b4..5abe441 100644
> > > > --- a/include/linux/memcontrol.h
> > > > +++ b/include/linux/memcontrol.h
> > > > @@ -57,13 +57,14 @@ extern int mem_cgroup_newpage_charge(struct page *page, struct mm_struct *mm,
> > > >  				gfp_t gfp_mask);
> > > >  /* for swap handling */
> > > >  extern int mem_cgroup_try_charge_swapin(struct mm_struct *mm,
> > > > -		struct page *page, gfp_t mask, struct mem_cgroup **memcgp);
> > > > +		struct page *page, gfp_t mask, struct mem_cgroup **memcgp,
> > > > +		bool oom);
> > > 
> > > Ok, now I feel almost bad for asking, but why the public interface,
> > > too?
> > 
> > Would it work out if I tell it was to double check that your review
> > quality is not decreased after that many revisions? :P

Deal.

> >From e21bb704947e9a477ec1df9121575c606dbfcb52 Mon Sep 17 00:00:00 2001
> From: Michal Hocko <mhocko@suse.cz>
> Date: Wed, 28 Nov 2012 17:46:32 +0100
> Subject: [PATCH] memcg: do not trigger OOM from add_to_page_cache_locked
> 
> memcg oom killer might deadlock if the process which falls down to
> mem_cgroup_handle_oom holds a lock which prevents other task to
> terminate because it is blocked on the very same lock.
> This can happen when a write system call needs to allocate a page but
> the allocation hits the memcg hard limit and there is nothing to reclaim
> (e.g. there is no swap or swap limit is hit as well and all cache pages
> have been reclaimed already) and the process selected by memcg OOM
> killer is blocked on i_mutex on the same inode (e.g. truncate it).
> 
> Process A
> [<ffffffff811109b8>] do_truncate+0x58/0xa0		# takes i_mutex
> [<ffffffff81121c90>] do_last+0x250/0xa30
> [<ffffffff81122547>] path_openat+0xd7/0x440
> [<ffffffff811229c9>] do_filp_open+0x49/0xa0
> [<ffffffff8110f7d6>] do_sys_open+0x106/0x240
> [<ffffffff8110f950>] sys_open+0x20/0x30
> [<ffffffff815b5926>] system_call_fastpath+0x18/0x1d
> [<ffffffffffffffff>] 0xffffffffffffffff
> 
> Process B
> [<ffffffff8110a9c1>] mem_cgroup_handle_oom+0x241/0x3b0
> [<ffffffff8110b5ab>] T.1146+0x5ab/0x5c0
> [<ffffffff8110c22e>] mem_cgroup_cache_charge+0xbe/0xe0
> [<ffffffff810ca28c>] add_to_page_cache_locked+0x4c/0x140
> [<ffffffff810ca3a2>] add_to_page_cache_lru+0x22/0x50
> [<ffffffff810ca45b>] grab_cache_page_write_begin+0x8b/0xe0
> [<ffffffff81193a18>] ext3_write_begin+0x88/0x270
> [<ffffffff810c8fc6>] generic_file_buffered_write+0x116/0x290
> [<ffffffff810cb3cc>] __generic_file_aio_write+0x27c/0x480
> [<ffffffff810cb646>] generic_file_aio_write+0x76/0xf0           # takes ->i_mutex
> [<ffffffff8111156a>] do_sync_write+0xea/0x130
> [<ffffffff81112183>] vfs_write+0xf3/0x1f0
> [<ffffffff81112381>] sys_write+0x51/0x90
> [<ffffffff815b5926>] system_call_fastpath+0x18/0x1d
> [<ffffffffffffffff>] 0xffffffffffffffff
> 
> This is not a hard deadlock though because administrator can still
> intervene and increase the limit on the group which helps the writer to
> finish the allocation and release the lock.
> 
> This patch heals the problem by forbidding OOM from page cache charges
> (namely add_ro_page_cache_locked). mem_cgroup_cache_charge grows oom
> argument which is pushed down the call chain.
> 
> As a possibly visible result add_to_page_cache_lru might fail more often
> with ENOMEM but this is to be expected if the limit is set and it is
> preferable than OOM killer IMO.
> 
> Changes since v1
> - do not abuse gfp_flags and rather use oom parameter directly as per
>   Johannes
> - handle also shmem write fauls resp. fallocate properly as per Johannes
> 
> Reported-by: azurIt <azurit@pobox.sk>
> Signed-off-by: Michal Hocko <mhocko@suse.cz>

Acked-by: Johannes Weiner <hannes@cmpxchg.org>

Thanks, Michal!

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: [PATCH -v2 -mm] memcg: do not trigger OOM from add_to_page_cache_locked
  2012-11-28 16:48                                         ` Michal Hocko
  2012-11-28 18:44                                           ` Johannes Weiner
@ 2012-11-28 20:20                                           ` Hugh Dickins
  2012-11-29 14:05                                             ` Michal Hocko
  1 sibling, 1 reply; 172+ messages in thread
From: Hugh Dickins @ 2012-11-28 20:20 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Johannes Weiner, KAMEZAWA Hiroyuki, azurIt, linux-kernel,
	linux-mm, cgroups mailinglist

On Wed, 28 Nov 2012, Michal Hocko wrote:
> From e21bb704947e9a477ec1df9121575c606dbfcb52 Mon Sep 17 00:00:00 2001
> From: Michal Hocko <mhocko@suse.cz>
> Date: Wed, 28 Nov 2012 17:46:32 +0100
> Subject: [PATCH] memcg: do not trigger OOM from add_to_page_cache_locked
> 
> memcg oom killer might deadlock if the process which falls down to
> mem_cgroup_handle_oom holds a lock which prevents other task to
> terminate because it is blocked on the very same lock.
> This can happen when a write system call needs to allocate a page but
> the allocation hits the memcg hard limit and there is nothing to reclaim
> (e.g. there is no swap or swap limit is hit as well and all cache pages
> have been reclaimed already) and the process selected by memcg OOM
> killer is blocked on i_mutex on the same inode (e.g. truncate it).
> 
> Process A
> [<ffffffff811109b8>] do_truncate+0x58/0xa0		# takes i_mutex
> [<ffffffff81121c90>] do_last+0x250/0xa30
> [<ffffffff81122547>] path_openat+0xd7/0x440
> [<ffffffff811229c9>] do_filp_open+0x49/0xa0
> [<ffffffff8110f7d6>] do_sys_open+0x106/0x240
> [<ffffffff8110f950>] sys_open+0x20/0x30
> [<ffffffff815b5926>] system_call_fastpath+0x18/0x1d
> [<ffffffffffffffff>] 0xffffffffffffffff
> 
> Process B
> [<ffffffff8110a9c1>] mem_cgroup_handle_oom+0x241/0x3b0
> [<ffffffff8110b5ab>] T.1146+0x5ab/0x5c0
> [<ffffffff8110c22e>] mem_cgroup_cache_charge+0xbe/0xe0
> [<ffffffff810ca28c>] add_to_page_cache_locked+0x4c/0x140
> [<ffffffff810ca3a2>] add_to_page_cache_lru+0x22/0x50
> [<ffffffff810ca45b>] grab_cache_page_write_begin+0x8b/0xe0
> [<ffffffff81193a18>] ext3_write_begin+0x88/0x270
> [<ffffffff810c8fc6>] generic_file_buffered_write+0x116/0x290
> [<ffffffff810cb3cc>] __generic_file_aio_write+0x27c/0x480
> [<ffffffff810cb646>] generic_file_aio_write+0x76/0xf0           # takes ->i_mutex
> [<ffffffff8111156a>] do_sync_write+0xea/0x130
> [<ffffffff81112183>] vfs_write+0xf3/0x1f0
> [<ffffffff81112381>] sys_write+0x51/0x90
> [<ffffffff815b5926>] system_call_fastpath+0x18/0x1d
> [<ffffffffffffffff>] 0xffffffffffffffff
> 
> This is not a hard deadlock though because administrator can still
> intervene and increase the limit on the group which helps the writer to
> finish the allocation and release the lock.
> 
> This patch heals the problem by forbidding OOM from page cache charges
> (namely add_ro_page_cache_locked). mem_cgroup_cache_charge grows oom
> argument which is pushed down the call chain.
> 
> As a possibly visible result add_to_page_cache_lru might fail more often
> with ENOMEM but this is to be expected if the limit is set and it is
> preferable than OOM killer IMO.
> 
> Changes since v1
> - do not abuse gfp_flags and rather use oom parameter directly as per
>   Johannes
> - handle also shmem write fauls resp. fallocate properly as per Johannes
> 
> Reported-by: azurIt <azurit@pobox.sk>
> Signed-off-by: Michal Hocko <mhocko@suse.cz>

Sorry, Michal, you've laboured hard on this: but I dislike it so much
that I'm here overcoming my dread of entering an OOM-killer discussion,
and the resultant deluge of unwelcome CCs for eternity afterwards.

I had been relying on Johannes to repeat his "This issue has been
around for a while so frankly I don't think it's urgent enough to
rush things", but it looks like I have to be the one to repeat it.

Your analysis of azurIt's traces may well be correct, and this patch
may indeed ameliorate the situation, and it's fine as something for
azurIt to try and report on and keep in his tree; but I hope that
it does not go upstream and to stable.

Why do I dislike it so much?  I suppose because it's both too general
and too limited at the same time.

Too general in that it changes the behaviour on OOM for a large set
of memcg charges, all those that go through add_to_page_cache_locked(),
when only a subset of those have the i_mutex issue.

If you're going to be that general, why not go further?  Leave the
mem_cgroup_cache_charge() interface as is, make it not-OOM internally,
no need for SGP_WRITE,SGP_FALLOC distinctions in mm/shmem.c.  No other
filesystem gets the benefit of those distinctions: isn't it better to
keep it simple?  (And I can see a partial truncation case where shmem
uses SGP_READ under i_mutex; and the change to shmem_unuse behaviour
is a non-issue, since swapoff invites itself to be killed anyway.)

Too limited in that i_mutex is just the held resource which azurIt's
traces have led you to, but it's a general problem that the OOM-killed
task might be waiting for a resource that the OOM-killing task holds.

I suspect that if we try hard enough (I admit I have not), we can find
an example of such a potential deadlock for almost every memcg charge
site.  mmap_sem? not as easy to invent a case with that as I thought,
since it needs a down_write, and the typical page allocations happen
with down_read, and I can't think of a process which does down_write
on another's mm.

But i_mutex is always good, once you remember the case of write to
file from userspace page which got paged out, so the fault path has
to read it back in, while i_mutex is still held at the outer level.
An unusual case?  Well, normally yes, but we're considering
out-of-memory conditions, which may converge upon cases like this.

Wouldn't it be nice if I could be constructive?  But I'm sceptical
even of Johannes's faith in what the global OOM killer would do:
how does __alloc_pages_slowpath() get out of its "goto restart"
loop, excepting the trivial case when the killer is the killed?

I wonder why this issue has hit azurIt and no other reporter?
No swap plays a part in it, but that's not so unusual.

Yours glOOMily,
Hugh

> ---
>  include/linux/memcontrol.h |    5 +++--
>  mm/filemap.c               |    9 +++++++--
>  mm/memcontrol.c            |   20 ++++++++++----------
>  mm/shmem.c                 |   17 ++++++++++++++---
>  4 files changed, 34 insertions(+), 17 deletions(-)
> 
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 095d2b4..8f48d5e 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -63,7 +63,7 @@ extern void mem_cgroup_commit_charge_swapin(struct page *page,
>  extern void mem_cgroup_cancel_charge_swapin(struct mem_cgroup *memcg);
>  
>  extern int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm,
> -					gfp_t gfp_mask);
> +					gfp_t gfp_mask, bool oom);
>  
>  struct lruvec *mem_cgroup_zone_lruvec(struct zone *, struct mem_cgroup *);
>  struct lruvec *mem_cgroup_page_lruvec(struct page *, struct zone *);
> @@ -210,7 +210,8 @@ static inline int mem_cgroup_newpage_charge(struct page *page,
>  }
>  
>  static inline int mem_cgroup_cache_charge(struct page *page,
> -					struct mm_struct *mm, gfp_t gfp_mask)
> +					struct mm_struct *mm, gfp_t gfp_mask,
> +					bool oom)
>  {
>  	return 0;
>  }
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 83efee7..ef8fbd5 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -447,8 +447,13 @@ int add_to_page_cache_locked(struct page *page, struct address_space *mapping,
>  	VM_BUG_ON(!PageLocked(page));
>  	VM_BUG_ON(PageSwapBacked(page));
>  
> -	error = mem_cgroup_cache_charge(page, current->mm,
> -					gfp_mask & GFP_RECLAIM_MASK);
> +	/*
> +	 * Cannot trigger OOM even if gfp_mask would allow that normally
> +	 * because we might be called from a locked context and that
> +	 * could lead to deadlocks if the killed process is waiting for
> +	 * the same lock.
> +	 */
> +	error = mem_cgroup_cache_charge(page, current->mm, gfp_mask, false);
>  	if (error)
>  		goto out;
>  
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 02ee2f7..3c9b1c5 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -3709,11 +3709,10 @@ out:
>   * < 0 if the cgroup is over its limit
>   */
>  static int mem_cgroup_charge_common(struct page *page, struct mm_struct *mm,
> -				gfp_t gfp_mask, enum charge_type ctype)
> +				gfp_t gfp_mask, enum charge_type ctype, bool oom)
>  {
>  	struct mem_cgroup *memcg = NULL;
>  	unsigned int nr_pages = 1;
> -	bool oom = true;
>  	int ret;
>  
>  	if (PageTransHuge(page)) {
> @@ -3742,7 +3741,7 @@ int mem_cgroup_newpage_charge(struct page *page,
>  	VM_BUG_ON(page->mapping && !PageAnon(page));
>  	VM_BUG_ON(!mm);
>  	return mem_cgroup_charge_common(page, mm, gfp_mask,
> -					MEM_CGROUP_CHARGE_TYPE_ANON);
> +					MEM_CGROUP_CHARGE_TYPE_ANON, true);
>  }
>  
>  /*
> @@ -3754,7 +3753,8 @@ int mem_cgroup_newpage_charge(struct page *page,
>  static int __mem_cgroup_try_charge_swapin(struct mm_struct *mm,
>  					  struct page *page,
>  					  gfp_t mask,
> -					  struct mem_cgroup **memcgp)
> +					  struct mem_cgroup **memcgp,
> +					  bool oom)
>  {
>  	struct mem_cgroup *memcg;
>  	struct page_cgroup *pc;
> @@ -3776,13 +3776,13 @@ static int __mem_cgroup_try_charge_swapin(struct mm_struct *mm,
>  	if (!memcg)
>  		goto charge_cur_mm;
>  	*memcgp = memcg;
> -	ret = __mem_cgroup_try_charge(NULL, mask, 1, memcgp, true);
> +	ret = __mem_cgroup_try_charge(NULL, mask, 1, memcgp, oom);
>  	css_put(&memcg->css);
>  	if (ret == -EINTR)
>  		ret = 0;
>  	return ret;
>  charge_cur_mm:
> -	ret = __mem_cgroup_try_charge(mm, mask, 1, memcgp, true);
> +	ret = __mem_cgroup_try_charge(mm, mask, 1, memcgp, oom);
>  	if (ret == -EINTR)
>  		ret = 0;
>  	return ret;
> @@ -3808,7 +3808,7 @@ int mem_cgroup_try_charge_swapin(struct mm_struct *mm, struct page *page,
>  			ret = 0;
>  		return ret;
>  	}
> -	return __mem_cgroup_try_charge_swapin(mm, page, gfp_mask, memcgp);
> +	return __mem_cgroup_try_charge_swapin(mm, page, gfp_mask, memcgp, true);
>  }
>  
>  void mem_cgroup_cancel_charge_swapin(struct mem_cgroup *memcg)
> @@ -3851,7 +3851,7 @@ void mem_cgroup_commit_charge_swapin(struct page *page,
>  }
>  
>  int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm,
> -				gfp_t gfp_mask)
> +				gfp_t gfp_mask, bool oom)
>  {
>  	struct mem_cgroup *memcg = NULL;
>  	enum charge_type type = MEM_CGROUP_CHARGE_TYPE_CACHE;
> @@ -3863,10 +3863,10 @@ int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm,
>  		return 0;
>  
>  	if (!PageSwapCache(page))
> -		ret = mem_cgroup_charge_common(page, mm, gfp_mask, type);
> +		ret = mem_cgroup_charge_common(page, mm, gfp_mask, type, oom);
>  	else { /* page is swapcache/shmem */
>  		ret = __mem_cgroup_try_charge_swapin(mm, page,
> -						     gfp_mask, &memcg);
> +						     gfp_mask, &memcg, oom);
>  		if (!ret)
>  			__mem_cgroup_commit_charge_swapin(page, memcg, type);
>  	}
> diff --git a/mm/shmem.c b/mm/shmem.c
> index 55054a7..3b27db4 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -760,7 +760,7 @@ int shmem_unuse(swp_entry_t swap, struct page *page)
>  	 * the shmem_swaplist_mutex which might hold up shmem_writepage().
>  	 * Charged back to the user (not to caller) when swap account is used.
>  	 */
> -	error = mem_cgroup_cache_charge(page, current->mm, GFP_KERNEL);
> +	error = mem_cgroup_cache_charge(page, current->mm, GFP_KERNEL, true);
>  	if (error)
>  		goto out;
>  	/* No radix_tree_preload: swap entry keeps a place for page in tree */
> @@ -1152,8 +1152,17 @@ repeat:
>  				goto failed;
>  		}
>  
> +		 /*
> +		  * Cannot trigger OOM even if gfp_mask would allow that
> +		  * normally because we might be called from a locked
> +		  * context (i_mutex held) if this is a write lock or
> +		  * fallocate and that could lead to deadlocks if the
> +		  * killed process is waiting for the same lock.
> +		  */
>  		error = mem_cgroup_cache_charge(page, current->mm,
> -						gfp & GFP_RECLAIM_MASK);
> +						gfp & GFP_RECLAIM_MASK,
> +						sgp != SGP_WRITE &&
> +						sgp != SGP_FALLOC);
>  		if (!error) {
>  			error = shmem_add_to_page_cache(page, mapping, index,
>  						gfp, swp_to_radix_entry(swap));
> @@ -1209,7 +1218,9 @@ repeat:
>  		SetPageSwapBacked(page);
>  		__set_page_locked(page);
>  		error = mem_cgroup_cache_charge(page, current->mm,
> -						gfp & GFP_RECLAIM_MASK);
> +						gfp & GFP_RECLAIM_MASK,
> +						sgp != SGP_WRITE &&
> +						sgp != SGP_FALLOC);
>  		if (error)
>  			goto decused;
>  		error = radix_tree_preload(gfp & GFP_RECLAIM_MASK);
> -- 
> 1.7.10.4
> 
> -- 
> Michal Hocko
> SUSE Labs
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> 

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: [PATCH -v2 -mm] memcg: do not trigger OOM from add_to_page_cache_locked
  2012-11-28 20:20                                           ` Hugh Dickins
@ 2012-11-29 14:05                                             ` Michal Hocko
  0 siblings, 0 replies; 172+ messages in thread
From: Michal Hocko @ 2012-11-29 14:05 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Johannes Weiner, KAMEZAWA Hiroyuki, azurIt, linux-kernel,
	linux-mm, cgroups mailinglist

On Wed 28-11-12 12:20:44, Hugh Dickins wrote:
[...]
> Sorry, Michal, you've laboured hard on this: but I dislike it so much
> that I'm here overcoming my dread of entering an OOM-killer discussion,
> and the resultant deluge of unwelcome CCs for eternity afterwards.
> 
> I had been relying on Johannes to repeat his "This issue has been
> around for a while so frankly I don't think it's urgent enough to
> rush things", but it looks like I have to be the one to repeat it.

Well, the idea was to use this only as a temporal fix and come up with a
better solution without any hurry.

> Your analysis of azurIt's traces may well be correct, and this patch
> may indeed ameliorate the situation, and it's fine as something for
> azurIt to try and report on and keep in his tree; but I hope that
> it does not go upstream and to stable.
> 
> Why do I dislike it so much?  I suppose because it's both too general
> and too limited at the same time.
> 
> Too general in that it changes the behaviour on OOM for a large set
> of memcg charges, all those that go through add_to_page_cache_locked(),
> when only a subset of those have the i_mutex issue.

This is a fair point but the real fix which we were discussing with
Johannes would be even more risky for stable.

> If you're going to be that general, why not go further?  Leave the
> mem_cgroup_cache_charge() interface as is, make it not-OOM internally,
> no need for SGP_WRITE,SGP_FALLOC distinctions in mm/shmem.c.  No other
> filesystem gets the benefit of those distinctions: isn't it better to
> keep it simple?  (And I can see a partial truncation case where shmem
> uses SGP_READ under i_mutex; and the change to shmem_unuse behaviour
> is a non-issue, since swapoff invites itself to be killed anyway.)
> 
> Too limited in that i_mutex is just the held resource which azurIt's
> traces have led you to, but it's a general problem that the OOM-killed
> task might be waiting for a resource that the OOM-killing task holds.
> 
> I suspect that if we try hard enough (I admit I have not), we can find
> an example of such a potential deadlock for almost every memcg charge
> site.  mmap_sem? not as easy to invent a case with that as I thought,
> since it needs a down_write, and the typical page allocations happen
> with down_read, and I can't think of a process which does down_write
> on another's mm.
> 
> But i_mutex is always good, once you remember the case of write to
> file from userspace page which got paged out, so the fault path has
> to read it back in, while i_mutex is still held at the outer level.
> An unusual case?  Well, normally yes, but we're considering
> out-of-memory conditions, which may converge upon cases like this.
> 
> Wouldn't it be nice if I could be constructive?  But I'm sceptical
> even of Johannes's faith in what the global OOM killer would do:
> how does __alloc_pages_slowpath() get out of its "goto restart"
> loop, excepting the trivial case when the killer is the killed?

I am not sure I am following you here but the Johannes's idea was to
break out of the charge after a signal has been sent and the charge
still fails and either retry the fault or fail the allocation. I think
this should work but I am afraid that this needs some tuning (number of
retries f.e.) to prevent from too aggressive OOM or too many failurs.

Do we have any other possibilities to solve this issue? Or do you think
we should ignore the problem just because nobody complained for such a
long time?
Dunno, I think we should fix this with something less risky for now and
come up with a real fix after it sees sufficient testing.

> I wonder why this issue has hit azurIt and no other reporter?
> No swap plays a part in it, but that's not so unusual.
> 
> Yours glOOMily,
> Hugh

[...]
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked
  2012-11-26 13:21                         ` [PATCH for 3.2.34] " Michal Hocko
  2012-11-26 21:28                           ` azurIt
@ 2012-11-30  1:45                           ` azurIt
  2012-11-30  2:29                           ` azurIt
  2 siblings, 0 replies; 172+ messages in thread
From: azurIt @ 2012-11-30  1:45 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki,
	Johannes Weiner

>Here we go with the patch for 3.2.34. Could you test with this one,
>please?


I installed kernel with this patch, will report back if problem occurs again OR in few weeks if everything will be ok. Thank you!

azurIt

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked
  2012-11-26 13:21                         ` [PATCH for 3.2.34] " Michal Hocko
  2012-11-26 21:28                           ` azurIt
  2012-11-30  1:45                           ` azurIt
@ 2012-11-30  2:29                           ` azurIt
  2012-11-30 12:45                             ` Michal Hocko
  2 siblings, 1 reply; 172+ messages in thread
From: azurIt @ 2012-11-30  2:29 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki,
	Johannes Weiner

>Here we go with the patch for 3.2.34. Could you test with this one,
>please?


Michal, unfortunately i had to boot to another kernel because the one with this patch keeps killing my MySQL server :( it was, probably, doing it on OOM in any cgroup - looks like OOM was not choosing processes only from cgroup which is out of memory. Here is the log from syslog: http://www.watchdog.sk/lkml/oom_mysqld

Maybe i should mention that MySQL server has it's own cgroup (called 'mysql') but with no limits to any resources.

azurIt

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked
  2012-11-30  2:29                           ` azurIt
@ 2012-11-30 12:45                             ` Michal Hocko
  2012-11-30 12:53                               ` azurIt
  2012-11-30 13:44                               ` azurIt
  0 siblings, 2 replies; 172+ messages in thread
From: Michal Hocko @ 2012-11-30 12:45 UTC (permalink / raw)
  To: azurIt
  Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki,
	Johannes Weiner

On Fri 30-11-12 03:29:18, azurIt wrote:
> >Here we go with the patch for 3.2.34. Could you test with this one,
> >please?
> 
> 
> Michal, unfortunately i had to boot to another kernel because the one
> with this patch keeps killing my MySQL server :( it was, probably,
> doing it on OOM in any cgroup - looks like OOM was not choosing
> processes only from cgroup which is out of memory. Here is the log
> from syslog: http://www.watchdog.sk/lkml/oom_mysqld

You are seeing also global OOM:
Nov 30 02:53:56 server01 kernel: [  818.233159] Pid: 9247, comm: apache2 Not tainted 3.2.34-grsec #1
Nov 30 02:53:56 server01 kernel: [  818.233289] Call Trace:
Nov 30 02:53:56 server01 kernel: [  818.233470]  [<ffffffff810cc90e>] dump_header+0x7e/0x1e0
Nov 30 02:53:56 server01 kernel: [  818.233600]  [<ffffffff810cc80f>] ? find_lock_task_mm+0x2f/0x70
Nov 30 02:53:56 server01 kernel: [  818.233721]  [<ffffffff810ccdd5>] oom_kill_process+0x85/0x2a0
Nov 30 02:53:56 server01 kernel: [  818.233842]  [<ffffffff810cd485>] out_of_memory+0xe5/0x200
Nov 30 02:53:56 server01 kernel: [  818.233963]  [<ffffffff8102aa8f>] ? pte_alloc_one+0x3f/0x50
Nov 30 02:53:56 server01 kernel: [  818.234082]  [<ffffffff810cd65d>] pagefault_out_of_memory+0xbd/0x110
Nov 30 02:53:56 server01 kernel: [  818.234204]  [<ffffffff81026ec6>] mm_fault_error+0xb6/0x1a0
Nov 30 02:53:56 server01 kernel: [  818.235886]  [<ffffffff8102739e>] do_page_fault+0x3ee/0x460
Nov 30 02:53:56 server01 kernel: [  818.236006]  [<ffffffff810f3057>] ? vma_merge+0x1f7/0x2c0
Nov 30 02:53:56 server01 kernel: [  818.236124]  [<ffffffff810f35d7>] ? do_brk+0x267/0x400
Nov 30 02:53:56 server01 kernel: [  818.236244]  [<ffffffff812c9a92>] ? gr_learn_resource+0x42/0x1e0
Nov 30 02:53:56 server01 kernel: [  818.236367]  [<ffffffff815b547f>] page_fault+0x1f/0x30
[...]
Nov 30 02:53:56 server01 kernel: [  818.356297] Out of memory: Kill process 2188 (mysqld) score 60 or sacrifice child
Nov 30 02:53:56 server01 kernel: [  818.356493] Killed process 2188 (mysqld) total-vm:3330016kB, anon-rss:864176kB, file-rss:8072kB

Then you also have memcg oom killer:
Nov 30 02:53:56 server01 kernel: [  818.375717] Task in /1037/uid killed as a result of limit of /1037
Nov 30 02:53:56 server01 kernel: [  818.375886] memory: usage 102400kB, limit 102400kB, failcnt 736
Nov 30 02:53:56 server01 kernel: [  818.376008] memory+swap: usage 102400kB, limit 102400kB, failcnt 0

The messages are intermixed and I guess rate limitting jumped in as
well, because I cannot associate all the oom messages to a specific OOM
event.

Anyway your system is under both global and local memory pressure. You
didn't see apache going down previously because it was probably the one
which was stuck and could be killed.
Anyway you need to setup your system more carefully.

> Maybe i should mention that MySQL server has it's own cgroup (called
> 'mysql') but with no limits to any resources.

Where is that group in the hierarchy?
> 
> azurIt
> --
> To unsubscribe from this list: send the line "unsubscribe cgroups" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked
  2012-11-30 12:45                             ` Michal Hocko
@ 2012-11-30 12:53                               ` azurIt
  2012-11-30 13:44                               ` azurIt
  1 sibling, 0 replies; 172+ messages in thread
From: azurIt @ 2012-11-30 12:53 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki,
	Johannes Weiner

>Anyway your system is under both global and local memory pressure. You
>didn't see apache going down previously because it was probably the one
>which was stuck and could be killed.
>Anyway you need to setup your system more carefully.


No, it wasn't, i'm 1000% sure (i was on SSH). Here is the memory usage graph from that system on that time:
http://www.watchdog.sk/lkml/memory.png

The blank part is rebooting into new kernel. MySQL server was killed several times, then i rebooted into previous kernel and problem was gone (not a single MySQL kill). You can see two MySQL kills there on 03:54 and 03:04:30.


>
>> Maybe i should mention that MySQL server has it's own cgroup (called
>> 'mysql') but with no limits to any resources.
>
>Where is that group in the hierarchy?



In root.

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked
  2012-11-30 12:45                             ` Michal Hocko
  2012-11-30 12:53                               ` azurIt
@ 2012-11-30 13:44                               ` azurIt
  2012-11-30 14:44                                 ` Michal Hocko
  1 sibling, 1 reply; 172+ messages in thread
From: azurIt @ 2012-11-30 13:44 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki,
	Johannes Weiner

>Anyway your system is under both global and local memory pressure. You
>didn't see apache going down previously because it was probably the one
>which was stuck and could be killed.
>Anyway you need to setup your system more carefully.


There is, also, an evidence that system has enough of memory! :) Just take column 'rss' from process list in OOM message and sum it - you will get 2489911. It's probably in KB so it's about 2.4 GB. System has 14 GB of RAM so this also match data on my graph - 2.4 is about 17% of 14.

azur

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked
  2012-11-30 13:44                               ` azurIt
@ 2012-11-30 14:44                                 ` Michal Hocko
  2012-11-30 15:03                                   ` Michal Hocko
  2012-11-30 15:08                                   ` azurIt
  0 siblings, 2 replies; 172+ messages in thread
From: Michal Hocko @ 2012-11-30 14:44 UTC (permalink / raw)
  To: azurIt
  Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki,
	Johannes Weiner

On Fri 30-11-12 14:44:27, azurIt wrote:
> >Anyway your system is under both global and local memory pressure. You
> >didn't see apache going down previously because it was probably the one
> >which was stuck and could be killed.
> >Anyway you need to setup your system more carefully.
> 
> 
> There is, also, an evidence that system has enough of memory! :) Just
> take column 'rss' from process list in OOM message and sum it - you
> will get 2489911. It's probably in KB so it's about 2.4 GB. System has
> 14 GB of RAM so this also match data on my graph - 2.4 is about 17% of
> 14.

Hmm, that corresponds to the ZONE_DMA32 size pretty nicely but that zone
is hardly touched:
Nov 30 02:53:56 server01 kernel: [  818.241291] DMA32 free:2523636kB min:2672kB low:3340kB high:4008kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:2542248kB mlocked:0kB dirty:0kB writeback:0kB mapped:4kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no

DMA32 zone is usually fills up first 4G unless your HW remaps the rest
of the memory above 4G or you have a numa machine and the rest of the
memory is at other node. Could you post your memory map printed during
the boot? (e820: BIOS-provided physical RAM map: and following lines)

There is also ZONE_NORMAL which is also not used much
Nov 30 02:53:56 server01 kernel: [  818.242163] Normal free:6924716kB min:12512kB low:15640kB high:18768kB active_anon:1463128kB inactive_anon:2072kB active_file:1803964kB inactive_file:1072628kB unevictable:3924kB isolated(anon):0kB isolated(file):0kB present:11893760kB mlocked:3924kB dirty:1000kB writeback:776kB mapped:35656kB shmem:3828kB slab_reclaimable:202560kB slab_unreclaimable:50696kB kernel_stack:2944kB pagetables:158616kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no

You have mentioned that you are comounting with cpuset. If this happens
to be a NUMA machine have you made the access to all nodes available?
Also what does /proc/sys/vm/zone_reclaim_mode says?
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked
  2012-11-30 14:44                                 ` Michal Hocko
@ 2012-11-30 15:03                                   ` Michal Hocko
  2012-11-30 15:37                                     ` Michal Hocko
  2012-11-30 15:08                                   ` azurIt
  1 sibling, 1 reply; 172+ messages in thread
From: Michal Hocko @ 2012-11-30 15:03 UTC (permalink / raw)
  To: azurIt
  Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki,
	Johannes Weiner

On Fri 30-11-12 15:44:31, Michal Hocko wrote:
> On Fri 30-11-12 14:44:27, azurIt wrote:
> > >Anyway your system is under both global and local memory pressure. You
> > >didn't see apache going down previously because it was probably the one
> > >which was stuck and could be killed.
> > >Anyway you need to setup your system more carefully.
> > 
> > 
> > There is, also, an evidence that system has enough of memory! :) Just
> > take column 'rss' from process list in OOM message and sum it - you
> > will get 2489911. It's probably in KB so it's about 2.4 GB. System has
> > 14 GB of RAM so this also match data on my graph - 2.4 is about 17% of
> > 14.
> 
> Hmm, that corresponds to the ZONE_DMA32 size pretty nicely but that zone
> is hardly touched:
> Nov 30 02:53:56 server01 kernel: [  818.241291] DMA32 free:2523636kB min:2672kB low:3340kB high:4008kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:2542248kB mlocked:0kB dirty:0kB writeback:0kB mapped:4kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
> 
> DMA32 zone is usually fills up first 4G unless your HW remaps the rest
> of the memory above 4G or you have a numa machine and the rest of the
> memory is at other node. Could you post your memory map printed during
> the boot? (e820: BIOS-provided physical RAM map: and following lines)
> 
> There is also ZONE_NORMAL which is also not used much
> Nov 30 02:53:56 server01 kernel: [  818.242163] Normal free:6924716kB min:12512kB low:15640kB high:18768kB active_anon:1463128kB inactive_anon:2072kB active_file:1803964kB inactive_file:1072628kB unevictable:3924kB isolated(anon):0kB isolated(file):0kB present:11893760kB mlocked:3924kB dirty:1000kB writeback:776kB mapped:35656kB shmem:3828kB slab_reclaimable:202560kB slab_unreclaimable:50696kB kernel_stack:2944kB pagetables:158616kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
> 
> You have mentioned that you are comounting with cpuset. If this happens
> to be a NUMA machine have you made the access to all nodes available?

And now that I am looking at the oom message more closely I can see
Nov 30 02:53:56 server01 kernel: [  818.232812] apache2 invoked oom-killer: gfp_mask=0x0, order=0, oom_adj=0, oom_score_adj=0
Nov 30 02:53:56 server01 kernel: [  818.233029] apache2 cpuset=uid mems_allowed=0
Nov 30 02:53:56 server01 kernel: [  818.233159] Pid: 9247, comm: apache2 Not tainted 3.2.34-grsec #1
Nov 30 02:53:56 server01 kernel: [  818.233289] Call Trace:
Nov 30 02:53:56 server01 kernel: [  818.233470]  [<ffffffff810cc90e>] dump_header+0x7e/0x1e0
Nov 30 02:53:56 server01 kernel: [  818.233600]  [<ffffffff810cc80f>] ? find_lock_task_mm+0x2f/0x70
Nov 30 02:53:56 server01 kernel: [  818.233721]  [<ffffffff810ccdd5>] oom_kill_process+0x85/0x2a0
Nov 30 02:53:56 server01 kernel: [  818.233842]  [<ffffffff810cd485>] out_of_memory+0xe5/0x200
Nov 30 02:53:56 server01 kernel: [  818.233963]  [<ffffffff8102aa8f>] ? pte_alloc_one+0x3f/0x50
Nov 30 02:53:56 server01 kernel: [  818.234082]  [<ffffffff810cd65d>] pagefault_out_of_memory+0xbd/0x110
Nov 30 02:53:56 server01 kernel: [  818.234204]  [<ffffffff81026ec6>] mm_fault_error+0xb6/0x1a0
Nov 30 02:53:56 server01 kernel: [  818.235886]  [<ffffffff8102739e>] do_page_fault+0x3ee/0x460
Nov 30 02:53:56 server01 kernel: [  818.236006]  [<ffffffff810f3057>] ? vma_merge+0x1f7/0x2c0
Nov 30 02:53:56 server01 kernel: [  818.236124]  [<ffffffff810f35d7>] ? do_brk+0x267/0x400
Nov 30 02:53:56 server01 kernel: [  818.236244]  [<ffffffff812c9a92>] ? gr_learn_resource+0x42/0x1e0
Nov 30 02:53:56 server01 kernel: [  818.236367]  [<ffffffff815b547f>] page_fault+0x1f/0x30

Which is interesting from 2 perspectives. Only the first node (Node-0)
is allowed which would suggest that the cpuset controller is not
configured to all nodes. It is still surprising Node 0 wouldn't have any
memory (I would expect ZONE_DMA32 would be sitting there).

Anyway, the more interesting thing is gfp_mask is GFP_NOWAIT allocation
from the page fault? Huh this shouldn't happen - ever.

> Also what does /proc/sys/vm/zone_reclaim_mode says?
> -- 
> Michal Hocko
> SUSE Labs
> --
> To unsubscribe from this list: send the line "unsubscribe cgroups" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked
  2012-11-30 14:44                                 ` Michal Hocko
  2012-11-30 15:03                                   ` Michal Hocko
@ 2012-11-30 15:08                                   ` azurIt
  2012-11-30 15:39                                     ` Michal Hocko
  1 sibling, 1 reply; 172+ messages in thread
From: azurIt @ 2012-11-30 15:08 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki,
	Johannes Weiner

>DMA32 zone is usually fills up first 4G unless your HW remaps the rest
>of the memory above 4G or you have a numa machine and the rest of the
>memory is at other node. Could you post your memory map printed during
>the boot? (e820: BIOS-provided physical RAM map: and following lines)


Here is the full boot log:
www.watchdog.sk/lkml/kern.log


>You have mentioned that you are comounting with cpuset. If this happens
>to be a NUMA machine have you made the access to all nodes available?
>Also what does /proc/sys/vm/zone_reclaim_mode says?


Don't really know what NUMA means and which nodes are you talking about, sorry :(

# cat /proc/sys/vm/zone_reclaim_mode
cat: /proc/sys/vm/zone_reclaim_mode: No such file or directory



azur

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked
  2012-11-30 15:03                                   ` Michal Hocko
@ 2012-11-30 15:37                                     ` Michal Hocko
  0 siblings, 0 replies; 172+ messages in thread
From: Michal Hocko @ 2012-11-30 15:37 UTC (permalink / raw)
  To: azurIt
  Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki,
	Johannes Weiner

On Fri 30-11-12 16:03:47, Michal Hocko wrote:
[...]
> Anyway, the more interesting thing is gfp_mask is GFP_NOWAIT allocation
> from the page fault? Huh this shouldn't happen - ever.

OK, it starts making sense now. The message came from
pagefault_out_of_memory which doesn't have gfp nor the required node
information any longer. This suggests that VM_FAULT_OOM has been
returned by the fault handler. So this hasn't been triggered by the page
fault allocator.
I am wondering whether this could be caused by the patch but the effect
of that one should be limitted to the write (unlike the later version
for -mm tree which hooks into the shmem as well).

Will have to think about it some more.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked
  2012-11-30 15:08                                   ` azurIt
@ 2012-11-30 15:39                                     ` Michal Hocko
  2012-11-30 15:59                                       ` azurIt
  0 siblings, 1 reply; 172+ messages in thread
From: Michal Hocko @ 2012-11-30 15:39 UTC (permalink / raw)
  To: azurIt
  Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki,
	Johannes Weiner

On Fri 30-11-12 16:08:11, azurIt wrote:
> >DMA32 zone is usually fills up first 4G unless your HW remaps the rest
> >of the memory above 4G or you have a numa machine and the rest of the
> >memory is at other node. Could you post your memory map printed during
> >the boot? (e820: BIOS-provided physical RAM map: and following lines)
> 
> 
> Here is the full boot log:
> www.watchdog.sk/lkml/kern.log

The log is not complete. Could you paste the comple dmesg output? Or
even better, do you have logs from the previous run?

> >You have mentioned that you are comounting with cpuset. If this happens
> >to be a NUMA machine have you made the access to all nodes available?
> >Also what does /proc/sys/vm/zone_reclaim_mode says?
> 
> 
> Don't really know what NUMA means and which nodes are you talking
> about, sorry :(

http://en.wikipedia.org/wiki/Non-Uniform_Memory_Access
 
> # cat /proc/sys/vm/zone_reclaim_mode
> cat: /proc/sys/vm/zone_reclaim_mode: No such file or directory

OK, so the NUMA is not enabled.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked
  2012-11-30 15:39                                     ` Michal Hocko
@ 2012-11-30 15:59                                       ` azurIt
  2012-11-30 16:19                                         ` Michal Hocko
  0 siblings, 1 reply; 172+ messages in thread
From: azurIt @ 2012-11-30 15:59 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki,
	Johannes Weiner

>> Here is the full boot log:
>> www.watchdog.sk/lkml/kern.log
>
>The log is not complete. Could you paste the comple dmesg output? Or
>even better, do you have logs from the previous run?


What is missing there? All kernel messages are logging into /var/log/kern.log (it's the same as dmesg), dmesg itself was already rewrited by other messages. I think it's all what that kernel printed.

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked
  2012-11-30 15:59                                       ` azurIt
@ 2012-11-30 16:19                                         ` Michal Hocko
  2012-11-30 16:26                                           ` azurIt
  2012-12-03 15:16                                           ` Michal Hocko
  0 siblings, 2 replies; 172+ messages in thread
From: Michal Hocko @ 2012-11-30 16:19 UTC (permalink / raw)
  To: azurIt
  Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki,
	Johannes Weiner

On Fri 30-11-12 16:59:37, azurIt wrote:
> >> Here is the full boot log:
> >> www.watchdog.sk/lkml/kern.log
> >
> >The log is not complete. Could you paste the comple dmesg output? Or
> >even better, do you have logs from the previous run?
> 
> 
> What is missing there? All kernel messages are logging into
> /var/log/kern.log (it's the same as dmesg), dmesg itself was already
> rewrited by other messages. I think it's all what that kernel printed.

Early boot messages are missing - so exactly the BIOS memory map I was
asking for. As the NUMA has been excluded it is probably not that
relevant anymore.
The important question is why you see VM_FAULT_OOM and whether memcg
charging failure can trigger that. I don not see how this could happen
right now because __GFP_NORETRY is not used for user pages (except for
THP which disable memcg OOM already), file backed page faults (aka
__do_fault) use mem_cgroup_newpage_charge which doesn't disable OOM.
This is a real head scratcher.

Could you also post your complete containers configuration, maybe there
is something strange in there (basically grep . -r YOUR_CGROUP_MNT
except for tasks files which are of no use right now).
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked
  2012-11-30 16:19                                         ` Michal Hocko
@ 2012-11-30 16:26                                           ` azurIt
  2012-11-30 16:53                                             ` Michal Hocko
  2012-12-03 15:16                                           ` Michal Hocko
  1 sibling, 1 reply; 172+ messages in thread
From: azurIt @ 2012-11-30 16:26 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki,
	Johannes Weiner

>Could you also post your complete containers configuration, maybe there
>is something strange in there (basically grep . -r YOUR_CGROUP_MNT
>except for tasks files which are of no use right now).


Here it is:
http://www.watchdog.sk/lkml/cgroups.gz

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked
  2012-11-30 16:26                                           ` azurIt
@ 2012-11-30 16:53                                             ` Michal Hocko
  2012-11-30 20:43                                               ` azurIt
  0 siblings, 1 reply; 172+ messages in thread
From: Michal Hocko @ 2012-11-30 16:53 UTC (permalink / raw)
  To: azurIt
  Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki,
	Johannes Weiner

On Fri 30-11-12 17:26:51, azurIt wrote:
> >Could you also post your complete containers configuration, maybe there
> >is something strange in there (basically grep . -r YOUR_CGROUP_MNT
> >except for tasks files which are of no use right now).
> 
> 
> Here it is:
> http://www.watchdog.sk/lkml/cgroups.gz

The only strange thing I noticed is that some groups have 0 limit. Is
this intentional?
grep memory.limit_in_bytes cgroups | grep -v uid | sed 's@.*/@@' | sort | uniq -c
      3 memory.limit_in_bytes:0
    254 memory.limit_in_bytes:104857600
    107 memory.limit_in_bytes:157286400
     68 memory.limit_in_bytes:209715200
     10 memory.limit_in_bytes:262144000
     28 memory.limit_in_bytes:314572800
      1 memory.limit_in_bytes:346030080
      1 memory.limit_in_bytes:524288000
      2 memory.limit_in_bytes:9223372036854775807
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked
  2012-11-30 16:53                                             ` Michal Hocko
@ 2012-11-30 20:43                                               ` azurIt
  0 siblings, 0 replies; 172+ messages in thread
From: azurIt @ 2012-11-30 20:43 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki,
	Johannes Weiner

>The only strange thing I noticed is that some groups have 0 limit. Is
>this intentional?
>grep memory.limit_in_bytes cgroups | grep -v uid | sed 's@.*/@@' | sort | uniq -c
>      3 memory.limit_in_bytes:0


These are users who are not allowed to run anything.


azur

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked
  2012-11-30 16:19                                         ` Michal Hocko
  2012-11-30 16:26                                           ` azurIt
@ 2012-12-03 15:16                                           ` Michal Hocko
  2012-12-05  1:36                                             ` azurIt
  1 sibling, 1 reply; 172+ messages in thread
From: Michal Hocko @ 2012-12-03 15:16 UTC (permalink / raw)
  To: azurIt
  Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki,
	Johannes Weiner

On Fri 30-11-12 17:19:23, Michal Hocko wrote:
[...]
> The important question is why you see VM_FAULT_OOM and whether memcg
> charging failure can trigger that. I don not see how this could happen
> right now because __GFP_NORETRY is not used for user pages (except for
> THP which disable memcg OOM already), file backed page faults (aka
> __do_fault) use mem_cgroup_newpage_charge which doesn't disable OOM.
> This is a real head scratcher.

The following should print the traces when we hand over ENOMEM to the
caller. It should catch all charge paths (migration is not covered but
that one is not important here). If we don't see any traces from here
and there is still global OOM striking then there must be something else
to trigger this.
Could you test this with the patch which aims at fixing your deadlock,
please? I realise that this is a production environment but I do not see
anything relevant in the code.
---
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index c8425b1..9e5b56b 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2397,6 +2397,7 @@ done:
 	return 0;
 nomem:
 	*ptr = NULL;
+	__WARN();
 	return -ENOMEM;
 bypass:
 	*ptr = NULL;

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply related	[flat|nested] 172+ messages in thread

* Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked
  2012-12-03 15:16                                           ` Michal Hocko
@ 2012-12-05  1:36                                             ` azurIt
  2012-12-05 14:17                                               ` Michal Hocko
  0 siblings, 1 reply; 172+ messages in thread
From: azurIt @ 2012-12-05  1:36 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki,
	Johannes Weiner

>The following should print the traces when we hand over ENOMEM to the
>caller. It should catch all charge paths (migration is not covered but
>that one is not important here). If we don't see any traces from here
>and there is still global OOM striking then there must be something else
>to trigger this.
>Could you test this with the patch which aims at fixing your deadlock,
>please? I realise that this is a production environment but I do not see
>anything relevant in the code.


Michal,

i think/hope this is what you wanted:
http://www.watchdog.sk/lkml/oom_mysqld2

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked
  2012-12-05  1:36                                             ` azurIt
@ 2012-12-05 14:17                                               ` Michal Hocko
  2012-12-06  0:29                                                 ` azurIt
  0 siblings, 1 reply; 172+ messages in thread
From: Michal Hocko @ 2012-12-05 14:17 UTC (permalink / raw)
  To: azurIt
  Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki,
	Johannes Weiner

On Wed 05-12-12 02:36:44, azurIt wrote:
> >The following should print the traces when we hand over ENOMEM to the
> >caller. It should catch all charge paths (migration is not covered but
> >that one is not important here). If we don't see any traces from here
> >and there is still global OOM striking then there must be something else
> >to trigger this.
> >Could you test this with the patch which aims at fixing your deadlock,
> >please? I realise that this is a production environment but I do not see
> >anything relevant in the code.
> 
> 
> Michal,
> 
> i think/hope this is what you wanted:
> http://www.watchdog.sk/lkml/oom_mysqld2

Dec  5 02:20:48 server01 kernel: [  380.995947] WARNING: at mm/memcontrol.c:2400 T.1146+0x2c1/0x5d0()
Dec  5 02:20:48 server01 kernel: [  380.995950] Hardware name: S5000VSA
Dec  5 02:20:48 server01 kernel: [  380.995952] Pid: 5351, comm: apache2 Not tainted 3.2.34-grsec #1
Dec  5 02:20:48 server01 kernel: [  380.995954] Call Trace:
Dec  5 02:20:48 server01 kernel: [  380.995960]  [<ffffffff81054eaa>] warn_slowpath_common+0x7a/0xb0
Dec  5 02:20:48 server01 kernel: [  380.995963]  [<ffffffff81054efa>] warn_slowpath_null+0x1a/0x20
Dec  5 02:20:48 server01 kernel: [  380.995965]  [<ffffffff8110b2e1>] T.1146+0x2c1/0x5d0
Dec  5 02:20:48 server01 kernel: [  380.995967]  [<ffffffff8110ba83>] mem_cgroup_charge_common+0x53/0x90
Dec  5 02:20:48 server01 kernel: [  380.995970]  [<ffffffff8110bb05>] mem_cgroup_newpage_charge+0x45/0x50
Dec  5 02:20:48 server01 kernel: [  380.995974]  [<ffffffff810eddf9>] handle_pte_fault+0x609/0x940
Dec  5 02:20:48 server01 kernel: [  380.995978]  [<ffffffff8102aa8f>] ? pte_alloc_one+0x3f/0x50
Dec  5 02:20:48 server01 kernel: [  380.995981]  [<ffffffff810ee268>] handle_mm_fault+0x138/0x260
Dec  5 02:20:48 server01 kernel: [  380.995983]  [<ffffffff810270ed>] do_page_fault+0x13d/0x460
Dec  5 02:20:48 server01 kernel: [  380.995986]  [<ffffffff810f429c>] ? do_mmap_pgoff+0x3dc/0x430
Dec  5 02:20:48 server01 kernel: [  380.995988]  [<ffffffff810f197d>] ? remove_vma+0x5d/0x80
Dec  5 02:20:48 server01 kernel: [  380.995992]  [<ffffffff815b54ff>] page_fault+0x1f/0x30
Dec  5 02:20:48 server01 kernel: [  380.995994] ---[ end trace 25bbb3e634c25b7f ]---
Dec  5 02:20:48 server01 kernel: [  380.996373] apache2 invoked oom-killer: gfp_mask=0x0, order=0, oom_adj=0, oom_score_adj=0
Dec  5 02:20:48 server01 kernel: [  380.996377] apache2 cpuset=uid mems_allowed=0
Dec  5 02:20:48 server01 kernel: [  380.996379] Pid: 5351, comm: apache2 Tainted: G        W    3.2.34-grsec #1
Dec  5 02:20:48 server01 kernel: [  380.996380] Call Trace:
Dec  5 02:20:48 server01 kernel: [  380.996384]  [<ffffffff810cc91e>] dump_header+0x7e/0x1e0
Dec  5 02:20:48 server01 kernel: [  380.996387]  [<ffffffff810cc81f>] ? find_lock_task_mm+0x2f/0x70
Dec  5 02:20:48 server01 kernel: [  380.996389]  [<ffffffff810ccde5>] oom_kill_process+0x85/0x2a0
Dec  5 02:20:48 server01 kernel: [  380.996392]  [<ffffffff810cd495>] out_of_memory+0xe5/0x200
Dec  5 02:20:48 server01 kernel: [  380.996394]  [<ffffffff8102aa8f>] ? pte_alloc_one+0x3f/0x50
Dec  5 02:20:48 server01 kernel: [  380.996397]  [<ffffffff810cd66d>] pagefault_out_of_memory+0xbd/0x110
Dec  5 02:20:48 server01 kernel: [  380.996399]  [<ffffffff81026ec6>] mm_fault_error+0xb6/0x1a0
Dec  5 02:20:48 server01 kernel: [  380.996401]  [<ffffffff8102739e>] do_page_fault+0x3ee/0x460
Dec  5 02:20:48 server01 kernel: [  380.996403]  [<ffffffff810f429c>] ? do_mmap_pgoff+0x3dc/0x430
Dec  5 02:20:48 server01 kernel: [  380.996405]  [<ffffffff810f197d>] ? remove_vma+0x5d/0x80
Dec  5 02:20:48 server01 kernel: [  380.996408]  [<ffffffff815b54ff>] page_fault+0x1f/0x30

OK, so the ENOMEM seems to be leaking from mem_cgroup_newpage_charge.
This can only happen if this was an atomic allocation request
(!__GFP_WAIT) or if oom is not allowed which is the case only for
transparent huge page allocation.
The first case can be excluded (in the clean 3.2 stable kernel) because
all callers of mem_cgroup_newpage_charge use GFP_KERNEL. The later one
should be OK because the page fault should fallback to a regular page if
THP allocation/charge fails.
[/me goes to double check]
Hmm do_huge_pmd_wp_page seems to charge a huge page and fails with
VM_FAULT_OOM without any fallback. We should do_huge_pmd_wp_page_fallback
instead. This has been fixed in 3.5-rc1 by 1f1d06c3 (thp, memcg: split
hugepage for memcg oom on cow) but it hasn't been backported to 3.2. The
patch applies to 3.2 without any further modifications. I didn't have
time to test it but if it helps you we should push this to the stable
tree.
---
>From 765f5e0121c4410faa19c088e9ada75976bde178 Mon Sep 17 00:00:00 2001
From: David Rientjes <rientjes@google.com>
Date: Tue, 29 May 2012 15:06:23 -0700
Subject: [PATCH] thp, memcg: split hugepage for memcg oom on cow

On COW, a new hugepage is allocated and charged to the memcg.  If the
system is oom or the charge to the memcg fails, however, the fault
handler will return VM_FAULT_OOM which results in an oom kill.

Instead, it's possible to fallback to splitting the hugepage so that the
COW results only in an order-0 page being allocated and charged to the
memcg which has a higher liklihood to succeed.  This is expensive
because the hugepage must be split in the page fault handler, but it is
much better than unnecessarily oom killing a process.

Signed-off-by: David Rientjes <rientjes@google.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Johannes Weiner <jweiner@redhat.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 1f1d06c34f7675026326cd9f39ff91e4555cf355)
---
 mm/huge_memory.c |    3 +++
 mm/memory.c      |   18 +++++++++++++++---
 2 files changed, 18 insertions(+), 3 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 8f005e9..470cbb4 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -921,6 +921,8 @@ int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		count_vm_event(THP_FAULT_FALLBACK);
 		ret = do_huge_pmd_wp_page_fallback(mm, vma, address,
 						   pmd, orig_pmd, page, haddr);
+		if (ret & VM_FAULT_OOM)
+			split_huge_page(page);
 		put_page(page);
 		goto out;
 	}
@@ -928,6 +930,7 @@ int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
 
 	if (unlikely(mem_cgroup_newpage_charge(new_page, mm, GFP_KERNEL))) {
 		put_page(new_page);
+		split_huge_page(page);
 		put_page(page);
 		ret |= VM_FAULT_OOM;
 		goto out;
diff --git a/mm/memory.c b/mm/memory.c
index 70f5daf..15e686a 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3469,6 +3469,7 @@ int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 	if (unlikely(is_vm_hugetlb_page(vma)))
 		return hugetlb_fault(mm, vma, address, flags);
 
+retry:
 	pgd = pgd_offset(mm, address);
 	pud = pud_alloc(mm, pgd, address);
 	if (!pud)
@@ -3482,13 +3483,24 @@ int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 							  pmd, flags);
 	} else {
 		pmd_t orig_pmd = *pmd;
+		int ret;
+
 		barrier();
 		if (pmd_trans_huge(orig_pmd)) {
 			if (flags & FAULT_FLAG_WRITE &&
 			    !pmd_write(orig_pmd) &&
-			    !pmd_trans_splitting(orig_pmd))
-				return do_huge_pmd_wp_page(mm, vma, address,
-							   pmd, orig_pmd);
+			    !pmd_trans_splitting(orig_pmd)) {
+				ret = do_huge_pmd_wp_page(mm, vma, address, pmd,
+							  orig_pmd);
+				/*
+				 * If COW results in an oom, the huge pmd will
+				 * have been split, so retry the fault on the
+				 * pte for a smaller charge.
+				 */
+				if (unlikely(ret & VM_FAULT_OOM))
+					goto retry;
+				return ret;
+			}
 			return 0;
 		}
 	}
-- 
1.7.10.4

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply related	[flat|nested] 172+ messages in thread

* Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked
  2012-12-05 14:17                                               ` Michal Hocko
@ 2012-12-06  0:29                                                 ` azurIt
  2012-12-06  9:54                                                   ` Michal Hocko
  0 siblings, 1 reply; 172+ messages in thread
From: azurIt @ 2012-12-06  0:29 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki,
	Johannes Weiner

>OK, so the ENOMEM seems to be leaking from mem_cgroup_newpage_charge.
>This can only happen if this was an atomic allocation request
>(!__GFP_WAIT) or if oom is not allowed which is the case only for
>transparent huge page allocation.
>The first case can be excluded (in the clean 3.2 stable kernel) because
>all callers of mem_cgroup_newpage_charge use GFP_KERNEL. The later one
>should be OK because the page fault should fallback to a regular page if
>THP allocation/charge fails.
>[/me goes to double check]
>Hmm do_huge_pmd_wp_page seems to charge a huge page and fails with
>VM_FAULT_OOM without any fallback. We should do_huge_pmd_wp_page_fallback
>instead. This has been fixed in 3.5-rc1 by 1f1d06c3 (thp, memcg: split
>hugepage for memcg oom on cow) but it hasn't been backported to 3.2. The
>patch applies to 3.2 without any further modifications. I didn't have
>time to test it but if it helps you we should push this to the stable
>tree.


This, unfortunately, didn't fix the problem :(
http://www.watchdog.sk/lkml/oom_mysqld3

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked
  2012-12-06  0:29                                                 ` azurIt
@ 2012-12-06  9:54                                                   ` Michal Hocko
  2012-12-06 10:12                                                     ` azurIt
  2012-12-10  1:20                                                     ` azurIt
  0 siblings, 2 replies; 172+ messages in thread
From: Michal Hocko @ 2012-12-06  9:54 UTC (permalink / raw)
  To: azurIt
  Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki,
	Johannes Weiner

On Thu 06-12-12 01:29:24, azurIt wrote:
> >OK, so the ENOMEM seems to be leaking from mem_cgroup_newpage_charge.
> >This can only happen if this was an atomic allocation request
> >(!__GFP_WAIT) or if oom is not allowed which is the case only for
> >transparent huge page allocation.
> >The first case can be excluded (in the clean 3.2 stable kernel) because
> >all callers of mem_cgroup_newpage_charge use GFP_KERNEL. The later one
> >should be OK because the page fault should fallback to a regular page if
> >THP allocation/charge fails.
> >[/me goes to double check]
> >Hmm do_huge_pmd_wp_page seems to charge a huge page and fails with
> >VM_FAULT_OOM without any fallback. We should do_huge_pmd_wp_page_fallback
> >instead. This has been fixed in 3.5-rc1 by 1f1d06c3 (thp, memcg: split
> >hugepage for memcg oom on cow) but it hasn't been backported to 3.2. The
> >patch applies to 3.2 without any further modifications. I didn't have
> >time to test it but if it helps you we should push this to the stable
> >tree.
> 
> 
> This, unfortunately, didn't fix the problem :(
> http://www.watchdog.sk/lkml/oom_mysqld3

Dohh. The very same stack mem_cgroup_newpage_charge called from the page
fault. The heavy inlining is not particularly helping here... So there
must be some other THP charge leaking out.
[/me is diving into the code again]

* do_huge_pmd_anonymous_page falls back to handle_pte_fault
* do_huge_pmd_wp_page_fallback falls back to simple pages so it doesn't
  charge the huge page
* do_huge_pmd_wp_page splits the huge page and retries with fallback to
  handle_pte_fault
* collapse_huge_page is not called in the page fault path
* do_wp_page, do_anonymous_page and __do_fault  operate on a single page
  so the memcg charging cannot return ENOMEM

There are no other callers AFAICS so I am getting clueless. Maybe more
debugging will tell us something (the inlining has been reduced for thp
paths which can reduce performance in thp page fault heavy workloads but
this will give us better traces - I hope).

Anyway do you see the same problem if transparent huge pages are
disabled?
echo never > /sys/kernel/mm/transparent_hugepage/enabled)
---
>From 93a30140b50d8474a047b91c698f4880149635db Mon Sep 17 00:00:00 2001
From: Michal Hocko <mhocko@suse.cz>
Date: Thu, 6 Dec 2012 10:40:17 +0100
Subject: [PATCH] more debugging

---
 mm/huge_memory.c |    6 +++---
 mm/memcontrol.c  |    2 +-
 2 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 470cbb4..01a11f1 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -671,7 +671,7 @@ static inline struct page *alloc_hugepage(int defrag)
 }
 #endif
 
-int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
+noinline int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
 			       unsigned long address, pmd_t *pmd,
 			       unsigned int flags)
 {
@@ -790,7 +790,7 @@ pgtable_t get_pmd_huge_pte(struct mm_struct *mm)
 	return pgtable;
 }
 
-static int do_huge_pmd_wp_page_fallback(struct mm_struct *mm,
+static noinline int do_huge_pmd_wp_page_fallback(struct mm_struct *mm,
 					struct vm_area_struct *vma,
 					unsigned long address,
 					pmd_t *pmd, pmd_t orig_pmd,
@@ -883,7 +883,7 @@ out_free_pages:
 	goto out;
 }
 
-int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
+noinline int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
 			unsigned long address, pmd_t *pmd, pmd_t orig_pmd)
 {
 	int ret = 0;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 9e5b56b..1986c65 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2397,7 +2397,7 @@ done:
 	return 0;
 nomem:
 	*ptr = NULL;
-	__WARN();
+	__WARN_printf("gfp_mask:%u nr_pages:%u oom:%d ret:%d\n", gfp_mask, nr_pages, oom, ret);
 	return -ENOMEM;
 bypass:
 	*ptr = NULL;
-- 
1.7.10.4

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply related	[flat|nested] 172+ messages in thread

* Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked
  2012-12-06  9:54                                                   ` Michal Hocko
@ 2012-12-06 10:12                                                     ` azurIt
  2012-12-06 17:06                                                       ` Michal Hocko
  2012-12-10  1:20                                                     ` azurIt
  1 sibling, 1 reply; 172+ messages in thread
From: azurIt @ 2012-12-06 10:12 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki,
	Johannes Weiner

>Dohh. The very same stack mem_cgroup_newpage_charge called from the page
>fault. The heavy inlining is not particularly helping here... So there
>must be some other THP charge leaking out.
>[/me is diving into the code again]
>
>* do_huge_pmd_anonymous_page falls back to handle_pte_fault
>* do_huge_pmd_wp_page_fallback falls back to simple pages so it doesn't
>  charge the huge page
>* do_huge_pmd_wp_page splits the huge page and retries with fallback to
>  handle_pte_fault
>* collapse_huge_page is not called in the page fault path
>* do_wp_page, do_anonymous_page and __do_fault  operate on a single page
>  so the memcg charging cannot return ENOMEM
>
>There are no other callers AFAICS so I am getting clueless. Maybe more
>debugging will tell us something (the inlining has been reduced for thp
>paths which can reduce performance in thp page fault heavy workloads but
>this will give us better traces - I hope).


Should i apply all patches togather? (fix for this bug, more log messages, backported fix from 3.5 and this new one)


>Anyway do you see the same problem if transparent huge pages are
>disabled?
>echo never > /sys/kernel/mm/transparent_hugepage/enabled)


# cat /sys/kernel/mm/transparent_hugepage/enabled
cat: /sys/kernel/mm/transparent_hugepage/enabled: No such file or directory

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked
  2012-12-06 10:12                                                     ` azurIt
@ 2012-12-06 17:06                                                       ` Michal Hocko
  0 siblings, 0 replies; 172+ messages in thread
From: Michal Hocko @ 2012-12-06 17:06 UTC (permalink / raw)
  To: azurIt
  Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki,
	Johannes Weiner

On Thu 06-12-12 11:12:49, azurIt wrote:
> >Dohh. The very same stack mem_cgroup_newpage_charge called from the page
> >fault. The heavy inlining is not particularly helping here... So there
> >must be some other THP charge leaking out.
> >[/me is diving into the code again]
> >
> >* do_huge_pmd_anonymous_page falls back to handle_pte_fault
> >* do_huge_pmd_wp_page_fallback falls back to simple pages so it doesn't
> >  charge the huge page
> >* do_huge_pmd_wp_page splits the huge page and retries with fallback to
> >  handle_pte_fault
> >* collapse_huge_page is not called in the page fault path
> >* do_wp_page, do_anonymous_page and __do_fault  operate on a single page
> >  so the memcg charging cannot return ENOMEM
> >
> >There are no other callers AFAICS so I am getting clueless. Maybe more
> >debugging will tell us something (the inlining has been reduced for thp
> >paths which can reduce performance in thp page fault heavy workloads but
> >this will give us better traces - I hope).
> 
> 
> Should i apply all patches togather? (fix for this bug, more log
> messages, backported fix from 3.5 and this new one)

Yes please
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked
  2012-12-06  9:54                                                   ` Michal Hocko
  2012-12-06 10:12                                                     ` azurIt
@ 2012-12-10  1:20                                                     ` azurIt
  2012-12-10  9:43                                                       ` Michal Hocko
  1 sibling, 1 reply; 172+ messages in thread
From: azurIt @ 2012-12-10  1:20 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki,
	Johannes Weiner

>There are no other callers AFAICS so I am getting clueless. Maybe more
>debugging will tell us something (the inlining has been reduced for thp
>paths which can reduce performance in thp page fault heavy workloads but
>this will give us better traces - I hope).


Michal,

this was printing so many debug messages to console that the whole server hangs and i had to hard reset it after several minutes :( Sorry but i cannot test such a things in production. There's no problem with one soft reset which takes 4 minutes but this hard reset creates about 20 minutes outage (mainly cos of disk quotas checking). Last logged message:

Dec 10 02:03:29 server01 kernel: [  220.366486] grsec: From 141.105.120.152: bruteforce prevention initiated for the next 30 minutes or until service restarted, stalling each fork 30 seconds.  Please investigate the crash report for /usr/lib/apache2/mpm-itk/apache2[apache2:3586] uid/euid:1258/1258 gid/egid:100/100, parent /usr/lib/apache2/mpm-itk/apache2[apache2:2142] uid/euid:0/0 gid/egid:0/0

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked
  2012-12-10  1:20                                                     ` azurIt
@ 2012-12-10  9:43                                                       ` Michal Hocko
  2012-12-10 10:18                                                         ` azurIt
  0 siblings, 1 reply; 172+ messages in thread
From: Michal Hocko @ 2012-12-10  9:43 UTC (permalink / raw)
  To: azurIt
  Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki,
	Johannes Weiner

On Mon 10-12-12 02:20:38, azurIt wrote:
[...]
> Michal,

Hi,
 
> this was printing so many debug messages to console that the whole
> server hangs

Hmm, this is _really_ surprising. The latest patch didn't add any new
logging actually. It just enahanced messages which were already printed
out previously + changed few functions to be not inlined so they show up
in the traces. So the only explanation is that the workload has changed
or the patches got misapplied.

> and i had to hard reset it after several minutes :( Sorry
> but i cannot test such a things in production. There's no problem with
> one soft reset which takes 4 minutes but this hard reset creates about
> 20 minutes outage (mainly cos of disk quotas checking).

Understood.

> Last logged message:
> 
> Dec 10 02:03:29 server01 kernel: [  220.366486] grsec: From 141.105.120.152: bruteforce prevention initiated for the next 30 minutes or until service restarted, stalling each fork 30 seconds.  Please investigate the crash report for /usr/lib/apache2/mpm-itk/apache2[apache2:3586] uid/euid:1258/1258 gid/egid:100/100, parent /usr/lib/apache2/mpm-itk/apache2[apache2:2142] uid/euid:0/0 gid/egid:0/0

This explains why you have seen your machine hung. I am not familiar
with grsec but stalling each fork 30s sounds really bad.

Anyway this will not help me much. Do you happen to still have any of
those logged traces from the last run?

Apart from that. If my current understanding is correct then this is
related to transparent huge pages (and leaking charge to the page fault
handler). Do you see the same problem if you disable THP before you
start your workload? (echo never > /sys/kernel/mm/transparent_hugepage/enabled)
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked
  2012-12-10  9:43                                                       ` Michal Hocko
@ 2012-12-10 10:18                                                         ` azurIt
  2012-12-10 15:52                                                           ` Michal Hocko
  0 siblings, 1 reply; 172+ messages in thread
From: azurIt @ 2012-12-10 10:18 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki,
	Johannes Weiner

>Hmm, this is _really_ surprising. The latest patch didn't add any new
>logging actually. It just enahanced messages which were already printed
>out previously + changed few functions to be not inlined so they show up
>in the traces. So the only explanation is that the workload has changed
>or the patches got misapplied.


This time i installed 3.2.35, maybe some changes between .34 and .35 did this? Should i try .34?


>> Dec 10 02:03:29 server01 kernel: [  220.366486] grsec: From 141.105.120.152: bruteforce prevention initiated for the next 30 minutes or until service restarted, stalling each fork 30 seconds.  Please investigate the crash report for /usr/lib/apache2/mpm-itk/apache2[apache2:3586] uid/euid:1258/1258 gid/egid:100/100, parent /usr/lib/apache2/mpm-itk/apache2[apache2:2142] uid/euid:0/0 gid/egid:0/0
>
>This explains why you have seen your machine hung. I am not familiar
>with grsec but stalling each fork 30s sounds really bad.


Btw, i never ever saw such a message from grsecurity yet. Will write to grsec mailing list about explanation.


>Anyway this will not help me much. Do you happen to still have any of
>those logged traces from the last run?


Unfortunately not, it didn't log anything and tons of messages were printed only to console (i was logged via IP-KVM). It looked that printing is infinite, i rebooted it after few minutes.


>Apart from that. If my current understanding is correct then this is
>related to transparent huge pages (and leaking charge to the page fault
>handler). Do you see the same problem if you disable THP before you
>start your workload? (echo never > /sys/kernel/mm/transparent_hugepage/enabled)

# cat /sys/kernel/mm/transparent_hugepage/enabled
cat: /sys/kernel/mm/transparent_hugepage/enabled: No such file or directory

# ls -la /sys/kernel/mm                             
total 0
drwx------ 3 root root 0 Dec 10 11:11 .
drwx------ 5 root root 0 Dec 10 02:06 ..
drwx------ 2 root root 0 Dec 10 11:11 cleancache

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked
  2012-12-10 10:18                                                         ` azurIt
@ 2012-12-10 15:52                                                           ` Michal Hocko
  2012-12-10 17:18                                                             ` azurIt
  2012-12-17  1:34                                                             ` azurIt
  0 siblings, 2 replies; 172+ messages in thread
From: Michal Hocko @ 2012-12-10 15:52 UTC (permalink / raw)
  To: azurIt
  Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki,
	Johannes Weiner

On Mon 10-12-12 11:18:17, azurIt wrote:
> >Hmm, this is _really_ surprising. The latest patch didn't add any new
> >logging actually. It just enahanced messages which were already printed
> >out previously + changed few functions to be not inlined so they show up
> >in the traces. So the only explanation is that the workload has changed
> >or the patches got misapplied.
> 
> 
> This time i installed 3.2.35, maybe some changes between .34 and .35
> did this? Should i try .34?

I would try to limit changes to minimum. So the original kernel you were
using + the first patch to prevent OOM from the write path + 2 debugging
patches.
 
> >> Dec 10 02:03:29 server01 kernel: [  220.366486] grsec: From 141.105.120.152: bruteforce prevention initiated for the next 30 minutes or until service restarted, stalling each fork 30 seconds.  Please investigate the crash report for /usr/lib/apache2/mpm-itk/apache2[apache2:3586] uid/euid:1258/1258 gid/egid:100/100, parent /usr/lib/apache2/mpm-itk/apache2[apache2:2142] uid/euid:0/0 gid/egid:0/0
> >
> >This explains why you have seen your machine hung. I am not familiar
> >with grsec but stalling each fork 30s sounds really bad.
> 
> 
> Btw, i never ever saw such a message from grsecurity yet. Will write to grsec mailing list about explanation.
> 
> 
> >Anyway this will not help me much. Do you happen to still have any of
> >those logged traces from the last run?
> 
> 
> Unfortunately not, it didn't log anything and tons of messages were
> printed only to console (i was logged via IP-KVM). It looked that
> printing is infinite, i rebooted it after few minutes.

But was it at least related to the debugging from the patch or it was
rather a totally unrelated thing?

> >Apart from that. If my current understanding is correct then this is
> >related to transparent huge pages (and leaking charge to the page fault
> >handler). Do you see the same problem if you disable THP before you
> >start your workload? (echo never > /sys/kernel/mm/transparent_hugepage/enabled)
> 
> # cat /sys/kernel/mm/transparent_hugepage/enabled
> cat: /sys/kernel/mm/transparent_hugepage/enabled: No such file or directory

Weee. Then it cannot be related to THP at all. Which makes this even
bigger mystery.
We really need to find out who is leaking that charge.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked
  2012-12-10 15:52                                                           ` Michal Hocko
@ 2012-12-10 17:18                                                             ` azurIt
  2012-12-17  1:34                                                             ` azurIt
  1 sibling, 0 replies; 172+ messages in thread
From: azurIt @ 2012-12-10 17:18 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki,
	Johannes Weiner

>I would try to limit changes to minimum. So the original kernel you were
>using + the first patch to prevent OOM from the write path + 2 debugging
>patches.


ok.


>But was it at least related to the debugging from the patch or it was
>rather a totally unrelated thing?


I wasn't reading it much but i think it looks like a traces i was sending you before.

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked
  2012-12-10 15:52                                                           ` Michal Hocko
  2012-12-10 17:18                                                             ` azurIt
@ 2012-12-17  1:34                                                             ` azurIt
  2012-12-17 16:32                                                               ` Michal Hocko
  1 sibling, 1 reply; 172+ messages in thread
From: azurIt @ 2012-12-17  1:34 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki,
	Johannes Weiner

>I would try to limit changes to minimum. So the original kernel you were
>using + the first patch to prevent OOM from the write path + 2 debugging
>patches.


It didn't take off the whole system this time (but i was prepared to record a video of console ;) ), here it is:
http://www.watchdog.sk/lkml/oom_mysqld4

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked
  2012-12-17  1:34                                                             ` azurIt
@ 2012-12-17 16:32                                                               ` Michal Hocko
  2012-12-17 18:23                                                                 ` azurIt
  0 siblings, 1 reply; 172+ messages in thread
From: Michal Hocko @ 2012-12-17 16:32 UTC (permalink / raw)
  To: azurIt
  Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki,
	Johannes Weiner

On Mon 17-12-12 02:34:30, azurIt wrote:
> >I would try to limit changes to minimum. So the original kernel you were
> >using + the first patch to prevent OOM from the write path + 2 debugging
> >patches.
> 
> 
> It didn't take off the whole system this time (but i was
> prepared to record a video of console ;) ), here it is:
> http://www.watchdog.sk/lkml/oom_mysqld4

[...]
[ 1248.059429] ------------[ cut here ]------------
[ 1248.059586] WARNING: at mm/memcontrol.c:2400 T.1146+0x2d9/0x610()
[ 1248.059723] Hardware name: S5000VSA
[ 1248.059855] gfp_mask:208 nr_pages:1 oom:0 ret:2

This is GFP_KERNEL allocation which is expected. It is also a simple
page which is not that expected because we shouldn't return ENOMEM on
those unless this was GFP_ATOMIC allocation (which it wasn't) or the
caller told us to not trigger OOM which is the case only for THP pages
(see mem_cgroup_charge_common). So the big question is how have we ended
up with oom=false here...

[Ohh, I am really an idiot. I screwed the first patch]
-       bool oom = true;
+       bool oom = !(gfp_mask | GFP_MEMCG_NO_OOM);

Which obviously doesn't work. It should read !(gfp_mask &GFP_MEMCG_NO_OOM).
  No idea how I could have missed that. I am really sorry about that.
---
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index c04676d..1f35a74 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2704,7 +2704,7 @@ static int mem_cgroup_charge_common(struct page *page, struct mm_struct *mm,
 	struct mem_cgroup *memcg = NULL;
 	unsigned int nr_pages = 1;
 	struct page_cgroup *pc;
-	bool oom = !(gfp_mask | GFP_MEMCG_NO_OOM);
+	bool oom = !(gfp_mask & GFP_MEMCG_NO_OOM);
 	int ret;
 
 	if (PageTransHuge(page)) {
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply related	[flat|nested] 172+ messages in thread

* Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked
  2012-12-17 16:32                                                               ` Michal Hocko
@ 2012-12-17 18:23                                                                 ` azurIt
  2012-12-17 19:55                                                                   ` Michal Hocko
  0 siblings, 1 reply; 172+ messages in thread
From: azurIt @ 2012-12-17 18:23 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki,
	Johannes Weiner

>[Ohh, I am really an idiot. I screwed the first patch]
>-       bool oom = true;
>+       bool oom = !(gfp_mask | GFP_MEMCG_NO_OOM);
>
>Which obviously doesn't work. It should read !(gfp_mask &GFP_MEMCG_NO_OOM).
>  No idea how I could have missed that. I am really sorry about that.


:D no problem :) so, now it should really work as expected and completely fix my original problem? is it safe to apply it on 3.2.35? Thank you very much!

azur

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked
  2012-12-17 18:23                                                                 ` azurIt
@ 2012-12-17 19:55                                                                   ` Michal Hocko
  2012-12-18 14:22                                                                     ` azurIt
  0 siblings, 1 reply; 172+ messages in thread
From: Michal Hocko @ 2012-12-17 19:55 UTC (permalink / raw)
  To: azurIt
  Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki,
	Johannes Weiner

On Mon 17-12-12 19:23:01, azurIt wrote:
> >[Ohh, I am really an idiot. I screwed the first patch]
> >-       bool oom = true;
> >+       bool oom = !(gfp_mask | GFP_MEMCG_NO_OOM);
> >
> >Which obviously doesn't work. It should read !(gfp_mask &GFP_MEMCG_NO_OOM).
> >  No idea how I could have missed that. I am really sorry about that.
> 
> 
> :D no problem :) so, now it should really work as expected and
> completely fix my original problem?

It should mitigate the problem. The real fix shouldn't be that specific
(as per discussion in other thread). The chance this will get upstream
is not big and that means that it will not get to the stable tree
either.

> is it safe to apply it on 3.2.35?

I didn't check what are the differences but I do not think there is
anything to conflict with it.

> Thank you very much!

HTH

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked
  2012-12-17 19:55                                                                   ` Michal Hocko
@ 2012-12-18 14:22                                                                     ` azurIt
  2012-12-18 15:20                                                                       ` Michal Hocko
  0 siblings, 1 reply; 172+ messages in thread
From: azurIt @ 2012-12-18 14:22 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki,
	Johannes Weiner

>It should mitigate the problem. The real fix shouldn't be that specific
>(as per discussion in other thread). The chance this will get upstream
>is not big and that means that it will not get to the stable tree
>either.


OOM is no longer killing processes outside target cgroups, so everything looks fine so far. Will report back when i will have more info. Thnks!

azur

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked
  2012-12-18 14:22                                                                     ` azurIt
@ 2012-12-18 15:20                                                                       ` Michal Hocko
  2012-12-24 13:25                                                                         ` azurIt
  2012-12-24 13:38                                                                         ` azurIt
  0 siblings, 2 replies; 172+ messages in thread
From: Michal Hocko @ 2012-12-18 15:20 UTC (permalink / raw)
  To: azurIt
  Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki,
	Johannes Weiner

On Tue 18-12-12 15:22:23, azurIt wrote:
> >It should mitigate the problem. The real fix shouldn't be that specific
> >(as per discussion in other thread). The chance this will get upstream
> >is not big and that means that it will not get to the stable tree
> >either.
> 
> 
> OOM is no longer killing processes outside target cgroups, so
> everything looks fine so far. Will report back when i will have more
> info. Thnks!

OK, good to hear and fingers crossed. I will try to get back to the
original problem and a better solution sometimes early next year when
all the things settle a bit.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked
  2012-12-18 15:20                                                                       ` Michal Hocko
@ 2012-12-24 13:25                                                                         ` azurIt
  2012-12-28 16:22                                                                           ` Michal Hocko
  2012-12-24 13:38                                                                         ` azurIt
  1 sibling, 1 reply; 172+ messages in thread
From: azurIt @ 2012-12-24 13:25 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki,
	Johannes Weiner

>OK, good to hear and fingers crossed. I will try to get back to the
>original problem and a better solution sometimes early next year when
>all the things settle a bit.


Michal, problem, unfortunately, happened again :( twice. When it happened first time (two days ago) i don't want to believe it so i recompiled the kernel and boot it again to be sure i really used your patch. Today it happened again, here is report:
http://watchdog.sk/lkml/memcg-bug-3.tar.gz

Here is patch which i used (kernel 3.2.35, i didn't use any other from your patches):
http://watchdog.sk/lkml/5-memcg-fix.patch

azur

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked
  2012-12-18 15:20                                                                       ` Michal Hocko
  2012-12-24 13:25                                                                         ` azurIt
@ 2012-12-24 13:38                                                                         ` azurIt
  2012-12-28 16:35                                                                           ` Michal Hocko
  1 sibling, 1 reply; 172+ messages in thread
From: azurIt @ 2012-12-24 13:38 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki,
	Johannes Weiner

>OK, good to hear and fingers crossed. I will try to get back to the
>original problem and a better solution sometimes early next year when
>all the things settle a bit.


Btw, i noticed one more thing when problem is happening (=when any cgroup is stucked), i fogot to mention it before, sorry :( . It's related to HDDs, something is slowing them down in a strange way. All services are working normally and i really cannot notice any slowness, the only thing which i noticed is affeceted is our backup software ( www.Bacula.org ). When problem occurs at night, so it's happening when backup is running, backup is extremely slow and usually don't finish until i kill processes inside affected cgroup (=until i resolve the problem). Backup software is NOT doing big HDD bandwidth BUT it's doing quite huge number of disk operations (it needs to stat every file and directory). I believe that only speed of disk operations are affected and are very slow.

Merry christmas!

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked
  2012-12-24 13:25                                                                         ` azurIt
@ 2012-12-28 16:22                                                                           ` Michal Hocko
  2012-12-30  1:09                                                                             ` azurIt
  0 siblings, 1 reply; 172+ messages in thread
From: Michal Hocko @ 2012-12-28 16:22 UTC (permalink / raw)
  To: azurIt
  Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki,
	Johannes Weiner

On Mon 24-12-12 14:25:26, azurIt wrote:
> >OK, good to hear and fingers crossed. I will try to get back to the
> >original problem and a better solution sometimes early next year when
> >all the things settle a bit.
> 
> 
> Michal, problem, unfortunately, happened again :( twice. When it
> happened first time (two days ago) i don't want to believe it so i
> recompiled the kernel and boot it again to be sure i really used your
> patch. Today it happened again, here is report:
> http://watchdog.sk/lkml/memcg-bug-3.tar.gz

Hmm, 1356352982/1507/stack says
[<ffffffff8110a971>] mem_cgroup_handle_oom+0x241/0x3b0
[<ffffffff8110b55b>] T.1147+0x5ab/0x5c0
[<ffffffff8110c1de>] mem_cgroup_cache_charge+0xbe/0xe0
[<ffffffff810ca20f>] add_to_page_cache_locked+0x4f/0x140
[<ffffffff810ca322>] add_to_page_cache_lru+0x22/0x50
[<ffffffff810cac53>] find_or_create_page+0x73/0xb0
[<ffffffff8114340a>] __getblk+0xea/0x2c0
[<ffffffff811921ab>] ext3_getblk+0xeb/0x240
[<ffffffff81192319>] ext3_bread+0x19/0x90
[<ffffffff811967e3>] ext3_dx_find_entry+0x83/0x1e0
[<ffffffff81196c24>] ext3_find_entry+0x2e4/0x480
[<ffffffff8119750d>] ext3_lookup+0x4d/0x120
[<ffffffff8111cff5>] d_alloc_and_lookup+0x45/0x90
[<ffffffff8111d598>] do_lookup+0x278/0x390
[<ffffffff8111f11e>] path_lookupat+0xae/0x7e0
[<ffffffff8111f885>] do_path_lookup+0x35/0xe0
[<ffffffff8111fa19>] user_path_at_empty+0x59/0xb0
[<ffffffff8111fa81>] user_path_at+0x11/0x20
[<ffffffff811164d7>] vfs_fstatat+0x47/0x80
[<ffffffff8111657e>] vfs_lstat+0x1e/0x20
[<ffffffff811165a4>] sys_newlstat+0x24/0x50
[<ffffffff815b5a66>] system_call_fastpath+0x18/0x1d
[<ffffffffffffffff>] 0xffffffffffffffff

which suggests that the patch is incomplete and that I am blind :/
mem_cgroup_cache_charge calls __mem_cgroup_try_charge for the page cache
and that one doesn't check GFP_MEMCG_NO_OOM. So you need the following
follow-up patch on top of the one you already have (which should catch
all the remaining cases).
Sorry about that...
---
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 89997ac..559a54d 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2779,6 +2779,7 @@ __mem_cgroup_commit_charge_lrucare(struct page *page, struct mem_cgroup *memcg,
 int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm,
 				gfp_t gfp_mask)
 {
+	bool oom = !(gfp_mask & GFP_MEMCG_NO_OOM);
 	struct mem_cgroup *memcg = NULL;
 	int ret;
 
@@ -2791,7 +2792,7 @@ int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm,
 		mm = &init_mm;
 
 	if (page_is_file_cache(page)) {
-		ret = __mem_cgroup_try_charge(mm, gfp_mask, 1, &memcg, true);
+		ret = __mem_cgroup_try_charge(mm, gfp_mask, 1, &memcg, oom);
 		if (ret || !memcg)
 			return ret;
 
@@ -2827,6 +2828,7 @@ int mem_cgroup_try_charge_swapin(struct mm_struct *mm,
 				 struct page *page,
 				 gfp_t mask, struct mem_cgroup **ptr)
 {
+	bool oom = !(mask & GFP_MEMCG_NO_OOM);
 	struct mem_cgroup *memcg;
 	int ret;
 
@@ -2849,13 +2851,13 @@ int mem_cgroup_try_charge_swapin(struct mm_struct *mm,
 	if (!memcg)
 		goto charge_cur_mm;
 	*ptr = memcg;
-	ret = __mem_cgroup_try_charge(NULL, mask, 1, ptr, true);
+	ret = __mem_cgroup_try_charge(NULL, mask, 1, ptr, oom);
 	css_put(&memcg->css);
 	return ret;
 charge_cur_mm:
 	if (unlikely(!mm))
 		mm = &init_mm;
-	return __mem_cgroup_try_charge(mm, mask, 1, ptr, true);
+	return __mem_cgroup_try_charge(mm, mask, 1, ptr, oom);
 }
 
 static void
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply related	[flat|nested] 172+ messages in thread

* Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked
  2012-12-24 13:38                                                                         ` azurIt
@ 2012-12-28 16:35                                                                           ` Michal Hocko
  0 siblings, 0 replies; 172+ messages in thread
From: Michal Hocko @ 2012-12-28 16:35 UTC (permalink / raw)
  To: azurIt
  Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki,
	Johannes Weiner

On Mon 24-12-12 14:38:50, azurIt wrote:
> >OK, good to hear and fingers crossed. I will try to get back to the
> >original problem and a better solution sometimes early next year when
> >all the things settle a bit.
> 
> 
> Btw, i noticed one more thing when problem is happening (=when any
> cgroup is stucked), i fogot to mention it before, sorry :( . It's
> related to HDDs, something is slowing them down in a strange way. All
> services are working normally and i really cannot notice any slowness,
> the only thing which i noticed is affeceted is our backup software (
> www.Bacula.org ). When problem occurs at night, so it's happening when
> backup is running, backup is extremely slow and usually don't finish
> until i kill processes inside affected cgroup (=until i resolve the
> problem). Backup software is NOT doing big HDD bandwidth BUT it's
> doing quite huge number of disk operations (it needs to stat every
> file and directory). I believe that only speed of disk operations are
> affected and are very slow.

I would bet that this is caused by the blocked proceses in memcg oom
handler which hold i_mutex and the backup process wants to access the
same inode with an operation which requires the lock.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked
  2012-12-28 16:22                                                                           ` Michal Hocko
@ 2012-12-30  1:09                                                                             ` azurIt
  2012-12-30 11:08                                                                               ` Michal Hocko
  0 siblings, 1 reply; 172+ messages in thread
From: azurIt @ 2012-12-30  1:09 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki,
	Johannes Weiner

>which suggests that the patch is incomplete and that I am blind :/
>mem_cgroup_cache_charge calls __mem_cgroup_try_charge for the page cache
>and that one doesn't check GFP_MEMCG_NO_OOM. So you need the following
>follow-up patch on top of the one you already have (which should catch
>all the remaining cases).
>Sorry about that...


This was, again, killing my MySQL server (search for "(mysqld)"):
http://www.watchdog.sk/lkml/oom_mysqld5

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked
  2012-12-30  1:09                                                                             ` azurIt
@ 2012-12-30 11:08                                                                               ` Michal Hocko
  2013-01-25 15:07                                                                                 ` azurIt
  0 siblings, 1 reply; 172+ messages in thread
From: Michal Hocko @ 2012-12-30 11:08 UTC (permalink / raw)
  To: azurIt
  Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki,
	Johannes Weiner

On Sun 30-12-12 02:09:47, azurIt wrote:
> >which suggests that the patch is incomplete and that I am blind :/
> >mem_cgroup_cache_charge calls __mem_cgroup_try_charge for the page cache
> >and that one doesn't check GFP_MEMCG_NO_OOM. So you need the following
> >follow-up patch on top of the one you already have (which should catch
> >all the remaining cases).
> >Sorry about that...
> 
> 
> This was, again, killing my MySQL server (search for "(mysqld)"):
> http://www.watchdog.sk/lkml/oom_mysqld5

grep "Kill process" oom_mysqld5 
Dec 30 01:53:34 server01 kernel: [  367.061801] Memory cgroup out of memory: Kill process 5512 (apache2) score 716 or sacrifice child
Dec 30 01:53:35 server01 kernel: [  367.338024] Memory cgroup out of memory: Kill process 5517 (apache2) score 718 or sacrifice child
Dec 30 01:53:35 server01 kernel: [  367.747888] Memory cgroup out of memory: Kill process 5513 (apache2) score 721 or sacrifice child
Dec 30 01:53:36 server01 kernel: [  368.159860] Memory cgroup out of memory: Kill process 5516 (apache2) score 726 or sacrifice child
Dec 30 01:53:36 server01 kernel: [  368.665606] Memory cgroup out of memory: Kill process 5520 (apache2) score 733 or sacrifice child
Dec 30 01:53:36 server01 kernel: [  368.765652] Out of memory: Kill process 1778 (mysqld) score 39 or sacrifice child
Dec 30 01:53:36 server01 kernel: [  369.101753] Memory cgroup out of memory: Kill process 5519 (apache2) score 754 or sacrifice child
Dec 30 01:53:37 server01 kernel: [  369.464262] Memory cgroup out of memory: Kill process 5583 (apache2) score 762 or sacrifice child
Dec 30 01:53:37 server01 kernel: [  369.465017] Out of memory: Kill process 5506 (apache2) score 18 or sacrifice child
Dec 30 01:53:37 server01 kernel: [  369.574932] Memory cgroup out of memory: Kill process 5523 (apache2) score 759 or sacrifice child

So your mysqld has been killed by the global OOM not memcg. But why when
you seem to be perfectly fine regarding memory? I guess the following
backtrace is relevant:
Dec 30 01:53:36 server01 kernel: [  368.569720] DMA: 0*4kB 1*8kB 0*16kB 1*32kB 2*64kB 1*128kB 1*256kB 0*512kB 1*1024kB 1*2048kB 3*4096kB = 15912kB
Dec 30 01:53:36 server01 kernel: [  368.570447] DMA32: 9*4kB 10*8kB 8*16kB 6*32kB 5*64kB 6*128kB 4*256kB 2*512kB 3*1024kB 3*2048kB 613*4096kB = 2523636kB
Dec 30 01:53:36 server01 kernel: [  368.571175] Normal: 5*4kB 2060*8kB 4122*16kB 2550*32kB 2667*64kB 722*128kB 197*256kB 68*512kB 15*1024kB 4*2048kB 1855*4096kB = 8134036kB
Dec 30 01:53:36 server01 kernel: [  368.571906] 308964 total pagecache pages
Dec 30 01:53:36 server01 kernel: [  368.572023] 0 pages in swap cache
Dec 30 01:53:36 server01 kernel: [  368.572140] Swap cache stats: add 0, delete 0, find 0/0
Dec 30 01:53:36 server01 kernel: [  368.572260] Free swap  = 0kB
Dec 30 01:53:36 server01 kernel: [  368.572375] Total swap = 0kB
Dec 30 01:53:36 server01 kernel: [  368.597836] apache2 invoked oom-killer: gfp_mask=0x0, order=0, oom_adj=0, oom_score_adj=0
Dec 30 01:53:36 server01 kernel: [  368.598034] apache2 cpuset=uid mems_allowed=0
Dec 30 01:53:36 server01 kernel: [  368.598152] Pid: 5385, comm: apache2 Not tainted 3.2.35-grsec #1
Dec 30 01:53:36 server01 kernel: [  368.598273] Call Trace:
Dec 30 01:53:36 server01 kernel: [  368.598396]  [<ffffffff810cc89e>] dump_header+0x7e/0x1e0
Dec 30 01:53:36 server01 kernel: [  368.598516]  [<ffffffff810cc79f>] ? find_lock_task_mm+0x2f/0x70
Dec 30 01:53:36 server01 kernel: [  368.598638]  [<ffffffff810ccd65>] oom_kill_process+0x85/0x2a0
Dec 30 01:53:36 server01 kernel: [  368.598759]  [<ffffffff810cd415>] out_of_memory+0xe5/0x200
Dec 30 01:53:36 server01 kernel: [  368.598880]  [<ffffffff810cd5ed>] pagefault_out_of_memory+0xbd/0x110
Dec 30 01:53:36 server01 kernel: [  368.599006]  [<ffffffff81026e96>] mm_fault_error+0xb6/0x1a0
Dec 30 01:53:36 server01 kernel: [  368.599127]  [<ffffffff8102736e>] do_page_fault+0x3ee/0x460
Dec 30 01:53:36 server01 kernel: [  368.599250]  [<ffffffff81131ccf>] ? mntput+0x1f/0x30
Dec 30 01:53:36 server01 kernel: [  368.599371]  [<ffffffff811134e6>] ? fput+0x156/0x200
Dec 30 01:53:36 server01 kernel: [  368.599496]  [<ffffffff815b567f>] page_fault+0x1f/0x30

This would suggest that an unexpected ENOMEM leaked during page fault
path. I do not see which one could that be because you said THP
(CONFIG_TRANSPARENT_HUGEPAGE) are disabled (and the other patch I have
mentioned in the thread should fix that issue - btw. the patch is
already scheduled for stable tree).
 __do_fault, do_anonymous_page and do_wp_page call
mem_cgroup_newpage_charge with GFP_KERNEL which means that
we do memcg OOM and never return ENOMEM. do_swap_page calls
mem_cgroup_try_charge_swapin with GFP_KERNEL as well.

I might have missed something but I will not get to look closer before
2nd January.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked
  2012-12-30 11:08                                                                               ` Michal Hocko
@ 2013-01-25 15:07                                                                                 ` azurIt
  2013-01-25 16:31                                                                                   ` Michal Hocko
  0 siblings, 1 reply; 172+ messages in thread
From: azurIt @ 2013-01-25 15:07 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki,
	Johannes Weiner

Any news? Thnx!

azur



______________________________________________________________
> Od: "Michal Hocko" <mhocko@suse.cz>
> Komu: azurIt <azurit@pobox.sk>
> Dátum: 30.12.2012 12:08
> Predmet: Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked
>
> CC: linux-kernel@vger.kernel.org, linux-mm@kvack.org, "cgroups mailinglist" <cgroups@vger.kernel.org>, "KAMEZAWA Hiroyuki" <kamezawa.hiroyu@jp.fujitsu.com>, "Johannes Weiner" <hannes@cmpxchg.org>
>On Sun 30-12-12 02:09:47, azurIt wrote:
>> >which suggests that the patch is incomplete and that I am blind :/
>> >mem_cgroup_cache_charge calls __mem_cgroup_try_charge for the page cache
>> >and that one doesn't check GFP_MEMCG_NO_OOM. So you need the following
>> >follow-up patch on top of the one you already have (which should catch
>> >all the remaining cases).
>> >Sorry about that...
>> 
>> 
>> This was, again, killing my MySQL server (search for "(mysqld)"):
>> http://www.watchdog.sk/lkml/oom_mysqld5
>
>grep "Kill process" oom_mysqld5 
>Dec 30 01:53:34 server01 kernel: [  367.061801] Memory cgroup out of memory: Kill process 5512 (apache2) score 716 or sacrifice child
>Dec 30 01:53:35 server01 kernel: [  367.338024] Memory cgroup out of memory: Kill process 5517 (apache2) score 718 or sacrifice child
>Dec 30 01:53:35 server01 kernel: [  367.747888] Memory cgroup out of memory: Kill process 5513 (apache2) score 721 or sacrifice child
>Dec 30 01:53:36 server01 kernel: [  368.159860] Memory cgroup out of memory: Kill process 5516 (apache2) score 726 or sacrifice child
>Dec 30 01:53:36 server01 kernel: [  368.665606] Memory cgroup out of memory: Kill process 5520 (apache2) score 733 or sacrifice child
>Dec 30 01:53:36 server01 kernel: [  368.765652] Out of memory: Kill process 1778 (mysqld) score 39 or sacrifice child
>Dec 30 01:53:36 server01 kernel: [  369.101753] Memory cgroup out of memory: Kill process 5519 (apache2) score 754 or sacrifice child
>Dec 30 01:53:37 server01 kernel: [  369.464262] Memory cgroup out of memory: Kill process 5583 (apache2) score 762 or sacrifice child
>Dec 30 01:53:37 server01 kernel: [  369.465017] Out of memory: Kill process 5506 (apache2) score 18 or sacrifice child
>Dec 30 01:53:37 server01 kernel: [  369.574932] Memory cgroup out of memory: Kill process 5523 (apache2) score 759 or sacrifice child
>
>So your mysqld has been killed by the global OOM not memcg. But why when
>you seem to be perfectly fine regarding memory? I guess the following
>backtrace is relevant:
>Dec 30 01:53:36 server01 kernel: [  368.569720] DMA: 0*4kB 1*8kB 0*16kB 1*32kB 2*64kB 1*128kB 1*256kB 0*512kB 1*1024kB 1*2048kB 3*4096kB = 15912kB
>Dec 30 01:53:36 server01 kernel: [  368.570447] DMA32: 9*4kB 10*8kB 8*16kB 6*32kB 5*64kB 6*128kB 4*256kB 2*512kB 3*1024kB 3*2048kB 613*4096kB = 2523636kB
>Dec 30 01:53:36 server01 kernel: [  368.571175] Normal: 5*4kB 2060*8kB 4122*16kB 2550*32kB 2667*64kB 722*128kB 197*256kB 68*512kB 15*1024kB 4*2048kB 1855*4096kB = 8134036kB
>Dec 30 01:53:36 server01 kernel: [  368.571906] 308964 total pagecache pages
>Dec 30 01:53:36 server01 kernel: [  368.572023] 0 pages in swap cache
>Dec 30 01:53:36 server01 kernel: [  368.572140] Swap cache stats: add 0, delete 0, find 0/0
>Dec 30 01:53:36 server01 kernel: [  368.572260] Free swap  = 0kB
>Dec 30 01:53:36 server01 kernel: [  368.572375] Total swap = 0kB
>Dec 30 01:53:36 server01 kernel: [  368.597836] apache2 invoked oom-killer: gfp_mask=0x0, order=0, oom_adj=0, oom_score_adj=0
>Dec 30 01:53:36 server01 kernel: [  368.598034] apache2 cpuset=uid mems_allowed=0
>Dec 30 01:53:36 server01 kernel: [  368.598152] Pid: 5385, comm: apache2 Not tainted 3.2.35-grsec #1
>Dec 30 01:53:36 server01 kernel: [  368.598273] Call Trace:
>Dec 30 01:53:36 server01 kernel: [  368.598396]  [<ffffffff810cc89e>] dump_header+0x7e/0x1e0
>Dec 30 01:53:36 server01 kernel: [  368.598516]  [<ffffffff810cc79f>] ? find_lock_task_mm+0x2f/0x70
>Dec 30 01:53:36 server01 kernel: [  368.598638]  [<ffffffff810ccd65>] oom_kill_process+0x85/0x2a0
>Dec 30 01:53:36 server01 kernel: [  368.598759]  [<ffffffff810cd415>] out_of_memory+0xe5/0x200
>Dec 30 01:53:36 server01 kernel: [  368.598880]  [<ffffffff810cd5ed>] pagefault_out_of_memory+0xbd/0x110
>Dec 30 01:53:36 server01 kernel: [  368.599006]  [<ffffffff81026e96>] mm_fault_error+0xb6/0x1a0
>Dec 30 01:53:36 server01 kernel: [  368.599127]  [<ffffffff8102736e>] do_page_fault+0x3ee/0x460
>Dec 30 01:53:36 server01 kernel: [  368.599250]  [<ffffffff81131ccf>] ? mntput+0x1f/0x30
>Dec 30 01:53:36 server01 kernel: [  368.599371]  [<ffffffff811134e6>] ? fput+0x156/0x200
>Dec 30 01:53:36 server01 kernel: [  368.599496]  [<ffffffff815b567f>] page_fault+0x1f/0x30
>
>This would suggest that an unexpected ENOMEM leaked during page fault
>path. I do not see which one could that be because you said THP
>(CONFIG_TRANSPARENT_HUGEPAGE) are disabled (and the other patch I have
>mentioned in the thread should fix that issue - btw. the patch is
>already scheduled for stable tree).
> __do_fault, do_anonymous_page and do_wp_page call
>mem_cgroup_newpage_charge with GFP_KERNEL which means that
>we do memcg OOM and never return ENOMEM. do_swap_page calls
>mem_cgroup_try_charge_swapin with GFP_KERNEL as well.
>
>I might have missed something but I will not get to look closer before
>2nd January.
>-- 
>Michal Hocko
>SUSE Labs
>

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked
  2013-01-25 15:07                                                                                 ` azurIt
@ 2013-01-25 16:31                                                                                   ` Michal Hocko
  2013-02-05 13:49                                                                                     ` Michal Hocko
  0 siblings, 1 reply; 172+ messages in thread
From: Michal Hocko @ 2013-01-25 16:31 UTC (permalink / raw)
  To: azurIt
  Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki,
	Johannes Weiner

On Fri 25-01-13 16:07:23, azurIt wrote:
> Any news? Thnx!

Sorry, but I didn't get to this one yet.

> 
> azur
> 
> 
> 
> ______________________________________________________________
> > Od: "Michal Hocko" <mhocko@suse.cz>
> > Komu: azurIt <azurit@pobox.sk>
> > Dátum: 30.12.2012 12:08
> > Predmet: Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked
> >
> > CC: linux-kernel@vger.kernel.org, linux-mm@kvack.org, "cgroups mailinglist" <cgroups@vger.kernel.org>, "KAMEZAWA Hiroyuki" <kamezawa.hiroyu@jp.fujitsu.com>, "Johannes Weiner" <hannes@cmpxchg.org>
> >On Sun 30-12-12 02:09:47, azurIt wrote:
> >> >which suggests that the patch is incomplete and that I am blind :/
> >> >mem_cgroup_cache_charge calls __mem_cgroup_try_charge for the page cache
> >> >and that one doesn't check GFP_MEMCG_NO_OOM. So you need the following
> >> >follow-up patch on top of the one you already have (which should catch
> >> >all the remaining cases).
> >> >Sorry about that...
> >> 
> >> 
> >> This was, again, killing my MySQL server (search for "(mysqld)"):
> >> http://www.watchdog.sk/lkml/oom_mysqld5
> >
> >grep "Kill process" oom_mysqld5 
> >Dec 30 01:53:34 server01 kernel: [  367.061801] Memory cgroup out of memory: Kill process 5512 (apache2) score 716 or sacrifice child
> >Dec 30 01:53:35 server01 kernel: [  367.338024] Memory cgroup out of memory: Kill process 5517 (apache2) score 718 or sacrifice child
> >Dec 30 01:53:35 server01 kernel: [  367.747888] Memory cgroup out of memory: Kill process 5513 (apache2) score 721 or sacrifice child
> >Dec 30 01:53:36 server01 kernel: [  368.159860] Memory cgroup out of memory: Kill process 5516 (apache2) score 726 or sacrifice child
> >Dec 30 01:53:36 server01 kernel: [  368.665606] Memory cgroup out of memory: Kill process 5520 (apache2) score 733 or sacrifice child
> >Dec 30 01:53:36 server01 kernel: [  368.765652] Out of memory: Kill process 1778 (mysqld) score 39 or sacrifice child
> >Dec 30 01:53:36 server01 kernel: [  369.101753] Memory cgroup out of memory: Kill process 5519 (apache2) score 754 or sacrifice child
> >Dec 30 01:53:37 server01 kernel: [  369.464262] Memory cgroup out of memory: Kill process 5583 (apache2) score 762 or sacrifice child
> >Dec 30 01:53:37 server01 kernel: [  369.465017] Out of memory: Kill process 5506 (apache2) score 18 or sacrifice child
> >Dec 30 01:53:37 server01 kernel: [  369.574932] Memory cgroup out of memory: Kill process 5523 (apache2) score 759 or sacrifice child
> >
> >So your mysqld has been killed by the global OOM not memcg. But why when
> >you seem to be perfectly fine regarding memory? I guess the following
> >backtrace is relevant:
> >Dec 30 01:53:36 server01 kernel: [  368.569720] DMA: 0*4kB 1*8kB 0*16kB 1*32kB 2*64kB 1*128kB 1*256kB 0*512kB 1*1024kB 1*2048kB 3*4096kB = 15912kB
> >Dec 30 01:53:36 server01 kernel: [  368.570447] DMA32: 9*4kB 10*8kB 8*16kB 6*32kB 5*64kB 6*128kB 4*256kB 2*512kB 3*1024kB 3*2048kB 613*4096kB = 2523636kB
> >Dec 30 01:53:36 server01 kernel: [  368.571175] Normal: 5*4kB 2060*8kB 4122*16kB 2550*32kB 2667*64kB 722*128kB 197*256kB 68*512kB 15*1024kB 4*2048kB 1855*4096kB = 8134036kB
> >Dec 30 01:53:36 server01 kernel: [  368.571906] 308964 total pagecache pages
> >Dec 30 01:53:36 server01 kernel: [  368.572023] 0 pages in swap cache
> >Dec 30 01:53:36 server01 kernel: [  368.572140] Swap cache stats: add 0, delete 0, find 0/0
> >Dec 30 01:53:36 server01 kernel: [  368.572260] Free swap  = 0kB
> >Dec 30 01:53:36 server01 kernel: [  368.572375] Total swap = 0kB
> >Dec 30 01:53:36 server01 kernel: [  368.597836] apache2 invoked oom-killer: gfp_mask=0x0, order=0, oom_adj=0, oom_score_adj=0
> >Dec 30 01:53:36 server01 kernel: [  368.598034] apache2 cpuset=uid mems_allowed=0
> >Dec 30 01:53:36 server01 kernel: [  368.598152] Pid: 5385, comm: apache2 Not tainted 3.2.35-grsec #1
> >Dec 30 01:53:36 server01 kernel: [  368.598273] Call Trace:
> >Dec 30 01:53:36 server01 kernel: [  368.598396]  [<ffffffff810cc89e>] dump_header+0x7e/0x1e0
> >Dec 30 01:53:36 server01 kernel: [  368.598516]  [<ffffffff810cc79f>] ? find_lock_task_mm+0x2f/0x70
> >Dec 30 01:53:36 server01 kernel: [  368.598638]  [<ffffffff810ccd65>] oom_kill_process+0x85/0x2a0
> >Dec 30 01:53:36 server01 kernel: [  368.598759]  [<ffffffff810cd415>] out_of_memory+0xe5/0x200
> >Dec 30 01:53:36 server01 kernel: [  368.598880]  [<ffffffff810cd5ed>] pagefault_out_of_memory+0xbd/0x110
> >Dec 30 01:53:36 server01 kernel: [  368.599006]  [<ffffffff81026e96>] mm_fault_error+0xb6/0x1a0
> >Dec 30 01:53:36 server01 kernel: [  368.599127]  [<ffffffff8102736e>] do_page_fault+0x3ee/0x460
> >Dec 30 01:53:36 server01 kernel: [  368.599250]  [<ffffffff81131ccf>] ? mntput+0x1f/0x30
> >Dec 30 01:53:36 server01 kernel: [  368.599371]  [<ffffffff811134e6>] ? fput+0x156/0x200
> >Dec 30 01:53:36 server01 kernel: [  368.599496]  [<ffffffff815b567f>] page_fault+0x1f/0x30
> >
> >This would suggest that an unexpected ENOMEM leaked during page fault
> >path. I do not see which one could that be because you said THP
> >(CONFIG_TRANSPARENT_HUGEPAGE) are disabled (and the other patch I have
> >mentioned in the thread should fix that issue - btw. the patch is
> >already scheduled for stable tree).
> > __do_fault, do_anonymous_page and do_wp_page call
> >mem_cgroup_newpage_charge with GFP_KERNEL which means that
> >we do memcg OOM and never return ENOMEM. do_swap_page calls
> >mem_cgroup_try_charge_swapin with GFP_KERNEL as well.
> >
> >I might have missed something but I will not get to look closer before
> >2nd January.
> >-- 
> >Michal Hocko
> >SUSE Labs
> >
> --
> To unsubscribe from this list: send the line "unsubscribe cgroups" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked
  2013-01-25 16:31                                                                                   ` Michal Hocko
@ 2013-02-05 13:49                                                                                     ` Michal Hocko
  2013-02-05 14:49                                                                                       ` azurIt
  0 siblings, 1 reply; 172+ messages in thread
From: Michal Hocko @ 2013-02-05 13:49 UTC (permalink / raw)
  To: azurIt
  Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki,
	Johannes Weiner

On Fri 25-01-13 17:31:30, Michal Hocko wrote:
> On Fri 25-01-13 16:07:23, azurIt wrote:
> > Any news? Thnx!
> 
> Sorry, but I didn't get to this one yet.

Sorry, to get back to this that late but I was busy as hell since the
beginning of the year.

Has the issue repeated since then?

You said you didn't apply other than the above mentioned patch. Could
you apply also debugging part of the patches I have sent?
In case you don't have it handy then it should be this one:
---
>From 1623420d964e7e8bc88e2a6239563052df891bf7 Mon Sep 17 00:00:00 2001
From: Michal Hocko <mhocko@suse.cz>
Date: Mon, 3 Dec 2012 16:16:01 +0100
Subject: [PATCH] more debugging

---
 mm/huge_memory.c |    6 +++---
 mm/memcontrol.c  |    1 +
 2 files changed, 4 insertions(+), 3 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 470cbb4..01a11f1 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -671,7 +671,7 @@ static inline struct page *alloc_hugepage(int defrag)
 }
 #endif
 
-int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
+noinline int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
 			       unsigned long address, pmd_t *pmd,
 			       unsigned int flags)
 {
@@ -790,7 +790,7 @@ pgtable_t get_pmd_huge_pte(struct mm_struct *mm)
 	return pgtable;
 }
 
-static int do_huge_pmd_wp_page_fallback(struct mm_struct *mm,
+static noinline int do_huge_pmd_wp_page_fallback(struct mm_struct *mm,
 					struct vm_area_struct *vma,
 					unsigned long address,
 					pmd_t *pmd, pmd_t orig_pmd,
@@ -883,7 +883,7 @@ out_free_pages:
 	goto out;
 }
 
-int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
+noinline int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
 			unsigned long address, pmd_t *pmd, pmd_t orig_pmd)
 {
 	int ret = 0;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index c8425b1..1986c65 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2397,6 +2397,7 @@ done:
 	return 0;
 nomem:
 	*ptr = NULL;
+	__WARN_printf("gfp_mask:%u nr_pages:%u oom:%d ret:%d\n", gfp_mask, nr_pages, oom, ret);
 	return -ENOMEM;
 bypass:
 	*ptr = NULL;
-- 
1.7.10.4

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply related	[flat|nested] 172+ messages in thread

* Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked
  2013-02-05 13:49                                                                                     ` Michal Hocko
@ 2013-02-05 14:49                                                                                       ` azurIt
  2013-02-05 16:09                                                                                         ` Michal Hocko
  2013-02-05 16:31                                                                                         ` Michal Hocko
  0 siblings, 2 replies; 172+ messages in thread
From: azurIt @ 2013-02-05 14:49 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki,
	Johannes Weiner

>Sorry, to get back to this that late but I was busy as hell since the
>beginning of the year.


Thank you for your time!


>Has the issue repeated since then?


Yes, it's happening all the time but meanwhile i wrote a script which is monitoring the problem and killing freezed processes when it occurs. But i don't like it much, it's not a solution for me :( i also noticed, that problem is always affecting the whole server but not so much as freezed cgroup. Depends on number of freezed processes, sometimes it has almost no imapct on the rest of the server, sometimes the whole server is lagging much.

I have another old problem which is maybe also related to this. I wasn't connecting it with this before but now i'm not sure. Two of our servers, which are affected by this cgroup problem, are also randomly freezing completely (few times per month). These are the symptoms:
 - servers are answering to ping
 - it is possible to connect via SSH but connection is freezed after sending the password
 - it is possible to login via console but it is freezed after typeing the login
These symptoms are very similar to HDD problems or HDD overload (but there is no overload for sure). The only way to fix it is, probably, hard rebooting the server (didn't find any other way). What do you think? Can this be related? Maybe HDDs are locked in the similar way the cgroups are - we already found out that cgroup freezeing is related also to HDD activity. Maybe there is a little chance that the whole HDD subsystem ends in deadlock?


>You said you didn't apply other than the above mentioned patch. Could
>you apply also debugging part of the patches I have sent?
>In case you don't have it handy then it should be this one:


Just to be sure - am i supposed to apply this two patches?
http://watchdog.sk/lkml/patches/


azur

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked
  2013-02-05 14:49                                                                                       ` azurIt
@ 2013-02-05 16:09                                                                                         ` Michal Hocko
  2013-02-05 16:46                                                                                           ` azurIt
                                                                                                             ` (2 more replies)
  2013-02-05 16:31                                                                                         ` Michal Hocko
  1 sibling, 3 replies; 172+ messages in thread
From: Michal Hocko @ 2013-02-05 16:09 UTC (permalink / raw)
  To: azurIt
  Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki,
	Johannes Weiner

On Tue 05-02-13 15:49:47, azurIt wrote:
[...]
> Just to be sure - am i supposed to apply this two patches?
> http://watchdog.sk/lkml/patches/

5-memcg-fix-1.patch is not complete. It doesn't contain the folloup I
mentioned in a follow up email. Here is the full patch:
---
>From f2bf8437d5b9bb38a95a432bf39f32c584955171 Mon Sep 17 00:00:00 2001
From: Michal Hocko <mhocko@suse.cz>
Date: Mon, 26 Nov 2012 11:47:57 +0100
Subject: [PATCH] memcg: do not trigger OOM from add_to_page_cache_locked

memcg oom killer might deadlock if the process which falls down to
mem_cgroup_handle_oom holds a lock which prevents other task to
terminate because it is blocked on the very same lock.
This can happen when a write system call needs to allocate a page but
the allocation hits the memcg hard limit and there is nothing to reclaim
(e.g. there is no swap or swap limit is hit as well and all cache pages
have been reclaimed already) and the process selected by memcg OOM
killer is blocked on i_mutex on the same inode (e.g. truncate it).

Process A
[<ffffffff811109b8>] do_truncate+0x58/0xa0		# takes i_mutex
[<ffffffff81121c90>] do_last+0x250/0xa30
[<ffffffff81122547>] path_openat+0xd7/0x440
[<ffffffff811229c9>] do_filp_open+0x49/0xa0
[<ffffffff8110f7d6>] do_sys_open+0x106/0x240
[<ffffffff8110f950>] sys_open+0x20/0x30
[<ffffffff815b5926>] system_call_fastpath+0x18/0x1d
[<ffffffffffffffff>] 0xffffffffffffffff

Process B
[<ffffffff8110a9c1>] mem_cgroup_handle_oom+0x241/0x3b0
[<ffffffff8110b5ab>] T.1146+0x5ab/0x5c0
[<ffffffff8110c22e>] mem_cgroup_cache_charge+0xbe/0xe0
[<ffffffff810ca28c>] add_to_page_cache_locked+0x4c/0x140
[<ffffffff810ca3a2>] add_to_page_cache_lru+0x22/0x50
[<ffffffff810ca45b>] grab_cache_page_write_begin+0x8b/0xe0
[<ffffffff81193a18>] ext3_write_begin+0x88/0x270
[<ffffffff810c8fc6>] generic_file_buffered_write+0x116/0x290
[<ffffffff810cb3cc>] __generic_file_aio_write+0x27c/0x480
[<ffffffff810cb646>] generic_file_aio_write+0x76/0xf0           # takes ->i_mutex
[<ffffffff8111156a>] do_sync_write+0xea/0x130
[<ffffffff81112183>] vfs_write+0xf3/0x1f0
[<ffffffff81112381>] sys_write+0x51/0x90
[<ffffffff815b5926>] system_call_fastpath+0x18/0x1d
[<ffffffffffffffff>] 0xffffffffffffffff

This is not a hard deadlock though because administrator can still
intervene and increase the limit on the group which helps the writer to
finish the allocation and release the lock.

This patch heals the problem by forbidding OOM from page cache charges
(namely add_ro_page_cache_locked). mem_cgroup_cache_charge_no_oom helper
function is defined which adds GFP_MEMCG_NO_OOM to the gfp mask which
then tells mem_cgroup_charge_common that OOM is not allowed for the
charge. No OOM from this path, except for fixing the bug, also make some
sense as we really do not want to cause an OOM because of a page cache
usage.
As a possibly visible result add_to_page_cache_lru might fail more often
with ENOMEM but this is to be expected if the limit is set and it is
preferable than OOM killer IMO.

__GFP_NORETRY is abused for this memcg specific flag because no user
accounted allocation use this flag except for THP which have memcg oom
disabled already.

Reported-by: azurIt <azurit@pobox.sk>
Signed-off-by: Michal Hocko <mhocko@suse.cz>
---
 include/linux/gfp.h        |    3 +++
 include/linux/memcontrol.h |   13 +++++++++++++
 mm/filemap.c               |    8 +++++++-
 mm/memcontrol.c            |   10 ++++++----
 4 files changed, 29 insertions(+), 5 deletions(-)

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 3a76faf..806fb54 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -146,6 +146,9 @@ struct vm_area_struct;
 /* 4GB DMA on some platforms */
 #define GFP_DMA32	__GFP_DMA32
 
+/* memcg oom killer is not allowed */
+#define GFP_MEMCG_NO_OOM	__GFP_NORETRY
+
 /* Convert GFP flags to their corresponding migrate type */
 static inline int allocflags_to_migratetype(gfp_t gfp_flags)
 {
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 81572af..bf0e575 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -63,6 +63,13 @@ extern void mem_cgroup_cancel_charge_swapin(struct mem_cgroup *ptr);
 
 extern int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm,
 					gfp_t gfp_mask);
+
+static inline int mem_cgroup_cache_charge_no_oom(struct page *page,
+					struct mm_struct *mm, gfp_t gfp_mask)
+{
+	return mem_cgroup_cache_charge(page, mm, gfp_mask | GFP_MEMCG_NO_OOM);
+}
+
 extern void mem_cgroup_add_lru_list(struct page *page, enum lru_list lru);
 extern void mem_cgroup_del_lru_list(struct page *page, enum lru_list lru);
 extern void mem_cgroup_rotate_reclaimable_page(struct page *page);
@@ -178,6 +185,12 @@ static inline int mem_cgroup_cache_charge(struct page *page,
 	return 0;
 }
 
+static inline int mem_cgroup_cache_charge_no_oom(struct page *page,
+					struct mm_struct *mm, gfp_t gfp_mask)
+{
+	return 0;
+}
+
 static inline int mem_cgroup_try_charge_swapin(struct mm_struct *mm,
 		struct page *page, gfp_t gfp_mask, struct mem_cgroup **ptr)
 {
diff --git a/mm/filemap.c b/mm/filemap.c
index 556858c..ef182a9 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -449,7 +449,13 @@ int add_to_page_cache_locked(struct page *page, struct address_space *mapping,
 	VM_BUG_ON(!PageLocked(page));
 	VM_BUG_ON(PageSwapBacked(page));
 
-	error = mem_cgroup_cache_charge(page, current->mm,
+	/*
+	 * Cannot trigger OOM even if gfp_mask would allow that normally
+	 * because we might be called from a locked context and that
+	 * could lead to deadlocks if the killed process is waiting for
+	 * the same lock.
+	 */
+	error = mem_cgroup_cache_charge_no_oom(page, current->mm,
 					gfp_mask & GFP_RECLAIM_MASK);
 	if (error)
 		goto out;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 1986c65..a68aa08 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2704,7 +2704,7 @@ static int mem_cgroup_charge_common(struct page *page, struct mm_struct *mm,
 	struct mem_cgroup *memcg = NULL;
 	unsigned int nr_pages = 1;
 	struct page_cgroup *pc;
-	bool oom = true;
+	bool oom = !(gfp_mask & GFP_MEMCG_NO_OOM);
 	int ret;
 
 	if (PageTransHuge(page)) {
@@ -2771,6 +2771,7 @@ __mem_cgroup_commit_charge_lrucare(struct page *page, struct mem_cgroup *memcg,
 int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm,
 				gfp_t gfp_mask)
 {
+	bool oom = !(gfp_mask & GFP_MEMCG_NO_OOM);
 	struct mem_cgroup *memcg = NULL;
 	int ret;
 
@@ -2783,7 +2784,7 @@ int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm,
 		mm = &init_mm;
 
 	if (page_is_file_cache(page)) {
-		ret = __mem_cgroup_try_charge(mm, gfp_mask, 1, &memcg, true);
+		ret = __mem_cgroup_try_charge(mm, gfp_mask, 1, &memcg, oom);
 		if (ret || !memcg)
 			return ret;
 
@@ -2819,6 +2820,7 @@ int mem_cgroup_try_charge_swapin(struct mm_struct *mm,
 				 struct page *page,
 				 gfp_t mask, struct mem_cgroup **ptr)
 {
+	bool oom = !(mask & GFP_MEMCG_NO_OOM);
 	struct mem_cgroup *memcg;
 	int ret;
 
@@ -2841,13 +2843,13 @@ int mem_cgroup_try_charge_swapin(struct mm_struct *mm,
 	if (!memcg)
 		goto charge_cur_mm;
 	*ptr = memcg;
-	ret = __mem_cgroup_try_charge(NULL, mask, 1, ptr, true);
+	ret = __mem_cgroup_try_charge(NULL, mask, 1, ptr, oom);
 	css_put(&memcg->css);
 	return ret;
 charge_cur_mm:
 	if (unlikely(!mm))
 		mm = &init_mm;
-	return __mem_cgroup_try_charge(mm, mask, 1, ptr, true);
+	return __mem_cgroup_try_charge(mm, mask, 1, ptr, oom);
 }
 
 static void
-- 
1.7.10.4


-- 
Michal Hocko
SUSE Labs

^ permalink raw reply related	[flat|nested] 172+ messages in thread

* Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked
  2013-02-05 14:49                                                                                       ` azurIt
  2013-02-05 16:09                                                                                         ` Michal Hocko
@ 2013-02-05 16:31                                                                                         ` Michal Hocko
  1 sibling, 0 replies; 172+ messages in thread
From: Michal Hocko @ 2013-02-05 16:31 UTC (permalink / raw)
  To: azurIt
  Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki,
	Johannes Weiner

On Tue 05-02-13 15:49:47, azurIt wrote:
[...]
> I have another old problem which is maybe also related to this. I
> wasn't connecting it with this before but now i'm not sure. Two of our
> servers, which are affected by this cgroup problem, are also randomly
> freezing completely (few times per month). These are the symptoms:
>  - servers are answering to ping
>  - it is possible to connect via SSH but connection is freezed after
>  sending the password
>  - it is possible to login via console but it is freezed after typeing
>  the login
> These symptoms are very similar to HDD problems or HDD overload (but
> there is no overload for sure). The only way to fix it is, probably,
> hard rebooting the server (didn't find any other way). What do you
> think? Can this be related?

This is hard to tell without further information.

> Maybe HDDs are locked in the similar way the cgroups are - we already
> found out that cgroup freezeing is related also to HDD activity. Maybe
> there is a little chance that the whole HDD subsystem ends in
> deadlock?

"HDD subsystem" whatever that means cannot be blocked by memcg being
stuck. Certain access to soem files might be an issue because those
could have locks held but I do not see other relations.

I would start by checking the HW, trying to focus on reducing elements
that could contribute - aka try to nail down to the minimum set which
reproduces the issue. I cannot help you much with that I am afraid.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked
  2013-02-05 16:09                                                                                         ` Michal Hocko
@ 2013-02-05 16:46                                                                                           ` azurIt
  2013-02-05 16:48                                                                                           ` Greg Thelen
  2013-02-06  1:17                                                                                           ` azurIt
  2 siblings, 0 replies; 172+ messages in thread
From: azurIt @ 2013-02-05 16:46 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki,
	Johannes Weiner

>5-memcg-fix-1.patch is not complete. It doesn't contain the folloup I
>mentioned in a follow up email.

ou, it wasn't complete? i used it in my last test.. sorry, i'm litte confused by all those patches. will try it this night and report back.

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked
  2013-02-05 16:09                                                                                         ` Michal Hocko
  2013-02-05 16:46                                                                                           ` azurIt
@ 2013-02-05 16:48                                                                                           ` Greg Thelen
  2013-02-05 17:46                                                                                             ` Michal Hocko
  2013-02-06  1:17                                                                                           ` azurIt
  2 siblings, 1 reply; 172+ messages in thread
From: Greg Thelen @ 2013-02-05 16:48 UTC (permalink / raw)
  To: Michal Hocko
  Cc: azurIt, linux-kernel, linux-mm, cgroups mailinglist,
	KAMEZAWA Hiroyuki, Johannes Weiner

On Tue, Feb 05 2013, Michal Hocko wrote:

> On Tue 05-02-13 15:49:47, azurIt wrote:
> [...]
>> Just to be sure - am i supposed to apply this two patches?
>> http://watchdog.sk/lkml/patches/
>
> 5-memcg-fix-1.patch is not complete. It doesn't contain the folloup I
> mentioned in a follow up email. Here is the full patch:
> ---
> From f2bf8437d5b9bb38a95a432bf39f32c584955171 Mon Sep 17 00:00:00 2001
> From: Michal Hocko <mhocko@suse.cz>
> Date: Mon, 26 Nov 2012 11:47:57 +0100
> Subject: [PATCH] memcg: do not trigger OOM from add_to_page_cache_locked
>
> memcg oom killer might deadlock if the process which falls down to
> mem_cgroup_handle_oom holds a lock which prevents other task to
> terminate because it is blocked on the very same lock.
> This can happen when a write system call needs to allocate a page but
> the allocation hits the memcg hard limit and there is nothing to reclaim
> (e.g. there is no swap or swap limit is hit as well and all cache pages
> have been reclaimed already) and the process selected by memcg OOM
> killer is blocked on i_mutex on the same inode (e.g. truncate it).
>
> Process A
> [<ffffffff811109b8>] do_truncate+0x58/0xa0		# takes i_mutex
> [<ffffffff81121c90>] do_last+0x250/0xa30
> [<ffffffff81122547>] path_openat+0xd7/0x440
> [<ffffffff811229c9>] do_filp_open+0x49/0xa0
> [<ffffffff8110f7d6>] do_sys_open+0x106/0x240
> [<ffffffff8110f950>] sys_open+0x20/0x30
> [<ffffffff815b5926>] system_call_fastpath+0x18/0x1d
> [<ffffffffffffffff>] 0xffffffffffffffff
>
> Process B
> [<ffffffff8110a9c1>] mem_cgroup_handle_oom+0x241/0x3b0
> [<ffffffff8110b5ab>] T.1146+0x5ab/0x5c0
> [<ffffffff8110c22e>] mem_cgroup_cache_charge+0xbe/0xe0
> [<ffffffff810ca28c>] add_to_page_cache_locked+0x4c/0x140
> [<ffffffff810ca3a2>] add_to_page_cache_lru+0x22/0x50
> [<ffffffff810ca45b>] grab_cache_page_write_begin+0x8b/0xe0
> [<ffffffff81193a18>] ext3_write_begin+0x88/0x270
> [<ffffffff810c8fc6>] generic_file_buffered_write+0x116/0x290
> [<ffffffff810cb3cc>] __generic_file_aio_write+0x27c/0x480
> [<ffffffff810cb646>] generic_file_aio_write+0x76/0xf0           # takes ->i_mutex
> [<ffffffff8111156a>] do_sync_write+0xea/0x130
> [<ffffffff81112183>] vfs_write+0xf3/0x1f0
> [<ffffffff81112381>] sys_write+0x51/0x90
> [<ffffffff815b5926>] system_call_fastpath+0x18/0x1d
> [<ffffffffffffffff>] 0xffffffffffffffff

It looks like grab_cache_page_write_begin() passes __GFP_FS into
__page_cache_alloc() and mem_cgroup_cache_charge().  Which makes me
think that this deadlock is also possible in the page allocator even
before getting to add_to_page_cache_lru.  no?

Can callers holding fs resources (e.g. i_mutex) pass __GFP_FS into the
page allocator?  If __GFP_FS was avoided, then I think memcg user page
charging would need a !__GFP_FS check to avoid invoking oom killer, but
at least then we'd avoid both deadlocks and cover both page allocation
and memcg page charging in similar fashion.

Example from memcg_charge_kmem:
	may_oom = (gfp & __GFP_FS) && !(gfp & __GFP_NORETRY);

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked
  2013-02-05 16:48                                                                                           ` Greg Thelen
@ 2013-02-05 17:46                                                                                             ` Michal Hocko
  2013-02-05 18:09                                                                                               ` Greg Thelen
  0 siblings, 1 reply; 172+ messages in thread
From: Michal Hocko @ 2013-02-05 17:46 UTC (permalink / raw)
  To: Greg Thelen
  Cc: azurIt, linux-kernel, linux-mm, cgroups mailinglist,
	KAMEZAWA Hiroyuki, Johannes Weiner

On Tue 05-02-13 08:48:23, Greg Thelen wrote:
> On Tue, Feb 05 2013, Michal Hocko wrote:
> 
> > On Tue 05-02-13 15:49:47, azurIt wrote:
> > [...]
> >> Just to be sure - am i supposed to apply this two patches?
> >> http://watchdog.sk/lkml/patches/
> >
> > 5-memcg-fix-1.patch is not complete. It doesn't contain the folloup I
> > mentioned in a follow up email. Here is the full patch:
> > ---
> > From f2bf8437d5b9bb38a95a432bf39f32c584955171 Mon Sep 17 00:00:00 2001
> > From: Michal Hocko <mhocko@suse.cz>
> > Date: Mon, 26 Nov 2012 11:47:57 +0100
> > Subject: [PATCH] memcg: do not trigger OOM from add_to_page_cache_locked
> >
> > memcg oom killer might deadlock if the process which falls down to
> > mem_cgroup_handle_oom holds a lock which prevents other task to
> > terminate because it is blocked on the very same lock.
> > This can happen when a write system call needs to allocate a page but
> > the allocation hits the memcg hard limit and there is nothing to reclaim
> > (e.g. there is no swap or swap limit is hit as well and all cache pages
> > have been reclaimed already) and the process selected by memcg OOM
> > killer is blocked on i_mutex on the same inode (e.g. truncate it).
> >
> > Process A
> > [<ffffffff811109b8>] do_truncate+0x58/0xa0		# takes i_mutex
> > [<ffffffff81121c90>] do_last+0x250/0xa30
> > [<ffffffff81122547>] path_openat+0xd7/0x440
> > [<ffffffff811229c9>] do_filp_open+0x49/0xa0
> > [<ffffffff8110f7d6>] do_sys_open+0x106/0x240
> > [<ffffffff8110f950>] sys_open+0x20/0x30
> > [<ffffffff815b5926>] system_call_fastpath+0x18/0x1d
> > [<ffffffffffffffff>] 0xffffffffffffffff
> >
> > Process B
> > [<ffffffff8110a9c1>] mem_cgroup_handle_oom+0x241/0x3b0
> > [<ffffffff8110b5ab>] T.1146+0x5ab/0x5c0
> > [<ffffffff8110c22e>] mem_cgroup_cache_charge+0xbe/0xe0
> > [<ffffffff810ca28c>] add_to_page_cache_locked+0x4c/0x140
> > [<ffffffff810ca3a2>] add_to_page_cache_lru+0x22/0x50
> > [<ffffffff810ca45b>] grab_cache_page_write_begin+0x8b/0xe0
> > [<ffffffff81193a18>] ext3_write_begin+0x88/0x270
> > [<ffffffff810c8fc6>] generic_file_buffered_write+0x116/0x290
> > [<ffffffff810cb3cc>] __generic_file_aio_write+0x27c/0x480
> > [<ffffffff810cb646>] generic_file_aio_write+0x76/0xf0           # takes ->i_mutex
> > [<ffffffff8111156a>] do_sync_write+0xea/0x130
> > [<ffffffff81112183>] vfs_write+0xf3/0x1f0
> > [<ffffffff81112381>] sys_write+0x51/0x90
> > [<ffffffff815b5926>] system_call_fastpath+0x18/0x1d
> > [<ffffffffffffffff>] 0xffffffffffffffff
> 
> It looks like grab_cache_page_write_begin() passes __GFP_FS into
> __page_cache_alloc() and mem_cgroup_cache_charge().  Which makes me
> think that this deadlock is also possible in the page allocator even
> before getting to add_to_page_cache_lru.  no?

I am not that familiar with VFS but i_mutex is a high level lock AFAIR
and it shouldn't be called from the pageout path so __page_cache_alloc
should be safe.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked
  2013-02-05 17:46                                                                                             ` Michal Hocko
@ 2013-02-05 18:09                                                                                               ` Greg Thelen
  2013-02-05 18:59                                                                                                 ` Michal Hocko
  0 siblings, 1 reply; 172+ messages in thread
From: Greg Thelen @ 2013-02-05 18:09 UTC (permalink / raw)
  To: Michal Hocko
  Cc: azurIt, linux-kernel, linux-mm, cgroups mailinglist,
	KAMEZAWA Hiroyuki, Johannes Weiner

On Tue, Feb 05 2013, Michal Hocko wrote:

> On Tue 05-02-13 08:48:23, Greg Thelen wrote:
>> On Tue, Feb 05 2013, Michal Hocko wrote:
>> 
>> > On Tue 05-02-13 15:49:47, azurIt wrote:
>> > [...]
>> >> Just to be sure - am i supposed to apply this two patches?
>> >> http://watchdog.sk/lkml/patches/
>> >
>> > 5-memcg-fix-1.patch is not complete. It doesn't contain the folloup I
>> > mentioned in a follow up email. Here is the full patch:
>> > ---
>> > From f2bf8437d5b9bb38a95a432bf39f32c584955171 Mon Sep 17 00:00:00 2001
>> > From: Michal Hocko <mhocko@suse.cz>
>> > Date: Mon, 26 Nov 2012 11:47:57 +0100
>> > Subject: [PATCH] memcg: do not trigger OOM from add_to_page_cache_locked
>> >
>> > memcg oom killer might deadlock if the process which falls down to
>> > mem_cgroup_handle_oom holds a lock which prevents other task to
>> > terminate because it is blocked on the very same lock.
>> > This can happen when a write system call needs to allocate a page but
>> > the allocation hits the memcg hard limit and there is nothing to reclaim
>> > (e.g. there is no swap or swap limit is hit as well and all cache pages
>> > have been reclaimed already) and the process selected by memcg OOM
>> > killer is blocked on i_mutex on the same inode (e.g. truncate it).
>> >
>> > Process A
>> > [<ffffffff811109b8>] do_truncate+0x58/0xa0		# takes i_mutex
>> > [<ffffffff81121c90>] do_last+0x250/0xa30
>> > [<ffffffff81122547>] path_openat+0xd7/0x440
>> > [<ffffffff811229c9>] do_filp_open+0x49/0xa0
>> > [<ffffffff8110f7d6>] do_sys_open+0x106/0x240
>> > [<ffffffff8110f950>] sys_open+0x20/0x30
>> > [<ffffffff815b5926>] system_call_fastpath+0x18/0x1d
>> > [<ffffffffffffffff>] 0xffffffffffffffff
>> >
>> > Process B
>> > [<ffffffff8110a9c1>] mem_cgroup_handle_oom+0x241/0x3b0
>> > [<ffffffff8110b5ab>] T.1146+0x5ab/0x5c0
>> > [<ffffffff8110c22e>] mem_cgroup_cache_charge+0xbe/0xe0
>> > [<ffffffff810ca28c>] add_to_page_cache_locked+0x4c/0x140
>> > [<ffffffff810ca3a2>] add_to_page_cache_lru+0x22/0x50
>> > [<ffffffff810ca45b>] grab_cache_page_write_begin+0x8b/0xe0
>> > [<ffffffff81193a18>] ext3_write_begin+0x88/0x270
>> > [<ffffffff810c8fc6>] generic_file_buffered_write+0x116/0x290
>> > [<ffffffff810cb3cc>] __generic_file_aio_write+0x27c/0x480
>> > [<ffffffff810cb646>] generic_file_aio_write+0x76/0xf0           # takes ->i_mutex
>> > [<ffffffff8111156a>] do_sync_write+0xea/0x130
>> > [<ffffffff81112183>] vfs_write+0xf3/0x1f0
>> > [<ffffffff81112381>] sys_write+0x51/0x90
>> > [<ffffffff815b5926>] system_call_fastpath+0x18/0x1d
>> > [<ffffffffffffffff>] 0xffffffffffffffff
>> 
>> It looks like grab_cache_page_write_begin() passes __GFP_FS into
>> __page_cache_alloc() and mem_cgroup_cache_charge().  Which makes me
>> think that this deadlock is also possible in the page allocator even
>> before getting to add_to_page_cache_lru.  no?
>
> I am not that familiar with VFS but i_mutex is a high level lock AFAIR
> and it shouldn't be called from the pageout path so __page_cache_alloc
> should be safe.

I wasn't clear, sorry.  My concern is not that pageout() grabs i_mutex.
My concern is that __page_cache_alloc() will invoke the oom killer and
select a victim which wants i_mutex.  This victim will deadlock because
the oom killer caller already holds i_mutex.  The wild accusation I am
making is that anyone who invokes the oom killer and waits on the victim
to die is essentially grabbing all of the locks that any of the oom
killer victims may grab (e.g. i_mutex).  To avoid deadlock the oom
killer can only be called is while holding no locks that the oom victim
demands.  I think some locks are grabbed in a way that allows the lock
request to fail if the task has a fatal signal pending, so they are
safe.  But any locks acquisitions that cannot fail (e.g. mutex_lock)
will deadlock with the oom killing process.  So the oom killing process
cannot hold any such locks which the victim will attempt to grab.
Hopefully I'm missing something.

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked
  2013-02-05 18:09                                                                                               ` Greg Thelen
@ 2013-02-05 18:59                                                                                                 ` Michal Hocko
  2013-02-08  4:27                                                                                                   ` Greg Thelen
  0 siblings, 1 reply; 172+ messages in thread
From: Michal Hocko @ 2013-02-05 18:59 UTC (permalink / raw)
  To: Greg Thelen
  Cc: azurIt, linux-kernel, linux-mm, cgroups mailinglist,
	KAMEZAWA Hiroyuki, Johannes Weiner

On Tue 05-02-13 10:09:57, Greg Thelen wrote:
> On Tue, Feb 05 2013, Michal Hocko wrote:
> 
> > On Tue 05-02-13 08:48:23, Greg Thelen wrote:
> >> On Tue, Feb 05 2013, Michal Hocko wrote:
> >> 
> >> > On Tue 05-02-13 15:49:47, azurIt wrote:
> >> > [...]
> >> >> Just to be sure - am i supposed to apply this two patches?
> >> >> http://watchdog.sk/lkml/patches/
> >> >
> >> > 5-memcg-fix-1.patch is not complete. It doesn't contain the folloup I
> >> > mentioned in a follow up email. Here is the full patch:
> >> > ---
> >> > From f2bf8437d5b9bb38a95a432bf39f32c584955171 Mon Sep 17 00:00:00 2001
> >> > From: Michal Hocko <mhocko@suse.cz>
> >> > Date: Mon, 26 Nov 2012 11:47:57 +0100
> >> > Subject: [PATCH] memcg: do not trigger OOM from add_to_page_cache_locked
> >> >
> >> > memcg oom killer might deadlock if the process which falls down to
> >> > mem_cgroup_handle_oom holds a lock which prevents other task to
> >> > terminate because it is blocked on the very same lock.
> >> > This can happen when a write system call needs to allocate a page but
> >> > the allocation hits the memcg hard limit and there is nothing to reclaim
> >> > (e.g. there is no swap or swap limit is hit as well and all cache pages
> >> > have been reclaimed already) and the process selected by memcg OOM
> >> > killer is blocked on i_mutex on the same inode (e.g. truncate it).
> >> >
> >> > Process A
> >> > [<ffffffff811109b8>] do_truncate+0x58/0xa0		# takes i_mutex
> >> > [<ffffffff81121c90>] do_last+0x250/0xa30
> >> > [<ffffffff81122547>] path_openat+0xd7/0x440
> >> > [<ffffffff811229c9>] do_filp_open+0x49/0xa0
> >> > [<ffffffff8110f7d6>] do_sys_open+0x106/0x240
> >> > [<ffffffff8110f950>] sys_open+0x20/0x30
> >> > [<ffffffff815b5926>] system_call_fastpath+0x18/0x1d
> >> > [<ffffffffffffffff>] 0xffffffffffffffff
> >> >
> >> > Process B
> >> > [<ffffffff8110a9c1>] mem_cgroup_handle_oom+0x241/0x3b0
> >> > [<ffffffff8110b5ab>] T.1146+0x5ab/0x5c0
> >> > [<ffffffff8110c22e>] mem_cgroup_cache_charge+0xbe/0xe0
> >> > [<ffffffff810ca28c>] add_to_page_cache_locked+0x4c/0x140
> >> > [<ffffffff810ca3a2>] add_to_page_cache_lru+0x22/0x50
> >> > [<ffffffff810ca45b>] grab_cache_page_write_begin+0x8b/0xe0
> >> > [<ffffffff81193a18>] ext3_write_begin+0x88/0x270
> >> > [<ffffffff810c8fc6>] generic_file_buffered_write+0x116/0x290
> >> > [<ffffffff810cb3cc>] __generic_file_aio_write+0x27c/0x480
> >> > [<ffffffff810cb646>] generic_file_aio_write+0x76/0xf0           # takes ->i_mutex
> >> > [<ffffffff8111156a>] do_sync_write+0xea/0x130
> >> > [<ffffffff81112183>] vfs_write+0xf3/0x1f0
> >> > [<ffffffff81112381>] sys_write+0x51/0x90
> >> > [<ffffffff815b5926>] system_call_fastpath+0x18/0x1d
> >> > [<ffffffffffffffff>] 0xffffffffffffffff
> >> 
> >> It looks like grab_cache_page_write_begin() passes __GFP_FS into
> >> __page_cache_alloc() and mem_cgroup_cache_charge().  Which makes me
> >> think that this deadlock is also possible in the page allocator even
> >> before getting to add_to_page_cache_lru.  no?
> >
> > I am not that familiar with VFS but i_mutex is a high level lock AFAIR
> > and it shouldn't be called from the pageout path so __page_cache_alloc
> > should be safe.
> 
> I wasn't clear, sorry.  My concern is not that pageout() grabs i_mutex.
> My concern is that __page_cache_alloc() will invoke the oom killer and
> select a victim which wants i_mutex.  This victim will deadlock because
> the oom killer caller already holds i_mutex.  

That would be true for the memcg oom because that one is blocking but
the global oom just puts the allocator into sleep for a while and then
the allocator should back off eventually (unless this is NOFAIL
allocation). I would need to look closer whether this is really the case
- I haven't seen that allocator code path for a while...

> The wild accusation I am making is that anyone who invokes the oom
> killer and waits on the victim to die is essentially grabbing all of
> the locks that any of the oom killer victims may grab (e.g. i_mutex).

True.

> To avoid deadlock the oom killer can only be called is while holding
> no locks that the oom victim demands.  I think some locks are grabbed
> in a way that allows the lock request to fail if the task has a fatal
> signal pending, so they are safe.  But any locks acquisitions that
> cannot fail (e.g. mutex_lock) will deadlock with the oom killing
> process.  So the oom killing process cannot hold any such locks which
> the victim will attempt to grab.  Hopefully I'm missing something.

Agreed.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked
  2013-02-05 16:09                                                                                         ` Michal Hocko
  2013-02-05 16:46                                                                                           ` azurIt
  2013-02-05 16:48                                                                                           ` Greg Thelen
@ 2013-02-06  1:17                                                                                           ` azurIt
  2013-02-06 14:01                                                                                             ` Michal Hocko
  2 siblings, 1 reply; 172+ messages in thread
From: azurIt @ 2013-02-06  1:17 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki,
	Johannes Weiner

>5-memcg-fix-1.patch is not complete. It doesn't contain the folloup I
>mentioned in a follow up email. Here is the full patch:


Here is the log where OOM, again, killed MySQL server [search for "(mysqld)"]:
http://www.watchdog.sk/lkml/oom_mysqld6

azur

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked
  2013-02-06  1:17                                                                                           ` azurIt
@ 2013-02-06 14:01                                                                                             ` Michal Hocko
  2013-02-06 14:22                                                                                               ` Michal Hocko
  2013-02-07 11:01                                                                                               ` [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked Kamezawa Hiroyuki
  0 siblings, 2 replies; 172+ messages in thread
From: Michal Hocko @ 2013-02-06 14:01 UTC (permalink / raw)
  To: azurIt
  Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki,
	Johannes Weiner

On Wed 06-02-13 02:17:21, azurIt wrote:
> >5-memcg-fix-1.patch is not complete. It doesn't contain the folloup I
> >mentioned in a follow up email. Here is the full patch:
> 
> 
> Here is the log where OOM, again, killed MySQL server [search for "(mysqld)"]:
> http://www.watchdog.sk/lkml/oom_mysqld6

[...]
WARNING: at mm/memcontrol.c:2409 T.1149+0x2d9/0x610()
Hardware name: S5000VSA
gfp_mask:4304 nr_pages:1 oom:0 ret:2
Pid: 3545, comm: apache2 Tainted: G        W    3.2.37-grsec #1
Call Trace:
 [<ffffffff8105502a>] warn_slowpath_common+0x7a/0xb0
 [<ffffffff81055116>] warn_slowpath_fmt+0x46/0x50
 [<ffffffff81108163>] ? mem_cgroup_margin+0x73/0xa0
 [<ffffffff8110b6f9>] T.1149+0x2d9/0x610
 [<ffffffff812af298>] ? blk_finish_plug+0x18/0x50
 [<ffffffff8110c6b4>] mem_cgroup_cache_charge+0xc4/0xf0
 [<ffffffff810ca6bf>] add_to_page_cache_locked+0x4f/0x140
 [<ffffffff810ca7d2>] add_to_page_cache_lru+0x22/0x50
 [<ffffffff810cad32>] filemap_fault+0x252/0x4f0
 [<ffffffff810eab18>] __do_fault+0x78/0x5a0
 [<ffffffff810edcb4>] handle_pte_fault+0x84/0x940
 [<ffffffff810e2460>] ? vma_prio_tree_insert+0x30/0x50
 [<ffffffff810f2508>] ? vma_link+0x88/0xe0
 [<ffffffff810ee6a8>] handle_mm_fault+0x138/0x260
 [<ffffffff8102709d>] do_page_fault+0x13d/0x460
 [<ffffffff810f46fc>] ? do_mmap_pgoff+0x3dc/0x430
 [<ffffffff815b61ff>] page_fault+0x1f/0x30
---[ end trace 8817670349022007 ]---
apache2 invoked oom-killer: gfp_mask=0x0, order=0, oom_adj=0, oom_score_adj=0
apache2 cpuset=uid mems_allowed=0
Pid: 3545, comm: apache2 Tainted: G        W    3.2.37-grsec #1
Call Trace:
 [<ffffffff810ccd2e>] dump_header+0x7e/0x1e0
 [<ffffffff810ccc2f>] ? find_lock_task_mm+0x2f/0x70
 [<ffffffff810cd1f5>] oom_kill_process+0x85/0x2a0
 [<ffffffff810cd8a5>] out_of_memory+0xe5/0x200
 [<ffffffff810cda7d>] pagefault_out_of_memory+0xbd/0x110
 [<ffffffff81026e76>] mm_fault_error+0xb6/0x1a0
 [<ffffffff8102734e>] do_page_fault+0x3ee/0x460
 [<ffffffff810f46fc>] ? do_mmap_pgoff+0x3dc/0x430
 [<ffffffff815b61ff>] page_fault+0x1f/0x30

The first trace comes from the debugging WARN and it clearly points to
a file fault path. __do_fault pre-charges a page in case we need to
do CoW (copy-on-write) for the returned page. This one falls back to
memcg OOM and never returns ENOMEM as I have mentioned earlier. 
However, the fs fault handler (filemap_fault here) can fallback to
page_cache_read if the readahead (do_sync_mmap_readahead) fails
to get page to the page cache. And we can see this happening in
the first trace. page_cache_read then calls add_to_page_cache_lru
and eventually gets to add_to_page_cache_locked which calls
mem_cgroup_cache_charge_no_oom so we will get ENOMEM if oom should
happen. This ENOMEM gets to the fault handler and kaboom.

So the fix is really much more complex than I thought. Although
add_to_page_cache_locked sounded like a good place it turned out to be
not in fact.

We need something more clever appaerently. One way would be not misusing
__GFP_NORETRY for GFP_MEMCG_NO_OOM and give it a real flag. We have 32
bits for those flags in gfp_t so there should be some room there. 
Or we could do this per task flag, same we do for NO_IO in the current
-mm tree.
The later one seems easier wrt. gfp_mask passing horror - e.g.
__generic_file_aio_write doesn't pass flags and it can be called from
unlocked contexts as well.

I have to think about it some more.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked
  2013-02-06 14:01                                                                                             ` Michal Hocko
@ 2013-02-06 14:22                                                                                               ` Michal Hocko
  2013-02-06 16:00                                                                                                 ` [PATCH for 3.2.34] memcg: do not trigger OOM if PF_NO_MEMCG_OOM is set Michal Hocko
  2013-02-07 11:01                                                                                               ` [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked Kamezawa Hiroyuki
  1 sibling, 1 reply; 172+ messages in thread
From: Michal Hocko @ 2013-02-06 14:22 UTC (permalink / raw)
  To: azurIt
  Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki,
	Johannes Weiner

On Wed 06-02-13 15:01:19, Michal Hocko wrote:
> On Wed 06-02-13 02:17:21, azurIt wrote:
> > >5-memcg-fix-1.patch is not complete. It doesn't contain the folloup I
> > >mentioned in a follow up email. Here is the full patch:
> > 
> > 
> > Here is the log where OOM, again, killed MySQL server [search for "(mysqld)"]:
> > http://www.watchdog.sk/lkml/oom_mysqld6
> 
> [...]
> WARNING: at mm/memcontrol.c:2409 T.1149+0x2d9/0x610()
> Hardware name: S5000VSA
> gfp_mask:4304 nr_pages:1 oom:0 ret:2
> Pid: 3545, comm: apache2 Tainted: G        W    3.2.37-grsec #1
> Call Trace:
>  [<ffffffff8105502a>] warn_slowpath_common+0x7a/0xb0
>  [<ffffffff81055116>] warn_slowpath_fmt+0x46/0x50
>  [<ffffffff81108163>] ? mem_cgroup_margin+0x73/0xa0
>  [<ffffffff8110b6f9>] T.1149+0x2d9/0x610
>  [<ffffffff812af298>] ? blk_finish_plug+0x18/0x50
>  [<ffffffff8110c6b4>] mem_cgroup_cache_charge+0xc4/0xf0
>  [<ffffffff810ca6bf>] add_to_page_cache_locked+0x4f/0x140
>  [<ffffffff810ca7d2>] add_to_page_cache_lru+0x22/0x50
>  [<ffffffff810cad32>] filemap_fault+0x252/0x4f0
>  [<ffffffff810eab18>] __do_fault+0x78/0x5a0
>  [<ffffffff810edcb4>] handle_pte_fault+0x84/0x940
>  [<ffffffff810e2460>] ? vma_prio_tree_insert+0x30/0x50
>  [<ffffffff810f2508>] ? vma_link+0x88/0xe0
>  [<ffffffff810ee6a8>] handle_mm_fault+0x138/0x260
>  [<ffffffff8102709d>] do_page_fault+0x13d/0x460
>  [<ffffffff810f46fc>] ? do_mmap_pgoff+0x3dc/0x430
>  [<ffffffff815b61ff>] page_fault+0x1f/0x30
> ---[ end trace 8817670349022007 ]---
> apache2 invoked oom-killer: gfp_mask=0x0, order=0, oom_adj=0, oom_score_adj=0
> apache2 cpuset=uid mems_allowed=0
> Pid: 3545, comm: apache2 Tainted: G        W    3.2.37-grsec #1
> Call Trace:
>  [<ffffffff810ccd2e>] dump_header+0x7e/0x1e0
>  [<ffffffff810ccc2f>] ? find_lock_task_mm+0x2f/0x70
>  [<ffffffff810cd1f5>] oom_kill_process+0x85/0x2a0
>  [<ffffffff810cd8a5>] out_of_memory+0xe5/0x200
>  [<ffffffff810cda7d>] pagefault_out_of_memory+0xbd/0x110
>  [<ffffffff81026e76>] mm_fault_error+0xb6/0x1a0
>  [<ffffffff8102734e>] do_page_fault+0x3ee/0x460
>  [<ffffffff810f46fc>] ? do_mmap_pgoff+0x3dc/0x430
>  [<ffffffff815b61ff>] page_fault+0x1f/0x30
> 
> The first trace comes from the debugging WARN and it clearly points to
> a file fault path. __do_fault pre-charges a page in case we need to
> do CoW (copy-on-write) for the returned page. This one falls back to
> memcg OOM and never returns ENOMEM as I have mentioned earlier. 
> However, the fs fault handler (filemap_fault here) can fallback to
> page_cache_read if the readahead (do_sync_mmap_readahead) fails
> to get page to the page cache. And we can see this happening in
> the first trace. page_cache_read then calls add_to_page_cache_lru
> and eventually gets to add_to_page_cache_locked which calls
> mem_cgroup_cache_charge_no_oom so we will get ENOMEM if oom should
> happen. This ENOMEM gets to the fault handler and kaboom.
> 
> So the fix is really much more complex than I thought. Although
> add_to_page_cache_locked sounded like a good place it turned out to be
> not in fact.
> 
> We need something more clever appaerently. One way would be not misusing
> __GFP_NORETRY for GFP_MEMCG_NO_OOM and give it a real flag. We have 32
> bits for those flags in gfp_t so there should be some room there. 
> Or we could do this per task flag, same we do for NO_IO in the current
> -mm tree.
> The later one seems easier wrt. gfp_mask passing horror - e.g.
> __generic_file_aio_write doesn't pass flags and it can be called from
> unlocked contexts as well.

Ouch, PF_ flags space seem to be drained already because
task_struct::flags is just unsigned int so there is just one bit left. I
am not sure this is the best use for it. This will be a real pain!

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 172+ messages in thread

* [PATCH for 3.2.34] memcg: do not trigger OOM if PF_NO_MEMCG_OOM is set
  2013-02-06 14:22                                                                                               ` Michal Hocko
@ 2013-02-06 16:00                                                                                                 ` Michal Hocko
  2013-02-08  5:03                                                                                                   ` azurIt
  0 siblings, 1 reply; 172+ messages in thread
From: Michal Hocko @ 2013-02-06 16:00 UTC (permalink / raw)
  To: azurIt
  Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki,
	Johannes Weiner

On Wed 06-02-13 15:22:19, Michal Hocko wrote:
> On Wed 06-02-13 15:01:19, Michal Hocko wrote:
> > On Wed 06-02-13 02:17:21, azurIt wrote:
> > > >5-memcg-fix-1.patch is not complete. It doesn't contain the folloup I
> > > >mentioned in a follow up email. Here is the full patch:
> > > 
> > > 
> > > Here is the log where OOM, again, killed MySQL server [search for "(mysqld)"]:
> > > http://www.watchdog.sk/lkml/oom_mysqld6
> > 
> > [...]
> > WARNING: at mm/memcontrol.c:2409 T.1149+0x2d9/0x610()
> > Hardware name: S5000VSA
> > gfp_mask:4304 nr_pages:1 oom:0 ret:2
> > Pid: 3545, comm: apache2 Tainted: G        W    3.2.37-grsec #1
> > Call Trace:
> >  [<ffffffff8105502a>] warn_slowpath_common+0x7a/0xb0
> >  [<ffffffff81055116>] warn_slowpath_fmt+0x46/0x50
> >  [<ffffffff81108163>] ? mem_cgroup_margin+0x73/0xa0
> >  [<ffffffff8110b6f9>] T.1149+0x2d9/0x610
> >  [<ffffffff812af298>] ? blk_finish_plug+0x18/0x50
> >  [<ffffffff8110c6b4>] mem_cgroup_cache_charge+0xc4/0xf0
> >  [<ffffffff810ca6bf>] add_to_page_cache_locked+0x4f/0x140
> >  [<ffffffff810ca7d2>] add_to_page_cache_lru+0x22/0x50
> >  [<ffffffff810cad32>] filemap_fault+0x252/0x4f0
> >  [<ffffffff810eab18>] __do_fault+0x78/0x5a0
> >  [<ffffffff810edcb4>] handle_pte_fault+0x84/0x940
> >  [<ffffffff810e2460>] ? vma_prio_tree_insert+0x30/0x50
> >  [<ffffffff810f2508>] ? vma_link+0x88/0xe0
> >  [<ffffffff810ee6a8>] handle_mm_fault+0x138/0x260
> >  [<ffffffff8102709d>] do_page_fault+0x13d/0x460
> >  [<ffffffff810f46fc>] ? do_mmap_pgoff+0x3dc/0x430
> >  [<ffffffff815b61ff>] page_fault+0x1f/0x30
> > ---[ end trace 8817670349022007 ]---
> > apache2 invoked oom-killer: gfp_mask=0x0, order=0, oom_adj=0, oom_score_adj=0
> > apache2 cpuset=uid mems_allowed=0
> > Pid: 3545, comm: apache2 Tainted: G        W    3.2.37-grsec #1
> > Call Trace:
> >  [<ffffffff810ccd2e>] dump_header+0x7e/0x1e0
> >  [<ffffffff810ccc2f>] ? find_lock_task_mm+0x2f/0x70
> >  [<ffffffff810cd1f5>] oom_kill_process+0x85/0x2a0
> >  [<ffffffff810cd8a5>] out_of_memory+0xe5/0x200
> >  [<ffffffff810cda7d>] pagefault_out_of_memory+0xbd/0x110
> >  [<ffffffff81026e76>] mm_fault_error+0xb6/0x1a0
> >  [<ffffffff8102734e>] do_page_fault+0x3ee/0x460
> >  [<ffffffff810f46fc>] ? do_mmap_pgoff+0x3dc/0x430
> >  [<ffffffff815b61ff>] page_fault+0x1f/0x30
> > 
> > The first trace comes from the debugging WARN and it clearly points to
> > a file fault path. __do_fault pre-charges a page in case we need to
> > do CoW (copy-on-write) for the returned page. This one falls back to
> > memcg OOM and never returns ENOMEM as I have mentioned earlier. 
> > However, the fs fault handler (filemap_fault here) can fallback to
> > page_cache_read if the readahead (do_sync_mmap_readahead) fails
> > to get page to the page cache. And we can see this happening in
> > the first trace. page_cache_read then calls add_to_page_cache_lru
> > and eventually gets to add_to_page_cache_locked which calls
> > mem_cgroup_cache_charge_no_oom so we will get ENOMEM if oom should
> > happen. This ENOMEM gets to the fault handler and kaboom.
> > 
> > So the fix is really much more complex than I thought. Although
> > add_to_page_cache_locked sounded like a good place it turned out to be
> > not in fact.
> > 
> > We need something more clever appaerently. One way would be not misusing
> > __GFP_NORETRY for GFP_MEMCG_NO_OOM and give it a real flag. We have 32
> > bits for those flags in gfp_t so there should be some room there. 
> > Or we could do this per task flag, same we do for NO_IO in the current
> > -mm tree.
> > The later one seems easier wrt. gfp_mask passing horror - e.g.
> > __generic_file_aio_write doesn't pass flags and it can be called from
> > unlocked contexts as well.
> 
> Ouch, PF_ flags space seem to be drained already because
> task_struct::flags is just unsigned int so there is just one bit left. I
> am not sure this is the best use for it. This will be a real pain!

OK, so this something that should help you without any risk of false
OOMs. I do not believe that something like that would be accepted
upstream because it is really heavy. We will need to come up with
something more clever for upstream.
I have also added a warning which will trigger when the charge fails. If
you see too many of those messages then there is something bad going on
and the lack of OOM causes userspace to loop without getting any
progress.

So there you go - your personal patch ;) You can drop all other patches.
Please note I have just compile tested it. But it should be pretty
trivial to check it is correct
---
>From 6f155187f77c971b45caf05dbc80ca9c20bc278c Mon Sep 17 00:00:00 2001
From: Michal Hocko <mhocko@suse.cz>
Date: Wed, 6 Feb 2013 16:45:07 +0100
Subject: [PATCH 1/2] memcg: do not trigger OOM if PF_NO_MEMCG_OOM is set

memcg oom killer might deadlock if the process which falls down to
mem_cgroup_handle_oom holds a lock which prevents other task to
terminate because it is blocked on the very same lock.
This can happen when a write system call needs to allocate a page but
the allocation hits the memcg hard limit and there is nothing to reclaim
(e.g. there is no swap or swap limit is hit as well and all cache pages
have been reclaimed already) and the process selected by memcg OOM
killer is blocked on i_mutex on the same inode (e.g. truncate it).

Process A
[<ffffffff811109b8>] do_truncate+0x58/0xa0		# takes i_mutex
[<ffffffff81121c90>] do_last+0x250/0xa30
[<ffffffff81122547>] path_openat+0xd7/0x440
[<ffffffff811229c9>] do_filp_open+0x49/0xa0
[<ffffffff8110f7d6>] do_sys_open+0x106/0x240
[<ffffffff8110f950>] sys_open+0x20/0x30
[<ffffffff815b5926>] system_call_fastpath+0x18/0x1d
[<ffffffffffffffff>] 0xffffffffffffffff

Process B
[<ffffffff8110a9c1>] mem_cgroup_handle_oom+0x241/0x3b0
[<ffffffff8110b5ab>] T.1146+0x5ab/0x5c0
[<ffffffff8110c22e>] mem_cgroup_cache_charge+0xbe/0xe0
[<ffffffff810ca28c>] add_to_page_cache_locked+0x4c/0x140
[<ffffffff810ca3a2>] add_to_page_cache_lru+0x22/0x50
[<ffffffff810ca45b>] grab_cache_page_write_begin+0x8b/0xe0
[<ffffffff81193a18>] ext3_write_begin+0x88/0x270
[<ffffffff810c8fc6>] generic_file_buffered_write+0x116/0x290
[<ffffffff810cb3cc>] __generic_file_aio_write+0x27c/0x480
[<ffffffff810cb646>] generic_file_aio_write+0x76/0xf0           # takes ->i_mutex
[<ffffffff8111156a>] do_sync_write+0xea/0x130
[<ffffffff81112183>] vfs_write+0xf3/0x1f0
[<ffffffff81112381>] sys_write+0x51/0x90
[<ffffffff815b5926>] system_call_fastpath+0x18/0x1d
[<ffffffffffffffff>] 0xffffffffffffffff

This is not a hard deadlock though because administrator can still
intervene and increase the limit on the group which helps the writer to
finish the allocation and release the lock.

This patch heals the problem by forbidding OOM from dangerous context.
Memcg charging code has no way to find out whether it is called from a
locked context we have to help it via process flags. PF_OOM_ORIGIN flag
removed recently will be reused for PF_NO_MEMCG_OOM which signals that
the memcg OOM killer could lead to a deadlock.
Only locked callers of __generic_file_aio_write are currently marked. I
am pretty sure there are more places (I didn't check shmem and hugetlb
uses fancy instantion mutex during page fault and filesystems might
use some locks during the write) but I've ignored those as this will
probably be just a user specific patch without any way to get upstream
in the current form.

Reported-by: azurIt <azurit@pobox.sk>
Signed-off-by: Michal Hocko <mhocko@suse.cz>
---
 drivers/staging/pohmelfs/inode.c |    2 ++
 include/linux/sched.h            |    1 +
 mm/filemap.c                     |    2 ++
 mm/memcontrol.c                  |   18 ++++++++++++++----
 4 files changed, 19 insertions(+), 4 deletions(-)

diff --git a/drivers/staging/pohmelfs/inode.c b/drivers/staging/pohmelfs/inode.c
index 7a19555..523de82e 100644
--- a/drivers/staging/pohmelfs/inode.c
+++ b/drivers/staging/pohmelfs/inode.c
@@ -921,7 +921,9 @@ ssize_t pohmelfs_write(struct file *file, const char __user *buf,
 	if (ret)
 		goto err_out_unlock;
 
+	current->flags |= PF_NO_MEMCG_OOM;
 	ret = __generic_file_aio_write(&kiocb, &iov, 1, &kiocb.ki_pos);
+	current->flags &= ~PF_NO_MEMCG_OOM;
 	*ppos = kiocb.ki_pos;
 
 	mutex_unlock(&inode->i_mutex);
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 1e86bb4..f275c8f 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1781,6 +1781,7 @@ extern void thread_group_times(struct task_struct *p, cputime_t *ut, cputime_t *
 #define PF_FROZEN	0x00010000	/* frozen for system suspend */
 #define PF_FSTRANS	0x00020000	/* inside a filesystem transaction */
 #define PF_KSWAPD	0x00040000	/* I am kswapd */
+#define PF_NO_MEMCG_OOM	0x00080000	/* Memcg OOM could lead to a deadlock */
 #define PF_LESS_THROTTLE 0x00100000	/* Throttle me less: I clean memory */
 #define PF_KTHREAD	0x00200000	/* I am a kernel thread */
 #define PF_RANDOMIZE	0x00400000	/* randomize virtual address space */
diff --git a/mm/filemap.c b/mm/filemap.c
index 556858c..58a316b 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2617,7 +2617,9 @@ ssize_t generic_file_aio_write(struct kiocb *iocb, const struct iovec *iov,
 
 	mutex_lock(&inode->i_mutex);
 	blk_start_plug(&plug);
+	current->flags |= PF_NO_MEMCG_OOM;
 	ret = __generic_file_aio_write(iocb, iov, nr_segs, &iocb->ki_pos);
+	current->flags &= ~PF_NO_MEMCG_OOM;
 	mutex_unlock(&inode->i_mutex);
 
 	if (ret > 0 || ret == -EIOCBQUEUED) {
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index c8425b1..128b615 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2397,6 +2397,14 @@ done:
 	return 0;
 nomem:
 	*ptr = NULL;
+	if (printk_ratelimit())
+		printk(KERN_WARNING"%s: task:%s pid:%d got ENOMEM without OOM for memcg:%p."
+				" If this message shows up very often for the"
+				" same task then there is a risk that the"
+				" process is not able to make any progress"
+				" because of the current limit. Try to enlarge"
+				" the hard limit.\n", __FUNCTION__,
+				current->comm, current->pid, memcg);
 	return -ENOMEM;
 bypass:
 	*ptr = NULL;
@@ -2703,7 +2711,7 @@ static int mem_cgroup_charge_common(struct page *page, struct mm_struct *mm,
 	struct mem_cgroup *memcg = NULL;
 	unsigned int nr_pages = 1;
 	struct page_cgroup *pc;
-	bool oom = true;
+	bool oom = !(current->flags & PF_NO_MEMCG_OOM);
 	int ret;
 
 	if (PageTransHuge(page)) {
@@ -2770,6 +2778,7 @@ __mem_cgroup_commit_charge_lrucare(struct page *page, struct mem_cgroup *memcg,
 int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm,
 				gfp_t gfp_mask)
 {
+	bool oom = !(current->flags & PF_NO_MEMCG_OOM);
 	struct mem_cgroup *memcg = NULL;
 	int ret;
 
@@ -2782,7 +2791,7 @@ int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm,
 		mm = &init_mm;
 
 	if (page_is_file_cache(page)) {
-		ret = __mem_cgroup_try_charge(mm, gfp_mask, 1, &memcg, true);
+		ret = __mem_cgroup_try_charge(mm, gfp_mask, 1, &memcg, oom);
 		if (ret || !memcg)
 			return ret;
 
@@ -2818,6 +2827,7 @@ int mem_cgroup_try_charge_swapin(struct mm_struct *mm,
 				 struct page *page,
 				 gfp_t mask, struct mem_cgroup **ptr)
 {
+	bool oom = !(current->flags & PF_NO_MEMCG_OOM);
 	struct mem_cgroup *memcg;
 	int ret;
 
@@ -2840,13 +2850,13 @@ int mem_cgroup_try_charge_swapin(struct mm_struct *mm,
 	if (!memcg)
 		goto charge_cur_mm;
 	*ptr = memcg;
-	ret = __mem_cgroup_try_charge(NULL, mask, 1, ptr, true);
+	ret = __mem_cgroup_try_charge(NULL, mask, 1, ptr, oom);
 	css_put(&memcg->css);
 	return ret;
 charge_cur_mm:
 	if (unlikely(!mm))
 		mm = &init_mm;
-	return __mem_cgroup_try_charge(mm, mask, 1, ptr, true);
+	return __mem_cgroup_try_charge(mm, mask, 1, ptr, oom);
 }
 
 static void
-- 
1.7.10.4

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply related	[flat|nested] 172+ messages in thread

* Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked
  2013-02-06 14:01                                                                                             ` Michal Hocko
  2013-02-06 14:22                                                                                               ` Michal Hocko
@ 2013-02-07 11:01                                                                                               ` Kamezawa Hiroyuki
  2013-02-07 12:31                                                                                                 ` Michal Hocko
  2013-02-08  1:40                                                                                                 ` Kamezawa Hiroyuki
  1 sibling, 2 replies; 172+ messages in thread
From: Kamezawa Hiroyuki @ 2013-02-07 11:01 UTC (permalink / raw)
  To: Michal Hocko
  Cc: azurIt, linux-kernel, linux-mm, cgroups mailinglist, Johannes Weiner

(2013/02/06 23:01), Michal Hocko wrote:
> On Wed 06-02-13 02:17:21, azurIt wrote:
>>> 5-memcg-fix-1.patch is not complete. It doesn't contain the folloup I
>>> mentioned in a follow up email. Here is the full patch:
>>
>>
>> Here is the log where OOM, again, killed MySQL server [search for "(mysqld)"]:
>> http://www.watchdog.sk/lkml/oom_mysqld6
>
> [...]
> WARNING: at mm/memcontrol.c:2409 T.1149+0x2d9/0x610()
> Hardware name: S5000VSA
> gfp_mask:4304 nr_pages:1 oom:0 ret:2
> Pid: 3545, comm: apache2 Tainted: G        W    3.2.37-grsec #1
> Call Trace:
>   [<ffffffff8105502a>] warn_slowpath_common+0x7a/0xb0
>   [<ffffffff81055116>] warn_slowpath_fmt+0x46/0x50
>   [<ffffffff81108163>] ? mem_cgroup_margin+0x73/0xa0
>   [<ffffffff8110b6f9>] T.1149+0x2d9/0x610
>   [<ffffffff812af298>] ? blk_finish_plug+0x18/0x50
>   [<ffffffff8110c6b4>] mem_cgroup_cache_charge+0xc4/0xf0
>   [<ffffffff810ca6bf>] add_to_page_cache_locked+0x4f/0x140
>   [<ffffffff810ca7d2>] add_to_page_cache_lru+0x22/0x50
>   [<ffffffff810cad32>] filemap_fault+0x252/0x4f0
>   [<ffffffff810eab18>] __do_fault+0x78/0x5a0
>   [<ffffffff810edcb4>] handle_pte_fault+0x84/0x940
>   [<ffffffff810e2460>] ? vma_prio_tree_insert+0x30/0x50
>   [<ffffffff810f2508>] ? vma_link+0x88/0xe0
>   [<ffffffff810ee6a8>] handle_mm_fault+0x138/0x260
>   [<ffffffff8102709d>] do_page_fault+0x13d/0x460
>   [<ffffffff810f46fc>] ? do_mmap_pgoff+0x3dc/0x430
>   [<ffffffff815b61ff>] page_fault+0x1f/0x30
> ---[ end trace 8817670349022007 ]---
> apache2 invoked oom-killer: gfp_mask=0x0, order=0, oom_adj=0, oom_score_adj=0
> apache2 cpuset=uid mems_allowed=0
> Pid: 3545, comm: apache2 Tainted: G        W    3.2.37-grsec #1
> Call Trace:
>   [<ffffffff810ccd2e>] dump_header+0x7e/0x1e0
>   [<ffffffff810ccc2f>] ? find_lock_task_mm+0x2f/0x70
>   [<ffffffff810cd1f5>] oom_kill_process+0x85/0x2a0
>   [<ffffffff810cd8a5>] out_of_memory+0xe5/0x200
>   [<ffffffff810cda7d>] pagefault_out_of_memory+0xbd/0x110
>   [<ffffffff81026e76>] mm_fault_error+0xb6/0x1a0
>   [<ffffffff8102734e>] do_page_fault+0x3ee/0x460
>   [<ffffffff810f46fc>] ? do_mmap_pgoff+0x3dc/0x430
>   [<ffffffff815b61ff>] page_fault+0x1f/0x30
>
> The first trace comes from the debugging WARN and it clearly points to
> a file fault path. __do_fault pre-charges a page in case we need to
> do CoW (copy-on-write) for the returned page. This one falls back to
> memcg OOM and never returns ENOMEM as I have mentioned earlier.
> However, the fs fault handler (filemap_fault here) can fallback to
> page_cache_read if the readahead (do_sync_mmap_readahead) fails
> to get page to the page cache. And we can see this happening in
> the first trace. page_cache_read then calls add_to_page_cache_lru
> and eventually gets to add_to_page_cache_locked which calls
> mem_cgroup_cache_charge_no_oom so we will get ENOMEM if oom should
> happen. This ENOMEM gets to the fault handler and kaboom.
>

Hmm. do we need to increase the "limit" virtually at memcg oom until
the oom-killed process dies ? It may be doable by increasing stock->cache
of each cpu....I think kernel can offer extra virtual charge up to
oom-killed process's memory usage.....

Thanks,
-Kame


^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked
  2013-02-07 11:01                                                                                               ` [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked Kamezawa Hiroyuki
@ 2013-02-07 12:31                                                                                                 ` Michal Hocko
  2013-02-08  4:16                                                                                                   ` Kamezawa Hiroyuki
  2013-02-08  1:40                                                                                                 ` Kamezawa Hiroyuki
  1 sibling, 1 reply; 172+ messages in thread
From: Michal Hocko @ 2013-02-07 12:31 UTC (permalink / raw)
  To: Kamezawa Hiroyuki
  Cc: azurIt, linux-kernel, linux-mm, cgroups mailinglist, Johannes Weiner

On Thu 07-02-13 20:01:45, KAMEZAWA Hiroyuki wrote:
> (2013/02/06 23:01), Michal Hocko wrote:
> >On Wed 06-02-13 02:17:21, azurIt wrote:
> >>>5-memcg-fix-1.patch is not complete. It doesn't contain the folloup I
> >>>mentioned in a follow up email. Here is the full patch:
> >>
> >>
> >>Here is the log where OOM, again, killed MySQL server [search for "(mysqld)"]:
> >>http://www.watchdog.sk/lkml/oom_mysqld6
> >
> >[...]
> >WARNING: at mm/memcontrol.c:2409 T.1149+0x2d9/0x610()
> >Hardware name: S5000VSA
> >gfp_mask:4304 nr_pages:1 oom:0 ret:2
> >Pid: 3545, comm: apache2 Tainted: G        W    3.2.37-grsec #1
> >Call Trace:
> >  [<ffffffff8105502a>] warn_slowpath_common+0x7a/0xb0
> >  [<ffffffff81055116>] warn_slowpath_fmt+0x46/0x50
> >  [<ffffffff81108163>] ? mem_cgroup_margin+0x73/0xa0
> >  [<ffffffff8110b6f9>] T.1149+0x2d9/0x610
> >  [<ffffffff812af298>] ? blk_finish_plug+0x18/0x50
> >  [<ffffffff8110c6b4>] mem_cgroup_cache_charge+0xc4/0xf0
> >  [<ffffffff810ca6bf>] add_to_page_cache_locked+0x4f/0x140
> >  [<ffffffff810ca7d2>] add_to_page_cache_lru+0x22/0x50
> >  [<ffffffff810cad32>] filemap_fault+0x252/0x4f0
> >  [<ffffffff810eab18>] __do_fault+0x78/0x5a0
> >  [<ffffffff810edcb4>] handle_pte_fault+0x84/0x940
> >  [<ffffffff810e2460>] ? vma_prio_tree_insert+0x30/0x50
> >  [<ffffffff810f2508>] ? vma_link+0x88/0xe0
> >  [<ffffffff810ee6a8>] handle_mm_fault+0x138/0x260
> >  [<ffffffff8102709d>] do_page_fault+0x13d/0x460
> >  [<ffffffff810f46fc>] ? do_mmap_pgoff+0x3dc/0x430
> >  [<ffffffff815b61ff>] page_fault+0x1f/0x30
> >---[ end trace 8817670349022007 ]---
> >apache2 invoked oom-killer: gfp_mask=0x0, order=0, oom_adj=0, oom_score_adj=0
> >apache2 cpuset=uid mems_allowed=0
> >Pid: 3545, comm: apache2 Tainted: G        W    3.2.37-grsec #1
> >Call Trace:
> >  [<ffffffff810ccd2e>] dump_header+0x7e/0x1e0
> >  [<ffffffff810ccc2f>] ? find_lock_task_mm+0x2f/0x70
> >  [<ffffffff810cd1f5>] oom_kill_process+0x85/0x2a0
> >  [<ffffffff810cd8a5>] out_of_memory+0xe5/0x200
> >  [<ffffffff810cda7d>] pagefault_out_of_memory+0xbd/0x110
> >  [<ffffffff81026e76>] mm_fault_error+0xb6/0x1a0
> >  [<ffffffff8102734e>] do_page_fault+0x3ee/0x460
> >  [<ffffffff810f46fc>] ? do_mmap_pgoff+0x3dc/0x430
> >  [<ffffffff815b61ff>] page_fault+0x1f/0x30
> >
> >The first trace comes from the debugging WARN and it clearly points to
> >a file fault path. __do_fault pre-charges a page in case we need to
> >do CoW (copy-on-write) for the returned page. This one falls back to
> >memcg OOM and never returns ENOMEM as I have mentioned earlier.
> >However, the fs fault handler (filemap_fault here) can fallback to
> >page_cache_read if the readahead (do_sync_mmap_readahead) fails
> >to get page to the page cache. And we can see this happening in
> >the first trace. page_cache_read then calls add_to_page_cache_lru
> >and eventually gets to add_to_page_cache_locked which calls
> >mem_cgroup_cache_charge_no_oom so we will get ENOMEM if oom should
> >happen. This ENOMEM gets to the fault handler and kaboom.
> >
> 
> Hmm. do we need to increase the "limit" virtually at memcg oom until
> the oom-killed process dies ? It may be doable by increasing stock->cache
> of each cpu....I think kernel can offer extra virtual charge up to
> oom-killed process's memory usage.....

If we can guarantee that the overflow charges do not exceed the memory
usage of the killed process then this would work. The question is, how
do we find out how much we can overflow. immigrate_on_move will play
some role as well as the amount of the shared memory. I am afraid this
would get too complex. Nevertheless the idea is nice.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked
  2013-02-07 11:01                                                                                               ` [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked Kamezawa Hiroyuki
  2013-02-07 12:31                                                                                                 ` Michal Hocko
@ 2013-02-08  1:40                                                                                                 ` Kamezawa Hiroyuki
  2013-02-08 16:01                                                                                                   ` Michal Hocko
  1 sibling, 1 reply; 172+ messages in thread
From: Kamezawa Hiroyuki @ 2013-02-08  1:40 UTC (permalink / raw)
  To: Michal Hocko
  Cc: azurIt, linux-kernel, linux-mm, cgroups mailinglist, Johannes Weiner

(2013/02/07 20:01), Kamezawa Hiroyuki wrote:
> (2013/02/06 23:01), Michal Hocko wrote:
>> On Wed 06-02-13 02:17:21, azurIt wrote:
>>>> 5-memcg-fix-1.patch is not complete. It doesn't contain the folloup I
>>>> mentioned in a follow up email. Here is the full patch:
>>>
>>>
>>> Here is the log where OOM, again, killed MySQL server [search for "(mysqld)"]:
>>> http://www.watchdog.sk/lkml/oom_mysqld6
>>
>> [...]
>> WARNING: at mm/memcontrol.c:2409 T.1149+0x2d9/0x610()
>> Hardware name: S5000VSA
>> gfp_mask:4304 nr_pages:1 oom:0 ret:2
>> Pid: 3545, comm: apache2 Tainted: G        W    3.2.37-grsec #1
>> Call Trace:
>>   [<ffffffff8105502a>] warn_slowpath_common+0x7a/0xb0
>>   [<ffffffff81055116>] warn_slowpath_fmt+0x46/0x50
>>   [<ffffffff81108163>] ? mem_cgroup_margin+0x73/0xa0
>>   [<ffffffff8110b6f9>] T.1149+0x2d9/0x610
>>   [<ffffffff812af298>] ? blk_finish_plug+0x18/0x50
>>   [<ffffffff8110c6b4>] mem_cgroup_cache_charge+0xc4/0xf0
>>   [<ffffffff810ca6bf>] add_to_page_cache_locked+0x4f/0x140
>>   [<ffffffff810ca7d2>] add_to_page_cache_lru+0x22/0x50
>>   [<ffffffff810cad32>] filemap_fault+0x252/0x4f0
>>   [<ffffffff810eab18>] __do_fault+0x78/0x5a0
>>   [<ffffffff810edcb4>] handle_pte_fault+0x84/0x940
>>   [<ffffffff810e2460>] ? vma_prio_tree_insert+0x30/0x50
>>   [<ffffffff810f2508>] ? vma_link+0x88/0xe0
>>   [<ffffffff810ee6a8>] handle_mm_fault+0x138/0x260
>>   [<ffffffff8102709d>] do_page_fault+0x13d/0x460
>>   [<ffffffff810f46fc>] ? do_mmap_pgoff+0x3dc/0x430
>>   [<ffffffff815b61ff>] page_fault+0x1f/0x30
>> ---[ end trace 8817670349022007 ]---
>> apache2 invoked oom-killer: gfp_mask=0x0, order=0, oom_adj=0, oom_score_adj=0
>> apache2 cpuset=uid mems_allowed=0
>> Pid: 3545, comm: apache2 Tainted: G        W    3.2.37-grsec #1
>> Call Trace:
>>   [<ffffffff810ccd2e>] dump_header+0x7e/0x1e0
>>   [<ffffffff810ccc2f>] ? find_lock_task_mm+0x2f/0x70
>>   [<ffffffff810cd1f5>] oom_kill_process+0x85/0x2a0
>>   [<ffffffff810cd8a5>] out_of_memory+0xe5/0x200
>>   [<ffffffff810cda7d>] pagefault_out_of_memory+0xbd/0x110
>>   [<ffffffff81026e76>] mm_fault_error+0xb6/0x1a0
>>   [<ffffffff8102734e>] do_page_fault+0x3ee/0x460
>>   [<ffffffff810f46fc>] ? do_mmap_pgoff+0x3dc/0x430
>>   [<ffffffff815b61ff>] page_fault+0x1f/0x30
>>
>> The first trace comes from the debugging WARN and it clearly points to
>> a file fault path. __do_fault pre-charges a page in case we need to
>> do CoW (copy-on-write) for the returned page. This one falls back to
>> memcg OOM and never returns ENOMEM as I have mentioned earlier.
>> However, the fs fault handler (filemap_fault here) can fallback to
>> page_cache_read if the readahead (do_sync_mmap_readahead) fails
>> to get page to the page cache. And we can see this happening in
>> the first trace. page_cache_read then calls add_to_page_cache_lru
>> and eventually gets to add_to_page_cache_locked which calls
>> mem_cgroup_cache_charge_no_oom so we will get ENOMEM if oom should
>> happen. This ENOMEM gets to the fault handler and kaboom.
>>
>
> Hmm. do we need to increase the "limit" virtually at memcg oom until
> the oom-killed process dies ?

Here is my naive idea...
==
 From 1a46318cf89e7df94bd4844f29105b61dacf335b Mon Sep 17 00:00:00 2001
From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Date: Fri, 8 Feb 2013 10:43:52 +0900
Subject: [PATCH] [Don't Apply][PATCH] memcg relax resource at OOM situation.

When an OOM happens, a task is killed and resources will be freed.

A problem here is that a task, which is oom-killed, may wait for
some other resource in which memory resource is required. Some thread
waits for free memory may holds some mutex and oom-killed process
wait for the mutex.

To avoid this, relaxing charged memory by giving virtual resource
can be a help. The system can get back it at uncharge().
This is a sample native implementation.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
  mm/memcontrol.c |   79 ++++++++++++++++++++++++++++++++++++++++++++++++++-----
  1 file changed, 73 insertions(+), 6 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 25ac5f4..4dea49a 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -301,6 +301,9 @@ struct mem_cgroup {
  	/* set when res.limit == memsw.limit */
  	bool		memsw_is_minimum;
  
+	/* extra resource at emergency situation */
+	unsigned long	loan;
+	spinlock_t	loan_lock;
  	/* protect arrays of thresholds */
  	struct mutex thresholds_lock;
  
@@ -2034,6 +2037,61 @@ static int mem_cgroup_soft_reclaim(struct mem_cgroup *root_memcg,
  	mem_cgroup_iter_break(root_memcg, victim);
  	return total;
  }
+/*
+ * When a memcg is in OOM situation, this lack of resource may cause deadlock
+ * because of complicated lock dependency(i_mutex...). To avoid that, we
+ * need extra resource or avoid charging.
+ *
+ * A memcg can request resource in an emergency state. We call it as loan.
+ * A memcg will return a loan when it does uncharge resource. We disallow
+ * double-loan and moving task to other groups until the loan is fully
+ * returned.
+ *
+ * Note: the problem here is that we cannot know what amount resouce should
+ * be necessary to exiting an emergency state.....
+ */
+#define LOAN_MAX		(2 * 1024 * 1024)
+
+static void mem_cgroup_make_loan(struct mem_cgroup *memcg)
+{
+	u64 usage;
+	unsigned long amount;
+
+	amount = LOAN_MAX;
+
+	usage = res_counter_read_u64(&memcg->res, RES_USAGE);
+	if (amount > usage /2 )
+		amount = usage / 2;
+	spin_lock(&memcg->loan_lock);
+	if (memcg->loan) {
+		spin_unlock(&memcg->loan_lock);
+		return;
+	}
+	memcg->loan = amount;
+	res_counter_uncharge(&memcg->res, amount);
+	if (do_swap_account)
+		res_counter_uncharge(&memcg->memsw, amount);
+	spin_unlock(&memcg->loan_lock);
+}
+
+/* return amount of free resource which can be uncharged */
+static unsigned long
+mem_cgroup_may_return_loan(struct mem_cgroup *memcg, unsigned long val)
+{
+	unsigned long tmp;
+	/* we don't care small race here */
+	if (unlikely(!memcg->loan))
+		return val;
+	spin_lock(&memcg->loan_lock);
+	if (memcg->loan) {
+		tmp = min(memcg->loan, val);
+		memcg->loan -= tmp;
+		val -= tmp;
+	}
+	spin_unlock(&memcg->loan_lock);
+	return val;
+}
+
  
  /*
   * Check OOM-Killer is already running under our hierarchy.
@@ -2182,6 +2240,7 @@ static bool mem_cgroup_handle_oom(struct mem_cgroup *memcg, gfp_t mask,
  	if (need_to_kill) {
  		finish_wait(&memcg_oom_waitq, &owait.wait);
  		mem_cgroup_out_of_memory(memcg, mask, order);
+		mem_cgroup_make_loan(memcg);
  	} else {
  		schedule();
  		finish_wait(&memcg_oom_waitq, &owait.wait);
@@ -2748,6 +2807,8 @@ static void __mem_cgroup_cancel_charge(struct mem_cgroup *memcg,
  	if (!mem_cgroup_is_root(memcg)) {
  		unsigned long bytes = nr_pages * PAGE_SIZE;
  
+		bytes = mem_cgroup_may_return_loan(memcg, bytes);
+
  		res_counter_uncharge(&memcg->res, bytes);
  		if (do_swap_account)
  			res_counter_uncharge(&memcg->memsw, bytes);
@@ -3989,6 +4050,7 @@ static void mem_cgroup_do_uncharge(struct mem_cgroup *memcg,
  {
  	struct memcg_batch_info *batch = NULL;
  	bool uncharge_memsw = true;
+	unsigned long val;
  
  	/* If swapout, usage of swap doesn't decrease */
  	if (!do_swap_account || ctype == MEM_CGROUP_CHARGE_TYPE_SWAPOUT)
@@ -4029,9 +4091,11 @@ static void mem_cgroup_do_uncharge(struct mem_cgroup *memcg,
  		batch->memsw_nr_pages++;
  	return;
  direct_uncharge:
-	res_counter_uncharge(&memcg->res, nr_pages * PAGE_SIZE);
+	val = nr_pages * PAGE_SIZE;
+	val = mem_cgroup_may_return_loan(memcg, val);
+	res_counter_uncharge(&memcg->res, val);
  	if (uncharge_memsw)
-		res_counter_uncharge(&memcg->memsw, nr_pages * PAGE_SIZE);
+		res_counter_uncharge(&memcg->memsw, val);
  	if (unlikely(batch->memcg != memcg))
  		memcg_oom_recover(memcg);
  }
@@ -4182,6 +4246,7 @@ void mem_cgroup_uncharge_start(void)
  void mem_cgroup_uncharge_end(void)
  {
  	struct memcg_batch_info *batch = &current->memcg_batch;
+	unsigned long val;
  
  	if (!batch->do_batch)
  		return;
@@ -4192,16 +4257,16 @@ void mem_cgroup_uncharge_end(void)
  
  	if (!batch->memcg)
  		return;
+	val = batch->nr_pages * PAGE_SIZE;
+	val = mem_cgroup_may_return_loan(batch->memcg, val);
  	/*
  	 * This "batch->memcg" is valid without any css_get/put etc...
  	 * bacause we hide charges behind us.
  	 */
  	if (batch->nr_pages)
-		res_counter_uncharge(&batch->memcg->res,
-				     batch->nr_pages * PAGE_SIZE);
+		res_counter_uncharge(&batch->memcg->res, val);
  	if (batch->memsw_nr_pages)
-		res_counter_uncharge(&batch->memcg->memsw,
-				     batch->memsw_nr_pages * PAGE_SIZE);
+		res_counter_uncharge(&batch->memcg->memsw, val);
  	memcg_oom_recover(batch->memcg);
  	/* forget this pointer (for sanity check) */
  	batch->memcg = NULL;
@@ -6291,6 +6356,8 @@ mem_cgroup_css_alloc(struct cgroup *cont)
  	memcg->move_charge_at_immigrate = 0;
  	mutex_init(&memcg->thresholds_lock);
  	spin_lock_init(&memcg->move_lock);
+	memcg->loan = 0;
+	spin_lock_init(&memcg->loan_lock);
  
  	return &memcg->css;
  
-- 
1.7.10.2








^ permalink raw reply related	[flat|nested] 172+ messages in thread

* Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked
  2013-02-07 12:31                                                                                                 ` Michal Hocko
@ 2013-02-08  4:16                                                                                                   ` Kamezawa Hiroyuki
  0 siblings, 0 replies; 172+ messages in thread
From: Kamezawa Hiroyuki @ 2013-02-08  4:16 UTC (permalink / raw)
  To: Michal Hocko
  Cc: azurIt, linux-kernel, linux-mm, cgroups mailinglist, Johannes Weiner

(2013/02/07 21:31), Michal Hocko wrote:
> On Thu 07-02-13 20:01:45, KAMEZAWA Hiroyuki wrote:
>> (2013/02/06 23:01), Michal Hocko wrote:
>>> On Wed 06-02-13 02:17:21, azurIt wrote:
>>>>> 5-memcg-fix-1.patch is not complete. It doesn't contain the folloup I
>>>>> mentioned in a follow up email. Here is the full patch:
>>>>
>>>>
>>>> Here is the log where OOM, again, killed MySQL server [search for "(mysqld)"]:
>>>> http://www.watchdog.sk/lkml/oom_mysqld6
>>>
>>> [...]
>>> WARNING: at mm/memcontrol.c:2409 T.1149+0x2d9/0x610()
>>> Hardware name: S5000VSA
>>> gfp_mask:4304 nr_pages:1 oom:0 ret:2
>>> Pid: 3545, comm: apache2 Tainted: G        W    3.2.37-grsec #1
>>> Call Trace:
>>>   [<ffffffff8105502a>] warn_slowpath_common+0x7a/0xb0
>>>   [<ffffffff81055116>] warn_slowpath_fmt+0x46/0x50
>>>   [<ffffffff81108163>] ? mem_cgroup_margin+0x73/0xa0
>>>   [<ffffffff8110b6f9>] T.1149+0x2d9/0x610
>>>   [<ffffffff812af298>] ? blk_finish_plug+0x18/0x50
>>>   [<ffffffff8110c6b4>] mem_cgroup_cache_charge+0xc4/0xf0
>>>   [<ffffffff810ca6bf>] add_to_page_cache_locked+0x4f/0x140
>>>   [<ffffffff810ca7d2>] add_to_page_cache_lru+0x22/0x50
>>>   [<ffffffff810cad32>] filemap_fault+0x252/0x4f0
>>>   [<ffffffff810eab18>] __do_fault+0x78/0x5a0
>>>   [<ffffffff810edcb4>] handle_pte_fault+0x84/0x940
>>>   [<ffffffff810e2460>] ? vma_prio_tree_insert+0x30/0x50
>>>   [<ffffffff810f2508>] ? vma_link+0x88/0xe0
>>>   [<ffffffff810ee6a8>] handle_mm_fault+0x138/0x260
>>>   [<ffffffff8102709d>] do_page_fault+0x13d/0x460
>>>   [<ffffffff810f46fc>] ? do_mmap_pgoff+0x3dc/0x430
>>>   [<ffffffff815b61ff>] page_fault+0x1f/0x30
>>> ---[ end trace 8817670349022007 ]---
>>> apache2 invoked oom-killer: gfp_mask=0x0, order=0, oom_adj=0, oom_score_adj=0
>>> apache2 cpuset=uid mems_allowed=0
>>> Pid: 3545, comm: apache2 Tainted: G        W    3.2.37-grsec #1
>>> Call Trace:
>>>   [<ffffffff810ccd2e>] dump_header+0x7e/0x1e0
>>>   [<ffffffff810ccc2f>] ? find_lock_task_mm+0x2f/0x70
>>>   [<ffffffff810cd1f5>] oom_kill_process+0x85/0x2a0
>>>   [<ffffffff810cd8a5>] out_of_memory+0xe5/0x200
>>>   [<ffffffff810cda7d>] pagefault_out_of_memory+0xbd/0x110
>>>   [<ffffffff81026e76>] mm_fault_error+0xb6/0x1a0
>>>   [<ffffffff8102734e>] do_page_fault+0x3ee/0x460
>>>   [<ffffffff810f46fc>] ? do_mmap_pgoff+0x3dc/0x430
>>>   [<ffffffff815b61ff>] page_fault+0x1f/0x30
>>>
>>> The first trace comes from the debugging WARN and it clearly points to
>>> a file fault path. __do_fault pre-charges a page in case we need to
>>> do CoW (copy-on-write) for the returned page. This one falls back to
>>> memcg OOM and never returns ENOMEM as I have mentioned earlier.
>>> However, the fs fault handler (filemap_fault here) can fallback to
>>> page_cache_read if the readahead (do_sync_mmap_readahead) fails
>>> to get page to the page cache. And we can see this happening in
>>> the first trace. page_cache_read then calls add_to_page_cache_lru
>>> and eventually gets to add_to_page_cache_locked which calls
>>> mem_cgroup_cache_charge_no_oom so we will get ENOMEM if oom should
>>> happen. This ENOMEM gets to the fault handler and kaboom.
>>>
>>
>> Hmm. do we need to increase the "limit" virtually at memcg oom until
>> the oom-killed process dies ? It may be doable by increasing stock->cache
>> of each cpu....I think kernel can offer extra virtual charge up to
>> oom-killed process's memory usage.....
>
> If we can guarantee that the overflow charges do not exceed the memory
> usage of the killed process then this would work. The question is, how
> do we find out how much we can overflow. immigrate_on_move will play
> some role as well as the amount of the shared memory. I am afraid this
> would get too complex. Nevertheless the idea is nice.
>
Yes, that's the problem. If we don't do in correct way, resouce usage
undeflow can happen. I guess we can count it per task_struct at charging
page-faulted anon pages.

_Or_ in other consideration, for example, we do charge 1MB per thread
regardless of its memory usage. And use it as a security at OOM-killing.
Implemtation will be easy but explanation may be difficult..

Thanks,
-Kame




Thanks,
-Kame




^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked
  2013-02-05 18:59                                                                                                 ` Michal Hocko
@ 2013-02-08  4:27                                                                                                   ` Greg Thelen
  2013-02-08 16:29                                                                                                     ` Michal Hocko
  0 siblings, 1 reply; 172+ messages in thread
From: Greg Thelen @ 2013-02-08  4:27 UTC (permalink / raw)
  To: Michal Hocko
  Cc: azurIt, linux-kernel, linux-mm, cgroups mailinglist,
	KAMEZAWA Hiroyuki, Johannes Weiner

On Tue, Feb 05 2013, Michal Hocko wrote:

> On Tue 05-02-13 10:09:57, Greg Thelen wrote:
>> On Tue, Feb 05 2013, Michal Hocko wrote:
>> 
>> > On Tue 05-02-13 08:48:23, Greg Thelen wrote:
>> >> On Tue, Feb 05 2013, Michal Hocko wrote:
>> >> 
>> >> > On Tue 05-02-13 15:49:47, azurIt wrote:
>> >> > [...]
>> >> >> Just to be sure - am i supposed to apply this two patches?
>> >> >> http://watchdog.sk/lkml/patches/
>> >> >
>> >> > 5-memcg-fix-1.patch is not complete. It doesn't contain the folloup I
>> >> > mentioned in a follow up email. Here is the full patch:
>> >> > ---
>> >> > From f2bf8437d5b9bb38a95a432bf39f32c584955171 Mon Sep 17 00:00:00 2001
>> >> > From: Michal Hocko <mhocko@suse.cz>
>> >> > Date: Mon, 26 Nov 2012 11:47:57 +0100
>> >> > Subject: [PATCH] memcg: do not trigger OOM from add_to_page_cache_locked
>> >> >
>> >> > memcg oom killer might deadlock if the process which falls down to
>> >> > mem_cgroup_handle_oom holds a lock which prevents other task to
>> >> > terminate because it is blocked on the very same lock.
>> >> > This can happen when a write system call needs to allocate a page but
>> >> > the allocation hits the memcg hard limit and there is nothing to reclaim
>> >> > (e.g. there is no swap or swap limit is hit as well and all cache pages
>> >> > have been reclaimed already) and the process selected by memcg OOM
>> >> > killer is blocked on i_mutex on the same inode (e.g. truncate it).
>> >> >
>> >> > Process A
>> >> > [<ffffffff811109b8>] do_truncate+0x58/0xa0		# takes i_mutex
>> >> > [<ffffffff81121c90>] do_last+0x250/0xa30
>> >> > [<ffffffff81122547>] path_openat+0xd7/0x440
>> >> > [<ffffffff811229c9>] do_filp_open+0x49/0xa0
>> >> > [<ffffffff8110f7d6>] do_sys_open+0x106/0x240
>> >> > [<ffffffff8110f950>] sys_open+0x20/0x30
>> >> > [<ffffffff815b5926>] system_call_fastpath+0x18/0x1d
>> >> > [<ffffffffffffffff>] 0xffffffffffffffff
>> >> >
>> >> > Process B
>> >> > [<ffffffff8110a9c1>] mem_cgroup_handle_oom+0x241/0x3b0
>> >> > [<ffffffff8110b5ab>] T.1146+0x5ab/0x5c0
>> >> > [<ffffffff8110c22e>] mem_cgroup_cache_charge+0xbe/0xe0
>> >> > [<ffffffff810ca28c>] add_to_page_cache_locked+0x4c/0x140
>> >> > [<ffffffff810ca3a2>] add_to_page_cache_lru+0x22/0x50
>> >> > [<ffffffff810ca45b>] grab_cache_page_write_begin+0x8b/0xe0
>> >> > [<ffffffff81193a18>] ext3_write_begin+0x88/0x270
>> >> > [<ffffffff810c8fc6>] generic_file_buffered_write+0x116/0x290
>> >> > [<ffffffff810cb3cc>] __generic_file_aio_write+0x27c/0x480
>> >> > [<ffffffff810cb646>] generic_file_aio_write+0x76/0xf0           # takes ->i_mutex
>> >> > [<ffffffff8111156a>] do_sync_write+0xea/0x130
>> >> > [<ffffffff81112183>] vfs_write+0xf3/0x1f0
>> >> > [<ffffffff81112381>] sys_write+0x51/0x90
>> >> > [<ffffffff815b5926>] system_call_fastpath+0x18/0x1d
>> >> > [<ffffffffffffffff>] 0xffffffffffffffff
>> >> 
>> >> It looks like grab_cache_page_write_begin() passes __GFP_FS into
>> >> __page_cache_alloc() and mem_cgroup_cache_charge().  Which makes me
>> >> think that this deadlock is also possible in the page allocator even
>> >> before getting to add_to_page_cache_lru.  no?
>> >
>> > I am not that familiar with VFS but i_mutex is a high level lock AFAIR
>> > and it shouldn't be called from the pageout path so __page_cache_alloc
>> > should be safe.
>> 
>> I wasn't clear, sorry.  My concern is not that pageout() grabs i_mutex.
>> My concern is that __page_cache_alloc() will invoke the oom killer and
>> select a victim which wants i_mutex.  This victim will deadlock because
>> the oom killer caller already holds i_mutex.  
>
> That would be true for the memcg oom because that one is blocking but
> the global oom just puts the allocator into sleep for a while and then
> the allocator should back off eventually (unless this is NOFAIL
> allocation). I would need to look closer whether this is really the case
> - I haven't seen that allocator code path for a while...

I think the page allocator can loop forever waiting for an oom victim to
terminate even without NOFAIL.  Especially if the oom victim wants a
resource exclusively held by the allocating thread (e.g. i_mutex).  It
looks like the same deadlock you describe is also possible (though more
rare) without memcg.

If the looping thread is an eligible oom victim (i.e. not oom disabled,
not an kernel thread, etc) then the page allocator can return NULL in so
long as NOFAIL is not used.  So any allocator which is able to call the
oom killer and is not oom disabled (kernel thread, etc) is already
exposed to the possibility of page allocator failure.  So if the page
allocator could detect the deadlock, then it could safely return NULL.
Maybe after looping N times without forward progress the page allocator
should consider failing unless NOFAIL is given.

Switching back to the memcg oom situation, can we similarly return NULL
if memcg oom kill has been tried a reasonable number of times.  Simply
failing the memcg charge with ENOMEM seems easier to support than
exceeding limit (Kame's loan patch).

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: [PATCH for 3.2.34] memcg: do not trigger OOM if PF_NO_MEMCG_OOM is set
  2013-02-06 16:00                                                                                                 ` [PATCH for 3.2.34] memcg: do not trigger OOM if PF_NO_MEMCG_OOM is set Michal Hocko
@ 2013-02-08  5:03                                                                                                   ` azurIt
  2013-02-08  9:44                                                                                                     ` Michal Hocko
  0 siblings, 1 reply; 172+ messages in thread
From: azurIt @ 2013-02-08  5:03 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki,
	Johannes Weiner

Michal, thank you very much but it just didn't work and broke everything :(

This happened:
Problem started to occur really often immediately after booting the new kernel, every few minutes for one of my users. But everything other seems to work fine so i gave it a try for a day (which was a mistake). I grabbed some data for you and go to sleep:
http://watchdog.sk/lkml/memcg-bug-4.tar.gz

Few hours later i was woke up from my sweet sweet dreams by alerts smses - Apache wasn't working and our system failed to restart it. When i observed the situation, two apache processes (of that user as above) were still running and it wasn't possible to kill them by any way. I grabbed some data for you:
http://watchdog.sk/lkml/memcg-bug-5.tar.gz

Then I logged to the console and this was waiting for me:
http://watchdog.sk/lkml/error.jpg

Finally i rebooted into different kernel, wrote this e-mail and go to my lovely bed ;)



______________________________________________________________
> Od: "Michal Hocko" <mhocko@suse.cz>
> Komu: azurIt <azurit@pobox.sk>
> Dátum: 06.02.2013 17:00
> Predmet: [PATCH for 3.2.34] memcg: do not trigger OOM if PF_NO_MEMCG_OOM is set
>
> CC: linux-kernel@vger.kernel.org, linux-mm@kvack.org, "cgroups mailinglist" <cgroups@vger.kernel.org>, "KAMEZAWA Hiroyuki" <kamezawa.hiroyu@jp.fujitsu.com>, "Johannes Weiner" <hannes@cmpxchg.org>
>On Wed 06-02-13 15:22:19, Michal Hocko wrote:
>> On Wed 06-02-13 15:01:19, Michal Hocko wrote:
>> > On Wed 06-02-13 02:17:21, azurIt wrote:
>> > > >5-memcg-fix-1.patch is not complete. It doesn't contain the folloup I
>> > > >mentioned in a follow up email. Here is the full patch:
>> > > 
>> > > 
>> > > Here is the log where OOM, again, killed MySQL server [search for "(mysqld)"]:
>> > > http://www.watchdog.sk/lkml/oom_mysqld6
>> > 
>> > [...]
>> > WARNING: at mm/memcontrol.c:2409 T.1149+0x2d9/0x610()
>> > Hardware name: S5000VSA
>> > gfp_mask:4304 nr_pages:1 oom:0 ret:2
>> > Pid: 3545, comm: apache2 Tainted: G        W    3.2.37-grsec #1
>> > Call Trace:
>> >  [<ffffffff8105502a>] warn_slowpath_common+0x7a/0xb0
>> >  [<ffffffff81055116>] warn_slowpath_fmt+0x46/0x50
>> >  [<ffffffff81108163>] ? mem_cgroup_margin+0x73/0xa0
>> >  [<ffffffff8110b6f9>] T.1149+0x2d9/0x610
>> >  [<ffffffff812af298>] ? blk_finish_plug+0x18/0x50
>> >  [<ffffffff8110c6b4>] mem_cgroup_cache_charge+0xc4/0xf0
>> >  [<ffffffff810ca6bf>] add_to_page_cache_locked+0x4f/0x140
>> >  [<ffffffff810ca7d2>] add_to_page_cache_lru+0x22/0x50
>> >  [<ffffffff810cad32>] filemap_fault+0x252/0x4f0
>> >  [<ffffffff810eab18>] __do_fault+0x78/0x5a0
>> >  [<ffffffff810edcb4>] handle_pte_fault+0x84/0x940
>> >  [<ffffffff810e2460>] ? vma_prio_tree_insert+0x30/0x50
>> >  [<ffffffff810f2508>] ? vma_link+0x88/0xe0
>> >  [<ffffffff810ee6a8>] handle_mm_fault+0x138/0x260
>> >  [<ffffffff8102709d>] do_page_fault+0x13d/0x460
>> >  [<ffffffff810f46fc>] ? do_mmap_pgoff+0x3dc/0x430
>> >  [<ffffffff815b61ff>] page_fault+0x1f/0x30
>> > ---[ end trace 8817670349022007 ]---
>> > apache2 invoked oom-killer: gfp_mask=0x0, order=0, oom_adj=0, oom_score_adj=0
>> > apache2 cpuset=uid mems_allowed=0
>> > Pid: 3545, comm: apache2 Tainted: G        W    3.2.37-grsec #1
>> > Call Trace:
>> >  [<ffffffff810ccd2e>] dump_header+0x7e/0x1e0
>> >  [<ffffffff810ccc2f>] ? find_lock_task_mm+0x2f/0x70
>> >  [<ffffffff810cd1f5>] oom_kill_process+0x85/0x2a0
>> >  [<ffffffff810cd8a5>] out_of_memory+0xe5/0x200
>> >  [<ffffffff810cda7d>] pagefault_out_of_memory+0xbd/0x110
>> >  [<ffffffff81026e76>] mm_fault_error+0xb6/0x1a0
>> >  [<ffffffff8102734e>] do_page_fault+0x3ee/0x460
>> >  [<ffffffff810f46fc>] ? do_mmap_pgoff+0x3dc/0x430
>> >  [<ffffffff815b61ff>] page_fault+0x1f/0x30
>> > 
>> > The first trace comes from the debugging WARN and it clearly points to
>> > a file fault path. __do_fault pre-charges a page in case we need to
>> > do CoW (copy-on-write) for the returned page. This one falls back to
>> > memcg OOM and never returns ENOMEM as I have mentioned earlier. 
>> > However, the fs fault handler (filemap_fault here) can fallback to
>> > page_cache_read if the readahead (do_sync_mmap_readahead) fails
>> > to get page to the page cache. And we can see this happening in
>> > the first trace. page_cache_read then calls add_to_page_cache_lru
>> > and eventually gets to add_to_page_cache_locked which calls
>> > mem_cgroup_cache_charge_no_oom so we will get ENOMEM if oom should
>> > happen. This ENOMEM gets to the fault handler and kaboom.
>> > 
>> > So the fix is really much more complex than I thought. Although
>> > add_to_page_cache_locked sounded like a good place it turned out to be
>> > not in fact.
>> > 
>> > We need something more clever appaerently. One way would be not misusing
>> > __GFP_NORETRY for GFP_MEMCG_NO_OOM and give it a real flag. We have 32
>> > bits for those flags in gfp_t so there should be some room there. 
>> > Or we could do this per task flag, same we do for NO_IO in the current
>> > -mm tree.
>> > The later one seems easier wrt. gfp_mask passing horror - e.g.
>> > __generic_file_aio_write doesn't pass flags and it can be called from
>> > unlocked contexts as well.
>> 
>> Ouch, PF_ flags space seem to be drained already because
>> task_struct::flags is just unsigned int so there is just one bit left. I
>> am not sure this is the best use for it. This will be a real pain!
>
>OK, so this something that should help you without any risk of false
>OOMs. I do not believe that something like that would be accepted
>upstream because it is really heavy. We will need to come up with
>something more clever for upstream.
>I have also added a warning which will trigger when the charge fails. If
>you see too many of those messages then there is something bad going on
>and the lack of OOM causes userspace to loop without getting any
>progress.
>
>So there you go - your personal patch ;) You can drop all other patches.
>Please note I have just compile tested it. But it should be pretty
>trivial to check it is correct
>---
>From 6f155187f77c971b45caf05dbc80ca9c20bc278c Mon Sep 17 00:00:00 2001
>From: Michal Hocko <mhocko@suse.cz>
>Date: Wed, 6 Feb 2013 16:45:07 +0100
>Subject: [PATCH 1/2] memcg: do not trigger OOM if PF_NO_MEMCG_OOM is set
>
>memcg oom killer might deadlock if the process which falls down to
>mem_cgroup_handle_oom holds a lock which prevents other task to
>terminate because it is blocked on the very same lock.
>This can happen when a write system call needs to allocate a page but
>the allocation hits the memcg hard limit and there is nothing to reclaim
>(e.g. there is no swap or swap limit is hit as well and all cache pages
>have been reclaimed already) and the process selected by memcg OOM
>killer is blocked on i_mutex on the same inode (e.g. truncate it).
>
>Process A
>[<ffffffff811109b8>] do_truncate+0x58/0xa0		# takes i_mutex
>[<ffffffff81121c90>] do_last+0x250/0xa30
>[<ffffffff81122547>] path_openat+0xd7/0x440
>[<ffffffff811229c9>] do_filp_open+0x49/0xa0
>[<ffffffff8110f7d6>] do_sys_open+0x106/0x240
>[<ffffffff8110f950>] sys_open+0x20/0x30
>[<ffffffff815b5926>] system_call_fastpath+0x18/0x1d
>[<ffffffffffffffff>] 0xffffffffffffffff
>
>Process B
>[<ffffffff8110a9c1>] mem_cgroup_handle_oom+0x241/0x3b0
>[<ffffffff8110b5ab>] T.1146+0x5ab/0x5c0
>[<ffffffff8110c22e>] mem_cgroup_cache_charge+0xbe/0xe0
>[<ffffffff810ca28c>] add_to_page_cache_locked+0x4c/0x140
>[<ffffffff810ca3a2>] add_to_page_cache_lru+0x22/0x50
>[<ffffffff810ca45b>] grab_cache_page_write_begin+0x8b/0xe0
>[<ffffffff81193a18>] ext3_write_begin+0x88/0x270
>[<ffffffff810c8fc6>] generic_file_buffered_write+0x116/0x290
>[<ffffffff810cb3cc>] __generic_file_aio_write+0x27c/0x480
>[<ffffffff810cb646>] generic_file_aio_write+0x76/0xf0           # takes ->i_mutex
>[<ffffffff8111156a>] do_sync_write+0xea/0x130
>[<ffffffff81112183>] vfs_write+0xf3/0x1f0
>[<ffffffff81112381>] sys_write+0x51/0x90
>[<ffffffff815b5926>] system_call_fastpath+0x18/0x1d
>[<ffffffffffffffff>] 0xffffffffffffffff
>
>This is not a hard deadlock though because administrator can still
>intervene and increase the limit on the group which helps the writer to
>finish the allocation and release the lock.
>
>This patch heals the problem by forbidding OOM from dangerous context.
>Memcg charging code has no way to find out whether it is called from a
>locked context we have to help it via process flags. PF_OOM_ORIGIN flag
>removed recently will be reused for PF_NO_MEMCG_OOM which signals that
>the memcg OOM killer could lead to a deadlock.
>Only locked callers of __generic_file_aio_write are currently marked. I
>am pretty sure there are more places (I didn't check shmem and hugetlb
>uses fancy instantion mutex during page fault and filesystems might
>use some locks during the write) but I've ignored those as this will
>probably be just a user specific patch without any way to get upstream
>in the current form.
>
>Reported-by: azurIt <azurit@pobox.sk>
>Signed-off-by: Michal Hocko <mhocko@suse.cz>
>---
> drivers/staging/pohmelfs/inode.c |    2 ++
> include/linux/sched.h            |    1 +
> mm/filemap.c                     |    2 ++
> mm/memcontrol.c                  |   18 ++++++++++++++----
> 4 files changed, 19 insertions(+), 4 deletions(-)
>
>diff --git a/drivers/staging/pohmelfs/inode.c b/drivers/staging/pohmelfs/inode.c
>index 7a19555..523de82e 100644
>--- a/drivers/staging/pohmelfs/inode.c
>+++ b/drivers/staging/pohmelfs/inode.c
>@@ -921,7 +921,9 @@ ssize_t pohmelfs_write(struct file *file, const char __user *buf,
> 	if (ret)
> 		goto err_out_unlock;
> 
>+	current->flags |= PF_NO_MEMCG_OOM;
> 	ret = __generic_file_aio_write(&kiocb, &iov, 1, &kiocb.ki_pos);
>+	current->flags &= ~PF_NO_MEMCG_OOM;
> 	*ppos = kiocb.ki_pos;
> 
> 	mutex_unlock(&inode->i_mutex);
>diff --git a/include/linux/sched.h b/include/linux/sched.h
>index 1e86bb4..f275c8f 100644
>--- a/include/linux/sched.h
>+++ b/include/linux/sched.h
>@@ -1781,6 +1781,7 @@ extern void thread_group_times(struct task_struct *p, cputime_t *ut, cputime_t *
> #define PF_FROZEN	0x00010000	/* frozen for system suspend */
> #define PF_FSTRANS	0x00020000	/* inside a filesystem transaction */
> #define PF_KSWAPD	0x00040000	/* I am kswapd */
>+#define PF_NO_MEMCG_OOM	0x00080000	/* Memcg OOM could lead to a deadlock */
> #define PF_LESS_THROTTLE 0x00100000	/* Throttle me less: I clean memory */
> #define PF_KTHREAD	0x00200000	/* I am a kernel thread */
> #define PF_RANDOMIZE	0x00400000	/* randomize virtual address space */
>diff --git a/mm/filemap.c b/mm/filemap.c
>index 556858c..58a316b 100644
>--- a/mm/filemap.c
>+++ b/mm/filemap.c
>@@ -2617,7 +2617,9 @@ ssize_t generic_file_aio_write(struct kiocb *iocb, const struct iovec *iov,
> 
> 	mutex_lock(&inode->i_mutex);
> 	blk_start_plug(&plug);
>+	current->flags |= PF_NO_MEMCG_OOM;
> 	ret = __generic_file_aio_write(iocb, iov, nr_segs, &iocb->ki_pos);
>+	current->flags &= ~PF_NO_MEMCG_OOM;
> 	mutex_unlock(&inode->i_mutex);
> 
> 	if (ret > 0 || ret == -EIOCBQUEUED) {
>diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>index c8425b1..128b615 100644
>--- a/mm/memcontrol.c
>+++ b/mm/memcontrol.c
>@@ -2397,6 +2397,14 @@ done:
> 	return 0;
> nomem:
> 	*ptr = NULL;
>+	if (printk_ratelimit())
>+		printk(KERN_WARNING"%s: task:%s pid:%d got ENOMEM without OOM for memcg:%p."
>+				" If this message shows up very often for the"
>+				" same task then there is a risk that the"
>+				" process is not able to make any progress"
>+				" because of the current limit. Try to enlarge"
>+				" the hard limit.\n", __FUNCTION__,
>+				current->comm, current->pid, memcg);
> 	return -ENOMEM;
> bypass:
> 	*ptr = NULL;
>@@ -2703,7 +2711,7 @@ static int mem_cgroup_charge_common(struct page *page, struct mm_struct *mm,
> 	struct mem_cgroup *memcg = NULL;
> 	unsigned int nr_pages = 1;
> 	struct page_cgroup *pc;
>-	bool oom = true;
>+	bool oom = !(current->flags & PF_NO_MEMCG_OOM);
> 	int ret;
> 
> 	if (PageTransHuge(page)) {
>@@ -2770,6 +2778,7 @@ __mem_cgroup_commit_charge_lrucare(struct page *page, struct mem_cgroup *memcg,
> int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm,
> 				gfp_t gfp_mask)
> {
>+	bool oom = !(current->flags & PF_NO_MEMCG_OOM);
> 	struct mem_cgroup *memcg = NULL;
> 	int ret;
> 
>@@ -2782,7 +2791,7 @@ int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm,
> 		mm = &init_mm;
> 
> 	if (page_is_file_cache(page)) {
>-		ret = __mem_cgroup_try_charge(mm, gfp_mask, 1, &memcg, true);
>+		ret = __mem_cgroup_try_charge(mm, gfp_mask, 1, &memcg, oom);
> 		if (ret || !memcg)
> 			return ret;
> 
>@@ -2818,6 +2827,7 @@ int mem_cgroup_try_charge_swapin(struct mm_struct *mm,
> 				 struct page *page,
> 				 gfp_t mask, struct mem_cgroup **ptr)
> {
>+	bool oom = !(current->flags & PF_NO_MEMCG_OOM);
> 	struct mem_cgroup *memcg;
> 	int ret;
> 
>@@ -2840,13 +2850,13 @@ int mem_cgroup_try_charge_swapin(struct mm_struct *mm,
> 	if (!memcg)
> 		goto charge_cur_mm;
> 	*ptr = memcg;
>-	ret = __mem_cgroup_try_charge(NULL, mask, 1, ptr, true);
>+	ret = __mem_cgroup_try_charge(NULL, mask, 1, ptr, oom);
> 	css_put(&memcg->css);
> 	return ret;
> charge_cur_mm:
> 	if (unlikely(!mm))
> 		mm = &init_mm;
>-	return __mem_cgroup_try_charge(mm, mask, 1, ptr, true);
>+	return __mem_cgroup_try_charge(mm, mask, 1, ptr, oom);
> }
> 
> static void
>-- 
>1.7.10.4
>
>-- 
>Michal Hocko
>SUSE Labs
>

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: [PATCH for 3.2.34] memcg: do not trigger OOM if PF_NO_MEMCG_OOM is set
  2013-02-08  5:03                                                                                                   ` azurIt
@ 2013-02-08  9:44                                                                                                     ` Michal Hocko
  2013-02-08 11:02                                                                                                       ` azurIt
  0 siblings, 1 reply; 172+ messages in thread
From: Michal Hocko @ 2013-02-08  9:44 UTC (permalink / raw)
  To: azurIt
  Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki,
	Johannes Weiner

On Fri 08-02-13 06:03:04, azurIt wrote:
> Michal, thank you very much but it just didn't work and broke
> everything :(

I am sorry to hear that. The patch should help to solve the deadlock you
have seen earlier. It in no way can solve side effects of failing writes
and it also cannot help much if the oom is permanent.

> This happened:
> Problem started to occur really often immediately after booting the
> new kernel, every few minutes for one of my users. But everything
> other seems to work fine so i gave it a try for a day (which was a
> mistake). I grabbed some data for you and go to sleep:
> http://watchdog.sk/lkml/memcg-bug-4.tar.gz

Do you have logs from that time period?

I have only glanced through the stacks and most of the threads are
waiting in the mem_cgroup_handle_oom (mostly from the page fault path
where we do not have other options than waiting) which suggests that
your memory limit is seriously underestimated. If you look at the number
of charging failures (memory.failcnt per-group file) then you will get
9332083 failures in _average_ per group. This is a lot!
Not all those failures end with OOM, of course. But it clearly signals
that the workload need much more memory than the limit allows.

> Few hours later i was woke up from my sweet sweet dreams by alerts
> smses - Apache wasn't working and our system failed to restart
> it. When i observed the situation, two apache processes (of that user
> as above) were still running and it wasn't possible to kill them by
> any way. I grabbed some data for you:
> http://watchdog.sk/lkml/memcg-bug-5.tar.gz

There are only 5 groups in this one and all of them have no memory
charged (so no OOM going on). All tasks are somewhere in the ptrace
code.

grep cache -r .
./1360297489/memory.stat:cache 0
./1360297489/memory.stat:total_cache 65642496
./1360297491/memory.stat:cache 0
./1360297491/memory.stat:total_cache 65642496
./1360297492/memory.stat:cache 0
./1360297492/memory.stat:total_cache 65642496
./1360297490/memory.stat:cache 0
./1360297490/memory.stat:total_cache 65642496
./1360297488/memory.stat:cache 0
./1360297488/memory.stat:total_cache 65642496

which suggests that this is a parent group and the memory is charged in
a child group. I guess that all those are under OOM as the number seems
like they have limit at 62M.

> Then I logged to the console and this was waiting for me:
> http://watchdog.sk/lkml/error.jpg

This is just a warning and it should be harmless. There is just one WARN
in ptrace_check_attach:
	WARN_ON_ONCE(task_is_stopped(child))

This has been introduced by
http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=321fb561
and the commit description claim this shouldn't happen. I am not
familiar with this code but it sounds like a bug in the tracing code
which is not related to the discussed issue.

> Finally i rebooted into different kernel, wrote this e-mail and go to
> my lovely bed ;)
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: [PATCH for 3.2.34] memcg: do not trigger OOM if PF_NO_MEMCG_OOM is set
  2013-02-08  9:44                                                                                                     ` Michal Hocko
@ 2013-02-08 11:02                                                                                                       ` azurIt
  2013-02-08 12:38                                                                                                         ` Michal Hocko
  0 siblings, 1 reply; 172+ messages in thread
From: azurIt @ 2013-02-08 11:02 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki,
	Johannes Weiner

>
>Do you have logs from that time period?
>
>I have only glanced through the stacks and most of the threads are
>waiting in the mem_cgroup_handle_oom (mostly from the page fault path
>where we do not have other options than waiting) which suggests that
>your memory limit is seriously underestimated. If you look at the number
>of charging failures (memory.failcnt per-group file) then you will get
>9332083 failures in _average_ per group. This is a lot!
>Not all those failures end with OOM, of course. But it clearly signals
>that the workload need much more memory than the limit allows.


What type of logs? I have all.

Memory usage graph:
http://www.watchdog.sk/lkml/memory2.png

New kernel was booted about 1:15. Data in memcg-bug-4.tar.gz were taken about 2:35 and data in memcg-bug-5.tar.gz about 5:25. There was always lots of free memory. Higher memory consumption between 3:39 and 5:33 was caused by data backup and was completed few minutes before i restarted the server (this was just a coincidence).



>There are only 5 groups in this one and all of them have no memory
>charged (so no OOM going on). All tasks are somewhere in the ptrace
>code.


It's all from the same cgroup but from different time.



>grep cache -r .
>./1360297489/memory.stat:cache 0
>./1360297489/memory.stat:total_cache 65642496
>./1360297491/memory.stat:cache 0
>./1360297491/memory.stat:total_cache 65642496
>./1360297492/memory.stat:cache 0
>./1360297492/memory.stat:total_cache 65642496
>./1360297490/memory.stat:cache 0
>./1360297490/memory.stat:total_cache 65642496
>./1360297488/memory.stat:cache 0
>./1360297488/memory.stat:total_cache 65642496
>
>which suggests that this is a parent group and the memory is charged in
>a child group. I guess that all those are under OOM as the number seems
>like they have limit at 62M.


The cgroup has limit 330M (346030080 bytes). As i said, these two processes were stucked and was impossible to kill them. They were, maybe, the processes which i was trying to 'strace' before - 'strace' was freezed as always when the cgroup has this problem and i killed it (i was just trying if it is the original cgroup problem).

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: [PATCH for 3.2.34] memcg: do not trigger OOM if PF_NO_MEMCG_OOM is set
  2013-02-08 11:02                                                                                                       ` azurIt
@ 2013-02-08 12:38                                                                                                         ` Michal Hocko
  2013-02-08 13:56                                                                                                           ` azurIt
  0 siblings, 1 reply; 172+ messages in thread
From: Michal Hocko @ 2013-02-08 12:38 UTC (permalink / raw)
  To: azurIt
  Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki,
	Johannes Weiner

On Fri 08-02-13 12:02:49, azurIt wrote:
> >
> >Do you have logs from that time period?
> >
> >I have only glanced through the stacks and most of the threads are
> >waiting in the mem_cgroup_handle_oom (mostly from the page fault path
> >where we do not have other options than waiting) which suggests that
> >your memory limit is seriously underestimated. If you look at the number
> >of charging failures (memory.failcnt per-group file) then you will get
> >9332083 failures in _average_ per group. This is a lot!
> >Not all those failures end with OOM, of course. But it clearly signals
> >that the workload need much more memory than the limit allows.
> 
> 
> What type of logs? I have all.

kernel log would be sufficient.

> Memory usage graph:
> http://www.watchdog.sk/lkml/memory2.png
> 
> New kernel was booted about 1:15. Data in memcg-bug-4.tar.gz were taken about 2:35 and data in memcg-bug-5.tar.gz about 5:25. There was always lots of free memory. Higher memory consumption between 3:39 and 5:33 was caused by data backup and was completed few minutes before i restarted the server (this was just a coincidence).
> 
> 
> 
> >There are only 5 groups in this one and all of them have no memory
> >charged (so no OOM going on). All tasks are somewhere in the ptrace
> >code.
> 
> 
> It's all from the same cgroup but from different time.
> 
> 
> 
> >grep cache -r .
> >./1360297489/memory.stat:cache 0
> >./1360297489/memory.stat:total_cache 65642496
> >./1360297491/memory.stat:cache 0
> >./1360297491/memory.stat:total_cache 65642496
> >./1360297492/memory.stat:cache 0
> >./1360297492/memory.stat:total_cache 65642496
> >./1360297490/memory.stat:cache 0
> >./1360297490/memory.stat:total_cache 65642496
> >./1360297488/memory.stat:cache 0
> >./1360297488/memory.stat:total_cache 65642496
> >
> >which suggests that this is a parent group and the memory is charged in
> >a child group. I guess that all those are under OOM as the number seems
> >like they have limit at 62M.
> 
> 
> The cgroup has limit 330M (346030080 bytes).

This limit is for top level groups, right? Those seem to children which
have 62MB charged - is that a limit for those children?

> As i said, these two processes

Which are those two processes?

> were stucked and was impossible to kill them. They were,
> maybe, the processes which i was trying to 'strace' before - 'strace'
> was freezed as always when the cgroup has this problem and i killed it
> (i was just trying if it is the original cgroup problem).

I have no idea what is the strace role here.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: [PATCH for 3.2.34] memcg: do not trigger OOM if PF_NO_MEMCG_OOM is set
  2013-02-08 12:38                                                                                                         ` Michal Hocko
@ 2013-02-08 13:56                                                                                                           ` azurIt
  2013-02-08 14:47                                                                                                             ` Michal Hocko
  2013-02-08 15:24                                                                                                             ` Michal Hocko
  0 siblings, 2 replies; 172+ messages in thread
From: azurIt @ 2013-02-08 13:56 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki,
	Johannes Weiner

>kernel log would be sufficient.


Full kernel log from kernel with you newest patch:
http://watchdog.sk/lkml/kern2.log



>This limit is for top level groups, right? Those seem to children which
>have 62MB charged - is that a limit for those children?


It was the limit for parent cgroup and processes were in one (the same) child cgroup. Child cgroup has no memory limit set (so limit for parent was also limit for child - 330 MB).



>Which are those two processes?


Data are inside memcg-bug-5.tar.gz in directories bug/<timestamp>/<pids>/


>I have no idea what is the strace role here.


I was stracing exactly two processes from that cgroup and exactly two processes were stucked later and was immpossible to kill them. Both of them were waiting on 'ptrace_stop'. Maybe it's completely unrelated, just guessing.


azur

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: [PATCH for 3.2.34] memcg: do not trigger OOM if PF_NO_MEMCG_OOM is set
  2013-02-08 13:56                                                                                                           ` azurIt
@ 2013-02-08 14:47                                                                                                             ` Michal Hocko
  2013-02-08 15:24                                                                                                             ` Michal Hocko
  1 sibling, 0 replies; 172+ messages in thread
From: Michal Hocko @ 2013-02-08 14:47 UTC (permalink / raw)
  To: azurIt
  Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki,
	Johannes Weiner

On Fri 08-02-13 14:56:16, azurIt wrote:
> Data are inside memcg-bug-5.tar.gz in directories bug/<timestamp>/<pids>/

ohh, I didn't get those were timestamp directories. It makes more sense
now.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: [PATCH for 3.2.34] memcg: do not trigger OOM if PF_NO_MEMCG_OOM is set
  2013-02-08 13:56                                                                                                           ` azurIt
  2013-02-08 14:47                                                                                                             ` Michal Hocko
@ 2013-02-08 15:24                                                                                                             ` Michal Hocko
  2013-02-08 15:58                                                                                                               ` azurIt
  1 sibling, 1 reply; 172+ messages in thread
From: Michal Hocko @ 2013-02-08 15:24 UTC (permalink / raw)
  To: azurIt
  Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki,
	Johannes Weiner

On Fri 08-02-13 14:56:16, azurIt wrote:
> >kernel log would be sufficient.
> 
> 
> Full kernel log from kernel with you newest patch:
> http://watchdog.sk/lkml/kern2.log

OK, so the log says that there is a little slaughter on your yard:
$ grep "Memory cgroup out of memory:" kern2.log | wc -l
220

$ grep "Memory cgroup out of memory:" kern2.log | sed 's@.*Kill process \([0-9]*\) .*@\1@' | sort -u | wc -l
220

Which means that the oom killer didn't try to kill any task more than
once which is good because it tells us that the killed task manages to
die before we trigger oom again. So this is definitely not a deadlock.
You are just hitting OOM very often.
$ grep "killed as a result of limit" kern2.log | sed 's@.*\] @@' | sort | uniq -c | sort -k1 -n
      1 Task in /1091/uid killed as a result of limit of /1091
      1 Task in /1223/uid killed as a result of limit of /1223
      1 Task in /1229/uid killed as a result of limit of /1229
      1 Task in /1255/uid killed as a result of limit of /1255
      1 Task in /1424/uid killed as a result of limit of /1424
      1 Task in /1470/uid killed as a result of limit of /1470
      1 Task in /1567/uid killed as a result of limit of /1567
      2 Task in /1080/uid killed as a result of limit of /1080
      3 Task in /1381/uid killed as a result of limit of /1381
      4 Task in /1185/uid killed as a result of limit of /1185
      4 Task in /1289/uid killed as a result of limit of /1289
      4 Task in /1709/uid killed as a result of limit of /1709
      5 Task in /1279/uid killed as a result of limit of /1279
      6 Task in /1020/uid killed as a result of limit of /1020
      6 Task in /1527/uid killed as a result of limit of /1527
      9 Task in /1388/uid killed as a result of limit of /1388
     17 Task in /1281/uid killed as a result of limit of /1281
     22 Task in /1599/uid killed as a result of limit of /1599
     30 Task in /1155/uid killed as a result of limit of /1155
     31 Task in /1258/uid killed as a result of limit of /1258
     71 Task in /1293/uid killed as a result of limit of /1293

So the group 1293 suffers the most. I would check how much memory the
worklod in the group really needs because this level of OOM cannot
possible be healthy.

The log also says that the deadlock prevention implemented by the patch
triggered and some writes really failed due to potential OOM:
$ grep "If this message shows up" kern2.log 
Feb  8 01:17:10 server01 kernel: [  431.033593] __mem_cgroup_try_charge: task:apache2 pid:6733 got ENOMEM without OOM for memcg:ffff8803807d5600. If this message shows up very often for the same task then there is a risk that the process is not able to make any progress because of the current limit. Try to enlarge the hard limit.
Feb  8 01:22:52 server01 kernel: [  773.556782] __mem_cgroup_try_charge: task:apache2 pid:12092 got ENOMEM without OOM for memcg:ffff8803807d5600. If this message shows up very often for the same task then there is a risk that the process is not able to make any progress because of the current limit. Try to enlarge the hard limit.
Feb  8 01:22:52 server01 kernel: [  773.567916] __mem_cgroup_try_charge: task:apache2 pid:12093 got ENOMEM without OOM for memcg:ffff8803807d5600. If this message shows up very often for the same task then there is a risk that the process is not able to make any progress because of the current limit. Try to enlarge the hard limit.
Feb  8 01:29:00 server01 kernel: [ 1141.355693] __mem_cgroup_try_charge: task:apache2 pid:17734 got ENOMEM without OOM for memcg:ffff88036e956e00. If this message shows up very often for the same task then there is a risk that the process is not able to make any progress because of the current limit. Try to enlarge the hard limit.
Feb  8 03:30:39 server01 kernel: [ 8440.346811] __mem_cgroup_try_charge: task:apache2 pid:8687 got ENOMEM without OOM for memcg:ffff8803654d6e00. If this message shows up very often for the same task then there is a risk that the process is not able to make any progress because of the current limit. Try to enlarge the hard limit.

This doesn't look very unhealthy. I have expected that write would fail
more often but it seems that the biggest memory pressure comes from
mmaps and page faults which have no way other than OOM.

So my suggestion would be to reconsider limits for groups to provide
more realistical environment.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: [PATCH for 3.2.34] memcg: do not trigger OOM if PF_NO_MEMCG_OOM is set
  2013-02-08 15:24                                                                                                             ` Michal Hocko
@ 2013-02-08 15:58                                                                                                               ` azurIt
  2013-02-08 17:10                                                                                                                 ` Michal Hocko
  0 siblings, 1 reply; 172+ messages in thread
From: azurIt @ 2013-02-08 15:58 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki,
	Johannes Weiner

>Which means that the oom killer didn't try to kill any task more than
>once which is good because it tells us that the killed task manages to
>die before we trigger oom again. So this is definitely not a deadlock.
>You are just hitting OOM very often.
>$ grep "killed as a result of limit" kern2.log | sed 's@.*\] @@' | sort | uniq -c | sort -k1 -n
>      1 Task in /1091/uid killed as a result of limit of /1091
>      1 Task in /1223/uid killed as a result of limit of /1223
>      1 Task in /1229/uid killed as a result of limit of /1229
>      1 Task in /1255/uid killed as a result of limit of /1255
>      1 Task in /1424/uid killed as a result of limit of /1424
>      1 Task in /1470/uid killed as a result of limit of /1470
>      1 Task in /1567/uid killed as a result of limit of /1567
>      2 Task in /1080/uid killed as a result of limit of /1080
>      3 Task in /1381/uid killed as a result of limit of /1381
>      4 Task in /1185/uid killed as a result of limit of /1185
>      4 Task in /1289/uid killed as a result of limit of /1289
>      4 Task in /1709/uid killed as a result of limit of /1709
>      5 Task in /1279/uid killed as a result of limit of /1279
>      6 Task in /1020/uid killed as a result of limit of /1020
>      6 Task in /1527/uid killed as a result of limit of /1527
>      9 Task in /1388/uid killed as a result of limit of /1388
>     17 Task in /1281/uid killed as a result of limit of /1281
>     22 Task in /1599/uid killed as a result of limit of /1599
>     30 Task in /1155/uid killed as a result of limit of /1155
>     31 Task in /1258/uid killed as a result of limit of /1258
>     71 Task in /1293/uid killed as a result of limit of /1293
>
>So the group 1293 suffers the most. I would check how much memory the
>worklod in the group really needs because this level of OOM cannot
>possible be healthy.



I took the kernel log from yesterday from the same time frame:

$ grep "killed as a result of limit" kern2.log | sed 's@.*\] @@' | sort | uniq -c | sort -k1 -n
      1 Task in /1252/uid killed as a result of limit of /1252
      1 Task in /1709/uid killed as a result of limit of /1709
      2 Task in /1185/uid killed as a result of limit of /1185
      2 Task in /1388/uid killed as a result of limit of /1388
      2 Task in /1567/uid killed as a result of limit of /1567
      2 Task in /1650/uid killed as a result of limit of /1650
      3 Task in /1527/uid killed as a result of limit of /1527
      5 Task in /1552/uid killed as a result of limit of /1552
   1634 Task in /1258/uid killed as a result of limit of /1258

As you can see, there were much more OOM in '1258' and no such problems like this night (well, there were never such problems before :) ). As i said, cgroup 1258 were freezing every few minutes with your latest patch so there must be something wrong (it usually freezes about once per day). And it was really freezed (i checked that), the sypthoms were:
 - cannot strace any of cgroup processes
 - no new processes were started, still the same processes were 'running'
 - kernel was unable to resolve this by it's own
 - all processes togather were taking 100% CPU
 - the whole memory limit was used
(see memcg-bug-4.tar.gz for more info)
Unfortunately i forget to check if killing only few of the processes will resolve it (i always killed them all yesterday night). Don't know if is was in deadlock or not but kernel was definitely unable to resolve the problem. And there is still a mystery of two freezed processes which cannot be killed.

By the way, i KNOW that so much OOM is not healthy but the client simply don't want to buy more memory. He knows about the problem of unsufficient memory limit.

Thank you.


azur

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked
  2013-02-08  1:40                                                                                                 ` Kamezawa Hiroyuki
@ 2013-02-08 16:01                                                                                                   ` Michal Hocko
  0 siblings, 0 replies; 172+ messages in thread
From: Michal Hocko @ 2013-02-08 16:01 UTC (permalink / raw)
  To: Kamezawa Hiroyuki
  Cc: azurIt, linux-kernel, linux-mm, cgroups mailinglist, Johannes Weiner

On Fri 08-02-13 10:40:13, KAMEZAWA Hiroyuki wrote:
> (2013/02/07 20:01), Kamezawa Hiroyuki wrote:
[...]
> >Hmm. do we need to increase the "limit" virtually at memcg oom until
> >the oom-killed process dies ?
> 
> Here is my naive idea...

and the next step would be
http://en.wikipedia.org/wiki/Credit_default_swap :P

But seriously now. The idea is not bad at all. This implementation
would need some tweaks to work though (e.g. you would need to wake oom
sleepers when you get a loan - because those are ones which can block
the resource).  We should also give the borrowed charges only to those
who would oom to prevent from stealing.
I think that it should be mem_cgroup_out_of_memory who establishes the
loan and it can have a look at how much memory the killed task frees -
e.g. some portion of get_mm_rss() or a more precise but much more
expensive traversing via private vmas and check whether they charged
memory from the target memcg hierarchy (this is a slow path anyway).

But who knows maybe a fixed 2MB would work out as well.

Thanks!

> ==
> From 1a46318cf89e7df94bd4844f29105b61dacf335b Mon Sep 17 00:00:00 2001
> From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> Date: Fri, 8 Feb 2013 10:43:52 +0900
> Subject: [PATCH] [Don't Apply][PATCH] memcg relax resource at OOM situation.
> 
> When an OOM happens, a task is killed and resources will be freed.
> 
> A problem here is that a task, which is oom-killed, may wait for
> some other resource in which memory resource is required. Some thread
> waits for free memory may holds some mutex and oom-killed process
> wait for the mutex.
> 
> To avoid this, relaxing charged memory by giving virtual resource
> can be a help. The system can get back it at uncharge().
> This is a sample native implementation.
> 
> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> ---
>  mm/memcontrol.c |   79 ++++++++++++++++++++++++++++++++++++++++++++++++++-----
>  1 file changed, 73 insertions(+), 6 deletions(-)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 25ac5f4..4dea49a 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -301,6 +301,9 @@ struct mem_cgroup {
>  	/* set when res.limit == memsw.limit */
>  	bool		memsw_is_minimum;
> +	/* extra resource at emergency situation */
> +	unsigned long	loan;
> +	spinlock_t	loan_lock;
>  	/* protect arrays of thresholds */
>  	struct mutex thresholds_lock;
> @@ -2034,6 +2037,61 @@ static int mem_cgroup_soft_reclaim(struct mem_cgroup *root_memcg,
>  	mem_cgroup_iter_break(root_memcg, victim);
>  	return total;
>  }
> +/*
> + * When a memcg is in OOM situation, this lack of resource may cause deadlock
> + * because of complicated lock dependency(i_mutex...). To avoid that, we
> + * need extra resource or avoid charging.
> + *
> + * A memcg can request resource in an emergency state. We call it as loan.
> + * A memcg will return a loan when it does uncharge resource. We disallow
> + * double-loan and moving task to other groups until the loan is fully
> + * returned.
> + *
> + * Note: the problem here is that we cannot know what amount resouce should
> + * be necessary to exiting an emergency state.....
> + */
> +#define LOAN_MAX		(2 * 1024 * 1024)
> +
> +static void mem_cgroup_make_loan(struct mem_cgroup *memcg)
> +{
> +	u64 usage;
> +	unsigned long amount;
> +
> +	amount = LOAN_MAX;
> +
> +	usage = res_counter_read_u64(&memcg->res, RES_USAGE);
> +	if (amount > usage /2 )
> +		amount = usage / 2;
> +	spin_lock(&memcg->loan_lock);
> +	if (memcg->loan) {
> +		spin_unlock(&memcg->loan_lock);
> +		return;
> +	}
> +	memcg->loan = amount;
> +	res_counter_uncharge(&memcg->res, amount);
> +	if (do_swap_account)
> +		res_counter_uncharge(&memcg->memsw, amount);
> +	spin_unlock(&memcg->loan_lock);
> +}
> +
> +/* return amount of free resource which can be uncharged */
> +static unsigned long
> +mem_cgroup_may_return_loan(struct mem_cgroup *memcg, unsigned long val)
> +{
> +	unsigned long tmp;
> +	/* we don't care small race here */
> +	if (unlikely(!memcg->loan))
> +		return val;
> +	spin_lock(&memcg->loan_lock);
> +	if (memcg->loan) {
> +		tmp = min(memcg->loan, val);
> +		memcg->loan -= tmp;
> +		val -= tmp;
> +	}
> +	spin_unlock(&memcg->loan_lock);
> +	return val;
> +}
> +
>  /*
>   * Check OOM-Killer is already running under our hierarchy.
> @@ -2182,6 +2240,7 @@ static bool mem_cgroup_handle_oom(struct mem_cgroup *memcg, gfp_t mask,
>  	if (need_to_kill) {
>  		finish_wait(&memcg_oom_waitq, &owait.wait);
>  		mem_cgroup_out_of_memory(memcg, mask, order);
> +		mem_cgroup_make_loan(memcg);
>  	} else {
>  		schedule();
>  		finish_wait(&memcg_oom_waitq, &owait.wait);
> @@ -2748,6 +2807,8 @@ static void __mem_cgroup_cancel_charge(struct mem_cgroup *memcg,
>  	if (!mem_cgroup_is_root(memcg)) {
>  		unsigned long bytes = nr_pages * PAGE_SIZE;
> +		bytes = mem_cgroup_may_return_loan(memcg, bytes);
> +
>  		res_counter_uncharge(&memcg->res, bytes);
>  		if (do_swap_account)
>  			res_counter_uncharge(&memcg->memsw, bytes);
> @@ -3989,6 +4050,7 @@ static void mem_cgroup_do_uncharge(struct mem_cgroup *memcg,
>  {
>  	struct memcg_batch_info *batch = NULL;
>  	bool uncharge_memsw = true;
> +	unsigned long val;
>  	/* If swapout, usage of swap doesn't decrease */
>  	if (!do_swap_account || ctype == MEM_CGROUP_CHARGE_TYPE_SWAPOUT)
> @@ -4029,9 +4091,11 @@ static void mem_cgroup_do_uncharge(struct mem_cgroup *memcg,
>  		batch->memsw_nr_pages++;
>  	return;
>  direct_uncharge:
> -	res_counter_uncharge(&memcg->res, nr_pages * PAGE_SIZE);
> +	val = nr_pages * PAGE_SIZE;
> +	val = mem_cgroup_may_return_loan(memcg, val);
> +	res_counter_uncharge(&memcg->res, val);
>  	if (uncharge_memsw)
> -		res_counter_uncharge(&memcg->memsw, nr_pages * PAGE_SIZE);
> +		res_counter_uncharge(&memcg->memsw, val);
>  	if (unlikely(batch->memcg != memcg))
>  		memcg_oom_recover(memcg);
>  }
> @@ -4182,6 +4246,7 @@ void mem_cgroup_uncharge_start(void)
>  void mem_cgroup_uncharge_end(void)
>  {
>  	struct memcg_batch_info *batch = &current->memcg_batch;
> +	unsigned long val;
>  	if (!batch->do_batch)
>  		return;
> @@ -4192,16 +4257,16 @@ void mem_cgroup_uncharge_end(void)
>  	if (!batch->memcg)
>  		return;
> +	val = batch->nr_pages * PAGE_SIZE;
> +	val = mem_cgroup_may_return_loan(batch->memcg, val);
>  	/*
>  	 * This "batch->memcg" is valid without any css_get/put etc...
>  	 * bacause we hide charges behind us.
>  	 */
>  	if (batch->nr_pages)
> -		res_counter_uncharge(&batch->memcg->res,
> -				     batch->nr_pages * PAGE_SIZE);
> +		res_counter_uncharge(&batch->memcg->res, val);
>  	if (batch->memsw_nr_pages)
> -		res_counter_uncharge(&batch->memcg->memsw,
> -				     batch->memsw_nr_pages * PAGE_SIZE);
> +		res_counter_uncharge(&batch->memcg->memsw, val);
>  	memcg_oom_recover(batch->memcg);
>  	/* forget this pointer (for sanity check) */
>  	batch->memcg = NULL;
> @@ -6291,6 +6356,8 @@ mem_cgroup_css_alloc(struct cgroup *cont)
>  	memcg->move_charge_at_immigrate = 0;
>  	mutex_init(&memcg->thresholds_lock);
>  	spin_lock_init(&memcg->move_lock);
> +	memcg->loan = 0;
> +	spin_lock_init(&memcg->loan_lock);
>  	return &memcg->css;
> -- 
> 1.7.10.2
> 
> 
> 
> 
> 
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe cgroups" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked
  2013-02-08  4:27                                                                                                   ` Greg Thelen
@ 2013-02-08 16:29                                                                                                     ` Michal Hocko
  2013-02-08 16:40                                                                                                       ` Michal Hocko
  0 siblings, 1 reply; 172+ messages in thread
From: Michal Hocko @ 2013-02-08 16:29 UTC (permalink / raw)
  To: Greg Thelen
  Cc: azurIt, linux-kernel, linux-mm, cgroups mailinglist,
	KAMEZAWA Hiroyuki, Johannes Weiner

On Thu 07-02-13 20:27:00, Greg Thelen wrote:
> On Tue, Feb 05 2013, Michal Hocko wrote:
> 
> > On Tue 05-02-13 10:09:57, Greg Thelen wrote:
> >> On Tue, Feb 05 2013, Michal Hocko wrote:
> >> 
> >> > On Tue 05-02-13 08:48:23, Greg Thelen wrote:
> >> >> On Tue, Feb 05 2013, Michal Hocko wrote:
> >> >> 
> >> >> > On Tue 05-02-13 15:49:47, azurIt wrote:
> >> >> > [...]
> >> >> >> Just to be sure - am i supposed to apply this two patches?
> >> >> >> http://watchdog.sk/lkml/patches/
> >> >> >
> >> >> > 5-memcg-fix-1.patch is not complete. It doesn't contain the folloup I
> >> >> > mentioned in a follow up email. Here is the full patch:
> >> >> > ---
> >> >> > From f2bf8437d5b9bb38a95a432bf39f32c584955171 Mon Sep 17 00:00:00 2001
> >> >> > From: Michal Hocko <mhocko@suse.cz>
> >> >> > Date: Mon, 26 Nov 2012 11:47:57 +0100
> >> >> > Subject: [PATCH] memcg: do not trigger OOM from add_to_page_cache_locked
> >> >> >
> >> >> > memcg oom killer might deadlock if the process which falls down to
> >> >> > mem_cgroup_handle_oom holds a lock which prevents other task to
> >> >> > terminate because it is blocked on the very same lock.
> >> >> > This can happen when a write system call needs to allocate a page but
> >> >> > the allocation hits the memcg hard limit and there is nothing to reclaim
> >> >> > (e.g. there is no swap or swap limit is hit as well and all cache pages
> >> >> > have been reclaimed already) and the process selected by memcg OOM
> >> >> > killer is blocked on i_mutex on the same inode (e.g. truncate it).
> >> >> >
> >> >> > Process A
> >> >> > [<ffffffff811109b8>] do_truncate+0x58/0xa0		# takes i_mutex
> >> >> > [<ffffffff81121c90>] do_last+0x250/0xa30
> >> >> > [<ffffffff81122547>] path_openat+0xd7/0x440
> >> >> > [<ffffffff811229c9>] do_filp_open+0x49/0xa0
> >> >> > [<ffffffff8110f7d6>] do_sys_open+0x106/0x240
> >> >> > [<ffffffff8110f950>] sys_open+0x20/0x30
> >> >> > [<ffffffff815b5926>] system_call_fastpath+0x18/0x1d
> >> >> > [<ffffffffffffffff>] 0xffffffffffffffff
> >> >> >
> >> >> > Process B
> >> >> > [<ffffffff8110a9c1>] mem_cgroup_handle_oom+0x241/0x3b0
> >> >> > [<ffffffff8110b5ab>] T.1146+0x5ab/0x5c0
> >> >> > [<ffffffff8110c22e>] mem_cgroup_cache_charge+0xbe/0xe0
> >> >> > [<ffffffff810ca28c>] add_to_page_cache_locked+0x4c/0x140
> >> >> > [<ffffffff810ca3a2>] add_to_page_cache_lru+0x22/0x50
> >> >> > [<ffffffff810ca45b>] grab_cache_page_write_begin+0x8b/0xe0
> >> >> > [<ffffffff81193a18>] ext3_write_begin+0x88/0x270
> >> >> > [<ffffffff810c8fc6>] generic_file_buffered_write+0x116/0x290
> >> >> > [<ffffffff810cb3cc>] __generic_file_aio_write+0x27c/0x480
> >> >> > [<ffffffff810cb646>] generic_file_aio_write+0x76/0xf0           # takes ->i_mutex
> >> >> > [<ffffffff8111156a>] do_sync_write+0xea/0x130
> >> >> > [<ffffffff81112183>] vfs_write+0xf3/0x1f0
> >> >> > [<ffffffff81112381>] sys_write+0x51/0x90
> >> >> > [<ffffffff815b5926>] system_call_fastpath+0x18/0x1d
> >> >> > [<ffffffffffffffff>] 0xffffffffffffffff
> >> >> 
> >> >> It looks like grab_cache_page_write_begin() passes __GFP_FS into
> >> >> __page_cache_alloc() and mem_cgroup_cache_charge().  Which makes me
> >> >> think that this deadlock is also possible in the page allocator even
> >> >> before getting to add_to_page_cache_lru.  no?
> >> >
> >> > I am not that familiar with VFS but i_mutex is a high level lock AFAIR
> >> > and it shouldn't be called from the pageout path so __page_cache_alloc
> >> > should be safe.
> >> 
> >> I wasn't clear, sorry.  My concern is not that pageout() grabs i_mutex.
> >> My concern is that __page_cache_alloc() will invoke the oom killer and
> >> select a victim which wants i_mutex.  This victim will deadlock because
> >> the oom killer caller already holds i_mutex.  
> >
> > That would be true for the memcg oom because that one is blocking but
> > the global oom just puts the allocator into sleep for a while and then
> > the allocator should back off eventually (unless this is NOFAIL
> > allocation). I would need to look closer whether this is really the case
> > - I haven't seen that allocator code path for a while...
> 
> I think the page allocator can loop forever waiting for an oom victim to
> terminate even without NOFAIL.  Especially if the oom victim wants a
> resource exclusively held by the allocating thread (e.g. i_mutex).  It
> looks like the same deadlock you describe is also possible (though more
> rare) without memcg.

OK, I have checked the allocator slow path and you are right even
GFP_KERNEL will not fail. This can lead to similar deadlocks - e.g.
OOM killed task blocked on down_write(mmap_sem) while the page fault
handler holding mmap_sem for reading and allocating a new page without
any progress.
Luckily there are memory reserves where the allocator fall back
eventually so the allocation should be able to get some memory and
release the lock. There is still a theoretical chance this would block
though. This sounds like a corner case though so I wouldn't care about
it very much.

> If the looping thread is an eligible oom victim (i.e. not oom disabled,
> not an kernel thread, etc) then the page allocator can return NULL in so
> long as NOFAIL is not used.  So any allocator which is able to call the
> oom killer and is not oom disabled (kernel thread, etc) is already
> exposed to the possibility of page allocator failure.  So if the page
> allocator could detect the deadlock, then it could safely return NULL.
> Maybe after looping N times without forward progress the page allocator
> should consider failing unless NOFAIL is given.

page allocator is quite tricky to touch and the chances of this deadlock
are not that big.

> if memcg oom kill has been tried a reasonable number of times.  Simply
> failing the memcg charge with ENOMEM seems easier to support than
> exceeding limit (Kame's loan patch).

We cannot do that in the page fault path because this would lead to a
global oom killer. We would need to either retry the page fault or send
KILL to the faulting process. But I do not like this much as this could
lead to DoS attacks.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked
  2013-02-08 16:29                                                                                                     ` Michal Hocko
@ 2013-02-08 16:40                                                                                                       ` Michal Hocko
  0 siblings, 0 replies; 172+ messages in thread
From: Michal Hocko @ 2013-02-08 16:40 UTC (permalink / raw)
  To: Greg Thelen
  Cc: azurIt, linux-kernel, linux-mm, cgroups mailinglist,
	KAMEZAWA Hiroyuki, Johannes Weiner

On Fri 08-02-13 17:29:18, Michal Hocko wrote:
[...]
> OK, I have checked the allocator slow path and you are right even
> GFP_KERNEL will not fail. This can lead to similar deadlocks - e.g.
> OOM killed task blocked on down_write(mmap_sem) while the page fault
> handler holding mmap_sem for reading and allocating a new page without
> any progress.

And now that I think about it some more it sounds like it shouldn't be
possible because allocator would fail because it would see
TIF_MEMDIE (OOM killer kills all threads that share the same mm).
But maybe there are other locks that are dangerous, but I think that the
risk is pretty low.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: [PATCH for 3.2.34] memcg: do not trigger OOM if PF_NO_MEMCG_OOM is set
  2013-02-08 15:58                                                                                                               ` azurIt
@ 2013-02-08 17:10                                                                                                                 ` Michal Hocko
  2013-02-08 21:02                                                                                                                   ` azurIt
  0 siblings, 1 reply; 172+ messages in thread
From: Michal Hocko @ 2013-02-08 17:10 UTC (permalink / raw)
  To: azurIt
  Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki,
	Johannes Weiner

On Fri 08-02-13 16:58:05, azurIt wrote:
[...]
> I took the kernel log from yesterday from the same time frame:
> 
> $ grep "killed as a result of limit" kern2.log | sed 's@.*\] @@' | sort | uniq -c | sort -k1 -n
>       1 Task in /1252/uid killed as a result of limit of /1252
>       1 Task in /1709/uid killed as a result of limit of /1709
>       2 Task in /1185/uid killed as a result of limit of /1185
>       2 Task in /1388/uid killed as a result of limit of /1388
>       2 Task in /1567/uid killed as a result of limit of /1567
>       2 Task in /1650/uid killed as a result of limit of /1650
>       3 Task in /1527/uid killed as a result of limit of /1527
>       5 Task in /1552/uid killed as a result of limit of /1552
>    1634 Task in /1258/uid killed as a result of limit of /1258
> 
> As you can see, there were much more OOM in '1258' and no such
> problems like this night (well, there were never such problems before
> :) ).

Well, all the patch does is that it prevents from the deadlock we have
seen earlier. Previously the writer would block on the oom wait queue
while it fails with ENOMEM now. Caller sees this as a short write which
can be retried (it is a question whether userspace can cope with that
properly). All other OOMs are preserved.

I suspect that all the problems you are seeing now are just side effects
of the OOM conditions.

> As i said, cgroup 1258 were freezing every few minutes with your
> latest patch so there must be something wrong (it usually freezes
> about once per day). And it was really freezed (i checked that), the
> sypthoms were:

I assume you have checked that the killed processes eventually die,
right?

>  - cannot strace any of cgroup processes
>  - no new processes were started, still the same processes were 'running'
>  - kernel was unable to resolve this by it's own
>  - all processes togather were taking 100% CPU
>  - the whole memory limit was used
> (see memcg-bug-4.tar.gz for more info)

Well, I do not see anything supsicious during that time period
(timestamps translate between Fri Feb  8 02:34:05 and Fri Feb  8
02:36:48). The kernel log shows a lot of oom during that time. All
killed processes die eventually.

> Unfortunately i forget to check if killing only few of the processes
> will resolve it (i always killed them all yesterday night). Don't
> know if is was in deadlock or not but kernel was definitely unable
> to resolve the problem.

Nothing shows it would be a deadlock so far. It is well possible that
the userspace went mad when seeing a lot of processes dying because it
doesn't expect it.

> And there is still a mystery of two freezed processes which cannot be
> killed.
> 
> By the way, i KNOW that so much OOM is not healthy but the client
> simply don't want to buy more memory. He knows about the problem of
> unsufficient memory limit.

Well, then you would see a permanent flood of OOM killing, I am afraid.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: [PATCH for 3.2.34] memcg: do not trigger OOM if PF_NO_MEMCG_OOM is set
  2013-02-08 17:10                                                                                                                 ` Michal Hocko
@ 2013-02-08 21:02                                                                                                                   ` azurIt
  2013-02-10 15:03                                                                                                                     ` Michal Hocko
  0 siblings, 1 reply; 172+ messages in thread
From: azurIt @ 2013-02-08 21:02 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki,
	Johannes Weiner

>
>I assume you have checked that the killed processes eventually die,
>right?
>


When i killed them by hand, yes, they dissappeard from process list (i saw it). I don't know if they really died when OOM killed them.


>Well, I do not see anything supsicious during that time period
>(timestamps translate between Fri Feb  8 02:34:05 and Fri Feb  8
>02:36:48). The kernel log shows a lot of oom during that time. All
>killed processes die eventually.


No, they didn't died by OOM when cgroup was freezed. Just check PIDs from memcg-bug-4.tar.gz and try to find them in kernel log. Why are all PIDs waiting on 'mem_cgroup_handle_oom' and there is no OOM message in the log? Data in memcg-bug-4.tar.gz are only for 2 minutes but i let it run for about 15-20 minutes, no single process killed by OOM. I'm 100% sure that OOM was not killing them (maybe it was trying to but it didn't happen).


>
>Nothing shows it would be a deadlock so far. It is well possible that
>the userspace went mad when seeing a lot of processes dying because it
>doesn't expect it.
>


Lots of processes are dying also now, without your latest patch, and no such things are happening. I'm sure there is something more it this, maybe it revealed another bug?


azur

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: [PATCH for 3.2.34] memcg: do not trigger OOM if PF_NO_MEMCG_OOM is set
  2013-02-08 21:02                                                                                                                   ` azurIt
@ 2013-02-10 15:03                                                                                                                     ` Michal Hocko
  2013-02-10 16:46                                                                                                                       ` azurIt
  0 siblings, 1 reply; 172+ messages in thread
From: Michal Hocko @ 2013-02-10 15:03 UTC (permalink / raw)
  To: azurIt
  Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki,
	Johannes Weiner

On Fri 08-02-13 22:02:43, azurIt wrote:
> >
> >I assume you have checked that the killed processes eventually die,
> >right?
> 
> 
> When i killed them by hand, yes, they dissappeard from process list (i
> saw it). I don't know if they really died when OOM killed them.
> 
> 
> >Well, I do not see anything supsicious during that time period
> >(timestamps translate between Fri Feb  8 02:34:05 and Fri Feb  8
> >02:36:48). The kernel log shows a lot of oom during that time. All
> >killed processes die eventually.
> 
> 
> No, they didn't died by OOM when cgroup was freezed. Just check PIDs
> from memcg-bug-4.tar.gz and try to find them in kernel log.

OK, you seem to be right. My initial examination showed that each cgroup
under OOM was able to move forward - in other words it was able to send
SIGKILL somebody and we didn't loop on a single task which cannot die
for some reason. Now when looking closer it seem we really have 2 tasks
which didn't die after being killed by OOM killer:

$ for i in `grep "Memory cgroup out of memory:" kern2.log | sed 's@.*Kill process \([0-9]*\) .*@\1@'`; 
do 
	find bug -name $i; 
done | sed 's@.*/@@' | sort | uniq -c
    141 18211
    141 8102

$ md5sum bug/*/18211/stack | cut -d" " -f1 | uniq -c
    141 3b8ce17e82a065a24ee046112033e1e8
So all the stacks are same:
[<ffffffff81069f94>] ptrace_stop+0x114/0x290
[<ffffffff8106a198>] ptrace_do_notify+0x88/0xa0
[<ffffffff8106a203>] ptrace_notify+0x53/0x70
[<ffffffff8100d168>] syscall_trace_enter+0xf8/0x1c0
[<ffffffff815b6983>] tracesys+0x71/0xd7
[<ffffffffffffffff>] 0xffffffffffffffff

stuck in the ptrace code.

The other task is more interesting:
$ md5sum bug/*/8102/stack | cut -d" " -f1 | sort | uniq -c
    135 042e893c0e6657ed321ea9045e528f3e
      6 dc7e71ce73be2a5c73404b565926e709

All snapshots with 042e893c0e6657ed321ea9045e528f3e are in:
[<ffffffff8110ae51>] mem_cgroup_handle_oom+0x241/0x3b0
[<ffffffff8110ba83>] T.1149+0x5f3/0x600
[<ffffffff8110bf5c>] mem_cgroup_charge_common+0x6c/0xb0
[<ffffffff8110bfe5>] mem_cgroup_newpage_charge+0x45/0x50
[<ffffffff810ee2a9>] handle_pte_fault+0x609/0x940
[<ffffffff810ee718>] handle_mm_fault+0x138/0x260
[<ffffffff810270bd>] do_page_fault+0x13d/0x460
[<ffffffff815b633f>] page_fault+0x1f/0x30
[<ffffffffffffffff>] 0xffffffffffffffff

While the others do not show any stack:
cat 1360287257/8102/stack 
[<ffffffffffffffff>] 0xffffffffffffffff

Which is quite interesting because we are talking about snapshots
starting at 1360287245 (which maps to 02:34:05) but the kern2.log tells
us that this process has been killed much earlier at:

Feb  8 01:18:30 server01 kernel: [  511.139921] Task in /1293/uid killed as a result of limit of /1293
[...]
Feb  8 01:18:30 server01 kernel: [  511.229755] [ pid ]   uid  tgid total_vm      rss cpu oom_adj oom_score_adj name
Feb  8 01:18:30 server01 kernel: [  511.230146] [ 8102]  1293  8102   170258    65869   7       0             0 apache2
Feb  8 01:18:30 server01 kernel: [  511.230339] [ 8113]  1293  8113   163756    59442   5       0             0 apache2
Feb  8 01:18:30 server01 kernel: [  511.230528] [ 8116]  1293  8116   170094    65675   2       0             0 apache2
Feb  8 01:18:30 server01 kernel: [  511.230726] [ 8119]  1293  8119   170094    65675   6       0             0 apache2
Feb  8 01:18:30 server01 kernel: [  511.230924] [ 8123]  1293  8123   169070    64612   7       0             0 apache2
Feb  8 01:18:30 server01 kernel: [  511.231132] [ 8124]  1293  8124   170094    65675   5       0             0 apache2
Feb  8 01:18:30 server01 kernel: [  511.231321] [ 8125]  1293  8125   170094    65673   1       0             0 apache2
Feb  8 01:18:30 server01 kernel: [  511.231516] Memory cgroup out of memory: Kill process 8102 (apache2) score 1000 or sacrifice child

This would suggest that the task is hung and cannot be killed but if we
have a look at the following OOM in the same group 1293 it was _not_
present in the process list for that group:

Feb  8 01:18:33 server01 kernel: [  514.789550] Task in /1293/uid killed as a result of limit of /1293
[...]
Feb  8 01:18:33 server01 kernel: [  514.893198] [ pid ]   uid  tgid total_vm      rss cpu oom_adj oom_score_adj name
Feb  8 01:18:33 server01 kernel: [  514.893594] [ 8113]  1293  8113   168212    64036   1       0             0 apache2
Feb  8 01:18:33 server01 kernel: [  514.893786] [ 8116]  1293  8116   170258    65870   6       0             0 apache2
Feb  8 01:18:33 server01 kernel: [  514.893976] [ 8119]  1293  8119   170258    65870   7       0             0 apache2
Feb  8 01:18:33 server01 kernel: [  514.894166] [ 8123]  1293  8123   170158    65824   6       0             0 apache2
Feb  8 01:18:33 server01 kernel: [  514.894356] [ 8124]  1293  8124   170258    65870   5       0             0 apache2
Feb  8 01:18:33 server01 kernel: [  514.894547] [ 8125]  1293  8125   170158    65824   1       0             0 apache2
Feb  8 01:18:33 server01 kernel: [  514.894749] [ 8149]  1293  8149   163989    59647   7       0             0 apache2
Feb  8 01:18:33 server01 kernel: [  514.894944] Memory cgroup out of memory: Kill process 8113 (apache2) score 1000 or sacrifice child

This is all _before_ you started collecting stacks and it also says that
8102 is gone.

This all suggests that a) stack unwinder which displays
/proc/<pid>/stack is somehow confused and it doesn't show the correct
stack for this process and b) the two processes cannot terminate due to
some issue related to ptrace (stracing) the dying process.

The above oom list doesn't include any processes which already released
the memory which would explain why you still can see it as a member of
the group (when looking into cgroup/tasks file). My guess would be that
there is a bug in ptrace which doesn't free a reference to the task
so it cannot cannot go away although it has dropped all the resources
already.

> Why are all PIDs waiting on 'mem_cgroup_handle_oom' and there is no
> OOM message in the log?

I am not sure what you mean here but there are
$ grep "Memory cgroup out of memory:" kern2.collected.log | wc -l
16

OOM killer events during the time you were gathering memcg-bug-4 data.

>  Data in memcg-bug-4.tar.gz are only for 2
> minutes but i let it run for about 15-20 minutes, no single process
> killed by OOM.

I can see
$ grep "Memory cgroup out of memory:" kern2.after.log | wc -l
57

killed after 02:38:47 when you stopped gathering data for memcg-bug-4

> I'm 100% sure that OOM was not killing them (maybe it was trying to
> but it didn't happen).

OK, let's do a little exercise. The list of processes eligible for OOM
are listed before any task is killed. So if we collect both pid lists
and "Kill process" messages per pid then no entries in the pid list
should be present after the specific pid is killed.

$ mkdir out
$ for i in `grep "Memory cgroup out of memory: Kill process" kern2.log | sed 's@.*Kill process \([0-9]*\) .*@\1@'`
do 
	grep -e "Memory cgroup out of memory: Kill process $i" \
	     -e "\[ *\<$i\]" kern2.log > out/$i
done
$ for i in out/*
do 
	tail -n1 $i | grep "Memory cgroup out of memory:" >/dev/null|| echo "$i has already killed tasks"
done
out/6698 has already killed tasks
out/6703 has already killed tasks

OK, so there are two pids which were listed after they have been
killed. Let's have a look at them.
$ cat out/6698
Feb  8 01:17:04 server01 kernel: [  425.497924] [ 6698]  1293  6698   170258    65846   1       0             0 apache2
Feb  8 01:17:05 server01 kernel: [  426.079010] [ 6698]  1293  6698   170258    65846   1       0             0 apache2
Feb  8 01:17:10 server01 kernel: [  431.144460] [ 6698]  1293  6698   169358    65220   1       0             0 apache2
Feb  8 01:17:10 server01 kernel: [  431.146058] Memory cgroup out of memory: Kill process 6698 (apache2) score 1000 or sacrifice child
Feb  8 03:27:57 server01 kernel: [ 8278.439896] [ 6698]  1020  6698   168518    64219   0       0             0 apache2
Feb  8 03:27:57 server01 kernel: [ 8278.879439] [ 6698]  1020  6698   168518    64218   6       0             0 apache2
Feb  8 03:27:59 server01 kernel: [ 8280.023944] [ 6698]  1020  6698   168816    64540   7       0             0 apache2
Feb  8 03:28:02 server01 kernel: [ 8283.242282] [ 6698]  1020  6698   171953    67751   6       0             0 apache2
$ cat out/6703
Feb  8 01:17:04 server01 kernel: [  425.498118] [ 6703]  1293  6703   170258    65844   6       0             0 apache2
Feb  8 01:17:05 server01 kernel: [  426.079206] [ 6703]  1293  6703   170258    65844   6       0             0 apache2
Feb  8 01:17:10 server01 kernel: [  431.144653] [ 6703]  1293  6703   169358    65219   2       0             0 apache2
Feb  8 01:17:10 server01 kernel: [  431.258924] [ 6703]  1293  6703   169358    65219   5       0             0 apache2
Feb  8 01:17:10 server01 kernel: [  431.260282] Memory cgroup out of memory: Kill process 6703 (apache2) score 1000 or sacrifice child
Feb  8 03:27:57 server01 kernel: [ 8278.440043] [ 6703]  1020  6703   166286    61978   7       0             0 apache2
Feb  8 03:27:57 server01 kernel: [ 8278.879587] [ 6703]  1020  6703   166286    61977   7       0             0 apache2
Feb  8 03:27:59 server01 kernel: [ 8280.024091] [ 6703]  1020  6703   166484    62233   7       0             0 apache2
Feb  8 03:28:02 server01 kernel: [ 8283.242429] [ 6703]  1020  6703   167402    63118   0       0             0 apache2

Lists have the following columns:
[ pid ]   uid  tgid total_vm      rss cpu oom_adj oom_score_adj name

As we can see the uid changed for both pids after it has been killed
(from 1293 to 1020) which suggests that the pid has been reused later
for a different user (which is a clear sign that those pids died) - thus
different group in your setup.
So those two died as well, apparently.

> >Nothing shows it would be a deadlock so far. It is well possible that
> >the userspace went mad when seeing a lot of processes dying because it
> >doesn't expect it.
> 
> Lots of processes are dying also now, without your latest patch, and
> no such things are happening. I'm sure there is something more it
> this, maybe it revealed another bug?

So far nothing shows that there would be anything broken wrt. memcg OOM
killer. The ptrace issue sounds strange, all right, but that is another
story and worth a separate investigation. I would be interested whether
you still see anything wrong going on without that in game.

You can get pretty nice overview of what is going on wrt. OOM from the
log.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: [PATCH for 3.2.34] memcg: do not trigger OOM if PF_NO_MEMCG_OOM is set
  2013-02-10 15:03                                                                                                                     ` Michal Hocko
@ 2013-02-10 16:46                                                                                                                       ` azurIt
  2013-02-11 11:22                                                                                                                         ` Michal Hocko
  0 siblings, 1 reply; 172+ messages in thread
From: azurIt @ 2013-02-10 16:46 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki,
	Johannes Weiner

>stuck in the ptrace code.


But this happens _after_ the cgroup was freezed and i tried to strace one of it's processes (to see what's happening):

Feb  8 01:29:46 server01 kernel: [ 1187.540672] grsec: From 178.40.250.111: process /usr/lib/apache2/mpm-itk/apache2(apache2:18211) attached to via ptrace by /usr/bin/strace[strace:18258] uid/euid:0/0 gid/egid:0/0, parent /usr/bin/htop[htop:2901] uid/euid:0/0 gid/egid:0/0



>> Why are all PIDs waiting on 'mem_cgroup_handle_oom' and there is no
>> OOM message in the log?
>
>I am not sure what you mean here but there are
>$ grep "Memory cgroup out of memory:" kern2.collected.log | wc -l
>16
>
>OOM killer events during the time you were gathering memcg-bug-4 data.
>
>>  Data in memcg-bug-4.tar.gz are only for 2
>> minutes but i let it run for about 15-20 minutes, no single process
>> killed by OOM.
>
>I can see
>$ grep "Memory cgroup out of memory:" kern2.after.log | wc -l
>57
>
>killed after 02:38:47 when you stopped gathering data for memcg-bug-4


I meant no single process was killed inside cgroup 1258 (data from this cgroup are in memcg-bug-4.tar.gz).

Just get data from memcg-bug-4.tar.gz which were taken from cgroup 1258. Almost all processes are in 'mem_cgroup_handle_oom' so cgroup is under OOM. I assume that this is suppose to take only few seconds while kernel finds any process and kill it (and maybe do it again until enough of memory is freed). I was gathering the data for about 2 and a half minutes and NO SINGLE process was killed (just compate list of PIDs from the first and the last directory inside memcg-bug-4.tar.gz). Even more, no single process was killed in cgroup 1258 also after i stopped gathering the data. You can also take the list od PID from memcg-bug-4.tar.gz and you will find only 18211 and 8102 (which are the two stucked processes).

So my question is: Why no process was killed inside cgroup 1258 while it was under OOM? It was under OOM for at least 2 and a half of minutes while i was gathering the data (then i let it run for additional, cca, 10 minutes and then killed processes by hand but i cannot proof this). Why kernel didn't kill any process for so long and ends the OOM?

Btw, processes in cgroup 1258 (memcg-bug-4.tar.gz) are looping in this two tasks (i pasted only first line of stack):
mem_cgroup_handle_oom+0x241/0x3b0
0xffffffffffffffff

Some of them are in 'poll_schedule_timeout' and then they start to loop as above. Is this correct behavior?

For example, do (first line of stack from process 7710 from all timestamps):
for i in */7710/stack; do head -n1 $i; done

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: [PATCH for 3.2.34] memcg: do not trigger OOM if PF_NO_MEMCG_OOM is set
  2013-02-10 16:46                                                                                                                       ` azurIt
@ 2013-02-11 11:22                                                                                                                         ` Michal Hocko
  2013-02-22  8:23                                                                                                                           ` azurIt
  2013-02-22 12:00                                                                                                                           ` [PATCH for 3.2.34] memcg: do not trigger OOM if PF_NO_MEMCG_OOM is set azurIt
  0 siblings, 2 replies; 172+ messages in thread
From: Michal Hocko @ 2013-02-11 11:22 UTC (permalink / raw)
  To: azurIt
  Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki,
	Johannes Weiner

On Sun 10-02-13 17:46:19, azurIt wrote:
> >stuck in the ptrace code.
> 
> 
> But this happens _after_ the cgroup was freezed and i tried to strace
> one of it's processes (to see what's happening):
> 
> Feb  8 01:29:46 server01 kernel: [ 1187.540672] grsec: From 178.40.250.111: process /usr/lib/apache2/mpm-itk/apache2(apache2:18211) attached to via ptrace by /usr/bin/strace[strace:18258] uid/euid:0/0 gid/egid:0/0, parent /usr/bin/htop[htop:2901] uid/euid:0/0 gid/egid:0/0

Hmmm,
Feb  8 01:39:16 server01 kernel: [ 1757.266678] Memory cgroup out of memory: Kill process 18211 (apache2) score 725 or sacrifice child)

So the process has been killed 10 minutes ago and this was really the
last OOM event for group /1258:

$ grep "Task in /1258/uid killed" kern2.log | tail -n2
Feb  8 01:39:16 server01 kernel: [ 1757.045021] Task in /1258/uid killed as a result of limit of /1258
Feb  8 01:39:16 server01 kernel: [ 1757.167984] Task in /1258/uid killed as a result of limit of /1258

But this was still before you started collecting data for memcg-bug-4
(2:34) so we do not know what was the previous stack unfortunatelly.

> >> Why are all PIDs waiting on 'mem_cgroup_handle_oom' and there is no
> >> OOM message in the log?
> >
> >I am not sure what you mean here but there are
> >$ grep "Memory cgroup out of memory:" kern2.collected.log | wc -l
> >16
> >
> >OOM killer events during the time you were gathering memcg-bug-4 data.
> >
> >>  Data in memcg-bug-4.tar.gz are only for 2
> >> minutes but i let it run for about 15-20 minutes, no single process
> >> killed by OOM.
> >
> >I can see
> >$ grep "Memory cgroup out of memory:" kern2.after.log | wc -l
> >57
> >
> >killed after 02:38:47 when you stopped gathering data for memcg-bug-4
> 
> 
> I meant no single process was killed inside cgroup 1258 (data from
> this cgroup are in memcg-bug-4.tar.gz).
>
> Just get data from memcg-bug-4.tar.gz which were taken from cgroup
> 1258.

Are you sure about that? When I extracted all pids from timestamp
directories and greped them in the log I got this:
for i in `cat bug/pids` ; do grep "\[ *\<$i\>\]" kern2.log ; done
Feb  8 01:31:02 server01 kernel: [ 1263.429212] [18211]  1258 18211   164338    60950   0       0             0 apache2
Feb  8 01:31:15 server01 kernel: [ 1276.655241] [18211]  1258 18211   164338    60950   0       0             0 apache2
Feb  8 01:32:29 server01 kernel: [ 1350.797835] [18211]  1258 18211   164338    60950   0       0             0 apache2
Feb  8 01:32:42 server01 kernel: [ 1363.662242] [18211]  1258 18211   164338    60950   0       0             0 apache2
Feb  8 01:32:46 server01 kernel: [ 1367.181798] [18211]  1258 18211   164338    60950   0       0             0 apache2
Feb  8 01:32:46 server01 kernel: [ 1367.381627] [18211]  1258 18211   164338    60950   0       0             0 apache2
Feb  8 01:32:46 server01 kernel: [ 1367.490896] [18211]  1258 18211   164338    60950   0       0             0 apache2
Feb  8 01:33:02 server01 kernel: [ 1383.709652] [18211]  1258 18211   164338    60950   0       0             0 apache2
Feb  8 01:36:26 server01 kernel: [ 1587.458967] [18211]  1258 18211   164338    60950   0       0             0 apache2
Feb  8 01:36:26 server01 kernel: [ 1587.558419] [18211]  1258 18211   164338    60950   0       0             0 apache2
Feb  8 01:36:26 server01 kernel: [ 1587.652474] [18211]  1258 18211   164338    60950   0       0             0 apache2
Feb  8 01:39:02 server01 kernel: [ 1743.107086] [18211]  1258 18211   164338    60950   0       0             0 apache2
Feb  8 01:39:16 server01 kernel: [ 1757.015359] [18211]  1258 18211   164338    60950   0       0             0 apache2
Feb  8 01:39:16 server01 kernel: [ 1757.133998] [18211]  1258 18211   164338    60950   0       0             0 apache2
Feb  8 01:39:16 server01 kernel: [ 1757.262992] [18211]  1258 18211   164338    60950   0       0             0 apache2
Feb  8 01:18:12 server01 kernel: [  493.156641] [ 7888]  1293  7888   169326    64876   3       0             0 apache2
Feb  8 01:18:12 server01 kernel: [  493.269129] [ 7888]  1293  7888   169390    64876   4       0             0 apache2
Feb  8 01:18:21 server01 kernel: [  502.384221] [ 8011]  1293  8011   170094    65675   5       0             0 apache2
Feb  8 01:18:24 server01 kernel: [  505.052600] [ 8011]  1293  8011   170260    65854   2       0             0 apache2
Feb  8 01:18:24 server01 kernel: [  505.200454] [ 8011]  1293  8011   170260    65854   2       0             0 apache2
Feb  8 01:18:33 server01 kernel: [  514.538637] [ 8054]  1258  8054   164404    60618   1       0             0 apache2
Feb  8 01:18:30 server01 kernel: [  511.230146] [ 8102]  1293  8102   170258    65869   7       0             0 apache2

So at least 7888, 8011 and 8102 were from a different group (1293).
Others were never listed in the eligible processes list which is a bit
unexpected. It is also unfortunate because I cannot match them to their
groups from the log.
$ for i in `cat bug/pids` ; do grep "\[ *\<$i\>\]" kern2.log >/dev/null || echo "$i not listed" ; done
7265 not listed
7474 not listed
7710 not listed
7969 not listed
7988 not listed
7997 not listed
8000 not listed
8014 not listed
8016 not listed
8019 not listed
8057 not listed
8058 not listed
8059 not listed
8063 not listed
8064 not listed
8066 not listed
8067 not listed
8069 not listed
8070 not listed
8071 not listed
8072 not listed
8075 not listed
8091 not listed
8092 not listed
8094 not listed
8098 not listed
8099 not listed
8100 not listed

Are you sure all of them belong to 1258 group?

> Almost all processes are in 'mem_cgroup_handle_oom' so cgroup
> is under OOM. 

You are right, almost all of them are waiting in mem_cgroup_handle_oom
which suggest that they should be listed in a per group eligible tasks
list.

One way how this might happen is when a process which manages to
get oom_lock has a fatal signal pending. Then we wouldn't get to
oom_kill_process and no OOM messages would get printed. This is correct
because such a task would terminate soon anyway and all the waiters
would wake up eventually. If not enough memory would be freed another
task would get the oom_lock and this one would trigger OOM (unless it
has fatal signal pending as well).

Another option would be that no task could be selected - e.g. because
select_bad_process sees TIF_MEMDIE marked task - the one already killed
by OOM killer but that wasn't able to terminate for some reason. 18211
could be such a task. But we do not know what was going on with it
before strace attached to it.

Finally it is possible that the OOM header (everything up to Kill process)
was suppressed because of rate limiting. But
$ grep -B1 "Kill process" kern2.log
Feb  8 01:15:02 server01 kernel: [  304.000402] [ 4969]  1258  4969   163761    59554   6       0             0 apache2
Feb  8 01:15:02 server01 kernel: [  304.000649] Memory cgroup out of memory: Kill process 4816 (apache2) score 1000 or sacrifice child
--
Feb  8 01:15:51 server01 kernel: [  352.924573] [ 5847]  1709  5847   163433    58952   6       0             0 apache2
Feb  8 01:15:51 server01 kernel: [  352.924761] Memory cgroup out of memory: Kill process 5212 (apache2) score 1000 or sacrifice child
[...]

says that the message was preceded by a process list so we can exclude
rate limiting.

> I assume that this is suppose to take only few seconds
> while kernel finds any process and kill it (and maybe do it again
> until enough of memory is freed). I was gathering the data for
> about 2 and a half minutes and NO SINGLE process was killed (just
> compate list of PIDs from the first and the last directory inside
> memcg-bug-4.tar.gz). Even more, no single process was killed in cgroup
> 1258 also after i stopped gathering the data. You can also take the
> list od PID from memcg-bug-4.tar.gz and you will find only 18211 and
> 8102 (which are the two stucked processes).
>
> So my question is: Why no process was killed inside cgroup 1258
> while it was under OOM?

I would bet that there is something weird going on with pid:18211. But I
do not have enough information to find out what and why.

> It was under OOM for at least 2 and a half of minutes while i was
> gathering the data (then i let it run for additional, cca, 10 minutes
> and then killed processes by hand but i cannot proof this). Why kernel
> didn't kill any process for so long and ends the OOM?

As already mentioned above, select_bad_process doesn't select any task
if there is one which is on the way out. Maybe this is what is going on here.
 
> Btw, processes in cgroup 1258 (memcg-bug-4.tar.gz) are looping in this
> two tasks (i pasted only first line of stack):
> mem_cgroup_handle_oom+0x241/0x3b0
> 0xffffffffffffffff

0xffffffffffffffff is just a bogus entry. No idea why this happens.

> Some of them are in 'poll_schedule_timeout' and then they start to
> loop as above. Is this correct behavior?
> For example, do (first line of stack from process 7710 from all
> timestamps): for i in */7710/stack; do head -n1 $i; done

Yes, this is perfectly ok, because that task starts with:
$ cat bug/1360287245/7710/stack
[<ffffffff81125eb9>] poll_schedule_timeout+0x49/0x70
[<ffffffff8112675b>] do_sys_poll+0x54b/0x680
[<ffffffff81126b4c>] sys_poll+0x7c/0xf0
[<ffffffff815b6866>] system_call_fastpath+0x18/0x1d
[<ffffffffffffffff>] 0xffffffffffffffff

and then later on it gets into OOM because of a page fault:
$ cat bug/1360287250/7710/stack
[<ffffffff8110ae51>] mem_cgroup_handle_oom+0x241/0x3b0
[<ffffffff8110ba83>] T.1149+0x5f3/0x600
[<ffffffff8110bf5c>] mem_cgroup_charge_common+0x6c/0xb0
[<ffffffff8110bfe5>] mem_cgroup_newpage_charge+0x45/0x50
[<ffffffff810eca1e>] do_wp_page+0x14e/0x800
[<ffffffff810edf04>] handle_pte_fault+0x264/0x940
[<ffffffff810ee718>] handle_mm_fault+0x138/0x260
[<ffffffff810270bd>] do_page_fault+0x13d/0x460
[<ffffffff815b633f>] page_fault+0x1f/0x30
[<ffffffffffffffff>] 0xffffffffffffffff

And it loops in it until the end which is possible as well if the group
is under permanent OOM condition and the task is not selected to be
killed.

Unfortunately I am not able to reproduce this behavior even if I try
to hammer OOM like mad so I am afraid I cannot help you much without
further debugging patches.
I do realize that experimenting in your environment is a problem but I
do not many options left. Please do not use strace and rather collect
/proc/pid/stack instead. It would be also helpful to get group/tasks
file to have a full list of tasks in the group
---
>From 1139745d43cc8c56bc79c219291d1e5281799dd4 Mon Sep 17 00:00:00 2001
From: Michal Hocko <mhocko@suse.cz>
Date: Mon, 11 Feb 2013 12:18:36 +0100
Subject: [PATCH] oom: debug skipping killing

---
 mm/oom_kill.c |   10 +++++++++-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 069b64e..3d759f0 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -329,6 +329,8 @@ static struct task_struct *select_bad_process(unsigned int *ppoints,
 		if (test_tsk_thread_flag(p, TIF_MEMDIE)) {
 			if (unlikely(frozen(p)))
 				thaw_process(p);
+			printk(KERN_WARNING"XXX: pid:%d (flags:%u) is TIF_MEMDIE. Waiting for it\n",
+					p->pid, p->flags);
 			return ERR_PTR(-1UL);
 		}
 		if (!p->mm)
@@ -353,8 +355,11 @@ static struct task_struct *select_bad_process(unsigned int *ppoints,
 				 * then wait for it to finish before killing
 				 * some other task unnecessarily.
 				 */
-				if (!(p->group_leader->ptrace & PT_TRACE_EXIT))
+				if (!(p->group_leader->ptrace & PT_TRACE_EXIT)) {
+					printk(KERN_WARNING"XXX: pid:%d (flags:%u) is PF_EXITING. Waiting for it\n",
+							p->pid, p->flags);
 					return ERR_PTR(-1UL);
+				}
 			}
 		}
 
@@ -494,6 +499,7 @@ static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
 	 * its children or threads, just set TIF_MEMDIE so it can die quickly
 	 */
 	if (p->flags & PF_EXITING) {
+		printk(KERN_WARNING"XXX: pid:%d (flags:%u). Not killing PF_EXITING\n", p->pid, p->flags);
 		set_tsk_thread_flag(p, TIF_MEMDIE);
 		return 0;
 	}
@@ -567,6 +573,8 @@ void mem_cgroup_out_of_memory(struct mem_cgroup *mem, gfp_t gfp_mask)
 	 * its memory.
 	 */
 	if (fatal_signal_pending(current)) {
+		printk(KERN_WARNING"XXX: pid:%d (flags:%u) has fatal_signal_pending. Waiting for it\n",
+				p->pid, p->flags);
 		set_thread_flag(TIF_MEMDIE);
 		return;
 	}
-- 
1.7.10.4

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply related	[flat|nested] 172+ messages in thread

* Re: [PATCH for 3.2.34] memcg: do not trigger OOM if PF_NO_MEMCG_OOM is set
  2013-02-11 11:22                                                                                                                         ` Michal Hocko
@ 2013-02-22  8:23                                                                                                                           ` azurIt
  2013-02-22 12:52                                                                                                                             ` Michal Hocko
  2013-06-06 16:04                                                                                                                             ` Michal Hocko
  2013-02-22 12:00                                                                                                                           ` [PATCH for 3.2.34] memcg: do not trigger OOM if PF_NO_MEMCG_OOM is set azurIt
  1 sibling, 2 replies; 172+ messages in thread
From: azurIt @ 2013-02-22  8:23 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki,
	Johannes Weiner

>Unfortunately I am not able to reproduce this behavior even if I try
>to hammer OOM like mad so I am afraid I cannot help you much without
>further debugging patches.
>I do realize that experimenting in your environment is a problem but I
>do not many options left. Please do not use strace and rather collect
>/proc/pid/stack instead. It would be also helpful to get group/tasks
>file to have a full list of tasks in the group



Hi Michal,


sorry that i didn't response for a while. Today i installed kernel with your two patches and i'm running it now. I'm still having problems with OOM which is not able to handle low memory and is not killing processes. Here is some info:

- data from cgroup 1258 while it was under OOM and no processes were killed (so OOM don't stop and cgroup was freezed)
http://watchdog.sk/lkml/memcg-bug-6.tar.gz

I noticed problem about on 8:39 and waited until 8:57 (nothing happend). Then i killed process 19864 which seems to help and other processes probably ends and cgroup started to work. But problem accoured again about 20 seconds later, so i killed all processes at 8:58. The problem is occuring all the time since then. All processes (in that cgroup) are always in state 'D' when it occurs.


- kernel log from boot until now
http://watchdog.sk/lkml/kern3.gz


Btw, something probably happened also at about 3:09 but i wasn't able to gather any data because my 'load check script' killed all apache processes (load was more than 100).



azur

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: [PATCH for 3.2.34] memcg: do not trigger OOM if PF_NO_MEMCG_OOM is set
  2013-02-11 11:22                                                                                                                         ` Michal Hocko
  2013-02-22  8:23                                                                                                                           ` azurIt
@ 2013-02-22 12:00                                                                                                                           ` azurIt
  1 sibling, 0 replies; 172+ messages in thread
From: azurIt @ 2013-02-22 12:00 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki,
	Johannes Weiner

>Unfortunately I am not able to reproduce this behavior even if I try
>to hammer OOM like mad so I am afraid I cannot help you much without
>further debugging patches.
>I do realize that experimenting in your environment is a problem but I
>do not many options left. Please do not use strace and rather collect
>/proc/pid/stack instead. It would be also helpful to get group/tasks
>file to have a full list of tasks in the group



Sending new info!

I found out one interesting thing. When problem occurs (it probably happen when OOM is started in target cgroup but i'm not sure), the target cgroup, somehow, becames broken. In other words, after the problem occurs once in target cgroup, it is happening always in this cgroup. I made this test:

1.) I create cgroup A with limits (also with memory limit).
2.) Waited when OOM is started (can takes hours). Processes in target cgroup becames freezed so they must be killed.
3.) After this, processes are always freezing in cgroup A, it usually takes 20-30 seconds after killing previously freezed processes.
4.) I created cgroup B with the *same* limits as cgroup A and moved user from A to B. Problem disappears.
5.) Go to (2)

And second thing, i got've kernel oops, look at the end of:
http://watchdog.sk/lkml/oops

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: [PATCH for 3.2.34] memcg: do not trigger OOM if PF_NO_MEMCG_OOM is set
  2013-02-22  8:23                                                                                                                           ` azurIt
@ 2013-02-22 12:52                                                                                                                             ` Michal Hocko
  2013-02-22 12:54                                                                                                                               ` azurIt
  2013-06-06 16:04                                                                                                                             ` Michal Hocko
  1 sibling, 1 reply; 172+ messages in thread
From: Michal Hocko @ 2013-02-22 12:52 UTC (permalink / raw)
  To: azurIt
  Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki,
	Johannes Weiner

Hi,

On Fri 22-02-13 09:23:32, azurIt wrote:
[...]
> sorry that i didn't response for a while. Today i installed kernel
> with your two patches and i'm running it now.

I am not sure how much time I'll have for this today but just to make
sure we are on the same page, could you point me to the two patches you
have applied in the mean time?
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: [PATCH for 3.2.34] memcg: do not trigger OOM if PF_NO_MEMCG_OOM is set
  2013-02-22 12:52                                                                                                                             ` Michal Hocko
@ 2013-02-22 12:54                                                                                                                               ` azurIt
  2013-02-22 13:00                                                                                                                                 ` Michal Hocko
  0 siblings, 1 reply; 172+ messages in thread
From: azurIt @ 2013-02-22 12:54 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki,
	Johannes Weiner

>I am not sure how much time I'll have for this today but just to make
>sure we are on the same page, could you point me to the two patches you
>have applied in the mean time?


Here:
http://watchdog.sk/lkml/patches2

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: [PATCH for 3.2.34] memcg: do not trigger OOM if PF_NO_MEMCG_OOM is set
  2013-02-22 12:54                                                                                                                               ` azurIt
@ 2013-02-22 13:00                                                                                                                                 ` Michal Hocko
  0 siblings, 0 replies; 172+ messages in thread
From: Michal Hocko @ 2013-02-22 13:00 UTC (permalink / raw)
  To: azurIt
  Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki,
	Johannes Weiner

On Fri 22-02-13 13:54:42, azurIt wrote:
> >I am not sure how much time I'll have for this today but just to make
> >sure we are on the same page, could you point me to the two patches you
> >have applied in the mean time?
> 
> 
> Here:
> http://watchdog.sk/lkml/patches2

OK, looks correct.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: [PATCH for 3.2.34] memcg: do not trigger OOM if PF_NO_MEMCG_OOM is set
  2013-02-22  8:23                                                                                                                           ` azurIt
  2013-02-22 12:52                                                                                                                             ` Michal Hocko
@ 2013-06-06 16:04                                                                                                                             ` Michal Hocko
  2013-06-06 16:16                                                                                                                               ` azurIt
  1 sibling, 1 reply; 172+ messages in thread
From: Michal Hocko @ 2013-06-06 16:04 UTC (permalink / raw)
  To: azurIt
  Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki,
	Johannes Weiner

Hi,

I am really sorry it took so long but I was constantly preempted by
other stuff. I hope I have a good news for you, though. Johannes has
found a nice way how to overcome deadlock issues from memcg OOM which
might help you. Would you be willing to test with his patch
(http://permalink.gmane.org/gmane.linux.kernel.mm/101437). Unlike my
patch which handles just the i_mutex case his patch solved all possible
locks.

I can backport the patch for your kernel (are you still using 3.2 kernel
or you have moved to a newer one?).

On Fri 22-02-13 09:23:32, azurIt wrote:
> >Unfortunately I am not able to reproduce this behavior even if I try
> >to hammer OOM like mad so I am afraid I cannot help you much without
> >further debugging patches.
> >I do realize that experimenting in your environment is a problem but I
> >do not many options left. Please do not use strace and rather collect
> >/proc/pid/stack instead. It would be also helpful to get group/tasks
> >file to have a full list of tasks in the group
> 
> 
> 
> Hi Michal,
> 
> 
> sorry that i didn't response for a while. Today i installed kernel with your two patches and i'm running it now. I'm still having problems with OOM which is not able to handle low memory and is not killing processes. Here is some info:
> 
> - data from cgroup 1258 while it was under OOM and no processes were killed (so OOM don't stop and cgroup was freezed)
> http://watchdog.sk/lkml/memcg-bug-6.tar.gz
> 
> I noticed problem about on 8:39 and waited until 8:57 (nothing happend). Then i killed process 19864 which seems to help and other processes probably ends and cgroup started to work. But problem accoured again about 20 seconds later, so i killed all processes at 8:58. The problem is occuring all the time since then. All processes (in that cgroup) are always in state 'D' when it occurs.
> 
> 
> - kernel log from boot until now
> http://watchdog.sk/lkml/kern3.gz
> 
> 
> Btw, something probably happened also at about 3:09 but i wasn't able to gather any data because my 'load check script' killed all apache processes (load was more than 100).
> 
> 
> 
> azur
> --
> To unsubscribe from this list: send the line "unsubscribe cgroups" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: [PATCH for 3.2.34] memcg: do not trigger OOM if PF_NO_MEMCG_OOM is set
  2013-06-06 16:04                                                                                                                             ` Michal Hocko
@ 2013-06-06 16:16                                                                                                                               ` azurIt
  2013-06-07 13:11                                                                                                                                 ` [PATCH for 3.2] memcg: do not trap chargers with full callstack on OOM Michal Hocko
  0 siblings, 1 reply; 172+ messages in thread
From: azurIt @ 2013-06-06 16:16 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki,
	Johannes Weiner

Hello Michal,

nice to read you! :) Yes, i'm still on 3.2. Could you be so kind and try to backport it? Thank you very much!

azur



______________________________________________________________
> Od: "Michal Hocko" <mhocko@suse.cz>
> Komu: azurIt <azurit@pobox.sk>
> Dátum: 06.06.2013 18:04
> Predmet: Re: [PATCH for 3.2.34] memcg: do not trigger OOM if PF_NO_MEMCG_OOM is set
>
> CC: linux-kernel@vger.kernel.org, linux-mm@kvack.org, "cgroups mailinglist" <cgroups@vger.kernel.org>, "KAMEZAWA Hiroyuki" <kamezawa.hiroyu@jp.fujitsu.com>, "Johannes Weiner" <hannes@cmpxchg.org>
>Hi,
>
>I am really sorry it took so long but I was constantly preempted by
>other stuff. I hope I have a good news for you, though. Johannes has
>found a nice way how to overcome deadlock issues from memcg OOM which
>might help you. Would you be willing to test with his patch
>(http://permalink.gmane.org/gmane.linux.kernel.mm/101437). Unlike my
>patch which handles just the i_mutex case his patch solved all possible
>locks.
>
>I can backport the patch for your kernel (are you still using 3.2 kernel
>or you have moved to a newer one?).
>
>On Fri 22-02-13 09:23:32, azurIt wrote:
>> >Unfortunately I am not able to reproduce this behavior even if I try
>> >to hammer OOM like mad so I am afraid I cannot help you much without
>> >further debugging patches.
>> >I do realize that experimenting in your environment is a problem but I
>> >do not many options left. Please do not use strace and rather collect
>> >/proc/pid/stack instead. It would be also helpful to get group/tasks
>> >file to have a full list of tasks in the group
>> 
>> 
>> 
>> Hi Michal,
>> 
>> 
>> sorry that i didn't response for a while. Today i installed kernel with your two patches and i'm running it now. I'm still having problems with OOM which is not able to handle low memory and is not killing processes. Here is some info:
>> 
>> - data from cgroup 1258 while it was under OOM and no processes were killed (so OOM don't stop and cgroup was freezed)
>> http://watchdog.sk/lkml/memcg-bug-6.tar.gz
>> 
>> I noticed problem about on 8:39 and waited until 8:57 (nothing happend). Then i killed process 19864 which seems to help and other processes probably ends and cgroup started to work. But problem accoured again about 20 seconds later, so i killed all processes at 8:58. The problem is occuring all the time since then. All processes (in that cgroup) are always in state 'D' when it occurs.
>> 
>> 
>> - kernel log from boot until now
>> http://watchdog.sk/lkml/kern3.gz
>> 
>> 
>> Btw, something probably happened also at about 3:09 but i wasn't able to gather any data because my 'load check script' killed all apache processes (load was more than 100).
>> 
>> 
>> 
>> azur
>> --
>> To unsubscribe from this list: send the line "unsubscribe cgroups" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>-- 
>Michal Hocko
>SUSE Labs
>

^ permalink raw reply	[flat|nested] 172+ messages in thread

* [PATCH for 3.2] memcg: do not trap chargers with full callstack on OOM
  2013-06-06 16:16                                                                                                                               ` azurIt
@ 2013-06-07 13:11                                                                                                                                 ` Michal Hocko
  2013-06-17 10:21                                                                                                                                   ` azurIt
  0 siblings, 1 reply; 172+ messages in thread
From: Michal Hocko @ 2013-06-07 13:11 UTC (permalink / raw)
  To: azurIt
  Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki,
	Johannes Weiner

On Thu 06-06-13 18:16:33, azurIt wrote:
> Hello Michal,
> 
> nice to read you! :) Yes, i'm still on 3.2. Could you be so kind and
> try to backport it? Thank you very much!

Here we go. I hope I didn't screw anything (Johannes might double check)
because there were quite some changes in the area since 3.2. Nothing
earth shattering though. Please note that I have only compile tested
this. Also make sure you remove the previous patches you have from me.
---
>From 9d2801c1f53147ca9134cc5f76ab28d505a37a54 Mon Sep 17 00:00:00 2001
From: Johannes Weiner <hannes@cmpxchg.org>
Date: Fri, 7 Jun 2013 13:52:42 +0200
Subject: [PATCH] memcg: do not trap chargers with full callstack on OOM

The memcg OOM handling is incredibly fragile and can deadlock.  When a
task fails to charge memory, it invokes the OOM killer and loops right
there in the charge code until it succeeds.  Comparably, any other
task that enters the charge path at this point will go to a waitqueue
right then and there and sleep until the OOM situation is resolved.
The problem is that these tasks may hold filesystem locks and the
mmap_sem; locks that the selected OOM victim may need to exit.

For example, in one reported case, the task invoking the OOM killer
was about to charge a page cache page during a write(), which holds
the i_mutex.  The OOM killer selected a task that was just entering
truncate() and trying to acquire the i_mutex:

OOM invoking task:
[<ffffffff8110a9c1>] mem_cgroup_handle_oom+0x241/0x3b0
[<ffffffff8110b5ab>] T.1146+0x5ab/0x5c0
[<ffffffff8110c22e>] mem_cgroup_cache_charge+0xbe/0xe0
[<ffffffff810ca28c>] add_to_page_cache_locked+0x4c/0x140
[<ffffffff810ca3a2>] add_to_page_cache_lru+0x22/0x50
[<ffffffff810ca45b>] grab_cache_page_write_begin+0x8b/0xe0
[<ffffffff81193a18>] ext3_write_begin+0x88/0x270
[<ffffffff810c8fc6>] generic_file_buffered_write+0x116/0x290
[<ffffffff810cb3cc>] __generic_file_aio_write+0x27c/0x480
[<ffffffff810cb646>] generic_file_aio_write+0x76/0xf0           # takes ->i_mutex
[<ffffffff8111156a>] do_sync_write+0xea/0x130
[<ffffffff81112183>] vfs_write+0xf3/0x1f0
[<ffffffff81112381>] sys_write+0x51/0x90
[<ffffffff815b5926>] system_call_fastpath+0x18/0x1d
[<ffffffffffffffff>] 0xffffffffffffffff

OOM kill victim:
[<ffffffff811109b8>] do_truncate+0x58/0xa0              # takes i_mutex
[<ffffffff81121c90>] do_last+0x250/0xa30
[<ffffffff81122547>] path_openat+0xd7/0x440
[<ffffffff811229c9>] do_filp_open+0x49/0xa0
[<ffffffff8110f7d6>] do_sys_open+0x106/0x240
[<ffffffff8110f950>] sys_open+0x20/0x30
[<ffffffff815b5926>] system_call_fastpath+0x18/0x1d
[<ffffffffffffffff>] 0xffffffffffffffff

The OOM handling task will retry the charge indefinitely while the OOM
killed task is not releasing any resources.

A similar scenario can happen when the kernel OOM killer for a memcg
is disabled and a userspace task is in charge of resolving OOM
situations.  In this case, ALL tasks that enter the OOM path will be
made to sleep on the OOM waitqueue and wait for userspace to free
resources or increase the group's limit.  But a userspace OOM handler
is prone to deadlock itself on the locks held by the waiting tasks.
For example one of the sleeping tasks may be stuck in a brk() call
with the mmap_sem held for writing but the userspace handler, in order
to pick an optimal victim, may need to read files from /proc/<pid>,
which tries to acquire the same mmap_sem for reading and deadlocks.

This patch changes the way tasks behave after detecting an OOM and
makes sure nobody loops or sleeps on OOM with locks held:

1. When OOMing in a system call (buffered IO and friends), invoke the
   OOM killer but just return -ENOMEM, never sleep on a OOM waitqueue.
   Userspace should be able to handle this and it prevents anybody
   from looping or waiting with locks held.

2. When OOMing in a page fault, invoke the OOM killer and restart the
   fault instead of looping on the charge attempt.  This way, the OOM
   victim can not get stuck on locks the looping task may hold.

3. When detecting an OOM in a page fault but somebody else is handling
   it (either the kernel OOM killer or a userspace handler), don't go
   to sleep in the charge context.  Instead, remember the OOMing memcg
   in the task struct and then fully unwind the page fault stack with
   -ENOMEM.  pagefault_out_of_memory() will then call back into the
   memcg code to check if the -ENOMEM came from the memcg, and then
   either put the task to sleep on the memcg's OOM waitqueue or just
   restart the fault.  The OOM victim can no longer get stuck on any
   lock a sleeping task may hold.

While reworking the OOM routine, also remove a needless OOM waitqueue
wakeup when invoking the killer.  Only uncharges and limit increases,
things that actually change the memory situation, should do wakeups.

Reported-by: Reported-by: azurIt <azurit@pobox.sk>
Debugged-by: Michal Hocko <mhocko@suse.cz>
Reported-by: David Rientjes <rientjes@google.com>
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
---
 include/linux/memcontrol.h |   22 +++++++
 include/linux/mm.h         |    1 +
 include/linux/sched.h      |    6 ++
 mm/ksm.c                   |    2 +-
 mm/memcontrol.c            |  149 ++++++++++++++++++++++++++++----------------
 mm/memory.c                |   40 ++++++++----
 mm/oom_kill.c              |    2 +
 7 files changed, 156 insertions(+), 66 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index b87068a..56bfc39 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -120,6 +120,15 @@ mem_cgroup_get_reclaim_stat_from_page(struct page *page);
 extern void mem_cgroup_print_oom_info(struct mem_cgroup *memcg,
 					struct task_struct *p);
 
+static inline void mem_cgroup_set_userfault(struct task_struct *p)
+{
+	p->memcg_oom.in_userfault = 1;
+}
+static inline void mem_cgroup_clear_userfault(struct task_struct *p)
+{
+	p->memcg_oom.in_userfault = 0;
+}
+bool mem_cgroup_oom_synchronize(void);
 #ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
 extern int do_swap_account;
 #endif
@@ -333,6 +342,19 @@ mem_cgroup_print_oom_info(struct mem_cgroup *memcg, struct task_struct *p)
 {
 }
 
+static inline void mem_cgroup_set_userfault(struct task_struct *p)
+{
+}
+
+static inline void mem_cgroup_clear_userfault(struct task_struct *p)
+{
+}
+
+static inline bool mem_cgroup_oom_synchronize(void)
+{
+	return false;
+}
+
 static inline void mem_cgroup_inc_page_stat(struct page *page,
 					    enum mem_cgroup_page_stat_item idx)
 {
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 4baadd1..91380ef 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -156,6 +156,7 @@ extern pgprot_t protection_map[16];
 #define FAULT_FLAG_ALLOW_RETRY	0x08	/* Retry fault if blocking */
 #define FAULT_FLAG_RETRY_NOWAIT	0x10	/* Don't drop mmap_sem and wait when retrying */
 #define FAULT_FLAG_KILLABLE	0x20	/* The fault task is in SIGKILL killable region */
+#define FAULT_FLAG_KERNEL	0x80	/* kernel-triggered fault (get_user_pages etc.) */
 
 /*
  * This interface is used by x86 PAT code to identify a pfn mapping that is
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 1c4f3e9..d521a70 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1568,6 +1568,12 @@ struct task_struct {
 		unsigned long nr_pages;	/* uncharged usage */
 		unsigned long memsw_nr_pages; /* uncharged mem+swap usage */
 	} memcg_batch;
+	struct memcg_oom_info {
+		unsigned int in_userfault:1;
+		unsigned int in_memcg_oom:1;
+		int wakeups;
+		struct mem_cgroup *wait_on_memcg;
+	} memcg_oom;
 #endif
 #ifdef CONFIG_HAVE_HW_BREAKPOINT
 	atomic_t ptrace_bp_refcnt;
diff --git a/mm/ksm.c b/mm/ksm.c
index 310544a..3295a3b 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -338,7 +338,7 @@ static int break_ksm(struct vm_area_struct *vma, unsigned long addr)
 			break;
 		if (PageKsm(page))
 			ret = handle_mm_fault(vma->vm_mm, vma, addr,
-							FAULT_FLAG_WRITE);
+					FAULT_FLAG_KERNEL | FAULT_FLAG_WRITE);
 		else
 			ret = VM_FAULT_WRITE;
 		put_page(page);
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index b63f5f7..67189b4 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -249,6 +249,7 @@ struct mem_cgroup {
 
 	bool		oom_lock;
 	atomic_t	under_oom;
+	atomic_t	oom_wakeups;
 
 	atomic_t	refcnt;
 
@@ -1846,6 +1847,7 @@ static int memcg_oom_wake_function(wait_queue_t *wait,
 
 static void memcg_wakeup_oom(struct mem_cgroup *memcg)
 {
+	atomic_inc(&memcg->oom_wakeups);
 	/* for filtering, pass "memcg" as argument. */
 	__wake_up(&memcg_oom_waitq, TASK_NORMAL, 0, memcg);
 }
@@ -1857,55 +1859,109 @@ static void memcg_oom_recover(struct mem_cgroup *memcg)
 }
 
 /*
- * try to call OOM killer. returns false if we should exit memory-reclaim loop.
+ * try to call OOM killer
  */
-bool mem_cgroup_handle_oom(struct mem_cgroup *memcg, gfp_t mask)
+static void mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask)
 {
-	struct oom_wait_info owait;
-	bool locked, need_to_kill;
-
-	owait.mem = memcg;
-	owait.wait.flags = 0;
-	owait.wait.func = memcg_oom_wake_function;
-	owait.wait.private = current;
-	INIT_LIST_HEAD(&owait.wait.task_list);
-	need_to_kill = true;
-	mem_cgroup_mark_under_oom(memcg);
+	bool locked, need_to_kill = true;
 
 	/* At first, try to OOM lock hierarchy under memcg.*/
 	spin_lock(&memcg_oom_lock);
 	locked = mem_cgroup_oom_lock(memcg);
-	/*
-	 * Even if signal_pending(), we can't quit charge() loop without
-	 * accounting. So, UNINTERRUPTIBLE is appropriate. But SIGKILL
-	 * under OOM is always welcomed, use TASK_KILLABLE here.
-	 */
-	prepare_to_wait(&memcg_oom_waitq, &owait.wait, TASK_KILLABLE);
 	if (!locked || memcg->oom_kill_disable)
 		need_to_kill = false;
 	if (locked)
 		mem_cgroup_oom_notify(memcg);
 	spin_unlock(&memcg_oom_lock);
 
-	if (need_to_kill) {
-		finish_wait(&memcg_oom_waitq, &owait.wait);
-		mem_cgroup_out_of_memory(memcg, mask);
-	} else {
-		schedule();
-		finish_wait(&memcg_oom_waitq, &owait.wait);
+	/*
+	 * A system call can just return -ENOMEM, but if this is a
+	 * page fault and somebody else is handling the OOM already,
+	 * we need to sleep on the OOM waitqueue for this memcg until
+	 * the situation is resolved.  Which can take some time
+	 * because it might be handled by a userspace task.
+	 *
+	 * However, this is the charge context, which means that we
+	 * may sit on a large call stack and hold various filesystem
+	 * locks, the mmap_sem etc. and we don't want the OOM handler
+	 * to deadlock on them while we sit here and wait.  Store the
+	 * current OOM context in the task_struct, then return
+	 * -ENOMEM.  At the end of the page fault handler, with the
+	 * stack unwound, pagefault_out_of_memory() will check back
+	 * with us by calling mem_cgroup_oom_synchronize(), possibly
+	 * putting the task to sleep.
+	 */
+	if (current->memcg_oom.in_userfault) {
+		current->memcg_oom.in_memcg_oom = 1;
+		/*
+		 * Somebody else is handling the situation.  Make sure
+		 * no wakeups are missed between now and going to
+		 * sleep at the end of the page fault.
+		 */
+		if (!need_to_kill) {
+			mem_cgroup_mark_under_oom(memcg);
+			current->memcg_oom.wakeups =
+				atomic_read(&memcg->oom_wakeups);
+			css_get(&memcg->css);
+			current->memcg_oom.wait_on_memcg = memcg;
+		}
 	}
-	spin_lock(&memcg_oom_lock);
-	if (locked)
+
+	if (need_to_kill)
+		mem_cgroup_out_of_memory(memcg, mask);
+
+	if (locked) {
+		spin_lock(&memcg_oom_lock);
 		mem_cgroup_oom_unlock(memcg);
-	memcg_wakeup_oom(memcg);
-	spin_unlock(&memcg_oom_lock);
+		/*
+		 * Sleeping tasks might have been killed, make sure
+		 * they get scheduled so they can exit.
+		 */
+		if (need_to_kill)
+			memcg_oom_recover(memcg);
+		spin_unlock(&memcg_oom_lock);
+	}
+}
 
-	mem_cgroup_unmark_under_oom(memcg);
+bool mem_cgroup_oom_synchronize(void)
+{
+	struct oom_wait_info owait;
+	struct mem_cgroup *memcg;
 
-	if (test_thread_flag(TIF_MEMDIE) || fatal_signal_pending(current))
+	/* OOM is global, do not handle */
+	if (!current->memcg_oom.in_memcg_oom)
 		return false;
-	/* Give chance to dying process */
-	schedule_timeout_uninterruptible(1);
+
+	/*
+	 * We invoked the OOM killer but there is a chance that a kill
+	 * did not free up any charges.  Everybody else might already
+	 * be sleeping, so restart the fault and keep the rampage
+	 * going until some charges are released.
+	 */
+	memcg = current->memcg_oom.wait_on_memcg;
+	if (!memcg)
+		goto out;
+
+	if (test_thread_flag(TIF_MEMDIE) || fatal_signal_pending(current))
+		goto out_put;
+
+	owait.mem = memcg;
+	owait.wait.flags = 0;
+	owait.wait.func = memcg_oom_wake_function;
+	owait.wait.private = current;
+	INIT_LIST_HEAD(&owait.wait.task_list);
+
+	prepare_to_wait(&memcg_oom_waitq, &owait.wait, TASK_KILLABLE);
+	/* Only sleep if we didn't miss any wakeups since OOM */
+	if (atomic_read(&memcg->oom_wakeups) == current->memcg_oom.wakeups)
+		schedule();
+	finish_wait(&memcg_oom_waitq, &owait.wait);
+out_put:
+	mem_cgroup_unmark_under_oom(memcg);
+	css_put(&memcg->css);
+	current->memcg_oom.wait_on_memcg = NULL;
+out:
+	current->memcg_oom.in_memcg_oom = 0;
 	return true;
 }
 
@@ -2195,11 +2251,10 @@ enum {
 	CHARGE_RETRY,		/* need to retry but retry is not bad */
 	CHARGE_NOMEM,		/* we can't do more. return -ENOMEM */
 	CHARGE_WOULDBLOCK,	/* GFP_WAIT wasn't set and no enough res. */
-	CHARGE_OOM_DIE,		/* the current is killed because of OOM */
 };
 
 static int mem_cgroup_do_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
-				unsigned int nr_pages, bool oom_check)
+				unsigned int nr_pages, bool invoke_oom)
 {
 	unsigned long csize = nr_pages * PAGE_SIZE;
 	struct mem_cgroup *mem_over_limit;
@@ -2257,14 +2312,10 @@ static int mem_cgroup_do_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
 	if (mem_cgroup_wait_acct_move(mem_over_limit))
 		return CHARGE_RETRY;
 
-	/* If we don't need to call oom-killer at el, return immediately */
-	if (!oom_check)
-		return CHARGE_NOMEM;
-	/* check OOM */
-	if (!mem_cgroup_handle_oom(mem_over_limit, gfp_mask))
-		return CHARGE_OOM_DIE;
+	if (invoke_oom)
+		mem_cgroup_oom(mem_over_limit, gfp_mask);
 
-	return CHARGE_RETRY;
+	return CHARGE_NOMEM;
 }
 
 /*
@@ -2349,7 +2400,7 @@ again:
 	}
 
 	do {
-		bool oom_check;
+		bool invoke_oom = oom && !nr_oom_retries;
 
 		/* If killed, bypass charge */
 		if (fatal_signal_pending(current)) {
@@ -2357,13 +2408,7 @@ again:
 			goto bypass;
 		}
 
-		oom_check = false;
-		if (oom && !nr_oom_retries) {
-			oom_check = true;
-			nr_oom_retries = MEM_CGROUP_RECLAIM_RETRIES;
-		}
-
-		ret = mem_cgroup_do_charge(memcg, gfp_mask, batch, oom_check);
+		ret = mem_cgroup_do_charge(memcg, gfp_mask, batch, invoke_oom);
 		switch (ret) {
 		case CHARGE_OK:
 			break;
@@ -2376,16 +2421,12 @@ again:
 			css_put(&memcg->css);
 			goto nomem;
 		case CHARGE_NOMEM: /* OOM routine works */
-			if (!oom) {
+			if (!oom || invoke_oom) {
 				css_put(&memcg->css);
 				goto nomem;
 			}
-			/* If oom, we never return -ENOMEM */
 			nr_oom_retries--;
 			break;
-		case CHARGE_OOM_DIE: /* Killed by OOM Killer */
-			css_put(&memcg->css);
-			goto bypass;
 		}
 	} while (ret != CHARGE_OK);
 
diff --git a/mm/memory.c b/mm/memory.c
index 829d437..bee177c 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1720,7 +1720,7 @@ int __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
 			cond_resched();
 			while (!(page = follow_page(vma, start, foll_flags))) {
 				int ret;
-				unsigned int fault_flags = 0;
+				unsigned int fault_flags = FAULT_FLAG_KERNEL;
 
 				/* For mlock, just skip the stack guard page. */
 				if (foll_flags & FOLL_MLOCK) {
@@ -1842,6 +1842,7 @@ int fixup_user_fault(struct task_struct *tsk, struct mm_struct *mm,
 	if (!vma || address < vma->vm_start)
 		return -EFAULT;
 
+	fault_flags |= FAULT_FLAG_KERNEL;
 	ret = handle_mm_fault(mm, vma, address, fault_flags);
 	if (ret & VM_FAULT_ERROR) {
 		if (ret & VM_FAULT_OOM)
@@ -3439,22 +3440,14 @@ unlock:
 /*
  * By the time we get here, we already hold the mm semaphore
  */
-int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma,
-		unsigned long address, unsigned int flags)
+static int __handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma,
+			     unsigned long address, unsigned int flags)
 {
 	pgd_t *pgd;
 	pud_t *pud;
 	pmd_t *pmd;
 	pte_t *pte;
 
-	__set_current_state(TASK_RUNNING);
-
-	count_vm_event(PGFAULT);
-	mem_cgroup_count_vm_event(mm, PGFAULT);
-
-	/* do counter updates before entering really critical section. */
-	check_sync_rss_stat(current);
-
 	if (unlikely(is_vm_hugetlb_page(vma)))
 		return hugetlb_fault(mm, vma, address, flags);
 
@@ -3503,6 +3496,31 @@ int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 	return handle_pte_fault(mm, vma, address, pte, pmd, flags);
 }
 
+int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma,
+		    unsigned long address, unsigned int flags)
+{
+	int in_userfault = !(flags & FAULT_FLAG_KERNEL);
+	int ret;
+
+	__set_current_state(TASK_RUNNING);
+
+	count_vm_event(PGFAULT);
+	mem_cgroup_count_vm_event(mm, PGFAULT);
+
+	/* do counter updates before entering really critical section. */
+	check_sync_rss_stat(current);
+
+	if (in_userfault)
+		mem_cgroup_set_userfault(current);
+
+	ret = __handle_mm_fault(mm, vma, address, flags);
+
+	if (in_userfault)
+		mem_cgroup_clear_userfault(current);
+
+	return ret;
+}
+
 #ifndef __PAGETABLE_PUD_FOLDED
 /*
  * Allocate page upper directory.
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 069b64e..aa60863 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -785,6 +785,8 @@ out:
  */
 void pagefault_out_of_memory(void)
 {
+	if (mem_cgroup_oom_synchronize())
+		return;
 	if (try_set_system_oom()) {
 		out_of_memory(NULL, 0, 0, NULL);
 		clear_system_oom();
-- 
1.7.10.4

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply related	[flat|nested] 172+ messages in thread

* Re: [PATCH for 3.2] memcg: do not trap chargers with full callstack on OOM
  2013-06-07 13:11                                                                                                                                 ` [PATCH for 3.2] memcg: do not trap chargers with full callstack on OOM Michal Hocko
@ 2013-06-17 10:21                                                                                                                                   ` azurIt
  2013-06-19 13:26                                                                                                                                     ` Michal Hocko
  0 siblings, 1 reply; 172+ messages in thread
From: azurIt @ 2013-06-17 10:21 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki,
	Johannes Weiner

>Here we go. I hope I didn't screw anything (Johannes might double check)
>because there were quite some changes in the area since 3.2. Nothing
>earth shattering though. Please note that I have only compile tested
>this. Also make sure you remove the previous patches you have from me.


Hi Michal,

it, unfortunately, didn't work. Everything was working fine but original problem is still occuring. I'm unable to send you stacks or more info because problem is taking down the whole server for some time now (don't know what exactly caused it to start happening, maybe newer versions of 3.2.x). But i'm sure of one thing - when problem occurs, nothing is able to access hard drives (every process which tries it is freezed until problem is resolved or server is rebooted). Problem is fixed after killing processes from cgroup which caused it and everything immediatelly starts to work normally. I find this out by keeping terminal opened from another server to one where my problem is occuring quite often and running several apps there (htop, iotop, etc.). When problem occurs, all apps which wasn't working with HDD was ok. The htop proved to be very usefull here because it's only reading proc filesystem and is also able to send KILL signals - i was able to resolve the problem with it
  without rebooting the server.

I created a special daemon (about month ago) which is able to detect and fix the problem so i'm not having server outages now. The point was to NOT access anything which is stored on HDDs, the daemon is only reading info from cgroup filesystem and sending KILL signals to processes. Maybe i should be able to also read stack files before killing, i will try it.

Btw, which vanilla kernel includes this patch?

Thank you and everyone involved very much for time and help.

azur

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: [PATCH for 3.2] memcg: do not trap chargers with full callstack on OOM
  2013-06-17 10:21                                                                                                                                   ` azurIt
@ 2013-06-19 13:26                                                                                                                                     ` Michal Hocko
  2013-06-22 20:09                                                                                                                                       ` azurIt
  2013-06-24 16:48                                                                                                                                       ` azurIt
  0 siblings, 2 replies; 172+ messages in thread
From: Michal Hocko @ 2013-06-19 13:26 UTC (permalink / raw)
  To: azurIt
  Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki,
	Johannes Weiner

On Mon 17-06-13 12:21:34, azurIt wrote:
> >Here we go. I hope I didn't screw anything (Johannes might double check)
> >because there were quite some changes in the area since 3.2. Nothing
> >earth shattering though. Please note that I have only compile tested
> >this. Also make sure you remove the previous patches you have from me.
> 
> 
> Hi Michal,
> 
> it, unfortunately, didn't work. Everything was working fine but
> original problem is still occuring. 

This would be more than surprising because tasks blocked at memcg OOM
don't hold any locks anymore. Maybe I have messed something up during
backport but I cannot spot anything.

> I'm unable to send you stacks or more info because problem is taking
> down the whole server for some time now (don't know what exactly
> caused it to start happening, maybe newer versions of 3.2.x).

So you are not testing with the same kernel with just the old patch
replaced by the new one?

> But i'm sure of one thing - when problem occurs, nothing is able to
> access hard drives (every process which tries it is freezed until
> problem is resolved or server is rebooted).

I would be really interesting to see what those tasks are blocked on.

> Problem is fixed after killing processes from cgroup which
> caused it and everything immediatelly starts to work normally. I
> find this out by keeping terminal opened from another server to one
> where my problem is occuring quite often and running several apps
> there (htop, iotop, etc.). When problem occurs, all apps which wasn't
> working with HDD was ok. The htop proved to be very usefull here
> because it's only reading proc filesystem and is also able to send
> KILL signals - i was able to resolve the problem with it
>   without rebooting the server.

sysrq+t will give you the list of all tasks and their traces.

> I created a special daemon (about month ago) which is able to detect
> and fix the problem so i'm not having server outages now. The point
> was to NOT access anything which is stored on HDDs, the daemon is
> only reading info from cgroup filesystem and sending KILL signals to
> processes. Maybe i should be able to also read stack files before
> killing, i will try it.
> 
> Btw, which vanilla kernel includes this patch?

None yet. But I hope it will be merged to 3.11 and backported to the
stable trees.
 
> Thank you and everyone involved very much for time and help.
> 
> azur

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: [PATCH for 3.2] memcg: do not trap chargers with full callstack on OOM
  2013-06-19 13:26                                                                                                                                     ` Michal Hocko
@ 2013-06-22 20:09                                                                                                                                       ` azurIt
  2013-06-24 20:13                                                                                                                                         ` Johannes Weiner
  2013-06-24 16:48                                                                                                                                       ` azurIt
  1 sibling, 1 reply; 172+ messages in thread
From: azurIt @ 2013-06-22 20:09 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki,
	Johannes Weiner

Michal,



>> I'm unable to send you stacks or more info because problem is taking
>> down the whole server for some time now (don't know what exactly
>> caused it to start happening, maybe newer versions of 3.2.x).
>
>So you are not testing with the same kernel with just the old patch
>replaced by the new one?


No, i'm not testing with the same kernel but all are 3.2.x. I even cannot install older 3.2.x because grsecurity is always available for newest kernel and there is no archive of older versions (at least i don't know about any).


>> But i'm sure of one thing - when problem occurs, nothing is able to
>> access hard drives (every process which tries it is freezed until
>> problem is resolved or server is rebooted).
>
>I would be really interesting to see what those tasks are blocked on.


I'm trying to get it, stay tuned :)


Today i noticed one bug, not 100% sure it is related to 'your' patch but i didn't seen this before. I noticed that i have lots of cgroups which cannot be removed - if i do 'rmdir <cgroup_directory>', it just hangs and never complete. Even more, it's not possible to access the whole cgroup filesystem until i kill that rmdir (anything, which tries it, just hangs). All unremoveable cgroups has this in 'memory.oom_control':
oom_kill_disable 0
under_oom 1

And, yes, 'tasks' file is empty.

azur

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: [PATCH for 3.2] memcg: do not trap chargers with full callstack on OOM
  2013-06-19 13:26                                                                                                                                     ` Michal Hocko
  2013-06-22 20:09                                                                                                                                       ` azurIt
@ 2013-06-24 16:48                                                                                                                                       ` azurIt
  1 sibling, 0 replies; 172+ messages in thread
From: azurIt @ 2013-06-24 16:48 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki,
	Johannes Weiner

>I would be really interesting to see what those tasks are blocked on.


Ok, i got it! Problem occurs two times and it behaves differently each time, I was running kernel with that latest patch.

1.) It doesn't have impact on the whole server, only on one cgroup. Here are stacks:
http://watchdog.sk/lkml/memcg-bug-7.tar.gz


2.) It almost takes down the server because of huge I/O on HDDs. Unfortunately, i had a bug in my script which was suppose to gather stacks (i wasn't able to do it by hand like in (1), server was almost unoperable). But I was lucky and somehow killed processes from problematic cgroup (via htop) and server was ok again EXCEPT one important thing - processes from that cgroup were still running in D state and i wasn't able to kill them for good. They were taking web server network ports so i had to reboot the server :( BUT, before that, i gathered stacks:
http://watchdog.sk/lkml/memcg-bug-8.tar.gz

What do you think?

azur

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: [PATCH for 3.2] memcg: do not trap chargers with full callstack on OOM
  2013-06-22 20:09                                                                                                                                       ` azurIt
@ 2013-06-24 20:13                                                                                                                                         ` Johannes Weiner
  2013-06-28 10:06                                                                                                                                           ` azurIt
  2013-07-09 13:00                                                                                                                                           ` Michal Hocko
  0 siblings, 2 replies; 172+ messages in thread
From: Johannes Weiner @ 2013-06-24 20:13 UTC (permalink / raw)
  To: azurIt
  Cc: Michal Hocko, linux-kernel, linux-mm, cgroups mailinglist,
	KAMEZAWA Hiroyuki

Hi guys,

On Sat, Jun 22, 2013 at 10:09:58PM +0200, azurIt wrote:
> >> But i'm sure of one thing - when problem occurs, nothing is able to
> >> access hard drives (every process which tries it is freezed until
> >> problem is resolved or server is rebooted).
> >
> >I would be really interesting to see what those tasks are blocked on.
> 
> I'm trying to get it, stay tuned :)
> 
> Today i noticed one bug, not 100% sure it is related to 'your' patch
> but i didn't seen this before. I noticed that i have lots of cgroups
> which cannot be removed - if i do 'rmdir <cgroup_directory>', it
> just hangs and never complete. Even more, it's not possible to
> access the whole cgroup filesystem until i kill that rmdir
> (anything, which tries it, just hangs). All unremoveable cgroups has
> this in 'memory.oom_control': oom_kill_disable 0 under_oom 1

Somebody acquires the OOM wait reference to the memcg and marks it
under oom but then does not call into mem_cgroup_oom_synchronize() to
clean up.  That's why under_oom is set and the rmdir waits for
outstanding references.

> And, yes, 'tasks' file is empty.

It's not a kernel thread that does it because all kernel-context
handle_mm_fault() are annotated properly, which means the task must be
userspace and, since tasks is empty, have exited before synchronizing.

Can you try with the following patch on top?

diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index 5db0490..9a0b152 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -846,17 +846,6 @@ static noinline int
 mm_fault_error(struct pt_regs *regs, unsigned long error_code,
 	       unsigned long address, unsigned int fault)
 {
-	/*
-	 * Pagefault was interrupted by SIGKILL. We have no reason to
-	 * continue pagefault.
-	 */
-	if (fatal_signal_pending(current)) {
-		if (!(fault & VM_FAULT_RETRY))
-			up_read(&current->mm->mmap_sem);
-		if (!(error_code & PF_USER))
-			no_context(regs, error_code, address);
-		return 1;
-	}
 	if (!(fault & VM_FAULT_ERROR))
 		return 0;
 

^ permalink raw reply related	[flat|nested] 172+ messages in thread

* Re: [PATCH for 3.2] memcg: do not trap chargers with full callstack on OOM
  2013-06-24 20:13                                                                                                                                         ` Johannes Weiner
@ 2013-06-28 10:06                                                                                                                                           ` azurIt
  2013-07-05 18:17                                                                                                                                             ` Johannes Weiner
  2013-07-09 13:00                                                                                                                                           ` Michal Hocko
  1 sibling, 1 reply; 172+ messages in thread
From: azurIt @ 2013-06-28 10:06 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Michal Hocko, linux-kernel, linux-mm, cgroups mailinglist,
	KAMEZAWA Hiroyuki

>It's not a kernel thread that does it because all kernel-context
>handle_mm_fault() are annotated properly, which means the task must be
>userspace and, since tasks is empty, have exited before synchronizing.
>
>Can you try with the following patch on top?


Michal and Johannes,

i have some observations which i made:
Original patch from Johannes was really fixing something but definitely not everything and was introducing new problems. I'm running unpatched kernel from time i send my last message and problems with freezing cgroups are occuring very often (several times per day) - they were, on the other hand, quite rare with patch from Johannes.

Johannes, i didn't try your last patch yet. I would like to wait until you or Michal look at my last message which contained detailed information about freezing of cgroups on kernel running your original patch (which was suppose to fix it for good). Even more, i would like to hear your opinion about that stucked processes which was holding web server port and which forced me to reboot production server at the middle of the day :( more information was in my last message. Thank you very much for your time.

azur

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: [PATCH for 3.2] memcg: do not trap chargers with full callstack on OOM
  2013-06-28 10:06                                                                                                                                           ` azurIt
@ 2013-07-05 18:17                                                                                                                                             ` Johannes Weiner
  2013-07-05 19:02                                                                                                                                               ` azurIt
  0 siblings, 1 reply; 172+ messages in thread
From: Johannes Weiner @ 2013-07-05 18:17 UTC (permalink / raw)
  To: azurIt
  Cc: Michal Hocko, linux-kernel, linux-mm, cgroups mailinglist,
	KAMEZAWA Hiroyuki

Hi azurIt,

On Fri, Jun 28, 2013 at 12:06:13PM +0200, azurIt wrote:
> >It's not a kernel thread that does it because all kernel-context
> >handle_mm_fault() are annotated properly, which means the task must be
> >userspace and, since tasks is empty, have exited before synchronizing.
> >
> >Can you try with the following patch on top?
> 
> 
> Michal and Johannes,
> 
> i have some observations which i made: Original patch from Johannes
> was really fixing something but definitely not everything and was
> introducing new problems. I'm running unpatched kernel from time i
> send my last message and problems with freezing cgroups are occuring
> very often (several times per day) - they were, on the other hand,
> quite rare with patch from Johannes.

That's good!

> Johannes, i didn't try your last patch yet. I would like to wait
> until you or Michal look at my last message which contained detailed
> information about freezing of cgroups on kernel running your
> original patch (which was suppose to fix it for good). Even more, i
> would like to hear your opinion about that stucked processes which
> was holding web server port and which forced me to reboot production
> server at the middle of the day :( more information was in my last
> message. Thank you very much for your time.

I looked at your debug messages but could not find anything that would
hint at a deadlock.  All tasks are stuck in the refrigerator, so I
assume you use the freezer cgroup and enabled it somehow?

Sorry about your production server locking up, but from the stacks I
don't see any connection to the OOM problems you were having... :/

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: [PATCH for 3.2] memcg: do not trap chargers with full callstack on OOM
  2013-07-05 18:17                                                                                                                                             ` Johannes Weiner
@ 2013-07-05 19:02                                                                                                                                               ` azurIt
  2013-07-05 19:18                                                                                                                                                 ` Johannes Weiner
  0 siblings, 1 reply; 172+ messages in thread
From: azurIt @ 2013-07-05 19:02 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Michal Hocko, linux-kernel, linux-mm, cgroups mailinglist,
	KAMEZAWA Hiroyuki

>I looked at your debug messages but could not find anything that would
>hint at a deadlock.  All tasks are stuck in the refrigerator, so I
>assume you use the freezer cgroup and enabled it somehow?


Yes, i'm really using freezer cgroup BUT i was checking if it's not doing problems - unfortunately, several days passed from that day and now i don't fully remember if i was checking it for both cases (unremoveabled cgroups and these freezed processes holding web server port). I'm 100% sure i was checking it for unremoveable cgroups but not so sure for the other problem (i had to act quickly in that case). Are you sure (from stacks) that freezer cgroup was enabled there?

Btw, what about that other stacks? I mean this file:
http://watchdog.sk/lkml/memcg-bug-7.tar.gz

It was taken while running the kernel with your patch and from cgroup which was under unresolveable OOM (just like my very original problem).

Thank you!


azur

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: [PATCH for 3.2] memcg: do not trap chargers with full callstack on OOM
  2013-07-05 19:02                                                                                                                                               ` azurIt
@ 2013-07-05 19:18                                                                                                                                                 ` Johannes Weiner
  2013-07-07 23:42                                                                                                                                                   ` azurIt
  2013-07-14 17:07                                                                                                                                                   ` azurIt
  0 siblings, 2 replies; 172+ messages in thread
From: Johannes Weiner @ 2013-07-05 19:18 UTC (permalink / raw)
  To: azurIt
  Cc: Michal Hocko, linux-kernel, linux-mm, cgroups mailinglist,
	KAMEZAWA Hiroyuki

On Fri, Jul 05, 2013 at 09:02:46PM +0200, azurIt wrote:
> >I looked at your debug messages but could not find anything that would
> >hint at a deadlock.  All tasks are stuck in the refrigerator, so I
> >assume you use the freezer cgroup and enabled it somehow?
> 
> 
> Yes, i'm really using freezer cgroup BUT i was checking if it's not
> doing problems - unfortunately, several days passed from that day
> and now i don't fully remember if i was checking it for both cases
> (unremoveabled cgroups and these freezed processes holding web
> server port). I'm 100% sure i was checking it for unremoveable
> cgroups but not so sure for the other problem (i had to act quickly
> in that case). Are you sure (from stacks) that freezer cgroup was
> enabled there?

Yeah, all the traces without exception look like this:

1372089762/23433/stack:[<ffffffff81080925>] refrigerator+0x95/0x160
1372089762/23433/stack:[<ffffffff8106ab7b>] get_signal_to_deliver+0x1cb/0x540
1372089762/23433/stack:[<ffffffff8100188b>] do_signal+0x6b/0x750
1372089762/23433/stack:[<ffffffff81001fc5>] do_notify_resume+0x55/0x80
1372089762/23433/stack:[<ffffffff815cac77>] int_signal+0x12/0x17
1372089762/23433/stack:[<ffffffffffffffff>] 0xffffffffffffffff

so the freezer was already enabled when you took the backtraces.

> Btw, what about that other stacks? I mean this file:
> http://watchdog.sk/lkml/memcg-bug-7.tar.gz
> 
> It was taken while running the kernel with your patch and from
> cgroup which was under unresolveable OOM (just like my very original
> problem).

I looked at these traces too, but none of the tasks are stuck in rmdir
or the OOM path.  Some /are/ in the page fault path, but they are
happily doing reclaim and don't appear to be stuck.  So I'm having a
hard time matching this data to what you otherwise observed.

However, based on what you reported the most likely explanation for
the continued hangs is the unfinished OOM handling for which I sent
the followup patch for arch/x86/mm/fault.c.

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: [PATCH for 3.2] memcg: do not trap chargers with full callstack on OOM
  2013-07-05 19:18                                                                                                                                                 ` Johannes Weiner
@ 2013-07-07 23:42                                                                                                                                                   ` azurIt
  2013-07-09 13:10                                                                                                                                                     ` Michal Hocko
  2013-07-14 17:07                                                                                                                                                   ` azurIt
  1 sibling, 1 reply; 172+ messages in thread
From: azurIt @ 2013-07-07 23:42 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Michal Hocko, linux-kernel, linux-mm, cgroups mailinglist,
	KAMEZAWA Hiroyuki

> CC: "Michal Hocko" <mhocko@suse.cz>, linux-kernel@vger.kernel.org, linux-mm@kvack.org, "cgroups mailinglist" <cgroups@vger.kernel.org>, "KAMEZAWA Hiroyuki" <kamezawa.hiroyu@jp.fujitsu.com>
>On Fri, Jul 05, 2013 at 09:02:46PM +0200, azurIt wrote:
>> >I looked at your debug messages but could not find anything that would
>> >hint at a deadlock.  All tasks are stuck in the refrigerator, so I
>> >assume you use the freezer cgroup and enabled it somehow?
>> 
>> 
>> Yes, i'm really using freezer cgroup BUT i was checking if it's not
>> doing problems - unfortunately, several days passed from that day
>> and now i don't fully remember if i was checking it for both cases
>> (unremoveabled cgroups and these freezed processes holding web
>> server port). I'm 100% sure i was checking it for unremoveable
>> cgroups but not so sure for the other problem (i had to act quickly
>> in that case). Are you sure (from stacks) that freezer cgroup was
>> enabled there?
>
>Yeah, all the traces without exception look like this:
>
>1372089762/23433/stack:[<ffffffff81080925>] refrigerator+0x95/0x160
>1372089762/23433/stack:[<ffffffff8106ab7b>] get_signal_to_deliver+0x1cb/0x540
>1372089762/23433/stack:[<ffffffff8100188b>] do_signal+0x6b/0x750
>1372089762/23433/stack:[<ffffffff81001fc5>] do_notify_resume+0x55/0x80
>1372089762/23433/stack:[<ffffffff815cac77>] int_signal+0x12/0x17
>1372089762/23433/stack:[<ffffffffffffffff>] 0xffffffffffffffff
>
>so the freezer was already enabled when you took the backtraces.
>
>> Btw, what about that other stacks? I mean this file:
>> http://watchdog.sk/lkml/memcg-bug-7.tar.gz
>> 
>> It was taken while running the kernel with your patch and from
>> cgroup which was under unresolveable OOM (just like my very original
>> problem).
>
>I looked at these traces too, but none of the tasks are stuck in rmdir
>or the OOM path.  Some /are/ in the page fault path, but they are
>happily doing reclaim and don't appear to be stuck.  So I'm having a
>hard time matching this data to what you otherwise observed.
>
>However, based on what you reported the most likely explanation for
>the continued hangs is the unfinished OOM handling for which I sent
>the followup patch for arch/x86/mm/fault.c.
>



Johannes,

today I tested both of your patches but problem with unremovable cgroups, unfortunately, persists.

azur

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: [PATCH for 3.2] memcg: do not trap chargers with full callstack on OOM
  2013-06-24 20:13                                                                                                                                         ` Johannes Weiner
  2013-06-28 10:06                                                                                                                                           ` azurIt
@ 2013-07-09 13:00                                                                                                                                           ` Michal Hocko
  2013-07-09 13:08                                                                                                                                             ` Michal Hocko
  1 sibling, 1 reply; 172+ messages in thread
From: Michal Hocko @ 2013-07-09 13:00 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: azurIt, linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki

On Mon 24-06-13 16:13:45, Johannes Weiner wrote:
> Hi guys,
> 
> On Sat, Jun 22, 2013 at 10:09:58PM +0200, azurIt wrote:
> > >> But i'm sure of one thing - when problem occurs, nothing is able to
> > >> access hard drives (every process which tries it is freezed until
> > >> problem is resolved or server is rebooted).
> > >
> > >I would be really interesting to see what those tasks are blocked on.
> > 
> > I'm trying to get it, stay tuned :)
> > 
> > Today i noticed one bug, not 100% sure it is related to 'your' patch
> > but i didn't seen this before. I noticed that i have lots of cgroups
> > which cannot be removed - if i do 'rmdir <cgroup_directory>', it
> > just hangs and never complete. Even more, it's not possible to
> > access the whole cgroup filesystem until i kill that rmdir
> > (anything, which tries it, just hangs). All unremoveable cgroups has
> > this in 'memory.oom_control': oom_kill_disable 0 under_oom 1
> 
> Somebody acquires the OOM wait reference to the memcg and marks it
> under oom but then does not call into mem_cgroup_oom_synchronize() to
> clean up.  That's why under_oom is set and the rmdir waits for
> outstanding references.
> 
> > And, yes, 'tasks' file is empty.
> 
> It's not a kernel thread that does it because all kernel-context
> handle_mm_fault() are annotated properly, which means the task must be
> userspace and, since tasks is empty, have exited before synchronizing.

Yes, well spotted. I have missed that while reviewing your patch.
The follow up fix looks correct.

> Can you try with the following patch on top?
> 
> diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
> index 5db0490..9a0b152 100644
> --- a/arch/x86/mm/fault.c
> +++ b/arch/x86/mm/fault.c
> @@ -846,17 +846,6 @@ static noinline int
>  mm_fault_error(struct pt_regs *regs, unsigned long error_code,
>  	       unsigned long address, unsigned int fault)
>  {
> -	/*
> -	 * Pagefault was interrupted by SIGKILL. We have no reason to
> -	 * continue pagefault.
> -	 */
> -	if (fatal_signal_pending(current)) {
> -		if (!(fault & VM_FAULT_RETRY))
> -			up_read(&current->mm->mmap_sem);
> -		if (!(error_code & PF_USER))
> -			no_context(regs, error_code, address);
> -		return 1;
> -	}
>  	if (!(fault & VM_FAULT_ERROR))
>  		return 0;
>  

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: [PATCH for 3.2] memcg: do not trap chargers with full callstack on OOM
  2013-07-09 13:00                                                                                                                                           ` Michal Hocko
@ 2013-07-09 13:08                                                                                                                                             ` Michal Hocko
  2013-07-09 13:10                                                                                                                                               ` Michal Hocko
  0 siblings, 1 reply; 172+ messages in thread
From: Michal Hocko @ 2013-07-09 13:08 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: azurIt, linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki

On Tue 09-07-13 15:00:17, Michal Hocko wrote:
> On Mon 24-06-13 16:13:45, Johannes Weiner wrote:
> > Hi guys,
> > 
> > On Sat, Jun 22, 2013 at 10:09:58PM +0200, azurIt wrote:
> > > >> But i'm sure of one thing - when problem occurs, nothing is able to
> > > >> access hard drives (every process which tries it is freezed until
> > > >> problem is resolved or server is rebooted).
> > > >
> > > >I would be really interesting to see what those tasks are blocked on.
> > > 
> > > I'm trying to get it, stay tuned :)
> > > 
> > > Today i noticed one bug, not 100% sure it is related to 'your' patch
> > > but i didn't seen this before. I noticed that i have lots of cgroups
> > > which cannot be removed - if i do 'rmdir <cgroup_directory>', it
> > > just hangs and never complete. Even more, it's not possible to
> > > access the whole cgroup filesystem until i kill that rmdir
> > > (anything, which tries it, just hangs). All unremoveable cgroups has
> > > this in 'memory.oom_control': oom_kill_disable 0 under_oom 1
> > 
> > Somebody acquires the OOM wait reference to the memcg and marks it
> > under oom but then does not call into mem_cgroup_oom_synchronize() to
> > clean up.  That's why under_oom is set and the rmdir waits for
> > outstanding references.
> > 
> > > And, yes, 'tasks' file is empty.
> > 
> > It's not a kernel thread that does it because all kernel-context
> > handle_mm_fault() are annotated properly, which means the task must be
> > userspace and, since tasks is empty, have exited before synchronizing.
> 
> Yes, well spotted. I have missed that while reviewing your patch.
> The follow up fix looks correct.

Hmm, I guess you wanted to remove !(fault & VM_FAULT_ERROR) test as well
otherwise the else BUG() path would be unreachable and we wouldn't know
that something fishy is going on.

> > Can you try with the following patch on top?
> > 
> > diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
> > index 5db0490..9a0b152 100644
> > --- a/arch/x86/mm/fault.c
> > +++ b/arch/x86/mm/fault.c
> > @@ -846,17 +846,6 @@ static noinline int
> >  mm_fault_error(struct pt_regs *regs, unsigned long error_code,
> >  	       unsigned long address, unsigned int fault)
> >  {
> > -	/*
> > -	 * Pagefault was interrupted by SIGKILL. We have no reason to
> > -	 * continue pagefault.
> > -	 */
> > -	if (fatal_signal_pending(current)) {
> > -		if (!(fault & VM_FAULT_RETRY))
> > -			up_read(&current->mm->mmap_sem);
> > -		if (!(error_code & PF_USER))
> > -			no_context(regs, error_code, address);
> > -		return 1;
> > -	}
> >  	if (!(fault & VM_FAULT_ERROR))
> >  		return 0;
> >  
> 
> -- 
> Michal Hocko
> SUSE Labs
> --
> To unsubscribe from this list: send the line "unsubscribe cgroups" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: [PATCH for 3.2] memcg: do not trap chargers with full callstack on OOM
  2013-07-09 13:08                                                                                                                                             ` Michal Hocko
@ 2013-07-09 13:10                                                                                                                                               ` Michal Hocko
  0 siblings, 0 replies; 172+ messages in thread
From: Michal Hocko @ 2013-07-09 13:10 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: azurIt, linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki

On Tue 09-07-13 15:08:08, Michal Hocko wrote:
> On Tue 09-07-13 15:00:17, Michal Hocko wrote:
> > On Mon 24-06-13 16:13:45, Johannes Weiner wrote:
> > > Hi guys,
> > > 
> > > On Sat, Jun 22, 2013 at 10:09:58PM +0200, azurIt wrote:
> > > > >> But i'm sure of one thing - when problem occurs, nothing is able to
> > > > >> access hard drives (every process which tries it is freezed until
> > > > >> problem is resolved or server is rebooted).
> > > > >
> > > > >I would be really interesting to see what those tasks are blocked on.
> > > > 
> > > > I'm trying to get it, stay tuned :)
> > > > 
> > > > Today i noticed one bug, not 100% sure it is related to 'your' patch
> > > > but i didn't seen this before. I noticed that i have lots of cgroups
> > > > which cannot be removed - if i do 'rmdir <cgroup_directory>', it
> > > > just hangs and never complete. Even more, it's not possible to
> > > > access the whole cgroup filesystem until i kill that rmdir
> > > > (anything, which tries it, just hangs). All unremoveable cgroups has
> > > > this in 'memory.oom_control': oom_kill_disable 0 under_oom 1
> > > 
> > > Somebody acquires the OOM wait reference to the memcg and marks it
> > > under oom but then does not call into mem_cgroup_oom_synchronize() to
> > > clean up.  That's why under_oom is set and the rmdir waits for
> > > outstanding references.
> > > 
> > > > And, yes, 'tasks' file is empty.
> > > 
> > > It's not a kernel thread that does it because all kernel-context
> > > handle_mm_fault() are annotated properly, which means the task must be
> > > userspace and, since tasks is empty, have exited before synchronizing.
> > 
> > Yes, well spotted. I have missed that while reviewing your patch.
> > The follow up fix looks correct.
> 
> Hmm, I guess you wanted to remove !(fault & VM_FAULT_ERROR) test as well
> otherwise the else BUG() path would be unreachable and we wouldn't know
> that something fishy is going on.

No, scratch it! We need it for VM_FAULT_RETRY. Sorry about the noise.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: [PATCH for 3.2] memcg: do not trap chargers with full callstack on OOM
  2013-07-07 23:42                                                                                                                                                   ` azurIt
@ 2013-07-09 13:10                                                                                                                                                     ` Michal Hocko
  2013-07-09 13:19                                                                                                                                                       ` azurIt
  0 siblings, 1 reply; 172+ messages in thread
From: Michal Hocko @ 2013-07-09 13:10 UTC (permalink / raw)
  To: azurIt
  Cc: Johannes Weiner, linux-kernel, linux-mm, cgroups mailinglist,
	KAMEZAWA Hiroyuki

On Mon 08-07-13 01:42:24, azurIt wrote:
> > CC: "Michal Hocko" <mhocko@suse.cz>, linux-kernel@vger.kernel.org, linux-mm@kvack.org, "cgroups mailinglist" <cgroups@vger.kernel.org>, "KAMEZAWA Hiroyuki" <kamezawa.hiroyu@jp.fujitsu.com>
> >On Fri, Jul 05, 2013 at 09:02:46PM +0200, azurIt wrote:
> >> >I looked at your debug messages but could not find anything that would
> >> >hint at a deadlock.  All tasks are stuck in the refrigerator, so I
> >> >assume you use the freezer cgroup and enabled it somehow?
> >> 
> >> 
> >> Yes, i'm really using freezer cgroup BUT i was checking if it's not
> >> doing problems - unfortunately, several days passed from that day
> >> and now i don't fully remember if i was checking it for both cases
> >> (unremoveabled cgroups and these freezed processes holding web
> >> server port). I'm 100% sure i was checking it for unremoveable
> >> cgroups but not so sure for the other problem (i had to act quickly
> >> in that case). Are you sure (from stacks) that freezer cgroup was
> >> enabled there?
> >
> >Yeah, all the traces without exception look like this:
> >
> >1372089762/23433/stack:[<ffffffff81080925>] refrigerator+0x95/0x160
> >1372089762/23433/stack:[<ffffffff8106ab7b>] get_signal_to_deliver+0x1cb/0x540
> >1372089762/23433/stack:[<ffffffff8100188b>] do_signal+0x6b/0x750
> >1372089762/23433/stack:[<ffffffff81001fc5>] do_notify_resume+0x55/0x80
> >1372089762/23433/stack:[<ffffffff815cac77>] int_signal+0x12/0x17
> >1372089762/23433/stack:[<ffffffffffffffff>] 0xffffffffffffffff
> >
> >so the freezer was already enabled when you took the backtraces.
> >
> >> Btw, what about that other stacks? I mean this file:
> >> http://watchdog.sk/lkml/memcg-bug-7.tar.gz
> >> 
> >> It was taken while running the kernel with your patch and from
> >> cgroup which was under unresolveable OOM (just like my very original
> >> problem).
> >
> >I looked at these traces too, but none of the tasks are stuck in rmdir
> >or the OOM path.  Some /are/ in the page fault path, but they are
> >happily doing reclaim and don't appear to be stuck.  So I'm having a
> >hard time matching this data to what you otherwise observed.

Agreed.

> >However, based on what you reported the most likely explanation for
> >the continued hangs is the unfinished OOM handling for which I sent
> >the followup patch for arch/x86/mm/fault.c.
> 
> Johannes,
> 
> today I tested both of your patches but problem with unremovable
> cgroups, unfortunately, persists.

Is the group empty again with marked under_oom?
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: [PATCH for 3.2] memcg: do not trap chargers with full callstack on OOM
  2013-07-09 13:10                                                                                                                                                     ` Michal Hocko
@ 2013-07-09 13:19                                                                                                                                                       ` azurIt
  2013-07-09 13:54                                                                                                                                                         ` Michal Hocko
  0 siblings, 1 reply; 172+ messages in thread
From: azurIt @ 2013-07-09 13:19 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Johannes Weiner, linux-kernel, linux-mm, cgroups mailinglist,
	KAMEZAWA Hiroyuki

>On Mon 08-07-13 01:42:24, azurIt wrote:
>> > CC: "Michal Hocko" <mhocko@suse.cz>, linux-kernel@vger.kernel.org, linux-mm@kvack.org, "cgroups mailinglist" <cgroups@vger.kernel.org>, "KAMEZAWA Hiroyuki" <kamezawa.hiroyu@jp.fujitsu.com>
>> >On Fri, Jul 05, 2013 at 09:02:46PM +0200, azurIt wrote:
>> >> >I looked at your debug messages but could not find anything that would
>> >> >hint at a deadlock.  All tasks are stuck in the refrigerator, so I
>> >> >assume you use the freezer cgroup and enabled it somehow?
>> >> 
>> >> 
>> >> Yes, i'm really using freezer cgroup BUT i was checking if it's not
>> >> doing problems - unfortunately, several days passed from that day
>> >> and now i don't fully remember if i was checking it for both cases
>> >> (unremoveabled cgroups and these freezed processes holding web
>> >> server port). I'm 100% sure i was checking it for unremoveable
>> >> cgroups but not so sure for the other problem (i had to act quickly
>> >> in that case). Are you sure (from stacks) that freezer cgroup was
>> >> enabled there?
>> >
>> >Yeah, all the traces without exception look like this:
>> >
>> >1372089762/23433/stack:[<ffffffff81080925>] refrigerator+0x95/0x160
>> >1372089762/23433/stack:[<ffffffff8106ab7b>] get_signal_to_deliver+0x1cb/0x540
>> >1372089762/23433/stack:[<ffffffff8100188b>] do_signal+0x6b/0x750
>> >1372089762/23433/stack:[<ffffffff81001fc5>] do_notify_resume+0x55/0x80
>> >1372089762/23433/stack:[<ffffffff815cac77>] int_signal+0x12/0x17
>> >1372089762/23433/stack:[<ffffffffffffffff>] 0xffffffffffffffff
>> >
>> >so the freezer was already enabled when you took the backtraces.
>> >
>> >> Btw, what about that other stacks? I mean this file:
>> >> http://watchdog.sk/lkml/memcg-bug-7.tar.gz
>> >> 
>> >> It was taken while running the kernel with your patch and from
>> >> cgroup which was under unresolveable OOM (just like my very original
>> >> problem).
>> >
>> >I looked at these traces too, but none of the tasks are stuck in rmdir
>> >or the OOM path.  Some /are/ in the page fault path, but they are
>> >happily doing reclaim and don't appear to be stuck.  So I'm having a
>> >hard time matching this data to what you otherwise observed.
>
>Agreed.
>
>> >However, based on what you reported the most likely explanation for
>> >the continued hangs is the unfinished OOM handling for which I sent
>> >the followup patch for arch/x86/mm/fault.c.
>> 
>> Johannes,
>> 
>> today I tested both of your patches but problem with unremovable
>> cgroups, unfortunately, persists.
>
>Is the group empty again with marked under_oom?


Now i realized that i forgot to remove UID from that cgroup before trying to remove it, so cgroup cannot be removed anyway (we are using third party cgroup called cgroup-uid from Andrea Righi, which is able to associate all user's processes with target cgroup). Look here for cgroup-uid patch:
https://www.develer.com/~arighi/linux/patches/cgroup-uid/cgroup-uid-v8.patch

ANYWAY, i'm 101% sure that 'tasks' file was empty and 'under_oom' was permanently '1'.

azur

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: [PATCH for 3.2] memcg: do not trap chargers with full callstack on OOM
  2013-07-09 13:19                                                                                                                                                       ` azurIt
@ 2013-07-09 13:54                                                                                                                                                         ` Michal Hocko
  2013-07-10 16:25                                                                                                                                                           ` azurIt
  0 siblings, 1 reply; 172+ messages in thread
From: Michal Hocko @ 2013-07-09 13:54 UTC (permalink / raw)
  To: azurIt
  Cc: Johannes Weiner, linux-kernel, linux-mm, cgroups mailinglist,
	KAMEZAWA Hiroyuki

On Tue 09-07-13 15:19:21, azurIt wrote:
[...]
> Now i realized that i forgot to remove UID from that cgroup before
> trying to remove it, so cgroup cannot be removed anyway (we are using
> third party cgroup called cgroup-uid from Andrea Righi, which is able
> to associate all user's processes with target cgroup). Look here for
> cgroup-uid patch:
> https://www.develer.com/~arighi/linux/patches/cgroup-uid/cgroup-uid-v8.patch
> 
> ANYWAY, i'm 101% sure that 'tasks' file was empty and 'under_oom' was
> permanently '1'.

This is really strange. Could you post the whole diff against stable
tree you are using (except for grsecurity stuff and the above cgroup-uid
patch)?

Btw. the bellow patch might help us to point to the exit path which
leaves wait_on_memcg without mem_cgroup_oom_synchronize:
---
diff --git a/kernel/exit.c b/kernel/exit.c
index e6e01b9..ad472e0 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -895,6 +895,7 @@ NORET_TYPE void do_exit(long code)
 
 	profile_task_exit(tsk);
 
+	WARN_ON(current->memcg_oom.wait_on_memcg);
 	WARN_ON(blk_needs_flush_plug(tsk));
 
 	if (unlikely(in_interrupt()))
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply related	[flat|nested] 172+ messages in thread

* Re: [PATCH for 3.2] memcg: do not trap chargers with full callstack on OOM
  2013-07-09 13:54                                                                                                                                                         ` Michal Hocko
@ 2013-07-10 16:25                                                                                                                                                           ` azurIt
  2013-07-11  7:25                                                                                                                                                             ` Michal Hocko
  0 siblings, 1 reply; 172+ messages in thread
From: azurIt @ 2013-07-10 16:25 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Johannes Weiner, linux-kernel, linux-mm, cgroups mailinglist,
	KAMEZAWA Hiroyuki, righi.andrea

>> Now i realized that i forgot to remove UID from that cgroup before
>> trying to remove it, so cgroup cannot be removed anyway (we are using
>> third party cgroup called cgroup-uid from Andrea Righi, which is able
>> to associate all user's processes with target cgroup). Look here for
>> cgroup-uid patch:
>> https://www.develer.com/~arighi/linux/patches/cgroup-uid/cgroup-uid-v8.patch
>> 
>> ANYWAY, i'm 101% sure that 'tasks' file was empty and 'under_oom' was
>> permanently '1'.
>
>This is really strange. Could you post the whole diff against stable
>tree you are using (except for grsecurity stuff and the above cgroup-uid
>patch)?


Here are all patches which i applied to kernel 3.2.48 in my last test:
http://watchdog.sk/lkml/patches3/

Patches marked as 7-* are from Johannes. I'm appling them in order except the grsecurity - it goes as first.

azur




>Btw. the bellow patch might help us to point to the exit path which
>leaves wait_on_memcg without mem_cgroup_oom_synchronize:
>---
>diff --git a/kernel/exit.c b/kernel/exit.c
>index e6e01b9..ad472e0 100644
>--- a/kernel/exit.c
>+++ b/kernel/exit.c
>@@ -895,6 +895,7 @@ NORET_TYPE void do_exit(long code)
> 
> 	profile_task_exit(tsk);
> 
>+	WARN_ON(current->memcg_oom.wait_on_memcg);
> 	WARN_ON(blk_needs_flush_plug(tsk));
> 
> 	if (unlikely(in_interrupt()))
>-- 
>Michal Hocko
>SUSE Labs
>

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: [PATCH for 3.2] memcg: do not trap chargers with full callstack on OOM
  2013-07-10 16:25                                                                                                                                                           ` azurIt
@ 2013-07-11  7:25                                                                                                                                                             ` Michal Hocko
  2013-07-13 23:26                                                                                                                                                               ` azurIt
  0 siblings, 1 reply; 172+ messages in thread
From: Michal Hocko @ 2013-07-11  7:25 UTC (permalink / raw)
  To: azurIt
  Cc: Johannes Weiner, linux-kernel, linux-mm, cgroups mailinglist,
	KAMEZAWA Hiroyuki, righi.andrea

On Wed 10-07-13 18:25:06, azurIt wrote:
> >> Now i realized that i forgot to remove UID from that cgroup before
> >> trying to remove it, so cgroup cannot be removed anyway (we are using
> >> third party cgroup called cgroup-uid from Andrea Righi, which is able
> >> to associate all user's processes with target cgroup). Look here for
> >> cgroup-uid patch:
> >> https://www.develer.com/~arighi/linux/patches/cgroup-uid/cgroup-uid-v8.patch
> >> 
> >> ANYWAY, i'm 101% sure that 'tasks' file was empty and 'under_oom' was
> >> permanently '1'.
> >
> >This is really strange. Could you post the whole diff against stable
> >tree you are using (except for grsecurity stuff and the above cgroup-uid
> >patch)?
> 
> 
> Here are all patches which i applied to kernel 3.2.48 in my last test:
> http://watchdog.sk/lkml/patches3/

The two patches from Johannes seem correct.

>From a quick look even grsecurity patchset shouldn't interfere as it
doesn't seem to put any code between handle_mm_fault and mm_fault_error
and there also doesn't seem to be any new handle_mm_fault call sites.

But I cannot tell there aren't other code paths which would lead to a
memcg charge, thus oom, without proper FAULT_FLAG_KERNEL handling.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: [PATCH for 3.2] memcg: do not trap chargers with full callstack on OOM
  2013-07-11  7:25                                                                                                                                                             ` Michal Hocko
@ 2013-07-13 23:26                                                                                                                                                               ` azurIt
  2013-07-13 23:51                                                                                                                                                                 ` azurIt
  0 siblings, 1 reply; 172+ messages in thread
From: azurIt @ 2013-07-13 23:26 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Johannes Weiner, linux-kernel, linux-mm, cgroups mailinglist,
	KAMEZAWA Hiroyuki, righi.andrea

> CC: "Johannes Weiner" <hannes@cmpxchg.org>, linux-kernel@vger.kernel.org, linux-mm@kvack.org, "cgroups mailinglist" <cgroups@vger.kernel.org>, "KAMEZAWA Hiroyuki" <kamezawa.hiroyu@jp.fujitsu.com>, righi.andrea@gmail.com
>On Wed 10-07-13 18:25:06, azurIt wrote:
>> >> Now i realized that i forgot to remove UID from that cgroup before
>> >> trying to remove it, so cgroup cannot be removed anyway (we are using
>> >> third party cgroup called cgroup-uid from Andrea Righi, which is able
>> >> to associate all user's processes with target cgroup). Look here for
>> >> cgroup-uid patch:
>> >> https://www.develer.com/~arighi/linux/patches/cgroup-uid/cgroup-uid-v8.patch
>> >> 
>> >> ANYWAY, i'm 101% sure that 'tasks' file was empty and 'under_oom' was
>> >> permanently '1'.
>> >
>> >This is really strange. Could you post the whole diff against stable
>> >tree you are using (except for grsecurity stuff and the above cgroup-uid
>> >patch)?
>> 
>> 
>> Here are all patches which i applied to kernel 3.2.48 in my last test:
>> http://watchdog.sk/lkml/patches3/
>
>The two patches from Johannes seem correct.
>
>From a quick look even grsecurity patchset shouldn't interfere as it
>doesn't seem to put any code between handle_mm_fault and mm_fault_error
>and there also doesn't seem to be any new handle_mm_fault call sites.
>
>But I cannot tell there aren't other code paths which would lead to a
>memcg charge, thus oom, without proper FAULT_FLAG_KERNEL handling.


Michal,

now i can definitely confirm that problem with unremovable cgroups persists. What info do you need from me? I applied also your little 'WARN_ON' patch.

azur

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: [PATCH for 3.2] memcg: do not trap chargers with full callstack on OOM
  2013-07-13 23:26                                                                                                                                                               ` azurIt
@ 2013-07-13 23:51                                                                                                                                                                 ` azurIt
  2013-07-15 15:41                                                                                                                                                                   ` Michal Hocko
  0 siblings, 1 reply; 172+ messages in thread
From: azurIt @ 2013-07-13 23:51 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Johannes Weiner, linux-kernel, linux-mm, cgroups mailinglist,
	KAMEZAWA Hiroyuki, righi.andrea

> CC: "Johannes Weiner" <hannes@cmpxchg.org>, linux-kernel@vger.kernel.org, linux-mm@kvack.org, "cgroups mailinglist" <cgroups@vger.kernel.org>, "KAMEZAWA Hiroyuki" <kamezawa.hiroyu@jp.fujitsu.com>, righi.andrea@gmail.com
>> CC: "Johannes Weiner" <hannes@cmpxchg.org>, linux-kernel@vger.kernel.org, linux-mm@kvack.org, "cgroups mailinglist" <cgroups@vger.kernel.org>, "KAMEZAWA Hiroyuki" <kamezawa.hiroyu@jp.fujitsu.com>, righi.andrea@gmail.com
>>On Wed 10-07-13 18:25:06, azurIt wrote:
>>> >> Now i realized that i forgot to remove UID from that cgroup before
>>> >> trying to remove it, so cgroup cannot be removed anyway (we are using
>>> >> third party cgroup called cgroup-uid from Andrea Righi, which is able
>>> >> to associate all user's processes with target cgroup). Look here for
>>> >> cgroup-uid patch:
>>> >> https://www.develer.com/~arighi/linux/patches/cgroup-uid/cgroup-uid-v8.patch
>>> >> 
>>> >> ANYWAY, i'm 101% sure that 'tasks' file was empty and 'under_oom' was
>>> >> permanently '1'.
>>> >
>>> >This is really strange. Could you post the whole diff against stable
>>> >tree you are using (except for grsecurity stuff and the above cgroup-uid
>>> >patch)?
>>> 
>>> 
>>> Here are all patches which i applied to kernel 3.2.48 in my last test:
>>> http://watchdog.sk/lkml/patches3/
>>
>>The two patches from Johannes seem correct.
>>
>>From a quick look even grsecurity patchset shouldn't interfere as it
>>doesn't seem to put any code between handle_mm_fault and mm_fault_error
>>and there also doesn't seem to be any new handle_mm_fault call sites.
>>
>>But I cannot tell there aren't other code paths which would lead to a
>>memcg charge, thus oom, without proper FAULT_FLAG_KERNEL handling.
>
>
>Michal,
>
>now i can definitely confirm that problem with unremovable cgroups persists. What info do you need from me? I applied also your little 'WARN_ON' patch.
>
>azur



Ok, i think you want this:
http://watchdog.sk/lkml/kern4.log

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: [PATCH for 3.2] memcg: do not trap chargers with full callstack on OOM
  2013-07-05 19:18                                                                                                                                                 ` Johannes Weiner
  2013-07-07 23:42                                                                                                                                                   ` azurIt
@ 2013-07-14 17:07                                                                                                                                                   ` azurIt
  1 sibling, 0 replies; 172+ messages in thread
From: azurIt @ 2013-07-14 17:07 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Michal Hocko, linux-kernel, linux-mm, cgroups mailinglist,
	KAMEZAWA Hiroyuki

> CC: "Michal Hocko" <mhocko@suse.cz>, linux-kernel@vger.kernel.org, linux-mm@kvack.org, "cgroups mailinglist" <cgroups@vger.kernel.org>, "KAMEZAWA Hiroyuki" <kamezawa.hiroyu@jp.fujitsu.com>
>On Fri, Jul 05, 2013 at 09:02:46PM +0200, azurIt wrote:
>> >I looked at your debug messages but could not find anything that would
>> >hint at a deadlock.  All tasks are stuck in the refrigerator, so I
>> >assume you use the freezer cgroup and enabled it somehow?
>> 
>> 
>> Yes, i'm really using freezer cgroup BUT i was checking if it's not
>> doing problems - unfortunately, several days passed from that day
>> and now i don't fully remember if i was checking it for both cases
>> (unremoveabled cgroups and these freezed processes holding web
>> server port). I'm 100% sure i was checking it for unremoveable
>> cgroups but not so sure for the other problem (i had to act quickly
>> in that case). Are you sure (from stacks) that freezer cgroup was
>> enabled there?
>
>Yeah, all the traces without exception look like this:
>
>1372089762/23433/stack:[<ffffffff81080925>] refrigerator+0x95/0x160
>1372089762/23433/stack:[<ffffffff8106ab7b>] get_signal_to_deliver+0x1cb/0x540
>1372089762/23433/stack:[<ffffffff8100188b>] do_signal+0x6b/0x750
>1372089762/23433/stack:[<ffffffff81001fc5>] do_notify_resume+0x55/0x80
>1372089762/23433/stack:[<ffffffff815cac77>] int_signal+0x12/0x17
>1372089762/23433/stack:[<ffffffffffffffff>] 0xffffffffffffffff
>
>so the freezer was already enabled when you took the backtraces.
>
>> Btw, what about that other stacks? I mean this file:
>> http://watchdog.sk/lkml/memcg-bug-7.tar.gz
>> 
>> It was taken while running the kernel with your patch and from
>> cgroup which was under unresolveable OOM (just like my very original
>> problem).
>
>I looked at these traces too, but none of the tasks are stuck in rmdir
>or the OOM path.  Some /are/ in the page fault path, but they are
>happily doing reclaim and don't appear to be stuck.  So I'm having a
>hard time matching this data to what you otherwise observed.
>
>However, based on what you reported the most likely explanation for
>the continued hangs is the unfinished OOM handling for which I sent
>the followup patch for arch/x86/mm/fault.c.


Johannes,

this problem happened again but was even worse, now i'm sure it wasn't my fault. This time I even wasn't able to access /proc/<pid> of hanged apache process (which was, again, helding web server port and forced me to reboot the server). Everything which tried to access /proc/<pid> just hanged. Server even wasn't able to reboot correctly, it hanged and then done a hard reboot after few minutes.

azur

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: [PATCH for 3.2] memcg: do not trap chargers with full callstack on OOM
  2013-07-13 23:51                                                                                                                                                                 ` azurIt
@ 2013-07-15 15:41                                                                                                                                                                   ` Michal Hocko
  2013-07-15 16:00                                                                                                                                                                     ` Michal Hocko
  0 siblings, 1 reply; 172+ messages in thread
From: Michal Hocko @ 2013-07-15 15:41 UTC (permalink / raw)
  To: azurIt
  Cc: Johannes Weiner, linux-kernel, linux-mm, cgroups mailinglist,
	KAMEZAWA Hiroyuki, righi.andrea

On Sun 14-07-13 01:51:12, azurIt wrote:
> > CC: "Johannes Weiner" <hannes@cmpxchg.org>, linux-kernel@vger.kernel.org, linux-mm@kvack.org, "cgroups mailinglist" <cgroups@vger.kernel.org>, "KAMEZAWA Hiroyuki" <kamezawa.hiroyu@jp.fujitsu.com>, righi.andrea@gmail.com
> >> CC: "Johannes Weiner" <hannes@cmpxchg.org>, linux-kernel@vger.kernel.org, linux-mm@kvack.org, "cgroups mailinglist" <cgroups@vger.kernel.org>, "KAMEZAWA Hiroyuki" <kamezawa.hiroyu@jp.fujitsu.com>, righi.andrea@gmail.com
> >>On Wed 10-07-13 18:25:06, azurIt wrote:
> >>> >> Now i realized that i forgot to remove UID from that cgroup before
> >>> >> trying to remove it, so cgroup cannot be removed anyway (we are using
> >>> >> third party cgroup called cgroup-uid from Andrea Righi, which is able
> >>> >> to associate all user's processes with target cgroup). Look here for
> >>> >> cgroup-uid patch:
> >>> >> https://www.develer.com/~arighi/linux/patches/cgroup-uid/cgroup-uid-v8.patch
> >>> >> 
> >>> >> ANYWAY, i'm 101% sure that 'tasks' file was empty and 'under_oom' was
> >>> >> permanently '1'.
> >>> >
> >>> >This is really strange. Could you post the whole diff against stable
> >>> >tree you are using (except for grsecurity stuff and the above cgroup-uid
> >>> >patch)?
> >>> 
> >>> 
> >>> Here are all patches which i applied to kernel 3.2.48 in my last test:
> >>> http://watchdog.sk/lkml/patches3/
> >>
> >>The two patches from Johannes seem correct.
> >>
> >>From a quick look even grsecurity patchset shouldn't interfere as it
> >>doesn't seem to put any code between handle_mm_fault and mm_fault_error
> >>and there also doesn't seem to be any new handle_mm_fault call sites.
> >>
> >>But I cannot tell there aren't other code paths which would lead to a
> >>memcg charge, thus oom, without proper FAULT_FLAG_KERNEL handling.
> >
> >
> >Michal,
> >
> >now i can definitely confirm that problem with unremovable cgroups
> >persists. What info do you need from me? I applied also your little
> >'WARN_ON' patch.
> 
> Ok, i think you want this:
> http://watchdog.sk/lkml/kern4.log

Jul 14 01:11:39 server01 kernel: [  593.589087] [ pid ]   uid  tgid total_vm      rss cpu oom_adj oom_score_adj name
Jul 14 01:11:39 server01 kernel: [  593.589451] [12021]  1333 12021   172027    64723   4       0             0 apache2
Jul 14 01:11:39 server01 kernel: [  593.589647] [12030]  1333 12030   172030    64748   2       0             0 apache2
Jul 14 01:11:39 server01 kernel: [  593.589836] [12031]  1333 12031   172030    64749   3       0             0 apache2
Jul 14 01:11:39 server01 kernel: [  593.590025] [12032]  1333 12032   170619    63428   3       0             0 apache2
Jul 14 01:11:39 server01 kernel: [  593.590213] [12033]  1333 12033   167934    60524   2       0             0 apache2
Jul 14 01:11:39 server01 kernel: [  593.590401] [12034]  1333 12034   170747    63496   4       0             0 apache2
Jul 14 01:11:39 server01 kernel: [  593.590588] [12035]  1333 12035   169659    62451   1       0             0 apache2
Jul 14 01:11:39 server01 kernel: [  593.590776] [12036]  1333 12036   167614    60384   3       0             0 apache2
Jul 14 01:11:39 server01 kernel: [  593.590984] [12037]  1333 12037   166342    58964   3       0             0 apache2
Jul 14 01:11:39 server01 kernel: [  593.591178] Memory cgroup out of memory: Kill process 12021 (apache2) score 847 or sacrifice child
Jul 14 01:11:39 server01 kernel: [  593.591370] Killed process 12021 (apache2) total-vm:688108kB, anon-rss:255472kB, file-rss:3420kB
Jul 14 01:11:41 server01 kernel: [  595.392920] ------------[ cut here ]------------
Jul 14 01:11:41 server01 kernel: [  595.393096] WARNING: at kernel/exit.c:888 do_exit+0x7d0/0x870()
Jul 14 01:11:41 server01 kernel: [  595.393256] Hardware name: S5000VSA
Jul 14 01:11:41 server01 kernel: [  595.393415] Pid: 12037, comm: apache2 Not tainted 3.2.48-grsec #1
Jul 14 01:11:41 server01 kernel: [  595.393577] Call Trace:
Jul 14 01:11:41 server01 kernel: [  595.393737]  [<ffffffff8105520a>] warn_slowpath_common+0x7a/0xb0
Jul 14 01:11:41 server01 kernel: [  595.393903]  [<ffffffff8105525a>] warn_slowpath_null+0x1a/0x20
Jul 14 01:11:41 server01 kernel: [  595.394068]  [<ffffffff81059c50>] do_exit+0x7d0/0x870
Jul 14 01:11:41 server01 kernel: [  595.394231]  [<ffffffff81050254>] ? thread_group_times+0x44/0xb0
Jul 14 01:11:41 server01 kernel: [  595.394392]  [<ffffffff81059d41>] do_group_exit+0x51/0xc0
Jul 14 01:11:41 server01 kernel: [  595.394551]  [<ffffffff81059dc7>] sys_exit_group+0x17/0x20
Jul 14 01:11:41 server01 kernel: [  595.394714]  [<ffffffff815caea6>] system_call_fastpath+0x18/0x1d
Jul 14 01:11:41 server01 kernel: [  595.394921] ---[ end trace 738570e688acf099 ]---

OK, so you had an OOM which has been handled by in-kernel oom handler
(it killed 12021) and 12037 was in the same group. The warning tells us
that it went through mem_cgroup_oom as well (otherwise it wouldn't have
memcg_oom.wait_on_memcg set and the warning wouldn't trigger) and then
it exited on the userspace request (by exit syscall).

I do not see any way how, this could happen though. If mem_cgroup_oom
is called then we always return CHARGE_NOMEM which turns into ENOMEM
returned by __mem_cgroup_try_charge (invoke_oom must have been set to
true).  So if nobody screwed the return value on the way up to page
fault handler then there is no way to escape.

I will check the code.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: [PATCH for 3.2] memcg: do not trap chargers with full callstack on OOM
  2013-07-15 15:41                                                                                                                                                                   ` Michal Hocko
@ 2013-07-15 16:00                                                                                                                                                                     ` Michal Hocko
  2013-07-16 15:35                                                                                                                                                                       ` Johannes Weiner
  0 siblings, 1 reply; 172+ messages in thread
From: Michal Hocko @ 2013-07-15 16:00 UTC (permalink / raw)
  To: azurIt
  Cc: Johannes Weiner, linux-kernel, linux-mm, cgroups mailinglist,
	KAMEZAWA Hiroyuki, righi.andrea

On Mon 15-07-13 17:41:19, Michal Hocko wrote:
> On Sun 14-07-13 01:51:12, azurIt wrote:
> > > CC: "Johannes Weiner" <hannes@cmpxchg.org>, linux-kernel@vger.kernel.org, linux-mm@kvack.org, "cgroups mailinglist" <cgroups@vger.kernel.org>, "KAMEZAWA Hiroyuki" <kamezawa.hiroyu@jp.fujitsu.com>, righi.andrea@gmail.com
> > >> CC: "Johannes Weiner" <hannes@cmpxchg.org>, linux-kernel@vger.kernel.org, linux-mm@kvack.org, "cgroups mailinglist" <cgroups@vger.kernel.org>, "KAMEZAWA Hiroyuki" <kamezawa.hiroyu@jp.fujitsu.com>, righi.andrea@gmail.com
> > >>On Wed 10-07-13 18:25:06, azurIt wrote:
> > >>> >> Now i realized that i forgot to remove UID from that cgroup before
> > >>> >> trying to remove it, so cgroup cannot be removed anyway (we are using
> > >>> >> third party cgroup called cgroup-uid from Andrea Righi, which is able
> > >>> >> to associate all user's processes with target cgroup). Look here for
> > >>> >> cgroup-uid patch:
> > >>> >> https://www.develer.com/~arighi/linux/patches/cgroup-uid/cgroup-uid-v8.patch
> > >>> >> 
> > >>> >> ANYWAY, i'm 101% sure that 'tasks' file was empty and 'under_oom' was
> > >>> >> permanently '1'.
> > >>> >
> > >>> >This is really strange. Could you post the whole diff against stable
> > >>> >tree you are using (except for grsecurity stuff and the above cgroup-uid
> > >>> >patch)?
> > >>> 
> > >>> 
> > >>> Here are all patches which i applied to kernel 3.2.48 in my last test:
> > >>> http://watchdog.sk/lkml/patches3/
> > >>
> > >>The two patches from Johannes seem correct.
> > >>
> > >>From a quick look even grsecurity patchset shouldn't interfere as it
> > >>doesn't seem to put any code between handle_mm_fault and mm_fault_error
> > >>and there also doesn't seem to be any new handle_mm_fault call sites.
> > >>
> > >>But I cannot tell there aren't other code paths which would lead to a
> > >>memcg charge, thus oom, without proper FAULT_FLAG_KERNEL handling.
> > >
> > >
> > >Michal,
> > >
> > >now i can definitely confirm that problem with unremovable cgroups
> > >persists. What info do you need from me? I applied also your little
> > >'WARN_ON' patch.
> > 
> > Ok, i think you want this:
> > http://watchdog.sk/lkml/kern4.log
> 
> Jul 14 01:11:39 server01 kernel: [  593.589087] [ pid ]   uid  tgid total_vm      rss cpu oom_adj oom_score_adj name
> Jul 14 01:11:39 server01 kernel: [  593.589451] [12021]  1333 12021   172027    64723   4       0             0 apache2
> Jul 14 01:11:39 server01 kernel: [  593.589647] [12030]  1333 12030   172030    64748   2       0             0 apache2
> Jul 14 01:11:39 server01 kernel: [  593.589836] [12031]  1333 12031   172030    64749   3       0             0 apache2
> Jul 14 01:11:39 server01 kernel: [  593.590025] [12032]  1333 12032   170619    63428   3       0             0 apache2
> Jul 14 01:11:39 server01 kernel: [  593.590213] [12033]  1333 12033   167934    60524   2       0             0 apache2
> Jul 14 01:11:39 server01 kernel: [  593.590401] [12034]  1333 12034   170747    63496   4       0             0 apache2
> Jul 14 01:11:39 server01 kernel: [  593.590588] [12035]  1333 12035   169659    62451   1       0             0 apache2
> Jul 14 01:11:39 server01 kernel: [  593.590776] [12036]  1333 12036   167614    60384   3       0             0 apache2
> Jul 14 01:11:39 server01 kernel: [  593.590984] [12037]  1333 12037   166342    58964   3       0             0 apache2
> Jul 14 01:11:39 server01 kernel: [  593.591178] Memory cgroup out of memory: Kill process 12021 (apache2) score 847 or sacrifice child
> Jul 14 01:11:39 server01 kernel: [  593.591370] Killed process 12021 (apache2) total-vm:688108kB, anon-rss:255472kB, file-rss:3420kB
> Jul 14 01:11:41 server01 kernel: [  595.392920] ------------[ cut here ]------------
> Jul 14 01:11:41 server01 kernel: [  595.393096] WARNING: at kernel/exit.c:888 do_exit+0x7d0/0x870()
> Jul 14 01:11:41 server01 kernel: [  595.393256] Hardware name: S5000VSA
> Jul 14 01:11:41 server01 kernel: [  595.393415] Pid: 12037, comm: apache2 Not tainted 3.2.48-grsec #1
> Jul 14 01:11:41 server01 kernel: [  595.393577] Call Trace:
> Jul 14 01:11:41 server01 kernel: [  595.393737]  [<ffffffff8105520a>] warn_slowpath_common+0x7a/0xb0
> Jul 14 01:11:41 server01 kernel: [  595.393903]  [<ffffffff8105525a>] warn_slowpath_null+0x1a/0x20
> Jul 14 01:11:41 server01 kernel: [  595.394068]  [<ffffffff81059c50>] do_exit+0x7d0/0x870
> Jul 14 01:11:41 server01 kernel: [  595.394231]  [<ffffffff81050254>] ? thread_group_times+0x44/0xb0
> Jul 14 01:11:41 server01 kernel: [  595.394392]  [<ffffffff81059d41>] do_group_exit+0x51/0xc0
> Jul 14 01:11:41 server01 kernel: [  595.394551]  [<ffffffff81059dc7>] sys_exit_group+0x17/0x20
> Jul 14 01:11:41 server01 kernel: [  595.394714]  [<ffffffff815caea6>] system_call_fastpath+0x18/0x1d
> Jul 14 01:11:41 server01 kernel: [  595.394921] ---[ end trace 738570e688acf099 ]---
> 
> OK, so you had an OOM which has been handled by in-kernel oom handler
> (it killed 12021) and 12037 was in the same group. The warning tells us
> that it went through mem_cgroup_oom as well (otherwise it wouldn't have
> memcg_oom.wait_on_memcg set and the warning wouldn't trigger) and then
> it exited on the userspace request (by exit syscall).
> 
> I do not see any way how, this could happen though. If mem_cgroup_oom
> is called then we always return CHARGE_NOMEM which turns into ENOMEM
> returned by __mem_cgroup_try_charge (invoke_oom must have been set to
> true).  So if nobody screwed the return value on the way up to page
> fault handler then there is no way to escape.
> 
> I will check the code.

OK, I guess I found it:
__do_fault
  fault = filemap_fault
  do_async_mmap_readahead
    page_cache_async_readahead
      ondemand_readahead
        __do_page_cache_readahead
          read_pages
            readpages = ext3_readpages
              mpage_readpages			# Doesn't propagate ENOMEM
               add_to_page_cache_lru
                 add_to_page_cache
                   add_to_page_cache_locked
                     mem_cgroup_cache_charge

So the read ahead most probably. Again! Duhhh. I will try to think
about a fix for this. One obvious place is mpage_readpages but
__do_page_cache_readahead ignores read_pages return value as well and
page_cache_async_readahead, even worse, is just void and exported as
such.

So this smells like a hard to fix bugger. One possible, and really ugly
way would be calling mem_cgroup_oom_synchronize even if handle_mm_fault
doesn't return VM_FAULT_ERROR, but that is a crude hack.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: [PATCH for 3.2] memcg: do not trap chargers with full callstack on OOM
  2013-07-15 16:00                                                                                                                                                                     ` Michal Hocko
@ 2013-07-16 15:35                                                                                                                                                                       ` Johannes Weiner
  2013-07-16 16:09                                                                                                                                                                         ` Michal Hocko
  0 siblings, 1 reply; 172+ messages in thread
From: Johannes Weiner @ 2013-07-16 15:35 UTC (permalink / raw)
  To: Michal Hocko
  Cc: azurIt, linux-kernel, linux-mm, cgroups mailinglist,
	KAMEZAWA Hiroyuki, righi.andrea

On Mon, Jul 15, 2013 at 06:00:06PM +0200, Michal Hocko wrote:
> On Mon 15-07-13 17:41:19, Michal Hocko wrote:
> > On Sun 14-07-13 01:51:12, azurIt wrote:
> > > > CC: "Johannes Weiner" <hannes@cmpxchg.org>, linux-kernel@vger.kernel.org, linux-mm@kvack.org, "cgroups mailinglist" <cgroups@vger.kernel.org>, "KAMEZAWA Hiroyuki" <kamezawa.hiroyu@jp.fujitsu.com>, righi.andrea@gmail.com
> > > >> CC: "Johannes Weiner" <hannes@cmpxchg.org>, linux-kernel@vger.kernel.org, linux-mm@kvack.org, "cgroups mailinglist" <cgroups@vger.kernel.org>, "KAMEZAWA Hiroyuki" <kamezawa.hiroyu@jp.fujitsu.com>, righi.andrea@gmail.com
> > > >>On Wed 10-07-13 18:25:06, azurIt wrote:
> > > >>> >> Now i realized that i forgot to remove UID from that cgroup before
> > > >>> >> trying to remove it, so cgroup cannot be removed anyway (we are using
> > > >>> >> third party cgroup called cgroup-uid from Andrea Righi, which is able
> > > >>> >> to associate all user's processes with target cgroup). Look here for
> > > >>> >> cgroup-uid patch:
> > > >>> >> https://www.develer.com/~arighi/linux/patches/cgroup-uid/cgroup-uid-v8.patch
> > > >>> >> 
> > > >>> >> ANYWAY, i'm 101% sure that 'tasks' file was empty and 'under_oom' was
> > > >>> >> permanently '1'.
> > > >>> >
> > > >>> >This is really strange. Could you post the whole diff against stable
> > > >>> >tree you are using (except for grsecurity stuff and the above cgroup-uid
> > > >>> >patch)?
> > > >>> 
> > > >>> 
> > > >>> Here are all patches which i applied to kernel 3.2.48 in my last test:
> > > >>> http://watchdog.sk/lkml/patches3/
> > > >>
> > > >>The two patches from Johannes seem correct.
> > > >>
> > > >>From a quick look even grsecurity patchset shouldn't interfere as it
> > > >>doesn't seem to put any code between handle_mm_fault and mm_fault_error
> > > >>and there also doesn't seem to be any new handle_mm_fault call sites.
> > > >>
> > > >>But I cannot tell there aren't other code paths which would lead to a
> > > >>memcg charge, thus oom, without proper FAULT_FLAG_KERNEL handling.
> > > >
> > > >
> > > >Michal,
> > > >
> > > >now i can definitely confirm that problem with unremovable cgroups
> > > >persists. What info do you need from me? I applied also your little
> > > >'WARN_ON' patch.
> > > 
> > > Ok, i think you want this:
> > > http://watchdog.sk/lkml/kern4.log
> > 
> > Jul 14 01:11:39 server01 kernel: [  593.589087] [ pid ]   uid  tgid total_vm      rss cpu oom_adj oom_score_adj name
> > Jul 14 01:11:39 server01 kernel: [  593.589451] [12021]  1333 12021   172027    64723   4       0             0 apache2
> > Jul 14 01:11:39 server01 kernel: [  593.589647] [12030]  1333 12030   172030    64748   2       0             0 apache2
> > Jul 14 01:11:39 server01 kernel: [  593.589836] [12031]  1333 12031   172030    64749   3       0             0 apache2
> > Jul 14 01:11:39 server01 kernel: [  593.590025] [12032]  1333 12032   170619    63428   3       0             0 apache2
> > Jul 14 01:11:39 server01 kernel: [  593.590213] [12033]  1333 12033   167934    60524   2       0             0 apache2
> > Jul 14 01:11:39 server01 kernel: [  593.590401] [12034]  1333 12034   170747    63496   4       0             0 apache2
> > Jul 14 01:11:39 server01 kernel: [  593.590588] [12035]  1333 12035   169659    62451   1       0             0 apache2
> > Jul 14 01:11:39 server01 kernel: [  593.590776] [12036]  1333 12036   167614    60384   3       0             0 apache2
> > Jul 14 01:11:39 server01 kernel: [  593.590984] [12037]  1333 12037   166342    58964   3       0             0 apache2
> > Jul 14 01:11:39 server01 kernel: [  593.591178] Memory cgroup out of memory: Kill process 12021 (apache2) score 847 or sacrifice child
> > Jul 14 01:11:39 server01 kernel: [  593.591370] Killed process 12021 (apache2) total-vm:688108kB, anon-rss:255472kB, file-rss:3420kB
> > Jul 14 01:11:41 server01 kernel: [  595.392920] ------------[ cut here ]------------
> > Jul 14 01:11:41 server01 kernel: [  595.393096] WARNING: at kernel/exit.c:888 do_exit+0x7d0/0x870()
> > Jul 14 01:11:41 server01 kernel: [  595.393256] Hardware name: S5000VSA
> > Jul 14 01:11:41 server01 kernel: [  595.393415] Pid: 12037, comm: apache2 Not tainted 3.2.48-grsec #1
> > Jul 14 01:11:41 server01 kernel: [  595.393577] Call Trace:
> > Jul 14 01:11:41 server01 kernel: [  595.393737]  [<ffffffff8105520a>] warn_slowpath_common+0x7a/0xb0
> > Jul 14 01:11:41 server01 kernel: [  595.393903]  [<ffffffff8105525a>] warn_slowpath_null+0x1a/0x20
> > Jul 14 01:11:41 server01 kernel: [  595.394068]  [<ffffffff81059c50>] do_exit+0x7d0/0x870
> > Jul 14 01:11:41 server01 kernel: [  595.394231]  [<ffffffff81050254>] ? thread_group_times+0x44/0xb0
> > Jul 14 01:11:41 server01 kernel: [  595.394392]  [<ffffffff81059d41>] do_group_exit+0x51/0xc0
> > Jul 14 01:11:41 server01 kernel: [  595.394551]  [<ffffffff81059dc7>] sys_exit_group+0x17/0x20
> > Jul 14 01:11:41 server01 kernel: [  595.394714]  [<ffffffff815caea6>] system_call_fastpath+0x18/0x1d
> > Jul 14 01:11:41 server01 kernel: [  595.394921] ---[ end trace 738570e688acf099 ]---
> > 
> > OK, so you had an OOM which has been handled by in-kernel oom handler
> > (it killed 12021) and 12037 was in the same group. The warning tells us
> > that it went through mem_cgroup_oom as well (otherwise it wouldn't have
> > memcg_oom.wait_on_memcg set and the warning wouldn't trigger) and then
> > it exited on the userspace request (by exit syscall).
> > 
> > I do not see any way how, this could happen though. If mem_cgroup_oom
> > is called then we always return CHARGE_NOMEM which turns into ENOMEM
> > returned by __mem_cgroup_try_charge (invoke_oom must have been set to
> > true).  So if nobody screwed the return value on the way up to page
> > fault handler then there is no way to escape.
> > 
> > I will check the code.
> 
> OK, I guess I found it:
> __do_fault
>   fault = filemap_fault
>   do_async_mmap_readahead
>     page_cache_async_readahead
>       ondemand_readahead
>         __do_page_cache_readahead
>           read_pages
>             readpages = ext3_readpages
>               mpage_readpages			# Doesn't propagate ENOMEM
>                add_to_page_cache_lru
>                  add_to_page_cache
>                    add_to_page_cache_locked
>                      mem_cgroup_cache_charge
> 
> So the read ahead most probably. Again! Duhhh. I will try to think
> about a fix for this. One obvious place is mpage_readpages but
> __do_page_cache_readahead ignores read_pages return value as well and
> page_cache_async_readahead, even worse, is just void and exported as
> such.
> 
> So this smells like a hard to fix bugger. One possible, and really ugly
> way would be calling mem_cgroup_oom_synchronize even if handle_mm_fault
> doesn't return VM_FAULT_ERROR, but that is a crude hack.

Ouch, good spot.

I don't think we need to handle an OOM from the readahead code.  If
readahead does not produce the desired page, we retry synchroneously
in page_cache_read() and handle the OOM properly.  We should not
signal an OOM for optional pages anyway.

So either we pass a flag from the readahead code down to
add_to_page_cache and mem_cgroup_cache_charge that tells the charge
code to ignore OOM conditions and do not set up an OOM context.

Or we DO call mem_cgroup_oom_synchronize() from the read_cache_pages,
with an argument that makes it only clean up the context and not wait.
It would not be completely outlandish to place it there, since it's
right next to where an error from add_to_page_cache() is not further
propagated back through the fault stack.

I'm travelling right now, I'll send a patch when I get back
(Thursday).  Unless you beat me to it :)

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: [PATCH for 3.2] memcg: do not trap chargers with full callstack on OOM
  2013-07-16 15:35                                                                                                                                                                       ` Johannes Weiner
@ 2013-07-16 16:09                                                                                                                                                                         ` Michal Hocko
  2013-07-16 16:48                                                                                                                                                                           ` Johannes Weiner
  0 siblings, 1 reply; 172+ messages in thread
From: Michal Hocko @ 2013-07-16 16:09 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: azurIt, linux-kernel, linux-mm, cgroups mailinglist,
	KAMEZAWA Hiroyuki, righi.andrea

On Tue 16-07-13 11:35:44, Johannes Weiner wrote:
> On Mon, Jul 15, 2013 at 06:00:06PM +0200, Michal Hocko wrote:
> > On Mon 15-07-13 17:41:19, Michal Hocko wrote:
> > > On Sun 14-07-13 01:51:12, azurIt wrote:
> > > > > CC: "Johannes Weiner" <hannes@cmpxchg.org>, linux-kernel@vger.kernel.org, linux-mm@kvack.org, "cgroups mailinglist" <cgroups@vger.kernel.org>, "KAMEZAWA Hiroyuki" <kamezawa.hiroyu@jp.fujitsu.com>, righi.andrea@gmail.com
> > > > >> CC: "Johannes Weiner" <hannes@cmpxchg.org>, linux-kernel@vger.kernel.org, linux-mm@kvack.org, "cgroups mailinglist" <cgroups@vger.kernel.org>, "KAMEZAWA Hiroyuki" <kamezawa.hiroyu@jp.fujitsu.com>, righi.andrea@gmail.com
> > > > >>On Wed 10-07-13 18:25:06, azurIt wrote:
> > > > >>> >> Now i realized that i forgot to remove UID from that cgroup before
> > > > >>> >> trying to remove it, so cgroup cannot be removed anyway (we are using
> > > > >>> >> third party cgroup called cgroup-uid from Andrea Righi, which is able
> > > > >>> >> to associate all user's processes with target cgroup). Look here for
> > > > >>> >> cgroup-uid patch:
> > > > >>> >> https://www.develer.com/~arighi/linux/patches/cgroup-uid/cgroup-uid-v8.patch
> > > > >>> >> 
> > > > >>> >> ANYWAY, i'm 101% sure that 'tasks' file was empty and 'under_oom' was
> > > > >>> >> permanently '1'.
> > > > >>> >
> > > > >>> >This is really strange. Could you post the whole diff against stable
> > > > >>> >tree you are using (except for grsecurity stuff and the above cgroup-uid
> > > > >>> >patch)?
> > > > >>> 
> > > > >>> 
> > > > >>> Here are all patches which i applied to kernel 3.2.48 in my last test:
> > > > >>> http://watchdog.sk/lkml/patches3/
> > > > >>
> > > > >>The two patches from Johannes seem correct.
> > > > >>
> > > > >>From a quick look even grsecurity patchset shouldn't interfere as it
> > > > >>doesn't seem to put any code between handle_mm_fault and mm_fault_error
> > > > >>and there also doesn't seem to be any new handle_mm_fault call sites.
> > > > >>
> > > > >>But I cannot tell there aren't other code paths which would lead to a
> > > > >>memcg charge, thus oom, without proper FAULT_FLAG_KERNEL handling.
> > > > >
> > > > >
> > > > >Michal,
> > > > >
> > > > >now i can definitely confirm that problem with unremovable cgroups
> > > > >persists. What info do you need from me? I applied also your little
> > > > >'WARN_ON' patch.
> > > > 
> > > > Ok, i think you want this:
> > > > http://watchdog.sk/lkml/kern4.log
> > > 
> > > Jul 14 01:11:39 server01 kernel: [  593.589087] [ pid ]   uid  tgid total_vm      rss cpu oom_adj oom_score_adj name
> > > Jul 14 01:11:39 server01 kernel: [  593.589451] [12021]  1333 12021   172027    64723   4       0             0 apache2
> > > Jul 14 01:11:39 server01 kernel: [  593.589647] [12030]  1333 12030   172030    64748   2       0             0 apache2
> > > Jul 14 01:11:39 server01 kernel: [  593.589836] [12031]  1333 12031   172030    64749   3       0             0 apache2
> > > Jul 14 01:11:39 server01 kernel: [  593.590025] [12032]  1333 12032   170619    63428   3       0             0 apache2
> > > Jul 14 01:11:39 server01 kernel: [  593.590213] [12033]  1333 12033   167934    60524   2       0             0 apache2
> > > Jul 14 01:11:39 server01 kernel: [  593.590401] [12034]  1333 12034   170747    63496   4       0             0 apache2
> > > Jul 14 01:11:39 server01 kernel: [  593.590588] [12035]  1333 12035   169659    62451   1       0             0 apache2
> > > Jul 14 01:11:39 server01 kernel: [  593.590776] [12036]  1333 12036   167614    60384   3       0             0 apache2
> > > Jul 14 01:11:39 server01 kernel: [  593.590984] [12037]  1333 12037   166342    58964   3       0             0 apache2
> > > Jul 14 01:11:39 server01 kernel: [  593.591178] Memory cgroup out of memory: Kill process 12021 (apache2) score 847 or sacrifice child
> > > Jul 14 01:11:39 server01 kernel: [  593.591370] Killed process 12021 (apache2) total-vm:688108kB, anon-rss:255472kB, file-rss:3420kB
> > > Jul 14 01:11:41 server01 kernel: [  595.392920] ------------[ cut here ]------------
> > > Jul 14 01:11:41 server01 kernel: [  595.393096] WARNING: at kernel/exit.c:888 do_exit+0x7d0/0x870()
> > > Jul 14 01:11:41 server01 kernel: [  595.393256] Hardware name: S5000VSA
> > > Jul 14 01:11:41 server01 kernel: [  595.393415] Pid: 12037, comm: apache2 Not tainted 3.2.48-grsec #1
> > > Jul 14 01:11:41 server01 kernel: [  595.393577] Call Trace:
> > > Jul 14 01:11:41 server01 kernel: [  595.393737]  [<ffffffff8105520a>] warn_slowpath_common+0x7a/0xb0
> > > Jul 14 01:11:41 server01 kernel: [  595.393903]  [<ffffffff8105525a>] warn_slowpath_null+0x1a/0x20
> > > Jul 14 01:11:41 server01 kernel: [  595.394068]  [<ffffffff81059c50>] do_exit+0x7d0/0x870
> > > Jul 14 01:11:41 server01 kernel: [  595.394231]  [<ffffffff81050254>] ? thread_group_times+0x44/0xb0
> > > Jul 14 01:11:41 server01 kernel: [  595.394392]  [<ffffffff81059d41>] do_group_exit+0x51/0xc0
> > > Jul 14 01:11:41 server01 kernel: [  595.394551]  [<ffffffff81059dc7>] sys_exit_group+0x17/0x20
> > > Jul 14 01:11:41 server01 kernel: [  595.394714]  [<ffffffff815caea6>] system_call_fastpath+0x18/0x1d
> > > Jul 14 01:11:41 server01 kernel: [  595.394921] ---[ end trace 738570e688acf099 ]---
> > > 
> > > OK, so you had an OOM which has been handled by in-kernel oom handler
> > > (it killed 12021) and 12037 was in the same group. The warning tells us
> > > that it went through mem_cgroup_oom as well (otherwise it wouldn't have
> > > memcg_oom.wait_on_memcg set and the warning wouldn't trigger) and then
> > > it exited on the userspace request (by exit syscall).
> > > 
> > > I do not see any way how, this could happen though. If mem_cgroup_oom
> > > is called then we always return CHARGE_NOMEM which turns into ENOMEM
> > > returned by __mem_cgroup_try_charge (invoke_oom must have been set to
> > > true).  So if nobody screwed the return value on the way up to page
> > > fault handler then there is no way to escape.
> > > 
> > > I will check the code.
> > 
> > OK, I guess I found it:
> > __do_fault
> >   fault = filemap_fault
> >   do_async_mmap_readahead
> >     page_cache_async_readahead
> >       ondemand_readahead
> >         __do_page_cache_readahead
> >           read_pages
> >             readpages = ext3_readpages
> >               mpage_readpages			# Doesn't propagate ENOMEM
> >                add_to_page_cache_lru
> >                  add_to_page_cache
> >                    add_to_page_cache_locked
> >                      mem_cgroup_cache_charge
> > 
> > So the read ahead most probably. Again! Duhhh. I will try to think
> > about a fix for this. One obvious place is mpage_readpages but
> > __do_page_cache_readahead ignores read_pages return value as well and
> > page_cache_async_readahead, even worse, is just void and exported as
> > such.
> > 
> > So this smells like a hard to fix bugger. One possible, and really ugly
> > way would be calling mem_cgroup_oom_synchronize even if handle_mm_fault
> > doesn't return VM_FAULT_ERROR, but that is a crude hack.
> 
> Ouch, good spot.
> 
> I don't think we need to handle an OOM from the readahead code.  If
> readahead does not produce the desired page, we retry synchroneously
> in page_cache_read() and handle the OOM properly.  We should not
> signal an OOM for optional pages anyway.
> 
> So either we pass a flag from the readahead code down to
> add_to_page_cache and mem_cgroup_cache_charge that tells the charge
> code to ignore OOM conditions and do not set up an OOM context.

That was my previous attempt and it was sooo painful.

> Or we DO call mem_cgroup_oom_synchronize() from the read_cache_pages,
> with an argument that makes it only clean up the context and not wait.

Yes, I was playing with this idea as well. I just do not like how
fragile this is. We need some way to catch all possible places which
might leak it.

> It would not be completely outlandish to place it there, since it's
> right next to where an error from add_to_page_cache() is not further
> propagated back through the fault stack.
> 
> I'm travelling right now, I'll send a patch when I get back
> (Thursday).  Unless you beat me to it :)

I can cook something up but there is quite a big pile on my desk
currently (as always :/).

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: [PATCH for 3.2] memcg: do not trap chargers with full callstack on OOM
  2013-07-16 16:09                                                                                                                                                                         ` Michal Hocko
@ 2013-07-16 16:48                                                                                                                                                                           ` Johannes Weiner
  2013-07-19  4:21                                                                                                                                                                             ` Johannes Weiner
  0 siblings, 1 reply; 172+ messages in thread
From: Johannes Weiner @ 2013-07-16 16:48 UTC (permalink / raw)
  To: Michal Hocko
  Cc: azurIt, linux-kernel, linux-mm, cgroups mailinglist,
	KAMEZAWA Hiroyuki, righi.andrea

On Tue, Jul 16, 2013 at 06:09:05PM +0200, Michal Hocko wrote:
> On Tue 16-07-13 11:35:44, Johannes Weiner wrote:
> > On Mon, Jul 15, 2013 at 06:00:06PM +0200, Michal Hocko wrote:
> > > On Mon 15-07-13 17:41:19, Michal Hocko wrote:
> > > > On Sun 14-07-13 01:51:12, azurIt wrote:
> > > > > > CC: "Johannes Weiner" <hannes@cmpxchg.org>, linux-kernel@vger.kernel.org, linux-mm@kvack.org, "cgroups mailinglist" <cgroups@vger.kernel.org>, "KAMEZAWA Hiroyuki" <kamezawa.hiroyu@jp.fujitsu.com>, righi.andrea@gmail.com
> > > > > >> CC: "Johannes Weiner" <hannes@cmpxchg.org>, linux-kernel@vger.kernel.org, linux-mm@kvack.org, "cgroups mailinglist" <cgroups@vger.kernel.org>, "KAMEZAWA Hiroyuki" <kamezawa.hiroyu@jp.fujitsu.com>, righi.andrea@gmail.com
> > > > > >>On Wed 10-07-13 18:25:06, azurIt wrote:
> > > > > >>> >> Now i realized that i forgot to remove UID from that cgroup before
> > > > > >>> >> trying to remove it, so cgroup cannot be removed anyway (we are using
> > > > > >>> >> third party cgroup called cgroup-uid from Andrea Righi, which is able
> > > > > >>> >> to associate all user's processes with target cgroup). Look here for
> > > > > >>> >> cgroup-uid patch:
> > > > > >>> >> https://www.develer.com/~arighi/linux/patches/cgroup-uid/cgroup-uid-v8.patch
> > > > > >>> >> 
> > > > > >>> >> ANYWAY, i'm 101% sure that 'tasks' file was empty and 'under_oom' was
> > > > > >>> >> permanently '1'.
> > > > > >>> >
> > > > > >>> >This is really strange. Could you post the whole diff against stable
> > > > > >>> >tree you are using (except for grsecurity stuff and the above cgroup-uid
> > > > > >>> >patch)?
> > > > > >>> 
> > > > > >>> 
> > > > > >>> Here are all patches which i applied to kernel 3.2.48 in my last test:
> > > > > >>> http://watchdog.sk/lkml/patches3/
> > > > > >>
> > > > > >>The two patches from Johannes seem correct.
> > > > > >>
> > > > > >>From a quick look even grsecurity patchset shouldn't interfere as it
> > > > > >>doesn't seem to put any code between handle_mm_fault and mm_fault_error
> > > > > >>and there also doesn't seem to be any new handle_mm_fault call sites.
> > > > > >>
> > > > > >>But I cannot tell there aren't other code paths which would lead to a
> > > > > >>memcg charge, thus oom, without proper FAULT_FLAG_KERNEL handling.
> > > > > >
> > > > > >
> > > > > >Michal,
> > > > > >
> > > > > >now i can definitely confirm that problem with unremovable cgroups
> > > > > >persists. What info do you need from me? I applied also your little
> > > > > >'WARN_ON' patch.
> > > > > 
> > > > > Ok, i think you want this:
> > > > > http://watchdog.sk/lkml/kern4.log
> > > > 
> > > > Jul 14 01:11:39 server01 kernel: [  593.589087] [ pid ]   uid  tgid total_vm      rss cpu oom_adj oom_score_adj name
> > > > Jul 14 01:11:39 server01 kernel: [  593.589451] [12021]  1333 12021   172027    64723   4       0             0 apache2
> > > > Jul 14 01:11:39 server01 kernel: [  593.589647] [12030]  1333 12030   172030    64748   2       0             0 apache2
> > > > Jul 14 01:11:39 server01 kernel: [  593.589836] [12031]  1333 12031   172030    64749   3       0             0 apache2
> > > > Jul 14 01:11:39 server01 kernel: [  593.590025] [12032]  1333 12032   170619    63428   3       0             0 apache2
> > > > Jul 14 01:11:39 server01 kernel: [  593.590213] [12033]  1333 12033   167934    60524   2       0             0 apache2
> > > > Jul 14 01:11:39 server01 kernel: [  593.590401] [12034]  1333 12034   170747    63496   4       0             0 apache2
> > > > Jul 14 01:11:39 server01 kernel: [  593.590588] [12035]  1333 12035   169659    62451   1       0             0 apache2
> > > > Jul 14 01:11:39 server01 kernel: [  593.590776] [12036]  1333 12036   167614    60384   3       0             0 apache2
> > > > Jul 14 01:11:39 server01 kernel: [  593.590984] [12037]  1333 12037   166342    58964   3       0             0 apache2
> > > > Jul 14 01:11:39 server01 kernel: [  593.591178] Memory cgroup out of memory: Kill process 12021 (apache2) score 847 or sacrifice child
> > > > Jul 14 01:11:39 server01 kernel: [  593.591370] Killed process 12021 (apache2) total-vm:688108kB, anon-rss:255472kB, file-rss:3420kB
> > > > Jul 14 01:11:41 server01 kernel: [  595.392920] ------------[ cut here ]------------
> > > > Jul 14 01:11:41 server01 kernel: [  595.393096] WARNING: at kernel/exit.c:888 do_exit+0x7d0/0x870()
> > > > Jul 14 01:11:41 server01 kernel: [  595.393256] Hardware name: S5000VSA
> > > > Jul 14 01:11:41 server01 kernel: [  595.393415] Pid: 12037, comm: apache2 Not tainted 3.2.48-grsec #1
> > > > Jul 14 01:11:41 server01 kernel: [  595.393577] Call Trace:
> > > > Jul 14 01:11:41 server01 kernel: [  595.393737]  [<ffffffff8105520a>] warn_slowpath_common+0x7a/0xb0
> > > > Jul 14 01:11:41 server01 kernel: [  595.393903]  [<ffffffff8105525a>] warn_slowpath_null+0x1a/0x20
> > > > Jul 14 01:11:41 server01 kernel: [  595.394068]  [<ffffffff81059c50>] do_exit+0x7d0/0x870
> > > > Jul 14 01:11:41 server01 kernel: [  595.394231]  [<ffffffff81050254>] ? thread_group_times+0x44/0xb0
> > > > Jul 14 01:11:41 server01 kernel: [  595.394392]  [<ffffffff81059d41>] do_group_exit+0x51/0xc0
> > > > Jul 14 01:11:41 server01 kernel: [  595.394551]  [<ffffffff81059dc7>] sys_exit_group+0x17/0x20
> > > > Jul 14 01:11:41 server01 kernel: [  595.394714]  [<ffffffff815caea6>] system_call_fastpath+0x18/0x1d
> > > > Jul 14 01:11:41 server01 kernel: [  595.394921] ---[ end trace 738570e688acf099 ]---
> > > > 
> > > > OK, so you had an OOM which has been handled by in-kernel oom handler
> > > > (it killed 12021) and 12037 was in the same group. The warning tells us
> > > > that it went through mem_cgroup_oom as well (otherwise it wouldn't have
> > > > memcg_oom.wait_on_memcg set and the warning wouldn't trigger) and then
> > > > it exited on the userspace request (by exit syscall).
> > > > 
> > > > I do not see any way how, this could happen though. If mem_cgroup_oom
> > > > is called then we always return CHARGE_NOMEM which turns into ENOMEM
> > > > returned by __mem_cgroup_try_charge (invoke_oom must have been set to
> > > > true).  So if nobody screwed the return value on the way up to page
> > > > fault handler then there is no way to escape.
> > > > 
> > > > I will check the code.
> > > 
> > > OK, I guess I found it:
> > > __do_fault
> > >   fault = filemap_fault
> > >   do_async_mmap_readahead
> > >     page_cache_async_readahead
> > >       ondemand_readahead
> > >         __do_page_cache_readahead
> > >           read_pages
> > >             readpages = ext3_readpages
> > >               mpage_readpages			# Doesn't propagate ENOMEM
> > >                add_to_page_cache_lru
> > >                  add_to_page_cache
> > >                    add_to_page_cache_locked
> > >                      mem_cgroup_cache_charge
> > > 
> > > So the read ahead most probably. Again! Duhhh. I will try to think
> > > about a fix for this. One obvious place is mpage_readpages but
> > > __do_page_cache_readahead ignores read_pages return value as well and
> > > page_cache_async_readahead, even worse, is just void and exported as
> > > such.
> > > 
> > > So this smells like a hard to fix bugger. One possible, and really ugly
> > > way would be calling mem_cgroup_oom_synchronize even if handle_mm_fault
> > > doesn't return VM_FAULT_ERROR, but that is a crude hack.
> > 
> > Ouch, good spot.
> > 
> > I don't think we need to handle an OOM from the readahead code.  If
> > readahead does not produce the desired page, we retry synchroneously
> > in page_cache_read() and handle the OOM properly.  We should not
> > signal an OOM for optional pages anyway.
> > 
> > So either we pass a flag from the readahead code down to
> > add_to_page_cache and mem_cgroup_cache_charge that tells the charge
> > code to ignore OOM conditions and do not set up an OOM context.
> 
> That was my previous attempt and it was sooo painful.
> 
> > Or we DO call mem_cgroup_oom_synchronize() from the read_cache_pages,
> > with an argument that makes it only clean up the context and not wait.
> 
> Yes, I was playing with this idea as well. I just do not like how
> fragile this is. We need some way to catch all possible places which
> might leak it.

I don't think this is necessary, but we could add a sanity check
in/near mem_cgroup_clear_userfault() that makes sure the OOM context
is only set up when an error is returned.

> > It would not be completely outlandish to place it there, since it's
> > right next to where an error from add_to_page_cache() is not further
> > propagated back through the fault stack.
> > 
> > I'm travelling right now, I'll send a patch when I get back
> > (Thursday).  Unless you beat me to it :)
> 
> I can cook something up but there is quite a big pile on my desk
> currently (as always :/).

No worries, I'll send an update.

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: [PATCH for 3.2] memcg: do not trap chargers with full callstack on OOM
  2013-07-16 16:48                                                                                                                                                                           ` Johannes Weiner
@ 2013-07-19  4:21                                                                                                                                                                             ` Johannes Weiner
  2013-07-19  4:22                                                                                                                                                                               ` [patch 1/5] mm: invoke oom-killer from remaining unconverted page fault handlers Johannes Weiner
                                                                                                                                                                                                 ` (5 more replies)
  0 siblings, 6 replies; 172+ messages in thread
From: Johannes Weiner @ 2013-07-19  4:21 UTC (permalink / raw)
  To: Michal Hocko
  Cc: azurIt, linux-kernel, linux-mm, cgroups mailinglist,
	KAMEZAWA Hiroyuki, righi.andrea

On Tue, Jul 16, 2013 at 12:48:30PM -0400, Johannes Weiner wrote:
> On Tue, Jul 16, 2013 at 06:09:05PM +0200, Michal Hocko wrote:
> > On Tue 16-07-13 11:35:44, Johannes Weiner wrote:
> > > On Mon, Jul 15, 2013 at 06:00:06PM +0200, Michal Hocko wrote:
> > > > On Mon 15-07-13 17:41:19, Michal Hocko wrote:
> > > > > On Sun 14-07-13 01:51:12, azurIt wrote:
> > > > > > > CC: "Johannes Weiner" <hannes@cmpxchg.org>, linux-kernel@vger.kernel.org, linux-mm@kvack.org, "cgroups mailinglist" <cgroups@vger.kernel.org>, "KAMEZAWA Hiroyuki" <kamezawa.hiroyu@jp.fujitsu.com>, righi.andrea@gmail.com
> > > > > > >> CC: "Johannes Weiner" <hannes@cmpxchg.org>, linux-kernel@vger.kernel.org, linux-mm@kvack.org, "cgroups mailinglist" <cgroups@vger.kernel.org>, "KAMEZAWA Hiroyuki" <kamezawa.hiroyu@jp.fujitsu.com>, righi.andrea@gmail.com
> > > > > > >>On Wed 10-07-13 18:25:06, azurIt wrote:
> > > > > > >>> >> Now i realized that i forgot to remove UID from that cgroup before
> > > > > > >>> >> trying to remove it, so cgroup cannot be removed anyway (we are using
> > > > > > >>> >> third party cgroup called cgroup-uid from Andrea Righi, which is able
> > > > > > >>> >> to associate all user's processes with target cgroup). Look here for
> > > > > > >>> >> cgroup-uid patch:
> > > > > > >>> >> https://www.develer.com/~arighi/linux/patches/cgroup-uid/cgroup-uid-v8.patch
> > > > > > >>> >> 
> > > > > > >>> >> ANYWAY, i'm 101% sure that 'tasks' file was empty and 'under_oom' was
> > > > > > >>> >> permanently '1'.
> > > > > > >>> >
> > > > > > >>> >This is really strange. Could you post the whole diff against stable
> > > > > > >>> >tree you are using (except for grsecurity stuff and the above cgroup-uid
> > > > > > >>> >patch)?
> > > > > > >>> 
> > > > > > >>> 
> > > > > > >>> Here are all patches which i applied to kernel 3.2.48 in my last test:
> > > > > > >>> http://watchdog.sk/lkml/patches3/
> > > > > > >>
> > > > > > >>The two patches from Johannes seem correct.
> > > > > > >>
> > > > > > >>From a quick look even grsecurity patchset shouldn't interfere as it
> > > > > > >>doesn't seem to put any code between handle_mm_fault and mm_fault_error
> > > > > > >>and there also doesn't seem to be any new handle_mm_fault call sites.
> > > > > > >>
> > > > > > >>But I cannot tell there aren't other code paths which would lead to a
> > > > > > >>memcg charge, thus oom, without proper FAULT_FLAG_KERNEL handling.
> > > > > > >
> > > > > > >
> > > > > > >Michal,
> > > > > > >
> > > > > > >now i can definitely confirm that problem with unremovable cgroups
> > > > > > >persists. What info do you need from me? I applied also your little
> > > > > > >'WARN_ON' patch.
> > > > > > 
> > > > > > Ok, i think you want this:
> > > > > > http://watchdog.sk/lkml/kern4.log
> > > > > 
> > > > > Jul 14 01:11:39 server01 kernel: [  593.589087] [ pid ]   uid  tgid total_vm      rss cpu oom_adj oom_score_adj name
> > > > > Jul 14 01:11:39 server01 kernel: [  593.589451] [12021]  1333 12021   172027    64723   4       0             0 apache2
> > > > > Jul 14 01:11:39 server01 kernel: [  593.589647] [12030]  1333 12030   172030    64748   2       0             0 apache2
> > > > > Jul 14 01:11:39 server01 kernel: [  593.589836] [12031]  1333 12031   172030    64749   3       0             0 apache2
> > > > > Jul 14 01:11:39 server01 kernel: [  593.590025] [12032]  1333 12032   170619    63428   3       0             0 apache2
> > > > > Jul 14 01:11:39 server01 kernel: [  593.590213] [12033]  1333 12033   167934    60524   2       0             0 apache2
> > > > > Jul 14 01:11:39 server01 kernel: [  593.590401] [12034]  1333 12034   170747    63496   4       0             0 apache2
> > > > > Jul 14 01:11:39 server01 kernel: [  593.590588] [12035]  1333 12035   169659    62451   1       0             0 apache2
> > > > > Jul 14 01:11:39 server01 kernel: [  593.590776] [12036]  1333 12036   167614    60384   3       0             0 apache2
> > > > > Jul 14 01:11:39 server01 kernel: [  593.590984] [12037]  1333 12037   166342    58964   3       0             0 apache2
> > > > > Jul 14 01:11:39 server01 kernel: [  593.591178] Memory cgroup out of memory: Kill process 12021 (apache2) score 847 or sacrifice child
> > > > > Jul 14 01:11:39 server01 kernel: [  593.591370] Killed process 12021 (apache2) total-vm:688108kB, anon-rss:255472kB, file-rss:3420kB
> > > > > Jul 14 01:11:41 server01 kernel: [  595.392920] ------------[ cut here ]------------
> > > > > Jul 14 01:11:41 server01 kernel: [  595.393096] WARNING: at kernel/exit.c:888 do_exit+0x7d0/0x870()
> > > > > Jul 14 01:11:41 server01 kernel: [  595.393256] Hardware name: S5000VSA
> > > > > Jul 14 01:11:41 server01 kernel: [  595.393415] Pid: 12037, comm: apache2 Not tainted 3.2.48-grsec #1
> > > > > Jul 14 01:11:41 server01 kernel: [  595.393577] Call Trace:
> > > > > Jul 14 01:11:41 server01 kernel: [  595.393737]  [<ffffffff8105520a>] warn_slowpath_common+0x7a/0xb0
> > > > > Jul 14 01:11:41 server01 kernel: [  595.393903]  [<ffffffff8105525a>] warn_slowpath_null+0x1a/0x20
> > > > > Jul 14 01:11:41 server01 kernel: [  595.394068]  [<ffffffff81059c50>] do_exit+0x7d0/0x870
> > > > > Jul 14 01:11:41 server01 kernel: [  595.394231]  [<ffffffff81050254>] ? thread_group_times+0x44/0xb0
> > > > > Jul 14 01:11:41 server01 kernel: [  595.394392]  [<ffffffff81059d41>] do_group_exit+0x51/0xc0
> > > > > Jul 14 01:11:41 server01 kernel: [  595.394551]  [<ffffffff81059dc7>] sys_exit_group+0x17/0x20
> > > > > Jul 14 01:11:41 server01 kernel: [  595.394714]  [<ffffffff815caea6>] system_call_fastpath+0x18/0x1d
> > > > > Jul 14 01:11:41 server01 kernel: [  595.394921] ---[ end trace 738570e688acf099 ]---
> > > > > 
> > > > > OK, so you had an OOM which has been handled by in-kernel oom handler
> > > > > (it killed 12021) and 12037 was in the same group. The warning tells us
> > > > > that it went through mem_cgroup_oom as well (otherwise it wouldn't have
> > > > > memcg_oom.wait_on_memcg set and the warning wouldn't trigger) and then
> > > > > it exited on the userspace request (by exit syscall).
> > > > > 
> > > > > I do not see any way how, this could happen though. If mem_cgroup_oom
> > > > > is called then we always return CHARGE_NOMEM which turns into ENOMEM
> > > > > returned by __mem_cgroup_try_charge (invoke_oom must have been set to
> > > > > true).  So if nobody screwed the return value on the way up to page
> > > > > fault handler then there is no way to escape.
> > > > > 
> > > > > I will check the code.
> > > > 
> > > > OK, I guess I found it:
> > > > __do_fault
> > > >   fault = filemap_fault
> > > >   do_async_mmap_readahead
> > > >     page_cache_async_readahead
> > > >       ondemand_readahead
> > > >         __do_page_cache_readahead
> > > >           read_pages
> > > >             readpages = ext3_readpages
> > > >               mpage_readpages			# Doesn't propagate ENOMEM
> > > >                add_to_page_cache_lru
> > > >                  add_to_page_cache
> > > >                    add_to_page_cache_locked
> > > >                      mem_cgroup_cache_charge
> > > > 
> > > > So the read ahead most probably. Again! Duhhh. I will try to think
> > > > about a fix for this. One obvious place is mpage_readpages but
> > > > __do_page_cache_readahead ignores read_pages return value as well and
> > > > page_cache_async_readahead, even worse, is just void and exported as
> > > > such.
> > > > 
> > > > So this smells like a hard to fix bugger. One possible, and really ugly
> > > > way would be calling mem_cgroup_oom_synchronize even if handle_mm_fault
> > > > doesn't return VM_FAULT_ERROR, but that is a crude hack.

I fixed it by disabling the OOM killer altogether for readahead code.
We don't do it globally, we should not do it in the memcg, these are
optional allocations/charges.

I also disabled it for kernel faults triggered from within a syscall
(copy_*user, get_user_pages), which should just return -ENOMEM as
usual (unless it's nested inside a userspace fault).  The only
downside is that we can't get around annotating userspace faults
anymore, so every architecture fault handler now passes
FAULT_FLAG_USER to handle_mm_fault().  Makes the series a little less
self-contained, but it's not unreasonable.

It's easy to detect leaks now by checking if the memcg OOM context is
setup and we are not returning VM_FAULT_OOM.

Here is a combined diff based on 3.2.  azurIt, any chance you could
give this a shot?  I tested it on my local machines, but you have a
known reproducer of fairly unlikely scenarios...

Thanks!
Johannes

diff --git a/arch/alpha/mm/fault.c b/arch/alpha/mm/fault.c
index fadd5f8..fa6b4e4 100644
--- a/arch/alpha/mm/fault.c
+++ b/arch/alpha/mm/fault.c
@@ -89,6 +89,7 @@ do_page_fault(unsigned long address, unsigned long mmcsr,
 	struct mm_struct *mm = current->mm;
 	const struct exception_table_entry *fixup;
 	int fault, si_code = SEGV_MAPERR;
+	unsigned long flags = 0;
 	siginfo_t info;
 
 	/* As of EV6, a load into $31/$f31 is a prefetch, and never faults
@@ -142,10 +143,15 @@ do_page_fault(unsigned long address, unsigned long mmcsr,
 			goto bad_area;
 	}
 
+	if (user_mode(regs))
+		flags |= FAULT_FLAG_USER;
+	if (cause > 0)
+		flags |= FAULT_FLAG_WRITE;
+
 	/* If for any reason at all we couldn't handle the fault,
 	   make sure we exit gracefully rather than endlessly redo
 	   the fault.  */
-	fault = handle_mm_fault(mm, vma, address, cause > 0 ? FAULT_FLAG_WRITE : 0);
+	fault = handle_mm_fault(mm, vma, address, flags);
 	up_read(&mm->mmap_sem);
 	if (unlikely(fault & VM_FAULT_ERROR)) {
 		if (fault & VM_FAULT_OOM)
diff --git a/arch/arm/mm/fault.c b/arch/arm/mm/fault.c
index aa33949..31b1e69 100644
--- a/arch/arm/mm/fault.c
+++ b/arch/arm/mm/fault.c
@@ -231,9 +231,10 @@ static inline bool access_error(unsigned int fsr, struct vm_area_struct *vma)
 
 static int __kprobes
 __do_page_fault(struct mm_struct *mm, unsigned long addr, unsigned int fsr,
-		struct task_struct *tsk)
+		struct task_struct *tsk, struct pt_regs *regs)
 {
 	struct vm_area_struct *vma;
+	unsigned long flags = 0;
 	int fault;
 
 	vma = find_vma(mm, addr);
@@ -253,11 +254,16 @@ good_area:
 		goto out;
 	}
 
+	if (user_mode(regs))
+		flags |= FAULT_FLAG_USER;
+	if (fsr & FSR_WRITE)
+		flags |= FAULT_FLAG_WRITE;
+
 	/*
 	 * If for any reason at all we couldn't handle the fault, make
 	 * sure we exit gracefully rather than endlessly redo the fault.
 	 */
-	fault = handle_mm_fault(mm, vma, addr & PAGE_MASK, (fsr & FSR_WRITE) ? FAULT_FLAG_WRITE : 0);
+	fault = handle_mm_fault(mm, vma, addr & PAGE_MASK, flags);
 	if (unlikely(fault & VM_FAULT_ERROR))
 		return fault;
 	if (fault & VM_FAULT_MAJOR)
@@ -320,7 +326,7 @@ do_page_fault(unsigned long addr, unsigned int fsr, struct pt_regs *regs)
 #endif
 	}
 
-	fault = __do_page_fault(mm, addr, fsr, tsk);
+	fault = __do_page_fault(mm, addr, fsr, tsk, regs);
 	up_read(&mm->mmap_sem);
 
 	perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, addr);
diff --git a/arch/avr32/mm/fault.c b/arch/avr32/mm/fault.c
index f7040a1..ada6237 100644
--- a/arch/avr32/mm/fault.c
+++ b/arch/avr32/mm/fault.c
@@ -59,6 +59,7 @@ asmlinkage void do_page_fault(unsigned long ecr, struct pt_regs *regs)
 	struct mm_struct *mm;
 	struct vm_area_struct *vma;
 	const struct exception_table_entry *fixup;
+	unsigned long flags = 0;
 	unsigned long address;
 	unsigned long page;
 	int writeaccess;
@@ -127,12 +128,17 @@ good_area:
 		panic("Unhandled case %lu in do_page_fault!", ecr);
 	}
 
+	if (user_mode(regs))
+		flags |= FAULT_FLAG_USER;
+	if (writeaccess)
+		flags |= FAULT_FLAG_WRITE;
+
 	/*
 	 * If for any reason at all we couldn't handle the fault, make
 	 * sure we exit gracefully rather than endlessly redo the
 	 * fault.
 	 */
-	fault = handle_mm_fault(mm, vma, address, writeaccess ? FAULT_FLAG_WRITE : 0);
+	fault = handle_mm_fault(mm, vma, address, flags);
 	if (unlikely(fault & VM_FAULT_ERROR)) {
 		if (fault & VM_FAULT_OOM)
 			goto out_of_memory;
diff --git a/arch/cris/mm/fault.c b/arch/cris/mm/fault.c
index 9dcac8e..35d096a 100644
--- a/arch/cris/mm/fault.c
+++ b/arch/cris/mm/fault.c
@@ -55,6 +55,7 @@ do_page_fault(unsigned long address, struct pt_regs *regs,
 	struct task_struct *tsk;
 	struct mm_struct *mm;
 	struct vm_area_struct * vma;
+	unsigned long flags = 0;
 	siginfo_t info;
 	int fault;
 
@@ -156,13 +157,18 @@ do_page_fault(unsigned long address, struct pt_regs *regs,
 			goto bad_area;
 	}
 
+	if (user_mode(regs))
+		flags |= FAULT_FLAG_USER;
+	if (writeaccess & 1)
+		flags |= FAULT_FLAG_WRITE;
+
 	/*
 	 * If for any reason at all we couldn't handle the fault,
 	 * make sure we exit gracefully rather than endlessly redo
 	 * the fault.
 	 */
 
-	fault = handle_mm_fault(mm, vma, address, (writeaccess & 1) ? FAULT_FLAG_WRITE : 0);
+	fault = handle_mm_fault(mm, vma, address, flags);
 	if (unlikely(fault & VM_FAULT_ERROR)) {
 		if (fault & VM_FAULT_OOM)
 			goto out_of_memory;
diff --git a/arch/frv/mm/fault.c b/arch/frv/mm/fault.c
index a325d57..2dbf219 100644
--- a/arch/frv/mm/fault.c
+++ b/arch/frv/mm/fault.c
@@ -35,6 +35,7 @@ asmlinkage void do_page_fault(int datammu, unsigned long esr0, unsigned long ear
 	struct vm_area_struct *vma;
 	struct mm_struct *mm;
 	unsigned long _pme, lrai, lrad, fixup;
+	unsigned long flags = 0;
 	siginfo_t info;
 	pgd_t *pge;
 	pud_t *pue;
@@ -158,12 +159,17 @@ asmlinkage void do_page_fault(int datammu, unsigned long esr0, unsigned long ear
 		break;
 	}
 
+	if (user_mode(__frame))
+		flags |= FAULT_FLAG_USER;
+	if (write)
+		flags |= FAULT_FLAG_WRITE;
+
 	/*
 	 * If for any reason at all we couldn't handle the fault,
 	 * make sure we exit gracefully rather than endlessly redo
 	 * the fault.
 	 */
-	fault = handle_mm_fault(mm, vma, ear0, write ? FAULT_FLAG_WRITE : 0);
+	fault = handle_mm_fault(mm, vma, ear0, flags);
 	if (unlikely(fault & VM_FAULT_ERROR)) {
 		if (fault & VM_FAULT_OOM)
 			goto out_of_memory;
diff --git a/arch/hexagon/mm/vm_fault.c b/arch/hexagon/mm/vm_fault.c
index c10b76f..e56baf3 100644
--- a/arch/hexagon/mm/vm_fault.c
+++ b/arch/hexagon/mm/vm_fault.c
@@ -52,6 +52,7 @@ void do_page_fault(unsigned long address, long cause, struct pt_regs *regs)
 	siginfo_t info;
 	int si_code = SEGV_MAPERR;
 	int fault;
+	unsigned long flags = 0;
 	const struct exception_table_entry *fixup;
 
 	/*
@@ -96,7 +97,12 @@ good_area:
 		break;
 	}
 
-	fault = handle_mm_fault(mm, vma, address, (cause > 0));
+	if (user_mode(regs))
+		flags |= FAULT_FLAG_USER;
+	if (cause > 0)
+		flags |= FAULT_FLAG_WRITE;
+
+	fault = handle_mm_fault(mm, vma, address, flags);
 
 	/* The most common case -- we are done. */
 	if (likely(!(fault & VM_FAULT_ERROR))) {
diff --git a/arch/ia64/mm/fault.c b/arch/ia64/mm/fault.c
index 20b3593..ad9ef9d 100644
--- a/arch/ia64/mm/fault.c
+++ b/arch/ia64/mm/fault.c
@@ -79,6 +79,7 @@ ia64_do_page_fault (unsigned long address, unsigned long isr, struct pt_regs *re
 	int signal = SIGSEGV, code = SEGV_MAPERR;
 	struct vm_area_struct *vma, *prev_vma;
 	struct mm_struct *mm = current->mm;
+	unsigned long flags = 0;
 	struct siginfo si;
 	unsigned long mask;
 	int fault;
@@ -149,12 +150,17 @@ ia64_do_page_fault (unsigned long address, unsigned long isr, struct pt_regs *re
 	if ((vma->vm_flags & mask) != mask)
 		goto bad_area;
 
+	if (user_mode(regs))
+		flags |= FAULT_FLAG_USER;
+	if (mask & VM_WRITE)
+		flags |= FAULT_FLAG_WRITE;
+
 	/*
 	 * If for any reason at all we couldn't handle the fault, make
 	 * sure we exit gracefully rather than endlessly redo the
 	 * fault.
 	 */
-	fault = handle_mm_fault(mm, vma, address, (mask & VM_WRITE) ? FAULT_FLAG_WRITE : 0);
+	fault = handle_mm_fault(mm, vma, address, flags);
 	if (unlikely(fault & VM_FAULT_ERROR)) {
 		/*
 		 * We ran out of memory, or some other thing happened
diff --git a/arch/m32r/mm/fault.c b/arch/m32r/mm/fault.c
index 2c9aeb4..e74f6fa 100644
--- a/arch/m32r/mm/fault.c
+++ b/arch/m32r/mm/fault.c
@@ -79,6 +79,7 @@ asmlinkage void do_page_fault(struct pt_regs *regs, unsigned long error_code,
 	struct mm_struct *mm;
 	struct vm_area_struct * vma;
 	unsigned long page, addr;
+	unsigned long flags = 0;
 	int write;
 	int fault;
 	siginfo_t info;
@@ -188,6 +189,11 @@ good_area:
 	if ((error_code & ACE_INSTRUCTION) && !(vma->vm_flags & VM_EXEC))
 	  goto bad_area;
 
+	if (error_code & ACE_USERMODE)
+		flags |= FAULT_FLAG_USER;
+	if (write)
+		flags |= FAULT_FLAG_WRITE;
+
 	/*
 	 * If for any reason at all we couldn't handle the fault,
 	 * make sure we exit gracefully rather than endlessly redo
@@ -195,7 +201,7 @@ good_area:
 	 */
 	addr = (address & PAGE_MASK);
 	set_thread_fault_code(error_code);
-	fault = handle_mm_fault(mm, vma, addr, write ? FAULT_FLAG_WRITE : 0);
+	fault = handle_mm_fault(mm, vma, addr, flags);
 	if (unlikely(fault & VM_FAULT_ERROR)) {
 		if (fault & VM_FAULT_OOM)
 			goto out_of_memory;
diff --git a/arch/m68k/mm/fault.c b/arch/m68k/mm/fault.c
index 2db6099..ab88a91 100644
--- a/arch/m68k/mm/fault.c
+++ b/arch/m68k/mm/fault.c
@@ -73,6 +73,7 @@ int do_page_fault(struct pt_regs *regs, unsigned long address,
 {
 	struct mm_struct *mm = current->mm;
 	struct vm_area_struct * vma;
+	unsigned long flags = 0;
 	int write, fault;
 
 #ifdef DEBUG
@@ -134,13 +135,18 @@ good_area:
 				goto acc_err;
 	}
 
+	if (user_mode(regs))
+		flags |= FAULT_FLAG_USER;
+	if (write)
+		flags |= FAULT_FLAG_WRITE;
+
 	/*
 	 * If for any reason at all we couldn't handle the fault,
 	 * make sure we exit gracefully rather than endlessly redo
 	 * the fault.
 	 */
 
-	fault = handle_mm_fault(mm, vma, address, write ? FAULT_FLAG_WRITE : 0);
+	fault = handle_mm_fault(mm, vma, address, flags);
 #ifdef DEBUG
 	printk("handle_mm_fault returns %d\n",fault);
 #endif
diff --git a/arch/microblaze/mm/fault.c b/arch/microblaze/mm/fault.c
index ae97d2c..b002612 100644
--- a/arch/microblaze/mm/fault.c
+++ b/arch/microblaze/mm/fault.c
@@ -89,6 +89,7 @@ void do_page_fault(struct pt_regs *regs, unsigned long address,
 {
 	struct vm_area_struct *vma;
 	struct mm_struct *mm = current->mm;
+	unsigned long flags = 0;
 	siginfo_t info;
 	int code = SEGV_MAPERR;
 	int is_write = error_code & ESR_S;
@@ -206,12 +207,17 @@ good_area:
 			goto bad_area;
 	}
 
+	if (user_mode(regs))
+		flags |= FAULT_FLAG_USER;
+	if (is_write)
+		flags |= FAULT_FLAG_WRITE;
+
 	/*
 	 * If for any reason at all we couldn't handle the fault,
 	 * make sure we exit gracefully rather than endlessly redo
 	 * the fault.
 	 */
-	fault = handle_mm_fault(mm, vma, address, is_write ? FAULT_FLAG_WRITE : 0);
+	fault = handle_mm_fault(mm, vma, address, flags);
 	if (unlikely(fault & VM_FAULT_ERROR)) {
 		if (fault & VM_FAULT_OOM)
 			goto out_of_memory;
diff --git a/arch/mips/mm/fault.c b/arch/mips/mm/fault.c
index 937cf33..e5b9fed 100644
--- a/arch/mips/mm/fault.c
+++ b/arch/mips/mm/fault.c
@@ -40,6 +40,7 @@ asmlinkage void __kprobes do_page_fault(struct pt_regs *regs, unsigned long writ
 	struct task_struct *tsk = current;
 	struct mm_struct *mm = tsk->mm;
 	const int field = sizeof(unsigned long) * 2;
+	unsigned long flags = 0;
 	siginfo_t info;
 	int fault;
 
@@ -139,12 +140,17 @@ good_area:
 		}
 	}
 
+	if (user_mode(regs))
+		flags |= FAULT_FLAG_USER;
+	if (write)
+		flags |= FAULT_FLAG_WRITE;
+
 	/*
 	 * If for any reason at all we couldn't handle the fault,
 	 * make sure we exit gracefully rather than endlessly redo
 	 * the fault.
 	 */
-	fault = handle_mm_fault(mm, vma, address, write ? FAULT_FLAG_WRITE : 0);
+	fault = handle_mm_fault(mm, vma, address, flags);
 	perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, address);
 	if (unlikely(fault & VM_FAULT_ERROR)) {
 		if (fault & VM_FAULT_OOM)
diff --git a/arch/mn10300/mm/fault.c b/arch/mn10300/mm/fault.c
index 0945409..031be56 100644
--- a/arch/mn10300/mm/fault.c
+++ b/arch/mn10300/mm/fault.c
@@ -121,6 +121,7 @@ asmlinkage void do_page_fault(struct pt_regs *regs, unsigned long fault_code,
 {
 	struct vm_area_struct *vma;
 	struct task_struct *tsk;
+	unsigned long flags = 0;
 	struct mm_struct *mm;
 	unsigned long page;
 	siginfo_t info;
@@ -247,12 +248,17 @@ good_area:
 		break;
 	}
 
+	if ((fault_code & MMUFCR_xFC_ACCESS) == MMUFCR_xFC_ACCESS_USR)
+		flags |= FAULT_FLAG_USER;
+	if (write)
+		flags |= FAULT_FLAG_WRITE;
+
 	/*
 	 * If for any reason at all we couldn't handle the fault,
 	 * make sure we exit gracefully rather than endlessly redo
 	 * the fault.
 	 */
-	fault = handle_mm_fault(mm, vma, address, write ? FAULT_FLAG_WRITE : 0);
+	fault = handle_mm_fault(mm, vma, address, flags);
 	if (unlikely(fault & VM_FAULT_ERROR)) {
 		if (fault & VM_FAULT_OOM)
 			goto out_of_memory;
@@ -329,9 +335,10 @@ no_context:
  */
 out_of_memory:
 	up_read(&mm->mmap_sem);
-	printk(KERN_ALERT "VM: killing process %s\n", tsk->comm);
-	if ((fault_code & MMUFCR_xFC_ACCESS) == MMUFCR_xFC_ACCESS_USR)
-		do_exit(SIGKILL);
+	if ((fault_code & MMUFCR_xFC_ACCESS) == MMUFCR_xFC_ACCESS_USR) {
+		pagefault_out_of_memory();
+		return;
+	}
 	goto no_context;
 
 do_sigbus:
diff --git a/arch/openrisc/mm/fault.c b/arch/openrisc/mm/fault.c
index a5dce82..d586119 100644
--- a/arch/openrisc/mm/fault.c
+++ b/arch/openrisc/mm/fault.c
@@ -52,6 +52,7 @@ asmlinkage void do_page_fault(struct pt_regs *regs, unsigned long address,
 	struct task_struct *tsk;
 	struct mm_struct *mm;
 	struct vm_area_struct *vma;
+	unsigned long flags = 0;
 	siginfo_t info;
 	int fault;
 
@@ -153,13 +154,18 @@ good_area:
 	if ((vector == 0x400) && !(vma->vm_page_prot.pgprot & _PAGE_EXEC))
 		goto bad_area;
 
+	if (user_mode(regs))
+		flags |= FAULT_FLAG_USER;
+	if (write_acc)
+		flags |= FAULT_FLAG_WRITE;
+
 	/*
 	 * If for any reason at all we couldn't handle the fault,
 	 * make sure we exit gracefully rather than endlessly redo
 	 * the fault.
 	 */
 
-	fault = handle_mm_fault(mm, vma, address, write_acc);
+	fault = handle_mm_fault(mm, vma, address, flags);
 	if (unlikely(fault & VM_FAULT_ERROR)) {
 		if (fault & VM_FAULT_OOM)
 			goto out_of_memory;
@@ -246,10 +252,10 @@ out_of_memory:
 	__asm__ __volatile__("l.nop 1");
 
 	up_read(&mm->mmap_sem);
-	printk("VM: killing process %s\n", tsk->comm);
-	if (user_mode(regs))
-		do_exit(SIGKILL);
-	goto no_context;
+	if (!user_mode(regs))
+		goto no_context;
+	pagefault_out_of_memory();
+	return;
 
 do_sigbus:
 	up_read(&mm->mmap_sem);
diff --git a/arch/parisc/mm/fault.c b/arch/parisc/mm/fault.c
index 18162ce..a151e87 100644
--- a/arch/parisc/mm/fault.c
+++ b/arch/parisc/mm/fault.c
@@ -173,6 +173,7 @@ void do_page_fault(struct pt_regs *regs, unsigned long code,
 	struct vm_area_struct *vma, *prev_vma;
 	struct task_struct *tsk = current;
 	struct mm_struct *mm = tsk->mm;
+	unsigned long flags = 0;
 	unsigned long acc_type;
 	int fault;
 
@@ -195,13 +196,18 @@ good_area:
 	if ((vma->vm_flags & acc_type) != acc_type)
 		goto bad_area;
 
+	if (user_mode(regs))
+		flags |= FAULT_FLAG_USER;
+	if (acc_type & VM_WRITE)
+		flags |= FAULT_FLAG_WRITE;
+
 	/*
 	 * If for any reason at all we couldn't handle the fault, make
 	 * sure we exit gracefully rather than endlessly redo the
 	 * fault.
 	 */
 
-	fault = handle_mm_fault(mm, vma, address, (acc_type & VM_WRITE) ? FAULT_FLAG_WRITE : 0);
+	fault = handle_mm_fault(mm, vma, address, flags);
 	if (unlikely(fault & VM_FAULT_ERROR)) {
 		/*
 		 * We hit a shared mapping outside of the file, or some
diff --git a/arch/powerpc/mm/fault.c b/arch/powerpc/mm/fault.c
index 5efe8c9..2bf339c 100644
--- a/arch/powerpc/mm/fault.c
+++ b/arch/powerpc/mm/fault.c
@@ -122,6 +122,7 @@ int __kprobes do_page_fault(struct pt_regs *regs, unsigned long address,
 {
 	struct vm_area_struct * vma;
 	struct mm_struct *mm = current->mm;
+	unsigned long flags = 0;
 	siginfo_t info;
 	int code = SEGV_MAPERR;
 	int is_write = 0, ret;
@@ -305,12 +306,17 @@ good_area:
 			goto bad_area;
 	}
 
+	if (user_mode(regs))
+		flags |= FAULT_FLAG_USER;
+	if (is_write)
+		flags |= FAULT_FLAG_WRITE;
+
 	/*
 	 * If for any reason at all we couldn't handle the fault,
 	 * make sure we exit gracefully rather than endlessly redo
 	 * the fault.
 	 */
-	ret = handle_mm_fault(mm, vma, address, is_write ? FAULT_FLAG_WRITE : 0);
+	ret = handle_mm_fault(mm, vma, address, flags);
 	if (unlikely(ret & VM_FAULT_ERROR)) {
 		if (ret & VM_FAULT_OOM)
 			goto out_of_memory;
diff --git a/arch/s390/mm/fault.c b/arch/s390/mm/fault.c
index a9a3018..fe6109c 100644
--- a/arch/s390/mm/fault.c
+++ b/arch/s390/mm/fault.c
@@ -301,6 +301,8 @@ static inline int do_exception(struct pt_regs *regs, int access,
 	address = trans_exc_code & __FAIL_ADDR_MASK;
 	perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, address);
 	flags = FAULT_FLAG_ALLOW_RETRY;
+	if (regs->psw.mask & PSW_MASK_PSTATE)
+		flags |= FAULT_FLAG_USER;
 	if (access == VM_WRITE || (trans_exc_code & store_indication) == 0x400)
 		flags |= FAULT_FLAG_WRITE;
 	down_read(&mm->mmap_sem);
diff --git a/arch/score/mm/fault.c b/arch/score/mm/fault.c
index 47b600e..2ca5ae5 100644
--- a/arch/score/mm/fault.c
+++ b/arch/score/mm/fault.c
@@ -47,6 +47,7 @@ asmlinkage void do_page_fault(struct pt_regs *regs, unsigned long write,
 	struct task_struct *tsk = current;
 	struct mm_struct *mm = tsk->mm;
 	const int field = sizeof(unsigned long) * 2;
+	unsigned long flags = 0;
 	siginfo_t info;
 	int fault;
 
@@ -101,12 +102,16 @@ good_area:
 	}
 
 survive:
+	if (user_mode(regs))
+		flags |= FAULT_FLAG_USER;
+	if (write)
+		flags |= FAULT_FLAG_WRITE;
 	/*
 	* If for any reason at all we couldn't handle the fault,
 	* make sure we exit gracefully rather than endlessly redo
 	* the fault.
 	*/
-	fault = handle_mm_fault(mm, vma, address, write);
+	fault = handle_mm_fault(mm, vma, address, flags);
 	if (unlikely(fault & VM_FAULT_ERROR)) {
 		if (fault & VM_FAULT_OOM)
 			goto out_of_memory;
@@ -172,10 +177,10 @@ out_of_memory:
 		down_read(&mm->mmap_sem);
 		goto survive;
 	}
-	printk("VM: killing process %s\n", tsk->comm);
-	if (user_mode(regs))
-		do_group_exit(SIGKILL);
-	goto no_context;
+	if (!user_mode(regs))
+		goto no_context;
+	pagefault_out_of_memory();
+	return;
 
 do_sigbus:
 	up_read(&mm->mmap_sem);
diff --git a/arch/sh/mm/fault_32.c b/arch/sh/mm/fault_32.c
index 7bebd04..a61b803 100644
--- a/arch/sh/mm/fault_32.c
+++ b/arch/sh/mm/fault_32.c
@@ -126,6 +126,7 @@ asmlinkage void __kprobes do_page_fault(struct pt_regs *regs,
 	struct task_struct *tsk;
 	struct mm_struct *mm;
 	struct vm_area_struct * vma;
+	unsigned long flags = 0;
 	int si_code;
 	int fault;
 	siginfo_t info;
@@ -195,12 +196,17 @@ good_area:
 			goto bad_area;
 	}
 
+	if (user_mode(regs))
+		flags |= FAULT_FLAG_USER;
+	if (writeaccess)
+		flags |= FAULT_FLAG_WRITE;
+
 	/*
 	 * If for any reason at all we couldn't handle the fault,
 	 * make sure we exit gracefully rather than endlessly redo
 	 * the fault.
 	 */
-	fault = handle_mm_fault(mm, vma, address, writeaccess ? FAULT_FLAG_WRITE : 0);
+	fault = handle_mm_fault(mm, vma, address, flags);
 	if (unlikely(fault & VM_FAULT_ERROR)) {
 		if (fault & VM_FAULT_OOM)
 			goto out_of_memory;
diff --git a/arch/sh/mm/tlbflush_64.c b/arch/sh/mm/tlbflush_64.c
index e3430e0..0a9d645 100644
--- a/arch/sh/mm/tlbflush_64.c
+++ b/arch/sh/mm/tlbflush_64.c
@@ -96,6 +96,7 @@ asmlinkage void do_page_fault(struct pt_regs *regs, unsigned long writeaccess,
 	struct mm_struct *mm;
 	struct vm_area_struct * vma;
 	const struct exception_table_entry *fixup;
+	unsigned long flags = 0;
 	pte_t *pte;
 	int fault;
 
@@ -184,12 +185,17 @@ good_area:
 		}
 	}
 
+	if (user_mode(regs))
+		flags |= FAULT_FLAG_USER;
+	if (writeaccess)
+		flags |= FAULT_FLAG_WRITE;
+
 	/*
 	 * If for any reason at all we couldn't handle the fault,
 	 * make sure we exit gracefully rather than endlessly redo
 	 * the fault.
 	 */
-	fault = handle_mm_fault(mm, vma, address, writeaccess ? FAULT_FLAG_WRITE : 0);
+	fault = handle_mm_fault(mm, vma, address, flags);
 	if (unlikely(fault & VM_FAULT_ERROR)) {
 		if (fault & VM_FAULT_OOM)
 			goto out_of_memory;
diff --git a/arch/sparc/mm/fault_32.c b/arch/sparc/mm/fault_32.c
index 8023fd7..efa3d48 100644
--- a/arch/sparc/mm/fault_32.c
+++ b/arch/sparc/mm/fault_32.c
@@ -222,6 +222,7 @@ asmlinkage void do_sparc_fault(struct pt_regs *regs, int text_fault, int write,
 	struct vm_area_struct *vma;
 	struct task_struct *tsk = current;
 	struct mm_struct *mm = tsk->mm;
+	unsigned long flags = 0;
 	unsigned int fixup;
 	unsigned long g2;
 	int from_user = !(regs->psr & PSR_PS);
@@ -285,12 +286,17 @@ good_area:
 			goto bad_area;
 	}
 
+	if (from_user)
+		flags |= FAULT_FLAG_USER;
+	if (write)
+		flags |= FAULT_FLAG_WRITE;
+
 	/*
 	 * If for any reason at all we couldn't handle the fault,
 	 * make sure we exit gracefully rather than endlessly redo
 	 * the fault.
 	 */
-	fault = handle_mm_fault(mm, vma, address, write ? FAULT_FLAG_WRITE : 0);
+	fault = handle_mm_fault(mm, vma, address, flags);
 	if (unlikely(fault & VM_FAULT_ERROR)) {
 		if (fault & VM_FAULT_OOM)
 			goto out_of_memory;
diff --git a/arch/sparc/mm/fault_64.c b/arch/sparc/mm/fault_64.c
index 504c062..bc536ea 100644
--- a/arch/sparc/mm/fault_64.c
+++ b/arch/sparc/mm/fault_64.c
@@ -276,6 +276,7 @@ asmlinkage void __kprobes do_sparc64_fault(struct pt_regs *regs)
 {
 	struct mm_struct *mm = current->mm;
 	struct vm_area_struct *vma;
+	unsigned long flags = 0;
 	unsigned int insn = 0;
 	int si_code, fault_code, fault;
 	unsigned long address, mm_rss;
@@ -423,7 +424,12 @@ good_area:
 			goto bad_area;
 	}
 
-	fault = handle_mm_fault(mm, vma, address, (fault_code & FAULT_CODE_WRITE) ? FAULT_FLAG_WRITE : 0);
+	if (!(regs->tstate & TSTATE_PRIV))
+		flags |= FAULT_FLAG_USER;
+	if (fault_code & FAULT_CODE_WRITE)
+		flags |= FAULT_FLAG_WRITE;
+
+	fault = handle_mm_fault(mm, vma, address, flags);
 	if (unlikely(fault & VM_FAULT_ERROR)) {
 		if (fault & VM_FAULT_OOM)
 			goto out_of_memory;
diff --git a/arch/tile/mm/fault.c b/arch/tile/mm/fault.c
index 25b7b90..b2a7fd5 100644
--- a/arch/tile/mm/fault.c
+++ b/arch/tile/mm/fault.c
@@ -263,6 +263,7 @@ static int handle_page_fault(struct pt_regs *regs,
 	struct mm_struct *mm;
 	struct vm_area_struct *vma;
 	unsigned long stack_offset;
+	unsigned long flags = 0;
 	int fault;
 	int si_code;
 	int is_kernel_mode;
@@ -415,12 +416,16 @@ good_area:
 	}
 
  survive:
+	if (!is_kernel_mode)
+		flags |= FAULT_FLAG_USER;
+	if (write)
+		flags |= FAULT_FLAG_WRITE;
 	/*
 	 * If for any reason at all we couldn't handle the fault,
 	 * make sure we exit gracefully rather than endlessly redo
 	 * the fault.
 	 */
-	fault = handle_mm_fault(mm, vma, address, write);
+	fault = handle_mm_fault(mm, vma, address, flags);
 	if (unlikely(fault & VM_FAULT_ERROR)) {
 		if (fault & VM_FAULT_OOM)
 			goto out_of_memory;
@@ -540,10 +545,10 @@ out_of_memory:
 		down_read(&mm->mmap_sem);
 		goto survive;
 	}
-	pr_alert("VM: killing process %s\n", tsk->comm);
-	if (!is_kernel_mode)
-		do_group_exit(SIGKILL);
-	goto no_context;
+	if (is_kernel_mode)
+		goto no_context;
+	pagefault_out_of_memory();
+	return 0;
 
 do_sigbus:
 	up_read(&mm->mmap_sem);
diff --git a/arch/um/kernel/trap.c b/arch/um/kernel/trap.c
index dafc947..626a85e 100644
--- a/arch/um/kernel/trap.c
+++ b/arch/um/kernel/trap.c
@@ -25,6 +25,7 @@ int handle_page_fault(unsigned long address, unsigned long ip,
 {
 	struct mm_struct *mm = current->mm;
 	struct vm_area_struct *vma;
+	unsigned long flags = 0;
 	pgd_t *pgd;
 	pud_t *pud;
 	pmd_t *pmd;
@@ -62,10 +63,15 @@ good_area:
 	if (!is_write && !(vma->vm_flags & (VM_READ | VM_EXEC)))
 		goto out;
 
+	if (is_user)
+		flags |= FAULT_FLAG_USER;
+	if (is_write)
+		flags |= FAULT_FLAG_WRITE;
+
 	do {
 		int fault;
 
-		fault = handle_mm_fault(mm, vma, address, is_write ? FAULT_FLAG_WRITE : 0);
+		fault = handle_mm_fault(mm, vma, address, flags);
 		if (unlikely(fault & VM_FAULT_ERROR)) {
 			if (fault & VM_FAULT_OOM) {
 				goto out_of_memory;
diff --git a/arch/unicore32/mm/fault.c b/arch/unicore32/mm/fault.c
index 283aa4b..3026943 100644
--- a/arch/unicore32/mm/fault.c
+++ b/arch/unicore32/mm/fault.c
@@ -169,9 +169,10 @@ static inline bool access_error(unsigned int fsr, struct vm_area_struct *vma)
 }
 
 static int __do_pf(struct mm_struct *mm, unsigned long addr, unsigned int fsr,
-		struct task_struct *tsk)
+		   struct task_struct *tsk, struct pt_regs *regs)
 {
 	struct vm_area_struct *vma;
+	unsigned long flags = 0;
 	int fault;
 
 	vma = find_vma(mm, addr);
@@ -191,12 +192,16 @@ good_area:
 		goto out;
 	}
 
+	if (user_mode(regs))
+		flags |= FAULT_FLAG_USER;
+	if (!(fsr ^ 0x12))
+		flags |= FAULT_FLAG_WRITE;
+
 	/*
 	 * If for any reason at all we couldn't handle the fault, make
 	 * sure we exit gracefully rather than endlessly redo the fault.
 	 */
-	fault = handle_mm_fault(mm, vma, addr & PAGE_MASK,
-			    (!(fsr ^ 0x12)) ? FAULT_FLAG_WRITE : 0);
+	fault = handle_mm_fault(mm, vma, addr & PAGE_MASK, flags);
 	if (unlikely(fault & VM_FAULT_ERROR))
 		return fault;
 	if (fault & VM_FAULT_MAJOR)
@@ -252,7 +257,7 @@ static int do_pf(unsigned long addr, unsigned int fsr, struct pt_regs *regs)
 #endif
 	}
 
-	fault = __do_pf(mm, addr, fsr, tsk);
+	fault = __do_pf(mm, addr, fsr, tsk, regs);
 	up_read(&mm->mmap_sem);
 
 	/*
diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index 5db0490..90248c9 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -846,17 +846,6 @@ static noinline int
 mm_fault_error(struct pt_regs *regs, unsigned long error_code,
 	       unsigned long address, unsigned int fault)
 {
-	/*
-	 * Pagefault was interrupted by SIGKILL. We have no reason to
-	 * continue pagefault.
-	 */
-	if (fatal_signal_pending(current)) {
-		if (!(fault & VM_FAULT_RETRY))
-			up_read(&current->mm->mmap_sem);
-		if (!(error_code & PF_USER))
-			no_context(regs, error_code, address);
-		return 1;
-	}
 	if (!(fault & VM_FAULT_ERROR))
 		return 0;
 
@@ -999,8 +988,7 @@ do_page_fault(struct pt_regs *regs, unsigned long error_code)
 	struct mm_struct *mm;
 	int fault;
 	int write = error_code & PF_WRITE;
-	unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE |
-					(write ? FAULT_FLAG_WRITE : 0);
+	unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE;
 
 	tsk = current;
 	mm = tsk->mm;
@@ -1160,6 +1148,11 @@ good_area:
 		return;
 	}
 
+	if (error_code & PF_USER)
+		flags |= FAULT_FLAG_USER;
+	if (write)
+		flags |= FAULT_FLAG_WRITE;
+
 	/*
 	 * If for any reason at all we couldn't handle the fault,
 	 * make sure we exit gracefully rather than endlessly redo
diff --git a/arch/xtensa/mm/fault.c b/arch/xtensa/mm/fault.c
index e367e30..7db9fbe 100644
--- a/arch/xtensa/mm/fault.c
+++ b/arch/xtensa/mm/fault.c
@@ -41,6 +41,7 @@ void do_page_fault(struct pt_regs *regs)
 	struct mm_struct *mm = current->mm;
 	unsigned int exccause = regs->exccause;
 	unsigned int address = regs->excvaddr;
+	unsigned long flags = 0;
 	siginfo_t info;
 
 	int is_write, is_exec;
@@ -101,11 +102,16 @@ good_area:
 		if (!(vma->vm_flags & (VM_READ | VM_WRITE)))
 			goto bad_area;
 
+	if (user_mode(regs))
+		flags |= FAULT_FLAG_USER;
+	if (is_write)
+		flags |= FAULT_FLAG_WRITE;
+
 	/* If for any reason at all we couldn't handle the fault,
 	 * make sure we exit gracefully rather than endlessly redo
 	 * the fault.
 	 */
-	fault = handle_mm_fault(mm, vma, address, is_write ? FAULT_FLAG_WRITE : 0);
+	fault = handle_mm_fault(mm, vma, address, flags);
 	if (unlikely(fault & VM_FAULT_ERROR)) {
 		if (fault & VM_FAULT_OOM)
 			goto out_of_memory;
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index b87068a..b92e5e7 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -120,6 +120,17 @@ mem_cgroup_get_reclaim_stat_from_page(struct page *page);
 extern void mem_cgroup_print_oom_info(struct mem_cgroup *memcg,
 					struct task_struct *p);
 
+static inline unsigned int mem_cgroup_xchg_may_oom(struct task_struct *p,
+						   unsigned int new)
+{
+	unsigned int old;
+
+	old = p->memcg_oom.may_oom;
+	p->memcg_oom.may_oom = new;
+
+	return old;
+}
+bool mem_cgroup_oom_synchronize(void);
 #ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
 extern int do_swap_account;
 #endif
@@ -333,6 +344,17 @@ mem_cgroup_print_oom_info(struct mem_cgroup *memcg, struct task_struct *p)
 {
 }
 
+static inline unsigned int mem_cgroup_xchg_may_oom(struct task_struct *p,
+						   unsigned int new)
+{
+	return 0;
+}
+
+static inline bool mem_cgroup_oom_synchronize(void)
+{
+	return false;
+}
+
 static inline void mem_cgroup_inc_page_stat(struct page *page,
 					    enum mem_cgroup_page_stat_item idx)
 {
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 4baadd1..846b82b 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -156,6 +156,7 @@ extern pgprot_t protection_map[16];
 #define FAULT_FLAG_ALLOW_RETRY	0x08	/* Retry fault if blocking */
 #define FAULT_FLAG_RETRY_NOWAIT	0x10	/* Don't drop mmap_sem and wait when retrying */
 #define FAULT_FLAG_KILLABLE	0x20	/* The fault task is in SIGKILL killable region */
+#define FAULT_FLAG_USER		0x40	/* The fault originated in userspace */
 
 /*
  * This interface is used by x86 PAT code to identify a pfn mapping that is
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 1c4f3e9..a77d198 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -91,6 +91,7 @@ struct sched_param {
 #include <linux/latencytop.h>
 #include <linux/cred.h>
 #include <linux/llist.h>
+#include <linux/stacktrace.h>
 
 #include <asm/processor.h>
 
@@ -1568,6 +1569,14 @@ struct task_struct {
 		unsigned long nr_pages;	/* uncharged usage */
 		unsigned long memsw_nr_pages; /* uncharged mem+swap usage */
 	} memcg_batch;
+	struct memcg_oom_info {
+		unsigned int may_oom:1;
+		unsigned int in_memcg_oom:1;
+		struct stack_trace trace;
+		unsigned long trace_entries[16];
+		int wakeups;
+		struct mem_cgroup *wait_on_memcg;
+	} memcg_oom;
 #endif
 #ifdef CONFIG_HAVE_HW_BREAKPOINT
 	atomic_t ptrace_bp_refcnt;
diff --git a/mm/filemap.c b/mm/filemap.c
index 5f0a3c9..d18bd47 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1660,6 +1660,7 @@ int filemap_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
 	struct file_ra_state *ra = &file->f_ra;
 	struct inode *inode = mapping->host;
 	pgoff_t offset = vmf->pgoff;
+	unsigned int may_oom;
 	struct page *page;
 	pgoff_t size;
 	int ret = 0;
@@ -1669,7 +1670,12 @@ int filemap_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
 		return VM_FAULT_SIGBUS;
 
 	/*
-	 * Do we have something in the page cache already?
+	 * Do we have something in the page cache already?  Either
+	 * way, try readahead, but disable the memcg OOM killer for it
+	 * as readahead is optional and no errors are propagated up
+	 * the fault stack, which does not allow proper unwinding of a
+	 * memcg OOM state.  The OOM killer is enabled while trying to
+	 * instantiate the faulting page individually below.
 	 */
 	page = find_get_page(mapping, offset);
 	if (likely(page)) {
@@ -1677,10 +1683,14 @@ int filemap_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
 		 * We found the page, so try async readahead before
 		 * waiting for the lock.
 		 */
+		may_oom = mem_cgroup_xchg_may_oom(current, 0);
 		do_async_mmap_readahead(vma, ra, file, page, offset);
+		mem_cgroup_xchg_may_oom(current, may_oom);
 	} else {
-		/* No page in the page cache at all */
+		/* No page in the page cache at all. */
+		may_oom = mem_cgroup_xchg_may_oom(current, 0);
 		do_sync_mmap_readahead(vma, ra, file, offset);
+		mem_cgroup_xchg_may_oom(current, may_oom);
 		count_vm_event(PGMAJFAULT);
 		mem_cgroup_count_vm_event(vma->vm_mm, PGMAJFAULT);
 		ret = VM_FAULT_MAJOR;
diff --git a/mm/ksm.c b/mm/ksm.c
index 310544a..ae7e4ae 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -338,7 +338,7 @@ static int break_ksm(struct vm_area_struct *vma, unsigned long addr)
 			break;
 		if (PageKsm(page))
 			ret = handle_mm_fault(vma->vm_mm, vma, addr,
-							FAULT_FLAG_WRITE);
+					FAULT_FLAG_WRITE);
 		else
 			ret = VM_FAULT_WRITE;
 		put_page(page);
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index b63f5f7..c47c77e 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -49,6 +49,7 @@
 #include <linux/page_cgroup.h>
 #include <linux/cpu.h>
 #include <linux/oom.h>
+#include <linux/stacktrace.h>
 #include "internal.h"
 
 #include <asm/uaccess.h>
@@ -249,6 +250,7 @@ struct mem_cgroup {
 
 	bool		oom_lock;
 	atomic_t	under_oom;
+	atomic_t	oom_wakeups;
 
 	atomic_t	refcnt;
 
@@ -1846,6 +1848,7 @@ static int memcg_oom_wake_function(wait_queue_t *wait,
 
 static void memcg_wakeup_oom(struct mem_cgroup *memcg)
 {
+	atomic_inc(&memcg->oom_wakeups);
 	/* for filtering, pass "memcg" as argument. */
 	__wake_up(&memcg_oom_waitq, TASK_NORMAL, 0, memcg);
 }
@@ -1857,30 +1860,26 @@ static void memcg_oom_recover(struct mem_cgroup *memcg)
 }
 
 /*
- * try to call OOM killer. returns false if we should exit memory-reclaim loop.
+ * try to call OOM killer
  */
-bool mem_cgroup_handle_oom(struct mem_cgroup *memcg, gfp_t mask)
+static void mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask)
 {
-	struct oom_wait_info owait;
-	bool locked, need_to_kill;
+	bool locked, need_to_kill = true;
 
-	owait.mem = memcg;
-	owait.wait.flags = 0;
-	owait.wait.func = memcg_oom_wake_function;
-	owait.wait.private = current;
-	INIT_LIST_HEAD(&owait.wait.task_list);
-	need_to_kill = true;
-	mem_cgroup_mark_under_oom(memcg);
+	if (!current->memcg_oom.may_oom)
+		return;
+
+	current->memcg_oom.in_memcg_oom = 1;
+
+	current->memcg_oom.trace.nr_entries = 0;
+	current->memcg_oom.trace.max_entries = 16;
+	current->memcg_oom.trace.entries = current->memcg_oom.trace_entries;
+	current->memcg_oom.trace.skip = 1;
+	save_stack_trace(&current->memcg_oom.trace);
 
 	/* At first, try to OOM lock hierarchy under memcg.*/
 	spin_lock(&memcg_oom_lock);
 	locked = mem_cgroup_oom_lock(memcg);
-	/*
-	 * Even if signal_pending(), we can't quit charge() loop without
-	 * accounting. So, UNINTERRUPTIBLE is appropriate. But SIGKILL
-	 * under OOM is always welcomed, use TASK_KILLABLE here.
-	 */
-	prepare_to_wait(&memcg_oom_waitq, &owait.wait, TASK_KILLABLE);
 	if (!locked || memcg->oom_kill_disable)
 		need_to_kill = false;
 	if (locked)
@@ -1888,24 +1887,86 @@ bool mem_cgroup_handle_oom(struct mem_cgroup *memcg, gfp_t mask)
 	spin_unlock(&memcg_oom_lock);
 
 	if (need_to_kill) {
-		finish_wait(&memcg_oom_waitq, &owait.wait);
 		mem_cgroup_out_of_memory(memcg, mask);
 	} else {
-		schedule();
-		finish_wait(&memcg_oom_waitq, &owait.wait);
+		/*
+		 * A system call can just return -ENOMEM, but if this
+		 * is a page fault and somebody else is handling the
+		 * OOM already, we need to sleep on the OOM waitqueue
+		 * for this memcg until the situation is resolved.
+		 * Which can take some time because it might be
+		 * handled by a userspace task.
+		 *
+		 * However, this is the charge context, which means
+		 * that we may sit on a large call stack and hold
+		 * various filesystem locks, the mmap_sem etc. and we
+		 * don't want the OOM handler to deadlock on them
+		 * while we sit here and wait.  Store the current OOM
+		 * context in the task_struct, then return -ENOMEM.
+		 * At the end of the page fault handler, with the
+		 * stack unwound, pagefault_out_of_memory() will check
+		 * back with us by calling
+		 * mem_cgroup_oom_synchronize(), possibly putting the
+		 * task to sleep.
+		 */
+		mem_cgroup_mark_under_oom(memcg);
+		current->memcg_oom.wakeups = atomic_read(&memcg->oom_wakeups);
+		css_get(&memcg->css);
+		current->memcg_oom.wait_on_memcg = memcg;
 	}
-	spin_lock(&memcg_oom_lock);
-	if (locked)
+
+	if (locked) {
+		spin_lock(&memcg_oom_lock);
 		mem_cgroup_oom_unlock(memcg);
-	memcg_wakeup_oom(memcg);
-	spin_unlock(&memcg_oom_lock);
+		/*
+		 * Sleeping tasks might have been killed, make sure
+		 * they get scheduled so they can exit.
+		 */
+		if (need_to_kill)
+			memcg_oom_recover(memcg);
+		spin_unlock(&memcg_oom_lock);
+	}
+}
 
-	mem_cgroup_unmark_under_oom(memcg);
+bool mem_cgroup_oom_synchronize(void)
+{
+	struct oom_wait_info owait;
+	struct mem_cgroup *memcg;
 
-	if (test_thread_flag(TIF_MEMDIE) || fatal_signal_pending(current))
+	/* OOM is global, do not handle */
+	if (!current->memcg_oom.in_memcg_oom)
 		return false;
-	/* Give chance to dying process */
-	schedule_timeout_uninterruptible(1);
+
+	/*
+	 * We invoked the OOM killer but there is a chance that a kill
+	 * did not free up any charges.  Everybody else might already
+	 * be sleeping, so restart the fault and keep the rampage
+	 * going until some charges are released.
+	 */
+	memcg = current->memcg_oom.wait_on_memcg;
+	if (!memcg)
+		goto out;
+
+	if (test_thread_flag(TIF_MEMDIE) || fatal_signal_pending(current))
+		goto out_put;
+
+	owait.mem = memcg;
+	owait.wait.flags = 0;
+	owait.wait.func = memcg_oom_wake_function;
+	owait.wait.private = current;
+	INIT_LIST_HEAD(&owait.wait.task_list);
+
+	prepare_to_wait(&memcg_oom_waitq, &owait.wait, TASK_KILLABLE);
+	/* Only sleep if we didn't miss any wakeups since OOM */
+	if (atomic_read(&memcg->oom_wakeups) == current->memcg_oom.wakeups)
+		schedule();
+	finish_wait(&memcg_oom_waitq, &owait.wait);
+out_put:
+	mem_cgroup_unmark_under_oom(memcg);
+	css_put(&memcg->css);
+	current->memcg_oom.wait_on_memcg = NULL;
+out:
+	current->memcg_oom.in_memcg_oom = 0;
 	return true;
 }
 
@@ -2195,11 +2256,10 @@ enum {
 	CHARGE_RETRY,		/* need to retry but retry is not bad */
 	CHARGE_NOMEM,		/* we can't do more. return -ENOMEM */
 	CHARGE_WOULDBLOCK,	/* GFP_WAIT wasn't set and no enough res. */
-	CHARGE_OOM_DIE,		/* the current is killed because of OOM */
 };
 
 static int mem_cgroup_do_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
-				unsigned int nr_pages, bool oom_check)
+				unsigned int nr_pages, bool invoke_oom)
 {
 	unsigned long csize = nr_pages * PAGE_SIZE;
 	struct mem_cgroup *mem_over_limit;
@@ -2257,14 +2317,10 @@ static int mem_cgroup_do_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
 	if (mem_cgroup_wait_acct_move(mem_over_limit))
 		return CHARGE_RETRY;
 
-	/* If we don't need to call oom-killer at el, return immediately */
-	if (!oom_check)
-		return CHARGE_NOMEM;
-	/* check OOM */
-	if (!mem_cgroup_handle_oom(mem_over_limit, gfp_mask))
-		return CHARGE_OOM_DIE;
+	if (invoke_oom)
+		mem_cgroup_oom(mem_over_limit, gfp_mask);
 
-	return CHARGE_RETRY;
+	return CHARGE_NOMEM;
 }
 
 /*
@@ -2349,7 +2405,7 @@ again:
 	}
 
 	do {
-		bool oom_check;
+		bool invoke_oom = oom && !nr_oom_retries;
 
 		/* If killed, bypass charge */
 		if (fatal_signal_pending(current)) {
@@ -2357,13 +2413,7 @@ again:
 			goto bypass;
 		}
 
-		oom_check = false;
-		if (oom && !nr_oom_retries) {
-			oom_check = true;
-			nr_oom_retries = MEM_CGROUP_RECLAIM_RETRIES;
-		}
-
-		ret = mem_cgroup_do_charge(memcg, gfp_mask, batch, oom_check);
+		ret = mem_cgroup_do_charge(memcg, gfp_mask, batch, invoke_oom);
 		switch (ret) {
 		case CHARGE_OK:
 			break;
@@ -2376,16 +2426,12 @@ again:
 			css_put(&memcg->css);
 			goto nomem;
 		case CHARGE_NOMEM: /* OOM routine works */
-			if (!oom) {
+			if (!oom || invoke_oom) {
 				css_put(&memcg->css);
 				goto nomem;
 			}
-			/* If oom, we never return -ENOMEM */
 			nr_oom_retries--;
 			break;
-		case CHARGE_OOM_DIE: /* Killed by OOM Killer */
-			css_put(&memcg->css);
-			goto bypass;
 		}
 	} while (ret != CHARGE_OK);
 
diff --git a/mm/memory.c b/mm/memory.c
index 829d437..fc6d741 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -57,6 +57,7 @@
 #include <linux/swapops.h>
 #include <linux/elf.h>
 #include <linux/gfp.h>
+#include <linux/stacktrace.h>
 
 #include <asm/io.h>
 #include <asm/pgalloc.h>
@@ -3439,22 +3440,14 @@ unlock:
 /*
  * By the time we get here, we already hold the mm semaphore
  */
-int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma,
-		unsigned long address, unsigned int flags)
+static int __handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma,
+			     unsigned long address, unsigned int flags)
 {
 	pgd_t *pgd;
 	pud_t *pud;
 	pmd_t *pmd;
 	pte_t *pte;
 
-	__set_current_state(TASK_RUNNING);
-
-	count_vm_event(PGFAULT);
-	mem_cgroup_count_vm_event(mm, PGFAULT);
-
-	/* do counter updates before entering really critical section. */
-	check_sync_rss_stat(current);
-
 	if (unlikely(is_vm_hugetlb_page(vma)))
 		return hugetlb_fault(mm, vma, address, flags);
 
@@ -3503,6 +3496,39 @@ int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 	return handle_pte_fault(mm, vma, address, pte, pmd, flags);
 }
 
+int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma,
+		    unsigned long address, unsigned int flags)
+{
+	int userfault = flags & FAULT_FLAG_USER;
+	int ret;
+
+	__set_current_state(TASK_RUNNING);
+
+	count_vm_event(PGFAULT);
+	mem_cgroup_count_vm_event(mm, PGFAULT);
+
+	/* do counter updates before entering really critical section. */
+	check_sync_rss_stat(current);
+
+	if (userfault)
+		WARN_ON(mem_cgroup_xchg_may_oom(current, 1) == 1);
+
+	ret = __handle_mm_fault(mm, vma, address, flags);
+
+	if (userfault)
+		WARN_ON(mem_cgroup_xchg_may_oom(current, 0) == 0);
+
+#ifdef CONFIG_CGROUP_MEM_RES_CTLR
+	if (WARN(!(ret & VM_FAULT_OOM) && current->memcg_oom.in_memcg_oom,
+		 "Fixing unhandled memcg OOM context, set up from:\n")) {
+		print_stack_trace(&current->memcg_oom.trace, 0);
+		mem_cgroup_oom_synchronize();
+	}
+#endif
+
+	return ret;
+}
+
 #ifndef __PAGETABLE_PUD_FOLDED
 /*
  * Allocate page upper directory.
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 069b64e..aa60863 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -785,6 +785,8 @@ out:
  */
 void pagefault_out_of_memory(void)
 {
+	if (mem_cgroup_oom_synchronize())
+		return;
 	if (try_set_system_oom()) {
 		out_of_memory(NULL, 0, 0, NULL);
 		clear_system_oom();



^ permalink raw reply related	[flat|nested] 172+ messages in thread

* [patch 1/5] mm: invoke oom-killer from remaining unconverted page fault handlers
  2013-07-19  4:21                                                                                                                                                                             ` Johannes Weiner
@ 2013-07-19  4:22                                                                                                                                                                               ` Johannes Weiner
  2013-07-19  4:24                                                                                                                                                                               ` [patch 2/5] mm: pass userspace fault flag to generic fault handler Johannes Weiner
                                                                                                                                                                                                 ` (4 subsequent siblings)
  5 siblings, 0 replies; 172+ messages in thread
From: Johannes Weiner @ 2013-07-19  4:22 UTC (permalink / raw)
  To: Michal Hocko
  Cc: azurIt, linux-kernel, linux-mm, cgroups mailinglist,
	KAMEZAWA Hiroyuki, righi.andrea

[already upstream, included for 3.2 reference]

A few remaining architectures directly kill the page faulting task in an
out of memory situation.  This is usually not a good idea since that
task might not even use a significant amount of memory and so may not be
the optimal victim to resolve the situation.

Since 2.6.29's 1c0fe6e ("mm: invoke oom-killer from page fault") there
is a hook that architecture page fault handlers are supposed to call to
invoke the OOM killer and let it pick the right task to kill.  Convert
the remaining architectures over to this hook.

To have the previous behavior of simply taking out the faulting task the
vm.oom_kill_allocating_task sysctl can be set to 1.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: David Rientjes <rientjes@google.com>
Acked-by: Vineet Gupta <vgupta@synopsys.com>   [arch/arc bits]
Cc: James Hogan <james.hogan@imgtec.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Jonas Bonn <jonas@southpole.se>
Cc: Chen Liqin <liqin.chen@sunplusct.com>
Cc: Lennox Wu <lennox.wu@gmail.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
---
 arch/mn10300/mm/fault.c  | 7 ++++---
 arch/openrisc/mm/fault.c | 8 ++++----
 arch/score/mm/fault.c    | 8 ++++----
 arch/tile/mm/fault.c     | 8 ++++----
 4 files changed, 16 insertions(+), 15 deletions(-)

diff --git a/arch/mn10300/mm/fault.c b/arch/mn10300/mm/fault.c
index 0945409..5ac4df5 100644
--- a/arch/mn10300/mm/fault.c
+++ b/arch/mn10300/mm/fault.c
@@ -329,9 +329,10 @@ no_context:
  */
 out_of_memory:
 	up_read(&mm->mmap_sem);
-	printk(KERN_ALERT "VM: killing process %s\n", tsk->comm);
-	if ((fault_code & MMUFCR_xFC_ACCESS) == MMUFCR_xFC_ACCESS_USR)
-		do_exit(SIGKILL);
+	if ((fault_code & MMUFCR_xFC_ACCESS) == MMUFCR_xFC_ACCESS_USR) {
+		pagefault_out_of_memory();
+		return;
+	}
 	goto no_context;
 
 do_sigbus:
diff --git a/arch/openrisc/mm/fault.c b/arch/openrisc/mm/fault.c
index a5dce82..d78881c 100644
--- a/arch/openrisc/mm/fault.c
+++ b/arch/openrisc/mm/fault.c
@@ -246,10 +246,10 @@ out_of_memory:
 	__asm__ __volatile__("l.nop 1");
 
 	up_read(&mm->mmap_sem);
-	printk("VM: killing process %s\n", tsk->comm);
-	if (user_mode(regs))
-		do_exit(SIGKILL);
-	goto no_context;
+	if (!user_mode(regs))
+		goto no_context;
+	pagefault_out_of_memory();
+	return;
 
 do_sigbus:
 	up_read(&mm->mmap_sem);
diff --git a/arch/score/mm/fault.c b/arch/score/mm/fault.c
index 47b600e..6b18fb0 100644
--- a/arch/score/mm/fault.c
+++ b/arch/score/mm/fault.c
@@ -172,10 +172,10 @@ out_of_memory:
 		down_read(&mm->mmap_sem);
 		goto survive;
 	}
-	printk("VM: killing process %s\n", tsk->comm);
-	if (user_mode(regs))
-		do_group_exit(SIGKILL);
-	goto no_context;
+	if (!user_mode(regs))
+		goto no_context;
+	pagefault_out_of_memory();
+	return;
 
 do_sigbus:
 	up_read(&mm->mmap_sem);
diff --git a/arch/tile/mm/fault.c b/arch/tile/mm/fault.c
index 25b7b90..3312531 100644
--- a/arch/tile/mm/fault.c
+++ b/arch/tile/mm/fault.c
@@ -540,10 +540,10 @@ out_of_memory:
 		down_read(&mm->mmap_sem);
 		goto survive;
 	}
-	pr_alert("VM: killing process %s\n", tsk->comm);
-	if (!is_kernel_mode)
-		do_group_exit(SIGKILL);
-	goto no_context;
+	if (is_kernel_mode)
+		goto no_context;
+	pagefault_out_of_memory();
+	return 0;
 
 do_sigbus:
 	up_read(&mm->mmap_sem);
-- 
1.8.3.2


^ permalink raw reply related	[flat|nested] 172+ messages in thread

* [patch 2/5] mm: pass userspace fault flag to generic fault handler
  2013-07-19  4:21                                                                                                                                                                             ` Johannes Weiner
  2013-07-19  4:22                                                                                                                                                                               ` [patch 1/5] mm: invoke oom-killer from remaining unconverted page fault handlers Johannes Weiner
@ 2013-07-19  4:24                                                                                                                                                                               ` Johannes Weiner
  2013-07-19  4:25                                                                                                                                                                               ` [patch 3/5] x86: finish fault error path with fatal signal Johannes Weiner
                                                                                                                                                                                                 ` (3 subsequent siblings)
  5 siblings, 0 replies; 172+ messages in thread
From: Johannes Weiner @ 2013-07-19  4:24 UTC (permalink / raw)
  To: Michal Hocko
  Cc: azurIt, linux-kernel, linux-mm, cgroups mailinglist,
	KAMEZAWA Hiroyuki, righi.andrea

The global OOM killer is (XXX: for most architectures) only invoked
for userspace faults, not for faults from kernelspace (uaccess, gup).

Memcg OOM handling is currently invoked for all faults.  Allow it to
behave like the global case by having the architectures pass a flag to
the generic fault handler code that identifies userspace faults.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 arch/alpha/mm/fault.c      |  8 +++++++-
 arch/arm/mm/fault.c        | 12 +++++++++---
 arch/avr32/mm/fault.c      |  8 +++++++-
 arch/cris/mm/fault.c       |  8 +++++++-
 arch/frv/mm/fault.c        |  8 +++++++-
 arch/hexagon/mm/vm_fault.c |  8 +++++++-
 arch/ia64/mm/fault.c       |  8 +++++++-
 arch/m32r/mm/fault.c       |  8 +++++++-
 arch/m68k/mm/fault.c       |  8 +++++++-
 arch/microblaze/mm/fault.c |  8 +++++++-
 arch/mips/mm/fault.c       |  8 +++++++-
 arch/mn10300/mm/fault.c    |  8 +++++++-
 arch/openrisc/mm/fault.c   |  8 +++++++-
 arch/parisc/mm/fault.c     |  8 +++++++-
 arch/powerpc/mm/fault.c    |  8 +++++++-
 arch/s390/mm/fault.c       |  2 ++
 arch/score/mm/fault.c      |  7 ++++++-
 arch/sh/mm/fault_32.c      |  8 +++++++-
 arch/sh/mm/tlbflush_64.c   |  8 +++++++-
 arch/sparc/mm/fault_32.c   |  8 +++++++-
 arch/sparc/mm/fault_64.c   |  8 +++++++-
 arch/tile/mm/fault.c       |  7 ++++++-
 arch/um/kernel/trap.c      |  8 +++++++-
 arch/unicore32/mm/fault.c  | 13 +++++++++----
 arch/x86/mm/fault.c        |  8 ++++++--
 arch/xtensa/mm/fault.c     |  8 +++++++-
 include/linux/mm.h         |  1 +
 27 files changed, 179 insertions(+), 31 deletions(-)

diff --git a/arch/alpha/mm/fault.c b/arch/alpha/mm/fault.c
index fadd5f8..fa6b4e4 100644
--- a/arch/alpha/mm/fault.c
+++ b/arch/alpha/mm/fault.c
@@ -89,6 +89,7 @@ do_page_fault(unsigned long address, unsigned long mmcsr,
 	struct mm_struct *mm = current->mm;
 	const struct exception_table_entry *fixup;
 	int fault, si_code = SEGV_MAPERR;
+	unsigned long flags = 0;
 	siginfo_t info;
 
 	/* As of EV6, a load into $31/$f31 is a prefetch, and never faults
@@ -142,10 +143,15 @@ do_page_fault(unsigned long address, unsigned long mmcsr,
 			goto bad_area;
 	}
 
+	if (user_mode(regs))
+		flags |= FAULT_FLAG_USER;
+	if (cause > 0)
+		flags |= FAULT_FLAG_WRITE;
+
 	/* If for any reason at all we couldn't handle the fault,
 	   make sure we exit gracefully rather than endlessly redo
 	   the fault.  */
-	fault = handle_mm_fault(mm, vma, address, cause > 0 ? FAULT_FLAG_WRITE : 0);
+	fault = handle_mm_fault(mm, vma, address, flags);
 	up_read(&mm->mmap_sem);
 	if (unlikely(fault & VM_FAULT_ERROR)) {
 		if (fault & VM_FAULT_OOM)
diff --git a/arch/arm/mm/fault.c b/arch/arm/mm/fault.c
index aa33949..31b1e69 100644
--- a/arch/arm/mm/fault.c
+++ b/arch/arm/mm/fault.c
@@ -231,9 +231,10 @@ static inline bool access_error(unsigned int fsr, struct vm_area_struct *vma)
 
 static int __kprobes
 __do_page_fault(struct mm_struct *mm, unsigned long addr, unsigned int fsr,
-		struct task_struct *tsk)
+		struct task_struct *tsk, struct pt_regs *regs)
 {
 	struct vm_area_struct *vma;
+	unsigned long flags = 0;
 	int fault;
 
 	vma = find_vma(mm, addr);
@@ -253,11 +254,16 @@ good_area:
 		goto out;
 	}
 
+	if (user_mode(regs))
+		flags |= FAULT_FLAG_USER;
+	if (fsr & FSR_WRITE)
+		flags |= FAULT_FLAG_WRITE;
+
 	/*
 	 * If for any reason at all we couldn't handle the fault, make
 	 * sure we exit gracefully rather than endlessly redo the fault.
 	 */
-	fault = handle_mm_fault(mm, vma, addr & PAGE_MASK, (fsr & FSR_WRITE) ? FAULT_FLAG_WRITE : 0);
+	fault = handle_mm_fault(mm, vma, addr & PAGE_MASK, flags);
 	if (unlikely(fault & VM_FAULT_ERROR))
 		return fault;
 	if (fault & VM_FAULT_MAJOR)
@@ -320,7 +326,7 @@ do_page_fault(unsigned long addr, unsigned int fsr, struct pt_regs *regs)
 #endif
 	}
 
-	fault = __do_page_fault(mm, addr, fsr, tsk);
+	fault = __do_page_fault(mm, addr, fsr, tsk, regs);
 	up_read(&mm->mmap_sem);
 
 	perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, addr);
diff --git a/arch/avr32/mm/fault.c b/arch/avr32/mm/fault.c
index f7040a1..ada6237 100644
--- a/arch/avr32/mm/fault.c
+++ b/arch/avr32/mm/fault.c
@@ -59,6 +59,7 @@ asmlinkage void do_page_fault(unsigned long ecr, struct pt_regs *regs)
 	struct mm_struct *mm;
 	struct vm_area_struct *vma;
 	const struct exception_table_entry *fixup;
+	unsigned long flags = 0;
 	unsigned long address;
 	unsigned long page;
 	int writeaccess;
@@ -127,12 +128,17 @@ good_area:
 		panic("Unhandled case %lu in do_page_fault!", ecr);
 	}
 
+	if (user_mode(regs))
+		flags |= FAULT_FLAG_USER;
+	if (writeaccess)
+		flags |= FAULT_FLAG_WRITE;
+
 	/*
 	 * If for any reason at all we couldn't handle the fault, make
 	 * sure we exit gracefully rather than endlessly redo the
 	 * fault.
 	 */
-	fault = handle_mm_fault(mm, vma, address, writeaccess ? FAULT_FLAG_WRITE : 0);
+	fault = handle_mm_fault(mm, vma, address, flags);
 	if (unlikely(fault & VM_FAULT_ERROR)) {
 		if (fault & VM_FAULT_OOM)
 			goto out_of_memory;
diff --git a/arch/cris/mm/fault.c b/arch/cris/mm/fault.c
index 9dcac8e..35d096a 100644
--- a/arch/cris/mm/fault.c
+++ b/arch/cris/mm/fault.c
@@ -55,6 +55,7 @@ do_page_fault(unsigned long address, struct pt_regs *regs,
 	struct task_struct *tsk;
 	struct mm_struct *mm;
 	struct vm_area_struct * vma;
+	unsigned long flags = 0;
 	siginfo_t info;
 	int fault;
 
@@ -156,13 +157,18 @@ do_page_fault(unsigned long address, struct pt_regs *regs,
 			goto bad_area;
 	}
 
+	if (user_mode(regs))
+		flags |= FAULT_FLAG_USER;
+	if (writeaccess & 1)
+		flags |= FAULT_FLAG_WRITE;
+
 	/*
 	 * If for any reason at all we couldn't handle the fault,
 	 * make sure we exit gracefully rather than endlessly redo
 	 * the fault.
 	 */
 
-	fault = handle_mm_fault(mm, vma, address, (writeaccess & 1) ? FAULT_FLAG_WRITE : 0);
+	fault = handle_mm_fault(mm, vma, address, flags);
 	if (unlikely(fault & VM_FAULT_ERROR)) {
 		if (fault & VM_FAULT_OOM)
 			goto out_of_memory;
diff --git a/arch/frv/mm/fault.c b/arch/frv/mm/fault.c
index a325d57..2dbf219 100644
--- a/arch/frv/mm/fault.c
+++ b/arch/frv/mm/fault.c
@@ -35,6 +35,7 @@ asmlinkage void do_page_fault(int datammu, unsigned long esr0, unsigned long ear
 	struct vm_area_struct *vma;
 	struct mm_struct *mm;
 	unsigned long _pme, lrai, lrad, fixup;
+	unsigned long flags = 0;
 	siginfo_t info;
 	pgd_t *pge;
 	pud_t *pue;
@@ -158,12 +159,17 @@ asmlinkage void do_page_fault(int datammu, unsigned long esr0, unsigned long ear
 		break;
 	}
 
+	if (user_mode(__frame))
+		flags |= FAULT_FLAG_USER;
+	if (write)
+		flags |= FAULT_FLAG_WRITE;
+
 	/*
 	 * If for any reason at all we couldn't handle the fault,
 	 * make sure we exit gracefully rather than endlessly redo
 	 * the fault.
 	 */
-	fault = handle_mm_fault(mm, vma, ear0, write ? FAULT_FLAG_WRITE : 0);
+	fault = handle_mm_fault(mm, vma, ear0, flags);
 	if (unlikely(fault & VM_FAULT_ERROR)) {
 		if (fault & VM_FAULT_OOM)
 			goto out_of_memory;
diff --git a/arch/hexagon/mm/vm_fault.c b/arch/hexagon/mm/vm_fault.c
index c10b76f..e56baf3 100644
--- a/arch/hexagon/mm/vm_fault.c
+++ b/arch/hexagon/mm/vm_fault.c
@@ -52,6 +52,7 @@ void do_page_fault(unsigned long address, long cause, struct pt_regs *regs)
 	siginfo_t info;
 	int si_code = SEGV_MAPERR;
 	int fault;
+	unsigned long flags = 0;
 	const struct exception_table_entry *fixup;
 
 	/*
@@ -96,7 +97,12 @@ good_area:
 		break;
 	}
 
-	fault = handle_mm_fault(mm, vma, address, (cause > 0));
+	if (user_mode(regs))
+		flags |= FAULT_FLAG_USER;
+	if (cause > 0)
+		flags |= FAULT_FLAG_WRITE;
+
+	fault = handle_mm_fault(mm, vma, address, flags);
 
 	/* The most common case -- we are done. */
 	if (likely(!(fault & VM_FAULT_ERROR))) {
diff --git a/arch/ia64/mm/fault.c b/arch/ia64/mm/fault.c
index 20b3593..ad9ef9d 100644
--- a/arch/ia64/mm/fault.c
+++ b/arch/ia64/mm/fault.c
@@ -79,6 +79,7 @@ ia64_do_page_fault (unsigned long address, unsigned long isr, struct pt_regs *re
 	int signal = SIGSEGV, code = SEGV_MAPERR;
 	struct vm_area_struct *vma, *prev_vma;
 	struct mm_struct *mm = current->mm;
+	unsigned long flags = 0;
 	struct siginfo si;
 	unsigned long mask;
 	int fault;
@@ -149,12 +150,17 @@ ia64_do_page_fault (unsigned long address, unsigned long isr, struct pt_regs *re
 	if ((vma->vm_flags & mask) != mask)
 		goto bad_area;
 
+	if (user_mode(regs))
+		flags |= FAULT_FLAG_USER;
+	if (mask & VM_WRITE)
+		flags |= FAULT_FLAG_WRITE;
+
 	/*
 	 * If for any reason at all we couldn't handle the fault, make
 	 * sure we exit gracefully rather than endlessly redo the
 	 * fault.
 	 */
-	fault = handle_mm_fault(mm, vma, address, (mask & VM_WRITE) ? FAULT_FLAG_WRITE : 0);
+	fault = handle_mm_fault(mm, vma, address, flags);
 	if (unlikely(fault & VM_FAULT_ERROR)) {
 		/*
 		 * We ran out of memory, or some other thing happened
diff --git a/arch/m32r/mm/fault.c b/arch/m32r/mm/fault.c
index 2c9aeb4..e74f6fa 100644
--- a/arch/m32r/mm/fault.c
+++ b/arch/m32r/mm/fault.c
@@ -79,6 +79,7 @@ asmlinkage void do_page_fault(struct pt_regs *regs, unsigned long error_code,
 	struct mm_struct *mm;
 	struct vm_area_struct * vma;
 	unsigned long page, addr;
+	unsigned long flags = 0;
 	int write;
 	int fault;
 	siginfo_t info;
@@ -188,6 +189,11 @@ good_area:
 	if ((error_code & ACE_INSTRUCTION) && !(vma->vm_flags & VM_EXEC))
 	  goto bad_area;
 
+	if (error_code & ACE_USERMODE)
+		flags |= FAULT_FLAG_USER;
+	if (write)
+		flags |= FAULT_FLAG_WRITE;
+
 	/*
 	 * If for any reason at all we couldn't handle the fault,
 	 * make sure we exit gracefully rather than endlessly redo
@@ -195,7 +201,7 @@ good_area:
 	 */
 	addr = (address & PAGE_MASK);
 	set_thread_fault_code(error_code);
-	fault = handle_mm_fault(mm, vma, addr, write ? FAULT_FLAG_WRITE : 0);
+	fault = handle_mm_fault(mm, vma, addr, flags);
 	if (unlikely(fault & VM_FAULT_ERROR)) {
 		if (fault & VM_FAULT_OOM)
 			goto out_of_memory;
diff --git a/arch/m68k/mm/fault.c b/arch/m68k/mm/fault.c
index 2db6099..ab88a91 100644
--- a/arch/m68k/mm/fault.c
+++ b/arch/m68k/mm/fault.c
@@ -73,6 +73,7 @@ int do_page_fault(struct pt_regs *regs, unsigned long address,
 {
 	struct mm_struct *mm = current->mm;
 	struct vm_area_struct * vma;
+	unsigned long flags = 0;
 	int write, fault;
 
 #ifdef DEBUG
@@ -134,13 +135,18 @@ good_area:
 				goto acc_err;
 	}
 
+	if (user_mode(regs))
+		flags |= FAULT_FLAG_USER;
+	if (write)
+		flags |= FAULT_FLAG_WRITE;
+
 	/*
 	 * If for any reason at all we couldn't handle the fault,
 	 * make sure we exit gracefully rather than endlessly redo
 	 * the fault.
 	 */
 
-	fault = handle_mm_fault(mm, vma, address, write ? FAULT_FLAG_WRITE : 0);
+	fault = handle_mm_fault(mm, vma, address, flags);
 #ifdef DEBUG
 	printk("handle_mm_fault returns %d\n",fault);
 #endif
diff --git a/arch/microblaze/mm/fault.c b/arch/microblaze/mm/fault.c
index ae97d2c..b002612 100644
--- a/arch/microblaze/mm/fault.c
+++ b/arch/microblaze/mm/fault.c
@@ -89,6 +89,7 @@ void do_page_fault(struct pt_regs *regs, unsigned long address,
 {
 	struct vm_area_struct *vma;
 	struct mm_struct *mm = current->mm;
+	unsigned long flags = 0;
 	siginfo_t info;
 	int code = SEGV_MAPERR;
 	int is_write = error_code & ESR_S;
@@ -206,12 +207,17 @@ good_area:
 			goto bad_area;
 	}
 
+	if (user_mode(regs))
+		flags |= FAULT_FLAG_USER;
+	if (is_write)
+		flags |= FAULT_FLAG_WRITE;
+
 	/*
 	 * If for any reason at all we couldn't handle the fault,
 	 * make sure we exit gracefully rather than endlessly redo
 	 * the fault.
 	 */
-	fault = handle_mm_fault(mm, vma, address, is_write ? FAULT_FLAG_WRITE : 0);
+	fault = handle_mm_fault(mm, vma, address, flags);
 	if (unlikely(fault & VM_FAULT_ERROR)) {
 		if (fault & VM_FAULT_OOM)
 			goto out_of_memory;
diff --git a/arch/mips/mm/fault.c b/arch/mips/mm/fault.c
index 937cf33..e5b9fed 100644
--- a/arch/mips/mm/fault.c
+++ b/arch/mips/mm/fault.c
@@ -40,6 +40,7 @@ asmlinkage void __kprobes do_page_fault(struct pt_regs *regs, unsigned long writ
 	struct task_struct *tsk = current;
 	struct mm_struct *mm = tsk->mm;
 	const int field = sizeof(unsigned long) * 2;
+	unsigned long flags = 0;
 	siginfo_t info;
 	int fault;
 
@@ -139,12 +140,17 @@ good_area:
 		}
 	}
 
+	if (user_mode(regs))
+		flags |= FAULT_FLAG_USER;
+	if (write)
+		flags |= FAULT_FLAG_WRITE;
+
 	/*
 	 * If for any reason at all we couldn't handle the fault,
 	 * make sure we exit gracefully rather than endlessly redo
 	 * the fault.
 	 */
-	fault = handle_mm_fault(mm, vma, address, write ? FAULT_FLAG_WRITE : 0);
+	fault = handle_mm_fault(mm, vma, address, flags);
 	perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, address);
 	if (unlikely(fault & VM_FAULT_ERROR)) {
 		if (fault & VM_FAULT_OOM)
diff --git a/arch/mn10300/mm/fault.c b/arch/mn10300/mm/fault.c
index 5ac4df5..031be56 100644
--- a/arch/mn10300/mm/fault.c
+++ b/arch/mn10300/mm/fault.c
@@ -121,6 +121,7 @@ asmlinkage void do_page_fault(struct pt_regs *regs, unsigned long fault_code,
 {
 	struct vm_area_struct *vma;
 	struct task_struct *tsk;
+	unsigned long flags = 0;
 	struct mm_struct *mm;
 	unsigned long page;
 	siginfo_t info;
@@ -247,12 +248,17 @@ good_area:
 		break;
 	}
 
+	if ((fault_code & MMUFCR_xFC_ACCESS) == MMUFCR_xFC_ACCESS_USR)
+		flags |= FAULT_FLAG_USER;
+	if (write)
+		flags |= FAULT_FLAG_WRITE;
+
 	/*
 	 * If for any reason at all we couldn't handle the fault,
 	 * make sure we exit gracefully rather than endlessly redo
 	 * the fault.
 	 */
-	fault = handle_mm_fault(mm, vma, address, write ? FAULT_FLAG_WRITE : 0);
+	fault = handle_mm_fault(mm, vma, address, flags);
 	if (unlikely(fault & VM_FAULT_ERROR)) {
 		if (fault & VM_FAULT_OOM)
 			goto out_of_memory;
diff --git a/arch/openrisc/mm/fault.c b/arch/openrisc/mm/fault.c
index d78881c..d586119 100644
--- a/arch/openrisc/mm/fault.c
+++ b/arch/openrisc/mm/fault.c
@@ -52,6 +52,7 @@ asmlinkage void do_page_fault(struct pt_regs *regs, unsigned long address,
 	struct task_struct *tsk;
 	struct mm_struct *mm;
 	struct vm_area_struct *vma;
+	unsigned long flags = 0;
 	siginfo_t info;
 	int fault;
 
@@ -153,13 +154,18 @@ good_area:
 	if ((vector == 0x400) && !(vma->vm_page_prot.pgprot & _PAGE_EXEC))
 		goto bad_area;
 
+	if (user_mode(regs))
+		flags |= FAULT_FLAG_USER;
+	if (write_acc)
+		flags |= FAULT_FLAG_WRITE;
+
 	/*
 	 * If for any reason at all we couldn't handle the fault,
 	 * make sure we exit gracefully rather than endlessly redo
 	 * the fault.
 	 */
 
-	fault = handle_mm_fault(mm, vma, address, write_acc);
+	fault = handle_mm_fault(mm, vma, address, flags);
 	if (unlikely(fault & VM_FAULT_ERROR)) {
 		if (fault & VM_FAULT_OOM)
 			goto out_of_memory;
diff --git a/arch/parisc/mm/fault.c b/arch/parisc/mm/fault.c
index 18162ce..a151e87 100644
--- a/arch/parisc/mm/fault.c
+++ b/arch/parisc/mm/fault.c
@@ -173,6 +173,7 @@ void do_page_fault(struct pt_regs *regs, unsigned long code,
 	struct vm_area_struct *vma, *prev_vma;
 	struct task_struct *tsk = current;
 	struct mm_struct *mm = tsk->mm;
+	unsigned long flags = 0;
 	unsigned long acc_type;
 	int fault;
 
@@ -195,13 +196,18 @@ good_area:
 	if ((vma->vm_flags & acc_type) != acc_type)
 		goto bad_area;
 
+	if (user_mode(regs))
+		flags |= FAULT_FLAG_USER;
+	if (acc_type & VM_WRITE)
+		flags |= FAULT_FLAG_WRITE;
+
 	/*
 	 * If for any reason at all we couldn't handle the fault, make
 	 * sure we exit gracefully rather than endlessly redo the
 	 * fault.
 	 */
 
-	fault = handle_mm_fault(mm, vma, address, (acc_type & VM_WRITE) ? FAULT_FLAG_WRITE : 0);
+	fault = handle_mm_fault(mm, vma, address, flags);
 	if (unlikely(fault & VM_FAULT_ERROR)) {
 		/*
 		 * We hit a shared mapping outside of the file, or some
diff --git a/arch/powerpc/mm/fault.c b/arch/powerpc/mm/fault.c
index 5efe8c9..2bf339c 100644
--- a/arch/powerpc/mm/fault.c
+++ b/arch/powerpc/mm/fault.c
@@ -122,6 +122,7 @@ int __kprobes do_page_fault(struct pt_regs *regs, unsigned long address,
 {
 	struct vm_area_struct * vma;
 	struct mm_struct *mm = current->mm;
+	unsigned long flags = 0;
 	siginfo_t info;
 	int code = SEGV_MAPERR;
 	int is_write = 0, ret;
@@ -305,12 +306,17 @@ good_area:
 			goto bad_area;
 	}
 
+	if (user_mode(regs))
+		flags |= FAULT_FLAG_USER;
+	if (is_write)
+		flags |= FAULT_FLAG_WRITE;
+
 	/*
 	 * If for any reason at all we couldn't handle the fault,
 	 * make sure we exit gracefully rather than endlessly redo
 	 * the fault.
 	 */
-	ret = handle_mm_fault(mm, vma, address, is_write ? FAULT_FLAG_WRITE : 0);
+	ret = handle_mm_fault(mm, vma, address, flags);
 	if (unlikely(ret & VM_FAULT_ERROR)) {
 		if (ret & VM_FAULT_OOM)
 			goto out_of_memory;
diff --git a/arch/s390/mm/fault.c b/arch/s390/mm/fault.c
index a9a3018..fe6109c 100644
--- a/arch/s390/mm/fault.c
+++ b/arch/s390/mm/fault.c
@@ -301,6 +301,8 @@ static inline int do_exception(struct pt_regs *regs, int access,
 	address = trans_exc_code & __FAIL_ADDR_MASK;
 	perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, address);
 	flags = FAULT_FLAG_ALLOW_RETRY;
+	if (regs->psw.mask & PSW_MASK_PSTATE)
+		flags |= FAULT_FLAG_USER;
 	if (access == VM_WRITE || (trans_exc_code & store_indication) == 0x400)
 		flags |= FAULT_FLAG_WRITE;
 	down_read(&mm->mmap_sem);
diff --git a/arch/score/mm/fault.c b/arch/score/mm/fault.c
index 6b18fb0..2ca5ae5 100644
--- a/arch/score/mm/fault.c
+++ b/arch/score/mm/fault.c
@@ -47,6 +47,7 @@ asmlinkage void do_page_fault(struct pt_regs *regs, unsigned long write,
 	struct task_struct *tsk = current;
 	struct mm_struct *mm = tsk->mm;
 	const int field = sizeof(unsigned long) * 2;
+	unsigned long flags = 0;
 	siginfo_t info;
 	int fault;
 
@@ -101,12 +102,16 @@ good_area:
 	}
 
 survive:
+	if (user_mode(regs))
+		flags |= FAULT_FLAG_USER;
+	if (write)
+		flags |= FAULT_FLAG_WRITE;
 	/*
 	* If for any reason at all we couldn't handle the fault,
 	* make sure we exit gracefully rather than endlessly redo
 	* the fault.
 	*/
-	fault = handle_mm_fault(mm, vma, address, write);
+	fault = handle_mm_fault(mm, vma, address, flags);
 	if (unlikely(fault & VM_FAULT_ERROR)) {
 		if (fault & VM_FAULT_OOM)
 			goto out_of_memory;
diff --git a/arch/sh/mm/fault_32.c b/arch/sh/mm/fault_32.c
index 7bebd04..a61b803 100644
--- a/arch/sh/mm/fault_32.c
+++ b/arch/sh/mm/fault_32.c
@@ -126,6 +126,7 @@ asmlinkage void __kprobes do_page_fault(struct pt_regs *regs,
 	struct task_struct *tsk;
 	struct mm_struct *mm;
 	struct vm_area_struct * vma;
+	unsigned long flags = 0;
 	int si_code;
 	int fault;
 	siginfo_t info;
@@ -195,12 +196,17 @@ good_area:
 			goto bad_area;
 	}
 
+	if (user_mode(regs))
+		flags |= FAULT_FLAG_USER;
+	if (writeaccess)
+		flags |= FAULT_FLAG_WRITE;
+
 	/*
 	 * If for any reason at all we couldn't handle the fault,
 	 * make sure we exit gracefully rather than endlessly redo
 	 * the fault.
 	 */
-	fault = handle_mm_fault(mm, vma, address, writeaccess ? FAULT_FLAG_WRITE : 0);
+	fault = handle_mm_fault(mm, vma, address, flags);
 	if (unlikely(fault & VM_FAULT_ERROR)) {
 		if (fault & VM_FAULT_OOM)
 			goto out_of_memory;
diff --git a/arch/sh/mm/tlbflush_64.c b/arch/sh/mm/tlbflush_64.c
index e3430e0..0a9d645 100644
--- a/arch/sh/mm/tlbflush_64.c
+++ b/arch/sh/mm/tlbflush_64.c
@@ -96,6 +96,7 @@ asmlinkage void do_page_fault(struct pt_regs *regs, unsigned long writeaccess,
 	struct mm_struct *mm;
 	struct vm_area_struct * vma;
 	const struct exception_table_entry *fixup;
+	unsigned long flags = 0;
 	pte_t *pte;
 	int fault;
 
@@ -184,12 +185,17 @@ good_area:
 		}
 	}
 
+	if (user_mode(regs))
+		flags |= FAULT_FLAG_USER;
+	if (writeaccess)
+		flags |= FAULT_FLAG_WRITE;
+
 	/*
 	 * If for any reason at all we couldn't handle the fault,
 	 * make sure we exit gracefully rather than endlessly redo
 	 * the fault.
 	 */
-	fault = handle_mm_fault(mm, vma, address, writeaccess ? FAULT_FLAG_WRITE : 0);
+	fault = handle_mm_fault(mm, vma, address, flags);
 	if (unlikely(fault & VM_FAULT_ERROR)) {
 		if (fault & VM_FAULT_OOM)
 			goto out_of_memory;
diff --git a/arch/sparc/mm/fault_32.c b/arch/sparc/mm/fault_32.c
index 8023fd7..efa3d48 100644
--- a/arch/sparc/mm/fault_32.c
+++ b/arch/sparc/mm/fault_32.c
@@ -222,6 +222,7 @@ asmlinkage void do_sparc_fault(struct pt_regs *regs, int text_fault, int write,
 	struct vm_area_struct *vma;
 	struct task_struct *tsk = current;
 	struct mm_struct *mm = tsk->mm;
+	unsigned long flags = 0;
 	unsigned int fixup;
 	unsigned long g2;
 	int from_user = !(regs->psr & PSR_PS);
@@ -285,12 +286,17 @@ good_area:
 			goto bad_area;
 	}
 
+	if (from_user)
+		flags |= FAULT_FLAG_USER;
+	if (write)
+		flags |= FAULT_FLAG_WRITE;
+
 	/*
 	 * If for any reason at all we couldn't handle the fault,
 	 * make sure we exit gracefully rather than endlessly redo
 	 * the fault.
 	 */
-	fault = handle_mm_fault(mm, vma, address, write ? FAULT_FLAG_WRITE : 0);
+	fault = handle_mm_fault(mm, vma, address, flags);
 	if (unlikely(fault & VM_FAULT_ERROR)) {
 		if (fault & VM_FAULT_OOM)
 			goto out_of_memory;
diff --git a/arch/sparc/mm/fault_64.c b/arch/sparc/mm/fault_64.c
index 504c062..bc536ea 100644
--- a/arch/sparc/mm/fault_64.c
+++ b/arch/sparc/mm/fault_64.c
@@ -276,6 +276,7 @@ asmlinkage void __kprobes do_sparc64_fault(struct pt_regs *regs)
 {
 	struct mm_struct *mm = current->mm;
 	struct vm_area_struct *vma;
+	unsigned long flags = 0;
 	unsigned int insn = 0;
 	int si_code, fault_code, fault;
 	unsigned long address, mm_rss;
@@ -423,7 +424,12 @@ good_area:
 			goto bad_area;
 	}
 
-	fault = handle_mm_fault(mm, vma, address, (fault_code & FAULT_CODE_WRITE) ? FAULT_FLAG_WRITE : 0);
+	if (!(regs->tstate & TSTATE_PRIV))
+		flags |= FAULT_FLAG_USER;
+	if (fault_code & FAULT_CODE_WRITE)
+		flags |= FAULT_FLAG_WRITE;
+
+	fault = handle_mm_fault(mm, vma, address, flags);
 	if (unlikely(fault & VM_FAULT_ERROR)) {
 		if (fault & VM_FAULT_OOM)
 			goto out_of_memory;
diff --git a/arch/tile/mm/fault.c b/arch/tile/mm/fault.c
index 3312531..b2a7fd5 100644
--- a/arch/tile/mm/fault.c
+++ b/arch/tile/mm/fault.c
@@ -263,6 +263,7 @@ static int handle_page_fault(struct pt_regs *regs,
 	struct mm_struct *mm;
 	struct vm_area_struct *vma;
 	unsigned long stack_offset;
+	unsigned long flags = 0;
 	int fault;
 	int si_code;
 	int is_kernel_mode;
@@ -415,12 +416,16 @@ good_area:
 	}
 
  survive:
+	if (!is_kernel_mode)
+		flags |= FAULT_FLAG_USER;
+	if (write)
+		flags |= FAULT_FLAG_WRITE;
 	/*
 	 * If for any reason at all we couldn't handle the fault,
 	 * make sure we exit gracefully rather than endlessly redo
 	 * the fault.
 	 */
-	fault = handle_mm_fault(mm, vma, address, write);
+	fault = handle_mm_fault(mm, vma, address, flags);
 	if (unlikely(fault & VM_FAULT_ERROR)) {
 		if (fault & VM_FAULT_OOM)
 			goto out_of_memory;
diff --git a/arch/um/kernel/trap.c b/arch/um/kernel/trap.c
index dafc947..626a85e 100644
--- a/arch/um/kernel/trap.c
+++ b/arch/um/kernel/trap.c
@@ -25,6 +25,7 @@ int handle_page_fault(unsigned long address, unsigned long ip,
 {
 	struct mm_struct *mm = current->mm;
 	struct vm_area_struct *vma;
+	unsigned long flags = 0;
 	pgd_t *pgd;
 	pud_t *pud;
 	pmd_t *pmd;
@@ -62,10 +63,15 @@ good_area:
 	if (!is_write && !(vma->vm_flags & (VM_READ | VM_EXEC)))
 		goto out;
 
+	if (is_user)
+		flags |= FAULT_FLAG_USER;
+	if (is_write)
+		flags |= FAULT_FLAG_WRITE;
+
 	do {
 		int fault;
 
-		fault = handle_mm_fault(mm, vma, address, is_write ? FAULT_FLAG_WRITE : 0);
+		fault = handle_mm_fault(mm, vma, address, flags);
 		if (unlikely(fault & VM_FAULT_ERROR)) {
 			if (fault & VM_FAULT_OOM) {
 				goto out_of_memory;
diff --git a/arch/unicore32/mm/fault.c b/arch/unicore32/mm/fault.c
index 283aa4b..3026943 100644
--- a/arch/unicore32/mm/fault.c
+++ b/arch/unicore32/mm/fault.c
@@ -169,9 +169,10 @@ static inline bool access_error(unsigned int fsr, struct vm_area_struct *vma)
 }
 
 static int __do_pf(struct mm_struct *mm, unsigned long addr, unsigned int fsr,
-		struct task_struct *tsk)
+		   struct task_struct *tsk, struct pt_regs *regs)
 {
 	struct vm_area_struct *vma;
+	unsigned long flags = 0;
 	int fault;
 
 	vma = find_vma(mm, addr);
@@ -191,12 +192,16 @@ good_area:
 		goto out;
 	}
 
+	if (user_mode(regs))
+		flags |= FAULT_FLAG_USER;
+	if (!(fsr ^ 0x12))
+		flags |= FAULT_FLAG_WRITE;
+
 	/*
 	 * If for any reason at all we couldn't handle the fault, make
 	 * sure we exit gracefully rather than endlessly redo the fault.
 	 */
-	fault = handle_mm_fault(mm, vma, addr & PAGE_MASK,
-			    (!(fsr ^ 0x12)) ? FAULT_FLAG_WRITE : 0);
+	fault = handle_mm_fault(mm, vma, addr & PAGE_MASK, flags);
 	if (unlikely(fault & VM_FAULT_ERROR))
 		return fault;
 	if (fault & VM_FAULT_MAJOR)
@@ -252,7 +257,7 @@ static int do_pf(unsigned long addr, unsigned int fsr, struct pt_regs *regs)
 #endif
 	}
 
-	fault = __do_pf(mm, addr, fsr, tsk);
+	fault = __do_pf(mm, addr, fsr, tsk, regs);
 	up_read(&mm->mmap_sem);
 
 	/*
diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index 5db0490..1cebabe 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -999,8 +999,7 @@ do_page_fault(struct pt_regs *regs, unsigned long error_code)
 	struct mm_struct *mm;
 	int fault;
 	int write = error_code & PF_WRITE;
-	unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE |
-					(write ? FAULT_FLAG_WRITE : 0);
+	unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE;
 
 	tsk = current;
 	mm = tsk->mm;
@@ -1160,6 +1159,11 @@ good_area:
 		return;
 	}
 
+	if (error_code & PF_USER)
+		flags |= FAULT_FLAG_USER;
+	if (write)
+		flags |= FAULT_FLAG_WRITE;
+
 	/*
 	 * If for any reason at all we couldn't handle the fault,
 	 * make sure we exit gracefully rather than endlessly redo
diff --git a/arch/xtensa/mm/fault.c b/arch/xtensa/mm/fault.c
index e367e30..7db9fbe 100644
--- a/arch/xtensa/mm/fault.c
+++ b/arch/xtensa/mm/fault.c
@@ -41,6 +41,7 @@ void do_page_fault(struct pt_regs *regs)
 	struct mm_struct *mm = current->mm;
 	unsigned int exccause = regs->exccause;
 	unsigned int address = regs->excvaddr;
+	unsigned long flags = 0;
 	siginfo_t info;
 
 	int is_write, is_exec;
@@ -101,11 +102,16 @@ good_area:
 		if (!(vma->vm_flags & (VM_READ | VM_WRITE)))
 			goto bad_area;
 
+	if (user_mode(regs))
+		flags |= FAULT_FLAG_USER;
+	if (is_write)
+		flags |= FAULT_FLAG_WRITE;
+
 	/* If for any reason at all we couldn't handle the fault,
 	 * make sure we exit gracefully rather than endlessly redo
 	 * the fault.
 	 */
-	fault = handle_mm_fault(mm, vma, address, is_write ? FAULT_FLAG_WRITE : 0);
+	fault = handle_mm_fault(mm, vma, address, flags);
 	if (unlikely(fault & VM_FAULT_ERROR)) {
 		if (fault & VM_FAULT_OOM)
 			goto out_of_memory;
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 4baadd1..846b82b 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -156,6 +156,7 @@ extern pgprot_t protection_map[16];
 #define FAULT_FLAG_ALLOW_RETRY	0x08	/* Retry fault if blocking */
 #define FAULT_FLAG_RETRY_NOWAIT	0x10	/* Don't drop mmap_sem and wait when retrying */
 #define FAULT_FLAG_KILLABLE	0x20	/* The fault task is in SIGKILL killable region */
+#define FAULT_FLAG_USER		0x40	/* The fault originated in userspace */
 
 /*
  * This interface is used by x86 PAT code to identify a pfn mapping that is
-- 
1.8.3.2


^ permalink raw reply related	[flat|nested] 172+ messages in thread

* [patch 3/5] x86: finish fault error path with fatal signal
  2013-07-19  4:21                                                                                                                                                                             ` Johannes Weiner
  2013-07-19  4:22                                                                                                                                                                               ` [patch 1/5] mm: invoke oom-killer from remaining unconverted page fault handlers Johannes Weiner
  2013-07-19  4:24                                                                                                                                                                               ` [patch 2/5] mm: pass userspace fault flag to generic fault handler Johannes Weiner
@ 2013-07-19  4:25                                                                                                                                                                               ` Johannes Weiner
  2013-07-24 20:32                                                                                                                                                                                 ` Johannes Weiner
  2013-07-19  4:25                                                                                                                                                                               ` [patch 4/5] memcg: do not trap chargers with full callstack on OOM Johannes Weiner
                                                                                                                                                                                                 ` (2 subsequent siblings)
  5 siblings, 1 reply; 172+ messages in thread
From: Johannes Weiner @ 2013-07-19  4:25 UTC (permalink / raw)
  To: Michal Hocko
  Cc: azurIt, linux-kernel, linux-mm, cgroups mailinglist,
	KAMEZAWA Hiroyuki, righi.andrea

The x86 fault handler bails in the middle of error handling when the
task has been killed.  For the next patch this is a problem, because
it relies on pagefault_out_of_memory() being called even when the task
has been killed, to perform proper OOM state unwinding.

This is a rather minor optimization, just remove it.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 arch/x86/mm/fault.c | 11 -----------
 1 file changed, 11 deletions(-)

diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index 1cebabe..90248c9 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -846,17 +846,6 @@ static noinline int
 mm_fault_error(struct pt_regs *regs, unsigned long error_code,
 	       unsigned long address, unsigned int fault)
 {
-	/*
-	 * Pagefault was interrupted by SIGKILL. We have no reason to
-	 * continue pagefault.
-	 */
-	if (fatal_signal_pending(current)) {
-		if (!(fault & VM_FAULT_RETRY))
-			up_read(&current->mm->mmap_sem);
-		if (!(error_code & PF_USER))
-			no_context(regs, error_code, address);
-		return 1;
-	}
 	if (!(fault & VM_FAULT_ERROR))
 		return 0;
 
-- 
1.8.3.2


^ permalink raw reply related	[flat|nested] 172+ messages in thread

* [patch 4/5] memcg: do not trap chargers with full callstack on OOM
  2013-07-19  4:21                                                                                                                                                                             ` Johannes Weiner
                                                                                                                                                                                                 ` (2 preceding siblings ...)
  2013-07-19  4:25                                                                                                                                                                               ` [patch 3/5] x86: finish fault error path with fatal signal Johannes Weiner
@ 2013-07-19  4:25                                                                                                                                                                               ` Johannes Weiner
  2013-07-19  4:26                                                                                                                                                                               ` [patch 5/5] mm: memcontrol: sanity check memcg OOM context unwind Johannes Weiner
  2013-07-19  8:23                                                                                                                                                                               ` [PATCH for 3.2] memcg: do not trap chargers with full callstack on OOM azurIt
  5 siblings, 0 replies; 172+ messages in thread
From: Johannes Weiner @ 2013-07-19  4:25 UTC (permalink / raw)
  To: Michal Hocko
  Cc: azurIt, linux-kernel, linux-mm, cgroups mailinglist,
	KAMEZAWA Hiroyuki, righi.andrea

The memcg OOM handling is incredibly fragile and can deadlock.  When a
task fails to charge memory, it invokes the OOM killer and loops right
there in the charge code until it succeeds.  Comparably, any other
task that enters the charge path at this point will go to a waitqueue
right then and there and sleep until the OOM situation is resolved.
The problem is that these tasks may hold filesystem locks and the
mmap_sem; locks that the selected OOM victim may need to exit.

For example, in one reported case, the task invoking the OOM killer
was about to charge a page cache page during a write(), which holds
the i_mutex.  The OOM killer selected a task that was just entering
truncate() and trying to acquire the i_mutex:

OOM invoking task:
[<ffffffff8110a9c1>] mem_cgroup_handle_oom+0x241/0x3b0
[<ffffffff8110b5ab>] T.1146+0x5ab/0x5c0
[<ffffffff8110c22e>] mem_cgroup_cache_charge+0xbe/0xe0
[<ffffffff810ca28c>] add_to_page_cache_locked+0x4c/0x140
[<ffffffff810ca3a2>] add_to_page_cache_lru+0x22/0x50
[<ffffffff810ca45b>] grab_cache_page_write_begin+0x8b/0xe0
[<ffffffff81193a18>] ext3_write_begin+0x88/0x270
[<ffffffff810c8fc6>] generic_file_buffered_write+0x116/0x290
[<ffffffff810cb3cc>] __generic_file_aio_write+0x27c/0x480
[<ffffffff810cb646>] generic_file_aio_write+0x76/0xf0           # takes ->i_mutex
[<ffffffff8111156a>] do_sync_write+0xea/0x130
[<ffffffff81112183>] vfs_write+0xf3/0x1f0
[<ffffffff81112381>] sys_write+0x51/0x90
[<ffffffff815b5926>] system_call_fastpath+0x18/0x1d
[<ffffffffffffffff>] 0xffffffffffffffff

OOM kill victim:
[<ffffffff811109b8>] do_truncate+0x58/0xa0              # takes i_mutex
[<ffffffff81121c90>] do_last+0x250/0xa30
[<ffffffff81122547>] path_openat+0xd7/0x440
[<ffffffff811229c9>] do_filp_open+0x49/0xa0
[<ffffffff8110f7d6>] do_sys_open+0x106/0x240
[<ffffffff8110f950>] sys_open+0x20/0x30
[<ffffffff815b5926>] system_call_fastpath+0x18/0x1d
[<ffffffffffffffff>] 0xffffffffffffffff

The OOM handling task will retry the charge indefinitely while the OOM
killed task is not releasing any resources.

A similar scenario can happen when the kernel OOM killer for a memcg
is disabled and a userspace task is in charge of resolving OOM
situations.  In this case, ALL tasks that enter the OOM path will be
made to sleep on the OOM waitqueue and wait for userspace to free
resources or increase the group's limit.  But a userspace OOM handler
is prone to deadlock itself on the locks held by the waiting tasks.
For example one of the sleeping tasks may be stuck in a brk() call
with the mmap_sem held for writing but the userspace handler, in order
to pick an optimal victim, may need to read files from /proc/<pid>,
which tries to acquire the same mmap_sem for reading and deadlocks.

This patch changes the way tasks behave after detecting a memcg OOM
and makes sure nobody loops or sleeps with locks held:

0. When OOMing in a system call (buffered IO and friends), do not
   invoke the OOM killer, do not sleep on a OOM waitqueue, just return
   -ENOMEM.  Userspace should be able to handle this and it prevents
   anybody from looping or waiting with locks held.

1. When OOMing in a kernel fault, do not invoke the OOM killer, do not
   sleep on the OOM waitqueue, just return -ENOMEM.  The kernel fault
   stack knows how to handle this.  If a kernel fault is nested inside
   a user fault, however, user fault handling applies:

2. When OOMing in a user fault, invoke the OOM killer and restart the
   fault instead of looping on the charge attempt.  This way, the OOM
   victim can not get stuck on locks the looping task may hold.

3. When OOMing in a user fault but somebody else is handling it
   (either the kernel OOM killer or a userspace handler), don't go to
   sleep in the charge context.  Instead, remember the OOMing memcg in
   the task struct and then fully unwind the page fault stack with
   -ENOMEM.  pagefault_out_of_memory() will then call back into the
   memcg code to check if the -ENOMEM came from the memcg, and then
   either put the task to sleep on the memcg's OOM waitqueue or just
   restart the fault.  The OOM victim can no longer get stuck on any
   lock a sleeping task may hold.

While reworking the OOM routine, also remove a needless OOM waitqueue
wakeup when invoking the killer.  In addition to the wakeup implied in
the kill signal delivery, only uncharges and limit increases, things
that actually change the memory situation, should poke the waitqueue.

Reported-by: Reported-by: azurIt <azurit@pobox.sk>
Debugged-by: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 include/linux/memcontrol.h |  22 +++++++
 include/linux/sched.h      |   6 ++
 mm/filemap.c               |  14 ++++-
 mm/ksm.c                   |   2 +-
 mm/memcontrol.c            | 139 +++++++++++++++++++++++++++++----------------
 mm/memory.c                |  37 ++++++++----
 mm/oom_kill.c              |   2 +
 7 files changed, 159 insertions(+), 63 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index b87068a..b92e5e7 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -120,6 +120,17 @@ mem_cgroup_get_reclaim_stat_from_page(struct page *page);
 extern void mem_cgroup_print_oom_info(struct mem_cgroup *memcg,
 					struct task_struct *p);
 
+static inline unsigned int mem_cgroup_xchg_may_oom(struct task_struct *p,
+						   unsigned int new)
+{
+	unsigned int old;
+
+	old = p->memcg_oom.may_oom;
+	p->memcg_oom.may_oom = new;
+
+	return old;
+}
+bool mem_cgroup_oom_synchronize(void);
 #ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
 extern int do_swap_account;
 #endif
@@ -333,6 +344,17 @@ mem_cgroup_print_oom_info(struct mem_cgroup *memcg, struct task_struct *p)
 {
 }
 
+static inline unsigned int mem_cgroup_xchg_may_oom(struct task_struct *p,
+						   unsigned int new)
+{
+	return 0;
+}
+
+static inline bool mem_cgroup_oom_synchronize(void)
+{
+	return false;
+}
+
 static inline void mem_cgroup_inc_page_stat(struct page *page,
 					    enum mem_cgroup_page_stat_item idx)
 {
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 1c4f3e9..7e6c9e9 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1568,6 +1568,12 @@ struct task_struct {
 		unsigned long nr_pages;	/* uncharged usage */
 		unsigned long memsw_nr_pages; /* uncharged mem+swap usage */
 	} memcg_batch;
+	struct memcg_oom_info {
+		unsigned int may_oom:1;
+		unsigned int in_memcg_oom:1;
+		int wakeups;
+		struct mem_cgroup *wait_on_memcg;
+	} memcg_oom;
 #endif
 #ifdef CONFIG_HAVE_HW_BREAKPOINT
 	atomic_t ptrace_bp_refcnt;
diff --git a/mm/filemap.c b/mm/filemap.c
index 5f0a3c9..d18bd47 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1660,6 +1660,7 @@ int filemap_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
 	struct file_ra_state *ra = &file->f_ra;
 	struct inode *inode = mapping->host;
 	pgoff_t offset = vmf->pgoff;
+	unsigned int may_oom;
 	struct page *page;
 	pgoff_t size;
 	int ret = 0;
@@ -1669,7 +1670,12 @@ int filemap_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
 		return VM_FAULT_SIGBUS;
 
 	/*
-	 * Do we have something in the page cache already?
+	 * Do we have something in the page cache already?  Either
+	 * way, try readahead, but disable the memcg OOM killer for it
+	 * as readahead is optional and no errors are propagated up
+	 * the fault stack, which does not allow proper unwinding of a
+	 * memcg OOM state.  The OOM killer is enabled while trying to
+	 * instantiate the faulting page individually below.
 	 */
 	page = find_get_page(mapping, offset);
 	if (likely(page)) {
@@ -1677,10 +1683,14 @@ int filemap_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
 		 * We found the page, so try async readahead before
 		 * waiting for the lock.
 		 */
+		may_oom = mem_cgroup_xchg_may_oom(current, 0);
 		do_async_mmap_readahead(vma, ra, file, page, offset);
+		mem_cgroup_xchg_may_oom(current, may_oom);
 	} else {
-		/* No page in the page cache at all */
+		/* No page in the page cache at all. */
+		may_oom = mem_cgroup_xchg_may_oom(current, 0);
 		do_sync_mmap_readahead(vma, ra, file, offset);
+		mem_cgroup_xchg_may_oom(current, may_oom);
 		count_vm_event(PGMAJFAULT);
 		mem_cgroup_count_vm_event(vma->vm_mm, PGMAJFAULT);
 		ret = VM_FAULT_MAJOR;
diff --git a/mm/ksm.c b/mm/ksm.c
index 310544a..ae7e4ae 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -338,7 +338,7 @@ static int break_ksm(struct vm_area_struct *vma, unsigned long addr)
 			break;
 		if (PageKsm(page))
 			ret = handle_mm_fault(vma->vm_mm, vma, addr,
-							FAULT_FLAG_WRITE);
+					FAULT_FLAG_WRITE);
 		else
 			ret = VM_FAULT_WRITE;
 		put_page(page);
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index b63f5f7..99b0101 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -249,6 +249,7 @@ struct mem_cgroup {
 
 	bool		oom_lock;
 	atomic_t	under_oom;
+	atomic_t	oom_wakeups;
 
 	atomic_t	refcnt;
 
@@ -1846,6 +1847,7 @@ static int memcg_oom_wake_function(wait_queue_t *wait,
 
 static void memcg_wakeup_oom(struct mem_cgroup *memcg)
 {
+	atomic_inc(&memcg->oom_wakeups);
 	/* for filtering, pass "memcg" as argument. */
 	__wake_up(&memcg_oom_waitq, TASK_NORMAL, 0, memcg);
 }
@@ -1857,30 +1859,20 @@ static void memcg_oom_recover(struct mem_cgroup *memcg)
 }
 
 /*
- * try to call OOM killer. returns false if we should exit memory-reclaim loop.
+ * try to call OOM killer
  */
-bool mem_cgroup_handle_oom(struct mem_cgroup *memcg, gfp_t mask)
+static void mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask)
 {
-	struct oom_wait_info owait;
-	bool locked, need_to_kill;
+	bool locked, need_to_kill = true;
 
-	owait.mem = memcg;
-	owait.wait.flags = 0;
-	owait.wait.func = memcg_oom_wake_function;
-	owait.wait.private = current;
-	INIT_LIST_HEAD(&owait.wait.task_list);
-	need_to_kill = true;
-	mem_cgroup_mark_under_oom(memcg);
+	if (!current->memcg_oom.may_oom)
+		return;
+
+	current->memcg_oom.in_memcg_oom = 1;
 
 	/* At first, try to OOM lock hierarchy under memcg.*/
 	spin_lock(&memcg_oom_lock);
 	locked = mem_cgroup_oom_lock(memcg);
-	/*
-	 * Even if signal_pending(), we can't quit charge() loop without
-	 * accounting. So, UNINTERRUPTIBLE is appropriate. But SIGKILL
-	 * under OOM is always welcomed, use TASK_KILLABLE here.
-	 */
-	prepare_to_wait(&memcg_oom_waitq, &owait.wait, TASK_KILLABLE);
 	if (!locked || memcg->oom_kill_disable)
 		need_to_kill = false;
 	if (locked)
@@ -1888,24 +1880,86 @@ bool mem_cgroup_handle_oom(struct mem_cgroup *memcg, gfp_t mask)
 	spin_unlock(&memcg_oom_lock);
 
 	if (need_to_kill) {
-		finish_wait(&memcg_oom_waitq, &owait.wait);
 		mem_cgroup_out_of_memory(memcg, mask);
 	} else {
-		schedule();
-		finish_wait(&memcg_oom_waitq, &owait.wait);
+		/*
+		 * A system call can just return -ENOMEM, but if this
+		 * is a page fault and somebody else is handling the
+		 * OOM already, we need to sleep on the OOM waitqueue
+		 * for this memcg until the situation is resolved.
+		 * Which can take some time because it might be
+		 * handled by a userspace task.
+		 *
+		 * However, this is the charge context, which means
+		 * that we may sit on a large call stack and hold
+		 * various filesystem locks, the mmap_sem etc. and we
+		 * don't want the OOM handler to deadlock on them
+		 * while we sit here and wait.  Store the current OOM
+		 * context in the task_struct, then return -ENOMEM.
+		 * At the end of the page fault handler, with the
+		 * stack unwound, pagefault_out_of_memory() will check
+		 * back with us by calling
+		 * mem_cgroup_oom_synchronize(), possibly putting the
+		 * task to sleep.
+		 */
+		mem_cgroup_mark_under_oom(memcg);
+		current->memcg_oom.wakeups = atomic_read(&memcg->oom_wakeups);
+		css_get(&memcg->css);
+		current->memcg_oom.wait_on_memcg = memcg;
 	}
-	spin_lock(&memcg_oom_lock);
-	if (locked)
+
+	if (locked) {
+		spin_lock(&memcg_oom_lock);
 		mem_cgroup_oom_unlock(memcg);
-	memcg_wakeup_oom(memcg);
-	spin_unlock(&memcg_oom_lock);
+		/*
+		 * Sleeping tasks might have been killed, make sure
+		 * they get scheduled so they can exit.
+		 */
+		if (need_to_kill)
+			memcg_oom_recover(memcg);
+		spin_unlock(&memcg_oom_lock);
+	}
+}
 
-	mem_cgroup_unmark_under_oom(memcg);
+bool mem_cgroup_oom_synchronize(void)
+{
+	struct oom_wait_info owait;
+	struct mem_cgroup *memcg;
 
-	if (test_thread_flag(TIF_MEMDIE) || fatal_signal_pending(current))
+	/* OOM is global, do not handle */
+	if (!current->memcg_oom.in_memcg_oom)
 		return false;
-	/* Give chance to dying process */
-	schedule_timeout_uninterruptible(1);
+
+	/*
+	 * We invoked the OOM killer but there is a chance that a kill
+	 * did not free up any charges.  Everybody else might already
+	 * be sleeping, so restart the fault and keep the rampage
+	 * going until some charges are released.
+	 */
+	memcg = current->memcg_oom.wait_on_memcg;
+	if (!memcg)
+		goto out;
+
+	if (test_thread_flag(TIF_MEMDIE) || fatal_signal_pending(current))
+		goto out_put;
+
+	owait.mem = memcg;
+	owait.wait.flags = 0;
+	owait.wait.func = memcg_oom_wake_function;
+	owait.wait.private = current;
+	INIT_LIST_HEAD(&owait.wait.task_list);
+
+	prepare_to_wait(&memcg_oom_waitq, &owait.wait, TASK_KILLABLE);
+	/* Only sleep if we didn't miss any wakeups since OOM */
+	if (atomic_read(&memcg->oom_wakeups) == current->memcg_oom.wakeups)
+		schedule();
+	finish_wait(&memcg_oom_waitq, &owait.wait);
+out_put:
+	mem_cgroup_unmark_under_oom(memcg);
+	css_put(&memcg->css);
+	current->memcg_oom.wait_on_memcg = NULL;
+out:
+	current->memcg_oom.in_memcg_oom = 0;
 	return true;
 }
 
@@ -2195,11 +2249,10 @@ enum {
 	CHARGE_RETRY,		/* need to retry but retry is not bad */
 	CHARGE_NOMEM,		/* we can't do more. return -ENOMEM */
 	CHARGE_WOULDBLOCK,	/* GFP_WAIT wasn't set and no enough res. */
-	CHARGE_OOM_DIE,		/* the current is killed because of OOM */
 };
 
 static int mem_cgroup_do_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
-				unsigned int nr_pages, bool oom_check)
+				unsigned int nr_pages, bool invoke_oom)
 {
 	unsigned long csize = nr_pages * PAGE_SIZE;
 	struct mem_cgroup *mem_over_limit;
@@ -2257,14 +2310,10 @@ static int mem_cgroup_do_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
 	if (mem_cgroup_wait_acct_move(mem_over_limit))
 		return CHARGE_RETRY;
 
-	/* If we don't need to call oom-killer at el, return immediately */
-	if (!oom_check)
-		return CHARGE_NOMEM;
-	/* check OOM */
-	if (!mem_cgroup_handle_oom(mem_over_limit, gfp_mask))
-		return CHARGE_OOM_DIE;
+	if (invoke_oom)
+		mem_cgroup_oom(mem_over_limit, gfp_mask);
 
-	return CHARGE_RETRY;
+	return CHARGE_NOMEM;
 }
 
 /*
@@ -2349,7 +2398,7 @@ again:
 	}
 
 	do {
-		bool oom_check;
+		bool invoke_oom = oom && !nr_oom_retries;
 
 		/* If killed, bypass charge */
 		if (fatal_signal_pending(current)) {
@@ -2357,13 +2406,7 @@ again:
 			goto bypass;
 		}
 
-		oom_check = false;
-		if (oom && !nr_oom_retries) {
-			oom_check = true;
-			nr_oom_retries = MEM_CGROUP_RECLAIM_RETRIES;
-		}
-
-		ret = mem_cgroup_do_charge(memcg, gfp_mask, batch, oom_check);
+		ret = mem_cgroup_do_charge(memcg, gfp_mask, batch, invoke_oom);
 		switch (ret) {
 		case CHARGE_OK:
 			break;
@@ -2376,16 +2419,12 @@ again:
 			css_put(&memcg->css);
 			goto nomem;
 		case CHARGE_NOMEM: /* OOM routine works */
-			if (!oom) {
+			if (!oom || invoke_oom) {
 				css_put(&memcg->css);
 				goto nomem;
 			}
-			/* If oom, we never return -ENOMEM */
 			nr_oom_retries--;
 			break;
-		case CHARGE_OOM_DIE: /* Killed by OOM Killer */
-			css_put(&memcg->css);
-			goto bypass;
 		}
 	} while (ret != CHARGE_OK);
 
diff --git a/mm/memory.c b/mm/memory.c
index 829d437..2be02b7 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3439,22 +3439,14 @@ unlock:
 /*
  * By the time we get here, we already hold the mm semaphore
  */
-int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma,
-		unsigned long address, unsigned int flags)
+static int __handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma,
+			     unsigned long address, unsigned int flags)
 {
 	pgd_t *pgd;
 	pud_t *pud;
 	pmd_t *pmd;
 	pte_t *pte;
 
-	__set_current_state(TASK_RUNNING);
-
-	count_vm_event(PGFAULT);
-	mem_cgroup_count_vm_event(mm, PGFAULT);
-
-	/* do counter updates before entering really critical section. */
-	check_sync_rss_stat(current);
-
 	if (unlikely(is_vm_hugetlb_page(vma)))
 		return hugetlb_fault(mm, vma, address, flags);
 
@@ -3503,6 +3495,31 @@ int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 	return handle_pte_fault(mm, vma, address, pte, pmd, flags);
 }
 
+int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma,
+		    unsigned long address, unsigned int flags)
+{
+	int userfault = flags & FAULT_FLAG_USER;
+	int ret;
+
+	__set_current_state(TASK_RUNNING);
+
+	count_vm_event(PGFAULT);
+	mem_cgroup_count_vm_event(mm, PGFAULT);
+
+	/* do counter updates before entering really critical section. */
+	check_sync_rss_stat(current);
+
+	if (userfault)
+		WARN_ON(mem_cgroup_xchg_may_oom(current, 1) == 1);
+
+	ret = __handle_mm_fault(mm, vma, address, flags);
+
+	if (userfault)
+		WARN_ON(mem_cgroup_xchg_may_oom(current, 0) == 0);
+
+	return ret;
+}
+
 #ifndef __PAGETABLE_PUD_FOLDED
 /*
  * Allocate page upper directory.
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 069b64e..aa60863 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -785,6 +785,8 @@ out:
  */
 void pagefault_out_of_memory(void)
 {
+	if (mem_cgroup_oom_synchronize())
+		return;
 	if (try_set_system_oom()) {
 		out_of_memory(NULL, 0, 0, NULL);
 		clear_system_oom();
-- 
1.8.3.2


^ permalink raw reply related	[flat|nested] 172+ messages in thread

* [patch 5/5] mm: memcontrol: sanity check memcg OOM context unwind
  2013-07-19  4:21                                                                                                                                                                             ` Johannes Weiner
                                                                                                                                                                                                 ` (3 preceding siblings ...)
  2013-07-19  4:25                                                                                                                                                                               ` [patch 4/5] memcg: do not trap chargers with full callstack on OOM Johannes Weiner
@ 2013-07-19  4:26                                                                                                                                                                               ` Johannes Weiner
  2013-07-19  8:23                                                                                                                                                                               ` [PATCH for 3.2] memcg: do not trap chargers with full callstack on OOM azurIt
  5 siblings, 0 replies; 172+ messages in thread
From: Johannes Weiner @ 2013-07-19  4:26 UTC (permalink / raw)
  To: Michal Hocko
  Cc: azurIt, linux-kernel, linux-mm, cgroups mailinglist,
	KAMEZAWA Hiroyuki, righi.andrea

Catch the cases where a memcg OOM context is set up in the failed
charge path but the fault handler is not actually returning
VM_FAULT_ERROR, which would be required to properly finalize the OOM.

Example output: the first trace shows the stack at the end of
handle_mm_fault() where an unexpected memcg OOM context is detected.
The subsequent trace is of whoever set up that OOM context.  In this
case it was the charging of readahead pages in a file fault, which
does not propagate VM_FAULT_OOM on failure and should disable OOM:

[   27.805359] WARNING: at /home/hannes/src/linux/linux/mm/memory.c:3523 handle_mm_fault+0x1fb/0x3f0()
[   27.805360] Hardware name: PowerEdge 1950
[   27.805361] Fixing unhandled memcg OOM context, set up from:
[   27.805362] Pid: 1599, comm: file Tainted: G        W    3.2.0-00005-g6d10010 #97
[   27.805363] Call Trace:
[   27.805365]  [<ffffffff8103dcea>] warn_slowpath_common+0x6a/0xa0
[   27.805367]  [<ffffffff8103dd91>] warn_slowpath_fmt+0x41/0x50
[   27.805369]  [<ffffffff810c8ffb>] handle_mm_fault+0x1fb/0x3f0
[   27.805371]  [<ffffffff81024fa0>] do_page_fault+0x140/0x4a0
[   27.805373]  [<ffffffff810cdbfb>] ? do_mmap_pgoff+0x34b/0x360
[   27.805376]  [<ffffffff813cbc6f>] page_fault+0x1f/0x30
[   27.805377] ---[ end trace 305ec584fba81649 ]---
[   27.805378]  [<ffffffff810f2418>] __mem_cgroup_try_charge+0x5c8/0x7e0
[   27.805380]  [<ffffffff810f38fc>] mem_cgroup_cache_charge+0xac/0x110
[   27.805381]  [<ffffffff810a528e>] add_to_page_cache_locked+0x3e/0x120
[   27.805383]  [<ffffffff810a5385>] add_to_page_cache_lru+0x15/0x40
[   27.805385]  [<ffffffff8112dfa3>] mpage_readpages+0xc3/0x150
[   27.805387]  [<ffffffff8115c6d8>] ext4_readpages+0x18/0x20
[   27.805388]  [<ffffffff810afbe1>] __do_page_cache_readahead+0x1c1/0x270
[   27.805390]  [<ffffffff810b023c>] ra_submit+0x1c/0x20
[   27.805392]  [<ffffffff810a5eb4>] filemap_fault+0x3f4/0x450
[   27.805394]  [<ffffffff810c4a2d>] __do_fault+0x6d/0x510
[   27.805395]  [<ffffffff810c741a>] handle_pte_fault+0x8a/0x920
[   27.805397]  [<ffffffff810c8f9c>] handle_mm_fault+0x19c/0x3f0
[   27.805398]  [<ffffffff81024fa0>] do_page_fault+0x140/0x4a0
[   27.805400]  [<ffffffff813cbc6f>] page_fault+0x1f/0x30
[   27.805401]  [<ffffffffffffffff>] 0xffffffffffffffff

Debug patch only.

Not-signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 include/linux/sched.h | 3 +++
 mm/memcontrol.c       | 7 +++++++
 mm/memory.c           | 9 +++++++++
 3 files changed, 19 insertions(+)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 7e6c9e9..a77d198 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -91,6 +91,7 @@ struct sched_param {
 #include <linux/latencytop.h>
 #include <linux/cred.h>
 #include <linux/llist.h>
+#include <linux/stacktrace.h>
 
 #include <asm/processor.h>
 
@@ -1571,6 +1572,8 @@ struct task_struct {
 	struct memcg_oom_info {
 		unsigned int may_oom:1;
 		unsigned int in_memcg_oom:1;
+		struct stack_trace trace;
+		unsigned long trace_entries[16];
 		int wakeups;
 		struct mem_cgroup *wait_on_memcg;
 	} memcg_oom;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 99b0101..c47c77e 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -49,6 +49,7 @@
 #include <linux/page_cgroup.h>
 #include <linux/cpu.h>
 #include <linux/oom.h>
+#include <linux/stacktrace.h>
 #include "internal.h"
 
 #include <asm/uaccess.h>
@@ -1870,6 +1871,12 @@ static void mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask)
 
 	current->memcg_oom.in_memcg_oom = 1;
 
+	current->memcg_oom.trace.nr_entries = 0;
+	current->memcg_oom.trace.max_entries = 16;
+	current->memcg_oom.trace.entries = current->memcg_oom.trace_entries;
+	current->memcg_oom.trace.skip = 1;
+	save_stack_trace(&current->memcg_oom.trace);
+
 	/* At first, try to OOM lock hierarchy under memcg.*/
 	spin_lock(&memcg_oom_lock);
 	locked = mem_cgroup_oom_lock(memcg);
diff --git a/mm/memory.c b/mm/memory.c
index 2be02b7..fc6d741 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -57,6 +57,7 @@
 #include <linux/swapops.h>
 #include <linux/elf.h>
 #include <linux/gfp.h>
+#include <linux/stacktrace.h>
 
 #include <asm/io.h>
 #include <asm/pgalloc.h>
@@ -3517,6 +3518,14 @@ int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 	if (userfault)
 		WARN_ON(mem_cgroup_xchg_may_oom(current, 0) == 0);
 
+#ifdef CONFIG_CGROUP_MEM_RES_CTLR
+	if (WARN(!(ret & VM_FAULT_OOM) && current->memcg_oom.in_memcg_oom,
+		 "Fixing unhandled memcg OOM context, set up from:\n")) {
+		print_stack_trace(&current->memcg_oom.trace, 0);
+		mem_cgroup_oom_synchronize();
+	}
+#endif
+
 	return ret;
 }
 
-- 
1.8.3.2


^ permalink raw reply related	[flat|nested] 172+ messages in thread

* Re: [PATCH for 3.2] memcg: do not trap chargers with full callstack on OOM
  2013-07-19  4:21                                                                                                                                                                             ` Johannes Weiner
                                                                                                                                                                                                 ` (4 preceding siblings ...)
  2013-07-19  4:26                                                                                                                                                                               ` [patch 5/5] mm: memcontrol: sanity check memcg OOM context unwind Johannes Weiner
@ 2013-07-19  8:23                                                                                                                                                                               ` azurIt
  5 siblings, 0 replies; 172+ messages in thread
From: azurIt @ 2013-07-19  8:23 UTC (permalink / raw)
  To: Johannes Weiner, Michal Hocko
  Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki,
	righi.andrea

> CC: linux-kernel@vger.kernel.org, linux-mm@kvack.org, "cgroups mailinglist" <cgroups@vger.kernel.org>, "KAMEZAWA Hiroyuki" <kamezawa.hiroyu@jp.fujitsu.com>, righi.andrea@gmail.com
>On Tue, Jul 16, 2013 at 12:48:30PM -0400, Johannes Weiner wrote:
>> On Tue, Jul 16, 2013 at 06:09:05PM +0200, Michal Hocko wrote:
>> > On Tue 16-07-13 11:35:44, Johannes Weiner wrote:
>> > > On Mon, Jul 15, 2013 at 06:00:06PM +0200, Michal Hocko wrote:
>> > > > On Mon 15-07-13 17:41:19, Michal Hocko wrote:
>> > > > > On Sun 14-07-13 01:51:12, azurIt wrote:
>> > > > > > > CC: "Johannes Weiner" <hannes@cmpxchg.org>, linux-kernel@vger.kernel.org, linux-mm@kvack.org, "cgroups mailinglist" <cgroups@vger.kernel.org>, "KAMEZAWA Hiroyuki" <kamezawa.hiroyu@jp.fujitsu.com>, righi.andrea@gmail.com
>> > > > > > >> CC: "Johannes Weiner" <hannes@cmpxchg.org>, linux-kernel@vger.kernel.org, linux-mm@kvack.org, "cgroups mailinglist" <cgroups@vger.kernel.org>, "KAMEZAWA Hiroyuki" <kamezawa.hiroyu@jp.fujitsu.com>, righi.andrea@gmail.com
>> > > > > > >>On Wed 10-07-13 18:25:06, azurIt wrote:
>> > > > > > >>> >> Now i realized that i forgot to remove UID from that cgroup before
>> > > > > > >>> >> trying to remove it, so cgroup cannot be removed anyway (we are using
>> > > > > > >>> >> third party cgroup called cgroup-uid from Andrea Righi, which is able
>> > > > > > >>> >> to associate all user's processes with target cgroup). Look here for
>> > > > > > >>> >> cgroup-uid patch:
>> > > > > > >>> >> https://www.develer.com/~arighi/linux/patches/cgroup-uid/cgroup-uid-v8.patch
>> > > > > > >>> >> 
>> > > > > > >>> >> ANYWAY, i'm 101% sure that 'tasks' file was empty and 'under_oom' was
>> > > > > > >>> >> permanently '1'.
>> > > > > > >>> >
>> > > > > > >>> >This is really strange. Could you post the whole diff against stable
>> > > > > > >>> >tree you are using (except for grsecurity stuff and the above cgroup-uid
>> > > > > > >>> >patch)?
>> > > > > > >>> 
>> > > > > > >>> 
>> > > > > > >>> Here are all patches which i applied to kernel 3.2.48 in my last test:
>> > > > > > >>> http://watchdog.sk/lkml/patches3/
>> > > > > > >>
>> > > > > > >>The two patches from Johannes seem correct.
>> > > > > > >>
>> > > > > > >>From a quick look even grsecurity patchset shouldn't interfere as it
>> > > > > > >>doesn't seem to put any code between handle_mm_fault and mm_fault_error
>> > > > > > >>and there also doesn't seem to be any new handle_mm_fault call sites.
>> > > > > > >>
>> > > > > > >>But I cannot tell there aren't other code paths which would lead to a
>> > > > > > >>memcg charge, thus oom, without proper FAULT_FLAG_KERNEL handling.
>> > > > > > >
>> > > > > > >
>> > > > > > >Michal,
>> > > > > > >
>> > > > > > >now i can definitely confirm that problem with unremovable cgroups
>> > > > > > >persists. What info do you need from me? I applied also your little
>> > > > > > >'WARN_ON' patch.
>> > > > > > 
>> > > > > > Ok, i think you want this:
>> > > > > > http://watchdog.sk/lkml/kern4.log
>> > > > > 
>> > > > > Jul 14 01:11:39 server01 kernel: [  593.589087] [ pid ]   uid  tgid total_vm      rss cpu oom_adj oom_score_adj name
>> > > > > Jul 14 01:11:39 server01 kernel: [  593.589451] [12021]  1333 12021   172027    64723   4       0             0 apache2
>> > > > > Jul 14 01:11:39 server01 kernel: [  593.589647] [12030]  1333 12030   172030    64748   2       0             0 apache2
>> > > > > Jul 14 01:11:39 server01 kernel: [  593.589836] [12031]  1333 12031   172030    64749   3       0             0 apache2
>> > > > > Jul 14 01:11:39 server01 kernel: [  593.590025] [12032]  1333 12032   170619    63428   3       0             0 apache2
>> > > > > Jul 14 01:11:39 server01 kernel: [  593.590213] [12033]  1333 12033   167934    60524   2       0             0 apache2
>> > > > > Jul 14 01:11:39 server01 kernel: [  593.590401] [12034]  1333 12034   170747    63496   4       0             0 apache2
>> > > > > Jul 14 01:11:39 server01 kernel: [  593.590588] [12035]  1333 12035   169659    62451   1       0             0 apache2
>> > > > > Jul 14 01:11:39 server01 kernel: [  593.590776] [12036]  1333 12036   167614    60384   3       0             0 apache2
>> > > > > Jul 14 01:11:39 server01 kernel: [  593.590984] [12037]  1333 12037   166342    58964   3       0             0 apache2
>> > > > > Jul 14 01:11:39 server01 kernel: [  593.591178] Memory cgroup out of memory: Kill process 12021 (apache2) score 847 or sacrifice child
>> > > > > Jul 14 01:11:39 server01 kernel: [  593.591370] Killed process 12021 (apache2) total-vm:688108kB, anon-rss:255472kB, file-rss:3420kB
>> > > > > Jul 14 01:11:41 server01 kernel: [  595.392920] ------------[ cut here ]------------
>> > > > > Jul 14 01:11:41 server01 kernel: [  595.393096] WARNING: at kernel/exit.c:888 do_exit+0x7d0/0x870()
>> > > > > Jul 14 01:11:41 server01 kernel: [  595.393256] Hardware name: S5000VSA
>> > > > > Jul 14 01:11:41 server01 kernel: [  595.393415] Pid: 12037, comm: apache2 Not tainted 3.2.48-grsec #1
>> > > > > Jul 14 01:11:41 server01 kernel: [  595.393577] Call Trace:
>> > > > > Jul 14 01:11:41 server01 kernel: [  595.393737]  [<ffffffff8105520a>] warn_slowpath_common+0x7a/0xb0
>> > > > > Jul 14 01:11:41 server01 kernel: [  595.393903]  [<ffffffff8105525a>] warn_slowpath_null+0x1a/0x20
>> > > > > Jul 14 01:11:41 server01 kernel: [  595.394068]  [<ffffffff81059c50>] do_exit+0x7d0/0x870
>> > > > > Jul 14 01:11:41 server01 kernel: [  595.394231]  [<ffffffff81050254>] ? thread_group_times+0x44/0xb0
>> > > > > Jul 14 01:11:41 server01 kernel: [  595.394392]  [<ffffffff81059d41>] do_group_exit+0x51/0xc0
>> > > > > Jul 14 01:11:41 server01 kernel: [  595.394551]  [<ffffffff81059dc7>] sys_exit_group+0x17/0x20
>> > > > > Jul 14 01:11:41 server01 kernel: [  595.394714]  [<ffffffff815caea6>] system_call_fastpath+0x18/0x1d
>> > > > > Jul 14 01:11:41 server01 kernel: [  595.394921] ---[ end trace 738570e688acf099 ]---
>> > > > > 
>> > > > > OK, so you had an OOM which has been handled by in-kernel oom handler
>> > > > > (it killed 12021) and 12037 was in the same group. The warning tells us
>> > > > > that it went through mem_cgroup_oom as well (otherwise it wouldn't have
>> > > > > memcg_oom.wait_on_memcg set and the warning wouldn't trigger) and then
>> > > > > it exited on the userspace request (by exit syscall).
>> > > > > 
>> > > > > I do not see any way how, this could happen though. If mem_cgroup_oom
>> > > > > is called then we always return CHARGE_NOMEM which turns into ENOMEM
>> > > > > returned by __mem_cgroup_try_charge (invoke_oom must have been set to
>> > > > > true).  So if nobody screwed the return value on the way up to page
>> > > > > fault handler then there is no way to escape.
>> > > > > 
>> > > > > I will check the code.
>> > > > 
>> > > > OK, I guess I found it:
>> > > > __do_fault
>> > > >   fault = filemap_fault
>> > > >   do_async_mmap_readahead
>> > > >     page_cache_async_readahead
>> > > >       ondemand_readahead
>> > > >         __do_page_cache_readahead
>> > > >           read_pages
>> > > >             readpages = ext3_readpages
>> > > >               mpage_readpages			# Doesn't propagate ENOMEM
>> > > >                add_to_page_cache_lru
>> > > >                  add_to_page_cache
>> > > >                    add_to_page_cache_locked
>> > > >                      mem_cgroup_cache_charge
>> > > > 
>> > > > So the read ahead most probably. Again! Duhhh. I will try to think
>> > > > about a fix for this. One obvious place is mpage_readpages but
>> > > > __do_page_cache_readahead ignores read_pages return value as well and
>> > > > page_cache_async_readahead, even worse, is just void and exported as
>> > > > such.
>> > > > 
>> > > > So this smells like a hard to fix bugger. One possible, and really ugly
>> > > > way would be calling mem_cgroup_oom_synchronize even if handle_mm_fault
>> > > > doesn't return VM_FAULT_ERROR, but that is a crude hack.
>
>I fixed it by disabling the OOM killer altogether for readahead code.
>We don't do it globally, we should not do it in the memcg, these are
>optional allocations/charges.
>
>I also disabled it for kernel faults triggered from within a syscall
>(copy_*user, get_user_pages), which should just return -ENOMEM as
>usual (unless it's nested inside a userspace fault).  The only
>downside is that we can't get around annotating userspace faults
>anymore, so every architecture fault handler now passes
>FAULT_FLAG_USER to handle_mm_fault().  Makes the series a little less
>self-contained, but it's not unreasonable.
>
>It's easy to detect leaks now by checking if the memcg OOM context is
>setup and we are not returning VM_FAULT_OOM.
>
>Here is a combined diff based on 3.2.  azurIt, any chance you could
>give this a shot?  I tested it on my local machines, but you have a
>known reproducer of fairly unlikely scenarios...


I will be out of office between 25.7. and 1.8. and I don't want to run anything which can potentially do an outage of our services. I will test this patch after 2.8. Should I use also previous patches of this one is enough? Thank you very much Johannes.

azur

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: [patch 3/5] x86: finish fault error path with fatal signal
  2013-07-19  4:25                                                                                                                                                                               ` [patch 3/5] x86: finish fault error path with fatal signal Johannes Weiner
@ 2013-07-24 20:32                                                                                                                                                                                 ` Johannes Weiner
  2013-07-25 20:29                                                                                                                                                                                   ` KOSAKI Motohiro
  0 siblings, 1 reply; 172+ messages in thread
From: Johannes Weiner @ 2013-07-24 20:32 UTC (permalink / raw)
  To: Michal Hocko
  Cc: azurIt, linux-kernel, linux-mm, cgroups mailinglist,
	KAMEZAWA Hiroyuki, righi.andrea

On Fri, Jul 19, 2013 at 12:25:02AM -0400, Johannes Weiner wrote:
> The x86 fault handler bails in the middle of error handling when the
> task has been killed.  For the next patch this is a problem, because
> it relies on pagefault_out_of_memory() being called even when the task
> has been killed, to perform proper OOM state unwinding.
> 
> This is a rather minor optimization, just remove it.
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> ---
>  arch/x86/mm/fault.c | 11 -----------
>  1 file changed, 11 deletions(-)
> 
> diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
> index 1cebabe..90248c9 100644
> --- a/arch/x86/mm/fault.c
> +++ b/arch/x86/mm/fault.c
> @@ -846,17 +846,6 @@ static noinline int
>  mm_fault_error(struct pt_regs *regs, unsigned long error_code,
>  	       unsigned long address, unsigned int fault)
>  {
> -	/*
> -	 * Pagefault was interrupted by SIGKILL. We have no reason to
> -	 * continue pagefault.
> -	 */
> -	if (fatal_signal_pending(current)) {
> -		if (!(fault & VM_FAULT_RETRY))
> -			up_read(&current->mm->mmap_sem);
> -		if (!(error_code & PF_USER))
> -			no_context(regs, error_code, address);
> -		return 1;

This is broken but I only hit it now after testing for a while.

The patch has the right idea: in case of an OOM kill, we should
continue the fault and not abort.  What I missed is that in case of a
kill during lock_page, i.e. VM_FAULT_RETRY && fatal_signal, we have to
exit the fault and not do up_read() etc.  This introduced a locking
imbalance that would get everybody hung on mmap_sem.

I moved the retry handling outside of mm_fault_error() (come on...)
and stole some documentation from arm.  It's now a little bit more
explicit and comparable to other architectures.

I'll send an updated series, patch for reference:

---
From: Johannes Weiner <hannes@cmpxchg.org>
Subject: [patch] x86: finish fault error path with fatal signal

The x86 fault handler bails in the middle of error handling when the
task has been killed.  For the next patch this is a problem, because
it relies on pagefault_out_of_memory() being called even when the task
has been killed, to perform proper OOM state unwinding.

This is a rather minor optimization that cuts short the fault handling
by a few instructions in rare cases.  Just remove it.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 arch/x86/mm/fault.c | 33 +++++++++++++--------------------
 1 file changed, 13 insertions(+), 20 deletions(-)

diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index 6d77c38..0c18beb 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -842,31 +842,17 @@ do_sigbus(struct pt_regs *regs, unsigned long error_code, unsigned long address,
 	force_sig_info_fault(SIGBUS, code, address, tsk, fault);
 }
 
-static noinline int
+static noinline void
 mm_fault_error(struct pt_regs *regs, unsigned long error_code,
 	       unsigned long address, unsigned int fault)
 {
-	/*
-	 * Pagefault was interrupted by SIGKILL. We have no reason to
-	 * continue pagefault.
-	 */
-	if (fatal_signal_pending(current)) {
-		if (!(fault & VM_FAULT_RETRY))
-			up_read(&current->mm->mmap_sem);
-		if (!(error_code & PF_USER))
-			no_context(regs, error_code, address, 0, 0);
-		return 1;
-	}
-	if (!(fault & VM_FAULT_ERROR))
-		return 0;
-
 	if (fault & VM_FAULT_OOM) {
 		/* Kernel mode? Handle exceptions or die: */
 		if (!(error_code & PF_USER)) {
 			up_read(&current->mm->mmap_sem);
 			no_context(regs, error_code, address,
 				   SIGSEGV, SEGV_MAPERR);
-			return 1;
+			return;
 		}
 
 		up_read(&current->mm->mmap_sem);
@@ -884,7 +870,6 @@ mm_fault_error(struct pt_regs *regs, unsigned long error_code,
 		else
 			BUG();
 	}
-	return 1;
 }
 
 static int spurious_fault_check(unsigned long error_code, pte_t *pte)
@@ -1189,9 +1174,17 @@ good_area:
 	 */
 	fault = handle_mm_fault(mm, vma, address, flags);
 
-	if (unlikely(fault & (VM_FAULT_RETRY|VM_FAULT_ERROR))) {
-		if (mm_fault_error(regs, error_code, address, fault))
-			return;
+	/*
+	 * If we need to retry but a fatal signal is pending, handle the
+	 * signal first. We do not need to release the mmap_sem because it
+	 * would already be released in __lock_page_or_retry in mm/filemap.c.
+	 */
+	if ((fault & VM_FAULT_RETRY) && fatal_signal_pending(current))
+		return;
+
+	if (unlikely(fault & VM_FAULT_ERROR)) {
+		mm_fault_error(regs, error_code, address, fault);
+		return;
 	}
 
 	/*
-- 
1.8.3.2


^ permalink raw reply related	[flat|nested] 172+ messages in thread

* Re: [patch 3/5] x86: finish fault error path with fatal signal
  2013-07-24 20:32                                                                                                                                                                                 ` Johannes Weiner
@ 2013-07-25 20:29                                                                                                                                                                                   ` KOSAKI Motohiro
  2013-07-25 21:50                                                                                                                                                                                     ` Johannes Weiner
  0 siblings, 1 reply; 172+ messages in thread
From: KOSAKI Motohiro @ 2013-07-25 20:29 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Michal Hocko, azurIt, linux-kernel, linux-mm,
	cgroups mailinglist, KAMEZAWA Hiroyuki, righi.andrea,
	kosaki.motohiro

(7/24/13 4:32 PM), Johannes Weiner wrote:
> On Fri, Jul 19, 2013 at 12:25:02AM -0400, Johannes Weiner wrote:
>> The x86 fault handler bails in the middle of error handling when the
>> task has been killed.  For the next patch this is a problem, because
>> it relies on pagefault_out_of_memory() being called even when the task
>> has been killed, to perform proper OOM state unwinding.
>>
>> This is a rather minor optimization, just remove it.
>>
>> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
>> ---
>>   arch/x86/mm/fault.c | 11 -----------
>>   1 file changed, 11 deletions(-)
>>
>> diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
>> index 1cebabe..90248c9 100644
>> --- a/arch/x86/mm/fault.c
>> +++ b/arch/x86/mm/fault.c
>> @@ -846,17 +846,6 @@ static noinline int
>>   mm_fault_error(struct pt_regs *regs, unsigned long error_code,
>>   	       unsigned long address, unsigned int fault)
>>   {
>> -	/*
>> -	 * Pagefault was interrupted by SIGKILL. We have no reason to
>> -	 * continue pagefault.
>> -	 */
>> -	if (fatal_signal_pending(current)) {
>> -		if (!(fault & VM_FAULT_RETRY))
>> -			up_read(&current->mm->mmap_sem);
>> -		if (!(error_code & PF_USER))
>> -			no_context(regs, error_code, address);
>> -		return 1;
>
> This is broken but I only hit it now after testing for a while.
>
> The patch has the right idea: in case of an OOM kill, we should
> continue the fault and not abort.  What I missed is that in case of a
> kill during lock_page, i.e. VM_FAULT_RETRY && fatal_signal, we have to
> exit the fault and not do up_read() etc.  This introduced a locking
> imbalance that would get everybody hung on mmap_sem.
>
> I moved the retry handling outside of mm_fault_error() (come on...)
> and stole some documentation from arm.  It's now a little bit more
> explicit and comparable to other architectures.
>
> I'll send an updated series, patch for reference:
>
> ---
> From: Johannes Weiner <hannes@cmpxchg.org>
> Subject: [patch] x86: finish fault error path with fatal signal
>
> The x86 fault handler bails in the middle of error handling when the
> task has been killed.  For the next patch this is a problem, because
> it relies on pagefault_out_of_memory() being called even when the task
> has been killed, to perform proper OOM state unwinding.
>
> This is a rather minor optimization that cuts short the fault handling
> by a few instructions in rare cases.  Just remove it.
>
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> ---
>   arch/x86/mm/fault.c | 33 +++++++++++++--------------------
>   1 file changed, 13 insertions(+), 20 deletions(-)
>
> diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
> index 6d77c38..0c18beb 100644
> --- a/arch/x86/mm/fault.c
> +++ b/arch/x86/mm/fault.c
> @@ -842,31 +842,17 @@ do_sigbus(struct pt_regs *regs, unsigned long error_code, unsigned long address,
>   	force_sig_info_fault(SIGBUS, code, address, tsk, fault);
>   }
>
> -static noinline int
> +static noinline void
>   mm_fault_error(struct pt_regs *regs, unsigned long error_code,
>   	       unsigned long address, unsigned int fault)
>   {
> -	/*
> -	 * Pagefault was interrupted by SIGKILL. We have no reason to
> -	 * continue pagefault.
> -	 */
> -	if (fatal_signal_pending(current)) {
> -		if (!(fault & VM_FAULT_RETRY))
> -			up_read(&current->mm->mmap_sem);
> -		if (!(error_code & PF_USER))
> -			no_context(regs, error_code, address, 0, 0);
> -		return 1;
> -	}
> -	if (!(fault & VM_FAULT_ERROR))
> -		return 0;
> -
>   	if (fault & VM_FAULT_OOM) {
>   		/* Kernel mode? Handle exceptions or die: */
>   		if (!(error_code & PF_USER)) {
>   			up_read(&current->mm->mmap_sem);
>   			no_context(regs, error_code, address,
>   				   SIGSEGV, SEGV_MAPERR);
> -			return 1;
> +			return;
>   		}
>
>   		up_read(&current->mm->mmap_sem);
> @@ -884,7 +870,6 @@ mm_fault_error(struct pt_regs *regs, unsigned long error_code,
>   		else
>   			BUG();
>   	}
> -	return 1;
>   }
>
>   static int spurious_fault_check(unsigned long error_code, pte_t *pte)
> @@ -1189,9 +1174,17 @@ good_area:
>   	 */
>   	fault = handle_mm_fault(mm, vma, address, flags);
>
> -	if (unlikely(fault & (VM_FAULT_RETRY|VM_FAULT_ERROR))) {
> -		if (mm_fault_error(regs, error_code, address, fault))
> -			return;
> +	/*
> +	 * If we need to retry but a fatal signal is pending, handle the
> +	 * signal first. We do not need to release the mmap_sem because it
> +	 * would already be released in __lock_page_or_retry in mm/filemap.c.
> +	 */
> +	if ((fault & VM_FAULT_RETRY) && fatal_signal_pending(current))
> +		return;
> +
> +	if (unlikely(fault & VM_FAULT_ERROR)) {
> +		mm_fault_error(regs, error_code, address, fault);
> +		return;
>   	}

When I made the patch you removed code, Ingo suggested we need put all rare case code
into if(unlikely()) block. Yes, this is purely micro optimization. But it is not costly
to maintain.





^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: [patch 3/5] x86: finish fault error path with fatal signal
  2013-07-25 20:29                                                                                                                                                                                   ` KOSAKI Motohiro
@ 2013-07-25 21:50                                                                                                                                                                                     ` Johannes Weiner
  0 siblings, 0 replies; 172+ messages in thread
From: Johannes Weiner @ 2013-07-25 21:50 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Michal Hocko, azurIt, linux-kernel, linux-mm,
	cgroups mailinglist, KAMEZAWA Hiroyuki, righi.andrea

On Thu, Jul 25, 2013 at 04:29:13PM -0400, KOSAKI Motohiro wrote:
> (7/24/13 4:32 PM), Johannes Weiner wrote:
> >@@ -1189,9 +1174,17 @@ good_area:
> >  	 */
> >  	fault = handle_mm_fault(mm, vma, address, flags);
> >
> >-	if (unlikely(fault & (VM_FAULT_RETRY|VM_FAULT_ERROR))) {
> >-		if (mm_fault_error(regs, error_code, address, fault))
> >-			return;
> >+	/*
> >+	 * If we need to retry but a fatal signal is pending, handle the
> >+	 * signal first. We do not need to release the mmap_sem because it
> >+	 * would already be released in __lock_page_or_retry in mm/filemap.c.
> >+	 */
> >+	if ((fault & VM_FAULT_RETRY) && fatal_signal_pending(current))
> >+		return;
> >+
> >+	if (unlikely(fault & VM_FAULT_ERROR)) {
> >+		mm_fault_error(regs, error_code, address, fault);
> >+		return;
> >  	}
> 
> When I made the patch you removed code, Ingo suggested we need put all rare case code
> into if(unlikely()) block. Yes, this is purely micro optimization. But it is not costly
> to maintain.

Fair enough, thanks for the heads up!

^ permalink raw reply	[flat|nested] 172+ messages in thread

end of thread, other threads:[~2013-07-25 21:50 UTC | newest]

Thread overview: 172+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-11-21 19:02 memory-cgroup bug azurIt
2012-11-22  0:26 ` Kamezawa Hiroyuki
2012-11-22  9:36   ` azurIt
2012-11-22 21:45     ` Michal Hocko
2012-11-22 15:24 ` Michal Hocko
2012-11-22 18:05   ` azurIt
2012-11-22 21:42     ` Michal Hocko
2012-11-22 22:34       ` azurIt
2012-11-23  7:40         ` Michal Hocko
2012-11-23  9:21           ` azurIt
2012-11-23  9:28             ` Michal Hocko
2012-11-23  9:44               ` azurIt
2012-11-23 10:10                 ` Michal Hocko
2012-11-23  9:34             ` Glauber Costa
2012-11-23 10:04             ` Michal Hocko
2012-11-23 14:59               ` azurIt
2012-11-25 10:17                 ` Michal Hocko
2012-11-25 12:39                   ` azurIt
2012-11-25 13:02                     ` Michal Hocko
2012-11-25 13:27                       ` azurIt
2012-11-25 13:44                         ` Michal Hocko
2012-11-25  0:10               ` azurIt
2012-11-25 12:05                 ` Michal Hocko
2012-11-25 12:36                   ` azurIt
2012-11-25 13:55                   ` Michal Hocko
2012-11-26  0:38                     ` azurIt
2012-11-26  7:57                       ` Michal Hocko
2012-11-26 13:18                       ` [PATCH -mm] memcg: do not trigger OOM from add_to_page_cache_locked Michal Hocko
2012-11-26 13:21                         ` [PATCH for 3.2.34] " Michal Hocko
2012-11-26 21:28                           ` azurIt
2012-11-30  1:45                           ` azurIt
2012-11-30  2:29                           ` azurIt
2012-11-30 12:45                             ` Michal Hocko
2012-11-30 12:53                               ` azurIt
2012-11-30 13:44                               ` azurIt
2012-11-30 14:44                                 ` Michal Hocko
2012-11-30 15:03                                   ` Michal Hocko
2012-11-30 15:37                                     ` Michal Hocko
2012-11-30 15:08                                   ` azurIt
2012-11-30 15:39                                     ` Michal Hocko
2012-11-30 15:59                                       ` azurIt
2012-11-30 16:19                                         ` Michal Hocko
2012-11-30 16:26                                           ` azurIt
2012-11-30 16:53                                             ` Michal Hocko
2012-11-30 20:43                                               ` azurIt
2012-12-03 15:16                                           ` Michal Hocko
2012-12-05  1:36                                             ` azurIt
2012-12-05 14:17                                               ` Michal Hocko
2012-12-06  0:29                                                 ` azurIt
2012-12-06  9:54                                                   ` Michal Hocko
2012-12-06 10:12                                                     ` azurIt
2012-12-06 17:06                                                       ` Michal Hocko
2012-12-10  1:20                                                     ` azurIt
2012-12-10  9:43                                                       ` Michal Hocko
2012-12-10 10:18                                                         ` azurIt
2012-12-10 15:52                                                           ` Michal Hocko
2012-12-10 17:18                                                             ` azurIt
2012-12-17  1:34                                                             ` azurIt
2012-12-17 16:32                                                               ` Michal Hocko
2012-12-17 18:23                                                                 ` azurIt
2012-12-17 19:55                                                                   ` Michal Hocko
2012-12-18 14:22                                                                     ` azurIt
2012-12-18 15:20                                                                       ` Michal Hocko
2012-12-24 13:25                                                                         ` azurIt
2012-12-28 16:22                                                                           ` Michal Hocko
2012-12-30  1:09                                                                             ` azurIt
2012-12-30 11:08                                                                               ` Michal Hocko
2013-01-25 15:07                                                                                 ` azurIt
2013-01-25 16:31                                                                                   ` Michal Hocko
2013-02-05 13:49                                                                                     ` Michal Hocko
2013-02-05 14:49                                                                                       ` azurIt
2013-02-05 16:09                                                                                         ` Michal Hocko
2013-02-05 16:46                                                                                           ` azurIt
2013-02-05 16:48                                                                                           ` Greg Thelen
2013-02-05 17:46                                                                                             ` Michal Hocko
2013-02-05 18:09                                                                                               ` Greg Thelen
2013-02-05 18:59                                                                                                 ` Michal Hocko
2013-02-08  4:27                                                                                                   ` Greg Thelen
2013-02-08 16:29                                                                                                     ` Michal Hocko
2013-02-08 16:40                                                                                                       ` Michal Hocko
2013-02-06  1:17                                                                                           ` azurIt
2013-02-06 14:01                                                                                             ` Michal Hocko
2013-02-06 14:22                                                                                               ` Michal Hocko
2013-02-06 16:00                                                                                                 ` [PATCH for 3.2.34] memcg: do not trigger OOM if PF_NO_MEMCG_OOM is set Michal Hocko
2013-02-08  5:03                                                                                                   ` azurIt
2013-02-08  9:44                                                                                                     ` Michal Hocko
2013-02-08 11:02                                                                                                       ` azurIt
2013-02-08 12:38                                                                                                         ` Michal Hocko
2013-02-08 13:56                                                                                                           ` azurIt
2013-02-08 14:47                                                                                                             ` Michal Hocko
2013-02-08 15:24                                                                                                             ` Michal Hocko
2013-02-08 15:58                                                                                                               ` azurIt
2013-02-08 17:10                                                                                                                 ` Michal Hocko
2013-02-08 21:02                                                                                                                   ` azurIt
2013-02-10 15:03                                                                                                                     ` Michal Hocko
2013-02-10 16:46                                                                                                                       ` azurIt
2013-02-11 11:22                                                                                                                         ` Michal Hocko
2013-02-22  8:23                                                                                                                           ` azurIt
2013-02-22 12:52                                                                                                                             ` Michal Hocko
2013-02-22 12:54                                                                                                                               ` azurIt
2013-02-22 13:00                                                                                                                                 ` Michal Hocko
2013-06-06 16:04                                                                                                                             ` Michal Hocko
2013-06-06 16:16                                                                                                                               ` azurIt
2013-06-07 13:11                                                                                                                                 ` [PATCH for 3.2] memcg: do not trap chargers with full callstack on OOM Michal Hocko
2013-06-17 10:21                                                                                                                                   ` azurIt
2013-06-19 13:26                                                                                                                                     ` Michal Hocko
2013-06-22 20:09                                                                                                                                       ` azurIt
2013-06-24 20:13                                                                                                                                         ` Johannes Weiner
2013-06-28 10:06                                                                                                                                           ` azurIt
2013-07-05 18:17                                                                                                                                             ` Johannes Weiner
2013-07-05 19:02                                                                                                                                               ` azurIt
2013-07-05 19:18                                                                                                                                                 ` Johannes Weiner
2013-07-07 23:42                                                                                                                                                   ` azurIt
2013-07-09 13:10                                                                                                                                                     ` Michal Hocko
2013-07-09 13:19                                                                                                                                                       ` azurIt
2013-07-09 13:54                                                                                                                                                         ` Michal Hocko
2013-07-10 16:25                                                                                                                                                           ` azurIt
2013-07-11  7:25                                                                                                                                                             ` Michal Hocko
2013-07-13 23:26                                                                                                                                                               ` azurIt
2013-07-13 23:51                                                                                                                                                                 ` azurIt
2013-07-15 15:41                                                                                                                                                                   ` Michal Hocko
2013-07-15 16:00                                                                                                                                                                     ` Michal Hocko
2013-07-16 15:35                                                                                                                                                                       ` Johannes Weiner
2013-07-16 16:09                                                                                                                                                                         ` Michal Hocko
2013-07-16 16:48                                                                                                                                                                           ` Johannes Weiner
2013-07-19  4:21                                                                                                                                                                             ` Johannes Weiner
2013-07-19  4:22                                                                                                                                                                               ` [patch 1/5] mm: invoke oom-killer from remaining unconverted page fault handlers Johannes Weiner
2013-07-19  4:24                                                                                                                                                                               ` [patch 2/5] mm: pass userspace fault flag to generic fault handler Johannes Weiner
2013-07-19  4:25                                                                                                                                                                               ` [patch 3/5] x86: finish fault error path with fatal signal Johannes Weiner
2013-07-24 20:32                                                                                                                                                                                 ` Johannes Weiner
2013-07-25 20:29                                                                                                                                                                                   ` KOSAKI Motohiro
2013-07-25 21:50                                                                                                                                                                                     ` Johannes Weiner
2013-07-19  4:25                                                                                                                                                                               ` [patch 4/5] memcg: do not trap chargers with full callstack on OOM Johannes Weiner
2013-07-19  4:26                                                                                                                                                                               ` [patch 5/5] mm: memcontrol: sanity check memcg OOM context unwind Johannes Weiner
2013-07-19  8:23                                                                                                                                                                               ` [PATCH for 3.2] memcg: do not trap chargers with full callstack on OOM azurIt
2013-07-14 17:07                                                                                                                                                   ` azurIt
2013-07-09 13:00                                                                                                                                           ` Michal Hocko
2013-07-09 13:08                                                                                                                                             ` Michal Hocko
2013-07-09 13:10                                                                                                                                               ` Michal Hocko
2013-06-24 16:48                                                                                                                                       ` azurIt
2013-02-22 12:00                                                                                                                           ` [PATCH for 3.2.34] memcg: do not trigger OOM if PF_NO_MEMCG_OOM is set azurIt
2013-02-07 11:01                                                                                               ` [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked Kamezawa Hiroyuki
2013-02-07 12:31                                                                                                 ` Michal Hocko
2013-02-08  4:16                                                                                                   ` Kamezawa Hiroyuki
2013-02-08  1:40                                                                                                 ` Kamezawa Hiroyuki
2013-02-08 16:01                                                                                                   ` Michal Hocko
2013-02-05 16:31                                                                                         ` Michal Hocko
2012-12-24 13:38                                                                         ` azurIt
2012-12-28 16:35                                                                           ` Michal Hocko
2012-11-26 17:46                         ` [PATCH -mm] " Johannes Weiner
2012-11-26 18:04                           ` Michal Hocko
2012-11-26 18:24                             ` Johannes Weiner
2012-11-26 19:03                               ` Michal Hocko
2012-11-26 19:29                                 ` Johannes Weiner
2012-11-26 20:08                                   ` Michal Hocko
2012-11-26 20:19                                     ` Johannes Weiner
2012-11-26 20:46                                       ` azurIt
2012-11-26 20:53                                         ` Johannes Weiner
2012-11-26 22:06                                       ` Michal Hocko
2012-11-27  0:05                         ` Kamezawa Hiroyuki
2012-11-27  9:54                           ` Michal Hocko
2012-11-27 19:48                           ` Johannes Weiner
2012-11-27 20:54                             ` [PATCH -v2 " Michal Hocko
2012-11-27 20:59                               ` Michal Hocko
2012-11-28 15:26                                 ` Johannes Weiner
2012-11-28 16:04                                   ` Michal Hocko
2012-11-28 16:37                                     ` Johannes Weiner
2012-11-28 16:46                                       ` Michal Hocko
2012-11-28 16:48                                         ` Michal Hocko
2012-11-28 18:44                                           ` Johannes Weiner
2012-11-28 20:20                                           ` Hugh Dickins
2012-11-29 14:05                                             ` Michal Hocko

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).