* memory-cgroup bug @ 2012-11-21 19:02 azurIt 2012-11-22 0:26 ` Kamezawa Hiroyuki 2012-11-22 15:24 ` Michal Hocko 0 siblings, 2 replies; 444+ messages in thread From: azurIt @ 2012-11-21 19:02 UTC (permalink / raw) To: linux-kernel Hi, i'm using memory cgroup for limiting our users and having a really strange problem when a cgroup gets out of its memory limit. It's very strange because it happens only sometimes (about once per week on random user), out of memory is usually handled ok. This happens when problem occures: - no new processes can be started for this cgroup - current processes are freezed and taking 100% of CPU - when i try to 'strace' any of current processes, the whole strace freezes until process is killed (strace cannot be terminated by CTRL-c) - problem can be resolved by raising memory limit for cgroup or killing of few processes inside cgroup so some memory is freed I also garbbed the content of /proc/<pid>/stack of freezed process: [<ffffffff8110a9c1>] mem_cgroup_handle_oom+0x241/0x3b0 [<ffffffff8110b5ab>] T.1146+0x5ab/0x5c0 [<ffffffff8110ba56>] mem_cgroup_charge_common+0x56/0xa0 [<ffffffff8110bae5>] mem_cgroup_newpage_charge+0x45/0x50 [<ffffffff810ec54e>] do_wp_page+0x14e/0x800 [<ffffffff810eda34>] handle_pte_fault+0x264/0x940 [<ffffffff810ee248>] handle_mm_fault+0x138/0x260 [<ffffffff810270ed>] do_page_fault+0x13d/0x460 [<ffffffff815b53ff>] page_fault+0x1f/0x30 [<ffffffffffffffff>] 0xffffffffffffffff I'm currently using kernel 3.2.34 but i'm having this problem since 2.6.32. Any ideas? Thnx. azurIt ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: memory-cgroup bug 2012-11-21 19:02 memory-cgroup bug azurIt @ 2012-11-22 0:26 ` Kamezawa Hiroyuki 2012-11-22 15:24 ` Michal Hocko 1 sibling, 0 replies; 444+ messages in thread From: Kamezawa Hiroyuki @ 2012-11-22 0:26 UTC (permalink / raw) To: azurIt; +Cc: linux-kernel, linux-mm (2012/11/22 4:02), azurIt wrote: > Hi, > > i'm using memory cgroup for limiting our users and having a really strange problem when a cgroup gets out of its memory limit. It's very strange because it happens only sometimes (about once per week on random user), out of memory is usually handled ok. This happens when problem occures: > - no new processes can be started for this cgroup > - current processes are freezed and taking 100% of CPU > - when i try to 'strace' any of current processes, the whole strace freezes until process is killed (strace cannot be terminated by CTRL-c) > - problem can be resolved by raising memory limit for cgroup or killing of few processes inside cgroup so some memory is freed > > I also garbbed the content of /proc/<pid>/stack of freezed process: > [<ffffffff8110a9c1>] mem_cgroup_handle_oom+0x241/0x3b0 > [<ffffffff8110b5ab>] T.1146+0x5ab/0x5c0 > [<ffffffff8110ba56>] mem_cgroup_charge_common+0x56/0xa0 > [<ffffffff8110bae5>] mem_cgroup_newpage_charge+0x45/0x50 > [<ffffffff810ec54e>] do_wp_page+0x14e/0x800 > [<ffffffff810eda34>] handle_pte_fault+0x264/0x940 > [<ffffffff810ee248>] handle_mm_fault+0x138/0x260 > [<ffffffff810270ed>] do_page_fault+0x13d/0x460 > [<ffffffff815b53ff>] page_fault+0x1f/0x30 > [<ffffffffffffffff>] 0xffffffffffffffff > > I'm currently using kernel 3.2.34 but i'm having this problem since 2.6.32. > > Any ideas? Thnx. > Under OOM in memcg, only one process is allowed to work. Because processes tends to use up CPU at memory shortage. other processes are freezed. Then, the problem here is the one process which uses CPU. IIUC, 'freezed' threads are in sleep and never use CPU. It's expected oom-killer or memory-reclaim can solve the probelm. What is your memcg's memory.oom_control value ? and process's oom_adj values ? (/proc/<pid>/oom_adj, /proc/<pid>/oom_score_adj) Thanks, -Kame > azurIt > -- > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ > ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: memory-cgroup bug @ 2012-11-22 0:26 ` Kamezawa Hiroyuki 0 siblings, 0 replies; 444+ messages in thread From: Kamezawa Hiroyuki @ 2012-11-22 0:26 UTC (permalink / raw) To: azurIt; +Cc: linux-kernel, linux-mm (2012/11/22 4:02), azurIt wrote: > Hi, > > i'm using memory cgroup for limiting our users and having a really strange problem when a cgroup gets out of its memory limit. It's very strange because it happens only sometimes (about once per week on random user), out of memory is usually handled ok. This happens when problem occures: > - no new processes can be started for this cgroup > - current processes are freezed and taking 100% of CPU > - when i try to 'strace' any of current processes, the whole strace freezes until process is killed (strace cannot be terminated by CTRL-c) > - problem can be resolved by raising memory limit for cgroup or killing of few processes inside cgroup so some memory is freed > > I also garbbed the content of /proc/<pid>/stack of freezed process: > [<ffffffff8110a9c1>] mem_cgroup_handle_oom+0x241/0x3b0 > [<ffffffff8110b5ab>] T.1146+0x5ab/0x5c0 > [<ffffffff8110ba56>] mem_cgroup_charge_common+0x56/0xa0 > [<ffffffff8110bae5>] mem_cgroup_newpage_charge+0x45/0x50 > [<ffffffff810ec54e>] do_wp_page+0x14e/0x800 > [<ffffffff810eda34>] handle_pte_fault+0x264/0x940 > [<ffffffff810ee248>] handle_mm_fault+0x138/0x260 > [<ffffffff810270ed>] do_page_fault+0x13d/0x460 > [<ffffffff815b53ff>] page_fault+0x1f/0x30 > [<ffffffffffffffff>] 0xffffffffffffffff > > I'm currently using kernel 3.2.34 but i'm having this problem since 2.6.32. > > Any ideas? Thnx. > Under OOM in memcg, only one process is allowed to work. Because processes tends to use up CPU at memory shortage. other processes are freezed. Then, the problem here is the one process which uses CPU. IIUC, 'freezed' threads are in sleep and never use CPU. It's expected oom-killer or memory-reclaim can solve the probelm. What is your memcg's memory.oom_control value ? and process's oom_adj values ? (/proc/<pid>/oom_adj, /proc/<pid>/oom_score_adj) Thanks, -Kame > azurIt > -- > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: memory-cgroup bug 2012-11-22 0:26 ` Kamezawa Hiroyuki @ 2012-11-22 9:36 ` azurIt -1 siblings, 0 replies; 444+ messages in thread From: azurIt @ 2012-11-22 9:36 UTC (permalink / raw) To: Kamezawa Hiroyuki; +Cc: linux-kernel, linux-mm ______________________________________________________________ > Od: "Kamezawa Hiroyuki" <kamezawa.hiroyu@jp.fujitsu.com> > Komu: azurIt <azurit@pobox.sk> > Dátum: 22.11.2012 01:27 > Predmet: Re: memory-cgroup bug > > CC: linux-kernel@vger.kernel.org, "linux-mm" <linux-mm@kvack.org> >(2012/11/22 4:02), azurIt wrote: >> Hi, >> >> i'm using memory cgroup for limiting our users and having a really strange problem when a cgroup gets out of its memory limit. It's very strange because it happens only sometimes (about once per week on random user), out of memory is usually handled ok. This happens when problem occures: >> - no new processes can be started for this cgroup >> - current processes are freezed and taking 100% of CPU >> - when i try to 'strace' any of current processes, the whole strace freezes until process is killed (strace cannot be terminated by CTRL-c) >> - problem can be resolved by raising memory limit for cgroup or killing of few processes inside cgroup so some memory is freed >> >> I also garbbed the content of /proc/<pid>/stack of freezed process: >> [<ffffffff8110a9c1>] mem_cgroup_handle_oom+0x241/0x3b0 >> [<ffffffff8110b5ab>] T.1146+0x5ab/0x5c0 >> [<ffffffff8110ba56>] mem_cgroup_charge_common+0x56/0xa0 >> [<ffffffff8110bae5>] mem_cgroup_newpage_charge+0x45/0x50 >> [<ffffffff810ec54e>] do_wp_page+0x14e/0x800 >> [<ffffffff810eda34>] handle_pte_fault+0x264/0x940 >> [<ffffffff810ee248>] handle_mm_fault+0x138/0x260 >> [<ffffffff810270ed>] do_page_fault+0x13d/0x460 >> [<ffffffff815b53ff>] page_fault+0x1f/0x30 >> [<ffffffffffffffff>] 0xffffffffffffffff >> >> I'm currently using kernel 3.2.34 but i'm having this problem since 2.6.32. >> >> Any ideas? Thnx. >> > >Under OOM in memcg, only one process is allowed to work. Because processes tends to use up >CPU at memory shortage. other processes are freezed. > > >Then, the problem here is the one process which uses CPU. IIUC, 'freezed' threads are >in sleep and never use CPU. It's expected oom-killer or memory-reclaim can solve the probelm. > >What is your memcg's memory.oom_control value ? oom_kill_disable 0 >and process's oom_adj values ? (/proc/<pid>/oom_adj, /proc/<pid>/oom_score_adj) when i look to a random user PID (Apache web server): oom_adj = 0 oom_score_adj = 0 I can look also to the data of 'freezed' proces if you need it but i will have to wait until problem occurs again. The main problem is that when this problem happens, it's NOT resolved automatically by kernel/OOM and user of cgroup, where it happend, has non-working services until i kill his processes by hand. I'm sure that all 'freezed' processes are taking very much CPU because also server load goes really high - next time i will make a screenshot of htop. I really wonder why OOM is __sometimes__ not resolving this (it's usually is, only sometimes not). Thank you! azur ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: memory-cgroup bug @ 2012-11-22 9:36 ` azurIt 0 siblings, 0 replies; 444+ messages in thread From: azurIt @ 2012-11-22 9:36 UTC (permalink / raw) To: Kamezawa Hiroyuki; +Cc: linux-kernel, linux-mm ______________________________________________________________ > Od: "Kamezawa Hiroyuki" <kamezawa.hiroyu@jp.fujitsu.com> > Komu: azurIt <azurit@pobox.sk> > DA!tum: 22.11.2012 01:27 > Predmet: Re: memory-cgroup bug > > CC: linux-kernel@vger.kernel.org, "linux-mm" <linux-mm@kvack.org> >(2012/11/22 4:02), azurIt wrote: >> Hi, >> >> i'm using memory cgroup for limiting our users and having a really strange problem when a cgroup gets out of its memory limit. It's very strange because it happens only sometimes (about once per week on random user), out of memory is usually handled ok. This happens when problem occures: >> - no new processes can be started for this cgroup >> - current processes are freezed and taking 100% of CPU >> - when i try to 'strace' any of current processes, the whole strace freezes until process is killed (strace cannot be terminated by CTRL-c) >> - problem can be resolved by raising memory limit for cgroup or killing of few processes inside cgroup so some memory is freed >> >> I also garbbed the content of /proc/<pid>/stack of freezed process: >> [<ffffffff8110a9c1>] mem_cgroup_handle_oom+0x241/0x3b0 >> [<ffffffff8110b5ab>] T.1146+0x5ab/0x5c0 >> [<ffffffff8110ba56>] mem_cgroup_charge_common+0x56/0xa0 >> [<ffffffff8110bae5>] mem_cgroup_newpage_charge+0x45/0x50 >> [<ffffffff810ec54e>] do_wp_page+0x14e/0x800 >> [<ffffffff810eda34>] handle_pte_fault+0x264/0x940 >> [<ffffffff810ee248>] handle_mm_fault+0x138/0x260 >> [<ffffffff810270ed>] do_page_fault+0x13d/0x460 >> [<ffffffff815b53ff>] page_fault+0x1f/0x30 >> [<ffffffffffffffff>] 0xffffffffffffffff >> >> I'm currently using kernel 3.2.34 but i'm having this problem since 2.6.32. >> >> Any ideas? Thnx. >> > >Under OOM in memcg, only one process is allowed to work. Because processes tends to use up >CPU at memory shortage. other processes are freezed. > > >Then, the problem here is the one process which uses CPU. IIUC, 'freezed' threads are >in sleep and never use CPU. It's expected oom-killer or memory-reclaim can solve the probelm. > >What is your memcg's memory.oom_control value ? oom_kill_disable 0 >and process's oom_adj values ? (/proc/<pid>/oom_adj, /proc/<pid>/oom_score_adj) when i look to a random user PID (Apache web server): oom_adj = 0 oom_score_adj = 0 I can look also to the data of 'freezed' proces if you need it but i will have to wait until problem occurs again. The main problem is that when this problem happens, it's NOT resolved automatically by kernel/OOM and user of cgroup, where it happend, has non-working services until i kill his processes by hand. I'm sure that all 'freezed' processes are taking very much CPU because also server load goes really high - next time i will make a screenshot of htop. I really wonder why OOM is __sometimes__ not resolving this (it's usually is, only sometimes not). Thank you! azur -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: memory-cgroup bug 2012-11-22 9:36 ` azurIt @ 2012-11-22 21:45 ` Michal Hocko -1 siblings, 0 replies; 444+ messages in thread From: Michal Hocko @ 2012-11-22 21:45 UTC (permalink / raw) To: azurIt; +Cc: Kamezawa Hiroyuki, linux-kernel, linux-mm On Thu 22-11-12 10:36:18, azurIt wrote: [...] > I can look also to the data of 'freezed' proces if you need it but i > will have to wait until problem occurs again. > > The main problem is that when this problem happens, it's NOT resolved > automatically by kernel/OOM and user of cgroup, where it happend, has > non-working services until i kill his processes by hand. I'm sure > that all 'freezed' processes are taking very much CPU because also > server load goes really high - next time i will make a screenshot of > htop. I really wonder why OOM is __sometimes__ not resolving this > (it's usually is, only sometimes not). What does your kernel log says while this is happening. Are there any memcg OOM messages showing up? -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: memory-cgroup bug @ 2012-11-22 21:45 ` Michal Hocko 0 siblings, 0 replies; 444+ messages in thread From: Michal Hocko @ 2012-11-22 21:45 UTC (permalink / raw) To: azurIt; +Cc: Kamezawa Hiroyuki, linux-kernel, linux-mm On Thu 22-11-12 10:36:18, azurIt wrote: [...] > I can look also to the data of 'freezed' proces if you need it but i > will have to wait until problem occurs again. > > The main problem is that when this problem happens, it's NOT resolved > automatically by kernel/OOM and user of cgroup, where it happend, has > non-working services until i kill his processes by hand. I'm sure > that all 'freezed' processes are taking very much CPU because also > server load goes really high - next time i will make a screenshot of > htop. I really wonder why OOM is __sometimes__ not resolving this > (it's usually is, only sometimes not). What does your kernel log says while this is happening. Are there any memcg OOM messages showing up? -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: memory-cgroup bug 2012-11-21 19:02 memory-cgroup bug azurIt @ 2012-11-22 15:24 ` Michal Hocko 2012-11-22 15:24 ` Michal Hocko 1 sibling, 0 replies; 444+ messages in thread From: Michal Hocko @ 2012-11-22 15:24 UTC (permalink / raw) To: azurIt; +Cc: linux-kernel, linux-mm, cgroups mailinglist On Wed 21-11-12 20:02:07, azurIt wrote: > Hi, > > i'm using memory cgroup for limiting our users and having a really > strange problem when a cgroup gets out of its memory limit. It's very > strange because it happens only sometimes (about once per week on > random user), out of memory is usually handled ok. What is your memcg configuration? Do you use deeper hierarchies, is use_hierarchy enabled? Is the memcg oom (aka memory.oom_control) enabled? Do you use soft limit for those groups? Is memcg swap accounting enabled and memsw limits in place? Is the machine under global memory pressure as well? Could you post sysrq+t or sysrq+w? > This happens when problem occures: > - no new processes can be started for this cgroup > - current processes are freezed and taking 100% of CPU > - when i try to 'strace' any of current processes, the whole strace > freezes until process is killed (strace cannot be terminated by > CTRL-c) > - problem can be resolved by raising memory limit for cgroup or > killing of few processes inside cgroup so some memory is freed > > I also garbbed the content of /proc/<pid>/stack of freezed process: > [<ffffffff8110a9c1>] mem_cgroup_handle_oom+0x241/0x3b0 > [<ffffffff8110b5ab>] T.1146+0x5ab/0x5c0 Hmm what is this? > [<ffffffff8110ba56>] mem_cgroup_charge_common+0x56/0xa0 > [<ffffffff8110bae5>] mem_cgroup_newpage_charge+0x45/0x50 > [<ffffffff810ec54e>] do_wp_page+0x14e/0x800 > [<ffffffff810eda34>] handle_pte_fault+0x264/0x940 > [<ffffffff810ee248>] handle_mm_fault+0x138/0x260 > [<ffffffff810270ed>] do_page_fault+0x13d/0x460 > [<ffffffff815b53ff>] page_fault+0x1f/0x30 > [<ffffffffffffffff>] 0xffffffffffffffff > How many tasks are hung in mem_cgroup_handle_oom? If there were many of them then it'd smell like an issue fixed by 79dfdaccd1d5 (memcg: make oom_lock 0 and 1 based rather than counter) and its follow up fix 23751be00940 (memcg: fix hierarchical oom locking) but you are saying that you can reproduce with 3.2 and those went in for 3.1. 2.6.32 would make more sense. > I'm currently using kernel 3.2.34 but i'm having this problem since 2.6.32. I guess this is a clean vanilla (stable) kernel, right? Are you able to reproduce with the latest Linus tree? -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: memory-cgroup bug @ 2012-11-22 15:24 ` Michal Hocko 0 siblings, 0 replies; 444+ messages in thread From: Michal Hocko @ 2012-11-22 15:24 UTC (permalink / raw) To: azurIt; +Cc: linux-kernel, linux-mm, cgroups mailinglist On Wed 21-11-12 20:02:07, azurIt wrote: > Hi, > > i'm using memory cgroup for limiting our users and having a really > strange problem when a cgroup gets out of its memory limit. It's very > strange because it happens only sometimes (about once per week on > random user), out of memory is usually handled ok. What is your memcg configuration? Do you use deeper hierarchies, is use_hierarchy enabled? Is the memcg oom (aka memory.oom_control) enabled? Do you use soft limit for those groups? Is memcg swap accounting enabled and memsw limits in place? Is the machine under global memory pressure as well? Could you post sysrq+t or sysrq+w? > This happens when problem occures: > - no new processes can be started for this cgroup > - current processes are freezed and taking 100% of CPU > - when i try to 'strace' any of current processes, the whole strace > freezes until process is killed (strace cannot be terminated by > CTRL-c) > - problem can be resolved by raising memory limit for cgroup or > killing of few processes inside cgroup so some memory is freed > > I also garbbed the content of /proc/<pid>/stack of freezed process: > [<ffffffff8110a9c1>] mem_cgroup_handle_oom+0x241/0x3b0 > [<ffffffff8110b5ab>] T.1146+0x5ab/0x5c0 Hmm what is this? > [<ffffffff8110ba56>] mem_cgroup_charge_common+0x56/0xa0 > [<ffffffff8110bae5>] mem_cgroup_newpage_charge+0x45/0x50 > [<ffffffff810ec54e>] do_wp_page+0x14e/0x800 > [<ffffffff810eda34>] handle_pte_fault+0x264/0x940 > [<ffffffff810ee248>] handle_mm_fault+0x138/0x260 > [<ffffffff810270ed>] do_page_fault+0x13d/0x460 > [<ffffffff815b53ff>] page_fault+0x1f/0x30 > [<ffffffffffffffff>] 0xffffffffffffffff > How many tasks are hung in mem_cgroup_handle_oom? If there were many of them then it'd smell like an issue fixed by 79dfdaccd1d5 (memcg: make oom_lock 0 and 1 based rather than counter) and its follow up fix 23751be00940 (memcg: fix hierarchical oom locking) but you are saying that you can reproduce with 3.2 and those went in for 3.1. 2.6.32 would make more sense. > I'm currently using kernel 3.2.34 but i'm having this problem since 2.6.32. I guess this is a clean vanilla (stable) kernel, right? Are you able to reproduce with the latest Linus tree? -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: memory-cgroup bug 2012-11-22 15:24 ` Michal Hocko @ 2012-11-22 18:05 ` azurIt -1 siblings, 0 replies; 444+ messages in thread From: azurIt @ 2012-11-22 18:05 UTC (permalink / raw) To: Michal Hocko; +Cc: linux-kernel, linux-mm, cgroups mailinglist >> i'm using memory cgroup for limiting our users and having a really >> strange problem when a cgroup gets out of its memory limit. It's very >> strange because it happens only sometimes (about once per week on >> random user), out of memory is usually handled ok. > >What is your memcg configuration? Do you use deeper hierarchies, is >use_hierarchy enabled? Is the memcg oom (aka memory.oom_control) >enabled? Do you use soft limit for those groups? Is memcg swap >accounting enabled and memsw limits in place? >Is the machine under global memory pressure as well? >Could you post sysrq+t or sysrq+w? My cgroups hierarchy: /cgroups/<user_id>/uid/ where '<user_id>' is system user id and 'uid' is just word 'uid'. Memory limits are set in /cgroups/<user_id>/ and hierarchy is enabled. Processes are inside /cgroups/<user_id>/uid/ . I'm using hard limits for memory and swap BUT system has no swap at all (it has 'only' 16 GB of real RAM). memory.oom_control is set to 'oom_kill_disable 0'. Server has enough of free memory when problem occurs. >> This happens when problem occures: >> - no new processes can be started for this cgroup >> - current processes are freezed and taking 100% of CPU >> - when i try to 'strace' any of current processes, the whole strace >> freezes until process is killed (strace cannot be terminated by >> CTRL-c) >> - problem can be resolved by raising memory limit for cgroup or >> killing of few processes inside cgroup so some memory is freed >> >> I also garbbed the content of /proc/<pid>/stack of freezed process: >> [<ffffffff8110a9c1>] mem_cgroup_handle_oom+0x241/0x3b0 >> [<ffffffff8110b5ab>] T.1146+0x5ab/0x5c0 > >Hmm what is this? Really doesn't know, i will get stack of all freezed processes next time so we can compare it. >> [<ffffffff8110ba56>] mem_cgroup_charge_common+0x56/0xa0 >> [<ffffffff8110bae5>] mem_cgroup_newpage_charge+0x45/0x50 >> [<ffffffff810ec54e>] do_wp_page+0x14e/0x800 >> [<ffffffff810eda34>] handle_pte_fault+0x264/0x940 >> [<ffffffff810ee248>] handle_mm_fault+0x138/0x260 >> [<ffffffff810270ed>] do_page_fault+0x13d/0x460 >> [<ffffffff815b53ff>] page_fault+0x1f/0x30 >> [<ffffffffffffffff>] 0xffffffffffffffff >> > >How many tasks are hung in mem_cgroup_handle_oom? If there were many >of them then it'd smell like an issue fixed by 79dfdaccd1d5 (memcg: >make oom_lock 0 and 1 based rather than counter) and its follow up fix >23751be00940 (memcg: fix hierarchical oom locking) but you are saying >that you can reproduce with 3.2 and those went in for 3.1. 2.6.32 would >make more sense. Usually maximum of several 10s of processes but i will check it next time. I was having much worse problems in 2.6.32 - when freezing happens, the whole server was affected (i wasn't able to do anything and needs to wait until my scripts takes case of it and killed apache, so i don't have any detailed info). In 3.2 only target cgroup is affected. >> I'm currently using kernel 3.2.34 but i'm having this problem since 2.6.32. > >I guess this is a clean vanilla (stable) kernel, right? Are you able to >reproduce with the latest Linus tree? Well, no. I'm using, for example, newest stable grsecurity patch. I'm also using few of Andrea Righi's cgroup subsystems but i don't believe these are doing problems: - cgroup-uid which is moving processes into cgroups based on UID - cgroup-task which can limit number of tasks in cgroup (i already tried to disable this one, it didn't help) http://www.develer.com/~arighi/linux/patches/ Unfortunately i cannot just install new and untested kernel version cos i'm not able to reproduce this problem anytime (it's happening randomly in production environment). Could it be that OOM cannot start and kill processes because there's no free memory in cgroup? Thank you! azur ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: memory-cgroup bug @ 2012-11-22 18:05 ` azurIt 0 siblings, 0 replies; 444+ messages in thread From: azurIt @ 2012-11-22 18:05 UTC (permalink / raw) To: Michal Hocko; +Cc: linux-kernel, linux-mm, cgroups mailinglist >> i'm using memory cgroup for limiting our users and having a really >> strange problem when a cgroup gets out of its memory limit. It's very >> strange because it happens only sometimes (about once per week on >> random user), out of memory is usually handled ok. > >What is your memcg configuration? Do you use deeper hierarchies, is >use_hierarchy enabled? Is the memcg oom (aka memory.oom_control) >enabled? Do you use soft limit for those groups? Is memcg swap >accounting enabled and memsw limits in place? >Is the machine under global memory pressure as well? >Could you post sysrq+t or sysrq+w? My cgroups hierarchy: /cgroups/<user_id>/uid/ where '<user_id>' is system user id and 'uid' is just word 'uid'. Memory limits are set in /cgroups/<user_id>/ and hierarchy is enabled. Processes are inside /cgroups/<user_id>/uid/ . I'm using hard limits for memory and swap BUT system has no swap at all (it has 'only' 16 GB of real RAM). memory.oom_control is set to 'oom_kill_disable 0'. Server has enough of free memory when problem occurs. >> This happens when problem occures: >> - no new processes can be started for this cgroup >> - current processes are freezed and taking 100% of CPU >> - when i try to 'strace' any of current processes, the whole strace >> freezes until process is killed (strace cannot be terminated by >> CTRL-c) >> - problem can be resolved by raising memory limit for cgroup or >> killing of few processes inside cgroup so some memory is freed >> >> I also garbbed the content of /proc/<pid>/stack of freezed process: >> [<ffffffff8110a9c1>] mem_cgroup_handle_oom+0x241/0x3b0 >> [<ffffffff8110b5ab>] T.1146+0x5ab/0x5c0 > >Hmm what is this? Really doesn't know, i will get stack of all freezed processes next time so we can compare it. >> [<ffffffff8110ba56>] mem_cgroup_charge_common+0x56/0xa0 >> [<ffffffff8110bae5>] mem_cgroup_newpage_charge+0x45/0x50 >> [<ffffffff810ec54e>] do_wp_page+0x14e/0x800 >> [<ffffffff810eda34>] handle_pte_fault+0x264/0x940 >> [<ffffffff810ee248>] handle_mm_fault+0x138/0x260 >> [<ffffffff810270ed>] do_page_fault+0x13d/0x460 >> [<ffffffff815b53ff>] page_fault+0x1f/0x30 >> [<ffffffffffffffff>] 0xffffffffffffffff >> > >How many tasks are hung in mem_cgroup_handle_oom? If there were many >of them then it'd smell like an issue fixed by 79dfdaccd1d5 (memcg: >make oom_lock 0 and 1 based rather than counter) and its follow up fix >23751be00940 (memcg: fix hierarchical oom locking) but you are saying >that you can reproduce with 3.2 and those went in for 3.1. 2.6.32 would >make more sense. Usually maximum of several 10s of processes but i will check it next time. I was having much worse problems in 2.6.32 - when freezing happens, the whole server was affected (i wasn't able to do anything and needs to wait until my scripts takes case of it and killed apache, so i don't have any detailed info). In 3.2 only target cgroup is affected. >> I'm currently using kernel 3.2.34 but i'm having this problem since 2.6.32. > >I guess this is a clean vanilla (stable) kernel, right? Are you able to >reproduce with the latest Linus tree? Well, no. I'm using, for example, newest stable grsecurity patch. I'm also using few of Andrea Righi's cgroup subsystems but i don't believe these are doing problems: - cgroup-uid which is moving processes into cgroups based on UID - cgroup-task which can limit number of tasks in cgroup (i already tried to disable this one, it didn't help) http://www.develer.com/~arighi/linux/patches/ Unfortunately i cannot just install new and untested kernel version cos i'm not able to reproduce this problem anytime (it's happening randomly in production environment). Could it be that OOM cannot start and kill processes because there's no free memory in cgroup? Thank you! azur -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: memory-cgroup bug 2012-11-22 18:05 ` azurIt @ 2012-11-22 21:42 ` Michal Hocko -1 siblings, 0 replies; 444+ messages in thread From: Michal Hocko @ 2012-11-22 21:42 UTC (permalink / raw) To: azurIt; +Cc: linux-kernel, linux-mm, cgroups mailinglist On Thu 22-11-12 19:05:26, azurIt wrote: [...] > My cgroups hierarchy: > /cgroups/<user_id>/uid/ > > where '<user_id>' is system user id and 'uid' is just word 'uid'. > > Memory limits are set in /cgroups/<user_id>/ and hierarchy is > enabled. Processes are inside /cgroups/<user_id>/uid/ . I'm using > hard limits for memory and swap BUT system has no swap at all > (it has 'only' 16 GB of real RAM). memory.oom_control is set to > 'oom_kill_disable 0'. Server has enough of free memory when problem > occurs. OK, so so the global reclaim shouldn't be active. This is definitely good to know. > >> This happens when problem occures: > >> - no new processes can be started for this cgroup > >> - current processes are freezed and taking 100% of CPU > >> - when i try to 'strace' any of current processes, the whole strace > >> freezes until process is killed (strace cannot be terminated by > >> CTRL-c) > >> - problem can be resolved by raising memory limit for cgroup or > >> killing of few processes inside cgroup so some memory is freed > >> > >> I also garbbed the content of /proc/<pid>/stack of freezed process: > >> [<ffffffff8110a9c1>] mem_cgroup_handle_oom+0x241/0x3b0 > >> [<ffffffff8110b5ab>] T.1146+0x5ab/0x5c0 > > > >Hmm what is this? > > Really doesn't know, i will get stack of all freezed processes next > time so we can compare it. > > >> [<ffffffff8110ba56>] mem_cgroup_charge_common+0x56/0xa0 > >> [<ffffffff8110bae5>] mem_cgroup_newpage_charge+0x45/0x50 > >> [<ffffffff810ec54e>] do_wp_page+0x14e/0x800 > >> [<ffffffff810eda34>] handle_pte_fault+0x264/0x940 > >> [<ffffffff810ee248>] handle_mm_fault+0x138/0x260 > >> [<ffffffff810270ed>] do_page_fault+0x13d/0x460 > >> [<ffffffff815b53ff>] page_fault+0x1f/0x30 > >> [<ffffffffffffffff>] 0xffffffffffffffff Btw. is this stack stable or is the task bouncing in some loop? And finally could you post the disassembly of your version of mem_cgroup_handle_oom, please? > >How many tasks are hung in mem_cgroup_handle_oom? If there were many > >of them then it'd smell like an issue fixed by 79dfdaccd1d5 (memcg: > >make oom_lock 0 and 1 based rather than counter) and its follow up fix > >23751be00940 (memcg: fix hierarchical oom locking) but you are saying > >that you can reproduce with 3.2 and those went in for 3.1. 2.6.32 would > >make more sense. > > > Usually maximum of several 10s of processes but i will check it next > time. I was having much worse problems in 2.6.32 - when freezing > happens, the whole server was affected (i wasn't able to do anything > and needs to wait until my scripts takes case of it and killed apache, > so i don't have any detailed info). Hmm, maybe the issue fixed by 1d65f86d (mm: preallocate page before lock_page() at filemap COW) which was merged in 3.1. > In 3.2 only target cgroup is affected. > > >> I'm currently using kernel 3.2.34 but i'm having this problem since 2.6.32. > > > >I guess this is a clean vanilla (stable) kernel, right? Are you able to > >reproduce with the latest Linus tree? > > > Well, no. I'm using, for example, newest stable grsecurity patch. That shouldn't be related > I'm also using few of Andrea Righi's cgroup subsystems but i don't > believe > these are doing problems: > - cgroup-uid which is moving processes into cgroups based on UID > - cgroup-task which can limit number of tasks in cgroup (i already > tried to disable this one, it didn't help) > http://www.develer.com/~arighi/linux/patches/ I am not familiar with those pathces but I will double check. > Unfortunately i cannot just install new and untested kernel version > cos i'm not able to reproduce this problem anytime (it's happening > randomly in production environment). This will make it a bit harder to debug but let's see maybe the new traces would help... > Could it be that OOM cannot start and kill processes because there's > no free memory in cgroup? That shouldn't happen. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: memory-cgroup bug @ 2012-11-22 21:42 ` Michal Hocko 0 siblings, 0 replies; 444+ messages in thread From: Michal Hocko @ 2012-11-22 21:42 UTC (permalink / raw) To: azurIt; +Cc: linux-kernel, linux-mm, cgroups mailinglist On Thu 22-11-12 19:05:26, azurIt wrote: [...] > My cgroups hierarchy: > /cgroups/<user_id>/uid/ > > where '<user_id>' is system user id and 'uid' is just word 'uid'. > > Memory limits are set in /cgroups/<user_id>/ and hierarchy is > enabled. Processes are inside /cgroups/<user_id>/uid/ . I'm using > hard limits for memory and swap BUT system has no swap at all > (it has 'only' 16 GB of real RAM). memory.oom_control is set to > 'oom_kill_disable 0'. Server has enough of free memory when problem > occurs. OK, so so the global reclaim shouldn't be active. This is definitely good to know. > >> This happens when problem occures: > >> - no new processes can be started for this cgroup > >> - current processes are freezed and taking 100% of CPU > >> - when i try to 'strace' any of current processes, the whole strace > >> freezes until process is killed (strace cannot be terminated by > >> CTRL-c) > >> - problem can be resolved by raising memory limit for cgroup or > >> killing of few processes inside cgroup so some memory is freed > >> > >> I also garbbed the content of /proc/<pid>/stack of freezed process: > >> [<ffffffff8110a9c1>] mem_cgroup_handle_oom+0x241/0x3b0 > >> [<ffffffff8110b5ab>] T.1146+0x5ab/0x5c0 > > > >Hmm what is this? > > Really doesn't know, i will get stack of all freezed processes next > time so we can compare it. > > >> [<ffffffff8110ba56>] mem_cgroup_charge_common+0x56/0xa0 > >> [<ffffffff8110bae5>] mem_cgroup_newpage_charge+0x45/0x50 > >> [<ffffffff810ec54e>] do_wp_page+0x14e/0x800 > >> [<ffffffff810eda34>] handle_pte_fault+0x264/0x940 > >> [<ffffffff810ee248>] handle_mm_fault+0x138/0x260 > >> [<ffffffff810270ed>] do_page_fault+0x13d/0x460 > >> [<ffffffff815b53ff>] page_fault+0x1f/0x30 > >> [<ffffffffffffffff>] 0xffffffffffffffff Btw. is this stack stable or is the task bouncing in some loop? And finally could you post the disassembly of your version of mem_cgroup_handle_oom, please? > >How many tasks are hung in mem_cgroup_handle_oom? If there were many > >of them then it'd smell like an issue fixed by 79dfdaccd1d5 (memcg: > >make oom_lock 0 and 1 based rather than counter) and its follow up fix > >23751be00940 (memcg: fix hierarchical oom locking) but you are saying > >that you can reproduce with 3.2 and those went in for 3.1. 2.6.32 would > >make more sense. > > > Usually maximum of several 10s of processes but i will check it next > time. I was having much worse problems in 2.6.32 - when freezing > happens, the whole server was affected (i wasn't able to do anything > and needs to wait until my scripts takes case of it and killed apache, > so i don't have any detailed info). Hmm, maybe the issue fixed by 1d65f86d (mm: preallocate page before lock_page() at filemap COW) which was merged in 3.1. > In 3.2 only target cgroup is affected. > > >> I'm currently using kernel 3.2.34 but i'm having this problem since 2.6.32. > > > >I guess this is a clean vanilla (stable) kernel, right? Are you able to > >reproduce with the latest Linus tree? > > > Well, no. I'm using, for example, newest stable grsecurity patch. That shouldn't be related > I'm also using few of Andrea Righi's cgroup subsystems but i don't > believe > these are doing problems: > - cgroup-uid which is moving processes into cgroups based on UID > - cgroup-task which can limit number of tasks in cgroup (i already > tried to disable this one, it didn't help) > http://www.develer.com/~arighi/linux/patches/ I am not familiar with those pathces but I will double check. > Unfortunately i cannot just install new and untested kernel version > cos i'm not able to reproduce this problem anytime (it's happening > randomly in production environment). This will make it a bit harder to debug but let's see maybe the new traces would help... > Could it be that OOM cannot start and kill processes because there's > no free memory in cgroup? That shouldn't happen. -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: memory-cgroup bug 2012-11-22 21:42 ` Michal Hocko @ 2012-11-22 22:34 ` azurIt -1 siblings, 0 replies; 444+ messages in thread From: azurIt @ 2012-11-22 22:34 UTC (permalink / raw) To: Michal Hocko; +Cc: linux-kernel, linux-mm, cgroups mailinglist >Btw. is this stack stable or is the task bouncing in some loop? Not sure, will check it next time. >And finally could you post the disassembly of your version of >mem_cgroup_handle_oom, please? How can i do this? >What does your kernel log says while this is happening. Are there any >memcg OOM messages showing up? I will get the logs next time. Thank you! azur ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: memory-cgroup bug @ 2012-11-22 22:34 ` azurIt 0 siblings, 0 replies; 444+ messages in thread From: azurIt @ 2012-11-22 22:34 UTC (permalink / raw) To: Michal Hocko; +Cc: linux-kernel, linux-mm, cgroups mailinglist >Btw. is this stack stable or is the task bouncing in some loop? Not sure, will check it next time. >And finally could you post the disassembly of your version of >mem_cgroup_handle_oom, please? How can i do this? >What does your kernel log says while this is happening. Are there any >memcg OOM messages showing up? I will get the logs next time. Thank you! azur -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: memory-cgroup bug 2012-11-22 22:34 ` azurIt (?) @ 2012-11-23 7:40 ` Michal Hocko -1 siblings, 0 replies; 444+ messages in thread From: Michal Hocko @ 2012-11-23 7:40 UTC (permalink / raw) To: azurIt; +Cc: linux-kernel, linux-mm, cgroups mailinglist On Thu 22-11-12 23:34:34, azurIt wrote: [...] > >And finally could you post the disassembly of your version of > >mem_cgroup_handle_oom, please? > > How can i do this? Either use gdb YOUR_VMLINUX and disassemble mem_cgroup_handle_oom or use objdump -d YOUR_VMLINUX and copy out only mem_cgroup_handle_oom function. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: memory-cgroup bug @ 2012-11-23 7:40 ` Michal Hocko 0 siblings, 0 replies; 444+ messages in thread From: Michal Hocko @ 2012-11-23 7:40 UTC (permalink / raw) To: azurIt Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups mailinglist On Thu 22-11-12 23:34:34, azurIt wrote: [...] > >And finally could you post the disassembly of your version of > >mem_cgroup_handle_oom, please? > > How can i do this? Either use gdb YOUR_VMLINUX and disassemble mem_cgroup_handle_oom or use objdump -d YOUR_VMLINUX and copy out only mem_cgroup_handle_oom function. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: memory-cgroup bug @ 2012-11-23 7:40 ` Michal Hocko 0 siblings, 0 replies; 444+ messages in thread From: Michal Hocko @ 2012-11-23 7:40 UTC (permalink / raw) To: azurIt; +Cc: linux-kernel, linux-mm, cgroups mailinglist On Thu 22-11-12 23:34:34, azurIt wrote: [...] > >And finally could you post the disassembly of your version of > >mem_cgroup_handle_oom, please? > > How can i do this? Either use gdb YOUR_VMLINUX and disassemble mem_cgroup_handle_oom or use objdump -d YOUR_VMLINUX and copy out only mem_cgroup_handle_oom function. -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: memory-cgroup bug 2012-11-23 7:40 ` Michal Hocko (?) @ 2012-11-23 9:21 ` azurIt -1 siblings, 0 replies; 444+ messages in thread From: azurIt @ 2012-11-23 9:21 UTC (permalink / raw) To: Michal Hocko; +Cc: linux-kernel, linux-mm, cgroups mailinglist >Either use gdb YOUR_VMLINUX and disassemble mem_cgroup_handle_oom or >use objdump -d YOUR_VMLINUX and copy out only mem_cgroup_handle_oom >function. If 'YOUR_VMLINUX' is supposed to be my kernel image: # gdb vmlinuz-3.2.34-grsec-1 GNU gdb (GDB) 7.0.1-debian Copyright (C) 2009 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html> This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type "show copying" and "show warranty" for details. This GDB was configured as "x86_64-linux-gnu". For bug reporting instructions, please see: <http://www.gnu.org/software/gdb/bugs/>... "/root/bug/vmlinuz-3.2.34-grsec-1": not in executable format: File format not recognized # objdump -d vmlinuz-3.2.34-grsec-1 objdump: vmlinuz-3.2.34-grsec-1: File format not recognized # file vmlinuz-3.2.34-grsec-1 vmlinuz-3.2.34-grsec-1: Linux kernel x86 boot executable bzImage, version 3.2.34-grsec (root@server01) #1, RO-rootFS, swap_dev 0x3, Normal VGA I'm probably doing something wrong :) It, luckily, happend again so i have more info. - there wasn't any logs in kernel from OOM for that cgroup - there were 16 processes in cgroup - processes in cgroup were taking togather 100% of CPU (it was allowed to use only one core, so 100% of that core) - memory.failcnt was groving fast - oom_control: oom_kill_disable 0 under_oom 0 (this was looping from 0 to 1) - limit_in_bytes was set to 157286400 - content of stat (as you can see, the whole memory limit was used): cache 0 rss 0 mapped_file 0 pgpgin 0 pgpgout 0 swap 0 pgfault 0 pgmajfault 0 inactive_anon 0 active_anon 0 inactive_file 0 active_file 0 unevictable 0 hierarchical_memory_limit 157286400 hierarchical_memsw_limit 157286400 total_cache 0 total_rss 157286400 total_mapped_file 0 total_pgpgin 10326454 total_pgpgout 10288054 total_swap 0 total_pgfault 12939677 total_pgmajfault 4283 total_inactive_anon 0 total_active_anon 157286400 total_inactive_file 0 total_active_file 0 total_unevictable 0 i also grabber oom_adj, oom_score_adj and stack of all processes, here it is: http://www.watchdog.sk/lkml/memcg-bug.tar Notice that stack is different for few processes. Stack for all processes were NOT chaging and was still the same. Btw, don't know if it matters but i was several cgroup subsystems mounted and i'm also using them (i was not activating freezer in this case, don't know if it can be active automatically by kernel or what, didn't checked if cgroup was freezed but i suppose it wasn't): none /cgroups cgroup defaults,cpuacct,cpuset,memory,freezer,task,blkio 0 0 Thank you. azur ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: memory-cgroup bug @ 2012-11-23 9:21 ` azurIt 0 siblings, 0 replies; 444+ messages in thread From: azurIt @ 2012-11-23 9:21 UTC (permalink / raw) To: Michal Hocko Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups mailinglist >Either use gdb YOUR_VMLINUX and disassemble mem_cgroup_handle_oom or >use objdump -d YOUR_VMLINUX and copy out only mem_cgroup_handle_oom >function. If 'YOUR_VMLINUX' is supposed to be my kernel image: # gdb vmlinuz-3.2.34-grsec-1 GNU gdb (GDB) 7.0.1-debian Copyright (C) 2009 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html> This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type "show copying" and "show warranty" for details. This GDB was configured as "x86_64-linux-gnu". For bug reporting instructions, please see: <http://www.gnu.org/software/gdb/bugs/>... "/root/bug/vmlinuz-3.2.34-grsec-1": not in executable format: File format not recognized # objdump -d vmlinuz-3.2.34-grsec-1 objdump: vmlinuz-3.2.34-grsec-1: File format not recognized # file vmlinuz-3.2.34-grsec-1 vmlinuz-3.2.34-grsec-1: Linux kernel x86 boot executable bzImage, version 3.2.34-grsec (root@server01) #1, RO-rootFS, swap_dev 0x3, Normal VGA I'm probably doing something wrong :) It, luckily, happend again so i have more info. - there wasn't any logs in kernel from OOM for that cgroup - there were 16 processes in cgroup - processes in cgroup were taking togather 100% of CPU (it was allowed to use only one core, so 100% of that core) - memory.failcnt was groving fast - oom_control: oom_kill_disable 0 under_oom 0 (this was looping from 0 to 1) - limit_in_bytes was set to 157286400 - content of stat (as you can see, the whole memory limit was used): cache 0 rss 0 mapped_file 0 pgpgin 0 pgpgout 0 swap 0 pgfault 0 pgmajfault 0 inactive_anon 0 active_anon 0 inactive_file 0 active_file 0 unevictable 0 hierarchical_memory_limit 157286400 hierarchical_memsw_limit 157286400 total_cache 0 total_rss 157286400 total_mapped_file 0 total_pgpgin 10326454 total_pgpgout 10288054 total_swap 0 total_pgfault 12939677 total_pgmajfault 4283 total_inactive_anon 0 total_active_anon 157286400 total_inactive_file 0 total_active_file 0 total_unevictable 0 i also grabber oom_adj, oom_score_adj and stack of all processes, here it is: http://www.watchdog.sk/lkml/memcg-bug.tar Notice that stack is different for few processes. Stack for all processes were NOT chaging and was still the same. Btw, don't know if it matters but i was several cgroup subsystems mounted and i'm also using them (i was not activating freezer in this case, don't know if it can be active automatically by kernel or what, didn't checked if cgroup was freezed but i suppose it wasn't): none /cgroups cgroup defaults,cpuacct,cpuset,memory,freezer,task,blkio 0 0 Thank you. azur ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: memory-cgroup bug @ 2012-11-23 9:21 ` azurIt 0 siblings, 0 replies; 444+ messages in thread From: azurIt @ 2012-11-23 9:21 UTC (permalink / raw) To: Michal Hocko; +Cc: linux-kernel, linux-mm, cgroups mailinglist >Either use gdb YOUR_VMLINUX and disassemble mem_cgroup_handle_oom or >use objdump -d YOUR_VMLINUX and copy out only mem_cgroup_handle_oom >function. If 'YOUR_VMLINUX' is supposed to be my kernel image: # gdb vmlinuz-3.2.34-grsec-1 GNU gdb (GDB) 7.0.1-debian Copyright (C) 2009 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html> This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type "show copying" and "show warranty" for details. This GDB was configured as "x86_64-linux-gnu". For bug reporting instructions, please see: <http://www.gnu.org/software/gdb/bugs/>... "/root/bug/vmlinuz-3.2.34-grsec-1": not in executable format: File format not recognized # objdump -d vmlinuz-3.2.34-grsec-1 objdump: vmlinuz-3.2.34-grsec-1: File format not recognized # file vmlinuz-3.2.34-grsec-1 vmlinuz-3.2.34-grsec-1: Linux kernel x86 boot executable bzImage, version 3.2.34-grsec (root@server01) #1, RO-rootFS, swap_dev 0x3, Normal VGA I'm probably doing something wrong :) It, luckily, happend again so i have more info. - there wasn't any logs in kernel from OOM for that cgroup - there were 16 processes in cgroup - processes in cgroup were taking togather 100% of CPU (it was allowed to use only one core, so 100% of that core) - memory.failcnt was groving fast - oom_control: oom_kill_disable 0 under_oom 0 (this was looping from 0 to 1) - limit_in_bytes was set to 157286400 - content of stat (as you can see, the whole memory limit was used): cache 0 rss 0 mapped_file 0 pgpgin 0 pgpgout 0 swap 0 pgfault 0 pgmajfault 0 inactive_anon 0 active_anon 0 inactive_file 0 active_file 0 unevictable 0 hierarchical_memory_limit 157286400 hierarchical_memsw_limit 157286400 total_cache 0 total_rss 157286400 total_mapped_file 0 total_pgpgin 10326454 total_pgpgout 10288054 total_swap 0 total_pgfault 12939677 total_pgmajfault 4283 total_inactive_anon 0 total_active_anon 157286400 total_inactive_file 0 total_active_file 0 total_unevictable 0 i also grabber oom_adj, oom_score_adj and stack of all processes, here it is: http://www.watchdog.sk/lkml/memcg-bug.tar Notice that stack is different for few processes. Stack for all processes were NOT chaging and was still the same. Btw, don't know if it matters but i was several cgroup subsystems mounted and i'm also using them (i was not activating freezer in this case, don't know if it can be active automatically by kernel or what, didn't checked if cgroup was freezed but i suppose it wasn't): none /cgroups cgroup defaults,cpuacct,cpuset,memory,freezer,task,blkio 0 0 Thank you. azur -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: memory-cgroup bug 2012-11-23 9:21 ` azurIt @ 2012-11-23 9:28 ` Michal Hocko -1 siblings, 0 replies; 444+ messages in thread From: Michal Hocko @ 2012-11-23 9:28 UTC (permalink / raw) To: azurIt; +Cc: linux-kernel, linux-mm, cgroups mailinglist On Fri 23-11-12 10:21:37, azurIt wrote: > >Either use gdb YOUR_VMLINUX and disassemble mem_cgroup_handle_oom or > >use objdump -d YOUR_VMLINUX and copy out only mem_cgroup_handle_oom > >function. > If 'YOUR_VMLINUX' is supposed to be my kernel image: > > # gdb vmlinuz-3.2.34-grsec-1 > GNU gdb (GDB) 7.0.1-debian > Copyright (C) 2009 Free Software Foundation, Inc. > License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html> > This is free software: you are free to change and redistribute it. > There is NO WARRANTY, to the extent permitted by law. Type "show copying" > and "show warranty" for details. > This GDB was configured as "x86_64-linux-gnu". > For bug reporting instructions, please see: > <http://www.gnu.org/software/gdb/bugs/>... > "/root/bug/vmlinuz-3.2.34-grsec-1": not in executable format: File format not recognized > > > # objdump -d vmlinuz-3.2.34-grsec-1 You need vmlinux not vmlinuz... -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: memory-cgroup bug @ 2012-11-23 9:28 ` Michal Hocko 0 siblings, 0 replies; 444+ messages in thread From: Michal Hocko @ 2012-11-23 9:28 UTC (permalink / raw) To: azurIt; +Cc: linux-kernel, linux-mm, cgroups mailinglist On Fri 23-11-12 10:21:37, azurIt wrote: > >Either use gdb YOUR_VMLINUX and disassemble mem_cgroup_handle_oom or > >use objdump -d YOUR_VMLINUX and copy out only mem_cgroup_handle_oom > >function. > If 'YOUR_VMLINUX' is supposed to be my kernel image: > > # gdb vmlinuz-3.2.34-grsec-1 > GNU gdb (GDB) 7.0.1-debian > Copyright (C) 2009 Free Software Foundation, Inc. > License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html> > This is free software: you are free to change and redistribute it. > There is NO WARRANTY, to the extent permitted by law. Type "show copying" > and "show warranty" for details. > This GDB was configured as "x86_64-linux-gnu". > For bug reporting instructions, please see: > <http://www.gnu.org/software/gdb/bugs/>... > "/root/bug/vmlinuz-3.2.34-grsec-1": not in executable format: File format not recognized > > > # objdump -d vmlinuz-3.2.34-grsec-1 You need vmlinux not vmlinuz... -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: memory-cgroup bug 2012-11-23 9:28 ` Michal Hocko (?) @ 2012-11-23 9:44 ` azurIt -1 siblings, 0 replies; 444+ messages in thread From: azurIt @ 2012-11-23 9:44 UTC (permalink / raw) To: Michal Hocko; +Cc: linux-kernel, linux-mm, cgroups mailinglist > CC: linux-kernel@vger.kernel.org, linux-mm@kvack.org, "cgroups mailinglist" <cgroups@vger.kernel.org> >On Fri 23-11-12 10:21:37, azurIt wrote: >> >Either use gdb YOUR_VMLINUX and disassemble mem_cgroup_handle_oom or >> >use objdump -d YOUR_VMLINUX and copy out only mem_cgroup_handle_oom >> >function. >> If 'YOUR_VMLINUX' is supposed to be my kernel image: >> >> # gdb vmlinuz-3.2.34-grsec-1 >> GNU gdb (GDB) 7.0.1-debian >> Copyright (C) 2009 Free Software Foundation, Inc. >> License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html> >> This is free software: you are free to change and redistribute it. >> There is NO WARRANTY, to the extent permitted by law. Type "show copying" >> and "show warranty" for details. >> This GDB was configured as "x86_64-linux-gnu". >> For bug reporting instructions, please see: >> <http://www.gnu.org/software/gdb/bugs/>... >> "/root/bug/vmlinuz-3.2.34-grsec-1": not in executable format: File format not recognized >> >> >> # objdump -d vmlinuz-3.2.34-grsec-1 > >You need vmlinux not vmlinuz... ok, got it but still no luck: # gdb vmlinux GNU gdb (GDB) 7.0.1-debian Copyright (C) 2009 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html> This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type "show copying" and "show warranty" for details. This GDB was configured as "x86_64-linux-gnu". For bug reporting instructions, please see: <http://www.gnu.org/software/gdb/bugs/>... Reading symbols from /root/bug/dddddddd/vmlinux...(no debugging symbols found)...done. (gdb) disassemble mem_cgroup_handle_oom No symbol table is loaded. Use the "file" command. # objdump -d vmlinux | grep mem_cgroup_handle_oom <no output> i can recompile the kernel if anything needs to be added into it. azur ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: memory-cgroup bug @ 2012-11-23 9:44 ` azurIt 0 siblings, 0 replies; 444+ messages in thread From: azurIt @ 2012-11-23 9:44 UTC (permalink / raw) To: Michal Hocko Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups mailinglist > CC: linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org, "cgroups mailinglist" <cgroups-u79uwXL29TY76Z2rM5mHXA@public.gmane.org> >On Fri 23-11-12 10:21:37, azurIt wrote: >> >Either use gdb YOUR_VMLINUX and disassemble mem_cgroup_handle_oom or >> >use objdump -d YOUR_VMLINUX and copy out only mem_cgroup_handle_oom >> >function. >> If 'YOUR_VMLINUX' is supposed to be my kernel image: >> >> # gdb vmlinuz-3.2.34-grsec-1 >> GNU gdb (GDB) 7.0.1-debian >> Copyright (C) 2009 Free Software Foundation, Inc. >> License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html> >> This is free software: you are free to change and redistribute it. >> There is NO WARRANTY, to the extent permitted by law. Type "show copying" >> and "show warranty" for details. >> This GDB was configured as "x86_64-linux-gnu". >> For bug reporting instructions, please see: >> <http://www.gnu.org/software/gdb/bugs/>... >> "/root/bug/vmlinuz-3.2.34-grsec-1": not in executable format: File format not recognized >> >> >> # objdump -d vmlinuz-3.2.34-grsec-1 > >You need vmlinux not vmlinuz... ok, got it but still no luck: # gdb vmlinux GNU gdb (GDB) 7.0.1-debian Copyright (C) 2009 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html> This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type "show copying" and "show warranty" for details. This GDB was configured as "x86_64-linux-gnu". For bug reporting instructions, please see: <http://www.gnu.org/software/gdb/bugs/>... Reading symbols from /root/bug/dddddddd/vmlinux...(no debugging symbols found)...done. (gdb) disassemble mem_cgroup_handle_oom No symbol table is loaded. Use the "file" command. # objdump -d vmlinux | grep mem_cgroup_handle_oom <no output> i can recompile the kernel if anything needs to be added into it. azur ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: memory-cgroup bug @ 2012-11-23 9:44 ` azurIt 0 siblings, 0 replies; 444+ messages in thread From: azurIt @ 2012-11-23 9:44 UTC (permalink / raw) To: Michal Hocko; +Cc: linux-kernel, linux-mm, cgroups mailinglist > CC: linux-kernel@vger.kernel.org, linux-mm@kvack.org, "cgroups mailinglist" <cgroups@vger.kernel.org> >On Fri 23-11-12 10:21:37, azurIt wrote: >> >Either use gdb YOUR_VMLINUX and disassemble mem_cgroup_handle_oom or >> >use objdump -d YOUR_VMLINUX and copy out only mem_cgroup_handle_oom >> >function. >> If 'YOUR_VMLINUX' is supposed to be my kernel image: >> >> # gdb vmlinuz-3.2.34-grsec-1 >> GNU gdb (GDB) 7.0.1-debian >> Copyright (C) 2009 Free Software Foundation, Inc. >> License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html> >> This is free software: you are free to change and redistribute it. >> There is NO WARRANTY, to the extent permitted by law. Type "show copying" >> and "show warranty" for details. >> This GDB was configured as "x86_64-linux-gnu". >> For bug reporting instructions, please see: >> <http://www.gnu.org/software/gdb/bugs/>... >> "/root/bug/vmlinuz-3.2.34-grsec-1": not in executable format: File format not recognized >> >> >> # objdump -d vmlinuz-3.2.34-grsec-1 > >You need vmlinux not vmlinuz... ok, got it but still no luck: # gdb vmlinux GNU gdb (GDB) 7.0.1-debian Copyright (C) 2009 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html> This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type "show copying" and "show warranty" for details. This GDB was configured as "x86_64-linux-gnu". For bug reporting instructions, please see: <http://www.gnu.org/software/gdb/bugs/>... Reading symbols from /root/bug/dddddddd/vmlinux...(no debugging symbols found)...done. (gdb) disassemble mem_cgroup_handle_oom No symbol table is loaded. Use the "file" command. # objdump -d vmlinux | grep mem_cgroup_handle_oom <no output> i can recompile the kernel if anything needs to be added into it. azur -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: memory-cgroup bug 2012-11-23 9:44 ` azurIt (?) @ 2012-11-23 10:10 ` Michal Hocko -1 siblings, 0 replies; 444+ messages in thread From: Michal Hocko @ 2012-11-23 10:10 UTC (permalink / raw) To: azurIt; +Cc: linux-kernel, linux-mm, cgroups mailinglist On Fri 23-11-12 10:44:23, azurIt wrote: [...] > # gdb vmlinux > GNU gdb (GDB) 7.0.1-debian > Copyright (C) 2009 Free Software Foundation, Inc. > License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html> > This is free software: you are free to change and redistribute it. > There is NO WARRANTY, to the extent permitted by law. Type "show copying" > and "show warranty" for details. > This GDB was configured as "x86_64-linux-gnu". > For bug reporting instructions, please see: > <http://www.gnu.org/software/gdb/bugs/>... > Reading symbols from /root/bug/dddddddd/vmlinux...(no debugging symbols found)...done. > (gdb) disassemble mem_cgroup_handle_oom > No symbol table is loaded. Use the "file" command. > > > > # objdump -d vmlinux | grep mem_cgroup_handle_oom > <no output> Hmm, strange so the function is on the stack but it has been inlined? Doesn't make much sense to me. > i can recompile the kernel if anything needs to be added into it. If you could instrument mem_cgroup_handle_oom with some printks (before we take the memcg_oom_lock, before we schedule and into mem_cgroup_out_of_memory) -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: memory-cgroup bug @ 2012-11-23 10:10 ` Michal Hocko 0 siblings, 0 replies; 444+ messages in thread From: Michal Hocko @ 2012-11-23 10:10 UTC (permalink / raw) To: azurIt Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups mailinglist On Fri 23-11-12 10:44:23, azurIt wrote: [...] > # gdb vmlinux > GNU gdb (GDB) 7.0.1-debian > Copyright (C) 2009 Free Software Foundation, Inc. > License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html> > This is free software: you are free to change and redistribute it. > There is NO WARRANTY, to the extent permitted by law. Type "show copying" > and "show warranty" for details. > This GDB was configured as "x86_64-linux-gnu". > For bug reporting instructions, please see: > <http://www.gnu.org/software/gdb/bugs/>... > Reading symbols from /root/bug/dddddddd/vmlinux...(no debugging symbols found)...done. > (gdb) disassemble mem_cgroup_handle_oom > No symbol table is loaded. Use the "file" command. > > > > # objdump -d vmlinux | grep mem_cgroup_handle_oom > <no output> Hmm, strange so the function is on the stack but it has been inlined? Doesn't make much sense to me. > i can recompile the kernel if anything needs to be added into it. If you could instrument mem_cgroup_handle_oom with some printks (before we take the memcg_oom_lock, before we schedule and into mem_cgroup_out_of_memory) -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: memory-cgroup bug @ 2012-11-23 10:10 ` Michal Hocko 0 siblings, 0 replies; 444+ messages in thread From: Michal Hocko @ 2012-11-23 10:10 UTC (permalink / raw) To: azurIt; +Cc: linux-kernel, linux-mm, cgroups mailinglist On Fri 23-11-12 10:44:23, azurIt wrote: [...] > # gdb vmlinux > GNU gdb (GDB) 7.0.1-debian > Copyright (C) 2009 Free Software Foundation, Inc. > License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html> > This is free software: you are free to change and redistribute it. > There is NO WARRANTY, to the extent permitted by law. Type "show copying" > and "show warranty" for details. > This GDB was configured as "x86_64-linux-gnu". > For bug reporting instructions, please see: > <http://www.gnu.org/software/gdb/bugs/>... > Reading symbols from /root/bug/dddddddd/vmlinux...(no debugging symbols found)...done. > (gdb) disassemble mem_cgroup_handle_oom > No symbol table is loaded. Use the "file" command. > > > > # objdump -d vmlinux | grep mem_cgroup_handle_oom > <no output> Hmm, strange so the function is on the stack but it has been inlined? Doesn't make much sense to me. > i can recompile the kernel if anything needs to be added into it. If you could instrument mem_cgroup_handle_oom with some printks (before we take the memcg_oom_lock, before we schedule and into mem_cgroup_out_of_memory) -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: memory-cgroup bug 2012-11-23 9:21 ` azurIt @ 2012-11-23 9:34 ` Glauber Costa -1 siblings, 0 replies; 444+ messages in thread From: Glauber Costa @ 2012-11-23 9:34 UTC (permalink / raw) To: azurIt; +Cc: Michal Hocko, linux-kernel, linux-mm, cgroups mailinglist On 11/23/2012 01:21 PM, azurIt wrote: >> Either use gdb YOUR_VMLINUX and disassemble mem_cgroup_handle_oom or >> use objdump -d YOUR_VMLINUX and copy out only mem_cgroup_handle_oom >> function. > If 'YOUR_VMLINUX' is supposed to be my kernel image: > > # gdb vmlinuz-3.2.34-grsec-1 this is vmlinuz, not vmlinux. This is the compressed image. > > # file vmlinuz-3.2.34-grsec-1 > vmlinuz-3.2.34-grsec-1: Linux kernel x86 boot executable bzImage, version 3.2.34-grsec (root@server01) #1, RO-rootFS, swap_dev 0x3, Normal VGA > > I'm probably doing something wrong :) You need this: [glauber@straightjacket linux-glommer]$ file vmlinux vmlinux: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), statically linked, BuildID[sha1]=0xba936ee6b6096f9bc4c663f2a2ee0c2d2481c408, not stripped instead of bzImage. ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: memory-cgroup bug @ 2012-11-23 9:34 ` Glauber Costa 0 siblings, 0 replies; 444+ messages in thread From: Glauber Costa @ 2012-11-23 9:34 UTC (permalink / raw) To: azurIt; +Cc: Michal Hocko, linux-kernel, linux-mm, cgroups mailinglist On 11/23/2012 01:21 PM, azurIt wrote: >> Either use gdb YOUR_VMLINUX and disassemble mem_cgroup_handle_oom or >> use objdump -d YOUR_VMLINUX and copy out only mem_cgroup_handle_oom >> function. > If 'YOUR_VMLINUX' is supposed to be my kernel image: > > # gdb vmlinuz-3.2.34-grsec-1 this is vmlinuz, not vmlinux. This is the compressed image. > > # file vmlinuz-3.2.34-grsec-1 > vmlinuz-3.2.34-grsec-1: Linux kernel x86 boot executable bzImage, version 3.2.34-grsec (root@server01) #1, RO-rootFS, swap_dev 0x3, Normal VGA > > I'm probably doing something wrong :) You need this: [glauber@straightjacket linux-glommer]$ file vmlinux vmlinux: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), statically linked, BuildID[sha1]=0xba936ee6b6096f9bc4c663f2a2ee0c2d2481c408, not stripped instead of bzImage. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: memory-cgroup bug 2012-11-23 9:21 ` azurIt @ 2012-11-23 10:04 ` Michal Hocko -1 siblings, 0 replies; 444+ messages in thread From: Michal Hocko @ 2012-11-23 10:04 UTC (permalink / raw) To: azurIt; +Cc: linux-kernel, linux-mm, cgroups mailinglist On Fri 23-11-12 10:21:37, azurIt wrote: [...] > It, luckily, happend again so i have more info. > > - there wasn't any logs in kernel from OOM for that cgroup > - there were 16 processes in cgroup > - processes in cgroup were taking togather 100% of CPU (it > was allowed to use only one core, so 100% of that core) > - memory.failcnt was groving fast > - oom_control: > oom_kill_disable 0 > under_oom 0 (this was looping from 0 to 1) So there was an OOM going on but no messages in the log? Really strange. Kame already asked about oom_score_adj of the processes in the group but it didn't look like all the processes would have oom disabled, right? > - limit_in_bytes was set to 157286400 > - content of stat (as you can see, the whole memory limit was used): > cache 0 > rss 0 This looks like a top-level group for your user. > mapped_file 0 > pgpgin 0 > pgpgout 0 > swap 0 > pgfault 0 > pgmajfault 0 > inactive_anon 0 > active_anon 0 > inactive_file 0 > active_file 0 > unevictable 0 > hierarchical_memory_limit 157286400 > hierarchical_memsw_limit 157286400 > total_cache 0 > total_rss 157286400 OK, so all the memory is anonymous and you have no swap so the oom is the only thing to do. > total_mapped_file 0 > total_pgpgin 10326454 > total_pgpgout 10288054 > total_swap 0 > total_pgfault 12939677 > total_pgmajfault 4283 > total_inactive_anon 0 > total_active_anon 157286400 > total_inactive_file 0 > total_active_file 0 > total_unevictable 0 > > > i also grabber oom_adj, oom_score_adj and stack of all processes, here > it is: > http://www.watchdog.sk/lkml/memcg-bug.tar Hmm, all processes waiting for oom are stuck at the very same place: $ grep mem_cgroup_handle_oom -r [0-9]* 30858/stack:[<ffffffff8110a9c1>] mem_cgroup_handle_oom+0x241/0x3b0 30859/stack:[<ffffffff8110a9c1>] mem_cgroup_handle_oom+0x241/0x3b0 30860/stack:[<ffffffff8110a9c1>] mem_cgroup_handle_oom+0x241/0x3b0 30892/stack:[<ffffffff8110a9c1>] mem_cgroup_handle_oom+0x241/0x3b0 30898/stack:[<ffffffff8110a9c1>] mem_cgroup_handle_oom+0x241/0x3b0 31588/stack:[<ffffffff8110a9c1>] mem_cgroup_handle_oom+0x241/0x3b0 32044/stack:[<ffffffff8110a9c1>] mem_cgroup_handle_oom+0x241/0x3b0 32358/stack:[<ffffffff8110a9c1>] mem_cgroup_handle_oom+0x241/0x3b0 6031/stack:[<ffffffff8110a9c1>] mem_cgroup_handle_oom+0x241/0x3b0 6534/stack:[<ffffffff8110a9c1>] mem_cgroup_handle_oom+0x241/0x3b0 7020/stack:[<ffffffff8110a9c1>] mem_cgroup_handle_oom+0x241/0x3b0 We are taking memcg_oom_lock spinlock twice in that function + we can schedule. As none of the tasks is scheduled this would suggest that you are blocked at the first lock. But who got the lock then? This is really strange. Btw. is sysrq+t resp. sysrq+w showing the same traces as /proc/<pid>/stat? > Notice that stack is different for few processes. Yes others are in VFS resp ext3. ext3_write_begin looks a bit dangerous but it grabs the page before it really starts a transaction. > Stack for all processes were NOT chaging and was still the same. Could you take few snapshots over time? > Btw, don't know if it matters but i was several cgroup subsystems > mounted and i'm also using them (i was not activating freezer in this > case, don't know if it can be active automatically by kernel or what, No > didn't checked if cgroup was freezed but i suppose it wasn't): > none /cgroups cgroup defaults,cpuacct,cpuset,memory,freezer,task,blkio 0 0 Do you see the same issue if only memory controller was mounted (resp. cpuset which you seem to use as well from your description). I know you said booting into a vanilla kernel would be problematic but could you at least rule out te cgroup patches that you have mentioned? If you need to move a task to a group based by an uid you can use cgrules daemon (libcgroup1 package) for that as well. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: memory-cgroup bug @ 2012-11-23 10:04 ` Michal Hocko 0 siblings, 0 replies; 444+ messages in thread From: Michal Hocko @ 2012-11-23 10:04 UTC (permalink / raw) To: azurIt; +Cc: linux-kernel, linux-mm, cgroups mailinglist On Fri 23-11-12 10:21:37, azurIt wrote: [...] > It, luckily, happend again so i have more info. > > - there wasn't any logs in kernel from OOM for that cgroup > - there were 16 processes in cgroup > - processes in cgroup were taking togather 100% of CPU (it > was allowed to use only one core, so 100% of that core) > - memory.failcnt was groving fast > - oom_control: > oom_kill_disable 0 > under_oom 0 (this was looping from 0 to 1) So there was an OOM going on but no messages in the log? Really strange. Kame already asked about oom_score_adj of the processes in the group but it didn't look like all the processes would have oom disabled, right? > - limit_in_bytes was set to 157286400 > - content of stat (as you can see, the whole memory limit was used): > cache 0 > rss 0 This looks like a top-level group for your user. > mapped_file 0 > pgpgin 0 > pgpgout 0 > swap 0 > pgfault 0 > pgmajfault 0 > inactive_anon 0 > active_anon 0 > inactive_file 0 > active_file 0 > unevictable 0 > hierarchical_memory_limit 157286400 > hierarchical_memsw_limit 157286400 > total_cache 0 > total_rss 157286400 OK, so all the memory is anonymous and you have no swap so the oom is the only thing to do. > total_mapped_file 0 > total_pgpgin 10326454 > total_pgpgout 10288054 > total_swap 0 > total_pgfault 12939677 > total_pgmajfault 4283 > total_inactive_anon 0 > total_active_anon 157286400 > total_inactive_file 0 > total_active_file 0 > total_unevictable 0 > > > i also grabber oom_adj, oom_score_adj and stack of all processes, here > it is: > http://www.watchdog.sk/lkml/memcg-bug.tar Hmm, all processes waiting for oom are stuck at the very same place: $ grep mem_cgroup_handle_oom -r [0-9]* 30858/stack:[<ffffffff8110a9c1>] mem_cgroup_handle_oom+0x241/0x3b0 30859/stack:[<ffffffff8110a9c1>] mem_cgroup_handle_oom+0x241/0x3b0 30860/stack:[<ffffffff8110a9c1>] mem_cgroup_handle_oom+0x241/0x3b0 30892/stack:[<ffffffff8110a9c1>] mem_cgroup_handle_oom+0x241/0x3b0 30898/stack:[<ffffffff8110a9c1>] mem_cgroup_handle_oom+0x241/0x3b0 31588/stack:[<ffffffff8110a9c1>] mem_cgroup_handle_oom+0x241/0x3b0 32044/stack:[<ffffffff8110a9c1>] mem_cgroup_handle_oom+0x241/0x3b0 32358/stack:[<ffffffff8110a9c1>] mem_cgroup_handle_oom+0x241/0x3b0 6031/stack:[<ffffffff8110a9c1>] mem_cgroup_handle_oom+0x241/0x3b0 6534/stack:[<ffffffff8110a9c1>] mem_cgroup_handle_oom+0x241/0x3b0 7020/stack:[<ffffffff8110a9c1>] mem_cgroup_handle_oom+0x241/0x3b0 We are taking memcg_oom_lock spinlock twice in that function + we can schedule. As none of the tasks is scheduled this would suggest that you are blocked at the first lock. But who got the lock then? This is really strange. Btw. is sysrq+t resp. sysrq+w showing the same traces as /proc/<pid>/stat? > Notice that stack is different for few processes. Yes others are in VFS resp ext3. ext3_write_begin looks a bit dangerous but it grabs the page before it really starts a transaction. > Stack for all processes were NOT chaging and was still the same. Could you take few snapshots over time? > Btw, don't know if it matters but i was several cgroup subsystems > mounted and i'm also using them (i was not activating freezer in this > case, don't know if it can be active automatically by kernel or what, No > didn't checked if cgroup was freezed but i suppose it wasn't): > none /cgroups cgroup defaults,cpuacct,cpuset,memory,freezer,task,blkio 0 0 Do you see the same issue if only memory controller was mounted (resp. cpuset which you seem to use as well from your description). I know you said booting into a vanilla kernel would be problematic but could you at least rule out te cgroup patches that you have mentioned? If you need to move a task to a group based by an uid you can use cgrules daemon (libcgroup1 package) for that as well. -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: memory-cgroup bug 2012-11-23 10:04 ` Michal Hocko (?) @ 2012-11-23 14:59 ` azurIt -1 siblings, 0 replies; 444+ messages in thread From: azurIt @ 2012-11-23 14:59 UTC (permalink / raw) To: Michal Hocko; +Cc: linux-kernel, linux-mm, cgroups mailinglist >If you could instrument mem_cgroup_handle_oom with some printks (before >we take the memcg_oom_lock, before we schedule and into >mem_cgroup_out_of_memory) If you send me patch i can do it. I'm, unfortunately, not able to code it. >> It, luckily, happend again so i have more info. >> >> - there wasn't any logs in kernel from OOM for that cgroup >> - there were 16 processes in cgroup >> - processes in cgroup were taking togather 100% of CPU (it >> was allowed to use only one core, so 100% of that core) >> - memory.failcnt was groving fast >> - oom_control: >> oom_kill_disable 0 >> under_oom 0 (this was looping from 0 to 1) > >So there was an OOM going on but no messages in the log? Really strange. >Kame already asked about oom_score_adj of the processes in the group but >it didn't look like all the processes would have oom disabled, right? There were no messages telling that some processes were killed because of OOM. >> - limit_in_bytes was set to 157286400 >> - content of stat (as you can see, the whole memory limit was used): >> cache 0 >> rss 0 > >This looks like a top-level group for your user. Yes, it was from /cgroup/<user-id>/ >> mapped_file 0 >> pgpgin 0 >> pgpgout 0 >> swap 0 >> pgfault 0 >> pgmajfault 0 >> inactive_anon 0 >> active_anon 0 >> inactive_file 0 >> active_file 0 >> unevictable 0 >> hierarchical_memory_limit 157286400 >> hierarchical_memsw_limit 157286400 >> total_cache 0 >> total_rss 157286400 > >OK, so all the memory is anonymous and you have no swap so the oom is >the only thing to do. What will happen if the same situation occurs globally? No swap, every bit of memory used. Will kernel be able to start OOM killer? Maybe the same thing is happening in cgroup - there's simply no space to run OOM killer. And maybe this is why it's happening rarely - usually there are still at least few KBs of memory left to start OOM killer. >Hmm, all processes waiting for oom are stuck at the very same place: >$ grep mem_cgroup_handle_oom -r [0-9]* >30858/stack:[<ffffffff8110a9c1>] mem_cgroup_handle_oom+0x241/0x3b0 >30859/stack:[<ffffffff8110a9c1>] mem_cgroup_handle_oom+0x241/0x3b0 >30860/stack:[<ffffffff8110a9c1>] mem_cgroup_handle_oom+0x241/0x3b0 >30892/stack:[<ffffffff8110a9c1>] mem_cgroup_handle_oom+0x241/0x3b0 >30898/stack:[<ffffffff8110a9c1>] mem_cgroup_handle_oom+0x241/0x3b0 >31588/stack:[<ffffffff8110a9c1>] mem_cgroup_handle_oom+0x241/0x3b0 >32044/stack:[<ffffffff8110a9c1>] mem_cgroup_handle_oom+0x241/0x3b0 >32358/stack:[<ffffffff8110a9c1>] mem_cgroup_handle_oom+0x241/0x3b0 >6031/stack:[<ffffffff8110a9c1>] mem_cgroup_handle_oom+0x241/0x3b0 >6534/stack:[<ffffffff8110a9c1>] mem_cgroup_handle_oom+0x241/0x3b0 >7020/stack:[<ffffffff8110a9c1>] mem_cgroup_handle_oom+0x241/0x3b0 > >We are taking memcg_oom_lock spinlock twice in that function + we can >schedule. As none of the tasks is scheduled this would suggest that you >are blocked at the first lock. But who got the lock then? >This is really strange. >Btw. is sysrq+t resp. sysrq+w showing the same traces as >/proc/<pid>/stat? Unfortunately i'm connecting remotely to the servers (SSH). >> Notice that stack is different for few processes. > >Yes others are in VFS resp ext3. ext3_write_begin looks a bit dangerous >but it grabs the page before it really starts a transaction. Maybe these processes were throttled by cgroup-blkio at the same time and are still keeping the lock? So the problem occurs when there are low on memory and cgroup is doing IO out of it's limits. Only guessing and telling my thoughts. >> Stack for all processes were NOT chaging and was still the same. > >Could you take few snapshots over time? Will do next time but i can't keep services freezed for a long time or customers will be angry. >> didn't checked if cgroup was freezed but i suppose it wasn't): >> none /cgroups cgroup defaults,cpuacct,cpuset,memory,freezer,task,blkio 0 0 > >Do you see the same issue if only memory controller was mounted (resp. >cpuset which you seem to use as well from your description). Uh, we are using all mounted subsystems :( I will be able to umount only freezer and maybe blkio for some time. Will it help? >I know you said booting into a vanilla kernel would be problematic but >could you at least rule out te cgroup patches that you have mentioned? >If you need to move a task to a group based by an uid you can use >cgrules daemon (libcgroup1 package) for that as well. We are using cgroup-uid cos it's MUCH MUCH MUCH more efective and better. For example, i don't believe that cgroup-task will work with that daemon. What will happen if cgrules won't be able to add process into cgroup because of task limit? Process will probably continue and will run outside of any cgroup which is wrong. With cgroup-task + cgroup-uid, such processes cannot be even started (and this is what we need). ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: memory-cgroup bug @ 2012-11-23 14:59 ` azurIt 0 siblings, 0 replies; 444+ messages in thread From: azurIt @ 2012-11-23 14:59 UTC (permalink / raw) To: Michal Hocko Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups mailinglist >If you could instrument mem_cgroup_handle_oom with some printks (before >we take the memcg_oom_lock, before we schedule and into >mem_cgroup_out_of_memory) If you send me patch i can do it. I'm, unfortunately, not able to code it. >> It, luckily, happend again so i have more info. >> >> - there wasn't any logs in kernel from OOM for that cgroup >> - there were 16 processes in cgroup >> - processes in cgroup were taking togather 100% of CPU (it >> was allowed to use only one core, so 100% of that core) >> - memory.failcnt was groving fast >> - oom_control: >> oom_kill_disable 0 >> under_oom 0 (this was looping from 0 to 1) > >So there was an OOM going on but no messages in the log? Really strange. >Kame already asked about oom_score_adj of the processes in the group but >it didn't look like all the processes would have oom disabled, right? There were no messages telling that some processes were killed because of OOM. >> - limit_in_bytes was set to 157286400 >> - content of stat (as you can see, the whole memory limit was used): >> cache 0 >> rss 0 > >This looks like a top-level group for your user. Yes, it was from /cgroup/<user-id>/ >> mapped_file 0 >> pgpgin 0 >> pgpgout 0 >> swap 0 >> pgfault 0 >> pgmajfault 0 >> inactive_anon 0 >> active_anon 0 >> inactive_file 0 >> active_file 0 >> unevictable 0 >> hierarchical_memory_limit 157286400 >> hierarchical_memsw_limit 157286400 >> total_cache 0 >> total_rss 157286400 > >OK, so all the memory is anonymous and you have no swap so the oom is >the only thing to do. What will happen if the same situation occurs globally? No swap, every bit of memory used. Will kernel be able to start OOM killer? Maybe the same thing is happening in cgroup - there's simply no space to run OOM killer. And maybe this is why it's happening rarely - usually there are still at least few KBs of memory left to start OOM killer. >Hmm, all processes waiting for oom are stuck at the very same place: >$ grep mem_cgroup_handle_oom -r [0-9]* >30858/stack:[<ffffffff8110a9c1>] mem_cgroup_handle_oom+0x241/0x3b0 >30859/stack:[<ffffffff8110a9c1>] mem_cgroup_handle_oom+0x241/0x3b0 >30860/stack:[<ffffffff8110a9c1>] mem_cgroup_handle_oom+0x241/0x3b0 >30892/stack:[<ffffffff8110a9c1>] mem_cgroup_handle_oom+0x241/0x3b0 >30898/stack:[<ffffffff8110a9c1>] mem_cgroup_handle_oom+0x241/0x3b0 >31588/stack:[<ffffffff8110a9c1>] mem_cgroup_handle_oom+0x241/0x3b0 >32044/stack:[<ffffffff8110a9c1>] mem_cgroup_handle_oom+0x241/0x3b0 >32358/stack:[<ffffffff8110a9c1>] mem_cgroup_handle_oom+0x241/0x3b0 >6031/stack:[<ffffffff8110a9c1>] mem_cgroup_handle_oom+0x241/0x3b0 >6534/stack:[<ffffffff8110a9c1>] mem_cgroup_handle_oom+0x241/0x3b0 >7020/stack:[<ffffffff8110a9c1>] mem_cgroup_handle_oom+0x241/0x3b0 > >We are taking memcg_oom_lock spinlock twice in that function + we can >schedule. As none of the tasks is scheduled this would suggest that you >are blocked at the first lock. But who got the lock then? >This is really strange. >Btw. is sysrq+t resp. sysrq+w showing the same traces as >/proc/<pid>/stat? Unfortunately i'm connecting remotely to the servers (SSH). >> Notice that stack is different for few processes. > >Yes others are in VFS resp ext3. ext3_write_begin looks a bit dangerous >but it grabs the page before it really starts a transaction. Maybe these processes were throttled by cgroup-blkio at the same time and are still keeping the lock? So the problem occurs when there are low on memory and cgroup is doing IO out of it's limits. Only guessing and telling my thoughts. >> Stack for all processes were NOT chaging and was still the same. > >Could you take few snapshots over time? Will do next time but i can't keep services freezed for a long time or customers will be angry. >> didn't checked if cgroup was freezed but i suppose it wasn't): >> none /cgroups cgroup defaults,cpuacct,cpuset,memory,freezer,task,blkio 0 0 > >Do you see the same issue if only memory controller was mounted (resp. >cpuset which you seem to use as well from your description). Uh, we are using all mounted subsystems :( I will be able to umount only freezer and maybe blkio for some time. Will it help? >I know you said booting into a vanilla kernel would be problematic but >could you at least rule out te cgroup patches that you have mentioned? >If you need to move a task to a group based by an uid you can use >cgrules daemon (libcgroup1 package) for that as well. We are using cgroup-uid cos it's MUCH MUCH MUCH more efective and better. For example, i don't believe that cgroup-task will work with that daemon. What will happen if cgrules won't be able to add process into cgroup because of task limit? Process will probably continue and will run outside of any cgroup which is wrong. With cgroup-task + cgroup-uid, such processes cannot be even started (and this is what we need). ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: memory-cgroup bug @ 2012-11-23 14:59 ` azurIt 0 siblings, 0 replies; 444+ messages in thread From: azurIt @ 2012-11-23 14:59 UTC (permalink / raw) To: Michal Hocko; +Cc: linux-kernel, linux-mm, cgroups mailinglist >If you could instrument mem_cgroup_handle_oom with some printks (before >we take the memcg_oom_lock, before we schedule and into >mem_cgroup_out_of_memory) If you send me patch i can do it. I'm, unfortunately, not able to code it. >> It, luckily, happend again so i have more info. >> >> - there wasn't any logs in kernel from OOM for that cgroup >> - there were 16 processes in cgroup >> - processes in cgroup were taking togather 100% of CPU (it >> was allowed to use only one core, so 100% of that core) >> - memory.failcnt was groving fast >> - oom_control: >> oom_kill_disable 0 >> under_oom 0 (this was looping from 0 to 1) > >So there was an OOM going on but no messages in the log? Really strange. >Kame already asked about oom_score_adj of the processes in the group but >it didn't look like all the processes would have oom disabled, right? There were no messages telling that some processes were killed because of OOM. >> - limit_in_bytes was set to 157286400 >> - content of stat (as you can see, the whole memory limit was used): >> cache 0 >> rss 0 > >This looks like a top-level group for your user. Yes, it was from /cgroup/<user-id>/ >> mapped_file 0 >> pgpgin 0 >> pgpgout 0 >> swap 0 >> pgfault 0 >> pgmajfault 0 >> inactive_anon 0 >> active_anon 0 >> inactive_file 0 >> active_file 0 >> unevictable 0 >> hierarchical_memory_limit 157286400 >> hierarchical_memsw_limit 157286400 >> total_cache 0 >> total_rss 157286400 > >OK, so all the memory is anonymous and you have no swap so the oom is >the only thing to do. What will happen if the same situation occurs globally? No swap, every bit of memory used. Will kernel be able to start OOM killer? Maybe the same thing is happening in cgroup - there's simply no space to run OOM killer. And maybe this is why it's happening rarely - usually there are still at least few KBs of memory left to start OOM killer. >Hmm, all processes waiting for oom are stuck at the very same place: >$ grep mem_cgroup_handle_oom -r [0-9]* >30858/stack:[<ffffffff8110a9c1>] mem_cgroup_handle_oom+0x241/0x3b0 >30859/stack:[<ffffffff8110a9c1>] mem_cgroup_handle_oom+0x241/0x3b0 >30860/stack:[<ffffffff8110a9c1>] mem_cgroup_handle_oom+0x241/0x3b0 >30892/stack:[<ffffffff8110a9c1>] mem_cgroup_handle_oom+0x241/0x3b0 >30898/stack:[<ffffffff8110a9c1>] mem_cgroup_handle_oom+0x241/0x3b0 >31588/stack:[<ffffffff8110a9c1>] mem_cgroup_handle_oom+0x241/0x3b0 >32044/stack:[<ffffffff8110a9c1>] mem_cgroup_handle_oom+0x241/0x3b0 >32358/stack:[<ffffffff8110a9c1>] mem_cgroup_handle_oom+0x241/0x3b0 >6031/stack:[<ffffffff8110a9c1>] mem_cgroup_handle_oom+0x241/0x3b0 >6534/stack:[<ffffffff8110a9c1>] mem_cgroup_handle_oom+0x241/0x3b0 >7020/stack:[<ffffffff8110a9c1>] mem_cgroup_handle_oom+0x241/0x3b0 > >We are taking memcg_oom_lock spinlock twice in that function + we can >schedule. As none of the tasks is scheduled this would suggest that you >are blocked at the first lock. But who got the lock then? >This is really strange. >Btw. is sysrq+t resp. sysrq+w showing the same traces as >/proc/<pid>/stat? Unfortunately i'm connecting remotely to the servers (SSH). >> Notice that stack is different for few processes. > >Yes others are in VFS resp ext3. ext3_write_begin looks a bit dangerous >but it grabs the page before it really starts a transaction. Maybe these processes were throttled by cgroup-blkio at the same time and are still keeping the lock? So the problem occurs when there are low on memory and cgroup is doing IO out of it's limits. Only guessing and telling my thoughts. >> Stack for all processes were NOT chaging and was still the same. > >Could you take few snapshots over time? Will do next time but i can't keep services freezed for a long time or customers will be angry. >> didn't checked if cgroup was freezed but i suppose it wasn't): >> none /cgroups cgroup defaults,cpuacct,cpuset,memory,freezer,task,blkio 0 0 > >Do you see the same issue if only memory controller was mounted (resp. >cpuset which you seem to use as well from your description). Uh, we are using all mounted subsystems :( I will be able to umount only freezer and maybe blkio for some time. Will it help? >I know you said booting into a vanilla kernel would be problematic but >could you at least rule out te cgroup patches that you have mentioned? >If you need to move a task to a group based by an uid you can use >cgrules daemon (libcgroup1 package) for that as well. We are using cgroup-uid cos it's MUCH MUCH MUCH more efective and better. For example, i don't believe that cgroup-task will work with that daemon. What will happen if cgrules won't be able to add process into cgroup because of task limit? Process will probably continue and will run outside of any cgroup which is wrong. With cgroup-task + cgroup-uid, such processes cannot be even started (and this is what we need). -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: memory-cgroup bug 2012-11-23 14:59 ` azurIt (?) @ 2012-11-25 10:17 ` Michal Hocko -1 siblings, 0 replies; 444+ messages in thread From: Michal Hocko @ 2012-11-25 10:17 UTC (permalink / raw) To: azurIt; +Cc: linux-kernel, linux-mm, cgroups mailinglist On Fri 23-11-12 15:59:04, azurIt wrote: > >If you could instrument mem_cgroup_handle_oom with some printks (before > >we take the memcg_oom_lock, before we schedule and into > >mem_cgroup_out_of_memory) > > > If you send me patch i can do it. I'm, unfortunately, not able to code it. Inlined at the end of the email. Please note I have compile tested it. It might produce a lot of output. > >> It, luckily, happend again so i have more info. > >> > >> - there wasn't any logs in kernel from OOM for that cgroup > >> - there were 16 processes in cgroup > >> - processes in cgroup were taking togather 100% of CPU (it > >> was allowed to use only one core, so 100% of that core) > >> - memory.failcnt was groving fast > >> - oom_control: > >> oom_kill_disable 0 > >> under_oom 0 (this was looping from 0 to 1) > > > >So there was an OOM going on but no messages in the log? Really strange. > >Kame already asked about oom_score_adj of the processes in the group but > >it didn't look like all the processes would have oom disabled, right? > > > There were no messages telling that some processes were killed because of OOM. dmesg | grep "Out of memory" doesn't tell anything, right? > >> - limit_in_bytes was set to 157286400 > >> - content of stat (as you can see, the whole memory limit was used): > >> cache 0 > >> rss 0 > > > >This looks like a top-level group for your user. > > > Yes, it was from /cgroup/<user-id>/ > > > >> mapped_file 0 > >> pgpgin 0 > >> pgpgout 0 > >> swap 0 > >> pgfault 0 > >> pgmajfault 0 > >> inactive_anon 0 > >> active_anon 0 > >> inactive_file 0 > >> active_file 0 > >> unevictable 0 > >> hierarchical_memory_limit 157286400 > >> hierarchical_memsw_limit 157286400 > >> total_cache 0 > >> total_rss 157286400 > > > >OK, so all the memory is anonymous and you have no swap so the oom is > >the only thing to do. > > > What will happen if the same situation occurs globally? No swap, every > bit of memory used. Will kernel be able to start OOM killer? OOM killer is not a task. It doesn't allocate any memory. It just walks the process list and picks up a task with the highest score. If the global oom is not able to find any such a task (e.g. because all of them have oom disabled) the the system panics. > Maybe the same thing is happening in cgroup cgroup oom differs only in that aspect that the system doesn't panic if there is no suitable task to kill. [...] > >> Notice that stack is different for few processes. > > > >Yes others are in VFS resp ext3. ext3_write_begin looks a bit dangerous > >but it grabs the page before it really starts a transaction. > > > Maybe these processes were throttled by cgroup-blkio at the same time > and are still keeping the lock? If you are thinking about memcg_oom_lock then this is not possible because the lock is held only for short times. There is no other lock that memcg oom holds. > So the problem occurs when there are low on memory and cgroup is doing > IO out of it's limits. Only guessing and telling my thoughts. The lockup (if this is what happens) still might be related to the IO controller if the killed task cannot finish due to pending IO, though. [...] > >> didn't checked if cgroup was freezed but i suppose it wasn't): > >> none /cgroups cgroup defaults,cpuacct,cpuset,memory,freezer,task,blkio 0 0 > > > >Do you see the same issue if only memory controller was mounted (resp. > >cpuset which you seem to use as well from your description). > > > Uh, we are using all mounted subsystems :( I will be able to umount > only freezer and maybe blkio for some time. Will it help? Not sure about that without further data. > >I know you said booting into a vanilla kernel would be problematic but > >could you at least rule out te cgroup patches that you have mentioned? > >If you need to move a task to a group based by an uid you can use > >cgrules daemon (libcgroup1 package) for that as well. > > > We are using cgroup-uid cos it's MUCH MUCH MUCH more efective and > better. For example, i don't believe that cgroup-task will work with > that daemon. What will happen if cgrules won't be able to add process > into cgroup because of task limit? Process will probably continue and > will run outside of any cgroup which is wrong. With cgroup-task + > cgroup-uid, such processes cannot be even started (and this is what we > need). I am not familiar with cgroup-task controller so I cannot comment on that. --- diff --git a/mm/memcontrol.c b/mm/memcontrol.c index c8425b1..7f26ec8 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -1863,6 +1863,7 @@ bool mem_cgroup_handle_oom(struct mem_cgroup *memcg, gfp_t mask) { struct oom_wait_info owait; bool locked, need_to_kill; + int ret = false; owait.mem = memcg; owait.wait.flags = 0; @@ -1873,6 +1874,7 @@ bool mem_cgroup_handle_oom(struct mem_cgroup *memcg, gfp_t mask) mem_cgroup_mark_under_oom(memcg); /* At first, try to OOM lock hierarchy under memcg.*/ + printk("XXX: %d waiting for memcg_oom_lock\n", current->pid); spin_lock(&memcg_oom_lock); locked = mem_cgroup_oom_lock(memcg); /* @@ -1887,12 +1889,14 @@ bool mem_cgroup_handle_oom(struct mem_cgroup *memcg, gfp_t mask) mem_cgroup_oom_notify(memcg); spin_unlock(&memcg_oom_lock); + printk("XXX: %d need_to_kill:%d locked:%d\n", current->pid, need_to_kill, locked); if (need_to_kill) { finish_wait(&memcg_oom_waitq, &owait.wait); mem_cgroup_out_of_memory(memcg, mask); } else { schedule(); finish_wait(&memcg_oom_waitq, &owait.wait); + printk("XXX: %d woken up\n", current->pid); } spin_lock(&memcg_oom_lock); if (locked) @@ -1903,10 +1907,13 @@ bool mem_cgroup_handle_oom(struct mem_cgroup *memcg, gfp_t mask) mem_cgroup_unmark_under_oom(memcg); if (test_thread_flag(TIF_MEMDIE) || fatal_signal_pending(current)) - return false; + goto out; /* Give chance to dying process */ schedule_timeout_uninterruptible(1); - return true; + ret = true; +out: + printk("XXX: %d done with %d\n", current->pid, ret); + return ret; } /* diff --git a/mm/oom_kill.c b/mm/oom_kill.c index 069b64e..a7db813 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -568,6 +568,7 @@ void mem_cgroup_out_of_memory(struct mem_cgroup *mem, gfp_t gfp_mask) */ if (fatal_signal_pending(current)) { set_thread_flag(TIF_MEMDIE); + printk("XXX: %d skipping task with fatal signal pending\n", current->pid); return; } @@ -576,8 +577,10 @@ void mem_cgroup_out_of_memory(struct mem_cgroup *mem, gfp_t gfp_mask) read_lock(&tasklist_lock); retry: p = select_bad_process(&points, limit, mem, NULL); - if (!p || PTR_ERR(p) == -1UL) + if (!p || PTR_ERR(p) == -1UL) { + printk("XXX: %d nothing to kill\n", current->pid); goto out; + } if (oom_kill_process(p, gfp_mask, 0, points, limit, mem, NULL, "Memory cgroup out of memory")) -- Michal Hocko SUSE Labs ^ permalink raw reply related [flat|nested] 444+ messages in thread
* Re: memory-cgroup bug @ 2012-11-25 10:17 ` Michal Hocko 0 siblings, 0 replies; 444+ messages in thread From: Michal Hocko @ 2012-11-25 10:17 UTC (permalink / raw) To: azurIt Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups mailinglist On Fri 23-11-12 15:59:04, azurIt wrote: > >If you could instrument mem_cgroup_handle_oom with some printks (before > >we take the memcg_oom_lock, before we schedule and into > >mem_cgroup_out_of_memory) > > > If you send me patch i can do it. I'm, unfortunately, not able to code it. Inlined at the end of the email. Please note I have compile tested it. It might produce a lot of output. > >> It, luckily, happend again so i have more info. > >> > >> - there wasn't any logs in kernel from OOM for that cgroup > >> - there were 16 processes in cgroup > >> - processes in cgroup were taking togather 100% of CPU (it > >> was allowed to use only one core, so 100% of that core) > >> - memory.failcnt was groving fast > >> - oom_control: > >> oom_kill_disable 0 > >> under_oom 0 (this was looping from 0 to 1) > > > >So there was an OOM going on but no messages in the log? Really strange. > >Kame already asked about oom_score_adj of the processes in the group but > >it didn't look like all the processes would have oom disabled, right? > > > There were no messages telling that some processes were killed because of OOM. dmesg | grep "Out of memory" doesn't tell anything, right? > >> - limit_in_bytes was set to 157286400 > >> - content of stat (as you can see, the whole memory limit was used): > >> cache 0 > >> rss 0 > > > >This looks like a top-level group for your user. > > > Yes, it was from /cgroup/<user-id>/ > > > >> mapped_file 0 > >> pgpgin 0 > >> pgpgout 0 > >> swap 0 > >> pgfault 0 > >> pgmajfault 0 > >> inactive_anon 0 > >> active_anon 0 > >> inactive_file 0 > >> active_file 0 > >> unevictable 0 > >> hierarchical_memory_limit 157286400 > >> hierarchical_memsw_limit 157286400 > >> total_cache 0 > >> total_rss 157286400 > > > >OK, so all the memory is anonymous and you have no swap so the oom is > >the only thing to do. > > > What will happen if the same situation occurs globally? No swap, every > bit of memory used. Will kernel be able to start OOM killer? OOM killer is not a task. It doesn't allocate any memory. It just walks the process list and picks up a task with the highest score. If the global oom is not able to find any such a task (e.g. because all of them have oom disabled) the the system panics. > Maybe the same thing is happening in cgroup cgroup oom differs only in that aspect that the system doesn't panic if there is no suitable task to kill. [...] > >> Notice that stack is different for few processes. > > > >Yes others are in VFS resp ext3. ext3_write_begin looks a bit dangerous > >but it grabs the page before it really starts a transaction. > > > Maybe these processes were throttled by cgroup-blkio at the same time > and are still keeping the lock? If you are thinking about memcg_oom_lock then this is not possible because the lock is held only for short times. There is no other lock that memcg oom holds. > So the problem occurs when there are low on memory and cgroup is doing > IO out of it's limits. Only guessing and telling my thoughts. The lockup (if this is what happens) still might be related to the IO controller if the killed task cannot finish due to pending IO, though. [...] > >> didn't checked if cgroup was freezed but i suppose it wasn't): > >> none /cgroups cgroup defaults,cpuacct,cpuset,memory,freezer,task,blkio 0 0 > > > >Do you see the same issue if only memory controller was mounted (resp. > >cpuset which you seem to use as well from your description). > > > Uh, we are using all mounted subsystems :( I will be able to umount > only freezer and maybe blkio for some time. Will it help? Not sure about that without further data. > >I know you said booting into a vanilla kernel would be problematic but > >could you at least rule out te cgroup patches that you have mentioned? > >If you need to move a task to a group based by an uid you can use > >cgrules daemon (libcgroup1 package) for that as well. > > > We are using cgroup-uid cos it's MUCH MUCH MUCH more efective and > better. For example, i don't believe that cgroup-task will work with > that daemon. What will happen if cgrules won't be able to add process > into cgroup because of task limit? Process will probably continue and > will run outside of any cgroup which is wrong. With cgroup-task + > cgroup-uid, such processes cannot be even started (and this is what we > need). I am not familiar with cgroup-task controller so I cannot comment on that. --- diff --git a/mm/memcontrol.c b/mm/memcontrol.c index c8425b1..7f26ec8 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -1863,6 +1863,7 @@ bool mem_cgroup_handle_oom(struct mem_cgroup *memcg, gfp_t mask) { struct oom_wait_info owait; bool locked, need_to_kill; + int ret = false; owait.mem = memcg; owait.wait.flags = 0; @@ -1873,6 +1874,7 @@ bool mem_cgroup_handle_oom(struct mem_cgroup *memcg, gfp_t mask) mem_cgroup_mark_under_oom(memcg); /* At first, try to OOM lock hierarchy under memcg.*/ + printk("XXX: %d waiting for memcg_oom_lock\n", current->pid); spin_lock(&memcg_oom_lock); locked = mem_cgroup_oom_lock(memcg); /* @@ -1887,12 +1889,14 @@ bool mem_cgroup_handle_oom(struct mem_cgroup *memcg, gfp_t mask) mem_cgroup_oom_notify(memcg); spin_unlock(&memcg_oom_lock); + printk("XXX: %d need_to_kill:%d locked:%d\n", current->pid, need_to_kill, locked); if (need_to_kill) { finish_wait(&memcg_oom_waitq, &owait.wait); mem_cgroup_out_of_memory(memcg, mask); } else { schedule(); finish_wait(&memcg_oom_waitq, &owait.wait); + printk("XXX: %d woken up\n", current->pid); } spin_lock(&memcg_oom_lock); if (locked) @@ -1903,10 +1907,13 @@ bool mem_cgroup_handle_oom(struct mem_cgroup *memcg, gfp_t mask) mem_cgroup_unmark_under_oom(memcg); if (test_thread_flag(TIF_MEMDIE) || fatal_signal_pending(current)) - return false; + goto out; /* Give chance to dying process */ schedule_timeout_uninterruptible(1); - return true; + ret = true; +out: + printk("XXX: %d done with %d\n", current->pid, ret); + return ret; } /* diff --git a/mm/oom_kill.c b/mm/oom_kill.c index 069b64e..a7db813 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -568,6 +568,7 @@ void mem_cgroup_out_of_memory(struct mem_cgroup *mem, gfp_t gfp_mask) */ if (fatal_signal_pending(current)) { set_thread_flag(TIF_MEMDIE); + printk("XXX: %d skipping task with fatal signal pending\n", current->pid); return; } @@ -576,8 +577,10 @@ void mem_cgroup_out_of_memory(struct mem_cgroup *mem, gfp_t gfp_mask) read_lock(&tasklist_lock); retry: p = select_bad_process(&points, limit, mem, NULL); - if (!p || PTR_ERR(p) == -1UL) + if (!p || PTR_ERR(p) == -1UL) { + printk("XXX: %d nothing to kill\n", current->pid); goto out; + } if (oom_kill_process(p, gfp_mask, 0, points, limit, mem, NULL, "Memory cgroup out of memory")) -- Michal Hocko SUSE Labs ^ permalink raw reply related [flat|nested] 444+ messages in thread
* Re: memory-cgroup bug @ 2012-11-25 10:17 ` Michal Hocko 0 siblings, 0 replies; 444+ messages in thread From: Michal Hocko @ 2012-11-25 10:17 UTC (permalink / raw) To: azurIt; +Cc: linux-kernel, linux-mm, cgroups mailinglist On Fri 23-11-12 15:59:04, azurIt wrote: > >If you could instrument mem_cgroup_handle_oom with some printks (before > >we take the memcg_oom_lock, before we schedule and into > >mem_cgroup_out_of_memory) > > > If you send me patch i can do it. I'm, unfortunately, not able to code it. Inlined at the end of the email. Please note I have compile tested it. It might produce a lot of output. > >> It, luckily, happend again so i have more info. > >> > >> - there wasn't any logs in kernel from OOM for that cgroup > >> - there were 16 processes in cgroup > >> - processes in cgroup were taking togather 100% of CPU (it > >> was allowed to use only one core, so 100% of that core) > >> - memory.failcnt was groving fast > >> - oom_control: > >> oom_kill_disable 0 > >> under_oom 0 (this was looping from 0 to 1) > > > >So there was an OOM going on but no messages in the log? Really strange. > >Kame already asked about oom_score_adj of the processes in the group but > >it didn't look like all the processes would have oom disabled, right? > > > There were no messages telling that some processes were killed because of OOM. dmesg | grep "Out of memory" doesn't tell anything, right? > >> - limit_in_bytes was set to 157286400 > >> - content of stat (as you can see, the whole memory limit was used): > >> cache 0 > >> rss 0 > > > >This looks like a top-level group for your user. > > > Yes, it was from /cgroup/<user-id>/ > > > >> mapped_file 0 > >> pgpgin 0 > >> pgpgout 0 > >> swap 0 > >> pgfault 0 > >> pgmajfault 0 > >> inactive_anon 0 > >> active_anon 0 > >> inactive_file 0 > >> active_file 0 > >> unevictable 0 > >> hierarchical_memory_limit 157286400 > >> hierarchical_memsw_limit 157286400 > >> total_cache 0 > >> total_rss 157286400 > > > >OK, so all the memory is anonymous and you have no swap so the oom is > >the only thing to do. > > > What will happen if the same situation occurs globally? No swap, every > bit of memory used. Will kernel be able to start OOM killer? OOM killer is not a task. It doesn't allocate any memory. It just walks the process list and picks up a task with the highest score. If the global oom is not able to find any such a task (e.g. because all of them have oom disabled) the the system panics. > Maybe the same thing is happening in cgroup cgroup oom differs only in that aspect that the system doesn't panic if there is no suitable task to kill. [...] > >> Notice that stack is different for few processes. > > > >Yes others are in VFS resp ext3. ext3_write_begin looks a bit dangerous > >but it grabs the page before it really starts a transaction. > > > Maybe these processes were throttled by cgroup-blkio at the same time > and are still keeping the lock? If you are thinking about memcg_oom_lock then this is not possible because the lock is held only for short times. There is no other lock that memcg oom holds. > So the problem occurs when there are low on memory and cgroup is doing > IO out of it's limits. Only guessing and telling my thoughts. The lockup (if this is what happens) still might be related to the IO controller if the killed task cannot finish due to pending IO, though. [...] > >> didn't checked if cgroup was freezed but i suppose it wasn't): > >> none /cgroups cgroup defaults,cpuacct,cpuset,memory,freezer,task,blkio 0 0 > > > >Do you see the same issue if only memory controller was mounted (resp. > >cpuset which you seem to use as well from your description). > > > Uh, we are using all mounted subsystems :( I will be able to umount > only freezer and maybe blkio for some time. Will it help? Not sure about that without further data. > >I know you said booting into a vanilla kernel would be problematic but > >could you at least rule out te cgroup patches that you have mentioned? > >If you need to move a task to a group based by an uid you can use > >cgrules daemon (libcgroup1 package) for that as well. > > > We are using cgroup-uid cos it's MUCH MUCH MUCH more efective and > better. For example, i don't believe that cgroup-task will work with > that daemon. What will happen if cgrules won't be able to add process > into cgroup because of task limit? Process will probably continue and > will run outside of any cgroup which is wrong. With cgroup-task + > cgroup-uid, such processes cannot be even started (and this is what we > need). I am not familiar with cgroup-task controller so I cannot comment on that. --- diff --git a/mm/memcontrol.c b/mm/memcontrol.c index c8425b1..7f26ec8 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -1863,6 +1863,7 @@ bool mem_cgroup_handle_oom(struct mem_cgroup *memcg, gfp_t mask) { struct oom_wait_info owait; bool locked, need_to_kill; + int ret = false; owait.mem = memcg; owait.wait.flags = 0; @@ -1873,6 +1874,7 @@ bool mem_cgroup_handle_oom(struct mem_cgroup *memcg, gfp_t mask) mem_cgroup_mark_under_oom(memcg); /* At first, try to OOM lock hierarchy under memcg.*/ + printk("XXX: %d waiting for memcg_oom_lock\n", current->pid); spin_lock(&memcg_oom_lock); locked = mem_cgroup_oom_lock(memcg); /* @@ -1887,12 +1889,14 @@ bool mem_cgroup_handle_oom(struct mem_cgroup *memcg, gfp_t mask) mem_cgroup_oom_notify(memcg); spin_unlock(&memcg_oom_lock); + printk("XXX: %d need_to_kill:%d locked:%d\n", current->pid, need_to_kill, locked); if (need_to_kill) { finish_wait(&memcg_oom_waitq, &owait.wait); mem_cgroup_out_of_memory(memcg, mask); } else { schedule(); finish_wait(&memcg_oom_waitq, &owait.wait); + printk("XXX: %d woken up\n", current->pid); } spin_lock(&memcg_oom_lock); if (locked) @@ -1903,10 +1907,13 @@ bool mem_cgroup_handle_oom(struct mem_cgroup *memcg, gfp_t mask) mem_cgroup_unmark_under_oom(memcg); if (test_thread_flag(TIF_MEMDIE) || fatal_signal_pending(current)) - return false; + goto out; /* Give chance to dying process */ schedule_timeout_uninterruptible(1); - return true; + ret = true; +out: + printk("XXX: %d done with %d\n", current->pid, ret); + return ret; } /* diff --git a/mm/oom_kill.c b/mm/oom_kill.c index 069b64e..a7db813 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -568,6 +568,7 @@ void mem_cgroup_out_of_memory(struct mem_cgroup *mem, gfp_t gfp_mask) */ if (fatal_signal_pending(current)) { set_thread_flag(TIF_MEMDIE); + printk("XXX: %d skipping task with fatal signal pending\n", current->pid); return; } @@ -576,8 +577,10 @@ void mem_cgroup_out_of_memory(struct mem_cgroup *mem, gfp_t gfp_mask) read_lock(&tasklist_lock); retry: p = select_bad_process(&points, limit, mem, NULL); - if (!p || PTR_ERR(p) == -1UL) + if (!p || PTR_ERR(p) == -1UL) { + printk("XXX: %d nothing to kill\n", current->pid); goto out; + } if (oom_kill_process(p, gfp_mask, 0, points, limit, mem, NULL, "Memory cgroup out of memory")) -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 444+ messages in thread
* Re: memory-cgroup bug 2012-11-25 10:17 ` Michal Hocko @ 2012-11-25 12:39 ` azurIt -1 siblings, 0 replies; 444+ messages in thread From: azurIt @ 2012-11-25 12:39 UTC (permalink / raw) To: Michal Hocko; +Cc: linux-kernel, linux-mm, cgroups mailinglist >Inlined at the end of the email. Please note I have compile tested >it. It might produce a lot of output. Thank you very much, i will install it ASAP (probably this night). >dmesg | grep "Out of memory" >doesn't tell anything, right? Only messages for other cgroups but not for the freezed one (before nor after the freeze). azur ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: memory-cgroup bug @ 2012-11-25 12:39 ` azurIt 0 siblings, 0 replies; 444+ messages in thread From: azurIt @ 2012-11-25 12:39 UTC (permalink / raw) To: Michal Hocko; +Cc: linux-kernel, linux-mm, cgroups mailinglist >Inlined at the end of the email. Please note I have compile tested >it. It might produce a lot of output. Thank you very much, i will install it ASAP (probably this night). >dmesg | grep "Out of memory" >doesn't tell anything, right? Only messages for other cgroups but not for the freezed one (before nor after the freeze). azur -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: memory-cgroup bug 2012-11-25 12:39 ` azurIt (?) @ 2012-11-25 13:02 ` Michal Hocko -1 siblings, 0 replies; 444+ messages in thread From: Michal Hocko @ 2012-11-25 13:02 UTC (permalink / raw) To: azurIt; +Cc: linux-kernel, linux-mm, cgroups mailinglist On Sun 25-11-12 13:39:53, azurIt wrote: > >Inlined at the end of the email. Please note I have compile tested > >it. It might produce a lot of output. > > > Thank you very much, i will install it ASAP (probably this night). Please don't. If my analysis is correct which I am almost 100% sure it is then it would cause excessive logging. I am sorry I cannot come up with something else in the mean time. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: memory-cgroup bug @ 2012-11-25 13:02 ` Michal Hocko 0 siblings, 0 replies; 444+ messages in thread From: Michal Hocko @ 2012-11-25 13:02 UTC (permalink / raw) To: azurIt Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups mailinglist On Sun 25-11-12 13:39:53, azurIt wrote: > >Inlined at the end of the email. Please note I have compile tested > >it. It might produce a lot of output. > > > Thank you very much, i will install it ASAP (probably this night). Please don't. If my analysis is correct which I am almost 100% sure it is then it would cause excessive logging. I am sorry I cannot come up with something else in the mean time. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: memory-cgroup bug @ 2012-11-25 13:02 ` Michal Hocko 0 siblings, 0 replies; 444+ messages in thread From: Michal Hocko @ 2012-11-25 13:02 UTC (permalink / raw) To: azurIt; +Cc: linux-kernel, linux-mm, cgroups mailinglist On Sun 25-11-12 13:39:53, azurIt wrote: > >Inlined at the end of the email. Please note I have compile tested > >it. It might produce a lot of output. > > > Thank you very much, i will install it ASAP (probably this night). Please don't. If my analysis is correct which I am almost 100% sure it is then it would cause excessive logging. I am sorry I cannot come up with something else in the mean time. -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: memory-cgroup bug 2012-11-25 13:02 ` Michal Hocko (?) @ 2012-11-25 13:27 ` azurIt -1 siblings, 0 replies; 444+ messages in thread From: azurIt @ 2012-11-25 13:27 UTC (permalink / raw) To: Michal Hocko; +Cc: linux-kernel, linux-mm, cgroups mailinglist >> Thank you very much, i will install it ASAP (probably this night). > >Please don't. If my analysis is correct which I am almost 100% sure it >is then it would cause excessive logging. I am sorry I cannot come up >with something else in the mean time. Ok then. I will, meanwhile, try to contact Andrea Righi (author of cgroup-task etc.) and ask him to send here his opinion about relation between freezes and his patches. Maybe it's some kind of a bug in memcg which don't appear in current vanilla code and is triggered by conditions created by, for example, cgroup-task. I noticed that there is always the exact number of freezed processes as the limit set for number of tasks by cgroup-task (i already tried to raise this limit AFTER the cgroup was freezed, didn't change anything). I'm sure it's not the problem with cgroup-task alone, it's 100% related also to memcg (but maybe there must be the combination of both of them). Thank you so far for your time! azur ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: memory-cgroup bug @ 2012-11-25 13:27 ` azurIt 0 siblings, 0 replies; 444+ messages in thread From: azurIt @ 2012-11-25 13:27 UTC (permalink / raw) To: Michal Hocko Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups mailinglist >> Thank you very much, i will install it ASAP (probably this night). > >Please don't. If my analysis is correct which I am almost 100% sure it >is then it would cause excessive logging. I am sorry I cannot come up >with something else in the mean time. Ok then. I will, meanwhile, try to contact Andrea Righi (author of cgroup-task etc.) and ask him to send here his opinion about relation between freezes and his patches. Maybe it's some kind of a bug in memcg which don't appear in current vanilla code and is triggered by conditions created by, for example, cgroup-task. I noticed that there is always the exact number of freezed processes as the limit set for number of tasks by cgroup-task (i already tried to raise this limit AFTER the cgroup was freezed, didn't change anything). I'm sure it's not the problem with cgroup-task alone, it's 100% related also to memcg (but maybe there must be the combination of both of them). Thank you so far for your time! azur ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: memory-cgroup bug @ 2012-11-25 13:27 ` azurIt 0 siblings, 0 replies; 444+ messages in thread From: azurIt @ 2012-11-25 13:27 UTC (permalink / raw) To: Michal Hocko; +Cc: linux-kernel, linux-mm, cgroups mailinglist >> Thank you very much, i will install it ASAP (probably this night). > >Please don't. If my analysis is correct which I am almost 100% sure it >is then it would cause excessive logging. I am sorry I cannot come up >with something else in the mean time. Ok then. I will, meanwhile, try to contact Andrea Righi (author of cgroup-task etc.) and ask him to send here his opinion about relation between freezes and his patches. Maybe it's some kind of a bug in memcg which don't appear in current vanilla code and is triggered by conditions created by, for example, cgroup-task. I noticed that there is always the exact number of freezed processes as the limit set for number of tasks by cgroup-task (i already tried to raise this limit AFTER the cgroup was freezed, didn't change anything). I'm sure it's not the problem with cgroup-task alone, it's 100% related also to memcg (but maybe there must be the combination of both of them). Thank you so far for your time! azur -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: memory-cgroup bug 2012-11-25 13:27 ` azurIt (?) @ 2012-11-25 13:44 ` Michal Hocko -1 siblings, 0 replies; 444+ messages in thread From: Michal Hocko @ 2012-11-25 13:44 UTC (permalink / raw) To: azurIt; +Cc: linux-kernel, linux-mm, cgroups mailinglist On Sun 25-11-12 14:27:09, azurIt wrote: > >> Thank you very much, i will install it ASAP (probably this night). > > > >Please don't. If my analysis is correct which I am almost 100% sure it > >is then it would cause excessive logging. I am sorry I cannot come up > >with something else in the mean time. > > > Ok then. I will, meanwhile, try to contact Andrea Righi (author of > cgroup-task etc.) and ask him to send here his opinion about relation > between freezes and his patches. As I described in other email. This seems to be a deadlock in memcg oom so I do not think that other patches influence this. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: memory-cgroup bug @ 2012-11-25 13:44 ` Michal Hocko 0 siblings, 0 replies; 444+ messages in thread From: Michal Hocko @ 2012-11-25 13:44 UTC (permalink / raw) To: azurIt Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups mailinglist On Sun 25-11-12 14:27:09, azurIt wrote: > >> Thank you very much, i will install it ASAP (probably this night). > > > >Please don't. If my analysis is correct which I am almost 100% sure it > >is then it would cause excessive logging. I am sorry I cannot come up > >with something else in the mean time. > > > Ok then. I will, meanwhile, try to contact Andrea Righi (author of > cgroup-task etc.) and ask him to send here his opinion about relation > between freezes and his patches. As I described in other email. This seems to be a deadlock in memcg oom so I do not think that other patches influence this. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: memory-cgroup bug @ 2012-11-25 13:44 ` Michal Hocko 0 siblings, 0 replies; 444+ messages in thread From: Michal Hocko @ 2012-11-25 13:44 UTC (permalink / raw) To: azurIt; +Cc: linux-kernel, linux-mm, cgroups mailinglist On Sun 25-11-12 14:27:09, azurIt wrote: > >> Thank you very much, i will install it ASAP (probably this night). > > > >Please don't. If my analysis is correct which I am almost 100% sure it > >is then it would cause excessive logging. I am sorry I cannot come up > >with something else in the mean time. > > > Ok then. I will, meanwhile, try to contact Andrea Righi (author of > cgroup-task etc.) and ask him to send here his opinion about relation > between freezes and his patches. As I described in other email. This seems to be a deadlock in memcg oom so I do not think that other patches influence this. -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: memory-cgroup bug 2012-11-23 10:04 ` Michal Hocko (?) @ 2012-11-25 0:10 ` azurIt -1 siblings, 0 replies; 444+ messages in thread From: azurIt @ 2012-11-25 0:10 UTC (permalink / raw) To: Michal Hocko; +Cc: linux-kernel, linux-mm, cgroups mailinglist >Could you take few snapshots over time? Here it is, now from different server, snapshot was taken every second for 10 minutes (hope it's enough): www.watchdog.sk/lkml/memcg-bug-2.tar.gz ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: memory-cgroup bug @ 2012-11-25 0:10 ` azurIt 0 siblings, 0 replies; 444+ messages in thread From: azurIt @ 2012-11-25 0:10 UTC (permalink / raw) To: Michal Hocko Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups mailinglist >Could you take few snapshots over time? Here it is, now from different server, snapshot was taken every second for 10 minutes (hope it's enough): www.watchdog.sk/lkml/memcg-bug-2.tar.gz ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: memory-cgroup bug @ 2012-11-25 0:10 ` azurIt 0 siblings, 0 replies; 444+ messages in thread From: azurIt @ 2012-11-25 0:10 UTC (permalink / raw) To: Michal Hocko; +Cc: linux-kernel, linux-mm, cgroups mailinglist >Could you take few snapshots over time? Here it is, now from different server, snapshot was taken every second for 10 minutes (hope it's enough): www.watchdog.sk/lkml/memcg-bug-2.tar.gz -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: memory-cgroup bug 2012-11-25 0:10 ` azurIt @ 2012-11-25 12:05 ` Michal Hocko -1 siblings, 0 replies; 444+ messages in thread From: Michal Hocko @ 2012-11-25 12:05 UTC (permalink / raw) To: azurIt; +Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki [Adding Kamezawa into CC] On Sun 25-11-12 01:10:47, azurIt wrote: > >Could you take few snapshots over time? > > > Here it is, now from different server, snapshot was taken every second > for 10 minutes (hope it's enough): > www.watchdog.sk/lkml/memcg-bug-2.tar.gz Hmm, interesting: $ grep . */memory.failcnt | cut -d: -f2 | awk 'BEGIN{min=666666}{if (prev>0) {diff=$1-prev; if (diff>max) max=diff; if (diff<min) min=diff; sum+=diff; n++} prev=$1}END{printf "min:%d max:%d avg:%f\n", min, max, sum/n}' min:16281 max:224048 avg:18818.943119 So there is a lot of attempts to allocate which fail, every second! Will get to that later. The number of tasks in the group is stable (20): $ for i in *; do ls -d1 $i/[0-9]* | wc -l; done | sort | uniq -c 546 20 And no task has been killed or spawned: $ for i in *; do ls -d1 $i/[0-9]* | cut -d/ -f2; done | sort | uniq 24495 24762 24774 24796 24798 24805 24813 24827 24831 24841 24842 24863 24892 24924 24931 25130 25131 25192 25193 25243 $ for stack in [0-9]*/[0-9]* do head -n1 $stack/stack done | sort | uniq -c 9841 [<ffffffff8110a9c1>] mem_cgroup_handle_oom+0x241/0x3b0 546 [<ffffffff811109b8>] do_truncate+0x58/0xa0 533 [<ffffffffffffffff>] 0xffffffffffffffff Tells us that the stacks are pretty much stable. $ grep do_truncate -r [0-9]* | cut -d/ -f2 | sort | uniq -c 546 24495 So 24495 is stuck in do_truncate [<ffffffff811109b8>] do_truncate+0x58/0xa0 [<ffffffff81121c90>] do_last+0x250/0xa30 [<ffffffff81122547>] path_openat+0xd7/0x440 [<ffffffff811229c9>] do_filp_open+0x49/0xa0 [<ffffffff8110f7d6>] do_sys_open+0x106/0x240 [<ffffffff8110f950>] sys_open+0x20/0x30 [<ffffffff815b5926>] system_call_fastpath+0x18/0x1d [<ffffffffffffffff>] 0xffffffffffffffff I suspect it is waiting for i_mutex. Who is holding that lock? Other tasks are blocked on the mem_cgroup_handle_oom either coming from the page fault path so i_mutex can be exluded or vfs_write (24796) and that one is interesting: [<ffffffff8110a9c1>] mem_cgroup_handle_oom+0x241/0x3b0 [<ffffffff8110b5ab>] T.1146+0x5ab/0x5c0 [<ffffffff8110c22e>] mem_cgroup_cache_charge+0xbe/0xe0 [<ffffffff810ca28c>] add_to_page_cache_locked+0x4c/0x140 [<ffffffff810ca3a2>] add_to_page_cache_lru+0x22/0x50 [<ffffffff810ca45b>] grab_cache_page_write_begin+0x8b/0xe0 [<ffffffff81193a18>] ext3_write_begin+0x88/0x270 [<ffffffff810c8fc6>] generic_file_buffered_write+0x116/0x290 [<ffffffff810cb3cc>] __generic_file_aio_write+0x27c/0x480 [<ffffffff810cb646>] generic_file_aio_write+0x76/0xf0 # takes &inode->i_mutex [<ffffffff8111156a>] do_sync_write+0xea/0x130 [<ffffffff81112183>] vfs_write+0xf3/0x1f0 [<ffffffff81112381>] sys_write+0x51/0x90 [<ffffffff815b5926>] system_call_fastpath+0x18/0x1d [<ffffffffffffffff>] 0xffffffffffffffff This smells like a deadlock. But kind strange one. The rapidly increasing failcnt suggests that somebody still tries to allocate but who when all of them hung in the mem_cgroup_handle_oom. This can be explained though. Memcg OOM killer let's only one process (which is able to lock the hierarchy by mem_cgroup_oom_lock) call mem_cgroup_out_of_memory and kill a process, while others are waiting on the wait queue. Once the killer is done it calls memcg_wakeup_oom which wakes up other tasks waiting on the queue. Those retry the charge, in a hope there is some memory freed in the meantime which hasn't happened so they get into OOM again (and again and again). This all usually works out except in this particular case I would bet my hat that the OOM selected task is pid 24495 which is blocked on the mutex which is held by one of the oom killer task so it cannot finish - thus free a memory. It seems that the current Linus' tree is affected as well. I will have to think about a solution but it sounds really tricky. It is not just ext3 that is affected. I guess we need to tell mem_cgroup_cache_charge that it should never reach OOM from add_to_page_cache_locked. This sounds quite intrusive to me. On the other hand it is really weird that an excessive writer might trigger a memcg OOM killer. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: memory-cgroup bug @ 2012-11-25 12:05 ` Michal Hocko 0 siblings, 0 replies; 444+ messages in thread From: Michal Hocko @ 2012-11-25 12:05 UTC (permalink / raw) To: azurIt; +Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki [Adding Kamezawa into CC] On Sun 25-11-12 01:10:47, azurIt wrote: > >Could you take few snapshots over time? > > > Here it is, now from different server, snapshot was taken every second > for 10 minutes (hope it's enough): > www.watchdog.sk/lkml/memcg-bug-2.tar.gz Hmm, interesting: $ grep . */memory.failcnt | cut -d: -f2 | awk 'BEGIN{min=666666}{if (prev>0) {diff=$1-prev; if (diff>max) max=diff; if (diff<min) min=diff; sum+=diff; n++} prev=$1}END{printf "min:%d max:%d avg:%f\n", min, max, sum/n}' min:16281 max:224048 avg:18818.943119 So there is a lot of attempts to allocate which fail, every second! Will get to that later. The number of tasks in the group is stable (20): $ for i in *; do ls -d1 $i/[0-9]* | wc -l; done | sort | uniq -c 546 20 And no task has been killed or spawned: $ for i in *; do ls -d1 $i/[0-9]* | cut -d/ -f2; done | sort | uniq 24495 24762 24774 24796 24798 24805 24813 24827 24831 24841 24842 24863 24892 24924 24931 25130 25131 25192 25193 25243 $ for stack in [0-9]*/[0-9]* do head -n1 $stack/stack done | sort | uniq -c 9841 [<ffffffff8110a9c1>] mem_cgroup_handle_oom+0x241/0x3b0 546 [<ffffffff811109b8>] do_truncate+0x58/0xa0 533 [<ffffffffffffffff>] 0xffffffffffffffff Tells us that the stacks are pretty much stable. $ grep do_truncate -r [0-9]* | cut -d/ -f2 | sort | uniq -c 546 24495 So 24495 is stuck in do_truncate [<ffffffff811109b8>] do_truncate+0x58/0xa0 [<ffffffff81121c90>] do_last+0x250/0xa30 [<ffffffff81122547>] path_openat+0xd7/0x440 [<ffffffff811229c9>] do_filp_open+0x49/0xa0 [<ffffffff8110f7d6>] do_sys_open+0x106/0x240 [<ffffffff8110f950>] sys_open+0x20/0x30 [<ffffffff815b5926>] system_call_fastpath+0x18/0x1d [<ffffffffffffffff>] 0xffffffffffffffff I suspect it is waiting for i_mutex. Who is holding that lock? Other tasks are blocked on the mem_cgroup_handle_oom either coming from the page fault path so i_mutex can be exluded or vfs_write (24796) and that one is interesting: [<ffffffff8110a9c1>] mem_cgroup_handle_oom+0x241/0x3b0 [<ffffffff8110b5ab>] T.1146+0x5ab/0x5c0 [<ffffffff8110c22e>] mem_cgroup_cache_charge+0xbe/0xe0 [<ffffffff810ca28c>] add_to_page_cache_locked+0x4c/0x140 [<ffffffff810ca3a2>] add_to_page_cache_lru+0x22/0x50 [<ffffffff810ca45b>] grab_cache_page_write_begin+0x8b/0xe0 [<ffffffff81193a18>] ext3_write_begin+0x88/0x270 [<ffffffff810c8fc6>] generic_file_buffered_write+0x116/0x290 [<ffffffff810cb3cc>] __generic_file_aio_write+0x27c/0x480 [<ffffffff810cb646>] generic_file_aio_write+0x76/0xf0 # takes &inode->i_mutex [<ffffffff8111156a>] do_sync_write+0xea/0x130 [<ffffffff81112183>] vfs_write+0xf3/0x1f0 [<ffffffff81112381>] sys_write+0x51/0x90 [<ffffffff815b5926>] system_call_fastpath+0x18/0x1d [<ffffffffffffffff>] 0xffffffffffffffff This smells like a deadlock. But kind strange one. The rapidly increasing failcnt suggests that somebody still tries to allocate but who when all of them hung in the mem_cgroup_handle_oom. This can be explained though. Memcg OOM killer let's only one process (which is able to lock the hierarchy by mem_cgroup_oom_lock) call mem_cgroup_out_of_memory and kill a process, while others are waiting on the wait queue. Once the killer is done it calls memcg_wakeup_oom which wakes up other tasks waiting on the queue. Those retry the charge, in a hope there is some memory freed in the meantime which hasn't happened so they get into OOM again (and again and again). This all usually works out except in this particular case I would bet my hat that the OOM selected task is pid 24495 which is blocked on the mutex which is held by one of the oom killer task so it cannot finish - thus free a memory. It seems that the current Linus' tree is affected as well. I will have to think about a solution but it sounds really tricky. It is not just ext3 that is affected. I guess we need to tell mem_cgroup_cache_charge that it should never reach OOM from add_to_page_cache_locked. This sounds quite intrusive to me. On the other hand it is really weird that an excessive writer might trigger a memcg OOM killer. -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: memory-cgroup bug 2012-11-25 12:05 ` Michal Hocko @ 2012-11-25 12:36 ` azurIt -1 siblings, 0 replies; 444+ messages in thread From: azurIt @ 2012-11-25 12:36 UTC (permalink / raw) To: Michal Hocko Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki >So there is a lot of attempts to allocate which fail, every second! Yes, as i said, the cgroup was taking 100% of (allocated) CPU core(s). Not sure if all processes were using CPU but _few_ of them (not only one) for sure. ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: memory-cgroup bug @ 2012-11-25 12:36 ` azurIt 0 siblings, 0 replies; 444+ messages in thread From: azurIt @ 2012-11-25 12:36 UTC (permalink / raw) To: Michal Hocko Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki >So there is a lot of attempts to allocate which fail, every second! Yes, as i said, the cgroup was taking 100% of (allocated) CPU core(s). Not sure if all processes were using CPU but _few_ of them (not only one) for sure. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: memory-cgroup bug 2012-11-25 12:05 ` Michal Hocko @ 2012-11-25 13:55 ` Michal Hocko -1 siblings, 0 replies; 444+ messages in thread From: Michal Hocko @ 2012-11-25 13:55 UTC (permalink / raw) To: azurIt; +Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki On Sun 25-11-12 13:05:24, Michal Hocko wrote: > [Adding Kamezawa into CC] > > On Sun 25-11-12 01:10:47, azurIt wrote: > > >Could you take few snapshots over time? > > > > > > Here it is, now from different server, snapshot was taken every second > > for 10 minutes (hope it's enough): > > www.watchdog.sk/lkml/memcg-bug-2.tar.gz > > Hmm, interesting: > $ grep . */memory.failcnt | cut -d: -f2 | awk 'BEGIN{min=666666}{if (prev>0) {diff=$1-prev; if (diff>max) max=diff; if (diff<min) min=diff; sum+=diff; n++} prev=$1}END{printf "min:%d max:%d avg:%f\n", min, max, sum/n}' > min:16281 max:224048 avg:18818.943119 > > So there is a lot of attempts to allocate which fail, every second! > Will get to that later. > > The number of tasks in the group is stable (20): > $ for i in *; do ls -d1 $i/[0-9]* | wc -l; done | sort | uniq -c > 546 20 > > And no task has been killed or spawned: > $ for i in *; do ls -d1 $i/[0-9]* | cut -d/ -f2; done | sort | uniq > 24495 > 24762 > 24774 > 24796 > 24798 > 24805 > 24813 > 24827 > 24831 > 24841 > 24842 > 24863 > 24892 > 24924 > 24931 > 25130 > 25131 > 25192 > 25193 > 25243 > > $ for stack in [0-9]*/[0-9]* > do > head -n1 $stack/stack > done | sort | uniq -c > 9841 [<ffffffff8110a9c1>] mem_cgroup_handle_oom+0x241/0x3b0 > 546 [<ffffffff811109b8>] do_truncate+0x58/0xa0 > 533 [<ffffffffffffffff>] 0xffffffffffffffff > > Tells us that the stacks are pretty much stable. > $ grep do_truncate -r [0-9]* | cut -d/ -f2 | sort | uniq -c > 546 24495 > > So 24495 is stuck in do_truncate > [<ffffffff811109b8>] do_truncate+0x58/0xa0 > [<ffffffff81121c90>] do_last+0x250/0xa30 > [<ffffffff81122547>] path_openat+0xd7/0x440 > [<ffffffff811229c9>] do_filp_open+0x49/0xa0 > [<ffffffff8110f7d6>] do_sys_open+0x106/0x240 > [<ffffffff8110f950>] sys_open+0x20/0x30 > [<ffffffff815b5926>] system_call_fastpath+0x18/0x1d > [<ffffffffffffffff>] 0xffffffffffffffff > > I suspect it is waiting for i_mutex. Who is holding that lock? > Other tasks are blocked on the mem_cgroup_handle_oom either coming from > the page fault path so i_mutex can be exluded or vfs_write (24796) and > that one is interesting: > [<ffffffff8110a9c1>] mem_cgroup_handle_oom+0x241/0x3b0 > [<ffffffff8110b5ab>] T.1146+0x5ab/0x5c0 > [<ffffffff8110c22e>] mem_cgroup_cache_charge+0xbe/0xe0 > [<ffffffff810ca28c>] add_to_page_cache_locked+0x4c/0x140 > [<ffffffff810ca3a2>] add_to_page_cache_lru+0x22/0x50 > [<ffffffff810ca45b>] grab_cache_page_write_begin+0x8b/0xe0 > [<ffffffff81193a18>] ext3_write_begin+0x88/0x270 > [<ffffffff810c8fc6>] generic_file_buffered_write+0x116/0x290 > [<ffffffff810cb3cc>] __generic_file_aio_write+0x27c/0x480 > [<ffffffff810cb646>] generic_file_aio_write+0x76/0xf0 # takes &inode->i_mutex > [<ffffffff8111156a>] do_sync_write+0xea/0x130 > [<ffffffff81112183>] vfs_write+0xf3/0x1f0 > [<ffffffff81112381>] sys_write+0x51/0x90 > [<ffffffff815b5926>] system_call_fastpath+0x18/0x1d > [<ffffffffffffffff>] 0xffffffffffffffff > > This smells like a deadlock. But kind strange one. The rapidly > increasing failcnt suggests that somebody still tries to allocate but > who when all of them hung in the mem_cgroup_handle_oom. This can be > explained though. > Memcg OOM killer let's only one process (which is able to lock the > hierarchy by mem_cgroup_oom_lock) call mem_cgroup_out_of_memory and kill > a process, while others are waiting on the wait queue. Once the killer > is done it calls memcg_wakeup_oom which wakes up other tasks waiting on > the queue. Those retry the charge, in a hope there is some memory freed > in the meantime which hasn't happened so they get into OOM again (and > again and again). > This all usually works out except in this particular case I would bet > my hat that the OOM selected task is pid 24495 which is blocked on the > mutex which is held by one of the oom killer task so it cannot finish - > thus free a memory. > > It seems that the current Linus' tree is affected as well. > > I will have to think about a solution but it sounds really tricky. It is > not just ext3 that is affected. > > I guess we need to tell mem_cgroup_cache_charge that it should never > reach OOM from add_to_page_cache_locked. This sounds quite intrusive to > me. On the other hand it is really weird that an excessive writer might > trigger a memcg OOM killer. This is hackish but it should help you in this case. Kamezawa, what do you think about that? Should we generalize this and prepare something like mem_cgroup_cache_charge_locked which would add __GFP_NORETRY automatically and use the function whenever we are in a locked context? To be honest I do not like this very much but nothing more sensible (without touching non-memcg paths) comes to my mind. --- diff --git a/mm/filemap.c b/mm/filemap.c index 83efee7..da50c83 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -448,7 +448,7 @@ int add_to_page_cache_locked(struct page *page, struct address_space *mapping, VM_BUG_ON(PageSwapBacked(page)); error = mem_cgroup_cache_charge(page, current->mm, - gfp_mask & GFP_RECLAIM_MASK); + (gfp_mask | __GFP_NORETRY) & GFP_RECLAIM_MASK); if (error) goto out; -- Michal Hocko SUSE Labs ^ permalink raw reply related [flat|nested] 444+ messages in thread
* Re: memory-cgroup bug @ 2012-11-25 13:55 ` Michal Hocko 0 siblings, 0 replies; 444+ messages in thread From: Michal Hocko @ 2012-11-25 13:55 UTC (permalink / raw) To: azurIt; +Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki On Sun 25-11-12 13:05:24, Michal Hocko wrote: > [Adding Kamezawa into CC] > > On Sun 25-11-12 01:10:47, azurIt wrote: > > >Could you take few snapshots over time? > > > > > > Here it is, now from different server, snapshot was taken every second > > for 10 minutes (hope it's enough): > > www.watchdog.sk/lkml/memcg-bug-2.tar.gz > > Hmm, interesting: > $ grep . */memory.failcnt | cut -d: -f2 | awk 'BEGIN{min=666666}{if (prev>0) {diff=$1-prev; if (diff>max) max=diff; if (diff<min) min=diff; sum+=diff; n++} prev=$1}END{printf "min:%d max:%d avg:%f\n", min, max, sum/n}' > min:16281 max:224048 avg:18818.943119 > > So there is a lot of attempts to allocate which fail, every second! > Will get to that later. > > The number of tasks in the group is stable (20): > $ for i in *; do ls -d1 $i/[0-9]* | wc -l; done | sort | uniq -c > 546 20 > > And no task has been killed or spawned: > $ for i in *; do ls -d1 $i/[0-9]* | cut -d/ -f2; done | sort | uniq > 24495 > 24762 > 24774 > 24796 > 24798 > 24805 > 24813 > 24827 > 24831 > 24841 > 24842 > 24863 > 24892 > 24924 > 24931 > 25130 > 25131 > 25192 > 25193 > 25243 > > $ for stack in [0-9]*/[0-9]* > do > head -n1 $stack/stack > done | sort | uniq -c > 9841 [<ffffffff8110a9c1>] mem_cgroup_handle_oom+0x241/0x3b0 > 546 [<ffffffff811109b8>] do_truncate+0x58/0xa0 > 533 [<ffffffffffffffff>] 0xffffffffffffffff > > Tells us that the stacks are pretty much stable. > $ grep do_truncate -r [0-9]* | cut -d/ -f2 | sort | uniq -c > 546 24495 > > So 24495 is stuck in do_truncate > [<ffffffff811109b8>] do_truncate+0x58/0xa0 > [<ffffffff81121c90>] do_last+0x250/0xa30 > [<ffffffff81122547>] path_openat+0xd7/0x440 > [<ffffffff811229c9>] do_filp_open+0x49/0xa0 > [<ffffffff8110f7d6>] do_sys_open+0x106/0x240 > [<ffffffff8110f950>] sys_open+0x20/0x30 > [<ffffffff815b5926>] system_call_fastpath+0x18/0x1d > [<ffffffffffffffff>] 0xffffffffffffffff > > I suspect it is waiting for i_mutex. Who is holding that lock? > Other tasks are blocked on the mem_cgroup_handle_oom either coming from > the page fault path so i_mutex can be exluded or vfs_write (24796) and > that one is interesting: > [<ffffffff8110a9c1>] mem_cgroup_handle_oom+0x241/0x3b0 > [<ffffffff8110b5ab>] T.1146+0x5ab/0x5c0 > [<ffffffff8110c22e>] mem_cgroup_cache_charge+0xbe/0xe0 > [<ffffffff810ca28c>] add_to_page_cache_locked+0x4c/0x140 > [<ffffffff810ca3a2>] add_to_page_cache_lru+0x22/0x50 > [<ffffffff810ca45b>] grab_cache_page_write_begin+0x8b/0xe0 > [<ffffffff81193a18>] ext3_write_begin+0x88/0x270 > [<ffffffff810c8fc6>] generic_file_buffered_write+0x116/0x290 > [<ffffffff810cb3cc>] __generic_file_aio_write+0x27c/0x480 > [<ffffffff810cb646>] generic_file_aio_write+0x76/0xf0 # takes &inode->i_mutex > [<ffffffff8111156a>] do_sync_write+0xea/0x130 > [<ffffffff81112183>] vfs_write+0xf3/0x1f0 > [<ffffffff81112381>] sys_write+0x51/0x90 > [<ffffffff815b5926>] system_call_fastpath+0x18/0x1d > [<ffffffffffffffff>] 0xffffffffffffffff > > This smells like a deadlock. But kind strange one. The rapidly > increasing failcnt suggests that somebody still tries to allocate but > who when all of them hung in the mem_cgroup_handle_oom. This can be > explained though. > Memcg OOM killer let's only one process (which is able to lock the > hierarchy by mem_cgroup_oom_lock) call mem_cgroup_out_of_memory and kill > a process, while others are waiting on the wait queue. Once the killer > is done it calls memcg_wakeup_oom which wakes up other tasks waiting on > the queue. Those retry the charge, in a hope there is some memory freed > in the meantime which hasn't happened so they get into OOM again (and > again and again). > This all usually works out except in this particular case I would bet > my hat that the OOM selected task is pid 24495 which is blocked on the > mutex which is held by one of the oom killer task so it cannot finish - > thus free a memory. > > It seems that the current Linus' tree is affected as well. > > I will have to think about a solution but it sounds really tricky. It is > not just ext3 that is affected. > > I guess we need to tell mem_cgroup_cache_charge that it should never > reach OOM from add_to_page_cache_locked. This sounds quite intrusive to > me. On the other hand it is really weird that an excessive writer might > trigger a memcg OOM killer. This is hackish but it should help you in this case. Kamezawa, what do you think about that? Should we generalize this and prepare something like mem_cgroup_cache_charge_locked which would add __GFP_NORETRY automatically and use the function whenever we are in a locked context? To be honest I do not like this very much but nothing more sensible (without touching non-memcg paths) comes to my mind. --- diff --git a/mm/filemap.c b/mm/filemap.c index 83efee7..da50c83 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -448,7 +448,7 @@ int add_to_page_cache_locked(struct page *page, struct address_space *mapping, VM_BUG_ON(PageSwapBacked(page)); error = mem_cgroup_cache_charge(page, current->mm, - gfp_mask & GFP_RECLAIM_MASK); + (gfp_mask | __GFP_NORETRY) & GFP_RECLAIM_MASK); if (error) goto out; -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 444+ messages in thread
* Re: memory-cgroup bug 2012-11-25 13:55 ` Michal Hocko @ 2012-11-26 0:38 ` azurIt -1 siblings, 0 replies; 444+ messages in thread From: azurIt @ 2012-11-26 0:38 UTC (permalink / raw) To: Michal Hocko Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki >This is hackish but it should help you in this case. Kamezawa, what do >you think about that? Should we generalize this and prepare something >like mem_cgroup_cache_charge_locked which would add __GFP_NORETRY >automatically and use the function whenever we are in a locked context? >To be honest I do not like this very much but nothing more sensible >(without touching non-memcg paths) comes to my mind. I installed kernel with this patch, will report back if problem occurs again OR in few weeks if everything will be ok. Thank you! Btw, will this patch be backported to 3.2? azur ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: memory-cgroup bug @ 2012-11-26 0:38 ` azurIt 0 siblings, 0 replies; 444+ messages in thread From: azurIt @ 2012-11-26 0:38 UTC (permalink / raw) To: Michal Hocko Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki >This is hackish but it should help you in this case. Kamezawa, what do >you think about that? Should we generalize this and prepare something >like mem_cgroup_cache_charge_locked which would add __GFP_NORETRY >automatically and use the function whenever we are in a locked context? >To be honest I do not like this very much but nothing more sensible >(without touching non-memcg paths) comes to my mind. I installed kernel with this patch, will report back if problem occurs again OR in few weeks if everything will be ok. Thank you! Btw, will this patch be backported to 3.2? azur -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: memory-cgroup bug 2012-11-26 0:38 ` azurIt (?) @ 2012-11-26 7:57 ` Michal Hocko -1 siblings, 0 replies; 444+ messages in thread From: Michal Hocko @ 2012-11-26 7:57 UTC (permalink / raw) To: azurIt; +Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki On Mon 26-11-12 01:38:55, azurIt wrote: > >This is hackish but it should help you in this case. Kamezawa, what do > >you think about that? Should we generalize this and prepare something > >like mem_cgroup_cache_charge_locked which would add __GFP_NORETRY > >automatically and use the function whenever we are in a locked context? > >To be honest I do not like this very much but nothing more sensible > >(without touching non-memcg paths) comes to my mind. > > > I installed kernel with this patch, will report back if problem occurs > again OR in few weeks if everything will be ok. Thank you! Thanks! > Btw, will this patch be backported to 3.2? Once we agree on a proper solution it will be backported to the stable trees. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: memory-cgroup bug @ 2012-11-26 7:57 ` Michal Hocko 0 siblings, 0 replies; 444+ messages in thread From: Michal Hocko @ 2012-11-26 7:57 UTC (permalink / raw) To: azurIt Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups mailinglist, KAMEZAWA Hiroyuki On Mon 26-11-12 01:38:55, azurIt wrote: > >This is hackish but it should help you in this case. Kamezawa, what do > >you think about that? Should we generalize this and prepare something > >like mem_cgroup_cache_charge_locked which would add __GFP_NORETRY > >automatically and use the function whenever we are in a locked context? > >To be honest I do not like this very much but nothing more sensible > >(without touching non-memcg paths) comes to my mind. > > > I installed kernel with this patch, will report back if problem occurs > again OR in few weeks if everything will be ok. Thank you! Thanks! > Btw, will this patch be backported to 3.2? Once we agree on a proper solution it will be backported to the stable trees. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: memory-cgroup bug @ 2012-11-26 7:57 ` Michal Hocko 0 siblings, 0 replies; 444+ messages in thread From: Michal Hocko @ 2012-11-26 7:57 UTC (permalink / raw) To: azurIt; +Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki On Mon 26-11-12 01:38:55, azurIt wrote: > >This is hackish but it should help you in this case. Kamezawa, what do > >you think about that? Should we generalize this and prepare something > >like mem_cgroup_cache_charge_locked which would add __GFP_NORETRY > >automatically and use the function whenever we are in a locked context? > >To be honest I do not like this very much but nothing more sensible > >(without touching non-memcg paths) comes to my mind. > > > I installed kernel with this patch, will report back if problem occurs > again OR in few weeks if everything will be ok. Thank you! Thanks! > Btw, will this patch be backported to 3.2? Once we agree on a proper solution it will be backported to the stable trees. -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 444+ messages in thread
* [PATCH -mm] memcg: do not trigger OOM from add_to_page_cache_locked 2012-11-26 0:38 ` azurIt (?) @ 2012-11-26 13:18 ` Michal Hocko -1 siblings, 0 replies; 444+ messages in thread From: Michal Hocko @ 2012-11-26 13:18 UTC (permalink / raw) To: azurIt Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki, Johannes Weiner [CCing also Johannes - the thread started here: https://lkml.org/lkml/2012/11/21/497] On Mon 26-11-12 01:38:55, azurIt wrote: > >This is hackish but it should help you in this case. Kamezawa, what do > >you think about that? Should we generalize this and prepare something > >like mem_cgroup_cache_charge_locked which would add __GFP_NORETRY > >automatically and use the function whenever we are in a locked context? > >To be honest I do not like this very much but nothing more sensible > >(without touching non-memcg paths) comes to my mind. > > > I installed kernel with this patch, will report back if problem occurs > again OR in few weeks if everything will be ok. Thank you! Now that I am looking at the patch closer it will not work because it depends on other patch which is not merged yet and even that one would help on its own because __GFP_NORETRY doesn't break the charge loop. Sorry I have missed that... The patch bellow should help though. (it is based on top of the current -mm tree but I will send a backport to 3.2 in the reply as well) --- >From 7796f942d62081ad45726efd90b5292b80e7c690 Mon Sep 17 00:00:00 2001 From: Michal Hocko <mhocko@suse.cz> Date: Mon, 26 Nov 2012 11:47:57 +0100 Subject: [PATCH] memcg: do not trigger OOM from add_to_page_cache_locked memcg oom killer might deadlock if the process which falls down to mem_cgroup_handle_oom holds a lock which prevents other task to terminate because it is blocked on the very same lock. This can happen when a write system call needs to allocate a page but the allocation hits the memcg hard limit and there is nothing to reclaim (e.g. there is no swap or swap limit is hit as well and all cache pages have been reclaimed already) and the process selected by memcg OOM killer is blocked on i_mutex on the same inode (e.g. truncate it). Process A [<ffffffff811109b8>] do_truncate+0x58/0xa0 # takes i_mutex [<ffffffff81121c90>] do_last+0x250/0xa30 [<ffffffff81122547>] path_openat+0xd7/0x440 [<ffffffff811229c9>] do_filp_open+0x49/0xa0 [<ffffffff8110f7d6>] do_sys_open+0x106/0x240 [<ffffffff8110f950>] sys_open+0x20/0x30 [<ffffffff815b5926>] system_call_fastpath+0x18/0x1d [<ffffffffffffffff>] 0xffffffffffffffff Process B [<ffffffff8110a9c1>] mem_cgroup_handle_oom+0x241/0x3b0 [<ffffffff8110b5ab>] T.1146+0x5ab/0x5c0 [<ffffffff8110c22e>] mem_cgroup_cache_charge+0xbe/0xe0 [<ffffffff810ca28c>] add_to_page_cache_locked+0x4c/0x140 [<ffffffff810ca3a2>] add_to_page_cache_lru+0x22/0x50 [<ffffffff810ca45b>] grab_cache_page_write_begin+0x8b/0xe0 [<ffffffff81193a18>] ext3_write_begin+0x88/0x270 [<ffffffff810c8fc6>] generic_file_buffered_write+0x116/0x290 [<ffffffff810cb3cc>] __generic_file_aio_write+0x27c/0x480 [<ffffffff810cb646>] generic_file_aio_write+0x76/0xf0 # takes ->i_mutex [<ffffffff8111156a>] do_sync_write+0xea/0x130 [<ffffffff81112183>] vfs_write+0xf3/0x1f0 [<ffffffff81112381>] sys_write+0x51/0x90 [<ffffffff815b5926>] system_call_fastpath+0x18/0x1d [<ffffffffffffffff>] 0xffffffffffffffff This is not a hard deadlock though because administrator can still intervene and increase the limit on the group which helps the writer to finish the allocation and release the lock. This patch heals the problem by forbidding OOM from page cache charges (namely add_ro_page_cache_locked). mem_cgroup_cache_charge_no_oom helper function is defined which adds GFP_MEMCG_NO_OOM to the gfp mask which then tells mem_cgroup_charge_common that OOM is not allowed for the charge. No OOM from this path, except for fixing the bug, also make some sense as we really do not want to cause an OOM because of a page cache usage. As a possibly visible result add_to_page_cache_lru might fail more often with ENOMEM but this is to be expected if the limit is set and it is preferable than OOM killer IMO. __GFP_NORETRY is abused for this memcg specific flag because it has been used to prevent from OOM already (since not-merged-yet "memcg: reclaim when more than one page needed"). The only difference is that the flag doesn't prevent from reclaim anymore which kind of makes sense because the global memory allocator triggers reclaim as well. The retry without any reclaim on __GFP_NORETRY doesn't make much sense anyway because this is effectively a busy loop with allowed OOM in this path. Reported-by: azurIt <azurit@pobox.sk> Signed-off-by: Michal Hocko <mhocko@suse.cz> --- include/linux/gfp.h | 3 +++ include/linux/memcontrol.h | 12 ++++++++++++ mm/filemap.c | 8 +++++++- mm/memcontrol.c | 5 +---- 4 files changed, 23 insertions(+), 5 deletions(-) diff --git a/include/linux/gfp.h b/include/linux/gfp.h index 10e667f..aac9b21 100644 --- a/include/linux/gfp.h +++ b/include/linux/gfp.h @@ -152,6 +152,9 @@ struct vm_area_struct; /* 4GB DMA on some platforms */ #define GFP_DMA32 __GFP_DMA32 +/* memcg oom killer is not allowed */ +#define GFP_MEMCG_NO_OOM __GFP_NORETRY + /* Convert GFP flags to their corresponding migrate type */ static inline int allocflags_to_migratetype(gfp_t gfp_flags) { diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 095d2b4..1ad4bc6 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -65,6 +65,12 @@ extern void mem_cgroup_cancel_charge_swapin(struct mem_cgroup *memcg); extern int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm, gfp_t gfp_mask); +static inline int mem_cgroup_cache_charge_no_oom(struct page *page, + struct mm_struct *mm, gfp_t gfp_mask) +{ + return mem_cgroup_cache_charge(page, mm, gfp_mask | GFP_MEMCG_NO_OOM); +} + struct lruvec *mem_cgroup_zone_lruvec(struct zone *, struct mem_cgroup *); struct lruvec *mem_cgroup_page_lruvec(struct page *, struct zone *); @@ -215,6 +221,12 @@ static inline int mem_cgroup_cache_charge(struct page *page, return 0; } +static inline int mem_cgroup_cache_charge_no_oom(struct page *page, + struct mm_struct *mm, gfp_t gfp_mask) +{ + return 0; +} + static inline int mem_cgroup_try_charge_swapin(struct mm_struct *mm, struct page *page, gfp_t gfp_mask, struct mem_cgroup **memcgp) { diff --git a/mm/filemap.c b/mm/filemap.c index 83efee7..ef14351 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -447,7 +447,13 @@ int add_to_page_cache_locked(struct page *page, struct address_space *mapping, VM_BUG_ON(!PageLocked(page)); VM_BUG_ON(PageSwapBacked(page)); - error = mem_cgroup_cache_charge(page, current->mm, + /* + * Cannot trigger OOM even if gfp_mask would allow that normally + * because we might be called from a locked context and that + * could lead to deadlocks if the killed process is waiting for + * the same lock. + */ + error = mem_cgroup_cache_charge_no_oom(page, current->mm, gfp_mask & GFP_RECLAIM_MASK); if (error) goto out; diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 02ee2f7..b4754ba 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -2430,9 +2430,6 @@ static int mem_cgroup_do_charge(struct mem_cgroup *memcg, gfp_t gfp_mask, if (!(gfp_mask & __GFP_WAIT)) return CHARGE_WOULDBLOCK; - if (gfp_mask & __GFP_NORETRY) - return CHARGE_NOMEM; - ret = mem_cgroup_reclaim(mem_over_limit, gfp_mask, flags); if (mem_cgroup_margin(mem_over_limit) >= nr_pages) return CHARGE_RETRY; @@ -3713,7 +3710,7 @@ static int mem_cgroup_charge_common(struct page *page, struct mm_struct *mm, { struct mem_cgroup *memcg = NULL; unsigned int nr_pages = 1; - bool oom = true; + bool oom = !(gfp_mask | GFP_MEMCG_NO_OOM); int ret; if (PageTransHuge(page)) { -- 1.7.10.4 -- Michal Hocko SUSE Labs ^ permalink raw reply related [flat|nested] 444+ messages in thread
* [PATCH -mm] memcg: do not trigger OOM from add_to_page_cache_locked @ 2012-11-26 13:18 ` Michal Hocko 0 siblings, 0 replies; 444+ messages in thread From: Michal Hocko @ 2012-11-26 13:18 UTC (permalink / raw) To: azurIt Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki, Johannes Weiner [CCing also Johannes - the thread started here: https://lkml.org/lkml/2012/11/21/497] On Mon 26-11-12 01:38:55, azurIt wrote: > >This is hackish but it should help you in this case. Kamezawa, what do > >you think about that? Should we generalize this and prepare something > >like mem_cgroup_cache_charge_locked which would add __GFP_NORETRY > >automatically and use the function whenever we are in a locked context? > >To be honest I do not like this very much but nothing more sensible > >(without touching non-memcg paths) comes to my mind. > > > I installed kernel with this patch, will report back if problem occurs > again OR in few weeks if everything will be ok. Thank you! Now that I am looking at the patch closer it will not work because it depends on other patch which is not merged yet and even that one would help on its own because __GFP_NORETRY doesn't break the charge loop. Sorry I have missed that... The patch bellow should help though. (it is based on top of the current -mm tree but I will send a backport to 3.2 in the reply as well) --- From 7796f942d62081ad45726efd90b5292b80e7c690 Mon Sep 17 00:00:00 2001 From: Michal Hocko <mhocko@suse.cz> Date: Mon, 26 Nov 2012 11:47:57 +0100 Subject: [PATCH] memcg: do not trigger OOM from add_to_page_cache_locked memcg oom killer might deadlock if the process which falls down to mem_cgroup_handle_oom holds a lock which prevents other task to terminate because it is blocked on the very same lock. This can happen when a write system call needs to allocate a page but the allocation hits the memcg hard limit and there is nothing to reclaim (e.g. there is no swap or swap limit is hit as well and all cache pages have been reclaimed already) and the process selected by memcg OOM killer is blocked on i_mutex on the same inode (e.g. truncate it). Process A [<ffffffff811109b8>] do_truncate+0x58/0xa0 # takes i_mutex [<ffffffff81121c90>] do_last+0x250/0xa30 [<ffffffff81122547>] path_openat+0xd7/0x440 [<ffffffff811229c9>] do_filp_open+0x49/0xa0 [<ffffffff8110f7d6>] do_sys_open+0x106/0x240 [<ffffffff8110f950>] sys_open+0x20/0x30 [<ffffffff815b5926>] system_call_fastpath+0x18/0x1d [<ffffffffffffffff>] 0xffffffffffffffff Process B [<ffffffff8110a9c1>] mem_cgroup_handle_oom+0x241/0x3b0 [<ffffffff8110b5ab>] T.1146+0x5ab/0x5c0 [<ffffffff8110c22e>] mem_cgroup_cache_charge+0xbe/0xe0 [<ffffffff810ca28c>] add_to_page_cache_locked+0x4c/0x140 [<ffffffff810ca3a2>] add_to_page_cache_lru+0x22/0x50 [<ffffffff810ca45b>] grab_cache_page_write_begin+0x8b/0xe0 [<ffffffff81193a18>] ext3_write_begin+0x88/0x270 [<ffffffff810c8fc6>] generic_file_buffered_write+0x116/0x290 [<ffffffff810cb3cc>] __generic_file_aio_write+0x27c/0x480 [<ffffffff810cb646>] generic_file_aio_write+0x76/0xf0 # takes ->i_mutex [<ffffffff8111156a>] do_sync_write+0xea/0x130 [<ffffffff81112183>] vfs_write+0xf3/0x1f0 [<ffffffff81112381>] sys_write+0x51/0x90 [<ffffffff815b5926>] system_call_fastpath+0x18/0x1d [<ffffffffffffffff>] 0xffffffffffffffff This is not a hard deadlock though because administrator can still intervene and increase the limit on the group which helps the writer to finish the allocation and release the lock. This patch heals the problem by forbidding OOM from page cache charges (namely add_ro_page_cache_locked). mem_cgroup_cache_charge_no_oom helper function is defined which adds GFP_MEMCG_NO_OOM to the gfp mask which then tells mem_cgroup_charge_common that OOM is not allowed for the charge. No OOM from this path, except for fixing the bug, also make some sense as we really do not want to cause an OOM because of a page cache usage. As a possibly visible result add_to_page_cache_lru might fail more often with ENOMEM but this is to be expected if the limit is set and it is preferable than OOM killer IMO. __GFP_NORETRY is abused for this memcg specific flag because it has been used to prevent from OOM already (since not-merged-yet "memcg: reclaim when more than one page needed"). The only difference is that the flag doesn't prevent from reclaim anymore which kind of makes sense because the global memory allocator triggers reclaim as well. The retry without any reclaim on __GFP_NORETRY doesn't make much sense anyway because this is effectively a busy loop with allowed OOM in this path. Reported-by: azurIt <azurit@pobox.sk> Signed-off-by: Michal Hocko <mhocko@suse.cz> --- include/linux/gfp.h | 3 +++ include/linux/memcontrol.h | 12 ++++++++++++ mm/filemap.c | 8 +++++++- mm/memcontrol.c | 5 +---- 4 files changed, 23 insertions(+), 5 deletions(-) diff --git a/include/linux/gfp.h b/include/linux/gfp.h index 10e667f..aac9b21 100644 --- a/include/linux/gfp.h +++ b/include/linux/gfp.h @@ -152,6 +152,9 @@ struct vm_area_struct; /* 4GB DMA on some platforms */ #define GFP_DMA32 __GFP_DMA32 +/* memcg oom killer is not allowed */ +#define GFP_MEMCG_NO_OOM __GFP_NORETRY + /* Convert GFP flags to their corresponding migrate type */ static inline int allocflags_to_migratetype(gfp_t gfp_flags) { diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 095d2b4..1ad4bc6 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -65,6 +65,12 @@ extern void mem_cgroup_cancel_charge_swapin(struct mem_cgroup *memcg); extern int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm, gfp_t gfp_mask); +static inline int mem_cgroup_cache_charge_no_oom(struct page *page, + struct mm_struct *mm, gfp_t gfp_mask) +{ + return mem_cgroup_cache_charge(page, mm, gfp_mask | GFP_MEMCG_NO_OOM); +} + struct lruvec *mem_cgroup_zone_lruvec(struct zone *, struct mem_cgroup *); struct lruvec *mem_cgroup_page_lruvec(struct page *, struct zone *); @@ -215,6 +221,12 @@ static inline int mem_cgroup_cache_charge(struct page *page, return 0; } +static inline int mem_cgroup_cache_charge_no_oom(struct page *page, + struct mm_struct *mm, gfp_t gfp_mask) +{ + return 0; +} + static inline int mem_cgroup_try_charge_swapin(struct mm_struct *mm, struct page *page, gfp_t gfp_mask, struct mem_cgroup **memcgp) { diff --git a/mm/filemap.c b/mm/filemap.c index 83efee7..ef14351 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -447,7 +447,13 @@ int add_to_page_cache_locked(struct page *page, struct address_space *mapping, VM_BUG_ON(!PageLocked(page)); VM_BUG_ON(PageSwapBacked(page)); - error = mem_cgroup_cache_charge(page, current->mm, + /* + * Cannot trigger OOM even if gfp_mask would allow that normally + * because we might be called from a locked context and that + * could lead to deadlocks if the killed process is waiting for + * the same lock. + */ + error = mem_cgroup_cache_charge_no_oom(page, current->mm, gfp_mask & GFP_RECLAIM_MASK); if (error) goto out; diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 02ee2f7..b4754ba 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -2430,9 +2430,6 @@ static int mem_cgroup_do_charge(struct mem_cgroup *memcg, gfp_t gfp_mask, if (!(gfp_mask & __GFP_WAIT)) return CHARGE_WOULDBLOCK; - if (gfp_mask & __GFP_NORETRY) - return CHARGE_NOMEM; - ret = mem_cgroup_reclaim(mem_over_limit, gfp_mask, flags); if (mem_cgroup_margin(mem_over_limit) >= nr_pages) return CHARGE_RETRY; @@ -3713,7 +3710,7 @@ static int mem_cgroup_charge_common(struct page *page, struct mm_struct *mm, { struct mem_cgroup *memcg = NULL; unsigned int nr_pages = 1; - bool oom = true; + bool oom = !(gfp_mask | GFP_MEMCG_NO_OOM); int ret; if (PageTransHuge(page)) { -- 1.7.10.4 -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 444+ messages in thread
* [PATCH -mm] memcg: do not trigger OOM from add_to_page_cache_locked @ 2012-11-26 13:18 ` Michal Hocko 0 siblings, 0 replies; 444+ messages in thread From: Michal Hocko @ 2012-11-26 13:18 UTC (permalink / raw) To: azurIt Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki, Johannes Weiner [CCing also Johannes - the thread started here: https://lkml.org/lkml/2012/11/21/497] On Mon 26-11-12 01:38:55, azurIt wrote: > >This is hackish but it should help you in this case. Kamezawa, what do > >you think about that? Should we generalize this and prepare something > >like mem_cgroup_cache_charge_locked which would add __GFP_NORETRY > >automatically and use the function whenever we are in a locked context? > >To be honest I do not like this very much but nothing more sensible > >(without touching non-memcg paths) comes to my mind. > > > I installed kernel with this patch, will report back if problem occurs > again OR in few weeks if everything will be ok. Thank you! Now that I am looking at the patch closer it will not work because it depends on other patch which is not merged yet and even that one would help on its own because __GFP_NORETRY doesn't break the charge loop. Sorry I have missed that... The patch bellow should help though. (it is based on top of the current -mm tree but I will send a backport to 3.2 in the reply as well) --- ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked 2012-11-26 13:18 ` Michal Hocko (?) @ 2012-11-26 13:21 ` Michal Hocko -1 siblings, 0 replies; 444+ messages in thread From: Michal Hocko @ 2012-11-26 13:21 UTC (permalink / raw) To: azurIt Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki, Johannes Weiner Here we go with the patch for 3.2.34. Could you test with this one, please? --- >From 0d2d915c16f93918051b7ab8039d30b5a922049c Mon Sep 17 00:00:00 2001 From: Michal Hocko <mhocko@suse.cz> Date: Mon, 26 Nov 2012 11:47:57 +0100 Subject: [PATCH] memcg: do not trigger OOM from add_to_page_cache_locked memcg oom killer might deadlock if the process which falls down to mem_cgroup_handle_oom holds a lock which prevents other task to terminate because it is blocked on the very same lock. This can happen when a write system call needs to allocate a page but the allocation hits the memcg hard limit and there is nothing to reclaim (e.g. there is no swap or swap limit is hit as well and all cache pages have been reclaimed already) and the process selected by memcg OOM killer is blocked on i_mutex on the same inode (e.g. truncate it). Process A [<ffffffff811109b8>] do_truncate+0x58/0xa0 # takes i_mutex [<ffffffff81121c90>] do_last+0x250/0xa30 [<ffffffff81122547>] path_openat+0xd7/0x440 [<ffffffff811229c9>] do_filp_open+0x49/0xa0 [<ffffffff8110f7d6>] do_sys_open+0x106/0x240 [<ffffffff8110f950>] sys_open+0x20/0x30 [<ffffffff815b5926>] system_call_fastpath+0x18/0x1d [<ffffffffffffffff>] 0xffffffffffffffff Process B [<ffffffff8110a9c1>] mem_cgroup_handle_oom+0x241/0x3b0 [<ffffffff8110b5ab>] T.1146+0x5ab/0x5c0 [<ffffffff8110c22e>] mem_cgroup_cache_charge+0xbe/0xe0 [<ffffffff810ca28c>] add_to_page_cache_locked+0x4c/0x140 [<ffffffff810ca3a2>] add_to_page_cache_lru+0x22/0x50 [<ffffffff810ca45b>] grab_cache_page_write_begin+0x8b/0xe0 [<ffffffff81193a18>] ext3_write_begin+0x88/0x270 [<ffffffff810c8fc6>] generic_file_buffered_write+0x116/0x290 [<ffffffff810cb3cc>] __generic_file_aio_write+0x27c/0x480 [<ffffffff810cb646>] generic_file_aio_write+0x76/0xf0 # takes ->i_mutex [<ffffffff8111156a>] do_sync_write+0xea/0x130 [<ffffffff81112183>] vfs_write+0xf3/0x1f0 [<ffffffff81112381>] sys_write+0x51/0x90 [<ffffffff815b5926>] system_call_fastpath+0x18/0x1d [<ffffffffffffffff>] 0xffffffffffffffff This is not a hard deadlock though because administrator can still intervene and increase the limit on the group which helps the writer to finish the allocation and release the lock. This patch heals the problem by forbidding OOM from page cache charges (namely add_ro_page_cache_locked). mem_cgroup_cache_charge_no_oom helper function is defined which adds GFP_MEMCG_NO_OOM to the gfp mask which then tells mem_cgroup_charge_common that OOM is not allowed for the charge. No OOM from this path, except for fixing the bug, also make some sense as we really do not want to cause an OOM because of a page cache usage. As a possibly visible result add_to_page_cache_lru might fail more often with ENOMEM but this is to be expected if the limit is set and it is preferable than OOM killer IMO. __GFP_NORETRY is abused for this memcg specific flag because no user accounted allocation use this flag except for THP which have memcg oom disabled already. Reported-by: azurIt <azurit@pobox.sk> Signed-off-by: Michal Hocko <mhocko@suse.cz> --- include/linux/gfp.h | 3 +++ include/linux/memcontrol.h | 13 +++++++++++++ mm/filemap.c | 8 +++++++- mm/memcontrol.c | 2 +- 4 files changed, 24 insertions(+), 2 deletions(-) diff --git a/include/linux/gfp.h b/include/linux/gfp.h index 3a76faf..806fb54 100644 --- a/include/linux/gfp.h +++ b/include/linux/gfp.h @@ -146,6 +146,9 @@ struct vm_area_struct; /* 4GB DMA on some platforms */ #define GFP_DMA32 __GFP_DMA32 +/* memcg oom killer is not allowed */ +#define GFP_MEMCG_NO_OOM __GFP_NORETRY + /* Convert GFP flags to their corresponding migrate type */ static inline int allocflags_to_migratetype(gfp_t gfp_flags) { diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 81572af..bf0e575 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -63,6 +63,13 @@ extern void mem_cgroup_cancel_charge_swapin(struct mem_cgroup *ptr); extern int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm, gfp_t gfp_mask); + +static inline int mem_cgroup_cache_charge_no_oom(struct page *page, + struct mm_struct *mm, gfp_t gfp_mask) +{ + return mem_cgroup_cache_charge(page, mm, gfp_mask | GFP_MEMCG_NO_OOM); +} + extern void mem_cgroup_add_lru_list(struct page *page, enum lru_list lru); extern void mem_cgroup_del_lru_list(struct page *page, enum lru_list lru); extern void mem_cgroup_rotate_reclaimable_page(struct page *page); @@ -178,6 +185,12 @@ static inline int mem_cgroup_cache_charge(struct page *page, return 0; } +static inline int mem_cgroup_cache_charge_no_oom(struct page *page, + struct mm_struct *mm, gfp_t gfp_mask) +{ + return 0; +} + static inline int mem_cgroup_try_charge_swapin(struct mm_struct *mm, struct page *page, gfp_t gfp_mask, struct mem_cgroup **ptr) { diff --git a/mm/filemap.c b/mm/filemap.c index 556858c..ef182a9 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -449,7 +449,13 @@ int add_to_page_cache_locked(struct page *page, struct address_space *mapping, VM_BUG_ON(!PageLocked(page)); VM_BUG_ON(PageSwapBacked(page)); - error = mem_cgroup_cache_charge(page, current->mm, + /* + * Cannot trigger OOM even if gfp_mask would allow that normally + * because we might be called from a locked context and that + * could lead to deadlocks if the killed process is waiting for + * the same lock. + */ + error = mem_cgroup_cache_charge_no_oom(page, current->mm, gfp_mask & GFP_RECLAIM_MASK); if (error) goto out; diff --git a/mm/memcontrol.c b/mm/memcontrol.c index c8425b1..1dbbe7f 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -2703,7 +2703,7 @@ static int mem_cgroup_charge_common(struct page *page, struct mm_struct *mm, struct mem_cgroup *memcg = NULL; unsigned int nr_pages = 1; struct page_cgroup *pc; - bool oom = true; + bool oom = !(gfp_mask | GFP_MEMCG_NO_OOM); int ret; if (PageTransHuge(page)) { -- 1.7.10.4 -- Michal Hocko SUSE Labs ^ permalink raw reply related [flat|nested] 444+ messages in thread
* Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked @ 2012-11-26 13:21 ` Michal Hocko 0 siblings, 0 replies; 444+ messages in thread From: Michal Hocko @ 2012-11-26 13:21 UTC (permalink / raw) To: azurIt Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups mailinglist, KAMEZAWA Hiroyuki, Johannes Weiner Here we go with the patch for 3.2.34. Could you test with this one, please? --- From 0d2d915c16f93918051b7ab8039d30b5a922049c Mon Sep 17 00:00:00 2001 From: Michal Hocko <mhocko-AlSwsSmVLrQ@public.gmane.org> Date: Mon, 26 Nov 2012 11:47:57 +0100 Subject: [PATCH] memcg: do not trigger OOM from add_to_page_cache_locked memcg oom killer might deadlock if the process which falls down to mem_cgroup_handle_oom holds a lock which prevents other task to terminate because it is blocked on the very same lock. This can happen when a write system call needs to allocate a page but the allocation hits the memcg hard limit and there is nothing to reclaim (e.g. there is no swap or swap limit is hit as well and all cache pages have been reclaimed already) and the process selected by memcg OOM killer is blocked on i_mutex on the same inode (e.g. truncate it). Process A [<ffffffff811109b8>] do_truncate+0x58/0xa0 # takes i_mutex [<ffffffff81121c90>] do_last+0x250/0xa30 [<ffffffff81122547>] path_openat+0xd7/0x440 [<ffffffff811229c9>] do_filp_open+0x49/0xa0 [<ffffffff8110f7d6>] do_sys_open+0x106/0x240 [<ffffffff8110f950>] sys_open+0x20/0x30 [<ffffffff815b5926>] system_call_fastpath+0x18/0x1d [<ffffffffffffffff>] 0xffffffffffffffff Process B [<ffffffff8110a9c1>] mem_cgroup_handle_oom+0x241/0x3b0 [<ffffffff8110b5ab>] T.1146+0x5ab/0x5c0 [<ffffffff8110c22e>] mem_cgroup_cache_charge+0xbe/0xe0 [<ffffffff810ca28c>] add_to_page_cache_locked+0x4c/0x140 [<ffffffff810ca3a2>] add_to_page_cache_lru+0x22/0x50 [<ffffffff810ca45b>] grab_cache_page_write_begin+0x8b/0xe0 [<ffffffff81193a18>] ext3_write_begin+0x88/0x270 [<ffffffff810c8fc6>] generic_file_buffered_write+0x116/0x290 [<ffffffff810cb3cc>] __generic_file_aio_write+0x27c/0x480 [<ffffffff810cb646>] generic_file_aio_write+0x76/0xf0 # takes ->i_mutex [<ffffffff8111156a>] do_sync_write+0xea/0x130 [<ffffffff81112183>] vfs_write+0xf3/0x1f0 [<ffffffff81112381>] sys_write+0x51/0x90 [<ffffffff815b5926>] system_call_fastpath+0x18/0x1d [<ffffffffffffffff>] 0xffffffffffffffff This is not a hard deadlock though because administrator can still intervene and increase the limit on the group which helps the writer to finish the allocation and release the lock. This patch heals the problem by forbidding OOM from page cache charges (namely add_ro_page_cache_locked). mem_cgroup_cache_charge_no_oom helper function is defined which adds GFP_MEMCG_NO_OOM to the gfp mask which then tells mem_cgroup_charge_common that OOM is not allowed for the charge. No OOM from this path, except for fixing the bug, also make some sense as we really do not want to cause an OOM because of a page cache usage. As a possibly visible result add_to_page_cache_lru might fail more often with ENOMEM but this is to be expected if the limit is set and it is preferable than OOM killer IMO. __GFP_NORETRY is abused for this memcg specific flag because no user accounted allocation use this flag except for THP which have memcg oom disabled already. Reported-by: azurIt <azurit-Rm0zKEqwvD4@public.gmane.org> Signed-off-by: Michal Hocko <mhocko-AlSwsSmVLrQ@public.gmane.org> --- include/linux/gfp.h | 3 +++ include/linux/memcontrol.h | 13 +++++++++++++ mm/filemap.c | 8 +++++++- mm/memcontrol.c | 2 +- 4 files changed, 24 insertions(+), 2 deletions(-) diff --git a/include/linux/gfp.h b/include/linux/gfp.h index 3a76faf..806fb54 100644 --- a/include/linux/gfp.h +++ b/include/linux/gfp.h @@ -146,6 +146,9 @@ struct vm_area_struct; /* 4GB DMA on some platforms */ #define GFP_DMA32 __GFP_DMA32 +/* memcg oom killer is not allowed */ +#define GFP_MEMCG_NO_OOM __GFP_NORETRY + /* Convert GFP flags to their corresponding migrate type */ static inline int allocflags_to_migratetype(gfp_t gfp_flags) { diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 81572af..bf0e575 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -63,6 +63,13 @@ extern void mem_cgroup_cancel_charge_swapin(struct mem_cgroup *ptr); extern int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm, gfp_t gfp_mask); + +static inline int mem_cgroup_cache_charge_no_oom(struct page *page, + struct mm_struct *mm, gfp_t gfp_mask) +{ + return mem_cgroup_cache_charge(page, mm, gfp_mask | GFP_MEMCG_NO_OOM); +} + extern void mem_cgroup_add_lru_list(struct page *page, enum lru_list lru); extern void mem_cgroup_del_lru_list(struct page *page, enum lru_list lru); extern void mem_cgroup_rotate_reclaimable_page(struct page *page); @@ -178,6 +185,12 @@ static inline int mem_cgroup_cache_charge(struct page *page, return 0; } +static inline int mem_cgroup_cache_charge_no_oom(struct page *page, + struct mm_struct *mm, gfp_t gfp_mask) +{ + return 0; +} + static inline int mem_cgroup_try_charge_swapin(struct mm_struct *mm, struct page *page, gfp_t gfp_mask, struct mem_cgroup **ptr) { diff --git a/mm/filemap.c b/mm/filemap.c index 556858c..ef182a9 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -449,7 +449,13 @@ int add_to_page_cache_locked(struct page *page, struct address_space *mapping, VM_BUG_ON(!PageLocked(page)); VM_BUG_ON(PageSwapBacked(page)); - error = mem_cgroup_cache_charge(page, current->mm, + /* + * Cannot trigger OOM even if gfp_mask would allow that normally + * because we might be called from a locked context and that + * could lead to deadlocks if the killed process is waiting for + * the same lock. + */ + error = mem_cgroup_cache_charge_no_oom(page, current->mm, gfp_mask & GFP_RECLAIM_MASK); if (error) goto out; diff --git a/mm/memcontrol.c b/mm/memcontrol.c index c8425b1..1dbbe7f 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -2703,7 +2703,7 @@ static int mem_cgroup_charge_common(struct page *page, struct mm_struct *mm, struct mem_cgroup *memcg = NULL; unsigned int nr_pages = 1; struct page_cgroup *pc; - bool oom = true; + bool oom = !(gfp_mask | GFP_MEMCG_NO_OOM); int ret; if (PageTransHuge(page)) { -- 1.7.10.4 -- Michal Hocko SUSE Labs ^ permalink raw reply related [flat|nested] 444+ messages in thread
* Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked @ 2012-11-26 13:21 ` Michal Hocko 0 siblings, 0 replies; 444+ messages in thread From: Michal Hocko @ 2012-11-26 13:21 UTC (permalink / raw) To: azurIt Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki, Johannes Weiner Here we go with the patch for 3.2.34. Could you test with this one, please? --- ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked 2012-11-26 13:21 ` Michal Hocko @ 2012-11-26 21:28 ` azurIt -1 siblings, 0 replies; 444+ messages in thread From: azurIt @ 2012-11-26 21:28 UTC (permalink / raw) To: Michal Hocko Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki, Johannes Weiner >Here we go with the patch for 3.2.34. Could you test with this one, >please? Michal, regarding to your conversation with Johannes Weiner, should i try this patch or not? azur ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked @ 2012-11-26 21:28 ` azurIt 0 siblings, 0 replies; 444+ messages in thread From: azurIt @ 2012-11-26 21:28 UTC (permalink / raw) To: Michal Hocko Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki, Johannes Weiner >Here we go with the patch for 3.2.34. Could you test with this one, >please? Michal, regarding to your conversation with Johannes Weiner, should i try this patch or not? azur -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked 2012-11-26 13:21 ` Michal Hocko (?) @ 2012-11-30 1:45 ` azurIt -1 siblings, 0 replies; 444+ messages in thread From: azurIt @ 2012-11-30 1:45 UTC (permalink / raw) To: Michal Hocko Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki, Johannes Weiner >Here we go with the patch for 3.2.34. Could you test with this one, >please? I installed kernel with this patch, will report back if problem occurs again OR in few weeks if everything will be ok. Thank you! azurIt ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked @ 2012-11-30 1:45 ` azurIt 0 siblings, 0 replies; 444+ messages in thread From: azurIt @ 2012-11-30 1:45 UTC (permalink / raw) To: Michal Hocko Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups mailinglist, KAMEZAWA Hiroyuki, Johannes Weiner >Here we go with the patch for 3.2.34. Could you test with this one, >please? I installed kernel with this patch, will report back if problem occurs again OR in few weeks if everything will be ok. Thank you! azurIt ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked @ 2012-11-30 1:45 ` azurIt 0 siblings, 0 replies; 444+ messages in thread From: azurIt @ 2012-11-30 1:45 UTC (permalink / raw) To: Michal Hocko Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki, Johannes Weiner >Here we go with the patch for 3.2.34. Could you test with this one, >please? I installed kernel with this patch, will report back if problem occurs again OR in few weeks if everything will be ok. Thank you! azurIt -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked 2012-11-26 13:21 ` Michal Hocko @ 2012-11-30 2:29 ` azurIt -1 siblings, 0 replies; 444+ messages in thread From: azurIt @ 2012-11-30 2:29 UTC (permalink / raw) To: Michal Hocko Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki, Johannes Weiner >Here we go with the patch for 3.2.34. Could you test with this one, >please? Michal, unfortunately i had to boot to another kernel because the one with this patch keeps killing my MySQL server :( it was, probably, doing it on OOM in any cgroup - looks like OOM was not choosing processes only from cgroup which is out of memory. Here is the log from syslog: http://www.watchdog.sk/lkml/oom_mysqld Maybe i should mention that MySQL server has it's own cgroup (called 'mysql') but with no limits to any resources. azurIt ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked @ 2012-11-30 2:29 ` azurIt 0 siblings, 0 replies; 444+ messages in thread From: azurIt @ 2012-11-30 2:29 UTC (permalink / raw) To: Michal Hocko Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki, Johannes Weiner >Here we go with the patch for 3.2.34. Could you test with this one, >please? Michal, unfortunately i had to boot to another kernel because the one with this patch keeps killing my MySQL server :( it was, probably, doing it on OOM in any cgroup - looks like OOM was not choosing processes only from cgroup which is out of memory. Here is the log from syslog: http://www.watchdog.sk/lkml/oom_mysqld Maybe i should mention that MySQL server has it's own cgroup (called 'mysql') but with no limits to any resources. azurIt -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked 2012-11-30 2:29 ` azurIt (?) @ 2012-11-30 12:45 ` Michal Hocko -1 siblings, 0 replies; 444+ messages in thread From: Michal Hocko @ 2012-11-30 12:45 UTC (permalink / raw) To: azurIt Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki, Johannes Weiner On Fri 30-11-12 03:29:18, azurIt wrote: > >Here we go with the patch for 3.2.34. Could you test with this one, > >please? > > > Michal, unfortunately i had to boot to another kernel because the one > with this patch keeps killing my MySQL server :( it was, probably, > doing it on OOM in any cgroup - looks like OOM was not choosing > processes only from cgroup which is out of memory. Here is the log > from syslog: http://www.watchdog.sk/lkml/oom_mysqld You are seeing also global OOM: Nov 30 02:53:56 server01 kernel: [ 818.233159] Pid: 9247, comm: apache2 Not tainted 3.2.34-grsec #1 Nov 30 02:53:56 server01 kernel: [ 818.233289] Call Trace: Nov 30 02:53:56 server01 kernel: [ 818.233470] [<ffffffff810cc90e>] dump_header+0x7e/0x1e0 Nov 30 02:53:56 server01 kernel: [ 818.233600] [<ffffffff810cc80f>] ? find_lock_task_mm+0x2f/0x70 Nov 30 02:53:56 server01 kernel: [ 818.233721] [<ffffffff810ccdd5>] oom_kill_process+0x85/0x2a0 Nov 30 02:53:56 server01 kernel: [ 818.233842] [<ffffffff810cd485>] out_of_memory+0xe5/0x200 Nov 30 02:53:56 server01 kernel: [ 818.233963] [<ffffffff8102aa8f>] ? pte_alloc_one+0x3f/0x50 Nov 30 02:53:56 server01 kernel: [ 818.234082] [<ffffffff810cd65d>] pagefault_out_of_memory+0xbd/0x110 Nov 30 02:53:56 server01 kernel: [ 818.234204] [<ffffffff81026ec6>] mm_fault_error+0xb6/0x1a0 Nov 30 02:53:56 server01 kernel: [ 818.235886] [<ffffffff8102739e>] do_page_fault+0x3ee/0x460 Nov 30 02:53:56 server01 kernel: [ 818.236006] [<ffffffff810f3057>] ? vma_merge+0x1f7/0x2c0 Nov 30 02:53:56 server01 kernel: [ 818.236124] [<ffffffff810f35d7>] ? do_brk+0x267/0x400 Nov 30 02:53:56 server01 kernel: [ 818.236244] [<ffffffff812c9a92>] ? gr_learn_resource+0x42/0x1e0 Nov 30 02:53:56 server01 kernel: [ 818.236367] [<ffffffff815b547f>] page_fault+0x1f/0x30 [...] Nov 30 02:53:56 server01 kernel: [ 818.356297] Out of memory: Kill process 2188 (mysqld) score 60 or sacrifice child Nov 30 02:53:56 server01 kernel: [ 818.356493] Killed process 2188 (mysqld) total-vm:3330016kB, anon-rss:864176kB, file-rss:8072kB Then you also have memcg oom killer: Nov 30 02:53:56 server01 kernel: [ 818.375717] Task in /1037/uid killed as a result of limit of /1037 Nov 30 02:53:56 server01 kernel: [ 818.375886] memory: usage 102400kB, limit 102400kB, failcnt 736 Nov 30 02:53:56 server01 kernel: [ 818.376008] memory+swap: usage 102400kB, limit 102400kB, failcnt 0 The messages are intermixed and I guess rate limitting jumped in as well, because I cannot associate all the oom messages to a specific OOM event. Anyway your system is under both global and local memory pressure. You didn't see apache going down previously because it was probably the one which was stuck and could be killed. Anyway you need to setup your system more carefully. > Maybe i should mention that MySQL server has it's own cgroup (called > 'mysql') but with no limits to any resources. Where is that group in the hierarchy? > > azurIt > -- > To unsubscribe from this list: send the line "unsubscribe cgroups" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked @ 2012-11-30 12:45 ` Michal Hocko 0 siblings, 0 replies; 444+ messages in thread From: Michal Hocko @ 2012-11-30 12:45 UTC (permalink / raw) To: azurIt Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups mailinglist, KAMEZAWA Hiroyuki, Johannes Weiner On Fri 30-11-12 03:29:18, azurIt wrote: > >Here we go with the patch for 3.2.34. Could you test with this one, > >please? > > > Michal, unfortunately i had to boot to another kernel because the one > with this patch keeps killing my MySQL server :( it was, probably, > doing it on OOM in any cgroup - looks like OOM was not choosing > processes only from cgroup which is out of memory. Here is the log > from syslog: http://www.watchdog.sk/lkml/oom_mysqld You are seeing also global OOM: Nov 30 02:53:56 server01 kernel: [ 818.233159] Pid: 9247, comm: apache2 Not tainted 3.2.34-grsec #1 Nov 30 02:53:56 server01 kernel: [ 818.233289] Call Trace: Nov 30 02:53:56 server01 kernel: [ 818.233470] [<ffffffff810cc90e>] dump_header+0x7e/0x1e0 Nov 30 02:53:56 server01 kernel: [ 818.233600] [<ffffffff810cc80f>] ? find_lock_task_mm+0x2f/0x70 Nov 30 02:53:56 server01 kernel: [ 818.233721] [<ffffffff810ccdd5>] oom_kill_process+0x85/0x2a0 Nov 30 02:53:56 server01 kernel: [ 818.233842] [<ffffffff810cd485>] out_of_memory+0xe5/0x200 Nov 30 02:53:56 server01 kernel: [ 818.233963] [<ffffffff8102aa8f>] ? pte_alloc_one+0x3f/0x50 Nov 30 02:53:56 server01 kernel: [ 818.234082] [<ffffffff810cd65d>] pagefault_out_of_memory+0xbd/0x110 Nov 30 02:53:56 server01 kernel: [ 818.234204] [<ffffffff81026ec6>] mm_fault_error+0xb6/0x1a0 Nov 30 02:53:56 server01 kernel: [ 818.235886] [<ffffffff8102739e>] do_page_fault+0x3ee/0x460 Nov 30 02:53:56 server01 kernel: [ 818.236006] [<ffffffff810f3057>] ? vma_merge+0x1f7/0x2c0 Nov 30 02:53:56 server01 kernel: [ 818.236124] [<ffffffff810f35d7>] ? do_brk+0x267/0x400 Nov 30 02:53:56 server01 kernel: [ 818.236244] [<ffffffff812c9a92>] ? gr_learn_resource+0x42/0x1e0 Nov 30 02:53:56 server01 kernel: [ 818.236367] [<ffffffff815b547f>] page_fault+0x1f/0x30 [...] Nov 30 02:53:56 server01 kernel: [ 818.356297] Out of memory: Kill process 2188 (mysqld) score 60 or sacrifice child Nov 30 02:53:56 server01 kernel: [ 818.356493] Killed process 2188 (mysqld) total-vm:3330016kB, anon-rss:864176kB, file-rss:8072kB Then you also have memcg oom killer: Nov 30 02:53:56 server01 kernel: [ 818.375717] Task in /1037/uid killed as a result of limit of /1037 Nov 30 02:53:56 server01 kernel: [ 818.375886] memory: usage 102400kB, limit 102400kB, failcnt 736 Nov 30 02:53:56 server01 kernel: [ 818.376008] memory+swap: usage 102400kB, limit 102400kB, failcnt 0 The messages are intermixed and I guess rate limitting jumped in as well, because I cannot associate all the oom messages to a specific OOM event. Anyway your system is under both global and local memory pressure. You didn't see apache going down previously because it was probably the one which was stuck and could be killed. Anyway you need to setup your system more carefully. > Maybe i should mention that MySQL server has it's own cgroup (called > 'mysql') but with no limits to any resources. Where is that group in the hierarchy? > > azurIt > -- > To unsubscribe from this list: send the line "unsubscribe cgroups" in > the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked @ 2012-11-30 12:45 ` Michal Hocko 0 siblings, 0 replies; 444+ messages in thread From: Michal Hocko @ 2012-11-30 12:45 UTC (permalink / raw) To: azurIt Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki, Johannes Weiner On Fri 30-11-12 03:29:18, azurIt wrote: > >Here we go with the patch for 3.2.34. Could you test with this one, > >please? > > > Michal, unfortunately i had to boot to another kernel because the one > with this patch keeps killing my MySQL server :( it was, probably, > doing it on OOM in any cgroup - looks like OOM was not choosing > processes only from cgroup which is out of memory. Here is the log > from syslog: http://www.watchdog.sk/lkml/oom_mysqld You are seeing also global OOM: Nov 30 02:53:56 server01 kernel: [ 818.233159] Pid: 9247, comm: apache2 Not tainted 3.2.34-grsec #1 Nov 30 02:53:56 server01 kernel: [ 818.233289] Call Trace: Nov 30 02:53:56 server01 kernel: [ 818.233470] [<ffffffff810cc90e>] dump_header+0x7e/0x1e0 Nov 30 02:53:56 server01 kernel: [ 818.233600] [<ffffffff810cc80f>] ? find_lock_task_mm+0x2f/0x70 Nov 30 02:53:56 server01 kernel: [ 818.233721] [<ffffffff810ccdd5>] oom_kill_process+0x85/0x2a0 Nov 30 02:53:56 server01 kernel: [ 818.233842] [<ffffffff810cd485>] out_of_memory+0xe5/0x200 Nov 30 02:53:56 server01 kernel: [ 818.233963] [<ffffffff8102aa8f>] ? pte_alloc_one+0x3f/0x50 Nov 30 02:53:56 server01 kernel: [ 818.234082] [<ffffffff810cd65d>] pagefault_out_of_memory+0xbd/0x110 Nov 30 02:53:56 server01 kernel: [ 818.234204] [<ffffffff81026ec6>] mm_fault_error+0xb6/0x1a0 Nov 30 02:53:56 server01 kernel: [ 818.235886] [<ffffffff8102739e>] do_page_fault+0x3ee/0x460 Nov 30 02:53:56 server01 kernel: [ 818.236006] [<ffffffff810f3057>] ? vma_merge+0x1f7/0x2c0 Nov 30 02:53:56 server01 kernel: [ 818.236124] [<ffffffff810f35d7>] ? do_brk+0x267/0x400 Nov 30 02:53:56 server01 kernel: [ 818.236244] [<ffffffff812c9a92>] ? gr_learn_resource+0x42/0x1e0 Nov 30 02:53:56 server01 kernel: [ 818.236367] [<ffffffff815b547f>] page_fault+0x1f/0x30 [...] Nov 30 02:53:56 server01 kernel: [ 818.356297] Out of memory: Kill process 2188 (mysqld) score 60 or sacrifice child Nov 30 02:53:56 server01 kernel: [ 818.356493] Killed process 2188 (mysqld) total-vm:3330016kB, anon-rss:864176kB, file-rss:8072kB Then you also have memcg oom killer: Nov 30 02:53:56 server01 kernel: [ 818.375717] Task in /1037/uid killed as a result of limit of /1037 Nov 30 02:53:56 server01 kernel: [ 818.375886] memory: usage 102400kB, limit 102400kB, failcnt 736 Nov 30 02:53:56 server01 kernel: [ 818.376008] memory+swap: usage 102400kB, limit 102400kB, failcnt 0 The messages are intermixed and I guess rate limitting jumped in as well, because I cannot associate all the oom messages to a specific OOM event. Anyway your system is under both global and local memory pressure. You didn't see apache going down previously because it was probably the one which was stuck and could be killed. Anyway you need to setup your system more carefully. > Maybe i should mention that MySQL server has it's own cgroup (called > 'mysql') but with no limits to any resources. Where is that group in the hierarchy? > > azurIt > -- > To unsubscribe from this list: send the line "unsubscribe cgroups" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked 2012-11-30 12:45 ` Michal Hocko (?) @ 2012-11-30 12:53 ` azurIt -1 siblings, 0 replies; 444+ messages in thread From: azurIt @ 2012-11-30 12:53 UTC (permalink / raw) To: Michal Hocko Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki, Johannes Weiner >Anyway your system is under both global and local memory pressure. You >didn't see apache going down previously because it was probably the one >which was stuck and could be killed. >Anyway you need to setup your system more carefully. No, it wasn't, i'm 1000% sure (i was on SSH). Here is the memory usage graph from that system on that time: http://www.watchdog.sk/lkml/memory.png The blank part is rebooting into new kernel. MySQL server was killed several times, then i rebooted into previous kernel and problem was gone (not a single MySQL kill). You can see two MySQL kills there on 03:54 and 03:04:30. > >> Maybe i should mention that MySQL server has it's own cgroup (called >> 'mysql') but with no limits to any resources. > >Where is that group in the hierarchy? In root. ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked @ 2012-11-30 12:53 ` azurIt 0 siblings, 0 replies; 444+ messages in thread From: azurIt @ 2012-11-30 12:53 UTC (permalink / raw) To: Michal Hocko Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups mailinglist, KAMEZAWA Hiroyuki, Johannes Weiner >Anyway your system is under both global and local memory pressure. You >didn't see apache going down previously because it was probably the one >which was stuck and could be killed. >Anyway you need to setup your system more carefully. No, it wasn't, i'm 1000% sure (i was on SSH). Here is the memory usage graph from that system on that time: http://www.watchdog.sk/lkml/memory.png The blank part is rebooting into new kernel. MySQL server was killed several times, then i rebooted into previous kernel and problem was gone (not a single MySQL kill). You can see two MySQL kills there on 03:54 and 03:04:30. > >> Maybe i should mention that MySQL server has it's own cgroup (called >> 'mysql') but with no limits to any resources. > >Where is that group in the hierarchy? In root. ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked @ 2012-11-30 12:53 ` azurIt 0 siblings, 0 replies; 444+ messages in thread From: azurIt @ 2012-11-30 12:53 UTC (permalink / raw) To: Michal Hocko Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki, Johannes Weiner >Anyway your system is under both global and local memory pressure. You >didn't see apache going down previously because it was probably the one >which was stuck and could be killed. >Anyway you need to setup your system more carefully. No, it wasn't, i'm 1000% sure (i was on SSH). Here is the memory usage graph from that system on that time: http://www.watchdog.sk/lkml/memory.png The blank part is rebooting into new kernel. MySQL server was killed several times, then i rebooted into previous kernel and problem was gone (not a single MySQL kill). You can see two MySQL kills there on 03:54 and 03:04:30. > >> Maybe i should mention that MySQL server has it's own cgroup (called >> 'mysql') but with no limits to any resources. > >Where is that group in the hierarchy? In root. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked 2012-11-30 12:45 ` Michal Hocko @ 2012-11-30 13:44 ` azurIt -1 siblings, 0 replies; 444+ messages in thread From: azurIt @ 2012-11-30 13:44 UTC (permalink / raw) To: Michal Hocko Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki, Johannes Weiner >Anyway your system is under both global and local memory pressure. You >didn't see apache going down previously because it was probably the one >which was stuck and could be killed. >Anyway you need to setup your system more carefully. There is, also, an evidence that system has enough of memory! :) Just take column 'rss' from process list in OOM message and sum it - you will get 2489911. It's probably in KB so it's about 2.4 GB. System has 14 GB of RAM so this also match data on my graph - 2.4 is about 17% of 14. azur ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked @ 2012-11-30 13:44 ` azurIt 0 siblings, 0 replies; 444+ messages in thread From: azurIt @ 2012-11-30 13:44 UTC (permalink / raw) To: Michal Hocko Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki, Johannes Weiner >Anyway your system is under both global and local memory pressure. You >didn't see apache going down previously because it was probably the one >which was stuck and could be killed. >Anyway you need to setup your system more carefully. There is, also, an evidence that system has enough of memory! :) Just take column 'rss' from process list in OOM message and sum it - you will get 2489911. It's probably in KB so it's about 2.4 GB. System has 14 GB of RAM so this also match data on my graph - 2.4 is about 17% of 14. azur -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked 2012-11-30 13:44 ` azurIt (?) @ 2012-11-30 14:44 ` Michal Hocko -1 siblings, 0 replies; 444+ messages in thread From: Michal Hocko @ 2012-11-30 14:44 UTC (permalink / raw) To: azurIt Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki, Johannes Weiner On Fri 30-11-12 14:44:27, azurIt wrote: > >Anyway your system is under both global and local memory pressure. You > >didn't see apache going down previously because it was probably the one > >which was stuck and could be killed. > >Anyway you need to setup your system more carefully. > > > There is, also, an evidence that system has enough of memory! :) Just > take column 'rss' from process list in OOM message and sum it - you > will get 2489911. It's probably in KB so it's about 2.4 GB. System has > 14 GB of RAM so this also match data on my graph - 2.4 is about 17% of > 14. Hmm, that corresponds to the ZONE_DMA32 size pretty nicely but that zone is hardly touched: Nov 30 02:53:56 server01 kernel: [ 818.241291] DMA32 free:2523636kB min:2672kB low:3340kB high:4008kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:2542248kB mlocked:0kB dirty:0kB writeback:0kB mapped:4kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no DMA32 zone is usually fills up first 4G unless your HW remaps the rest of the memory above 4G or you have a numa machine and the rest of the memory is at other node. Could you post your memory map printed during the boot? (e820: BIOS-provided physical RAM map: and following lines) There is also ZONE_NORMAL which is also not used much Nov 30 02:53:56 server01 kernel: [ 818.242163] Normal free:6924716kB min:12512kB low:15640kB high:18768kB active_anon:1463128kB inactive_anon:2072kB active_file:1803964kB inactive_file:1072628kB unevictable:3924kB isolated(anon):0kB isolated(file):0kB present:11893760kB mlocked:3924kB dirty:1000kB writeback:776kB mapped:35656kB shmem:3828kB slab_reclaimable:202560kB slab_unreclaimable:50696kB kernel_stack:2944kB pagetables:158616kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no You have mentioned that you are comounting with cpuset. If this happens to be a NUMA machine have you made the access to all nodes available? Also what does /proc/sys/vm/zone_reclaim_mode says? -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked @ 2012-11-30 14:44 ` Michal Hocko 0 siblings, 0 replies; 444+ messages in thread From: Michal Hocko @ 2012-11-30 14:44 UTC (permalink / raw) To: azurIt Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups mailinglist, KAMEZAWA Hiroyuki, Johannes Weiner On Fri 30-11-12 14:44:27, azurIt wrote: > >Anyway your system is under both global and local memory pressure. You > >didn't see apache going down previously because it was probably the one > >which was stuck and could be killed. > >Anyway you need to setup your system more carefully. > > > There is, also, an evidence that system has enough of memory! :) Just > take column 'rss' from process list in OOM message and sum it - you > will get 2489911. It's probably in KB so it's about 2.4 GB. System has > 14 GB of RAM so this also match data on my graph - 2.4 is about 17% of > 14. Hmm, that corresponds to the ZONE_DMA32 size pretty nicely but that zone is hardly touched: Nov 30 02:53:56 server01 kernel: [ 818.241291] DMA32 free:2523636kB min:2672kB low:3340kB high:4008kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:2542248kB mlocked:0kB dirty:0kB writeback:0kB mapped:4kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no DMA32 zone is usually fills up first 4G unless your HW remaps the rest of the memory above 4G or you have a numa machine and the rest of the memory is at other node. Could you post your memory map printed during the boot? (e820: BIOS-provided physical RAM map: and following lines) There is also ZONE_NORMAL which is also not used much Nov 30 02:53:56 server01 kernel: [ 818.242163] Normal free:6924716kB min:12512kB low:15640kB high:18768kB active_anon:1463128kB inactive_anon:2072kB active_file:1803964kB inactive_file:1072628kB unevictable:3924kB isolated(anon):0kB isolated(file):0kB present:11893760kB mlocked:3924kB dirty:1000kB writeback:776kB mapped:35656kB shmem:3828kB slab_reclaimable:202560kB slab_unreclaimable:50696kB kernel_stack:2944kB pagetables:158616kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no You have mentioned that you are comounting with cpuset. If this happens to be a NUMA machine have you made the access to all nodes available? Also what does /proc/sys/vm/zone_reclaim_mode says? -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked @ 2012-11-30 14:44 ` Michal Hocko 0 siblings, 0 replies; 444+ messages in thread From: Michal Hocko @ 2012-11-30 14:44 UTC (permalink / raw) To: azurIt Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki, Johannes Weiner On Fri 30-11-12 14:44:27, azurIt wrote: > >Anyway your system is under both global and local memory pressure. You > >didn't see apache going down previously because it was probably the one > >which was stuck and could be killed. > >Anyway you need to setup your system more carefully. > > > There is, also, an evidence that system has enough of memory! :) Just > take column 'rss' from process list in OOM message and sum it - you > will get 2489911. It's probably in KB so it's about 2.4 GB. System has > 14 GB of RAM so this also match data on my graph - 2.4 is about 17% of > 14. Hmm, that corresponds to the ZONE_DMA32 size pretty nicely but that zone is hardly touched: Nov 30 02:53:56 server01 kernel: [ 818.241291] DMA32 free:2523636kB min:2672kB low:3340kB high:4008kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:2542248kB mlocked:0kB dirty:0kB writeback:0kB mapped:4kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no DMA32 zone is usually fills up first 4G unless your HW remaps the rest of the memory above 4G or you have a numa machine and the rest of the memory is at other node. Could you post your memory map printed during the boot? (e820: BIOS-provided physical RAM map: and following lines) There is also ZONE_NORMAL which is also not used much Nov 30 02:53:56 server01 kernel: [ 818.242163] Normal free:6924716kB min:12512kB low:15640kB high:18768kB active_anon:1463128kB inactive_anon:2072kB active_file:1803964kB inactive_file:1072628kB unevictable:3924kB isolated(anon):0kB isolated(file):0kB present:11893760kB mlocked:3924kB dirty:1000kB writeback:776kB mapped:35656kB shmem:3828kB slab_reclaimable:202560kB slab_unreclaimable:50696kB kernel_stack:2944kB pagetables:158616kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no You have mentioned that you are comounting with cpuset. If this happens to be a NUMA machine have you made the access to all nodes available? Also what does /proc/sys/vm/zone_reclaim_mode says? -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked 2012-11-30 14:44 ` Michal Hocko (?) @ 2012-11-30 15:03 ` Michal Hocko -1 siblings, 0 replies; 444+ messages in thread From: Michal Hocko @ 2012-11-30 15:03 UTC (permalink / raw) To: azurIt Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki, Johannes Weiner On Fri 30-11-12 15:44:31, Michal Hocko wrote: > On Fri 30-11-12 14:44:27, azurIt wrote: > > >Anyway your system is under both global and local memory pressure. You > > >didn't see apache going down previously because it was probably the one > > >which was stuck and could be killed. > > >Anyway you need to setup your system more carefully. > > > > > > There is, also, an evidence that system has enough of memory! :) Just > > take column 'rss' from process list in OOM message and sum it - you > > will get 2489911. It's probably in KB so it's about 2.4 GB. System has > > 14 GB of RAM so this also match data on my graph - 2.4 is about 17% of > > 14. > > Hmm, that corresponds to the ZONE_DMA32 size pretty nicely but that zone > is hardly touched: > Nov 30 02:53:56 server01 kernel: [ 818.241291] DMA32 free:2523636kB min:2672kB low:3340kB high:4008kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:2542248kB mlocked:0kB dirty:0kB writeback:0kB mapped:4kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no > > DMA32 zone is usually fills up first 4G unless your HW remaps the rest > of the memory above 4G or you have a numa machine and the rest of the > memory is at other node. Could you post your memory map printed during > the boot? (e820: BIOS-provided physical RAM map: and following lines) > > There is also ZONE_NORMAL which is also not used much > Nov 30 02:53:56 server01 kernel: [ 818.242163] Normal free:6924716kB min:12512kB low:15640kB high:18768kB active_anon:1463128kB inactive_anon:2072kB active_file:1803964kB inactive_file:1072628kB unevictable:3924kB isolated(anon):0kB isolated(file):0kB present:11893760kB mlocked:3924kB dirty:1000kB writeback:776kB mapped:35656kB shmem:3828kB slab_reclaimable:202560kB slab_unreclaimable:50696kB kernel_stack:2944kB pagetables:158616kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no > > You have mentioned that you are comounting with cpuset. If this happens > to be a NUMA machine have you made the access to all nodes available? And now that I am looking at the oom message more closely I can see Nov 30 02:53:56 server01 kernel: [ 818.232812] apache2 invoked oom-killer: gfp_mask=0x0, order=0, oom_adj=0, oom_score_adj=0 Nov 30 02:53:56 server01 kernel: [ 818.233029] apache2 cpuset=uid mems_allowed=0 Nov 30 02:53:56 server01 kernel: [ 818.233159] Pid: 9247, comm: apache2 Not tainted 3.2.34-grsec #1 Nov 30 02:53:56 server01 kernel: [ 818.233289] Call Trace: Nov 30 02:53:56 server01 kernel: [ 818.233470] [<ffffffff810cc90e>] dump_header+0x7e/0x1e0 Nov 30 02:53:56 server01 kernel: [ 818.233600] [<ffffffff810cc80f>] ? find_lock_task_mm+0x2f/0x70 Nov 30 02:53:56 server01 kernel: [ 818.233721] [<ffffffff810ccdd5>] oom_kill_process+0x85/0x2a0 Nov 30 02:53:56 server01 kernel: [ 818.233842] [<ffffffff810cd485>] out_of_memory+0xe5/0x200 Nov 30 02:53:56 server01 kernel: [ 818.233963] [<ffffffff8102aa8f>] ? pte_alloc_one+0x3f/0x50 Nov 30 02:53:56 server01 kernel: [ 818.234082] [<ffffffff810cd65d>] pagefault_out_of_memory+0xbd/0x110 Nov 30 02:53:56 server01 kernel: [ 818.234204] [<ffffffff81026ec6>] mm_fault_error+0xb6/0x1a0 Nov 30 02:53:56 server01 kernel: [ 818.235886] [<ffffffff8102739e>] do_page_fault+0x3ee/0x460 Nov 30 02:53:56 server01 kernel: [ 818.236006] [<ffffffff810f3057>] ? vma_merge+0x1f7/0x2c0 Nov 30 02:53:56 server01 kernel: [ 818.236124] [<ffffffff810f35d7>] ? do_brk+0x267/0x400 Nov 30 02:53:56 server01 kernel: [ 818.236244] [<ffffffff812c9a92>] ? gr_learn_resource+0x42/0x1e0 Nov 30 02:53:56 server01 kernel: [ 818.236367] [<ffffffff815b547f>] page_fault+0x1f/0x30 Which is interesting from 2 perspectives. Only the first node (Node-0) is allowed which would suggest that the cpuset controller is not configured to all nodes. It is still surprising Node 0 wouldn't have any memory (I would expect ZONE_DMA32 would be sitting there). Anyway, the more interesting thing is gfp_mask is GFP_NOWAIT allocation from the page fault? Huh this shouldn't happen - ever. > Also what does /proc/sys/vm/zone_reclaim_mode says? > -- > Michal Hocko > SUSE Labs > -- > To unsubscribe from this list: send the line "unsubscribe cgroups" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked @ 2012-11-30 15:03 ` Michal Hocko 0 siblings, 0 replies; 444+ messages in thread From: Michal Hocko @ 2012-11-30 15:03 UTC (permalink / raw) To: azurIt Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups mailinglist, KAMEZAWA Hiroyuki, Johannes Weiner On Fri 30-11-12 15:44:31, Michal Hocko wrote: > On Fri 30-11-12 14:44:27, azurIt wrote: > > >Anyway your system is under both global and local memory pressure. You > > >didn't see apache going down previously because it was probably the one > > >which was stuck and could be killed. > > >Anyway you need to setup your system more carefully. > > > > > > There is, also, an evidence that system has enough of memory! :) Just > > take column 'rss' from process list in OOM message and sum it - you > > will get 2489911. It's probably in KB so it's about 2.4 GB. System has > > 14 GB of RAM so this also match data on my graph - 2.4 is about 17% of > > 14. > > Hmm, that corresponds to the ZONE_DMA32 size pretty nicely but that zone > is hardly touched: > Nov 30 02:53:56 server01 kernel: [ 818.241291] DMA32 free:2523636kB min:2672kB low:3340kB high:4008kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:2542248kB mlocked:0kB dirty:0kB writeback:0kB mapped:4kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no > > DMA32 zone is usually fills up first 4G unless your HW remaps the rest > of the memory above 4G or you have a numa machine and the rest of the > memory is at other node. Could you post your memory map printed during > the boot? (e820: BIOS-provided physical RAM map: and following lines) > > There is also ZONE_NORMAL which is also not used much > Nov 30 02:53:56 server01 kernel: [ 818.242163] Normal free:6924716kB min:12512kB low:15640kB high:18768kB active_anon:1463128kB inactive_anon:2072kB active_file:1803964kB inactive_file:1072628kB unevictable:3924kB isolated(anon):0kB isolated(file):0kB present:11893760kB mlocked:3924kB dirty:1000kB writeback:776kB mapped:35656kB shmem:3828kB slab_reclaimable:202560kB slab_unreclaimable:50696kB kernel_stack:2944kB pagetables:158616kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no > > You have mentioned that you are comounting with cpuset. If this happens > to be a NUMA machine have you made the access to all nodes available? And now that I am looking at the oom message more closely I can see Nov 30 02:53:56 server01 kernel: [ 818.232812] apache2 invoked oom-killer: gfp_mask=0x0, order=0, oom_adj=0, oom_score_adj=0 Nov 30 02:53:56 server01 kernel: [ 818.233029] apache2 cpuset=uid mems_allowed=0 Nov 30 02:53:56 server01 kernel: [ 818.233159] Pid: 9247, comm: apache2 Not tainted 3.2.34-grsec #1 Nov 30 02:53:56 server01 kernel: [ 818.233289] Call Trace: Nov 30 02:53:56 server01 kernel: [ 818.233470] [<ffffffff810cc90e>] dump_header+0x7e/0x1e0 Nov 30 02:53:56 server01 kernel: [ 818.233600] [<ffffffff810cc80f>] ? find_lock_task_mm+0x2f/0x70 Nov 30 02:53:56 server01 kernel: [ 818.233721] [<ffffffff810ccdd5>] oom_kill_process+0x85/0x2a0 Nov 30 02:53:56 server01 kernel: [ 818.233842] [<ffffffff810cd485>] out_of_memory+0xe5/0x200 Nov 30 02:53:56 server01 kernel: [ 818.233963] [<ffffffff8102aa8f>] ? pte_alloc_one+0x3f/0x50 Nov 30 02:53:56 server01 kernel: [ 818.234082] [<ffffffff810cd65d>] pagefault_out_of_memory+0xbd/0x110 Nov 30 02:53:56 server01 kernel: [ 818.234204] [<ffffffff81026ec6>] mm_fault_error+0xb6/0x1a0 Nov 30 02:53:56 server01 kernel: [ 818.235886] [<ffffffff8102739e>] do_page_fault+0x3ee/0x460 Nov 30 02:53:56 server01 kernel: [ 818.236006] [<ffffffff810f3057>] ? vma_merge+0x1f7/0x2c0 Nov 30 02:53:56 server01 kernel: [ 818.236124] [<ffffffff810f35d7>] ? do_brk+0x267/0x400 Nov 30 02:53:56 server01 kernel: [ 818.236244] [<ffffffff812c9a92>] ? gr_learn_resource+0x42/0x1e0 Nov 30 02:53:56 server01 kernel: [ 818.236367] [<ffffffff815b547f>] page_fault+0x1f/0x30 Which is interesting from 2 perspectives. Only the first node (Node-0) is allowed which would suggest that the cpuset controller is not configured to all nodes. It is still surprising Node 0 wouldn't have any memory (I would expect ZONE_DMA32 would be sitting there). Anyway, the more interesting thing is gfp_mask is GFP_NOWAIT allocation from the page fault? Huh this shouldn't happen - ever. > Also what does /proc/sys/vm/zone_reclaim_mode says? > -- > Michal Hocko > SUSE Labs > -- > To unsubscribe from this list: send the line "unsubscribe cgroups" in > the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked @ 2012-11-30 15:03 ` Michal Hocko 0 siblings, 0 replies; 444+ messages in thread From: Michal Hocko @ 2012-11-30 15:03 UTC (permalink / raw) To: azurIt Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki, Johannes Weiner On Fri 30-11-12 15:44:31, Michal Hocko wrote: > On Fri 30-11-12 14:44:27, azurIt wrote: > > >Anyway your system is under both global and local memory pressure. You > > >didn't see apache going down previously because it was probably the one > > >which was stuck and could be killed. > > >Anyway you need to setup your system more carefully. > > > > > > There is, also, an evidence that system has enough of memory! :) Just > > take column 'rss' from process list in OOM message and sum it - you > > will get 2489911. It's probably in KB so it's about 2.4 GB. System has > > 14 GB of RAM so this also match data on my graph - 2.4 is about 17% of > > 14. > > Hmm, that corresponds to the ZONE_DMA32 size pretty nicely but that zone > is hardly touched: > Nov 30 02:53:56 server01 kernel: [ 818.241291] DMA32 free:2523636kB min:2672kB low:3340kB high:4008kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:2542248kB mlocked:0kB dirty:0kB writeback:0kB mapped:4kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no > > DMA32 zone is usually fills up first 4G unless your HW remaps the rest > of the memory above 4G or you have a numa machine and the rest of the > memory is at other node. Could you post your memory map printed during > the boot? (e820: BIOS-provided physical RAM map: and following lines) > > There is also ZONE_NORMAL which is also not used much > Nov 30 02:53:56 server01 kernel: [ 818.242163] Normal free:6924716kB min:12512kB low:15640kB high:18768kB active_anon:1463128kB inactive_anon:2072kB active_file:1803964kB inactive_file:1072628kB unevictable:3924kB isolated(anon):0kB isolated(file):0kB present:11893760kB mlocked:3924kB dirty:1000kB writeback:776kB mapped:35656kB shmem:3828kB slab_reclaimable:202560kB slab_unreclaimable:50696kB kernel_stack:2944kB pagetables:158616kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no > > You have mentioned that you are comounting with cpuset. If this happens > to be a NUMA machine have you made the access to all nodes available? And now that I am looking at the oom message more closely I can see Nov 30 02:53:56 server01 kernel: [ 818.232812] apache2 invoked oom-killer: gfp_mask=0x0, order=0, oom_adj=0, oom_score_adj=0 Nov 30 02:53:56 server01 kernel: [ 818.233029] apache2 cpuset=uid mems_allowed=0 Nov 30 02:53:56 server01 kernel: [ 818.233159] Pid: 9247, comm: apache2 Not tainted 3.2.34-grsec #1 Nov 30 02:53:56 server01 kernel: [ 818.233289] Call Trace: Nov 30 02:53:56 server01 kernel: [ 818.233470] [<ffffffff810cc90e>] dump_header+0x7e/0x1e0 Nov 30 02:53:56 server01 kernel: [ 818.233600] [<ffffffff810cc80f>] ? find_lock_task_mm+0x2f/0x70 Nov 30 02:53:56 server01 kernel: [ 818.233721] [<ffffffff810ccdd5>] oom_kill_process+0x85/0x2a0 Nov 30 02:53:56 server01 kernel: [ 818.233842] [<ffffffff810cd485>] out_of_memory+0xe5/0x200 Nov 30 02:53:56 server01 kernel: [ 818.233963] [<ffffffff8102aa8f>] ? pte_alloc_one+0x3f/0x50 Nov 30 02:53:56 server01 kernel: [ 818.234082] [<ffffffff810cd65d>] pagefault_out_of_memory+0xbd/0x110 Nov 30 02:53:56 server01 kernel: [ 818.234204] [<ffffffff81026ec6>] mm_fault_error+0xb6/0x1a0 Nov 30 02:53:56 server01 kernel: [ 818.235886] [<ffffffff8102739e>] do_page_fault+0x3ee/0x460 Nov 30 02:53:56 server01 kernel: [ 818.236006] [<ffffffff810f3057>] ? vma_merge+0x1f7/0x2c0 Nov 30 02:53:56 server01 kernel: [ 818.236124] [<ffffffff810f35d7>] ? do_brk+0x267/0x400 Nov 30 02:53:56 server01 kernel: [ 818.236244] [<ffffffff812c9a92>] ? gr_learn_resource+0x42/0x1e0 Nov 30 02:53:56 server01 kernel: [ 818.236367] [<ffffffff815b547f>] page_fault+0x1f/0x30 Which is interesting from 2 perspectives. Only the first node (Node-0) is allowed which would suggest that the cpuset controller is not configured to all nodes. It is still surprising Node 0 wouldn't have any memory (I would expect ZONE_DMA32 would be sitting there). Anyway, the more interesting thing is gfp_mask is GFP_NOWAIT allocation from the page fault? Huh this shouldn't happen - ever. > Also what does /proc/sys/vm/zone_reclaim_mode says? > -- > Michal Hocko > SUSE Labs > -- > To unsubscribe from this list: send the line "unsubscribe cgroups" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked 2012-11-30 15:03 ` Michal Hocko @ 2012-11-30 15:37 ` Michal Hocko -1 siblings, 0 replies; 444+ messages in thread From: Michal Hocko @ 2012-11-30 15:37 UTC (permalink / raw) To: azurIt Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki, Johannes Weiner On Fri 30-11-12 16:03:47, Michal Hocko wrote: [...] > Anyway, the more interesting thing is gfp_mask is GFP_NOWAIT allocation > from the page fault? Huh this shouldn't happen - ever. OK, it starts making sense now. The message came from pagefault_out_of_memory which doesn't have gfp nor the required node information any longer. This suggests that VM_FAULT_OOM has been returned by the fault handler. So this hasn't been triggered by the page fault allocator. I am wondering whether this could be caused by the patch but the effect of that one should be limitted to the write (unlike the later version for -mm tree which hooks into the shmem as well). Will have to think about it some more. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked @ 2012-11-30 15:37 ` Michal Hocko 0 siblings, 0 replies; 444+ messages in thread From: Michal Hocko @ 2012-11-30 15:37 UTC (permalink / raw) To: azurIt Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki, Johannes Weiner On Fri 30-11-12 16:03:47, Michal Hocko wrote: [...] > Anyway, the more interesting thing is gfp_mask is GFP_NOWAIT allocation > from the page fault? Huh this shouldn't happen - ever. OK, it starts making sense now. The message came from pagefault_out_of_memory which doesn't have gfp nor the required node information any longer. This suggests that VM_FAULT_OOM has been returned by the fault handler. So this hasn't been triggered by the page fault allocator. I am wondering whether this could be caused by the patch but the effect of that one should be limitted to the write (unlike the later version for -mm tree which hooks into the shmem as well). Will have to think about it some more. -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked 2012-11-30 14:44 ` Michal Hocko @ 2012-11-30 15:08 ` azurIt -1 siblings, 0 replies; 444+ messages in thread From: azurIt @ 2012-11-30 15:08 UTC (permalink / raw) To: Michal Hocko Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki, Johannes Weiner >DMA32 zone is usually fills up first 4G unless your HW remaps the rest >of the memory above 4G or you have a numa machine and the rest of the >memory is at other node. Could you post your memory map printed during >the boot? (e820: BIOS-provided physical RAM map: and following lines) Here is the full boot log: www.watchdog.sk/lkml/kern.log >You have mentioned that you are comounting with cpuset. If this happens >to be a NUMA machine have you made the access to all nodes available? >Also what does /proc/sys/vm/zone_reclaim_mode says? Don't really know what NUMA means and which nodes are you talking about, sorry :( # cat /proc/sys/vm/zone_reclaim_mode cat: /proc/sys/vm/zone_reclaim_mode: No such file or directory azur ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked @ 2012-11-30 15:08 ` azurIt 0 siblings, 0 replies; 444+ messages in thread From: azurIt @ 2012-11-30 15:08 UTC (permalink / raw) To: Michal Hocko Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki, Johannes Weiner >DMA32 zone is usually fills up first 4G unless your HW remaps the rest >of the memory above 4G or you have a numa machine and the rest of the >memory is at other node. Could you post your memory map printed during >the boot? (e820: BIOS-provided physical RAM map: and following lines) Here is the full boot log: www.watchdog.sk/lkml/kern.log >You have mentioned that you are comounting with cpuset. If this happens >to be a NUMA machine have you made the access to all nodes available? >Also what does /proc/sys/vm/zone_reclaim_mode says? Don't really know what NUMA means and which nodes are you talking about, sorry :( # cat /proc/sys/vm/zone_reclaim_mode cat: /proc/sys/vm/zone_reclaim_mode: No such file or directory azur -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked 2012-11-30 15:08 ` azurIt @ 2012-11-30 15:39 ` Michal Hocko -1 siblings, 0 replies; 444+ messages in thread From: Michal Hocko @ 2012-11-30 15:39 UTC (permalink / raw) To: azurIt Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki, Johannes Weiner On Fri 30-11-12 16:08:11, azurIt wrote: > >DMA32 zone is usually fills up first 4G unless your HW remaps the rest > >of the memory above 4G or you have a numa machine and the rest of the > >memory is at other node. Could you post your memory map printed during > >the boot? (e820: BIOS-provided physical RAM map: and following lines) > > > Here is the full boot log: > www.watchdog.sk/lkml/kern.log The log is not complete. Could you paste the comple dmesg output? Or even better, do you have logs from the previous run? > >You have mentioned that you are comounting with cpuset. If this happens > >to be a NUMA machine have you made the access to all nodes available? > >Also what does /proc/sys/vm/zone_reclaim_mode says? > > > Don't really know what NUMA means and which nodes are you talking > about, sorry :( http://en.wikipedia.org/wiki/Non-Uniform_Memory_Access > # cat /proc/sys/vm/zone_reclaim_mode > cat: /proc/sys/vm/zone_reclaim_mode: No such file or directory OK, so the NUMA is not enabled. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked @ 2012-11-30 15:39 ` Michal Hocko 0 siblings, 0 replies; 444+ messages in thread From: Michal Hocko @ 2012-11-30 15:39 UTC (permalink / raw) To: azurIt Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki, Johannes Weiner On Fri 30-11-12 16:08:11, azurIt wrote: > >DMA32 zone is usually fills up first 4G unless your HW remaps the rest > >of the memory above 4G or you have a numa machine and the rest of the > >memory is at other node. Could you post your memory map printed during > >the boot? (e820: BIOS-provided physical RAM map: and following lines) > > > Here is the full boot log: > www.watchdog.sk/lkml/kern.log The log is not complete. Could you paste the comple dmesg output? Or even better, do you have logs from the previous run? > >You have mentioned that you are comounting with cpuset. If this happens > >to be a NUMA machine have you made the access to all nodes available? > >Also what does /proc/sys/vm/zone_reclaim_mode says? > > > Don't really know what NUMA means and which nodes are you talking > about, sorry :( http://en.wikipedia.org/wiki/Non-Uniform_Memory_Access > # cat /proc/sys/vm/zone_reclaim_mode > cat: /proc/sys/vm/zone_reclaim_mode: No such file or directory OK, so the NUMA is not enabled. -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked 2012-11-30 15:39 ` Michal Hocko (?) @ 2012-11-30 15:59 ` azurIt -1 siblings, 0 replies; 444+ messages in thread From: azurIt @ 2012-11-30 15:59 UTC (permalink / raw) To: Michal Hocko Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki, Johannes Weiner >> Here is the full boot log: >> www.watchdog.sk/lkml/kern.log > >The log is not complete. Could you paste the comple dmesg output? Or >even better, do you have logs from the previous run? What is missing there? All kernel messages are logging into /var/log/kern.log (it's the same as dmesg), dmesg itself was already rewrited by other messages. I think it's all what that kernel printed. ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked @ 2012-11-30 15:59 ` azurIt 0 siblings, 0 replies; 444+ messages in thread From: azurIt @ 2012-11-30 15:59 UTC (permalink / raw) To: Michal Hocko Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups mailinglist, KAMEZAWA Hiroyuki, Johannes Weiner >> Here is the full boot log: >> www.watchdog.sk/lkml/kern.log > >The log is not complete. Could you paste the comple dmesg output? Or >even better, do you have logs from the previous run? What is missing there? All kernel messages are logging into /var/log/kern.log (it's the same as dmesg), dmesg itself was already rewrited by other messages. I think it's all what that kernel printed. ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked @ 2012-11-30 15:59 ` azurIt 0 siblings, 0 replies; 444+ messages in thread From: azurIt @ 2012-11-30 15:59 UTC (permalink / raw) To: Michal Hocko Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki, Johannes Weiner >> Here is the full boot log: >> www.watchdog.sk/lkml/kern.log > >The log is not complete. Could you paste the comple dmesg output? Or >even better, do you have logs from the previous run? What is missing there? All kernel messages are logging into /var/log/kern.log (it's the same as dmesg), dmesg itself was already rewrited by other messages. I think it's all what that kernel printed. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked 2012-11-30 15:59 ` azurIt @ 2012-11-30 16:19 ` Michal Hocko -1 siblings, 0 replies; 444+ messages in thread From: Michal Hocko @ 2012-11-30 16:19 UTC (permalink / raw) To: azurIt Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki, Johannes Weiner On Fri 30-11-12 16:59:37, azurIt wrote: > >> Here is the full boot log: > >> www.watchdog.sk/lkml/kern.log > > > >The log is not complete. Could you paste the comple dmesg output? Or > >even better, do you have logs from the previous run? > > > What is missing there? All kernel messages are logging into > /var/log/kern.log (it's the same as dmesg), dmesg itself was already > rewrited by other messages. I think it's all what that kernel printed. Early boot messages are missing - so exactly the BIOS memory map I was asking for. As the NUMA has been excluded it is probably not that relevant anymore. The important question is why you see VM_FAULT_OOM and whether memcg charging failure can trigger that. I don not see how this could happen right now because __GFP_NORETRY is not used for user pages (except for THP which disable memcg OOM already), file backed page faults (aka __do_fault) use mem_cgroup_newpage_charge which doesn't disable OOM. This is a real head scratcher. Could you also post your complete containers configuration, maybe there is something strange in there (basically grep . -r YOUR_CGROUP_MNT except for tasks files which are of no use right now). -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked @ 2012-11-30 16:19 ` Michal Hocko 0 siblings, 0 replies; 444+ messages in thread From: Michal Hocko @ 2012-11-30 16:19 UTC (permalink / raw) To: azurIt Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki, Johannes Weiner On Fri 30-11-12 16:59:37, azurIt wrote: > >> Here is the full boot log: > >> www.watchdog.sk/lkml/kern.log > > > >The log is not complete. Could you paste the comple dmesg output? Or > >even better, do you have logs from the previous run? > > > What is missing there? All kernel messages are logging into > /var/log/kern.log (it's the same as dmesg), dmesg itself was already > rewrited by other messages. I think it's all what that kernel printed. Early boot messages are missing - so exactly the BIOS memory map I was asking for. As the NUMA has been excluded it is probably not that relevant anymore. The important question is why you see VM_FAULT_OOM and whether memcg charging failure can trigger that. I don not see how this could happen right now because __GFP_NORETRY is not used for user pages (except for THP which disable memcg OOM already), file backed page faults (aka __do_fault) use mem_cgroup_newpage_charge which doesn't disable OOM. This is a real head scratcher. Could you also post your complete containers configuration, maybe there is something strange in there (basically grep . -r YOUR_CGROUP_MNT except for tasks files which are of no use right now). -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked 2012-11-30 16:19 ` Michal Hocko @ 2012-11-30 16:26 ` azurIt -1 siblings, 0 replies; 444+ messages in thread From: azurIt @ 2012-11-30 16:26 UTC (permalink / raw) To: Michal Hocko Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki, Johannes Weiner >Could you also post your complete containers configuration, maybe there >is something strange in there (basically grep . -r YOUR_CGROUP_MNT >except for tasks files which are of no use right now). Here it is: http://www.watchdog.sk/lkml/cgroups.gz ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked @ 2012-11-30 16:26 ` azurIt 0 siblings, 0 replies; 444+ messages in thread From: azurIt @ 2012-11-30 16:26 UTC (permalink / raw) To: Michal Hocko Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki, Johannes Weiner >Could you also post your complete containers configuration, maybe there >is something strange in there (basically grep . -r YOUR_CGROUP_MNT >except for tasks files which are of no use right now). Here it is: http://www.watchdog.sk/lkml/cgroups.gz -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked 2012-11-30 16:26 ` azurIt (?) @ 2012-11-30 16:53 ` Michal Hocko -1 siblings, 0 replies; 444+ messages in thread From: Michal Hocko @ 2012-11-30 16:53 UTC (permalink / raw) To: azurIt Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki, Johannes Weiner On Fri 30-11-12 17:26:51, azurIt wrote: > >Could you also post your complete containers configuration, maybe there > >is something strange in there (basically grep . -r YOUR_CGROUP_MNT > >except for tasks files which are of no use right now). > > > Here it is: > http://www.watchdog.sk/lkml/cgroups.gz The only strange thing I noticed is that some groups have 0 limit. Is this intentional? grep memory.limit_in_bytes cgroups | grep -v uid | sed 's@.*/@@' | sort | uniq -c 3 memory.limit_in_bytes:0 254 memory.limit_in_bytes:104857600 107 memory.limit_in_bytes:157286400 68 memory.limit_in_bytes:209715200 10 memory.limit_in_bytes:262144000 28 memory.limit_in_bytes:314572800 1 memory.limit_in_bytes:346030080 1 memory.limit_in_bytes:524288000 2 memory.limit_in_bytes:9223372036854775807 -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked @ 2012-11-30 16:53 ` Michal Hocko 0 siblings, 0 replies; 444+ messages in thread From: Michal Hocko @ 2012-11-30 16:53 UTC (permalink / raw) To: azurIt Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups mailinglist, KAMEZAWA Hiroyuki, Johannes Weiner On Fri 30-11-12 17:26:51, azurIt wrote: > >Could you also post your complete containers configuration, maybe there > >is something strange in there (basically grep . -r YOUR_CGROUP_MNT > >except for tasks files which are of no use right now). > > > Here it is: > http://www.watchdog.sk/lkml/cgroups.gz The only strange thing I noticed is that some groups have 0 limit. Is this intentional? grep memory.limit_in_bytes cgroups | grep -v uid | sed 's@.*/@@' | sort | uniq -c 3 memory.limit_in_bytes:0 254 memory.limit_in_bytes:104857600 107 memory.limit_in_bytes:157286400 68 memory.limit_in_bytes:209715200 10 memory.limit_in_bytes:262144000 28 memory.limit_in_bytes:314572800 1 memory.limit_in_bytes:346030080 1 memory.limit_in_bytes:524288000 2 memory.limit_in_bytes:9223372036854775807 -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked @ 2012-11-30 16:53 ` Michal Hocko 0 siblings, 0 replies; 444+ messages in thread From: Michal Hocko @ 2012-11-30 16:53 UTC (permalink / raw) To: azurIt Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki, Johannes Weiner On Fri 30-11-12 17:26:51, azurIt wrote: > >Could you also post your complete containers configuration, maybe there > >is something strange in there (basically grep . -r YOUR_CGROUP_MNT > >except for tasks files which are of no use right now). > > > Here it is: > http://www.watchdog.sk/lkml/cgroups.gz The only strange thing I noticed is that some groups have 0 limit. Is this intentional? grep memory.limit_in_bytes cgroups | grep -v uid | sed 's@.*/@@' | sort | uniq -c 3 memory.limit_in_bytes:0 254 memory.limit_in_bytes:104857600 107 memory.limit_in_bytes:157286400 68 memory.limit_in_bytes:209715200 10 memory.limit_in_bytes:262144000 28 memory.limit_in_bytes:314572800 1 memory.limit_in_bytes:346030080 1 memory.limit_in_bytes:524288000 2 memory.limit_in_bytes:9223372036854775807 -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked 2012-11-30 16:53 ` Michal Hocko @ 2012-11-30 20:43 ` azurIt -1 siblings, 0 replies; 444+ messages in thread From: azurIt @ 2012-11-30 20:43 UTC (permalink / raw) To: Michal Hocko Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki, Johannes Weiner >The only strange thing I noticed is that some groups have 0 limit. Is >this intentional? >grep memory.limit_in_bytes cgroups | grep -v uid | sed 's@.*/@@' | sort | uniq -c > 3 memory.limit_in_bytes:0 These are users who are not allowed to run anything. azur ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked @ 2012-11-30 20:43 ` azurIt 0 siblings, 0 replies; 444+ messages in thread From: azurIt @ 2012-11-30 20:43 UTC (permalink / raw) To: Michal Hocko Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki, Johannes Weiner >The only strange thing I noticed is that some groups have 0 limit. Is >this intentional? >grep memory.limit_in_bytes cgroups | grep -v uid | sed 's@.*/@@' | sort | uniq -c > 3 memory.limit_in_bytes:0 These are users who are not allowed to run anything. azur -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked 2012-11-30 16:19 ` Michal Hocko @ 2012-12-03 15:16 ` Michal Hocko -1 siblings, 0 replies; 444+ messages in thread From: Michal Hocko @ 2012-12-03 15:16 UTC (permalink / raw) To: azurIt Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki, Johannes Weiner On Fri 30-11-12 17:19:23, Michal Hocko wrote: [...] > The important question is why you see VM_FAULT_OOM and whether memcg > charging failure can trigger that. I don not see how this could happen > right now because __GFP_NORETRY is not used for user pages (except for > THP which disable memcg OOM already), file backed page faults (aka > __do_fault) use mem_cgroup_newpage_charge which doesn't disable OOM. > This is a real head scratcher. The following should print the traces when we hand over ENOMEM to the caller. It should catch all charge paths (migration is not covered but that one is not important here). If we don't see any traces from here and there is still global OOM striking then there must be something else to trigger this. Could you test this with the patch which aims at fixing your deadlock, please? I realise that this is a production environment but I do not see anything relevant in the code. --- diff --git a/mm/memcontrol.c b/mm/memcontrol.c index c8425b1..9e5b56b 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -2397,6 +2397,7 @@ done: return 0; nomem: *ptr = NULL; + __WARN(); return -ENOMEM; bypass: *ptr = NULL; -- Michal Hocko SUSE Labs ^ permalink raw reply related [flat|nested] 444+ messages in thread
* Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked @ 2012-12-03 15:16 ` Michal Hocko 0 siblings, 0 replies; 444+ messages in thread From: Michal Hocko @ 2012-12-03 15:16 UTC (permalink / raw) To: azurIt Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki, Johannes Weiner On Fri 30-11-12 17:19:23, Michal Hocko wrote: [...] > The important question is why you see VM_FAULT_OOM and whether memcg > charging failure can trigger that. I don not see how this could happen > right now because __GFP_NORETRY is not used for user pages (except for > THP which disable memcg OOM already), file backed page faults (aka > __do_fault) use mem_cgroup_newpage_charge which doesn't disable OOM. > This is a real head scratcher. The following should print the traces when we hand over ENOMEM to the caller. It should catch all charge paths (migration is not covered but that one is not important here). If we don't see any traces from here and there is still global OOM striking then there must be something else to trigger this. Could you test this with the patch which aims at fixing your deadlock, please? I realise that this is a production environment but I do not see anything relevant in the code. --- diff --git a/mm/memcontrol.c b/mm/memcontrol.c index c8425b1..9e5b56b 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -2397,6 +2397,7 @@ done: return 0; nomem: *ptr = NULL; + __WARN(); return -ENOMEM; bypass: *ptr = NULL; -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 444+ messages in thread
* Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked 2012-12-03 15:16 ` Michal Hocko (?) @ 2012-12-05 1:36 ` azurIt -1 siblings, 0 replies; 444+ messages in thread From: azurIt @ 2012-12-05 1:36 UTC (permalink / raw) To: Michal Hocko Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki, Johannes Weiner >The following should print the traces when we hand over ENOMEM to the >caller. It should catch all charge paths (migration is not covered but >that one is not important here). If we don't see any traces from here >and there is still global OOM striking then there must be something else >to trigger this. >Could you test this with the patch which aims at fixing your deadlock, >please? I realise that this is a production environment but I do not see >anything relevant in the code. Michal, i think/hope this is what you wanted: http://www.watchdog.sk/lkml/oom_mysqld2 ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked @ 2012-12-05 1:36 ` azurIt 0 siblings, 0 replies; 444+ messages in thread From: azurIt @ 2012-12-05 1:36 UTC (permalink / raw) To: Michal Hocko Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups mailinglist, KAMEZAWA Hiroyuki, Johannes Weiner >The following should print the traces when we hand over ENOMEM to the >caller. It should catch all charge paths (migration is not covered but >that one is not important here). If we don't see any traces from here >and there is still global OOM striking then there must be something else >to trigger this. >Could you test this with the patch which aims at fixing your deadlock, >please? I realise that this is a production environment but I do not see >anything relevant in the code. Michal, i think/hope this is what you wanted: http://www.watchdog.sk/lkml/oom_mysqld2 ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked @ 2012-12-05 1:36 ` azurIt 0 siblings, 0 replies; 444+ messages in thread From: azurIt @ 2012-12-05 1:36 UTC (permalink / raw) To: Michal Hocko Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki, Johannes Weiner >The following should print the traces when we hand over ENOMEM to the >caller. It should catch all charge paths (migration is not covered but >that one is not important here). If we don't see any traces from here >and there is still global OOM striking then there must be something else >to trigger this. >Could you test this with the patch which aims at fixing your deadlock, >please? I realise that this is a production environment but I do not see >anything relevant in the code. Michal, i think/hope this is what you wanted: http://www.watchdog.sk/lkml/oom_mysqld2 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked 2012-12-05 1:36 ` azurIt (?) @ 2012-12-05 14:17 ` Michal Hocko -1 siblings, 0 replies; 444+ messages in thread From: Michal Hocko @ 2012-12-05 14:17 UTC (permalink / raw) To: azurIt Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki, Johannes Weiner On Wed 05-12-12 02:36:44, azurIt wrote: > >The following should print the traces when we hand over ENOMEM to the > >caller. It should catch all charge paths (migration is not covered but > >that one is not important here). If we don't see any traces from here > >and there is still global OOM striking then there must be something else > >to trigger this. > >Could you test this with the patch which aims at fixing your deadlock, > >please? I realise that this is a production environment but I do not see > >anything relevant in the code. > > > Michal, > > i think/hope this is what you wanted: > http://www.watchdog.sk/lkml/oom_mysqld2 Dec 5 02:20:48 server01 kernel: [ 380.995947] WARNING: at mm/memcontrol.c:2400 T.1146+0x2c1/0x5d0() Dec 5 02:20:48 server01 kernel: [ 380.995950] Hardware name: S5000VSA Dec 5 02:20:48 server01 kernel: [ 380.995952] Pid: 5351, comm: apache2 Not tainted 3.2.34-grsec #1 Dec 5 02:20:48 server01 kernel: [ 380.995954] Call Trace: Dec 5 02:20:48 server01 kernel: [ 380.995960] [<ffffffff81054eaa>] warn_slowpath_common+0x7a/0xb0 Dec 5 02:20:48 server01 kernel: [ 380.995963] [<ffffffff81054efa>] warn_slowpath_null+0x1a/0x20 Dec 5 02:20:48 server01 kernel: [ 380.995965] [<ffffffff8110b2e1>] T.1146+0x2c1/0x5d0 Dec 5 02:20:48 server01 kernel: [ 380.995967] [<ffffffff8110ba83>] mem_cgroup_charge_common+0x53/0x90 Dec 5 02:20:48 server01 kernel: [ 380.995970] [<ffffffff8110bb05>] mem_cgroup_newpage_charge+0x45/0x50 Dec 5 02:20:48 server01 kernel: [ 380.995974] [<ffffffff810eddf9>] handle_pte_fault+0x609/0x940 Dec 5 02:20:48 server01 kernel: [ 380.995978] [<ffffffff8102aa8f>] ? pte_alloc_one+0x3f/0x50 Dec 5 02:20:48 server01 kernel: [ 380.995981] [<ffffffff810ee268>] handle_mm_fault+0x138/0x260 Dec 5 02:20:48 server01 kernel: [ 380.995983] [<ffffffff810270ed>] do_page_fault+0x13d/0x460 Dec 5 02:20:48 server01 kernel: [ 380.995986] [<ffffffff810f429c>] ? do_mmap_pgoff+0x3dc/0x430 Dec 5 02:20:48 server01 kernel: [ 380.995988] [<ffffffff810f197d>] ? remove_vma+0x5d/0x80 Dec 5 02:20:48 server01 kernel: [ 380.995992] [<ffffffff815b54ff>] page_fault+0x1f/0x30 Dec 5 02:20:48 server01 kernel: [ 380.995994] ---[ end trace 25bbb3e634c25b7f ]--- Dec 5 02:20:48 server01 kernel: [ 380.996373] apache2 invoked oom-killer: gfp_mask=0x0, order=0, oom_adj=0, oom_score_adj=0 Dec 5 02:20:48 server01 kernel: [ 380.996377] apache2 cpuset=uid mems_allowed=0 Dec 5 02:20:48 server01 kernel: [ 380.996379] Pid: 5351, comm: apache2 Tainted: G W 3.2.34-grsec #1 Dec 5 02:20:48 server01 kernel: [ 380.996380] Call Trace: Dec 5 02:20:48 server01 kernel: [ 380.996384] [<ffffffff810cc91e>] dump_header+0x7e/0x1e0 Dec 5 02:20:48 server01 kernel: [ 380.996387] [<ffffffff810cc81f>] ? find_lock_task_mm+0x2f/0x70 Dec 5 02:20:48 server01 kernel: [ 380.996389] [<ffffffff810ccde5>] oom_kill_process+0x85/0x2a0 Dec 5 02:20:48 server01 kernel: [ 380.996392] [<ffffffff810cd495>] out_of_memory+0xe5/0x200 Dec 5 02:20:48 server01 kernel: [ 380.996394] [<ffffffff8102aa8f>] ? pte_alloc_one+0x3f/0x50 Dec 5 02:20:48 server01 kernel: [ 380.996397] [<ffffffff810cd66d>] pagefault_out_of_memory+0xbd/0x110 Dec 5 02:20:48 server01 kernel: [ 380.996399] [<ffffffff81026ec6>] mm_fault_error+0xb6/0x1a0 Dec 5 02:20:48 server01 kernel: [ 380.996401] [<ffffffff8102739e>] do_page_fault+0x3ee/0x460 Dec 5 02:20:48 server01 kernel: [ 380.996403] [<ffffffff810f429c>] ? do_mmap_pgoff+0x3dc/0x430 Dec 5 02:20:48 server01 kernel: [ 380.996405] [<ffffffff810f197d>] ? remove_vma+0x5d/0x80 Dec 5 02:20:48 server01 kernel: [ 380.996408] [<ffffffff815b54ff>] page_fault+0x1f/0x30 OK, so the ENOMEM seems to be leaking from mem_cgroup_newpage_charge. This can only happen if this was an atomic allocation request (!__GFP_WAIT) or if oom is not allowed which is the case only for transparent huge page allocation. The first case can be excluded (in the clean 3.2 stable kernel) because all callers of mem_cgroup_newpage_charge use GFP_KERNEL. The later one should be OK because the page fault should fallback to a regular page if THP allocation/charge fails. [/me goes to double check] Hmm do_huge_pmd_wp_page seems to charge a huge page and fails with VM_FAULT_OOM without any fallback. We should do_huge_pmd_wp_page_fallback instead. This has been fixed in 3.5-rc1 by 1f1d06c3 (thp, memcg: split hugepage for memcg oom on cow) but it hasn't been backported to 3.2. The patch applies to 3.2 without any further modifications. I didn't have time to test it but if it helps you we should push this to the stable tree. --- >From 765f5e0121c4410faa19c088e9ada75976bde178 Mon Sep 17 00:00:00 2001 From: David Rientjes <rientjes@google.com> Date: Tue, 29 May 2012 15:06:23 -0700 Subject: [PATCH] thp, memcg: split hugepage for memcg oom on cow On COW, a new hugepage is allocated and charged to the memcg. If the system is oom or the charge to the memcg fails, however, the fault handler will return VM_FAULT_OOM which results in an oom kill. Instead, it's possible to fallback to splitting the hugepage so that the COW results only in an order-0 page being allocated and charged to the memcg which has a higher liklihood to succeed. This is expensive because the hugepage must be split in the page fault handler, but it is much better than unnecessarily oom killing a process. Signed-off-by: David Rientjes <rientjes@google.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Johannes Weiner <jweiner@redhat.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Michal Hocko <mhocko@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> (cherry picked from commit 1f1d06c34f7675026326cd9f39ff91e4555cf355) --- mm/huge_memory.c | 3 +++ mm/memory.c | 18 +++++++++++++++--- 2 files changed, 18 insertions(+), 3 deletions(-) diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 8f005e9..470cbb4 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -921,6 +921,8 @@ int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma, count_vm_event(THP_FAULT_FALLBACK); ret = do_huge_pmd_wp_page_fallback(mm, vma, address, pmd, orig_pmd, page, haddr); + if (ret & VM_FAULT_OOM) + split_huge_page(page); put_page(page); goto out; } @@ -928,6 +930,7 @@ int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma, if (unlikely(mem_cgroup_newpage_charge(new_page, mm, GFP_KERNEL))) { put_page(new_page); + split_huge_page(page); put_page(page); ret |= VM_FAULT_OOM; goto out; diff --git a/mm/memory.c b/mm/memory.c index 70f5daf..15e686a 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -3469,6 +3469,7 @@ int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma, if (unlikely(is_vm_hugetlb_page(vma))) return hugetlb_fault(mm, vma, address, flags); +retry: pgd = pgd_offset(mm, address); pud = pud_alloc(mm, pgd, address); if (!pud) @@ -3482,13 +3483,24 @@ int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma, pmd, flags); } else { pmd_t orig_pmd = *pmd; + int ret; + barrier(); if (pmd_trans_huge(orig_pmd)) { if (flags & FAULT_FLAG_WRITE && !pmd_write(orig_pmd) && - !pmd_trans_splitting(orig_pmd)) - return do_huge_pmd_wp_page(mm, vma, address, - pmd, orig_pmd); + !pmd_trans_splitting(orig_pmd)) { + ret = do_huge_pmd_wp_page(mm, vma, address, pmd, + orig_pmd); + /* + * If COW results in an oom, the huge pmd will + * have been split, so retry the fault on the + * pte for a smaller charge. + */ + if (unlikely(ret & VM_FAULT_OOM)) + goto retry; + return ret; + } return 0; } } -- 1.7.10.4 -- Michal Hocko SUSE Labs ^ permalink raw reply related [flat|nested] 444+ messages in thread
* Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked @ 2012-12-05 14:17 ` Michal Hocko 0 siblings, 0 replies; 444+ messages in thread From: Michal Hocko @ 2012-12-05 14:17 UTC (permalink / raw) To: azurIt Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups mailinglist, KAMEZAWA Hiroyuki, Johannes Weiner On Wed 05-12-12 02:36:44, azurIt wrote: > >The following should print the traces when we hand over ENOMEM to the > >caller. It should catch all charge paths (migration is not covered but > >that one is not important here). If we don't see any traces from here > >and there is still global OOM striking then there must be something else > >to trigger this. > >Could you test this with the patch which aims at fixing your deadlock, > >please? I realise that this is a production environment but I do not see > >anything relevant in the code. > > > Michal, > > i think/hope this is what you wanted: > http://www.watchdog.sk/lkml/oom_mysqld2 Dec 5 02:20:48 server01 kernel: [ 380.995947] WARNING: at mm/memcontrol.c:2400 T.1146+0x2c1/0x5d0() Dec 5 02:20:48 server01 kernel: [ 380.995950] Hardware name: S5000VSA Dec 5 02:20:48 server01 kernel: [ 380.995952] Pid: 5351, comm: apache2 Not tainted 3.2.34-grsec #1 Dec 5 02:20:48 server01 kernel: [ 380.995954] Call Trace: Dec 5 02:20:48 server01 kernel: [ 380.995960] [<ffffffff81054eaa>] warn_slowpath_common+0x7a/0xb0 Dec 5 02:20:48 server01 kernel: [ 380.995963] [<ffffffff81054efa>] warn_slowpath_null+0x1a/0x20 Dec 5 02:20:48 server01 kernel: [ 380.995965] [<ffffffff8110b2e1>] T.1146+0x2c1/0x5d0 Dec 5 02:20:48 server01 kernel: [ 380.995967] [<ffffffff8110ba83>] mem_cgroup_charge_common+0x53/0x90 Dec 5 02:20:48 server01 kernel: [ 380.995970] [<ffffffff8110bb05>] mem_cgroup_newpage_charge+0x45/0x50 Dec 5 02:20:48 server01 kernel: [ 380.995974] [<ffffffff810eddf9>] handle_pte_fault+0x609/0x940 Dec 5 02:20:48 server01 kernel: [ 380.995978] [<ffffffff8102aa8f>] ? pte_alloc_one+0x3f/0x50 Dec 5 02:20:48 server01 kernel: [ 380.995981] [<ffffffff810ee268>] handle_mm_fault+0x138/0x260 Dec 5 02:20:48 server01 kernel: [ 380.995983] [<ffffffff810270ed>] do_page_fault+0x13d/0x460 Dec 5 02:20:48 server01 kernel: [ 380.995986] [<ffffffff810f429c>] ? do_mmap_pgoff+0x3dc/0x430 Dec 5 02:20:48 server01 kernel: [ 380.995988] [<ffffffff810f197d>] ? remove_vma+0x5d/0x80 Dec 5 02:20:48 server01 kernel: [ 380.995992] [<ffffffff815b54ff>] page_fault+0x1f/0x30 Dec 5 02:20:48 server01 kernel: [ 380.995994] ---[ end trace 25bbb3e634c25b7f ]--- Dec 5 02:20:48 server01 kernel: [ 380.996373] apache2 invoked oom-killer: gfp_mask=0x0, order=0, oom_adj=0, oom_score_adj=0 Dec 5 02:20:48 server01 kernel: [ 380.996377] apache2 cpuset=uid mems_allowed=0 Dec 5 02:20:48 server01 kernel: [ 380.996379] Pid: 5351, comm: apache2 Tainted: G W 3.2.34-grsec #1 Dec 5 02:20:48 server01 kernel: [ 380.996380] Call Trace: Dec 5 02:20:48 server01 kernel: [ 380.996384] [<ffffffff810cc91e>] dump_header+0x7e/0x1e0 Dec 5 02:20:48 server01 kernel: [ 380.996387] [<ffffffff810cc81f>] ? find_lock_task_mm+0x2f/0x70 Dec 5 02:20:48 server01 kernel: [ 380.996389] [<ffffffff810ccde5>] oom_kill_process+0x85/0x2a0 Dec 5 02:20:48 server01 kernel: [ 380.996392] [<ffffffff810cd495>] out_of_memory+0xe5/0x200 Dec 5 02:20:48 server01 kernel: [ 380.996394] [<ffffffff8102aa8f>] ? pte_alloc_one+0x3f/0x50 Dec 5 02:20:48 server01 kernel: [ 380.996397] [<ffffffff810cd66d>] pagefault_out_of_memory+0xbd/0x110 Dec 5 02:20:48 server01 kernel: [ 380.996399] [<ffffffff81026ec6>] mm_fault_error+0xb6/0x1a0 Dec 5 02:20:48 server01 kernel: [ 380.996401] [<ffffffff8102739e>] do_page_fault+0x3ee/0x460 Dec 5 02:20:48 server01 kernel: [ 380.996403] [<ffffffff810f429c>] ? do_mmap_pgoff+0x3dc/0x430 Dec 5 02:20:48 server01 kernel: [ 380.996405] [<ffffffff810f197d>] ? remove_vma+0x5d/0x80 Dec 5 02:20:48 server01 kernel: [ 380.996408] [<ffffffff815b54ff>] page_fault+0x1f/0x30 OK, so the ENOMEM seems to be leaking from mem_cgroup_newpage_charge. This can only happen if this was an atomic allocation request (!__GFP_WAIT) or if oom is not allowed which is the case only for transparent huge page allocation. The first case can be excluded (in the clean 3.2 stable kernel) because all callers of mem_cgroup_newpage_charge use GFP_KERNEL. The later one should be OK because the page fault should fallback to a regular page if THP allocation/charge fails. [/me goes to double check] Hmm do_huge_pmd_wp_page seems to charge a huge page and fails with VM_FAULT_OOM without any fallback. We should do_huge_pmd_wp_page_fallback instead. This has been fixed in 3.5-rc1 by 1f1d06c3 (thp, memcg: split hugepage for memcg oom on cow) but it hasn't been backported to 3.2. The patch applies to 3.2 without any further modifications. I didn't have time to test it but if it helps you we should push this to the stable tree. --- From 765f5e0121c4410faa19c088e9ada75976bde178 Mon Sep 17 00:00:00 2001 From: David Rientjes <rientjes-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> Date: Tue, 29 May 2012 15:06:23 -0700 Subject: [PATCH] thp, memcg: split hugepage for memcg oom on cow On COW, a new hugepage is allocated and charged to the memcg. If the system is oom or the charge to the memcg fails, however, the fault handler will return VM_FAULT_OOM which results in an oom kill. Instead, it's possible to fallback to splitting the hugepage so that the COW results only in an order-0 page being allocated and charged to the memcg which has a higher liklihood to succeed. This is expensive because the hugepage must be split in the page fault handler, but it is much better than unnecessarily oom killing a process. Signed-off-by: David Rientjes <rientjes-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> Cc: Andrea Arcangeli <aarcange-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> Cc: Johannes Weiner <jweiner-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A@public.gmane.org> Cc: Michal Hocko <mhocko-AlSwsSmVLrQ@public.gmane.org> Signed-off-by: Andrew Morton <akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org> Signed-off-by: Linus Torvalds <torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org> (cherry picked from commit 1f1d06c34f7675026326cd9f39ff91e4555cf355) --- mm/huge_memory.c | 3 +++ mm/memory.c | 18 +++++++++++++++--- 2 files changed, 18 insertions(+), 3 deletions(-) diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 8f005e9..470cbb4 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -921,6 +921,8 @@ int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma, count_vm_event(THP_FAULT_FALLBACK); ret = do_huge_pmd_wp_page_fallback(mm, vma, address, pmd, orig_pmd, page, haddr); + if (ret & VM_FAULT_OOM) + split_huge_page(page); put_page(page); goto out; } @@ -928,6 +930,7 @@ int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma, if (unlikely(mem_cgroup_newpage_charge(new_page, mm, GFP_KERNEL))) { put_page(new_page); + split_huge_page(page); put_page(page); ret |= VM_FAULT_OOM; goto out; diff --git a/mm/memory.c b/mm/memory.c index 70f5daf..15e686a 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -3469,6 +3469,7 @@ int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma, if (unlikely(is_vm_hugetlb_page(vma))) return hugetlb_fault(mm, vma, address, flags); +retry: pgd = pgd_offset(mm, address); pud = pud_alloc(mm, pgd, address); if (!pud) @@ -3482,13 +3483,24 @@ int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma, pmd, flags); } else { pmd_t orig_pmd = *pmd; + int ret; + barrier(); if (pmd_trans_huge(orig_pmd)) { if (flags & FAULT_FLAG_WRITE && !pmd_write(orig_pmd) && - !pmd_trans_splitting(orig_pmd)) - return do_huge_pmd_wp_page(mm, vma, address, - pmd, orig_pmd); + !pmd_trans_splitting(orig_pmd)) { + ret = do_huge_pmd_wp_page(mm, vma, address, pmd, + orig_pmd); + /* + * If COW results in an oom, the huge pmd will + * have been split, so retry the fault on the + * pte for a smaller charge. + */ + if (unlikely(ret & VM_FAULT_OOM)) + goto retry; + return ret; + } return 0; } } -- 1.7.10.4 -- Michal Hocko SUSE Labs ^ permalink raw reply related [flat|nested] 444+ messages in thread
* Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked @ 2012-12-05 14:17 ` Michal Hocko 0 siblings, 0 replies; 444+ messages in thread From: Michal Hocko @ 2012-12-05 14:17 UTC (permalink / raw) To: azurIt Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki, Johannes Weiner On Wed 05-12-12 02:36:44, azurIt wrote: > >The following should print the traces when we hand over ENOMEM to the > >caller. It should catch all charge paths (migration is not covered but > >that one is not important here). If we don't see any traces from here > >and there is still global OOM striking then there must be something else > >to trigger this. > >Could you test this with the patch which aims at fixing your deadlock, > >please? I realise that this is a production environment but I do not see > >anything relevant in the code. > > > Michal, > > i think/hope this is what you wanted: > http://www.watchdog.sk/lkml/oom_mysqld2 Dec 5 02:20:48 server01 kernel: [ 380.995947] WARNING: at mm/memcontrol.c:2400 T.1146+0x2c1/0x5d0() Dec 5 02:20:48 server01 kernel: [ 380.995950] Hardware name: S5000VSA Dec 5 02:20:48 server01 kernel: [ 380.995952] Pid: 5351, comm: apache2 Not tainted 3.2.34-grsec #1 Dec 5 02:20:48 server01 kernel: [ 380.995954] Call Trace: Dec 5 02:20:48 server01 kernel: [ 380.995960] [<ffffffff81054eaa>] warn_slowpath_common+0x7a/0xb0 Dec 5 02:20:48 server01 kernel: [ 380.995963] [<ffffffff81054efa>] warn_slowpath_null+0x1a/0x20 Dec 5 02:20:48 server01 kernel: [ 380.995965] [<ffffffff8110b2e1>] T.1146+0x2c1/0x5d0 Dec 5 02:20:48 server01 kernel: [ 380.995967] [<ffffffff8110ba83>] mem_cgroup_charge_common+0x53/0x90 Dec 5 02:20:48 server01 kernel: [ 380.995970] [<ffffffff8110bb05>] mem_cgroup_newpage_charge+0x45/0x50 Dec 5 02:20:48 server01 kernel: [ 380.995974] [<ffffffff810eddf9>] handle_pte_fault+0x609/0x940 Dec 5 02:20:48 server01 kernel: [ 380.995978] [<ffffffff8102aa8f>] ? pte_alloc_one+0x3f/0x50 Dec 5 02:20:48 server01 kernel: [ 380.995981] [<ffffffff810ee268>] handle_mm_fault+0x138/0x260 Dec 5 02:20:48 server01 kernel: [ 380.995983] [<ffffffff810270ed>] do_page_fault+0x13d/0x460 Dec 5 02:20:48 server01 kernel: [ 380.995986] [<ffffffff810f429c>] ? do_mmap_pgoff+0x3dc/0x430 Dec 5 02:20:48 server01 kernel: [ 380.995988] [<ffffffff810f197d>] ? remove_vma+0x5d/0x80 Dec 5 02:20:48 server01 kernel: [ 380.995992] [<ffffffff815b54ff>] page_fault+0x1f/0x30 Dec 5 02:20:48 server01 kernel: [ 380.995994] ---[ end trace 25bbb3e634c25b7f ]--- Dec 5 02:20:48 server01 kernel: [ 380.996373] apache2 invoked oom-killer: gfp_mask=0x0, order=0, oom_adj=0, oom_score_adj=0 Dec 5 02:20:48 server01 kernel: [ 380.996377] apache2 cpuset=uid mems_allowed=0 Dec 5 02:20:48 server01 kernel: [ 380.996379] Pid: 5351, comm: apache2 Tainted: G W 3.2.34-grsec #1 Dec 5 02:20:48 server01 kernel: [ 380.996380] Call Trace: Dec 5 02:20:48 server01 kernel: [ 380.996384] [<ffffffff810cc91e>] dump_header+0x7e/0x1e0 Dec 5 02:20:48 server01 kernel: [ 380.996387] [<ffffffff810cc81f>] ? find_lock_task_mm+0x2f/0x70 Dec 5 02:20:48 server01 kernel: [ 380.996389] [<ffffffff810ccde5>] oom_kill_process+0x85/0x2a0 Dec 5 02:20:48 server01 kernel: [ 380.996392] [<ffffffff810cd495>] out_of_memory+0xe5/0x200 Dec 5 02:20:48 server01 kernel: [ 380.996394] [<ffffffff8102aa8f>] ? pte_alloc_one+0x3f/0x50 Dec 5 02:20:48 server01 kernel: [ 380.996397] [<ffffffff810cd66d>] pagefault_out_of_memory+0xbd/0x110 Dec 5 02:20:48 server01 kernel: [ 380.996399] [<ffffffff81026ec6>] mm_fault_error+0xb6/0x1a0 Dec 5 02:20:48 server01 kernel: [ 380.996401] [<ffffffff8102739e>] do_page_fault+0x3ee/0x460 Dec 5 02:20:48 server01 kernel: [ 380.996403] [<ffffffff810f429c>] ? do_mmap_pgoff+0x3dc/0x430 Dec 5 02:20:48 server01 kernel: [ 380.996405] [<ffffffff810f197d>] ? remove_vma+0x5d/0x80 Dec 5 02:20:48 server01 kernel: [ 380.996408] [<ffffffff815b54ff>] page_fault+0x1f/0x30 OK, so the ENOMEM seems to be leaking from mem_cgroup_newpage_charge. This can only happen if this was an atomic allocation request (!__GFP_WAIT) or if oom is not allowed which is the case only for transparent huge page allocation. The first case can be excluded (in the clean 3.2 stable kernel) because all callers of mem_cgroup_newpage_charge use GFP_KERNEL. The later one should be OK because the page fault should fallback to a regular page if THP allocation/charge fails. [/me goes to double check] Hmm do_huge_pmd_wp_page seems to charge a huge page and fails with VM_FAULT_OOM without any fallback. We should do_huge_pmd_wp_page_fallback instead. This has been fixed in 3.5-rc1 by 1f1d06c3 (thp, memcg: split hugepage for memcg oom on cow) but it hasn't been backported to 3.2. The patch applies to 3.2 without any further modifications. I didn't have time to test it but if it helps you we should push this to the stable tree. --- ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked 2012-12-05 14:17 ` Michal Hocko (?) @ 2012-12-06 0:29 ` azurIt -1 siblings, 0 replies; 444+ messages in thread From: azurIt @ 2012-12-06 0:29 UTC (permalink / raw) To: Michal Hocko Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki, Johannes Weiner >OK, so the ENOMEM seems to be leaking from mem_cgroup_newpage_charge. >This can only happen if this was an atomic allocation request >(!__GFP_WAIT) or if oom is not allowed which is the case only for >transparent huge page allocation. >The first case can be excluded (in the clean 3.2 stable kernel) because >all callers of mem_cgroup_newpage_charge use GFP_KERNEL. The later one >should be OK because the page fault should fallback to a regular page if >THP allocation/charge fails. >[/me goes to double check] >Hmm do_huge_pmd_wp_page seems to charge a huge page and fails with >VM_FAULT_OOM without any fallback. We should do_huge_pmd_wp_page_fallback >instead. This has been fixed in 3.5-rc1 by 1f1d06c3 (thp, memcg: split >hugepage for memcg oom on cow) but it hasn't been backported to 3.2. The >patch applies to 3.2 without any further modifications. I didn't have >time to test it but if it helps you we should push this to the stable >tree. This, unfortunately, didn't fix the problem :( http://www.watchdog.sk/lkml/oom_mysqld3 ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked @ 2012-12-06 0:29 ` azurIt 0 siblings, 0 replies; 444+ messages in thread From: azurIt @ 2012-12-06 0:29 UTC (permalink / raw) To: Michal Hocko Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups mailinglist, KAMEZAWA Hiroyuki, Johannes Weiner >OK, so the ENOMEM seems to be leaking from mem_cgroup_newpage_charge. >This can only happen if this was an atomic allocation request >(!__GFP_WAIT) or if oom is not allowed which is the case only for >transparent huge page allocation. >The first case can be excluded (in the clean 3.2 stable kernel) because >all callers of mem_cgroup_newpage_charge use GFP_KERNEL. The later one >should be OK because the page fault should fallback to a regular page if >THP allocation/charge fails. >[/me goes to double check] >Hmm do_huge_pmd_wp_page seems to charge a huge page and fails with >VM_FAULT_OOM without any fallback. We should do_huge_pmd_wp_page_fallback >instead. This has been fixed in 3.5-rc1 by 1f1d06c3 (thp, memcg: split >hugepage for memcg oom on cow) but it hasn't been backported to 3.2. The >patch applies to 3.2 without any further modifications. I didn't have >time to test it but if it helps you we should push this to the stable >tree. This, unfortunately, didn't fix the problem :( http://www.watchdog.sk/lkml/oom_mysqld3 ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked @ 2012-12-06 0:29 ` azurIt 0 siblings, 0 replies; 444+ messages in thread From: azurIt @ 2012-12-06 0:29 UTC (permalink / raw) To: Michal Hocko Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki, Johannes Weiner >OK, so the ENOMEM seems to be leaking from mem_cgroup_newpage_charge. >This can only happen if this was an atomic allocation request >(!__GFP_WAIT) or if oom is not allowed which is the case only for >transparent huge page allocation. >The first case can be excluded (in the clean 3.2 stable kernel) because >all callers of mem_cgroup_newpage_charge use GFP_KERNEL. The later one >should be OK because the page fault should fallback to a regular page if >THP allocation/charge fails. >[/me goes to double check] >Hmm do_huge_pmd_wp_page seems to charge a huge page and fails with >VM_FAULT_OOM without any fallback. We should do_huge_pmd_wp_page_fallback >instead. This has been fixed in 3.5-rc1 by 1f1d06c3 (thp, memcg: split >hugepage for memcg oom on cow) but it hasn't been backported to 3.2. The >patch applies to 3.2 without any further modifications. I didn't have >time to test it but if it helps you we should push this to the stable >tree. This, unfortunately, didn't fix the problem :( http://www.watchdog.sk/lkml/oom_mysqld3 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked 2012-12-06 0:29 ` azurIt (?) @ 2012-12-06 9:54 ` Michal Hocko -1 siblings, 0 replies; 444+ messages in thread From: Michal Hocko @ 2012-12-06 9:54 UTC (permalink / raw) To: azurIt Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki, Johannes Weiner On Thu 06-12-12 01:29:24, azurIt wrote: > >OK, so the ENOMEM seems to be leaking from mem_cgroup_newpage_charge. > >This can only happen if this was an atomic allocation request > >(!__GFP_WAIT) or if oom is not allowed which is the case only for > >transparent huge page allocation. > >The first case can be excluded (in the clean 3.2 stable kernel) because > >all callers of mem_cgroup_newpage_charge use GFP_KERNEL. The later one > >should be OK because the page fault should fallback to a regular page if > >THP allocation/charge fails. > >[/me goes to double check] > >Hmm do_huge_pmd_wp_page seems to charge a huge page and fails with > >VM_FAULT_OOM without any fallback. We should do_huge_pmd_wp_page_fallback > >instead. This has been fixed in 3.5-rc1 by 1f1d06c3 (thp, memcg: split > >hugepage for memcg oom on cow) but it hasn't been backported to 3.2. The > >patch applies to 3.2 without any further modifications. I didn't have > >time to test it but if it helps you we should push this to the stable > >tree. > > > This, unfortunately, didn't fix the problem :( > http://www.watchdog.sk/lkml/oom_mysqld3 Dohh. The very same stack mem_cgroup_newpage_charge called from the page fault. The heavy inlining is not particularly helping here... So there must be some other THP charge leaking out. [/me is diving into the code again] * do_huge_pmd_anonymous_page falls back to handle_pte_fault * do_huge_pmd_wp_page_fallback falls back to simple pages so it doesn't charge the huge page * do_huge_pmd_wp_page splits the huge page and retries with fallback to handle_pte_fault * collapse_huge_page is not called in the page fault path * do_wp_page, do_anonymous_page and __do_fault operate on a single page so the memcg charging cannot return ENOMEM There are no other callers AFAICS so I am getting clueless. Maybe more debugging will tell us something (the inlining has been reduced for thp paths which can reduce performance in thp page fault heavy workloads but this will give us better traces - I hope). Anyway do you see the same problem if transparent huge pages are disabled? echo never > /sys/kernel/mm/transparent_hugepage/enabled) --- >From 93a30140b50d8474a047b91c698f4880149635db Mon Sep 17 00:00:00 2001 From: Michal Hocko <mhocko@suse.cz> Date: Thu, 6 Dec 2012 10:40:17 +0100 Subject: [PATCH] more debugging --- mm/huge_memory.c | 6 +++--- mm/memcontrol.c | 2 +- 2 files changed, 4 insertions(+), 4 deletions(-) diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 470cbb4..01a11f1 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -671,7 +671,7 @@ static inline struct page *alloc_hugepage(int defrag) } #endif -int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma, +noinline int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma, unsigned long address, pmd_t *pmd, unsigned int flags) { @@ -790,7 +790,7 @@ pgtable_t get_pmd_huge_pte(struct mm_struct *mm) return pgtable; } -static int do_huge_pmd_wp_page_fallback(struct mm_struct *mm, +static noinline int do_huge_pmd_wp_page_fallback(struct mm_struct *mm, struct vm_area_struct *vma, unsigned long address, pmd_t *pmd, pmd_t orig_pmd, @@ -883,7 +883,7 @@ out_free_pages: goto out; } -int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma, +noinline int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma, unsigned long address, pmd_t *pmd, pmd_t orig_pmd) { int ret = 0; diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 9e5b56b..1986c65 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -2397,7 +2397,7 @@ done: return 0; nomem: *ptr = NULL; - __WARN(); + __WARN_printf("gfp_mask:%u nr_pages:%u oom:%d ret:%d\n", gfp_mask, nr_pages, oom, ret); return -ENOMEM; bypass: *ptr = NULL; -- 1.7.10.4 -- Michal Hocko SUSE Labs ^ permalink raw reply related [flat|nested] 444+ messages in thread
* Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked @ 2012-12-06 9:54 ` Michal Hocko 0 siblings, 0 replies; 444+ messages in thread From: Michal Hocko @ 2012-12-06 9:54 UTC (permalink / raw) To: azurIt Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups mailinglist, KAMEZAWA Hiroyuki, Johannes Weiner On Thu 06-12-12 01:29:24, azurIt wrote: > >OK, so the ENOMEM seems to be leaking from mem_cgroup_newpage_charge. > >This can only happen if this was an atomic allocation request > >(!__GFP_WAIT) or if oom is not allowed which is the case only for > >transparent huge page allocation. > >The first case can be excluded (in the clean 3.2 stable kernel) because > >all callers of mem_cgroup_newpage_charge use GFP_KERNEL. The later one > >should be OK because the page fault should fallback to a regular page if > >THP allocation/charge fails. > >[/me goes to double check] > >Hmm do_huge_pmd_wp_page seems to charge a huge page and fails with > >VM_FAULT_OOM without any fallback. We should do_huge_pmd_wp_page_fallback > >instead. This has been fixed in 3.5-rc1 by 1f1d06c3 (thp, memcg: split > >hugepage for memcg oom on cow) but it hasn't been backported to 3.2. The > >patch applies to 3.2 without any further modifications. I didn't have > >time to test it but if it helps you we should push this to the stable > >tree. > > > This, unfortunately, didn't fix the problem :( > http://www.watchdog.sk/lkml/oom_mysqld3 Dohh. The very same stack mem_cgroup_newpage_charge called from the page fault. The heavy inlining is not particularly helping here... So there must be some other THP charge leaking out. [/me is diving into the code again] * do_huge_pmd_anonymous_page falls back to handle_pte_fault * do_huge_pmd_wp_page_fallback falls back to simple pages so it doesn't charge the huge page * do_huge_pmd_wp_page splits the huge page and retries with fallback to handle_pte_fault * collapse_huge_page is not called in the page fault path * do_wp_page, do_anonymous_page and __do_fault operate on a single page so the memcg charging cannot return ENOMEM There are no other callers AFAICS so I am getting clueless. Maybe more debugging will tell us something (the inlining has been reduced for thp paths which can reduce performance in thp page fault heavy workloads but this will give us better traces - I hope). Anyway do you see the same problem if transparent huge pages are disabled? echo never > /sys/kernel/mm/transparent_hugepage/enabled) --- From 93a30140b50d8474a047b91c698f4880149635db Mon Sep 17 00:00:00 2001 From: Michal Hocko <mhocko-AlSwsSmVLrQ@public.gmane.org> Date: Thu, 6 Dec 2012 10:40:17 +0100 Subject: [PATCH] more debugging --- mm/huge_memory.c | 6 +++--- mm/memcontrol.c | 2 +- 2 files changed, 4 insertions(+), 4 deletions(-) diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 470cbb4..01a11f1 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -671,7 +671,7 @@ static inline struct page *alloc_hugepage(int defrag) } #endif -int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma, +noinline int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma, unsigned long address, pmd_t *pmd, unsigned int flags) { @@ -790,7 +790,7 @@ pgtable_t get_pmd_huge_pte(struct mm_struct *mm) return pgtable; } -static int do_huge_pmd_wp_page_fallback(struct mm_struct *mm, +static noinline int do_huge_pmd_wp_page_fallback(struct mm_struct *mm, struct vm_area_struct *vma, unsigned long address, pmd_t *pmd, pmd_t orig_pmd, @@ -883,7 +883,7 @@ out_free_pages: goto out; } -int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma, +noinline int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma, unsigned long address, pmd_t *pmd, pmd_t orig_pmd) { int ret = 0; diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 9e5b56b..1986c65 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -2397,7 +2397,7 @@ done: return 0; nomem: *ptr = NULL; - __WARN(); + __WARN_printf("gfp_mask:%u nr_pages:%u oom:%d ret:%d\n", gfp_mask, nr_pages, oom, ret); return -ENOMEM; bypass: *ptr = NULL; -- 1.7.10.4 -- Michal Hocko SUSE Labs ^ permalink raw reply related [flat|nested] 444+ messages in thread
* Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked @ 2012-12-06 9:54 ` Michal Hocko 0 siblings, 0 replies; 444+ messages in thread From: Michal Hocko @ 2012-12-06 9:54 UTC (permalink / raw) To: azurIt Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki, Johannes Weiner On Thu 06-12-12 01:29:24, azurIt wrote: > >OK, so the ENOMEM seems to be leaking from mem_cgroup_newpage_charge. > >This can only happen if this was an atomic allocation request > >(!__GFP_WAIT) or if oom is not allowed which is the case only for > >transparent huge page allocation. > >The first case can be excluded (in the clean 3.2 stable kernel) because > >all callers of mem_cgroup_newpage_charge use GFP_KERNEL. The later one > >should be OK because the page fault should fallback to a regular page if > >THP allocation/charge fails. > >[/me goes to double check] > >Hmm do_huge_pmd_wp_page seems to charge a huge page and fails with > >VM_FAULT_OOM without any fallback. We should do_huge_pmd_wp_page_fallback > >instead. This has been fixed in 3.5-rc1 by 1f1d06c3 (thp, memcg: split > >hugepage for memcg oom on cow) but it hasn't been backported to 3.2. The > >patch applies to 3.2 without any further modifications. I didn't have > >time to test it but if it helps you we should push this to the stable > >tree. > > > This, unfortunately, didn't fix the problem :( > http://www.watchdog.sk/lkml/oom_mysqld3 Dohh. The very same stack mem_cgroup_newpage_charge called from the page fault. The heavy inlining is not particularly helping here... So there must be some other THP charge leaking out. [/me is diving into the code again] * do_huge_pmd_anonymous_page falls back to handle_pte_fault * do_huge_pmd_wp_page_fallback falls back to simple pages so it doesn't charge the huge page * do_huge_pmd_wp_page splits the huge page and retries with fallback to handle_pte_fault * collapse_huge_page is not called in the page fault path * do_wp_page, do_anonymous_page and __do_fault operate on a single page so the memcg charging cannot return ENOMEM There are no other callers AFAICS so I am getting clueless. Maybe more debugging will tell us something (the inlining has been reduced for thp paths which can reduce performance in thp page fault heavy workloads but this will give us better traces - I hope). Anyway do you see the same problem if transparent huge pages are disabled? echo never > /sys/kernel/mm/transparent_hugepage/enabled) --- ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked 2012-12-06 9:54 ` Michal Hocko @ 2012-12-06 10:12 ` azurIt -1 siblings, 0 replies; 444+ messages in thread From: azurIt @ 2012-12-06 10:12 UTC (permalink / raw) To: Michal Hocko Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki, Johannes Weiner >Dohh. The very same stack mem_cgroup_newpage_charge called from the page >fault. The heavy inlining is not particularly helping here... So there >must be some other THP charge leaking out. >[/me is diving into the code again] > >* do_huge_pmd_anonymous_page falls back to handle_pte_fault >* do_huge_pmd_wp_page_fallback falls back to simple pages so it doesn't > charge the huge page >* do_huge_pmd_wp_page splits the huge page and retries with fallback to > handle_pte_fault >* collapse_huge_page is not called in the page fault path >* do_wp_page, do_anonymous_page and __do_fault operate on a single page > so the memcg charging cannot return ENOMEM > >There are no other callers AFAICS so I am getting clueless. Maybe more >debugging will tell us something (the inlining has been reduced for thp >paths which can reduce performance in thp page fault heavy workloads but >this will give us better traces - I hope). Should i apply all patches togather? (fix for this bug, more log messages, backported fix from 3.5 and this new one) >Anyway do you see the same problem if transparent huge pages are >disabled? >echo never > /sys/kernel/mm/transparent_hugepage/enabled) # cat /sys/kernel/mm/transparent_hugepage/enabled cat: /sys/kernel/mm/transparent_hugepage/enabled: No such file or directory ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked @ 2012-12-06 10:12 ` azurIt 0 siblings, 0 replies; 444+ messages in thread From: azurIt @ 2012-12-06 10:12 UTC (permalink / raw) To: Michal Hocko Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki, Johannes Weiner >Dohh. The very same stack mem_cgroup_newpage_charge called from the page >fault. The heavy inlining is not particularly helping here... So there >must be some other THP charge leaking out. >[/me is diving into the code again] > >* do_huge_pmd_anonymous_page falls back to handle_pte_fault >* do_huge_pmd_wp_page_fallback falls back to simple pages so it doesn't > charge the huge page >* do_huge_pmd_wp_page splits the huge page and retries with fallback to > handle_pte_fault >* collapse_huge_page is not called in the page fault path >* do_wp_page, do_anonymous_page and __do_fault operate on a single page > so the memcg charging cannot return ENOMEM > >There are no other callers AFAICS so I am getting clueless. Maybe more >debugging will tell us something (the inlining has been reduced for thp >paths which can reduce performance in thp page fault heavy workloads but >this will give us better traces - I hope). Should i apply all patches togather? (fix for this bug, more log messages, backported fix from 3.5 and this new one) >Anyway do you see the same problem if transparent huge pages are >disabled? >echo never > /sys/kernel/mm/transparent_hugepage/enabled) # cat /sys/kernel/mm/transparent_hugepage/enabled cat: /sys/kernel/mm/transparent_hugepage/enabled: No such file or directory -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked 2012-12-06 10:12 ` azurIt @ 2012-12-06 17:06 ` Michal Hocko -1 siblings, 0 replies; 444+ messages in thread From: Michal Hocko @ 2012-12-06 17:06 UTC (permalink / raw) To: azurIt Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki, Johannes Weiner On Thu 06-12-12 11:12:49, azurIt wrote: > >Dohh. The very same stack mem_cgroup_newpage_charge called from the page > >fault. The heavy inlining is not particularly helping here... So there > >must be some other THP charge leaking out. > >[/me is diving into the code again] > > > >* do_huge_pmd_anonymous_page falls back to handle_pte_fault > >* do_huge_pmd_wp_page_fallback falls back to simple pages so it doesn't > > charge the huge page > >* do_huge_pmd_wp_page splits the huge page and retries with fallback to > > handle_pte_fault > >* collapse_huge_page is not called in the page fault path > >* do_wp_page, do_anonymous_page and __do_fault operate on a single page > > so the memcg charging cannot return ENOMEM > > > >There are no other callers AFAICS so I am getting clueless. Maybe more > >debugging will tell us something (the inlining has been reduced for thp > >paths which can reduce performance in thp page fault heavy workloads but > >this will give us better traces - I hope). > > > Should i apply all patches togather? (fix for this bug, more log > messages, backported fix from 3.5 and this new one) Yes please -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked @ 2012-12-06 17:06 ` Michal Hocko 0 siblings, 0 replies; 444+ messages in thread From: Michal Hocko @ 2012-12-06 17:06 UTC (permalink / raw) To: azurIt Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki, Johannes Weiner On Thu 06-12-12 11:12:49, azurIt wrote: > >Dohh. The very same stack mem_cgroup_newpage_charge called from the page > >fault. The heavy inlining is not particularly helping here... So there > >must be some other THP charge leaking out. > >[/me is diving into the code again] > > > >* do_huge_pmd_anonymous_page falls back to handle_pte_fault > >* do_huge_pmd_wp_page_fallback falls back to simple pages so it doesn't > > charge the huge page > >* do_huge_pmd_wp_page splits the huge page and retries with fallback to > > handle_pte_fault > >* collapse_huge_page is not called in the page fault path > >* do_wp_page, do_anonymous_page and __do_fault operate on a single page > > so the memcg charging cannot return ENOMEM > > > >There are no other callers AFAICS so I am getting clueless. Maybe more > >debugging will tell us something (the inlining has been reduced for thp > >paths which can reduce performance in thp page fault heavy workloads but > >this will give us better traces - I hope). > > > Should i apply all patches togather? (fix for this bug, more log > messages, backported fix from 3.5 and this new one) Yes please -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked 2012-12-06 9:54 ` Michal Hocko (?) @ 2012-12-10 1:20 ` azurIt -1 siblings, 0 replies; 444+ messages in thread From: azurIt @ 2012-12-10 1:20 UTC (permalink / raw) To: Michal Hocko Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki, Johannes Weiner >There are no other callers AFAICS so I am getting clueless. Maybe more >debugging will tell us something (the inlining has been reduced for thp >paths which can reduce performance in thp page fault heavy workloads but >this will give us better traces - I hope). Michal, this was printing so many debug messages to console that the whole server hangs and i had to hard reset it after several minutes :( Sorry but i cannot test such a things in production. There's no problem with one soft reset which takes 4 minutes but this hard reset creates about 20 minutes outage (mainly cos of disk quotas checking). Last logged message: Dec 10 02:03:29 server01 kernel: [ 220.366486] grsec: From 141.105.120.152: bruteforce prevention initiated for the next 30 minutes or until service restarted, stalling each fork 30 seconds. Please investigate the crash report for /usr/lib/apache2/mpm-itk/apache2[apache2:3586] uid/euid:1258/1258 gid/egid:100/100, parent /usr/lib/apache2/mpm-itk/apache2[apache2:2142] uid/euid:0/0 gid/egid:0/0 ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked @ 2012-12-10 1:20 ` azurIt 0 siblings, 0 replies; 444+ messages in thread From: azurIt @ 2012-12-10 1:20 UTC (permalink / raw) To: Michal Hocko Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups mailinglist, KAMEZAWA Hiroyuki, Johannes Weiner >There are no other callers AFAICS so I am getting clueless. Maybe more >debugging will tell us something (the inlining has been reduced for thp >paths which can reduce performance in thp page fault heavy workloads but >this will give us better traces - I hope). Michal, this was printing so many debug messages to console that the whole server hangs and i had to hard reset it after several minutes :( Sorry but i cannot test such a things in production. There's no problem with one soft reset which takes 4 minutes but this hard reset creates about 20 minutes outage (mainly cos of disk quotas checking). Last logged message: Dec 10 02:03:29 server01 kernel: [ 220.366486] grsec: From 141.105.120.152: bruteforce prevention initiated for the next 30 minutes or until service restarted, stalling each fork 30 seconds. Please investigate the crash report for /usr/lib/apache2/mpm-itk/apache2[apache2:3586] uid/euid:1258/1258 gid/egid:100/100, parent /usr/lib/apache2/mpm-itk/apache2[apache2:2142] uid/euid:0/0 gid/egid:0/0 ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked @ 2012-12-10 1:20 ` azurIt 0 siblings, 0 replies; 444+ messages in thread From: azurIt @ 2012-12-10 1:20 UTC (permalink / raw) To: Michal Hocko Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki, Johannes Weiner >There are no other callers AFAICS so I am getting clueless. Maybe more >debugging will tell us something (the inlining has been reduced for thp >paths which can reduce performance in thp page fault heavy workloads but >this will give us better traces - I hope). Michal, this was printing so many debug messages to console that the whole server hangs and i had to hard reset it after several minutes :( Sorry but i cannot test such a things in production. There's no problem with one soft reset which takes 4 minutes but this hard reset creates about 20 minutes outage (mainly cos of disk quotas checking). Last logged message: Dec 10 02:03:29 server01 kernel: [ 220.366486] grsec: From 141.105.120.152: bruteforce prevention initiated for the next 30 minutes or until service restarted, stalling each fork 30 seconds. Please investigate the crash report for /usr/lib/apache2/mpm-itk/apache2[apache2:3586] uid/euid:1258/1258 gid/egid:100/100, parent /usr/lib/apache2/mpm-itk/apache2[apache2:2142] uid/euid:0/0 gid/egid:0/0 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked 2012-12-10 1:20 ` azurIt (?) @ 2012-12-10 9:43 ` Michal Hocko -1 siblings, 0 replies; 444+ messages in thread From: Michal Hocko @ 2012-12-10 9:43 UTC (permalink / raw) To: azurIt Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki, Johannes Weiner On Mon 10-12-12 02:20:38, azurIt wrote: [...] > Michal, Hi, > this was printing so many debug messages to console that the whole > server hangs Hmm, this is _really_ surprising. The latest patch didn't add any new logging actually. It just enahanced messages which were already printed out previously + changed few functions to be not inlined so they show up in the traces. So the only explanation is that the workload has changed or the patches got misapplied. > and i had to hard reset it after several minutes :( Sorry > but i cannot test such a things in production. There's no problem with > one soft reset which takes 4 minutes but this hard reset creates about > 20 minutes outage (mainly cos of disk quotas checking). Understood. > Last logged message: > > Dec 10 02:03:29 server01 kernel: [ 220.366486] grsec: From 141.105.120.152: bruteforce prevention initiated for the next 30 minutes or until service restarted, stalling each fork 30 seconds. Please investigate the crash report for /usr/lib/apache2/mpm-itk/apache2[apache2:3586] uid/euid:1258/1258 gid/egid:100/100, parent /usr/lib/apache2/mpm-itk/apache2[apache2:2142] uid/euid:0/0 gid/egid:0/0 This explains why you have seen your machine hung. I am not familiar with grsec but stalling each fork 30s sounds really bad. Anyway this will not help me much. Do you happen to still have any of those logged traces from the last run? Apart from that. If my current understanding is correct then this is related to transparent huge pages (and leaking charge to the page fault handler). Do you see the same problem if you disable THP before you start your workload? (echo never > /sys/kernel/mm/transparent_hugepage/enabled) -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked @ 2012-12-10 9:43 ` Michal Hocko 0 siblings, 0 replies; 444+ messages in thread From: Michal Hocko @ 2012-12-10 9:43 UTC (permalink / raw) To: azurIt Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups mailinglist, KAMEZAWA Hiroyuki, Johannes Weiner On Mon 10-12-12 02:20:38, azurIt wrote: [...] > Michal, Hi, > this was printing so many debug messages to console that the whole > server hangs Hmm, this is _really_ surprising. The latest patch didn't add any new logging actually. It just enahanced messages which were already printed out previously + changed few functions to be not inlined so they show up in the traces. So the only explanation is that the workload has changed or the patches got misapplied. > and i had to hard reset it after several minutes :( Sorry > but i cannot test such a things in production. There's no problem with > one soft reset which takes 4 minutes but this hard reset creates about > 20 minutes outage (mainly cos of disk quotas checking). Understood. > Last logged message: > > Dec 10 02:03:29 server01 kernel: [ 220.366486] grsec: From 141.105.120.152: bruteforce prevention initiated for the next 30 minutes or until service restarted, stalling each fork 30 seconds. Please investigate the crash report for /usr/lib/apache2/mpm-itk/apache2[apache2:3586] uid/euid:1258/1258 gid/egid:100/100, parent /usr/lib/apache2/mpm-itk/apache2[apache2:2142] uid/euid:0/0 gid/egid:0/0 This explains why you have seen your machine hung. I am not familiar with grsec but stalling each fork 30s sounds really bad. Anyway this will not help me much. Do you happen to still have any of those logged traces from the last run? Apart from that. If my current understanding is correct then this is related to transparent huge pages (and leaking charge to the page fault handler). Do you see the same problem if you disable THP before you start your workload? (echo never > /sys/kernel/mm/transparent_hugepage/enabled) -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked @ 2012-12-10 9:43 ` Michal Hocko 0 siblings, 0 replies; 444+ messages in thread From: Michal Hocko @ 2012-12-10 9:43 UTC (permalink / raw) To: azurIt Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki, Johannes Weiner On Mon 10-12-12 02:20:38, azurIt wrote: [...] > Michal, Hi, > this was printing so many debug messages to console that the whole > server hangs Hmm, this is _really_ surprising. The latest patch didn't add any new logging actually. It just enahanced messages which were already printed out previously + changed few functions to be not inlined so they show up in the traces. So the only explanation is that the workload has changed or the patches got misapplied. > and i had to hard reset it after several minutes :( Sorry > but i cannot test such a things in production. There's no problem with > one soft reset which takes 4 minutes but this hard reset creates about > 20 minutes outage (mainly cos of disk quotas checking). Understood. > Last logged message: > > Dec 10 02:03:29 server01 kernel: [ 220.366486] grsec: From 141.105.120.152: bruteforce prevention initiated for the next 30 minutes or until service restarted, stalling each fork 30 seconds. Please investigate the crash report for /usr/lib/apache2/mpm-itk/apache2[apache2:3586] uid/euid:1258/1258 gid/egid:100/100, parent /usr/lib/apache2/mpm-itk/apache2[apache2:2142] uid/euid:0/0 gid/egid:0/0 This explains why you have seen your machine hung. I am not familiar with grsec but stalling each fork 30s sounds really bad. Anyway this will not help me much. Do you happen to still have any of those logged traces from the last run? Apart from that. If my current understanding is correct then this is related to transparent huge pages (and leaking charge to the page fault handler). Do you see the same problem if you disable THP before you start your workload? (echo never > /sys/kernel/mm/transparent_hugepage/enabled) -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked 2012-12-10 9:43 ` Michal Hocko (?) @ 2012-12-10 10:18 ` azurIt -1 siblings, 0 replies; 444+ messages in thread From: azurIt @ 2012-12-10 10:18 UTC (permalink / raw) To: Michal Hocko Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki, Johannes Weiner >Hmm, this is _really_ surprising. The latest patch didn't add any new >logging actually. It just enahanced messages which were already printed >out previously + changed few functions to be not inlined so they show up >in the traces. So the only explanation is that the workload has changed >or the patches got misapplied. This time i installed 3.2.35, maybe some changes between .34 and .35 did this? Should i try .34? >> Dec 10 02:03:29 server01 kernel: [ 220.366486] grsec: From 141.105.120.152: bruteforce prevention initiated for the next 30 minutes or until service restarted, stalling each fork 30 seconds. Please investigate the crash report for /usr/lib/apache2/mpm-itk/apache2[apache2:3586] uid/euid:1258/1258 gid/egid:100/100, parent /usr/lib/apache2/mpm-itk/apache2[apache2:2142] uid/euid:0/0 gid/egid:0/0 > >This explains why you have seen your machine hung. I am not familiar >with grsec but stalling each fork 30s sounds really bad. Btw, i never ever saw such a message from grsecurity yet. Will write to grsec mailing list about explanation. >Anyway this will not help me much. Do you happen to still have any of >those logged traces from the last run? Unfortunately not, it didn't log anything and tons of messages were printed only to console (i was logged via IP-KVM). It looked that printing is infinite, i rebooted it after few minutes. >Apart from that. If my current understanding is correct then this is >related to transparent huge pages (and leaking charge to the page fault >handler). Do you see the same problem if you disable THP before you >start your workload? (echo never > /sys/kernel/mm/transparent_hugepage/enabled) # cat /sys/kernel/mm/transparent_hugepage/enabled cat: /sys/kernel/mm/transparent_hugepage/enabled: No such file or directory # ls -la /sys/kernel/mm total 0 drwx------ 3 root root 0 Dec 10 11:11 . drwx------ 5 root root 0 Dec 10 02:06 .. drwx------ 2 root root 0 Dec 10 11:11 cleancache ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked @ 2012-12-10 10:18 ` azurIt 0 siblings, 0 replies; 444+ messages in thread From: azurIt @ 2012-12-10 10:18 UTC (permalink / raw) To: Michal Hocko Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups mailinglist, KAMEZAWA Hiroyuki, Johannes Weiner >Hmm, this is _really_ surprising. The latest patch didn't add any new >logging actually. It just enahanced messages which were already printed >out previously + changed few functions to be not inlined so they show up >in the traces. So the only explanation is that the workload has changed >or the patches got misapplied. This time i installed 3.2.35, maybe some changes between .34 and .35 did this? Should i try .34? >> Dec 10 02:03:29 server01 kernel: [ 220.366486] grsec: From 141.105.120.152: bruteforce prevention initiated for the next 30 minutes or until service restarted, stalling each fork 30 seconds. Please investigate the crash report for /usr/lib/apache2/mpm-itk/apache2[apache2:3586] uid/euid:1258/1258 gid/egid:100/100, parent /usr/lib/apache2/mpm-itk/apache2[apache2:2142] uid/euid:0/0 gid/egid:0/0 > >This explains why you have seen your machine hung. I am not familiar >with grsec but stalling each fork 30s sounds really bad. Btw, i never ever saw such a message from grsecurity yet. Will write to grsec mailing list about explanation. >Anyway this will not help me much. Do you happen to still have any of >those logged traces from the last run? Unfortunately not, it didn't log anything and tons of messages were printed only to console (i was logged via IP-KVM). It looked that printing is infinite, i rebooted it after few minutes. >Apart from that. If my current understanding is correct then this is >related to transparent huge pages (and leaking charge to the page fault >handler). Do you see the same problem if you disable THP before you >start your workload? (echo never > /sys/kernel/mm/transparent_hugepage/enabled) # cat /sys/kernel/mm/transparent_hugepage/enabled cat: /sys/kernel/mm/transparent_hugepage/enabled: No such file or directory # ls -la /sys/kernel/mm total 0 drwx------ 3 root root 0 Dec 10 11:11 . drwx------ 5 root root 0 Dec 10 02:06 .. drwx------ 2 root root 0 Dec 10 11:11 cleancache ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked @ 2012-12-10 10:18 ` azurIt 0 siblings, 0 replies; 444+ messages in thread From: azurIt @ 2012-12-10 10:18 UTC (permalink / raw) To: Michal Hocko Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki, Johannes Weiner >Hmm, this is _really_ surprising. The latest patch didn't add any new >logging actually. It just enahanced messages which were already printed >out previously + changed few functions to be not inlined so they show up >in the traces. So the only explanation is that the workload has changed >or the patches got misapplied. This time i installed 3.2.35, maybe some changes between .34 and .35 did this? Should i try .34? >> Dec 10 02:03:29 server01 kernel: [ 220.366486] grsec: From 141.105.120.152: bruteforce prevention initiated for the next 30 minutes or until service restarted, stalling each fork 30 seconds. Please investigate the crash report for /usr/lib/apache2/mpm-itk/apache2[apache2:3586] uid/euid:1258/1258 gid/egid:100/100, parent /usr/lib/apache2/mpm-itk/apache2[apache2:2142] uid/euid:0/0 gid/egid:0/0 > >This explains why you have seen your machine hung. I am not familiar >with grsec but stalling each fork 30s sounds really bad. Btw, i never ever saw such a message from grsecurity yet. Will write to grsec mailing list about explanation. >Anyway this will not help me much. Do you happen to still have any of >those logged traces from the last run? Unfortunately not, it didn't log anything and tons of messages were printed only to console (i was logged via IP-KVM). It looked that printing is infinite, i rebooted it after few minutes. >Apart from that. If my current understanding is correct then this is >related to transparent huge pages (and leaking charge to the page fault >handler). Do you see the same problem if you disable THP before you >start your workload? (echo never > /sys/kernel/mm/transparent_hugepage/enabled) # cat /sys/kernel/mm/transparent_hugepage/enabled cat: /sys/kernel/mm/transparent_hugepage/enabled: No such file or directory # ls -la /sys/kernel/mm total 0 drwx------ 3 root root 0 Dec 10 11:11 . drwx------ 5 root root 0 Dec 10 02:06 .. drwx------ 2 root root 0 Dec 10 11:11 cleancache -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked 2012-12-10 10:18 ` azurIt (?) @ 2012-12-10 15:52 ` Michal Hocko -1 siblings, 0 replies; 444+ messages in thread From: Michal Hocko @ 2012-12-10 15:52 UTC (permalink / raw) To: azurIt Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki, Johannes Weiner On Mon 10-12-12 11:18:17, azurIt wrote: > >Hmm, this is _really_ surprising. The latest patch didn't add any new > >logging actually. It just enahanced messages which were already printed > >out previously + changed few functions to be not inlined so they show up > >in the traces. So the only explanation is that the workload has changed > >or the patches got misapplied. > > > This time i installed 3.2.35, maybe some changes between .34 and .35 > did this? Should i try .34? I would try to limit changes to minimum. So the original kernel you were using + the first patch to prevent OOM from the write path + 2 debugging patches. > >> Dec 10 02:03:29 server01 kernel: [ 220.366486] grsec: From 141.105.120.152: bruteforce prevention initiated for the next 30 minutes or until service restarted, stalling each fork 30 seconds. Please investigate the crash report for /usr/lib/apache2/mpm-itk/apache2[apache2:3586] uid/euid:1258/1258 gid/egid:100/100, parent /usr/lib/apache2/mpm-itk/apache2[apache2:2142] uid/euid:0/0 gid/egid:0/0 > > > >This explains why you have seen your machine hung. I am not familiar > >with grsec but stalling each fork 30s sounds really bad. > > > Btw, i never ever saw such a message from grsecurity yet. Will write to grsec mailing list about explanation. > > > >Anyway this will not help me much. Do you happen to still have any of > >those logged traces from the last run? > > > Unfortunately not, it didn't log anything and tons of messages were > printed only to console (i was logged via IP-KVM). It looked that > printing is infinite, i rebooted it after few minutes. But was it at least related to the debugging from the patch or it was rather a totally unrelated thing? > >Apart from that. If my current understanding is correct then this is > >related to transparent huge pages (and leaking charge to the page fault > >handler). Do you see the same problem if you disable THP before you > >start your workload? (echo never > /sys/kernel/mm/transparent_hugepage/enabled) > > # cat /sys/kernel/mm/transparent_hugepage/enabled > cat: /sys/kernel/mm/transparent_hugepage/enabled: No such file or directory Weee. Then it cannot be related to THP at all. Which makes this even bigger mystery. We really need to find out who is leaking that charge. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked @ 2012-12-10 15:52 ` Michal Hocko 0 siblings, 0 replies; 444+ messages in thread From: Michal Hocko @ 2012-12-10 15:52 UTC (permalink / raw) To: azurIt Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups mailinglist, KAMEZAWA Hiroyuki, Johannes Weiner On Mon 10-12-12 11:18:17, azurIt wrote: > >Hmm, this is _really_ surprising. The latest patch didn't add any new > >logging actually. It just enahanced messages which were already printed > >out previously + changed few functions to be not inlined so they show up > >in the traces. So the only explanation is that the workload has changed > >or the patches got misapplied. > > > This time i installed 3.2.35, maybe some changes between .34 and .35 > did this? Should i try .34? I would try to limit changes to minimum. So the original kernel you were using + the first patch to prevent OOM from the write path + 2 debugging patches. > >> Dec 10 02:03:29 server01 kernel: [ 220.366486] grsec: From 141.105.120.152: bruteforce prevention initiated for the next 30 minutes or until service restarted, stalling each fork 30 seconds. Please investigate the crash report for /usr/lib/apache2/mpm-itk/apache2[apache2:3586] uid/euid:1258/1258 gid/egid:100/100, parent /usr/lib/apache2/mpm-itk/apache2[apache2:2142] uid/euid:0/0 gid/egid:0/0 > > > >This explains why you have seen your machine hung. I am not familiar > >with grsec but stalling each fork 30s sounds really bad. > > > Btw, i never ever saw such a message from grsecurity yet. Will write to grsec mailing list about explanation. > > > >Anyway this will not help me much. Do you happen to still have any of > >those logged traces from the last run? > > > Unfortunately not, it didn't log anything and tons of messages were > printed only to console (i was logged via IP-KVM). It looked that > printing is infinite, i rebooted it after few minutes. But was it at least related to the debugging from the patch or it was rather a totally unrelated thing? > >Apart from that. If my current understanding is correct then this is > >related to transparent huge pages (and leaking charge to the page fault > >handler). Do you see the same problem if you disable THP before you > >start your workload? (echo never > /sys/kernel/mm/transparent_hugepage/enabled) > > # cat /sys/kernel/mm/transparent_hugepage/enabled > cat: /sys/kernel/mm/transparent_hugepage/enabled: No such file or directory Weee. Then it cannot be related to THP at all. Which makes this even bigger mystery. We really need to find out who is leaking that charge. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked @ 2012-12-10 15:52 ` Michal Hocko 0 siblings, 0 replies; 444+ messages in thread From: Michal Hocko @ 2012-12-10 15:52 UTC (permalink / raw) To: azurIt Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki, Johannes Weiner On Mon 10-12-12 11:18:17, azurIt wrote: > >Hmm, this is _really_ surprising. The latest patch didn't add any new > >logging actually. It just enahanced messages which were already printed > >out previously + changed few functions to be not inlined so they show up > >in the traces. So the only explanation is that the workload has changed > >or the patches got misapplied. > > > This time i installed 3.2.35, maybe some changes between .34 and .35 > did this? Should i try .34? I would try to limit changes to minimum. So the original kernel you were using + the first patch to prevent OOM from the write path + 2 debugging patches. > >> Dec 10 02:03:29 server01 kernel: [ 220.366486] grsec: From 141.105.120.152: bruteforce prevention initiated for the next 30 minutes or until service restarted, stalling each fork 30 seconds. Please investigate the crash report for /usr/lib/apache2/mpm-itk/apache2[apache2:3586] uid/euid:1258/1258 gid/egid:100/100, parent /usr/lib/apache2/mpm-itk/apache2[apache2:2142] uid/euid:0/0 gid/egid:0/0 > > > >This explains why you have seen your machine hung. I am not familiar > >with grsec but stalling each fork 30s sounds really bad. > > > Btw, i never ever saw such a message from grsecurity yet. Will write to grsec mailing list about explanation. > > > >Anyway this will not help me much. Do you happen to still have any of > >those logged traces from the last run? > > > Unfortunately not, it didn't log anything and tons of messages were > printed only to console (i was logged via IP-KVM). It looked that > printing is infinite, i rebooted it after few minutes. But was it at least related to the debugging from the patch or it was rather a totally unrelated thing? > >Apart from that. If my current understanding is correct then this is > >related to transparent huge pages (and leaking charge to the page fault > >handler). Do you see the same problem if you disable THP before you > >start your workload? (echo never > /sys/kernel/mm/transparent_hugepage/enabled) > > # cat /sys/kernel/mm/transparent_hugepage/enabled > cat: /sys/kernel/mm/transparent_hugepage/enabled: No such file or directory Weee. Then it cannot be related to THP at all. Which makes this even bigger mystery. We really need to find out who is leaking that charge. -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked 2012-12-10 15:52 ` Michal Hocko @ 2012-12-10 17:18 ` azurIt -1 siblings, 0 replies; 444+ messages in thread From: azurIt @ 2012-12-10 17:18 UTC (permalink / raw) To: Michal Hocko Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki, Johannes Weiner >I would try to limit changes to minimum. So the original kernel you were >using + the first patch to prevent OOM from the write path + 2 debugging >patches. ok. >But was it at least related to the debugging from the patch or it was >rather a totally unrelated thing? I wasn't reading it much but i think it looks like a traces i was sending you before. ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked @ 2012-12-10 17:18 ` azurIt 0 siblings, 0 replies; 444+ messages in thread From: azurIt @ 2012-12-10 17:18 UTC (permalink / raw) To: Michal Hocko Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki, Johannes Weiner >I would try to limit changes to minimum. So the original kernel you were >using + the first patch to prevent OOM from the write path + 2 debugging >patches. ok. >But was it at least related to the debugging from the patch or it was >rather a totally unrelated thing? I wasn't reading it much but i think it looks like a traces i was sending you before. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked 2012-12-10 15:52 ` Michal Hocko @ 2012-12-17 1:34 ` azurIt -1 siblings, 0 replies; 444+ messages in thread From: azurIt @ 2012-12-17 1:34 UTC (permalink / raw) To: Michal Hocko Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki, Johannes Weiner >I would try to limit changes to minimum. So the original kernel you were >using + the first patch to prevent OOM from the write path + 2 debugging >patches. It didn't take off the whole system this time (but i was prepared to record a video of console ;) ), here it is: http://www.watchdog.sk/lkml/oom_mysqld4 ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked @ 2012-12-17 1:34 ` azurIt 0 siblings, 0 replies; 444+ messages in thread From: azurIt @ 2012-12-17 1:34 UTC (permalink / raw) To: Michal Hocko Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki, Johannes Weiner >I would try to limit changes to minimum. So the original kernel you were >using + the first patch to prevent OOM from the write path + 2 debugging >patches. It didn't take off the whole system this time (but i was prepared to record a video of console ;) ), here it is: http://www.watchdog.sk/lkml/oom_mysqld4 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked 2012-12-17 1:34 ` azurIt @ 2012-12-17 16:32 ` Michal Hocko -1 siblings, 0 replies; 444+ messages in thread From: Michal Hocko @ 2012-12-17 16:32 UTC (permalink / raw) To: azurIt Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki, Johannes Weiner On Mon 17-12-12 02:34:30, azurIt wrote: > >I would try to limit changes to minimum. So the original kernel you were > >using + the first patch to prevent OOM from the write path + 2 debugging > >patches. > > > It didn't take off the whole system this time (but i was > prepared to record a video of console ;) ), here it is: > http://www.watchdog.sk/lkml/oom_mysqld4 [...] [ 1248.059429] ------------[ cut here ]------------ [ 1248.059586] WARNING: at mm/memcontrol.c:2400 T.1146+0x2d9/0x610() [ 1248.059723] Hardware name: S5000VSA [ 1248.059855] gfp_mask:208 nr_pages:1 oom:0 ret:2 This is GFP_KERNEL allocation which is expected. It is also a simple page which is not that expected because we shouldn't return ENOMEM on those unless this was GFP_ATOMIC allocation (which it wasn't) or the caller told us to not trigger OOM which is the case only for THP pages (see mem_cgroup_charge_common). So the big question is how have we ended up with oom=false here... [Ohh, I am really an idiot. I screwed the first patch] - bool oom = true; + bool oom = !(gfp_mask | GFP_MEMCG_NO_OOM); Which obviously doesn't work. It should read !(gfp_mask &GFP_MEMCG_NO_OOM). No idea how I could have missed that. I am really sorry about that. --- diff --git a/mm/memcontrol.c b/mm/memcontrol.c index c04676d..1f35a74 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -2704,7 +2704,7 @@ static int mem_cgroup_charge_common(struct page *page, struct mm_struct *mm, struct mem_cgroup *memcg = NULL; unsigned int nr_pages = 1; struct page_cgroup *pc; - bool oom = !(gfp_mask | GFP_MEMCG_NO_OOM); + bool oom = !(gfp_mask & GFP_MEMCG_NO_OOM); int ret; if (PageTransHuge(page)) { -- Michal Hocko SUSE Labs ^ permalink raw reply related [flat|nested] 444+ messages in thread
* Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked @ 2012-12-17 16:32 ` Michal Hocko 0 siblings, 0 replies; 444+ messages in thread From: Michal Hocko @ 2012-12-17 16:32 UTC (permalink / raw) To: azurIt Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki, Johannes Weiner On Mon 17-12-12 02:34:30, azurIt wrote: > >I would try to limit changes to minimum. So the original kernel you were > >using + the first patch to prevent OOM from the write path + 2 debugging > >patches. > > > It didn't take off the whole system this time (but i was > prepared to record a video of console ;) ), here it is: > http://www.watchdog.sk/lkml/oom_mysqld4 [...] [ 1248.059429] ------------[ cut here ]------------ [ 1248.059586] WARNING: at mm/memcontrol.c:2400 T.1146+0x2d9/0x610() [ 1248.059723] Hardware name: S5000VSA [ 1248.059855] gfp_mask:208 nr_pages:1 oom:0 ret:2 This is GFP_KERNEL allocation which is expected. It is also a simple page which is not that expected because we shouldn't return ENOMEM on those unless this was GFP_ATOMIC allocation (which it wasn't) or the caller told us to not trigger OOM which is the case only for THP pages (see mem_cgroup_charge_common). So the big question is how have we ended up with oom=false here... [Ohh, I am really an idiot. I screwed the first patch] - bool oom = true; + bool oom = !(gfp_mask | GFP_MEMCG_NO_OOM); Which obviously doesn't work. It should read !(gfp_mask &GFP_MEMCG_NO_OOM). No idea how I could have missed that. I am really sorry about that. --- diff --git a/mm/memcontrol.c b/mm/memcontrol.c index c04676d..1f35a74 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -2704,7 +2704,7 @@ static int mem_cgroup_charge_common(struct page *page, struct mm_struct *mm, struct mem_cgroup *memcg = NULL; unsigned int nr_pages = 1; struct page_cgroup *pc; - bool oom = !(gfp_mask | GFP_MEMCG_NO_OOM); + bool oom = !(gfp_mask & GFP_MEMCG_NO_OOM); int ret; if (PageTransHuge(page)) { -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 444+ messages in thread
* Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked 2012-12-17 16:32 ` Michal Hocko @ 2012-12-17 18:23 ` azurIt -1 siblings, 0 replies; 444+ messages in thread From: azurIt @ 2012-12-17 18:23 UTC (permalink / raw) To: Michal Hocko Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki, Johannes Weiner >[Ohh, I am really an idiot. I screwed the first patch] >- bool oom = true; >+ bool oom = !(gfp_mask | GFP_MEMCG_NO_OOM); > >Which obviously doesn't work. It should read !(gfp_mask &GFP_MEMCG_NO_OOM). > No idea how I could have missed that. I am really sorry about that. :D no problem :) so, now it should really work as expected and completely fix my original problem? is it safe to apply it on 3.2.35? Thank you very much! azur ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked @ 2012-12-17 18:23 ` azurIt 0 siblings, 0 replies; 444+ messages in thread From: azurIt @ 2012-12-17 18:23 UTC (permalink / raw) To: Michal Hocko Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki, Johannes Weiner >[Ohh, I am really an idiot. I screwed the first patch] >- bool oom = true; >+ bool oom = !(gfp_mask | GFP_MEMCG_NO_OOM); > >Which obviously doesn't work. It should read !(gfp_mask &GFP_MEMCG_NO_OOM). > No idea how I could have missed that. I am really sorry about that. :D no problem :) so, now it should really work as expected and completely fix my original problem? is it safe to apply it on 3.2.35? Thank you very much! azur -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked 2012-12-17 18:23 ` azurIt (?) @ 2012-12-17 19:55 ` Michal Hocko -1 siblings, 0 replies; 444+ messages in thread From: Michal Hocko @ 2012-12-17 19:55 UTC (permalink / raw) To: azurIt Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki, Johannes Weiner On Mon 17-12-12 19:23:01, azurIt wrote: > >[Ohh, I am really an idiot. I screwed the first patch] > >- bool oom = true; > >+ bool oom = !(gfp_mask | GFP_MEMCG_NO_OOM); > > > >Which obviously doesn't work. It should read !(gfp_mask &GFP_MEMCG_NO_OOM). > > No idea how I could have missed that. I am really sorry about that. > > > :D no problem :) so, now it should really work as expected and > completely fix my original problem? It should mitigate the problem. The real fix shouldn't be that specific (as per discussion in other thread). The chance this will get upstream is not big and that means that it will not get to the stable tree either. > is it safe to apply it on 3.2.35? I didn't check what are the differences but I do not think there is anything to conflict with it. > Thank you very much! HTH -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked @ 2012-12-17 19:55 ` Michal Hocko 0 siblings, 0 replies; 444+ messages in thread From: Michal Hocko @ 2012-12-17 19:55 UTC (permalink / raw) To: azurIt Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups mailinglist, KAMEZAWA Hiroyuki, Johannes Weiner On Mon 17-12-12 19:23:01, azurIt wrote: > >[Ohh, I am really an idiot. I screwed the first patch] > >- bool oom = true; > >+ bool oom = !(gfp_mask | GFP_MEMCG_NO_OOM); > > > >Which obviously doesn't work. It should read !(gfp_mask &GFP_MEMCG_NO_OOM). > > No idea how I could have missed that. I am really sorry about that. > > > :D no problem :) so, now it should really work as expected and > completely fix my original problem? It should mitigate the problem. The real fix shouldn't be that specific (as per discussion in other thread). The chance this will get upstream is not big and that means that it will not get to the stable tree either. > is it safe to apply it on 3.2.35? I didn't check what are the differences but I do not think there is anything to conflict with it. > Thank you very much! HTH -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked @ 2012-12-17 19:55 ` Michal Hocko 0 siblings, 0 replies; 444+ messages in thread From: Michal Hocko @ 2012-12-17 19:55 UTC (permalink / raw) To: azurIt Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki, Johannes Weiner On Mon 17-12-12 19:23:01, azurIt wrote: > >[Ohh, I am really an idiot. I screwed the first patch] > >- bool oom = true; > >+ bool oom = !(gfp_mask | GFP_MEMCG_NO_OOM); > > > >Which obviously doesn't work. It should read !(gfp_mask &GFP_MEMCG_NO_OOM). > > No idea how I could have missed that. I am really sorry about that. > > > :D no problem :) so, now it should really work as expected and > completely fix my original problem? It should mitigate the problem. The real fix shouldn't be that specific (as per discussion in other thread). The chance this will get upstream is not big and that means that it will not get to the stable tree either. > is it safe to apply it on 3.2.35? I didn't check what are the differences but I do not think there is anything to conflict with it. > Thank you very much! HTH -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked 2012-12-17 19:55 ` Michal Hocko (?) @ 2012-12-18 14:22 ` azurIt -1 siblings, 0 replies; 444+ messages in thread From: azurIt @ 2012-12-18 14:22 UTC (permalink / raw) To: Michal Hocko Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki, Johannes Weiner >It should mitigate the problem. The real fix shouldn't be that specific >(as per discussion in other thread). The chance this will get upstream >is not big and that means that it will not get to the stable tree >either. OOM is no longer killing processes outside target cgroups, so everything looks fine so far. Will report back when i will have more info. Thnks! azur ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked @ 2012-12-18 14:22 ` azurIt 0 siblings, 0 replies; 444+ messages in thread From: azurIt @ 2012-12-18 14:22 UTC (permalink / raw) To: Michal Hocko Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups mailinglist, KAMEZAWA Hiroyuki, Johannes Weiner >It should mitigate the problem. The real fix shouldn't be that specific >(as per discussion in other thread). The chance this will get upstream >is not big and that means that it will not get to the stable tree >either. OOM is no longer killing processes outside target cgroups, so everything looks fine so far. Will report back when i will have more info. Thnks! azur ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked @ 2012-12-18 14:22 ` azurIt 0 siblings, 0 replies; 444+ messages in thread From: azurIt @ 2012-12-18 14:22 UTC (permalink / raw) To: Michal Hocko Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki, Johannes Weiner >It should mitigate the problem. The real fix shouldn't be that specific >(as per discussion in other thread). The chance this will get upstream >is not big and that means that it will not get to the stable tree >either. OOM is no longer killing processes outside target cgroups, so everything looks fine so far. Will report back when i will have more info. Thnks! azur -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked 2012-12-18 14:22 ` azurIt @ 2012-12-18 15:20 ` Michal Hocko -1 siblings, 0 replies; 444+ messages in thread From: Michal Hocko @ 2012-12-18 15:20 UTC (permalink / raw) To: azurIt Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki, Johannes Weiner On Tue 18-12-12 15:22:23, azurIt wrote: > >It should mitigate the problem. The real fix shouldn't be that specific > >(as per discussion in other thread). The chance this will get upstream > >is not big and that means that it will not get to the stable tree > >either. > > > OOM is no longer killing processes outside target cgroups, so > everything looks fine so far. Will report back when i will have more > info. Thnks! OK, good to hear and fingers crossed. I will try to get back to the original problem and a better solution sometimes early next year when all the things settle a bit. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked @ 2012-12-18 15:20 ` Michal Hocko 0 siblings, 0 replies; 444+ messages in thread From: Michal Hocko @ 2012-12-18 15:20 UTC (permalink / raw) To: azurIt Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki, Johannes Weiner On Tue 18-12-12 15:22:23, azurIt wrote: > >It should mitigate the problem. The real fix shouldn't be that specific > >(as per discussion in other thread). The chance this will get upstream > >is not big and that means that it will not get to the stable tree > >either. > > > OOM is no longer killing processes outside target cgroups, so > everything looks fine so far. Will report back when i will have more > info. Thnks! OK, good to hear and fingers crossed. I will try to get back to the original problem and a better solution sometimes early next year when all the things settle a bit. -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked 2012-12-18 15:20 ` Michal Hocko (?) @ 2012-12-24 13:25 ` azurIt -1 siblings, 0 replies; 444+ messages in thread From: azurIt @ 2012-12-24 13:25 UTC (permalink / raw) To: Michal Hocko Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki, Johannes Weiner >OK, good to hear and fingers crossed. I will try to get back to the >original problem and a better solution sometimes early next year when >all the things settle a bit. Michal, problem, unfortunately, happened again :( twice. When it happened first time (two days ago) i don't want to believe it so i recompiled the kernel and boot it again to be sure i really used your patch. Today it happened again, here is report: http://watchdog.sk/lkml/memcg-bug-3.tar.gz Here is patch which i used (kernel 3.2.35, i didn't use any other from your patches): http://watchdog.sk/lkml/5-memcg-fix.patch azur ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked @ 2012-12-24 13:25 ` azurIt 0 siblings, 0 replies; 444+ messages in thread From: azurIt @ 2012-12-24 13:25 UTC (permalink / raw) To: Michal Hocko Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups mailinglist, KAMEZAWA Hiroyuki, Johannes Weiner >OK, good to hear and fingers crossed. I will try to get back to the >original problem and a better solution sometimes early next year when >all the things settle a bit. Michal, problem, unfortunately, happened again :( twice. When it happened first time (two days ago) i don't want to believe it so i recompiled the kernel and boot it again to be sure i really used your patch. Today it happened again, here is report: http://watchdog.sk/lkml/memcg-bug-3.tar.gz Here is patch which i used (kernel 3.2.35, i didn't use any other from your patches): http://watchdog.sk/lkml/5-memcg-fix.patch azur ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked @ 2012-12-24 13:25 ` azurIt 0 siblings, 0 replies; 444+ messages in thread From: azurIt @ 2012-12-24 13:25 UTC (permalink / raw) To: Michal Hocko Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki, Johannes Weiner >OK, good to hear and fingers crossed. I will try to get back to the >original problem and a better solution sometimes early next year when >all the things settle a bit. Michal, problem, unfortunately, happened again :( twice. When it happened first time (two days ago) i don't want to believe it so i recompiled the kernel and boot it again to be sure i really used your patch. Today it happened again, here is report: http://watchdog.sk/lkml/memcg-bug-3.tar.gz Here is patch which i used (kernel 3.2.35, i didn't use any other from your patches): http://watchdog.sk/lkml/5-memcg-fix.patch azur -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked 2012-12-24 13:25 ` azurIt (?) @ 2012-12-28 16:22 ` Michal Hocko -1 siblings, 0 replies; 444+ messages in thread From: Michal Hocko @ 2012-12-28 16:22 UTC (permalink / raw) To: azurIt Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki, Johannes Weiner On Mon 24-12-12 14:25:26, azurIt wrote: > >OK, good to hear and fingers crossed. I will try to get back to the > >original problem and a better solution sometimes early next year when > >all the things settle a bit. > > > Michal, problem, unfortunately, happened again :( twice. When it > happened first time (two days ago) i don't want to believe it so i > recompiled the kernel and boot it again to be sure i really used your > patch. Today it happened again, here is report: > http://watchdog.sk/lkml/memcg-bug-3.tar.gz Hmm, 1356352982/1507/stack says [<ffffffff8110a971>] mem_cgroup_handle_oom+0x241/0x3b0 [<ffffffff8110b55b>] T.1147+0x5ab/0x5c0 [<ffffffff8110c1de>] mem_cgroup_cache_charge+0xbe/0xe0 [<ffffffff810ca20f>] add_to_page_cache_locked+0x4f/0x140 [<ffffffff810ca322>] add_to_page_cache_lru+0x22/0x50 [<ffffffff810cac53>] find_or_create_page+0x73/0xb0 [<ffffffff8114340a>] __getblk+0xea/0x2c0 [<ffffffff811921ab>] ext3_getblk+0xeb/0x240 [<ffffffff81192319>] ext3_bread+0x19/0x90 [<ffffffff811967e3>] ext3_dx_find_entry+0x83/0x1e0 [<ffffffff81196c24>] ext3_find_entry+0x2e4/0x480 [<ffffffff8119750d>] ext3_lookup+0x4d/0x120 [<ffffffff8111cff5>] d_alloc_and_lookup+0x45/0x90 [<ffffffff8111d598>] do_lookup+0x278/0x390 [<ffffffff8111f11e>] path_lookupat+0xae/0x7e0 [<ffffffff8111f885>] do_path_lookup+0x35/0xe0 [<ffffffff8111fa19>] user_path_at_empty+0x59/0xb0 [<ffffffff8111fa81>] user_path_at+0x11/0x20 [<ffffffff811164d7>] vfs_fstatat+0x47/0x80 [<ffffffff8111657e>] vfs_lstat+0x1e/0x20 [<ffffffff811165a4>] sys_newlstat+0x24/0x50 [<ffffffff815b5a66>] system_call_fastpath+0x18/0x1d [<ffffffffffffffff>] 0xffffffffffffffff which suggests that the patch is incomplete and that I am blind :/ mem_cgroup_cache_charge calls __mem_cgroup_try_charge for the page cache and that one doesn't check GFP_MEMCG_NO_OOM. So you need the following follow-up patch on top of the one you already have (which should catch all the remaining cases). Sorry about that... --- diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 89997ac..559a54d 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -2779,6 +2779,7 @@ __mem_cgroup_commit_charge_lrucare(struct page *page, struct mem_cgroup *memcg, int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm, gfp_t gfp_mask) { + bool oom = !(gfp_mask & GFP_MEMCG_NO_OOM); struct mem_cgroup *memcg = NULL; int ret; @@ -2791,7 +2792,7 @@ int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm, mm = &init_mm; if (page_is_file_cache(page)) { - ret = __mem_cgroup_try_charge(mm, gfp_mask, 1, &memcg, true); + ret = __mem_cgroup_try_charge(mm, gfp_mask, 1, &memcg, oom); if (ret || !memcg) return ret; @@ -2827,6 +2828,7 @@ int mem_cgroup_try_charge_swapin(struct mm_struct *mm, struct page *page, gfp_t mask, struct mem_cgroup **ptr) { + bool oom = !(mask & GFP_MEMCG_NO_OOM); struct mem_cgroup *memcg; int ret; @@ -2849,13 +2851,13 @@ int mem_cgroup_try_charge_swapin(struct mm_struct *mm, if (!memcg) goto charge_cur_mm; *ptr = memcg; - ret = __mem_cgroup_try_charge(NULL, mask, 1, ptr, true); + ret = __mem_cgroup_try_charge(NULL, mask, 1, ptr, oom); css_put(&memcg->css); return ret; charge_cur_mm: if (unlikely(!mm)) mm = &init_mm; - return __mem_cgroup_try_charge(mm, mask, 1, ptr, true); + return __mem_cgroup_try_charge(mm, mask, 1, ptr, oom); } static void -- Michal Hocko SUSE Labs ^ permalink raw reply related [flat|nested] 444+ messages in thread
* Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked @ 2012-12-28 16:22 ` Michal Hocko 0 siblings, 0 replies; 444+ messages in thread From: Michal Hocko @ 2012-12-28 16:22 UTC (permalink / raw) To: azurIt Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups mailinglist, KAMEZAWA Hiroyuki, Johannes Weiner On Mon 24-12-12 14:25:26, azurIt wrote: > >OK, good to hear and fingers crossed. I will try to get back to the > >original problem and a better solution sometimes early next year when > >all the things settle a bit. > > > Michal, problem, unfortunately, happened again :( twice. When it > happened first time (two days ago) i don't want to believe it so i > recompiled the kernel and boot it again to be sure i really used your > patch. Today it happened again, here is report: > http://watchdog.sk/lkml/memcg-bug-3.tar.gz Hmm, 1356352982/1507/stack says [<ffffffff8110a971>] mem_cgroup_handle_oom+0x241/0x3b0 [<ffffffff8110b55b>] T.1147+0x5ab/0x5c0 [<ffffffff8110c1de>] mem_cgroup_cache_charge+0xbe/0xe0 [<ffffffff810ca20f>] add_to_page_cache_locked+0x4f/0x140 [<ffffffff810ca322>] add_to_page_cache_lru+0x22/0x50 [<ffffffff810cac53>] find_or_create_page+0x73/0xb0 [<ffffffff8114340a>] __getblk+0xea/0x2c0 [<ffffffff811921ab>] ext3_getblk+0xeb/0x240 [<ffffffff81192319>] ext3_bread+0x19/0x90 [<ffffffff811967e3>] ext3_dx_find_entry+0x83/0x1e0 [<ffffffff81196c24>] ext3_find_entry+0x2e4/0x480 [<ffffffff8119750d>] ext3_lookup+0x4d/0x120 [<ffffffff8111cff5>] d_alloc_and_lookup+0x45/0x90 [<ffffffff8111d598>] do_lookup+0x278/0x390 [<ffffffff8111f11e>] path_lookupat+0xae/0x7e0 [<ffffffff8111f885>] do_path_lookup+0x35/0xe0 [<ffffffff8111fa19>] user_path_at_empty+0x59/0xb0 [<ffffffff8111fa81>] user_path_at+0x11/0x20 [<ffffffff811164d7>] vfs_fstatat+0x47/0x80 [<ffffffff8111657e>] vfs_lstat+0x1e/0x20 [<ffffffff811165a4>] sys_newlstat+0x24/0x50 [<ffffffff815b5a66>] system_call_fastpath+0x18/0x1d [<ffffffffffffffff>] 0xffffffffffffffff which suggests that the patch is incomplete and that I am blind :/ mem_cgroup_cache_charge calls __mem_cgroup_try_charge for the page cache and that one doesn't check GFP_MEMCG_NO_OOM. So you need the following follow-up patch on top of the one you already have (which should catch all the remaining cases). Sorry about that... --- diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 89997ac..559a54d 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -2779,6 +2779,7 @@ __mem_cgroup_commit_charge_lrucare(struct page *page, struct mem_cgroup *memcg, int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm, gfp_t gfp_mask) { + bool oom = !(gfp_mask & GFP_MEMCG_NO_OOM); struct mem_cgroup *memcg = NULL; int ret; @@ -2791,7 +2792,7 @@ int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm, mm = &init_mm; if (page_is_file_cache(page)) { - ret = __mem_cgroup_try_charge(mm, gfp_mask, 1, &memcg, true); + ret = __mem_cgroup_try_charge(mm, gfp_mask, 1, &memcg, oom); if (ret || !memcg) return ret; @@ -2827,6 +2828,7 @@ int mem_cgroup_try_charge_swapin(struct mm_struct *mm, struct page *page, gfp_t mask, struct mem_cgroup **ptr) { + bool oom = !(mask & GFP_MEMCG_NO_OOM); struct mem_cgroup *memcg; int ret; @@ -2849,13 +2851,13 @@ int mem_cgroup_try_charge_swapin(struct mm_struct *mm, if (!memcg) goto charge_cur_mm; *ptr = memcg; - ret = __mem_cgroup_try_charge(NULL, mask, 1, ptr, true); + ret = __mem_cgroup_try_charge(NULL, mask, 1, ptr, oom); css_put(&memcg->css); return ret; charge_cur_mm: if (unlikely(!mm)) mm = &init_mm; - return __mem_cgroup_try_charge(mm, mask, 1, ptr, true); + return __mem_cgroup_try_charge(mm, mask, 1, ptr, oom); } static void -- Michal Hocko SUSE Labs ^ permalink raw reply related [flat|nested] 444+ messages in thread
* Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked @ 2012-12-28 16:22 ` Michal Hocko 0 siblings, 0 replies; 444+ messages in thread From: Michal Hocko @ 2012-12-28 16:22 UTC (permalink / raw) To: azurIt Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki, Johannes Weiner On Mon 24-12-12 14:25:26, azurIt wrote: > >OK, good to hear and fingers crossed. I will try to get back to the > >original problem and a better solution sometimes early next year when > >all the things settle a bit. > > > Michal, problem, unfortunately, happened again :( twice. When it > happened first time (two days ago) i don't want to believe it so i > recompiled the kernel and boot it again to be sure i really used your > patch. Today it happened again, here is report: > http://watchdog.sk/lkml/memcg-bug-3.tar.gz Hmm, 1356352982/1507/stack says [<ffffffff8110a971>] mem_cgroup_handle_oom+0x241/0x3b0 [<ffffffff8110b55b>] T.1147+0x5ab/0x5c0 [<ffffffff8110c1de>] mem_cgroup_cache_charge+0xbe/0xe0 [<ffffffff810ca20f>] add_to_page_cache_locked+0x4f/0x140 [<ffffffff810ca322>] add_to_page_cache_lru+0x22/0x50 [<ffffffff810cac53>] find_or_create_page+0x73/0xb0 [<ffffffff8114340a>] __getblk+0xea/0x2c0 [<ffffffff811921ab>] ext3_getblk+0xeb/0x240 [<ffffffff81192319>] ext3_bread+0x19/0x90 [<ffffffff811967e3>] ext3_dx_find_entry+0x83/0x1e0 [<ffffffff81196c24>] ext3_find_entry+0x2e4/0x480 [<ffffffff8119750d>] ext3_lookup+0x4d/0x120 [<ffffffff8111cff5>] d_alloc_and_lookup+0x45/0x90 [<ffffffff8111d598>] do_lookup+0x278/0x390 [<ffffffff8111f11e>] path_lookupat+0xae/0x7e0 [<ffffffff8111f885>] do_path_lookup+0x35/0xe0 [<ffffffff8111fa19>] user_path_at_empty+0x59/0xb0 [<ffffffff8111fa81>] user_path_at+0x11/0x20 [<ffffffff811164d7>] vfs_fstatat+0x47/0x80 [<ffffffff8111657e>] vfs_lstat+0x1e/0x20 [<ffffffff811165a4>] sys_newlstat+0x24/0x50 [<ffffffff815b5a66>] system_call_fastpath+0x18/0x1d [<ffffffffffffffff>] 0xffffffffffffffff which suggests that the patch is incomplete and that I am blind :/ mem_cgroup_cache_charge calls __mem_cgroup_try_charge for the page cache and that one doesn't check GFP_MEMCG_NO_OOM. So you need the following follow-up patch on top of the one you already have (which should catch all the remaining cases). Sorry about that... --- diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 89997ac..559a54d 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -2779,6 +2779,7 @@ __mem_cgroup_commit_charge_lrucare(struct page *page, struct mem_cgroup *memcg, int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm, gfp_t gfp_mask) { + bool oom = !(gfp_mask & GFP_MEMCG_NO_OOM); struct mem_cgroup *memcg = NULL; int ret; @@ -2791,7 +2792,7 @@ int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm, mm = &init_mm; if (page_is_file_cache(page)) { - ret = __mem_cgroup_try_charge(mm, gfp_mask, 1, &memcg, true); + ret = __mem_cgroup_try_charge(mm, gfp_mask, 1, &memcg, oom); if (ret || !memcg) return ret; @@ -2827,6 +2828,7 @@ int mem_cgroup_try_charge_swapin(struct mm_struct *mm, struct page *page, gfp_t mask, struct mem_cgroup **ptr) { + bool oom = !(mask & GFP_MEMCG_NO_OOM); struct mem_cgroup *memcg; int ret; @@ -2849,13 +2851,13 @@ int mem_cgroup_try_charge_swapin(struct mm_struct *mm, if (!memcg) goto charge_cur_mm; *ptr = memcg; - ret = __mem_cgroup_try_charge(NULL, mask, 1, ptr, true); + ret = __mem_cgroup_try_charge(NULL, mask, 1, ptr, oom); css_put(&memcg->css); return ret; charge_cur_mm: if (unlikely(!mm)) mm = &init_mm; - return __mem_cgroup_try_charge(mm, mask, 1, ptr, true); + return __mem_cgroup_try_charge(mm, mask, 1, ptr, oom); } static void -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 444+ messages in thread
* Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked 2012-12-28 16:22 ` Michal Hocko (?) @ 2012-12-30 1:09 ` azurIt -1 siblings, 0 replies; 444+ messages in thread From: azurIt @ 2012-12-30 1:09 UTC (permalink / raw) To: Michal Hocko Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki, Johannes Weiner >which suggests that the patch is incomplete and that I am blind :/ >mem_cgroup_cache_charge calls __mem_cgroup_try_charge for the page cache >and that one doesn't check GFP_MEMCG_NO_OOM. So you need the following >follow-up patch on top of the one you already have (which should catch >all the remaining cases). >Sorry about that... This was, again, killing my MySQL server (search for "(mysqld)"): http://www.watchdog.sk/lkml/oom_mysqld5 ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked @ 2012-12-30 1:09 ` azurIt 0 siblings, 0 replies; 444+ messages in thread From: azurIt @ 2012-12-30 1:09 UTC (permalink / raw) To: Michal Hocko Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups mailinglist, KAMEZAWA Hiroyuki, Johannes Weiner >which suggests that the patch is incomplete and that I am blind :/ >mem_cgroup_cache_charge calls __mem_cgroup_try_charge for the page cache >and that one doesn't check GFP_MEMCG_NO_OOM. So you need the following >follow-up patch on top of the one you already have (which should catch >all the remaining cases). >Sorry about that... This was, again, killing my MySQL server (search for "(mysqld)"): http://www.watchdog.sk/lkml/oom_mysqld5 ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked @ 2012-12-30 1:09 ` azurIt 0 siblings, 0 replies; 444+ messages in thread From: azurIt @ 2012-12-30 1:09 UTC (permalink / raw) To: Michal Hocko Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki, Johannes Weiner >which suggests that the patch is incomplete and that I am blind :/ >mem_cgroup_cache_charge calls __mem_cgroup_try_charge for the page cache >and that one doesn't check GFP_MEMCG_NO_OOM. So you need the following >follow-up patch on top of the one you already have (which should catch >all the remaining cases). >Sorry about that... This was, again, killing my MySQL server (search for "(mysqld)"): http://www.watchdog.sk/lkml/oom_mysqld5 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked 2012-12-30 1:09 ` azurIt @ 2012-12-30 11:08 ` Michal Hocko -1 siblings, 0 replies; 444+ messages in thread From: Michal Hocko @ 2012-12-30 11:08 UTC (permalink / raw) To: azurIt Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki, Johannes Weiner On Sun 30-12-12 02:09:47, azurIt wrote: > >which suggests that the patch is incomplete and that I am blind :/ > >mem_cgroup_cache_charge calls __mem_cgroup_try_charge for the page cache > >and that one doesn't check GFP_MEMCG_NO_OOM. So you need the following > >follow-up patch on top of the one you already have (which should catch > >all the remaining cases). > >Sorry about that... > > > This was, again, killing my MySQL server (search for "(mysqld)"): > http://www.watchdog.sk/lkml/oom_mysqld5 grep "Kill process" oom_mysqld5 Dec 30 01:53:34 server01 kernel: [ 367.061801] Memory cgroup out of memory: Kill process 5512 (apache2) score 716 or sacrifice child Dec 30 01:53:35 server01 kernel: [ 367.338024] Memory cgroup out of memory: Kill process 5517 (apache2) score 718 or sacrifice child Dec 30 01:53:35 server01 kernel: [ 367.747888] Memory cgroup out of memory: Kill process 5513 (apache2) score 721 or sacrifice child Dec 30 01:53:36 server01 kernel: [ 368.159860] Memory cgroup out of memory: Kill process 5516 (apache2) score 726 or sacrifice child Dec 30 01:53:36 server01 kernel: [ 368.665606] Memory cgroup out of memory: Kill process 5520 (apache2) score 733 or sacrifice child Dec 30 01:53:36 server01 kernel: [ 368.765652] Out of memory: Kill process 1778 (mysqld) score 39 or sacrifice child Dec 30 01:53:36 server01 kernel: [ 369.101753] Memory cgroup out of memory: Kill process 5519 (apache2) score 754 or sacrifice child Dec 30 01:53:37 server01 kernel: [ 369.464262] Memory cgroup out of memory: Kill process 5583 (apache2) score 762 or sacrifice child Dec 30 01:53:37 server01 kernel: [ 369.465017] Out of memory: Kill process 5506 (apache2) score 18 or sacrifice child Dec 30 01:53:37 server01 kernel: [ 369.574932] Memory cgroup out of memory: Kill process 5523 (apache2) score 759 or sacrifice child So your mysqld has been killed by the global OOM not memcg. But why when you seem to be perfectly fine regarding memory? I guess the following backtrace is relevant: Dec 30 01:53:36 server01 kernel: [ 368.569720] DMA: 0*4kB 1*8kB 0*16kB 1*32kB 2*64kB 1*128kB 1*256kB 0*512kB 1*1024kB 1*2048kB 3*4096kB = 15912kB Dec 30 01:53:36 server01 kernel: [ 368.570447] DMA32: 9*4kB 10*8kB 8*16kB 6*32kB 5*64kB 6*128kB 4*256kB 2*512kB 3*1024kB 3*2048kB 613*4096kB = 2523636kB Dec 30 01:53:36 server01 kernel: [ 368.571175] Normal: 5*4kB 2060*8kB 4122*16kB 2550*32kB 2667*64kB 722*128kB 197*256kB 68*512kB 15*1024kB 4*2048kB 1855*4096kB = 8134036kB Dec 30 01:53:36 server01 kernel: [ 368.571906] 308964 total pagecache pages Dec 30 01:53:36 server01 kernel: [ 368.572023] 0 pages in swap cache Dec 30 01:53:36 server01 kernel: [ 368.572140] Swap cache stats: add 0, delete 0, find 0/0 Dec 30 01:53:36 server01 kernel: [ 368.572260] Free swap = 0kB Dec 30 01:53:36 server01 kernel: [ 368.572375] Total swap = 0kB Dec 30 01:53:36 server01 kernel: [ 368.597836] apache2 invoked oom-killer: gfp_mask=0x0, order=0, oom_adj=0, oom_score_adj=0 Dec 30 01:53:36 server01 kernel: [ 368.598034] apache2 cpuset=uid mems_allowed=0 Dec 30 01:53:36 server01 kernel: [ 368.598152] Pid: 5385, comm: apache2 Not tainted 3.2.35-grsec #1 Dec 30 01:53:36 server01 kernel: [ 368.598273] Call Trace: Dec 30 01:53:36 server01 kernel: [ 368.598396] [<ffffffff810cc89e>] dump_header+0x7e/0x1e0 Dec 30 01:53:36 server01 kernel: [ 368.598516] [<ffffffff810cc79f>] ? find_lock_task_mm+0x2f/0x70 Dec 30 01:53:36 server01 kernel: [ 368.598638] [<ffffffff810ccd65>] oom_kill_process+0x85/0x2a0 Dec 30 01:53:36 server01 kernel: [ 368.598759] [<ffffffff810cd415>] out_of_memory+0xe5/0x200 Dec 30 01:53:36 server01 kernel: [ 368.598880] [<ffffffff810cd5ed>] pagefault_out_of_memory+0xbd/0x110 Dec 30 01:53:36 server01 kernel: [ 368.599006] [<ffffffff81026e96>] mm_fault_error+0xb6/0x1a0 Dec 30 01:53:36 server01 kernel: [ 368.599127] [<ffffffff8102736e>] do_page_fault+0x3ee/0x460 Dec 30 01:53:36 server01 kernel: [ 368.599250] [<ffffffff81131ccf>] ? mntput+0x1f/0x30 Dec 30 01:53:36 server01 kernel: [ 368.599371] [<ffffffff811134e6>] ? fput+0x156/0x200 Dec 30 01:53:36 server01 kernel: [ 368.599496] [<ffffffff815b567f>] page_fault+0x1f/0x30 This would suggest that an unexpected ENOMEM leaked during page fault path. I do not see which one could that be because you said THP (CONFIG_TRANSPARENT_HUGEPAGE) are disabled (and the other patch I have mentioned in the thread should fix that issue - btw. the patch is already scheduled for stable tree). __do_fault, do_anonymous_page and do_wp_page call mem_cgroup_newpage_charge with GFP_KERNEL which means that we do memcg OOM and never return ENOMEM. do_swap_page calls mem_cgroup_try_charge_swapin with GFP_KERNEL as well. I might have missed something but I will not get to look closer before 2nd January. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked @ 2012-12-30 11:08 ` Michal Hocko 0 siblings, 0 replies; 444+ messages in thread From: Michal Hocko @ 2012-12-30 11:08 UTC (permalink / raw) To: azurIt Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki, Johannes Weiner On Sun 30-12-12 02:09:47, azurIt wrote: > >which suggests that the patch is incomplete and that I am blind :/ > >mem_cgroup_cache_charge calls __mem_cgroup_try_charge for the page cache > >and that one doesn't check GFP_MEMCG_NO_OOM. So you need the following > >follow-up patch on top of the one you already have (which should catch > >all the remaining cases). > >Sorry about that... > > > This was, again, killing my MySQL server (search for "(mysqld)"): > http://www.watchdog.sk/lkml/oom_mysqld5 grep "Kill process" oom_mysqld5 Dec 30 01:53:34 server01 kernel: [ 367.061801] Memory cgroup out of memory: Kill process 5512 (apache2) score 716 or sacrifice child Dec 30 01:53:35 server01 kernel: [ 367.338024] Memory cgroup out of memory: Kill process 5517 (apache2) score 718 or sacrifice child Dec 30 01:53:35 server01 kernel: [ 367.747888] Memory cgroup out of memory: Kill process 5513 (apache2) score 721 or sacrifice child Dec 30 01:53:36 server01 kernel: [ 368.159860] Memory cgroup out of memory: Kill process 5516 (apache2) score 726 or sacrifice child Dec 30 01:53:36 server01 kernel: [ 368.665606] Memory cgroup out of memory: Kill process 5520 (apache2) score 733 or sacrifice child Dec 30 01:53:36 server01 kernel: [ 368.765652] Out of memory: Kill process 1778 (mysqld) score 39 or sacrifice child Dec 30 01:53:36 server01 kernel: [ 369.101753] Memory cgroup out of memory: Kill process 5519 (apache2) score 754 or sacrifice child Dec 30 01:53:37 server01 kernel: [ 369.464262] Memory cgroup out of memory: Kill process 5583 (apache2) score 762 or sacrifice child Dec 30 01:53:37 server01 kernel: [ 369.465017] Out of memory: Kill process 5506 (apache2) score 18 or sacrifice child Dec 30 01:53:37 server01 kernel: [ 369.574932] Memory cgroup out of memory: Kill process 5523 (apache2) score 759 or sacrifice child So your mysqld has been killed by the global OOM not memcg. But why when you seem to be perfectly fine regarding memory? I guess the following backtrace is relevant: Dec 30 01:53:36 server01 kernel: [ 368.569720] DMA: 0*4kB 1*8kB 0*16kB 1*32kB 2*64kB 1*128kB 1*256kB 0*512kB 1*1024kB 1*2048kB 3*4096kB = 15912kB Dec 30 01:53:36 server01 kernel: [ 368.570447] DMA32: 9*4kB 10*8kB 8*16kB 6*32kB 5*64kB 6*128kB 4*256kB 2*512kB 3*1024kB 3*2048kB 613*4096kB = 2523636kB Dec 30 01:53:36 server01 kernel: [ 368.571175] Normal: 5*4kB 2060*8kB 4122*16kB 2550*32kB 2667*64kB 722*128kB 197*256kB 68*512kB 15*1024kB 4*2048kB 1855*4096kB = 8134036kB Dec 30 01:53:36 server01 kernel: [ 368.571906] 308964 total pagecache pages Dec 30 01:53:36 server01 kernel: [ 368.572023] 0 pages in swap cache Dec 30 01:53:36 server01 kernel: [ 368.572140] Swap cache stats: add 0, delete 0, find 0/0 Dec 30 01:53:36 server01 kernel: [ 368.572260] Free swap = 0kB Dec 30 01:53:36 server01 kernel: [ 368.572375] Total swap = 0kB Dec 30 01:53:36 server01 kernel: [ 368.597836] apache2 invoked oom-killer: gfp_mask=0x0, order=0, oom_adj=0, oom_score_adj=0 Dec 30 01:53:36 server01 kernel: [ 368.598034] apache2 cpuset=uid mems_allowed=0 Dec 30 01:53:36 server01 kernel: [ 368.598152] Pid: 5385, comm: apache2 Not tainted 3.2.35-grsec #1 Dec 30 01:53:36 server01 kernel: [ 368.598273] Call Trace: Dec 30 01:53:36 server01 kernel: [ 368.598396] [<ffffffff810cc89e>] dump_header+0x7e/0x1e0 Dec 30 01:53:36 server01 kernel: [ 368.598516] [<ffffffff810cc79f>] ? find_lock_task_mm+0x2f/0x70 Dec 30 01:53:36 server01 kernel: [ 368.598638] [<ffffffff810ccd65>] oom_kill_process+0x85/0x2a0 Dec 30 01:53:36 server01 kernel: [ 368.598759] [<ffffffff810cd415>] out_of_memory+0xe5/0x200 Dec 30 01:53:36 server01 kernel: [ 368.598880] [<ffffffff810cd5ed>] pagefault_out_of_memory+0xbd/0x110 Dec 30 01:53:36 server01 kernel: [ 368.599006] [<ffffffff81026e96>] mm_fault_error+0xb6/0x1a0 Dec 30 01:53:36 server01 kernel: [ 368.599127] [<ffffffff8102736e>] do_page_fault+0x3ee/0x460 Dec 30 01:53:36 server01 kernel: [ 368.599250] [<ffffffff81131ccf>] ? mntput+0x1f/0x30 Dec 30 01:53:36 server01 kernel: [ 368.599371] [<ffffffff811134e6>] ? fput+0x156/0x200 Dec 30 01:53:36 server01 kernel: [ 368.599496] [<ffffffff815b567f>] page_fault+0x1f/0x30 This would suggest that an unexpected ENOMEM leaked during page fault path. I do not see which one could that be because you said THP (CONFIG_TRANSPARENT_HUGEPAGE) are disabled (and the other patch I have mentioned in the thread should fix that issue - btw. the patch is already scheduled for stable tree). __do_fault, do_anonymous_page and do_wp_page call mem_cgroup_newpage_charge with GFP_KERNEL which means that we do memcg OOM and never return ENOMEM. do_swap_page calls mem_cgroup_try_charge_swapin with GFP_KERNEL as well. I might have missed something but I will not get to look closer before 2nd January. -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked 2012-12-30 11:08 ` Michal Hocko (?) @ 2013-01-25 15:07 ` azurIt -1 siblings, 0 replies; 444+ messages in thread From: azurIt @ 2013-01-25 15:07 UTC (permalink / raw) To: Michal Hocko Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki, Johannes Weiner Any news? Thnx! azur ______________________________________________________________ > Od: "Michal Hocko" <mhocko@suse.cz> > Komu: azurIt <azurit@pobox.sk> > Dátum: 30.12.2012 12:08 > Predmet: Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked > > CC: linux-kernel@vger.kernel.org, linux-mm@kvack.org, "cgroups mailinglist" <cgroups@vger.kernel.org>, "KAMEZAWA Hiroyuki" <kamezawa.hiroyu@jp.fujitsu.com>, "Johannes Weiner" <hannes@cmpxchg.org> >On Sun 30-12-12 02:09:47, azurIt wrote: >> >which suggests that the patch is incomplete and that I am blind :/ >> >mem_cgroup_cache_charge calls __mem_cgroup_try_charge for the page cache >> >and that one doesn't check GFP_MEMCG_NO_OOM. So you need the following >> >follow-up patch on top of the one you already have (which should catch >> >all the remaining cases). >> >Sorry about that... >> >> >> This was, again, killing my MySQL server (search for "(mysqld)"): >> http://www.watchdog.sk/lkml/oom_mysqld5 > >grep "Kill process" oom_mysqld5 >Dec 30 01:53:34 server01 kernel: [ 367.061801] Memory cgroup out of memory: Kill process 5512 (apache2) score 716 or sacrifice child >Dec 30 01:53:35 server01 kernel: [ 367.338024] Memory cgroup out of memory: Kill process 5517 (apache2) score 718 or sacrifice child >Dec 30 01:53:35 server01 kernel: [ 367.747888] Memory cgroup out of memory: Kill process 5513 (apache2) score 721 or sacrifice child >Dec 30 01:53:36 server01 kernel: [ 368.159860] Memory cgroup out of memory: Kill process 5516 (apache2) score 726 or sacrifice child >Dec 30 01:53:36 server01 kernel: [ 368.665606] Memory cgroup out of memory: Kill process 5520 (apache2) score 733 or sacrifice child >Dec 30 01:53:36 server01 kernel: [ 368.765652] Out of memory: Kill process 1778 (mysqld) score 39 or sacrifice child >Dec 30 01:53:36 server01 kernel: [ 369.101753] Memory cgroup out of memory: Kill process 5519 (apache2) score 754 or sacrifice child >Dec 30 01:53:37 server01 kernel: [ 369.464262] Memory cgroup out of memory: Kill process 5583 (apache2) score 762 or sacrifice child >Dec 30 01:53:37 server01 kernel: [ 369.465017] Out of memory: Kill process 5506 (apache2) score 18 or sacrifice child >Dec 30 01:53:37 server01 kernel: [ 369.574932] Memory cgroup out of memory: Kill process 5523 (apache2) score 759 or sacrifice child > >So your mysqld has been killed by the global OOM not memcg. But why when >you seem to be perfectly fine regarding memory? I guess the following >backtrace is relevant: >Dec 30 01:53:36 server01 kernel: [ 368.569720] DMA: 0*4kB 1*8kB 0*16kB 1*32kB 2*64kB 1*128kB 1*256kB 0*512kB 1*1024kB 1*2048kB 3*4096kB = 15912kB >Dec 30 01:53:36 server01 kernel: [ 368.570447] DMA32: 9*4kB 10*8kB 8*16kB 6*32kB 5*64kB 6*128kB 4*256kB 2*512kB 3*1024kB 3*2048kB 613*4096kB = 2523636kB >Dec 30 01:53:36 server01 kernel: [ 368.571175] Normal: 5*4kB 2060*8kB 4122*16kB 2550*32kB 2667*64kB 722*128kB 197*256kB 68*512kB 15*1024kB 4*2048kB 1855*4096kB = 8134036kB >Dec 30 01:53:36 server01 kernel: [ 368.571906] 308964 total pagecache pages >Dec 30 01:53:36 server01 kernel: [ 368.572023] 0 pages in swap cache >Dec 30 01:53:36 server01 kernel: [ 368.572140] Swap cache stats: add 0, delete 0, find 0/0 >Dec 30 01:53:36 server01 kernel: [ 368.572260] Free swap = 0kB >Dec 30 01:53:36 server01 kernel: [ 368.572375] Total swap = 0kB >Dec 30 01:53:36 server01 kernel: [ 368.597836] apache2 invoked oom-killer: gfp_mask=0x0, order=0, oom_adj=0, oom_score_adj=0 >Dec 30 01:53:36 server01 kernel: [ 368.598034] apache2 cpuset=uid mems_allowed=0 >Dec 30 01:53:36 server01 kernel: [ 368.598152] Pid: 5385, comm: apache2 Not tainted 3.2.35-grsec #1 >Dec 30 01:53:36 server01 kernel: [ 368.598273] Call Trace: >Dec 30 01:53:36 server01 kernel: [ 368.598396] [<ffffffff810cc89e>] dump_header+0x7e/0x1e0 >Dec 30 01:53:36 server01 kernel: [ 368.598516] [<ffffffff810cc79f>] ? find_lock_task_mm+0x2f/0x70 >Dec 30 01:53:36 server01 kernel: [ 368.598638] [<ffffffff810ccd65>] oom_kill_process+0x85/0x2a0 >Dec 30 01:53:36 server01 kernel: [ 368.598759] [<ffffffff810cd415>] out_of_memory+0xe5/0x200 >Dec 30 01:53:36 server01 kernel: [ 368.598880] [<ffffffff810cd5ed>] pagefault_out_of_memory+0xbd/0x110 >Dec 30 01:53:36 server01 kernel: [ 368.599006] [<ffffffff81026e96>] mm_fault_error+0xb6/0x1a0 >Dec 30 01:53:36 server01 kernel: [ 368.599127] [<ffffffff8102736e>] do_page_fault+0x3ee/0x460 >Dec 30 01:53:36 server01 kernel: [ 368.599250] [<ffffffff81131ccf>] ? mntput+0x1f/0x30 >Dec 30 01:53:36 server01 kernel: [ 368.599371] [<ffffffff811134e6>] ? fput+0x156/0x200 >Dec 30 01:53:36 server01 kernel: [ 368.599496] [<ffffffff815b567f>] page_fault+0x1f/0x30 > >This would suggest that an unexpected ENOMEM leaked during page fault >path. I do not see which one could that be because you said THP >(CONFIG_TRANSPARENT_HUGEPAGE) are disabled (and the other patch I have >mentioned in the thread should fix that issue - btw. the patch is >already scheduled for stable tree). > __do_fault, do_anonymous_page and do_wp_page call >mem_cgroup_newpage_charge with GFP_KERNEL which means that >we do memcg OOM and never return ENOMEM. do_swap_page calls >mem_cgroup_try_charge_swapin with GFP_KERNEL as well. > >I might have missed something but I will not get to look closer before >2nd January. >-- >Michal Hocko >SUSE Labs > ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked @ 2013-01-25 15:07 ` azurIt 0 siblings, 0 replies; 444+ messages in thread From: azurIt @ 2013-01-25 15:07 UTC (permalink / raw) To: Michal Hocko Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups mailinglist, KAMEZAWA Hiroyuki, Johannes Weiner Any news? Thnx! azur ______________________________________________________________ > Od: "Michal Hocko" <mhocko-AlSwsSmVLrQ@public.gmane.org> > Komu: azurIt <azurit-Rm0zKEqwvD4@public.gmane.org> > Dátum: 30.12.2012 12:08 > Predmet: Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked > > CC: linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org, "cgroups mailinglist" <cgroups-u79uwXL29TY76Z2rM5mHXA@public.gmane.org>, "KAMEZAWA Hiroyuki" <kamezawa.hiroyu@jp.fujitsu.com>, "Johannes Weiner" <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org> >On Sun 30-12-12 02:09:47, azurIt wrote: >> >which suggests that the patch is incomplete and that I am blind :/ >> >mem_cgroup_cache_charge calls __mem_cgroup_try_charge for the page cache >> >and that one doesn't check GFP_MEMCG_NO_OOM. So you need the following >> >follow-up patch on top of the one you already have (which should catch >> >all the remaining cases). >> >Sorry about that... >> >> >> This was, again, killing my MySQL server (search for "(mysqld)"): >> http://www.watchdog.sk/lkml/oom_mysqld5 > >grep "Kill process" oom_mysqld5 >Dec 30 01:53:34 server01 kernel: [ 367.061801] Memory cgroup out of memory: Kill process 5512 (apache2) score 716 or sacrifice child >Dec 30 01:53:35 server01 kernel: [ 367.338024] Memory cgroup out of memory: Kill process 5517 (apache2) score 718 or sacrifice child >Dec 30 01:53:35 server01 kernel: [ 367.747888] Memory cgroup out of memory: Kill process 5513 (apache2) score 721 or sacrifice child >Dec 30 01:53:36 server01 kernel: [ 368.159860] Memory cgroup out of memory: Kill process 5516 (apache2) score 726 or sacrifice child >Dec 30 01:53:36 server01 kernel: [ 368.665606] Memory cgroup out of memory: Kill process 5520 (apache2) score 733 or sacrifice child >Dec 30 01:53:36 server01 kernel: [ 368.765652] Out of memory: Kill process 1778 (mysqld) score 39 or sacrifice child >Dec 30 01:53:36 server01 kernel: [ 369.101753] Memory cgroup out of memory: Kill process 5519 (apache2) score 754 or sacrifice child >Dec 30 01:53:37 server01 kernel: [ 369.464262] Memory cgroup out of memory: Kill process 5583 (apache2) score 762 or sacrifice child >Dec 30 01:53:37 server01 kernel: [ 369.465017] Out of memory: Kill process 5506 (apache2) score 18 or sacrifice child >Dec 30 01:53:37 server01 kernel: [ 369.574932] Memory cgroup out of memory: Kill process 5523 (apache2) score 759 or sacrifice child > >So your mysqld has been killed by the global OOM not memcg. But why when >you seem to be perfectly fine regarding memory? I guess the following >backtrace is relevant: >Dec 30 01:53:36 server01 kernel: [ 368.569720] DMA: 0*4kB 1*8kB 0*16kB 1*32kB 2*64kB 1*128kB 1*256kB 0*512kB 1*1024kB 1*2048kB 3*4096kB = 15912kB >Dec 30 01:53:36 server01 kernel: [ 368.570447] DMA32: 9*4kB 10*8kB 8*16kB 6*32kB 5*64kB 6*128kB 4*256kB 2*512kB 3*1024kB 3*2048kB 613*4096kB = 2523636kB >Dec 30 01:53:36 server01 kernel: [ 368.571175] Normal: 5*4kB 2060*8kB 4122*16kB 2550*32kB 2667*64kB 722*128kB 197*256kB 68*512kB 15*1024kB 4*2048kB 1855*4096kB = 8134036kB >Dec 30 01:53:36 server01 kernel: [ 368.571906] 308964 total pagecache pages >Dec 30 01:53:36 server01 kernel: [ 368.572023] 0 pages in swap cache >Dec 30 01:53:36 server01 kernel: [ 368.572140] Swap cache stats: add 0, delete 0, find 0/0 >Dec 30 01:53:36 server01 kernel: [ 368.572260] Free swap = 0kB >Dec 30 01:53:36 server01 kernel: [ 368.572375] Total swap = 0kB >Dec 30 01:53:36 server01 kernel: [ 368.597836] apache2 invoked oom-killer: gfp_mask=0x0, order=0, oom_adj=0, oom_score_adj=0 >Dec 30 01:53:36 server01 kernel: [ 368.598034] apache2 cpuset=uid mems_allowed=0 >Dec 30 01:53:36 server01 kernel: [ 368.598152] Pid: 5385, comm: apache2 Not tainted 3.2.35-grsec #1 >Dec 30 01:53:36 server01 kernel: [ 368.598273] Call Trace: >Dec 30 01:53:36 server01 kernel: [ 368.598396] [<ffffffff810cc89e>] dump_header+0x7e/0x1e0 >Dec 30 01:53:36 server01 kernel: [ 368.598516] [<ffffffff810cc79f>] ? find_lock_task_mm+0x2f/0x70 >Dec 30 01:53:36 server01 kernel: [ 368.598638] [<ffffffff810ccd65>] oom_kill_process+0x85/0x2a0 >Dec 30 01:53:36 server01 kernel: [ 368.598759] [<ffffffff810cd415>] out_of_memory+0xe5/0x200 >Dec 30 01:53:36 server01 kernel: [ 368.598880] [<ffffffff810cd5ed>] pagefault_out_of_memory+0xbd/0x110 >Dec 30 01:53:36 server01 kernel: [ 368.599006] [<ffffffff81026e96>] mm_fault_error+0xb6/0x1a0 >Dec 30 01:53:36 server01 kernel: [ 368.599127] [<ffffffff8102736e>] do_page_fault+0x3ee/0x460 >Dec 30 01:53:36 server01 kernel: [ 368.599250] [<ffffffff81131ccf>] ? mntput+0x1f/0x30 >Dec 30 01:53:36 server01 kernel: [ 368.599371] [<ffffffff811134e6>] ? fput+0x156/0x200 >Dec 30 01:53:36 server01 kernel: [ 368.599496] [<ffffffff815b567f>] page_fault+0x1f/0x30 > >This would suggest that an unexpected ENOMEM leaked during page fault >path. I do not see which one could that be because you said THP >(CONFIG_TRANSPARENT_HUGEPAGE) are disabled (and the other patch I have >mentioned in the thread should fix that issue - btw. the patch is >already scheduled for stable tree). > __do_fault, do_anonymous_page and do_wp_page call >mem_cgroup_newpage_charge with GFP_KERNEL which means that >we do memcg OOM and never return ENOMEM. do_swap_page calls >mem_cgroup_try_charge_swapin with GFP_KERNEL as well. > >I might have missed something but I will not get to look closer before >2nd January. >-- >Michal Hocko >SUSE Labs > ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked @ 2013-01-25 15:07 ` azurIt 0 siblings, 0 replies; 444+ messages in thread From: azurIt @ 2013-01-25 15:07 UTC (permalink / raw) To: Michal Hocko Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki, Johannes Weiner Any news? Thnx! azur ______________________________________________________________ > Od: "Michal Hocko" <mhocko@suse.cz> > Komu: azurIt <azurit@pobox.sk> > DA!tum: 30.12.2012 12:08 > Predmet: Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked > > CC: linux-kernel@vger.kernel.org, linux-mm@kvack.org, "cgroups mailinglist" <cgroups@vger.kernel.org>, "KAMEZAWA Hiroyuki" <kamezawa.hiroyu@jp.fujitsu.com>, "Johannes Weiner" <hannes@cmpxchg.org> >On Sun 30-12-12 02:09:47, azurIt wrote: >> >which suggests that the patch is incomplete and that I am blind :/ >> >mem_cgroup_cache_charge calls __mem_cgroup_try_charge for the page cache >> >and that one doesn't check GFP_MEMCG_NO_OOM. So you need the following >> >follow-up patch on top of the one you already have (which should catch >> >all the remaining cases). >> >Sorry about that... >> >> >> This was, again, killing my MySQL server (search for "(mysqld)"): >> http://www.watchdog.sk/lkml/oom_mysqld5 > >grep "Kill process" oom_mysqld5 >Dec 30 01:53:34 server01 kernel: [ 367.061801] Memory cgroup out of memory: Kill process 5512 (apache2) score 716 or sacrifice child >Dec 30 01:53:35 server01 kernel: [ 367.338024] Memory cgroup out of memory: Kill process 5517 (apache2) score 718 or sacrifice child >Dec 30 01:53:35 server01 kernel: [ 367.747888] Memory cgroup out of memory: Kill process 5513 (apache2) score 721 or sacrifice child >Dec 30 01:53:36 server01 kernel: [ 368.159860] Memory cgroup out of memory: Kill process 5516 (apache2) score 726 or sacrifice child >Dec 30 01:53:36 server01 kernel: [ 368.665606] Memory cgroup out of memory: Kill process 5520 (apache2) score 733 or sacrifice child >Dec 30 01:53:36 server01 kernel: [ 368.765652] Out of memory: Kill process 1778 (mysqld) score 39 or sacrifice child >Dec 30 01:53:36 server01 kernel: [ 369.101753] Memory cgroup out of memory: Kill process 5519 (apache2) score 754 or sacrifice child >Dec 30 01:53:37 server01 kernel: [ 369.464262] Memory cgroup out of memory: Kill process 5583 (apache2) score 762 or sacrifice child >Dec 30 01:53:37 server01 kernel: [ 369.465017] Out of memory: Kill process 5506 (apache2) score 18 or sacrifice child >Dec 30 01:53:37 server01 kernel: [ 369.574932] Memory cgroup out of memory: Kill process 5523 (apache2) score 759 or sacrifice child > >So your mysqld has been killed by the global OOM not memcg. But why when >you seem to be perfectly fine regarding memory? I guess the following >backtrace is relevant: >Dec 30 01:53:36 server01 kernel: [ 368.569720] DMA: 0*4kB 1*8kB 0*16kB 1*32kB 2*64kB 1*128kB 1*256kB 0*512kB 1*1024kB 1*2048kB 3*4096kB = 15912kB >Dec 30 01:53:36 server01 kernel: [ 368.570447] DMA32: 9*4kB 10*8kB 8*16kB 6*32kB 5*64kB 6*128kB 4*256kB 2*512kB 3*1024kB 3*2048kB 613*4096kB = 2523636kB >Dec 30 01:53:36 server01 kernel: [ 368.571175] Normal: 5*4kB 2060*8kB 4122*16kB 2550*32kB 2667*64kB 722*128kB 197*256kB 68*512kB 15*1024kB 4*2048kB 1855*4096kB = 8134036kB >Dec 30 01:53:36 server01 kernel: [ 368.571906] 308964 total pagecache pages >Dec 30 01:53:36 server01 kernel: [ 368.572023] 0 pages in swap cache >Dec 30 01:53:36 server01 kernel: [ 368.572140] Swap cache stats: add 0, delete 0, find 0/0 >Dec 30 01:53:36 server01 kernel: [ 368.572260] Free swap = 0kB >Dec 30 01:53:36 server01 kernel: [ 368.572375] Total swap = 0kB >Dec 30 01:53:36 server01 kernel: [ 368.597836] apache2 invoked oom-killer: gfp_mask=0x0, order=0, oom_adj=0, oom_score_adj=0 >Dec 30 01:53:36 server01 kernel: [ 368.598034] apache2 cpuset=uid mems_allowed=0 >Dec 30 01:53:36 server01 kernel: [ 368.598152] Pid: 5385, comm: apache2 Not tainted 3.2.35-grsec #1 >Dec 30 01:53:36 server01 kernel: [ 368.598273] Call Trace: >Dec 30 01:53:36 server01 kernel: [ 368.598396] [<ffffffff810cc89e>] dump_header+0x7e/0x1e0 >Dec 30 01:53:36 server01 kernel: [ 368.598516] [<ffffffff810cc79f>] ? find_lock_task_mm+0x2f/0x70 >Dec 30 01:53:36 server01 kernel: [ 368.598638] [<ffffffff810ccd65>] oom_kill_process+0x85/0x2a0 >Dec 30 01:53:36 server01 kernel: [ 368.598759] [<ffffffff810cd415>] out_of_memory+0xe5/0x200 >Dec 30 01:53:36 server01 kernel: [ 368.598880] [<ffffffff810cd5ed>] pagefault_out_of_memory+0xbd/0x110 >Dec 30 01:53:36 server01 kernel: [ 368.599006] [<ffffffff81026e96>] mm_fault_error+0xb6/0x1a0 >Dec 30 01:53:36 server01 kernel: [ 368.599127] [<ffffffff8102736e>] do_page_fault+0x3ee/0x460 >Dec 30 01:53:36 server01 kernel: [ 368.599250] [<ffffffff81131ccf>] ? mntput+0x1f/0x30 >Dec 30 01:53:36 server01 kernel: [ 368.599371] [<ffffffff811134e6>] ? fput+0x156/0x200 >Dec 30 01:53:36 server01 kernel: [ 368.599496] [<ffffffff815b567f>] page_fault+0x1f/0x30 > >This would suggest that an unexpected ENOMEM leaked during page fault >path. I do not see which one could that be because you said THP >(CONFIG_TRANSPARENT_HUGEPAGE) are disabled (and the other patch I have >mentioned in the thread should fix that issue - btw. the patch is >already scheduled for stable tree). > __do_fault, do_anonymous_page and do_wp_page call >mem_cgroup_newpage_charge with GFP_KERNEL which means that >we do memcg OOM and never return ENOMEM. do_swap_page calls >mem_cgroup_try_charge_swapin with GFP_KERNEL as well. > >I might have missed something but I will not get to look closer before >2nd January. >-- >Michal Hocko >SUSE Labs > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked 2013-01-25 15:07 ` azurIt (?) @ 2013-01-25 16:31 ` Michal Hocko -1 siblings, 0 replies; 444+ messages in thread From: Michal Hocko @ 2013-01-25 16:31 UTC (permalink / raw) To: azurIt Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki, Johannes Weiner On Fri 25-01-13 16:07:23, azurIt wrote: > Any news? Thnx! Sorry, but I didn't get to this one yet. > > azur > > > > ______________________________________________________________ > > Od: "Michal Hocko" <mhocko@suse.cz> > > Komu: azurIt <azurit@pobox.sk> > > Dátum: 30.12.2012 12:08 > > Predmet: Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked > > > > CC: linux-kernel@vger.kernel.org, linux-mm@kvack.org, "cgroups mailinglist" <cgroups@vger.kernel.org>, "KAMEZAWA Hiroyuki" <kamezawa.hiroyu@jp.fujitsu.com>, "Johannes Weiner" <hannes@cmpxchg.org> > >On Sun 30-12-12 02:09:47, azurIt wrote: > >> >which suggests that the patch is incomplete and that I am blind :/ > >> >mem_cgroup_cache_charge calls __mem_cgroup_try_charge for the page cache > >> >and that one doesn't check GFP_MEMCG_NO_OOM. So you need the following > >> >follow-up patch on top of the one you already have (which should catch > >> >all the remaining cases). > >> >Sorry about that... > >> > >> > >> This was, again, killing my MySQL server (search for "(mysqld)"): > >> http://www.watchdog.sk/lkml/oom_mysqld5 > > > >grep "Kill process" oom_mysqld5 > >Dec 30 01:53:34 server01 kernel: [ 367.061801] Memory cgroup out of memory: Kill process 5512 (apache2) score 716 or sacrifice child > >Dec 30 01:53:35 server01 kernel: [ 367.338024] Memory cgroup out of memory: Kill process 5517 (apache2) score 718 or sacrifice child > >Dec 30 01:53:35 server01 kernel: [ 367.747888] Memory cgroup out of memory: Kill process 5513 (apache2) score 721 or sacrifice child > >Dec 30 01:53:36 server01 kernel: [ 368.159860] Memory cgroup out of memory: Kill process 5516 (apache2) score 726 or sacrifice child > >Dec 30 01:53:36 server01 kernel: [ 368.665606] Memory cgroup out of memory: Kill process 5520 (apache2) score 733 or sacrifice child > >Dec 30 01:53:36 server01 kernel: [ 368.765652] Out of memory: Kill process 1778 (mysqld) score 39 or sacrifice child > >Dec 30 01:53:36 server01 kernel: [ 369.101753] Memory cgroup out of memory: Kill process 5519 (apache2) score 754 or sacrifice child > >Dec 30 01:53:37 server01 kernel: [ 369.464262] Memory cgroup out of memory: Kill process 5583 (apache2) score 762 or sacrifice child > >Dec 30 01:53:37 server01 kernel: [ 369.465017] Out of memory: Kill process 5506 (apache2) score 18 or sacrifice child > >Dec 30 01:53:37 server01 kernel: [ 369.574932] Memory cgroup out of memory: Kill process 5523 (apache2) score 759 or sacrifice child > > > >So your mysqld has been killed by the global OOM not memcg. But why when > >you seem to be perfectly fine regarding memory? I guess the following > >backtrace is relevant: > >Dec 30 01:53:36 server01 kernel: [ 368.569720] DMA: 0*4kB 1*8kB 0*16kB 1*32kB 2*64kB 1*128kB 1*256kB 0*512kB 1*1024kB 1*2048kB 3*4096kB = 15912kB > >Dec 30 01:53:36 server01 kernel: [ 368.570447] DMA32: 9*4kB 10*8kB 8*16kB 6*32kB 5*64kB 6*128kB 4*256kB 2*512kB 3*1024kB 3*2048kB 613*4096kB = 2523636kB > >Dec 30 01:53:36 server01 kernel: [ 368.571175] Normal: 5*4kB 2060*8kB 4122*16kB 2550*32kB 2667*64kB 722*128kB 197*256kB 68*512kB 15*1024kB 4*2048kB 1855*4096kB = 8134036kB > >Dec 30 01:53:36 server01 kernel: [ 368.571906] 308964 total pagecache pages > >Dec 30 01:53:36 server01 kernel: [ 368.572023] 0 pages in swap cache > >Dec 30 01:53:36 server01 kernel: [ 368.572140] Swap cache stats: add 0, delete 0, find 0/0 > >Dec 30 01:53:36 server01 kernel: [ 368.572260] Free swap = 0kB > >Dec 30 01:53:36 server01 kernel: [ 368.572375] Total swap = 0kB > >Dec 30 01:53:36 server01 kernel: [ 368.597836] apache2 invoked oom-killer: gfp_mask=0x0, order=0, oom_adj=0, oom_score_adj=0 > >Dec 30 01:53:36 server01 kernel: [ 368.598034] apache2 cpuset=uid mems_allowed=0 > >Dec 30 01:53:36 server01 kernel: [ 368.598152] Pid: 5385, comm: apache2 Not tainted 3.2.35-grsec #1 > >Dec 30 01:53:36 server01 kernel: [ 368.598273] Call Trace: > >Dec 30 01:53:36 server01 kernel: [ 368.598396] [<ffffffff810cc89e>] dump_header+0x7e/0x1e0 > >Dec 30 01:53:36 server01 kernel: [ 368.598516] [<ffffffff810cc79f>] ? find_lock_task_mm+0x2f/0x70 > >Dec 30 01:53:36 server01 kernel: [ 368.598638] [<ffffffff810ccd65>] oom_kill_process+0x85/0x2a0 > >Dec 30 01:53:36 server01 kernel: [ 368.598759] [<ffffffff810cd415>] out_of_memory+0xe5/0x200 > >Dec 30 01:53:36 server01 kernel: [ 368.598880] [<ffffffff810cd5ed>] pagefault_out_of_memory+0xbd/0x110 > >Dec 30 01:53:36 server01 kernel: [ 368.599006] [<ffffffff81026e96>] mm_fault_error+0xb6/0x1a0 > >Dec 30 01:53:36 server01 kernel: [ 368.599127] [<ffffffff8102736e>] do_page_fault+0x3ee/0x460 > >Dec 30 01:53:36 server01 kernel: [ 368.599250] [<ffffffff81131ccf>] ? mntput+0x1f/0x30 > >Dec 30 01:53:36 server01 kernel: [ 368.599371] [<ffffffff811134e6>] ? fput+0x156/0x200 > >Dec 30 01:53:36 server01 kernel: [ 368.599496] [<ffffffff815b567f>] page_fault+0x1f/0x30 > > > >This would suggest that an unexpected ENOMEM leaked during page fault > >path. I do not see which one could that be because you said THP > >(CONFIG_TRANSPARENT_HUGEPAGE) are disabled (and the other patch I have > >mentioned in the thread should fix that issue - btw. the patch is > >already scheduled for stable tree). > > __do_fault, do_anonymous_page and do_wp_page call > >mem_cgroup_newpage_charge with GFP_KERNEL which means that > >we do memcg OOM and never return ENOMEM. do_swap_page calls > >mem_cgroup_try_charge_swapin with GFP_KERNEL as well. > > > >I might have missed something but I will not get to look closer before > >2nd January. > >-- > >Michal Hocko > >SUSE Labs > > > -- > To unsubscribe from this list: send the line "unsubscribe cgroups" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked @ 2013-01-25 16:31 ` Michal Hocko 0 siblings, 0 replies; 444+ messages in thread From: Michal Hocko @ 2013-01-25 16:31 UTC (permalink / raw) To: azurIt Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki, Johannes Weiner On Fri 25-01-13 16:07:23, azurIt wrote: > Any news? Thnx! Sorry, but I didn't get to this one yet. > > azur > > > > ______________________________________________________________ > > Od: "Michal Hocko" <mhocko@suse.cz> > > Komu: azurIt <azurit@pobox.sk> > > Dátum: 30.12.2012 12:08 > > Predmet: Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked > > > > CC: linux-kernel@vger.kernel.org, linux-mm@kvack.org, "cgroups mailinglist" <cgroups@vger.kernel.org>, "KAMEZAWA Hiroyuki" <kamezawa.hiroyu@jp.fujitsu.com>, "Johannes Weiner" <hannes@cmpxchg.org> > >On Sun 30-12-12 02:09:47, azurIt wrote: > >> >which suggests that the patch is incomplete and that I am blind :/ > >> >mem_cgroup_cache_charge calls __mem_cgroup_try_charge for the page cache > >> >and that one doesn't check GFP_MEMCG_NO_OOM. So you need the following > >> >follow-up patch on top of the one you already have (which should catch > >> >all the remaining cases). > >> >Sorry about that... > >> > >> > >> This was, again, killing my MySQL server (search for "(mysqld)"): > >> http://www.watchdog.sk/lkml/oom_mysqld5 > > > >grep "Kill process" oom_mysqld5 > >Dec 30 01:53:34 server01 kernel: [ 367.061801] Memory cgroup out of memory: Kill process 5512 (apache2) score 716 or sacrifice child > >Dec 30 01:53:35 server01 kernel: [ 367.338024] Memory cgroup out of memory: Kill process 5517 (apache2) score 718 or sacrifice child > >Dec 30 01:53:35 server01 kernel: [ 367.747888] Memory cgroup out of memory: Kill process 5513 (apache2) score 721 or sacrifice child > >Dec 30 01:53:36 server01 kernel: [ 368.159860] Memory cgroup out of memory: Kill process 5516 (apache2) score 726 or sacrifice child > >Dec 30 01:53:36 server01 kernel: [ 368.665606] Memory cgroup out of memory: Kill process 5520 (apache2) score 733 or sacrifice child > >Dec 30 01:53:36 server01 kernel: [ 368.765652] Out of memory: Kill process 1778 (mysqld) score 39 or sacrifice child > >Dec 30 01:53:36 server01 kernel: [ 369.101753] Memory cgroup out of memory: Kill process 5519 (apache2) score 754 or sacrifice child > >Dec 30 01:53:37 server01 kernel: [ 369.464262] Memory cgroup out of memory: Kill process 5583 (apache2) score 762 or sacrifice child > >Dec 30 01:53:37 server01 kernel: [ 369.465017] Out of memory: Kill process 5506 (apache2) score 18 or sacrifice child > >Dec 30 01:53:37 server01 kernel: [ 369.574932] Memory cgroup out of memory: Kill process 5523 (apache2) score 759 or sacrifice child > > > >So your mysqld has been killed by the global OOM not memcg. But why when > >you seem to be perfectly fine regarding memory? I guess the following > >backtrace is relevant: > >Dec 30 01:53:36 server01 kernel: [ 368.569720] DMA: 0*4kB 1*8kB 0*16kB 1*32kB 2*64kB 1*128kB 1*256kB 0*512kB 1*1024kB 1*2048kB 3*4096kB = 15912kB > >Dec 30 01:53:36 server01 kernel: [ 368.570447] DMA32: 9*4kB 10*8kB 8*16kB 6*32kB 5*64kB 6*128kB 4*256kB 2*512kB 3*1024kB 3*2048kB 613*4096kB = 2523636kB > >Dec 30 01:53:36 server01 kernel: [ 368.571175] Normal: 5*4kB 2060*8kB 4122*16kB 2550*32kB 2667*64kB 722*128kB 197*256kB 68*512kB 15*1024kB 4*2048kB 1855*4096kB = 8134036kB > >Dec 30 01:53:36 server01 kernel: [ 368.571906] 308964 total pagecache pages > >Dec 30 01:53:36 server01 kernel: [ 368.572023] 0 pages in swap cache > >Dec 30 01:53:36 server01 kernel: [ 368.572140] Swap cache stats: add 0, delete 0, find 0/0 > >Dec 30 01:53:36 server01 kernel: [ 368.572260] Free swap = 0kB > >Dec 30 01:53:36 server01 kernel: [ 368.572375] Total swap = 0kB > >Dec 30 01:53:36 server01 kernel: [ 368.597836] apache2 invoked oom-killer: gfp_mask=0x0, order=0, oom_adj=0, oom_score_adj=0 > >Dec 30 01:53:36 server01 kernel: [ 368.598034] apache2 cpuset=uid mems_allowed=0 > >Dec 30 01:53:36 server01 kernel: [ 368.598152] Pid: 5385, comm: apache2 Not tainted 3.2.35-grsec #1 > >Dec 30 01:53:36 server01 kernel: [ 368.598273] Call Trace: > >Dec 30 01:53:36 server01 kernel: [ 368.598396] [<ffffffff810cc89e>] dump_header+0x7e/0x1e0 > >Dec 30 01:53:36 server01 kernel: [ 368.598516] [<ffffffff810cc79f>] ? find_lock_task_mm+0x2f/0x70 > >Dec 30 01:53:36 server01 kernel: [ 368.598638] [<ffffffff810ccd65>] oom_kill_process+0x85/0x2a0 > >Dec 30 01:53:36 server01 kernel: [ 368.598759] [<ffffffff810cd415>] out_of_memory+0xe5/0x200 > >Dec 30 01:53:36 server01 kernel: [ 368.598880] [<ffffffff810cd5ed>] pagefault_out_of_memory+0xbd/0x110 > >Dec 30 01:53:36 server01 kernel: [ 368.599006] [<ffffffff81026e96>] mm_fault_error+0xb6/0x1a0 > >Dec 30 01:53:36 server01 kernel: [ 368.599127] [<ffffffff8102736e>] do_page_fault+0x3ee/0x460 > >Dec 30 01:53:36 server01 kernel: [ 368.599250] [<ffffffff81131ccf>] ? mntput+0x1f/0x30 > >Dec 30 01:53:36 server01 kernel: [ 368.599371] [<ffffffff811134e6>] ? fput+0x156/0x200 > >Dec 30 01:53:36 server01 kernel: [ 368.599496] [<ffffffff815b567f>] page_fault+0x1f/0x30 > > > >This would suggest that an unexpected ENOMEM leaked during page fault > >path. I do not see which one could that be because you said THP > >(CONFIG_TRANSPARENT_HUGEPAGE) are disabled (and the other patch I have > >mentioned in the thread should fix that issue - btw. the patch is > >already scheduled for stable tree). > > __do_fault, do_anonymous_page and do_wp_page call > >mem_cgroup_newpage_charge with GFP_KERNEL which means that > >we do memcg OOM and never return ENOMEM. do_swap_page calls > >mem_cgroup_try_charge_swapin with GFP_KERNEL as well. > > > >I might have missed something but I will not get to look closer before > >2nd January. > >-- > >Michal Hocko > >SUSE Labs > > > -- > To unsubscribe from this list: send the line "unsubscribe cgroups" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked @ 2013-01-25 16:31 ` Michal Hocko 0 siblings, 0 replies; 444+ messages in thread From: Michal Hocko @ 2013-01-25 16:31 UTC (permalink / raw) To: azurIt Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki, Johannes Weiner On Fri 25-01-13 16:07:23, azurIt wrote: > Any news? Thnx! Sorry, but I didn't get to this one yet. > > azur > > > > ______________________________________________________________ > > Od: "Michal Hocko" <mhocko@suse.cz> > > Komu: azurIt <azurit@pobox.sk> > > Datum: 30.12.2012 12:08 > > Predmet: Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked > > > > CC: linux-kernel@vger.kernel.org, linux-mm@kvack.org, "cgroups mailinglist" <cgroups@vger.kernel.org>, "KAMEZAWA Hiroyuki" <kamezawa.hiroyu@jp.fujitsu.com>, "Johannes Weiner" <hannes@cmpxchg.org> > >On Sun 30-12-12 02:09:47, azurIt wrote: > >> >which suggests that the patch is incomplete and that I am blind :/ > >> >mem_cgroup_cache_charge calls __mem_cgroup_try_charge for the page cache > >> >and that one doesn't check GFP_MEMCG_NO_OOM. So you need the following > >> >follow-up patch on top of the one you already have (which should catch > >> >all the remaining cases). > >> >Sorry about that... > >> > >> > >> This was, again, killing my MySQL server (search for "(mysqld)"): > >> http://www.watchdog.sk/lkml/oom_mysqld5 > > > >grep "Kill process" oom_mysqld5 > >Dec 30 01:53:34 server01 kernel: [ 367.061801] Memory cgroup out of memory: Kill process 5512 (apache2) score 716 or sacrifice child > >Dec 30 01:53:35 server01 kernel: [ 367.338024] Memory cgroup out of memory: Kill process 5517 (apache2) score 718 or sacrifice child > >Dec 30 01:53:35 server01 kernel: [ 367.747888] Memory cgroup out of memory: Kill process 5513 (apache2) score 721 or sacrifice child > >Dec 30 01:53:36 server01 kernel: [ 368.159860] Memory cgroup out of memory: Kill process 5516 (apache2) score 726 or sacrifice child > >Dec 30 01:53:36 server01 kernel: [ 368.665606] Memory cgroup out of memory: Kill process 5520 (apache2) score 733 or sacrifice child > >Dec 30 01:53:36 server01 kernel: [ 368.765652] Out of memory: Kill process 1778 (mysqld) score 39 or sacrifice child > >Dec 30 01:53:36 server01 kernel: [ 369.101753] Memory cgroup out of memory: Kill process 5519 (apache2) score 754 or sacrifice child > >Dec 30 01:53:37 server01 kernel: [ 369.464262] Memory cgroup out of memory: Kill process 5583 (apache2) score 762 or sacrifice child > >Dec 30 01:53:37 server01 kernel: [ 369.465017] Out of memory: Kill process 5506 (apache2) score 18 or sacrifice child > >Dec 30 01:53:37 server01 kernel: [ 369.574932] Memory cgroup out of memory: Kill process 5523 (apache2) score 759 or sacrifice child > > > >So your mysqld has been killed by the global OOM not memcg. But why when > >you seem to be perfectly fine regarding memory? I guess the following > >backtrace is relevant: > >Dec 30 01:53:36 server01 kernel: [ 368.569720] DMA: 0*4kB 1*8kB 0*16kB 1*32kB 2*64kB 1*128kB 1*256kB 0*512kB 1*1024kB 1*2048kB 3*4096kB = 15912kB > >Dec 30 01:53:36 server01 kernel: [ 368.570447] DMA32: 9*4kB 10*8kB 8*16kB 6*32kB 5*64kB 6*128kB 4*256kB 2*512kB 3*1024kB 3*2048kB 613*4096kB = 2523636kB > >Dec 30 01:53:36 server01 kernel: [ 368.571175] Normal: 5*4kB 2060*8kB 4122*16kB 2550*32kB 2667*64kB 722*128kB 197*256kB 68*512kB 15*1024kB 4*2048kB 1855*4096kB = 8134036kB > >Dec 30 01:53:36 server01 kernel: [ 368.571906] 308964 total pagecache pages > >Dec 30 01:53:36 server01 kernel: [ 368.572023] 0 pages in swap cache > >Dec 30 01:53:36 server01 kernel: [ 368.572140] Swap cache stats: add 0, delete 0, find 0/0 > >Dec 30 01:53:36 server01 kernel: [ 368.572260] Free swap = 0kB > >Dec 30 01:53:36 server01 kernel: [ 368.572375] Total swap = 0kB > >Dec 30 01:53:36 server01 kernel: [ 368.597836] apache2 invoked oom-killer: gfp_mask=0x0, order=0, oom_adj=0, oom_score_adj=0 > >Dec 30 01:53:36 server01 kernel: [ 368.598034] apache2 cpuset=uid mems_allowed=0 > >Dec 30 01:53:36 server01 kernel: [ 368.598152] Pid: 5385, comm: apache2 Not tainted 3.2.35-grsec #1 > >Dec 30 01:53:36 server01 kernel: [ 368.598273] Call Trace: > >Dec 30 01:53:36 server01 kernel: [ 368.598396] [<ffffffff810cc89e>] dump_header+0x7e/0x1e0 > >Dec 30 01:53:36 server01 kernel: [ 368.598516] [<ffffffff810cc79f>] ? find_lock_task_mm+0x2f/0x70 > >Dec 30 01:53:36 server01 kernel: [ 368.598638] [<ffffffff810ccd65>] oom_kill_process+0x85/0x2a0 > >Dec 30 01:53:36 server01 kernel: [ 368.598759] [<ffffffff810cd415>] out_of_memory+0xe5/0x200 > >Dec 30 01:53:36 server01 kernel: [ 368.598880] [<ffffffff810cd5ed>] pagefault_out_of_memory+0xbd/0x110 > >Dec 30 01:53:36 server01 kernel: [ 368.599006] [<ffffffff81026e96>] mm_fault_error+0xb6/0x1a0 > >Dec 30 01:53:36 server01 kernel: [ 368.599127] [<ffffffff8102736e>] do_page_fault+0x3ee/0x460 > >Dec 30 01:53:36 server01 kernel: [ 368.599250] [<ffffffff81131ccf>] ? mntput+0x1f/0x30 > >Dec 30 01:53:36 server01 kernel: [ 368.599371] [<ffffffff811134e6>] ? fput+0x156/0x200 > >Dec 30 01:53:36 server01 kernel: [ 368.599496] [<ffffffff815b567f>] page_fault+0x1f/0x30 > > > >This would suggest that an unexpected ENOMEM leaked during page fault > >path. I do not see which one could that be because you said THP > >(CONFIG_TRANSPARENT_HUGEPAGE) are disabled (and the other patch I have > >mentioned in the thread should fix that issue - btw. the patch is > >already scheduled for stable tree). > > __do_fault, do_anonymous_page and do_wp_page call > >mem_cgroup_newpage_charge with GFP_KERNEL which means that > >we do memcg OOM and never return ENOMEM. do_swap_page calls > >mem_cgroup_try_charge_swapin with GFP_KERNEL as well. > > > >I might have missed something but I will not get to look closer before > >2nd January. > >-- > >Michal Hocko > >SUSE Labs > > > -- > To unsubscribe from this list: send the line "unsubscribe cgroups" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked 2013-01-25 16:31 ` Michal Hocko (?) @ 2013-02-05 13:49 ` Michal Hocko -1 siblings, 0 replies; 444+ messages in thread From: Michal Hocko @ 2013-02-05 13:49 UTC (permalink / raw) To: azurIt Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki, Johannes Weiner On Fri 25-01-13 17:31:30, Michal Hocko wrote: > On Fri 25-01-13 16:07:23, azurIt wrote: > > Any news? Thnx! > > Sorry, but I didn't get to this one yet. Sorry, to get back to this that late but I was busy as hell since the beginning of the year. Has the issue repeated since then? You said you didn't apply other than the above mentioned patch. Could you apply also debugging part of the patches I have sent? In case you don't have it handy then it should be this one: --- >From 1623420d964e7e8bc88e2a6239563052df891bf7 Mon Sep 17 00:00:00 2001 From: Michal Hocko <mhocko@suse.cz> Date: Mon, 3 Dec 2012 16:16:01 +0100 Subject: [PATCH] more debugging --- mm/huge_memory.c | 6 +++--- mm/memcontrol.c | 1 + 2 files changed, 4 insertions(+), 3 deletions(-) diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 470cbb4..01a11f1 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -671,7 +671,7 @@ static inline struct page *alloc_hugepage(int defrag) } #endif -int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma, +noinline int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma, unsigned long address, pmd_t *pmd, unsigned int flags) { @@ -790,7 +790,7 @@ pgtable_t get_pmd_huge_pte(struct mm_struct *mm) return pgtable; } -static int do_huge_pmd_wp_page_fallback(struct mm_struct *mm, +static noinline int do_huge_pmd_wp_page_fallback(struct mm_struct *mm, struct vm_area_struct *vma, unsigned long address, pmd_t *pmd, pmd_t orig_pmd, @@ -883,7 +883,7 @@ out_free_pages: goto out; } -int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma, +noinline int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma, unsigned long address, pmd_t *pmd, pmd_t orig_pmd) { int ret = 0; diff --git a/mm/memcontrol.c b/mm/memcontrol.c index c8425b1..1986c65 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -2397,6 +2397,7 @@ done: return 0; nomem: *ptr = NULL; + __WARN_printf("gfp_mask:%u nr_pages:%u oom:%d ret:%d\n", gfp_mask, nr_pages, oom, ret); return -ENOMEM; bypass: *ptr = NULL; -- 1.7.10.4 -- Michal Hocko SUSE Labs ^ permalink raw reply related [flat|nested] 444+ messages in thread
* Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked @ 2013-02-05 13:49 ` Michal Hocko 0 siblings, 0 replies; 444+ messages in thread From: Michal Hocko @ 2013-02-05 13:49 UTC (permalink / raw) To: azurIt Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups mailinglist, KAMEZAWA Hiroyuki, Johannes Weiner On Fri 25-01-13 17:31:30, Michal Hocko wrote: > On Fri 25-01-13 16:07:23, azurIt wrote: > > Any news? Thnx! > > Sorry, but I didn't get to this one yet. Sorry, to get back to this that late but I was busy as hell since the beginning of the year. Has the issue repeated since then? You said you didn't apply other than the above mentioned patch. Could you apply also debugging part of the patches I have sent? In case you don't have it handy then it should be this one: --- From 1623420d964e7e8bc88e2a6239563052df891bf7 Mon Sep 17 00:00:00 2001 From: Michal Hocko <mhocko-AlSwsSmVLrQ@public.gmane.org> Date: Mon, 3 Dec 2012 16:16:01 +0100 Subject: [PATCH] more debugging --- mm/huge_memory.c | 6 +++--- mm/memcontrol.c | 1 + 2 files changed, 4 insertions(+), 3 deletions(-) diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 470cbb4..01a11f1 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -671,7 +671,7 @@ static inline struct page *alloc_hugepage(int defrag) } #endif -int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma, +noinline int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma, unsigned long address, pmd_t *pmd, unsigned int flags) { @@ -790,7 +790,7 @@ pgtable_t get_pmd_huge_pte(struct mm_struct *mm) return pgtable; } -static int do_huge_pmd_wp_page_fallback(struct mm_struct *mm, +static noinline int do_huge_pmd_wp_page_fallback(struct mm_struct *mm, struct vm_area_struct *vma, unsigned long address, pmd_t *pmd, pmd_t orig_pmd, @@ -883,7 +883,7 @@ out_free_pages: goto out; } -int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma, +noinline int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma, unsigned long address, pmd_t *pmd, pmd_t orig_pmd) { int ret = 0; diff --git a/mm/memcontrol.c b/mm/memcontrol.c index c8425b1..1986c65 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -2397,6 +2397,7 @@ done: return 0; nomem: *ptr = NULL; + __WARN_printf("gfp_mask:%u nr_pages:%u oom:%d ret:%d\n", gfp_mask, nr_pages, oom, ret); return -ENOMEM; bypass: *ptr = NULL; -- 1.7.10.4 -- Michal Hocko SUSE Labs ^ permalink raw reply related [flat|nested] 444+ messages in thread
* Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked @ 2013-02-05 13:49 ` Michal Hocko 0 siblings, 0 replies; 444+ messages in thread From: Michal Hocko @ 2013-02-05 13:49 UTC (permalink / raw) To: azurIt Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki, Johannes Weiner On Fri 25-01-13 17:31:30, Michal Hocko wrote: > On Fri 25-01-13 16:07:23, azurIt wrote: > > Any news? Thnx! > > Sorry, but I didn't get to this one yet. Sorry, to get back to this that late but I was busy as hell since the beginning of the year. Has the issue repeated since then? You said you didn't apply other than the above mentioned patch. Could you apply also debugging part of the patches I have sent? In case you don't have it handy then it should be this one: --- ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked 2013-02-05 13:49 ` Michal Hocko @ 2013-02-05 14:49 ` azurIt -1 siblings, 0 replies; 444+ messages in thread From: azurIt @ 2013-02-05 14:49 UTC (permalink / raw) To: Michal Hocko Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki, Johannes Weiner >Sorry, to get back to this that late but I was busy as hell since the >beginning of the year. Thank you for your time! >Has the issue repeated since then? Yes, it's happening all the time but meanwhile i wrote a script which is monitoring the problem and killing freezed processes when it occurs. But i don't like it much, it's not a solution for me :( i also noticed, that problem is always affecting the whole server but not so much as freezed cgroup. Depends on number of freezed processes, sometimes it has almost no imapct on the rest of the server, sometimes the whole server is lagging much. I have another old problem which is maybe also related to this. I wasn't connecting it with this before but now i'm not sure. Two of our servers, which are affected by this cgroup problem, are also randomly freezing completely (few times per month). These are the symptoms: - servers are answering to ping - it is possible to connect via SSH but connection is freezed after sending the password - it is possible to login via console but it is freezed after typeing the login These symptoms are very similar to HDD problems or HDD overload (but there is no overload for sure). The only way to fix it is, probably, hard rebooting the server (didn't find any other way). What do you think? Can this be related? Maybe HDDs are locked in the similar way the cgroups are - we already found out that cgroup freezeing is related also to HDD activity. Maybe there is a little chance that the whole HDD subsystem ends in deadlock? >You said you didn't apply other than the above mentioned patch. Could >you apply also debugging part of the patches I have sent? >In case you don't have it handy then it should be this one: Just to be sure - am i supposed to apply this two patches? http://watchdog.sk/lkml/patches/ azur ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked @ 2013-02-05 14:49 ` azurIt 0 siblings, 0 replies; 444+ messages in thread From: azurIt @ 2013-02-05 14:49 UTC (permalink / raw) To: Michal Hocko Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki, Johannes Weiner >Sorry, to get back to this that late but I was busy as hell since the >beginning of the year. Thank you for your time! >Has the issue repeated since then? Yes, it's happening all the time but meanwhile i wrote a script which is monitoring the problem and killing freezed processes when it occurs. But i don't like it much, it's not a solution for me :( i also noticed, that problem is always affecting the whole server but not so much as freezed cgroup. Depends on number of freezed processes, sometimes it has almost no imapct on the rest of the server, sometimes the whole server is lagging much. I have another old problem which is maybe also related to this. I wasn't connecting it with this before but now i'm not sure. Two of our servers, which are affected by this cgroup problem, are also randomly freezing completely (few times per month). These are the symptoms: - servers are answering to ping - it is possible to connect via SSH but connection is freezed after sending the password - it is possible to login via console but it is freezed after typeing the login These symptoms are very similar to HDD problems or HDD overload (but there is no overload for sure). The only way to fix it is, probably, hard rebooting the server (didn't find any other way). What do you think? Can this be related? Maybe HDDs are locked in the similar way the cgroups are - we already found out that cgroup freezeing is related also to HDD activity. Maybe there is a little chance that the whole HDD subsystem ends in deadlock? >You said you didn't apply other than the above mentioned patch. Could >you apply also debugging part of the patches I have sent? >In case you don't have it handy then it should be this one: Just to be sure - am i supposed to apply this two patches? http://watchdog.sk/lkml/patches/ azur -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked 2013-02-05 14:49 ` azurIt (?) @ 2013-02-05 16:09 ` Michal Hocko -1 siblings, 0 replies; 444+ messages in thread From: Michal Hocko @ 2013-02-05 16:09 UTC (permalink / raw) To: azurIt Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki, Johannes Weiner On Tue 05-02-13 15:49:47, azurIt wrote: [...] > Just to be sure - am i supposed to apply this two patches? > http://watchdog.sk/lkml/patches/ 5-memcg-fix-1.patch is not complete. It doesn't contain the folloup I mentioned in a follow up email. Here is the full patch: --- >From f2bf8437d5b9bb38a95a432bf39f32c584955171 Mon Sep 17 00:00:00 2001 From: Michal Hocko <mhocko@suse.cz> Date: Mon, 26 Nov 2012 11:47:57 +0100 Subject: [PATCH] memcg: do not trigger OOM from add_to_page_cache_locked memcg oom killer might deadlock if the process which falls down to mem_cgroup_handle_oom holds a lock which prevents other task to terminate because it is blocked on the very same lock. This can happen when a write system call needs to allocate a page but the allocation hits the memcg hard limit and there is nothing to reclaim (e.g. there is no swap or swap limit is hit as well and all cache pages have been reclaimed already) and the process selected by memcg OOM killer is blocked on i_mutex on the same inode (e.g. truncate it). Process A [<ffffffff811109b8>] do_truncate+0x58/0xa0 # takes i_mutex [<ffffffff81121c90>] do_last+0x250/0xa30 [<ffffffff81122547>] path_openat+0xd7/0x440 [<ffffffff811229c9>] do_filp_open+0x49/0xa0 [<ffffffff8110f7d6>] do_sys_open+0x106/0x240 [<ffffffff8110f950>] sys_open+0x20/0x30 [<ffffffff815b5926>] system_call_fastpath+0x18/0x1d [<ffffffffffffffff>] 0xffffffffffffffff Process B [<ffffffff8110a9c1>] mem_cgroup_handle_oom+0x241/0x3b0 [<ffffffff8110b5ab>] T.1146+0x5ab/0x5c0 [<ffffffff8110c22e>] mem_cgroup_cache_charge+0xbe/0xe0 [<ffffffff810ca28c>] add_to_page_cache_locked+0x4c/0x140 [<ffffffff810ca3a2>] add_to_page_cache_lru+0x22/0x50 [<ffffffff810ca45b>] grab_cache_page_write_begin+0x8b/0xe0 [<ffffffff81193a18>] ext3_write_begin+0x88/0x270 [<ffffffff810c8fc6>] generic_file_buffered_write+0x116/0x290 [<ffffffff810cb3cc>] __generic_file_aio_write+0x27c/0x480 [<ffffffff810cb646>] generic_file_aio_write+0x76/0xf0 # takes ->i_mutex [<ffffffff8111156a>] do_sync_write+0xea/0x130 [<ffffffff81112183>] vfs_write+0xf3/0x1f0 [<ffffffff81112381>] sys_write+0x51/0x90 [<ffffffff815b5926>] system_call_fastpath+0x18/0x1d [<ffffffffffffffff>] 0xffffffffffffffff This is not a hard deadlock though because administrator can still intervene and increase the limit on the group which helps the writer to finish the allocation and release the lock. This patch heals the problem by forbidding OOM from page cache charges (namely add_ro_page_cache_locked). mem_cgroup_cache_charge_no_oom helper function is defined which adds GFP_MEMCG_NO_OOM to the gfp mask which then tells mem_cgroup_charge_common that OOM is not allowed for the charge. No OOM from this path, except for fixing the bug, also make some sense as we really do not want to cause an OOM because of a page cache usage. As a possibly visible result add_to_page_cache_lru might fail more often with ENOMEM but this is to be expected if the limit is set and it is preferable than OOM killer IMO. __GFP_NORETRY is abused for this memcg specific flag because no user accounted allocation use this flag except for THP which have memcg oom disabled already. Reported-by: azurIt <azurit@pobox.sk> Signed-off-by: Michal Hocko <mhocko@suse.cz> --- include/linux/gfp.h | 3 +++ include/linux/memcontrol.h | 13 +++++++++++++ mm/filemap.c | 8 +++++++- mm/memcontrol.c | 10 ++++++---- 4 files changed, 29 insertions(+), 5 deletions(-) diff --git a/include/linux/gfp.h b/include/linux/gfp.h index 3a76faf..806fb54 100644 --- a/include/linux/gfp.h +++ b/include/linux/gfp.h @@ -146,6 +146,9 @@ struct vm_area_struct; /* 4GB DMA on some platforms */ #define GFP_DMA32 __GFP_DMA32 +/* memcg oom killer is not allowed */ +#define GFP_MEMCG_NO_OOM __GFP_NORETRY + /* Convert GFP flags to their corresponding migrate type */ static inline int allocflags_to_migratetype(gfp_t gfp_flags) { diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 81572af..bf0e575 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -63,6 +63,13 @@ extern void mem_cgroup_cancel_charge_swapin(struct mem_cgroup *ptr); extern int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm, gfp_t gfp_mask); + +static inline int mem_cgroup_cache_charge_no_oom(struct page *page, + struct mm_struct *mm, gfp_t gfp_mask) +{ + return mem_cgroup_cache_charge(page, mm, gfp_mask | GFP_MEMCG_NO_OOM); +} + extern void mem_cgroup_add_lru_list(struct page *page, enum lru_list lru); extern void mem_cgroup_del_lru_list(struct page *page, enum lru_list lru); extern void mem_cgroup_rotate_reclaimable_page(struct page *page); @@ -178,6 +185,12 @@ static inline int mem_cgroup_cache_charge(struct page *page, return 0; } +static inline int mem_cgroup_cache_charge_no_oom(struct page *page, + struct mm_struct *mm, gfp_t gfp_mask) +{ + return 0; +} + static inline int mem_cgroup_try_charge_swapin(struct mm_struct *mm, struct page *page, gfp_t gfp_mask, struct mem_cgroup **ptr) { diff --git a/mm/filemap.c b/mm/filemap.c index 556858c..ef182a9 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -449,7 +449,13 @@ int add_to_page_cache_locked(struct page *page, struct address_space *mapping, VM_BUG_ON(!PageLocked(page)); VM_BUG_ON(PageSwapBacked(page)); - error = mem_cgroup_cache_charge(page, current->mm, + /* + * Cannot trigger OOM even if gfp_mask would allow that normally + * because we might be called from a locked context and that + * could lead to deadlocks if the killed process is waiting for + * the same lock. + */ + error = mem_cgroup_cache_charge_no_oom(page, current->mm, gfp_mask & GFP_RECLAIM_MASK); if (error) goto out; diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 1986c65..a68aa08 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -2704,7 +2704,7 @@ static int mem_cgroup_charge_common(struct page *page, struct mm_struct *mm, struct mem_cgroup *memcg = NULL; unsigned int nr_pages = 1; struct page_cgroup *pc; - bool oom = true; + bool oom = !(gfp_mask & GFP_MEMCG_NO_OOM); int ret; if (PageTransHuge(page)) { @@ -2771,6 +2771,7 @@ __mem_cgroup_commit_charge_lrucare(struct page *page, struct mem_cgroup *memcg, int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm, gfp_t gfp_mask) { + bool oom = !(gfp_mask & GFP_MEMCG_NO_OOM); struct mem_cgroup *memcg = NULL; int ret; @@ -2783,7 +2784,7 @@ int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm, mm = &init_mm; if (page_is_file_cache(page)) { - ret = __mem_cgroup_try_charge(mm, gfp_mask, 1, &memcg, true); + ret = __mem_cgroup_try_charge(mm, gfp_mask, 1, &memcg, oom); if (ret || !memcg) return ret; @@ -2819,6 +2820,7 @@ int mem_cgroup_try_charge_swapin(struct mm_struct *mm, struct page *page, gfp_t mask, struct mem_cgroup **ptr) { + bool oom = !(mask & GFP_MEMCG_NO_OOM); struct mem_cgroup *memcg; int ret; @@ -2841,13 +2843,13 @@ int mem_cgroup_try_charge_swapin(struct mm_struct *mm, if (!memcg) goto charge_cur_mm; *ptr = memcg; - ret = __mem_cgroup_try_charge(NULL, mask, 1, ptr, true); + ret = __mem_cgroup_try_charge(NULL, mask, 1, ptr, oom); css_put(&memcg->css); return ret; charge_cur_mm: if (unlikely(!mm)) mm = &init_mm; - return __mem_cgroup_try_charge(mm, mask, 1, ptr, true); + return __mem_cgroup_try_charge(mm, mask, 1, ptr, oom); } static void -- 1.7.10.4 -- Michal Hocko SUSE Labs ^ permalink raw reply related [flat|nested] 444+ messages in thread
* Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked @ 2013-02-05 16:09 ` Michal Hocko 0 siblings, 0 replies; 444+ messages in thread From: Michal Hocko @ 2013-02-05 16:09 UTC (permalink / raw) To: azurIt Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki, Johannes Weiner On Tue 05-02-13 15:49:47, azurIt wrote: [...] > Just to be sure - am i supposed to apply this two patches? > http://watchdog.sk/lkml/patches/ 5-memcg-fix-1.patch is not complete. It doesn't contain the folloup I mentioned in a follow up email. Here is the full patch: --- From f2bf8437d5b9bb38a95a432bf39f32c584955171 Mon Sep 17 00:00:00 2001 From: Michal Hocko <mhocko@suse.cz> Date: Mon, 26 Nov 2012 11:47:57 +0100 Subject: [PATCH] memcg: do not trigger OOM from add_to_page_cache_locked memcg oom killer might deadlock if the process which falls down to mem_cgroup_handle_oom holds a lock which prevents other task to terminate because it is blocked on the very same lock. This can happen when a write system call needs to allocate a page but the allocation hits the memcg hard limit and there is nothing to reclaim (e.g. there is no swap or swap limit is hit as well and all cache pages have been reclaimed already) and the process selected by memcg OOM killer is blocked on i_mutex on the same inode (e.g. truncate it). Process A [<ffffffff811109b8>] do_truncate+0x58/0xa0 # takes i_mutex [<ffffffff81121c90>] do_last+0x250/0xa30 [<ffffffff81122547>] path_openat+0xd7/0x440 [<ffffffff811229c9>] do_filp_open+0x49/0xa0 [<ffffffff8110f7d6>] do_sys_open+0x106/0x240 [<ffffffff8110f950>] sys_open+0x20/0x30 [<ffffffff815b5926>] system_call_fastpath+0x18/0x1d [<ffffffffffffffff>] 0xffffffffffffffff Process B [<ffffffff8110a9c1>] mem_cgroup_handle_oom+0x241/0x3b0 [<ffffffff8110b5ab>] T.1146+0x5ab/0x5c0 [<ffffffff8110c22e>] mem_cgroup_cache_charge+0xbe/0xe0 [<ffffffff810ca28c>] add_to_page_cache_locked+0x4c/0x140 [<ffffffff810ca3a2>] add_to_page_cache_lru+0x22/0x50 [<ffffffff810ca45b>] grab_cache_page_write_begin+0x8b/0xe0 [<ffffffff81193a18>] ext3_write_begin+0x88/0x270 [<ffffffff810c8fc6>] generic_file_buffered_write+0x116/0x290 [<ffffffff810cb3cc>] __generic_file_aio_write+0x27c/0x480 [<ffffffff810cb646>] generic_file_aio_write+0x76/0xf0 # takes ->i_mutex [<ffffffff8111156a>] do_sync_write+0xea/0x130 [<ffffffff81112183>] vfs_write+0xf3/0x1f0 [<ffffffff81112381>] sys_write+0x51/0x90 [<ffffffff815b5926>] system_call_fastpath+0x18/0x1d [<ffffffffffffffff>] 0xffffffffffffffff This is not a hard deadlock though because administrator can still intervene and increase the limit on the group which helps the writer to finish the allocation and release the lock. This patch heals the problem by forbidding OOM from page cache charges (namely add_ro_page_cache_locked). mem_cgroup_cache_charge_no_oom helper function is defined which adds GFP_MEMCG_NO_OOM to the gfp mask which then tells mem_cgroup_charge_common that OOM is not allowed for the charge. No OOM from this path, except for fixing the bug, also make some sense as we really do not want to cause an OOM because of a page cache usage. As a possibly visible result add_to_page_cache_lru might fail more often with ENOMEM but this is to be expected if the limit is set and it is preferable than OOM killer IMO. __GFP_NORETRY is abused for this memcg specific flag because no user accounted allocation use this flag except for THP which have memcg oom disabled already. Reported-by: azurIt <azurit@pobox.sk> Signed-off-by: Michal Hocko <mhocko@suse.cz> --- include/linux/gfp.h | 3 +++ include/linux/memcontrol.h | 13 +++++++++++++ mm/filemap.c | 8 +++++++- mm/memcontrol.c | 10 ++++++---- 4 files changed, 29 insertions(+), 5 deletions(-) diff --git a/include/linux/gfp.h b/include/linux/gfp.h index 3a76faf..806fb54 100644 --- a/include/linux/gfp.h +++ b/include/linux/gfp.h @@ -146,6 +146,9 @@ struct vm_area_struct; /* 4GB DMA on some platforms */ #define GFP_DMA32 __GFP_DMA32 +/* memcg oom killer is not allowed */ +#define GFP_MEMCG_NO_OOM __GFP_NORETRY + /* Convert GFP flags to their corresponding migrate type */ static inline int allocflags_to_migratetype(gfp_t gfp_flags) { diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 81572af..bf0e575 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -63,6 +63,13 @@ extern void mem_cgroup_cancel_charge_swapin(struct mem_cgroup *ptr); extern int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm, gfp_t gfp_mask); + +static inline int mem_cgroup_cache_charge_no_oom(struct page *page, + struct mm_struct *mm, gfp_t gfp_mask) +{ + return mem_cgroup_cache_charge(page, mm, gfp_mask | GFP_MEMCG_NO_OOM); +} + extern void mem_cgroup_add_lru_list(struct page *page, enum lru_list lru); extern void mem_cgroup_del_lru_list(struct page *page, enum lru_list lru); extern void mem_cgroup_rotate_reclaimable_page(struct page *page); @@ -178,6 +185,12 @@ static inline int mem_cgroup_cache_charge(struct page *page, return 0; } +static inline int mem_cgroup_cache_charge_no_oom(struct page *page, + struct mm_struct *mm, gfp_t gfp_mask) +{ + return 0; +} + static inline int mem_cgroup_try_charge_swapin(struct mm_struct *mm, struct page *page, gfp_t gfp_mask, struct mem_cgroup **ptr) { diff --git a/mm/filemap.c b/mm/filemap.c index 556858c..ef182a9 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -449,7 +449,13 @@ int add_to_page_cache_locked(struct page *page, struct address_space *mapping, VM_BUG_ON(!PageLocked(page)); VM_BUG_ON(PageSwapBacked(page)); - error = mem_cgroup_cache_charge(page, current->mm, + /* + * Cannot trigger OOM even if gfp_mask would allow that normally + * because we might be called from a locked context and that + * could lead to deadlocks if the killed process is waiting for + * the same lock. + */ + error = mem_cgroup_cache_charge_no_oom(page, current->mm, gfp_mask & GFP_RECLAIM_MASK); if (error) goto out; diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 1986c65..a68aa08 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -2704,7 +2704,7 @@ static int mem_cgroup_charge_common(struct page *page, struct mm_struct *mm, struct mem_cgroup *memcg = NULL; unsigned int nr_pages = 1; struct page_cgroup *pc; - bool oom = true; + bool oom = !(gfp_mask & GFP_MEMCG_NO_OOM); int ret; if (PageTransHuge(page)) { @@ -2771,6 +2771,7 @@ __mem_cgroup_commit_charge_lrucare(struct page *page, struct mem_cgroup *memcg, int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm, gfp_t gfp_mask) { + bool oom = !(gfp_mask & GFP_MEMCG_NO_OOM); struct mem_cgroup *memcg = NULL; int ret; @@ -2783,7 +2784,7 @@ int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm, mm = &init_mm; if (page_is_file_cache(page)) { - ret = __mem_cgroup_try_charge(mm, gfp_mask, 1, &memcg, true); + ret = __mem_cgroup_try_charge(mm, gfp_mask, 1, &memcg, oom); if (ret || !memcg) return ret; @@ -2819,6 +2820,7 @@ int mem_cgroup_try_charge_swapin(struct mm_struct *mm, struct page *page, gfp_t mask, struct mem_cgroup **ptr) { + bool oom = !(mask & GFP_MEMCG_NO_OOM); struct mem_cgroup *memcg; int ret; @@ -2841,13 +2843,13 @@ int mem_cgroup_try_charge_swapin(struct mm_struct *mm, if (!memcg) goto charge_cur_mm; *ptr = memcg; - ret = __mem_cgroup_try_charge(NULL, mask, 1, ptr, true); + ret = __mem_cgroup_try_charge(NULL, mask, 1, ptr, oom); css_put(&memcg->css); return ret; charge_cur_mm: if (unlikely(!mm)) mm = &init_mm; - return __mem_cgroup_try_charge(mm, mask, 1, ptr, true); + return __mem_cgroup_try_charge(mm, mask, 1, ptr, oom); } static void -- 1.7.10.4 -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 444+ messages in thread
* Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked @ 2013-02-05 16:09 ` Michal Hocko 0 siblings, 0 replies; 444+ messages in thread From: Michal Hocko @ 2013-02-05 16:09 UTC (permalink / raw) To: azurIt Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki, Johannes Weiner On Tue 05-02-13 15:49:47, azurIt wrote: [...] > Just to be sure - am i supposed to apply this two patches? > http://watchdog.sk/lkml/patches/ 5-memcg-fix-1.patch is not complete. It doesn't contain the folloup I mentioned in a follow up email. Here is the full patch: --- ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked 2013-02-05 16:09 ` Michal Hocko @ 2013-02-05 16:46 ` azurIt -1 siblings, 0 replies; 444+ messages in thread From: azurIt @ 2013-02-05 16:46 UTC (permalink / raw) To: Michal Hocko Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki, Johannes Weiner >5-memcg-fix-1.patch is not complete. It doesn't contain the folloup I >mentioned in a follow up email. ou, it wasn't complete? i used it in my last test.. sorry, i'm litte confused by all those patches. will try it this night and report back. ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked @ 2013-02-05 16:46 ` azurIt 0 siblings, 0 replies; 444+ messages in thread From: azurIt @ 2013-02-05 16:46 UTC (permalink / raw) To: Michal Hocko Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki, Johannes Weiner >5-memcg-fix-1.patch is not complete. It doesn't contain the folloup I >mentioned in a follow up email. ou, it wasn't complete? i used it in my last test.. sorry, i'm litte confused by all those patches. will try it this night and report back. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked 2013-02-05 16:09 ` Michal Hocko @ 2013-02-05 16:48 ` Greg Thelen -1 siblings, 0 replies; 444+ messages in thread From: Greg Thelen @ 2013-02-05 16:48 UTC (permalink / raw) To: Michal Hocko Cc: azurIt, linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki, Johannes Weiner On Tue, Feb 05 2013, Michal Hocko wrote: > On Tue 05-02-13 15:49:47, azurIt wrote: > [...] >> Just to be sure - am i supposed to apply this two patches? >> http://watchdog.sk/lkml/patches/ > > 5-memcg-fix-1.patch is not complete. It doesn't contain the folloup I > mentioned in a follow up email. Here is the full patch: > --- > From f2bf8437d5b9bb38a95a432bf39f32c584955171 Mon Sep 17 00:00:00 2001 > From: Michal Hocko <mhocko@suse.cz> > Date: Mon, 26 Nov 2012 11:47:57 +0100 > Subject: [PATCH] memcg: do not trigger OOM from add_to_page_cache_locked > > memcg oom killer might deadlock if the process which falls down to > mem_cgroup_handle_oom holds a lock which prevents other task to > terminate because it is blocked on the very same lock. > This can happen when a write system call needs to allocate a page but > the allocation hits the memcg hard limit and there is nothing to reclaim > (e.g. there is no swap or swap limit is hit as well and all cache pages > have been reclaimed already) and the process selected by memcg OOM > killer is blocked on i_mutex on the same inode (e.g. truncate it). > > Process A > [<ffffffff811109b8>] do_truncate+0x58/0xa0 # takes i_mutex > [<ffffffff81121c90>] do_last+0x250/0xa30 > [<ffffffff81122547>] path_openat+0xd7/0x440 > [<ffffffff811229c9>] do_filp_open+0x49/0xa0 > [<ffffffff8110f7d6>] do_sys_open+0x106/0x240 > [<ffffffff8110f950>] sys_open+0x20/0x30 > [<ffffffff815b5926>] system_call_fastpath+0x18/0x1d > [<ffffffffffffffff>] 0xffffffffffffffff > > Process B > [<ffffffff8110a9c1>] mem_cgroup_handle_oom+0x241/0x3b0 > [<ffffffff8110b5ab>] T.1146+0x5ab/0x5c0 > [<ffffffff8110c22e>] mem_cgroup_cache_charge+0xbe/0xe0 > [<ffffffff810ca28c>] add_to_page_cache_locked+0x4c/0x140 > [<ffffffff810ca3a2>] add_to_page_cache_lru+0x22/0x50 > [<ffffffff810ca45b>] grab_cache_page_write_begin+0x8b/0xe0 > [<ffffffff81193a18>] ext3_write_begin+0x88/0x270 > [<ffffffff810c8fc6>] generic_file_buffered_write+0x116/0x290 > [<ffffffff810cb3cc>] __generic_file_aio_write+0x27c/0x480 > [<ffffffff810cb646>] generic_file_aio_write+0x76/0xf0 # takes ->i_mutex > [<ffffffff8111156a>] do_sync_write+0xea/0x130 > [<ffffffff81112183>] vfs_write+0xf3/0x1f0 > [<ffffffff81112381>] sys_write+0x51/0x90 > [<ffffffff815b5926>] system_call_fastpath+0x18/0x1d > [<ffffffffffffffff>] 0xffffffffffffffff It looks like grab_cache_page_write_begin() passes __GFP_FS into __page_cache_alloc() and mem_cgroup_cache_charge(). Which makes me think that this deadlock is also possible in the page allocator even before getting to add_to_page_cache_lru. no? Can callers holding fs resources (e.g. i_mutex) pass __GFP_FS into the page allocator? If __GFP_FS was avoided, then I think memcg user page charging would need a !__GFP_FS check to avoid invoking oom killer, but at least then we'd avoid both deadlocks and cover both page allocation and memcg page charging in similar fashion. Example from memcg_charge_kmem: may_oom = (gfp & __GFP_FS) && !(gfp & __GFP_NORETRY); ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked @ 2013-02-05 16:48 ` Greg Thelen 0 siblings, 0 replies; 444+ messages in thread From: Greg Thelen @ 2013-02-05 16:48 UTC (permalink / raw) To: Michal Hocko Cc: azurIt, linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki, Johannes Weiner On Tue, Feb 05 2013, Michal Hocko wrote: > On Tue 05-02-13 15:49:47, azurIt wrote: > [...] >> Just to be sure - am i supposed to apply this two patches? >> http://watchdog.sk/lkml/patches/ > > 5-memcg-fix-1.patch is not complete. It doesn't contain the folloup I > mentioned in a follow up email. Here is the full patch: > --- > From f2bf8437d5b9bb38a95a432bf39f32c584955171 Mon Sep 17 00:00:00 2001 > From: Michal Hocko <mhocko@suse.cz> > Date: Mon, 26 Nov 2012 11:47:57 +0100 > Subject: [PATCH] memcg: do not trigger OOM from add_to_page_cache_locked > > memcg oom killer might deadlock if the process which falls down to > mem_cgroup_handle_oom holds a lock which prevents other task to > terminate because it is blocked on the very same lock. > This can happen when a write system call needs to allocate a page but > the allocation hits the memcg hard limit and there is nothing to reclaim > (e.g. there is no swap or swap limit is hit as well and all cache pages > have been reclaimed already) and the process selected by memcg OOM > killer is blocked on i_mutex on the same inode (e.g. truncate it). > > Process A > [<ffffffff811109b8>] do_truncate+0x58/0xa0 # takes i_mutex > [<ffffffff81121c90>] do_last+0x250/0xa30 > [<ffffffff81122547>] path_openat+0xd7/0x440 > [<ffffffff811229c9>] do_filp_open+0x49/0xa0 > [<ffffffff8110f7d6>] do_sys_open+0x106/0x240 > [<ffffffff8110f950>] sys_open+0x20/0x30 > [<ffffffff815b5926>] system_call_fastpath+0x18/0x1d > [<ffffffffffffffff>] 0xffffffffffffffff > > Process B > [<ffffffff8110a9c1>] mem_cgroup_handle_oom+0x241/0x3b0 > [<ffffffff8110b5ab>] T.1146+0x5ab/0x5c0 > [<ffffffff8110c22e>] mem_cgroup_cache_charge+0xbe/0xe0 > [<ffffffff810ca28c>] add_to_page_cache_locked+0x4c/0x140 > [<ffffffff810ca3a2>] add_to_page_cache_lru+0x22/0x50 > [<ffffffff810ca45b>] grab_cache_page_write_begin+0x8b/0xe0 > [<ffffffff81193a18>] ext3_write_begin+0x88/0x270 > [<ffffffff810c8fc6>] generic_file_buffered_write+0x116/0x290 > [<ffffffff810cb3cc>] __generic_file_aio_write+0x27c/0x480 > [<ffffffff810cb646>] generic_file_aio_write+0x76/0xf0 # takes ->i_mutex > [<ffffffff8111156a>] do_sync_write+0xea/0x130 > [<ffffffff81112183>] vfs_write+0xf3/0x1f0 > [<ffffffff81112381>] sys_write+0x51/0x90 > [<ffffffff815b5926>] system_call_fastpath+0x18/0x1d > [<ffffffffffffffff>] 0xffffffffffffffff It looks like grab_cache_page_write_begin() passes __GFP_FS into __page_cache_alloc() and mem_cgroup_cache_charge(). Which makes me think that this deadlock is also possible in the page allocator even before getting to add_to_page_cache_lru. no? Can callers holding fs resources (e.g. i_mutex) pass __GFP_FS into the page allocator? If __GFP_FS was avoided, then I think memcg user page charging would need a !__GFP_FS check to avoid invoking oom killer, but at least then we'd avoid both deadlocks and cover both page allocation and memcg page charging in similar fashion. Example from memcg_charge_kmem: may_oom = (gfp & __GFP_FS) && !(gfp & __GFP_NORETRY); -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked 2013-02-05 16:48 ` Greg Thelen @ 2013-02-05 17:46 ` Michal Hocko -1 siblings, 0 replies; 444+ messages in thread From: Michal Hocko @ 2013-02-05 17:46 UTC (permalink / raw) To: Greg Thelen Cc: azurIt, linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki, Johannes Weiner On Tue 05-02-13 08:48:23, Greg Thelen wrote: > On Tue, Feb 05 2013, Michal Hocko wrote: > > > On Tue 05-02-13 15:49:47, azurIt wrote: > > [...] > >> Just to be sure - am i supposed to apply this two patches? > >> http://watchdog.sk/lkml/patches/ > > > > 5-memcg-fix-1.patch is not complete. It doesn't contain the folloup I > > mentioned in a follow up email. Here is the full patch: > > --- > > From f2bf8437d5b9bb38a95a432bf39f32c584955171 Mon Sep 17 00:00:00 2001 > > From: Michal Hocko <mhocko@suse.cz> > > Date: Mon, 26 Nov 2012 11:47:57 +0100 > > Subject: [PATCH] memcg: do not trigger OOM from add_to_page_cache_locked > > > > memcg oom killer might deadlock if the process which falls down to > > mem_cgroup_handle_oom holds a lock which prevents other task to > > terminate because it is blocked on the very same lock. > > This can happen when a write system call needs to allocate a page but > > the allocation hits the memcg hard limit and there is nothing to reclaim > > (e.g. there is no swap or swap limit is hit as well and all cache pages > > have been reclaimed already) and the process selected by memcg OOM > > killer is blocked on i_mutex on the same inode (e.g. truncate it). > > > > Process A > > [<ffffffff811109b8>] do_truncate+0x58/0xa0 # takes i_mutex > > [<ffffffff81121c90>] do_last+0x250/0xa30 > > [<ffffffff81122547>] path_openat+0xd7/0x440 > > [<ffffffff811229c9>] do_filp_open+0x49/0xa0 > > [<ffffffff8110f7d6>] do_sys_open+0x106/0x240 > > [<ffffffff8110f950>] sys_open+0x20/0x30 > > [<ffffffff815b5926>] system_call_fastpath+0x18/0x1d > > [<ffffffffffffffff>] 0xffffffffffffffff > > > > Process B > > [<ffffffff8110a9c1>] mem_cgroup_handle_oom+0x241/0x3b0 > > [<ffffffff8110b5ab>] T.1146+0x5ab/0x5c0 > > [<ffffffff8110c22e>] mem_cgroup_cache_charge+0xbe/0xe0 > > [<ffffffff810ca28c>] add_to_page_cache_locked+0x4c/0x140 > > [<ffffffff810ca3a2>] add_to_page_cache_lru+0x22/0x50 > > [<ffffffff810ca45b>] grab_cache_page_write_begin+0x8b/0xe0 > > [<ffffffff81193a18>] ext3_write_begin+0x88/0x270 > > [<ffffffff810c8fc6>] generic_file_buffered_write+0x116/0x290 > > [<ffffffff810cb3cc>] __generic_file_aio_write+0x27c/0x480 > > [<ffffffff810cb646>] generic_file_aio_write+0x76/0xf0 # takes ->i_mutex > > [<ffffffff8111156a>] do_sync_write+0xea/0x130 > > [<ffffffff81112183>] vfs_write+0xf3/0x1f0 > > [<ffffffff81112381>] sys_write+0x51/0x90 > > [<ffffffff815b5926>] system_call_fastpath+0x18/0x1d > > [<ffffffffffffffff>] 0xffffffffffffffff > > It looks like grab_cache_page_write_begin() passes __GFP_FS into > __page_cache_alloc() and mem_cgroup_cache_charge(). Which makes me > think that this deadlock is also possible in the page allocator even > before getting to add_to_page_cache_lru. no? I am not that familiar with VFS but i_mutex is a high level lock AFAIR and it shouldn't be called from the pageout path so __page_cache_alloc should be safe. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked @ 2013-02-05 17:46 ` Michal Hocko 0 siblings, 0 replies; 444+ messages in thread From: Michal Hocko @ 2013-02-05 17:46 UTC (permalink / raw) To: Greg Thelen Cc: azurIt, linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki, Johannes Weiner On Tue 05-02-13 08:48:23, Greg Thelen wrote: > On Tue, Feb 05 2013, Michal Hocko wrote: > > > On Tue 05-02-13 15:49:47, azurIt wrote: > > [...] > >> Just to be sure - am i supposed to apply this two patches? > >> http://watchdog.sk/lkml/patches/ > > > > 5-memcg-fix-1.patch is not complete. It doesn't contain the folloup I > > mentioned in a follow up email. Here is the full patch: > > --- > > From f2bf8437d5b9bb38a95a432bf39f32c584955171 Mon Sep 17 00:00:00 2001 > > From: Michal Hocko <mhocko@suse.cz> > > Date: Mon, 26 Nov 2012 11:47:57 +0100 > > Subject: [PATCH] memcg: do not trigger OOM from add_to_page_cache_locked > > > > memcg oom killer might deadlock if the process which falls down to > > mem_cgroup_handle_oom holds a lock which prevents other task to > > terminate because it is blocked on the very same lock. > > This can happen when a write system call needs to allocate a page but > > the allocation hits the memcg hard limit and there is nothing to reclaim > > (e.g. there is no swap or swap limit is hit as well and all cache pages > > have been reclaimed already) and the process selected by memcg OOM > > killer is blocked on i_mutex on the same inode (e.g. truncate it). > > > > Process A > > [<ffffffff811109b8>] do_truncate+0x58/0xa0 # takes i_mutex > > [<ffffffff81121c90>] do_last+0x250/0xa30 > > [<ffffffff81122547>] path_openat+0xd7/0x440 > > [<ffffffff811229c9>] do_filp_open+0x49/0xa0 > > [<ffffffff8110f7d6>] do_sys_open+0x106/0x240 > > [<ffffffff8110f950>] sys_open+0x20/0x30 > > [<ffffffff815b5926>] system_call_fastpath+0x18/0x1d > > [<ffffffffffffffff>] 0xffffffffffffffff > > > > Process B > > [<ffffffff8110a9c1>] mem_cgroup_handle_oom+0x241/0x3b0 > > [<ffffffff8110b5ab>] T.1146+0x5ab/0x5c0 > > [<ffffffff8110c22e>] mem_cgroup_cache_charge+0xbe/0xe0 > > [<ffffffff810ca28c>] add_to_page_cache_locked+0x4c/0x140 > > [<ffffffff810ca3a2>] add_to_page_cache_lru+0x22/0x50 > > [<ffffffff810ca45b>] grab_cache_page_write_begin+0x8b/0xe0 > > [<ffffffff81193a18>] ext3_write_begin+0x88/0x270 > > [<ffffffff810c8fc6>] generic_file_buffered_write+0x116/0x290 > > [<ffffffff810cb3cc>] __generic_file_aio_write+0x27c/0x480 > > [<ffffffff810cb646>] generic_file_aio_write+0x76/0xf0 # takes ->i_mutex > > [<ffffffff8111156a>] do_sync_write+0xea/0x130 > > [<ffffffff81112183>] vfs_write+0xf3/0x1f0 > > [<ffffffff81112381>] sys_write+0x51/0x90 > > [<ffffffff815b5926>] system_call_fastpath+0x18/0x1d > > [<ffffffffffffffff>] 0xffffffffffffffff > > It looks like grab_cache_page_write_begin() passes __GFP_FS into > __page_cache_alloc() and mem_cgroup_cache_charge(). Which makes me > think that this deadlock is also possible in the page allocator even > before getting to add_to_page_cache_lru. no? I am not that familiar with VFS but i_mutex is a high level lock AFAIR and it shouldn't be called from the pageout path so __page_cache_alloc should be safe. -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked 2013-02-05 17:46 ` Michal Hocko @ 2013-02-05 18:09 ` Greg Thelen -1 siblings, 0 replies; 444+ messages in thread From: Greg Thelen @ 2013-02-05 18:09 UTC (permalink / raw) To: Michal Hocko Cc: azurIt, linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki, Johannes Weiner On Tue, Feb 05 2013, Michal Hocko wrote: > On Tue 05-02-13 08:48:23, Greg Thelen wrote: >> On Tue, Feb 05 2013, Michal Hocko wrote: >> >> > On Tue 05-02-13 15:49:47, azurIt wrote: >> > [...] >> >> Just to be sure - am i supposed to apply this two patches? >> >> http://watchdog.sk/lkml/patches/ >> > >> > 5-memcg-fix-1.patch is not complete. It doesn't contain the folloup I >> > mentioned in a follow up email. Here is the full patch: >> > --- >> > From f2bf8437d5b9bb38a95a432bf39f32c584955171 Mon Sep 17 00:00:00 2001 >> > From: Michal Hocko <mhocko@suse.cz> >> > Date: Mon, 26 Nov 2012 11:47:57 +0100 >> > Subject: [PATCH] memcg: do not trigger OOM from add_to_page_cache_locked >> > >> > memcg oom killer might deadlock if the process which falls down to >> > mem_cgroup_handle_oom holds a lock which prevents other task to >> > terminate because it is blocked on the very same lock. >> > This can happen when a write system call needs to allocate a page but >> > the allocation hits the memcg hard limit and there is nothing to reclaim >> > (e.g. there is no swap or swap limit is hit as well and all cache pages >> > have been reclaimed already) and the process selected by memcg OOM >> > killer is blocked on i_mutex on the same inode (e.g. truncate it). >> > >> > Process A >> > [<ffffffff811109b8>] do_truncate+0x58/0xa0 # takes i_mutex >> > [<ffffffff81121c90>] do_last+0x250/0xa30 >> > [<ffffffff81122547>] path_openat+0xd7/0x440 >> > [<ffffffff811229c9>] do_filp_open+0x49/0xa0 >> > [<ffffffff8110f7d6>] do_sys_open+0x106/0x240 >> > [<ffffffff8110f950>] sys_open+0x20/0x30 >> > [<ffffffff815b5926>] system_call_fastpath+0x18/0x1d >> > [<ffffffffffffffff>] 0xffffffffffffffff >> > >> > Process B >> > [<ffffffff8110a9c1>] mem_cgroup_handle_oom+0x241/0x3b0 >> > [<ffffffff8110b5ab>] T.1146+0x5ab/0x5c0 >> > [<ffffffff8110c22e>] mem_cgroup_cache_charge+0xbe/0xe0 >> > [<ffffffff810ca28c>] add_to_page_cache_locked+0x4c/0x140 >> > [<ffffffff810ca3a2>] add_to_page_cache_lru+0x22/0x50 >> > [<ffffffff810ca45b>] grab_cache_page_write_begin+0x8b/0xe0 >> > [<ffffffff81193a18>] ext3_write_begin+0x88/0x270 >> > [<ffffffff810c8fc6>] generic_file_buffered_write+0x116/0x290 >> > [<ffffffff810cb3cc>] __generic_file_aio_write+0x27c/0x480 >> > [<ffffffff810cb646>] generic_file_aio_write+0x76/0xf0 # takes ->i_mutex >> > [<ffffffff8111156a>] do_sync_write+0xea/0x130 >> > [<ffffffff81112183>] vfs_write+0xf3/0x1f0 >> > [<ffffffff81112381>] sys_write+0x51/0x90 >> > [<ffffffff815b5926>] system_call_fastpath+0x18/0x1d >> > [<ffffffffffffffff>] 0xffffffffffffffff >> >> It looks like grab_cache_page_write_begin() passes __GFP_FS into >> __page_cache_alloc() and mem_cgroup_cache_charge(). Which makes me >> think that this deadlock is also possible in the page allocator even >> before getting to add_to_page_cache_lru. no? > > I am not that familiar with VFS but i_mutex is a high level lock AFAIR > and it shouldn't be called from the pageout path so __page_cache_alloc > should be safe. I wasn't clear, sorry. My concern is not that pageout() grabs i_mutex. My concern is that __page_cache_alloc() will invoke the oom killer and select a victim which wants i_mutex. This victim will deadlock because the oom killer caller already holds i_mutex. The wild accusation I am making is that anyone who invokes the oom killer and waits on the victim to die is essentially grabbing all of the locks that any of the oom killer victims may grab (e.g. i_mutex). To avoid deadlock the oom killer can only be called is while holding no locks that the oom victim demands. I think some locks are grabbed in a way that allows the lock request to fail if the task has a fatal signal pending, so they are safe. But any locks acquisitions that cannot fail (e.g. mutex_lock) will deadlock with the oom killing process. So the oom killing process cannot hold any such locks which the victim will attempt to grab. Hopefully I'm missing something. ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked @ 2013-02-05 18:09 ` Greg Thelen 0 siblings, 0 replies; 444+ messages in thread From: Greg Thelen @ 2013-02-05 18:09 UTC (permalink / raw) To: Michal Hocko Cc: azurIt, linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki, Johannes Weiner On Tue, Feb 05 2013, Michal Hocko wrote: > On Tue 05-02-13 08:48:23, Greg Thelen wrote: >> On Tue, Feb 05 2013, Michal Hocko wrote: >> >> > On Tue 05-02-13 15:49:47, azurIt wrote: >> > [...] >> >> Just to be sure - am i supposed to apply this two patches? >> >> http://watchdog.sk/lkml/patches/ >> > >> > 5-memcg-fix-1.patch is not complete. It doesn't contain the folloup I >> > mentioned in a follow up email. Here is the full patch: >> > --- >> > From f2bf8437d5b9bb38a95a432bf39f32c584955171 Mon Sep 17 00:00:00 2001 >> > From: Michal Hocko <mhocko@suse.cz> >> > Date: Mon, 26 Nov 2012 11:47:57 +0100 >> > Subject: [PATCH] memcg: do not trigger OOM from add_to_page_cache_locked >> > >> > memcg oom killer might deadlock if the process which falls down to >> > mem_cgroup_handle_oom holds a lock which prevents other task to >> > terminate because it is blocked on the very same lock. >> > This can happen when a write system call needs to allocate a page but >> > the allocation hits the memcg hard limit and there is nothing to reclaim >> > (e.g. there is no swap or swap limit is hit as well and all cache pages >> > have been reclaimed already) and the process selected by memcg OOM >> > killer is blocked on i_mutex on the same inode (e.g. truncate it). >> > >> > Process A >> > [<ffffffff811109b8>] do_truncate+0x58/0xa0 # takes i_mutex >> > [<ffffffff81121c90>] do_last+0x250/0xa30 >> > [<ffffffff81122547>] path_openat+0xd7/0x440 >> > [<ffffffff811229c9>] do_filp_open+0x49/0xa0 >> > [<ffffffff8110f7d6>] do_sys_open+0x106/0x240 >> > [<ffffffff8110f950>] sys_open+0x20/0x30 >> > [<ffffffff815b5926>] system_call_fastpath+0x18/0x1d >> > [<ffffffffffffffff>] 0xffffffffffffffff >> > >> > Process B >> > [<ffffffff8110a9c1>] mem_cgroup_handle_oom+0x241/0x3b0 >> > [<ffffffff8110b5ab>] T.1146+0x5ab/0x5c0 >> > [<ffffffff8110c22e>] mem_cgroup_cache_charge+0xbe/0xe0 >> > [<ffffffff810ca28c>] add_to_page_cache_locked+0x4c/0x140 >> > [<ffffffff810ca3a2>] add_to_page_cache_lru+0x22/0x50 >> > [<ffffffff810ca45b>] grab_cache_page_write_begin+0x8b/0xe0 >> > [<ffffffff81193a18>] ext3_write_begin+0x88/0x270 >> > [<ffffffff810c8fc6>] generic_file_buffered_write+0x116/0x290 >> > [<ffffffff810cb3cc>] __generic_file_aio_write+0x27c/0x480 >> > [<ffffffff810cb646>] generic_file_aio_write+0x76/0xf0 # takes ->i_mutex >> > [<ffffffff8111156a>] do_sync_write+0xea/0x130 >> > [<ffffffff81112183>] vfs_write+0xf3/0x1f0 >> > [<ffffffff81112381>] sys_write+0x51/0x90 >> > [<ffffffff815b5926>] system_call_fastpath+0x18/0x1d >> > [<ffffffffffffffff>] 0xffffffffffffffff >> >> It looks like grab_cache_page_write_begin() passes __GFP_FS into >> __page_cache_alloc() and mem_cgroup_cache_charge(). Which makes me >> think that this deadlock is also possible in the page allocator even >> before getting to add_to_page_cache_lru. no? > > I am not that familiar with VFS but i_mutex is a high level lock AFAIR > and it shouldn't be called from the pageout path so __page_cache_alloc > should be safe. I wasn't clear, sorry. My concern is not that pageout() grabs i_mutex. My concern is that __page_cache_alloc() will invoke the oom killer and select a victim which wants i_mutex. This victim will deadlock because the oom killer caller already holds i_mutex. The wild accusation I am making is that anyone who invokes the oom killer and waits on the victim to die is essentially grabbing all of the locks that any of the oom killer victims may grab (e.g. i_mutex). To avoid deadlock the oom killer can only be called is while holding no locks that the oom victim demands. I think some locks are grabbed in a way that allows the lock request to fail if the task has a fatal signal pending, so they are safe. But any locks acquisitions that cannot fail (e.g. mutex_lock) will deadlock with the oom killing process. So the oom killing process cannot hold any such locks which the victim will attempt to grab. Hopefully I'm missing something. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked 2013-02-05 18:09 ` Greg Thelen (?) @ 2013-02-05 18:59 ` Michal Hocko -1 siblings, 0 replies; 444+ messages in thread From: Michal Hocko @ 2013-02-05 18:59 UTC (permalink / raw) To: Greg Thelen Cc: azurIt, linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki, Johannes Weiner On Tue 05-02-13 10:09:57, Greg Thelen wrote: > On Tue, Feb 05 2013, Michal Hocko wrote: > > > On Tue 05-02-13 08:48:23, Greg Thelen wrote: > >> On Tue, Feb 05 2013, Michal Hocko wrote: > >> > >> > On Tue 05-02-13 15:49:47, azurIt wrote: > >> > [...] > >> >> Just to be sure - am i supposed to apply this two patches? > >> >> http://watchdog.sk/lkml/patches/ > >> > > >> > 5-memcg-fix-1.patch is not complete. It doesn't contain the folloup I > >> > mentioned in a follow up email. Here is the full patch: > >> > --- > >> > From f2bf8437d5b9bb38a95a432bf39f32c584955171 Mon Sep 17 00:00:00 2001 > >> > From: Michal Hocko <mhocko@suse.cz> > >> > Date: Mon, 26 Nov 2012 11:47:57 +0100 > >> > Subject: [PATCH] memcg: do not trigger OOM from add_to_page_cache_locked > >> > > >> > memcg oom killer might deadlock if the process which falls down to > >> > mem_cgroup_handle_oom holds a lock which prevents other task to > >> > terminate because it is blocked on the very same lock. > >> > This can happen when a write system call needs to allocate a page but > >> > the allocation hits the memcg hard limit and there is nothing to reclaim > >> > (e.g. there is no swap or swap limit is hit as well and all cache pages > >> > have been reclaimed already) and the process selected by memcg OOM > >> > killer is blocked on i_mutex on the same inode (e.g. truncate it). > >> > > >> > Process A > >> > [<ffffffff811109b8>] do_truncate+0x58/0xa0 # takes i_mutex > >> > [<ffffffff81121c90>] do_last+0x250/0xa30 > >> > [<ffffffff81122547>] path_openat+0xd7/0x440 > >> > [<ffffffff811229c9>] do_filp_open+0x49/0xa0 > >> > [<ffffffff8110f7d6>] do_sys_open+0x106/0x240 > >> > [<ffffffff8110f950>] sys_open+0x20/0x30 > >> > [<ffffffff815b5926>] system_call_fastpath+0x18/0x1d > >> > [<ffffffffffffffff>] 0xffffffffffffffff > >> > > >> > Process B > >> > [<ffffffff8110a9c1>] mem_cgroup_handle_oom+0x241/0x3b0 > >> > [<ffffffff8110b5ab>] T.1146+0x5ab/0x5c0 > >> > [<ffffffff8110c22e>] mem_cgroup_cache_charge+0xbe/0xe0 > >> > [<ffffffff810ca28c>] add_to_page_cache_locked+0x4c/0x140 > >> > [<ffffffff810ca3a2>] add_to_page_cache_lru+0x22/0x50 > >> > [<ffffffff810ca45b>] grab_cache_page_write_begin+0x8b/0xe0 > >> > [<ffffffff81193a18>] ext3_write_begin+0x88/0x270 > >> > [<ffffffff810c8fc6>] generic_file_buffered_write+0x116/0x290 > >> > [<ffffffff810cb3cc>] __generic_file_aio_write+0x27c/0x480 > >> > [<ffffffff810cb646>] generic_file_aio_write+0x76/0xf0 # takes ->i_mutex > >> > [<ffffffff8111156a>] do_sync_write+0xea/0x130 > >> > [<ffffffff81112183>] vfs_write+0xf3/0x1f0 > >> > [<ffffffff81112381>] sys_write+0x51/0x90 > >> > [<ffffffff815b5926>] system_call_fastpath+0x18/0x1d > >> > [<ffffffffffffffff>] 0xffffffffffffffff > >> > >> It looks like grab_cache_page_write_begin() passes __GFP_FS into > >> __page_cache_alloc() and mem_cgroup_cache_charge(). Which makes me > >> think that this deadlock is also possible in the page allocator even > >> before getting to add_to_page_cache_lru. no? > > > > I am not that familiar with VFS but i_mutex is a high level lock AFAIR > > and it shouldn't be called from the pageout path so __page_cache_alloc > > should be safe. > > I wasn't clear, sorry. My concern is not that pageout() grabs i_mutex. > My concern is that __page_cache_alloc() will invoke the oom killer and > select a victim which wants i_mutex. This victim will deadlock because > the oom killer caller already holds i_mutex. That would be true for the memcg oom because that one is blocking but the global oom just puts the allocator into sleep for a while and then the allocator should back off eventually (unless this is NOFAIL allocation). I would need to look closer whether this is really the case - I haven't seen that allocator code path for a while... > The wild accusation I am making is that anyone who invokes the oom > killer and waits on the victim to die is essentially grabbing all of > the locks that any of the oom killer victims may grab (e.g. i_mutex). True. > To avoid deadlock the oom killer can only be called is while holding > no locks that the oom victim demands. I think some locks are grabbed > in a way that allows the lock request to fail if the task has a fatal > signal pending, so they are safe. But any locks acquisitions that > cannot fail (e.g. mutex_lock) will deadlock with the oom killing > process. So the oom killing process cannot hold any such locks which > the victim will attempt to grab. Hopefully I'm missing something. Agreed. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked @ 2013-02-05 18:59 ` Michal Hocko 0 siblings, 0 replies; 444+ messages in thread From: Michal Hocko @ 2013-02-05 18:59 UTC (permalink / raw) To: Greg Thelen Cc: azurIt, linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups mailinglist, KAMEZAWA Hiroyuki, Johannes Weiner On Tue 05-02-13 10:09:57, Greg Thelen wrote: > On Tue, Feb 05 2013, Michal Hocko wrote: > > > On Tue 05-02-13 08:48:23, Greg Thelen wrote: > >> On Tue, Feb 05 2013, Michal Hocko wrote: > >> > >> > On Tue 05-02-13 15:49:47, azurIt wrote: > >> > [...] > >> >> Just to be sure - am i supposed to apply this two patches? > >> >> http://watchdog.sk/lkml/patches/ > >> > > >> > 5-memcg-fix-1.patch is not complete. It doesn't contain the folloup I > >> > mentioned in a follow up email. Here is the full patch: > >> > --- > >> > From f2bf8437d5b9bb38a95a432bf39f32c584955171 Mon Sep 17 00:00:00 2001 > >> > From: Michal Hocko <mhocko-AlSwsSmVLrQ@public.gmane.org> > >> > Date: Mon, 26 Nov 2012 11:47:57 +0100 > >> > Subject: [PATCH] memcg: do not trigger OOM from add_to_page_cache_locked > >> > > >> > memcg oom killer might deadlock if the process which falls down to > >> > mem_cgroup_handle_oom holds a lock which prevents other task to > >> > terminate because it is blocked on the very same lock. > >> > This can happen when a write system call needs to allocate a page but > >> > the allocation hits the memcg hard limit and there is nothing to reclaim > >> > (e.g. there is no swap or swap limit is hit as well and all cache pages > >> > have been reclaimed already) and the process selected by memcg OOM > >> > killer is blocked on i_mutex on the same inode (e.g. truncate it). > >> > > >> > Process A > >> > [<ffffffff811109b8>] do_truncate+0x58/0xa0 # takes i_mutex > >> > [<ffffffff81121c90>] do_last+0x250/0xa30 > >> > [<ffffffff81122547>] path_openat+0xd7/0x440 > >> > [<ffffffff811229c9>] do_filp_open+0x49/0xa0 > >> > [<ffffffff8110f7d6>] do_sys_open+0x106/0x240 > >> > [<ffffffff8110f950>] sys_open+0x20/0x30 > >> > [<ffffffff815b5926>] system_call_fastpath+0x18/0x1d > >> > [<ffffffffffffffff>] 0xffffffffffffffff > >> > > >> > Process B > >> > [<ffffffff8110a9c1>] mem_cgroup_handle_oom+0x241/0x3b0 > >> > [<ffffffff8110b5ab>] T.1146+0x5ab/0x5c0 > >> > [<ffffffff8110c22e>] mem_cgroup_cache_charge+0xbe/0xe0 > >> > [<ffffffff810ca28c>] add_to_page_cache_locked+0x4c/0x140 > >> > [<ffffffff810ca3a2>] add_to_page_cache_lru+0x22/0x50 > >> > [<ffffffff810ca45b>] grab_cache_page_write_begin+0x8b/0xe0 > >> > [<ffffffff81193a18>] ext3_write_begin+0x88/0x270 > >> > [<ffffffff810c8fc6>] generic_file_buffered_write+0x116/0x290 > >> > [<ffffffff810cb3cc>] __generic_file_aio_write+0x27c/0x480 > >> > [<ffffffff810cb646>] generic_file_aio_write+0x76/0xf0 # takes ->i_mutex > >> > [<ffffffff8111156a>] do_sync_write+0xea/0x130 > >> > [<ffffffff81112183>] vfs_write+0xf3/0x1f0 > >> > [<ffffffff81112381>] sys_write+0x51/0x90 > >> > [<ffffffff815b5926>] system_call_fastpath+0x18/0x1d > >> > [<ffffffffffffffff>] 0xffffffffffffffff > >> > >> It looks like grab_cache_page_write_begin() passes __GFP_FS into > >> __page_cache_alloc() and mem_cgroup_cache_charge(). Which makes me > >> think that this deadlock is also possible in the page allocator even > >> before getting to add_to_page_cache_lru. no? > > > > I am not that familiar with VFS but i_mutex is a high level lock AFAIR > > and it shouldn't be called from the pageout path so __page_cache_alloc > > should be safe. > > I wasn't clear, sorry. My concern is not that pageout() grabs i_mutex. > My concern is that __page_cache_alloc() will invoke the oom killer and > select a victim which wants i_mutex. This victim will deadlock because > the oom killer caller already holds i_mutex. That would be true for the memcg oom because that one is blocking but the global oom just puts the allocator into sleep for a while and then the allocator should back off eventually (unless this is NOFAIL allocation). I would need to look closer whether this is really the case - I haven't seen that allocator code path for a while... > The wild accusation I am making is that anyone who invokes the oom > killer and waits on the victim to die is essentially grabbing all of > the locks that any of the oom killer victims may grab (e.g. i_mutex). True. > To avoid deadlock the oom killer can only be called is while holding > no locks that the oom victim demands. I think some locks are grabbed > in a way that allows the lock request to fail if the task has a fatal > signal pending, so they are safe. But any locks acquisitions that > cannot fail (e.g. mutex_lock) will deadlock with the oom killing > process. So the oom killing process cannot hold any such locks which > the victim will attempt to grab. Hopefully I'm missing something. Agreed. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked @ 2013-02-05 18:59 ` Michal Hocko 0 siblings, 0 replies; 444+ messages in thread From: Michal Hocko @ 2013-02-05 18:59 UTC (permalink / raw) To: Greg Thelen Cc: azurIt, linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki, Johannes Weiner On Tue 05-02-13 10:09:57, Greg Thelen wrote: > On Tue, Feb 05 2013, Michal Hocko wrote: > > > On Tue 05-02-13 08:48:23, Greg Thelen wrote: > >> On Tue, Feb 05 2013, Michal Hocko wrote: > >> > >> > On Tue 05-02-13 15:49:47, azurIt wrote: > >> > [...] > >> >> Just to be sure - am i supposed to apply this two patches? > >> >> http://watchdog.sk/lkml/patches/ > >> > > >> > 5-memcg-fix-1.patch is not complete. It doesn't contain the folloup I > >> > mentioned in a follow up email. Here is the full patch: > >> > --- > >> > From f2bf8437d5b9bb38a95a432bf39f32c584955171 Mon Sep 17 00:00:00 2001 > >> > From: Michal Hocko <mhocko@suse.cz> > >> > Date: Mon, 26 Nov 2012 11:47:57 +0100 > >> > Subject: [PATCH] memcg: do not trigger OOM from add_to_page_cache_locked > >> > > >> > memcg oom killer might deadlock if the process which falls down to > >> > mem_cgroup_handle_oom holds a lock which prevents other task to > >> > terminate because it is blocked on the very same lock. > >> > This can happen when a write system call needs to allocate a page but > >> > the allocation hits the memcg hard limit and there is nothing to reclaim > >> > (e.g. there is no swap or swap limit is hit as well and all cache pages > >> > have been reclaimed already) and the process selected by memcg OOM > >> > killer is blocked on i_mutex on the same inode (e.g. truncate it). > >> > > >> > Process A > >> > [<ffffffff811109b8>] do_truncate+0x58/0xa0 # takes i_mutex > >> > [<ffffffff81121c90>] do_last+0x250/0xa30 > >> > [<ffffffff81122547>] path_openat+0xd7/0x440 > >> > [<ffffffff811229c9>] do_filp_open+0x49/0xa0 > >> > [<ffffffff8110f7d6>] do_sys_open+0x106/0x240 > >> > [<ffffffff8110f950>] sys_open+0x20/0x30 > >> > [<ffffffff815b5926>] system_call_fastpath+0x18/0x1d > >> > [<ffffffffffffffff>] 0xffffffffffffffff > >> > > >> > Process B > >> > [<ffffffff8110a9c1>] mem_cgroup_handle_oom+0x241/0x3b0 > >> > [<ffffffff8110b5ab>] T.1146+0x5ab/0x5c0 > >> > [<ffffffff8110c22e>] mem_cgroup_cache_charge+0xbe/0xe0 > >> > [<ffffffff810ca28c>] add_to_page_cache_locked+0x4c/0x140 > >> > [<ffffffff810ca3a2>] add_to_page_cache_lru+0x22/0x50 > >> > [<ffffffff810ca45b>] grab_cache_page_write_begin+0x8b/0xe0 > >> > [<ffffffff81193a18>] ext3_write_begin+0x88/0x270 > >> > [<ffffffff810c8fc6>] generic_file_buffered_write+0x116/0x290 > >> > [<ffffffff810cb3cc>] __generic_file_aio_write+0x27c/0x480 > >> > [<ffffffff810cb646>] generic_file_aio_write+0x76/0xf0 # takes ->i_mutex > >> > [<ffffffff8111156a>] do_sync_write+0xea/0x130 > >> > [<ffffffff81112183>] vfs_write+0xf3/0x1f0 > >> > [<ffffffff81112381>] sys_write+0x51/0x90 > >> > [<ffffffff815b5926>] system_call_fastpath+0x18/0x1d > >> > [<ffffffffffffffff>] 0xffffffffffffffff > >> > >> It looks like grab_cache_page_write_begin() passes __GFP_FS into > >> __page_cache_alloc() and mem_cgroup_cache_charge(). Which makes me > >> think that this deadlock is also possible in the page allocator even > >> before getting to add_to_page_cache_lru. no? > > > > I am not that familiar with VFS but i_mutex is a high level lock AFAIR > > and it shouldn't be called from the pageout path so __page_cache_alloc > > should be safe. > > I wasn't clear, sorry. My concern is not that pageout() grabs i_mutex. > My concern is that __page_cache_alloc() will invoke the oom killer and > select a victim which wants i_mutex. This victim will deadlock because > the oom killer caller already holds i_mutex. That would be true for the memcg oom because that one is blocking but the global oom just puts the allocator into sleep for a while and then the allocator should back off eventually (unless this is NOFAIL allocation). I would need to look closer whether this is really the case - I haven't seen that allocator code path for a while... > The wild accusation I am making is that anyone who invokes the oom > killer and waits on the victim to die is essentially grabbing all of > the locks that any of the oom killer victims may grab (e.g. i_mutex). True. > To avoid deadlock the oom killer can only be called is while holding > no locks that the oom victim demands. I think some locks are grabbed > in a way that allows the lock request to fail if the task has a fatal > signal pending, so they are safe. But any locks acquisitions that > cannot fail (e.g. mutex_lock) will deadlock with the oom killing > process. So the oom killing process cannot hold any such locks which > the victim will attempt to grab. Hopefully I'm missing something. Agreed. -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked 2013-02-05 18:59 ` Michal Hocko @ 2013-02-08 4:27 ` Greg Thelen -1 siblings, 0 replies; 444+ messages in thread From: Greg Thelen @ 2013-02-08 4:27 UTC (permalink / raw) To: Michal Hocko Cc: azurIt, linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki, Johannes Weiner On Tue, Feb 05 2013, Michal Hocko wrote: > On Tue 05-02-13 10:09:57, Greg Thelen wrote: >> On Tue, Feb 05 2013, Michal Hocko wrote: >> >> > On Tue 05-02-13 08:48:23, Greg Thelen wrote: >> >> On Tue, Feb 05 2013, Michal Hocko wrote: >> >> >> >> > On Tue 05-02-13 15:49:47, azurIt wrote: >> >> > [...] >> >> >> Just to be sure - am i supposed to apply this two patches? >> >> >> http://watchdog.sk/lkml/patches/ >> >> > >> >> > 5-memcg-fix-1.patch is not complete. It doesn't contain the folloup I >> >> > mentioned in a follow up email. Here is the full patch: >> >> > --- >> >> > From f2bf8437d5b9bb38a95a432bf39f32c584955171 Mon Sep 17 00:00:00 2001 >> >> > From: Michal Hocko <mhocko@suse.cz> >> >> > Date: Mon, 26 Nov 2012 11:47:57 +0100 >> >> > Subject: [PATCH] memcg: do not trigger OOM from add_to_page_cache_locked >> >> > >> >> > memcg oom killer might deadlock if the process which falls down to >> >> > mem_cgroup_handle_oom holds a lock which prevents other task to >> >> > terminate because it is blocked on the very same lock. >> >> > This can happen when a write system call needs to allocate a page but >> >> > the allocation hits the memcg hard limit and there is nothing to reclaim >> >> > (e.g. there is no swap or swap limit is hit as well and all cache pages >> >> > have been reclaimed already) and the process selected by memcg OOM >> >> > killer is blocked on i_mutex on the same inode (e.g. truncate it). >> >> > >> >> > Process A >> >> > [<ffffffff811109b8>] do_truncate+0x58/0xa0 # takes i_mutex >> >> > [<ffffffff81121c90>] do_last+0x250/0xa30 >> >> > [<ffffffff81122547>] path_openat+0xd7/0x440 >> >> > [<ffffffff811229c9>] do_filp_open+0x49/0xa0 >> >> > [<ffffffff8110f7d6>] do_sys_open+0x106/0x240 >> >> > [<ffffffff8110f950>] sys_open+0x20/0x30 >> >> > [<ffffffff815b5926>] system_call_fastpath+0x18/0x1d >> >> > [<ffffffffffffffff>] 0xffffffffffffffff >> >> > >> >> > Process B >> >> > [<ffffffff8110a9c1>] mem_cgroup_handle_oom+0x241/0x3b0 >> >> > [<ffffffff8110b5ab>] T.1146+0x5ab/0x5c0 >> >> > [<ffffffff8110c22e>] mem_cgroup_cache_charge+0xbe/0xe0 >> >> > [<ffffffff810ca28c>] add_to_page_cache_locked+0x4c/0x140 >> >> > [<ffffffff810ca3a2>] add_to_page_cache_lru+0x22/0x50 >> >> > [<ffffffff810ca45b>] grab_cache_page_write_begin+0x8b/0xe0 >> >> > [<ffffffff81193a18>] ext3_write_begin+0x88/0x270 >> >> > [<ffffffff810c8fc6>] generic_file_buffered_write+0x116/0x290 >> >> > [<ffffffff810cb3cc>] __generic_file_aio_write+0x27c/0x480 >> >> > [<ffffffff810cb646>] generic_file_aio_write+0x76/0xf0 # takes ->i_mutex >> >> > [<ffffffff8111156a>] do_sync_write+0xea/0x130 >> >> > [<ffffffff81112183>] vfs_write+0xf3/0x1f0 >> >> > [<ffffffff81112381>] sys_write+0x51/0x90 >> >> > [<ffffffff815b5926>] system_call_fastpath+0x18/0x1d >> >> > [<ffffffffffffffff>] 0xffffffffffffffff >> >> >> >> It looks like grab_cache_page_write_begin() passes __GFP_FS into >> >> __page_cache_alloc() and mem_cgroup_cache_charge(). Which makes me >> >> think that this deadlock is also possible in the page allocator even >> >> before getting to add_to_page_cache_lru. no? >> > >> > I am not that familiar with VFS but i_mutex is a high level lock AFAIR >> > and it shouldn't be called from the pageout path so __page_cache_alloc >> > should be safe. >> >> I wasn't clear, sorry. My concern is not that pageout() grabs i_mutex. >> My concern is that __page_cache_alloc() will invoke the oom killer and >> select a victim which wants i_mutex. This victim will deadlock because >> the oom killer caller already holds i_mutex. > > That would be true for the memcg oom because that one is blocking but > the global oom just puts the allocator into sleep for a while and then > the allocator should back off eventually (unless this is NOFAIL > allocation). I would need to look closer whether this is really the case > - I haven't seen that allocator code path for a while... I think the page allocator can loop forever waiting for an oom victim to terminate even without NOFAIL. Especially if the oom victim wants a resource exclusively held by the allocating thread (e.g. i_mutex). It looks like the same deadlock you describe is also possible (though more rare) without memcg. If the looping thread is an eligible oom victim (i.e. not oom disabled, not an kernel thread, etc) then the page allocator can return NULL in so long as NOFAIL is not used. So any allocator which is able to call the oom killer and is not oom disabled (kernel thread, etc) is already exposed to the possibility of page allocator failure. So if the page allocator could detect the deadlock, then it could safely return NULL. Maybe after looping N times without forward progress the page allocator should consider failing unless NOFAIL is given. Switching back to the memcg oom situation, can we similarly return NULL if memcg oom kill has been tried a reasonable number of times. Simply failing the memcg charge with ENOMEM seems easier to support than exceeding limit (Kame's loan patch). ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked @ 2013-02-08 4:27 ` Greg Thelen 0 siblings, 0 replies; 444+ messages in thread From: Greg Thelen @ 2013-02-08 4:27 UTC (permalink / raw) To: Michal Hocko Cc: azurIt, linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki, Johannes Weiner On Tue, Feb 05 2013, Michal Hocko wrote: > On Tue 05-02-13 10:09:57, Greg Thelen wrote: >> On Tue, Feb 05 2013, Michal Hocko wrote: >> >> > On Tue 05-02-13 08:48:23, Greg Thelen wrote: >> >> On Tue, Feb 05 2013, Michal Hocko wrote: >> >> >> >> > On Tue 05-02-13 15:49:47, azurIt wrote: >> >> > [...] >> >> >> Just to be sure - am i supposed to apply this two patches? >> >> >> http://watchdog.sk/lkml/patches/ >> >> > >> >> > 5-memcg-fix-1.patch is not complete. It doesn't contain the folloup I >> >> > mentioned in a follow up email. Here is the full patch: >> >> > --- >> >> > From f2bf8437d5b9bb38a95a432bf39f32c584955171 Mon Sep 17 00:00:00 2001 >> >> > From: Michal Hocko <mhocko@suse.cz> >> >> > Date: Mon, 26 Nov 2012 11:47:57 +0100 >> >> > Subject: [PATCH] memcg: do not trigger OOM from add_to_page_cache_locked >> >> > >> >> > memcg oom killer might deadlock if the process which falls down to >> >> > mem_cgroup_handle_oom holds a lock which prevents other task to >> >> > terminate because it is blocked on the very same lock. >> >> > This can happen when a write system call needs to allocate a page but >> >> > the allocation hits the memcg hard limit and there is nothing to reclaim >> >> > (e.g. there is no swap or swap limit is hit as well and all cache pages >> >> > have been reclaimed already) and the process selected by memcg OOM >> >> > killer is blocked on i_mutex on the same inode (e.g. truncate it). >> >> > >> >> > Process A >> >> > [<ffffffff811109b8>] do_truncate+0x58/0xa0 # takes i_mutex >> >> > [<ffffffff81121c90>] do_last+0x250/0xa30 >> >> > [<ffffffff81122547>] path_openat+0xd7/0x440 >> >> > [<ffffffff811229c9>] do_filp_open+0x49/0xa0 >> >> > [<ffffffff8110f7d6>] do_sys_open+0x106/0x240 >> >> > [<ffffffff8110f950>] sys_open+0x20/0x30 >> >> > [<ffffffff815b5926>] system_call_fastpath+0x18/0x1d >> >> > [<ffffffffffffffff>] 0xffffffffffffffff >> >> > >> >> > Process B >> >> > [<ffffffff8110a9c1>] mem_cgroup_handle_oom+0x241/0x3b0 >> >> > [<ffffffff8110b5ab>] T.1146+0x5ab/0x5c0 >> >> > [<ffffffff8110c22e>] mem_cgroup_cache_charge+0xbe/0xe0 >> >> > [<ffffffff810ca28c>] add_to_page_cache_locked+0x4c/0x140 >> >> > [<ffffffff810ca3a2>] add_to_page_cache_lru+0x22/0x50 >> >> > [<ffffffff810ca45b>] grab_cache_page_write_begin+0x8b/0xe0 >> >> > [<ffffffff81193a18>] ext3_write_begin+0x88/0x270 >> >> > [<ffffffff810c8fc6>] generic_file_buffered_write+0x116/0x290 >> >> > [<ffffffff810cb3cc>] __generic_file_aio_write+0x27c/0x480 >> >> > [<ffffffff810cb646>] generic_file_aio_write+0x76/0xf0 # takes ->i_mutex >> >> > [<ffffffff8111156a>] do_sync_write+0xea/0x130 >> >> > [<ffffffff81112183>] vfs_write+0xf3/0x1f0 >> >> > [<ffffffff81112381>] sys_write+0x51/0x90 >> >> > [<ffffffff815b5926>] system_call_fastpath+0x18/0x1d >> >> > [<ffffffffffffffff>] 0xffffffffffffffff >> >> >> >> It looks like grab_cache_page_write_begin() passes __GFP_FS into >> >> __page_cache_alloc() and mem_cgroup_cache_charge(). Which makes me >> >> think that this deadlock is also possible in the page allocator even >> >> before getting to add_to_page_cache_lru. no? >> > >> > I am not that familiar with VFS but i_mutex is a high level lock AFAIR >> > and it shouldn't be called from the pageout path so __page_cache_alloc >> > should be safe. >> >> I wasn't clear, sorry. My concern is not that pageout() grabs i_mutex. >> My concern is that __page_cache_alloc() will invoke the oom killer and >> select a victim which wants i_mutex. This victim will deadlock because >> the oom killer caller already holds i_mutex. > > That would be true for the memcg oom because that one is blocking but > the global oom just puts the allocator into sleep for a while and then > the allocator should back off eventually (unless this is NOFAIL > allocation). I would need to look closer whether this is really the case > - I haven't seen that allocator code path for a while... I think the page allocator can loop forever waiting for an oom victim to terminate even without NOFAIL. Especially if the oom victim wants a resource exclusively held by the allocating thread (e.g. i_mutex). It looks like the same deadlock you describe is also possible (though more rare) without memcg. If the looping thread is an eligible oom victim (i.e. not oom disabled, not an kernel thread, etc) then the page allocator can return NULL in so long as NOFAIL is not used. So any allocator which is able to call the oom killer and is not oom disabled (kernel thread, etc) is already exposed to the possibility of page allocator failure. So if the page allocator could detect the deadlock, then it could safely return NULL. Maybe after looping N times without forward progress the page allocator should consider failing unless NOFAIL is given. Switching back to the memcg oom situation, can we similarly return NULL if memcg oom kill has been tried a reasonable number of times. Simply failing the memcg charge with ENOMEM seems easier to support than exceeding limit (Kame's loan patch). -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked 2013-02-08 4:27 ` Greg Thelen (?) @ 2013-02-08 16:29 ` Michal Hocko -1 siblings, 0 replies; 444+ messages in thread From: Michal Hocko @ 2013-02-08 16:29 UTC (permalink / raw) To: Greg Thelen Cc: azurIt, linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki, Johannes Weiner On Thu 07-02-13 20:27:00, Greg Thelen wrote: > On Tue, Feb 05 2013, Michal Hocko wrote: > > > On Tue 05-02-13 10:09:57, Greg Thelen wrote: > >> On Tue, Feb 05 2013, Michal Hocko wrote: > >> > >> > On Tue 05-02-13 08:48:23, Greg Thelen wrote: > >> >> On Tue, Feb 05 2013, Michal Hocko wrote: > >> >> > >> >> > On Tue 05-02-13 15:49:47, azurIt wrote: > >> >> > [...] > >> >> >> Just to be sure - am i supposed to apply this two patches? > >> >> >> http://watchdog.sk/lkml/patches/ > >> >> > > >> >> > 5-memcg-fix-1.patch is not complete. It doesn't contain the folloup I > >> >> > mentioned in a follow up email. Here is the full patch: > >> >> > --- > >> >> > From f2bf8437d5b9bb38a95a432bf39f32c584955171 Mon Sep 17 00:00:00 2001 > >> >> > From: Michal Hocko <mhocko@suse.cz> > >> >> > Date: Mon, 26 Nov 2012 11:47:57 +0100 > >> >> > Subject: [PATCH] memcg: do not trigger OOM from add_to_page_cache_locked > >> >> > > >> >> > memcg oom killer might deadlock if the process which falls down to > >> >> > mem_cgroup_handle_oom holds a lock which prevents other task to > >> >> > terminate because it is blocked on the very same lock. > >> >> > This can happen when a write system call needs to allocate a page but > >> >> > the allocation hits the memcg hard limit and there is nothing to reclaim > >> >> > (e.g. there is no swap or swap limit is hit as well and all cache pages > >> >> > have been reclaimed already) and the process selected by memcg OOM > >> >> > killer is blocked on i_mutex on the same inode (e.g. truncate it). > >> >> > > >> >> > Process A > >> >> > [<ffffffff811109b8>] do_truncate+0x58/0xa0 # takes i_mutex > >> >> > [<ffffffff81121c90>] do_last+0x250/0xa30 > >> >> > [<ffffffff81122547>] path_openat+0xd7/0x440 > >> >> > [<ffffffff811229c9>] do_filp_open+0x49/0xa0 > >> >> > [<ffffffff8110f7d6>] do_sys_open+0x106/0x240 > >> >> > [<ffffffff8110f950>] sys_open+0x20/0x30 > >> >> > [<ffffffff815b5926>] system_call_fastpath+0x18/0x1d > >> >> > [<ffffffffffffffff>] 0xffffffffffffffff > >> >> > > >> >> > Process B > >> >> > [<ffffffff8110a9c1>] mem_cgroup_handle_oom+0x241/0x3b0 > >> >> > [<ffffffff8110b5ab>] T.1146+0x5ab/0x5c0 > >> >> > [<ffffffff8110c22e>] mem_cgroup_cache_charge+0xbe/0xe0 > >> >> > [<ffffffff810ca28c>] add_to_page_cache_locked+0x4c/0x140 > >> >> > [<ffffffff810ca3a2>] add_to_page_cache_lru+0x22/0x50 > >> >> > [<ffffffff810ca45b>] grab_cache_page_write_begin+0x8b/0xe0 > >> >> > [<ffffffff81193a18>] ext3_write_begin+0x88/0x270 > >> >> > [<ffffffff810c8fc6>] generic_file_buffered_write+0x116/0x290 > >> >> > [<ffffffff810cb3cc>] __generic_file_aio_write+0x27c/0x480 > >> >> > [<ffffffff810cb646>] generic_file_aio_write+0x76/0xf0 # takes ->i_mutex > >> >> > [<ffffffff8111156a>] do_sync_write+0xea/0x130 > >> >> > [<ffffffff81112183>] vfs_write+0xf3/0x1f0 > >> >> > [<ffffffff81112381>] sys_write+0x51/0x90 > >> >> > [<ffffffff815b5926>] system_call_fastpath+0x18/0x1d > >> >> > [<ffffffffffffffff>] 0xffffffffffffffff > >> >> > >> >> It looks like grab_cache_page_write_begin() passes __GFP_FS into > >> >> __page_cache_alloc() and mem_cgroup_cache_charge(). Which makes me > >> >> think that this deadlock is also possible in the page allocator even > >> >> before getting to add_to_page_cache_lru. no? > >> > > >> > I am not that familiar with VFS but i_mutex is a high level lock AFAIR > >> > and it shouldn't be called from the pageout path so __page_cache_alloc > >> > should be safe. > >> > >> I wasn't clear, sorry. My concern is not that pageout() grabs i_mutex. > >> My concern is that __page_cache_alloc() will invoke the oom killer and > >> select a victim which wants i_mutex. This victim will deadlock because > >> the oom killer caller already holds i_mutex. > > > > That would be true for the memcg oom because that one is blocking but > > the global oom just puts the allocator into sleep for a while and then > > the allocator should back off eventually (unless this is NOFAIL > > allocation). I would need to look closer whether this is really the case > > - I haven't seen that allocator code path for a while... > > I think the page allocator can loop forever waiting for an oom victim to > terminate even without NOFAIL. Especially if the oom victim wants a > resource exclusively held by the allocating thread (e.g. i_mutex). It > looks like the same deadlock you describe is also possible (though more > rare) without memcg. OK, I have checked the allocator slow path and you are right even GFP_KERNEL will not fail. This can lead to similar deadlocks - e.g. OOM killed task blocked on down_write(mmap_sem) while the page fault handler holding mmap_sem for reading and allocating a new page without any progress. Luckily there are memory reserves where the allocator fall back eventually so the allocation should be able to get some memory and release the lock. There is still a theoretical chance this would block though. This sounds like a corner case though so I wouldn't care about it very much. > If the looping thread is an eligible oom victim (i.e. not oom disabled, > not an kernel thread, etc) then the page allocator can return NULL in so > long as NOFAIL is not used. So any allocator which is able to call the > oom killer and is not oom disabled (kernel thread, etc) is already > exposed to the possibility of page allocator failure. So if the page > allocator could detect the deadlock, then it could safely return NULL. > Maybe after looping N times without forward progress the page allocator > should consider failing unless NOFAIL is given. page allocator is quite tricky to touch and the chances of this deadlock are not that big. > if memcg oom kill has been tried a reasonable number of times. Simply > failing the memcg charge with ENOMEM seems easier to support than > exceeding limit (Kame's loan patch). We cannot do that in the page fault path because this would lead to a global oom killer. We would need to either retry the page fault or send KILL to the faulting process. But I do not like this much as this could lead to DoS attacks. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked @ 2013-02-08 16:29 ` Michal Hocko 0 siblings, 0 replies; 444+ messages in thread From: Michal Hocko @ 2013-02-08 16:29 UTC (permalink / raw) To: Greg Thelen Cc: azurIt, linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups mailinglist, KAMEZAWA Hiroyuki, Johannes Weiner On Thu 07-02-13 20:27:00, Greg Thelen wrote: > On Tue, Feb 05 2013, Michal Hocko wrote: > > > On Tue 05-02-13 10:09:57, Greg Thelen wrote: > >> On Tue, Feb 05 2013, Michal Hocko wrote: > >> > >> > On Tue 05-02-13 08:48:23, Greg Thelen wrote: > >> >> On Tue, Feb 05 2013, Michal Hocko wrote: > >> >> > >> >> > On Tue 05-02-13 15:49:47, azurIt wrote: > >> >> > [...] > >> >> >> Just to be sure - am i supposed to apply this two patches? > >> >> >> http://watchdog.sk/lkml/patches/ > >> >> > > >> >> > 5-memcg-fix-1.patch is not complete. It doesn't contain the folloup I > >> >> > mentioned in a follow up email. Here is the full patch: > >> >> > --- > >> >> > From f2bf8437d5b9bb38a95a432bf39f32c584955171 Mon Sep 17 00:00:00 2001 > >> >> > From: Michal Hocko <mhocko-AlSwsSmVLrQ@public.gmane.org> > >> >> > Date: Mon, 26 Nov 2012 11:47:57 +0100 > >> >> > Subject: [PATCH] memcg: do not trigger OOM from add_to_page_cache_locked > >> >> > > >> >> > memcg oom killer might deadlock if the process which falls down to > >> >> > mem_cgroup_handle_oom holds a lock which prevents other task to > >> >> > terminate because it is blocked on the very same lock. > >> >> > This can happen when a write system call needs to allocate a page but > >> >> > the allocation hits the memcg hard limit and there is nothing to reclaim > >> >> > (e.g. there is no swap or swap limit is hit as well and all cache pages > >> >> > have been reclaimed already) and the process selected by memcg OOM > >> >> > killer is blocked on i_mutex on the same inode (e.g. truncate it). > >> >> > > >> >> > Process A > >> >> > [<ffffffff811109b8>] do_truncate+0x58/0xa0 # takes i_mutex > >> >> > [<ffffffff81121c90>] do_last+0x250/0xa30 > >> >> > [<ffffffff81122547>] path_openat+0xd7/0x440 > >> >> > [<ffffffff811229c9>] do_filp_open+0x49/0xa0 > >> >> > [<ffffffff8110f7d6>] do_sys_open+0x106/0x240 > >> >> > [<ffffffff8110f950>] sys_open+0x20/0x30 > >> >> > [<ffffffff815b5926>] system_call_fastpath+0x18/0x1d > >> >> > [<ffffffffffffffff>] 0xffffffffffffffff > >> >> > > >> >> > Process B > >> >> > [<ffffffff8110a9c1>] mem_cgroup_handle_oom+0x241/0x3b0 > >> >> > [<ffffffff8110b5ab>] T.1146+0x5ab/0x5c0 > >> >> > [<ffffffff8110c22e>] mem_cgroup_cache_charge+0xbe/0xe0 > >> >> > [<ffffffff810ca28c>] add_to_page_cache_locked+0x4c/0x140 > >> >> > [<ffffffff810ca3a2>] add_to_page_cache_lru+0x22/0x50 > >> >> > [<ffffffff810ca45b>] grab_cache_page_write_begin+0x8b/0xe0 > >> >> > [<ffffffff81193a18>] ext3_write_begin+0x88/0x270 > >> >> > [<ffffffff810c8fc6>] generic_file_buffered_write+0x116/0x290 > >> >> > [<ffffffff810cb3cc>] __generic_file_aio_write+0x27c/0x480 > >> >> > [<ffffffff810cb646>] generic_file_aio_write+0x76/0xf0 # takes ->i_mutex > >> >> > [<ffffffff8111156a>] do_sync_write+0xea/0x130 > >> >> > [<ffffffff81112183>] vfs_write+0xf3/0x1f0 > >> >> > [<ffffffff81112381>] sys_write+0x51/0x90 > >> >> > [<ffffffff815b5926>] system_call_fastpath+0x18/0x1d > >> >> > [<ffffffffffffffff>] 0xffffffffffffffff > >> >> > >> >> It looks like grab_cache_page_write_begin() passes __GFP_FS into > >> >> __page_cache_alloc() and mem_cgroup_cache_charge(). Which makes me > >> >> think that this deadlock is also possible in the page allocator even > >> >> before getting to add_to_page_cache_lru. no? > >> > > >> > I am not that familiar with VFS but i_mutex is a high level lock AFAIR > >> > and it shouldn't be called from the pageout path so __page_cache_alloc > >> > should be safe. > >> > >> I wasn't clear, sorry. My concern is not that pageout() grabs i_mutex. > >> My concern is that __page_cache_alloc() will invoke the oom killer and > >> select a victim which wants i_mutex. This victim will deadlock because > >> the oom killer caller already holds i_mutex. > > > > That would be true for the memcg oom because that one is blocking but > > the global oom just puts the allocator into sleep for a while and then > > the allocator should back off eventually (unless this is NOFAIL > > allocation). I would need to look closer whether this is really the case > > - I haven't seen that allocator code path for a while... > > I think the page allocator can loop forever waiting for an oom victim to > terminate even without NOFAIL. Especially if the oom victim wants a > resource exclusively held by the allocating thread (e.g. i_mutex). It > looks like the same deadlock you describe is also possible (though more > rare) without memcg. OK, I have checked the allocator slow path and you are right even GFP_KERNEL will not fail. This can lead to similar deadlocks - e.g. OOM killed task blocked on down_write(mmap_sem) while the page fault handler holding mmap_sem for reading and allocating a new page without any progress. Luckily there are memory reserves where the allocator fall back eventually so the allocation should be able to get some memory and release the lock. There is still a theoretical chance this would block though. This sounds like a corner case though so I wouldn't care about it very much. > If the looping thread is an eligible oom victim (i.e. not oom disabled, > not an kernel thread, etc) then the page allocator can return NULL in so > long as NOFAIL is not used. So any allocator which is able to call the > oom killer and is not oom disabled (kernel thread, etc) is already > exposed to the possibility of page allocator failure. So if the page > allocator could detect the deadlock, then it could safely return NULL. > Maybe after looping N times without forward progress the page allocator > should consider failing unless NOFAIL is given. page allocator is quite tricky to touch and the chances of this deadlock are not that big. > if memcg oom kill has been tried a reasonable number of times. Simply > failing the memcg charge with ENOMEM seems easier to support than > exceeding limit (Kame's loan patch). We cannot do that in the page fault path because this would lead to a global oom killer. We would need to either retry the page fault or send KILL to the faulting process. But I do not like this much as this could lead to DoS attacks. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked @ 2013-02-08 16:29 ` Michal Hocko 0 siblings, 0 replies; 444+ messages in thread From: Michal Hocko @ 2013-02-08 16:29 UTC (permalink / raw) To: Greg Thelen Cc: azurIt, linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki, Johannes Weiner On Thu 07-02-13 20:27:00, Greg Thelen wrote: > On Tue, Feb 05 2013, Michal Hocko wrote: > > > On Tue 05-02-13 10:09:57, Greg Thelen wrote: > >> On Tue, Feb 05 2013, Michal Hocko wrote: > >> > >> > On Tue 05-02-13 08:48:23, Greg Thelen wrote: > >> >> On Tue, Feb 05 2013, Michal Hocko wrote: > >> >> > >> >> > On Tue 05-02-13 15:49:47, azurIt wrote: > >> >> > [...] > >> >> >> Just to be sure - am i supposed to apply this two patches? > >> >> >> http://watchdog.sk/lkml/patches/ > >> >> > > >> >> > 5-memcg-fix-1.patch is not complete. It doesn't contain the folloup I > >> >> > mentioned in a follow up email. Here is the full patch: > >> >> > --- > >> >> > From f2bf8437d5b9bb38a95a432bf39f32c584955171 Mon Sep 17 00:00:00 2001 > >> >> > From: Michal Hocko <mhocko@suse.cz> > >> >> > Date: Mon, 26 Nov 2012 11:47:57 +0100 > >> >> > Subject: [PATCH] memcg: do not trigger OOM from add_to_page_cache_locked > >> >> > > >> >> > memcg oom killer might deadlock if the process which falls down to > >> >> > mem_cgroup_handle_oom holds a lock which prevents other task to > >> >> > terminate because it is blocked on the very same lock. > >> >> > This can happen when a write system call needs to allocate a page but > >> >> > the allocation hits the memcg hard limit and there is nothing to reclaim > >> >> > (e.g. there is no swap or swap limit is hit as well and all cache pages > >> >> > have been reclaimed already) and the process selected by memcg OOM > >> >> > killer is blocked on i_mutex on the same inode (e.g. truncate it). > >> >> > > >> >> > Process A > >> >> > [<ffffffff811109b8>] do_truncate+0x58/0xa0 # takes i_mutex > >> >> > [<ffffffff81121c90>] do_last+0x250/0xa30 > >> >> > [<ffffffff81122547>] path_openat+0xd7/0x440 > >> >> > [<ffffffff811229c9>] do_filp_open+0x49/0xa0 > >> >> > [<ffffffff8110f7d6>] do_sys_open+0x106/0x240 > >> >> > [<ffffffff8110f950>] sys_open+0x20/0x30 > >> >> > [<ffffffff815b5926>] system_call_fastpath+0x18/0x1d > >> >> > [<ffffffffffffffff>] 0xffffffffffffffff > >> >> > > >> >> > Process B > >> >> > [<ffffffff8110a9c1>] mem_cgroup_handle_oom+0x241/0x3b0 > >> >> > [<ffffffff8110b5ab>] T.1146+0x5ab/0x5c0 > >> >> > [<ffffffff8110c22e>] mem_cgroup_cache_charge+0xbe/0xe0 > >> >> > [<ffffffff810ca28c>] add_to_page_cache_locked+0x4c/0x140 > >> >> > [<ffffffff810ca3a2>] add_to_page_cache_lru+0x22/0x50 > >> >> > [<ffffffff810ca45b>] grab_cache_page_write_begin+0x8b/0xe0 > >> >> > [<ffffffff81193a18>] ext3_write_begin+0x88/0x270 > >> >> > [<ffffffff810c8fc6>] generic_file_buffered_write+0x116/0x290 > >> >> > [<ffffffff810cb3cc>] __generic_file_aio_write+0x27c/0x480 > >> >> > [<ffffffff810cb646>] generic_file_aio_write+0x76/0xf0 # takes ->i_mutex > >> >> > [<ffffffff8111156a>] do_sync_write+0xea/0x130 > >> >> > [<ffffffff81112183>] vfs_write+0xf3/0x1f0 > >> >> > [<ffffffff81112381>] sys_write+0x51/0x90 > >> >> > [<ffffffff815b5926>] system_call_fastpath+0x18/0x1d > >> >> > [<ffffffffffffffff>] 0xffffffffffffffff > >> >> > >> >> It looks like grab_cache_page_write_begin() passes __GFP_FS into > >> >> __page_cache_alloc() and mem_cgroup_cache_charge(). Which makes me > >> >> think that this deadlock is also possible in the page allocator even > >> >> before getting to add_to_page_cache_lru. no? > >> > > >> > I am not that familiar with VFS but i_mutex is a high level lock AFAIR > >> > and it shouldn't be called from the pageout path so __page_cache_alloc > >> > should be safe. > >> > >> I wasn't clear, sorry. My concern is not that pageout() grabs i_mutex. > >> My concern is that __page_cache_alloc() will invoke the oom killer and > >> select a victim which wants i_mutex. This victim will deadlock because > >> the oom killer caller already holds i_mutex. > > > > That would be true for the memcg oom because that one is blocking but > > the global oom just puts the allocator into sleep for a while and then > > the allocator should back off eventually (unless this is NOFAIL > > allocation). I would need to look closer whether this is really the case > > - I haven't seen that allocator code path for a while... > > I think the page allocator can loop forever waiting for an oom victim to > terminate even without NOFAIL. Especially if the oom victim wants a > resource exclusively held by the allocating thread (e.g. i_mutex). It > looks like the same deadlock you describe is also possible (though more > rare) without memcg. OK, I have checked the allocator slow path and you are right even GFP_KERNEL will not fail. This can lead to similar deadlocks - e.g. OOM killed task blocked on down_write(mmap_sem) while the page fault handler holding mmap_sem for reading and allocating a new page without any progress. Luckily there are memory reserves where the allocator fall back eventually so the allocation should be able to get some memory and release the lock. There is still a theoretical chance this would block though. This sounds like a corner case though so I wouldn't care about it very much. > If the looping thread is an eligible oom victim (i.e. not oom disabled, > not an kernel thread, etc) then the page allocator can return NULL in so > long as NOFAIL is not used. So any allocator which is able to call the > oom killer and is not oom disabled (kernel thread, etc) is already > exposed to the possibility of page allocator failure. So if the page > allocator could detect the deadlock, then it could safely return NULL. > Maybe after looping N times without forward progress the page allocator > should consider failing unless NOFAIL is given. page allocator is quite tricky to touch and the chances of this deadlock are not that big. > if memcg oom kill has been tried a reasonable number of times. Simply > failing the memcg charge with ENOMEM seems easier to support than > exceeding limit (Kame's loan patch). We cannot do that in the page fault path because this would lead to a global oom killer. We would need to either retry the page fault or send KILL to the faulting process. But I do not like this much as this could lead to DoS attacks. -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked 2013-02-08 16:29 ` Michal Hocko @ 2013-02-08 16:40 ` Michal Hocko -1 siblings, 0 replies; 444+ messages in thread From: Michal Hocko @ 2013-02-08 16:40 UTC (permalink / raw) To: Greg Thelen Cc: azurIt, linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki, Johannes Weiner On Fri 08-02-13 17:29:18, Michal Hocko wrote: [...] > OK, I have checked the allocator slow path and you are right even > GFP_KERNEL will not fail. This can lead to similar deadlocks - e.g. > OOM killed task blocked on down_write(mmap_sem) while the page fault > handler holding mmap_sem for reading and allocating a new page without > any progress. And now that I think about it some more it sounds like it shouldn't be possible because allocator would fail because it would see TIF_MEMDIE (OOM killer kills all threads that share the same mm). But maybe there are other locks that are dangerous, but I think that the risk is pretty low. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked @ 2013-02-08 16:40 ` Michal Hocko 0 siblings, 0 replies; 444+ messages in thread From: Michal Hocko @ 2013-02-08 16:40 UTC (permalink / raw) To: Greg Thelen Cc: azurIt, linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki, Johannes Weiner On Fri 08-02-13 17:29:18, Michal Hocko wrote: [...] > OK, I have checked the allocator slow path and you are right even > GFP_KERNEL will not fail. This can lead to similar deadlocks - e.g. > OOM killed task blocked on down_write(mmap_sem) while the page fault > handler holding mmap_sem for reading and allocating a new page without > any progress. And now that I think about it some more it sounds like it shouldn't be possible because allocator would fail because it would see TIF_MEMDIE (OOM killer kills all threads that share the same mm). But maybe there are other locks that are dangerous, but I think that the risk is pretty low. -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked 2013-02-05 16:09 ` Michal Hocko @ 2013-02-06 1:17 ` azurIt -1 siblings, 0 replies; 444+ messages in thread From: azurIt @ 2013-02-06 1:17 UTC (permalink / raw) To: Michal Hocko Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki, Johannes Weiner >5-memcg-fix-1.patch is not complete. It doesn't contain the folloup I >mentioned in a follow up email. Here is the full patch: Here is the log where OOM, again, killed MySQL server [search for "(mysqld)"]: http://www.watchdog.sk/lkml/oom_mysqld6 azur ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked @ 2013-02-06 1:17 ` azurIt 0 siblings, 0 replies; 444+ messages in thread From: azurIt @ 2013-02-06 1:17 UTC (permalink / raw) To: Michal Hocko Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki, Johannes Weiner >5-memcg-fix-1.patch is not complete. It doesn't contain the folloup I >mentioned in a follow up email. Here is the full patch: Here is the log where OOM, again, killed MySQL server [search for "(mysqld)"]: http://www.watchdog.sk/lkml/oom_mysqld6 azur -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked 2013-02-06 1:17 ` azurIt (?) @ 2013-02-06 14:01 ` Michal Hocko -1 siblings, 0 replies; 444+ messages in thread From: Michal Hocko @ 2013-02-06 14:01 UTC (permalink / raw) To: azurIt Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki, Johannes Weiner On Wed 06-02-13 02:17:21, azurIt wrote: > >5-memcg-fix-1.patch is not complete. It doesn't contain the folloup I > >mentioned in a follow up email. Here is the full patch: > > > Here is the log where OOM, again, killed MySQL server [search for "(mysqld)"]: > http://www.watchdog.sk/lkml/oom_mysqld6 [...] WARNING: at mm/memcontrol.c:2409 T.1149+0x2d9/0x610() Hardware name: S5000VSA gfp_mask:4304 nr_pages:1 oom:0 ret:2 Pid: 3545, comm: apache2 Tainted: G W 3.2.37-grsec #1 Call Trace: [<ffffffff8105502a>] warn_slowpath_common+0x7a/0xb0 [<ffffffff81055116>] warn_slowpath_fmt+0x46/0x50 [<ffffffff81108163>] ? mem_cgroup_margin+0x73/0xa0 [<ffffffff8110b6f9>] T.1149+0x2d9/0x610 [<ffffffff812af298>] ? blk_finish_plug+0x18/0x50 [<ffffffff8110c6b4>] mem_cgroup_cache_charge+0xc4/0xf0 [<ffffffff810ca6bf>] add_to_page_cache_locked+0x4f/0x140 [<ffffffff810ca7d2>] add_to_page_cache_lru+0x22/0x50 [<ffffffff810cad32>] filemap_fault+0x252/0x4f0 [<ffffffff810eab18>] __do_fault+0x78/0x5a0 [<ffffffff810edcb4>] handle_pte_fault+0x84/0x940 [<ffffffff810e2460>] ? vma_prio_tree_insert+0x30/0x50 [<ffffffff810f2508>] ? vma_link+0x88/0xe0 [<ffffffff810ee6a8>] handle_mm_fault+0x138/0x260 [<ffffffff8102709d>] do_page_fault+0x13d/0x460 [<ffffffff810f46fc>] ? do_mmap_pgoff+0x3dc/0x430 [<ffffffff815b61ff>] page_fault+0x1f/0x30 ---[ end trace 8817670349022007 ]--- apache2 invoked oom-killer: gfp_mask=0x0, order=0, oom_adj=0, oom_score_adj=0 apache2 cpuset=uid mems_allowed=0 Pid: 3545, comm: apache2 Tainted: G W 3.2.37-grsec #1 Call Trace: [<ffffffff810ccd2e>] dump_header+0x7e/0x1e0 [<ffffffff810ccc2f>] ? find_lock_task_mm+0x2f/0x70 [<ffffffff810cd1f5>] oom_kill_process+0x85/0x2a0 [<ffffffff810cd8a5>] out_of_memory+0xe5/0x200 [<ffffffff810cda7d>] pagefault_out_of_memory+0xbd/0x110 [<ffffffff81026e76>] mm_fault_error+0xb6/0x1a0 [<ffffffff8102734e>] do_page_fault+0x3ee/0x460 [<ffffffff810f46fc>] ? do_mmap_pgoff+0x3dc/0x430 [<ffffffff815b61ff>] page_fault+0x1f/0x30 The first trace comes from the debugging WARN and it clearly points to a file fault path. __do_fault pre-charges a page in case we need to do CoW (copy-on-write) for the returned page. This one falls back to memcg OOM and never returns ENOMEM as I have mentioned earlier. However, the fs fault handler (filemap_fault here) can fallback to page_cache_read if the readahead (do_sync_mmap_readahead) fails to get page to the page cache. And we can see this happening in the first trace. page_cache_read then calls add_to_page_cache_lru and eventually gets to add_to_page_cache_locked which calls mem_cgroup_cache_charge_no_oom so we will get ENOMEM if oom should happen. This ENOMEM gets to the fault handler and kaboom. So the fix is really much more complex than I thought. Although add_to_page_cache_locked sounded like a good place it turned out to be not in fact. We need something more clever appaerently. One way would be not misusing __GFP_NORETRY for GFP_MEMCG_NO_OOM and give it a real flag. We have 32 bits for those flags in gfp_t so there should be some room there. Or we could do this per task flag, same we do for NO_IO in the current -mm tree. The later one seems easier wrt. gfp_mask passing horror - e.g. __generic_file_aio_write doesn't pass flags and it can be called from unlocked contexts as well. I have to think about it some more. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked @ 2013-02-06 14:01 ` Michal Hocko 0 siblings, 0 replies; 444+ messages in thread From: Michal Hocko @ 2013-02-06 14:01 UTC (permalink / raw) To: azurIt Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups mailinglist, KAMEZAWA Hiroyuki, Johannes Weiner On Wed 06-02-13 02:17:21, azurIt wrote: > >5-memcg-fix-1.patch is not complete. It doesn't contain the folloup I > >mentioned in a follow up email. Here is the full patch: > > > Here is the log where OOM, again, killed MySQL server [search for "(mysqld)"]: > http://www.watchdog.sk/lkml/oom_mysqld6 [...] WARNING: at mm/memcontrol.c:2409 T.1149+0x2d9/0x610() Hardware name: S5000VSA gfp_mask:4304 nr_pages:1 oom:0 ret:2 Pid: 3545, comm: apache2 Tainted: G W 3.2.37-grsec #1 Call Trace: [<ffffffff8105502a>] warn_slowpath_common+0x7a/0xb0 [<ffffffff81055116>] warn_slowpath_fmt+0x46/0x50 [<ffffffff81108163>] ? mem_cgroup_margin+0x73/0xa0 [<ffffffff8110b6f9>] T.1149+0x2d9/0x610 [<ffffffff812af298>] ? blk_finish_plug+0x18/0x50 [<ffffffff8110c6b4>] mem_cgroup_cache_charge+0xc4/0xf0 [<ffffffff810ca6bf>] add_to_page_cache_locked+0x4f/0x140 [<ffffffff810ca7d2>] add_to_page_cache_lru+0x22/0x50 [<ffffffff810cad32>] filemap_fault+0x252/0x4f0 [<ffffffff810eab18>] __do_fault+0x78/0x5a0 [<ffffffff810edcb4>] handle_pte_fault+0x84/0x940 [<ffffffff810e2460>] ? vma_prio_tree_insert+0x30/0x50 [<ffffffff810f2508>] ? vma_link+0x88/0xe0 [<ffffffff810ee6a8>] handle_mm_fault+0x138/0x260 [<ffffffff8102709d>] do_page_fault+0x13d/0x460 [<ffffffff810f46fc>] ? do_mmap_pgoff+0x3dc/0x430 [<ffffffff815b61ff>] page_fault+0x1f/0x30 ---[ end trace 8817670349022007 ]--- apache2 invoked oom-killer: gfp_mask=0x0, order=0, oom_adj=0, oom_score_adj=0 apache2 cpuset=uid mems_allowed=0 Pid: 3545, comm: apache2 Tainted: G W 3.2.37-grsec #1 Call Trace: [<ffffffff810ccd2e>] dump_header+0x7e/0x1e0 [<ffffffff810ccc2f>] ? find_lock_task_mm+0x2f/0x70 [<ffffffff810cd1f5>] oom_kill_process+0x85/0x2a0 [<ffffffff810cd8a5>] out_of_memory+0xe5/0x200 [<ffffffff810cda7d>] pagefault_out_of_memory+0xbd/0x110 [<ffffffff81026e76>] mm_fault_error+0xb6/0x1a0 [<ffffffff8102734e>] do_page_fault+0x3ee/0x460 [<ffffffff810f46fc>] ? do_mmap_pgoff+0x3dc/0x430 [<ffffffff815b61ff>] page_fault+0x1f/0x30 The first trace comes from the debugging WARN and it clearly points to a file fault path. __do_fault pre-charges a page in case we need to do CoW (copy-on-write) for the returned page. This one falls back to memcg OOM and never returns ENOMEM as I have mentioned earlier. However, the fs fault handler (filemap_fault here) can fallback to page_cache_read if the readahead (do_sync_mmap_readahead) fails to get page to the page cache. And we can see this happening in the first trace. page_cache_read then calls add_to_page_cache_lru and eventually gets to add_to_page_cache_locked which calls mem_cgroup_cache_charge_no_oom so we will get ENOMEM if oom should happen. This ENOMEM gets to the fault handler and kaboom. So the fix is really much more complex than I thought. Although add_to_page_cache_locked sounded like a good place it turned out to be not in fact. We need something more clever appaerently. One way would be not misusing __GFP_NORETRY for GFP_MEMCG_NO_OOM and give it a real flag. We have 32 bits for those flags in gfp_t so there should be some room there. Or we could do this per task flag, same we do for NO_IO in the current -mm tree. The later one seems easier wrt. gfp_mask passing horror - e.g. __generic_file_aio_write doesn't pass flags and it can be called from unlocked contexts as well. I have to think about it some more. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked @ 2013-02-06 14:01 ` Michal Hocko 0 siblings, 0 replies; 444+ messages in thread From: Michal Hocko @ 2013-02-06 14:01 UTC (permalink / raw) To: azurIt Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki, Johannes Weiner On Wed 06-02-13 02:17:21, azurIt wrote: > >5-memcg-fix-1.patch is not complete. It doesn't contain the folloup I > >mentioned in a follow up email. Here is the full patch: > > > Here is the log where OOM, again, killed MySQL server [search for "(mysqld)"]: > http://www.watchdog.sk/lkml/oom_mysqld6 [...] WARNING: at mm/memcontrol.c:2409 T.1149+0x2d9/0x610() Hardware name: S5000VSA gfp_mask:4304 nr_pages:1 oom:0 ret:2 Pid: 3545, comm: apache2 Tainted: G W 3.2.37-grsec #1 Call Trace: [<ffffffff8105502a>] warn_slowpath_common+0x7a/0xb0 [<ffffffff81055116>] warn_slowpath_fmt+0x46/0x50 [<ffffffff81108163>] ? mem_cgroup_margin+0x73/0xa0 [<ffffffff8110b6f9>] T.1149+0x2d9/0x610 [<ffffffff812af298>] ? blk_finish_plug+0x18/0x50 [<ffffffff8110c6b4>] mem_cgroup_cache_charge+0xc4/0xf0 [<ffffffff810ca6bf>] add_to_page_cache_locked+0x4f/0x140 [<ffffffff810ca7d2>] add_to_page_cache_lru+0x22/0x50 [<ffffffff810cad32>] filemap_fault+0x252/0x4f0 [<ffffffff810eab18>] __do_fault+0x78/0x5a0 [<ffffffff810edcb4>] handle_pte_fault+0x84/0x940 [<ffffffff810e2460>] ? vma_prio_tree_insert+0x30/0x50 [<ffffffff810f2508>] ? vma_link+0x88/0xe0 [<ffffffff810ee6a8>] handle_mm_fault+0x138/0x260 [<ffffffff8102709d>] do_page_fault+0x13d/0x460 [<ffffffff810f46fc>] ? do_mmap_pgoff+0x3dc/0x430 [<ffffffff815b61ff>] page_fault+0x1f/0x30 ---[ end trace 8817670349022007 ]--- apache2 invoked oom-killer: gfp_mask=0x0, order=0, oom_adj=0, oom_score_adj=0 apache2 cpuset=uid mems_allowed=0 Pid: 3545, comm: apache2 Tainted: G W 3.2.37-grsec #1 Call Trace: [<ffffffff810ccd2e>] dump_header+0x7e/0x1e0 [<ffffffff810ccc2f>] ? find_lock_task_mm+0x2f/0x70 [<ffffffff810cd1f5>] oom_kill_process+0x85/0x2a0 [<ffffffff810cd8a5>] out_of_memory+0xe5/0x200 [<ffffffff810cda7d>] pagefault_out_of_memory+0xbd/0x110 [<ffffffff81026e76>] mm_fault_error+0xb6/0x1a0 [<ffffffff8102734e>] do_page_fault+0x3ee/0x460 [<ffffffff810f46fc>] ? do_mmap_pgoff+0x3dc/0x430 [<ffffffff815b61ff>] page_fault+0x1f/0x30 The first trace comes from the debugging WARN and it clearly points to a file fault path. __do_fault pre-charges a page in case we need to do CoW (copy-on-write) for the returned page. This one falls back to memcg OOM and never returns ENOMEM as I have mentioned earlier. However, the fs fault handler (filemap_fault here) can fallback to page_cache_read if the readahead (do_sync_mmap_readahead) fails to get page to the page cache. And we can see this happening in the first trace. page_cache_read then calls add_to_page_cache_lru and eventually gets to add_to_page_cache_locked which calls mem_cgroup_cache_charge_no_oom so we will get ENOMEM if oom should happen. This ENOMEM gets to the fault handler and kaboom. So the fix is really much more complex than I thought. Although add_to_page_cache_locked sounded like a good place it turned out to be not in fact. We need something more clever appaerently. One way would be not misusing __GFP_NORETRY for GFP_MEMCG_NO_OOM and give it a real flag. We have 32 bits for those flags in gfp_t so there should be some room there. Or we could do this per task flag, same we do for NO_IO in the current -mm tree. The later one seems easier wrt. gfp_mask passing horror - e.g. __generic_file_aio_write doesn't pass flags and it can be called from unlocked contexts as well. I have to think about it some more. -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked 2013-02-06 14:01 ` Michal Hocko (?) @ 2013-02-06 14:22 ` Michal Hocko -1 siblings, 0 replies; 444+ messages in thread From: Michal Hocko @ 2013-02-06 14:22 UTC (permalink / raw) To: azurIt Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki, Johannes Weiner On Wed 06-02-13 15:01:19, Michal Hocko wrote: > On Wed 06-02-13 02:17:21, azurIt wrote: > > >5-memcg-fix-1.patch is not complete. It doesn't contain the folloup I > > >mentioned in a follow up email. Here is the full patch: > > > > > > Here is the log where OOM, again, killed MySQL server [search for "(mysqld)"]: > > http://www.watchdog.sk/lkml/oom_mysqld6 > > [...] > WARNING: at mm/memcontrol.c:2409 T.1149+0x2d9/0x610() > Hardware name: S5000VSA > gfp_mask:4304 nr_pages:1 oom:0 ret:2 > Pid: 3545, comm: apache2 Tainted: G W 3.2.37-grsec #1 > Call Trace: > [<ffffffff8105502a>] warn_slowpath_common+0x7a/0xb0 > [<ffffffff81055116>] warn_slowpath_fmt+0x46/0x50 > [<ffffffff81108163>] ? mem_cgroup_margin+0x73/0xa0 > [<ffffffff8110b6f9>] T.1149+0x2d9/0x610 > [<ffffffff812af298>] ? blk_finish_plug+0x18/0x50 > [<ffffffff8110c6b4>] mem_cgroup_cache_charge+0xc4/0xf0 > [<ffffffff810ca6bf>] add_to_page_cache_locked+0x4f/0x140 > [<ffffffff810ca7d2>] add_to_page_cache_lru+0x22/0x50 > [<ffffffff810cad32>] filemap_fault+0x252/0x4f0 > [<ffffffff810eab18>] __do_fault+0x78/0x5a0 > [<ffffffff810edcb4>] handle_pte_fault+0x84/0x940 > [<ffffffff810e2460>] ? vma_prio_tree_insert+0x30/0x50 > [<ffffffff810f2508>] ? vma_link+0x88/0xe0 > [<ffffffff810ee6a8>] handle_mm_fault+0x138/0x260 > [<ffffffff8102709d>] do_page_fault+0x13d/0x460 > [<ffffffff810f46fc>] ? do_mmap_pgoff+0x3dc/0x430 > [<ffffffff815b61ff>] page_fault+0x1f/0x30 > ---[ end trace 8817670349022007 ]--- > apache2 invoked oom-killer: gfp_mask=0x0, order=0, oom_adj=0, oom_score_adj=0 > apache2 cpuset=uid mems_allowed=0 > Pid: 3545, comm: apache2 Tainted: G W 3.2.37-grsec #1 > Call Trace: > [<ffffffff810ccd2e>] dump_header+0x7e/0x1e0 > [<ffffffff810ccc2f>] ? find_lock_task_mm+0x2f/0x70 > [<ffffffff810cd1f5>] oom_kill_process+0x85/0x2a0 > [<ffffffff810cd8a5>] out_of_memory+0xe5/0x200 > [<ffffffff810cda7d>] pagefault_out_of_memory+0xbd/0x110 > [<ffffffff81026e76>] mm_fault_error+0xb6/0x1a0 > [<ffffffff8102734e>] do_page_fault+0x3ee/0x460 > [<ffffffff810f46fc>] ? do_mmap_pgoff+0x3dc/0x430 > [<ffffffff815b61ff>] page_fault+0x1f/0x30 > > The first trace comes from the debugging WARN and it clearly points to > a file fault path. __do_fault pre-charges a page in case we need to > do CoW (copy-on-write) for the returned page. This one falls back to > memcg OOM and never returns ENOMEM as I have mentioned earlier. > However, the fs fault handler (filemap_fault here) can fallback to > page_cache_read if the readahead (do_sync_mmap_readahead) fails > to get page to the page cache. And we can see this happening in > the first trace. page_cache_read then calls add_to_page_cache_lru > and eventually gets to add_to_page_cache_locked which calls > mem_cgroup_cache_charge_no_oom so we will get ENOMEM if oom should > happen. This ENOMEM gets to the fault handler and kaboom. > > So the fix is really much more complex than I thought. Although > add_to_page_cache_locked sounded like a good place it turned out to be > not in fact. > > We need something more clever appaerently. One way would be not misusing > __GFP_NORETRY for GFP_MEMCG_NO_OOM and give it a real flag. We have 32 > bits for those flags in gfp_t so there should be some room there. > Or we could do this per task flag, same we do for NO_IO in the current > -mm tree. > The later one seems easier wrt. gfp_mask passing horror - e.g. > __generic_file_aio_write doesn't pass flags and it can be called from > unlocked contexts as well. Ouch, PF_ flags space seem to be drained already because task_struct::flags is just unsigned int so there is just one bit left. I am not sure this is the best use for it. This will be a real pain! -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked @ 2013-02-06 14:22 ` Michal Hocko 0 siblings, 0 replies; 444+ messages in thread From: Michal Hocko @ 2013-02-06 14:22 UTC (permalink / raw) To: azurIt Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups mailinglist, KAMEZAWA Hiroyuki, Johannes Weiner On Wed 06-02-13 15:01:19, Michal Hocko wrote: > On Wed 06-02-13 02:17:21, azurIt wrote: > > >5-memcg-fix-1.patch is not complete. It doesn't contain the folloup I > > >mentioned in a follow up email. Here is the full patch: > > > > > > Here is the log where OOM, again, killed MySQL server [search for "(mysqld)"]: > > http://www.watchdog.sk/lkml/oom_mysqld6 > > [...] > WARNING: at mm/memcontrol.c:2409 T.1149+0x2d9/0x610() > Hardware name: S5000VSA > gfp_mask:4304 nr_pages:1 oom:0 ret:2 > Pid: 3545, comm: apache2 Tainted: G W 3.2.37-grsec #1 > Call Trace: > [<ffffffff8105502a>] warn_slowpath_common+0x7a/0xb0 > [<ffffffff81055116>] warn_slowpath_fmt+0x46/0x50 > [<ffffffff81108163>] ? mem_cgroup_margin+0x73/0xa0 > [<ffffffff8110b6f9>] T.1149+0x2d9/0x610 > [<ffffffff812af298>] ? blk_finish_plug+0x18/0x50 > [<ffffffff8110c6b4>] mem_cgroup_cache_charge+0xc4/0xf0 > [<ffffffff810ca6bf>] add_to_page_cache_locked+0x4f/0x140 > [<ffffffff810ca7d2>] add_to_page_cache_lru+0x22/0x50 > [<ffffffff810cad32>] filemap_fault+0x252/0x4f0 > [<ffffffff810eab18>] __do_fault+0x78/0x5a0 > [<ffffffff810edcb4>] handle_pte_fault+0x84/0x940 > [<ffffffff810e2460>] ? vma_prio_tree_insert+0x30/0x50 > [<ffffffff810f2508>] ? vma_link+0x88/0xe0 > [<ffffffff810ee6a8>] handle_mm_fault+0x138/0x260 > [<ffffffff8102709d>] do_page_fault+0x13d/0x460 > [<ffffffff810f46fc>] ? do_mmap_pgoff+0x3dc/0x430 > [<ffffffff815b61ff>] page_fault+0x1f/0x30 > ---[ end trace 8817670349022007 ]--- > apache2 invoked oom-killer: gfp_mask=0x0, order=0, oom_adj=0, oom_score_adj=0 > apache2 cpuset=uid mems_allowed=0 > Pid: 3545, comm: apache2 Tainted: G W 3.2.37-grsec #1 > Call Trace: > [<ffffffff810ccd2e>] dump_header+0x7e/0x1e0 > [<ffffffff810ccc2f>] ? find_lock_task_mm+0x2f/0x70 > [<ffffffff810cd1f5>] oom_kill_process+0x85/0x2a0 > [<ffffffff810cd8a5>] out_of_memory+0xe5/0x200 > [<ffffffff810cda7d>] pagefault_out_of_memory+0xbd/0x110 > [<ffffffff81026e76>] mm_fault_error+0xb6/0x1a0 > [<ffffffff8102734e>] do_page_fault+0x3ee/0x460 > [<ffffffff810f46fc>] ? do_mmap_pgoff+0x3dc/0x430 > [<ffffffff815b61ff>] page_fault+0x1f/0x30 > > The first trace comes from the debugging WARN and it clearly points to > a file fault path. __do_fault pre-charges a page in case we need to > do CoW (copy-on-write) for the returned page. This one falls back to > memcg OOM and never returns ENOMEM as I have mentioned earlier. > However, the fs fault handler (filemap_fault here) can fallback to > page_cache_read if the readahead (do_sync_mmap_readahead) fails > to get page to the page cache. And we can see this happening in > the first trace. page_cache_read then calls add_to_page_cache_lru > and eventually gets to add_to_page_cache_locked which calls > mem_cgroup_cache_charge_no_oom so we will get ENOMEM if oom should > happen. This ENOMEM gets to the fault handler and kaboom. > > So the fix is really much more complex than I thought. Although > add_to_page_cache_locked sounded like a good place it turned out to be > not in fact. > > We need something more clever appaerently. One way would be not misusing > __GFP_NORETRY for GFP_MEMCG_NO_OOM and give it a real flag. We have 32 > bits for those flags in gfp_t so there should be some room there. > Or we could do this per task flag, same we do for NO_IO in the current > -mm tree. > The later one seems easier wrt. gfp_mask passing horror - e.g. > __generic_file_aio_write doesn't pass flags and it can be called from > unlocked contexts as well. Ouch, PF_ flags space seem to be drained already because task_struct::flags is just unsigned int so there is just one bit left. I am not sure this is the best use for it. This will be a real pain! -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked @ 2013-02-06 14:22 ` Michal Hocko 0 siblings, 0 replies; 444+ messages in thread From: Michal Hocko @ 2013-02-06 14:22 UTC (permalink / raw) To: azurIt Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki, Johannes Weiner On Wed 06-02-13 15:01:19, Michal Hocko wrote: > On Wed 06-02-13 02:17:21, azurIt wrote: > > >5-memcg-fix-1.patch is not complete. It doesn't contain the folloup I > > >mentioned in a follow up email. Here is the full patch: > > > > > > Here is the log where OOM, again, killed MySQL server [search for "(mysqld)"]: > > http://www.watchdog.sk/lkml/oom_mysqld6 > > [...] > WARNING: at mm/memcontrol.c:2409 T.1149+0x2d9/0x610() > Hardware name: S5000VSA > gfp_mask:4304 nr_pages:1 oom:0 ret:2 > Pid: 3545, comm: apache2 Tainted: G W 3.2.37-grsec #1 > Call Trace: > [<ffffffff8105502a>] warn_slowpath_common+0x7a/0xb0 > [<ffffffff81055116>] warn_slowpath_fmt+0x46/0x50 > [<ffffffff81108163>] ? mem_cgroup_margin+0x73/0xa0 > [<ffffffff8110b6f9>] T.1149+0x2d9/0x610 > [<ffffffff812af298>] ? blk_finish_plug+0x18/0x50 > [<ffffffff8110c6b4>] mem_cgroup_cache_charge+0xc4/0xf0 > [<ffffffff810ca6bf>] add_to_page_cache_locked+0x4f/0x140 > [<ffffffff810ca7d2>] add_to_page_cache_lru+0x22/0x50 > [<ffffffff810cad32>] filemap_fault+0x252/0x4f0 > [<ffffffff810eab18>] __do_fault+0x78/0x5a0 > [<ffffffff810edcb4>] handle_pte_fault+0x84/0x940 > [<ffffffff810e2460>] ? vma_prio_tree_insert+0x30/0x50 > [<ffffffff810f2508>] ? vma_link+0x88/0xe0 > [<ffffffff810ee6a8>] handle_mm_fault+0x138/0x260 > [<ffffffff8102709d>] do_page_fault+0x13d/0x460 > [<ffffffff810f46fc>] ? do_mmap_pgoff+0x3dc/0x430 > [<ffffffff815b61ff>] page_fault+0x1f/0x30 > ---[ end trace 8817670349022007 ]--- > apache2 invoked oom-killer: gfp_mask=0x0, order=0, oom_adj=0, oom_score_adj=0 > apache2 cpuset=uid mems_allowed=0 > Pid: 3545, comm: apache2 Tainted: G W 3.2.37-grsec #1 > Call Trace: > [<ffffffff810ccd2e>] dump_header+0x7e/0x1e0 > [<ffffffff810ccc2f>] ? find_lock_task_mm+0x2f/0x70 > [<ffffffff810cd1f5>] oom_kill_process+0x85/0x2a0 > [<ffffffff810cd8a5>] out_of_memory+0xe5/0x200 > [<ffffffff810cda7d>] pagefault_out_of_memory+0xbd/0x110 > [<ffffffff81026e76>] mm_fault_error+0xb6/0x1a0 > [<ffffffff8102734e>] do_page_fault+0x3ee/0x460 > [<ffffffff810f46fc>] ? do_mmap_pgoff+0x3dc/0x430 > [<ffffffff815b61ff>] page_fault+0x1f/0x30 > > The first trace comes from the debugging WARN and it clearly points to > a file fault path. __do_fault pre-charges a page in case we need to > do CoW (copy-on-write) for the returned page. This one falls back to > memcg OOM and never returns ENOMEM as I have mentioned earlier. > However, the fs fault handler (filemap_fault here) can fallback to > page_cache_read if the readahead (do_sync_mmap_readahead) fails > to get page to the page cache. And we can see this happening in > the first trace. page_cache_read then calls add_to_page_cache_lru > and eventually gets to add_to_page_cache_locked which calls > mem_cgroup_cache_charge_no_oom so we will get ENOMEM if oom should > happen. This ENOMEM gets to the fault handler and kaboom. > > So the fix is really much more complex than I thought. Although > add_to_page_cache_locked sounded like a good place it turned out to be > not in fact. > > We need something more clever appaerently. One way would be not misusing > __GFP_NORETRY for GFP_MEMCG_NO_OOM and give it a real flag. We have 32 > bits for those flags in gfp_t so there should be some room there. > Or we could do this per task flag, same we do for NO_IO in the current > -mm tree. > The later one seems easier wrt. gfp_mask passing horror - e.g. > __generic_file_aio_write doesn't pass flags and it can be called from > unlocked contexts as well. Ouch, PF_ flags space seem to be drained already because task_struct::flags is just unsigned int so there is just one bit left. I am not sure this is the best use for it. This will be a real pain! -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 444+ messages in thread
* [PATCH for 3.2.34] memcg: do not trigger OOM if PF_NO_MEMCG_OOM is set 2013-02-06 14:22 ` Michal Hocko (?) @ 2013-02-06 16:00 ` Michal Hocko -1 siblings, 0 replies; 444+ messages in thread From: Michal Hocko @ 2013-02-06 16:00 UTC (permalink / raw) To: azurIt Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki, Johannes Weiner On Wed 06-02-13 15:22:19, Michal Hocko wrote: > On Wed 06-02-13 15:01:19, Michal Hocko wrote: > > On Wed 06-02-13 02:17:21, azurIt wrote: > > > >5-memcg-fix-1.patch is not complete. It doesn't contain the folloup I > > > >mentioned in a follow up email. Here is the full patch: > > > > > > > > > Here is the log where OOM, again, killed MySQL server [search for "(mysqld)"]: > > > http://www.watchdog.sk/lkml/oom_mysqld6 > > > > [...] > > WARNING: at mm/memcontrol.c:2409 T.1149+0x2d9/0x610() > > Hardware name: S5000VSA > > gfp_mask:4304 nr_pages:1 oom:0 ret:2 > > Pid: 3545, comm: apache2 Tainted: G W 3.2.37-grsec #1 > > Call Trace: > > [<ffffffff8105502a>] warn_slowpath_common+0x7a/0xb0 > > [<ffffffff81055116>] warn_slowpath_fmt+0x46/0x50 > > [<ffffffff81108163>] ? mem_cgroup_margin+0x73/0xa0 > > [<ffffffff8110b6f9>] T.1149+0x2d9/0x610 > > [<ffffffff812af298>] ? blk_finish_plug+0x18/0x50 > > [<ffffffff8110c6b4>] mem_cgroup_cache_charge+0xc4/0xf0 > > [<ffffffff810ca6bf>] add_to_page_cache_locked+0x4f/0x140 > > [<ffffffff810ca7d2>] add_to_page_cache_lru+0x22/0x50 > > [<ffffffff810cad32>] filemap_fault+0x252/0x4f0 > > [<ffffffff810eab18>] __do_fault+0x78/0x5a0 > > [<ffffffff810edcb4>] handle_pte_fault+0x84/0x940 > > [<ffffffff810e2460>] ? vma_prio_tree_insert+0x30/0x50 > > [<ffffffff810f2508>] ? vma_link+0x88/0xe0 > > [<ffffffff810ee6a8>] handle_mm_fault+0x138/0x260 > > [<ffffffff8102709d>] do_page_fault+0x13d/0x460 > > [<ffffffff810f46fc>] ? do_mmap_pgoff+0x3dc/0x430 > > [<ffffffff815b61ff>] page_fault+0x1f/0x30 > > ---[ end trace 8817670349022007 ]--- > > apache2 invoked oom-killer: gfp_mask=0x0, order=0, oom_adj=0, oom_score_adj=0 > > apache2 cpuset=uid mems_allowed=0 > > Pid: 3545, comm: apache2 Tainted: G W 3.2.37-grsec #1 > > Call Trace: > > [<ffffffff810ccd2e>] dump_header+0x7e/0x1e0 > > [<ffffffff810ccc2f>] ? find_lock_task_mm+0x2f/0x70 > > [<ffffffff810cd1f5>] oom_kill_process+0x85/0x2a0 > > [<ffffffff810cd8a5>] out_of_memory+0xe5/0x200 > > [<ffffffff810cda7d>] pagefault_out_of_memory+0xbd/0x110 > > [<ffffffff81026e76>] mm_fault_error+0xb6/0x1a0 > > [<ffffffff8102734e>] do_page_fault+0x3ee/0x460 > > [<ffffffff810f46fc>] ? do_mmap_pgoff+0x3dc/0x430 > > [<ffffffff815b61ff>] page_fault+0x1f/0x30 > > > > The first trace comes from the debugging WARN and it clearly points to > > a file fault path. __do_fault pre-charges a page in case we need to > > do CoW (copy-on-write) for the returned page. This one falls back to > > memcg OOM and never returns ENOMEM as I have mentioned earlier. > > However, the fs fault handler (filemap_fault here) can fallback to > > page_cache_read if the readahead (do_sync_mmap_readahead) fails > > to get page to the page cache. And we can see this happening in > > the first trace. page_cache_read then calls add_to_page_cache_lru > > and eventually gets to add_to_page_cache_locked which calls > > mem_cgroup_cache_charge_no_oom so we will get ENOMEM if oom should > > happen. This ENOMEM gets to the fault handler and kaboom. > > > > So the fix is really much more complex than I thought. Although > > add_to_page_cache_locked sounded like a good place it turned out to be > > not in fact. > > > > We need something more clever appaerently. One way would be not misusing > > __GFP_NORETRY for GFP_MEMCG_NO_OOM and give it a real flag. We have 32 > > bits for those flags in gfp_t so there should be some room there. > > Or we could do this per task flag, same we do for NO_IO in the current > > -mm tree. > > The later one seems easier wrt. gfp_mask passing horror - e.g. > > __generic_file_aio_write doesn't pass flags and it can be called from > > unlocked contexts as well. > > Ouch, PF_ flags space seem to be drained already because > task_struct::flags is just unsigned int so there is just one bit left. I > am not sure this is the best use for it. This will be a real pain! OK, so this something that should help you without any risk of false OOMs. I do not believe that something like that would be accepted upstream because it is really heavy. We will need to come up with something more clever for upstream. I have also added a warning which will trigger when the charge fails. If you see too many of those messages then there is something bad going on and the lack of OOM causes userspace to loop without getting any progress. So there you go - your personal patch ;) You can drop all other patches. Please note I have just compile tested it. But it should be pretty trivial to check it is correct --- >From 6f155187f77c971b45caf05dbc80ca9c20bc278c Mon Sep 17 00:00:00 2001 From: Michal Hocko <mhocko@suse.cz> Date: Wed, 6 Feb 2013 16:45:07 +0100 Subject: [PATCH 1/2] memcg: do not trigger OOM if PF_NO_MEMCG_OOM is set memcg oom killer might deadlock if the process which falls down to mem_cgroup_handle_oom holds a lock which prevents other task to terminate because it is blocked on the very same lock. This can happen when a write system call needs to allocate a page but the allocation hits the memcg hard limit and there is nothing to reclaim (e.g. there is no swap or swap limit is hit as well and all cache pages have been reclaimed already) and the process selected by memcg OOM killer is blocked on i_mutex on the same inode (e.g. truncate it). Process A [<ffffffff811109b8>] do_truncate+0x58/0xa0 # takes i_mutex [<ffffffff81121c90>] do_last+0x250/0xa30 [<ffffffff81122547>] path_openat+0xd7/0x440 [<ffffffff811229c9>] do_filp_open+0x49/0xa0 [<ffffffff8110f7d6>] do_sys_open+0x106/0x240 [<ffffffff8110f950>] sys_open+0x20/0x30 [<ffffffff815b5926>] system_call_fastpath+0x18/0x1d [<ffffffffffffffff>] 0xffffffffffffffff Process B [<ffffffff8110a9c1>] mem_cgroup_handle_oom+0x241/0x3b0 [<ffffffff8110b5ab>] T.1146+0x5ab/0x5c0 [<ffffffff8110c22e>] mem_cgroup_cache_charge+0xbe/0xe0 [<ffffffff810ca28c>] add_to_page_cache_locked+0x4c/0x140 [<ffffffff810ca3a2>] add_to_page_cache_lru+0x22/0x50 [<ffffffff810ca45b>] grab_cache_page_write_begin+0x8b/0xe0 [<ffffffff81193a18>] ext3_write_begin+0x88/0x270 [<ffffffff810c8fc6>] generic_file_buffered_write+0x116/0x290 [<ffffffff810cb3cc>] __generic_file_aio_write+0x27c/0x480 [<ffffffff810cb646>] generic_file_aio_write+0x76/0xf0 # takes ->i_mutex [<ffffffff8111156a>] do_sync_write+0xea/0x130 [<ffffffff81112183>] vfs_write+0xf3/0x1f0 [<ffffffff81112381>] sys_write+0x51/0x90 [<ffffffff815b5926>] system_call_fastpath+0x18/0x1d [<ffffffffffffffff>] 0xffffffffffffffff This is not a hard deadlock though because administrator can still intervene and increase the limit on the group which helps the writer to finish the allocation and release the lock. This patch heals the problem by forbidding OOM from dangerous context. Memcg charging code has no way to find out whether it is called from a locked context we have to help it via process flags. PF_OOM_ORIGIN flag removed recently will be reused for PF_NO_MEMCG_OOM which signals that the memcg OOM killer could lead to a deadlock. Only locked callers of __generic_file_aio_write are currently marked. I am pretty sure there are more places (I didn't check shmem and hugetlb uses fancy instantion mutex during page fault and filesystems might use some locks during the write) but I've ignored those as this will probably be just a user specific patch without any way to get upstream in the current form. Reported-by: azurIt <azurit@pobox.sk> Signed-off-by: Michal Hocko <mhocko@suse.cz> --- drivers/staging/pohmelfs/inode.c | 2 ++ include/linux/sched.h | 1 + mm/filemap.c | 2 ++ mm/memcontrol.c | 18 ++++++++++++++---- 4 files changed, 19 insertions(+), 4 deletions(-) diff --git a/drivers/staging/pohmelfs/inode.c b/drivers/staging/pohmelfs/inode.c index 7a19555..523de82e 100644 --- a/drivers/staging/pohmelfs/inode.c +++ b/drivers/staging/pohmelfs/inode.c @@ -921,7 +921,9 @@ ssize_t pohmelfs_write(struct file *file, const char __user *buf, if (ret) goto err_out_unlock; + current->flags |= PF_NO_MEMCG_OOM; ret = __generic_file_aio_write(&kiocb, &iov, 1, &kiocb.ki_pos); + current->flags &= ~PF_NO_MEMCG_OOM; *ppos = kiocb.ki_pos; mutex_unlock(&inode->i_mutex); diff --git a/include/linux/sched.h b/include/linux/sched.h index 1e86bb4..f275c8f 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1781,6 +1781,7 @@ extern void thread_group_times(struct task_struct *p, cputime_t *ut, cputime_t * #define PF_FROZEN 0x00010000 /* frozen for system suspend */ #define PF_FSTRANS 0x00020000 /* inside a filesystem transaction */ #define PF_KSWAPD 0x00040000 /* I am kswapd */ +#define PF_NO_MEMCG_OOM 0x00080000 /* Memcg OOM could lead to a deadlock */ #define PF_LESS_THROTTLE 0x00100000 /* Throttle me less: I clean memory */ #define PF_KTHREAD 0x00200000 /* I am a kernel thread */ #define PF_RANDOMIZE 0x00400000 /* randomize virtual address space */ diff --git a/mm/filemap.c b/mm/filemap.c index 556858c..58a316b 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -2617,7 +2617,9 @@ ssize_t generic_file_aio_write(struct kiocb *iocb, const struct iovec *iov, mutex_lock(&inode->i_mutex); blk_start_plug(&plug); + current->flags |= PF_NO_MEMCG_OOM; ret = __generic_file_aio_write(iocb, iov, nr_segs, &iocb->ki_pos); + current->flags &= ~PF_NO_MEMCG_OOM; mutex_unlock(&inode->i_mutex); if (ret > 0 || ret == -EIOCBQUEUED) { diff --git a/mm/memcontrol.c b/mm/memcontrol.c index c8425b1..128b615 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -2397,6 +2397,14 @@ done: return 0; nomem: *ptr = NULL; + if (printk_ratelimit()) + printk(KERN_WARNING"%s: task:%s pid:%d got ENOMEM without OOM for memcg:%p." + " If this message shows up very often for the" + " same task then there is a risk that the" + " process is not able to make any progress" + " because of the current limit. Try to enlarge" + " the hard limit.\n", __FUNCTION__, + current->comm, current->pid, memcg); return -ENOMEM; bypass: *ptr = NULL; @@ -2703,7 +2711,7 @@ static int mem_cgroup_charge_common(struct page *page, struct mm_struct *mm, struct mem_cgroup *memcg = NULL; unsigned int nr_pages = 1; struct page_cgroup *pc; - bool oom = true; + bool oom = !(current->flags & PF_NO_MEMCG_OOM); int ret; if (PageTransHuge(page)) { @@ -2770,6 +2778,7 @@ __mem_cgroup_commit_charge_lrucare(struct page *page, struct mem_cgroup *memcg, int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm, gfp_t gfp_mask) { + bool oom = !(current->flags & PF_NO_MEMCG_OOM); struct mem_cgroup *memcg = NULL; int ret; @@ -2782,7 +2791,7 @@ int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm, mm = &init_mm; if (page_is_file_cache(page)) { - ret = __mem_cgroup_try_charge(mm, gfp_mask, 1, &memcg, true); + ret = __mem_cgroup_try_charge(mm, gfp_mask, 1, &memcg, oom); if (ret || !memcg) return ret; @@ -2818,6 +2827,7 @@ int mem_cgroup_try_charge_swapin(struct mm_struct *mm, struct page *page, gfp_t mask, struct mem_cgroup **ptr) { + bool oom = !(current->flags & PF_NO_MEMCG_OOM); struct mem_cgroup *memcg; int ret; @@ -2840,13 +2850,13 @@ int mem_cgroup_try_charge_swapin(struct mm_struct *mm, if (!memcg) goto charge_cur_mm; *ptr = memcg; - ret = __mem_cgroup_try_charge(NULL, mask, 1, ptr, true); + ret = __mem_cgroup_try_charge(NULL, mask, 1, ptr, oom); css_put(&memcg->css); return ret; charge_cur_mm: if (unlikely(!mm)) mm = &init_mm; - return __mem_cgroup_try_charge(mm, mask, 1, ptr, true); + return __mem_cgroup_try_charge(mm, mask, 1, ptr, oom); } static void -- 1.7.10.4 -- Michal Hocko SUSE Labs ^ permalink raw reply related [flat|nested] 444+ messages in thread
* [PATCH for 3.2.34] memcg: do not trigger OOM if PF_NO_MEMCG_OOM is set @ 2013-02-06 16:00 ` Michal Hocko 0 siblings, 0 replies; 444+ messages in thread From: Michal Hocko @ 2013-02-06 16:00 UTC (permalink / raw) To: azurIt Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups mailinglist, KAMEZAWA Hiroyuki, Johannes Weiner On Wed 06-02-13 15:22:19, Michal Hocko wrote: > On Wed 06-02-13 15:01:19, Michal Hocko wrote: > > On Wed 06-02-13 02:17:21, azurIt wrote: > > > >5-memcg-fix-1.patch is not complete. It doesn't contain the folloup I > > > >mentioned in a follow up email. Here is the full patch: > > > > > > > > > Here is the log where OOM, again, killed MySQL server [search for "(mysqld)"]: > > > http://www.watchdog.sk/lkml/oom_mysqld6 > > > > [...] > > WARNING: at mm/memcontrol.c:2409 T.1149+0x2d9/0x610() > > Hardware name: S5000VSA > > gfp_mask:4304 nr_pages:1 oom:0 ret:2 > > Pid: 3545, comm: apache2 Tainted: G W 3.2.37-grsec #1 > > Call Trace: > > [<ffffffff8105502a>] warn_slowpath_common+0x7a/0xb0 > > [<ffffffff81055116>] warn_slowpath_fmt+0x46/0x50 > > [<ffffffff81108163>] ? mem_cgroup_margin+0x73/0xa0 > > [<ffffffff8110b6f9>] T.1149+0x2d9/0x610 > > [<ffffffff812af298>] ? blk_finish_plug+0x18/0x50 > > [<ffffffff8110c6b4>] mem_cgroup_cache_charge+0xc4/0xf0 > > [<ffffffff810ca6bf>] add_to_page_cache_locked+0x4f/0x140 > > [<ffffffff810ca7d2>] add_to_page_cache_lru+0x22/0x50 > > [<ffffffff810cad32>] filemap_fault+0x252/0x4f0 > > [<ffffffff810eab18>] __do_fault+0x78/0x5a0 > > [<ffffffff810edcb4>] handle_pte_fault+0x84/0x940 > > [<ffffffff810e2460>] ? vma_prio_tree_insert+0x30/0x50 > > [<ffffffff810f2508>] ? vma_link+0x88/0xe0 > > [<ffffffff810ee6a8>] handle_mm_fault+0x138/0x260 > > [<ffffffff8102709d>] do_page_fault+0x13d/0x460 > > [<ffffffff810f46fc>] ? do_mmap_pgoff+0x3dc/0x430 > > [<ffffffff815b61ff>] page_fault+0x1f/0x30 > > ---[ end trace 8817670349022007 ]--- > > apache2 invoked oom-killer: gfp_mask=0x0, order=0, oom_adj=0, oom_score_adj=0 > > apache2 cpuset=uid mems_allowed=0 > > Pid: 3545, comm: apache2 Tainted: G W 3.2.37-grsec #1 > > Call Trace: > > [<ffffffff810ccd2e>] dump_header+0x7e/0x1e0 > > [<ffffffff810ccc2f>] ? find_lock_task_mm+0x2f/0x70 > > [<ffffffff810cd1f5>] oom_kill_process+0x85/0x2a0 > > [<ffffffff810cd8a5>] out_of_memory+0xe5/0x200 > > [<ffffffff810cda7d>] pagefault_out_of_memory+0xbd/0x110 > > [<ffffffff81026e76>] mm_fault_error+0xb6/0x1a0 > > [<ffffffff8102734e>] do_page_fault+0x3ee/0x460 > > [<ffffffff810f46fc>] ? do_mmap_pgoff+0x3dc/0x430 > > [<ffffffff815b61ff>] page_fault+0x1f/0x30 > > > > The first trace comes from the debugging WARN and it clearly points to > > a file fault path. __do_fault pre-charges a page in case we need to > > do CoW (copy-on-write) for the returned page. This one falls back to > > memcg OOM and never returns ENOMEM as I have mentioned earlier. > > However, the fs fault handler (filemap_fault here) can fallback to > > page_cache_read if the readahead (do_sync_mmap_readahead) fails > > to get page to the page cache. And we can see this happening in > > the first trace. page_cache_read then calls add_to_page_cache_lru > > and eventually gets to add_to_page_cache_locked which calls > > mem_cgroup_cache_charge_no_oom so we will get ENOMEM if oom should > > happen. This ENOMEM gets to the fault handler and kaboom. > > > > So the fix is really much more complex than I thought. Although > > add_to_page_cache_locked sounded like a good place it turned out to be > > not in fact. > > > > We need something more clever appaerently. One way would be not misusing > > __GFP_NORETRY for GFP_MEMCG_NO_OOM and give it a real flag. We have 32 > > bits for those flags in gfp_t so there should be some room there. > > Or we could do this per task flag, same we do for NO_IO in the current > > -mm tree. > > The later one seems easier wrt. gfp_mask passing horror - e.g. > > __generic_file_aio_write doesn't pass flags and it can be called from > > unlocked contexts as well. > > Ouch, PF_ flags space seem to be drained already because > task_struct::flags is just unsigned int so there is just one bit left. I > am not sure this is the best use for it. This will be a real pain! OK, so this something that should help you without any risk of false OOMs. I do not believe that something like that would be accepted upstream because it is really heavy. We will need to come up with something more clever for upstream. I have also added a warning which will trigger when the charge fails. If you see too many of those messages then there is something bad going on and the lack of OOM causes userspace to loop without getting any progress. So there you go - your personal patch ;) You can drop all other patches. Please note I have just compile tested it. But it should be pretty trivial to check it is correct --- From 6f155187f77c971b45caf05dbc80ca9c20bc278c Mon Sep 17 00:00:00 2001 From: Michal Hocko <mhocko-AlSwsSmVLrQ@public.gmane.org> Date: Wed, 6 Feb 2013 16:45:07 +0100 Subject: [PATCH 1/2] memcg: do not trigger OOM if PF_NO_MEMCG_OOM is set memcg oom killer might deadlock if the process which falls down to mem_cgroup_handle_oom holds a lock which prevents other task to terminate because it is blocked on the very same lock. This can happen when a write system call needs to allocate a page but the allocation hits the memcg hard limit and there is nothing to reclaim (e.g. there is no swap or swap limit is hit as well and all cache pages have been reclaimed already) and the process selected by memcg OOM killer is blocked on i_mutex on the same inode (e.g. truncate it). Process A [<ffffffff811109b8>] do_truncate+0x58/0xa0 # takes i_mutex [<ffffffff81121c90>] do_last+0x250/0xa30 [<ffffffff81122547>] path_openat+0xd7/0x440 [<ffffffff811229c9>] do_filp_open+0x49/0xa0 [<ffffffff8110f7d6>] do_sys_open+0x106/0x240 [<ffffffff8110f950>] sys_open+0x20/0x30 [<ffffffff815b5926>] system_call_fastpath+0x18/0x1d [<ffffffffffffffff>] 0xffffffffffffffff Process B [<ffffffff8110a9c1>] mem_cgroup_handle_oom+0x241/0x3b0 [<ffffffff8110b5ab>] T.1146+0x5ab/0x5c0 [<ffffffff8110c22e>] mem_cgroup_cache_charge+0xbe/0xe0 [<ffffffff810ca28c>] add_to_page_cache_locked+0x4c/0x140 [<ffffffff810ca3a2>] add_to_page_cache_lru+0x22/0x50 [<ffffffff810ca45b>] grab_cache_page_write_begin+0x8b/0xe0 [<ffffffff81193a18>] ext3_write_begin+0x88/0x270 [<ffffffff810c8fc6>] generic_file_buffered_write+0x116/0x290 [<ffffffff810cb3cc>] __generic_file_aio_write+0x27c/0x480 [<ffffffff810cb646>] generic_file_aio_write+0x76/0xf0 # takes ->i_mutex [<ffffffff8111156a>] do_sync_write+0xea/0x130 [<ffffffff81112183>] vfs_write+0xf3/0x1f0 [<ffffffff81112381>] sys_write+0x51/0x90 [<ffffffff815b5926>] system_call_fastpath+0x18/0x1d [<ffffffffffffffff>] 0xffffffffffffffff This is not a hard deadlock though because administrator can still intervene and increase the limit on the group which helps the writer to finish the allocation and release the lock. This patch heals the problem by forbidding OOM from dangerous context. Memcg charging code has no way to find out whether it is called from a locked context we have to help it via process flags. PF_OOM_ORIGIN flag removed recently will be reused for PF_NO_MEMCG_OOM which signals that the memcg OOM killer could lead to a deadlock. Only locked callers of __generic_file_aio_write are currently marked. I am pretty sure there are more places (I didn't check shmem and hugetlb uses fancy instantion mutex during page fault and filesystems might use some locks during the write) but I've ignored those as this will probably be just a user specific patch without any way to get upstream in the current form. Reported-by: azurIt <azurit-Rm0zKEqwvD4@public.gmane.org> Signed-off-by: Michal Hocko <mhocko-AlSwsSmVLrQ@public.gmane.org> --- drivers/staging/pohmelfs/inode.c | 2 ++ include/linux/sched.h | 1 + mm/filemap.c | 2 ++ mm/memcontrol.c | 18 ++++++++++++++---- 4 files changed, 19 insertions(+), 4 deletions(-) diff --git a/drivers/staging/pohmelfs/inode.c b/drivers/staging/pohmelfs/inode.c index 7a19555..523de82e 100644 --- a/drivers/staging/pohmelfs/inode.c +++ b/drivers/staging/pohmelfs/inode.c @@ -921,7 +921,9 @@ ssize_t pohmelfs_write(struct file *file, const char __user *buf, if (ret) goto err_out_unlock; + current->flags |= PF_NO_MEMCG_OOM; ret = __generic_file_aio_write(&kiocb, &iov, 1, &kiocb.ki_pos); + current->flags &= ~PF_NO_MEMCG_OOM; *ppos = kiocb.ki_pos; mutex_unlock(&inode->i_mutex); diff --git a/include/linux/sched.h b/include/linux/sched.h index 1e86bb4..f275c8f 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1781,6 +1781,7 @@ extern void thread_group_times(struct task_struct *p, cputime_t *ut, cputime_t * #define PF_FROZEN 0x00010000 /* frozen for system suspend */ #define PF_FSTRANS 0x00020000 /* inside a filesystem transaction */ #define PF_KSWAPD 0x00040000 /* I am kswapd */ +#define PF_NO_MEMCG_OOM 0x00080000 /* Memcg OOM could lead to a deadlock */ #define PF_LESS_THROTTLE 0x00100000 /* Throttle me less: I clean memory */ #define PF_KTHREAD 0x00200000 /* I am a kernel thread */ #define PF_RANDOMIZE 0x00400000 /* randomize virtual address space */ diff --git a/mm/filemap.c b/mm/filemap.c index 556858c..58a316b 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -2617,7 +2617,9 @@ ssize_t generic_file_aio_write(struct kiocb *iocb, const struct iovec *iov, mutex_lock(&inode->i_mutex); blk_start_plug(&plug); + current->flags |= PF_NO_MEMCG_OOM; ret = __generic_file_aio_write(iocb, iov, nr_segs, &iocb->ki_pos); + current->flags &= ~PF_NO_MEMCG_OOM; mutex_unlock(&inode->i_mutex); if (ret > 0 || ret == -EIOCBQUEUED) { diff --git a/mm/memcontrol.c b/mm/memcontrol.c index c8425b1..128b615 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -2397,6 +2397,14 @@ done: return 0; nomem: *ptr = NULL; + if (printk_ratelimit()) + printk(KERN_WARNING"%s: task:%s pid:%d got ENOMEM without OOM for memcg:%p." + " If this message shows up very often for the" + " same task then there is a risk that the" + " process is not able to make any progress" + " because of the current limit. Try to enlarge" + " the hard limit.\n", __FUNCTION__, + current->comm, current->pid, memcg); return -ENOMEM; bypass: *ptr = NULL; @@ -2703,7 +2711,7 @@ static int mem_cgroup_charge_common(struct page *page, struct mm_struct *mm, struct mem_cgroup *memcg = NULL; unsigned int nr_pages = 1; struct page_cgroup *pc; - bool oom = true; + bool oom = !(current->flags & PF_NO_MEMCG_OOM); int ret; if (PageTransHuge(page)) { @@ -2770,6 +2778,7 @@ __mem_cgroup_commit_charge_lrucare(struct page *page, struct mem_cgroup *memcg, int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm, gfp_t gfp_mask) { + bool oom = !(current->flags & PF_NO_MEMCG_OOM); struct mem_cgroup *memcg = NULL; int ret; @@ -2782,7 +2791,7 @@ int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm, mm = &init_mm; if (page_is_file_cache(page)) { - ret = __mem_cgroup_try_charge(mm, gfp_mask, 1, &memcg, true); + ret = __mem_cgroup_try_charge(mm, gfp_mask, 1, &memcg, oom); if (ret || !memcg) return ret; @@ -2818,6 +2827,7 @@ int mem_cgroup_try_charge_swapin(struct mm_struct *mm, struct page *page, gfp_t mask, struct mem_cgroup **ptr) { + bool oom = !(current->flags & PF_NO_MEMCG_OOM); struct mem_cgroup *memcg; int ret; @@ -2840,13 +2850,13 @@ int mem_cgroup_try_charge_swapin(struct mm_struct *mm, if (!memcg) goto charge_cur_mm; *ptr = memcg; - ret = __mem_cgroup_try_charge(NULL, mask, 1, ptr, true); + ret = __mem_cgroup_try_charge(NULL, mask, 1, ptr, oom); css_put(&memcg->css); return ret; charge_cur_mm: if (unlikely(!mm)) mm = &init_mm; - return __mem_cgroup_try_charge(mm, mask, 1, ptr, true); + return __mem_cgroup_try_charge(mm, mask, 1, ptr, oom); } static void -- 1.7.10.4 -- Michal Hocko SUSE Labs ^ permalink raw reply related [flat|nested] 444+ messages in thread
* [PATCH for 3.2.34] memcg: do not trigger OOM if PF_NO_MEMCG_OOM is set @ 2013-02-06 16:00 ` Michal Hocko 0 siblings, 0 replies; 444+ messages in thread From: Michal Hocko @ 2013-02-06 16:00 UTC (permalink / raw) To: azurIt Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki, Johannes Weiner On Wed 06-02-13 15:22:19, Michal Hocko wrote: > On Wed 06-02-13 15:01:19, Michal Hocko wrote: > > On Wed 06-02-13 02:17:21, azurIt wrote: > > > >5-memcg-fix-1.patch is not complete. It doesn't contain the folloup I > > > >mentioned in a follow up email. Here is the full patch: > > > > > > > > > Here is the log where OOM, again, killed MySQL server [search for "(mysqld)"]: > > > http://www.watchdog.sk/lkml/oom_mysqld6 > > > > [...] > > WARNING: at mm/memcontrol.c:2409 T.1149+0x2d9/0x610() > > Hardware name: S5000VSA > > gfp_mask:4304 nr_pages:1 oom:0 ret:2 > > Pid: 3545, comm: apache2 Tainted: G W 3.2.37-grsec #1 > > Call Trace: > > [<ffffffff8105502a>] warn_slowpath_common+0x7a/0xb0 > > [<ffffffff81055116>] warn_slowpath_fmt+0x46/0x50 > > [<ffffffff81108163>] ? mem_cgroup_margin+0x73/0xa0 > > [<ffffffff8110b6f9>] T.1149+0x2d9/0x610 > > [<ffffffff812af298>] ? blk_finish_plug+0x18/0x50 > > [<ffffffff8110c6b4>] mem_cgroup_cache_charge+0xc4/0xf0 > > [<ffffffff810ca6bf>] add_to_page_cache_locked+0x4f/0x140 > > [<ffffffff810ca7d2>] add_to_page_cache_lru+0x22/0x50 > > [<ffffffff810cad32>] filemap_fault+0x252/0x4f0 > > [<ffffffff810eab18>] __do_fault+0x78/0x5a0 > > [<ffffffff810edcb4>] handle_pte_fault+0x84/0x940 > > [<ffffffff810e2460>] ? vma_prio_tree_insert+0x30/0x50 > > [<ffffffff810f2508>] ? vma_link+0x88/0xe0 > > [<ffffffff810ee6a8>] handle_mm_fault+0x138/0x260 > > [<ffffffff8102709d>] do_page_fault+0x13d/0x460 > > [<ffffffff810f46fc>] ? do_mmap_pgoff+0x3dc/0x430 > > [<ffffffff815b61ff>] page_fault+0x1f/0x30 > > ---[ end trace 8817670349022007 ]--- > > apache2 invoked oom-killer: gfp_mask=0x0, order=0, oom_adj=0, oom_score_adj=0 > > apache2 cpuset=uid mems_allowed=0 > > Pid: 3545, comm: apache2 Tainted: G W 3.2.37-grsec #1 > > Call Trace: > > [<ffffffff810ccd2e>] dump_header+0x7e/0x1e0 > > [<ffffffff810ccc2f>] ? find_lock_task_mm+0x2f/0x70 > > [<ffffffff810cd1f5>] oom_kill_process+0x85/0x2a0 > > [<ffffffff810cd8a5>] out_of_memory+0xe5/0x200 > > [<ffffffff810cda7d>] pagefault_out_of_memory+0xbd/0x110 > > [<ffffffff81026e76>] mm_fault_error+0xb6/0x1a0 > > [<ffffffff8102734e>] do_page_fault+0x3ee/0x460 > > [<ffffffff810f46fc>] ? do_mmap_pgoff+0x3dc/0x430 > > [<ffffffff815b61ff>] page_fault+0x1f/0x30 > > > > The first trace comes from the debugging WARN and it clearly points to > > a file fault path. __do_fault pre-charges a page in case we need to > > do CoW (copy-on-write) for the returned page. This one falls back to > > memcg OOM and never returns ENOMEM as I have mentioned earlier. > > However, the fs fault handler (filemap_fault here) can fallback to > > page_cache_read if the readahead (do_sync_mmap_readahead) fails > > to get page to the page cache. And we can see this happening in > > the first trace. page_cache_read then calls add_to_page_cache_lru > > and eventually gets to add_to_page_cache_locked which calls > > mem_cgroup_cache_charge_no_oom so we will get ENOMEM if oom should > > happen. This ENOMEM gets to the fault handler and kaboom. > > > > So the fix is really much more complex than I thought. Although > > add_to_page_cache_locked sounded like a good place it turned out to be > > not in fact. > > > > We need something more clever appaerently. One way would be not misusing > > __GFP_NORETRY for GFP_MEMCG_NO_OOM and give it a real flag. We have 32 > > bits for those flags in gfp_t so there should be some room there. > > Or we could do this per task flag, same we do for NO_IO in the current > > -mm tree. > > The later one seems easier wrt. gfp_mask passing horror - e.g. > > __generic_file_aio_write doesn't pass flags and it can be called from > > unlocked contexts as well. > > Ouch, PF_ flags space seem to be drained already because > task_struct::flags is just unsigned int so there is just one bit left. I > am not sure this is the best use for it. This will be a real pain! OK, so this something that should help you without any risk of false OOMs. I do not believe that something like that would be accepted upstream because it is really heavy. We will need to come up with something more clever for upstream. I have also added a warning which will trigger when the charge fails. If you see too many of those messages then there is something bad going on and the lack of OOM causes userspace to loop without getting any progress. So there you go - your personal patch ;) You can drop all other patches. Please note I have just compile tested it. But it should be pretty trivial to check it is correct --- ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH for 3.2.34] memcg: do not trigger OOM if PF_NO_MEMCG_OOM is set 2013-02-06 16:00 ` Michal Hocko (?) @ 2013-02-08 5:03 ` azurIt -1 siblings, 0 replies; 444+ messages in thread From: azurIt @ 2013-02-08 5:03 UTC (permalink / raw) To: Michal Hocko Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki, Johannes Weiner Michal, thank you very much but it just didn't work and broke everything :( This happened: Problem started to occur really often immediately after booting the new kernel, every few minutes for one of my users. But everything other seems to work fine so i gave it a try for a day (which was a mistake). I grabbed some data for you and go to sleep: http://watchdog.sk/lkml/memcg-bug-4.tar.gz Few hours later i was woke up from my sweet sweet dreams by alerts smses - Apache wasn't working and our system failed to restart it. When i observed the situation, two apache processes (of that user as above) were still running and it wasn't possible to kill them by any way. I grabbed some data for you: http://watchdog.sk/lkml/memcg-bug-5.tar.gz Then I logged to the console and this was waiting for me: http://watchdog.sk/lkml/error.jpg Finally i rebooted into different kernel, wrote this e-mail and go to my lovely bed ;) ______________________________________________________________ > Od: "Michal Hocko" <mhocko@suse.cz> > Komu: azurIt <azurit@pobox.sk> > Dátum: 06.02.2013 17:00 > Predmet: [PATCH for 3.2.34] memcg: do not trigger OOM if PF_NO_MEMCG_OOM is set > > CC: linux-kernel@vger.kernel.org, linux-mm@kvack.org, "cgroups mailinglist" <cgroups@vger.kernel.org>, "KAMEZAWA Hiroyuki" <kamezawa.hiroyu@jp.fujitsu.com>, "Johannes Weiner" <hannes@cmpxchg.org> >On Wed 06-02-13 15:22:19, Michal Hocko wrote: >> On Wed 06-02-13 15:01:19, Michal Hocko wrote: >> > On Wed 06-02-13 02:17:21, azurIt wrote: >> > > >5-memcg-fix-1.patch is not complete. It doesn't contain the folloup I >> > > >mentioned in a follow up email. Here is the full patch: >> > > >> > > >> > > Here is the log where OOM, again, killed MySQL server [search for "(mysqld)"]: >> > > http://www.watchdog.sk/lkml/oom_mysqld6 >> > >> > [...] >> > WARNING: at mm/memcontrol.c:2409 T.1149+0x2d9/0x610() >> > Hardware name: S5000VSA >> > gfp_mask:4304 nr_pages:1 oom:0 ret:2 >> > Pid: 3545, comm: apache2 Tainted: G W 3.2.37-grsec #1 >> > Call Trace: >> > [<ffffffff8105502a>] warn_slowpath_common+0x7a/0xb0 >> > [<ffffffff81055116>] warn_slowpath_fmt+0x46/0x50 >> > [<ffffffff81108163>] ? mem_cgroup_margin+0x73/0xa0 >> > [<ffffffff8110b6f9>] T.1149+0x2d9/0x610 >> > [<ffffffff812af298>] ? blk_finish_plug+0x18/0x50 >> > [<ffffffff8110c6b4>] mem_cgroup_cache_charge+0xc4/0xf0 >> > [<ffffffff810ca6bf>] add_to_page_cache_locked+0x4f/0x140 >> > [<ffffffff810ca7d2>] add_to_page_cache_lru+0x22/0x50 >> > [<ffffffff810cad32>] filemap_fault+0x252/0x4f0 >> > [<ffffffff810eab18>] __do_fault+0x78/0x5a0 >> > [<ffffffff810edcb4>] handle_pte_fault+0x84/0x940 >> > [<ffffffff810e2460>] ? vma_prio_tree_insert+0x30/0x50 >> > [<ffffffff810f2508>] ? vma_link+0x88/0xe0 >> > [<ffffffff810ee6a8>] handle_mm_fault+0x138/0x260 >> > [<ffffffff8102709d>] do_page_fault+0x13d/0x460 >> > [<ffffffff810f46fc>] ? do_mmap_pgoff+0x3dc/0x430 >> > [<ffffffff815b61ff>] page_fault+0x1f/0x30 >> > ---[ end trace 8817670349022007 ]--- >> > apache2 invoked oom-killer: gfp_mask=0x0, order=0, oom_adj=0, oom_score_adj=0 >> > apache2 cpuset=uid mems_allowed=0 >> > Pid: 3545, comm: apache2 Tainted: G W 3.2.37-grsec #1 >> > Call Trace: >> > [<ffffffff810ccd2e>] dump_header+0x7e/0x1e0 >> > [<ffffffff810ccc2f>] ? find_lock_task_mm+0x2f/0x70 >> > [<ffffffff810cd1f5>] oom_kill_process+0x85/0x2a0 >> > [<ffffffff810cd8a5>] out_of_memory+0xe5/0x200 >> > [<ffffffff810cda7d>] pagefault_out_of_memory+0xbd/0x110 >> > [<ffffffff81026e76>] mm_fault_error+0xb6/0x1a0 >> > [<ffffffff8102734e>] do_page_fault+0x3ee/0x460 >> > [<ffffffff810f46fc>] ? do_mmap_pgoff+0x3dc/0x430 >> > [<ffffffff815b61ff>] page_fault+0x1f/0x30 >> > >> > The first trace comes from the debugging WARN and it clearly points to >> > a file fault path. __do_fault pre-charges a page in case we need to >> > do CoW (copy-on-write) for the returned page. This one falls back to >> > memcg OOM and never returns ENOMEM as I have mentioned earlier. >> > However, the fs fault handler (filemap_fault here) can fallback to >> > page_cache_read if the readahead (do_sync_mmap_readahead) fails >> > to get page to the page cache. And we can see this happening in >> > the first trace. page_cache_read then calls add_to_page_cache_lru >> > and eventually gets to add_to_page_cache_locked which calls >> > mem_cgroup_cache_charge_no_oom so we will get ENOMEM if oom should >> > happen. This ENOMEM gets to the fault handler and kaboom. >> > >> > So the fix is really much more complex than I thought. Although >> > add_to_page_cache_locked sounded like a good place it turned out to be >> > not in fact. >> > >> > We need something more clever appaerently. One way would be not misusing >> > __GFP_NORETRY for GFP_MEMCG_NO_OOM and give it a real flag. We have 32 >> > bits for those flags in gfp_t so there should be some room there. >> > Or we could do this per task flag, same we do for NO_IO in the current >> > -mm tree. >> > The later one seems easier wrt. gfp_mask passing horror - e.g. >> > __generic_file_aio_write doesn't pass flags and it can be called from >> > unlocked contexts as well. >> >> Ouch, PF_ flags space seem to be drained already because >> task_struct::flags is just unsigned int so there is just one bit left. I >> am not sure this is the best use for it. This will be a real pain! > >OK, so this something that should help you without any risk of false >OOMs. I do not believe that something like that would be accepted >upstream because it is really heavy. We will need to come up with >something more clever for upstream. >I have also added a warning which will trigger when the charge fails. If >you see too many of those messages then there is something bad going on >and the lack of OOM causes userspace to loop without getting any >progress. > >So there you go - your personal patch ;) You can drop all other patches. >Please note I have just compile tested it. But it should be pretty >trivial to check it is correct >--- >From 6f155187f77c971b45caf05dbc80ca9c20bc278c Mon Sep 17 00:00:00 2001 >From: Michal Hocko <mhocko@suse.cz> >Date: Wed, 6 Feb 2013 16:45:07 +0100 >Subject: [PATCH 1/2] memcg: do not trigger OOM if PF_NO_MEMCG_OOM is set > >memcg oom killer might deadlock if the process which falls down to >mem_cgroup_handle_oom holds a lock which prevents other task to >terminate because it is blocked on the very same lock. >This can happen when a write system call needs to allocate a page but >the allocation hits the memcg hard limit and there is nothing to reclaim >(e.g. there is no swap or swap limit is hit as well and all cache pages >have been reclaimed already) and the process selected by memcg OOM >killer is blocked on i_mutex on the same inode (e.g. truncate it). > >Process A >[<ffffffff811109b8>] do_truncate+0x58/0xa0 # takes i_mutex >[<ffffffff81121c90>] do_last+0x250/0xa30 >[<ffffffff81122547>] path_openat+0xd7/0x440 >[<ffffffff811229c9>] do_filp_open+0x49/0xa0 >[<ffffffff8110f7d6>] do_sys_open+0x106/0x240 >[<ffffffff8110f950>] sys_open+0x20/0x30 >[<ffffffff815b5926>] system_call_fastpath+0x18/0x1d >[<ffffffffffffffff>] 0xffffffffffffffff > >Process B >[<ffffffff8110a9c1>] mem_cgroup_handle_oom+0x241/0x3b0 >[<ffffffff8110b5ab>] T.1146+0x5ab/0x5c0 >[<ffffffff8110c22e>] mem_cgroup_cache_charge+0xbe/0xe0 >[<ffffffff810ca28c>] add_to_page_cache_locked+0x4c/0x140 >[<ffffffff810ca3a2>] add_to_page_cache_lru+0x22/0x50 >[<ffffffff810ca45b>] grab_cache_page_write_begin+0x8b/0xe0 >[<ffffffff81193a18>] ext3_write_begin+0x88/0x270 >[<ffffffff810c8fc6>] generic_file_buffered_write+0x116/0x290 >[<ffffffff810cb3cc>] __generic_file_aio_write+0x27c/0x480 >[<ffffffff810cb646>] generic_file_aio_write+0x76/0xf0 # takes ->i_mutex >[<ffffffff8111156a>] do_sync_write+0xea/0x130 >[<ffffffff81112183>] vfs_write+0xf3/0x1f0 >[<ffffffff81112381>] sys_write+0x51/0x90 >[<ffffffff815b5926>] system_call_fastpath+0x18/0x1d >[<ffffffffffffffff>] 0xffffffffffffffff > >This is not a hard deadlock though because administrator can still >intervene and increase the limit on the group which helps the writer to >finish the allocation and release the lock. > >This patch heals the problem by forbidding OOM from dangerous context. >Memcg charging code has no way to find out whether it is called from a >locked context we have to help it via process flags. PF_OOM_ORIGIN flag >removed recently will be reused for PF_NO_MEMCG_OOM which signals that >the memcg OOM killer could lead to a deadlock. >Only locked callers of __generic_file_aio_write are currently marked. I >am pretty sure there are more places (I didn't check shmem and hugetlb >uses fancy instantion mutex during page fault and filesystems might >use some locks during the write) but I've ignored those as this will >probably be just a user specific patch without any way to get upstream >in the current form. > >Reported-by: azurIt <azurit@pobox.sk> >Signed-off-by: Michal Hocko <mhocko@suse.cz> >--- > drivers/staging/pohmelfs/inode.c | 2 ++ > include/linux/sched.h | 1 + > mm/filemap.c | 2 ++ > mm/memcontrol.c | 18 ++++++++++++++---- > 4 files changed, 19 insertions(+), 4 deletions(-) > >diff --git a/drivers/staging/pohmelfs/inode.c b/drivers/staging/pohmelfs/inode.c >index 7a19555..523de82e 100644 >--- a/drivers/staging/pohmelfs/inode.c >+++ b/drivers/staging/pohmelfs/inode.c >@@ -921,7 +921,9 @@ ssize_t pohmelfs_write(struct file *file, const char __user *buf, > if (ret) > goto err_out_unlock; > >+ current->flags |= PF_NO_MEMCG_OOM; > ret = __generic_file_aio_write(&kiocb, &iov, 1, &kiocb.ki_pos); >+ current->flags &= ~PF_NO_MEMCG_OOM; > *ppos = kiocb.ki_pos; > > mutex_unlock(&inode->i_mutex); >diff --git a/include/linux/sched.h b/include/linux/sched.h >index 1e86bb4..f275c8f 100644 >--- a/include/linux/sched.h >+++ b/include/linux/sched.h >@@ -1781,6 +1781,7 @@ extern void thread_group_times(struct task_struct *p, cputime_t *ut, cputime_t * > #define PF_FROZEN 0x00010000 /* frozen for system suspend */ > #define PF_FSTRANS 0x00020000 /* inside a filesystem transaction */ > #define PF_KSWAPD 0x00040000 /* I am kswapd */ >+#define PF_NO_MEMCG_OOM 0x00080000 /* Memcg OOM could lead to a deadlock */ > #define PF_LESS_THROTTLE 0x00100000 /* Throttle me less: I clean memory */ > #define PF_KTHREAD 0x00200000 /* I am a kernel thread */ > #define PF_RANDOMIZE 0x00400000 /* randomize virtual address space */ >diff --git a/mm/filemap.c b/mm/filemap.c >index 556858c..58a316b 100644 >--- a/mm/filemap.c >+++ b/mm/filemap.c >@@ -2617,7 +2617,9 @@ ssize_t generic_file_aio_write(struct kiocb *iocb, const struct iovec *iov, > > mutex_lock(&inode->i_mutex); > blk_start_plug(&plug); >+ current->flags |= PF_NO_MEMCG_OOM; > ret = __generic_file_aio_write(iocb, iov, nr_segs, &iocb->ki_pos); >+ current->flags &= ~PF_NO_MEMCG_OOM; > mutex_unlock(&inode->i_mutex); > > if (ret > 0 || ret == -EIOCBQUEUED) { >diff --git a/mm/memcontrol.c b/mm/memcontrol.c >index c8425b1..128b615 100644 >--- a/mm/memcontrol.c >+++ b/mm/memcontrol.c >@@ -2397,6 +2397,14 @@ done: > return 0; > nomem: > *ptr = NULL; >+ if (printk_ratelimit()) >+ printk(KERN_WARNING"%s: task:%s pid:%d got ENOMEM without OOM for memcg:%p." >+ " If this message shows up very often for the" >+ " same task then there is a risk that the" >+ " process is not able to make any progress" >+ " because of the current limit. Try to enlarge" >+ " the hard limit.\n", __FUNCTION__, >+ current->comm, current->pid, memcg); > return -ENOMEM; > bypass: > *ptr = NULL; >@@ -2703,7 +2711,7 @@ static int mem_cgroup_charge_common(struct page *page, struct mm_struct *mm, > struct mem_cgroup *memcg = NULL; > unsigned int nr_pages = 1; > struct page_cgroup *pc; >- bool oom = true; >+ bool oom = !(current->flags & PF_NO_MEMCG_OOM); > int ret; > > if (PageTransHuge(page)) { >@@ -2770,6 +2778,7 @@ __mem_cgroup_commit_charge_lrucare(struct page *page, struct mem_cgroup *memcg, > int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm, > gfp_t gfp_mask) > { >+ bool oom = !(current->flags & PF_NO_MEMCG_OOM); > struct mem_cgroup *memcg = NULL; > int ret; > >@@ -2782,7 +2791,7 @@ int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm, > mm = &init_mm; > > if (page_is_file_cache(page)) { >- ret = __mem_cgroup_try_charge(mm, gfp_mask, 1, &memcg, true); >+ ret = __mem_cgroup_try_charge(mm, gfp_mask, 1, &memcg, oom); > if (ret || !memcg) > return ret; > >@@ -2818,6 +2827,7 @@ int mem_cgroup_try_charge_swapin(struct mm_struct *mm, > struct page *page, > gfp_t mask, struct mem_cgroup **ptr) > { >+ bool oom = !(current->flags & PF_NO_MEMCG_OOM); > struct mem_cgroup *memcg; > int ret; > >@@ -2840,13 +2850,13 @@ int mem_cgroup_try_charge_swapin(struct mm_struct *mm, > if (!memcg) > goto charge_cur_mm; > *ptr = memcg; >- ret = __mem_cgroup_try_charge(NULL, mask, 1, ptr, true); >+ ret = __mem_cgroup_try_charge(NULL, mask, 1, ptr, oom); > css_put(&memcg->css); > return ret; > charge_cur_mm: > if (unlikely(!mm)) > mm = &init_mm; >- return __mem_cgroup_try_charge(mm, mask, 1, ptr, true); >+ return __mem_cgroup_try_charge(mm, mask, 1, ptr, oom); > } > > static void >-- >1.7.10.4 > >-- >Michal Hocko >SUSE Labs > ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH for 3.2.34] memcg: do not trigger OOM if PF_NO_MEMCG_OOM is set @ 2013-02-08 5:03 ` azurIt 0 siblings, 0 replies; 444+ messages in thread From: azurIt @ 2013-02-08 5:03 UTC (permalink / raw) To: Michal Hocko Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki, Johannes Weiner Michal, thank you very much but it just didn't work and broke everything :( This happened: Problem started to occur really often immediately after booting the new kernel, every few minutes for one of my users. But everything other seems to work fine so i gave it a try for a day (which was a mistake). I grabbed some data for you and go to sleep: http://watchdog.sk/lkml/memcg-bug-4.tar.gz Few hours later i was woke up from my sweet sweet dreams by alerts smses - Apache wasn't working and our system failed to restart it. When i observed the situation, two apache processes (of that user as above) were still running and it wasn't possible to kill them by any way. I grabbed some data for you: http://watchdog.sk/lkml/memcg-bug-5.tar.gz Then I logged to the console and this was waiting for me: http://watchdog.sk/lkml/error.jpg Finally i rebooted into different kernel, wrote this e-mail and go to my lovely bed ;) ______________________________________________________________ > Od: "Michal Hocko" <mhocko@suse.cz> > Komu: azurIt <azurit@pobox.sk> > Dátum: 06.02.2013 17:00 > Predmet: [PATCH for 3.2.34] memcg: do not trigger OOM if PF_NO_MEMCG_OOM is set > > CC: linux-kernel@vger.kernel.org, linux-mm@kvack.org, "cgroups mailinglist" <cgroups@vger.kernel.org>, "KAMEZAWA Hiroyuki" <kamezawa.hiroyu@jp.fujitsu.com>, "Johannes Weiner" <hannes@cmpxchg.org> >On Wed 06-02-13 15:22:19, Michal Hocko wrote: >> On Wed 06-02-13 15:01:19, Michal Hocko wrote: >> > On Wed 06-02-13 02:17:21, azurIt wrote: >> > > >5-memcg-fix-1.patch is not complete. It doesn't contain the folloup I >> > > >mentioned in a follow up email. Here is the full patch: >> > > >> > > >> > > Here is the log where OOM, again, killed MySQL server [search for "(mysqld)"]: >> > > http://www.watchdog.sk/lkml/oom_mysqld6 >> > >> > [...] >> > WARNING: at mm/memcontrol.c:2409 T.1149+0x2d9/0x610() >> > Hardware name: S5000VSA >> > gfp_mask:4304 nr_pages:1 oom:0 ret:2 >> > Pid: 3545, comm: apache2 Tainted: G W 3.2.37-grsec #1 >> > Call Trace: >> > [<ffffffff8105502a>] warn_slowpath_common+0x7a/0xb0 >> > [<ffffffff81055116>] warn_slowpath_fmt+0x46/0x50 >> > [<ffffffff81108163>] ? mem_cgroup_margin+0x73/0xa0 >> > [<ffffffff8110b6f9>] T.1149+0x2d9/0x610 >> > [<ffffffff812af298>] ? blk_finish_plug+0x18/0x50 >> > [<ffffffff8110c6b4>] mem_cgroup_cache_charge+0xc4/0xf0 >> > [<ffffffff810ca6bf>] add_to_page_cache_locked+0x4f/0x140 >> > [<ffffffff810ca7d2>] add_to_page_cache_lru+0x22/0x50 >> > [<ffffffff810cad32>] filemap_fault+0x252/0x4f0 >> > [<ffffffff810eab18>] __do_fault+0x78/0x5a0 >> > [<ffffffff810edcb4>] handle_pte_fault+0x84/0x940 >> > [<ffffffff810e2460>] ? vma_prio_tree_insert+0x30/0x50 >> > [<ffffffff810f2508>] ? vma_link+0x88/0xe0 >> > [<ffffffff810ee6a8>] handle_mm_fault+0x138/0x260 >> > [<ffffffff8102709d>] do_page_fault+0x13d/0x460 >> > [<ffffffff810f46fc>] ? do_mmap_pgoff+0x3dc/0x430 >> > [<ffffffff815b61ff>] page_fault+0x1f/0x30 >> > ---[ end trace 8817670349022007 ]--- >> > apache2 invoked oom-killer: gfp_mask=0x0, order=0, oom_adj=0, oom_score_adj=0 >> > apache2 cpuset=uid mems_allowed=0 >> > Pid: 3545, comm: apache2 Tainted: G W 3.2.37-grsec #1 >> > Call Trace: >> > [<ffffffff810ccd2e>] dump_header+0x7e/0x1e0 >> > [<ffffffff810ccc2f>] ? find_lock_task_mm+0x2f/0x70 >> > [<ffffffff810cd1f5>] oom_kill_process+0x85/0x2a0 >> > [<ffffffff810cd8a5>] out_of_memory+0xe5/0x200 >> > [<ffffffff810cda7d>] pagefault_out_of_memory+0xbd/0x110 >> > [<ffffffff81026e76>] mm_fault_error+0xb6/0x1a0 >> > [<ffffffff8102734e>] do_page_fault+0x3ee/0x460 >> > [<ffffffff810f46fc>] ? do_mmap_pgoff+0x3dc/0x430 >> > [<ffffffff815b61ff>] page_fault+0x1f/0x30 >> > >> > The first trace comes from the debugging WARN and it clearly points to >> > a file fault path. __do_fault pre-charges a page in case we need to >> > do CoW (copy-on-write) for the returned page. This one falls back to >> > memcg OOM and never returns ENOMEM as I have mentioned earlier. >> > However, the fs fault handler (filemap_fault here) can fallback to >> > page_cache_read if the readahead (do_sync_mmap_readahead) fails >> > to get page to the page cache. And we can see this happening in >> > the first trace. page_cache_read then calls add_to_page_cache_lru >> > and eventually gets to add_to_page_cache_locked which calls >> > mem_cgroup_cache_charge_no_oom so we will get ENOMEM if oom should >> > happen. This ENOMEM gets to the fault handler and kaboom. >> > >> > So the fix is really much more complex than I thought. Although >> > add_to_page_cache_locked sounded like a good place it turned out to be >> > not in fact. >> > >> > We need something more clever appaerently. One way would be not misusing >> > __GFP_NORETRY for GFP_MEMCG_NO_OOM and give it a real flag. We have 32 >> > bits for those flags in gfp_t so there should be some room there. >> > Or we could do this per task flag, same we do for NO_IO in the current >> > -mm tree. >> > The later one seems easier wrt. gfp_mask passing horror - e.g. >> > __generic_file_aio_write doesn't pass flags and it can be called from >> > unlocked contexts as well. >> >> Ouch, PF_ flags space seem to be drained already because >> task_struct::flags is just unsigned int so there is just one bit left. I >> am not sure this is the best use for it. This will be a real pain! > >OK, so this something that should help you without any risk of false >OOMs. I do not believe that something like that would be accepted >upstream because it is really heavy. We will need to come up with >something more clever for upstream. >I have also added a warning which will trigger when the charge fails. If >you see too many of those messages then there is something bad going on >and the lack of OOM causes userspace to loop without getting any >progress. > >So there you go - your personal patch ;) You can drop all other patches. >Please note I have just compile tested it. But it should be pretty >trivial to check it is correct >--- From 6f155187f77c971b45caf05dbc80ca9c20bc278c Mon Sep 17 00:00:00 2001 >From: Michal Hocko <mhocko@suse.cz> >Date: Wed, 6 Feb 2013 16:45:07 +0100 >Subject: [PATCH 1/2] memcg: do not trigger OOM if PF_NO_MEMCG_OOM is set > >memcg oom killer might deadlock if the process which falls down to >mem_cgroup_handle_oom holds a lock which prevents other task to >terminate because it is blocked on the very same lock. >This can happen when a write system call needs to allocate a page but >the allocation hits the memcg hard limit and there is nothing to reclaim >(e.g. there is no swap or swap limit is hit as well and all cache pages >have been reclaimed already) and the process selected by memcg OOM >killer is blocked on i_mutex on the same inode (e.g. truncate it). > >Process A >[<ffffffff811109b8>] do_truncate+0x58/0xa0 # takes i_mutex >[<ffffffff81121c90>] do_last+0x250/0xa30 >[<ffffffff81122547>] path_openat+0xd7/0x440 >[<ffffffff811229c9>] do_filp_open+0x49/0xa0 >[<ffffffff8110f7d6>] do_sys_open+0x106/0x240 >[<ffffffff8110f950>] sys_open+0x20/0x30 >[<ffffffff815b5926>] system_call_fastpath+0x18/0x1d >[<ffffffffffffffff>] 0xffffffffffffffff > >Process B >[<ffffffff8110a9c1>] mem_cgroup_handle_oom+0x241/0x3b0 >[<ffffffff8110b5ab>] T.1146+0x5ab/0x5c0 >[<ffffffff8110c22e>] mem_cgroup_cache_charge+0xbe/0xe0 >[<ffffffff810ca28c>] add_to_page_cache_locked+0x4c/0x140 >[<ffffffff810ca3a2>] add_to_page_cache_lru+0x22/0x50 >[<ffffffff810ca45b>] grab_cache_page_write_begin+0x8b/0xe0 >[<ffffffff81193a18>] ext3_write_begin+0x88/0x270 >[<ffffffff810c8fc6>] generic_file_buffered_write+0x116/0x290 >[<ffffffff810cb3cc>] __generic_file_aio_write+0x27c/0x480 >[<ffffffff810cb646>] generic_file_aio_write+0x76/0xf0 # takes ->i_mutex >[<ffffffff8111156a>] do_sync_write+0xea/0x130 >[<ffffffff81112183>] vfs_write+0xf3/0x1f0 >[<ffffffff81112381>] sys_write+0x51/0x90 >[<ffffffff815b5926>] system_call_fastpath+0x18/0x1d >[<ffffffffffffffff>] 0xffffffffffffffff > >This is not a hard deadlock though because administrator can still >intervene and increase the limit on the group which helps the writer to >finish the allocation and release the lock. > >This patch heals the problem by forbidding OOM from dangerous context. >Memcg charging code has no way to find out whether it is called from a >locked context we have to help it via process flags. PF_OOM_ORIGIN flag >removed recently will be reused for PF_NO_MEMCG_OOM which signals that >the memcg OOM killer could lead to a deadlock. >Only locked callers of __generic_file_aio_write are currently marked. I >am pretty sure there are more places (I didn't check shmem and hugetlb >uses fancy instantion mutex during page fault and filesystems might >use some locks during the write) but I've ignored those as this will >probably be just a user specific patch without any way to get upstream >in the current form. > >Reported-by: azurIt <azurit@pobox.sk> >Signed-off-by: Michal Hocko <mhocko@suse.cz> >--- > drivers/staging/pohmelfs/inode.c | 2 ++ > include/linux/sched.h | 1 + > mm/filemap.c | 2 ++ > mm/memcontrol.c | 18 ++++++++++++++---- > 4 files changed, 19 insertions(+), 4 deletions(-) > >diff --git a/drivers/staging/pohmelfs/inode.c b/drivers/staging/pohmelfs/inode.c >index 7a19555..523de82e 100644 >--- a/drivers/staging/pohmelfs/inode.c >+++ b/drivers/staging/pohmelfs/inode.c >@@ -921,7 +921,9 @@ ssize_t pohmelfs_write(struct file *file, const char __user *buf, > if (ret) > goto err_out_unlock; > >+ current->flags |= PF_NO_MEMCG_OOM; > ret = __generic_file_aio_write(&kiocb, &iov, 1, &kiocb.ki_pos); >+ current->flags &= ~PF_NO_MEMCG_OOM; > *ppos = kiocb.ki_pos; > > mutex_unlock(&inode->i_mutex); >diff --git a/include/linux/sched.h b/include/linux/sched.h >index 1e86bb4..f275c8f 100644 >--- a/include/linux/sched.h >+++ b/include/linux/sched.h >@@ -1781,6 +1781,7 @@ extern void thread_group_times(struct task_struct *p, cputime_t *ut, cputime_t * > #define PF_FROZEN 0x00010000 /* frozen for system suspend */ > #define PF_FSTRANS 0x00020000 /* inside a filesystem transaction */ > #define PF_KSWAPD 0x00040000 /* I am kswapd */ >+#define PF_NO_MEMCG_OOM 0x00080000 /* Memcg OOM could lead to a deadlock */ > #define PF_LESS_THROTTLE 0x00100000 /* Throttle me less: I clean memory */ > #define PF_KTHREAD 0x00200000 /* I am a kernel thread */ > #define PF_RANDOMIZE 0x00400000 /* randomize virtual address space */ >diff --git a/mm/filemap.c b/mm/filemap.c >index 556858c..58a316b 100644 >--- a/mm/filemap.c >+++ b/mm/filemap.c >@@ -2617,7 +2617,9 @@ ssize_t generic_file_aio_write(struct kiocb *iocb, const struct iovec *iov, > > mutex_lock(&inode->i_mutex); > blk_start_plug(&plug); >+ current->flags |= PF_NO_MEMCG_OOM; > ret = __generic_file_aio_write(iocb, iov, nr_segs, &iocb->ki_pos); >+ current->flags &= ~PF_NO_MEMCG_OOM; > mutex_unlock(&inode->i_mutex); > > if (ret > 0 || ret == -EIOCBQUEUED) { >diff --git a/mm/memcontrol.c b/mm/memcontrol.c >index c8425b1..128b615 100644 >--- a/mm/memcontrol.c >+++ b/mm/memcontrol.c >@@ -2397,6 +2397,14 @@ done: > return 0; > nomem: > *ptr = NULL; >+ if (printk_ratelimit()) >+ printk(KERN_WARNING"%s: task:%s pid:%d got ENOMEM without OOM for memcg:%p." >+ " If this message shows up very often for the" >+ " same task then there is a risk that the" >+ " process is not able to make any progress" >+ " because of the current limit. Try to enlarge" >+ " the hard limit.\n", __FUNCTION__, >+ current->comm, current->pid, memcg); > return -ENOMEM; > bypass: > *ptr = NULL; >@@ -2703,7 +2711,7 @@ static int mem_cgroup_charge_common(struct page *page, struct mm_struct *mm, > struct mem_cgroup *memcg = NULL; > unsigned int nr_pages = 1; > struct page_cgroup *pc; >- bool oom = true; >+ bool oom = !(current->flags & PF_NO_MEMCG_OOM); > int ret; > > if (PageTransHuge(page)) { >@@ -2770,6 +2778,7 @@ __mem_cgroup_commit_charge_lrucare(struct page *page, struct mem_cgroup *memcg, > int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm, > gfp_t gfp_mask) > { >+ bool oom = !(current->flags & PF_NO_MEMCG_OOM); > struct mem_cgroup *memcg = NULL; > int ret; > >@@ -2782,7 +2791,7 @@ int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm, > mm = &init_mm; > > if (page_is_file_cache(page)) { >- ret = __mem_cgroup_try_charge(mm, gfp_mask, 1, &memcg, true); >+ ret = __mem_cgroup_try_charge(mm, gfp_mask, 1, &memcg, oom); > if (ret || !memcg) > return ret; > >@@ -2818,6 +2827,7 @@ int mem_cgroup_try_charge_swapin(struct mm_struct *mm, > struct page *page, > gfp_t mask, struct mem_cgroup **ptr) > { >+ bool oom = !(current->flags & PF_NO_MEMCG_OOM); > struct mem_cgroup *memcg; > int ret; > >@@ -2840,13 +2850,13 @@ int mem_cgroup_try_charge_swapin(struct mm_struct *mm, > if (!memcg) > goto charge_cur_mm; > *ptr = memcg; >- ret = __mem_cgroup_try_charge(NULL, mask, 1, ptr, true); >+ ret = __mem_cgroup_try_charge(NULL, mask, 1, ptr, oom); > css_put(&memcg->css); > return ret; > charge_cur_mm: > if (unlikely(!mm)) > mm = &init_mm; >- return __mem_cgroup_try_charge(mm, mask, 1, ptr, true); >+ return __mem_cgroup_try_charge(mm, mask, 1, ptr, oom); > } > > static void >-- >1.7.10.4 > >-- >Michal Hocko >SUSE Labs > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH for 3.2.34] memcg: do not trigger OOM if PF_NO_MEMCG_OOM is set @ 2013-02-08 5:03 ` azurIt 0 siblings, 0 replies; 444+ messages in thread From: azurIt @ 2013-02-08 5:03 UTC (permalink / raw) To: Michal Hocko Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki, Johannes Weiner Michal, thank you very much but it just didn't work and broke everything :( This happened: Problem started to occur really often immediately after booting the new kernel, every few minutes for one of my users. But everything other seems to work fine so i gave it a try for a day (which was a mistake). I grabbed some data for you and go to sleep: http://watchdog.sk/lkml/memcg-bug-4.tar.gz Few hours later i was woke up from my sweet sweet dreams by alerts smses - Apache wasn't working and our system failed to restart it. When i observed the situation, two apache processes (of that user as above) were still running and it wasn't possible to kill them by any way. I grabbed some data for you: http://watchdog.sk/lkml/memcg-bug-5.tar.gz Then I logged to the console and this was waiting for me: http://watchdog.sk/lkml/error.jpg Finally i rebooted into different kernel, wrote this e-mail and go to my lovely bed ;) ______________________________________________________________ > Od: "Michal Hocko" <mhocko@suse.cz> > Komu: azurIt <azurit@pobox.sk> > DA!tum: 06.02.2013 17:00 > Predmet: [PATCH for 3.2.34] memcg: do not trigger OOM if PF_NO_MEMCG_OOM is set > > CC: linux-kernel@vger.kernel.org, linux-mm@kvack.org, "cgroups mailinglist" <cgroups@vger.kernel.org>, "KAMEZAWA Hiroyuki" <kamezawa.hiroyu@jp.fujitsu.com>, "Johannes Weiner" <hannes@cmpxchg.org> >On Wed 06-02-13 15:22:19, Michal Hocko wrote: >> On Wed 06-02-13 15:01:19, Michal Hocko wrote: >> > On Wed 06-02-13 02:17:21, azurIt wrote: >> > > >5-memcg-fix-1.patch is not complete. It doesn't contain the folloup I >> > > >mentioned in a follow up email. Here is the full patch: >> > > >> > > >> > > Here is the log where OOM, again, killed MySQL server [search for "(mysqld)"]: >> > > http://www.watchdog.sk/lkml/oom_mysqld6 >> > >> > [...] >> > WARNING: at mm/memcontrol.c:2409 T.1149+0x2d9/0x610() >> > Hardware name: S5000VSA >> > gfp_mask:4304 nr_pages:1 oom:0 ret:2 >> > Pid: 3545, comm: apache2 Tainted: G W 3.2.37-grsec #1 >> > Call Trace: >> > [<ffffffff8105502a>] warn_slowpath_common+0x7a/0xb0 >> > [<ffffffff81055116>] warn_slowpath_fmt+0x46/0x50 >> > [<ffffffff81108163>] ? mem_cgroup_margin+0x73/0xa0 >> > [<ffffffff8110b6f9>] T.1149+0x2d9/0x610 >> > [<ffffffff812af298>] ? blk_finish_plug+0x18/0x50 >> > [<ffffffff8110c6b4>] mem_cgroup_cache_charge+0xc4/0xf0 >> > [<ffffffff810ca6bf>] add_to_page_cache_locked+0x4f/0x140 >> > [<ffffffff810ca7d2>] add_to_page_cache_lru+0x22/0x50 >> > [<ffffffff810cad32>] filemap_fault+0x252/0x4f0 >> > [<ffffffff810eab18>] __do_fault+0x78/0x5a0 >> > [<ffffffff810edcb4>] handle_pte_fault+0x84/0x940 >> > [<ffffffff810e2460>] ? vma_prio_tree_insert+0x30/0x50 >> > [<ffffffff810f2508>] ? vma_link+0x88/0xe0 >> > [<ffffffff810ee6a8>] handle_mm_fault+0x138/0x260 >> > [<ffffffff8102709d>] do_page_fault+0x13d/0x460 >> > [<ffffffff810f46fc>] ? do_mmap_pgoff+0x3dc/0x430 >> > [<ffffffff815b61ff>] page_fault+0x1f/0x30 >> > ---[ end trace 8817670349022007 ]--- >> > apache2 invoked oom-killer: gfp_mask=0x0, order=0, oom_adj=0, oom_score_adj=0 >> > apache2 cpuset=uid mems_allowed=0 >> > Pid: 3545, comm: apache2 Tainted: G W 3.2.37-grsec #1 >> > Call Trace: >> > [<ffffffff810ccd2e>] dump_header+0x7e/0x1e0 >> > [<ffffffff810ccc2f>] ? find_lock_task_mm+0x2f/0x70 >> > [<ffffffff810cd1f5>] oom_kill_process+0x85/0x2a0 >> > [<ffffffff810cd8a5>] out_of_memory+0xe5/0x200 >> > [<ffffffff810cda7d>] pagefault_out_of_memory+0xbd/0x110 >> > [<ffffffff81026e76>] mm_fault_error+0xb6/0x1a0 >> > [<ffffffff8102734e>] do_page_fault+0x3ee/0x460 >> > [<ffffffff810f46fc>] ? do_mmap_pgoff+0x3dc/0x430 >> > [<ffffffff815b61ff>] page_fault+0x1f/0x30 >> > >> > The first trace comes from the debugging WARN and it clearly points to >> > a file fault path. __do_fault pre-charges a page in case we need to >> > do CoW (copy-on-write) for the returned page. This one falls back to >> > memcg OOM and never returns ENOMEM as I have mentioned earlier. >> > However, the fs fault handler (filemap_fault here) can fallback to >> > page_cache_read if the readahead (do_sync_mmap_readahead) fails >> > to get page to the page cache. And we can see this happening in >> > the first trace. page_cache_read then calls add_to_page_cache_lru >> > and eventually gets to add_to_page_cache_locked which calls >> > mem_cgroup_cache_charge_no_oom so we will get ENOMEM if oom should >> > happen. This ENOMEM gets to the fault handler and kaboom. >> > >> > So the fix is really much more complex than I thought. Although >> > add_to_page_cache_locked sounded like a good place it turned out to be >> > not in fact. >> > >> > We need something more clever appaerently. One way would be not misusing >> > __GFP_NORETRY for GFP_MEMCG_NO_OOM and give it a real flag. We have 32 >> > bits for those flags in gfp_t so there should be some room there. >> > Or we could do this per task flag, same we do for NO_IO in the current >> > -mm tree. >> > The later one seems easier wrt. gfp_mask passing horror - e.g. >> > __generic_file_aio_write doesn't pass flags and it can be called from >> > unlocked contexts as well. >> >> Ouch, PF_ flags space seem to be drained already because >> task_struct::flags is just unsigned int so there is just one bit left. I >> am not sure this is the best use for it. This will be a real pain! > >OK, so this something that should help you without any risk of false >OOMs. I do not believe that something like that would be accepted >upstream because it is really heavy. We will need to come up with >something more clever for upstream. >I have also added a warning which will trigger when the charge fails. If >you see too many of those messages then there is something bad going on >and the lack of OOM causes userspace to loop without getting any >progress. > >So there you go - your personal patch ;) You can drop all other patches. >Please note I have just compile tested it. But it should be pretty >trivial to check it is correct >--- >From 6f155187f77c971b45caf05dbc80ca9c20bc278c Mon Sep 17 00:00:00 2001 >From: Michal Hocko <mhocko@suse.cz> >Date: Wed, 6 Feb 2013 16:45:07 +0100 >Subject: [PATCH 1/2] memcg: do not trigger OOM if PF_NO_MEMCG_OOM is set > >memcg oom killer might deadlock if the process which falls down to >mem_cgroup_handle_oom holds a lock which prevents other task to >terminate because it is blocked on the very same lock. >This can happen when a write system call needs to allocate a page but >the allocation hits the memcg hard limit and there is nothing to reclaim >(e.g. there is no swap or swap limit is hit as well and all cache pages >have been reclaimed already) and the process selected by memcg OOM >killer is blocked on i_mutex on the same inode (e.g. truncate it). > >Process A >[<ffffffff811109b8>] do_truncate+0x58/0xa0 # takes i_mutex >[<ffffffff81121c90>] do_last+0x250/0xa30 >[<ffffffff81122547>] path_openat+0xd7/0x440 >[<ffffffff811229c9>] do_filp_open+0x49/0xa0 >[<ffffffff8110f7d6>] do_sys_open+0x106/0x240 >[<ffffffff8110f950>] sys_open+0x20/0x30 >[<ffffffff815b5926>] system_call_fastpath+0x18/0x1d >[<ffffffffffffffff>] 0xffffffffffffffff > >Process B >[<ffffffff8110a9c1>] mem_cgroup_handle_oom+0x241/0x3b0 >[<ffffffff8110b5ab>] T.1146+0x5ab/0x5c0 >[<ffffffff8110c22e>] mem_cgroup_cache_charge+0xbe/0xe0 >[<ffffffff810ca28c>] add_to_page_cache_locked+0x4c/0x140 >[<ffffffff810ca3a2>] add_to_page_cache_lru+0x22/0x50 >[<ffffffff810ca45b>] grab_cache_page_write_begin+0x8b/0xe0 >[<ffffffff81193a18>] ext3_write_begin+0x88/0x270 >[<ffffffff810c8fc6>] generic_file_buffered_write+0x116/0x290 >[<ffffffff810cb3cc>] __generic_file_aio_write+0x27c/0x480 >[<ffffffff810cb646>] generic_file_aio_write+0x76/0xf0 # takes ->i_mutex >[<ffffffff8111156a>] do_sync_write+0xea/0x130 >[<ffffffff81112183>] vfs_write+0xf3/0x1f0 >[<ffffffff81112381>] sys_write+0x51/0x90 >[<ffffffff815b5926>] system_call_fastpath+0x18/0x1d >[<ffffffffffffffff>] 0xffffffffffffffff > >This is not a hard deadlock though because administrator can still >intervene and increase the limit on the group which helps the writer to >finish the allocation and release the lock. > >This patch heals the problem by forbidding OOM from dangerous context. >Memcg charging code has no way to find out whether it is called from a >locked context we have to help it via process flags. PF_OOM_ORIGIN flag >removed recently will be reused for PF_NO_MEMCG_OOM which signals that >the memcg OOM killer could lead to a deadlock. >Only locked callers of __generic_file_aio_write are currently marked. I >am pretty sure there are more places (I didn't check shmem and hugetlb >uses fancy instantion mutex during page fault and filesystems might >use some locks during the write) but I've ignored those as this will >probably be just a user specific patch without any way to get upstream >in the current form. > >Reported-by: azurIt <azurit@pobox.sk> >Signed-off-by: Michal Hocko <mhocko@suse.cz> >--- > drivers/staging/pohmelfs/inode.c | 2 ++ > include/linux/sched.h | 1 + > mm/filemap.c | 2 ++ > mm/memcontrol.c | 18 ++++++++++++++---- > 4 files changed, 19 insertions(+), 4 deletions(-) > >diff --git a/drivers/staging/pohmelfs/inode.c b/drivers/staging/pohmelfs/inode.c >index 7a19555..523de82e 100644 >--- a/drivers/staging/pohmelfs/inode.c >+++ b/drivers/staging/pohmelfs/inode.c >@@ -921,7 +921,9 @@ ssize_t pohmelfs_write(struct file *file, const char __user *buf, > if (ret) > goto err_out_unlock; > >+ current->flags |= PF_NO_MEMCG_OOM; > ret = __generic_file_aio_write(&kiocb, &iov, 1, &kiocb.ki_pos); >+ current->flags &= ~PF_NO_MEMCG_OOM; > *ppos = kiocb.ki_pos; > > mutex_unlock(&inode->i_mutex); >diff --git a/include/linux/sched.h b/include/linux/sched.h >index 1e86bb4..f275c8f 100644 >--- a/include/linux/sched.h >+++ b/include/linux/sched.h >@@ -1781,6 +1781,7 @@ extern void thread_group_times(struct task_struct *p, cputime_t *ut, cputime_t * > #define PF_FROZEN 0x00010000 /* frozen for system suspend */ > #define PF_FSTRANS 0x00020000 /* inside a filesystem transaction */ > #define PF_KSWAPD 0x00040000 /* I am kswapd */ >+#define PF_NO_MEMCG_OOM 0x00080000 /* Memcg OOM could lead to a deadlock */ > #define PF_LESS_THROTTLE 0x00100000 /* Throttle me less: I clean memory */ > #define PF_KTHREAD 0x00200000 /* I am a kernel thread */ > #define PF_RANDOMIZE 0x00400000 /* randomize virtual address space */ >diff --git a/mm/filemap.c b/mm/filemap.c >index 556858c..58a316b 100644 >--- a/mm/filemap.c >+++ b/mm/filemap.c >@@ -2617,7 +2617,9 @@ ssize_t generic_file_aio_write(struct kiocb *iocb, const struct iovec *iov, > > mutex_lock(&inode->i_mutex); > blk_start_plug(&plug); >+ current->flags |= PF_NO_MEMCG_OOM; > ret = __generic_file_aio_write(iocb, iov, nr_segs, &iocb->ki_pos); >+ current->flags &= ~PF_NO_MEMCG_OOM; > mutex_unlock(&inode->i_mutex); > > if (ret > 0 || ret == -EIOCBQUEUED) { >diff --git a/mm/memcontrol.c b/mm/memcontrol.c >index c8425b1..128b615 100644 >--- a/mm/memcontrol.c >+++ b/mm/memcontrol.c >@@ -2397,6 +2397,14 @@ done: > return 0; > nomem: > *ptr = NULL; >+ if (printk_ratelimit()) >+ printk(KERN_WARNING"%s: task:%s pid:%d got ENOMEM without OOM for memcg:%p." >+ " If this message shows up very often for the" >+ " same task then there is a risk that the" >+ " process is not able to make any progress" >+ " because of the current limit. Try to enlarge" >+ " the hard limit.\n", __FUNCTION__, >+ current->comm, current->pid, memcg); > return -ENOMEM; > bypass: > *ptr = NULL; >@@ -2703,7 +2711,7 @@ static int mem_cgroup_charge_common(struct page *page, struct mm_struct *mm, > struct mem_cgroup *memcg = NULL; > unsigned int nr_pages = 1; > struct page_cgroup *pc; >- bool oom = true; >+ bool oom = !(current->flags & PF_NO_MEMCG_OOM); > int ret; > > if (PageTransHuge(page)) { >@@ -2770,6 +2778,7 @@ __mem_cgroup_commit_charge_lrucare(struct page *page, struct mem_cgroup *memcg, > int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm, > gfp_t gfp_mask) > { >+ bool oom = !(current->flags & PF_NO_MEMCG_OOM); > struct mem_cgroup *memcg = NULL; > int ret; > >@@ -2782,7 +2791,7 @@ int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm, > mm = &init_mm; > > if (page_is_file_cache(page)) { >- ret = __mem_cgroup_try_charge(mm, gfp_mask, 1, &memcg, true); >+ ret = __mem_cgroup_try_charge(mm, gfp_mask, 1, &memcg, oom); > if (ret || !memcg) > return ret; > >@@ -2818,6 +2827,7 @@ int mem_cgroup_try_charge_swapin(struct mm_struct *mm, > struct page *page, > gfp_t mask, struct mem_cgroup **ptr) > { >+ bool oom = !(current->flags & PF_NO_MEMCG_OOM); > struct mem_cgroup *memcg; > int ret; > >@@ -2840,13 +2850,13 @@ int mem_cgroup_try_charge_swapin(struct mm_struct *mm, > if (!memcg) > goto charge_cur_mm; > *ptr = memcg; >- ret = __mem_cgroup_try_charge(NULL, mask, 1, ptr, true); >+ ret = __mem_cgroup_try_charge(NULL, mask, 1, ptr, oom); > css_put(&memcg->css); > return ret; > charge_cur_mm: > if (unlikely(!mm)) > mm = &init_mm; >- return __mem_cgroup_try_charge(mm, mask, 1, ptr, true); >+ return __mem_cgroup_try_charge(mm, mask, 1, ptr, oom); > } > > static void >-- >1.7.10.4 > >-- >Michal Hocko >SUSE Labs > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH for 3.2.34] memcg: do not trigger OOM if PF_NO_MEMCG_OOM is set 2013-02-08 5:03 ` azurIt @ 2013-02-08 9:44 ` Michal Hocko -1 siblings, 0 replies; 444+ messages in thread From: Michal Hocko @ 2013-02-08 9:44 UTC (permalink / raw) To: azurIt Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki, Johannes Weiner On Fri 08-02-13 06:03:04, azurIt wrote: > Michal, thank you very much but it just didn't work and broke > everything :( I am sorry to hear that. The patch should help to solve the deadlock you have seen earlier. It in no way can solve side effects of failing writes and it also cannot help much if the oom is permanent. > This happened: > Problem started to occur really often immediately after booting the > new kernel, every few minutes for one of my users. But everything > other seems to work fine so i gave it a try for a day (which was a > mistake). I grabbed some data for you and go to sleep: > http://watchdog.sk/lkml/memcg-bug-4.tar.gz Do you have logs from that time period? I have only glanced through the stacks and most of the threads are waiting in the mem_cgroup_handle_oom (mostly from the page fault path where we do not have other options than waiting) which suggests that your memory limit is seriously underestimated. If you look at the number of charging failures (memory.failcnt per-group file) then you will get 9332083 failures in _average_ per group. This is a lot! Not all those failures end with OOM, of course. But it clearly signals that the workload need much more memory than the limit allows. > Few hours later i was woke up from my sweet sweet dreams by alerts > smses - Apache wasn't working and our system failed to restart > it. When i observed the situation, two apache processes (of that user > as above) were still running and it wasn't possible to kill them by > any way. I grabbed some data for you: > http://watchdog.sk/lkml/memcg-bug-5.tar.gz There are only 5 groups in this one and all of them have no memory charged (so no OOM going on). All tasks are somewhere in the ptrace code. grep cache -r . ./1360297489/memory.stat:cache 0 ./1360297489/memory.stat:total_cache 65642496 ./1360297491/memory.stat:cache 0 ./1360297491/memory.stat:total_cache 65642496 ./1360297492/memory.stat:cache 0 ./1360297492/memory.stat:total_cache 65642496 ./1360297490/memory.stat:cache 0 ./1360297490/memory.stat:total_cache 65642496 ./1360297488/memory.stat:cache 0 ./1360297488/memory.stat:total_cache 65642496 which suggests that this is a parent group and the memory is charged in a child group. I guess that all those are under OOM as the number seems like they have limit at 62M. > Then I logged to the console and this was waiting for me: > http://watchdog.sk/lkml/error.jpg This is just a warning and it should be harmless. There is just one WARN in ptrace_check_attach: WARN_ON_ONCE(task_is_stopped(child)) This has been introduced by http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=321fb561 and the commit description claim this shouldn't happen. I am not familiar with this code but it sounds like a bug in the tracing code which is not related to the discussed issue. > Finally i rebooted into different kernel, wrote this e-mail and go to > my lovely bed ;) -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH for 3.2.34] memcg: do not trigger OOM if PF_NO_MEMCG_OOM is set @ 2013-02-08 9:44 ` Michal Hocko 0 siblings, 0 replies; 444+ messages in thread From: Michal Hocko @ 2013-02-08 9:44 UTC (permalink / raw) To: azurIt Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki, Johannes Weiner On Fri 08-02-13 06:03:04, azurIt wrote: > Michal, thank you very much but it just didn't work and broke > everything :( I am sorry to hear that. The patch should help to solve the deadlock you have seen earlier. It in no way can solve side effects of failing writes and it also cannot help much if the oom is permanent. > This happened: > Problem started to occur really often immediately after booting the > new kernel, every few minutes for one of my users. But everything > other seems to work fine so i gave it a try for a day (which was a > mistake). I grabbed some data for you and go to sleep: > http://watchdog.sk/lkml/memcg-bug-4.tar.gz Do you have logs from that time period? I have only glanced through the stacks and most of the threads are waiting in the mem_cgroup_handle_oom (mostly from the page fault path where we do not have other options than waiting) which suggests that your memory limit is seriously underestimated. If you look at the number of charging failures (memory.failcnt per-group file) then you will get 9332083 failures in _average_ per group. This is a lot! Not all those failures end with OOM, of course. But it clearly signals that the workload need much more memory than the limit allows. > Few hours later i was woke up from my sweet sweet dreams by alerts > smses - Apache wasn't working and our system failed to restart > it. When i observed the situation, two apache processes (of that user > as above) were still running and it wasn't possible to kill them by > any way. I grabbed some data for you: > http://watchdog.sk/lkml/memcg-bug-5.tar.gz There are only 5 groups in this one and all of them have no memory charged (so no OOM going on). All tasks are somewhere in the ptrace code. grep cache -r . ./1360297489/memory.stat:cache 0 ./1360297489/memory.stat:total_cache 65642496 ./1360297491/memory.stat:cache 0 ./1360297491/memory.stat:total_cache 65642496 ./1360297492/memory.stat:cache 0 ./1360297492/memory.stat:total_cache 65642496 ./1360297490/memory.stat:cache 0 ./1360297490/memory.stat:total_cache 65642496 ./1360297488/memory.stat:cache 0 ./1360297488/memory.stat:total_cache 65642496 which suggests that this is a parent group and the memory is charged in a child group. I guess that all those are under OOM as the number seems like they have limit at 62M. > Then I logged to the console and this was waiting for me: > http://watchdog.sk/lkml/error.jpg This is just a warning and it should be harmless. There is just one WARN in ptrace_check_attach: WARN_ON_ONCE(task_is_stopped(child)) This has been introduced by http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=321fb561 and the commit description claim this shouldn't happen. I am not familiar with this code but it sounds like a bug in the tracing code which is not related to the discussed issue. > Finally i rebooted into different kernel, wrote this e-mail and go to > my lovely bed ;) -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH for 3.2.34] memcg: do not trigger OOM if PF_NO_MEMCG_OOM is set 2013-02-08 9:44 ` Michal Hocko @ 2013-02-08 11:02 ` azurIt -1 siblings, 0 replies; 444+ messages in thread From: azurIt @ 2013-02-08 11:02 UTC (permalink / raw) To: Michal Hocko Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki, Johannes Weiner > >Do you have logs from that time period? > >I have only glanced through the stacks and most of the threads are >waiting in the mem_cgroup_handle_oom (mostly from the page fault path >where we do not have other options than waiting) which suggests that >your memory limit is seriously underestimated. If you look at the number >of charging failures (memory.failcnt per-group file) then you will get >9332083 failures in _average_ per group. This is a lot! >Not all those failures end with OOM, of course. But it clearly signals >that the workload need much more memory than the limit allows. What type of logs? I have all. Memory usage graph: http://www.watchdog.sk/lkml/memory2.png New kernel was booted about 1:15. Data in memcg-bug-4.tar.gz were taken about 2:35 and data in memcg-bug-5.tar.gz about 5:25. There was always lots of free memory. Higher memory consumption between 3:39 and 5:33 was caused by data backup and was completed few minutes before i restarted the server (this was just a coincidence). >There are only 5 groups in this one and all of them have no memory >charged (so no OOM going on). All tasks are somewhere in the ptrace >code. It's all from the same cgroup but from different time. >grep cache -r . >./1360297489/memory.stat:cache 0 >./1360297489/memory.stat:total_cache 65642496 >./1360297491/memory.stat:cache 0 >./1360297491/memory.stat:total_cache 65642496 >./1360297492/memory.stat:cache 0 >./1360297492/memory.stat:total_cache 65642496 >./1360297490/memory.stat:cache 0 >./1360297490/memory.stat:total_cache 65642496 >./1360297488/memory.stat:cache 0 >./1360297488/memory.stat:total_cache 65642496 > >which suggests that this is a parent group and the memory is charged in >a child group. I guess that all those are under OOM as the number seems >like they have limit at 62M. The cgroup has limit 330M (346030080 bytes). As i said, these two processes were stucked and was impossible to kill them. They were, maybe, the processes which i was trying to 'strace' before - 'strace' was freezed as always when the cgroup has this problem and i killed it (i was just trying if it is the original cgroup problem). ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH for 3.2.34] memcg: do not trigger OOM if PF_NO_MEMCG_OOM is set @ 2013-02-08 11:02 ` azurIt 0 siblings, 0 replies; 444+ messages in thread From: azurIt @ 2013-02-08 11:02 UTC (permalink / raw) To: Michal Hocko Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki, Johannes Weiner > >Do you have logs from that time period? > >I have only glanced through the stacks and most of the threads are >waiting in the mem_cgroup_handle_oom (mostly from the page fault path >where we do not have other options than waiting) which suggests that >your memory limit is seriously underestimated. If you look at the number >of charging failures (memory.failcnt per-group file) then you will get >9332083 failures in _average_ per group. This is a lot! >Not all those failures end with OOM, of course. But it clearly signals >that the workload need much more memory than the limit allows. What type of logs? I have all. Memory usage graph: http://www.watchdog.sk/lkml/memory2.png New kernel was booted about 1:15. Data in memcg-bug-4.tar.gz were taken about 2:35 and data in memcg-bug-5.tar.gz about 5:25. There was always lots of free memory. Higher memory consumption between 3:39 and 5:33 was caused by data backup and was completed few minutes before i restarted the server (this was just a coincidence). >There are only 5 groups in this one and all of them have no memory >charged (so no OOM going on). All tasks are somewhere in the ptrace >code. It's all from the same cgroup but from different time. >grep cache -r . >./1360297489/memory.stat:cache 0 >./1360297489/memory.stat:total_cache 65642496 >./1360297491/memory.stat:cache 0 >./1360297491/memory.stat:total_cache 65642496 >./1360297492/memory.stat:cache 0 >./1360297492/memory.stat:total_cache 65642496 >./1360297490/memory.stat:cache 0 >./1360297490/memory.stat:total_cache 65642496 >./1360297488/memory.stat:cache 0 >./1360297488/memory.stat:total_cache 65642496 > >which suggests that this is a parent group and the memory is charged in >a child group. I guess that all those are under OOM as the number seems >like they have limit at 62M. The cgroup has limit 330M (346030080 bytes). As i said, these two processes were stucked and was impossible to kill them. They were, maybe, the processes which i was trying to 'strace' before - 'strace' was freezed as always when the cgroup has this problem and i killed it (i was just trying if it is the original cgroup problem). -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH for 3.2.34] memcg: do not trigger OOM if PF_NO_MEMCG_OOM is set 2013-02-08 11:02 ` azurIt (?) @ 2013-02-08 12:38 ` Michal Hocko -1 siblings, 0 replies; 444+ messages in thread From: Michal Hocko @ 2013-02-08 12:38 UTC (permalink / raw) To: azurIt Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki, Johannes Weiner On Fri 08-02-13 12:02:49, azurIt wrote: > > > >Do you have logs from that time period? > > > >I have only glanced through the stacks and most of the threads are > >waiting in the mem_cgroup_handle_oom (mostly from the page fault path > >where we do not have other options than waiting) which suggests that > >your memory limit is seriously underestimated. If you look at the number > >of charging failures (memory.failcnt per-group file) then you will get > >9332083 failures in _average_ per group. This is a lot! > >Not all those failures end with OOM, of course. But it clearly signals > >that the workload need much more memory than the limit allows. > > > What type of logs? I have all. kernel log would be sufficient. > Memory usage graph: > http://www.watchdog.sk/lkml/memory2.png > > New kernel was booted about 1:15. Data in memcg-bug-4.tar.gz were taken about 2:35 and data in memcg-bug-5.tar.gz about 5:25. There was always lots of free memory. Higher memory consumption between 3:39 and 5:33 was caused by data backup and was completed few minutes before i restarted the server (this was just a coincidence). > > > > >There are only 5 groups in this one and all of them have no memory > >charged (so no OOM going on). All tasks are somewhere in the ptrace > >code. > > > It's all from the same cgroup but from different time. > > > > >grep cache -r . > >./1360297489/memory.stat:cache 0 > >./1360297489/memory.stat:total_cache 65642496 > >./1360297491/memory.stat:cache 0 > >./1360297491/memory.stat:total_cache 65642496 > >./1360297492/memory.stat:cache 0 > >./1360297492/memory.stat:total_cache 65642496 > >./1360297490/memory.stat:cache 0 > >./1360297490/memory.stat:total_cache 65642496 > >./1360297488/memory.stat:cache 0 > >./1360297488/memory.stat:total_cache 65642496 > > > >which suggests that this is a parent group and the memory is charged in > >a child group. I guess that all those are under OOM as the number seems > >like they have limit at 62M. > > > The cgroup has limit 330M (346030080 bytes). This limit is for top level groups, right? Those seem to children which have 62MB charged - is that a limit for those children? > As i said, these two processes Which are those two processes? > were stucked and was impossible to kill them. They were, > maybe, the processes which i was trying to 'strace' before - 'strace' > was freezed as always when the cgroup has this problem and i killed it > (i was just trying if it is the original cgroup problem). I have no idea what is the strace role here. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH for 3.2.34] memcg: do not trigger OOM if PF_NO_MEMCG_OOM is set @ 2013-02-08 12:38 ` Michal Hocko 0 siblings, 0 replies; 444+ messages in thread From: Michal Hocko @ 2013-02-08 12:38 UTC (permalink / raw) To: azurIt Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups mailinglist, KAMEZAWA Hiroyuki, Johannes Weiner On Fri 08-02-13 12:02:49, azurIt wrote: > > > >Do you have logs from that time period? > > > >I have only glanced through the stacks and most of the threads are > >waiting in the mem_cgroup_handle_oom (mostly from the page fault path > >where we do not have other options than waiting) which suggests that > >your memory limit is seriously underestimated. If you look at the number > >of charging failures (memory.failcnt per-group file) then you will get > >9332083 failures in _average_ per group. This is a lot! > >Not all those failures end with OOM, of course. But it clearly signals > >that the workload need much more memory than the limit allows. > > > What type of logs? I have all. kernel log would be sufficient. > Memory usage graph: > http://www.watchdog.sk/lkml/memory2.png > > New kernel was booted about 1:15. Data in memcg-bug-4.tar.gz were taken about 2:35 and data in memcg-bug-5.tar.gz about 5:25. There was always lots of free memory. Higher memory consumption between 3:39 and 5:33 was caused by data backup and was completed few minutes before i restarted the server (this was just a coincidence). > > > > >There are only 5 groups in this one and all of them have no memory > >charged (so no OOM going on). All tasks are somewhere in the ptrace > >code. > > > It's all from the same cgroup but from different time. > > > > >grep cache -r . > >./1360297489/memory.stat:cache 0 > >./1360297489/memory.stat:total_cache 65642496 > >./1360297491/memory.stat:cache 0 > >./1360297491/memory.stat:total_cache 65642496 > >./1360297492/memory.stat:cache 0 > >./1360297492/memory.stat:total_cache 65642496 > >./1360297490/memory.stat:cache 0 > >./1360297490/memory.stat:total_cache 65642496 > >./1360297488/memory.stat:cache 0 > >./1360297488/memory.stat:total_cache 65642496 > > > >which suggests that this is a parent group and the memory is charged in > >a child group. I guess that all those are under OOM as the number seems > >like they have limit at 62M. > > > The cgroup has limit 330M (346030080 bytes). This limit is for top level groups, right? Those seem to children which have 62MB charged - is that a limit for those children? > As i said, these two processes Which are those two processes? > were stucked and was impossible to kill them. They were, > maybe, the processes which i was trying to 'strace' before - 'strace' > was freezed as always when the cgroup has this problem and i killed it > (i was just trying if it is the original cgroup problem). I have no idea what is the strace role here. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH for 3.2.34] memcg: do not trigger OOM if PF_NO_MEMCG_OOM is set @ 2013-02-08 12:38 ` Michal Hocko 0 siblings, 0 replies; 444+ messages in thread From: Michal Hocko @ 2013-02-08 12:38 UTC (permalink / raw) To: azurIt Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki, Johannes Weiner On Fri 08-02-13 12:02:49, azurIt wrote: > > > >Do you have logs from that time period? > > > >I have only glanced through the stacks and most of the threads are > >waiting in the mem_cgroup_handle_oom (mostly from the page fault path > >where we do not have other options than waiting) which suggests that > >your memory limit is seriously underestimated. If you look at the number > >of charging failures (memory.failcnt per-group file) then you will get > >9332083 failures in _average_ per group. This is a lot! > >Not all those failures end with OOM, of course. But it clearly signals > >that the workload need much more memory than the limit allows. > > > What type of logs? I have all. kernel log would be sufficient. > Memory usage graph: > http://www.watchdog.sk/lkml/memory2.png > > New kernel was booted about 1:15. Data in memcg-bug-4.tar.gz were taken about 2:35 and data in memcg-bug-5.tar.gz about 5:25. There was always lots of free memory. Higher memory consumption between 3:39 and 5:33 was caused by data backup and was completed few minutes before i restarted the server (this was just a coincidence). > > > > >There are only 5 groups in this one and all of them have no memory > >charged (so no OOM going on). All tasks are somewhere in the ptrace > >code. > > > It's all from the same cgroup but from different time. > > > > >grep cache -r . > >./1360297489/memory.stat:cache 0 > >./1360297489/memory.stat:total_cache 65642496 > >./1360297491/memory.stat:cache 0 > >./1360297491/memory.stat:total_cache 65642496 > >./1360297492/memory.stat:cache 0 > >./1360297492/memory.stat:total_cache 65642496 > >./1360297490/memory.stat:cache 0 > >./1360297490/memory.stat:total_cache 65642496 > >./1360297488/memory.stat:cache 0 > >./1360297488/memory.stat:total_cache 65642496 > > > >which suggests that this is a parent group and the memory is charged in > >a child group. I guess that all those are under OOM as the number seems > >like they have limit at 62M. > > > The cgroup has limit 330M (346030080 bytes). This limit is for top level groups, right? Those seem to children which have 62MB charged - is that a limit for those children? > As i said, these two processes Which are those two processes? > were stucked and was impossible to kill them. They were, > maybe, the processes which i was trying to 'strace' before - 'strace' > was freezed as always when the cgroup has this problem and i killed it > (i was just trying if it is the original cgroup problem). I have no idea what is the strace role here. -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH for 3.2.34] memcg: do not trigger OOM if PF_NO_MEMCG_OOM is set 2013-02-08 12:38 ` Michal Hocko @ 2013-02-08 13:56 ` azurIt -1 siblings, 0 replies; 444+ messages in thread From: azurIt @ 2013-02-08 13:56 UTC (permalink / raw) To: Michal Hocko Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki, Johannes Weiner >kernel log would be sufficient. Full kernel log from kernel with you newest patch: http://watchdog.sk/lkml/kern2.log >This limit is for top level groups, right? Those seem to children which >have 62MB charged - is that a limit for those children? It was the limit for parent cgroup and processes were in one (the same) child cgroup. Child cgroup has no memory limit set (so limit for parent was also limit for child - 330 MB). >Which are those two processes? Data are inside memcg-bug-5.tar.gz in directories bug/<timestamp>/<pids>/ >I have no idea what is the strace role here. I was stracing exactly two processes from that cgroup and exactly two processes were stucked later and was immpossible to kill them. Both of them were waiting on 'ptrace_stop'. Maybe it's completely unrelated, just guessing. azur ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH for 3.2.34] memcg: do not trigger OOM if PF_NO_MEMCG_OOM is set @ 2013-02-08 13:56 ` azurIt 0 siblings, 0 replies; 444+ messages in thread From: azurIt @ 2013-02-08 13:56 UTC (permalink / raw) To: Michal Hocko Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki, Johannes Weiner >kernel log would be sufficient. Full kernel log from kernel with you newest patch: http://watchdog.sk/lkml/kern2.log >This limit is for top level groups, right? Those seem to children which >have 62MB charged - is that a limit for those children? It was the limit for parent cgroup and processes were in one (the same) child cgroup. Child cgroup has no memory limit set (so limit for parent was also limit for child - 330 MB). >Which are those two processes? Data are inside memcg-bug-5.tar.gz in directories bug/<timestamp>/<pids>/ >I have no idea what is the strace role here. I was stracing exactly two processes from that cgroup and exactly two processes were stucked later and was immpossible to kill them. Both of them were waiting on 'ptrace_stop'. Maybe it's completely unrelated, just guessing. azur -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH for 3.2.34] memcg: do not trigger OOM if PF_NO_MEMCG_OOM is set 2013-02-08 13:56 ` azurIt (?) @ 2013-02-08 14:47 ` Michal Hocko -1 siblings, 0 replies; 444+ messages in thread From: Michal Hocko @ 2013-02-08 14:47 UTC (permalink / raw) To: azurIt Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki, Johannes Weiner On Fri 08-02-13 14:56:16, azurIt wrote: > Data are inside memcg-bug-5.tar.gz in directories bug/<timestamp>/<pids>/ ohh, I didn't get those were timestamp directories. It makes more sense now. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH for 3.2.34] memcg: do not trigger OOM if PF_NO_MEMCG_OOM is set @ 2013-02-08 14:47 ` Michal Hocko 0 siblings, 0 replies; 444+ messages in thread From: Michal Hocko @ 2013-02-08 14:47 UTC (permalink / raw) To: azurIt Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups mailinglist, KAMEZAWA Hiroyuki, Johannes Weiner On Fri 08-02-13 14:56:16, azurIt wrote: > Data are inside memcg-bug-5.tar.gz in directories bug/<timestamp>/<pids>/ ohh, I didn't get those were timestamp directories. It makes more sense now. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH for 3.2.34] memcg: do not trigger OOM if PF_NO_MEMCG_OOM is set @ 2013-02-08 14:47 ` Michal Hocko 0 siblings, 0 replies; 444+ messages in thread From: Michal Hocko @ 2013-02-08 14:47 UTC (permalink / raw) To: azurIt Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki, Johannes Weiner On Fri 08-02-13 14:56:16, azurIt wrote: > Data are inside memcg-bug-5.tar.gz in directories bug/<timestamp>/<pids>/ ohh, I didn't get those were timestamp directories. It makes more sense now. -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH for 3.2.34] memcg: do not trigger OOM if PF_NO_MEMCG_OOM is set 2013-02-08 13:56 ` azurIt (?) @ 2013-02-08 15:24 ` Michal Hocko -1 siblings, 0 replies; 444+ messages in thread From: Michal Hocko @ 2013-02-08 15:24 UTC (permalink / raw) To: azurIt Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki, Johannes Weiner On Fri 08-02-13 14:56:16, azurIt wrote: > >kernel log would be sufficient. > > > Full kernel log from kernel with you newest patch: > http://watchdog.sk/lkml/kern2.log OK, so the log says that there is a little slaughter on your yard: $ grep "Memory cgroup out of memory:" kern2.log | wc -l 220 $ grep "Memory cgroup out of memory:" kern2.log | sed 's@.*Kill process \([0-9]*\) .*@\1@' | sort -u | wc -l 220 Which means that the oom killer didn't try to kill any task more than once which is good because it tells us that the killed task manages to die before we trigger oom again. So this is definitely not a deadlock. You are just hitting OOM very often. $ grep "killed as a result of limit" kern2.log | sed 's@.*\] @@' | sort | uniq -c | sort -k1 -n 1 Task in /1091/uid killed as a result of limit of /1091 1 Task in /1223/uid killed as a result of limit of /1223 1 Task in /1229/uid killed as a result of limit of /1229 1 Task in /1255/uid killed as a result of limit of /1255 1 Task in /1424/uid killed as a result of limit of /1424 1 Task in /1470/uid killed as a result of limit of /1470 1 Task in /1567/uid killed as a result of limit of /1567 2 Task in /1080/uid killed as a result of limit of /1080 3 Task in /1381/uid killed as a result of limit of /1381 4 Task in /1185/uid killed as a result of limit of /1185 4 Task in /1289/uid killed as a result of limit of /1289 4 Task in /1709/uid killed as a result of limit of /1709 5 Task in /1279/uid killed as a result of limit of /1279 6 Task in /1020/uid killed as a result of limit of /1020 6 Task in /1527/uid killed as a result of limit of /1527 9 Task in /1388/uid killed as a result of limit of /1388 17 Task in /1281/uid killed as a result of limit of /1281 22 Task in /1599/uid killed as a result of limit of /1599 30 Task in /1155/uid killed as a result of limit of /1155 31 Task in /1258/uid killed as a result of limit of /1258 71 Task in /1293/uid killed as a result of limit of /1293 So the group 1293 suffers the most. I would check how much memory the worklod in the group really needs because this level of OOM cannot possible be healthy. The log also says that the deadlock prevention implemented by the patch triggered and some writes really failed due to potential OOM: $ grep "If this message shows up" kern2.log Feb 8 01:17:10 server01 kernel: [ 431.033593] __mem_cgroup_try_charge: task:apache2 pid:6733 got ENOMEM without OOM for memcg:ffff8803807d5600. If this message shows up very often for the same task then there is a risk that the process is not able to make any progress because of the current limit. Try to enlarge the hard limit. Feb 8 01:22:52 server01 kernel: [ 773.556782] __mem_cgroup_try_charge: task:apache2 pid:12092 got ENOMEM without OOM for memcg:ffff8803807d5600. If this message shows up very often for the same task then there is a risk that the process is not able to make any progress because of the current limit. Try to enlarge the hard limit. Feb 8 01:22:52 server01 kernel: [ 773.567916] __mem_cgroup_try_charge: task:apache2 pid:12093 got ENOMEM without OOM for memcg:ffff8803807d5600. If this message shows up very often for the same task then there is a risk that the process is not able to make any progress because of the current limit. Try to enlarge the hard limit. Feb 8 01:29:00 server01 kernel: [ 1141.355693] __mem_cgroup_try_charge: task:apache2 pid:17734 got ENOMEM without OOM for memcg:ffff88036e956e00. If this message shows up very often for the same task then there is a risk that the process is not able to make any progress because of the current limit. Try to enlarge the hard limit. Feb 8 03:30:39 server01 kernel: [ 8440.346811] __mem_cgroup_try_charge: task:apache2 pid:8687 got ENOMEM without OOM for memcg:ffff8803654d6e00. If this message shows up very often for the same task then there is a risk that the process is not able to make any progress because of the current limit. Try to enlarge the hard limit. This doesn't look very unhealthy. I have expected that write would fail more often but it seems that the biggest memory pressure comes from mmaps and page faults which have no way other than OOM. So my suggestion would be to reconsider limits for groups to provide more realistical environment. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH for 3.2.34] memcg: do not trigger OOM if PF_NO_MEMCG_OOM is set @ 2013-02-08 15:24 ` Michal Hocko 0 siblings, 0 replies; 444+ messages in thread From: Michal Hocko @ 2013-02-08 15:24 UTC (permalink / raw) To: azurIt Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups mailinglist, KAMEZAWA Hiroyuki, Johannes Weiner On Fri 08-02-13 14:56:16, azurIt wrote: > >kernel log would be sufficient. > > > Full kernel log from kernel with you newest patch: > http://watchdog.sk/lkml/kern2.log OK, so the log says that there is a little slaughter on your yard: $ grep "Memory cgroup out of memory:" kern2.log | wc -l 220 $ grep "Memory cgroup out of memory:" kern2.log | sed 's@.*Kill process \([0-9]*\) .*@\1@' | sort -u | wc -l 220 Which means that the oom killer didn't try to kill any task more than once which is good because it tells us that the killed task manages to die before we trigger oom again. So this is definitely not a deadlock. You are just hitting OOM very often. $ grep "killed as a result of limit" kern2.log | sed 's@.*\] @@' | sort | uniq -c | sort -k1 -n 1 Task in /1091/uid killed as a result of limit of /1091 1 Task in /1223/uid killed as a result of limit of /1223 1 Task in /1229/uid killed as a result of limit of /1229 1 Task in /1255/uid killed as a result of limit of /1255 1 Task in /1424/uid killed as a result of limit of /1424 1 Task in /1470/uid killed as a result of limit of /1470 1 Task in /1567/uid killed as a result of limit of /1567 2 Task in /1080/uid killed as a result of limit of /1080 3 Task in /1381/uid killed as a result of limit of /1381 4 Task in /1185/uid killed as a result of limit of /1185 4 Task in /1289/uid killed as a result of limit of /1289 4 Task in /1709/uid killed as a result of limit of /1709 5 Task in /1279/uid killed as a result of limit of /1279 6 Task in /1020/uid killed as a result of limit of /1020 6 Task in /1527/uid killed as a result of limit of /1527 9 Task in /1388/uid killed as a result of limit of /1388 17 Task in /1281/uid killed as a result of limit of /1281 22 Task in /1599/uid killed as a result of limit of /1599 30 Task in /1155/uid killed as a result of limit of /1155 31 Task in /1258/uid killed as a result of limit of /1258 71 Task in /1293/uid killed as a result of limit of /1293 So the group 1293 suffers the most. I would check how much memory the worklod in the group really needs because this level of OOM cannot possible be healthy. The log also says that the deadlock prevention implemented by the patch triggered and some writes really failed due to potential OOM: $ grep "If this message shows up" kern2.log Feb 8 01:17:10 server01 kernel: [ 431.033593] __mem_cgroup_try_charge: task:apache2 pid:6733 got ENOMEM without OOM for memcg:ffff8803807d5600. If this message shows up very often for the same task then there is a risk that the process is not able to make any progress because of the current limit. Try to enlarge the hard limit. Feb 8 01:22:52 server01 kernel: [ 773.556782] __mem_cgroup_try_charge: task:apache2 pid:12092 got ENOMEM without OOM for memcg:ffff8803807d5600. If this message shows up very often for the same task then there is a risk that the process is not able to make any progress because of the current limit. Try to enlarge the hard limit. Feb 8 01:22:52 server01 kernel: [ 773.567916] __mem_cgroup_try_charge: task:apache2 pid:12093 got ENOMEM without OOM for memcg:ffff8803807d5600. If this message shows up very often for the same task then there is a risk that the process is not able to make any progress because of the current limit. Try to enlarge the hard limit. Feb 8 01:29:00 server01 kernel: [ 1141.355693] __mem_cgroup_try_charge: task:apache2 pid:17734 got ENOMEM without OOM for memcg:ffff88036e956e00. If this message shows up very often for the same task then there is a risk that the process is not able to make any progress because of the current limit. Try to enlarge the hard limit. Feb 8 03:30:39 server01 kernel: [ 8440.346811] __mem_cgroup_try_charge: task:apache2 pid:8687 got ENOMEM without OOM for memcg:ffff8803654d6e00. If this message shows up very often for the same task then there is a risk that the process is not able to make any progress because of the current limit. Try to enlarge the hard limit. This doesn't look very unhealthy. I have expected that write would fail more often but it seems that the biggest memory pressure comes from mmaps and page faults which have no way other than OOM. So my suggestion would be to reconsider limits for groups to provide more realistical environment. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH for 3.2.34] memcg: do not trigger OOM if PF_NO_MEMCG_OOM is set @ 2013-02-08 15:24 ` Michal Hocko 0 siblings, 0 replies; 444+ messages in thread From: Michal Hocko @ 2013-02-08 15:24 UTC (permalink / raw) To: azurIt Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki, Johannes Weiner On Fri 08-02-13 14:56:16, azurIt wrote: > >kernel log would be sufficient. > > > Full kernel log from kernel with you newest patch: > http://watchdog.sk/lkml/kern2.log OK, so the log says that there is a little slaughter on your yard: $ grep "Memory cgroup out of memory:" kern2.log | wc -l 220 $ grep "Memory cgroup out of memory:" kern2.log | sed 's@.*Kill process \([0-9]*\) .*@\1@' | sort -u | wc -l 220 Which means that the oom killer didn't try to kill any task more than once which is good because it tells us that the killed task manages to die before we trigger oom again. So this is definitely not a deadlock. You are just hitting OOM very often. $ grep "killed as a result of limit" kern2.log | sed 's@.*\] @@' | sort | uniq -c | sort -k1 -n 1 Task in /1091/uid killed as a result of limit of /1091 1 Task in /1223/uid killed as a result of limit of /1223 1 Task in /1229/uid killed as a result of limit of /1229 1 Task in /1255/uid killed as a result of limit of /1255 1 Task in /1424/uid killed as a result of limit of /1424 1 Task in /1470/uid killed as a result of limit of /1470 1 Task in /1567/uid killed as a result of limit of /1567 2 Task in /1080/uid killed as a result of limit of /1080 3 Task in /1381/uid killed as a result of limit of /1381 4 Task in /1185/uid killed as a result of limit of /1185 4 Task in /1289/uid killed as a result of limit of /1289 4 Task in /1709/uid killed as a result of limit of /1709 5 Task in /1279/uid killed as a result of limit of /1279 6 Task in /1020/uid killed as a result of limit of /1020 6 Task in /1527/uid killed as a result of limit of /1527 9 Task in /1388/uid killed as a result of limit of /1388 17 Task in /1281/uid killed as a result of limit of /1281 22 Task in /1599/uid killed as a result of limit of /1599 30 Task in /1155/uid killed as a result of limit of /1155 31 Task in /1258/uid killed as a result of limit of /1258 71 Task in /1293/uid killed as a result of limit of /1293 So the group 1293 suffers the most. I would check how much memory the worklod in the group really needs because this level of OOM cannot possible be healthy. The log also says that the deadlock prevention implemented by the patch triggered and some writes really failed due to potential OOM: $ grep "If this message shows up" kern2.log Feb 8 01:17:10 server01 kernel: [ 431.033593] __mem_cgroup_try_charge: task:apache2 pid:6733 got ENOMEM without OOM for memcg:ffff8803807d5600. If this message shows up very often for the same task then there is a risk that the process is not able to make any progress because of the current limit. Try to enlarge the hard limit. Feb 8 01:22:52 server01 kernel: [ 773.556782] __mem_cgroup_try_charge: task:apache2 pid:12092 got ENOMEM without OOM for memcg:ffff8803807d5600. If this message shows up very often for the same task then there is a risk that the process is not able to make any progress because of the current limit. Try to enlarge the hard limit. Feb 8 01:22:52 server01 kernel: [ 773.567916] __mem_cgroup_try_charge: task:apache2 pid:12093 got ENOMEM without OOM for memcg:ffff8803807d5600. If this message shows up very often for the same task then there is a risk that the process is not able to make any progress because of the current limit. Try to enlarge the hard limit. Feb 8 01:29:00 server01 kernel: [ 1141.355693] __mem_cgroup_try_charge: task:apache2 pid:17734 got ENOMEM without OOM for memcg:ffff88036e956e00. If this message shows up very often for the same task then there is a risk that the process is not able to make any progress because of the current limit. Try to enlarge the hard limit. Feb 8 03:30:39 server01 kernel: [ 8440.346811] __mem_cgroup_try_charge: task:apache2 pid:8687 got ENOMEM without OOM for memcg:ffff8803654d6e00. If this message shows up very often for the same task then there is a risk that the process is not able to make any progress because of the current limit. Try to enlarge the hard limit. This doesn't look very unhealthy. I have expected that write would fail more often but it seems that the biggest memory pressure comes from mmaps and page faults which have no way other than OOM. So my suggestion would be to reconsider limits for groups to provide more realistical environment. -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH for 3.2.34] memcg: do not trigger OOM if PF_NO_MEMCG_OOM is set 2013-02-08 15:24 ` Michal Hocko (?) @ 2013-02-08 15:58 ` azurIt -1 siblings, 0 replies; 444+ messages in thread From: azurIt @ 2013-02-08 15:58 UTC (permalink / raw) To: Michal Hocko Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki, Johannes Weiner >Which means that the oom killer didn't try to kill any task more than >once which is good because it tells us that the killed task manages to >die before we trigger oom again. So this is definitely not a deadlock. >You are just hitting OOM very often. >$ grep "killed as a result of limit" kern2.log | sed 's@.*\] @@' | sort | uniq -c | sort -k1 -n > 1 Task in /1091/uid killed as a result of limit of /1091 > 1 Task in /1223/uid killed as a result of limit of /1223 > 1 Task in /1229/uid killed as a result of limit of /1229 > 1 Task in /1255/uid killed as a result of limit of /1255 > 1 Task in /1424/uid killed as a result of limit of /1424 > 1 Task in /1470/uid killed as a result of limit of /1470 > 1 Task in /1567/uid killed as a result of limit of /1567 > 2 Task in /1080/uid killed as a result of limit of /1080 > 3 Task in /1381/uid killed as a result of limit of /1381 > 4 Task in /1185/uid killed as a result of limit of /1185 > 4 Task in /1289/uid killed as a result of limit of /1289 > 4 Task in /1709/uid killed as a result of limit of /1709 > 5 Task in /1279/uid killed as a result of limit of /1279 > 6 Task in /1020/uid killed as a result of limit of /1020 > 6 Task in /1527/uid killed as a result of limit of /1527 > 9 Task in /1388/uid killed as a result of limit of /1388 > 17 Task in /1281/uid killed as a result of limit of /1281 > 22 Task in /1599/uid killed as a result of limit of /1599 > 30 Task in /1155/uid killed as a result of limit of /1155 > 31 Task in /1258/uid killed as a result of limit of /1258 > 71 Task in /1293/uid killed as a result of limit of /1293 > >So the group 1293 suffers the most. I would check how much memory the >worklod in the group really needs because this level of OOM cannot >possible be healthy. I took the kernel log from yesterday from the same time frame: $ grep "killed as a result of limit" kern2.log | sed 's@.*\] @@' | sort | uniq -c | sort -k1 -n 1 Task in /1252/uid killed as a result of limit of /1252 1 Task in /1709/uid killed as a result of limit of /1709 2 Task in /1185/uid killed as a result of limit of /1185 2 Task in /1388/uid killed as a result of limit of /1388 2 Task in /1567/uid killed as a result of limit of /1567 2 Task in /1650/uid killed as a result of limit of /1650 3 Task in /1527/uid killed as a result of limit of /1527 5 Task in /1552/uid killed as a result of limit of /1552 1634 Task in /1258/uid killed as a result of limit of /1258 As you can see, there were much more OOM in '1258' and no such problems like this night (well, there were never such problems before :) ). As i said, cgroup 1258 were freezing every few minutes with your latest patch so there must be something wrong (it usually freezes about once per day). And it was really freezed (i checked that), the sypthoms were: - cannot strace any of cgroup processes - no new processes were started, still the same processes were 'running' - kernel was unable to resolve this by it's own - all processes togather were taking 100% CPU - the whole memory limit was used (see memcg-bug-4.tar.gz for more info) Unfortunately i forget to check if killing only few of the processes will resolve it (i always killed them all yesterday night). Don't know if is was in deadlock or not but kernel was definitely unable to resolve the problem. And there is still a mystery of two freezed processes which cannot be killed. By the way, i KNOW that so much OOM is not healthy but the client simply don't want to buy more memory. He knows about the problem of unsufficient memory limit. Thank you. azur ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH for 3.2.34] memcg: do not trigger OOM if PF_NO_MEMCG_OOM is set @ 2013-02-08 15:58 ` azurIt 0 siblings, 0 replies; 444+ messages in thread From: azurIt @ 2013-02-08 15:58 UTC (permalink / raw) To: Michal Hocko Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups mailinglist, KAMEZAWA Hiroyuki, Johannes Weiner >Which means that the oom killer didn't try to kill any task more than >once which is good because it tells us that the killed task manages to >die before we trigger oom again. So this is definitely not a deadlock. >You are just hitting OOM very often. >$ grep "killed as a result of limit" kern2.log | sed 's@.*\] @@' | sort | uniq -c | sort -k1 -n > 1 Task in /1091/uid killed as a result of limit of /1091 > 1 Task in /1223/uid killed as a result of limit of /1223 > 1 Task in /1229/uid killed as a result of limit of /1229 > 1 Task in /1255/uid killed as a result of limit of /1255 > 1 Task in /1424/uid killed as a result of limit of /1424 > 1 Task in /1470/uid killed as a result of limit of /1470 > 1 Task in /1567/uid killed as a result of limit of /1567 > 2 Task in /1080/uid killed as a result of limit of /1080 > 3 Task in /1381/uid killed as a result of limit of /1381 > 4 Task in /1185/uid killed as a result of limit of /1185 > 4 Task in /1289/uid killed as a result of limit of /1289 > 4 Task in /1709/uid killed as a result of limit of /1709 > 5 Task in /1279/uid killed as a result of limit of /1279 > 6 Task in /1020/uid killed as a result of limit of /1020 > 6 Task in /1527/uid killed as a result of limit of /1527 > 9 Task in /1388/uid killed as a result of limit of /1388 > 17 Task in /1281/uid killed as a result of limit of /1281 > 22 Task in /1599/uid killed as a result of limit of /1599 > 30 Task in /1155/uid killed as a result of limit of /1155 > 31 Task in /1258/uid killed as a result of limit of /1258 > 71 Task in /1293/uid killed as a result of limit of /1293 > >So the group 1293 suffers the most. I would check how much memory the >worklod in the group really needs because this level of OOM cannot >possible be healthy. I took the kernel log from yesterday from the same time frame: $ grep "killed as a result of limit" kern2.log | sed 's@.*\] @@' | sort | uniq -c | sort -k1 -n 1 Task in /1252/uid killed as a result of limit of /1252 1 Task in /1709/uid killed as a result of limit of /1709 2 Task in /1185/uid killed as a result of limit of /1185 2 Task in /1388/uid killed as a result of limit of /1388 2 Task in /1567/uid killed as a result of limit of /1567 2 Task in /1650/uid killed as a result of limit of /1650 3 Task in /1527/uid killed as a result of limit of /1527 5 Task in /1552/uid killed as a result of limit of /1552 1634 Task in /1258/uid killed as a result of limit of /1258 As you can see, there were much more OOM in '1258' and no such problems like this night (well, there were never such problems before :) ). As i said, cgroup 1258 were freezing every few minutes with your latest patch so there must be something wrong (it usually freezes about once per day). And it was really freezed (i checked that), the sypthoms were: - cannot strace any of cgroup processes - no new processes were started, still the same processes were 'running' - kernel was unable to resolve this by it's own - all processes togather were taking 100% CPU - the whole memory limit was used (see memcg-bug-4.tar.gz for more info) Unfortunately i forget to check if killing only few of the processes will resolve it (i always killed them all yesterday night). Don't know if is was in deadlock or not but kernel was definitely unable to resolve the problem. And there is still a mystery of two freezed processes which cannot be killed. By the way, i KNOW that so much OOM is not healthy but the client simply don't want to buy more memory. He knows about the problem of unsufficient memory limit. Thank you. azur ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH for 3.2.34] memcg: do not trigger OOM if PF_NO_MEMCG_OOM is set @ 2013-02-08 15:58 ` azurIt 0 siblings, 0 replies; 444+ messages in thread From: azurIt @ 2013-02-08 15:58 UTC (permalink / raw) To: Michal Hocko Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki, Johannes Weiner >Which means that the oom killer didn't try to kill any task more than >once which is good because it tells us that the killed task manages to >die before we trigger oom again. So this is definitely not a deadlock. >You are just hitting OOM very often. >$ grep "killed as a result of limit" kern2.log | sed 's@.*\] @@' | sort | uniq -c | sort -k1 -n > 1 Task in /1091/uid killed as a result of limit of /1091 > 1 Task in /1223/uid killed as a result of limit of /1223 > 1 Task in /1229/uid killed as a result of limit of /1229 > 1 Task in /1255/uid killed as a result of limit of /1255 > 1 Task in /1424/uid killed as a result of limit of /1424 > 1 Task in /1470/uid killed as a result of limit of /1470 > 1 Task in /1567/uid killed as a result of limit of /1567 > 2 Task in /1080/uid killed as a result of limit of /1080 > 3 Task in /1381/uid killed as a result of limit of /1381 > 4 Task in /1185/uid killed as a result of limit of /1185 > 4 Task in /1289/uid killed as a result of limit of /1289 > 4 Task in /1709/uid killed as a result of limit of /1709 > 5 Task in /1279/uid killed as a result of limit of /1279 > 6 Task in /1020/uid killed as a result of limit of /1020 > 6 Task in /1527/uid killed as a result of limit of /1527 > 9 Task in /1388/uid killed as a result of limit of /1388 > 17 Task in /1281/uid killed as a result of limit of /1281 > 22 Task in /1599/uid killed as a result of limit of /1599 > 30 Task in /1155/uid killed as a result of limit of /1155 > 31 Task in /1258/uid killed as a result of limit of /1258 > 71 Task in /1293/uid killed as a result of limit of /1293 > >So the group 1293 suffers the most. I would check how much memory the >worklod in the group really needs because this level of OOM cannot >possible be healthy. I took the kernel log from yesterday from the same time frame: $ grep "killed as a result of limit" kern2.log | sed 's@.*\] @@' | sort | uniq -c | sort -k1 -n 1 Task in /1252/uid killed as a result of limit of /1252 1 Task in /1709/uid killed as a result of limit of /1709 2 Task in /1185/uid killed as a result of limit of /1185 2 Task in /1388/uid killed as a result of limit of /1388 2 Task in /1567/uid killed as a result of limit of /1567 2 Task in /1650/uid killed as a result of limit of /1650 3 Task in /1527/uid killed as a result of limit of /1527 5 Task in /1552/uid killed as a result of limit of /1552 1634 Task in /1258/uid killed as a result of limit of /1258 As you can see, there were much more OOM in '1258' and no such problems like this night (well, there were never such problems before :) ). As i said, cgroup 1258 were freezing every few minutes with your latest patch so there must be something wrong (it usually freezes about once per day). And it was really freezed (i checked that), the sypthoms were: - cannot strace any of cgroup processes - no new processes were started, still the same processes were 'running' - kernel was unable to resolve this by it's own - all processes togather were taking 100% CPU - the whole memory limit was used (see memcg-bug-4.tar.gz for more info) Unfortunately i forget to check if killing only few of the processes will resolve it (i always killed them all yesterday night). Don't know if is was in deadlock or not but kernel was definitely unable to resolve the problem. And there is still a mystery of two freezed processes which cannot be killed. By the way, i KNOW that so much OOM is not healthy but the client simply don't want to buy more memory. He knows about the problem of unsufficient memory limit. Thank you. azur -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH for 3.2.34] memcg: do not trigger OOM if PF_NO_MEMCG_OOM is set 2013-02-08 15:58 ` azurIt (?) @ 2013-02-08 17:10 ` Michal Hocko -1 siblings, 0 replies; 444+ messages in thread From: Michal Hocko @ 2013-02-08 17:10 UTC (permalink / raw) To: azurIt Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki, Johannes Weiner On Fri 08-02-13 16:58:05, azurIt wrote: [...] > I took the kernel log from yesterday from the same time frame: > > $ grep "killed as a result of limit" kern2.log | sed 's@.*\] @@' | sort | uniq -c | sort -k1 -n > 1 Task in /1252/uid killed as a result of limit of /1252 > 1 Task in /1709/uid killed as a result of limit of /1709 > 2 Task in /1185/uid killed as a result of limit of /1185 > 2 Task in /1388/uid killed as a result of limit of /1388 > 2 Task in /1567/uid killed as a result of limit of /1567 > 2 Task in /1650/uid killed as a result of limit of /1650 > 3 Task in /1527/uid killed as a result of limit of /1527 > 5 Task in /1552/uid killed as a result of limit of /1552 > 1634 Task in /1258/uid killed as a result of limit of /1258 > > As you can see, there were much more OOM in '1258' and no such > problems like this night (well, there were never such problems before > :) ). Well, all the patch does is that it prevents from the deadlock we have seen earlier. Previously the writer would block on the oom wait queue while it fails with ENOMEM now. Caller sees this as a short write which can be retried (it is a question whether userspace can cope with that properly). All other OOMs are preserved. I suspect that all the problems you are seeing now are just side effects of the OOM conditions. > As i said, cgroup 1258 were freezing every few minutes with your > latest patch so there must be something wrong (it usually freezes > about once per day). And it was really freezed (i checked that), the > sypthoms were: I assume you have checked that the killed processes eventually die, right? > - cannot strace any of cgroup processes > - no new processes were started, still the same processes were 'running' > - kernel was unable to resolve this by it's own > - all processes togather were taking 100% CPU > - the whole memory limit was used > (see memcg-bug-4.tar.gz for more info) Well, I do not see anything supsicious during that time period (timestamps translate between Fri Feb 8 02:34:05 and Fri Feb 8 02:36:48). The kernel log shows a lot of oom during that time. All killed processes die eventually. > Unfortunately i forget to check if killing only few of the processes > will resolve it (i always killed them all yesterday night). Don't > know if is was in deadlock or not but kernel was definitely unable > to resolve the problem. Nothing shows it would be a deadlock so far. It is well possible that the userspace went mad when seeing a lot of processes dying because it doesn't expect it. > And there is still a mystery of two freezed processes which cannot be > killed. > > By the way, i KNOW that so much OOM is not healthy but the client > simply don't want to buy more memory. He knows about the problem of > unsufficient memory limit. Well, then you would see a permanent flood of OOM killing, I am afraid. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH for 3.2.34] memcg: do not trigger OOM if PF_NO_MEMCG_OOM is set @ 2013-02-08 17:10 ` Michal Hocko 0 siblings, 0 replies; 444+ messages in thread From: Michal Hocko @ 2013-02-08 17:10 UTC (permalink / raw) To: azurIt Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups mailinglist, KAMEZAWA Hiroyuki, Johannes Weiner On Fri 08-02-13 16:58:05, azurIt wrote: [...] > I took the kernel log from yesterday from the same time frame: > > $ grep "killed as a result of limit" kern2.log | sed 's@.*\] @@' | sort | uniq -c | sort -k1 -n > 1 Task in /1252/uid killed as a result of limit of /1252 > 1 Task in /1709/uid killed as a result of limit of /1709 > 2 Task in /1185/uid killed as a result of limit of /1185 > 2 Task in /1388/uid killed as a result of limit of /1388 > 2 Task in /1567/uid killed as a result of limit of /1567 > 2 Task in /1650/uid killed as a result of limit of /1650 > 3 Task in /1527/uid killed as a result of limit of /1527 > 5 Task in /1552/uid killed as a result of limit of /1552 > 1634 Task in /1258/uid killed as a result of limit of /1258 > > As you can see, there were much more OOM in '1258' and no such > problems like this night (well, there were never such problems before > :) ). Well, all the patch does is that it prevents from the deadlock we have seen earlier. Previously the writer would block on the oom wait queue while it fails with ENOMEM now. Caller sees this as a short write which can be retried (it is a question whether userspace can cope with that properly). All other OOMs are preserved. I suspect that all the problems you are seeing now are just side effects of the OOM conditions. > As i said, cgroup 1258 were freezing every few minutes with your > latest patch so there must be something wrong (it usually freezes > about once per day). And it was really freezed (i checked that), the > sypthoms were: I assume you have checked that the killed processes eventually die, right? > - cannot strace any of cgroup processes > - no new processes were started, still the same processes were 'running' > - kernel was unable to resolve this by it's own > - all processes togather were taking 100% CPU > - the whole memory limit was used > (see memcg-bug-4.tar.gz for more info) Well, I do not see anything supsicious during that time period (timestamps translate between Fri Feb 8 02:34:05 and Fri Feb 8 02:36:48). The kernel log shows a lot of oom during that time. All killed processes die eventually. > Unfortunately i forget to check if killing only few of the processes > will resolve it (i always killed them all yesterday night). Don't > know if is was in deadlock or not but kernel was definitely unable > to resolve the problem. Nothing shows it would be a deadlock so far. It is well possible that the userspace went mad when seeing a lot of processes dying because it doesn't expect it. > And there is still a mystery of two freezed processes which cannot be > killed. > > By the way, i KNOW that so much OOM is not healthy but the client > simply don't want to buy more memory. He knows about the problem of > unsufficient memory limit. Well, then you would see a permanent flood of OOM killing, I am afraid. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH for 3.2.34] memcg: do not trigger OOM if PF_NO_MEMCG_OOM is set @ 2013-02-08 17:10 ` Michal Hocko 0 siblings, 0 replies; 444+ messages in thread From: Michal Hocko @ 2013-02-08 17:10 UTC (permalink / raw) To: azurIt Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki, Johannes Weiner On Fri 08-02-13 16:58:05, azurIt wrote: [...] > I took the kernel log from yesterday from the same time frame: > > $ grep "killed as a result of limit" kern2.log | sed 's@.*\] @@' | sort | uniq -c | sort -k1 -n > 1 Task in /1252/uid killed as a result of limit of /1252 > 1 Task in /1709/uid killed as a result of limit of /1709 > 2 Task in /1185/uid killed as a result of limit of /1185 > 2 Task in /1388/uid killed as a result of limit of /1388 > 2 Task in /1567/uid killed as a result of limit of /1567 > 2 Task in /1650/uid killed as a result of limit of /1650 > 3 Task in /1527/uid killed as a result of limit of /1527 > 5 Task in /1552/uid killed as a result of limit of /1552 > 1634 Task in /1258/uid killed as a result of limit of /1258 > > As you can see, there were much more OOM in '1258' and no such > problems like this night (well, there were never such problems before > :) ). Well, all the patch does is that it prevents from the deadlock we have seen earlier. Previously the writer would block on the oom wait queue while it fails with ENOMEM now. Caller sees this as a short write which can be retried (it is a question whether userspace can cope with that properly). All other OOMs are preserved. I suspect that all the problems you are seeing now are just side effects of the OOM conditions. > As i said, cgroup 1258 were freezing every few minutes with your > latest patch so there must be something wrong (it usually freezes > about once per day). And it was really freezed (i checked that), the > sypthoms were: I assume you have checked that the killed processes eventually die, right? > - cannot strace any of cgroup processes > - no new processes were started, still the same processes were 'running' > - kernel was unable to resolve this by it's own > - all processes togather were taking 100% CPU > - the whole memory limit was used > (see memcg-bug-4.tar.gz for more info) Well, I do not see anything supsicious during that time period (timestamps translate between Fri Feb 8 02:34:05 and Fri Feb 8 02:36:48). The kernel log shows a lot of oom during that time. All killed processes die eventually. > Unfortunately i forget to check if killing only few of the processes > will resolve it (i always killed them all yesterday night). Don't > know if is was in deadlock or not but kernel was definitely unable > to resolve the problem. Nothing shows it would be a deadlock so far. It is well possible that the userspace went mad when seeing a lot of processes dying because it doesn't expect it. > And there is still a mystery of two freezed processes which cannot be > killed. > > By the way, i KNOW that so much OOM is not healthy but the client > simply don't want to buy more memory. He knows about the problem of > unsufficient memory limit. Well, then you would see a permanent flood of OOM killing, I am afraid. -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH for 3.2.34] memcg: do not trigger OOM if PF_NO_MEMCG_OOM is set 2013-02-08 17:10 ` Michal Hocko @ 2013-02-08 21:02 ` azurIt -1 siblings, 0 replies; 444+ messages in thread From: azurIt @ 2013-02-08 21:02 UTC (permalink / raw) To: Michal Hocko Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki, Johannes Weiner > >I assume you have checked that the killed processes eventually die, >right? > When i killed them by hand, yes, they dissappeard from process list (i saw it). I don't know if they really died when OOM killed them. >Well, I do not see anything supsicious during that time period >(timestamps translate between Fri Feb 8 02:34:05 and Fri Feb 8 >02:36:48). The kernel log shows a lot of oom during that time. All >killed processes die eventually. No, they didn't died by OOM when cgroup was freezed. Just check PIDs from memcg-bug-4.tar.gz and try to find them in kernel log. Why are all PIDs waiting on 'mem_cgroup_handle_oom' and there is no OOM message in the log? Data in memcg-bug-4.tar.gz are only for 2 minutes but i let it run for about 15-20 minutes, no single process killed by OOM. I'm 100% sure that OOM was not killing them (maybe it was trying to but it didn't happen). > >Nothing shows it would be a deadlock so far. It is well possible that >the userspace went mad when seeing a lot of processes dying because it >doesn't expect it. > Lots of processes are dying also now, without your latest patch, and no such things are happening. I'm sure there is something more it this, maybe it revealed another bug? azur ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH for 3.2.34] memcg: do not trigger OOM if PF_NO_MEMCG_OOM is set @ 2013-02-08 21:02 ` azurIt 0 siblings, 0 replies; 444+ messages in thread From: azurIt @ 2013-02-08 21:02 UTC (permalink / raw) To: Michal Hocko Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki, Johannes Weiner > >I assume you have checked that the killed processes eventually die, >right? > When i killed them by hand, yes, they dissappeard from process list (i saw it). I don't know if they really died when OOM killed them. >Well, I do not see anything supsicious during that time period >(timestamps translate between Fri Feb 8 02:34:05 and Fri Feb 8 >02:36:48). The kernel log shows a lot of oom during that time. All >killed processes die eventually. No, they didn't died by OOM when cgroup was freezed. Just check PIDs from memcg-bug-4.tar.gz and try to find them in kernel log. Why are all PIDs waiting on 'mem_cgroup_handle_oom' and there is no OOM message in the log? Data in memcg-bug-4.tar.gz are only for 2 minutes but i let it run for about 15-20 minutes, no single process killed by OOM. I'm 100% sure that OOM was not killing them (maybe it was trying to but it didn't happen). > >Nothing shows it would be a deadlock so far. It is well possible that >the userspace went mad when seeing a lot of processes dying because it >doesn't expect it. > Lots of processes are dying also now, without your latest patch, and no such things are happening. I'm sure there is something more it this, maybe it revealed another bug? azur -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH for 3.2.34] memcg: do not trigger OOM if PF_NO_MEMCG_OOM is set 2013-02-08 21:02 ` azurIt (?) @ 2013-02-10 15:03 ` Michal Hocko -1 siblings, 0 replies; 444+ messages in thread From: Michal Hocko @ 2013-02-10 15:03 UTC (permalink / raw) To: azurIt Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki, Johannes Weiner On Fri 08-02-13 22:02:43, azurIt wrote: > > > >I assume you have checked that the killed processes eventually die, > >right? > > > When i killed them by hand, yes, they dissappeard from process list (i > saw it). I don't know if they really died when OOM killed them. > > > >Well, I do not see anything supsicious during that time period > >(timestamps translate between Fri Feb 8 02:34:05 and Fri Feb 8 > >02:36:48). The kernel log shows a lot of oom during that time. All > >killed processes die eventually. > > > No, they didn't died by OOM when cgroup was freezed. Just check PIDs > from memcg-bug-4.tar.gz and try to find them in kernel log. OK, you seem to be right. My initial examination showed that each cgroup under OOM was able to move forward - in other words it was able to send SIGKILL somebody and we didn't loop on a single task which cannot die for some reason. Now when looking closer it seem we really have 2 tasks which didn't die after being killed by OOM killer: $ for i in `grep "Memory cgroup out of memory:" kern2.log | sed 's@.*Kill process \([0-9]*\) .*@\1@'`; do find bug -name $i; done | sed 's@.*/@@' | sort | uniq -c 141 18211 141 8102 $ md5sum bug/*/18211/stack | cut -d" " -f1 | uniq -c 141 3b8ce17e82a065a24ee046112033e1e8 So all the stacks are same: [<ffffffff81069f94>] ptrace_stop+0x114/0x290 [<ffffffff8106a198>] ptrace_do_notify+0x88/0xa0 [<ffffffff8106a203>] ptrace_notify+0x53/0x70 [<ffffffff8100d168>] syscall_trace_enter+0xf8/0x1c0 [<ffffffff815b6983>] tracesys+0x71/0xd7 [<ffffffffffffffff>] 0xffffffffffffffff stuck in the ptrace code. The other task is more interesting: $ md5sum bug/*/8102/stack | cut -d" " -f1 | sort | uniq -c 135 042e893c0e6657ed321ea9045e528f3e 6 dc7e71ce73be2a5c73404b565926e709 All snapshots with 042e893c0e6657ed321ea9045e528f3e are in: [<ffffffff8110ae51>] mem_cgroup_handle_oom+0x241/0x3b0 [<ffffffff8110ba83>] T.1149+0x5f3/0x600 [<ffffffff8110bf5c>] mem_cgroup_charge_common+0x6c/0xb0 [<ffffffff8110bfe5>] mem_cgroup_newpage_charge+0x45/0x50 [<ffffffff810ee2a9>] handle_pte_fault+0x609/0x940 [<ffffffff810ee718>] handle_mm_fault+0x138/0x260 [<ffffffff810270bd>] do_page_fault+0x13d/0x460 [<ffffffff815b633f>] page_fault+0x1f/0x30 [<ffffffffffffffff>] 0xffffffffffffffff While the others do not show any stack: cat 1360287257/8102/stack [<ffffffffffffffff>] 0xffffffffffffffff Which is quite interesting because we are talking about snapshots starting at 1360287245 (which maps to 02:34:05) but the kern2.log tells us that this process has been killed much earlier at: Feb 8 01:18:30 server01 kernel: [ 511.139921] Task in /1293/uid killed as a result of limit of /1293 [...] Feb 8 01:18:30 server01 kernel: [ 511.229755] [ pid ] uid tgid total_vm rss cpu oom_adj oom_score_adj name Feb 8 01:18:30 server01 kernel: [ 511.230146] [ 8102] 1293 8102 170258 65869 7 0 0 apache2 Feb 8 01:18:30 server01 kernel: [ 511.230339] [ 8113] 1293 8113 163756 59442 5 0 0 apache2 Feb 8 01:18:30 server01 kernel: [ 511.230528] [ 8116] 1293 8116 170094 65675 2 0 0 apache2 Feb 8 01:18:30 server01 kernel: [ 511.230726] [ 8119] 1293 8119 170094 65675 6 0 0 apache2 Feb 8 01:18:30 server01 kernel: [ 511.230924] [ 8123] 1293 8123 169070 64612 7 0 0 apache2 Feb 8 01:18:30 server01 kernel: [ 511.231132] [ 8124] 1293 8124 170094 65675 5 0 0 apache2 Feb 8 01:18:30 server01 kernel: [ 511.231321] [ 8125] 1293 8125 170094 65673 1 0 0 apache2 Feb 8 01:18:30 server01 kernel: [ 511.231516] Memory cgroup out of memory: Kill process 8102 (apache2) score 1000 or sacrifice child This would suggest that the task is hung and cannot be killed but if we have a look at the following OOM in the same group 1293 it was _not_ present in the process list for that group: Feb 8 01:18:33 server01 kernel: [ 514.789550] Task in /1293/uid killed as a result of limit of /1293 [...] Feb 8 01:18:33 server01 kernel: [ 514.893198] [ pid ] uid tgid total_vm rss cpu oom_adj oom_score_adj name Feb 8 01:18:33 server01 kernel: [ 514.893594] [ 8113] 1293 8113 168212 64036 1 0 0 apache2 Feb 8 01:18:33 server01 kernel: [ 514.893786] [ 8116] 1293 8116 170258 65870 6 0 0 apache2 Feb 8 01:18:33 server01 kernel: [ 514.893976] [ 8119] 1293 8119 170258 65870 7 0 0 apache2 Feb 8 01:18:33 server01 kernel: [ 514.894166] [ 8123] 1293 8123 170158 65824 6 0 0 apache2 Feb 8 01:18:33 server01 kernel: [ 514.894356] [ 8124] 1293 8124 170258 65870 5 0 0 apache2 Feb 8 01:18:33 server01 kernel: [ 514.894547] [ 8125] 1293 8125 170158 65824 1 0 0 apache2 Feb 8 01:18:33 server01 kernel: [ 514.894749] [ 8149] 1293 8149 163989 59647 7 0 0 apache2 Feb 8 01:18:33 server01 kernel: [ 514.894944] Memory cgroup out of memory: Kill process 8113 (apache2) score 1000 or sacrifice child This is all _before_ you started collecting stacks and it also says that 8102 is gone. This all suggests that a) stack unwinder which displays /proc/<pid>/stack is somehow confused and it doesn't show the correct stack for this process and b) the two processes cannot terminate due to some issue related to ptrace (stracing) the dying process. The above oom list doesn't include any processes which already released the memory which would explain why you still can see it as a member of the group (when looking into cgroup/tasks file). My guess would be that there is a bug in ptrace which doesn't free a reference to the task so it cannot cannot go away although it has dropped all the resources already. > Why are all PIDs waiting on 'mem_cgroup_handle_oom' and there is no > OOM message in the log? I am not sure what you mean here but there are $ grep "Memory cgroup out of memory:" kern2.collected.log | wc -l 16 OOM killer events during the time you were gathering memcg-bug-4 data. > Data in memcg-bug-4.tar.gz are only for 2 > minutes but i let it run for about 15-20 minutes, no single process > killed by OOM. I can see $ grep "Memory cgroup out of memory:" kern2.after.log | wc -l 57 killed after 02:38:47 when you stopped gathering data for memcg-bug-4 > I'm 100% sure that OOM was not killing them (maybe it was trying to > but it didn't happen). OK, let's do a little exercise. The list of processes eligible for OOM are listed before any task is killed. So if we collect both pid lists and "Kill process" messages per pid then no entries in the pid list should be present after the specific pid is killed. $ mkdir out $ for i in `grep "Memory cgroup out of memory: Kill process" kern2.log | sed 's@.*Kill process \([0-9]*\) .*@\1@'` do grep -e "Memory cgroup out of memory: Kill process $i" \ -e "\[ *\<$i\]" kern2.log > out/$i done $ for i in out/* do tail -n1 $i | grep "Memory cgroup out of memory:" >/dev/null|| echo "$i has already killed tasks" done out/6698 has already killed tasks out/6703 has already killed tasks OK, so there are two pids which were listed after they have been killed. Let's have a look at them. $ cat out/6698 Feb 8 01:17:04 server01 kernel: [ 425.497924] [ 6698] 1293 6698 170258 65846 1 0 0 apache2 Feb 8 01:17:05 server01 kernel: [ 426.079010] [ 6698] 1293 6698 170258 65846 1 0 0 apache2 Feb 8 01:17:10 server01 kernel: [ 431.144460] [ 6698] 1293 6698 169358 65220 1 0 0 apache2 Feb 8 01:17:10 server01 kernel: [ 431.146058] Memory cgroup out of memory: Kill process 6698 (apache2) score 1000 or sacrifice child Feb 8 03:27:57 server01 kernel: [ 8278.439896] [ 6698] 1020 6698 168518 64219 0 0 0 apache2 Feb 8 03:27:57 server01 kernel: [ 8278.879439] [ 6698] 1020 6698 168518 64218 6 0 0 apache2 Feb 8 03:27:59 server01 kernel: [ 8280.023944] [ 6698] 1020 6698 168816 64540 7 0 0 apache2 Feb 8 03:28:02 server01 kernel: [ 8283.242282] [ 6698] 1020 6698 171953 67751 6 0 0 apache2 $ cat out/6703 Feb 8 01:17:04 server01 kernel: [ 425.498118] [ 6703] 1293 6703 170258 65844 6 0 0 apache2 Feb 8 01:17:05 server01 kernel: [ 426.079206] [ 6703] 1293 6703 170258 65844 6 0 0 apache2 Feb 8 01:17:10 server01 kernel: [ 431.144653] [ 6703] 1293 6703 169358 65219 2 0 0 apache2 Feb 8 01:17:10 server01 kernel: [ 431.258924] [ 6703] 1293 6703 169358 65219 5 0 0 apache2 Feb 8 01:17:10 server01 kernel: [ 431.260282] Memory cgroup out of memory: Kill process 6703 (apache2) score 1000 or sacrifice child Feb 8 03:27:57 server01 kernel: [ 8278.440043] [ 6703] 1020 6703 166286 61978 7 0 0 apache2 Feb 8 03:27:57 server01 kernel: [ 8278.879587] [ 6703] 1020 6703 166286 61977 7 0 0 apache2 Feb 8 03:27:59 server01 kernel: [ 8280.024091] [ 6703] 1020 6703 166484 62233 7 0 0 apache2 Feb 8 03:28:02 server01 kernel: [ 8283.242429] [ 6703] 1020 6703 167402 63118 0 0 0 apache2 Lists have the following columns: [ pid ] uid tgid total_vm rss cpu oom_adj oom_score_adj name As we can see the uid changed for both pids after it has been killed (from 1293 to 1020) which suggests that the pid has been reused later for a different user (which is a clear sign that those pids died) - thus different group in your setup. So those two died as well, apparently. > >Nothing shows it would be a deadlock so far. It is well possible that > >the userspace went mad when seeing a lot of processes dying because it > >doesn't expect it. > > Lots of processes are dying also now, without your latest patch, and > no such things are happening. I'm sure there is something more it > this, maybe it revealed another bug? So far nothing shows that there would be anything broken wrt. memcg OOM killer. The ptrace issue sounds strange, all right, but that is another story and worth a separate investigation. I would be interested whether you still see anything wrong going on without that in game. You can get pretty nice overview of what is going on wrt. OOM from the log. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH for 3.2.34] memcg: do not trigger OOM if PF_NO_MEMCG_OOM is set @ 2013-02-10 15:03 ` Michal Hocko 0 siblings, 0 replies; 444+ messages in thread From: Michal Hocko @ 2013-02-10 15:03 UTC (permalink / raw) To: azurIt Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups mailinglist, KAMEZAWA Hiroyuki, Johannes Weiner On Fri 08-02-13 22:02:43, azurIt wrote: > > > >I assume you have checked that the killed processes eventually die, > >right? > > > When i killed them by hand, yes, they dissappeard from process list (i > saw it). I don't know if they really died when OOM killed them. > > > >Well, I do not see anything supsicious during that time period > >(timestamps translate between Fri Feb 8 02:34:05 and Fri Feb 8 > >02:36:48). The kernel log shows a lot of oom during that time. All > >killed processes die eventually. > > > No, they didn't died by OOM when cgroup was freezed. Just check PIDs > from memcg-bug-4.tar.gz and try to find them in kernel log. OK, you seem to be right. My initial examination showed that each cgroup under OOM was able to move forward - in other words it was able to send SIGKILL somebody and we didn't loop on a single task which cannot die for some reason. Now when looking closer it seem we really have 2 tasks which didn't die after being killed by OOM killer: $ for i in `grep "Memory cgroup out of memory:" kern2.log | sed 's@.*Kill process \([0-9]*\) .*@\1@'`; do find bug -name $i; done | sed 's@.*/@@' | sort | uniq -c 141 18211 141 8102 $ md5sum bug/*/18211/stack | cut -d" " -f1 | uniq -c 141 3b8ce17e82a065a24ee046112033e1e8 So all the stacks are same: [<ffffffff81069f94>] ptrace_stop+0x114/0x290 [<ffffffff8106a198>] ptrace_do_notify+0x88/0xa0 [<ffffffff8106a203>] ptrace_notify+0x53/0x70 [<ffffffff8100d168>] syscall_trace_enter+0xf8/0x1c0 [<ffffffff815b6983>] tracesys+0x71/0xd7 [<ffffffffffffffff>] 0xffffffffffffffff stuck in the ptrace code. The other task is more interesting: $ md5sum bug/*/8102/stack | cut -d" " -f1 | sort | uniq -c 135 042e893c0e6657ed321ea9045e528f3e 6 dc7e71ce73be2a5c73404b565926e709 All snapshots with 042e893c0e6657ed321ea9045e528f3e are in: [<ffffffff8110ae51>] mem_cgroup_handle_oom+0x241/0x3b0 [<ffffffff8110ba83>] T.1149+0x5f3/0x600 [<ffffffff8110bf5c>] mem_cgroup_charge_common+0x6c/0xb0 [<ffffffff8110bfe5>] mem_cgroup_newpage_charge+0x45/0x50 [<ffffffff810ee2a9>] handle_pte_fault+0x609/0x940 [<ffffffff810ee718>] handle_mm_fault+0x138/0x260 [<ffffffff810270bd>] do_page_fault+0x13d/0x460 [<ffffffff815b633f>] page_fault+0x1f/0x30 [<ffffffffffffffff>] 0xffffffffffffffff While the others do not show any stack: cat 1360287257/8102/stack [<ffffffffffffffff>] 0xffffffffffffffff Which is quite interesting because we are talking about snapshots starting at 1360287245 (which maps to 02:34:05) but the kern2.log tells us that this process has been killed much earlier at: Feb 8 01:18:30 server01 kernel: [ 511.139921] Task in /1293/uid killed as a result of limit of /1293 [...] Feb 8 01:18:30 server01 kernel: [ 511.229755] [ pid ] uid tgid total_vm rss cpu oom_adj oom_score_adj name Feb 8 01:18:30 server01 kernel: [ 511.230146] [ 8102] 1293 8102 170258 65869 7 0 0 apache2 Feb 8 01:18:30 server01 kernel: [ 511.230339] [ 8113] 1293 8113 163756 59442 5 0 0 apache2 Feb 8 01:18:30 server01 kernel: [ 511.230528] [ 8116] 1293 8116 170094 65675 2 0 0 apache2 Feb 8 01:18:30 server01 kernel: [ 511.230726] [ 8119] 1293 8119 170094 65675 6 0 0 apache2 Feb 8 01:18:30 server01 kernel: [ 511.230924] [ 8123] 1293 8123 169070 64612 7 0 0 apache2 Feb 8 01:18:30 server01 kernel: [ 511.231132] [ 8124] 1293 8124 170094 65675 5 0 0 apache2 Feb 8 01:18:30 server01 kernel: [ 511.231321] [ 8125] 1293 8125 170094 65673 1 0 0 apache2 Feb 8 01:18:30 server01 kernel: [ 511.231516] Memory cgroup out of memory: Kill process 8102 (apache2) score 1000 or sacrifice child This would suggest that the task is hung and cannot be killed but if we have a look at the following OOM in the same group 1293 it was _not_ present in the process list for that group: Feb 8 01:18:33 server01 kernel: [ 514.789550] Task in /1293/uid killed as a result of limit of /1293 [...] Feb 8 01:18:33 server01 kernel: [ 514.893198] [ pid ] uid tgid total_vm rss cpu oom_adj oom_score_adj name Feb 8 01:18:33 server01 kernel: [ 514.893594] [ 8113] 1293 8113 168212 64036 1 0 0 apache2 Feb 8 01:18:33 server01 kernel: [ 514.893786] [ 8116] 1293 8116 170258 65870 6 0 0 apache2 Feb 8 01:18:33 server01 kernel: [ 514.893976] [ 8119] 1293 8119 170258 65870 7 0 0 apache2 Feb 8 01:18:33 server01 kernel: [ 514.894166] [ 8123] 1293 8123 170158 65824 6 0 0 apache2 Feb 8 01:18:33 server01 kernel: [ 514.894356] [ 8124] 1293 8124 170258 65870 5 0 0 apache2 Feb 8 01:18:33 server01 kernel: [ 514.894547] [ 8125] 1293 8125 170158 65824 1 0 0 apache2 Feb 8 01:18:33 server01 kernel: [ 514.894749] [ 8149] 1293 8149 163989 59647 7 0 0 apache2 Feb 8 01:18:33 server01 kernel: [ 514.894944] Memory cgroup out of memory: Kill process 8113 (apache2) score 1000 or sacrifice child This is all _before_ you started collecting stacks and it also says that 8102 is gone. This all suggests that a) stack unwinder which displays /proc/<pid>/stack is somehow confused and it doesn't show the correct stack for this process and b) the two processes cannot terminate due to some issue related to ptrace (stracing) the dying process. The above oom list doesn't include any processes which already released the memory which would explain why you still can see it as a member of the group (when looking into cgroup/tasks file). My guess would be that there is a bug in ptrace which doesn't free a reference to the task so it cannot cannot go away although it has dropped all the resources already. > Why are all PIDs waiting on 'mem_cgroup_handle_oom' and there is no > OOM message in the log? I am not sure what you mean here but there are $ grep "Memory cgroup out of memory:" kern2.collected.log | wc -l 16 OOM killer events during the time you were gathering memcg-bug-4 data. > Data in memcg-bug-4.tar.gz are only for 2 > minutes but i let it run for about 15-20 minutes, no single process > killed by OOM. I can see $ grep "Memory cgroup out of memory:" kern2.after.log | wc -l 57 killed after 02:38:47 when you stopped gathering data for memcg-bug-4 > I'm 100% sure that OOM was not killing them (maybe it was trying to > but it didn't happen). OK, let's do a little exercise. The list of processes eligible for OOM are listed before any task is killed. So if we collect both pid lists and "Kill process" messages per pid then no entries in the pid list should be present after the specific pid is killed. $ mkdir out $ for i in `grep "Memory cgroup out of memory: Kill process" kern2.log | sed 's@.*Kill process \([0-9]*\) .*@\1@'` do grep -e "Memory cgroup out of memory: Kill process $i" \ -e "\[ *\<$i\]" kern2.log > out/$i done $ for i in out/* do tail -n1 $i | grep "Memory cgroup out of memory:" >/dev/null|| echo "$i has already killed tasks" done out/6698 has already killed tasks out/6703 has already killed tasks OK, so there are two pids which were listed after they have been killed. Let's have a look at them. $ cat out/6698 Feb 8 01:17:04 server01 kernel: [ 425.497924] [ 6698] 1293 6698 170258 65846 1 0 0 apache2 Feb 8 01:17:05 server01 kernel: [ 426.079010] [ 6698] 1293 6698 170258 65846 1 0 0 apache2 Feb 8 01:17:10 server01 kernel: [ 431.144460] [ 6698] 1293 6698 169358 65220 1 0 0 apache2 Feb 8 01:17:10 server01 kernel: [ 431.146058] Memory cgroup out of memory: Kill process 6698 (apache2) score 1000 or sacrifice child Feb 8 03:27:57 server01 kernel: [ 8278.439896] [ 6698] 1020 6698 168518 64219 0 0 0 apache2 Feb 8 03:27:57 server01 kernel: [ 8278.879439] [ 6698] 1020 6698 168518 64218 6 0 0 apache2 Feb 8 03:27:59 server01 kernel: [ 8280.023944] [ 6698] 1020 6698 168816 64540 7 0 0 apache2 Feb 8 03:28:02 server01 kernel: [ 8283.242282] [ 6698] 1020 6698 171953 67751 6 0 0 apache2 $ cat out/6703 Feb 8 01:17:04 server01 kernel: [ 425.498118] [ 6703] 1293 6703 170258 65844 6 0 0 apache2 Feb 8 01:17:05 server01 kernel: [ 426.079206] [ 6703] 1293 6703 170258 65844 6 0 0 apache2 Feb 8 01:17:10 server01 kernel: [ 431.144653] [ 6703] 1293 6703 169358 65219 2 0 0 apache2 Feb 8 01:17:10 server01 kernel: [ 431.258924] [ 6703] 1293 6703 169358 65219 5 0 0 apache2 Feb 8 01:17:10 server01 kernel: [ 431.260282] Memory cgroup out of memory: Kill process 6703 (apache2) score 1000 or sacrifice child Feb 8 03:27:57 server01 kernel: [ 8278.440043] [ 6703] 1020 6703 166286 61978 7 0 0 apache2 Feb 8 03:27:57 server01 kernel: [ 8278.879587] [ 6703] 1020 6703 166286 61977 7 0 0 apache2 Feb 8 03:27:59 server01 kernel: [ 8280.024091] [ 6703] 1020 6703 166484 62233 7 0 0 apache2 Feb 8 03:28:02 server01 kernel: [ 8283.242429] [ 6703] 1020 6703 167402 63118 0 0 0 apache2 Lists have the following columns: [ pid ] uid tgid total_vm rss cpu oom_adj oom_score_adj name As we can see the uid changed for both pids after it has been killed (from 1293 to 1020) which suggests that the pid has been reused later for a different user (which is a clear sign that those pids died) - thus different group in your setup. So those two died as well, apparently. > >Nothing shows it would be a deadlock so far. It is well possible that > >the userspace went mad when seeing a lot of processes dying because it > >doesn't expect it. > > Lots of processes are dying also now, without your latest patch, and > no such things are happening. I'm sure there is something more it > this, maybe it revealed another bug? So far nothing shows that there would be anything broken wrt. memcg OOM killer. The ptrace issue sounds strange, all right, but that is another story and worth a separate investigation. I would be interested whether you still see anything wrong going on without that in game. You can get pretty nice overview of what is going on wrt. OOM from the log. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH for 3.2.34] memcg: do not trigger OOM if PF_NO_MEMCG_OOM is set @ 2013-02-10 15:03 ` Michal Hocko 0 siblings, 0 replies; 444+ messages in thread From: Michal Hocko @ 2013-02-10 15:03 UTC (permalink / raw) To: azurIt Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki, Johannes Weiner On Fri 08-02-13 22:02:43, azurIt wrote: > > > >I assume you have checked that the killed processes eventually die, > >right? > > > When i killed them by hand, yes, they dissappeard from process list (i > saw it). I don't know if they really died when OOM killed them. > > > >Well, I do not see anything supsicious during that time period > >(timestamps translate between Fri Feb 8 02:34:05 and Fri Feb 8 > >02:36:48). The kernel log shows a lot of oom during that time. All > >killed processes die eventually. > > > No, they didn't died by OOM when cgroup was freezed. Just check PIDs > from memcg-bug-4.tar.gz and try to find them in kernel log. OK, you seem to be right. My initial examination showed that each cgroup under OOM was able to move forward - in other words it was able to send SIGKILL somebody and we didn't loop on a single task which cannot die for some reason. Now when looking closer it seem we really have 2 tasks which didn't die after being killed by OOM killer: $ for i in `grep "Memory cgroup out of memory:" kern2.log | sed 's@.*Kill process \([0-9]*\) .*@\1@'`; do find bug -name $i; done | sed 's@.*/@@' | sort | uniq -c 141 18211 141 8102 $ md5sum bug/*/18211/stack | cut -d" " -f1 | uniq -c 141 3b8ce17e82a065a24ee046112033e1e8 So all the stacks are same: [<ffffffff81069f94>] ptrace_stop+0x114/0x290 [<ffffffff8106a198>] ptrace_do_notify+0x88/0xa0 [<ffffffff8106a203>] ptrace_notify+0x53/0x70 [<ffffffff8100d168>] syscall_trace_enter+0xf8/0x1c0 [<ffffffff815b6983>] tracesys+0x71/0xd7 [<ffffffffffffffff>] 0xffffffffffffffff stuck in the ptrace code. The other task is more interesting: $ md5sum bug/*/8102/stack | cut -d" " -f1 | sort | uniq -c 135 042e893c0e6657ed321ea9045e528f3e 6 dc7e71ce73be2a5c73404b565926e709 All snapshots with 042e893c0e6657ed321ea9045e528f3e are in: [<ffffffff8110ae51>] mem_cgroup_handle_oom+0x241/0x3b0 [<ffffffff8110ba83>] T.1149+0x5f3/0x600 [<ffffffff8110bf5c>] mem_cgroup_charge_common+0x6c/0xb0 [<ffffffff8110bfe5>] mem_cgroup_newpage_charge+0x45/0x50 [<ffffffff810ee2a9>] handle_pte_fault+0x609/0x940 [<ffffffff810ee718>] handle_mm_fault+0x138/0x260 [<ffffffff810270bd>] do_page_fault+0x13d/0x460 [<ffffffff815b633f>] page_fault+0x1f/0x30 [<ffffffffffffffff>] 0xffffffffffffffff While the others do not show any stack: cat 1360287257/8102/stack [<ffffffffffffffff>] 0xffffffffffffffff Which is quite interesting because we are talking about snapshots starting at 1360287245 (which maps to 02:34:05) but the kern2.log tells us that this process has been killed much earlier at: Feb 8 01:18:30 server01 kernel: [ 511.139921] Task in /1293/uid killed as a result of limit of /1293 [...] Feb 8 01:18:30 server01 kernel: [ 511.229755] [ pid ] uid tgid total_vm rss cpu oom_adj oom_score_adj name Feb 8 01:18:30 server01 kernel: [ 511.230146] [ 8102] 1293 8102 170258 65869 7 0 0 apache2 Feb 8 01:18:30 server01 kernel: [ 511.230339] [ 8113] 1293 8113 163756 59442 5 0 0 apache2 Feb 8 01:18:30 server01 kernel: [ 511.230528] [ 8116] 1293 8116 170094 65675 2 0 0 apache2 Feb 8 01:18:30 server01 kernel: [ 511.230726] [ 8119] 1293 8119 170094 65675 6 0 0 apache2 Feb 8 01:18:30 server01 kernel: [ 511.230924] [ 8123] 1293 8123 169070 64612 7 0 0 apache2 Feb 8 01:18:30 server01 kernel: [ 511.231132] [ 8124] 1293 8124 170094 65675 5 0 0 apache2 Feb 8 01:18:30 server01 kernel: [ 511.231321] [ 8125] 1293 8125 170094 65673 1 0 0 apache2 Feb 8 01:18:30 server01 kernel: [ 511.231516] Memory cgroup out of memory: Kill process 8102 (apache2) score 1000 or sacrifice child This would suggest that the task is hung and cannot be killed but if we have a look at the following OOM in the same group 1293 it was _not_ present in the process list for that group: Feb 8 01:18:33 server01 kernel: [ 514.789550] Task in /1293/uid killed as a result of limit of /1293 [...] Feb 8 01:18:33 server01 kernel: [ 514.893198] [ pid ] uid tgid total_vm rss cpu oom_adj oom_score_adj name Feb 8 01:18:33 server01 kernel: [ 514.893594] [ 8113] 1293 8113 168212 64036 1 0 0 apache2 Feb 8 01:18:33 server01 kernel: [ 514.893786] [ 8116] 1293 8116 170258 65870 6 0 0 apache2 Feb 8 01:18:33 server01 kernel: [ 514.893976] [ 8119] 1293 8119 170258 65870 7 0 0 apache2 Feb 8 01:18:33 server01 kernel: [ 514.894166] [ 8123] 1293 8123 170158 65824 6 0 0 apache2 Feb 8 01:18:33 server01 kernel: [ 514.894356] [ 8124] 1293 8124 170258 65870 5 0 0 apache2 Feb 8 01:18:33 server01 kernel: [ 514.894547] [ 8125] 1293 8125 170158 65824 1 0 0 apache2 Feb 8 01:18:33 server01 kernel: [ 514.894749] [ 8149] 1293 8149 163989 59647 7 0 0 apache2 Feb 8 01:18:33 server01 kernel: [ 514.894944] Memory cgroup out of memory: Kill process 8113 (apache2) score 1000 or sacrifice child This is all _before_ you started collecting stacks and it also says that 8102 is gone. This all suggests that a) stack unwinder which displays /proc/<pid>/stack is somehow confused and it doesn't show the correct stack for this process and b) the two processes cannot terminate due to some issue related to ptrace (stracing) the dying process. The above oom list doesn't include any processes which already released the memory which would explain why you still can see it as a member of the group (when looking into cgroup/tasks file). My guess would be that there is a bug in ptrace which doesn't free a reference to the task so it cannot cannot go away although it has dropped all the resources already. > Why are all PIDs waiting on 'mem_cgroup_handle_oom' and there is no > OOM message in the log? I am not sure what you mean here but there are $ grep "Memory cgroup out of memory:" kern2.collected.log | wc -l 16 OOM killer events during the time you were gathering memcg-bug-4 data. > Data in memcg-bug-4.tar.gz are only for 2 > minutes but i let it run for about 15-20 minutes, no single process > killed by OOM. I can see $ grep "Memory cgroup out of memory:" kern2.after.log | wc -l 57 killed after 02:38:47 when you stopped gathering data for memcg-bug-4 > I'm 100% sure that OOM was not killing them (maybe it was trying to > but it didn't happen). OK, let's do a little exercise. The list of processes eligible for OOM are listed before any task is killed. So if we collect both pid lists and "Kill process" messages per pid then no entries in the pid list should be present after the specific pid is killed. $ mkdir out $ for i in `grep "Memory cgroup out of memory: Kill process" kern2.log | sed 's@.*Kill process \([0-9]*\) .*@\1@'` do grep -e "Memory cgroup out of memory: Kill process $i" \ -e "\[ *\<$i\]" kern2.log > out/$i done $ for i in out/* do tail -n1 $i | grep "Memory cgroup out of memory:" >/dev/null|| echo "$i has already killed tasks" done out/6698 has already killed tasks out/6703 has already killed tasks OK, so there are two pids which were listed after they have been killed. Let's have a look at them. $ cat out/6698 Feb 8 01:17:04 server01 kernel: [ 425.497924] [ 6698] 1293 6698 170258 65846 1 0 0 apache2 Feb 8 01:17:05 server01 kernel: [ 426.079010] [ 6698] 1293 6698 170258 65846 1 0 0 apache2 Feb 8 01:17:10 server01 kernel: [ 431.144460] [ 6698] 1293 6698 169358 65220 1 0 0 apache2 Feb 8 01:17:10 server01 kernel: [ 431.146058] Memory cgroup out of memory: Kill process 6698 (apache2) score 1000 or sacrifice child Feb 8 03:27:57 server01 kernel: [ 8278.439896] [ 6698] 1020 6698 168518 64219 0 0 0 apache2 Feb 8 03:27:57 server01 kernel: [ 8278.879439] [ 6698] 1020 6698 168518 64218 6 0 0 apache2 Feb 8 03:27:59 server01 kernel: [ 8280.023944] [ 6698] 1020 6698 168816 64540 7 0 0 apache2 Feb 8 03:28:02 server01 kernel: [ 8283.242282] [ 6698] 1020 6698 171953 67751 6 0 0 apache2 $ cat out/6703 Feb 8 01:17:04 server01 kernel: [ 425.498118] [ 6703] 1293 6703 170258 65844 6 0 0 apache2 Feb 8 01:17:05 server01 kernel: [ 426.079206] [ 6703] 1293 6703 170258 65844 6 0 0 apache2 Feb 8 01:17:10 server01 kernel: [ 431.144653] [ 6703] 1293 6703 169358 65219 2 0 0 apache2 Feb 8 01:17:10 server01 kernel: [ 431.258924] [ 6703] 1293 6703 169358 65219 5 0 0 apache2 Feb 8 01:17:10 server01 kernel: [ 431.260282] Memory cgroup out of memory: Kill process 6703 (apache2) score 1000 or sacrifice child Feb 8 03:27:57 server01 kernel: [ 8278.440043] [ 6703] 1020 6703 166286 61978 7 0 0 apache2 Feb 8 03:27:57 server01 kernel: [ 8278.879587] [ 6703] 1020 6703 166286 61977 7 0 0 apache2 Feb 8 03:27:59 server01 kernel: [ 8280.024091] [ 6703] 1020 6703 166484 62233 7 0 0 apache2 Feb 8 03:28:02 server01 kernel: [ 8283.242429] [ 6703] 1020 6703 167402 63118 0 0 0 apache2 Lists have the following columns: [ pid ] uid tgid total_vm rss cpu oom_adj oom_score_adj name As we can see the uid changed for both pids after it has been killed (from 1293 to 1020) which suggests that the pid has been reused later for a different user (which is a clear sign that those pids died) - thus different group in your setup. So those two died as well, apparently. > >Nothing shows it would be a deadlock so far. It is well possible that > >the userspace went mad when seeing a lot of processes dying because it > >doesn't expect it. > > Lots of processes are dying also now, without your latest patch, and > no such things are happening. I'm sure there is something more it > this, maybe it revealed another bug? So far nothing shows that there would be anything broken wrt. memcg OOM killer. The ptrace issue sounds strange, all right, but that is another story and worth a separate investigation. I would be interested whether you still see anything wrong going on without that in game. You can get pretty nice overview of what is going on wrt. OOM from the log. -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH for 3.2.34] memcg: do not trigger OOM if PF_NO_MEMCG_OOM is set 2013-02-10 15:03 ` Michal Hocko (?) @ 2013-02-10 16:46 ` azurIt -1 siblings, 0 replies; 444+ messages in thread From: azurIt @ 2013-02-10 16:46 UTC (permalink / raw) To: Michal Hocko Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki, Johannes Weiner >stuck in the ptrace code. But this happens _after_ the cgroup was freezed and i tried to strace one of it's processes (to see what's happening): Feb 8 01:29:46 server01 kernel: [ 1187.540672] grsec: From 178.40.250.111: process /usr/lib/apache2/mpm-itk/apache2(apache2:18211) attached to via ptrace by /usr/bin/strace[strace:18258] uid/euid:0/0 gid/egid:0/0, parent /usr/bin/htop[htop:2901] uid/euid:0/0 gid/egid:0/0 >> Why are all PIDs waiting on 'mem_cgroup_handle_oom' and there is no >> OOM message in the log? > >I am not sure what you mean here but there are >$ grep "Memory cgroup out of memory:" kern2.collected.log | wc -l >16 > >OOM killer events during the time you were gathering memcg-bug-4 data. > >> Data in memcg-bug-4.tar.gz are only for 2 >> minutes but i let it run for about 15-20 minutes, no single process >> killed by OOM. > >I can see >$ grep "Memory cgroup out of memory:" kern2.after.log | wc -l >57 > >killed after 02:38:47 when you stopped gathering data for memcg-bug-4 I meant no single process was killed inside cgroup 1258 (data from this cgroup are in memcg-bug-4.tar.gz). Just get data from memcg-bug-4.tar.gz which were taken from cgroup 1258. Almost all processes are in 'mem_cgroup_handle_oom' so cgroup is under OOM. I assume that this is suppose to take only few seconds while kernel finds any process and kill it (and maybe do it again until enough of memory is freed). I was gathering the data for about 2 and a half minutes and NO SINGLE process was killed (just compate list of PIDs from the first and the last directory inside memcg-bug-4.tar.gz). Even more, no single process was killed in cgroup 1258 also after i stopped gathering the data. You can also take the list od PID from memcg-bug-4.tar.gz and you will find only 18211 and 8102 (which are the two stucked processes). So my question is: Why no process was killed inside cgroup 1258 while it was under OOM? It was under OOM for at least 2 and a half of minutes while i was gathering the data (then i let it run for additional, cca, 10 minutes and then killed processes by hand but i cannot proof this). Why kernel didn't kill any process for so long and ends the OOM? Btw, processes in cgroup 1258 (memcg-bug-4.tar.gz) are looping in this two tasks (i pasted only first line of stack): mem_cgroup_handle_oom+0x241/0x3b0 0xffffffffffffffff Some of them are in 'poll_schedule_timeout' and then they start to loop as above. Is this correct behavior? For example, do (first line of stack from process 7710 from all timestamps): for i in */7710/stack; do head -n1 $i; done ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH for 3.2.34] memcg: do not trigger OOM if PF_NO_MEMCG_OOM is set @ 2013-02-10 16:46 ` azurIt 0 siblings, 0 replies; 444+ messages in thread From: azurIt @ 2013-02-10 16:46 UTC (permalink / raw) To: Michal Hocko Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups mailinglist, KAMEZAWA Hiroyuki, Johannes Weiner >stuck in the ptrace code. But this happens _after_ the cgroup was freezed and i tried to strace one of it's processes (to see what's happening): Feb 8 01:29:46 server01 kernel: [ 1187.540672] grsec: From 178.40.250.111: process /usr/lib/apache2/mpm-itk/apache2(apache2:18211) attached to via ptrace by /usr/bin/strace[strace:18258] uid/euid:0/0 gid/egid:0/0, parent /usr/bin/htop[htop:2901] uid/euid:0/0 gid/egid:0/0 >> Why are all PIDs waiting on 'mem_cgroup_handle_oom' and there is no >> OOM message in the log? > >I am not sure what you mean here but there are >$ grep "Memory cgroup out of memory:" kern2.collected.log | wc -l >16 > >OOM killer events during the time you were gathering memcg-bug-4 data. > >> Data in memcg-bug-4.tar.gz are only for 2 >> minutes but i let it run for about 15-20 minutes, no single process >> killed by OOM. > >I can see >$ grep "Memory cgroup out of memory:" kern2.after.log | wc -l >57 > >killed after 02:38:47 when you stopped gathering data for memcg-bug-4 I meant no single process was killed inside cgroup 1258 (data from this cgroup are in memcg-bug-4.tar.gz). Just get data from memcg-bug-4.tar.gz which were taken from cgroup 1258. Almost all processes are in 'mem_cgroup_handle_oom' so cgroup is under OOM. I assume that this is suppose to take only few seconds while kernel finds any process and kill it (and maybe do it again until enough of memory is freed). I was gathering the data for about 2 and a half minutes and NO SINGLE process was killed (just compate list of PIDs from the first and the last directory inside memcg-bug-4.tar.gz). Even more, no single process was killed in cgroup 1258 also after i stopped gathering the data. You can also take the list od PID from memcg-bug-4.tar.gz and you will find only 18211 and 8102 (which are the two stucked processes). So my question is: Why no process was killed inside cgroup 1258 while it was under OOM? It was under OOM for at least 2 and a half of minutes while i was gathering the data (then i let it run for additional, cca, 10 minutes and then killed processes by hand but i cannot proof this). Why kernel didn't kill any process for so long and ends the OOM? Btw, processes in cgroup 1258 (memcg-bug-4.tar.gz) are looping in this two tasks (i pasted only first line of stack): mem_cgroup_handle_oom+0x241/0x3b0 0xffffffffffffffff Some of them are in 'poll_schedule_timeout' and then they start to loop as above. Is this correct behavior? For example, do (first line of stack from process 7710 from all timestamps): for i in */7710/stack; do head -n1 $i; done ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH for 3.2.34] memcg: do not trigger OOM if PF_NO_MEMCG_OOM is set @ 2013-02-10 16:46 ` azurIt 0 siblings, 0 replies; 444+ messages in thread From: azurIt @ 2013-02-10 16:46 UTC (permalink / raw) To: Michal Hocko Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki, Johannes Weiner >stuck in the ptrace code. But this happens _after_ the cgroup was freezed and i tried to strace one of it's processes (to see what's happening): Feb 8 01:29:46 server01 kernel: [ 1187.540672] grsec: From 178.40.250.111: process /usr/lib/apache2/mpm-itk/apache2(apache2:18211) attached to via ptrace by /usr/bin/strace[strace:18258] uid/euid:0/0 gid/egid:0/0, parent /usr/bin/htop[htop:2901] uid/euid:0/0 gid/egid:0/0 >> Why are all PIDs waiting on 'mem_cgroup_handle_oom' and there is no >> OOM message in the log? > >I am not sure what you mean here but there are >$ grep "Memory cgroup out of memory:" kern2.collected.log | wc -l >16 > >OOM killer events during the time you were gathering memcg-bug-4 data. > >> Data in memcg-bug-4.tar.gz are only for 2 >> minutes but i let it run for about 15-20 minutes, no single process >> killed by OOM. > >I can see >$ grep "Memory cgroup out of memory:" kern2.after.log | wc -l >57 > >killed after 02:38:47 when you stopped gathering data for memcg-bug-4 I meant no single process was killed inside cgroup 1258 (data from this cgroup are in memcg-bug-4.tar.gz). Just get data from memcg-bug-4.tar.gz which were taken from cgroup 1258. Almost all processes are in 'mem_cgroup_handle_oom' so cgroup is under OOM. I assume that this is suppose to take only few seconds while kernel finds any process and kill it (and maybe do it again until enough of memory is freed). I was gathering the data for about 2 and a half minutes and NO SINGLE process was killed (just compate list of PIDs from the first and the last directory inside memcg-bug-4.tar.gz). Even more, no single process was killed in cgroup 1258 also after i stopped gathering the data. You can also take the list od PID from memcg-bug-4.tar.gz and you will find only 18211 and 8102 (which are the two stucked processes). So my question is: Why no process was killed inside cgroup 1258 while it was under OOM? It was under OOM for at least 2 and a half of minutes while i was gathering the data (then i let it run for additional, cca, 10 minutes and then killed processes by hand but i cannot proof this). Why kernel didn't kill any process for so long and ends the OOM? Btw, processes in cgroup 1258 (memcg-bug-4.tar.gz) are looping in this two tasks (i pasted only first line of stack): mem_cgroup_handle_oom+0x241/0x3b0 0xffffffffffffffff Some of them are in 'poll_schedule_timeout' and then they start to loop as above. Is this correct behavior? For example, do (first line of stack from process 7710 from all timestamps): for i in */7710/stack; do head -n1 $i; done -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH for 3.2.34] memcg: do not trigger OOM if PF_NO_MEMCG_OOM is set 2013-02-10 16:46 ` azurIt (?) @ 2013-02-11 11:22 ` Michal Hocko -1 siblings, 0 replies; 444+ messages in thread From: Michal Hocko @ 2013-02-11 11:22 UTC (permalink / raw) To: azurIt Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki, Johannes Weiner On Sun 10-02-13 17:46:19, azurIt wrote: > >stuck in the ptrace code. > > > But this happens _after_ the cgroup was freezed and i tried to strace > one of it's processes (to see what's happening): > > Feb 8 01:29:46 server01 kernel: [ 1187.540672] grsec: From 178.40.250.111: process /usr/lib/apache2/mpm-itk/apache2(apache2:18211) attached to via ptrace by /usr/bin/strace[strace:18258] uid/euid:0/0 gid/egid:0/0, parent /usr/bin/htop[htop:2901] uid/euid:0/0 gid/egid:0/0 Hmmm, Feb 8 01:39:16 server01 kernel: [ 1757.266678] Memory cgroup out of memory: Kill process 18211 (apache2) score 725 or sacrifice child) So the process has been killed 10 minutes ago and this was really the last OOM event for group /1258: $ grep "Task in /1258/uid killed" kern2.log | tail -n2 Feb 8 01:39:16 server01 kernel: [ 1757.045021] Task in /1258/uid killed as a result of limit of /1258 Feb 8 01:39:16 server01 kernel: [ 1757.167984] Task in /1258/uid killed as a result of limit of /1258 But this was still before you started collecting data for memcg-bug-4 (2:34) so we do not know what was the previous stack unfortunatelly. > >> Why are all PIDs waiting on 'mem_cgroup_handle_oom' and there is no > >> OOM message in the log? > > > >I am not sure what you mean here but there are > >$ grep "Memory cgroup out of memory:" kern2.collected.log | wc -l > >16 > > > >OOM killer events during the time you were gathering memcg-bug-4 data. > > > >> Data in memcg-bug-4.tar.gz are only for 2 > >> minutes but i let it run for about 15-20 minutes, no single process > >> killed by OOM. > > > >I can see > >$ grep "Memory cgroup out of memory:" kern2.after.log | wc -l > >57 > > > >killed after 02:38:47 when you stopped gathering data for memcg-bug-4 > > > I meant no single process was killed inside cgroup 1258 (data from > this cgroup are in memcg-bug-4.tar.gz). > > Just get data from memcg-bug-4.tar.gz which were taken from cgroup > 1258. Are you sure about that? When I extracted all pids from timestamp directories and greped them in the log I got this: for i in `cat bug/pids` ; do grep "\[ *\<$i\>\]" kern2.log ; done Feb 8 01:31:02 server01 kernel: [ 1263.429212] [18211] 1258 18211 164338 60950 0 0 0 apache2 Feb 8 01:31:15 server01 kernel: [ 1276.655241] [18211] 1258 18211 164338 60950 0 0 0 apache2 Feb 8 01:32:29 server01 kernel: [ 1350.797835] [18211] 1258 18211 164338 60950 0 0 0 apache2 Feb 8 01:32:42 server01 kernel: [ 1363.662242] [18211] 1258 18211 164338 60950 0 0 0 apache2 Feb 8 01:32:46 server01 kernel: [ 1367.181798] [18211] 1258 18211 164338 60950 0 0 0 apache2 Feb 8 01:32:46 server01 kernel: [ 1367.381627] [18211] 1258 18211 164338 60950 0 0 0 apache2 Feb 8 01:32:46 server01 kernel: [ 1367.490896] [18211] 1258 18211 164338 60950 0 0 0 apache2 Feb 8 01:33:02 server01 kernel: [ 1383.709652] [18211] 1258 18211 164338 60950 0 0 0 apache2 Feb 8 01:36:26 server01 kernel: [ 1587.458967] [18211] 1258 18211 164338 60950 0 0 0 apache2 Feb 8 01:36:26 server01 kernel: [ 1587.558419] [18211] 1258 18211 164338 60950 0 0 0 apache2 Feb 8 01:36:26 server01 kernel: [ 1587.652474] [18211] 1258 18211 164338 60950 0 0 0 apache2 Feb 8 01:39:02 server01 kernel: [ 1743.107086] [18211] 1258 18211 164338 60950 0 0 0 apache2 Feb 8 01:39:16 server01 kernel: [ 1757.015359] [18211] 1258 18211 164338 60950 0 0 0 apache2 Feb 8 01:39:16 server01 kernel: [ 1757.133998] [18211] 1258 18211 164338 60950 0 0 0 apache2 Feb 8 01:39:16 server01 kernel: [ 1757.262992] [18211] 1258 18211 164338 60950 0 0 0 apache2 Feb 8 01:18:12 server01 kernel: [ 493.156641] [ 7888] 1293 7888 169326 64876 3 0 0 apache2 Feb 8 01:18:12 server01 kernel: [ 493.269129] [ 7888] 1293 7888 169390 64876 4 0 0 apache2 Feb 8 01:18:21 server01 kernel: [ 502.384221] [ 8011] 1293 8011 170094 65675 5 0 0 apache2 Feb 8 01:18:24 server01 kernel: [ 505.052600] [ 8011] 1293 8011 170260 65854 2 0 0 apache2 Feb 8 01:18:24 server01 kernel: [ 505.200454] [ 8011] 1293 8011 170260 65854 2 0 0 apache2 Feb 8 01:18:33 server01 kernel: [ 514.538637] [ 8054] 1258 8054 164404 60618 1 0 0 apache2 Feb 8 01:18:30 server01 kernel: [ 511.230146] [ 8102] 1293 8102 170258 65869 7 0 0 apache2 So at least 7888, 8011 and 8102 were from a different group (1293). Others were never listed in the eligible processes list which is a bit unexpected. It is also unfortunate because I cannot match them to their groups from the log. $ for i in `cat bug/pids` ; do grep "\[ *\<$i\>\]" kern2.log >/dev/null || echo "$i not listed" ; done 7265 not listed 7474 not listed 7710 not listed 7969 not listed 7988 not listed 7997 not listed 8000 not listed 8014 not listed 8016 not listed 8019 not listed 8057 not listed 8058 not listed 8059 not listed 8063 not listed 8064 not listed 8066 not listed 8067 not listed 8069 not listed 8070 not listed 8071 not listed 8072 not listed 8075 not listed 8091 not listed 8092 not listed 8094 not listed 8098 not listed 8099 not listed 8100 not listed Are you sure all of them belong to 1258 group? > Almost all processes are in 'mem_cgroup_handle_oom' so cgroup > is under OOM. You are right, almost all of them are waiting in mem_cgroup_handle_oom which suggest that they should be listed in a per group eligible tasks list. One way how this might happen is when a process which manages to get oom_lock has a fatal signal pending. Then we wouldn't get to oom_kill_process and no OOM messages would get printed. This is correct because such a task would terminate soon anyway and all the waiters would wake up eventually. If not enough memory would be freed another task would get the oom_lock and this one would trigger OOM (unless it has fatal signal pending as well). Another option would be that no task could be selected - e.g. because select_bad_process sees TIF_MEMDIE marked task - the one already killed by OOM killer but that wasn't able to terminate for some reason. 18211 could be such a task. But we do not know what was going on with it before strace attached to it. Finally it is possible that the OOM header (everything up to Kill process) was suppressed because of rate limiting. But $ grep -B1 "Kill process" kern2.log Feb 8 01:15:02 server01 kernel: [ 304.000402] [ 4969] 1258 4969 163761 59554 6 0 0 apache2 Feb 8 01:15:02 server01 kernel: [ 304.000649] Memory cgroup out of memory: Kill process 4816 (apache2) score 1000 or sacrifice child -- Feb 8 01:15:51 server01 kernel: [ 352.924573] [ 5847] 1709 5847 163433 58952 6 0 0 apache2 Feb 8 01:15:51 server01 kernel: [ 352.924761] Memory cgroup out of memory: Kill process 5212 (apache2) score 1000 or sacrifice child [...] says that the message was preceded by a process list so we can exclude rate limiting. > I assume that this is suppose to take only few seconds > while kernel finds any process and kill it (and maybe do it again > until enough of memory is freed). I was gathering the data for > about 2 and a half minutes and NO SINGLE process was killed (just > compate list of PIDs from the first and the last directory inside > memcg-bug-4.tar.gz). Even more, no single process was killed in cgroup > 1258 also after i stopped gathering the data. You can also take the > list od PID from memcg-bug-4.tar.gz and you will find only 18211 and > 8102 (which are the two stucked processes). > > So my question is: Why no process was killed inside cgroup 1258 > while it was under OOM? I would bet that there is something weird going on with pid:18211. But I do not have enough information to find out what and why. > It was under OOM for at least 2 and a half of minutes while i was > gathering the data (then i let it run for additional, cca, 10 minutes > and then killed processes by hand but i cannot proof this). Why kernel > didn't kill any process for so long and ends the OOM? As already mentioned above, select_bad_process doesn't select any task if there is one which is on the way out. Maybe this is what is going on here. > Btw, processes in cgroup 1258 (memcg-bug-4.tar.gz) are looping in this > two tasks (i pasted only first line of stack): > mem_cgroup_handle_oom+0x241/0x3b0 > 0xffffffffffffffff 0xffffffffffffffff is just a bogus entry. No idea why this happens. > Some of them are in 'poll_schedule_timeout' and then they start to > loop as above. Is this correct behavior? > For example, do (first line of stack from process 7710 from all > timestamps): for i in */7710/stack; do head -n1 $i; done Yes, this is perfectly ok, because that task starts with: $ cat bug/1360287245/7710/stack [<ffffffff81125eb9>] poll_schedule_timeout+0x49/0x70 [<ffffffff8112675b>] do_sys_poll+0x54b/0x680 [<ffffffff81126b4c>] sys_poll+0x7c/0xf0 [<ffffffff815b6866>] system_call_fastpath+0x18/0x1d [<ffffffffffffffff>] 0xffffffffffffffff and then later on it gets into OOM because of a page fault: $ cat bug/1360287250/7710/stack [<ffffffff8110ae51>] mem_cgroup_handle_oom+0x241/0x3b0 [<ffffffff8110ba83>] T.1149+0x5f3/0x600 [<ffffffff8110bf5c>] mem_cgroup_charge_common+0x6c/0xb0 [<ffffffff8110bfe5>] mem_cgroup_newpage_charge+0x45/0x50 [<ffffffff810eca1e>] do_wp_page+0x14e/0x800 [<ffffffff810edf04>] handle_pte_fault+0x264/0x940 [<ffffffff810ee718>] handle_mm_fault+0x138/0x260 [<ffffffff810270bd>] do_page_fault+0x13d/0x460 [<ffffffff815b633f>] page_fault+0x1f/0x30 [<ffffffffffffffff>] 0xffffffffffffffff And it loops in it until the end which is possible as well if the group is under permanent OOM condition and the task is not selected to be killed. Unfortunately I am not able to reproduce this behavior even if I try to hammer OOM like mad so I am afraid I cannot help you much without further debugging patches. I do realize that experimenting in your environment is a problem but I do not many options left. Please do not use strace and rather collect /proc/pid/stack instead. It would be also helpful to get group/tasks file to have a full list of tasks in the group --- >From 1139745d43cc8c56bc79c219291d1e5281799dd4 Mon Sep 17 00:00:00 2001 From: Michal Hocko <mhocko@suse.cz> Date: Mon, 11 Feb 2013 12:18:36 +0100 Subject: [PATCH] oom: debug skipping killing --- mm/oom_kill.c | 10 +++++++++- 1 file changed, 9 insertions(+), 1 deletion(-) diff --git a/mm/oom_kill.c b/mm/oom_kill.c index 069b64e..3d759f0 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -329,6 +329,8 @@ static struct task_struct *select_bad_process(unsigned int *ppoints, if (test_tsk_thread_flag(p, TIF_MEMDIE)) { if (unlikely(frozen(p))) thaw_process(p); + printk(KERN_WARNING"XXX: pid:%d (flags:%u) is TIF_MEMDIE. Waiting for it\n", + p->pid, p->flags); return ERR_PTR(-1UL); } if (!p->mm) @@ -353,8 +355,11 @@ static struct task_struct *select_bad_process(unsigned int *ppoints, * then wait for it to finish before killing * some other task unnecessarily. */ - if (!(p->group_leader->ptrace & PT_TRACE_EXIT)) + if (!(p->group_leader->ptrace & PT_TRACE_EXIT)) { + printk(KERN_WARNING"XXX: pid:%d (flags:%u) is PF_EXITING. Waiting for it\n", + p->pid, p->flags); return ERR_PTR(-1UL); + } } } @@ -494,6 +499,7 @@ static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order, * its children or threads, just set TIF_MEMDIE so it can die quickly */ if (p->flags & PF_EXITING) { + printk(KERN_WARNING"XXX: pid:%d (flags:%u). Not killing PF_EXITING\n", p->pid, p->flags); set_tsk_thread_flag(p, TIF_MEMDIE); return 0; } @@ -567,6 +573,8 @@ void mem_cgroup_out_of_memory(struct mem_cgroup *mem, gfp_t gfp_mask) * its memory. */ if (fatal_signal_pending(current)) { + printk(KERN_WARNING"XXX: pid:%d (flags:%u) has fatal_signal_pending. Waiting for it\n", + p->pid, p->flags); set_thread_flag(TIF_MEMDIE); return; } -- 1.7.10.4 -- Michal Hocko SUSE Labs ^ permalink raw reply related [flat|nested] 444+ messages in thread
* Re: [PATCH for 3.2.34] memcg: do not trigger OOM if PF_NO_MEMCG_OOM is set @ 2013-02-11 11:22 ` Michal Hocko 0 siblings, 0 replies; 444+ messages in thread From: Michal Hocko @ 2013-02-11 11:22 UTC (permalink / raw) To: azurIt Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki, Johannes Weiner On Sun 10-02-13 17:46:19, azurIt wrote: > >stuck in the ptrace code. > > > But this happens _after_ the cgroup was freezed and i tried to strace > one of it's processes (to see what's happening): > > Feb 8 01:29:46 server01 kernel: [ 1187.540672] grsec: From 178.40.250.111: process /usr/lib/apache2/mpm-itk/apache2(apache2:18211) attached to via ptrace by /usr/bin/strace[strace:18258] uid/euid:0/0 gid/egid:0/0, parent /usr/bin/htop[htop:2901] uid/euid:0/0 gid/egid:0/0 Hmmm, Feb 8 01:39:16 server01 kernel: [ 1757.266678] Memory cgroup out of memory: Kill process 18211 (apache2) score 725 or sacrifice child) So the process has been killed 10 minutes ago and this was really the last OOM event for group /1258: $ grep "Task in /1258/uid killed" kern2.log | tail -n2 Feb 8 01:39:16 server01 kernel: [ 1757.045021] Task in /1258/uid killed as a result of limit of /1258 Feb 8 01:39:16 server01 kernel: [ 1757.167984] Task in /1258/uid killed as a result of limit of /1258 But this was still before you started collecting data for memcg-bug-4 (2:34) so we do not know what was the previous stack unfortunatelly. > >> Why are all PIDs waiting on 'mem_cgroup_handle_oom' and there is no > >> OOM message in the log? > > > >I am not sure what you mean here but there are > >$ grep "Memory cgroup out of memory:" kern2.collected.log | wc -l > >16 > > > >OOM killer events during the time you were gathering memcg-bug-4 data. > > > >> Data in memcg-bug-4.tar.gz are only for 2 > >> minutes but i let it run for about 15-20 minutes, no single process > >> killed by OOM. > > > >I can see > >$ grep "Memory cgroup out of memory:" kern2.after.log | wc -l > >57 > > > >killed after 02:38:47 when you stopped gathering data for memcg-bug-4 > > > I meant no single process was killed inside cgroup 1258 (data from > this cgroup are in memcg-bug-4.tar.gz). > > Just get data from memcg-bug-4.tar.gz which were taken from cgroup > 1258. Are you sure about that? When I extracted all pids from timestamp directories and greped them in the log I got this: for i in `cat bug/pids` ; do grep "\[ *\<$i\>\]" kern2.log ; done Feb 8 01:31:02 server01 kernel: [ 1263.429212] [18211] 1258 18211 164338 60950 0 0 0 apache2 Feb 8 01:31:15 server01 kernel: [ 1276.655241] [18211] 1258 18211 164338 60950 0 0 0 apache2 Feb 8 01:32:29 server01 kernel: [ 1350.797835] [18211] 1258 18211 164338 60950 0 0 0 apache2 Feb 8 01:32:42 server01 kernel: [ 1363.662242] [18211] 1258 18211 164338 60950 0 0 0 apache2 Feb 8 01:32:46 server01 kernel: [ 1367.181798] [18211] 1258 18211 164338 60950 0 0 0 apache2 Feb 8 01:32:46 server01 kernel: [ 1367.381627] [18211] 1258 18211 164338 60950 0 0 0 apache2 Feb 8 01:32:46 server01 kernel: [ 1367.490896] [18211] 1258 18211 164338 60950 0 0 0 apache2 Feb 8 01:33:02 server01 kernel: [ 1383.709652] [18211] 1258 18211 164338 60950 0 0 0 apache2 Feb 8 01:36:26 server01 kernel: [ 1587.458967] [18211] 1258 18211 164338 60950 0 0 0 apache2 Feb 8 01:36:26 server01 kernel: [ 1587.558419] [18211] 1258 18211 164338 60950 0 0 0 apache2 Feb 8 01:36:26 server01 kernel: [ 1587.652474] [18211] 1258 18211 164338 60950 0 0 0 apache2 Feb 8 01:39:02 server01 kernel: [ 1743.107086] [18211] 1258 18211 164338 60950 0 0 0 apache2 Feb 8 01:39:16 server01 kernel: [ 1757.015359] [18211] 1258 18211 164338 60950 0 0 0 apache2 Feb 8 01:39:16 server01 kernel: [ 1757.133998] [18211] 1258 18211 164338 60950 0 0 0 apache2 Feb 8 01:39:16 server01 kernel: [ 1757.262992] [18211] 1258 18211 164338 60950 0 0 0 apache2 Feb 8 01:18:12 server01 kernel: [ 493.156641] [ 7888] 1293 7888 169326 64876 3 0 0 apache2 Feb 8 01:18:12 server01 kernel: [ 493.269129] [ 7888] 1293 7888 169390 64876 4 0 0 apache2 Feb 8 01:18:21 server01 kernel: [ 502.384221] [ 8011] 1293 8011 170094 65675 5 0 0 apache2 Feb 8 01:18:24 server01 kernel: [ 505.052600] [ 8011] 1293 8011 170260 65854 2 0 0 apache2 Feb 8 01:18:24 server01 kernel: [ 505.200454] [ 8011] 1293 8011 170260 65854 2 0 0 apache2 Feb 8 01:18:33 server01 kernel: [ 514.538637] [ 8054] 1258 8054 164404 60618 1 0 0 apache2 Feb 8 01:18:30 server01 kernel: [ 511.230146] [ 8102] 1293 8102 170258 65869 7 0 0 apache2 So at least 7888, 8011 and 8102 were from a different group (1293). Others were never listed in the eligible processes list which is a bit unexpected. It is also unfortunate because I cannot match them to their groups from the log. $ for i in `cat bug/pids` ; do grep "\[ *\<$i\>\]" kern2.log >/dev/null || echo "$i not listed" ; done 7265 not listed 7474 not listed 7710 not listed 7969 not listed 7988 not listed 7997 not listed 8000 not listed 8014 not listed 8016 not listed 8019 not listed 8057 not listed 8058 not listed 8059 not listed 8063 not listed 8064 not listed 8066 not listed 8067 not listed 8069 not listed 8070 not listed 8071 not listed 8072 not listed 8075 not listed 8091 not listed 8092 not listed 8094 not listed 8098 not listed 8099 not listed 8100 not listed Are you sure all of them belong to 1258 group? > Almost all processes are in 'mem_cgroup_handle_oom' so cgroup > is under OOM. You are right, almost all of them are waiting in mem_cgroup_handle_oom which suggest that they should be listed in a per group eligible tasks list. One way how this might happen is when a process which manages to get oom_lock has a fatal signal pending. Then we wouldn't get to oom_kill_process and no OOM messages would get printed. This is correct because such a task would terminate soon anyway and all the waiters would wake up eventually. If not enough memory would be freed another task would get the oom_lock and this one would trigger OOM (unless it has fatal signal pending as well). Another option would be that no task could be selected - e.g. because select_bad_process sees TIF_MEMDIE marked task - the one already killed by OOM killer but that wasn't able to terminate for some reason. 18211 could be such a task. But we do not know what was going on with it before strace attached to it. Finally it is possible that the OOM header (everything up to Kill process) was suppressed because of rate limiting. But $ grep -B1 "Kill process" kern2.log Feb 8 01:15:02 server01 kernel: [ 304.000402] [ 4969] 1258 4969 163761 59554 6 0 0 apache2 Feb 8 01:15:02 server01 kernel: [ 304.000649] Memory cgroup out of memory: Kill process 4816 (apache2) score 1000 or sacrifice child -- Feb 8 01:15:51 server01 kernel: [ 352.924573] [ 5847] 1709 5847 163433 58952 6 0 0 apache2 Feb 8 01:15:51 server01 kernel: [ 352.924761] Memory cgroup out of memory: Kill process 5212 (apache2) score 1000 or sacrifice child [...] says that the message was preceded by a process list so we can exclude rate limiting. > I assume that this is suppose to take only few seconds > while kernel finds any process and kill it (and maybe do it again > until enough of memory is freed). I was gathering the data for > about 2 and a half minutes and NO SINGLE process was killed (just > compate list of PIDs from the first and the last directory inside > memcg-bug-4.tar.gz). Even more, no single process was killed in cgroup > 1258 also after i stopped gathering the data. You can also take the > list od PID from memcg-bug-4.tar.gz and you will find only 18211 and > 8102 (which are the two stucked processes). > > So my question is: Why no process was killed inside cgroup 1258 > while it was under OOM? I would bet that there is something weird going on with pid:18211. But I do not have enough information to find out what and why. > It was under OOM for at least 2 and a half of minutes while i was > gathering the data (then i let it run for additional, cca, 10 minutes > and then killed processes by hand but i cannot proof this). Why kernel > didn't kill any process for so long and ends the OOM? As already mentioned above, select_bad_process doesn't select any task if there is one which is on the way out. Maybe this is what is going on here. > Btw, processes in cgroup 1258 (memcg-bug-4.tar.gz) are looping in this > two tasks (i pasted only first line of stack): > mem_cgroup_handle_oom+0x241/0x3b0 > 0xffffffffffffffff 0xffffffffffffffff is just a bogus entry. No idea why this happens. > Some of them are in 'poll_schedule_timeout' and then they start to > loop as above. Is this correct behavior? > For example, do (first line of stack from process 7710 from all > timestamps): for i in */7710/stack; do head -n1 $i; done Yes, this is perfectly ok, because that task starts with: $ cat bug/1360287245/7710/stack [<ffffffff81125eb9>] poll_schedule_timeout+0x49/0x70 [<ffffffff8112675b>] do_sys_poll+0x54b/0x680 [<ffffffff81126b4c>] sys_poll+0x7c/0xf0 [<ffffffff815b6866>] system_call_fastpath+0x18/0x1d [<ffffffffffffffff>] 0xffffffffffffffff and then later on it gets into OOM because of a page fault: $ cat bug/1360287250/7710/stack [<ffffffff8110ae51>] mem_cgroup_handle_oom+0x241/0x3b0 [<ffffffff8110ba83>] T.1149+0x5f3/0x600 [<ffffffff8110bf5c>] mem_cgroup_charge_common+0x6c/0xb0 [<ffffffff8110bfe5>] mem_cgroup_newpage_charge+0x45/0x50 [<ffffffff810eca1e>] do_wp_page+0x14e/0x800 [<ffffffff810edf04>] handle_pte_fault+0x264/0x940 [<ffffffff810ee718>] handle_mm_fault+0x138/0x260 [<ffffffff810270bd>] do_page_fault+0x13d/0x460 [<ffffffff815b633f>] page_fault+0x1f/0x30 [<ffffffffffffffff>] 0xffffffffffffffff And it loops in it until the end which is possible as well if the group is under permanent OOM condition and the task is not selected to be killed. Unfortunately I am not able to reproduce this behavior even if I try to hammer OOM like mad so I am afraid I cannot help you much without further debugging patches. I do realize that experimenting in your environment is a problem but I do not many options left. Please do not use strace and rather collect /proc/pid/stack instead. It would be also helpful to get group/tasks file to have a full list of tasks in the group --- From 1139745d43cc8c56bc79c219291d1e5281799dd4 Mon Sep 17 00:00:00 2001 From: Michal Hocko <mhocko@suse.cz> Date: Mon, 11 Feb 2013 12:18:36 +0100 Subject: [PATCH] oom: debug skipping killing --- mm/oom_kill.c | 10 +++++++++- 1 file changed, 9 insertions(+), 1 deletion(-) diff --git a/mm/oom_kill.c b/mm/oom_kill.c index 069b64e..3d759f0 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -329,6 +329,8 @@ static struct task_struct *select_bad_process(unsigned int *ppoints, if (test_tsk_thread_flag(p, TIF_MEMDIE)) { if (unlikely(frozen(p))) thaw_process(p); + printk(KERN_WARNING"XXX: pid:%d (flags:%u) is TIF_MEMDIE. Waiting for it\n", + p->pid, p->flags); return ERR_PTR(-1UL); } if (!p->mm) @@ -353,8 +355,11 @@ static struct task_struct *select_bad_process(unsigned int *ppoints, * then wait for it to finish before killing * some other task unnecessarily. */ - if (!(p->group_leader->ptrace & PT_TRACE_EXIT)) + if (!(p->group_leader->ptrace & PT_TRACE_EXIT)) { + printk(KERN_WARNING"XXX: pid:%d (flags:%u) is PF_EXITING. Waiting for it\n", + p->pid, p->flags); return ERR_PTR(-1UL); + } } } @@ -494,6 +499,7 @@ static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order, * its children or threads, just set TIF_MEMDIE so it can die quickly */ if (p->flags & PF_EXITING) { + printk(KERN_WARNING"XXX: pid:%d (flags:%u). Not killing PF_EXITING\n", p->pid, p->flags); set_tsk_thread_flag(p, TIF_MEMDIE); return 0; } @@ -567,6 +573,8 @@ void mem_cgroup_out_of_memory(struct mem_cgroup *mem, gfp_t gfp_mask) * its memory. */ if (fatal_signal_pending(current)) { + printk(KERN_WARNING"XXX: pid:%d (flags:%u) has fatal_signal_pending. Waiting for it\n", + p->pid, p->flags); set_thread_flag(TIF_MEMDIE); return; } -- 1.7.10.4 -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 444+ messages in thread
* Re: [PATCH for 3.2.34] memcg: do not trigger OOM if PF_NO_MEMCG_OOM is set @ 2013-02-11 11:22 ` Michal Hocko 0 siblings, 0 replies; 444+ messages in thread From: Michal Hocko @ 2013-02-11 11:22 UTC (permalink / raw) To: azurIt Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki, Johannes Weiner On Sun 10-02-13 17:46:19, azurIt wrote: > >stuck in the ptrace code. > > > But this happens _after_ the cgroup was freezed and i tried to strace > one of it's processes (to see what's happening): > > Feb 8 01:29:46 server01 kernel: [ 1187.540672] grsec: From 178.40.250.111: process /usr/lib/apache2/mpm-itk/apache2(apache2:18211) attached to via ptrace by /usr/bin/strace[strace:18258] uid/euid:0/0 gid/egid:0/0, parent /usr/bin/htop[htop:2901] uid/euid:0/0 gid/egid:0/0 Hmmm, Feb 8 01:39:16 server01 kernel: [ 1757.266678] Memory cgroup out of memory: Kill process 18211 (apache2) score 725 or sacrifice child) So the process has been killed 10 minutes ago and this was really the last OOM event for group /1258: $ grep "Task in /1258/uid killed" kern2.log | tail -n2 Feb 8 01:39:16 server01 kernel: [ 1757.045021] Task in /1258/uid killed as a result of limit of /1258 Feb 8 01:39:16 server01 kernel: [ 1757.167984] Task in /1258/uid killed as a result of limit of /1258 But this was still before you started collecting data for memcg-bug-4 (2:34) so we do not know what was the previous stack unfortunatelly. > >> Why are all PIDs waiting on 'mem_cgroup_handle_oom' and there is no > >> OOM message in the log? > > > >I am not sure what you mean here but there are > >$ grep "Memory cgroup out of memory:" kern2.collected.log | wc -l > >16 > > > >OOM killer events during the time you were gathering memcg-bug-4 data. > > > >> Data in memcg-bug-4.tar.gz are only for 2 > >> minutes but i let it run for about 15-20 minutes, no single process > >> killed by OOM. > > > >I can see > >$ grep "Memory cgroup out of memory:" kern2.after.log | wc -l > >57 > > > >killed after 02:38:47 when you stopped gathering data for memcg-bug-4 > > > I meant no single process was killed inside cgroup 1258 (data from > this cgroup are in memcg-bug-4.tar.gz). > > Just get data from memcg-bug-4.tar.gz which were taken from cgroup > 1258. Are you sure about that? When I extracted all pids from timestamp directories and greped them in the log I got this: for i in `cat bug/pids` ; do grep "\[ *\<$i\>\]" kern2.log ; done Feb 8 01:31:02 server01 kernel: [ 1263.429212] [18211] 1258 18211 164338 60950 0 0 0 apache2 Feb 8 01:31:15 server01 kernel: [ 1276.655241] [18211] 1258 18211 164338 60950 0 0 0 apache2 Feb 8 01:32:29 server01 kernel: [ 1350.797835] [18211] 1258 18211 164338 60950 0 0 0 apache2 Feb 8 01:32:42 server01 kernel: [ 1363.662242] [18211] 1258 18211 164338 60950 0 0 0 apache2 Feb 8 01:32:46 server01 kernel: [ 1367.181798] [18211] 1258 18211 164338 60950 0 0 0 apache2 Feb 8 01:32:46 server01 kernel: [ 1367.381627] [18211] 1258 18211 164338 60950 0 0 0 apache2 Feb 8 01:32:46 server01 kernel: [ 1367.490896] [18211] 1258 18211 164338 60950 0 0 0 apache2 Feb 8 01:33:02 server01 kernel: [ 1383.709652] [18211] 1258 18211 164338 60950 0 0 0 apache2 Feb 8 01:36:26 server01 kernel: [ 1587.458967] [18211] 1258 18211 164338 60950 0 0 0 apache2 Feb 8 01:36:26 server01 kernel: [ 1587.558419] [18211] 1258 18211 164338 60950 0 0 0 apache2 Feb 8 01:36:26 server01 kernel: [ 1587.652474] [18211] 1258 18211 164338 60950 0 0 0 apache2 Feb 8 01:39:02 server01 kernel: [ 1743.107086] [18211] 1258 18211 164338 60950 0 0 0 apache2 Feb 8 01:39:16 server01 kernel: [ 1757.015359] [18211] 1258 18211 164338 60950 0 0 0 apache2 Feb 8 01:39:16 server01 kernel: [ 1757.133998] [18211] 1258 18211 164338 60950 0 0 0 apache2 Feb 8 01:39:16 server01 kernel: [ 1757.262992] [18211] 1258 18211 164338 60950 0 0 0 apache2 Feb 8 01:18:12 server01 kernel: [ 493.156641] [ 7888] 1293 7888 169326 64876 3 0 0 apache2 Feb 8 01:18:12 server01 kernel: [ 493.269129] [ 7888] 1293 7888 169390 64876 4 0 0 apache2 Feb 8 01:18:21 server01 kernel: [ 502.384221] [ 8011] 1293 8011 170094 65675 5 0 0 apache2 Feb 8 01:18:24 server01 kernel: [ 505.052600] [ 8011] 1293 8011 170260 65854 2 0 0 apache2 Feb 8 01:18:24 server01 kernel: [ 505.200454] [ 8011] 1293 8011 170260 65854 2 0 0 apache2 Feb 8 01:18:33 server01 kernel: [ 514.538637] [ 8054] 1258 8054 164404 60618 1 0 0 apache2 Feb 8 01:18:30 server01 kernel: [ 511.230146] [ 8102] 1293 8102 170258 65869 7 0 0 apache2 So at least 7888, 8011 and 8102 were from a different group (1293). Others were never listed in the eligible processes list which is a bit unexpected. It is also unfortunate because I cannot match them to their groups from the log. $ for i in `cat bug/pids` ; do grep "\[ *\<$i\>\]" kern2.log >/dev/null || echo "$i not listed" ; done 7265 not listed 7474 not listed 7710 not listed 7969 not listed 7988 not listed 7997 not listed 8000 not listed 8014 not listed 8016 not listed 8019 not listed 8057 not listed 8058 not listed 8059 not listed 8063 not listed 8064 not listed 8066 not listed 8067 not listed 8069 not listed 8070 not listed 8071 not listed 8072 not listed 8075 not listed 8091 not listed 8092 not listed 8094 not listed 8098 not listed 8099 not listed 8100 not listed Are you sure all of them belong to 1258 group? > Almost all processes are in 'mem_cgroup_handle_oom' so cgroup > is under OOM. You are right, almost all of them are waiting in mem_cgroup_handle_oom which suggest that they should be listed in a per group eligible tasks list. One way how this might happen is when a process which manages to get oom_lock has a fatal signal pending. Then we wouldn't get to oom_kill_process and no OOM messages would get printed. This is correct because such a task would terminate soon anyway and all the waiters would wake up eventually. If not enough memory would be freed another task would get the oom_lock and this one would trigger OOM (unless it has fatal signal pending as well). Another option would be that no task could be selected - e.g. because select_bad_process sees TIF_MEMDIE marked task - the one already killed by OOM killer but that wasn't able to terminate for some reason. 18211 could be such a task. But we do not know what was going on with it before strace attached to it. Finally it is possible that the OOM header (everything up to Kill process) was suppressed because of rate limiting. But $ grep -B1 "Kill process" kern2.log Feb 8 01:15:02 server01 kernel: [ 304.000402] [ 4969] 1258 4969 163761 59554 6 0 0 apache2 Feb 8 01:15:02 server01 kernel: [ 304.000649] Memory cgroup out of memory: Kill process 4816 (apache2) score 1000 or sacrifice child -- Feb 8 01:15:51 server01 kernel: [ 352.924573] [ 5847] 1709 5847 163433 58952 6 0 0 apache2 Feb 8 01:15:51 server01 kernel: [ 352.924761] Memory cgroup out of memory: Kill process 5212 (apache2) score 1000 or sacrifice child [...] says that the message was preceded by a process list so we can exclude rate limiting. > I assume that this is suppose to take only few seconds > while kernel finds any process and kill it (and maybe do it again > until enough of memory is freed). I was gathering the data for > about 2 and a half minutes and NO SINGLE process was killed (just > compate list of PIDs from the first and the last directory inside > memcg-bug-4.tar.gz). Even more, no single process was killed in cgroup > 1258 also after i stopped gathering the data. You can also take the > list od PID from memcg-bug-4.tar.gz and you will find only 18211 and > 8102 (which are the two stucked processes). > > So my question is: Why no process was killed inside cgroup 1258 > while it was under OOM? I would bet that there is something weird going on with pid:18211. But I do not have enough information to find out what and why. > It was under OOM for at least 2 and a half of minutes while i was > gathering the data (then i let it run for additional, cca, 10 minutes > and then killed processes by hand but i cannot proof this). Why kernel > didn't kill any process for so long and ends the OOM? As already mentioned above, select_bad_process doesn't select any task if there is one which is on the way out. Maybe this is what is going on here. > Btw, processes in cgroup 1258 (memcg-bug-4.tar.gz) are looping in this > two tasks (i pasted only first line of stack): > mem_cgroup_handle_oom+0x241/0x3b0 > 0xffffffffffffffff 0xffffffffffffffff is just a bogus entry. No idea why this happens. > Some of them are in 'poll_schedule_timeout' and then they start to > loop as above. Is this correct behavior? > For example, do (first line of stack from process 7710 from all > timestamps): for i in */7710/stack; do head -n1 $i; done Yes, this is perfectly ok, because that task starts with: $ cat bug/1360287245/7710/stack [<ffffffff81125eb9>] poll_schedule_timeout+0x49/0x70 [<ffffffff8112675b>] do_sys_poll+0x54b/0x680 [<ffffffff81126b4c>] sys_poll+0x7c/0xf0 [<ffffffff815b6866>] system_call_fastpath+0x18/0x1d [<ffffffffffffffff>] 0xffffffffffffffff and then later on it gets into OOM because of a page fault: $ cat bug/1360287250/7710/stack [<ffffffff8110ae51>] mem_cgroup_handle_oom+0x241/0x3b0 [<ffffffff8110ba83>] T.1149+0x5f3/0x600 [<ffffffff8110bf5c>] mem_cgroup_charge_common+0x6c/0xb0 [<ffffffff8110bfe5>] mem_cgroup_newpage_charge+0x45/0x50 [<ffffffff810eca1e>] do_wp_page+0x14e/0x800 [<ffffffff810edf04>] handle_pte_fault+0x264/0x940 [<ffffffff810ee718>] handle_mm_fault+0x138/0x260 [<ffffffff810270bd>] do_page_fault+0x13d/0x460 [<ffffffff815b633f>] page_fault+0x1f/0x30 [<ffffffffffffffff>] 0xffffffffffffffff And it loops in it until the end which is possible as well if the group is under permanent OOM condition and the task is not selected to be killed. Unfortunately I am not able to reproduce this behavior even if I try to hammer OOM like mad so I am afraid I cannot help you much without further debugging patches. I do realize that experimenting in your environment is a problem but I do not many options left. Please do not use strace and rather collect /proc/pid/stack instead. It would be also helpful to get group/tasks file to have a full list of tasks in the group --- ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH for 3.2.34] memcg: do not trigger OOM if PF_NO_MEMCG_OOM is set 2013-02-11 11:22 ` Michal Hocko (?) @ 2013-02-22 8:23 ` azurIt -1 siblings, 0 replies; 444+ messages in thread From: azurIt @ 2013-02-22 8:23 UTC (permalink / raw) To: Michal Hocko Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki, Johannes Weiner >Unfortunately I am not able to reproduce this behavior even if I try >to hammer OOM like mad so I am afraid I cannot help you much without >further debugging patches. >I do realize that experimenting in your environment is a problem but I >do not many options left. Please do not use strace and rather collect >/proc/pid/stack instead. It would be also helpful to get group/tasks >file to have a full list of tasks in the group Hi Michal, sorry that i didn't response for a while. Today i installed kernel with your two patches and i'm running it now. I'm still having problems with OOM which is not able to handle low memory and is not killing processes. Here is some info: - data from cgroup 1258 while it was under OOM and no processes were killed (so OOM don't stop and cgroup was freezed) http://watchdog.sk/lkml/memcg-bug-6.tar.gz I noticed problem about on 8:39 and waited until 8:57 (nothing happend). Then i killed process 19864 which seems to help and other processes probably ends and cgroup started to work. But problem accoured again about 20 seconds later, so i killed all processes at 8:58. The problem is occuring all the time since then. All processes (in that cgroup) are always in state 'D' when it occurs. - kernel log from boot until now http://watchdog.sk/lkml/kern3.gz Btw, something probably happened also at about 3:09 but i wasn't able to gather any data because my 'load check script' killed all apache processes (load was more than 100). azur ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH for 3.2.34] memcg: do not trigger OOM if PF_NO_MEMCG_OOM is set @ 2013-02-22 8:23 ` azurIt 0 siblings, 0 replies; 444+ messages in thread From: azurIt @ 2013-02-22 8:23 UTC (permalink / raw) To: Michal Hocko Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups mailinglist, KAMEZAWA Hiroyuki, Johannes Weiner >Unfortunately I am not able to reproduce this behavior even if I try >to hammer OOM like mad so I am afraid I cannot help you much without >further debugging patches. >I do realize that experimenting in your environment is a problem but I >do not many options left. Please do not use strace and rather collect >/proc/pid/stack instead. It would be also helpful to get group/tasks >file to have a full list of tasks in the group Hi Michal, sorry that i didn't response for a while. Today i installed kernel with your two patches and i'm running it now. I'm still having problems with OOM which is not able to handle low memory and is not killing processes. Here is some info: - data from cgroup 1258 while it was under OOM and no processes were killed (so OOM don't stop and cgroup was freezed) http://watchdog.sk/lkml/memcg-bug-6.tar.gz I noticed problem about on 8:39 and waited until 8:57 (nothing happend). Then i killed process 19864 which seems to help and other processes probably ends and cgroup started to work. But problem accoured again about 20 seconds later, so i killed all processes at 8:58. The problem is occuring all the time since then. All processes (in that cgroup) are always in state 'D' when it occurs. - kernel log from boot until now http://watchdog.sk/lkml/kern3.gz Btw, something probably happened also at about 3:09 but i wasn't able to gather any data because my 'load check script' killed all apache processes (load was more than 100). azur ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH for 3.2.34] memcg: do not trigger OOM if PF_NO_MEMCG_OOM is set @ 2013-02-22 8:23 ` azurIt 0 siblings, 0 replies; 444+ messages in thread From: azurIt @ 2013-02-22 8:23 UTC (permalink / raw) To: Michal Hocko Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki, Johannes Weiner >Unfortunately I am not able to reproduce this behavior even if I try >to hammer OOM like mad so I am afraid I cannot help you much without >further debugging patches. >I do realize that experimenting in your environment is a problem but I >do not many options left. Please do not use strace and rather collect >/proc/pid/stack instead. It would be also helpful to get group/tasks >file to have a full list of tasks in the group Hi Michal, sorry that i didn't response for a while. Today i installed kernel with your two patches and i'm running it now. I'm still having problems with OOM which is not able to handle low memory and is not killing processes. Here is some info: - data from cgroup 1258 while it was under OOM and no processes were killed (so OOM don't stop and cgroup was freezed) http://watchdog.sk/lkml/memcg-bug-6.tar.gz I noticed problem about on 8:39 and waited until 8:57 (nothing happend). Then i killed process 19864 which seems to help and other processes probably ends and cgroup started to work. But problem accoured again about 20 seconds later, so i killed all processes at 8:58. The problem is occuring all the time since then. All processes (in that cgroup) are always in state 'D' when it occurs. - kernel log from boot until now http://watchdog.sk/lkml/kern3.gz Btw, something probably happened also at about 3:09 but i wasn't able to gather any data because my 'load check script' killed all apache processes (load was more than 100). azur -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH for 3.2.34] memcg: do not trigger OOM if PF_NO_MEMCG_OOM is set 2013-02-22 8:23 ` azurIt (?) @ 2013-02-22 12:52 ` Michal Hocko -1 siblings, 0 replies; 444+ messages in thread From: Michal Hocko @ 2013-02-22 12:52 UTC (permalink / raw) To: azurIt Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki, Johannes Weiner Hi, On Fri 22-02-13 09:23:32, azurIt wrote: [...] > sorry that i didn't response for a while. Today i installed kernel > with your two patches and i'm running it now. I am not sure how much time I'll have for this today but just to make sure we are on the same page, could you point me to the two patches you have applied in the mean time? -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH for 3.2.34] memcg: do not trigger OOM if PF_NO_MEMCG_OOM is set @ 2013-02-22 12:52 ` Michal Hocko 0 siblings, 0 replies; 444+ messages in thread From: Michal Hocko @ 2013-02-22 12:52 UTC (permalink / raw) To: azurIt Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups mailinglist, KAMEZAWA Hiroyuki, Johannes Weiner Hi, On Fri 22-02-13 09:23:32, azurIt wrote: [...] > sorry that i didn't response for a while. Today i installed kernel > with your two patches and i'm running it now. I am not sure how much time I'll have for this today but just to make sure we are on the same page, could you point me to the two patches you have applied in the mean time? -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH for 3.2.34] memcg: do not trigger OOM if PF_NO_MEMCG_OOM is set @ 2013-02-22 12:52 ` Michal Hocko 0 siblings, 0 replies; 444+ messages in thread From: Michal Hocko @ 2013-02-22 12:52 UTC (permalink / raw) To: azurIt Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki, Johannes Weiner Hi, On Fri 22-02-13 09:23:32, azurIt wrote: [...] > sorry that i didn't response for a while. Today i installed kernel > with your two patches and i'm running it now. I am not sure how much time I'll have for this today but just to make sure we are on the same page, could you point me to the two patches you have applied in the mean time? -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH for 3.2.34] memcg: do not trigger OOM if PF_NO_MEMCG_OOM is set 2013-02-22 12:52 ` Michal Hocko @ 2013-02-22 12:54 ` azurIt -1 siblings, 0 replies; 444+ messages in thread From: azurIt @ 2013-02-22 12:54 UTC (permalink / raw) To: Michal Hocko Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki, Johannes Weiner >I am not sure how much time I'll have for this today but just to make >sure we are on the same page, could you point me to the two patches you >have applied in the mean time? Here: http://watchdog.sk/lkml/patches2 ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH for 3.2.34] memcg: do not trigger OOM if PF_NO_MEMCG_OOM is set @ 2013-02-22 12:54 ` azurIt 0 siblings, 0 replies; 444+ messages in thread From: azurIt @ 2013-02-22 12:54 UTC (permalink / raw) To: Michal Hocko Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki, Johannes Weiner >I am not sure how much time I'll have for this today but just to make >sure we are on the same page, could you point me to the two patches you >have applied in the mean time? Here: http://watchdog.sk/lkml/patches2 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH for 3.2.34] memcg: do not trigger OOM if PF_NO_MEMCG_OOM is set 2013-02-22 12:54 ` azurIt (?) @ 2013-02-22 13:00 ` Michal Hocko -1 siblings, 0 replies; 444+ messages in thread From: Michal Hocko @ 2013-02-22 13:00 UTC (permalink / raw) To: azurIt Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki, Johannes Weiner On Fri 22-02-13 13:54:42, azurIt wrote: > >I am not sure how much time I'll have for this today but just to make > >sure we are on the same page, could you point me to the two patches you > >have applied in the mean time? > > > Here: > http://watchdog.sk/lkml/patches2 OK, looks correct. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH for 3.2.34] memcg: do not trigger OOM if PF_NO_MEMCG_OOM is set @ 2013-02-22 13:00 ` Michal Hocko 0 siblings, 0 replies; 444+ messages in thread From: Michal Hocko @ 2013-02-22 13:00 UTC (permalink / raw) To: azurIt Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups mailinglist, KAMEZAWA Hiroyuki, Johannes Weiner On Fri 22-02-13 13:54:42, azurIt wrote: > >I am not sure how much time I'll have for this today but just to make > >sure we are on the same page, could you point me to the two patches you > >have applied in the mean time? > > > Here: > http://watchdog.sk/lkml/patches2 OK, looks correct. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH for 3.2.34] memcg: do not trigger OOM if PF_NO_MEMCG_OOM is set @ 2013-02-22 13:00 ` Michal Hocko 0 siblings, 0 replies; 444+ messages in thread From: Michal Hocko @ 2013-02-22 13:00 UTC (permalink / raw) To: azurIt Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki, Johannes Weiner On Fri 22-02-13 13:54:42, azurIt wrote: > >I am not sure how much time I'll have for this today but just to make > >sure we are on the same page, could you point me to the two patches you > >have applied in the mean time? > > > Here: > http://watchdog.sk/lkml/patches2 OK, looks correct. -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH for 3.2.34] memcg: do not trigger OOM if PF_NO_MEMCG_OOM is set 2013-02-22 8:23 ` azurIt (?) @ 2013-06-06 16:04 ` Michal Hocko -1 siblings, 0 replies; 444+ messages in thread From: Michal Hocko @ 2013-06-06 16:04 UTC (permalink / raw) To: azurIt Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki, Johannes Weiner Hi, I am really sorry it took so long but I was constantly preempted by other stuff. I hope I have a good news for you, though. Johannes has found a nice way how to overcome deadlock issues from memcg OOM which might help you. Would you be willing to test with his patch (http://permalink.gmane.org/gmane.linux.kernel.mm/101437). Unlike my patch which handles just the i_mutex case his patch solved all possible locks. I can backport the patch for your kernel (are you still using 3.2 kernel or you have moved to a newer one?). On Fri 22-02-13 09:23:32, azurIt wrote: > >Unfortunately I am not able to reproduce this behavior even if I try > >to hammer OOM like mad so I am afraid I cannot help you much without > >further debugging patches. > >I do realize that experimenting in your environment is a problem but I > >do not many options left. Please do not use strace and rather collect > >/proc/pid/stack instead. It would be also helpful to get group/tasks > >file to have a full list of tasks in the group > > > > Hi Michal, > > > sorry that i didn't response for a while. Today i installed kernel with your two patches and i'm running it now. I'm still having problems with OOM which is not able to handle low memory and is not killing processes. Here is some info: > > - data from cgroup 1258 while it was under OOM and no processes were killed (so OOM don't stop and cgroup was freezed) > http://watchdog.sk/lkml/memcg-bug-6.tar.gz > > I noticed problem about on 8:39 and waited until 8:57 (nothing happend). Then i killed process 19864 which seems to help and other processes probably ends and cgroup started to work. But problem accoured again about 20 seconds later, so i killed all processes at 8:58. The problem is occuring all the time since then. All processes (in that cgroup) are always in state 'D' when it occurs. > > > - kernel log from boot until now > http://watchdog.sk/lkml/kern3.gz > > > Btw, something probably happened also at about 3:09 but i wasn't able to gather any data because my 'load check script' killed all apache processes (load was more than 100). > > > > azur > -- > To unsubscribe from this list: send the line "unsubscribe cgroups" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH for 3.2.34] memcg: do not trigger OOM if PF_NO_MEMCG_OOM is set @ 2013-06-06 16:04 ` Michal Hocko 0 siblings, 0 replies; 444+ messages in thread From: Michal Hocko @ 2013-06-06 16:04 UTC (permalink / raw) To: azurIt Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups mailinglist, KAMEZAWA Hiroyuki, Johannes Weiner Hi, I am really sorry it took so long but I was constantly preempted by other stuff. I hope I have a good news for you, though. Johannes has found a nice way how to overcome deadlock issues from memcg OOM which might help you. Would you be willing to test with his patch (http://permalink.gmane.org/gmane.linux.kernel.mm/101437). Unlike my patch which handles just the i_mutex case his patch solved all possible locks. I can backport the patch for your kernel (are you still using 3.2 kernel or you have moved to a newer one?). On Fri 22-02-13 09:23:32, azurIt wrote: > >Unfortunately I am not able to reproduce this behavior even if I try > >to hammer OOM like mad so I am afraid I cannot help you much without > >further debugging patches. > >I do realize that experimenting in your environment is a problem but I > >do not many options left. Please do not use strace and rather collect > >/proc/pid/stack instead. It would be also helpful to get group/tasks > >file to have a full list of tasks in the group > > > > Hi Michal, > > > sorry that i didn't response for a while. Today i installed kernel with your two patches and i'm running it now. I'm still having problems with OOM which is not able to handle low memory and is not killing processes. Here is some info: > > - data from cgroup 1258 while it was under OOM and no processes were killed (so OOM don't stop and cgroup was freezed) > http://watchdog.sk/lkml/memcg-bug-6.tar.gz > > I noticed problem about on 8:39 and waited until 8:57 (nothing happend). Then i killed process 19864 which seems to help and other processes probably ends and cgroup started to work. But problem accoured again about 20 seconds later, so i killed all processes at 8:58. The problem is occuring all the time since then. All processes (in that cgroup) are always in state 'D' when it occurs. > > > - kernel log from boot until now > http://watchdog.sk/lkml/kern3.gz > > > Btw, something probably happened also at about 3:09 but i wasn't able to gather any data because my 'load check script' killed all apache processes (load was more than 100). > > > > azur > -- > To unsubscribe from this list: send the line "unsubscribe cgroups" in > the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH for 3.2.34] memcg: do not trigger OOM if PF_NO_MEMCG_OOM is set @ 2013-06-06 16:04 ` Michal Hocko 0 siblings, 0 replies; 444+ messages in thread From: Michal Hocko @ 2013-06-06 16:04 UTC (permalink / raw) To: azurIt Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki, Johannes Weiner Hi, I am really sorry it took so long but I was constantly preempted by other stuff. I hope I have a good news for you, though. Johannes has found a nice way how to overcome deadlock issues from memcg OOM which might help you. Would you be willing to test with his patch (http://permalink.gmane.org/gmane.linux.kernel.mm/101437). Unlike my patch which handles just the i_mutex case his patch solved all possible locks. I can backport the patch for your kernel (are you still using 3.2 kernel or you have moved to a newer one?). On Fri 22-02-13 09:23:32, azurIt wrote: > >Unfortunately I am not able to reproduce this behavior even if I try > >to hammer OOM like mad so I am afraid I cannot help you much without > >further debugging patches. > >I do realize that experimenting in your environment is a problem but I > >do not many options left. Please do not use strace and rather collect > >/proc/pid/stack instead. It would be also helpful to get group/tasks > >file to have a full list of tasks in the group > > > > Hi Michal, > > > sorry that i didn't response for a while. Today i installed kernel with your two patches and i'm running it now. I'm still having problems with OOM which is not able to handle low memory and is not killing processes. Here is some info: > > - data from cgroup 1258 while it was under OOM and no processes were killed (so OOM don't stop and cgroup was freezed) > http://watchdog.sk/lkml/memcg-bug-6.tar.gz > > I noticed problem about on 8:39 and waited until 8:57 (nothing happend). Then i killed process 19864 which seems to help and other processes probably ends and cgroup started to work. But problem accoured again about 20 seconds later, so i killed all processes at 8:58. The problem is occuring all the time since then. All processes (in that cgroup) are always in state 'D' when it occurs. > > > - kernel log from boot until now > http://watchdog.sk/lkml/kern3.gz > > > Btw, something probably happened also at about 3:09 but i wasn't able to gather any data because my 'load check script' killed all apache processes (load was more than 100). > > > > azur > -- > To unsubscribe from this list: send the line "unsubscribe cgroups" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH for 3.2.34] memcg: do not trigger OOM if PF_NO_MEMCG_OOM is set 2013-06-06 16:04 ` Michal Hocko (?) @ 2013-06-06 16:16 ` azurIt -1 siblings, 0 replies; 444+ messages in thread From: azurIt @ 2013-06-06 16:16 UTC (permalink / raw) To: Michal Hocko Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki, Johannes Weiner Hello Michal, nice to read you! :) Yes, i'm still on 3.2. Could you be so kind and try to backport it? Thank you very much! azur ______________________________________________________________ > Od: "Michal Hocko" <mhocko@suse.cz> > Komu: azurIt <azurit@pobox.sk> > Dátum: 06.06.2013 18:04 > Predmet: Re: [PATCH for 3.2.34] memcg: do not trigger OOM if PF_NO_MEMCG_OOM is set > > CC: linux-kernel@vger.kernel.org, linux-mm@kvack.org, "cgroups mailinglist" <cgroups@vger.kernel.org>, "KAMEZAWA Hiroyuki" <kamezawa.hiroyu@jp.fujitsu.com>, "Johannes Weiner" <hannes@cmpxchg.org> >Hi, > >I am really sorry it took so long but I was constantly preempted by >other stuff. I hope I have a good news for you, though. Johannes has >found a nice way how to overcome deadlock issues from memcg OOM which >might help you. Would you be willing to test with his patch >(http://permalink.gmane.org/gmane.linux.kernel.mm/101437). Unlike my >patch which handles just the i_mutex case his patch solved all possible >locks. > >I can backport the patch for your kernel (are you still using 3.2 kernel >or you have moved to a newer one?). > >On Fri 22-02-13 09:23:32, azurIt wrote: >> >Unfortunately I am not able to reproduce this behavior even if I try >> >to hammer OOM like mad so I am afraid I cannot help you much without >> >further debugging patches. >> >I do realize that experimenting in your environment is a problem but I >> >do not many options left. Please do not use strace and rather collect >> >/proc/pid/stack instead. It would be also helpful to get group/tasks >> >file to have a full list of tasks in the group >> >> >> >> Hi Michal, >> >> >> sorry that i didn't response for a while. Today i installed kernel with your two patches and i'm running it now. I'm still having problems with OOM which is not able to handle low memory and is not killing processes. Here is some info: >> >> - data from cgroup 1258 while it was under OOM and no processes were killed (so OOM don't stop and cgroup was freezed) >> http://watchdog.sk/lkml/memcg-bug-6.tar.gz >> >> I noticed problem about on 8:39 and waited until 8:57 (nothing happend). Then i killed process 19864 which seems to help and other processes probably ends and cgroup started to work. But problem accoured again about 20 seconds later, so i killed all processes at 8:58. The problem is occuring all the time since then. All processes (in that cgroup) are always in state 'D' when it occurs. >> >> >> - kernel log from boot until now >> http://watchdog.sk/lkml/kern3.gz >> >> >> Btw, something probably happened also at about 3:09 but i wasn't able to gather any data because my 'load check script' killed all apache processes (load was more than 100). >> >> >> >> azur >> -- >> To unsubscribe from this list: send the line "unsubscribe cgroups" in >> the body of a message to majordomo@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html > >-- >Michal Hocko >SUSE Labs > ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH for 3.2.34] memcg: do not trigger OOM if PF_NO_MEMCG_OOM is set @ 2013-06-06 16:16 ` azurIt 0 siblings, 0 replies; 444+ messages in thread From: azurIt @ 2013-06-06 16:16 UTC (permalink / raw) To: Michal Hocko Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki, Johannes Weiner Hello Michal, nice to read you! :) Yes, i'm still on 3.2. Could you be so kind and try to backport it? Thank you very much! azur ______________________________________________________________ > Od: "Michal Hocko" <mhocko@suse.cz> > Komu: azurIt <azurit@pobox.sk> > Dátum: 06.06.2013 18:04 > Predmet: Re: [PATCH for 3.2.34] memcg: do not trigger OOM if PF_NO_MEMCG_OOM is set > > CC: linux-kernel@vger.kernel.org, linux-mm@kvack.org, "cgroups mailinglist" <cgroups@vger.kernel.org>, "KAMEZAWA Hiroyuki" <kamezawa.hiroyu@jp.fujitsu.com>, "Johannes Weiner" <hannes@cmpxchg.org> >Hi, > >I am really sorry it took so long but I was constantly preempted by >other stuff. I hope I have a good news for you, though. Johannes has >found a nice way how to overcome deadlock issues from memcg OOM which >might help you. Would you be willing to test with his patch >(http://permalink.gmane.org/gmane.linux.kernel.mm/101437). Unlike my >patch which handles just the i_mutex case his patch solved all possible >locks. > >I can backport the patch for your kernel (are you still using 3.2 kernel >or you have moved to a newer one?). > >On Fri 22-02-13 09:23:32, azurIt wrote: >> >Unfortunately I am not able to reproduce this behavior even if I try >> >to hammer OOM like mad so I am afraid I cannot help you much without >> >further debugging patches. >> >I do realize that experimenting in your environment is a problem but I >> >do not many options left. Please do not use strace and rather collect >> >/proc/pid/stack instead. It would be also helpful to get group/tasks >> >file to have a full list of tasks in the group >> >> >> >> Hi Michal, >> >> >> sorry that i didn't response for a while. Today i installed kernel with your two patches and i'm running it now. I'm still having problems with OOM which is not able to handle low memory and is not killing processes. Here is some info: >> >> - data from cgroup 1258 while it was under OOM and no processes were killed (so OOM don't stop and cgroup was freezed) >> http://watchdog.sk/lkml/memcg-bug-6.tar.gz >> >> I noticed problem about on 8:39 and waited until 8:57 (nothing happend). Then i killed process 19864 which seems to help and other processes probably ends and cgroup started to work. But problem accoured again about 20 seconds later, so i killed all processes at 8:58. The problem is occuring all the time since then. All processes (in that cgroup) are always in state 'D' when it occurs. >> >> >> - kernel log from boot until now >> http://watchdog.sk/lkml/kern3.gz >> >> >> Btw, something probably happened also at about 3:09 but i wasn't able to gather any data because my 'load check script' killed all apache processes (load was more than 100). >> >> >> >> azur >> -- >> To unsubscribe from this list: send the line "unsubscribe cgroups" in >> the body of a message to majordomo@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html > >-- >Michal Hocko >SUSE Labs > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH for 3.2.34] memcg: do not trigger OOM if PF_NO_MEMCG_OOM is set @ 2013-06-06 16:16 ` azurIt 0 siblings, 0 replies; 444+ messages in thread From: azurIt @ 2013-06-06 16:16 UTC (permalink / raw) To: Michal Hocko Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki, Johannes Weiner Hello Michal, nice to read you! :) Yes, i'm still on 3.2. Could you be so kind and try to backport it? Thank you very much! azur ______________________________________________________________ > Od: "Michal Hocko" <mhocko@suse.cz> > Komu: azurIt <azurit@pobox.sk> > DA!tum: 06.06.2013 18:04 > Predmet: Re: [PATCH for 3.2.34] memcg: do not trigger OOM if PF_NO_MEMCG_OOM is set > > CC: linux-kernel@vger.kernel.org, linux-mm@kvack.org, "cgroups mailinglist" <cgroups@vger.kernel.org>, "KAMEZAWA Hiroyuki" <kamezawa.hiroyu@jp.fujitsu.com>, "Johannes Weiner" <hannes@cmpxchg.org> >Hi, > >I am really sorry it took so long but I was constantly preempted by >other stuff. I hope I have a good news for you, though. Johannes has >found a nice way how to overcome deadlock issues from memcg OOM which >might help you. Would you be willing to test with his patch >(http://permalink.gmane.org/gmane.linux.kernel.mm/101437). Unlike my >patch which handles just the i_mutex case his patch solved all possible >locks. > >I can backport the patch for your kernel (are you still using 3.2 kernel >or you have moved to a newer one?). > >On Fri 22-02-13 09:23:32, azurIt wrote: >> >Unfortunately I am not able to reproduce this behavior even if I try >> >to hammer OOM like mad so I am afraid I cannot help you much without >> >further debugging patches. >> >I do realize that experimenting in your environment is a problem but I >> >do not many options left. Please do not use strace and rather collect >> >/proc/pid/stack instead. It would be also helpful to get group/tasks >> >file to have a full list of tasks in the group >> >> >> >> Hi Michal, >> >> >> sorry that i didn't response for a while. Today i installed kernel with your two patches and i'm running it now. I'm still having problems with OOM which is not able to handle low memory and is not killing processes. Here is some info: >> >> - data from cgroup 1258 while it was under OOM and no processes were killed (so OOM don't stop and cgroup was freezed) >> http://watchdog.sk/lkml/memcg-bug-6.tar.gz >> >> I noticed problem about on 8:39 and waited until 8:57 (nothing happend). Then i killed process 19864 which seems to help and other processes probably ends and cgroup started to work. But problem accoured again about 20 seconds later, so i killed all processes at 8:58. The problem is occuring all the time since then. All processes (in that cgroup) are always in state 'D' when it occurs. >> >> >> - kernel log from boot until now >> http://watchdog.sk/lkml/kern3.gz >> >> >> Btw, something probably happened also at about 3:09 but i wasn't able to gather any data because my 'load check script' killed all apache processes (load was more than 100). >> >> >> >> azur >> -- >> To unsubscribe from this list: send the line "unsubscribe cgroups" in >> the body of a message to majordomo@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html > >-- >Michal Hocko >SUSE Labs > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 444+ messages in thread
* [PATCH for 3.2] memcg: do not trap chargers with full callstack on OOM 2013-06-06 16:16 ` azurIt (?) @ 2013-06-07 13:11 ` Michal Hocko -1 siblings, 0 replies; 444+ messages in thread From: Michal Hocko @ 2013-06-07 13:11 UTC (permalink / raw) To: azurIt Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki, Johannes Weiner On Thu 06-06-13 18:16:33, azurIt wrote: > Hello Michal, > > nice to read you! :) Yes, i'm still on 3.2. Could you be so kind and > try to backport it? Thank you very much! Here we go. I hope I didn't screw anything (Johannes might double check) because there were quite some changes in the area since 3.2. Nothing earth shattering though. Please note that I have only compile tested this. Also make sure you remove the previous patches you have from me. --- >From 9d2801c1f53147ca9134cc5f76ab28d505a37a54 Mon Sep 17 00:00:00 2001 From: Johannes Weiner <hannes@cmpxchg.org> Date: Fri, 7 Jun 2013 13:52:42 +0200 Subject: [PATCH] memcg: do not trap chargers with full callstack on OOM The memcg OOM handling is incredibly fragile and can deadlock. When a task fails to charge memory, it invokes the OOM killer and loops right there in the charge code until it succeeds. Comparably, any other task that enters the charge path at this point will go to a waitqueue right then and there and sleep until the OOM situation is resolved. The problem is that these tasks may hold filesystem locks and the mmap_sem; locks that the selected OOM victim may need to exit. For example, in one reported case, the task invoking the OOM killer was about to charge a page cache page during a write(), which holds the i_mutex. The OOM killer selected a task that was just entering truncate() and trying to acquire the i_mutex: OOM invoking task: [<ffffffff8110a9c1>] mem_cgroup_handle_oom+0x241/0x3b0 [<ffffffff8110b5ab>] T.1146+0x5ab/0x5c0 [<ffffffff8110c22e>] mem_cgroup_cache_charge+0xbe/0xe0 [<ffffffff810ca28c>] add_to_page_cache_locked+0x4c/0x140 [<ffffffff810ca3a2>] add_to_page_cache_lru+0x22/0x50 [<ffffffff810ca45b>] grab_cache_page_write_begin+0x8b/0xe0 [<ffffffff81193a18>] ext3_write_begin+0x88/0x270 [<ffffffff810c8fc6>] generic_file_buffered_write+0x116/0x290 [<ffffffff810cb3cc>] __generic_file_aio_write+0x27c/0x480 [<ffffffff810cb646>] generic_file_aio_write+0x76/0xf0 # takes ->i_mutex [<ffffffff8111156a>] do_sync_write+0xea/0x130 [<ffffffff81112183>] vfs_write+0xf3/0x1f0 [<ffffffff81112381>] sys_write+0x51/0x90 [<ffffffff815b5926>] system_call_fastpath+0x18/0x1d [<ffffffffffffffff>] 0xffffffffffffffff OOM kill victim: [<ffffffff811109b8>] do_truncate+0x58/0xa0 # takes i_mutex [<ffffffff81121c90>] do_last+0x250/0xa30 [<ffffffff81122547>] path_openat+0xd7/0x440 [<ffffffff811229c9>] do_filp_open+0x49/0xa0 [<ffffffff8110f7d6>] do_sys_open+0x106/0x240 [<ffffffff8110f950>] sys_open+0x20/0x30 [<ffffffff815b5926>] system_call_fastpath+0x18/0x1d [<ffffffffffffffff>] 0xffffffffffffffff The OOM handling task will retry the charge indefinitely while the OOM killed task is not releasing any resources. A similar scenario can happen when the kernel OOM killer for a memcg is disabled and a userspace task is in charge of resolving OOM situations. In this case, ALL tasks that enter the OOM path will be made to sleep on the OOM waitqueue and wait for userspace to free resources or increase the group's limit. But a userspace OOM handler is prone to deadlock itself on the locks held by the waiting tasks. For example one of the sleeping tasks may be stuck in a brk() call with the mmap_sem held for writing but the userspace handler, in order to pick an optimal victim, may need to read files from /proc/<pid>, which tries to acquire the same mmap_sem for reading and deadlocks. This patch changes the way tasks behave after detecting an OOM and makes sure nobody loops or sleeps on OOM with locks held: 1. When OOMing in a system call (buffered IO and friends), invoke the OOM killer but just return -ENOMEM, never sleep on a OOM waitqueue. Userspace should be able to handle this and it prevents anybody from looping or waiting with locks held. 2. When OOMing in a page fault, invoke the OOM killer and restart the fault instead of looping on the charge attempt. This way, the OOM victim can not get stuck on locks the looping task may hold. 3. When detecting an OOM in a page fault but somebody else is handling it (either the kernel OOM killer or a userspace handler), don't go to sleep in the charge context. Instead, remember the OOMing memcg in the task struct and then fully unwind the page fault stack with -ENOMEM. pagefault_out_of_memory() will then call back into the memcg code to check if the -ENOMEM came from the memcg, and then either put the task to sleep on the memcg's OOM waitqueue or just restart the fault. The OOM victim can no longer get stuck on any lock a sleeping task may hold. While reworking the OOM routine, also remove a needless OOM waitqueue wakeup when invoking the killer. Only uncharges and limit increases, things that actually change the memory situation, should do wakeups. Reported-by: Reported-by: azurIt <azurit@pobox.sk> Debugged-by: Michal Hocko <mhocko@suse.cz> Reported-by: David Rientjes <rientjes@google.com> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Michal Hocko <mhocko@suse.cz> --- include/linux/memcontrol.h | 22 +++++++ include/linux/mm.h | 1 + include/linux/sched.h | 6 ++ mm/ksm.c | 2 +- mm/memcontrol.c | 149 ++++++++++++++++++++++++++++---------------- mm/memory.c | 40 ++++++++---- mm/oom_kill.c | 2 + 7 files changed, 156 insertions(+), 66 deletions(-) diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index b87068a..56bfc39 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -120,6 +120,15 @@ mem_cgroup_get_reclaim_stat_from_page(struct page *page); extern void mem_cgroup_print_oom_info(struct mem_cgroup *memcg, struct task_struct *p); +static inline void mem_cgroup_set_userfault(struct task_struct *p) +{ + p->memcg_oom.in_userfault = 1; +} +static inline void mem_cgroup_clear_userfault(struct task_struct *p) +{ + p->memcg_oom.in_userfault = 0; +} +bool mem_cgroup_oom_synchronize(void); #ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP extern int do_swap_account; #endif @@ -333,6 +342,19 @@ mem_cgroup_print_oom_info(struct mem_cgroup *memcg, struct task_struct *p) { } +static inline void mem_cgroup_set_userfault(struct task_struct *p) +{ +} + +static inline void mem_cgroup_clear_userfault(struct task_struct *p) +{ +} + +static inline bool mem_cgroup_oom_synchronize(void) +{ + return false; +} + static inline void mem_cgroup_inc_page_stat(struct page *page, enum mem_cgroup_page_stat_item idx) { diff --git a/include/linux/mm.h b/include/linux/mm.h index 4baadd1..91380ef 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -156,6 +156,7 @@ extern pgprot_t protection_map[16]; #define FAULT_FLAG_ALLOW_RETRY 0x08 /* Retry fault if blocking */ #define FAULT_FLAG_RETRY_NOWAIT 0x10 /* Don't drop mmap_sem and wait when retrying */ #define FAULT_FLAG_KILLABLE 0x20 /* The fault task is in SIGKILL killable region */ +#define FAULT_FLAG_KERNEL 0x80 /* kernel-triggered fault (get_user_pages etc.) */ /* * This interface is used by x86 PAT code to identify a pfn mapping that is diff --git a/include/linux/sched.h b/include/linux/sched.h index 1c4f3e9..d521a70 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1568,6 +1568,12 @@ struct task_struct { unsigned long nr_pages; /* uncharged usage */ unsigned long memsw_nr_pages; /* uncharged mem+swap usage */ } memcg_batch; + struct memcg_oom_info { + unsigned int in_userfault:1; + unsigned int in_memcg_oom:1; + int wakeups; + struct mem_cgroup *wait_on_memcg; + } memcg_oom; #endif #ifdef CONFIG_HAVE_HW_BREAKPOINT atomic_t ptrace_bp_refcnt; diff --git a/mm/ksm.c b/mm/ksm.c index 310544a..3295a3b 100644 --- a/mm/ksm.c +++ b/mm/ksm.c @@ -338,7 +338,7 @@ static int break_ksm(struct vm_area_struct *vma, unsigned long addr) break; if (PageKsm(page)) ret = handle_mm_fault(vma->vm_mm, vma, addr, - FAULT_FLAG_WRITE); + FAULT_FLAG_KERNEL | FAULT_FLAG_WRITE); else ret = VM_FAULT_WRITE; put_page(page); diff --git a/mm/memcontrol.c b/mm/memcontrol.c index b63f5f7..67189b4 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -249,6 +249,7 @@ struct mem_cgroup { bool oom_lock; atomic_t under_oom; + atomic_t oom_wakeups; atomic_t refcnt; @@ -1846,6 +1847,7 @@ static int memcg_oom_wake_function(wait_queue_t *wait, static void memcg_wakeup_oom(struct mem_cgroup *memcg) { + atomic_inc(&memcg->oom_wakeups); /* for filtering, pass "memcg" as argument. */ __wake_up(&memcg_oom_waitq, TASK_NORMAL, 0, memcg); } @@ -1857,55 +1859,109 @@ static void memcg_oom_recover(struct mem_cgroup *memcg) } /* - * try to call OOM killer. returns false if we should exit memory-reclaim loop. + * try to call OOM killer */ -bool mem_cgroup_handle_oom(struct mem_cgroup *memcg, gfp_t mask) +static void mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask) { - struct oom_wait_info owait; - bool locked, need_to_kill; - - owait.mem = memcg; - owait.wait.flags = 0; - owait.wait.func = memcg_oom_wake_function; - owait.wait.private = current; - INIT_LIST_HEAD(&owait.wait.task_list); - need_to_kill = true; - mem_cgroup_mark_under_oom(memcg); + bool locked, need_to_kill = true; /* At first, try to OOM lock hierarchy under memcg.*/ spin_lock(&memcg_oom_lock); locked = mem_cgroup_oom_lock(memcg); - /* - * Even if signal_pending(), we can't quit charge() loop without - * accounting. So, UNINTERRUPTIBLE is appropriate. But SIGKILL - * under OOM is always welcomed, use TASK_KILLABLE here. - */ - prepare_to_wait(&memcg_oom_waitq, &owait.wait, TASK_KILLABLE); if (!locked || memcg->oom_kill_disable) need_to_kill = false; if (locked) mem_cgroup_oom_notify(memcg); spin_unlock(&memcg_oom_lock); - if (need_to_kill) { - finish_wait(&memcg_oom_waitq, &owait.wait); - mem_cgroup_out_of_memory(memcg, mask); - } else { - schedule(); - finish_wait(&memcg_oom_waitq, &owait.wait); + /* + * A system call can just return -ENOMEM, but if this is a + * page fault and somebody else is handling the OOM already, + * we need to sleep on the OOM waitqueue for this memcg until + * the situation is resolved. Which can take some time + * because it might be handled by a userspace task. + * + * However, this is the charge context, which means that we + * may sit on a large call stack and hold various filesystem + * locks, the mmap_sem etc. and we don't want the OOM handler + * to deadlock on them while we sit here and wait. Store the + * current OOM context in the task_struct, then return + * -ENOMEM. At the end of the page fault handler, with the + * stack unwound, pagefault_out_of_memory() will check back + * with us by calling mem_cgroup_oom_synchronize(), possibly + * putting the task to sleep. + */ + if (current->memcg_oom.in_userfault) { + current->memcg_oom.in_memcg_oom = 1; + /* + * Somebody else is handling the situation. Make sure + * no wakeups are missed between now and going to + * sleep at the end of the page fault. + */ + if (!need_to_kill) { + mem_cgroup_mark_under_oom(memcg); + current->memcg_oom.wakeups = + atomic_read(&memcg->oom_wakeups); + css_get(&memcg->css); + current->memcg_oom.wait_on_memcg = memcg; + } } - spin_lock(&memcg_oom_lock); - if (locked) + + if (need_to_kill) + mem_cgroup_out_of_memory(memcg, mask); + + if (locked) { + spin_lock(&memcg_oom_lock); mem_cgroup_oom_unlock(memcg); - memcg_wakeup_oom(memcg); - spin_unlock(&memcg_oom_lock); + /* + * Sleeping tasks might have been killed, make sure + * they get scheduled so they can exit. + */ + if (need_to_kill) + memcg_oom_recover(memcg); + spin_unlock(&memcg_oom_lock); + } +} - mem_cgroup_unmark_under_oom(memcg); +bool mem_cgroup_oom_synchronize(void) +{ + struct oom_wait_info owait; + struct mem_cgroup *memcg; - if (test_thread_flag(TIF_MEMDIE) || fatal_signal_pending(current)) + /* OOM is global, do not handle */ + if (!current->memcg_oom.in_memcg_oom) return false; - /* Give chance to dying process */ - schedule_timeout_uninterruptible(1); + + /* + * We invoked the OOM killer but there is a chance that a kill + * did not free up any charges. Everybody else might already + * be sleeping, so restart the fault and keep the rampage + * going until some charges are released. + */ + memcg = current->memcg_oom.wait_on_memcg; + if (!memcg) + goto out; + + if (test_thread_flag(TIF_MEMDIE) || fatal_signal_pending(current)) + goto out_put; + + owait.mem = memcg; + owait.wait.flags = 0; + owait.wait.func = memcg_oom_wake_function; + owait.wait.private = current; + INIT_LIST_HEAD(&owait.wait.task_list); + + prepare_to_wait(&memcg_oom_waitq, &owait.wait, TASK_KILLABLE); + /* Only sleep if we didn't miss any wakeups since OOM */ + if (atomic_read(&memcg->oom_wakeups) == current->memcg_oom.wakeups) + schedule(); + finish_wait(&memcg_oom_waitq, &owait.wait); +out_put: + mem_cgroup_unmark_under_oom(memcg); + css_put(&memcg->css); + current->memcg_oom.wait_on_memcg = NULL; +out: + current->memcg_oom.in_memcg_oom = 0; return true; } @@ -2195,11 +2251,10 @@ enum { CHARGE_RETRY, /* need to retry but retry is not bad */ CHARGE_NOMEM, /* we can't do more. return -ENOMEM */ CHARGE_WOULDBLOCK, /* GFP_WAIT wasn't set and no enough res. */ - CHARGE_OOM_DIE, /* the current is killed because of OOM */ }; static int mem_cgroup_do_charge(struct mem_cgroup *memcg, gfp_t gfp_mask, - unsigned int nr_pages, bool oom_check) + unsigned int nr_pages, bool invoke_oom) { unsigned long csize = nr_pages * PAGE_SIZE; struct mem_cgroup *mem_over_limit; @@ -2257,14 +2312,10 @@ static int mem_cgroup_do_charge(struct mem_cgroup *memcg, gfp_t gfp_mask, if (mem_cgroup_wait_acct_move(mem_over_limit)) return CHARGE_RETRY; - /* If we don't need to call oom-killer at el, return immediately */ - if (!oom_check) - return CHARGE_NOMEM; - /* check OOM */ - if (!mem_cgroup_handle_oom(mem_over_limit, gfp_mask)) - return CHARGE_OOM_DIE; + if (invoke_oom) + mem_cgroup_oom(mem_over_limit, gfp_mask); - return CHARGE_RETRY; + return CHARGE_NOMEM; } /* @@ -2349,7 +2400,7 @@ again: } do { - bool oom_check; + bool invoke_oom = oom && !nr_oom_retries; /* If killed, bypass charge */ if (fatal_signal_pending(current)) { @@ -2357,13 +2408,7 @@ again: goto bypass; } - oom_check = false; - if (oom && !nr_oom_retries) { - oom_check = true; - nr_oom_retries = MEM_CGROUP_RECLAIM_RETRIES; - } - - ret = mem_cgroup_do_charge(memcg, gfp_mask, batch, oom_check); + ret = mem_cgroup_do_charge(memcg, gfp_mask, batch, invoke_oom); switch (ret) { case CHARGE_OK: break; @@ -2376,16 +2421,12 @@ again: css_put(&memcg->css); goto nomem; case CHARGE_NOMEM: /* OOM routine works */ - if (!oom) { + if (!oom || invoke_oom) { css_put(&memcg->css); goto nomem; } - /* If oom, we never return -ENOMEM */ nr_oom_retries--; break; - case CHARGE_OOM_DIE: /* Killed by OOM Killer */ - css_put(&memcg->css); - goto bypass; } } while (ret != CHARGE_OK); diff --git a/mm/memory.c b/mm/memory.c index 829d437..bee177c 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -1720,7 +1720,7 @@ int __get_user_pages(struct task_struct *tsk, struct mm_struct *mm, cond_resched(); while (!(page = follow_page(vma, start, foll_flags))) { int ret; - unsigned int fault_flags = 0; + unsigned int fault_flags = FAULT_FLAG_KERNEL; /* For mlock, just skip the stack guard page. */ if (foll_flags & FOLL_MLOCK) { @@ -1842,6 +1842,7 @@ int fixup_user_fault(struct task_struct *tsk, struct mm_struct *mm, if (!vma || address < vma->vm_start) return -EFAULT; + fault_flags |= FAULT_FLAG_KERNEL; ret = handle_mm_fault(mm, vma, address, fault_flags); if (ret & VM_FAULT_ERROR) { if (ret & VM_FAULT_OOM) @@ -3439,22 +3440,14 @@ unlock: /* * By the time we get here, we already hold the mm semaphore */ -int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma, - unsigned long address, unsigned int flags) +static int __handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma, + unsigned long address, unsigned int flags) { pgd_t *pgd; pud_t *pud; pmd_t *pmd; pte_t *pte; - __set_current_state(TASK_RUNNING); - - count_vm_event(PGFAULT); - mem_cgroup_count_vm_event(mm, PGFAULT); - - /* do counter updates before entering really critical section. */ - check_sync_rss_stat(current); - if (unlikely(is_vm_hugetlb_page(vma))) return hugetlb_fault(mm, vma, address, flags); @@ -3503,6 +3496,31 @@ int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma, return handle_pte_fault(mm, vma, address, pte, pmd, flags); } +int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma, + unsigned long address, unsigned int flags) +{ + int in_userfault = !(flags & FAULT_FLAG_KERNEL); + int ret; + + __set_current_state(TASK_RUNNING); + + count_vm_event(PGFAULT); + mem_cgroup_count_vm_event(mm, PGFAULT); + + /* do counter updates before entering really critical section. */ + check_sync_rss_stat(current); + + if (in_userfault) + mem_cgroup_set_userfault(current); + + ret = __handle_mm_fault(mm, vma, address, flags); + + if (in_userfault) + mem_cgroup_clear_userfault(current); + + return ret; +} + #ifndef __PAGETABLE_PUD_FOLDED /* * Allocate page upper directory. diff --git a/mm/oom_kill.c b/mm/oom_kill.c index 069b64e..aa60863 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -785,6 +785,8 @@ out: */ void pagefault_out_of_memory(void) { + if (mem_cgroup_oom_synchronize()) + return; if (try_set_system_oom()) { out_of_memory(NULL, 0, 0, NULL); clear_system_oom(); -- 1.7.10.4 -- Michal Hocko SUSE Labs ^ permalink raw reply related [flat|nested] 444+ messages in thread
* [PATCH for 3.2] memcg: do not trap chargers with full callstack on OOM @ 2013-06-07 13:11 ` Michal Hocko 0 siblings, 0 replies; 444+ messages in thread From: Michal Hocko @ 2013-06-07 13:11 UTC (permalink / raw) To: azurIt Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki, Johannes Weiner On Thu 06-06-13 18:16:33, azurIt wrote: > Hello Michal, > > nice to read you! :) Yes, i'm still on 3.2. Could you be so kind and > try to backport it? Thank you very much! Here we go. I hope I didn't screw anything (Johannes might double check) because there were quite some changes in the area since 3.2. Nothing earth shattering though. Please note that I have only compile tested this. Also make sure you remove the previous patches you have from me. --- From 9d2801c1f53147ca9134cc5f76ab28d505a37a54 Mon Sep 17 00:00:00 2001 From: Johannes Weiner <hannes@cmpxchg.org> Date: Fri, 7 Jun 2013 13:52:42 +0200 Subject: [PATCH] memcg: do not trap chargers with full callstack on OOM The memcg OOM handling is incredibly fragile and can deadlock. When a task fails to charge memory, it invokes the OOM killer and loops right there in the charge code until it succeeds. Comparably, any other task that enters the charge path at this point will go to a waitqueue right then and there and sleep until the OOM situation is resolved. The problem is that these tasks may hold filesystem locks and the mmap_sem; locks that the selected OOM victim may need to exit. For example, in one reported case, the task invoking the OOM killer was about to charge a page cache page during a write(), which holds the i_mutex. The OOM killer selected a task that was just entering truncate() and trying to acquire the i_mutex: OOM invoking task: [<ffffffff8110a9c1>] mem_cgroup_handle_oom+0x241/0x3b0 [<ffffffff8110b5ab>] T.1146+0x5ab/0x5c0 [<ffffffff8110c22e>] mem_cgroup_cache_charge+0xbe/0xe0 [<ffffffff810ca28c>] add_to_page_cache_locked+0x4c/0x140 [<ffffffff810ca3a2>] add_to_page_cache_lru+0x22/0x50 [<ffffffff810ca45b>] grab_cache_page_write_begin+0x8b/0xe0 [<ffffffff81193a18>] ext3_write_begin+0x88/0x270 [<ffffffff810c8fc6>] generic_file_buffered_write+0x116/0x290 [<ffffffff810cb3cc>] __generic_file_aio_write+0x27c/0x480 [<ffffffff810cb646>] generic_file_aio_write+0x76/0xf0 # takes ->i_mutex [<ffffffff8111156a>] do_sync_write+0xea/0x130 [<ffffffff81112183>] vfs_write+0xf3/0x1f0 [<ffffffff81112381>] sys_write+0x51/0x90 [<ffffffff815b5926>] system_call_fastpath+0x18/0x1d [<ffffffffffffffff>] 0xffffffffffffffff OOM kill victim: [<ffffffff811109b8>] do_truncate+0x58/0xa0 # takes i_mutex [<ffffffff81121c90>] do_last+0x250/0xa30 [<ffffffff81122547>] path_openat+0xd7/0x440 [<ffffffff811229c9>] do_filp_open+0x49/0xa0 [<ffffffff8110f7d6>] do_sys_open+0x106/0x240 [<ffffffff8110f950>] sys_open+0x20/0x30 [<ffffffff815b5926>] system_call_fastpath+0x18/0x1d [<ffffffffffffffff>] 0xffffffffffffffff The OOM handling task will retry the charge indefinitely while the OOM killed task is not releasing any resources. A similar scenario can happen when the kernel OOM killer for a memcg is disabled and a userspace task is in charge of resolving OOM situations. In this case, ALL tasks that enter the OOM path will be made to sleep on the OOM waitqueue and wait for userspace to free resources or increase the group's limit. But a userspace OOM handler is prone to deadlock itself on the locks held by the waiting tasks. For example one of the sleeping tasks may be stuck in a brk() call with the mmap_sem held for writing but the userspace handler, in order to pick an optimal victim, may need to read files from /proc/<pid>, which tries to acquire the same mmap_sem for reading and deadlocks. This patch changes the way tasks behave after detecting an OOM and makes sure nobody loops or sleeps on OOM with locks held: 1. When OOMing in a system call (buffered IO and friends), invoke the OOM killer but just return -ENOMEM, never sleep on a OOM waitqueue. Userspace should be able to handle this and it prevents anybody from looping or waiting with locks held. 2. When OOMing in a page fault, invoke the OOM killer and restart the fault instead of looping on the charge attempt. This way, the OOM victim can not get stuck on locks the looping task may hold. 3. When detecting an OOM in a page fault but somebody else is handling it (either the kernel OOM killer or a userspace handler), don't go to sleep in the charge context. Instead, remember the OOMing memcg in the task struct and then fully unwind the page fault stack with -ENOMEM. pagefault_out_of_memory() will then call back into the memcg code to check if the -ENOMEM came from the memcg, and then either put the task to sleep on the memcg's OOM waitqueue or just restart the fault. The OOM victim can no longer get stuck on any lock a sleeping task may hold. While reworking the OOM routine, also remove a needless OOM waitqueue wakeup when invoking the killer. Only uncharges and limit increases, things that actually change the memory situation, should do wakeups. Reported-by: Reported-by: azurIt <azurit@pobox.sk> Debugged-by: Michal Hocko <mhocko@suse.cz> Reported-by: David Rientjes <rientjes@google.com> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Michal Hocko <mhocko@suse.cz> --- include/linux/memcontrol.h | 22 +++++++ include/linux/mm.h | 1 + include/linux/sched.h | 6 ++ mm/ksm.c | 2 +- mm/memcontrol.c | 149 ++++++++++++++++++++++++++++---------------- mm/memory.c | 40 ++++++++---- mm/oom_kill.c | 2 + 7 files changed, 156 insertions(+), 66 deletions(-) diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index b87068a..56bfc39 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -120,6 +120,15 @@ mem_cgroup_get_reclaim_stat_from_page(struct page *page); extern void mem_cgroup_print_oom_info(struct mem_cgroup *memcg, struct task_struct *p); +static inline void mem_cgroup_set_userfault(struct task_struct *p) +{ + p->memcg_oom.in_userfault = 1; +} +static inline void mem_cgroup_clear_userfault(struct task_struct *p) +{ + p->memcg_oom.in_userfault = 0; +} +bool mem_cgroup_oom_synchronize(void); #ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP extern int do_swap_account; #endif @@ -333,6 +342,19 @@ mem_cgroup_print_oom_info(struct mem_cgroup *memcg, struct task_struct *p) { } +static inline void mem_cgroup_set_userfault(struct task_struct *p) +{ +} + +static inline void mem_cgroup_clear_userfault(struct task_struct *p) +{ +} + +static inline bool mem_cgroup_oom_synchronize(void) +{ + return false; +} + static inline void mem_cgroup_inc_page_stat(struct page *page, enum mem_cgroup_page_stat_item idx) { diff --git a/include/linux/mm.h b/include/linux/mm.h index 4baadd1..91380ef 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -156,6 +156,7 @@ extern pgprot_t protection_map[16]; #define FAULT_FLAG_ALLOW_RETRY 0x08 /* Retry fault if blocking */ #define FAULT_FLAG_RETRY_NOWAIT 0x10 /* Don't drop mmap_sem and wait when retrying */ #define FAULT_FLAG_KILLABLE 0x20 /* The fault task is in SIGKILL killable region */ +#define FAULT_FLAG_KERNEL 0x80 /* kernel-triggered fault (get_user_pages etc.) */ /* * This interface is used by x86 PAT code to identify a pfn mapping that is diff --git a/include/linux/sched.h b/include/linux/sched.h index 1c4f3e9..d521a70 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1568,6 +1568,12 @@ struct task_struct { unsigned long nr_pages; /* uncharged usage */ unsigned long memsw_nr_pages; /* uncharged mem+swap usage */ } memcg_batch; + struct memcg_oom_info { + unsigned int in_userfault:1; + unsigned int in_memcg_oom:1; + int wakeups; + struct mem_cgroup *wait_on_memcg; + } memcg_oom; #endif #ifdef CONFIG_HAVE_HW_BREAKPOINT atomic_t ptrace_bp_refcnt; diff --git a/mm/ksm.c b/mm/ksm.c index 310544a..3295a3b 100644 --- a/mm/ksm.c +++ b/mm/ksm.c @@ -338,7 +338,7 @@ static int break_ksm(struct vm_area_struct *vma, unsigned long addr) break; if (PageKsm(page)) ret = handle_mm_fault(vma->vm_mm, vma, addr, - FAULT_FLAG_WRITE); + FAULT_FLAG_KERNEL | FAULT_FLAG_WRITE); else ret = VM_FAULT_WRITE; put_page(page); diff --git a/mm/memcontrol.c b/mm/memcontrol.c index b63f5f7..67189b4 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -249,6 +249,7 @@ struct mem_cgroup { bool oom_lock; atomic_t under_oom; + atomic_t oom_wakeups; atomic_t refcnt; @@ -1846,6 +1847,7 @@ static int memcg_oom_wake_function(wait_queue_t *wait, static void memcg_wakeup_oom(struct mem_cgroup *memcg) { + atomic_inc(&memcg->oom_wakeups); /* for filtering, pass "memcg" as argument. */ __wake_up(&memcg_oom_waitq, TASK_NORMAL, 0, memcg); } @@ -1857,55 +1859,109 @@ static void memcg_oom_recover(struct mem_cgroup *memcg) } /* - * try to call OOM killer. returns false if we should exit memory-reclaim loop. + * try to call OOM killer */ -bool mem_cgroup_handle_oom(struct mem_cgroup *memcg, gfp_t mask) +static void mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask) { - struct oom_wait_info owait; - bool locked, need_to_kill; - - owait.mem = memcg; - owait.wait.flags = 0; - owait.wait.func = memcg_oom_wake_function; - owait.wait.private = current; - INIT_LIST_HEAD(&owait.wait.task_list); - need_to_kill = true; - mem_cgroup_mark_under_oom(memcg); + bool locked, need_to_kill = true; /* At first, try to OOM lock hierarchy under memcg.*/ spin_lock(&memcg_oom_lock); locked = mem_cgroup_oom_lock(memcg); - /* - * Even if signal_pending(), we can't quit charge() loop without - * accounting. So, UNINTERRUPTIBLE is appropriate. But SIGKILL - * under OOM is always welcomed, use TASK_KILLABLE here. - */ - prepare_to_wait(&memcg_oom_waitq, &owait.wait, TASK_KILLABLE); if (!locked || memcg->oom_kill_disable) need_to_kill = false; if (locked) mem_cgroup_oom_notify(memcg); spin_unlock(&memcg_oom_lock); - if (need_to_kill) { - finish_wait(&memcg_oom_waitq, &owait.wait); - mem_cgroup_out_of_memory(memcg, mask); - } else { - schedule(); - finish_wait(&memcg_oom_waitq, &owait.wait); + /* + * A system call can just return -ENOMEM, but if this is a + * page fault and somebody else is handling the OOM already, + * we need to sleep on the OOM waitqueue for this memcg until + * the situation is resolved. Which can take some time + * because it might be handled by a userspace task. + * + * However, this is the charge context, which means that we + * may sit on a large call stack and hold various filesystem + * locks, the mmap_sem etc. and we don't want the OOM handler + * to deadlock on them while we sit here and wait. Store the + * current OOM context in the task_struct, then return + * -ENOMEM. At the end of the page fault handler, with the + * stack unwound, pagefault_out_of_memory() will check back + * with us by calling mem_cgroup_oom_synchronize(), possibly + * putting the task to sleep. + */ + if (current->memcg_oom.in_userfault) { + current->memcg_oom.in_memcg_oom = 1; + /* + * Somebody else is handling the situation. Make sure + * no wakeups are missed between now and going to + * sleep at the end of the page fault. + */ + if (!need_to_kill) { + mem_cgroup_mark_under_oom(memcg); + current->memcg_oom.wakeups = + atomic_read(&memcg->oom_wakeups); + css_get(&memcg->css); + current->memcg_oom.wait_on_memcg = memcg; + } } - spin_lock(&memcg_oom_lock); - if (locked) + + if (need_to_kill) + mem_cgroup_out_of_memory(memcg, mask); + + if (locked) { + spin_lock(&memcg_oom_lock); mem_cgroup_oom_unlock(memcg); - memcg_wakeup_oom(memcg); - spin_unlock(&memcg_oom_lock); + /* + * Sleeping tasks might have been killed, make sure + * they get scheduled so they can exit. + */ + if (need_to_kill) + memcg_oom_recover(memcg); + spin_unlock(&memcg_oom_lock); + } +} - mem_cgroup_unmark_under_oom(memcg); +bool mem_cgroup_oom_synchronize(void) +{ + struct oom_wait_info owait; + struct mem_cgroup *memcg; - if (test_thread_flag(TIF_MEMDIE) || fatal_signal_pending(current)) + /* OOM is global, do not handle */ + if (!current->memcg_oom.in_memcg_oom) return false; - /* Give chance to dying process */ - schedule_timeout_uninterruptible(1); + + /* + * We invoked the OOM killer but there is a chance that a kill + * did not free up any charges. Everybody else might already + * be sleeping, so restart the fault and keep the rampage + * going until some charges are released. + */ + memcg = current->memcg_oom.wait_on_memcg; + if (!memcg) + goto out; + + if (test_thread_flag(TIF_MEMDIE) || fatal_signal_pending(current)) + goto out_put; + + owait.mem = memcg; + owait.wait.flags = 0; + owait.wait.func = memcg_oom_wake_function; + owait.wait.private = current; + INIT_LIST_HEAD(&owait.wait.task_list); + + prepare_to_wait(&memcg_oom_waitq, &owait.wait, TASK_KILLABLE); + /* Only sleep if we didn't miss any wakeups since OOM */ + if (atomic_read(&memcg->oom_wakeups) == current->memcg_oom.wakeups) + schedule(); + finish_wait(&memcg_oom_waitq, &owait.wait); +out_put: + mem_cgroup_unmark_under_oom(memcg); + css_put(&memcg->css); + current->memcg_oom.wait_on_memcg = NULL; +out: + current->memcg_oom.in_memcg_oom = 0; return true; } @@ -2195,11 +2251,10 @@ enum { CHARGE_RETRY, /* need to retry but retry is not bad */ CHARGE_NOMEM, /* we can't do more. return -ENOMEM */ CHARGE_WOULDBLOCK, /* GFP_WAIT wasn't set and no enough res. */ - CHARGE_OOM_DIE, /* the current is killed because of OOM */ }; static int mem_cgroup_do_charge(struct mem_cgroup *memcg, gfp_t gfp_mask, - unsigned int nr_pages, bool oom_check) + unsigned int nr_pages, bool invoke_oom) { unsigned long csize = nr_pages * PAGE_SIZE; struct mem_cgroup *mem_over_limit; @@ -2257,14 +2312,10 @@ static int mem_cgroup_do_charge(struct mem_cgroup *memcg, gfp_t gfp_mask, if (mem_cgroup_wait_acct_move(mem_over_limit)) return CHARGE_RETRY; - /* If we don't need to call oom-killer at el, return immediately */ - if (!oom_check) - return CHARGE_NOMEM; - /* check OOM */ - if (!mem_cgroup_handle_oom(mem_over_limit, gfp_mask)) - return CHARGE_OOM_DIE; + if (invoke_oom) + mem_cgroup_oom(mem_over_limit, gfp_mask); - return CHARGE_RETRY; + return CHARGE_NOMEM; } /* @@ -2349,7 +2400,7 @@ again: } do { - bool oom_check; + bool invoke_oom = oom && !nr_oom_retries; /* If killed, bypass charge */ if (fatal_signal_pending(current)) { @@ -2357,13 +2408,7 @@ again: goto bypass; } - oom_check = false; - if (oom && !nr_oom_retries) { - oom_check = true; - nr_oom_retries = MEM_CGROUP_RECLAIM_RETRIES; - } - - ret = mem_cgroup_do_charge(memcg, gfp_mask, batch, oom_check); + ret = mem_cgroup_do_charge(memcg, gfp_mask, batch, invoke_oom); switch (ret) { case CHARGE_OK: break; @@ -2376,16 +2421,12 @@ again: css_put(&memcg->css); goto nomem; case CHARGE_NOMEM: /* OOM routine works */ - if (!oom) { + if (!oom || invoke_oom) { css_put(&memcg->css); goto nomem; } - /* If oom, we never return -ENOMEM */ nr_oom_retries--; break; - case CHARGE_OOM_DIE: /* Killed by OOM Killer */ - css_put(&memcg->css); - goto bypass; } } while (ret != CHARGE_OK); diff --git a/mm/memory.c b/mm/memory.c index 829d437..bee177c 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -1720,7 +1720,7 @@ int __get_user_pages(struct task_struct *tsk, struct mm_struct *mm, cond_resched(); while (!(page = follow_page(vma, start, foll_flags))) { int ret; - unsigned int fault_flags = 0; + unsigned int fault_flags = FAULT_FLAG_KERNEL; /* For mlock, just skip the stack guard page. */ if (foll_flags & FOLL_MLOCK) { @@ -1842,6 +1842,7 @@ int fixup_user_fault(struct task_struct *tsk, struct mm_struct *mm, if (!vma || address < vma->vm_start) return -EFAULT; + fault_flags |= FAULT_FLAG_KERNEL; ret = handle_mm_fault(mm, vma, address, fault_flags); if (ret & VM_FAULT_ERROR) { if (ret & VM_FAULT_OOM) @@ -3439,22 +3440,14 @@ unlock: /* * By the time we get here, we already hold the mm semaphore */ -int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma, - unsigned long address, unsigned int flags) +static int __handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma, + unsigned long address, unsigned int flags) { pgd_t *pgd; pud_t *pud; pmd_t *pmd; pte_t *pte; - __set_current_state(TASK_RUNNING); - - count_vm_event(PGFAULT); - mem_cgroup_count_vm_event(mm, PGFAULT); - - /* do counter updates before entering really critical section. */ - check_sync_rss_stat(current); - if (unlikely(is_vm_hugetlb_page(vma))) return hugetlb_fault(mm, vma, address, flags); @@ -3503,6 +3496,31 @@ int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma, return handle_pte_fault(mm, vma, address, pte, pmd, flags); } +int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma, + unsigned long address, unsigned int flags) +{ + int in_userfault = !(flags & FAULT_FLAG_KERNEL); + int ret; + + __set_current_state(TASK_RUNNING); + + count_vm_event(PGFAULT); + mem_cgroup_count_vm_event(mm, PGFAULT); + + /* do counter updates before entering really critical section. */ + check_sync_rss_stat(current); + + if (in_userfault) + mem_cgroup_set_userfault(current); + + ret = __handle_mm_fault(mm, vma, address, flags); + + if (in_userfault) + mem_cgroup_clear_userfault(current); + + return ret; +} + #ifndef __PAGETABLE_PUD_FOLDED /* * Allocate page upper directory. diff --git a/mm/oom_kill.c b/mm/oom_kill.c index 069b64e..aa60863 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -785,6 +785,8 @@ out: */ void pagefault_out_of_memory(void) { + if (mem_cgroup_oom_synchronize()) + return; if (try_set_system_oom()) { out_of_memory(NULL, 0, 0, NULL); clear_system_oom(); -- 1.7.10.4 -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 444+ messages in thread
* [PATCH for 3.2] memcg: do not trap chargers with full callstack on OOM @ 2013-06-07 13:11 ` Michal Hocko 0 siblings, 0 replies; 444+ messages in thread From: Michal Hocko @ 2013-06-07 13:11 UTC (permalink / raw) To: azurIt Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki, Johannes Weiner On Thu 06-06-13 18:16:33, azurIt wrote: > Hello Michal, > > nice to read you! :) Yes, i'm still on 3.2. Could you be so kind and > try to backport it? Thank you very much! Here we go. I hope I didn't screw anything (Johannes might double check) because there were quite some changes in the area since 3.2. Nothing earth shattering though. Please note that I have only compile tested this. Also make sure you remove the previous patches you have from me. --- ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH for 3.2] memcg: do not trap chargers with full callstack on OOM 2013-06-07 13:11 ` Michal Hocko @ 2013-06-17 10:21 ` azurIt -1 siblings, 0 replies; 444+ messages in thread From: azurIt @ 2013-06-17 10:21 UTC (permalink / raw) To: Michal Hocko Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki, Johannes Weiner >Here we go. I hope I didn't screw anything (Johannes might double check) >because there were quite some changes in the area since 3.2. Nothing >earth shattering though. Please note that I have only compile tested >this. Also make sure you remove the previous patches you have from me. Hi Michal, it, unfortunately, didn't work. Everything was working fine but original problem is still occuring. I'm unable to send you stacks or more info because problem is taking down the whole server for some time now (don't know what exactly caused it to start happening, maybe newer versions of 3.2.x). But i'm sure of one thing - when problem occurs, nothing is able to access hard drives (every process which tries it is freezed until problem is resolved or server is rebooted). Problem is fixed after killing processes from cgroup which caused it and everything immediatelly starts to work normally. I find this out by keeping terminal opened from another server to one where my problem is occuring quite often and running several apps there (htop, iotop, etc.). When problem occurs, all apps which wasn't working with HDD was ok. The htop proved to be very usefull here because it's only reading proc filesystem and is also able to send KILL signals - i was able to resolve the problem with it without rebooting the server. I created a special daemon (about month ago) which is able to detect and fix the problem so i'm not having server outages now. The point was to NOT access anything which is stored on HDDs, the daemon is only reading info from cgroup filesystem and sending KILL signals to processes. Maybe i should be able to also read stack files before killing, i will try it. Btw, which vanilla kernel includes this patch? Thank you and everyone involved very much for time and help. azur ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH for 3.2] memcg: do not trap chargers with full callstack on OOM @ 2013-06-17 10:21 ` azurIt 0 siblings, 0 replies; 444+ messages in thread From: azurIt @ 2013-06-17 10:21 UTC (permalink / raw) To: Michal Hocko Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki, Johannes Weiner >Here we go. I hope I didn't screw anything (Johannes might double check) >because there were quite some changes in the area since 3.2. Nothing >earth shattering though. Please note that I have only compile tested >this. Also make sure you remove the previous patches you have from me. Hi Michal, it, unfortunately, didn't work. Everything was working fine but original problem is still occuring. I'm unable to send you stacks or more info because problem is taking down the whole server for some time now (don't know what exactly caused it to start happening, maybe newer versions of 3.2.x). But i'm sure of one thing - when problem occurs, nothing is able to access hard drives (every process which tries it is freezed until problem is resolved or server is rebooted). Problem is fixed after killing processes from cgroup which caused it and everything immediatelly starts to work normally. I find this out by keeping terminal opened from another server to one where my problem is occuring quite often and running several apps there (htop, iotop, etc.). When problem occurs, all apps which wasn't working with HDD was ok. The htop proved to be very usefull here because it's only reading proc filesystem and is also able to send KILL signals - i was able to resolve the problem with it without rebooting the server. I created a special daemon (about month ago) which is able to detect and fix the problem so i'm not having server outages now. The point was to NOT access anything which is stored on HDDs, the daemon is only reading info from cgroup filesystem and sending KILL signals to processes. Maybe i should be able to also read stack files before killing, i will try it. Btw, which vanilla kernel includes this patch? Thank you and everyone involved very much for time and help. azur -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH for 3.2] memcg: do not trap chargers with full callstack on OOM 2013-06-17 10:21 ` azurIt @ 2013-06-19 13:26 ` Michal Hocko -1 siblings, 0 replies; 444+ messages in thread From: Michal Hocko @ 2013-06-19 13:26 UTC (permalink / raw) To: azurIt Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki, Johannes Weiner On Mon 17-06-13 12:21:34, azurIt wrote: > >Here we go. I hope I didn't screw anything (Johannes might double check) > >because there were quite some changes in the area since 3.2. Nothing > >earth shattering though. Please note that I have only compile tested > >this. Also make sure you remove the previous patches you have from me. > > > Hi Michal, > > it, unfortunately, didn't work. Everything was working fine but > original problem is still occuring. This would be more than surprising because tasks blocked at memcg OOM don't hold any locks anymore. Maybe I have messed something up during backport but I cannot spot anything. > I'm unable to send you stacks or more info because problem is taking > down the whole server for some time now (don't know what exactly > caused it to start happening, maybe newer versions of 3.2.x). So you are not testing with the same kernel with just the old patch replaced by the new one? > But i'm sure of one thing - when problem occurs, nothing is able to > access hard drives (every process which tries it is freezed until > problem is resolved or server is rebooted). I would be really interesting to see what those tasks are blocked on. > Problem is fixed after killing processes from cgroup which > caused it and everything immediatelly starts to work normally. I > find this out by keeping terminal opened from another server to one > where my problem is occuring quite often and running several apps > there (htop, iotop, etc.). When problem occurs, all apps which wasn't > working with HDD was ok. The htop proved to be very usefull here > because it's only reading proc filesystem and is also able to send > KILL signals - i was able to resolve the problem with it > without rebooting the server. sysrq+t will give you the list of all tasks and their traces. > I created a special daemon (about month ago) which is able to detect > and fix the problem so i'm not having server outages now. The point > was to NOT access anything which is stored on HDDs, the daemon is > only reading info from cgroup filesystem and sending KILL signals to > processes. Maybe i should be able to also read stack files before > killing, i will try it. > > Btw, which vanilla kernel includes this patch? None yet. But I hope it will be merged to 3.11 and backported to the stable trees. > Thank you and everyone involved very much for time and help. > > azur -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH for 3.2] memcg: do not trap chargers with full callstack on OOM @ 2013-06-19 13:26 ` Michal Hocko 0 siblings, 0 replies; 444+ messages in thread From: Michal Hocko @ 2013-06-19 13:26 UTC (permalink / raw) To: azurIt Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki, Johannes Weiner On Mon 17-06-13 12:21:34, azurIt wrote: > >Here we go. I hope I didn't screw anything (Johannes might double check) > >because there were quite some changes in the area since 3.2. Nothing > >earth shattering though. Please note that I have only compile tested > >this. Also make sure you remove the previous patches you have from me. > > > Hi Michal, > > it, unfortunately, didn't work. Everything was working fine but > original problem is still occuring. This would be more than surprising because tasks blocked at memcg OOM don't hold any locks anymore. Maybe I have messed something up during backport but I cannot spot anything. > I'm unable to send you stacks or more info because problem is taking > down the whole server for some time now (don't know what exactly > caused it to start happening, maybe newer versions of 3.2.x). So you are not testing with the same kernel with just the old patch replaced by the new one? > But i'm sure of one thing - when problem occurs, nothing is able to > access hard drives (every process which tries it is freezed until > problem is resolved or server is rebooted). I would be really interesting to see what those tasks are blocked on. > Problem is fixed after killing processes from cgroup which > caused it and everything immediatelly starts to work normally. I > find this out by keeping terminal opened from another server to one > where my problem is occuring quite often and running several apps > there (htop, iotop, etc.). When problem occurs, all apps which wasn't > working with HDD was ok. The htop proved to be very usefull here > because it's only reading proc filesystem and is also able to send > KILL signals - i was able to resolve the problem with it > without rebooting the server. sysrq+t will give you the list of all tasks and their traces. > I created a special daemon (about month ago) which is able to detect > and fix the problem so i'm not having server outages now. The point > was to NOT access anything which is stored on HDDs, the daemon is > only reading info from cgroup filesystem and sending KILL signals to > processes. Maybe i should be able to also read stack files before > killing, i will try it. > > Btw, which vanilla kernel includes this patch? None yet. But I hope it will be merged to 3.11 and backported to the stable trees. > Thank you and everyone involved very much for time and help. > > azur -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH for 3.2] memcg: do not trap chargers with full callstack on OOM 2013-06-19 13:26 ` Michal Hocko @ 2013-06-22 20:09 ` azurIt -1 siblings, 0 replies; 444+ messages in thread From: azurIt @ 2013-06-22 20:09 UTC (permalink / raw) To: Michal Hocko Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki, Johannes Weiner Michal, >> I'm unable to send you stacks or more info because problem is taking >> down the whole server for some time now (don't know what exactly >> caused it to start happening, maybe newer versions of 3.2.x). > >So you are not testing with the same kernel with just the old patch >replaced by the new one? No, i'm not testing with the same kernel but all are 3.2.x. I even cannot install older 3.2.x because grsecurity is always available for newest kernel and there is no archive of older versions (at least i don't know about any). >> But i'm sure of one thing - when problem occurs, nothing is able to >> access hard drives (every process which tries it is freezed until >> problem is resolved or server is rebooted). > >I would be really interesting to see what those tasks are blocked on. I'm trying to get it, stay tuned :) Today i noticed one bug, not 100% sure it is related to 'your' patch but i didn't seen this before. I noticed that i have lots of cgroups which cannot be removed - if i do 'rmdir <cgroup_directory>', it just hangs and never complete. Even more, it's not possible to access the whole cgroup filesystem until i kill that rmdir (anything, which tries it, just hangs). All unremoveable cgroups has this in 'memory.oom_control': oom_kill_disable 0 under_oom 1 And, yes, 'tasks' file is empty. azur ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH for 3.2] memcg: do not trap chargers with full callstack on OOM @ 2013-06-22 20:09 ` azurIt 0 siblings, 0 replies; 444+ messages in thread From: azurIt @ 2013-06-22 20:09 UTC (permalink / raw) To: Michal Hocko Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki, Johannes Weiner Michal, >> I'm unable to send you stacks or more info because problem is taking >> down the whole server for some time now (don't know what exactly >> caused it to start happening, maybe newer versions of 3.2.x). > >So you are not testing with the same kernel with just the old patch >replaced by the new one? No, i'm not testing with the same kernel but all are 3.2.x. I even cannot install older 3.2.x because grsecurity is always available for newest kernel and there is no archive of older versions (at least i don't know about any). >> But i'm sure of one thing - when problem occurs, nothing is able to >> access hard drives (every process which tries it is freezed until >> problem is resolved or server is rebooted). > >I would be really interesting to see what those tasks are blocked on. I'm trying to get it, stay tuned :) Today i noticed one bug, not 100% sure it is related to 'your' patch but i didn't seen this before. I noticed that i have lots of cgroups which cannot be removed - if i do 'rmdir <cgroup_directory>', it just hangs and never complete. Even more, it's not possible to access the whole cgroup filesystem until i kill that rmdir (anything, which tries it, just hangs). All unremoveable cgroups has this in 'memory.oom_control': oom_kill_disable 0 under_oom 1 And, yes, 'tasks' file is empty. azur -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH for 3.2] memcg: do not trap chargers with full callstack on OOM 2013-06-22 20:09 ` azurIt (?) @ 2013-06-24 20:13 ` Johannes Weiner -1 siblings, 0 replies; 444+ messages in thread From: Johannes Weiner @ 2013-06-24 20:13 UTC (permalink / raw) To: azurIt Cc: Michal Hocko, linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki Hi guys, On Sat, Jun 22, 2013 at 10:09:58PM +0200, azurIt wrote: > >> But i'm sure of one thing - when problem occurs, nothing is able to > >> access hard drives (every process which tries it is freezed until > >> problem is resolved or server is rebooted). > > > >I would be really interesting to see what those tasks are blocked on. > > I'm trying to get it, stay tuned :) > > Today i noticed one bug, not 100% sure it is related to 'your' patch > but i didn't seen this before. I noticed that i have lots of cgroups > which cannot be removed - if i do 'rmdir <cgroup_directory>', it > just hangs and never complete. Even more, it's not possible to > access the whole cgroup filesystem until i kill that rmdir > (anything, which tries it, just hangs). All unremoveable cgroups has > this in 'memory.oom_control': oom_kill_disable 0 under_oom 1 Somebody acquires the OOM wait reference to the memcg and marks it under oom but then does not call into mem_cgroup_oom_synchronize() to clean up. That's why under_oom is set and the rmdir waits for outstanding references. > And, yes, 'tasks' file is empty. It's not a kernel thread that does it because all kernel-context handle_mm_fault() are annotated properly, which means the task must be userspace and, since tasks is empty, have exited before synchronizing. Can you try with the following patch on top? diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c index 5db0490..9a0b152 100644 --- a/arch/x86/mm/fault.c +++ b/arch/x86/mm/fault.c @@ -846,17 +846,6 @@ static noinline int mm_fault_error(struct pt_regs *regs, unsigned long error_code, unsigned long address, unsigned int fault) { - /* - * Pagefault was interrupted by SIGKILL. We have no reason to - * continue pagefault. - */ - if (fatal_signal_pending(current)) { - if (!(fault & VM_FAULT_RETRY)) - up_read(¤t->mm->mmap_sem); - if (!(error_code & PF_USER)) - no_context(regs, error_code, address); - return 1; - } if (!(fault & VM_FAULT_ERROR)) return 0; ^ permalink raw reply related [flat|nested] 444+ messages in thread
* Re: [PATCH for 3.2] memcg: do not trap chargers with full callstack on OOM @ 2013-06-24 20:13 ` Johannes Weiner 0 siblings, 0 replies; 444+ messages in thread From: Johannes Weiner @ 2013-06-24 20:13 UTC (permalink / raw) To: azurIt Cc: Michal Hocko, linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups mailinglist, KAMEZAWA Hiroyuki Hi guys, On Sat, Jun 22, 2013 at 10:09:58PM +0200, azurIt wrote: > >> But i'm sure of one thing - when problem occurs, nothing is able to > >> access hard drives (every process which tries it is freezed until > >> problem is resolved or server is rebooted). > > > >I would be really interesting to see what those tasks are blocked on. > > I'm trying to get it, stay tuned :) > > Today i noticed one bug, not 100% sure it is related to 'your' patch > but i didn't seen this before. I noticed that i have lots of cgroups > which cannot be removed - if i do 'rmdir <cgroup_directory>', it > just hangs and never complete. Even more, it's not possible to > access the whole cgroup filesystem until i kill that rmdir > (anything, which tries it, just hangs). All unremoveable cgroups has > this in 'memory.oom_control': oom_kill_disable 0 under_oom 1 Somebody acquires the OOM wait reference to the memcg and marks it under oom but then does not call into mem_cgroup_oom_synchronize() to clean up. That's why under_oom is set and the rmdir waits for outstanding references. > And, yes, 'tasks' file is empty. It's not a kernel thread that does it because all kernel-context handle_mm_fault() are annotated properly, which means the task must be userspace and, since tasks is empty, have exited before synchronizing. Can you try with the following patch on top? diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c index 5db0490..9a0b152 100644 --- a/arch/x86/mm/fault.c +++ b/arch/x86/mm/fault.c @@ -846,17 +846,6 @@ static noinline int mm_fault_error(struct pt_regs *regs, unsigned long error_code, unsigned long address, unsigned int fault) { - /* - * Pagefault was interrupted by SIGKILL. We have no reason to - * continue pagefault. - */ - if (fatal_signal_pending(current)) { - if (!(fault & VM_FAULT_RETRY)) - up_read(¤t->mm->mmap_sem); - if (!(error_code & PF_USER)) - no_context(regs, error_code, address); - return 1; - } if (!(fault & VM_FAULT_ERROR)) return 0; ^ permalink raw reply related [flat|nested] 444+ messages in thread
* Re: [PATCH for 3.2] memcg: do not trap chargers with full callstack on OOM @ 2013-06-24 20:13 ` Johannes Weiner 0 siblings, 0 replies; 444+ messages in thread From: Johannes Weiner @ 2013-06-24 20:13 UTC (permalink / raw) To: azurIt Cc: Michal Hocko, linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki Hi guys, On Sat, Jun 22, 2013 at 10:09:58PM +0200, azurIt wrote: > >> But i'm sure of one thing - when problem occurs, nothing is able to > >> access hard drives (every process which tries it is freezed until > >> problem is resolved or server is rebooted). > > > >I would be really interesting to see what those tasks are blocked on. > > I'm trying to get it, stay tuned :) > > Today i noticed one bug, not 100% sure it is related to 'your' patch > but i didn't seen this before. I noticed that i have lots of cgroups > which cannot be removed - if i do 'rmdir <cgroup_directory>', it > just hangs and never complete. Even more, it's not possible to > access the whole cgroup filesystem until i kill that rmdir > (anything, which tries it, just hangs). All unremoveable cgroups has > this in 'memory.oom_control': oom_kill_disable 0 under_oom 1 Somebody acquires the OOM wait reference to the memcg and marks it under oom but then does not call into mem_cgroup_oom_synchronize() to clean up. That's why under_oom is set and the rmdir waits for outstanding references. > And, yes, 'tasks' file is empty. It's not a kernel thread that does it because all kernel-context handle_mm_fault() are annotated properly, which means the task must be userspace and, since tasks is empty, have exited before synchronizing. Can you try with the following patch on top? diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c index 5db0490..9a0b152 100644 --- a/arch/x86/mm/fault.c +++ b/arch/x86/mm/fault.c @@ -846,17 +846,6 @@ static noinline int mm_fault_error(struct pt_regs *regs, unsigned long error_code, unsigned long address, unsigned int fault) { - /* - * Pagefault was interrupted by SIGKILL. We have no reason to - * continue pagefault. - */ - if (fatal_signal_pending(current)) { - if (!(fault & VM_FAULT_RETRY)) - up_read(¤t->mm->mmap_sem); - if (!(error_code & PF_USER)) - no_context(regs, error_code, address); - return 1; - } if (!(fault & VM_FAULT_ERROR)) return 0; -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 444+ messages in thread
* Re: [PATCH for 3.2] memcg: do not trap chargers with full callstack on OOM 2013-06-24 20:13 ` Johannes Weiner (?) @ 2013-06-28 10:06 ` azurIt -1 siblings, 0 replies; 444+ messages in thread From: azurIt @ 2013-06-28 10:06 UTC (permalink / raw) To: Johannes Weiner Cc: Michal Hocko, linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki >It's not a kernel thread that does it because all kernel-context >handle_mm_fault() are annotated properly, which means the task must be >userspace and, since tasks is empty, have exited before synchronizing. > >Can you try with the following patch on top? Michal and Johannes, i have some observations which i made: Original patch from Johannes was really fixing something but definitely not everything and was introducing new problems. I'm running unpatched kernel from time i send my last message and problems with freezing cgroups are occuring very often (several times per day) - they were, on the other hand, quite rare with patch from Johannes. Johannes, i didn't try your last patch yet. I would like to wait until you or Michal look at my last message which contained detailed information about freezing of cgroups on kernel running your original patch (which was suppose to fix it for good). Even more, i would like to hear your opinion about that stucked processes which was holding web server port and which forced me to reboot production server at the middle of the day :( more information was in my last message. Thank you very much for your time. azur ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH for 3.2] memcg: do not trap chargers with full callstack on OOM @ 2013-06-28 10:06 ` azurIt 0 siblings, 0 replies; 444+ messages in thread From: azurIt @ 2013-06-28 10:06 UTC (permalink / raw) To: Johannes Weiner Cc: Michal Hocko, linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups mailinglist, KAMEZAWA Hiroyuki >It's not a kernel thread that does it because all kernel-context >handle_mm_fault() are annotated properly, which means the task must be >userspace and, since tasks is empty, have exited before synchronizing. > >Can you try with the following patch on top? Michal and Johannes, i have some observations which i made: Original patch from Johannes was really fixing something but definitely not everything and was introducing new problems. I'm running unpatched kernel from time i send my last message and problems with freezing cgroups are occuring very often (several times per day) - they were, on the other hand, quite rare with patch from Johannes. Johannes, i didn't try your last patch yet. I would like to wait until you or Michal look at my last message which contained detailed information about freezing of cgroups on kernel running your original patch (which was suppose to fix it for good). Even more, i would like to hear your opinion about that stucked processes which was holding web server port and which forced me to reboot production server at the middle of the day :( more information was in my last message. Thank you very much for your time. azur ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH for 3.2] memcg: do not trap chargers with full callstack on OOM @ 2013-06-28 10:06 ` azurIt 0 siblings, 0 replies; 444+ messages in thread From: azurIt @ 2013-06-28 10:06 UTC (permalink / raw) To: Johannes Weiner Cc: Michal Hocko, linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki >It's not a kernel thread that does it because all kernel-context >handle_mm_fault() are annotated properly, which means the task must be >userspace and, since tasks is empty, have exited before synchronizing. > >Can you try with the following patch on top? Michal and Johannes, i have some observations which i made: Original patch from Johannes was really fixing something but definitely not everything and was introducing new problems. I'm running unpatched kernel from time i send my last message and problems with freezing cgroups are occuring very often (several times per day) - they were, on the other hand, quite rare with patch from Johannes. Johannes, i didn't try your last patch yet. I would like to wait until you or Michal look at my last message which contained detailed information about freezing of cgroups on kernel running your original patch (which was suppose to fix it for good). Even more, i would like to hear your opinion about that stucked processes which was holding web server port and which forced me to reboot production server at the middle of the day :( more information was in my last message. Thank you very much for your time. azur -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH for 3.2] memcg: do not trap chargers with full callstack on OOM 2013-06-28 10:06 ` azurIt (?) @ 2013-07-05 18:17 ` Johannes Weiner -1 siblings, 0 replies; 444+ messages in thread From: Johannes Weiner @ 2013-07-05 18:17 UTC (permalink / raw) To: azurIt Cc: Michal Hocko, linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki Hi azurIt, On Fri, Jun 28, 2013 at 12:06:13PM +0200, azurIt wrote: > >It's not a kernel thread that does it because all kernel-context > >handle_mm_fault() are annotated properly, which means the task must be > >userspace and, since tasks is empty, have exited before synchronizing. > > > >Can you try with the following patch on top? > > > Michal and Johannes, > > i have some observations which i made: Original patch from Johannes > was really fixing something but definitely not everything and was > introducing new problems. I'm running unpatched kernel from time i > send my last message and problems with freezing cgroups are occuring > very often (several times per day) - they were, on the other hand, > quite rare with patch from Johannes. That's good! > Johannes, i didn't try your last patch yet. I would like to wait > until you or Michal look at my last message which contained detailed > information about freezing of cgroups on kernel running your > original patch (which was suppose to fix it for good). Even more, i > would like to hear your opinion about that stucked processes which > was holding web server port and which forced me to reboot production > server at the middle of the day :( more information was in my last > message. Thank you very much for your time. I looked at your debug messages but could not find anything that would hint at a deadlock. All tasks are stuck in the refrigerator, so I assume you use the freezer cgroup and enabled it somehow? Sorry about your production server locking up, but from the stacks I don't see any connection to the OOM problems you were having... :/ ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH for 3.2] memcg: do not trap chargers with full callstack on OOM @ 2013-07-05 18:17 ` Johannes Weiner 0 siblings, 0 replies; 444+ messages in thread From: Johannes Weiner @ 2013-07-05 18:17 UTC (permalink / raw) To: azurIt Cc: Michal Hocko, linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups mailinglist, KAMEZAWA Hiroyuki Hi azurIt, On Fri, Jun 28, 2013 at 12:06:13PM +0200, azurIt wrote: > >It's not a kernel thread that does it because all kernel-context > >handle_mm_fault() are annotated properly, which means the task must be > >userspace and, since tasks is empty, have exited before synchronizing. > > > >Can you try with the following patch on top? > > > Michal and Johannes, > > i have some observations which i made: Original patch from Johannes > was really fixing something but definitely not everything and was > introducing new problems. I'm running unpatched kernel from time i > send my last message and problems with freezing cgroups are occuring > very often (several times per day) - they were, on the other hand, > quite rare with patch from Johannes. That's good! > Johannes, i didn't try your last patch yet. I would like to wait > until you or Michal look at my last message which contained detailed > information about freezing of cgroups on kernel running your > original patch (which was suppose to fix it for good). Even more, i > would like to hear your opinion about that stucked processes which > was holding web server port and which forced me to reboot production > server at the middle of the day :( more information was in my last > message. Thank you very much for your time. I looked at your debug messages but could not find anything that would hint at a deadlock. All tasks are stuck in the refrigerator, so I assume you use the freezer cgroup and enabled it somehow? Sorry about your production server locking up, but from the stacks I don't see any connection to the OOM problems you were having... :/ ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH for 3.2] memcg: do not trap chargers with full callstack on OOM @ 2013-07-05 18:17 ` Johannes Weiner 0 siblings, 0 replies; 444+ messages in thread From: Johannes Weiner @ 2013-07-05 18:17 UTC (permalink / raw) To: azurIt Cc: Michal Hocko, linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki Hi azurIt, On Fri, Jun 28, 2013 at 12:06:13PM +0200, azurIt wrote: > >It's not a kernel thread that does it because all kernel-context > >handle_mm_fault() are annotated properly, which means the task must be > >userspace and, since tasks is empty, have exited before synchronizing. > > > >Can you try with the following patch on top? > > > Michal and Johannes, > > i have some observations which i made: Original patch from Johannes > was really fixing something but definitely not everything and was > introducing new problems. I'm running unpatched kernel from time i > send my last message and problems with freezing cgroups are occuring > very often (several times per day) - they were, on the other hand, > quite rare with patch from Johannes. That's good! > Johannes, i didn't try your last patch yet. I would like to wait > until you or Michal look at my last message which contained detailed > information about freezing of cgroups on kernel running your > original patch (which was suppose to fix it for good). Even more, i > would like to hear your opinion about that stucked processes which > was holding web server port and which forced me to reboot production > server at the middle of the day :( more information was in my last > message. Thank you very much for your time. I looked at your debug messages but could not find anything that would hint at a deadlock. All tasks are stuck in the refrigerator, so I assume you use the freezer cgroup and enabled it somehow? Sorry about your production server locking up, but from the stacks I don't see any connection to the OOM problems you were having... :/ -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH for 3.2] memcg: do not trap chargers with full callstack on OOM 2013-07-05 18:17 ` Johannes Weiner @ 2013-07-05 19:02 ` azurIt -1 siblings, 0 replies; 444+ messages in thread From: azurIt @ 2013-07-05 19:02 UTC (permalink / raw) To: Johannes Weiner Cc: Michal Hocko, linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki >I looked at your debug messages but could not find anything that would >hint at a deadlock. All tasks are stuck in the refrigerator, so I >assume you use the freezer cgroup and enabled it somehow? Yes, i'm really using freezer cgroup BUT i was checking if it's not doing problems - unfortunately, several days passed from that day and now i don't fully remember if i was checking it for both cases (unremoveabled cgroups and these freezed processes holding web server port). I'm 100% sure i was checking it for unremoveable cgroups but not so sure for the other problem (i had to act quickly in that case). Are you sure (from stacks) that freezer cgroup was enabled there? Btw, what about that other stacks? I mean this file: http://watchdog.sk/lkml/memcg-bug-7.tar.gz It was taken while running the kernel with your patch and from cgroup which was under unresolveable OOM (just like my very original problem). Thank you! azur ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH for 3.2] memcg: do not trap chargers with full callstack on OOM @ 2013-07-05 19:02 ` azurIt 0 siblings, 0 replies; 444+ messages in thread From: azurIt @ 2013-07-05 19:02 UTC (permalink / raw) To: Johannes Weiner Cc: Michal Hocko, linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki >I looked at your debug messages but could not find anything that would >hint at a deadlock. All tasks are stuck in the refrigerator, so I >assume you use the freezer cgroup and enabled it somehow? Yes, i'm really using freezer cgroup BUT i was checking if it's not doing problems - unfortunately, several days passed from that day and now i don't fully remember if i was checking it for both cases (unremoveabled cgroups and these freezed processes holding web server port). I'm 100% sure i was checking it for unremoveable cgroups but not so sure for the other problem (i had to act quickly in that case). Are you sure (from stacks) that freezer cgroup was enabled there? Btw, what about that other stacks? I mean this file: http://watchdog.sk/lkml/memcg-bug-7.tar.gz It was taken while running the kernel with your patch and from cgroup which was under unresolveable OOM (just like my very original problem). Thank you! azur -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH for 3.2] memcg: do not trap chargers with full callstack on OOM 2013-07-05 19:02 ` azurIt (?) @ 2013-07-05 19:18 ` Johannes Weiner -1 siblings, 0 replies; 444+ messages in thread From: Johannes Weiner @ 2013-07-05 19:18 UTC (permalink / raw) To: azurIt Cc: Michal Hocko, linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki On Fri, Jul 05, 2013 at 09:02:46PM +0200, azurIt wrote: > >I looked at your debug messages but could not find anything that would > >hint at a deadlock. All tasks are stuck in the refrigerator, so I > >assume you use the freezer cgroup and enabled it somehow? > > > Yes, i'm really using freezer cgroup BUT i was checking if it's not > doing problems - unfortunately, several days passed from that day > and now i don't fully remember if i was checking it for both cases > (unremoveabled cgroups and these freezed processes holding web > server port). I'm 100% sure i was checking it for unremoveable > cgroups but not so sure for the other problem (i had to act quickly > in that case). Are you sure (from stacks) that freezer cgroup was > enabled there? Yeah, all the traces without exception look like this: 1372089762/23433/stack:[<ffffffff81080925>] refrigerator+0x95/0x160 1372089762/23433/stack:[<ffffffff8106ab7b>] get_signal_to_deliver+0x1cb/0x540 1372089762/23433/stack:[<ffffffff8100188b>] do_signal+0x6b/0x750 1372089762/23433/stack:[<ffffffff81001fc5>] do_notify_resume+0x55/0x80 1372089762/23433/stack:[<ffffffff815cac77>] int_signal+0x12/0x17 1372089762/23433/stack:[<ffffffffffffffff>] 0xffffffffffffffff so the freezer was already enabled when you took the backtraces. > Btw, what about that other stacks? I mean this file: > http://watchdog.sk/lkml/memcg-bug-7.tar.gz > > It was taken while running the kernel with your patch and from > cgroup which was under unresolveable OOM (just like my very original > problem). I looked at these traces too, but none of the tasks are stuck in rmdir or the OOM path. Some /are/ in the page fault path, but they are happily doing reclaim and don't appear to be stuck. So I'm having a hard time matching this data to what you otherwise observed. However, based on what you reported the most likely explanation for the continued hangs is the unfinished OOM handling for which I sent the followup patch for arch/x86/mm/fault.c. ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH for 3.2] memcg: do not trap chargers with full callstack on OOM @ 2013-07-05 19:18 ` Johannes Weiner 0 siblings, 0 replies; 444+ messages in thread From: Johannes Weiner @ 2013-07-05 19:18 UTC (permalink / raw) To: azurIt Cc: Michal Hocko, linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups mailinglist, KAMEZAWA Hiroyuki On Fri, Jul 05, 2013 at 09:02:46PM +0200, azurIt wrote: > >I looked at your debug messages but could not find anything that would > >hint at a deadlock. All tasks are stuck in the refrigerator, so I > >assume you use the freezer cgroup and enabled it somehow? > > > Yes, i'm really using freezer cgroup BUT i was checking if it's not > doing problems - unfortunately, several days passed from that day > and now i don't fully remember if i was checking it for both cases > (unremoveabled cgroups and these freezed processes holding web > server port). I'm 100% sure i was checking it for unremoveable > cgroups but not so sure for the other problem (i had to act quickly > in that case). Are you sure (from stacks) that freezer cgroup was > enabled there? Yeah, all the traces without exception look like this: 1372089762/23433/stack:[<ffffffff81080925>] refrigerator+0x95/0x160 1372089762/23433/stack:[<ffffffff8106ab7b>] get_signal_to_deliver+0x1cb/0x540 1372089762/23433/stack:[<ffffffff8100188b>] do_signal+0x6b/0x750 1372089762/23433/stack:[<ffffffff81001fc5>] do_notify_resume+0x55/0x80 1372089762/23433/stack:[<ffffffff815cac77>] int_signal+0x12/0x17 1372089762/23433/stack:[<ffffffffffffffff>] 0xffffffffffffffff so the freezer was already enabled when you took the backtraces. > Btw, what about that other stacks? I mean this file: > http://watchdog.sk/lkml/memcg-bug-7.tar.gz > > It was taken while running the kernel with your patch and from > cgroup which was under unresolveable OOM (just like my very original > problem). I looked at these traces too, but none of the tasks are stuck in rmdir or the OOM path. Some /are/ in the page fault path, but they are happily doing reclaim and don't appear to be stuck. So I'm having a hard time matching this data to what you otherwise observed. However, based on what you reported the most likely explanation for the continued hangs is the unfinished OOM handling for which I sent the followup patch for arch/x86/mm/fault.c. ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH for 3.2] memcg: do not trap chargers with full callstack on OOM @ 2013-07-05 19:18 ` Johannes Weiner 0 siblings, 0 replies; 444+ messages in thread From: Johannes Weiner @ 2013-07-05 19:18 UTC (permalink / raw) To: azurIt Cc: Michal Hocko, linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki On Fri, Jul 05, 2013 at 09:02:46PM +0200, azurIt wrote: > >I looked at your debug messages but could not find anything that would > >hint at a deadlock. All tasks are stuck in the refrigerator, so I > >assume you use the freezer cgroup and enabled it somehow? > > > Yes, i'm really using freezer cgroup BUT i was checking if it's not > doing problems - unfortunately, several days passed from that day > and now i don't fully remember if i was checking it for both cases > (unremoveabled cgroups and these freezed processes holding web > server port). I'm 100% sure i was checking it for unremoveable > cgroups but not so sure for the other problem (i had to act quickly > in that case). Are you sure (from stacks) that freezer cgroup was > enabled there? Yeah, all the traces without exception look like this: 1372089762/23433/stack:[<ffffffff81080925>] refrigerator+0x95/0x160 1372089762/23433/stack:[<ffffffff8106ab7b>] get_signal_to_deliver+0x1cb/0x540 1372089762/23433/stack:[<ffffffff8100188b>] do_signal+0x6b/0x750 1372089762/23433/stack:[<ffffffff81001fc5>] do_notify_resume+0x55/0x80 1372089762/23433/stack:[<ffffffff815cac77>] int_signal+0x12/0x17 1372089762/23433/stack:[<ffffffffffffffff>] 0xffffffffffffffff so the freezer was already enabled when you took the backtraces. > Btw, what about that other stacks? I mean this file: > http://watchdog.sk/lkml/memcg-bug-7.tar.gz > > It was taken while running the kernel with your patch and from > cgroup which was under unresolveable OOM (just like my very original > problem). I looked at these traces too, but none of the tasks are stuck in rmdir or the OOM path. Some /are/ in the page fault path, but they are happily doing reclaim and don't appear to be stuck. So I'm having a hard time matching this data to what you otherwise observed. However, based on what you reported the most likely explanation for the continued hangs is the unfinished OOM handling for which I sent the followup patch for arch/x86/mm/fault.c. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH for 3.2] memcg: do not trap chargers with full callstack on OOM 2013-07-05 19:18 ` Johannes Weiner @ 2013-07-07 23:42 ` azurIt -1 siblings, 0 replies; 444+ messages in thread From: azurIt @ 2013-07-07 23:42 UTC (permalink / raw) To: Johannes Weiner Cc: Michal Hocko, linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki > CC: "Michal Hocko" <mhocko@suse.cz>, linux-kernel@vger.kernel.org, linux-mm@kvack.org, "cgroups mailinglist" <cgroups@vger.kernel.org>, "KAMEZAWA Hiroyuki" <kamezawa.hiroyu@jp.fujitsu.com> >On Fri, Jul 05, 2013 at 09:02:46PM +0200, azurIt wrote: >> >I looked at your debug messages but could not find anything that would >> >hint at a deadlock. All tasks are stuck in the refrigerator, so I >> >assume you use the freezer cgroup and enabled it somehow? >> >> >> Yes, i'm really using freezer cgroup BUT i was checking if it's not >> doing problems - unfortunately, several days passed from that day >> and now i don't fully remember if i was checking it for both cases >> (unremoveabled cgroups and these freezed processes holding web >> server port). I'm 100% sure i was checking it for unremoveable >> cgroups but not so sure for the other problem (i had to act quickly >> in that case). Are you sure (from stacks) that freezer cgroup was >> enabled there? > >Yeah, all the traces without exception look like this: > >1372089762/23433/stack:[<ffffffff81080925>] refrigerator+0x95/0x160 >1372089762/23433/stack:[<ffffffff8106ab7b>] get_signal_to_deliver+0x1cb/0x540 >1372089762/23433/stack:[<ffffffff8100188b>] do_signal+0x6b/0x750 >1372089762/23433/stack:[<ffffffff81001fc5>] do_notify_resume+0x55/0x80 >1372089762/23433/stack:[<ffffffff815cac77>] int_signal+0x12/0x17 >1372089762/23433/stack:[<ffffffffffffffff>] 0xffffffffffffffff > >so the freezer was already enabled when you took the backtraces. > >> Btw, what about that other stacks? I mean this file: >> http://watchdog.sk/lkml/memcg-bug-7.tar.gz >> >> It was taken while running the kernel with your patch and from >> cgroup which was under unresolveable OOM (just like my very original >> problem). > >I looked at these traces too, but none of the tasks are stuck in rmdir >or the OOM path. Some /are/ in the page fault path, but they are >happily doing reclaim and don't appear to be stuck. So I'm having a >hard time matching this data to what you otherwise observed. > >However, based on what you reported the most likely explanation for >the continued hangs is the unfinished OOM handling for which I sent >the followup patch for arch/x86/mm/fault.c. > Johannes, today I tested both of your patches but problem with unremovable cgroups, unfortunately, persists. azur ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH for 3.2] memcg: do not trap chargers with full callstack on OOM @ 2013-07-07 23:42 ` azurIt 0 siblings, 0 replies; 444+ messages in thread From: azurIt @ 2013-07-07 23:42 UTC (permalink / raw) To: Johannes Weiner Cc: Michal Hocko, linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki > CC: "Michal Hocko" <mhocko@suse.cz>, linux-kernel@vger.kernel.org, linux-mm@kvack.org, "cgroups mailinglist" <cgroups@vger.kernel.org>, "KAMEZAWA Hiroyuki" <kamezawa.hiroyu@jp.fujitsu.com> >On Fri, Jul 05, 2013 at 09:02:46PM +0200, azurIt wrote: >> >I looked at your debug messages but could not find anything that would >> >hint at a deadlock. All tasks are stuck in the refrigerator, so I >> >assume you use the freezer cgroup and enabled it somehow? >> >> >> Yes, i'm really using freezer cgroup BUT i was checking if it's not >> doing problems - unfortunately, several days passed from that day >> and now i don't fully remember if i was checking it for both cases >> (unremoveabled cgroups and these freezed processes holding web >> server port). I'm 100% sure i was checking it for unremoveable >> cgroups but not so sure for the other problem (i had to act quickly >> in that case). Are you sure (from stacks) that freezer cgroup was >> enabled there? > >Yeah, all the traces without exception look like this: > >1372089762/23433/stack:[<ffffffff81080925>] refrigerator+0x95/0x160 >1372089762/23433/stack:[<ffffffff8106ab7b>] get_signal_to_deliver+0x1cb/0x540 >1372089762/23433/stack:[<ffffffff8100188b>] do_signal+0x6b/0x750 >1372089762/23433/stack:[<ffffffff81001fc5>] do_notify_resume+0x55/0x80 >1372089762/23433/stack:[<ffffffff815cac77>] int_signal+0x12/0x17 >1372089762/23433/stack:[<ffffffffffffffff>] 0xffffffffffffffff > >so the freezer was already enabled when you took the backtraces. > >> Btw, what about that other stacks? I mean this file: >> http://watchdog.sk/lkml/memcg-bug-7.tar.gz >> >> It was taken while running the kernel with your patch and from >> cgroup which was under unresolveable OOM (just like my very original >> problem). > >I looked at these traces too, but none of the tasks are stuck in rmdir >or the OOM path. Some /are/ in the page fault path, but they are >happily doing reclaim and don't appear to be stuck. So I'm having a >hard time matching this data to what you otherwise observed. > >However, based on what you reported the most likely explanation for >the continued hangs is the unfinished OOM handling for which I sent >the followup patch for arch/x86/mm/fault.c. > Johannes, today I tested both of your patches but problem with unremovable cgroups, unfortunately, persists. azur -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH for 3.2] memcg: do not trap chargers with full callstack on OOM 2013-07-07 23:42 ` azurIt @ 2013-07-09 13:10 ` Michal Hocko -1 siblings, 0 replies; 444+ messages in thread From: Michal Hocko @ 2013-07-09 13:10 UTC (permalink / raw) To: azurIt Cc: Johannes Weiner, linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki On Mon 08-07-13 01:42:24, azurIt wrote: > > CC: "Michal Hocko" <mhocko@suse.cz>, linux-kernel@vger.kernel.org, linux-mm@kvack.org, "cgroups mailinglist" <cgroups@vger.kernel.org>, "KAMEZAWA Hiroyuki" <kamezawa.hiroyu@jp.fujitsu.com> > >On Fri, Jul 05, 2013 at 09:02:46PM +0200, azurIt wrote: > >> >I looked at your debug messages but could not find anything that would > >> >hint at a deadlock. All tasks are stuck in the refrigerator, so I > >> >assume you use the freezer cgroup and enabled it somehow? > >> > >> > >> Yes, i'm really using freezer cgroup BUT i was checking if it's not > >> doing problems - unfortunately, several days passed from that day > >> and now i don't fully remember if i was checking it for both cases > >> (unremoveabled cgroups and these freezed processes holding web > >> server port). I'm 100% sure i was checking it for unremoveable > >> cgroups but not so sure for the other problem (i had to act quickly > >> in that case). Are you sure (from stacks) that freezer cgroup was > >> enabled there? > > > >Yeah, all the traces without exception look like this: > > > >1372089762/23433/stack:[<ffffffff81080925>] refrigerator+0x95/0x160 > >1372089762/23433/stack:[<ffffffff8106ab7b>] get_signal_to_deliver+0x1cb/0x540 > >1372089762/23433/stack:[<ffffffff8100188b>] do_signal+0x6b/0x750 > >1372089762/23433/stack:[<ffffffff81001fc5>] do_notify_resume+0x55/0x80 > >1372089762/23433/stack:[<ffffffff815cac77>] int_signal+0x12/0x17 > >1372089762/23433/stack:[<ffffffffffffffff>] 0xffffffffffffffff > > > >so the freezer was already enabled when you took the backtraces. > > > >> Btw, what about that other stacks? I mean this file: > >> http://watchdog.sk/lkml/memcg-bug-7.tar.gz > >> > >> It was taken while running the kernel with your patch and from > >> cgroup which was under unresolveable OOM (just like my very original > >> problem). > > > >I looked at these traces too, but none of the tasks are stuck in rmdir > >or the OOM path. Some /are/ in the page fault path, but they are > >happily doing reclaim and don't appear to be stuck. So I'm having a > >hard time matching this data to what you otherwise observed. Agreed. > >However, based on what you reported the most likely explanation for > >the continued hangs is the unfinished OOM handling for which I sent > >the followup patch for arch/x86/mm/fault.c. > > Johannes, > > today I tested both of your patches but problem with unremovable > cgroups, unfortunately, persists. Is the group empty again with marked under_oom? -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH for 3.2] memcg: do not trap chargers with full callstack on OOM @ 2013-07-09 13:10 ` Michal Hocko 0 siblings, 0 replies; 444+ messages in thread From: Michal Hocko @ 2013-07-09 13:10 UTC (permalink / raw) To: azurIt Cc: Johannes Weiner, linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki On Mon 08-07-13 01:42:24, azurIt wrote: > > CC: "Michal Hocko" <mhocko@suse.cz>, linux-kernel@vger.kernel.org, linux-mm@kvack.org, "cgroups mailinglist" <cgroups@vger.kernel.org>, "KAMEZAWA Hiroyuki" <kamezawa.hiroyu@jp.fujitsu.com> > >On Fri, Jul 05, 2013 at 09:02:46PM +0200, azurIt wrote: > >> >I looked at your debug messages but could not find anything that would > >> >hint at a deadlock. All tasks are stuck in the refrigerator, so I > >> >assume you use the freezer cgroup and enabled it somehow? > >> > >> > >> Yes, i'm really using freezer cgroup BUT i was checking if it's not > >> doing problems - unfortunately, several days passed from that day > >> and now i don't fully remember if i was checking it for both cases > >> (unremoveabled cgroups and these freezed processes holding web > >> server port). I'm 100% sure i was checking it for unremoveable > >> cgroups but not so sure for the other problem (i had to act quickly > >> in that case). Are you sure (from stacks) that freezer cgroup was > >> enabled there? > > > >Yeah, all the traces without exception look like this: > > > >1372089762/23433/stack:[<ffffffff81080925>] refrigerator+0x95/0x160 > >1372089762/23433/stack:[<ffffffff8106ab7b>] get_signal_to_deliver+0x1cb/0x540 > >1372089762/23433/stack:[<ffffffff8100188b>] do_signal+0x6b/0x750 > >1372089762/23433/stack:[<ffffffff81001fc5>] do_notify_resume+0x55/0x80 > >1372089762/23433/stack:[<ffffffff815cac77>] int_signal+0x12/0x17 > >1372089762/23433/stack:[<ffffffffffffffff>] 0xffffffffffffffff > > > >so the freezer was already enabled when you took the backtraces. > > > >> Btw, what about that other stacks? I mean this file: > >> http://watchdog.sk/lkml/memcg-bug-7.tar.gz > >> > >> It was taken while running the kernel with your patch and from > >> cgroup which was under unresolveable OOM (just like my very original > >> problem). > > > >I looked at these traces too, but none of the tasks are stuck in rmdir > >or the OOM path. Some /are/ in the page fault path, but they are > >happily doing reclaim and don't appear to be stuck. So I'm having a > >hard time matching this data to what you otherwise observed. Agreed. > >However, based on what you reported the most likely explanation for > >the continued hangs is the unfinished OOM handling for which I sent > >the followup patch for arch/x86/mm/fault.c. > > Johannes, > > today I tested both of your patches but problem with unremovable > cgroups, unfortunately, persists. Is the group empty again with marked under_oom? -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH for 3.2] memcg: do not trap chargers with full callstack on OOM 2013-07-09 13:10 ` Michal Hocko @ 2013-07-09 13:19 ` azurIt -1 siblings, 0 replies; 444+ messages in thread From: azurIt @ 2013-07-09 13:19 UTC (permalink / raw) To: Michal Hocko Cc: Johannes Weiner, linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki >On Mon 08-07-13 01:42:24, azurIt wrote: >> > CC: "Michal Hocko" <mhocko@suse.cz>, linux-kernel@vger.kernel.org, linux-mm@kvack.org, "cgroups mailinglist" <cgroups@vger.kernel.org>, "KAMEZAWA Hiroyuki" <kamezawa.hiroyu@jp.fujitsu.com> >> >On Fri, Jul 05, 2013 at 09:02:46PM +0200, azurIt wrote: >> >> >I looked at your debug messages but could not find anything that would >> >> >hint at a deadlock. All tasks are stuck in the refrigerator, so I >> >> >assume you use the freezer cgroup and enabled it somehow? >> >> >> >> >> >> Yes, i'm really using freezer cgroup BUT i was checking if it's not >> >> doing problems - unfortunately, several days passed from that day >> >> and now i don't fully remember if i was checking it for both cases >> >> (unremoveabled cgroups and these freezed processes holding web >> >> server port). I'm 100% sure i was checking it for unremoveable >> >> cgroups but not so sure for the other problem (i had to act quickly >> >> in that case). Are you sure (from stacks) that freezer cgroup was >> >> enabled there? >> > >> >Yeah, all the traces without exception look like this: >> > >> >1372089762/23433/stack:[<ffffffff81080925>] refrigerator+0x95/0x160 >> >1372089762/23433/stack:[<ffffffff8106ab7b>] get_signal_to_deliver+0x1cb/0x540 >> >1372089762/23433/stack:[<ffffffff8100188b>] do_signal+0x6b/0x750 >> >1372089762/23433/stack:[<ffffffff81001fc5>] do_notify_resume+0x55/0x80 >> >1372089762/23433/stack:[<ffffffff815cac77>] int_signal+0x12/0x17 >> >1372089762/23433/stack:[<ffffffffffffffff>] 0xffffffffffffffff >> > >> >so the freezer was already enabled when you took the backtraces. >> > >> >> Btw, what about that other stacks? I mean this file: >> >> http://watchdog.sk/lkml/memcg-bug-7.tar.gz >> >> >> >> It was taken while running the kernel with your patch and from >> >> cgroup which was under unresolveable OOM (just like my very original >> >> problem). >> > >> >I looked at these traces too, but none of the tasks are stuck in rmdir >> >or the OOM path. Some /are/ in the page fault path, but they are >> >happily doing reclaim and don't appear to be stuck. So I'm having a >> >hard time matching this data to what you otherwise observed. > >Agreed. > >> >However, based on what you reported the most likely explanation for >> >the continued hangs is the unfinished OOM handling for which I sent >> >the followup patch for arch/x86/mm/fault.c. >> >> Johannes, >> >> today I tested both of your patches but problem with unremovable >> cgroups, unfortunately, persists. > >Is the group empty again with marked under_oom? Now i realized that i forgot to remove UID from that cgroup before trying to remove it, so cgroup cannot be removed anyway (we are using third party cgroup called cgroup-uid from Andrea Righi, which is able to associate all user's processes with target cgroup). Look here for cgroup-uid patch: https://www.develer.com/~arighi/linux/patches/cgroup-uid/cgroup-uid-v8.patch ANYWAY, i'm 101% sure that 'tasks' file was empty and 'under_oom' was permanently '1'. azur ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH for 3.2] memcg: do not trap chargers with full callstack on OOM @ 2013-07-09 13:19 ` azurIt 0 siblings, 0 replies; 444+ messages in thread From: azurIt @ 2013-07-09 13:19 UTC (permalink / raw) To: Michal Hocko Cc: Johannes Weiner, linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki >On Mon 08-07-13 01:42:24, azurIt wrote: >> > CC: "Michal Hocko" <mhocko@suse.cz>, linux-kernel@vger.kernel.org, linux-mm@kvack.org, "cgroups mailinglist" <cgroups@vger.kernel.org>, "KAMEZAWA Hiroyuki" <kamezawa.hiroyu@jp.fujitsu.com> >> >On Fri, Jul 05, 2013 at 09:02:46PM +0200, azurIt wrote: >> >> >I looked at your debug messages but could not find anything that would >> >> >hint at a deadlock. All tasks are stuck in the refrigerator, so I >> >> >assume you use the freezer cgroup and enabled it somehow? >> >> >> >> >> >> Yes, i'm really using freezer cgroup BUT i was checking if it's not >> >> doing problems - unfortunately, several days passed from that day >> >> and now i don't fully remember if i was checking it for both cases >> >> (unremoveabled cgroups and these freezed processes holding web >> >> server port). I'm 100% sure i was checking it for unremoveable >> >> cgroups but not so sure for the other problem (i had to act quickly >> >> in that case). Are you sure (from stacks) that freezer cgroup was >> >> enabled there? >> > >> >Yeah, all the traces without exception look like this: >> > >> >1372089762/23433/stack:[<ffffffff81080925>] refrigerator+0x95/0x160 >> >1372089762/23433/stack:[<ffffffff8106ab7b>] get_signal_to_deliver+0x1cb/0x540 >> >1372089762/23433/stack:[<ffffffff8100188b>] do_signal+0x6b/0x750 >> >1372089762/23433/stack:[<ffffffff81001fc5>] do_notify_resume+0x55/0x80 >> >1372089762/23433/stack:[<ffffffff815cac77>] int_signal+0x12/0x17 >> >1372089762/23433/stack:[<ffffffffffffffff>] 0xffffffffffffffff >> > >> >so the freezer was already enabled when you took the backtraces. >> > >> >> Btw, what about that other stacks? I mean this file: >> >> http://watchdog.sk/lkml/memcg-bug-7.tar.gz >> >> >> >> It was taken while running the kernel with your patch and from >> >> cgroup which was under unresolveable OOM (just like my very original >> >> problem). >> > >> >I looked at these traces too, but none of the tasks are stuck in rmdir >> >or the OOM path. Some /are/ in the page fault path, but they are >> >happily doing reclaim and don't appear to be stuck. So I'm having a >> >hard time matching this data to what you otherwise observed. > >Agreed. > >> >However, based on what you reported the most likely explanation for >> >the continued hangs is the unfinished OOM handling for which I sent >> >the followup patch for arch/x86/mm/fault.c. >> >> Johannes, >> >> today I tested both of your patches but problem with unremovable >> cgroups, unfortunately, persists. > >Is the group empty again with marked under_oom? Now i realized that i forgot to remove UID from that cgroup before trying to remove it, so cgroup cannot be removed anyway (we are using third party cgroup called cgroup-uid from Andrea Righi, which is able to associate all user's processes with target cgroup). Look here for cgroup-uid patch: https://www.develer.com/~arighi/linux/patches/cgroup-uid/cgroup-uid-v8.patch ANYWAY, i'm 101% sure that 'tasks' file was empty and 'under_oom' was permanently '1'. azur -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH for 3.2] memcg: do not trap chargers with full callstack on OOM 2013-07-09 13:19 ` azurIt (?) @ 2013-07-09 13:54 ` Michal Hocko -1 siblings, 0 replies; 444+ messages in thread From: Michal Hocko @ 2013-07-09 13:54 UTC (permalink / raw) To: azurIt Cc: Johannes Weiner, linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki On Tue 09-07-13 15:19:21, azurIt wrote: [...] > Now i realized that i forgot to remove UID from that cgroup before > trying to remove it, so cgroup cannot be removed anyway (we are using > third party cgroup called cgroup-uid from Andrea Righi, which is able > to associate all user's processes with target cgroup). Look here for > cgroup-uid patch: > https://www.develer.com/~arighi/linux/patches/cgroup-uid/cgroup-uid-v8.patch > > ANYWAY, i'm 101% sure that 'tasks' file was empty and 'under_oom' was > permanently '1'. This is really strange. Could you post the whole diff against stable tree you are using (except for grsecurity stuff and the above cgroup-uid patch)? Btw. the bellow patch might help us to point to the exit path which leaves wait_on_memcg without mem_cgroup_oom_synchronize: --- diff --git a/kernel/exit.c b/kernel/exit.c index e6e01b9..ad472e0 100644 --- a/kernel/exit.c +++ b/kernel/exit.c @@ -895,6 +895,7 @@ NORET_TYPE void do_exit(long code) profile_task_exit(tsk); + WARN_ON(current->memcg_oom.wait_on_memcg); WARN_ON(blk_needs_flush_plug(tsk)); if (unlikely(in_interrupt())) -- Michal Hocko SUSE Labs ^ permalink raw reply related [flat|nested] 444+ messages in thread
* Re: [PATCH for 3.2] memcg: do not trap chargers with full callstack on OOM @ 2013-07-09 13:54 ` Michal Hocko 0 siblings, 0 replies; 444+ messages in thread From: Michal Hocko @ 2013-07-09 13:54 UTC (permalink / raw) To: azurIt Cc: Johannes Weiner, linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups mailinglist, KAMEZAWA Hiroyuki On Tue 09-07-13 15:19:21, azurIt wrote: [...] > Now i realized that i forgot to remove UID from that cgroup before > trying to remove it, so cgroup cannot be removed anyway (we are using > third party cgroup called cgroup-uid from Andrea Righi, which is able > to associate all user's processes with target cgroup). Look here for > cgroup-uid patch: > https://www.develer.com/~arighi/linux/patches/cgroup-uid/cgroup-uid-v8.patch > > ANYWAY, i'm 101% sure that 'tasks' file was empty and 'under_oom' was > permanently '1'. This is really strange. Could you post the whole diff against stable tree you are using (except for grsecurity stuff and the above cgroup-uid patch)? Btw. the bellow patch might help us to point to the exit path which leaves wait_on_memcg without mem_cgroup_oom_synchronize: --- diff --git a/kernel/exit.c b/kernel/exit.c index e6e01b9..ad472e0 100644 --- a/kernel/exit.c +++ b/kernel/exit.c @@ -895,6 +895,7 @@ NORET_TYPE void do_exit(long code) profile_task_exit(tsk); + WARN_ON(current->memcg_oom.wait_on_memcg); WARN_ON(blk_needs_flush_plug(tsk)); if (unlikely(in_interrupt())) -- Michal Hocko SUSE Labs ^ permalink raw reply related [flat|nested] 444+ messages in thread
* Re: [PATCH for 3.2] memcg: do not trap chargers with full callstack on OOM @ 2013-07-09 13:54 ` Michal Hocko 0 siblings, 0 replies; 444+ messages in thread From: Michal Hocko @ 2013-07-09 13:54 UTC (permalink / raw) To: azurIt Cc: Johannes Weiner, linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki On Tue 09-07-13 15:19:21, azurIt wrote: [...] > Now i realized that i forgot to remove UID from that cgroup before > trying to remove it, so cgroup cannot be removed anyway (we are using > third party cgroup called cgroup-uid from Andrea Righi, which is able > to associate all user's processes with target cgroup). Look here for > cgroup-uid patch: > https://www.develer.com/~arighi/linux/patches/cgroup-uid/cgroup-uid-v8.patch > > ANYWAY, i'm 101% sure that 'tasks' file was empty and 'under_oom' was > permanently '1'. This is really strange. Could you post the whole diff against stable tree you are using (except for grsecurity stuff and the above cgroup-uid patch)? Btw. the bellow patch might help us to point to the exit path which leaves wait_on_memcg without mem_cgroup_oom_synchronize: --- diff --git a/kernel/exit.c b/kernel/exit.c index e6e01b9..ad472e0 100644 --- a/kernel/exit.c +++ b/kernel/exit.c @@ -895,6 +895,7 @@ NORET_TYPE void do_exit(long code) profile_task_exit(tsk); + WARN_ON(current->memcg_oom.wait_on_memcg); WARN_ON(blk_needs_flush_plug(tsk)); if (unlikely(in_interrupt())) -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 444+ messages in thread
* Re: [PATCH for 3.2] memcg: do not trap chargers with full callstack on OOM 2013-07-09 13:54 ` Michal Hocko @ 2013-07-10 16:25 ` azurIt -1 siblings, 0 replies; 444+ messages in thread From: azurIt @ 2013-07-10 16:25 UTC (permalink / raw) To: Michal Hocko Cc: Johannes Weiner, linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki, righi.andrea >> Now i realized that i forgot to remove UID from that cgroup before >> trying to remove it, so cgroup cannot be removed anyway (we are using >> third party cgroup called cgroup-uid from Andrea Righi, which is able >> to associate all user's processes with target cgroup). Look here for >> cgroup-uid patch: >> https://www.develer.com/~arighi/linux/patches/cgroup-uid/cgroup-uid-v8.patch >> >> ANYWAY, i'm 101% sure that 'tasks' file was empty and 'under_oom' was >> permanently '1'. > >This is really strange. Could you post the whole diff against stable >tree you are using (except for grsecurity stuff and the above cgroup-uid >patch)? Here are all patches which i applied to kernel 3.2.48 in my last test: http://watchdog.sk/lkml/patches3/ Patches marked as 7-* are from Johannes. I'm appling them in order except the grsecurity - it goes as first. azur >Btw. the bellow patch might help us to point to the exit path which >leaves wait_on_memcg without mem_cgroup_oom_synchronize: >--- >diff --git a/kernel/exit.c b/kernel/exit.c >index e6e01b9..ad472e0 100644 >--- a/kernel/exit.c >+++ b/kernel/exit.c >@@ -895,6 +895,7 @@ NORET_TYPE void do_exit(long code) > > profile_task_exit(tsk); > >+ WARN_ON(current->memcg_oom.wait_on_memcg); > WARN_ON(blk_needs_flush_plug(tsk)); > > if (unlikely(in_interrupt())) >-- >Michal Hocko >SUSE Labs > ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH for 3.2] memcg: do not trap chargers with full callstack on OOM @ 2013-07-10 16:25 ` azurIt 0 siblings, 0 replies; 444+ messages in thread From: azurIt @ 2013-07-10 16:25 UTC (permalink / raw) To: Michal Hocko Cc: Johannes Weiner, linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki, righi.andrea >> Now i realized that i forgot to remove UID from that cgroup before >> trying to remove it, so cgroup cannot be removed anyway (we are using >> third party cgroup called cgroup-uid from Andrea Righi, which is able >> to associate all user's processes with target cgroup). Look here for >> cgroup-uid patch: >> https://www.develer.com/~arighi/linux/patches/cgroup-uid/cgroup-uid-v8.patch >> >> ANYWAY, i'm 101% sure that 'tasks' file was empty and 'under_oom' was >> permanently '1'. > >This is really strange. Could you post the whole diff against stable >tree you are using (except for grsecurity stuff and the above cgroup-uid >patch)? Here are all patches which i applied to kernel 3.2.48 in my last test: http://watchdog.sk/lkml/patches3/ Patches marked as 7-* are from Johannes. I'm appling them in order except the grsecurity - it goes as first. azur >Btw. the bellow patch might help us to point to the exit path which >leaves wait_on_memcg without mem_cgroup_oom_synchronize: >--- >diff --git a/kernel/exit.c b/kernel/exit.c >index e6e01b9..ad472e0 100644 >--- a/kernel/exit.c >+++ b/kernel/exit.c >@@ -895,6 +895,7 @@ NORET_TYPE void do_exit(long code) > > profile_task_exit(tsk); > >+ WARN_ON(current->memcg_oom.wait_on_memcg); > WARN_ON(blk_needs_flush_plug(tsk)); > > if (unlikely(in_interrupt())) >-- >Michal Hocko >SUSE Labs > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH for 3.2] memcg: do not trap chargers with full callstack on OOM 2013-07-10 16:25 ` azurIt (?) @ 2013-07-11 7:25 ` Michal Hocko -1 siblings, 0 replies; 444+ messages in thread From: Michal Hocko @ 2013-07-11 7:25 UTC (permalink / raw) To: azurIt Cc: Johannes Weiner, linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki, righi.andrea On Wed 10-07-13 18:25:06, azurIt wrote: > >> Now i realized that i forgot to remove UID from that cgroup before > >> trying to remove it, so cgroup cannot be removed anyway (we are using > >> third party cgroup called cgroup-uid from Andrea Righi, which is able > >> to associate all user's processes with target cgroup). Look here for > >> cgroup-uid patch: > >> https://www.develer.com/~arighi/linux/patches/cgroup-uid/cgroup-uid-v8.patch > >> > >> ANYWAY, i'm 101% sure that 'tasks' file was empty and 'under_oom' was > >> permanently '1'. > > > >This is really strange. Could you post the whole diff against stable > >tree you are using (except for grsecurity stuff and the above cgroup-uid > >patch)? > > > Here are all patches which i applied to kernel 3.2.48 in my last test: > http://watchdog.sk/lkml/patches3/ The two patches from Johannes seem correct. >From a quick look even grsecurity patchset shouldn't interfere as it doesn't seem to put any code between handle_mm_fault and mm_fault_error and there also doesn't seem to be any new handle_mm_fault call sites. But I cannot tell there aren't other code paths which would lead to a memcg charge, thus oom, without proper FAULT_FLAG_KERNEL handling. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH for 3.2] memcg: do not trap chargers with full callstack on OOM @ 2013-07-11 7:25 ` Michal Hocko 0 siblings, 0 replies; 444+ messages in thread From: Michal Hocko @ 2013-07-11 7:25 UTC (permalink / raw) To: azurIt Cc: Johannes Weiner, linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki, righi.andrea On Wed 10-07-13 18:25:06, azurIt wrote: > >> Now i realized that i forgot to remove UID from that cgroup before > >> trying to remove it, so cgroup cannot be removed anyway (we are using > >> third party cgroup called cgroup-uid from Andrea Righi, which is able > >> to associate all user's processes with target cgroup). Look here for > >> cgroup-uid patch: > >> https://www.develer.com/~arighi/linux/patches/cgroup-uid/cgroup-uid-v8.patch > >> > >> ANYWAY, i'm 101% sure that 'tasks' file was empty and 'under_oom' was > >> permanently '1'. > > > >This is really strange. Could you post the whole diff against stable > >tree you are using (except for grsecurity stuff and the above cgroup-uid > >patch)? > > > Here are all patches which i applied to kernel 3.2.48 in my last test: > http://watchdog.sk/lkml/patches3/ The two patches from Johannes seem correct. From a quick look even grsecurity patchset shouldn't interfere as it doesn't seem to put any code between handle_mm_fault and mm_fault_error and there also doesn't seem to be any new handle_mm_fault call sites. But I cannot tell there aren't other code paths which would lead to a memcg charge, thus oom, without proper FAULT_FLAG_KERNEL handling. -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH for 3.2] memcg: do not trap chargers with full callstack on OOM @ 2013-07-11 7:25 ` Michal Hocko 0 siblings, 0 replies; 444+ messages in thread From: Michal Hocko @ 2013-07-11 7:25 UTC (permalink / raw) To: azurIt Cc: Johannes Weiner, linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki, righi.andrea On Wed 10-07-13 18:25:06, azurIt wrote: > >> Now i realized that i forgot to remove UID from that cgroup before > >> trying to remove it, so cgroup cannot be removed anyway (we are using > >> third party cgroup called cgroup-uid from Andrea Righi, which is able > >> to associate all user's processes with target cgroup). Look here for > >> cgroup-uid patch: > >> https://www.develer.com/~arighi/linux/patches/cgroup-uid/cgroup-uid-v8.patch > >> > >> ANYWAY, i'm 101% sure that 'tasks' file was empty and 'under_oom' was > >> permanently '1'. > > > >This is really strange. Could you post the whole diff against stable > >tree you are using (except for grsecurity stuff and the above cgroup-uid > >patch)? > > > Here are all patches which i applied to kernel 3.2.48 in my last test: > http://watchdog.sk/lkml/patches3/ The two patches from Johannes seem correct. >From a quick look even grsecurity patchset shouldn't interfere as it doesn't seem to put any code between handle_mm_fault and mm_fault_error and there also doesn't seem to be any new handle_mm_fault call sites. But I cannot tell there aren't other code paths which would lead to a memcg charge, thus oom, without proper FAULT_FLAG_KERNEL handling. -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH for 3.2] memcg: do not trap chargers with full callstack on OOM 2013-07-11 7:25 ` Michal Hocko (?) @ 2013-07-13 23:26 ` azurIt -1 siblings, 0 replies; 444+ messages in thread From: azurIt @ 2013-07-13 23:26 UTC (permalink / raw) To: Michal Hocko Cc: Johannes Weiner, linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki, righi.andrea > CC: "Johannes Weiner" <hannes@cmpxchg.org>, linux-kernel@vger.kernel.org, linux-mm@kvack.org, "cgroups mailinglist" <cgroups@vger.kernel.org>, "KAMEZAWA Hiroyuki" <kamezawa.hiroyu@jp.fujitsu.com>, righi.andrea@gmail.com >On Wed 10-07-13 18:25:06, azurIt wrote: >> >> Now i realized that i forgot to remove UID from that cgroup before >> >> trying to remove it, so cgroup cannot be removed anyway (we are using >> >> third party cgroup called cgroup-uid from Andrea Righi, which is able >> >> to associate all user's processes with target cgroup). Look here for >> >> cgroup-uid patch: >> >> https://www.develer.com/~arighi/linux/patches/cgroup-uid/cgroup-uid-v8.patch >> >> >> >> ANYWAY, i'm 101% sure that 'tasks' file was empty and 'under_oom' was >> >> permanently '1'. >> > >> >This is really strange. Could you post the whole diff against stable >> >tree you are using (except for grsecurity stuff and the above cgroup-uid >> >patch)? >> >> >> Here are all patches which i applied to kernel 3.2.48 in my last test: >> http://watchdog.sk/lkml/patches3/ > >The two patches from Johannes seem correct. > >From a quick look even grsecurity patchset shouldn't interfere as it >doesn't seem to put any code between handle_mm_fault and mm_fault_error >and there also doesn't seem to be any new handle_mm_fault call sites. > >But I cannot tell there aren't other code paths which would lead to a >memcg charge, thus oom, without proper FAULT_FLAG_KERNEL handling. Michal, now i can definitely confirm that problem with unremovable cgroups persists. What info do you need from me? I applied also your little 'WARN_ON' patch. azur ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH for 3.2] memcg: do not trap chargers with full callstack on OOM @ 2013-07-13 23:26 ` azurIt 0 siblings, 0 replies; 444+ messages in thread From: azurIt @ 2013-07-13 23:26 UTC (permalink / raw) To: Michal Hocko Cc: Johannes Weiner, linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki, righi.andrea > CC: "Johannes Weiner" <hannes@cmpxchg.org>, linux-kernel@vger.kernel.org, linux-mm@kvack.org, "cgroups mailinglist" <cgroups@vger.kernel.org>, "KAMEZAWA Hiroyuki" <kamezawa.hiroyu@jp.fujitsu.com>, righi.andrea@gmail.com >On Wed 10-07-13 18:25:06, azurIt wrote: >> >> Now i realized that i forgot to remove UID from that cgroup before >> >> trying to remove it, so cgroup cannot be removed anyway (we are using >> >> third party cgroup called cgroup-uid from Andrea Righi, which is able >> >> to associate all user's processes with target cgroup). Look here for >> >> cgroup-uid patch: >> >> https://www.develer.com/~arighi/linux/patches/cgroup-uid/cgroup-uid-v8.patch >> >> >> >> ANYWAY, i'm 101% sure that 'tasks' file was empty and 'under_oom' was >> >> permanently '1'. >> > >> >This is really strange. Could you post the whole diff against stable >> >tree you are using (except for grsecurity stuff and the above cgroup-uid >> >patch)? >> >> >> Here are all patches which i applied to kernel 3.2.48 in my last test: >> http://watchdog.sk/lkml/patches3/ > >The two patches from Johannes seem correct. > From a quick look even grsecurity patchset shouldn't interfere as it >doesn't seem to put any code between handle_mm_fault and mm_fault_error >and there also doesn't seem to be any new handle_mm_fault call sites. > >But I cannot tell there aren't other code paths which would lead to a >memcg charge, thus oom, without proper FAULT_FLAG_KERNEL handling. Michal, now i can definitely confirm that problem with unremovable cgroups persists. What info do you need from me? I applied also your little 'WARN_ON' patch. azur -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH for 3.2] memcg: do not trap chargers with full callstack on OOM @ 2013-07-13 23:26 ` azurIt 0 siblings, 0 replies; 444+ messages in thread From: azurIt @ 2013-07-13 23:26 UTC (permalink / raw) To: Michal Hocko Cc: Johannes Weiner, linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki, righi.andrea > CC: "Johannes Weiner" <hannes@cmpxchg.org>, linux-kernel@vger.kernel.org, linux-mm@kvack.org, "cgroups mailinglist" <cgroups@vger.kernel.org>, "KAMEZAWA Hiroyuki" <kamezawa.hiroyu@jp.fujitsu.com>, righi.andrea@gmail.com >On Wed 10-07-13 18:25:06, azurIt wrote: >> >> Now i realized that i forgot to remove UID from that cgroup before >> >> trying to remove it, so cgroup cannot be removed anyway (we are using >> >> third party cgroup called cgroup-uid from Andrea Righi, which is able >> >> to associate all user's processes with target cgroup). Look here for >> >> cgroup-uid patch: >> >> https://www.develer.com/~arighi/linux/patches/cgroup-uid/cgroup-uid-v8.patch >> >> >> >> ANYWAY, i'm 101% sure that 'tasks' file was empty and 'under_oom' was >> >> permanently '1'. >> > >> >This is really strange. Could you post the whole diff against stable >> >tree you are using (except for grsecurity stuff and the above cgroup-uid >> >patch)? >> >> >> Here are all patches which i applied to kernel 3.2.48 in my last test: >> http://watchdog.sk/lkml/patches3/ > >The two patches from Johannes seem correct. > >From a quick look even grsecurity patchset shouldn't interfere as it >doesn't seem to put any code between handle_mm_fault and mm_fault_error >and there also doesn't seem to be any new handle_mm_fault call sites. > >But I cannot tell there aren't other code paths which would lead to a >memcg charge, thus oom, without proper FAULT_FLAG_KERNEL handling. Michal, now i can definitely confirm that problem with unremovable cgroups persists. What info do you need from me? I applied also your little 'WARN_ON' patch. azur -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH for 3.2] memcg: do not trap chargers with full callstack on OOM 2013-07-13 23:26 ` azurIt @ 2013-07-13 23:51 ` azurIt -1 siblings, 0 replies; 444+ messages in thread From: azurIt @ 2013-07-13 23:51 UTC (permalink / raw) To: Michal Hocko Cc: Johannes Weiner, linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki, righi.andrea > CC: "Johannes Weiner" <hannes@cmpxchg.org>, linux-kernel@vger.kernel.org, linux-mm@kvack.org, "cgroups mailinglist" <cgroups@vger.kernel.org>, "KAMEZAWA Hiroyuki" <kamezawa.hiroyu@jp.fujitsu.com>, righi.andrea@gmail.com >> CC: "Johannes Weiner" <hannes@cmpxchg.org>, linux-kernel@vger.kernel.org, linux-mm@kvack.org, "cgroups mailinglist" <cgroups@vger.kernel.org>, "KAMEZAWA Hiroyuki" <kamezawa.hiroyu@jp.fujitsu.com>, righi.andrea@gmail.com >>On Wed 10-07-13 18:25:06, azurIt wrote: >>> >> Now i realized that i forgot to remove UID from that cgroup before >>> >> trying to remove it, so cgroup cannot be removed anyway (we are using >>> >> third party cgroup called cgroup-uid from Andrea Righi, which is able >>> >> to associate all user's processes with target cgroup). Look here for >>> >> cgroup-uid patch: >>> >> https://www.develer.com/~arighi/linux/patches/cgroup-uid/cgroup-uid-v8.patch >>> >> >>> >> ANYWAY, i'm 101% sure that 'tasks' file was empty and 'under_oom' was >>> >> permanently '1'. >>> > >>> >This is really strange. Could you post the whole diff against stable >>> >tree you are using (except for grsecurity stuff and the above cgroup-uid >>> >patch)? >>> >>> >>> Here are all patches which i applied to kernel 3.2.48 in my last test: >>> http://watchdog.sk/lkml/patches3/ >> >>The two patches from Johannes seem correct. >> >>From a quick look even grsecurity patchset shouldn't interfere as it >>doesn't seem to put any code between handle_mm_fault and mm_fault_error >>and there also doesn't seem to be any new handle_mm_fault call sites. >> >>But I cannot tell there aren't other code paths which would lead to a >>memcg charge, thus oom, without proper FAULT_FLAG_KERNEL handling. > > >Michal, > >now i can definitely confirm that problem with unremovable cgroups persists. What info do you need from me? I applied also your little 'WARN_ON' patch. > >azur Ok, i think you want this: http://watchdog.sk/lkml/kern4.log ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH for 3.2] memcg: do not trap chargers with full callstack on OOM @ 2013-07-13 23:51 ` azurIt 0 siblings, 0 replies; 444+ messages in thread From: azurIt @ 2013-07-13 23:51 UTC (permalink / raw) To: Michal Hocko Cc: Johannes Weiner, linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki, righi.andrea > CC: "Johannes Weiner" <hannes@cmpxchg.org>, linux-kernel@vger.kernel.org, linux-mm@kvack.org, "cgroups mailinglist" <cgroups@vger.kernel.org>, "KAMEZAWA Hiroyuki" <kamezawa.hiroyu@jp.fujitsu.com>, righi.andrea@gmail.com >> CC: "Johannes Weiner" <hannes@cmpxchg.org>, linux-kernel@vger.kernel.org, linux-mm@kvack.org, "cgroups mailinglist" <cgroups@vger.kernel.org>, "KAMEZAWA Hiroyuki" <kamezawa.hiroyu@jp.fujitsu.com>, righi.andrea@gmail.com >>On Wed 10-07-13 18:25:06, azurIt wrote: >>> >> Now i realized that i forgot to remove UID from that cgroup before >>> >> trying to remove it, so cgroup cannot be removed anyway (we are using >>> >> third party cgroup called cgroup-uid from Andrea Righi, which is able >>> >> to associate all user's processes with target cgroup). Look here for >>> >> cgroup-uid patch: >>> >> https://www.develer.com/~arighi/linux/patches/cgroup-uid/cgroup-uid-v8.patch >>> >> >>> >> ANYWAY, i'm 101% sure that 'tasks' file was empty and 'under_oom' was >>> >> permanently '1'. >>> > >>> >This is really strange. Could you post the whole diff against stable >>> >tree you are using (except for grsecurity stuff and the above cgroup-uid >>> >patch)? >>> >>> >>> Here are all patches which i applied to kernel 3.2.48 in my last test: >>> http://watchdog.sk/lkml/patches3/ >> >>The two patches from Johannes seem correct. >> >>From a quick look even grsecurity patchset shouldn't interfere as it >>doesn't seem to put any code between handle_mm_fault and mm_fault_error >>and there also doesn't seem to be any new handle_mm_fault call sites. >> >>But I cannot tell there aren't other code paths which would lead to a >>memcg charge, thus oom, without proper FAULT_FLAG_KERNEL handling. > > >Michal, > >now i can definitely confirm that problem with unremovable cgroups persists. What info do you need from me? I applied also your little 'WARN_ON' patch. > >azur Ok, i think you want this: http://watchdog.sk/lkml/kern4.log -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH for 3.2] memcg: do not trap chargers with full callstack on OOM 2013-07-13 23:51 ` azurIt (?) @ 2013-07-15 15:41 ` Michal Hocko -1 siblings, 0 replies; 444+ messages in thread From: Michal Hocko @ 2013-07-15 15:41 UTC (permalink / raw) To: azurIt Cc: Johannes Weiner, linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki, righi.andrea On Sun 14-07-13 01:51:12, azurIt wrote: > > CC: "Johannes Weiner" <hannes@cmpxchg.org>, linux-kernel@vger.kernel.org, linux-mm@kvack.org, "cgroups mailinglist" <cgroups@vger.kernel.org>, "KAMEZAWA Hiroyuki" <kamezawa.hiroyu@jp.fujitsu.com>, righi.andrea@gmail.com > >> CC: "Johannes Weiner" <hannes@cmpxchg.org>, linux-kernel@vger.kernel.org, linux-mm@kvack.org, "cgroups mailinglist" <cgroups@vger.kernel.org>, "KAMEZAWA Hiroyuki" <kamezawa.hiroyu@jp.fujitsu.com>, righi.andrea@gmail.com > >>On Wed 10-07-13 18:25:06, azurIt wrote: > >>> >> Now i realized that i forgot to remove UID from that cgroup before > >>> >> trying to remove it, so cgroup cannot be removed anyway (we are using > >>> >> third party cgroup called cgroup-uid from Andrea Righi, which is able > >>> >> to associate all user's processes with target cgroup). Look here for > >>> >> cgroup-uid patch: > >>> >> https://www.develer.com/~arighi/linux/patches/cgroup-uid/cgroup-uid-v8.patch > >>> >> > >>> >> ANYWAY, i'm 101% sure that 'tasks' file was empty and 'under_oom' was > >>> >> permanently '1'. > >>> > > >>> >This is really strange. Could you post the whole diff against stable > >>> >tree you are using (except for grsecurity stuff and the above cgroup-uid > >>> >patch)? > >>> > >>> > >>> Here are all patches which i applied to kernel 3.2.48 in my last test: > >>> http://watchdog.sk/lkml/patches3/ > >> > >>The two patches from Johannes seem correct. > >> > >>From a quick look even grsecurity patchset shouldn't interfere as it > >>doesn't seem to put any code between handle_mm_fault and mm_fault_error > >>and there also doesn't seem to be any new handle_mm_fault call sites. > >> > >>But I cannot tell there aren't other code paths which would lead to a > >>memcg charge, thus oom, without proper FAULT_FLAG_KERNEL handling. > > > > > >Michal, > > > >now i can definitely confirm that problem with unremovable cgroups > >persists. What info do you need from me? I applied also your little > >'WARN_ON' patch. > > Ok, i think you want this: > http://watchdog.sk/lkml/kern4.log Jul 14 01:11:39 server01 kernel: [ 593.589087] [ pid ] uid tgid total_vm rss cpu oom_adj oom_score_adj name Jul 14 01:11:39 server01 kernel: [ 593.589451] [12021] 1333 12021 172027 64723 4 0 0 apache2 Jul 14 01:11:39 server01 kernel: [ 593.589647] [12030] 1333 12030 172030 64748 2 0 0 apache2 Jul 14 01:11:39 server01 kernel: [ 593.589836] [12031] 1333 12031 172030 64749 3 0 0 apache2 Jul 14 01:11:39 server01 kernel: [ 593.590025] [12032] 1333 12032 170619 63428 3 0 0 apache2 Jul 14 01:11:39 server01 kernel: [ 593.590213] [12033] 1333 12033 167934 60524 2 0 0 apache2 Jul 14 01:11:39 server01 kernel: [ 593.590401] [12034] 1333 12034 170747 63496 4 0 0 apache2 Jul 14 01:11:39 server01 kernel: [ 593.590588] [12035] 1333 12035 169659 62451 1 0 0 apache2 Jul 14 01:11:39 server01 kernel: [ 593.590776] [12036] 1333 12036 167614 60384 3 0 0 apache2 Jul 14 01:11:39 server01 kernel: [ 593.590984] [12037] 1333 12037 166342 58964 3 0 0 apache2 Jul 14 01:11:39 server01 kernel: [ 593.591178] Memory cgroup out of memory: Kill process 12021 (apache2) score 847 or sacrifice child Jul 14 01:11:39 server01 kernel: [ 593.591370] Killed process 12021 (apache2) total-vm:688108kB, anon-rss:255472kB, file-rss:3420kB Jul 14 01:11:41 server01 kernel: [ 595.392920] ------------[ cut here ]------------ Jul 14 01:11:41 server01 kernel: [ 595.393096] WARNING: at kernel/exit.c:888 do_exit+0x7d0/0x870() Jul 14 01:11:41 server01 kernel: [ 595.393256] Hardware name: S5000VSA Jul 14 01:11:41 server01 kernel: [ 595.393415] Pid: 12037, comm: apache2 Not tainted 3.2.48-grsec #1 Jul 14 01:11:41 server01 kernel: [ 595.393577] Call Trace: Jul 14 01:11:41 server01 kernel: [ 595.393737] [<ffffffff8105520a>] warn_slowpath_common+0x7a/0xb0 Jul 14 01:11:41 server01 kernel: [ 595.393903] [<ffffffff8105525a>] warn_slowpath_null+0x1a/0x20 Jul 14 01:11:41 server01 kernel: [ 595.394068] [<ffffffff81059c50>] do_exit+0x7d0/0x870 Jul 14 01:11:41 server01 kernel: [ 595.394231] [<ffffffff81050254>] ? thread_group_times+0x44/0xb0 Jul 14 01:11:41 server01 kernel: [ 595.394392] [<ffffffff81059d41>] do_group_exit+0x51/0xc0 Jul 14 01:11:41 server01 kernel: [ 595.394551] [<ffffffff81059dc7>] sys_exit_group+0x17/0x20 Jul 14 01:11:41 server01 kernel: [ 595.394714] [<ffffffff815caea6>] system_call_fastpath+0x18/0x1d Jul 14 01:11:41 server01 kernel: [ 595.394921] ---[ end trace 738570e688acf099 ]--- OK, so you had an OOM which has been handled by in-kernel oom handler (it killed 12021) and 12037 was in the same group. The warning tells us that it went through mem_cgroup_oom as well (otherwise it wouldn't have memcg_oom.wait_on_memcg set and the warning wouldn't trigger) and then it exited on the userspace request (by exit syscall). I do not see any way how, this could happen though. If mem_cgroup_oom is called then we always return CHARGE_NOMEM which turns into ENOMEM returned by __mem_cgroup_try_charge (invoke_oom must have been set to true). So if nobody screwed the return value on the way up to page fault handler then there is no way to escape. I will check the code. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH for 3.2] memcg: do not trap chargers with full callstack on OOM @ 2013-07-15 15:41 ` Michal Hocko 0 siblings, 0 replies; 444+ messages in thread From: Michal Hocko @ 2013-07-15 15:41 UTC (permalink / raw) To: azurIt Cc: Johannes Weiner, linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups mailinglist, KAMEZAWA Hiroyuki, righi.andrea-Re5JQEeQqe8AvxtiuMwx3w On Sun 14-07-13 01:51:12, azurIt wrote: > > CC: "Johannes Weiner" <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>, linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org, "cgroups mailinglist" <cgroups-u79uwXL29TY76Z2rM5mHXA@public.gmane.org>, "KAMEZAWA Hiroyuki" <kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A@public.gmane.org>, righi.andrea-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org > >> CC: "Johannes Weiner" <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>, linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org, "cgroups mailinglist" <cgroups-u79uwXL29TY76Z2rM5mHXA@public.gmane.org>, "KAMEZAWA Hiroyuki" <kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A@public.gmane.org>, righi.andrea-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org > >>On Wed 10-07-13 18:25:06, azurIt wrote: > >>> >> Now i realized that i forgot to remove UID from that cgroup before > >>> >> trying to remove it, so cgroup cannot be removed anyway (we are using > >>> >> third party cgroup called cgroup-uid from Andrea Righi, which is able > >>> >> to associate all user's processes with target cgroup). Look here for > >>> >> cgroup-uid patch: > >>> >> https://www.develer.com/~arighi/linux/patches/cgroup-uid/cgroup-uid-v8.patch > >>> >> > >>> >> ANYWAY, i'm 101% sure that 'tasks' file was empty and 'under_oom' was > >>> >> permanently '1'. > >>> > > >>> >This is really strange. Could you post the whole diff against stable > >>> >tree you are using (except for grsecurity stuff and the above cgroup-uid > >>> >patch)? > >>> > >>> > >>> Here are all patches which i applied to kernel 3.2.48 in my last test: > >>> http://watchdog.sk/lkml/patches3/ > >> > >>The two patches from Johannes seem correct. > >> > >>From a quick look even grsecurity patchset shouldn't interfere as it > >>doesn't seem to put any code between handle_mm_fault and mm_fault_error > >>and there also doesn't seem to be any new handle_mm_fault call sites. > >> > >>But I cannot tell there aren't other code paths which would lead to a > >>memcg charge, thus oom, without proper FAULT_FLAG_KERNEL handling. > > > > > >Michal, > > > >now i can definitely confirm that problem with unremovable cgroups > >persists. What info do you need from me? I applied also your little > >'WARN_ON' patch. > > Ok, i think you want this: > http://watchdog.sk/lkml/kern4.log Jul 14 01:11:39 server01 kernel: [ 593.589087] [ pid ] uid tgid total_vm rss cpu oom_adj oom_score_adj name Jul 14 01:11:39 server01 kernel: [ 593.589451] [12021] 1333 12021 172027 64723 4 0 0 apache2 Jul 14 01:11:39 server01 kernel: [ 593.589647] [12030] 1333 12030 172030 64748 2 0 0 apache2 Jul 14 01:11:39 server01 kernel: [ 593.589836] [12031] 1333 12031 172030 64749 3 0 0 apache2 Jul 14 01:11:39 server01 kernel: [ 593.590025] [12032] 1333 12032 170619 63428 3 0 0 apache2 Jul 14 01:11:39 server01 kernel: [ 593.590213] [12033] 1333 12033 167934 60524 2 0 0 apache2 Jul 14 01:11:39 server01 kernel: [ 593.590401] [12034] 1333 12034 170747 63496 4 0 0 apache2 Jul 14 01:11:39 server01 kernel: [ 593.590588] [12035] 1333 12035 169659 62451 1 0 0 apache2 Jul 14 01:11:39 server01 kernel: [ 593.590776] [12036] 1333 12036 167614 60384 3 0 0 apache2 Jul 14 01:11:39 server01 kernel: [ 593.590984] [12037] 1333 12037 166342 58964 3 0 0 apache2 Jul 14 01:11:39 server01 kernel: [ 593.591178] Memory cgroup out of memory: Kill process 12021 (apache2) score 847 or sacrifice child Jul 14 01:11:39 server01 kernel: [ 593.591370] Killed process 12021 (apache2) total-vm:688108kB, anon-rss:255472kB, file-rss:3420kB Jul 14 01:11:41 server01 kernel: [ 595.392920] ------------[ cut here ]------------ Jul 14 01:11:41 server01 kernel: [ 595.393096] WARNING: at kernel/exit.c:888 do_exit+0x7d0/0x870() Jul 14 01:11:41 server01 kernel: [ 595.393256] Hardware name: S5000VSA Jul 14 01:11:41 server01 kernel: [ 595.393415] Pid: 12037, comm: apache2 Not tainted 3.2.48-grsec #1 Jul 14 01:11:41 server01 kernel: [ 595.393577] Call Trace: Jul 14 01:11:41 server01 kernel: [ 595.393737] [<ffffffff8105520a>] warn_slowpath_common+0x7a/0xb0 Jul 14 01:11:41 server01 kernel: [ 595.393903] [<ffffffff8105525a>] warn_slowpath_null+0x1a/0x20 Jul 14 01:11:41 server01 kernel: [ 595.394068] [<ffffffff81059c50>] do_exit+0x7d0/0x870 Jul 14 01:11:41 server01 kernel: [ 595.394231] [<ffffffff81050254>] ? thread_group_times+0x44/0xb0 Jul 14 01:11:41 server01 kernel: [ 595.394392] [<ffffffff81059d41>] do_group_exit+0x51/0xc0 Jul 14 01:11:41 server01 kernel: [ 595.394551] [<ffffffff81059dc7>] sys_exit_group+0x17/0x20 Jul 14 01:11:41 server01 kernel: [ 595.394714] [<ffffffff815caea6>] system_call_fastpath+0x18/0x1d Jul 14 01:11:41 server01 kernel: [ 595.394921] ---[ end trace 738570e688acf099 ]--- OK, so you had an OOM which has been handled by in-kernel oom handler (it killed 12021) and 12037 was in the same group. The warning tells us that it went through mem_cgroup_oom as well (otherwise it wouldn't have memcg_oom.wait_on_memcg set and the warning wouldn't trigger) and then it exited on the userspace request (by exit syscall). I do not see any way how, this could happen though. If mem_cgroup_oom is called then we always return CHARGE_NOMEM which turns into ENOMEM returned by __mem_cgroup_try_charge (invoke_oom must have been set to true). So if nobody screwed the return value on the way up to page fault handler then there is no way to escape. I will check the code. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH for 3.2] memcg: do not trap chargers with full callstack on OOM @ 2013-07-15 15:41 ` Michal Hocko 0 siblings, 0 replies; 444+ messages in thread From: Michal Hocko @ 2013-07-15 15:41 UTC (permalink / raw) To: azurIt Cc: Johannes Weiner, linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki, righi.andrea On Sun 14-07-13 01:51:12, azurIt wrote: > > CC: "Johannes Weiner" <hannes@cmpxchg.org>, linux-kernel@vger.kernel.org, linux-mm@kvack.org, "cgroups mailinglist" <cgroups@vger.kernel.org>, "KAMEZAWA Hiroyuki" <kamezawa.hiroyu@jp.fujitsu.com>, righi.andrea@gmail.com > >> CC: "Johannes Weiner" <hannes@cmpxchg.org>, linux-kernel@vger.kernel.org, linux-mm@kvack.org, "cgroups mailinglist" <cgroups@vger.kernel.org>, "KAMEZAWA Hiroyuki" <kamezawa.hiroyu@jp.fujitsu.com>, righi.andrea@gmail.com > >>On Wed 10-07-13 18:25:06, azurIt wrote: > >>> >> Now i realized that i forgot to remove UID from that cgroup before > >>> >> trying to remove it, so cgroup cannot be removed anyway (we are using > >>> >> third party cgroup called cgroup-uid from Andrea Righi, which is able > >>> >> to associate all user's processes with target cgroup). Look here for > >>> >> cgroup-uid patch: > >>> >> https://www.develer.com/~arighi/linux/patches/cgroup-uid/cgroup-uid-v8.patch > >>> >> > >>> >> ANYWAY, i'm 101% sure that 'tasks' file was empty and 'under_oom' was > >>> >> permanently '1'. > >>> > > >>> >This is really strange. Could you post the whole diff against stable > >>> >tree you are using (except for grsecurity stuff and the above cgroup-uid > >>> >patch)? > >>> > >>> > >>> Here are all patches which i applied to kernel 3.2.48 in my last test: > >>> http://watchdog.sk/lkml/patches3/ > >> > >>The two patches from Johannes seem correct. > >> > >>From a quick look even grsecurity patchset shouldn't interfere as it > >>doesn't seem to put any code between handle_mm_fault and mm_fault_error > >>and there also doesn't seem to be any new handle_mm_fault call sites. > >> > >>But I cannot tell there aren't other code paths which would lead to a > >>memcg charge, thus oom, without proper FAULT_FLAG_KERNEL handling. > > > > > >Michal, > > > >now i can definitely confirm that problem with unremovable cgroups > >persists. What info do you need from me? I applied also your little > >'WARN_ON' patch. > > Ok, i think you want this: > http://watchdog.sk/lkml/kern4.log Jul 14 01:11:39 server01 kernel: [ 593.589087] [ pid ] uid tgid total_vm rss cpu oom_adj oom_score_adj name Jul 14 01:11:39 server01 kernel: [ 593.589451] [12021] 1333 12021 172027 64723 4 0 0 apache2 Jul 14 01:11:39 server01 kernel: [ 593.589647] [12030] 1333 12030 172030 64748 2 0 0 apache2 Jul 14 01:11:39 server01 kernel: [ 593.589836] [12031] 1333 12031 172030 64749 3 0 0 apache2 Jul 14 01:11:39 server01 kernel: [ 593.590025] [12032] 1333 12032 170619 63428 3 0 0 apache2 Jul 14 01:11:39 server01 kernel: [ 593.590213] [12033] 1333 12033 167934 60524 2 0 0 apache2 Jul 14 01:11:39 server01 kernel: [ 593.590401] [12034] 1333 12034 170747 63496 4 0 0 apache2 Jul 14 01:11:39 server01 kernel: [ 593.590588] [12035] 1333 12035 169659 62451 1 0 0 apache2 Jul 14 01:11:39 server01 kernel: [ 593.590776] [12036] 1333 12036 167614 60384 3 0 0 apache2 Jul 14 01:11:39 server01 kernel: [ 593.590984] [12037] 1333 12037 166342 58964 3 0 0 apache2 Jul 14 01:11:39 server01 kernel: [ 593.591178] Memory cgroup out of memory: Kill process 12021 (apache2) score 847 or sacrifice child Jul 14 01:11:39 server01 kernel: [ 593.591370] Killed process 12021 (apache2) total-vm:688108kB, anon-rss:255472kB, file-rss:3420kB Jul 14 01:11:41 server01 kernel: [ 595.392920] ------------[ cut here ]------------ Jul 14 01:11:41 server01 kernel: [ 595.393096] WARNING: at kernel/exit.c:888 do_exit+0x7d0/0x870() Jul 14 01:11:41 server01 kernel: [ 595.393256] Hardware name: S5000VSA Jul 14 01:11:41 server01 kernel: [ 595.393415] Pid: 12037, comm: apache2 Not tainted 3.2.48-grsec #1 Jul 14 01:11:41 server01 kernel: [ 595.393577] Call Trace: Jul 14 01:11:41 server01 kernel: [ 595.393737] [<ffffffff8105520a>] warn_slowpath_common+0x7a/0xb0 Jul 14 01:11:41 server01 kernel: [ 595.393903] [<ffffffff8105525a>] warn_slowpath_null+0x1a/0x20 Jul 14 01:11:41 server01 kernel: [ 595.394068] [<ffffffff81059c50>] do_exit+0x7d0/0x870 Jul 14 01:11:41 server01 kernel: [ 595.394231] [<ffffffff81050254>] ? thread_group_times+0x44/0xb0 Jul 14 01:11:41 server01 kernel: [ 595.394392] [<ffffffff81059d41>] do_group_exit+0x51/0xc0 Jul 14 01:11:41 server01 kernel: [ 595.394551] [<ffffffff81059dc7>] sys_exit_group+0x17/0x20 Jul 14 01:11:41 server01 kernel: [ 595.394714] [<ffffffff815caea6>] system_call_fastpath+0x18/0x1d Jul 14 01:11:41 server01 kernel: [ 595.394921] ---[ end trace 738570e688acf099 ]--- OK, so you had an OOM which has been handled by in-kernel oom handler (it killed 12021) and 12037 was in the same group. The warning tells us that it went through mem_cgroup_oom as well (otherwise it wouldn't have memcg_oom.wait_on_memcg set and the warning wouldn't trigger) and then it exited on the userspace request (by exit syscall). I do not see any way how, this could happen though. If mem_cgroup_oom is called then we always return CHARGE_NOMEM which turns into ENOMEM returned by __mem_cgroup_try_charge (invoke_oom must have been set to true). So if nobody screwed the return value on the way up to page fault handler then there is no way to escape. I will check the code. -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH for 3.2] memcg: do not trap chargers with full callstack on OOM 2013-07-15 15:41 ` Michal Hocko (?) @ 2013-07-15 16:00 ` Michal Hocko -1 siblings, 0 replies; 444+ messages in thread From: Michal Hocko @ 2013-07-15 16:00 UTC (permalink / raw) To: azurIt Cc: Johannes Weiner, linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki, righi.andrea On Mon 15-07-13 17:41:19, Michal Hocko wrote: > On Sun 14-07-13 01:51:12, azurIt wrote: > > > CC: "Johannes Weiner" <hannes@cmpxchg.org>, linux-kernel@vger.kernel.org, linux-mm@kvack.org, "cgroups mailinglist" <cgroups@vger.kernel.org>, "KAMEZAWA Hiroyuki" <kamezawa.hiroyu@jp.fujitsu.com>, righi.andrea@gmail.com > > >> CC: "Johannes Weiner" <hannes@cmpxchg.org>, linux-kernel@vger.kernel.org, linux-mm@kvack.org, "cgroups mailinglist" <cgroups@vger.kernel.org>, "KAMEZAWA Hiroyuki" <kamezawa.hiroyu@jp.fujitsu.com>, righi.andrea@gmail.com > > >>On Wed 10-07-13 18:25:06, azurIt wrote: > > >>> >> Now i realized that i forgot to remove UID from that cgroup before > > >>> >> trying to remove it, so cgroup cannot be removed anyway (we are using > > >>> >> third party cgroup called cgroup-uid from Andrea Righi, which is able > > >>> >> to associate all user's processes with target cgroup). Look here for > > >>> >> cgroup-uid patch: > > >>> >> https://www.develer.com/~arighi/linux/patches/cgroup-uid/cgroup-uid-v8.patch > > >>> >> > > >>> >> ANYWAY, i'm 101% sure that 'tasks' file was empty and 'under_oom' was > > >>> >> permanently '1'. > > >>> > > > >>> >This is really strange. Could you post the whole diff against stable > > >>> >tree you are using (except for grsecurity stuff and the above cgroup-uid > > >>> >patch)? > > >>> > > >>> > > >>> Here are all patches which i applied to kernel 3.2.48 in my last test: > > >>> http://watchdog.sk/lkml/patches3/ > > >> > > >>The two patches from Johannes seem correct. > > >> > > >>From a quick look even grsecurity patchset shouldn't interfere as it > > >>doesn't seem to put any code between handle_mm_fault and mm_fault_error > > >>and there also doesn't seem to be any new handle_mm_fault call sites. > > >> > > >>But I cannot tell there aren't other code paths which would lead to a > > >>memcg charge, thus oom, without proper FAULT_FLAG_KERNEL handling. > > > > > > > > >Michal, > > > > > >now i can definitely confirm that problem with unremovable cgroups > > >persists. What info do you need from me? I applied also your little > > >'WARN_ON' patch. > > > > Ok, i think you want this: > > http://watchdog.sk/lkml/kern4.log > > Jul 14 01:11:39 server01 kernel: [ 593.589087] [ pid ] uid tgid total_vm rss cpu oom_adj oom_score_adj name > Jul 14 01:11:39 server01 kernel: [ 593.589451] [12021] 1333 12021 172027 64723 4 0 0 apache2 > Jul 14 01:11:39 server01 kernel: [ 593.589647] [12030] 1333 12030 172030 64748 2 0 0 apache2 > Jul 14 01:11:39 server01 kernel: [ 593.589836] [12031] 1333 12031 172030 64749 3 0 0 apache2 > Jul 14 01:11:39 server01 kernel: [ 593.590025] [12032] 1333 12032 170619 63428 3 0 0 apache2 > Jul 14 01:11:39 server01 kernel: [ 593.590213] [12033] 1333 12033 167934 60524 2 0 0 apache2 > Jul 14 01:11:39 server01 kernel: [ 593.590401] [12034] 1333 12034 170747 63496 4 0 0 apache2 > Jul 14 01:11:39 server01 kernel: [ 593.590588] [12035] 1333 12035 169659 62451 1 0 0 apache2 > Jul 14 01:11:39 server01 kernel: [ 593.590776] [12036] 1333 12036 167614 60384 3 0 0 apache2 > Jul 14 01:11:39 server01 kernel: [ 593.590984] [12037] 1333 12037 166342 58964 3 0 0 apache2 > Jul 14 01:11:39 server01 kernel: [ 593.591178] Memory cgroup out of memory: Kill process 12021 (apache2) score 847 or sacrifice child > Jul 14 01:11:39 server01 kernel: [ 593.591370] Killed process 12021 (apache2) total-vm:688108kB, anon-rss:255472kB, file-rss:3420kB > Jul 14 01:11:41 server01 kernel: [ 595.392920] ------------[ cut here ]------------ > Jul 14 01:11:41 server01 kernel: [ 595.393096] WARNING: at kernel/exit.c:888 do_exit+0x7d0/0x870() > Jul 14 01:11:41 server01 kernel: [ 595.393256] Hardware name: S5000VSA > Jul 14 01:11:41 server01 kernel: [ 595.393415] Pid: 12037, comm: apache2 Not tainted 3.2.48-grsec #1 > Jul 14 01:11:41 server01 kernel: [ 595.393577] Call Trace: > Jul 14 01:11:41 server01 kernel: [ 595.393737] [<ffffffff8105520a>] warn_slowpath_common+0x7a/0xb0 > Jul 14 01:11:41 server01 kernel: [ 595.393903] [<ffffffff8105525a>] warn_slowpath_null+0x1a/0x20 > Jul 14 01:11:41 server01 kernel: [ 595.394068] [<ffffffff81059c50>] do_exit+0x7d0/0x870 > Jul 14 01:11:41 server01 kernel: [ 595.394231] [<ffffffff81050254>] ? thread_group_times+0x44/0xb0 > Jul 14 01:11:41 server01 kernel: [ 595.394392] [<ffffffff81059d41>] do_group_exit+0x51/0xc0 > Jul 14 01:11:41 server01 kernel: [ 595.394551] [<ffffffff81059dc7>] sys_exit_group+0x17/0x20 > Jul 14 01:11:41 server01 kernel: [ 595.394714] [<ffffffff815caea6>] system_call_fastpath+0x18/0x1d > Jul 14 01:11:41 server01 kernel: [ 595.394921] ---[ end trace 738570e688acf099 ]--- > > OK, so you had an OOM which has been handled by in-kernel oom handler > (it killed 12021) and 12037 was in the same group. The warning tells us > that it went through mem_cgroup_oom as well (otherwise it wouldn't have > memcg_oom.wait_on_memcg set and the warning wouldn't trigger) and then > it exited on the userspace request (by exit syscall). > > I do not see any way how, this could happen though. If mem_cgroup_oom > is called then we always return CHARGE_NOMEM which turns into ENOMEM > returned by __mem_cgroup_try_charge (invoke_oom must have been set to > true). So if nobody screwed the return value on the way up to page > fault handler then there is no way to escape. > > I will check the code. OK, I guess I found it: __do_fault fault = filemap_fault do_async_mmap_readahead page_cache_async_readahead ondemand_readahead __do_page_cache_readahead read_pages readpages = ext3_readpages mpage_readpages # Doesn't propagate ENOMEM add_to_page_cache_lru add_to_page_cache add_to_page_cache_locked mem_cgroup_cache_charge So the read ahead most probably. Again! Duhhh. I will try to think about a fix for this. One obvious place is mpage_readpages but __do_page_cache_readahead ignores read_pages return value as well and page_cache_async_readahead, even worse, is just void and exported as such. So this smells like a hard to fix bugger. One possible, and really ugly way would be calling mem_cgroup_oom_synchronize even if handle_mm_fault doesn't return VM_FAULT_ERROR, but that is a crude hack. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH for 3.2] memcg: do not trap chargers with full callstack on OOM @ 2013-07-15 16:00 ` Michal Hocko 0 siblings, 0 replies; 444+ messages in thread From: Michal Hocko @ 2013-07-15 16:00 UTC (permalink / raw) To: azurIt Cc: Johannes Weiner, linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups mailinglist, KAMEZAWA Hiroyuki, righi.andrea-Re5JQEeQqe8AvxtiuMwx3w On Mon 15-07-13 17:41:19, Michal Hocko wrote: > On Sun 14-07-13 01:51:12, azurIt wrote: > > > CC: "Johannes Weiner" <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>, linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org, "cgroups mailinglist" <cgroups-u79uwXL29TY76Z2rM5mHXA@public.gmane.org>, "KAMEZAWA Hiroyuki" <kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A@public.gmane.org>, righi.andrea-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org > > >> CC: "Johannes Weiner" <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>, linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org, "cgroups mailinglist" <cgroups-u79uwXL29TY76Z2rM5mHXA@public.gmane.org>, "KAMEZAWA Hiroyuki" <kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A@public.gmane.org>, righi.andrea-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org > > >>On Wed 10-07-13 18:25:06, azurIt wrote: > > >>> >> Now i realized that i forgot to remove UID from that cgroup before > > >>> >> trying to remove it, so cgroup cannot be removed anyway (we are using > > >>> >> third party cgroup called cgroup-uid from Andrea Righi, which is able > > >>> >> to associate all user's processes with target cgroup). Look here for > > >>> >> cgroup-uid patch: > > >>> >> https://www.develer.com/~arighi/linux/patches/cgroup-uid/cgroup-uid-v8.patch > > >>> >> > > >>> >> ANYWAY, i'm 101% sure that 'tasks' file was empty and 'under_oom' was > > >>> >> permanently '1'. > > >>> > > > >>> >This is really strange. Could you post the whole diff against stable > > >>> >tree you are using (except for grsecurity stuff and the above cgroup-uid > > >>> >patch)? > > >>> > > >>> > > >>> Here are all patches which i applied to kernel 3.2.48 in my last test: > > >>> http://watchdog.sk/lkml/patches3/ > > >> > > >>The two patches from Johannes seem correct. > > >> > > >>From a quick look even grsecurity patchset shouldn't interfere as it > > >>doesn't seem to put any code between handle_mm_fault and mm_fault_error > > >>and there also doesn't seem to be any new handle_mm_fault call sites. > > >> > > >>But I cannot tell there aren't other code paths which would lead to a > > >>memcg charge, thus oom, without proper FAULT_FLAG_KERNEL handling. > > > > > > > > >Michal, > > > > > >now i can definitely confirm that problem with unremovable cgroups > > >persists. What info do you need from me? I applied also your little > > >'WARN_ON' patch. > > > > Ok, i think you want this: > > http://watchdog.sk/lkml/kern4.log > > Jul 14 01:11:39 server01 kernel: [ 593.589087] [ pid ] uid tgid total_vm rss cpu oom_adj oom_score_adj name > Jul 14 01:11:39 server01 kernel: [ 593.589451] [12021] 1333 12021 172027 64723 4 0 0 apache2 > Jul 14 01:11:39 server01 kernel: [ 593.589647] [12030] 1333 12030 172030 64748 2 0 0 apache2 > Jul 14 01:11:39 server01 kernel: [ 593.589836] [12031] 1333 12031 172030 64749 3 0 0 apache2 > Jul 14 01:11:39 server01 kernel: [ 593.590025] [12032] 1333 12032 170619 63428 3 0 0 apache2 > Jul 14 01:11:39 server01 kernel: [ 593.590213] [12033] 1333 12033 167934 60524 2 0 0 apache2 > Jul 14 01:11:39 server01 kernel: [ 593.590401] [12034] 1333 12034 170747 63496 4 0 0 apache2 > Jul 14 01:11:39 server01 kernel: [ 593.590588] [12035] 1333 12035 169659 62451 1 0 0 apache2 > Jul 14 01:11:39 server01 kernel: [ 593.590776] [12036] 1333 12036 167614 60384 3 0 0 apache2 > Jul 14 01:11:39 server01 kernel: [ 593.590984] [12037] 1333 12037 166342 58964 3 0 0 apache2 > Jul 14 01:11:39 server01 kernel: [ 593.591178] Memory cgroup out of memory: Kill process 12021 (apache2) score 847 or sacrifice child > Jul 14 01:11:39 server01 kernel: [ 593.591370] Killed process 12021 (apache2) total-vm:688108kB, anon-rss:255472kB, file-rss:3420kB > Jul 14 01:11:41 server01 kernel: [ 595.392920] ------------[ cut here ]------------ > Jul 14 01:11:41 server01 kernel: [ 595.393096] WARNING: at kernel/exit.c:888 do_exit+0x7d0/0x870() > Jul 14 01:11:41 server01 kernel: [ 595.393256] Hardware name: S5000VSA > Jul 14 01:11:41 server01 kernel: [ 595.393415] Pid: 12037, comm: apache2 Not tainted 3.2.48-grsec #1 > Jul 14 01:11:41 server01 kernel: [ 595.393577] Call Trace: > Jul 14 01:11:41 server01 kernel: [ 595.393737] [<ffffffff8105520a>] warn_slowpath_common+0x7a/0xb0 > Jul 14 01:11:41 server01 kernel: [ 595.393903] [<ffffffff8105525a>] warn_slowpath_null+0x1a/0x20 > Jul 14 01:11:41 server01 kernel: [ 595.394068] [<ffffffff81059c50>] do_exit+0x7d0/0x870 > Jul 14 01:11:41 server01 kernel: [ 595.394231] [<ffffffff81050254>] ? thread_group_times+0x44/0xb0 > Jul 14 01:11:41 server01 kernel: [ 595.394392] [<ffffffff81059d41>] do_group_exit+0x51/0xc0 > Jul 14 01:11:41 server01 kernel: [ 595.394551] [<ffffffff81059dc7>] sys_exit_group+0x17/0x20 > Jul 14 01:11:41 server01 kernel: [ 595.394714] [<ffffffff815caea6>] system_call_fastpath+0x18/0x1d > Jul 14 01:11:41 server01 kernel: [ 595.394921] ---[ end trace 738570e688acf099 ]--- > > OK, so you had an OOM which has been handled by in-kernel oom handler > (it killed 12021) and 12037 was in the same group. The warning tells us > that it went through mem_cgroup_oom as well (otherwise it wouldn't have > memcg_oom.wait_on_memcg set and the warning wouldn't trigger) and then > it exited on the userspace request (by exit syscall). > > I do not see any way how, this could happen though. If mem_cgroup_oom > is called then we always return CHARGE_NOMEM which turns into ENOMEM > returned by __mem_cgroup_try_charge (invoke_oom must have been set to > true). So if nobody screwed the return value on the way up to page > fault handler then there is no way to escape. > > I will check the code. OK, I guess I found it: __do_fault fault = filemap_fault do_async_mmap_readahead page_cache_async_readahead ondemand_readahead __do_page_cache_readahead read_pages readpages = ext3_readpages mpage_readpages # Doesn't propagate ENOMEM add_to_page_cache_lru add_to_page_cache add_to_page_cache_locked mem_cgroup_cache_charge So the read ahead most probably. Again! Duhhh. I will try to think about a fix for this. One obvious place is mpage_readpages but __do_page_cache_readahead ignores read_pages return value as well and page_cache_async_readahead, even worse, is just void and exported as such. So this smells like a hard to fix bugger. One possible, and really ugly way would be calling mem_cgroup_oom_synchronize even if handle_mm_fault doesn't return VM_FAULT_ERROR, but that is a crude hack. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH for 3.2] memcg: do not trap chargers with full callstack on OOM @ 2013-07-15 16:00 ` Michal Hocko 0 siblings, 0 replies; 444+ messages in thread From: Michal Hocko @ 2013-07-15 16:00 UTC (permalink / raw) To: azurIt Cc: Johannes Weiner, linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki, righi.andrea On Mon 15-07-13 17:41:19, Michal Hocko wrote: > On Sun 14-07-13 01:51:12, azurIt wrote: > > > CC: "Johannes Weiner" <hannes@cmpxchg.org>, linux-kernel@vger.kernel.org, linux-mm@kvack.org, "cgroups mailinglist" <cgroups@vger.kernel.org>, "KAMEZAWA Hiroyuki" <kamezawa.hiroyu@jp.fujitsu.com>, righi.andrea@gmail.com > > >> CC: "Johannes Weiner" <hannes@cmpxchg.org>, linux-kernel@vger.kernel.org, linux-mm@kvack.org, "cgroups mailinglist" <cgroups@vger.kernel.org>, "KAMEZAWA Hiroyuki" <kamezawa.hiroyu@jp.fujitsu.com>, righi.andrea@gmail.com > > >>On Wed 10-07-13 18:25:06, azurIt wrote: > > >>> >> Now i realized that i forgot to remove UID from that cgroup before > > >>> >> trying to remove it, so cgroup cannot be removed anyway (we are using > > >>> >> third party cgroup called cgroup-uid from Andrea Righi, which is able > > >>> >> to associate all user's processes with target cgroup). Look here for > > >>> >> cgroup-uid patch: > > >>> >> https://www.develer.com/~arighi/linux/patches/cgroup-uid/cgroup-uid-v8.patch > > >>> >> > > >>> >> ANYWAY, i'm 101% sure that 'tasks' file was empty and 'under_oom' was > > >>> >> permanently '1'. > > >>> > > > >>> >This is really strange. Could you post the whole diff against stable > > >>> >tree you are using (except for grsecurity stuff and the above cgroup-uid > > >>> >patch)? > > >>> > > >>> > > >>> Here are all patches which i applied to kernel 3.2.48 in my last test: > > >>> http://watchdog.sk/lkml/patches3/ > > >> > > >>The two patches from Johannes seem correct. > > >> > > >>From a quick look even grsecurity patchset shouldn't interfere as it > > >>doesn't seem to put any code between handle_mm_fault and mm_fault_error > > >>and there also doesn't seem to be any new handle_mm_fault call sites. > > >> > > >>But I cannot tell there aren't other code paths which would lead to a > > >>memcg charge, thus oom, without proper FAULT_FLAG_KERNEL handling. > > > > > > > > >Michal, > > > > > >now i can definitely confirm that problem with unremovable cgroups > > >persists. What info do you need from me? I applied also your little > > >'WARN_ON' patch. > > > > Ok, i think you want this: > > http://watchdog.sk/lkml/kern4.log > > Jul 14 01:11:39 server01 kernel: [ 593.589087] [ pid ] uid tgid total_vm rss cpu oom_adj oom_score_adj name > Jul 14 01:11:39 server01 kernel: [ 593.589451] [12021] 1333 12021 172027 64723 4 0 0 apache2 > Jul 14 01:11:39 server01 kernel: [ 593.589647] [12030] 1333 12030 172030 64748 2 0 0 apache2 > Jul 14 01:11:39 server01 kernel: [ 593.589836] [12031] 1333 12031 172030 64749 3 0 0 apache2 > Jul 14 01:11:39 server01 kernel: [ 593.590025] [12032] 1333 12032 170619 63428 3 0 0 apache2 > Jul 14 01:11:39 server01 kernel: [ 593.590213] [12033] 1333 12033 167934 60524 2 0 0 apache2 > Jul 14 01:11:39 server01 kernel: [ 593.590401] [12034] 1333 12034 170747 63496 4 0 0 apache2 > Jul 14 01:11:39 server01 kernel: [ 593.590588] [12035] 1333 12035 169659 62451 1 0 0 apache2 > Jul 14 01:11:39 server01 kernel: [ 593.590776] [12036] 1333 12036 167614 60384 3 0 0 apache2 > Jul 14 01:11:39 server01 kernel: [ 593.590984] [12037] 1333 12037 166342 58964 3 0 0 apache2 > Jul 14 01:11:39 server01 kernel: [ 593.591178] Memory cgroup out of memory: Kill process 12021 (apache2) score 847 or sacrifice child > Jul 14 01:11:39 server01 kernel: [ 593.591370] Killed process 12021 (apache2) total-vm:688108kB, anon-rss:255472kB, file-rss:3420kB > Jul 14 01:11:41 server01 kernel: [ 595.392920] ------------[ cut here ]------------ > Jul 14 01:11:41 server01 kernel: [ 595.393096] WARNING: at kernel/exit.c:888 do_exit+0x7d0/0x870() > Jul 14 01:11:41 server01 kernel: [ 595.393256] Hardware name: S5000VSA > Jul 14 01:11:41 server01 kernel: [ 595.393415] Pid: 12037, comm: apache2 Not tainted 3.2.48-grsec #1 > Jul 14 01:11:41 server01 kernel: [ 595.393577] Call Trace: > Jul 14 01:11:41 server01 kernel: [ 595.393737] [<ffffffff8105520a>] warn_slowpath_common+0x7a/0xb0 > Jul 14 01:11:41 server01 kernel: [ 595.393903] [<ffffffff8105525a>] warn_slowpath_null+0x1a/0x20 > Jul 14 01:11:41 server01 kernel: [ 595.394068] [<ffffffff81059c50>] do_exit+0x7d0/0x870 > Jul 14 01:11:41 server01 kernel: [ 595.394231] [<ffffffff81050254>] ? thread_group_times+0x44/0xb0 > Jul 14 01:11:41 server01 kernel: [ 595.394392] [<ffffffff81059d41>] do_group_exit+0x51/0xc0 > Jul 14 01:11:41 server01 kernel: [ 595.394551] [<ffffffff81059dc7>] sys_exit_group+0x17/0x20 > Jul 14 01:11:41 server01 kernel: [ 595.394714] [<ffffffff815caea6>] system_call_fastpath+0x18/0x1d > Jul 14 01:11:41 server01 kernel: [ 595.394921] ---[ end trace 738570e688acf099 ]--- > > OK, so you had an OOM which has been handled by in-kernel oom handler > (it killed 12021) and 12037 was in the same group. The warning tells us > that it went through mem_cgroup_oom as well (otherwise it wouldn't have > memcg_oom.wait_on_memcg set and the warning wouldn't trigger) and then > it exited on the userspace request (by exit syscall). > > I do not see any way how, this could happen though. If mem_cgroup_oom > is called then we always return CHARGE_NOMEM which turns into ENOMEM > returned by __mem_cgroup_try_charge (invoke_oom must have been set to > true). So if nobody screwed the return value on the way up to page > fault handler then there is no way to escape. > > I will check the code. OK, I guess I found it: __do_fault fault = filemap_fault do_async_mmap_readahead page_cache_async_readahead ondemand_readahead __do_page_cache_readahead read_pages readpages = ext3_readpages mpage_readpages # Doesn't propagate ENOMEM add_to_page_cache_lru add_to_page_cache add_to_page_cache_locked mem_cgroup_cache_charge So the read ahead most probably. Again! Duhhh. I will try to think about a fix for this. One obvious place is mpage_readpages but __do_page_cache_readahead ignores read_pages return value as well and page_cache_async_readahead, even worse, is just void and exported as such. So this smells like a hard to fix bugger. One possible, and really ugly way would be calling mem_cgroup_oom_synchronize even if handle_mm_fault doesn't return VM_FAULT_ERROR, but that is a crude hack. -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH for 3.2] memcg: do not trap chargers with full callstack on OOM 2013-07-15 16:00 ` Michal Hocko (?) @ 2013-07-16 15:35 ` Johannes Weiner -1 siblings, 0 replies; 444+ messages in thread From: Johannes Weiner @ 2013-07-16 15:35 UTC (permalink / raw) To: Michal Hocko Cc: azurIt, linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki, righi.andrea On Mon, Jul 15, 2013 at 06:00:06PM +0200, Michal Hocko wrote: > On Mon 15-07-13 17:41:19, Michal Hocko wrote: > > On Sun 14-07-13 01:51:12, azurIt wrote: > > > > CC: "Johannes Weiner" <hannes@cmpxchg.org>, linux-kernel@vger.kernel.org, linux-mm@kvack.org, "cgroups mailinglist" <cgroups@vger.kernel.org>, "KAMEZAWA Hiroyuki" <kamezawa.hiroyu@jp.fujitsu.com>, righi.andrea@gmail.com > > > >> CC: "Johannes Weiner" <hannes@cmpxchg.org>, linux-kernel@vger.kernel.org, linux-mm@kvack.org, "cgroups mailinglist" <cgroups@vger.kernel.org>, "KAMEZAWA Hiroyuki" <kamezawa.hiroyu@jp.fujitsu.com>, righi.andrea@gmail.com > > > >>On Wed 10-07-13 18:25:06, azurIt wrote: > > > >>> >> Now i realized that i forgot to remove UID from that cgroup before > > > >>> >> trying to remove it, so cgroup cannot be removed anyway (we are using > > > >>> >> third party cgroup called cgroup-uid from Andrea Righi, which is able > > > >>> >> to associate all user's processes with target cgroup). Look here for > > > >>> >> cgroup-uid patch: > > > >>> >> https://www.develer.com/~arighi/linux/patches/cgroup-uid/cgroup-uid-v8.patch > > > >>> >> > > > >>> >> ANYWAY, i'm 101% sure that 'tasks' file was empty and 'under_oom' was > > > >>> >> permanently '1'. > > > >>> > > > > >>> >This is really strange. Could you post the whole diff against stable > > > >>> >tree you are using (except for grsecurity stuff and the above cgroup-uid > > > >>> >patch)? > > > >>> > > > >>> > > > >>> Here are all patches which i applied to kernel 3.2.48 in my last test: > > > >>> http://watchdog.sk/lkml/patches3/ > > > >> > > > >>The two patches from Johannes seem correct. > > > >> > > > >>From a quick look even grsecurity patchset shouldn't interfere as it > > > >>doesn't seem to put any code between handle_mm_fault and mm_fault_error > > > >>and there also doesn't seem to be any new handle_mm_fault call sites. > > > >> > > > >>But I cannot tell there aren't other code paths which would lead to a > > > >>memcg charge, thus oom, without proper FAULT_FLAG_KERNEL handling. > > > > > > > > > > > >Michal, > > > > > > > >now i can definitely confirm that problem with unremovable cgroups > > > >persists. What info do you need from me? I applied also your little > > > >'WARN_ON' patch. > > > > > > Ok, i think you want this: > > > http://watchdog.sk/lkml/kern4.log > > > > Jul 14 01:11:39 server01 kernel: [ 593.589087] [ pid ] uid tgid total_vm rss cpu oom_adj oom_score_adj name > > Jul 14 01:11:39 server01 kernel: [ 593.589451] [12021] 1333 12021 172027 64723 4 0 0 apache2 > > Jul 14 01:11:39 server01 kernel: [ 593.589647] [12030] 1333 12030 172030 64748 2 0 0 apache2 > > Jul 14 01:11:39 server01 kernel: [ 593.589836] [12031] 1333 12031 172030 64749 3 0 0 apache2 > > Jul 14 01:11:39 server01 kernel: [ 593.590025] [12032] 1333 12032 170619 63428 3 0 0 apache2 > > Jul 14 01:11:39 server01 kernel: [ 593.590213] [12033] 1333 12033 167934 60524 2 0 0 apache2 > > Jul 14 01:11:39 server01 kernel: [ 593.590401] [12034] 1333 12034 170747 63496 4 0 0 apache2 > > Jul 14 01:11:39 server01 kernel: [ 593.590588] [12035] 1333 12035 169659 62451 1 0 0 apache2 > > Jul 14 01:11:39 server01 kernel: [ 593.590776] [12036] 1333 12036 167614 60384 3 0 0 apache2 > > Jul 14 01:11:39 server01 kernel: [ 593.590984] [12037] 1333 12037 166342 58964 3 0 0 apache2 > > Jul 14 01:11:39 server01 kernel: [ 593.591178] Memory cgroup out of memory: Kill process 12021 (apache2) score 847 or sacrifice child > > Jul 14 01:11:39 server01 kernel: [ 593.591370] Killed process 12021 (apache2) total-vm:688108kB, anon-rss:255472kB, file-rss:3420kB > > Jul 14 01:11:41 server01 kernel: [ 595.392920] ------------[ cut here ]------------ > > Jul 14 01:11:41 server01 kernel: [ 595.393096] WARNING: at kernel/exit.c:888 do_exit+0x7d0/0x870() > > Jul 14 01:11:41 server01 kernel: [ 595.393256] Hardware name: S5000VSA > > Jul 14 01:11:41 server01 kernel: [ 595.393415] Pid: 12037, comm: apache2 Not tainted 3.2.48-grsec #1 > > Jul 14 01:11:41 server01 kernel: [ 595.393577] Call Trace: > > Jul 14 01:11:41 server01 kernel: [ 595.393737] [<ffffffff8105520a>] warn_slowpath_common+0x7a/0xb0 > > Jul 14 01:11:41 server01 kernel: [ 595.393903] [<ffffffff8105525a>] warn_slowpath_null+0x1a/0x20 > > Jul 14 01:11:41 server01 kernel: [ 595.394068] [<ffffffff81059c50>] do_exit+0x7d0/0x870 > > Jul 14 01:11:41 server01 kernel: [ 595.394231] [<ffffffff81050254>] ? thread_group_times+0x44/0xb0 > > Jul 14 01:11:41 server01 kernel: [ 595.394392] [<ffffffff81059d41>] do_group_exit+0x51/0xc0 > > Jul 14 01:11:41 server01 kernel: [ 595.394551] [<ffffffff81059dc7>] sys_exit_group+0x17/0x20 > > Jul 14 01:11:41 server01 kernel: [ 595.394714] [<ffffffff815caea6>] system_call_fastpath+0x18/0x1d > > Jul 14 01:11:41 server01 kernel: [ 595.394921] ---[ end trace 738570e688acf099 ]--- > > > > OK, so you had an OOM which has been handled by in-kernel oom handler > > (it killed 12021) and 12037 was in the same group. The warning tells us > > that it went through mem_cgroup_oom as well (otherwise it wouldn't have > > memcg_oom.wait_on_memcg set and the warning wouldn't trigger) and then > > it exited on the userspace request (by exit syscall). > > > > I do not see any way how, this could happen though. If mem_cgroup_oom > > is called then we always return CHARGE_NOMEM which turns into ENOMEM > > returned by __mem_cgroup_try_charge (invoke_oom must have been set to > > true). So if nobody screwed the return value on the way up to page > > fault handler then there is no way to escape. > > > > I will check the code. > > OK, I guess I found it: > __do_fault > fault = filemap_fault > do_async_mmap_readahead > page_cache_async_readahead > ondemand_readahead > __do_page_cache_readahead > read_pages > readpages = ext3_readpages > mpage_readpages # Doesn't propagate ENOMEM > add_to_page_cache_lru > add_to_page_cache > add_to_page_cache_locked > mem_cgroup_cache_charge > > So the read ahead most probably. Again! Duhhh. I will try to think > about a fix for this. One obvious place is mpage_readpages but > __do_page_cache_readahead ignores read_pages return value as well and > page_cache_async_readahead, even worse, is just void and exported as > such. > > So this smells like a hard to fix bugger. One possible, and really ugly > way would be calling mem_cgroup_oom_synchronize even if handle_mm_fault > doesn't return VM_FAULT_ERROR, but that is a crude hack. Ouch, good spot. I don't think we need to handle an OOM from the readahead code. If readahead does not produce the desired page, we retry synchroneously in page_cache_read() and handle the OOM properly. We should not signal an OOM for optional pages anyway. So either we pass a flag from the readahead code down to add_to_page_cache and mem_cgroup_cache_charge that tells the charge code to ignore OOM conditions and do not set up an OOM context. Or we DO call mem_cgroup_oom_synchronize() from the read_cache_pages, with an argument that makes it only clean up the context and not wait. It would not be completely outlandish to place it there, since it's right next to where an error from add_to_page_cache() is not further propagated back through the fault stack. I'm travelling right now, I'll send a patch when I get back (Thursday). Unless you beat me to it :) ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH for 3.2] memcg: do not trap chargers with full callstack on OOM @ 2013-07-16 15:35 ` Johannes Weiner 0 siblings, 0 replies; 444+ messages in thread From: Johannes Weiner @ 2013-07-16 15:35 UTC (permalink / raw) To: Michal Hocko Cc: azurIt, linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups mailinglist, KAMEZAWA Hiroyuki, righi.andrea-Re5JQEeQqe8AvxtiuMwx3w On Mon, Jul 15, 2013 at 06:00:06PM +0200, Michal Hocko wrote: > On Mon 15-07-13 17:41:19, Michal Hocko wrote: > > On Sun 14-07-13 01:51:12, azurIt wrote: > > > > CC: "Johannes Weiner" <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>, linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org, "cgroups mailinglist" <cgroups-u79uwXL29TY76Z2rM5mHXA@public.gmane.org>, "KAMEZAWA Hiroyuki" <kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A@public.gmane.org>, righi.andrea-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org > > > >> CC: "Johannes Weiner" <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>, linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org, "cgroups mailinglist" <cgroups-u79uwXL29TY76Z2rM5mHXA@public.gmane.org>, "KAMEZAWA Hiroyuki" <kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A@public.gmane.org>, righi.andrea-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org > > > >>On Wed 10-07-13 18:25:06, azurIt wrote: > > > >>> >> Now i realized that i forgot to remove UID from that cgroup before > > > >>> >> trying to remove it, so cgroup cannot be removed anyway (we are using > > > >>> >> third party cgroup called cgroup-uid from Andrea Righi, which is able > > > >>> >> to associate all user's processes with target cgroup). Look here for > > > >>> >> cgroup-uid patch: > > > >>> >> https://www.develer.com/~arighi/linux/patches/cgroup-uid/cgroup-uid-v8.patch > > > >>> >> > > > >>> >> ANYWAY, i'm 101% sure that 'tasks' file was empty and 'under_oom' was > > > >>> >> permanently '1'. > > > >>> > > > > >>> >This is really strange. Could you post the whole diff against stable > > > >>> >tree you are using (except for grsecurity stuff and the above cgroup-uid > > > >>> >patch)? > > > >>> > > > >>> > > > >>> Here are all patches which i applied to kernel 3.2.48 in my last test: > > > >>> http://watchdog.sk/lkml/patches3/ > > > >> > > > >>The two patches from Johannes seem correct. > > > >> > > > >>From a quick look even grsecurity patchset shouldn't interfere as it > > > >>doesn't seem to put any code between handle_mm_fault and mm_fault_error > > > >>and there also doesn't seem to be any new handle_mm_fault call sites. > > > >> > > > >>But I cannot tell there aren't other code paths which would lead to a > > > >>memcg charge, thus oom, without proper FAULT_FLAG_KERNEL handling. > > > > > > > > > > > >Michal, > > > > > > > >now i can definitely confirm that problem with unremovable cgroups > > > >persists. What info do you need from me? I applied also your little > > > >'WARN_ON' patch. > > > > > > Ok, i think you want this: > > > http://watchdog.sk/lkml/kern4.log > > > > Jul 14 01:11:39 server01 kernel: [ 593.589087] [ pid ] uid tgid total_vm rss cpu oom_adj oom_score_adj name > > Jul 14 01:11:39 server01 kernel: [ 593.589451] [12021] 1333 12021 172027 64723 4 0 0 apache2 > > Jul 14 01:11:39 server01 kernel: [ 593.589647] [12030] 1333 12030 172030 64748 2 0 0 apache2 > > Jul 14 01:11:39 server01 kernel: [ 593.589836] [12031] 1333 12031 172030 64749 3 0 0 apache2 > > Jul 14 01:11:39 server01 kernel: [ 593.590025] [12032] 1333 12032 170619 63428 3 0 0 apache2 > > Jul 14 01:11:39 server01 kernel: [ 593.590213] [12033] 1333 12033 167934 60524 2 0 0 apache2 > > Jul 14 01:11:39 server01 kernel: [ 593.590401] [12034] 1333 12034 170747 63496 4 0 0 apache2 > > Jul 14 01:11:39 server01 kernel: [ 593.590588] [12035] 1333 12035 169659 62451 1 0 0 apache2 > > Jul 14 01:11:39 server01 kernel: [ 593.590776] [12036] 1333 12036 167614 60384 3 0 0 apache2 > > Jul 14 01:11:39 server01 kernel: [ 593.590984] [12037] 1333 12037 166342 58964 3 0 0 apache2 > > Jul 14 01:11:39 server01 kernel: [ 593.591178] Memory cgroup out of memory: Kill process 12021 (apache2) score 847 or sacrifice child > > Jul 14 01:11:39 server01 kernel: [ 593.591370] Killed process 12021 (apache2) total-vm:688108kB, anon-rss:255472kB, file-rss:3420kB > > Jul 14 01:11:41 server01 kernel: [ 595.392920] ------------[ cut here ]------------ > > Jul 14 01:11:41 server01 kernel: [ 595.393096] WARNING: at kernel/exit.c:888 do_exit+0x7d0/0x870() > > Jul 14 01:11:41 server01 kernel: [ 595.393256] Hardware name: S5000VSA > > Jul 14 01:11:41 server01 kernel: [ 595.393415] Pid: 12037, comm: apache2 Not tainted 3.2.48-grsec #1 > > Jul 14 01:11:41 server01 kernel: [ 595.393577] Call Trace: > > Jul 14 01:11:41 server01 kernel: [ 595.393737] [<ffffffff8105520a>] warn_slowpath_common+0x7a/0xb0 > > Jul 14 01:11:41 server01 kernel: [ 595.393903] [<ffffffff8105525a>] warn_slowpath_null+0x1a/0x20 > > Jul 14 01:11:41 server01 kernel: [ 595.394068] [<ffffffff81059c50>] do_exit+0x7d0/0x870 > > Jul 14 01:11:41 server01 kernel: [ 595.394231] [<ffffffff81050254>] ? thread_group_times+0x44/0xb0 > > Jul 14 01:11:41 server01 kernel: [ 595.394392] [<ffffffff81059d41>] do_group_exit+0x51/0xc0 > > Jul 14 01:11:41 server01 kernel: [ 595.394551] [<ffffffff81059dc7>] sys_exit_group+0x17/0x20 > > Jul 14 01:11:41 server01 kernel: [ 595.394714] [<ffffffff815caea6>] system_call_fastpath+0x18/0x1d > > Jul 14 01:11:41 server01 kernel: [ 595.394921] ---[ end trace 738570e688acf099 ]--- > > > > OK, so you had an OOM which has been handled by in-kernel oom handler > > (it killed 12021) and 12037 was in the same group. The warning tells us > > that it went through mem_cgroup_oom as well (otherwise it wouldn't have > > memcg_oom.wait_on_memcg set and the warning wouldn't trigger) and then > > it exited on the userspace request (by exit syscall). > > > > I do not see any way how, this could happen though. If mem_cgroup_oom > > is called then we always return CHARGE_NOMEM which turns into ENOMEM > > returned by __mem_cgroup_try_charge (invoke_oom must have been set to > > true). So if nobody screwed the return value on the way up to page > > fault handler then there is no way to escape. > > > > I will check the code. > > OK, I guess I found it: > __do_fault > fault = filemap_fault > do_async_mmap_readahead > page_cache_async_readahead > ondemand_readahead > __do_page_cache_readahead > read_pages > readpages = ext3_readpages > mpage_readpages # Doesn't propagate ENOMEM > add_to_page_cache_lru > add_to_page_cache > add_to_page_cache_locked > mem_cgroup_cache_charge > > So the read ahead most probably. Again! Duhhh. I will try to think > about a fix for this. One obvious place is mpage_readpages but > __do_page_cache_readahead ignores read_pages return value as well and > page_cache_async_readahead, even worse, is just void and exported as > such. > > So this smells like a hard to fix bugger. One possible, and really ugly > way would be calling mem_cgroup_oom_synchronize even if handle_mm_fault > doesn't return VM_FAULT_ERROR, but that is a crude hack. Ouch, good spot. I don't think we need to handle an OOM from the readahead code. If readahead does not produce the desired page, we retry synchroneously in page_cache_read() and handle the OOM properly. We should not signal an OOM for optional pages anyway. So either we pass a flag from the readahead code down to add_to_page_cache and mem_cgroup_cache_charge that tells the charge code to ignore OOM conditions and do not set up an OOM context. Or we DO call mem_cgroup_oom_synchronize() from the read_cache_pages, with an argument that makes it only clean up the context and not wait. It would not be completely outlandish to place it there, since it's right next to where an error from add_to_page_cache() is not further propagated back through the fault stack. I'm travelling right now, I'll send a patch when I get back (Thursday). Unless you beat me to it :) ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH for 3.2] memcg: do not trap chargers with full callstack on OOM @ 2013-07-16 15:35 ` Johannes Weiner 0 siblings, 0 replies; 444+ messages in thread From: Johannes Weiner @ 2013-07-16 15:35 UTC (permalink / raw) To: Michal Hocko Cc: azurIt, linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki, righi.andrea On Mon, Jul 15, 2013 at 06:00:06PM +0200, Michal Hocko wrote: > On Mon 15-07-13 17:41:19, Michal Hocko wrote: > > On Sun 14-07-13 01:51:12, azurIt wrote: > > > > CC: "Johannes Weiner" <hannes@cmpxchg.org>, linux-kernel@vger.kernel.org, linux-mm@kvack.org, "cgroups mailinglist" <cgroups@vger.kernel.org>, "KAMEZAWA Hiroyuki" <kamezawa.hiroyu@jp.fujitsu.com>, righi.andrea@gmail.com > > > >> CC: "Johannes Weiner" <hannes@cmpxchg.org>, linux-kernel@vger.kernel.org, linux-mm@kvack.org, "cgroups mailinglist" <cgroups@vger.kernel.org>, "KAMEZAWA Hiroyuki" <kamezawa.hiroyu@jp.fujitsu.com>, righi.andrea@gmail.com > > > >>On Wed 10-07-13 18:25:06, azurIt wrote: > > > >>> >> Now i realized that i forgot to remove UID from that cgroup before > > > >>> >> trying to remove it, so cgroup cannot be removed anyway (we are using > > > >>> >> third party cgroup called cgroup-uid from Andrea Righi, which is able > > > >>> >> to associate all user's processes with target cgroup). Look here for > > > >>> >> cgroup-uid patch: > > > >>> >> https://www.develer.com/~arighi/linux/patches/cgroup-uid/cgroup-uid-v8.patch > > > >>> >> > > > >>> >> ANYWAY, i'm 101% sure that 'tasks' file was empty and 'under_oom' was > > > >>> >> permanently '1'. > > > >>> > > > > >>> >This is really strange. Could you post the whole diff against stable > > > >>> >tree you are using (except for grsecurity stuff and the above cgroup-uid > > > >>> >patch)? > > > >>> > > > >>> > > > >>> Here are all patches which i applied to kernel 3.2.48 in my last test: > > > >>> http://watchdog.sk/lkml/patches3/ > > > >> > > > >>The two patches from Johannes seem correct. > > > >> > > > >>From a quick look even grsecurity patchset shouldn't interfere as it > > > >>doesn't seem to put any code between handle_mm_fault and mm_fault_error > > > >>and there also doesn't seem to be any new handle_mm_fault call sites. > > > >> > > > >>But I cannot tell there aren't other code paths which would lead to a > > > >>memcg charge, thus oom, without proper FAULT_FLAG_KERNEL handling. > > > > > > > > > > > >Michal, > > > > > > > >now i can definitely confirm that problem with unremovable cgroups > > > >persists. What info do you need from me? I applied also your little > > > >'WARN_ON' patch. > > > > > > Ok, i think you want this: > > > http://watchdog.sk/lkml/kern4.log > > > > Jul 14 01:11:39 server01 kernel: [ 593.589087] [ pid ] uid tgid total_vm rss cpu oom_adj oom_score_adj name > > Jul 14 01:11:39 server01 kernel: [ 593.589451] [12021] 1333 12021 172027 64723 4 0 0 apache2 > > Jul 14 01:11:39 server01 kernel: [ 593.589647] [12030] 1333 12030 172030 64748 2 0 0 apache2 > > Jul 14 01:11:39 server01 kernel: [ 593.589836] [12031] 1333 12031 172030 64749 3 0 0 apache2 > > Jul 14 01:11:39 server01 kernel: [ 593.590025] [12032] 1333 12032 170619 63428 3 0 0 apache2 > > Jul 14 01:11:39 server01 kernel: [ 593.590213] [12033] 1333 12033 167934 60524 2 0 0 apache2 > > Jul 14 01:11:39 server01 kernel: [ 593.590401] [12034] 1333 12034 170747 63496 4 0 0 apache2 > > Jul 14 01:11:39 server01 kernel: [ 593.590588] [12035] 1333 12035 169659 62451 1 0 0 apache2 > > Jul 14 01:11:39 server01 kernel: [ 593.590776] [12036] 1333 12036 167614 60384 3 0 0 apache2 > > Jul 14 01:11:39 server01 kernel: [ 593.590984] [12037] 1333 12037 166342 58964 3 0 0 apache2 > > Jul 14 01:11:39 server01 kernel: [ 593.591178] Memory cgroup out of memory: Kill process 12021 (apache2) score 847 or sacrifice child > > Jul 14 01:11:39 server01 kernel: [ 593.591370] Killed process 12021 (apache2) total-vm:688108kB, anon-rss:255472kB, file-rss:3420kB > > Jul 14 01:11:41 server01 kernel: [ 595.392920] ------------[ cut here ]------------ > > Jul 14 01:11:41 server01 kernel: [ 595.393096] WARNING: at kernel/exit.c:888 do_exit+0x7d0/0x870() > > Jul 14 01:11:41 server01 kernel: [ 595.393256] Hardware name: S5000VSA > > Jul 14 01:11:41 server01 kernel: [ 595.393415] Pid: 12037, comm: apache2 Not tainted 3.2.48-grsec #1 > > Jul 14 01:11:41 server01 kernel: [ 595.393577] Call Trace: > > Jul 14 01:11:41 server01 kernel: [ 595.393737] [<ffffffff8105520a>] warn_slowpath_common+0x7a/0xb0 > > Jul 14 01:11:41 server01 kernel: [ 595.393903] [<ffffffff8105525a>] warn_slowpath_null+0x1a/0x20 > > Jul 14 01:11:41 server01 kernel: [ 595.394068] [<ffffffff81059c50>] do_exit+0x7d0/0x870 > > Jul 14 01:11:41 server01 kernel: [ 595.394231] [<ffffffff81050254>] ? thread_group_times+0x44/0xb0 > > Jul 14 01:11:41 server01 kernel: [ 595.394392] [<ffffffff81059d41>] do_group_exit+0x51/0xc0 > > Jul 14 01:11:41 server01 kernel: [ 595.394551] [<ffffffff81059dc7>] sys_exit_group+0x17/0x20 > > Jul 14 01:11:41 server01 kernel: [ 595.394714] [<ffffffff815caea6>] system_call_fastpath+0x18/0x1d > > Jul 14 01:11:41 server01 kernel: [ 595.394921] ---[ end trace 738570e688acf099 ]--- > > > > OK, so you had an OOM which has been handled by in-kernel oom handler > > (it killed 12021) and 12037 was in the same group. The warning tells us > > that it went through mem_cgroup_oom as well (otherwise it wouldn't have > > memcg_oom.wait_on_memcg set and the warning wouldn't trigger) and then > > it exited on the userspace request (by exit syscall). > > > > I do not see any way how, this could happen though. If mem_cgroup_oom > > is called then we always return CHARGE_NOMEM which turns into ENOMEM > > returned by __mem_cgroup_try_charge (invoke_oom must have been set to > > true). So if nobody screwed the return value on the way up to page > > fault handler then there is no way to escape. > > > > I will check the code. > > OK, I guess I found it: > __do_fault > fault = filemap_fault > do_async_mmap_readahead > page_cache_async_readahead > ondemand_readahead > __do_page_cache_readahead > read_pages > readpages = ext3_readpages > mpage_readpages # Doesn't propagate ENOMEM > add_to_page_cache_lru > add_to_page_cache > add_to_page_cache_locked > mem_cgroup_cache_charge > > So the read ahead most probably. Again! Duhhh. I will try to think > about a fix for this. One obvious place is mpage_readpages but > __do_page_cache_readahead ignores read_pages return value as well and > page_cache_async_readahead, even worse, is just void and exported as > such. > > So this smells like a hard to fix bugger. One possible, and really ugly > way would be calling mem_cgroup_oom_synchronize even if handle_mm_fault > doesn't return VM_FAULT_ERROR, but that is a crude hack. Ouch, good spot. I don't think we need to handle an OOM from the readahead code. If readahead does not produce the desired page, we retry synchroneously in page_cache_read() and handle the OOM properly. We should not signal an OOM for optional pages anyway. So either we pass a flag from the readahead code down to add_to_page_cache and mem_cgroup_cache_charge that tells the charge code to ignore OOM conditions and do not set up an OOM context. Or we DO call mem_cgroup_oom_synchronize() from the read_cache_pages, with an argument that makes it only clean up the context and not wait. It would not be completely outlandish to place it there, since it's right next to where an error from add_to_page_cache() is not further propagated back through the fault stack. I'm travelling right now, I'll send a patch when I get back (Thursday). Unless you beat me to it :) -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH for 3.2] memcg: do not trap chargers with full callstack on OOM 2013-07-16 15:35 ` Johannes Weiner (?) @ 2013-07-16 16:09 ` Michal Hocko -1 siblings, 0 replies; 444+ messages in thread From: Michal Hocko @ 2013-07-16 16:09 UTC (permalink / raw) To: Johannes Weiner Cc: azurIt, linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki, righi.andrea On Tue 16-07-13 11:35:44, Johannes Weiner wrote: > On Mon, Jul 15, 2013 at 06:00:06PM +0200, Michal Hocko wrote: > > On Mon 15-07-13 17:41:19, Michal Hocko wrote: > > > On Sun 14-07-13 01:51:12, azurIt wrote: > > > > > CC: "Johannes Weiner" <hannes@cmpxchg.org>, linux-kernel@vger.kernel.org, linux-mm@kvack.org, "cgroups mailinglist" <cgroups@vger.kernel.org>, "KAMEZAWA Hiroyuki" <kamezawa.hiroyu@jp.fujitsu.com>, righi.andrea@gmail.com > > > > >> CC: "Johannes Weiner" <hannes@cmpxchg.org>, linux-kernel@vger.kernel.org, linux-mm@kvack.org, "cgroups mailinglist" <cgroups@vger.kernel.org>, "KAMEZAWA Hiroyuki" <kamezawa.hiroyu@jp.fujitsu.com>, righi.andrea@gmail.com > > > > >>On Wed 10-07-13 18:25:06, azurIt wrote: > > > > >>> >> Now i realized that i forgot to remove UID from that cgroup before > > > > >>> >> trying to remove it, so cgroup cannot be removed anyway (we are using > > > > >>> >> third party cgroup called cgroup-uid from Andrea Righi, which is able > > > > >>> >> to associate all user's processes with target cgroup). Look here for > > > > >>> >> cgroup-uid patch: > > > > >>> >> https://www.develer.com/~arighi/linux/patches/cgroup-uid/cgroup-uid-v8.patch > > > > >>> >> > > > > >>> >> ANYWAY, i'm 101% sure that 'tasks' file was empty and 'under_oom' was > > > > >>> >> permanently '1'. > > > > >>> > > > > > >>> >This is really strange. Could you post the whole diff against stable > > > > >>> >tree you are using (except for grsecurity stuff and the above cgroup-uid > > > > >>> >patch)? > > > > >>> > > > > >>> > > > > >>> Here are all patches which i applied to kernel 3.2.48 in my last test: > > > > >>> http://watchdog.sk/lkml/patches3/ > > > > >> > > > > >>The two patches from Johannes seem correct. > > > > >> > > > > >>From a quick look even grsecurity patchset shouldn't interfere as it > > > > >>doesn't seem to put any code between handle_mm_fault and mm_fault_error > > > > >>and there also doesn't seem to be any new handle_mm_fault call sites. > > > > >> > > > > >>But I cannot tell there aren't other code paths which would lead to a > > > > >>memcg charge, thus oom, without proper FAULT_FLAG_KERNEL handling. > > > > > > > > > > > > > > >Michal, > > > > > > > > > >now i can definitely confirm that problem with unremovable cgroups > > > > >persists. What info do you need from me? I applied also your little > > > > >'WARN_ON' patch. > > > > > > > > Ok, i think you want this: > > > > http://watchdog.sk/lkml/kern4.log > > > > > > Jul 14 01:11:39 server01 kernel: [ 593.589087] [ pid ] uid tgid total_vm rss cpu oom_adj oom_score_adj name > > > Jul 14 01:11:39 server01 kernel: [ 593.589451] [12021] 1333 12021 172027 64723 4 0 0 apache2 > > > Jul 14 01:11:39 server01 kernel: [ 593.589647] [12030] 1333 12030 172030 64748 2 0 0 apache2 > > > Jul 14 01:11:39 server01 kernel: [ 593.589836] [12031] 1333 12031 172030 64749 3 0 0 apache2 > > > Jul 14 01:11:39 server01 kernel: [ 593.590025] [12032] 1333 12032 170619 63428 3 0 0 apache2 > > > Jul 14 01:11:39 server01 kernel: [ 593.590213] [12033] 1333 12033 167934 60524 2 0 0 apache2 > > > Jul 14 01:11:39 server01 kernel: [ 593.590401] [12034] 1333 12034 170747 63496 4 0 0 apache2 > > > Jul 14 01:11:39 server01 kernel: [ 593.590588] [12035] 1333 12035 169659 62451 1 0 0 apache2 > > > Jul 14 01:11:39 server01 kernel: [ 593.590776] [12036] 1333 12036 167614 60384 3 0 0 apache2 > > > Jul 14 01:11:39 server01 kernel: [ 593.590984] [12037] 1333 12037 166342 58964 3 0 0 apache2 > > > Jul 14 01:11:39 server01 kernel: [ 593.591178] Memory cgroup out of memory: Kill process 12021 (apache2) score 847 or sacrifice child > > > Jul 14 01:11:39 server01 kernel: [ 593.591370] Killed process 12021 (apache2) total-vm:688108kB, anon-rss:255472kB, file-rss:3420kB > > > Jul 14 01:11:41 server01 kernel: [ 595.392920] ------------[ cut here ]------------ > > > Jul 14 01:11:41 server01 kernel: [ 595.393096] WARNING: at kernel/exit.c:888 do_exit+0x7d0/0x870() > > > Jul 14 01:11:41 server01 kernel: [ 595.393256] Hardware name: S5000VSA > > > Jul 14 01:11:41 server01 kernel: [ 595.393415] Pid: 12037, comm: apache2 Not tainted 3.2.48-grsec #1 > > > Jul 14 01:11:41 server01 kernel: [ 595.393577] Call Trace: > > > Jul 14 01:11:41 server01 kernel: [ 595.393737] [<ffffffff8105520a>] warn_slowpath_common+0x7a/0xb0 > > > Jul 14 01:11:41 server01 kernel: [ 595.393903] [<ffffffff8105525a>] warn_slowpath_null+0x1a/0x20 > > > Jul 14 01:11:41 server01 kernel: [ 595.394068] [<ffffffff81059c50>] do_exit+0x7d0/0x870 > > > Jul 14 01:11:41 server01 kernel: [ 595.394231] [<ffffffff81050254>] ? thread_group_times+0x44/0xb0 > > > Jul 14 01:11:41 server01 kernel: [ 595.394392] [<ffffffff81059d41>] do_group_exit+0x51/0xc0 > > > Jul 14 01:11:41 server01 kernel: [ 595.394551] [<ffffffff81059dc7>] sys_exit_group+0x17/0x20 > > > Jul 14 01:11:41 server01 kernel: [ 595.394714] [<ffffffff815caea6>] system_call_fastpath+0x18/0x1d > > > Jul 14 01:11:41 server01 kernel: [ 595.394921] ---[ end trace 738570e688acf099 ]--- > > > > > > OK, so you had an OOM which has been handled by in-kernel oom handler > > > (it killed 12021) and 12037 was in the same group. The warning tells us > > > that it went through mem_cgroup_oom as well (otherwise it wouldn't have > > > memcg_oom.wait_on_memcg set and the warning wouldn't trigger) and then > > > it exited on the userspace request (by exit syscall). > > > > > > I do not see any way how, this could happen though. If mem_cgroup_oom > > > is called then we always return CHARGE_NOMEM which turns into ENOMEM > > > returned by __mem_cgroup_try_charge (invoke_oom must have been set to > > > true). So if nobody screwed the return value on the way up to page > > > fault handler then there is no way to escape. > > > > > > I will check the code. > > > > OK, I guess I found it: > > __do_fault > > fault = filemap_fault > > do_async_mmap_readahead > > page_cache_async_readahead > > ondemand_readahead > > __do_page_cache_readahead > > read_pages > > readpages = ext3_readpages > > mpage_readpages # Doesn't propagate ENOMEM > > add_to_page_cache_lru > > add_to_page_cache > > add_to_page_cache_locked > > mem_cgroup_cache_charge > > > > So the read ahead most probably. Again! Duhhh. I will try to think > > about a fix for this. One obvious place is mpage_readpages but > > __do_page_cache_readahead ignores read_pages return value as well and > > page_cache_async_readahead, even worse, is just void and exported as > > such. > > > > So this smells like a hard to fix bugger. One possible, and really ugly > > way would be calling mem_cgroup_oom_synchronize even if handle_mm_fault > > doesn't return VM_FAULT_ERROR, but that is a crude hack. > > Ouch, good spot. > > I don't think we need to handle an OOM from the readahead code. If > readahead does not produce the desired page, we retry synchroneously > in page_cache_read() and handle the OOM properly. We should not > signal an OOM for optional pages anyway. > > So either we pass a flag from the readahead code down to > add_to_page_cache and mem_cgroup_cache_charge that tells the charge > code to ignore OOM conditions and do not set up an OOM context. That was my previous attempt and it was sooo painful. > Or we DO call mem_cgroup_oom_synchronize() from the read_cache_pages, > with an argument that makes it only clean up the context and not wait. Yes, I was playing with this idea as well. I just do not like how fragile this is. We need some way to catch all possible places which might leak it. > It would not be completely outlandish to place it there, since it's > right next to where an error from add_to_page_cache() is not further > propagated back through the fault stack. > > I'm travelling right now, I'll send a patch when I get back > (Thursday). Unless you beat me to it :) I can cook something up but there is quite a big pile on my desk currently (as always :/). -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH for 3.2] memcg: do not trap chargers with full callstack on OOM @ 2013-07-16 16:09 ` Michal Hocko 0 siblings, 0 replies; 444+ messages in thread From: Michal Hocko @ 2013-07-16 16:09 UTC (permalink / raw) To: Johannes Weiner Cc: azurIt, linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups mailinglist, KAMEZAWA Hiroyuki, righi.andrea-Re5JQEeQqe8AvxtiuMwx3w On Tue 16-07-13 11:35:44, Johannes Weiner wrote: > On Mon, Jul 15, 2013 at 06:00:06PM +0200, Michal Hocko wrote: > > On Mon 15-07-13 17:41:19, Michal Hocko wrote: > > > On Sun 14-07-13 01:51:12, azurIt wrote: > > > > > CC: "Johannes Weiner" <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>, linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org, "cgroups mailinglist" <cgroups-u79uwXL29TY76Z2rM5mHXA@public.gmane.org>, "KAMEZAWA Hiroyuki" <kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A@public.gmane.org>, righi.andrea-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org > > > > >> CC: "Johannes Weiner" <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>, linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org, "cgroups mailinglist" <cgroups-u79uwXL29TY76Z2rM5mHXA@public.gmane.org>, "KAMEZAWA Hiroyuki" <kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A@public.gmane.org>, righi.andrea-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org > > > > >>On Wed 10-07-13 18:25:06, azurIt wrote: > > > > >>> >> Now i realized that i forgot to remove UID from that cgroup before > > > > >>> >> trying to remove it, so cgroup cannot be removed anyway (we are using > > > > >>> >> third party cgroup called cgroup-uid from Andrea Righi, which is able > > > > >>> >> to associate all user's processes with target cgroup). Look here for > > > > >>> >> cgroup-uid patch: > > > > >>> >> https://www.develer.com/~arighi/linux/patches/cgroup-uid/cgroup-uid-v8.patch > > > > >>> >> > > > > >>> >> ANYWAY, i'm 101% sure that 'tasks' file was empty and 'under_oom' was > > > > >>> >> permanently '1'. > > > > >>> > > > > > >>> >This is really strange. Could you post the whole diff against stable > > > > >>> >tree you are using (except for grsecurity stuff and the above cgroup-uid > > > > >>> >patch)? > > > > >>> > > > > >>> > > > > >>> Here are all patches which i applied to kernel 3.2.48 in my last test: > > > > >>> http://watchdog.sk/lkml/patches3/ > > > > >> > > > > >>The two patches from Johannes seem correct. > > > > >> > > > > >>From a quick look even grsecurity patchset shouldn't interfere as it > > > > >>doesn't seem to put any code between handle_mm_fault and mm_fault_error > > > > >>and there also doesn't seem to be any new handle_mm_fault call sites. > > > > >> > > > > >>But I cannot tell there aren't other code paths which would lead to a > > > > >>memcg charge, thus oom, without proper FAULT_FLAG_KERNEL handling. > > > > > > > > > > > > > > >Michal, > > > > > > > > > >now i can definitely confirm that problem with unremovable cgroups > > > > >persists. What info do you need from me? I applied also your little > > > > >'WARN_ON' patch. > > > > > > > > Ok, i think you want this: > > > > http://watchdog.sk/lkml/kern4.log > > > > > > Jul 14 01:11:39 server01 kernel: [ 593.589087] [ pid ] uid tgid total_vm rss cpu oom_adj oom_score_adj name > > > Jul 14 01:11:39 server01 kernel: [ 593.589451] [12021] 1333 12021 172027 64723 4 0 0 apache2 > > > Jul 14 01:11:39 server01 kernel: [ 593.589647] [12030] 1333 12030 172030 64748 2 0 0 apache2 > > > Jul 14 01:11:39 server01 kernel: [ 593.589836] [12031] 1333 12031 172030 64749 3 0 0 apache2 > > > Jul 14 01:11:39 server01 kernel: [ 593.590025] [12032] 1333 12032 170619 63428 3 0 0 apache2 > > > Jul 14 01:11:39 server01 kernel: [ 593.590213] [12033] 1333 12033 167934 60524 2 0 0 apache2 > > > Jul 14 01:11:39 server01 kernel: [ 593.590401] [12034] 1333 12034 170747 63496 4 0 0 apache2 > > > Jul 14 01:11:39 server01 kernel: [ 593.590588] [12035] 1333 12035 169659 62451 1 0 0 apache2 > > > Jul 14 01:11:39 server01 kernel: [ 593.590776] [12036] 1333 12036 167614 60384 3 0 0 apache2 > > > Jul 14 01:11:39 server01 kernel: [ 593.590984] [12037] 1333 12037 166342 58964 3 0 0 apache2 > > > Jul 14 01:11:39 server01 kernel: [ 593.591178] Memory cgroup out of memory: Kill process 12021 (apache2) score 847 or sacrifice child > > > Jul 14 01:11:39 server01 kernel: [ 593.591370] Killed process 12021 (apache2) total-vm:688108kB, anon-rss:255472kB, file-rss:3420kB > > > Jul 14 01:11:41 server01 kernel: [ 595.392920] ------------[ cut here ]------------ > > > Jul 14 01:11:41 server01 kernel: [ 595.393096] WARNING: at kernel/exit.c:888 do_exit+0x7d0/0x870() > > > Jul 14 01:11:41 server01 kernel: [ 595.393256] Hardware name: S5000VSA > > > Jul 14 01:11:41 server01 kernel: [ 595.393415] Pid: 12037, comm: apache2 Not tainted 3.2.48-grsec #1 > > > Jul 14 01:11:41 server01 kernel: [ 595.393577] Call Trace: > > > Jul 14 01:11:41 server01 kernel: [ 595.393737] [<ffffffff8105520a>] warn_slowpath_common+0x7a/0xb0 > > > Jul 14 01:11:41 server01 kernel: [ 595.393903] [<ffffffff8105525a>] warn_slowpath_null+0x1a/0x20 > > > Jul 14 01:11:41 server01 kernel: [ 595.394068] [<ffffffff81059c50>] do_exit+0x7d0/0x870 > > > Jul 14 01:11:41 server01 kernel: [ 595.394231] [<ffffffff81050254>] ? thread_group_times+0x44/0xb0 > > > Jul 14 01:11:41 server01 kernel: [ 595.394392] [<ffffffff81059d41>] do_group_exit+0x51/0xc0 > > > Jul 14 01:11:41 server01 kernel: [ 595.394551] [<ffffffff81059dc7>] sys_exit_group+0x17/0x20 > > > Jul 14 01:11:41 server01 kernel: [ 595.394714] [<ffffffff815caea6>] system_call_fastpath+0x18/0x1d > > > Jul 14 01:11:41 server01 kernel: [ 595.394921] ---[ end trace 738570e688acf099 ]--- > > > > > > OK, so you had an OOM which has been handled by in-kernel oom handler > > > (it killed 12021) and 12037 was in the same group. The warning tells us > > > that it went through mem_cgroup_oom as well (otherwise it wouldn't have > > > memcg_oom.wait_on_memcg set and the warning wouldn't trigger) and then > > > it exited on the userspace request (by exit syscall). > > > > > > I do not see any way how, this could happen though. If mem_cgroup_oom > > > is called then we always return CHARGE_NOMEM which turns into ENOMEM > > > returned by __mem_cgroup_try_charge (invoke_oom must have been set to > > > true). So if nobody screwed the return value on the way up to page > > > fault handler then there is no way to escape. > > > > > > I will check the code. > > > > OK, I guess I found it: > > __do_fault > > fault = filemap_fault > > do_async_mmap_readahead > > page_cache_async_readahead > > ondemand_readahead > > __do_page_cache_readahead > > read_pages > > readpages = ext3_readpages > > mpage_readpages # Doesn't propagate ENOMEM > > add_to_page_cache_lru > > add_to_page_cache > > add_to_page_cache_locked > > mem_cgroup_cache_charge > > > > So the read ahead most probably. Again! Duhhh. I will try to think > > about a fix for this. One obvious place is mpage_readpages but > > __do_page_cache_readahead ignores read_pages return value as well and > > page_cache_async_readahead, even worse, is just void and exported as > > such. > > > > So this smells like a hard to fix bugger. One possible, and really ugly > > way would be calling mem_cgroup_oom_synchronize even if handle_mm_fault > > doesn't return VM_FAULT_ERROR, but that is a crude hack. > > Ouch, good spot. > > I don't think we need to handle an OOM from the readahead code. If > readahead does not produce the desired page, we retry synchroneously > in page_cache_read() and handle the OOM properly. We should not > signal an OOM for optional pages anyway. > > So either we pass a flag from the readahead code down to > add_to_page_cache and mem_cgroup_cache_charge that tells the charge > code to ignore OOM conditions and do not set up an OOM context. That was my previous attempt and it was sooo painful. > Or we DO call mem_cgroup_oom_synchronize() from the read_cache_pages, > with an argument that makes it only clean up the context and not wait. Yes, I was playing with this idea as well. I just do not like how fragile this is. We need some way to catch all possible places which might leak it. > It would not be completely outlandish to place it there, since it's > right next to where an error from add_to_page_cache() is not further > propagated back through the fault stack. > > I'm travelling right now, I'll send a patch when I get back > (Thursday). Unless you beat me to it :) I can cook something up but there is quite a big pile on my desk currently (as always :/). -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH for 3.2] memcg: do not trap chargers with full callstack on OOM @ 2013-07-16 16:09 ` Michal Hocko 0 siblings, 0 replies; 444+ messages in thread From: Michal Hocko @ 2013-07-16 16:09 UTC (permalink / raw) To: Johannes Weiner Cc: azurIt, linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki, righi.andrea On Tue 16-07-13 11:35:44, Johannes Weiner wrote: > On Mon, Jul 15, 2013 at 06:00:06PM +0200, Michal Hocko wrote: > > On Mon 15-07-13 17:41:19, Michal Hocko wrote: > > > On Sun 14-07-13 01:51:12, azurIt wrote: > > > > > CC: "Johannes Weiner" <hannes@cmpxchg.org>, linux-kernel@vger.kernel.org, linux-mm@kvack.org, "cgroups mailinglist" <cgroups@vger.kernel.org>, "KAMEZAWA Hiroyuki" <kamezawa.hiroyu@jp.fujitsu.com>, righi.andrea@gmail.com > > > > >> CC: "Johannes Weiner" <hannes@cmpxchg.org>, linux-kernel@vger.kernel.org, linux-mm@kvack.org, "cgroups mailinglist" <cgroups@vger.kernel.org>, "KAMEZAWA Hiroyuki" <kamezawa.hiroyu@jp.fujitsu.com>, righi.andrea@gmail.com > > > > >>On Wed 10-07-13 18:25:06, azurIt wrote: > > > > >>> >> Now i realized that i forgot to remove UID from that cgroup before > > > > >>> >> trying to remove it, so cgroup cannot be removed anyway (we are using > > > > >>> >> third party cgroup called cgroup-uid from Andrea Righi, which is able > > > > >>> >> to associate all user's processes with target cgroup). Look here for > > > > >>> >> cgroup-uid patch: > > > > >>> >> https://www.develer.com/~arighi/linux/patches/cgroup-uid/cgroup-uid-v8.patch > > > > >>> >> > > > > >>> >> ANYWAY, i'm 101% sure that 'tasks' file was empty and 'under_oom' was > > > > >>> >> permanently '1'. > > > > >>> > > > > > >>> >This is really strange. Could you post the whole diff against stable > > > > >>> >tree you are using (except for grsecurity stuff and the above cgroup-uid > > > > >>> >patch)? > > > > >>> > > > > >>> > > > > >>> Here are all patches which i applied to kernel 3.2.48 in my last test: > > > > >>> http://watchdog.sk/lkml/patches3/ > > > > >> > > > > >>The two patches from Johannes seem correct. > > > > >> > > > > >>From a quick look even grsecurity patchset shouldn't interfere as it > > > > >>doesn't seem to put any code between handle_mm_fault and mm_fault_error > > > > >>and there also doesn't seem to be any new handle_mm_fault call sites. > > > > >> > > > > >>But I cannot tell there aren't other code paths which would lead to a > > > > >>memcg charge, thus oom, without proper FAULT_FLAG_KERNEL handling. > > > > > > > > > > > > > > >Michal, > > > > > > > > > >now i can definitely confirm that problem with unremovable cgroups > > > > >persists. What info do you need from me? I applied also your little > > > > >'WARN_ON' patch. > > > > > > > > Ok, i think you want this: > > > > http://watchdog.sk/lkml/kern4.log > > > > > > Jul 14 01:11:39 server01 kernel: [ 593.589087] [ pid ] uid tgid total_vm rss cpu oom_adj oom_score_adj name > > > Jul 14 01:11:39 server01 kernel: [ 593.589451] [12021] 1333 12021 172027 64723 4 0 0 apache2 > > > Jul 14 01:11:39 server01 kernel: [ 593.589647] [12030] 1333 12030 172030 64748 2 0 0 apache2 > > > Jul 14 01:11:39 server01 kernel: [ 593.589836] [12031] 1333 12031 172030 64749 3 0 0 apache2 > > > Jul 14 01:11:39 server01 kernel: [ 593.590025] [12032] 1333 12032 170619 63428 3 0 0 apache2 > > > Jul 14 01:11:39 server01 kernel: [ 593.590213] [12033] 1333 12033 167934 60524 2 0 0 apache2 > > > Jul 14 01:11:39 server01 kernel: [ 593.590401] [12034] 1333 12034 170747 63496 4 0 0 apache2 > > > Jul 14 01:11:39 server01 kernel: [ 593.590588] [12035] 1333 12035 169659 62451 1 0 0 apache2 > > > Jul 14 01:11:39 server01 kernel: [ 593.590776] [12036] 1333 12036 167614 60384 3 0 0 apache2 > > > Jul 14 01:11:39 server01 kernel: [ 593.590984] [12037] 1333 12037 166342 58964 3 0 0 apache2 > > > Jul 14 01:11:39 server01 kernel: [ 593.591178] Memory cgroup out of memory: Kill process 12021 (apache2) score 847 or sacrifice child > > > Jul 14 01:11:39 server01 kernel: [ 593.591370] Killed process 12021 (apache2) total-vm:688108kB, anon-rss:255472kB, file-rss:3420kB > > > Jul 14 01:11:41 server01 kernel: [ 595.392920] ------------[ cut here ]------------ > > > Jul 14 01:11:41 server01 kernel: [ 595.393096] WARNING: at kernel/exit.c:888 do_exit+0x7d0/0x870() > > > Jul 14 01:11:41 server01 kernel: [ 595.393256] Hardware name: S5000VSA > > > Jul 14 01:11:41 server01 kernel: [ 595.393415] Pid: 12037, comm: apache2 Not tainted 3.2.48-grsec #1 > > > Jul 14 01:11:41 server01 kernel: [ 595.393577] Call Trace: > > > Jul 14 01:11:41 server01 kernel: [ 595.393737] [<ffffffff8105520a>] warn_slowpath_common+0x7a/0xb0 > > > Jul 14 01:11:41 server01 kernel: [ 595.393903] [<ffffffff8105525a>] warn_slowpath_null+0x1a/0x20 > > > Jul 14 01:11:41 server01 kernel: [ 595.394068] [<ffffffff81059c50>] do_exit+0x7d0/0x870 > > > Jul 14 01:11:41 server01 kernel: [ 595.394231] [<ffffffff81050254>] ? thread_group_times+0x44/0xb0 > > > Jul 14 01:11:41 server01 kernel: [ 595.394392] [<ffffffff81059d41>] do_group_exit+0x51/0xc0 > > > Jul 14 01:11:41 server01 kernel: [ 595.394551] [<ffffffff81059dc7>] sys_exit_group+0x17/0x20 > > > Jul 14 01:11:41 server01 kernel: [ 595.394714] [<ffffffff815caea6>] system_call_fastpath+0x18/0x1d > > > Jul 14 01:11:41 server01 kernel: [ 595.394921] ---[ end trace 738570e688acf099 ]--- > > > > > > OK, so you had an OOM which has been handled by in-kernel oom handler > > > (it killed 12021) and 12037 was in the same group. The warning tells us > > > that it went through mem_cgroup_oom as well (otherwise it wouldn't have > > > memcg_oom.wait_on_memcg set and the warning wouldn't trigger) and then > > > it exited on the userspace request (by exit syscall). > > > > > > I do not see any way how, this could happen though. If mem_cgroup_oom > > > is called then we always return CHARGE_NOMEM which turns into ENOMEM > > > returned by __mem_cgroup_try_charge (invoke_oom must have been set to > > > true). So if nobody screwed the return value on the way up to page > > > fault handler then there is no way to escape. > > > > > > I will check the code. > > > > OK, I guess I found it: > > __do_fault > > fault = filemap_fault > > do_async_mmap_readahead > > page_cache_async_readahead > > ondemand_readahead > > __do_page_cache_readahead > > read_pages > > readpages = ext3_readpages > > mpage_readpages # Doesn't propagate ENOMEM > > add_to_page_cache_lru > > add_to_page_cache > > add_to_page_cache_locked > > mem_cgroup_cache_charge > > > > So the read ahead most probably. Again! Duhhh. I will try to think > > about a fix for this. One obvious place is mpage_readpages but > > __do_page_cache_readahead ignores read_pages return value as well and > > page_cache_async_readahead, even worse, is just void and exported as > > such. > > > > So this smells like a hard to fix bugger. One possible, and really ugly > > way would be calling mem_cgroup_oom_synchronize even if handle_mm_fault > > doesn't return VM_FAULT_ERROR, but that is a crude hack. > > Ouch, good spot. > > I don't think we need to handle an OOM from the readahead code. If > readahead does not produce the desired page, we retry synchroneously > in page_cache_read() and handle the OOM properly. We should not > signal an OOM for optional pages anyway. > > So either we pass a flag from the readahead code down to > add_to_page_cache and mem_cgroup_cache_charge that tells the charge > code to ignore OOM conditions and do not set up an OOM context. That was my previous attempt and it was sooo painful. > Or we DO call mem_cgroup_oom_synchronize() from the read_cache_pages, > with an argument that makes it only clean up the context and not wait. Yes, I was playing with this idea as well. I just do not like how fragile this is. We need some way to catch all possible places which might leak it. > It would not be completely outlandish to place it there, since it's > right next to where an error from add_to_page_cache() is not further > propagated back through the fault stack. > > I'm travelling right now, I'll send a patch when I get back > (Thursday). Unless you beat me to it :) I can cook something up but there is quite a big pile on my desk currently (as always :/). -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH for 3.2] memcg: do not trap chargers with full callstack on OOM 2013-07-16 16:09 ` Michal Hocko (?) @ 2013-07-16 16:48 ` Johannes Weiner -1 siblings, 0 replies; 444+ messages in thread From: Johannes Weiner @ 2013-07-16 16:48 UTC (permalink / raw) To: Michal Hocko Cc: azurIt, linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki, righi.andrea On Tue, Jul 16, 2013 at 06:09:05PM +0200, Michal Hocko wrote: > On Tue 16-07-13 11:35:44, Johannes Weiner wrote: > > On Mon, Jul 15, 2013 at 06:00:06PM +0200, Michal Hocko wrote: > > > On Mon 15-07-13 17:41:19, Michal Hocko wrote: > > > > On Sun 14-07-13 01:51:12, azurIt wrote: > > > > > > CC: "Johannes Weiner" <hannes@cmpxchg.org>, linux-kernel@vger.kernel.org, linux-mm@kvack.org, "cgroups mailinglist" <cgroups@vger.kernel.org>, "KAMEZAWA Hiroyuki" <kamezawa.hiroyu@jp.fujitsu.com>, righi.andrea@gmail.com > > > > > >> CC: "Johannes Weiner" <hannes@cmpxchg.org>, linux-kernel@vger.kernel.org, linux-mm@kvack.org, "cgroups mailinglist" <cgroups@vger.kernel.org>, "KAMEZAWA Hiroyuki" <kamezawa.hiroyu@jp.fujitsu.com>, righi.andrea@gmail.com > > > > > >>On Wed 10-07-13 18:25:06, azurIt wrote: > > > > > >>> >> Now i realized that i forgot to remove UID from that cgroup before > > > > > >>> >> trying to remove it, so cgroup cannot be removed anyway (we are using > > > > > >>> >> third party cgroup called cgroup-uid from Andrea Righi, which is able > > > > > >>> >> to associate all user's processes with target cgroup). Look here for > > > > > >>> >> cgroup-uid patch: > > > > > >>> >> https://www.develer.com/~arighi/linux/patches/cgroup-uid/cgroup-uid-v8.patch > > > > > >>> >> > > > > > >>> >> ANYWAY, i'm 101% sure that 'tasks' file was empty and 'under_oom' was > > > > > >>> >> permanently '1'. > > > > > >>> > > > > > > >>> >This is really strange. Could you post the whole diff against stable > > > > > >>> >tree you are using (except for grsecurity stuff and the above cgroup-uid > > > > > >>> >patch)? > > > > > >>> > > > > > >>> > > > > > >>> Here are all patches which i applied to kernel 3.2.48 in my last test: > > > > > >>> http://watchdog.sk/lkml/patches3/ > > > > > >> > > > > > >>The two patches from Johannes seem correct. > > > > > >> > > > > > >>From a quick look even grsecurity patchset shouldn't interfere as it > > > > > >>doesn't seem to put any code between handle_mm_fault and mm_fault_error > > > > > >>and there also doesn't seem to be any new handle_mm_fault call sites. > > > > > >> > > > > > >>But I cannot tell there aren't other code paths which would lead to a > > > > > >>memcg charge, thus oom, without proper FAULT_FLAG_KERNEL handling. > > > > > > > > > > > > > > > > > >Michal, > > > > > > > > > > > >now i can definitely confirm that problem with unremovable cgroups > > > > > >persists. What info do you need from me? I applied also your little > > > > > >'WARN_ON' patch. > > > > > > > > > > Ok, i think you want this: > > > > > http://watchdog.sk/lkml/kern4.log > > > > > > > > Jul 14 01:11:39 server01 kernel: [ 593.589087] [ pid ] uid tgid total_vm rss cpu oom_adj oom_score_adj name > > > > Jul 14 01:11:39 server01 kernel: [ 593.589451] [12021] 1333 12021 172027 64723 4 0 0 apache2 > > > > Jul 14 01:11:39 server01 kernel: [ 593.589647] [12030] 1333 12030 172030 64748 2 0 0 apache2 > > > > Jul 14 01:11:39 server01 kernel: [ 593.589836] [12031] 1333 12031 172030 64749 3 0 0 apache2 > > > > Jul 14 01:11:39 server01 kernel: [ 593.590025] [12032] 1333 12032 170619 63428 3 0 0 apache2 > > > > Jul 14 01:11:39 server01 kernel: [ 593.590213] [12033] 1333 12033 167934 60524 2 0 0 apache2 > > > > Jul 14 01:11:39 server01 kernel: [ 593.590401] [12034] 1333 12034 170747 63496 4 0 0 apache2 > > > > Jul 14 01:11:39 server01 kernel: [ 593.590588] [12035] 1333 12035 169659 62451 1 0 0 apache2 > > > > Jul 14 01:11:39 server01 kernel: [ 593.590776] [12036] 1333 12036 167614 60384 3 0 0 apache2 > > > > Jul 14 01:11:39 server01 kernel: [ 593.590984] [12037] 1333 12037 166342 58964 3 0 0 apache2 > > > > Jul 14 01:11:39 server01 kernel: [ 593.591178] Memory cgroup out of memory: Kill process 12021 (apache2) score 847 or sacrifice child > > > > Jul 14 01:11:39 server01 kernel: [ 593.591370] Killed process 12021 (apache2) total-vm:688108kB, anon-rss:255472kB, file-rss:3420kB > > > > Jul 14 01:11:41 server01 kernel: [ 595.392920] ------------[ cut here ]------------ > > > > Jul 14 01:11:41 server01 kernel: [ 595.393096] WARNING: at kernel/exit.c:888 do_exit+0x7d0/0x870() > > > > Jul 14 01:11:41 server01 kernel: [ 595.393256] Hardware name: S5000VSA > > > > Jul 14 01:11:41 server01 kernel: [ 595.393415] Pid: 12037, comm: apache2 Not tainted 3.2.48-grsec #1 > > > > Jul 14 01:11:41 server01 kernel: [ 595.393577] Call Trace: > > > > Jul 14 01:11:41 server01 kernel: [ 595.393737] [<ffffffff8105520a>] warn_slowpath_common+0x7a/0xb0 > > > > Jul 14 01:11:41 server01 kernel: [ 595.393903] [<ffffffff8105525a>] warn_slowpath_null+0x1a/0x20 > > > > Jul 14 01:11:41 server01 kernel: [ 595.394068] [<ffffffff81059c50>] do_exit+0x7d0/0x870 > > > > Jul 14 01:11:41 server01 kernel: [ 595.394231] [<ffffffff81050254>] ? thread_group_times+0x44/0xb0 > > > > Jul 14 01:11:41 server01 kernel: [ 595.394392] [<ffffffff81059d41>] do_group_exit+0x51/0xc0 > > > > Jul 14 01:11:41 server01 kernel: [ 595.394551] [<ffffffff81059dc7>] sys_exit_group+0x17/0x20 > > > > Jul 14 01:11:41 server01 kernel: [ 595.394714] [<ffffffff815caea6>] system_call_fastpath+0x18/0x1d > > > > Jul 14 01:11:41 server01 kernel: [ 595.394921] ---[ end trace 738570e688acf099 ]--- > > > > > > > > OK, so you had an OOM which has been handled by in-kernel oom handler > > > > (it killed 12021) and 12037 was in the same group. The warning tells us > > > > that it went through mem_cgroup_oom as well (otherwise it wouldn't have > > > > memcg_oom.wait_on_memcg set and the warning wouldn't trigger) and then > > > > it exited on the userspace request (by exit syscall). > > > > > > > > I do not see any way how, this could happen though. If mem_cgroup_oom > > > > is called then we always return CHARGE_NOMEM which turns into ENOMEM > > > > returned by __mem_cgroup_try_charge (invoke_oom must have been set to > > > > true). So if nobody screwed the return value on the way up to page > > > > fault handler then there is no way to escape. > > > > > > > > I will check the code. > > > > > > OK, I guess I found it: > > > __do_fault > > > fault = filemap_fault > > > do_async_mmap_readahead > > > page_cache_async_readahead > > > ondemand_readahead > > > __do_page_cache_readahead > > > read_pages > > > readpages = ext3_readpages > > > mpage_readpages # Doesn't propagate ENOMEM > > > add_to_page_cache_lru > > > add_to_page_cache > > > add_to_page_cache_locked > > > mem_cgroup_cache_charge > > > > > > So the read ahead most probably. Again! Duhhh. I will try to think > > > about a fix for this. One obvious place is mpage_readpages but > > > __do_page_cache_readahead ignores read_pages return value as well and > > > page_cache_async_readahead, even worse, is just void and exported as > > > such. > > > > > > So this smells like a hard to fix bugger. One possible, and really ugly > > > way would be calling mem_cgroup_oom_synchronize even if handle_mm_fault > > > doesn't return VM_FAULT_ERROR, but that is a crude hack. > > > > Ouch, good spot. > > > > I don't think we need to handle an OOM from the readahead code. If > > readahead does not produce the desired page, we retry synchroneously > > in page_cache_read() and handle the OOM properly. We should not > > signal an OOM for optional pages anyway. > > > > So either we pass a flag from the readahead code down to > > add_to_page_cache and mem_cgroup_cache_charge that tells the charge > > code to ignore OOM conditions and do not set up an OOM context. > > That was my previous attempt and it was sooo painful. > > > Or we DO call mem_cgroup_oom_synchronize() from the read_cache_pages, > > with an argument that makes it only clean up the context and not wait. > > Yes, I was playing with this idea as well. I just do not like how > fragile this is. We need some way to catch all possible places which > might leak it. I don't think this is necessary, but we could add a sanity check in/near mem_cgroup_clear_userfault() that makes sure the OOM context is only set up when an error is returned. > > It would not be completely outlandish to place it there, since it's > > right next to where an error from add_to_page_cache() is not further > > propagated back through the fault stack. > > > > I'm travelling right now, I'll send a patch when I get back > > (Thursday). Unless you beat me to it :) > > I can cook something up but there is quite a big pile on my desk > currently (as always :/). No worries, I'll send an update. ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH for 3.2] memcg: do not trap chargers with full callstack on OOM @ 2013-07-16 16:48 ` Johannes Weiner 0 siblings, 0 replies; 444+ messages in thread From: Johannes Weiner @ 2013-07-16 16:48 UTC (permalink / raw) To: Michal Hocko Cc: azurIt, linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups mailinglist, KAMEZAWA Hiroyuki, righi.andrea-Re5JQEeQqe8AvxtiuMwx3w On Tue, Jul 16, 2013 at 06:09:05PM +0200, Michal Hocko wrote: > On Tue 16-07-13 11:35:44, Johannes Weiner wrote: > > On Mon, Jul 15, 2013 at 06:00:06PM +0200, Michal Hocko wrote: > > > On Mon 15-07-13 17:41:19, Michal Hocko wrote: > > > > On Sun 14-07-13 01:51:12, azurIt wrote: > > > > > > CC: "Johannes Weiner" <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>, linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org, "cgroups mailinglist" <cgroups-u79uwXL29TY76Z2rM5mHXA@public.gmane.org>, "KAMEZAWA Hiroyuki" <kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A@public.gmane.org>, righi.andrea-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org > > > > > >> CC: "Johannes Weiner" <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>, linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org, "cgroups mailinglist" <cgroups-u79uwXL29TY76Z2rM5mHXA@public.gmane.org>, "KAMEZAWA Hiroyuki" <kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A@public.gmane.org>, righi.andrea-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org > > > > > >>On Wed 10-07-13 18:25:06, azurIt wrote: > > > > > >>> >> Now i realized that i forgot to remove UID from that cgroup before > > > > > >>> >> trying to remove it, so cgroup cannot be removed anyway (we are using > > > > > >>> >> third party cgroup called cgroup-uid from Andrea Righi, which is able > > > > > >>> >> to associate all user's processes with target cgroup). Look here for > > > > > >>> >> cgroup-uid patch: > > > > > >>> >> https://www.develer.com/~arighi/linux/patches/cgroup-uid/cgroup-uid-v8.patch > > > > > >>> >> > > > > > >>> >> ANYWAY, i'm 101% sure that 'tasks' file was empty and 'under_oom' was > > > > > >>> >> permanently '1'. > > > > > >>> > > > > > > >>> >This is really strange. Could you post the whole diff against stable > > > > > >>> >tree you are using (except for grsecurity stuff and the above cgroup-uid > > > > > >>> >patch)? > > > > > >>> > > > > > >>> > > > > > >>> Here are all patches which i applied to kernel 3.2.48 in my last test: > > > > > >>> http://watchdog.sk/lkml/patches3/ > > > > > >> > > > > > >>The two patches from Johannes seem correct. > > > > > >> > > > > > >>From a quick look even grsecurity patchset shouldn't interfere as it > > > > > >>doesn't seem to put any code between handle_mm_fault and mm_fault_error > > > > > >>and there also doesn't seem to be any new handle_mm_fault call sites. > > > > > >> > > > > > >>But I cannot tell there aren't other code paths which would lead to a > > > > > >>memcg charge, thus oom, without proper FAULT_FLAG_KERNEL handling. > > > > > > > > > > > > > > > > > >Michal, > > > > > > > > > > > >now i can definitely confirm that problem with unremovable cgroups > > > > > >persists. What info do you need from me? I applied also your little > > > > > >'WARN_ON' patch. > > > > > > > > > > Ok, i think you want this: > > > > > http://watchdog.sk/lkml/kern4.log > > > > > > > > Jul 14 01:11:39 server01 kernel: [ 593.589087] [ pid ] uid tgid total_vm rss cpu oom_adj oom_score_adj name > > > > Jul 14 01:11:39 server01 kernel: [ 593.589451] [12021] 1333 12021 172027 64723 4 0 0 apache2 > > > > Jul 14 01:11:39 server01 kernel: [ 593.589647] [12030] 1333 12030 172030 64748 2 0 0 apache2 > > > > Jul 14 01:11:39 server01 kernel: [ 593.589836] [12031] 1333 12031 172030 64749 3 0 0 apache2 > > > > Jul 14 01:11:39 server01 kernel: [ 593.590025] [12032] 1333 12032 170619 63428 3 0 0 apache2 > > > > Jul 14 01:11:39 server01 kernel: [ 593.590213] [12033] 1333 12033 167934 60524 2 0 0 apache2 > > > > Jul 14 01:11:39 server01 kernel: [ 593.590401] [12034] 1333 12034 170747 63496 4 0 0 apache2 > > > > Jul 14 01:11:39 server01 kernel: [ 593.590588] [12035] 1333 12035 169659 62451 1 0 0 apache2 > > > > Jul 14 01:11:39 server01 kernel: [ 593.590776] [12036] 1333 12036 167614 60384 3 0 0 apache2 > > > > Jul 14 01:11:39 server01 kernel: [ 593.590984] [12037] 1333 12037 166342 58964 3 0 0 apache2 > > > > Jul 14 01:11:39 server01 kernel: [ 593.591178] Memory cgroup out of memory: Kill process 12021 (apache2) score 847 or sacrifice child > > > > Jul 14 01:11:39 server01 kernel: [ 593.591370] Killed process 12021 (apache2) total-vm:688108kB, anon-rss:255472kB, file-rss:3420kB > > > > Jul 14 01:11:41 server01 kernel: [ 595.392920] ------------[ cut here ]------------ > > > > Jul 14 01:11:41 server01 kernel: [ 595.393096] WARNING: at kernel/exit.c:888 do_exit+0x7d0/0x870() > > > > Jul 14 01:11:41 server01 kernel: [ 595.393256] Hardware name: S5000VSA > > > > Jul 14 01:11:41 server01 kernel: [ 595.393415] Pid: 12037, comm: apache2 Not tainted 3.2.48-grsec #1 > > > > Jul 14 01:11:41 server01 kernel: [ 595.393577] Call Trace: > > > > Jul 14 01:11:41 server01 kernel: [ 595.393737] [<ffffffff8105520a>] warn_slowpath_common+0x7a/0xb0 > > > > Jul 14 01:11:41 server01 kernel: [ 595.393903] [<ffffffff8105525a>] warn_slowpath_null+0x1a/0x20 > > > > Jul 14 01:11:41 server01 kernel: [ 595.394068] [<ffffffff81059c50>] do_exit+0x7d0/0x870 > > > > Jul 14 01:11:41 server01 kernel: [ 595.394231] [<ffffffff81050254>] ? thread_group_times+0x44/0xb0 > > > > Jul 14 01:11:41 server01 kernel: [ 595.394392] [<ffffffff81059d41>] do_group_exit+0x51/0xc0 > > > > Jul 14 01:11:41 server01 kernel: [ 595.394551] [<ffffffff81059dc7>] sys_exit_group+0x17/0x20 > > > > Jul 14 01:11:41 server01 kernel: [ 595.394714] [<ffffffff815caea6>] system_call_fastpath+0x18/0x1d > > > > Jul 14 01:11:41 server01 kernel: [ 595.394921] ---[ end trace 738570e688acf099 ]--- > > > > > > > > OK, so you had an OOM which has been handled by in-kernel oom handler > > > > (it killed 12021) and 12037 was in the same group. The warning tells us > > > > that it went through mem_cgroup_oom as well (otherwise it wouldn't have > > > > memcg_oom.wait_on_memcg set and the warning wouldn't trigger) and then > > > > it exited on the userspace request (by exit syscall). > > > > > > > > I do not see any way how, this could happen though. If mem_cgroup_oom > > > > is called then we always return CHARGE_NOMEM which turns into ENOMEM > > > > returned by __mem_cgroup_try_charge (invoke_oom must have been set to > > > > true). So if nobody screwed the return value on the way up to page > > > > fault handler then there is no way to escape. > > > > > > > > I will check the code. > > > > > > OK, I guess I found it: > > > __do_fault > > > fault = filemap_fault > > > do_async_mmap_readahead > > > page_cache_async_readahead > > > ondemand_readahead > > > __do_page_cache_readahead > > > read_pages > > > readpages = ext3_readpages > > > mpage_readpages # Doesn't propagate ENOMEM > > > add_to_page_cache_lru > > > add_to_page_cache > > > add_to_page_cache_locked > > > mem_cgroup_cache_charge > > > > > > So the read ahead most probably. Again! Duhhh. I will try to think > > > about a fix for this. One obvious place is mpage_readpages but > > > __do_page_cache_readahead ignores read_pages return value as well and > > > page_cache_async_readahead, even worse, is just void and exported as > > > such. > > > > > > So this smells like a hard to fix bugger. One possible, and really ugly > > > way would be calling mem_cgroup_oom_synchronize even if handle_mm_fault > > > doesn't return VM_FAULT_ERROR, but that is a crude hack. > > > > Ouch, good spot. > > > > I don't think we need to handle an OOM from the readahead code. If > > readahead does not produce the desired page, we retry synchroneously > > in page_cache_read() and handle the OOM properly. We should not > > signal an OOM for optional pages anyway. > > > > So either we pass a flag from the readahead code down to > > add_to_page_cache and mem_cgroup_cache_charge that tells the charge > > code to ignore OOM conditions and do not set up an OOM context. > > That was my previous attempt and it was sooo painful. > > > Or we DO call mem_cgroup_oom_synchronize() from the read_cache_pages, > > with an argument that makes it only clean up the context and not wait. > > Yes, I was playing with this idea as well. I just do not like how > fragile this is. We need some way to catch all possible places which > might leak it. I don't think this is necessary, but we could add a sanity check in/near mem_cgroup_clear_userfault() that makes sure the OOM context is only set up when an error is returned. > > It would not be completely outlandish to place it there, since it's > > right next to where an error from add_to_page_cache() is not further > > propagated back through the fault stack. > > > > I'm travelling right now, I'll send a patch when I get back > > (Thursday). Unless you beat me to it :) > > I can cook something up but there is quite a big pile on my desk > currently (as always :/). No worries, I'll send an update. ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH for 3.2] memcg: do not trap chargers with full callstack on OOM @ 2013-07-16 16:48 ` Johannes Weiner 0 siblings, 0 replies; 444+ messages in thread From: Johannes Weiner @ 2013-07-16 16:48 UTC (permalink / raw) To: Michal Hocko Cc: azurIt, linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki, righi.andrea On Tue, Jul 16, 2013 at 06:09:05PM +0200, Michal Hocko wrote: > On Tue 16-07-13 11:35:44, Johannes Weiner wrote: > > On Mon, Jul 15, 2013 at 06:00:06PM +0200, Michal Hocko wrote: > > > On Mon 15-07-13 17:41:19, Michal Hocko wrote: > > > > On Sun 14-07-13 01:51:12, azurIt wrote: > > > > > > CC: "Johannes Weiner" <hannes@cmpxchg.org>, linux-kernel@vger.kernel.org, linux-mm@kvack.org, "cgroups mailinglist" <cgroups@vger.kernel.org>, "KAMEZAWA Hiroyuki" <kamezawa.hiroyu@jp.fujitsu.com>, righi.andrea@gmail.com > > > > > >> CC: "Johannes Weiner" <hannes@cmpxchg.org>, linux-kernel@vger.kernel.org, linux-mm@kvack.org, "cgroups mailinglist" <cgroups@vger.kernel.org>, "KAMEZAWA Hiroyuki" <kamezawa.hiroyu@jp.fujitsu.com>, righi.andrea@gmail.com > > > > > >>On Wed 10-07-13 18:25:06, azurIt wrote: > > > > > >>> >> Now i realized that i forgot to remove UID from that cgroup before > > > > > >>> >> trying to remove it, so cgroup cannot be removed anyway (we are using > > > > > >>> >> third party cgroup called cgroup-uid from Andrea Righi, which is able > > > > > >>> >> to associate all user's processes with target cgroup). Look here for > > > > > >>> >> cgroup-uid patch: > > > > > >>> >> https://www.develer.com/~arighi/linux/patches/cgroup-uid/cgroup-uid-v8.patch > > > > > >>> >> > > > > > >>> >> ANYWAY, i'm 101% sure that 'tasks' file was empty and 'under_oom' was > > > > > >>> >> permanently '1'. > > > > > >>> > > > > > > >>> >This is really strange. Could you post the whole diff against stable > > > > > >>> >tree you are using (except for grsecurity stuff and the above cgroup-uid > > > > > >>> >patch)? > > > > > >>> > > > > > >>> > > > > > >>> Here are all patches which i applied to kernel 3.2.48 in my last test: > > > > > >>> http://watchdog.sk/lkml/patches3/ > > > > > >> > > > > > >>The two patches from Johannes seem correct. > > > > > >> > > > > > >>From a quick look even grsecurity patchset shouldn't interfere as it > > > > > >>doesn't seem to put any code between handle_mm_fault and mm_fault_error > > > > > >>and there also doesn't seem to be any new handle_mm_fault call sites. > > > > > >> > > > > > >>But I cannot tell there aren't other code paths which would lead to a > > > > > >>memcg charge, thus oom, without proper FAULT_FLAG_KERNEL handling. > > > > > > > > > > > > > > > > > >Michal, > > > > > > > > > > > >now i can definitely confirm that problem with unremovable cgroups > > > > > >persists. What info do you need from me? I applied also your little > > > > > >'WARN_ON' patch. > > > > > > > > > > Ok, i think you want this: > > > > > http://watchdog.sk/lkml/kern4.log > > > > > > > > Jul 14 01:11:39 server01 kernel: [ 593.589087] [ pid ] uid tgid total_vm rss cpu oom_adj oom_score_adj name > > > > Jul 14 01:11:39 server01 kernel: [ 593.589451] [12021] 1333 12021 172027 64723 4 0 0 apache2 > > > > Jul 14 01:11:39 server01 kernel: [ 593.589647] [12030] 1333 12030 172030 64748 2 0 0 apache2 > > > > Jul 14 01:11:39 server01 kernel: [ 593.589836] [12031] 1333 12031 172030 64749 3 0 0 apache2 > > > > Jul 14 01:11:39 server01 kernel: [ 593.590025] [12032] 1333 12032 170619 63428 3 0 0 apache2 > > > > Jul 14 01:11:39 server01 kernel: [ 593.590213] [12033] 1333 12033 167934 60524 2 0 0 apache2 > > > > Jul 14 01:11:39 server01 kernel: [ 593.590401] [12034] 1333 12034 170747 63496 4 0 0 apache2 > > > > Jul 14 01:11:39 server01 kernel: [ 593.590588] [12035] 1333 12035 169659 62451 1 0 0 apache2 > > > > Jul 14 01:11:39 server01 kernel: [ 593.590776] [12036] 1333 12036 167614 60384 3 0 0 apache2 > > > > Jul 14 01:11:39 server01 kernel: [ 593.590984] [12037] 1333 12037 166342 58964 3 0 0 apache2 > > > > Jul 14 01:11:39 server01 kernel: [ 593.591178] Memory cgroup out of memory: Kill process 12021 (apache2) score 847 or sacrifice child > > > > Jul 14 01:11:39 server01 kernel: [ 593.591370] Killed process 12021 (apache2) total-vm:688108kB, anon-rss:255472kB, file-rss:3420kB > > > > Jul 14 01:11:41 server01 kernel: [ 595.392920] ------------[ cut here ]------------ > > > > Jul 14 01:11:41 server01 kernel: [ 595.393096] WARNING: at kernel/exit.c:888 do_exit+0x7d0/0x870() > > > > Jul 14 01:11:41 server01 kernel: [ 595.393256] Hardware name: S5000VSA > > > > Jul 14 01:11:41 server01 kernel: [ 595.393415] Pid: 12037, comm: apache2 Not tainted 3.2.48-grsec #1 > > > > Jul 14 01:11:41 server01 kernel: [ 595.393577] Call Trace: > > > > Jul 14 01:11:41 server01 kernel: [ 595.393737] [<ffffffff8105520a>] warn_slowpath_common+0x7a/0xb0 > > > > Jul 14 01:11:41 server01 kernel: [ 595.393903] [<ffffffff8105525a>] warn_slowpath_null+0x1a/0x20 > > > > Jul 14 01:11:41 server01 kernel: [ 595.394068] [<ffffffff81059c50>] do_exit+0x7d0/0x870 > > > > Jul 14 01:11:41 server01 kernel: [ 595.394231] [<ffffffff81050254>] ? thread_group_times+0x44/0xb0 > > > > Jul 14 01:11:41 server01 kernel: [ 595.394392] [<ffffffff81059d41>] do_group_exit+0x51/0xc0 > > > > Jul 14 01:11:41 server01 kernel: [ 595.394551] [<ffffffff81059dc7>] sys_exit_group+0x17/0x20 > > > > Jul 14 01:11:41 server01 kernel: [ 595.394714] [<ffffffff815caea6>] system_call_fastpath+0x18/0x1d > > > > Jul 14 01:11:41 server01 kernel: [ 595.394921] ---[ end trace 738570e688acf099 ]--- > > > > > > > > OK, so you had an OOM which has been handled by in-kernel oom handler > > > > (it killed 12021) and 12037 was in the same group. The warning tells us > > > > that it went through mem_cgroup_oom as well (otherwise it wouldn't have > > > > memcg_oom.wait_on_memcg set and the warning wouldn't trigger) and then > > > > it exited on the userspace request (by exit syscall). > > > > > > > > I do not see any way how, this could happen though. If mem_cgroup_oom > > > > is called then we always return CHARGE_NOMEM which turns into ENOMEM > > > > returned by __mem_cgroup_try_charge (invoke_oom must have been set to > > > > true). So if nobody screwed the return value on the way up to page > > > > fault handler then there is no way to escape. > > > > > > > > I will check the code. > > > > > > OK, I guess I found it: > > > __do_fault > > > fault = filemap_fault > > > do_async_mmap_readahead > > > page_cache_async_readahead > > > ondemand_readahead > > > __do_page_cache_readahead > > > read_pages > > > readpages = ext3_readpages > > > mpage_readpages # Doesn't propagate ENOMEM > > > add_to_page_cache_lru > > > add_to_page_cache > > > add_to_page_cache_locked > > > mem_cgroup_cache_charge > > > > > > So the read ahead most probably. Again! Duhhh. I will try to think > > > about a fix for this. One obvious place is mpage_readpages but > > > __do_page_cache_readahead ignores read_pages return value as well and > > > page_cache_async_readahead, even worse, is just void and exported as > > > such. > > > > > > So this smells like a hard to fix bugger. One possible, and really ugly > > > way would be calling mem_cgroup_oom_synchronize even if handle_mm_fault > > > doesn't return VM_FAULT_ERROR, but that is a crude hack. > > > > Ouch, good spot. > > > > I don't think we need to handle an OOM from the readahead code. If > > readahead does not produce the desired page, we retry synchroneously > > in page_cache_read() and handle the OOM properly. We should not > > signal an OOM for optional pages anyway. > > > > So either we pass a flag from the readahead code down to > > add_to_page_cache and mem_cgroup_cache_charge that tells the charge > > code to ignore OOM conditions and do not set up an OOM context. > > That was my previous attempt and it was sooo painful. > > > Or we DO call mem_cgroup_oom_synchronize() from the read_cache_pages, > > with an argument that makes it only clean up the context and not wait. > > Yes, I was playing with this idea as well. I just do not like how > fragile this is. We need some way to catch all possible places which > might leak it. I don't think this is necessary, but we could add a sanity check in/near mem_cgroup_clear_userfault() that makes sure the OOM context is only set up when an error is returned. > > It would not be completely outlandish to place it there, since it's > > right next to where an error from add_to_page_cache() is not further > > propagated back through the fault stack. > > > > I'm travelling right now, I'll send a patch when I get back > > (Thursday). Unless you beat me to it :) > > I can cook something up but there is quite a big pile on my desk > currently (as always :/). No worries, I'll send an update. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH for 3.2] memcg: do not trap chargers with full callstack on OOM 2013-07-16 16:48 ` Johannes Weiner @ 2013-07-19 4:21 ` Johannes Weiner -1 siblings, 0 replies; 444+ messages in thread From: Johannes Weiner @ 2013-07-19 4:21 UTC (permalink / raw) To: Michal Hocko Cc: azurIt, linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki, righi.andrea On Tue, Jul 16, 2013 at 12:48:30PM -0400, Johannes Weiner wrote: > On Tue, Jul 16, 2013 at 06:09:05PM +0200, Michal Hocko wrote: > > On Tue 16-07-13 11:35:44, Johannes Weiner wrote: > > > On Mon, Jul 15, 2013 at 06:00:06PM +0200, Michal Hocko wrote: > > > > On Mon 15-07-13 17:41:19, Michal Hocko wrote: > > > > > On Sun 14-07-13 01:51:12, azurIt wrote: > > > > > > > CC: "Johannes Weiner" <hannes@cmpxchg.org>, linux-kernel@vger.kernel.org, linux-mm@kvack.org, "cgroups mailinglist" <cgroups@vger.kernel.org>, "KAMEZAWA Hiroyuki" <kamezawa.hiroyu@jp.fujitsu.com>, righi.andrea@gmail.com > > > > > > >> CC: "Johannes Weiner" <hannes@cmpxchg.org>, linux-kernel@vger.kernel.org, linux-mm@kvack.org, "cgroups mailinglist" <cgroups@vger.kernel.org>, "KAMEZAWA Hiroyuki" <kamezawa.hiroyu@jp.fujitsu.com>, righi.andrea@gmail.com > > > > > > >>On Wed 10-07-13 18:25:06, azurIt wrote: > > > > > > >>> >> Now i realized that i forgot to remove UID from that cgroup before > > > > > > >>> >> trying to remove it, so cgroup cannot be removed anyway (we are using > > > > > > >>> >> third party cgroup called cgroup-uid from Andrea Righi, which is able > > > > > > >>> >> to associate all user's processes with target cgroup). Look here for > > > > > > >>> >> cgroup-uid patch: > > > > > > >>> >> https://www.develer.com/~arighi/linux/patches/cgroup-uid/cgroup-uid-v8.patch > > > > > > >>> >> > > > > > > >>> >> ANYWAY, i'm 101% sure that 'tasks' file was empty and 'under_oom' was > > > > > > >>> >> permanently '1'. > > > > > > >>> > > > > > > > >>> >This is really strange. Could you post the whole diff against stable > > > > > > >>> >tree you are using (except for grsecurity stuff and the above cgroup-uid > > > > > > >>> >patch)? > > > > > > >>> > > > > > > >>> > > > > > > >>> Here are all patches which i applied to kernel 3.2.48 in my last test: > > > > > > >>> http://watchdog.sk/lkml/patches3/ > > > > > > >> > > > > > > >>The two patches from Johannes seem correct. > > > > > > >> > > > > > > >>From a quick look even grsecurity patchset shouldn't interfere as it > > > > > > >>doesn't seem to put any code between handle_mm_fault and mm_fault_error > > > > > > >>and there also doesn't seem to be any new handle_mm_fault call sites. > > > > > > >> > > > > > > >>But I cannot tell there aren't other code paths which would lead to a > > > > > > >>memcg charge, thus oom, without proper FAULT_FLAG_KERNEL handling. > > > > > > > > > > > > > > > > > > > > >Michal, > > > > > > > > > > > > > >now i can definitely confirm that problem with unremovable cgroups > > > > > > >persists. What info do you need from me? I applied also your little > > > > > > >'WARN_ON' patch. > > > > > > > > > > > > Ok, i think you want this: > > > > > > http://watchdog.sk/lkml/kern4.log > > > > > > > > > > Jul 14 01:11:39 server01 kernel: [ 593.589087] [ pid ] uid tgid total_vm rss cpu oom_adj oom_score_adj name > > > > > Jul 14 01:11:39 server01 kernel: [ 593.589451] [12021] 1333 12021 172027 64723 4 0 0 apache2 > > > > > Jul 14 01:11:39 server01 kernel: [ 593.589647] [12030] 1333 12030 172030 64748 2 0 0 apache2 > > > > > Jul 14 01:11:39 server01 kernel: [ 593.589836] [12031] 1333 12031 172030 64749 3 0 0 apache2 > > > > > Jul 14 01:11:39 server01 kernel: [ 593.590025] [12032] 1333 12032 170619 63428 3 0 0 apache2 > > > > > Jul 14 01:11:39 server01 kernel: [ 593.590213] [12033] 1333 12033 167934 60524 2 0 0 apache2 > > > > > Jul 14 01:11:39 server01 kernel: [ 593.590401] [12034] 1333 12034 170747 63496 4 0 0 apache2 > > > > > Jul 14 01:11:39 server01 kernel: [ 593.590588] [12035] 1333 12035 169659 62451 1 0 0 apache2 > > > > > Jul 14 01:11:39 server01 kernel: [ 593.590776] [12036] 1333 12036 167614 60384 3 0 0 apache2 > > > > > Jul 14 01:11:39 server01 kernel: [ 593.590984] [12037] 1333 12037 166342 58964 3 0 0 apache2 > > > > > Jul 14 01:11:39 server01 kernel: [ 593.591178] Memory cgroup out of memory: Kill process 12021 (apache2) score 847 or sacrifice child > > > > > Jul 14 01:11:39 server01 kernel: [ 593.591370] Killed process 12021 (apache2) total-vm:688108kB, anon-rss:255472kB, file-rss:3420kB > > > > > Jul 14 01:11:41 server01 kernel: [ 595.392920] ------------[ cut here ]------------ > > > > > Jul 14 01:11:41 server01 kernel: [ 595.393096] WARNING: at kernel/exit.c:888 do_exit+0x7d0/0x870() > > > > > Jul 14 01:11:41 server01 kernel: [ 595.393256] Hardware name: S5000VSA > > > > > Jul 14 01:11:41 server01 kernel: [ 595.393415] Pid: 12037, comm: apache2 Not tainted 3.2.48-grsec #1 > > > > > Jul 14 01:11:41 server01 kernel: [ 595.393577] Call Trace: > > > > > Jul 14 01:11:41 server01 kernel: [ 595.393737] [<ffffffff8105520a>] warn_slowpath_common+0x7a/0xb0 > > > > > Jul 14 01:11:41 server01 kernel: [ 595.393903] [<ffffffff8105525a>] warn_slowpath_null+0x1a/0x20 > > > > > Jul 14 01:11:41 server01 kernel: [ 595.394068] [<ffffffff81059c50>] do_exit+0x7d0/0x870 > > > > > Jul 14 01:11:41 server01 kernel: [ 595.394231] [<ffffffff81050254>] ? thread_group_times+0x44/0xb0 > > > > > Jul 14 01:11:41 server01 kernel: [ 595.394392] [<ffffffff81059d41>] do_group_exit+0x51/0xc0 > > > > > Jul 14 01:11:41 server01 kernel: [ 595.394551] [<ffffffff81059dc7>] sys_exit_group+0x17/0x20 > > > > > Jul 14 01:11:41 server01 kernel: [ 595.394714] [<ffffffff815caea6>] system_call_fastpath+0x18/0x1d > > > > > Jul 14 01:11:41 server01 kernel: [ 595.394921] ---[ end trace 738570e688acf099 ]--- > > > > > > > > > > OK, so you had an OOM which has been handled by in-kernel oom handler > > > > > (it killed 12021) and 12037 was in the same group. The warning tells us > > > > > that it went through mem_cgroup_oom as well (otherwise it wouldn't have > > > > > memcg_oom.wait_on_memcg set and the warning wouldn't trigger) and then > > > > > it exited on the userspace request (by exit syscall). > > > > > > > > > > I do not see any way how, this could happen though. If mem_cgroup_oom > > > > > is called then we always return CHARGE_NOMEM which turns into ENOMEM > > > > > returned by __mem_cgroup_try_charge (invoke_oom must have been set to > > > > > true). So if nobody screwed the return value on the way up to page > > > > > fault handler then there is no way to escape. > > > > > > > > > > I will check the code. > > > > > > > > OK, I guess I found it: > > > > __do_fault > > > > fault = filemap_fault > > > > do_async_mmap_readahead > > > > page_cache_async_readahead > > > > ondemand_readahead > > > > __do_page_cache_readahead > > > > read_pages > > > > readpages = ext3_readpages > > > > mpage_readpages # Doesn't propagate ENOMEM > > > > add_to_page_cache_lru > > > > add_to_page_cache > > > > add_to_page_cache_locked > > > > mem_cgroup_cache_charge > > > > > > > > So the read ahead most probably. Again! Duhhh. I will try to think > > > > about a fix for this. One obvious place is mpage_readpages but > > > > __do_page_cache_readahead ignores read_pages return value as well and > > > > page_cache_async_readahead, even worse, is just void and exported as > > > > such. > > > > > > > > So this smells like a hard to fix bugger. One possible, and really ugly > > > > way would be calling mem_cgroup_oom_synchronize even if handle_mm_fault > > > > doesn't return VM_FAULT_ERROR, but that is a crude hack. I fixed it by disabling the OOM killer altogether for readahead code. We don't do it globally, we should not do it in the memcg, these are optional allocations/charges. I also disabled it for kernel faults triggered from within a syscall (copy_*user, get_user_pages), which should just return -ENOMEM as usual (unless it's nested inside a userspace fault). The only downside is that we can't get around annotating userspace faults anymore, so every architecture fault handler now passes FAULT_FLAG_USER to handle_mm_fault(). Makes the series a little less self-contained, but it's not unreasonable. It's easy to detect leaks now by checking if the memcg OOM context is setup and we are not returning VM_FAULT_OOM. Here is a combined diff based on 3.2. azurIt, any chance you could give this a shot? I tested it on my local machines, but you have a known reproducer of fairly unlikely scenarios... Thanks! Johannes diff --git a/arch/alpha/mm/fault.c b/arch/alpha/mm/fault.c index fadd5f8..fa6b4e4 100644 --- a/arch/alpha/mm/fault.c +++ b/arch/alpha/mm/fault.c @@ -89,6 +89,7 @@ do_page_fault(unsigned long address, unsigned long mmcsr, struct mm_struct *mm = current->mm; const struct exception_table_entry *fixup; int fault, si_code = SEGV_MAPERR; + unsigned long flags = 0; siginfo_t info; /* As of EV6, a load into $31/$f31 is a prefetch, and never faults @@ -142,10 +143,15 @@ do_page_fault(unsigned long address, unsigned long mmcsr, goto bad_area; } + if (user_mode(regs)) + flags |= FAULT_FLAG_USER; + if (cause > 0) + flags |= FAULT_FLAG_WRITE; + /* If for any reason at all we couldn't handle the fault, make sure we exit gracefully rather than endlessly redo the fault. */ - fault = handle_mm_fault(mm, vma, address, cause > 0 ? FAULT_FLAG_WRITE : 0); + fault = handle_mm_fault(mm, vma, address, flags); up_read(&mm->mmap_sem); if (unlikely(fault & VM_FAULT_ERROR)) { if (fault & VM_FAULT_OOM) diff --git a/arch/arm/mm/fault.c b/arch/arm/mm/fault.c index aa33949..31b1e69 100644 --- a/arch/arm/mm/fault.c +++ b/arch/arm/mm/fault.c @@ -231,9 +231,10 @@ static inline bool access_error(unsigned int fsr, struct vm_area_struct *vma) static int __kprobes __do_page_fault(struct mm_struct *mm, unsigned long addr, unsigned int fsr, - struct task_struct *tsk) + struct task_struct *tsk, struct pt_regs *regs) { struct vm_area_struct *vma; + unsigned long flags = 0; int fault; vma = find_vma(mm, addr); @@ -253,11 +254,16 @@ good_area: goto out; } + if (user_mode(regs)) + flags |= FAULT_FLAG_USER; + if (fsr & FSR_WRITE) + flags |= FAULT_FLAG_WRITE; + /* * If for any reason at all we couldn't handle the fault, make * sure we exit gracefully rather than endlessly redo the fault. */ - fault = handle_mm_fault(mm, vma, addr & PAGE_MASK, (fsr & FSR_WRITE) ? FAULT_FLAG_WRITE : 0); + fault = handle_mm_fault(mm, vma, addr & PAGE_MASK, flags); if (unlikely(fault & VM_FAULT_ERROR)) return fault; if (fault & VM_FAULT_MAJOR) @@ -320,7 +326,7 @@ do_page_fault(unsigned long addr, unsigned int fsr, struct pt_regs *regs) #endif } - fault = __do_page_fault(mm, addr, fsr, tsk); + fault = __do_page_fault(mm, addr, fsr, tsk, regs); up_read(&mm->mmap_sem); perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, addr); diff --git a/arch/avr32/mm/fault.c b/arch/avr32/mm/fault.c index f7040a1..ada6237 100644 --- a/arch/avr32/mm/fault.c +++ b/arch/avr32/mm/fault.c @@ -59,6 +59,7 @@ asmlinkage void do_page_fault(unsigned long ecr, struct pt_regs *regs) struct mm_struct *mm; struct vm_area_struct *vma; const struct exception_table_entry *fixup; + unsigned long flags = 0; unsigned long address; unsigned long page; int writeaccess; @@ -127,12 +128,17 @@ good_area: panic("Unhandled case %lu in do_page_fault!", ecr); } + if (user_mode(regs)) + flags |= FAULT_FLAG_USER; + if (writeaccess) + flags |= FAULT_FLAG_WRITE; + /* * If for any reason at all we couldn't handle the fault, make * sure we exit gracefully rather than endlessly redo the * fault. */ - fault = handle_mm_fault(mm, vma, address, writeaccess ? FAULT_FLAG_WRITE : 0); + fault = handle_mm_fault(mm, vma, address, flags); if (unlikely(fault & VM_FAULT_ERROR)) { if (fault & VM_FAULT_OOM) goto out_of_memory; diff --git a/arch/cris/mm/fault.c b/arch/cris/mm/fault.c index 9dcac8e..35d096a 100644 --- a/arch/cris/mm/fault.c +++ b/arch/cris/mm/fault.c @@ -55,6 +55,7 @@ do_page_fault(unsigned long address, struct pt_regs *regs, struct task_struct *tsk; struct mm_struct *mm; struct vm_area_struct * vma; + unsigned long flags = 0; siginfo_t info; int fault; @@ -156,13 +157,18 @@ do_page_fault(unsigned long address, struct pt_regs *regs, goto bad_area; } + if (user_mode(regs)) + flags |= FAULT_FLAG_USER; + if (writeaccess & 1) + flags |= FAULT_FLAG_WRITE; + /* * If for any reason at all we couldn't handle the fault, * make sure we exit gracefully rather than endlessly redo * the fault. */ - fault = handle_mm_fault(mm, vma, address, (writeaccess & 1) ? FAULT_FLAG_WRITE : 0); + fault = handle_mm_fault(mm, vma, address, flags); if (unlikely(fault & VM_FAULT_ERROR)) { if (fault & VM_FAULT_OOM) goto out_of_memory; diff --git a/arch/frv/mm/fault.c b/arch/frv/mm/fault.c index a325d57..2dbf219 100644 --- a/arch/frv/mm/fault.c +++ b/arch/frv/mm/fault.c @@ -35,6 +35,7 @@ asmlinkage void do_page_fault(int datammu, unsigned long esr0, unsigned long ear struct vm_area_struct *vma; struct mm_struct *mm; unsigned long _pme, lrai, lrad, fixup; + unsigned long flags = 0; siginfo_t info; pgd_t *pge; pud_t *pue; @@ -158,12 +159,17 @@ asmlinkage void do_page_fault(int datammu, unsigned long esr0, unsigned long ear break; } + if (user_mode(__frame)) + flags |= FAULT_FLAG_USER; + if (write) + flags |= FAULT_FLAG_WRITE; + /* * If for any reason at all we couldn't handle the fault, * make sure we exit gracefully rather than endlessly redo * the fault. */ - fault = handle_mm_fault(mm, vma, ear0, write ? FAULT_FLAG_WRITE : 0); + fault = handle_mm_fault(mm, vma, ear0, flags); if (unlikely(fault & VM_FAULT_ERROR)) { if (fault & VM_FAULT_OOM) goto out_of_memory; diff --git a/arch/hexagon/mm/vm_fault.c b/arch/hexagon/mm/vm_fault.c index c10b76f..e56baf3 100644 --- a/arch/hexagon/mm/vm_fault.c +++ b/arch/hexagon/mm/vm_fault.c @@ -52,6 +52,7 @@ void do_page_fault(unsigned long address, long cause, struct pt_regs *regs) siginfo_t info; int si_code = SEGV_MAPERR; int fault; + unsigned long flags = 0; const struct exception_table_entry *fixup; /* @@ -96,7 +97,12 @@ good_area: break; } - fault = handle_mm_fault(mm, vma, address, (cause > 0)); + if (user_mode(regs)) + flags |= FAULT_FLAG_USER; + if (cause > 0) + flags |= FAULT_FLAG_WRITE; + + fault = handle_mm_fault(mm, vma, address, flags); /* The most common case -- we are done. */ if (likely(!(fault & VM_FAULT_ERROR))) { diff --git a/arch/ia64/mm/fault.c b/arch/ia64/mm/fault.c index 20b3593..ad9ef9d 100644 --- a/arch/ia64/mm/fault.c +++ b/arch/ia64/mm/fault.c @@ -79,6 +79,7 @@ ia64_do_page_fault (unsigned long address, unsigned long isr, struct pt_regs *re int signal = SIGSEGV, code = SEGV_MAPERR; struct vm_area_struct *vma, *prev_vma; struct mm_struct *mm = current->mm; + unsigned long flags = 0; struct siginfo si; unsigned long mask; int fault; @@ -149,12 +150,17 @@ ia64_do_page_fault (unsigned long address, unsigned long isr, struct pt_regs *re if ((vma->vm_flags & mask) != mask) goto bad_area; + if (user_mode(regs)) + flags |= FAULT_FLAG_USER; + if (mask & VM_WRITE) + flags |= FAULT_FLAG_WRITE; + /* * If for any reason at all we couldn't handle the fault, make * sure we exit gracefully rather than endlessly redo the * fault. */ - fault = handle_mm_fault(mm, vma, address, (mask & VM_WRITE) ? FAULT_FLAG_WRITE : 0); + fault = handle_mm_fault(mm, vma, address, flags); if (unlikely(fault & VM_FAULT_ERROR)) { /* * We ran out of memory, or some other thing happened diff --git a/arch/m32r/mm/fault.c b/arch/m32r/mm/fault.c index 2c9aeb4..e74f6fa 100644 --- a/arch/m32r/mm/fault.c +++ b/arch/m32r/mm/fault.c @@ -79,6 +79,7 @@ asmlinkage void do_page_fault(struct pt_regs *regs, unsigned long error_code, struct mm_struct *mm; struct vm_area_struct * vma; unsigned long page, addr; + unsigned long flags = 0; int write; int fault; siginfo_t info; @@ -188,6 +189,11 @@ good_area: if ((error_code & ACE_INSTRUCTION) && !(vma->vm_flags & VM_EXEC)) goto bad_area; + if (error_code & ACE_USERMODE) + flags |= FAULT_FLAG_USER; + if (write) + flags |= FAULT_FLAG_WRITE; + /* * If for any reason at all we couldn't handle the fault, * make sure we exit gracefully rather than endlessly redo @@ -195,7 +201,7 @@ good_area: */ addr = (address & PAGE_MASK); set_thread_fault_code(error_code); - fault = handle_mm_fault(mm, vma, addr, write ? FAULT_FLAG_WRITE : 0); + fault = handle_mm_fault(mm, vma, addr, flags); if (unlikely(fault & VM_FAULT_ERROR)) { if (fault & VM_FAULT_OOM) goto out_of_memory; diff --git a/arch/m68k/mm/fault.c b/arch/m68k/mm/fault.c index 2db6099..ab88a91 100644 --- a/arch/m68k/mm/fault.c +++ b/arch/m68k/mm/fault.c @@ -73,6 +73,7 @@ int do_page_fault(struct pt_regs *regs, unsigned long address, { struct mm_struct *mm = current->mm; struct vm_area_struct * vma; + unsigned long flags = 0; int write, fault; #ifdef DEBUG @@ -134,13 +135,18 @@ good_area: goto acc_err; } + if (user_mode(regs)) + flags |= FAULT_FLAG_USER; + if (write) + flags |= FAULT_FLAG_WRITE; + /* * If for any reason at all we couldn't handle the fault, * make sure we exit gracefully rather than endlessly redo * the fault. */ - fault = handle_mm_fault(mm, vma, address, write ? FAULT_FLAG_WRITE : 0); + fault = handle_mm_fault(mm, vma, address, flags); #ifdef DEBUG printk("handle_mm_fault returns %d\n",fault); #endif diff --git a/arch/microblaze/mm/fault.c b/arch/microblaze/mm/fault.c index ae97d2c..b002612 100644 --- a/arch/microblaze/mm/fault.c +++ b/arch/microblaze/mm/fault.c @@ -89,6 +89,7 @@ void do_page_fault(struct pt_regs *regs, unsigned long address, { struct vm_area_struct *vma; struct mm_struct *mm = current->mm; + unsigned long flags = 0; siginfo_t info; int code = SEGV_MAPERR; int is_write = error_code & ESR_S; @@ -206,12 +207,17 @@ good_area: goto bad_area; } + if (user_mode(regs)) + flags |= FAULT_FLAG_USER; + if (is_write) + flags |= FAULT_FLAG_WRITE; + /* * If for any reason at all we couldn't handle the fault, * make sure we exit gracefully rather than endlessly redo * the fault. */ - fault = handle_mm_fault(mm, vma, address, is_write ? FAULT_FLAG_WRITE : 0); + fault = handle_mm_fault(mm, vma, address, flags); if (unlikely(fault & VM_FAULT_ERROR)) { if (fault & VM_FAULT_OOM) goto out_of_memory; diff --git a/arch/mips/mm/fault.c b/arch/mips/mm/fault.c index 937cf33..e5b9fed 100644 --- a/arch/mips/mm/fault.c +++ b/arch/mips/mm/fault.c @@ -40,6 +40,7 @@ asmlinkage void __kprobes do_page_fault(struct pt_regs *regs, unsigned long writ struct task_struct *tsk = current; struct mm_struct *mm = tsk->mm; const int field = sizeof(unsigned long) * 2; + unsigned long flags = 0; siginfo_t info; int fault; @@ -139,12 +140,17 @@ good_area: } } + if (user_mode(regs)) + flags |= FAULT_FLAG_USER; + if (write) + flags |= FAULT_FLAG_WRITE; + /* * If for any reason at all we couldn't handle the fault, * make sure we exit gracefully rather than endlessly redo * the fault. */ - fault = handle_mm_fault(mm, vma, address, write ? FAULT_FLAG_WRITE : 0); + fault = handle_mm_fault(mm, vma, address, flags); perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, address); if (unlikely(fault & VM_FAULT_ERROR)) { if (fault & VM_FAULT_OOM) diff --git a/arch/mn10300/mm/fault.c b/arch/mn10300/mm/fault.c index 0945409..031be56 100644 --- a/arch/mn10300/mm/fault.c +++ b/arch/mn10300/mm/fault.c @@ -121,6 +121,7 @@ asmlinkage void do_page_fault(struct pt_regs *regs, unsigned long fault_code, { struct vm_area_struct *vma; struct task_struct *tsk; + unsigned long flags = 0; struct mm_struct *mm; unsigned long page; siginfo_t info; @@ -247,12 +248,17 @@ good_area: break; } + if ((fault_code & MMUFCR_xFC_ACCESS) == MMUFCR_xFC_ACCESS_USR) + flags |= FAULT_FLAG_USER; + if (write) + flags |= FAULT_FLAG_WRITE; + /* * If for any reason at all we couldn't handle the fault, * make sure we exit gracefully rather than endlessly redo * the fault. */ - fault = handle_mm_fault(mm, vma, address, write ? FAULT_FLAG_WRITE : 0); + fault = handle_mm_fault(mm, vma, address, flags); if (unlikely(fault & VM_FAULT_ERROR)) { if (fault & VM_FAULT_OOM) goto out_of_memory; @@ -329,9 +335,10 @@ no_context: */ out_of_memory: up_read(&mm->mmap_sem); - printk(KERN_ALERT "VM: killing process %s\n", tsk->comm); - if ((fault_code & MMUFCR_xFC_ACCESS) == MMUFCR_xFC_ACCESS_USR) - do_exit(SIGKILL); + if ((fault_code & MMUFCR_xFC_ACCESS) == MMUFCR_xFC_ACCESS_USR) { + pagefault_out_of_memory(); + return; + } goto no_context; do_sigbus: diff --git a/arch/openrisc/mm/fault.c b/arch/openrisc/mm/fault.c index a5dce82..d586119 100644 --- a/arch/openrisc/mm/fault.c +++ b/arch/openrisc/mm/fault.c @@ -52,6 +52,7 @@ asmlinkage void do_page_fault(struct pt_regs *regs, unsigned long address, struct task_struct *tsk; struct mm_struct *mm; struct vm_area_struct *vma; + unsigned long flags = 0; siginfo_t info; int fault; @@ -153,13 +154,18 @@ good_area: if ((vector == 0x400) && !(vma->vm_page_prot.pgprot & _PAGE_EXEC)) goto bad_area; + if (user_mode(regs)) + flags |= FAULT_FLAG_USER; + if (write_acc) + flags |= FAULT_FLAG_WRITE; + /* * If for any reason at all we couldn't handle the fault, * make sure we exit gracefully rather than endlessly redo * the fault. */ - fault = handle_mm_fault(mm, vma, address, write_acc); + fault = handle_mm_fault(mm, vma, address, flags); if (unlikely(fault & VM_FAULT_ERROR)) { if (fault & VM_FAULT_OOM) goto out_of_memory; @@ -246,10 +252,10 @@ out_of_memory: __asm__ __volatile__("l.nop 1"); up_read(&mm->mmap_sem); - printk("VM: killing process %s\n", tsk->comm); - if (user_mode(regs)) - do_exit(SIGKILL); - goto no_context; + if (!user_mode(regs)) + goto no_context; + pagefault_out_of_memory(); + return; do_sigbus: up_read(&mm->mmap_sem); diff --git a/arch/parisc/mm/fault.c b/arch/parisc/mm/fault.c index 18162ce..a151e87 100644 --- a/arch/parisc/mm/fault.c +++ b/arch/parisc/mm/fault.c @@ -173,6 +173,7 @@ void do_page_fault(struct pt_regs *regs, unsigned long code, struct vm_area_struct *vma, *prev_vma; struct task_struct *tsk = current; struct mm_struct *mm = tsk->mm; + unsigned long flags = 0; unsigned long acc_type; int fault; @@ -195,13 +196,18 @@ good_area: if ((vma->vm_flags & acc_type) != acc_type) goto bad_area; + if (user_mode(regs)) + flags |= FAULT_FLAG_USER; + if (acc_type & VM_WRITE) + flags |= FAULT_FLAG_WRITE; + /* * If for any reason at all we couldn't handle the fault, make * sure we exit gracefully rather than endlessly redo the * fault. */ - fault = handle_mm_fault(mm, vma, address, (acc_type & VM_WRITE) ? FAULT_FLAG_WRITE : 0); + fault = handle_mm_fault(mm, vma, address, flags); if (unlikely(fault & VM_FAULT_ERROR)) { /* * We hit a shared mapping outside of the file, or some diff --git a/arch/powerpc/mm/fault.c b/arch/powerpc/mm/fault.c index 5efe8c9..2bf339c 100644 --- a/arch/powerpc/mm/fault.c +++ b/arch/powerpc/mm/fault.c @@ -122,6 +122,7 @@ int __kprobes do_page_fault(struct pt_regs *regs, unsigned long address, { struct vm_area_struct * vma; struct mm_struct *mm = current->mm; + unsigned long flags = 0; siginfo_t info; int code = SEGV_MAPERR; int is_write = 0, ret; @@ -305,12 +306,17 @@ good_area: goto bad_area; } + if (user_mode(regs)) + flags |= FAULT_FLAG_USER; + if (is_write) + flags |= FAULT_FLAG_WRITE; + /* * If for any reason at all we couldn't handle the fault, * make sure we exit gracefully rather than endlessly redo * the fault. */ - ret = handle_mm_fault(mm, vma, address, is_write ? FAULT_FLAG_WRITE : 0); + ret = handle_mm_fault(mm, vma, address, flags); if (unlikely(ret & VM_FAULT_ERROR)) { if (ret & VM_FAULT_OOM) goto out_of_memory; diff --git a/arch/s390/mm/fault.c b/arch/s390/mm/fault.c index a9a3018..fe6109c 100644 --- a/arch/s390/mm/fault.c +++ b/arch/s390/mm/fault.c @@ -301,6 +301,8 @@ static inline int do_exception(struct pt_regs *regs, int access, address = trans_exc_code & __FAIL_ADDR_MASK; perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, address); flags = FAULT_FLAG_ALLOW_RETRY; + if (regs->psw.mask & PSW_MASK_PSTATE) + flags |= FAULT_FLAG_USER; if (access == VM_WRITE || (trans_exc_code & store_indication) == 0x400) flags |= FAULT_FLAG_WRITE; down_read(&mm->mmap_sem); diff --git a/arch/score/mm/fault.c b/arch/score/mm/fault.c index 47b600e..2ca5ae5 100644 --- a/arch/score/mm/fault.c +++ b/arch/score/mm/fault.c @@ -47,6 +47,7 @@ asmlinkage void do_page_fault(struct pt_regs *regs, unsigned long write, struct task_struct *tsk = current; struct mm_struct *mm = tsk->mm; const int field = sizeof(unsigned long) * 2; + unsigned long flags = 0; siginfo_t info; int fault; @@ -101,12 +102,16 @@ good_area: } survive: + if (user_mode(regs)) + flags |= FAULT_FLAG_USER; + if (write) + flags |= FAULT_FLAG_WRITE; /* * If for any reason at all we couldn't handle the fault, * make sure we exit gracefully rather than endlessly redo * the fault. */ - fault = handle_mm_fault(mm, vma, address, write); + fault = handle_mm_fault(mm, vma, address, flags); if (unlikely(fault & VM_FAULT_ERROR)) { if (fault & VM_FAULT_OOM) goto out_of_memory; @@ -172,10 +177,10 @@ out_of_memory: down_read(&mm->mmap_sem); goto survive; } - printk("VM: killing process %s\n", tsk->comm); - if (user_mode(regs)) - do_group_exit(SIGKILL); - goto no_context; + if (!user_mode(regs)) + goto no_context; + pagefault_out_of_memory(); + return; do_sigbus: up_read(&mm->mmap_sem); diff --git a/arch/sh/mm/fault_32.c b/arch/sh/mm/fault_32.c index 7bebd04..a61b803 100644 --- a/arch/sh/mm/fault_32.c +++ b/arch/sh/mm/fault_32.c @@ -126,6 +126,7 @@ asmlinkage void __kprobes do_page_fault(struct pt_regs *regs, struct task_struct *tsk; struct mm_struct *mm; struct vm_area_struct * vma; + unsigned long flags = 0; int si_code; int fault; siginfo_t info; @@ -195,12 +196,17 @@ good_area: goto bad_area; } + if (user_mode(regs)) + flags |= FAULT_FLAG_USER; + if (writeaccess) + flags |= FAULT_FLAG_WRITE; + /* * If for any reason at all we couldn't handle the fault, * make sure we exit gracefully rather than endlessly redo * the fault. */ - fault = handle_mm_fault(mm, vma, address, writeaccess ? FAULT_FLAG_WRITE : 0); + fault = handle_mm_fault(mm, vma, address, flags); if (unlikely(fault & VM_FAULT_ERROR)) { if (fault & VM_FAULT_OOM) goto out_of_memory; diff --git a/arch/sh/mm/tlbflush_64.c b/arch/sh/mm/tlbflush_64.c index e3430e0..0a9d645 100644 --- a/arch/sh/mm/tlbflush_64.c +++ b/arch/sh/mm/tlbflush_64.c @@ -96,6 +96,7 @@ asmlinkage void do_page_fault(struct pt_regs *regs, unsigned long writeaccess, struct mm_struct *mm; struct vm_area_struct * vma; const struct exception_table_entry *fixup; + unsigned long flags = 0; pte_t *pte; int fault; @@ -184,12 +185,17 @@ good_area: } } + if (user_mode(regs)) + flags |= FAULT_FLAG_USER; + if (writeaccess) + flags |= FAULT_FLAG_WRITE; + /* * If for any reason at all we couldn't handle the fault, * make sure we exit gracefully rather than endlessly redo * the fault. */ - fault = handle_mm_fault(mm, vma, address, writeaccess ? FAULT_FLAG_WRITE : 0); + fault = handle_mm_fault(mm, vma, address, flags); if (unlikely(fault & VM_FAULT_ERROR)) { if (fault & VM_FAULT_OOM) goto out_of_memory; diff --git a/arch/sparc/mm/fault_32.c b/arch/sparc/mm/fault_32.c index 8023fd7..efa3d48 100644 --- a/arch/sparc/mm/fault_32.c +++ b/arch/sparc/mm/fault_32.c @@ -222,6 +222,7 @@ asmlinkage void do_sparc_fault(struct pt_regs *regs, int text_fault, int write, struct vm_area_struct *vma; struct task_struct *tsk = current; struct mm_struct *mm = tsk->mm; + unsigned long flags = 0; unsigned int fixup; unsigned long g2; int from_user = !(regs->psr & PSR_PS); @@ -285,12 +286,17 @@ good_area: goto bad_area; } + if (from_user) + flags |= FAULT_FLAG_USER; + if (write) + flags |= FAULT_FLAG_WRITE; + /* * If for any reason at all we couldn't handle the fault, * make sure we exit gracefully rather than endlessly redo * the fault. */ - fault = handle_mm_fault(mm, vma, address, write ? FAULT_FLAG_WRITE : 0); + fault = handle_mm_fault(mm, vma, address, flags); if (unlikely(fault & VM_FAULT_ERROR)) { if (fault & VM_FAULT_OOM) goto out_of_memory; diff --git a/arch/sparc/mm/fault_64.c b/arch/sparc/mm/fault_64.c index 504c062..bc536ea 100644 --- a/arch/sparc/mm/fault_64.c +++ b/arch/sparc/mm/fault_64.c @@ -276,6 +276,7 @@ asmlinkage void __kprobes do_sparc64_fault(struct pt_regs *regs) { struct mm_struct *mm = current->mm; struct vm_area_struct *vma; + unsigned long flags = 0; unsigned int insn = 0; int si_code, fault_code, fault; unsigned long address, mm_rss; @@ -423,7 +424,12 @@ good_area: goto bad_area; } - fault = handle_mm_fault(mm, vma, address, (fault_code & FAULT_CODE_WRITE) ? FAULT_FLAG_WRITE : 0); + if (!(regs->tstate & TSTATE_PRIV)) + flags |= FAULT_FLAG_USER; + if (fault_code & FAULT_CODE_WRITE) + flags |= FAULT_FLAG_WRITE; + + fault = handle_mm_fault(mm, vma, address, flags); if (unlikely(fault & VM_FAULT_ERROR)) { if (fault & VM_FAULT_OOM) goto out_of_memory; diff --git a/arch/tile/mm/fault.c b/arch/tile/mm/fault.c index 25b7b90..b2a7fd5 100644 --- a/arch/tile/mm/fault.c +++ b/arch/tile/mm/fault.c @@ -263,6 +263,7 @@ static int handle_page_fault(struct pt_regs *regs, struct mm_struct *mm; struct vm_area_struct *vma; unsigned long stack_offset; + unsigned long flags = 0; int fault; int si_code; int is_kernel_mode; @@ -415,12 +416,16 @@ good_area: } survive: + if (!is_kernel_mode) + flags |= FAULT_FLAG_USER; + if (write) + flags |= FAULT_FLAG_WRITE; /* * If for any reason at all we couldn't handle the fault, * make sure we exit gracefully rather than endlessly redo * the fault. */ - fault = handle_mm_fault(mm, vma, address, write); + fault = handle_mm_fault(mm, vma, address, flags); if (unlikely(fault & VM_FAULT_ERROR)) { if (fault & VM_FAULT_OOM) goto out_of_memory; @@ -540,10 +545,10 @@ out_of_memory: down_read(&mm->mmap_sem); goto survive; } - pr_alert("VM: killing process %s\n", tsk->comm); - if (!is_kernel_mode) - do_group_exit(SIGKILL); - goto no_context; + if (is_kernel_mode) + goto no_context; + pagefault_out_of_memory(); + return 0; do_sigbus: up_read(&mm->mmap_sem); diff --git a/arch/um/kernel/trap.c b/arch/um/kernel/trap.c index dafc947..626a85e 100644 --- a/arch/um/kernel/trap.c +++ b/arch/um/kernel/trap.c @@ -25,6 +25,7 @@ int handle_page_fault(unsigned long address, unsigned long ip, { struct mm_struct *mm = current->mm; struct vm_area_struct *vma; + unsigned long flags = 0; pgd_t *pgd; pud_t *pud; pmd_t *pmd; @@ -62,10 +63,15 @@ good_area: if (!is_write && !(vma->vm_flags & (VM_READ | VM_EXEC))) goto out; + if (is_user) + flags |= FAULT_FLAG_USER; + if (is_write) + flags |= FAULT_FLAG_WRITE; + do { int fault; - fault = handle_mm_fault(mm, vma, address, is_write ? FAULT_FLAG_WRITE : 0); + fault = handle_mm_fault(mm, vma, address, flags); if (unlikely(fault & VM_FAULT_ERROR)) { if (fault & VM_FAULT_OOM) { goto out_of_memory; diff --git a/arch/unicore32/mm/fault.c b/arch/unicore32/mm/fault.c index 283aa4b..3026943 100644 --- a/arch/unicore32/mm/fault.c +++ b/arch/unicore32/mm/fault.c @@ -169,9 +169,10 @@ static inline bool access_error(unsigned int fsr, struct vm_area_struct *vma) } static int __do_pf(struct mm_struct *mm, unsigned long addr, unsigned int fsr, - struct task_struct *tsk) + struct task_struct *tsk, struct pt_regs *regs) { struct vm_area_struct *vma; + unsigned long flags = 0; int fault; vma = find_vma(mm, addr); @@ -191,12 +192,16 @@ good_area: goto out; } + if (user_mode(regs)) + flags |= FAULT_FLAG_USER; + if (!(fsr ^ 0x12)) + flags |= FAULT_FLAG_WRITE; + /* * If for any reason at all we couldn't handle the fault, make * sure we exit gracefully rather than endlessly redo the fault. */ - fault = handle_mm_fault(mm, vma, addr & PAGE_MASK, - (!(fsr ^ 0x12)) ? FAULT_FLAG_WRITE : 0); + fault = handle_mm_fault(mm, vma, addr & PAGE_MASK, flags); if (unlikely(fault & VM_FAULT_ERROR)) return fault; if (fault & VM_FAULT_MAJOR) @@ -252,7 +257,7 @@ static int do_pf(unsigned long addr, unsigned int fsr, struct pt_regs *regs) #endif } - fault = __do_pf(mm, addr, fsr, tsk); + fault = __do_pf(mm, addr, fsr, tsk, regs); up_read(&mm->mmap_sem); /* diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c index 5db0490..90248c9 100644 --- a/arch/x86/mm/fault.c +++ b/arch/x86/mm/fault.c @@ -846,17 +846,6 @@ static noinline int mm_fault_error(struct pt_regs *regs, unsigned long error_code, unsigned long address, unsigned int fault) { - /* - * Pagefault was interrupted by SIGKILL. We have no reason to - * continue pagefault. - */ - if (fatal_signal_pending(current)) { - if (!(fault & VM_FAULT_RETRY)) - up_read(¤t->mm->mmap_sem); - if (!(error_code & PF_USER)) - no_context(regs, error_code, address); - return 1; - } if (!(fault & VM_FAULT_ERROR)) return 0; @@ -999,8 +988,7 @@ do_page_fault(struct pt_regs *regs, unsigned long error_code) struct mm_struct *mm; int fault; int write = error_code & PF_WRITE; - unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE | - (write ? FAULT_FLAG_WRITE : 0); + unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE; tsk = current; mm = tsk->mm; @@ -1160,6 +1148,11 @@ good_area: return; } + if (error_code & PF_USER) + flags |= FAULT_FLAG_USER; + if (write) + flags |= FAULT_FLAG_WRITE; + /* * If for any reason at all we couldn't handle the fault, * make sure we exit gracefully rather than endlessly redo diff --git a/arch/xtensa/mm/fault.c b/arch/xtensa/mm/fault.c index e367e30..7db9fbe 100644 --- a/arch/xtensa/mm/fault.c +++ b/arch/xtensa/mm/fault.c @@ -41,6 +41,7 @@ void do_page_fault(struct pt_regs *regs) struct mm_struct *mm = current->mm; unsigned int exccause = regs->exccause; unsigned int address = regs->excvaddr; + unsigned long flags = 0; siginfo_t info; int is_write, is_exec; @@ -101,11 +102,16 @@ good_area: if (!(vma->vm_flags & (VM_READ | VM_WRITE))) goto bad_area; + if (user_mode(regs)) + flags |= FAULT_FLAG_USER; + if (is_write) + flags |= FAULT_FLAG_WRITE; + /* If for any reason at all we couldn't handle the fault, * make sure we exit gracefully rather than endlessly redo * the fault. */ - fault = handle_mm_fault(mm, vma, address, is_write ? FAULT_FLAG_WRITE : 0); + fault = handle_mm_fault(mm, vma, address, flags); if (unlikely(fault & VM_FAULT_ERROR)) { if (fault & VM_FAULT_OOM) goto out_of_memory; diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index b87068a..b92e5e7 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -120,6 +120,17 @@ mem_cgroup_get_reclaim_stat_from_page(struct page *page); extern void mem_cgroup_print_oom_info(struct mem_cgroup *memcg, struct task_struct *p); +static inline unsigned int mem_cgroup_xchg_may_oom(struct task_struct *p, + unsigned int new) +{ + unsigned int old; + + old = p->memcg_oom.may_oom; + p->memcg_oom.may_oom = new; + + return old; +} +bool mem_cgroup_oom_synchronize(void); #ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP extern int do_swap_account; #endif @@ -333,6 +344,17 @@ mem_cgroup_print_oom_info(struct mem_cgroup *memcg, struct task_struct *p) { } +static inline unsigned int mem_cgroup_xchg_may_oom(struct task_struct *p, + unsigned int new) +{ + return 0; +} + +static inline bool mem_cgroup_oom_synchronize(void) +{ + return false; +} + static inline void mem_cgroup_inc_page_stat(struct page *page, enum mem_cgroup_page_stat_item idx) { diff --git a/include/linux/mm.h b/include/linux/mm.h index 4baadd1..846b82b 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -156,6 +156,7 @@ extern pgprot_t protection_map[16]; #define FAULT_FLAG_ALLOW_RETRY 0x08 /* Retry fault if blocking */ #define FAULT_FLAG_RETRY_NOWAIT 0x10 /* Don't drop mmap_sem and wait when retrying */ #define FAULT_FLAG_KILLABLE 0x20 /* The fault task is in SIGKILL killable region */ +#define FAULT_FLAG_USER 0x40 /* The fault originated in userspace */ /* * This interface is used by x86 PAT code to identify a pfn mapping that is diff --git a/include/linux/sched.h b/include/linux/sched.h index 1c4f3e9..a77d198 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -91,6 +91,7 @@ struct sched_param { #include <linux/latencytop.h> #include <linux/cred.h> #include <linux/llist.h> +#include <linux/stacktrace.h> #include <asm/processor.h> @@ -1568,6 +1569,14 @@ struct task_struct { unsigned long nr_pages; /* uncharged usage */ unsigned long memsw_nr_pages; /* uncharged mem+swap usage */ } memcg_batch; + struct memcg_oom_info { + unsigned int may_oom:1; + unsigned int in_memcg_oom:1; + struct stack_trace trace; + unsigned long trace_entries[16]; + int wakeups; + struct mem_cgroup *wait_on_memcg; + } memcg_oom; #endif #ifdef CONFIG_HAVE_HW_BREAKPOINT atomic_t ptrace_bp_refcnt; diff --git a/mm/filemap.c b/mm/filemap.c index 5f0a3c9..d18bd47 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -1660,6 +1660,7 @@ int filemap_fault(struct vm_area_struct *vma, struct vm_fault *vmf) struct file_ra_state *ra = &file->f_ra; struct inode *inode = mapping->host; pgoff_t offset = vmf->pgoff; + unsigned int may_oom; struct page *page; pgoff_t size; int ret = 0; @@ -1669,7 +1670,12 @@ int filemap_fault(struct vm_area_struct *vma, struct vm_fault *vmf) return VM_FAULT_SIGBUS; /* - * Do we have something in the page cache already? + * Do we have something in the page cache already? Either + * way, try readahead, but disable the memcg OOM killer for it + * as readahead is optional and no errors are propagated up + * the fault stack, which does not allow proper unwinding of a + * memcg OOM state. The OOM killer is enabled while trying to + * instantiate the faulting page individually below. */ page = find_get_page(mapping, offset); if (likely(page)) { @@ -1677,10 +1683,14 @@ int filemap_fault(struct vm_area_struct *vma, struct vm_fault *vmf) * We found the page, so try async readahead before * waiting for the lock. */ + may_oom = mem_cgroup_xchg_may_oom(current, 0); do_async_mmap_readahead(vma, ra, file, page, offset); + mem_cgroup_xchg_may_oom(current, may_oom); } else { - /* No page in the page cache at all */ + /* No page in the page cache at all. */ + may_oom = mem_cgroup_xchg_may_oom(current, 0); do_sync_mmap_readahead(vma, ra, file, offset); + mem_cgroup_xchg_may_oom(current, may_oom); count_vm_event(PGMAJFAULT); mem_cgroup_count_vm_event(vma->vm_mm, PGMAJFAULT); ret = VM_FAULT_MAJOR; diff --git a/mm/ksm.c b/mm/ksm.c index 310544a..ae7e4ae 100644 --- a/mm/ksm.c +++ b/mm/ksm.c @@ -338,7 +338,7 @@ static int break_ksm(struct vm_area_struct *vma, unsigned long addr) break; if (PageKsm(page)) ret = handle_mm_fault(vma->vm_mm, vma, addr, - FAULT_FLAG_WRITE); + FAULT_FLAG_WRITE); else ret = VM_FAULT_WRITE; put_page(page); diff --git a/mm/memcontrol.c b/mm/memcontrol.c index b63f5f7..c47c77e 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -49,6 +49,7 @@ #include <linux/page_cgroup.h> #include <linux/cpu.h> #include <linux/oom.h> +#include <linux/stacktrace.h> #include "internal.h" #include <asm/uaccess.h> @@ -249,6 +250,7 @@ struct mem_cgroup { bool oom_lock; atomic_t under_oom; + atomic_t oom_wakeups; atomic_t refcnt; @@ -1846,6 +1848,7 @@ static int memcg_oom_wake_function(wait_queue_t *wait, static void memcg_wakeup_oom(struct mem_cgroup *memcg) { + atomic_inc(&memcg->oom_wakeups); /* for filtering, pass "memcg" as argument. */ __wake_up(&memcg_oom_waitq, TASK_NORMAL, 0, memcg); } @@ -1857,30 +1860,26 @@ static void memcg_oom_recover(struct mem_cgroup *memcg) } /* - * try to call OOM killer. returns false if we should exit memory-reclaim loop. + * try to call OOM killer */ -bool mem_cgroup_handle_oom(struct mem_cgroup *memcg, gfp_t mask) +static void mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask) { - struct oom_wait_info owait; - bool locked, need_to_kill; + bool locked, need_to_kill = true; - owait.mem = memcg; - owait.wait.flags = 0; - owait.wait.func = memcg_oom_wake_function; - owait.wait.private = current; - INIT_LIST_HEAD(&owait.wait.task_list); - need_to_kill = true; - mem_cgroup_mark_under_oom(memcg); + if (!current->memcg_oom.may_oom) + return; + + current->memcg_oom.in_memcg_oom = 1; + + current->memcg_oom.trace.nr_entries = 0; + current->memcg_oom.trace.max_entries = 16; + current->memcg_oom.trace.entries = current->memcg_oom.trace_entries; + current->memcg_oom.trace.skip = 1; + save_stack_trace(¤t->memcg_oom.trace); /* At first, try to OOM lock hierarchy under memcg.*/ spin_lock(&memcg_oom_lock); locked = mem_cgroup_oom_lock(memcg); - /* - * Even if signal_pending(), we can't quit charge() loop without - * accounting. So, UNINTERRUPTIBLE is appropriate. But SIGKILL - * under OOM is always welcomed, use TASK_KILLABLE here. - */ - prepare_to_wait(&memcg_oom_waitq, &owait.wait, TASK_KILLABLE); if (!locked || memcg->oom_kill_disable) need_to_kill = false; if (locked) @@ -1888,24 +1887,86 @@ bool mem_cgroup_handle_oom(struct mem_cgroup *memcg, gfp_t mask) spin_unlock(&memcg_oom_lock); if (need_to_kill) { - finish_wait(&memcg_oom_waitq, &owait.wait); mem_cgroup_out_of_memory(memcg, mask); } else { - schedule(); - finish_wait(&memcg_oom_waitq, &owait.wait); + /* + * A system call can just return -ENOMEM, but if this + * is a page fault and somebody else is handling the + * OOM already, we need to sleep on the OOM waitqueue + * for this memcg until the situation is resolved. + * Which can take some time because it might be + * handled by a userspace task. + * + * However, this is the charge context, which means + * that we may sit on a large call stack and hold + * various filesystem locks, the mmap_sem etc. and we + * don't want the OOM handler to deadlock on them + * while we sit here and wait. Store the current OOM + * context in the task_struct, then return -ENOMEM. + * At the end of the page fault handler, with the + * stack unwound, pagefault_out_of_memory() will check + * back with us by calling + * mem_cgroup_oom_synchronize(), possibly putting the + * task to sleep. + */ + mem_cgroup_mark_under_oom(memcg); + current->memcg_oom.wakeups = atomic_read(&memcg->oom_wakeups); + css_get(&memcg->css); + current->memcg_oom.wait_on_memcg = memcg; } - spin_lock(&memcg_oom_lock); - if (locked) + + if (locked) { + spin_lock(&memcg_oom_lock); mem_cgroup_oom_unlock(memcg); - memcg_wakeup_oom(memcg); - spin_unlock(&memcg_oom_lock); + /* + * Sleeping tasks might have been killed, make sure + * they get scheduled so they can exit. + */ + if (need_to_kill) + memcg_oom_recover(memcg); + spin_unlock(&memcg_oom_lock); + } +} - mem_cgroup_unmark_under_oom(memcg); +bool mem_cgroup_oom_synchronize(void) +{ + struct oom_wait_info owait; + struct mem_cgroup *memcg; - if (test_thread_flag(TIF_MEMDIE) || fatal_signal_pending(current)) + /* OOM is global, do not handle */ + if (!current->memcg_oom.in_memcg_oom) return false; - /* Give chance to dying process */ - schedule_timeout_uninterruptible(1); + + /* + * We invoked the OOM killer but there is a chance that a kill + * did not free up any charges. Everybody else might already + * be sleeping, so restart the fault and keep the rampage + * going until some charges are released. + */ + memcg = current->memcg_oom.wait_on_memcg; + if (!memcg) + goto out; + + if (test_thread_flag(TIF_MEMDIE) || fatal_signal_pending(current)) + goto out_put; + + owait.mem = memcg; + owait.wait.flags = 0; + owait.wait.func = memcg_oom_wake_function; + owait.wait.private = current; + INIT_LIST_HEAD(&owait.wait.task_list); + + prepare_to_wait(&memcg_oom_waitq, &owait.wait, TASK_KILLABLE); + /* Only sleep if we didn't miss any wakeups since OOM */ + if (atomic_read(&memcg->oom_wakeups) == current->memcg_oom.wakeups) + schedule(); + finish_wait(&memcg_oom_waitq, &owait.wait); +out_put: + mem_cgroup_unmark_under_oom(memcg); + css_put(&memcg->css); + current->memcg_oom.wait_on_memcg = NULL; +out: + current->memcg_oom.in_memcg_oom = 0; return true; } @@ -2195,11 +2256,10 @@ enum { CHARGE_RETRY, /* need to retry but retry is not bad */ CHARGE_NOMEM, /* we can't do more. return -ENOMEM */ CHARGE_WOULDBLOCK, /* GFP_WAIT wasn't set and no enough res. */ - CHARGE_OOM_DIE, /* the current is killed because of OOM */ }; static int mem_cgroup_do_charge(struct mem_cgroup *memcg, gfp_t gfp_mask, - unsigned int nr_pages, bool oom_check) + unsigned int nr_pages, bool invoke_oom) { unsigned long csize = nr_pages * PAGE_SIZE; struct mem_cgroup *mem_over_limit; @@ -2257,14 +2317,10 @@ static int mem_cgroup_do_charge(struct mem_cgroup *memcg, gfp_t gfp_mask, if (mem_cgroup_wait_acct_move(mem_over_limit)) return CHARGE_RETRY; - /* If we don't need to call oom-killer at el, return immediately */ - if (!oom_check) - return CHARGE_NOMEM; - /* check OOM */ - if (!mem_cgroup_handle_oom(mem_over_limit, gfp_mask)) - return CHARGE_OOM_DIE; + if (invoke_oom) + mem_cgroup_oom(mem_over_limit, gfp_mask); - return CHARGE_RETRY; + return CHARGE_NOMEM; } /* @@ -2349,7 +2405,7 @@ again: } do { - bool oom_check; + bool invoke_oom = oom && !nr_oom_retries; /* If killed, bypass charge */ if (fatal_signal_pending(current)) { @@ -2357,13 +2413,7 @@ again: goto bypass; } - oom_check = false; - if (oom && !nr_oom_retries) { - oom_check = true; - nr_oom_retries = MEM_CGROUP_RECLAIM_RETRIES; - } - - ret = mem_cgroup_do_charge(memcg, gfp_mask, batch, oom_check); + ret = mem_cgroup_do_charge(memcg, gfp_mask, batch, invoke_oom); switch (ret) { case CHARGE_OK: break; @@ -2376,16 +2426,12 @@ again: css_put(&memcg->css); goto nomem; case CHARGE_NOMEM: /* OOM routine works */ - if (!oom) { + if (!oom || invoke_oom) { css_put(&memcg->css); goto nomem; } - /* If oom, we never return -ENOMEM */ nr_oom_retries--; break; - case CHARGE_OOM_DIE: /* Killed by OOM Killer */ - css_put(&memcg->css); - goto bypass; } } while (ret != CHARGE_OK); diff --git a/mm/memory.c b/mm/memory.c index 829d437..fc6d741 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -57,6 +57,7 @@ #include <linux/swapops.h> #include <linux/elf.h> #include <linux/gfp.h> +#include <linux/stacktrace.h> #include <asm/io.h> #include <asm/pgalloc.h> @@ -3439,22 +3440,14 @@ unlock: /* * By the time we get here, we already hold the mm semaphore */ -int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma, - unsigned long address, unsigned int flags) +static int __handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma, + unsigned long address, unsigned int flags) { pgd_t *pgd; pud_t *pud; pmd_t *pmd; pte_t *pte; - __set_current_state(TASK_RUNNING); - - count_vm_event(PGFAULT); - mem_cgroup_count_vm_event(mm, PGFAULT); - - /* do counter updates before entering really critical section. */ - check_sync_rss_stat(current); - if (unlikely(is_vm_hugetlb_page(vma))) return hugetlb_fault(mm, vma, address, flags); @@ -3503,6 +3496,39 @@ int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma, return handle_pte_fault(mm, vma, address, pte, pmd, flags); } +int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma, + unsigned long address, unsigned int flags) +{ + int userfault = flags & FAULT_FLAG_USER; + int ret; + + __set_current_state(TASK_RUNNING); + + count_vm_event(PGFAULT); + mem_cgroup_count_vm_event(mm, PGFAULT); + + /* do counter updates before entering really critical section. */ + check_sync_rss_stat(current); + + if (userfault) + WARN_ON(mem_cgroup_xchg_may_oom(current, 1) == 1); + + ret = __handle_mm_fault(mm, vma, address, flags); + + if (userfault) + WARN_ON(mem_cgroup_xchg_may_oom(current, 0) == 0); + +#ifdef CONFIG_CGROUP_MEM_RES_CTLR + if (WARN(!(ret & VM_FAULT_OOM) && current->memcg_oom.in_memcg_oom, + "Fixing unhandled memcg OOM context, set up from:\n")) { + print_stack_trace(¤t->memcg_oom.trace, 0); + mem_cgroup_oom_synchronize(); + } +#endif + + return ret; +} + #ifndef __PAGETABLE_PUD_FOLDED /* * Allocate page upper directory. diff --git a/mm/oom_kill.c b/mm/oom_kill.c index 069b64e..aa60863 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -785,6 +785,8 @@ out: */ void pagefault_out_of_memory(void) { + if (mem_cgroup_oom_synchronize()) + return; if (try_set_system_oom()) { out_of_memory(NULL, 0, 0, NULL); clear_system_oom(); ^ permalink raw reply related [flat|nested] 444+ messages in thread
* Re: [PATCH for 3.2] memcg: do not trap chargers with full callstack on OOM @ 2013-07-19 4:21 ` Johannes Weiner 0 siblings, 0 replies; 444+ messages in thread From: Johannes Weiner @ 2013-07-19 4:21 UTC (permalink / raw) To: Michal Hocko Cc: azurIt, linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki, righi.andrea On Tue, Jul 16, 2013 at 12:48:30PM -0400, Johannes Weiner wrote: > On Tue, Jul 16, 2013 at 06:09:05PM +0200, Michal Hocko wrote: > > On Tue 16-07-13 11:35:44, Johannes Weiner wrote: > > > On Mon, Jul 15, 2013 at 06:00:06PM +0200, Michal Hocko wrote: > > > > On Mon 15-07-13 17:41:19, Michal Hocko wrote: > > > > > On Sun 14-07-13 01:51:12, azurIt wrote: > > > > > > > CC: "Johannes Weiner" <hannes@cmpxchg.org>, linux-kernel@vger.kernel.org, linux-mm@kvack.org, "cgroups mailinglist" <cgroups@vger.kernel.org>, "KAMEZAWA Hiroyuki" <kamezawa.hiroyu@jp.fujitsu.com>, righi.andrea@gmail.com > > > > > > >> CC: "Johannes Weiner" <hannes@cmpxchg.org>, linux-kernel@vger.kernel.org, linux-mm@kvack.org, "cgroups mailinglist" <cgroups@vger.kernel.org>, "KAMEZAWA Hiroyuki" <kamezawa.hiroyu@jp.fujitsu.com>, righi.andrea@gmail.com > > > > > > >>On Wed 10-07-13 18:25:06, azurIt wrote: > > > > > > >>> >> Now i realized that i forgot to remove UID from that cgroup before > > > > > > >>> >> trying to remove it, so cgroup cannot be removed anyway (we are using > > > > > > >>> >> third party cgroup called cgroup-uid from Andrea Righi, which is able > > > > > > >>> >> to associate all user's processes with target cgroup). Look here for > > > > > > >>> >> cgroup-uid patch: > > > > > > >>> >> https://www.develer.com/~arighi/linux/patches/cgroup-uid/cgroup-uid-v8.patch > > > > > > >>> >> > > > > > > >>> >> ANYWAY, i'm 101% sure that 'tasks' file was empty and 'under_oom' was > > > > > > >>> >> permanently '1'. > > > > > > >>> > > > > > > > >>> >This is really strange. Could you post the whole diff against stable > > > > > > >>> >tree you are using (except for grsecurity stuff and the above cgroup-uid > > > > > > >>> >patch)? > > > > > > >>> > > > > > > >>> > > > > > > >>> Here are all patches which i applied to kernel 3.2.48 in my last test: > > > > > > >>> http://watchdog.sk/lkml/patches3/ > > > > > > >> > > > > > > >>The two patches from Johannes seem correct. > > > > > > >> > > > > > > >>From a quick look even grsecurity patchset shouldn't interfere as it > > > > > > >>doesn't seem to put any code between handle_mm_fault and mm_fault_error > > > > > > >>and there also doesn't seem to be any new handle_mm_fault call sites. > > > > > > >> > > > > > > >>But I cannot tell there aren't other code paths which would lead to a > > > > > > >>memcg charge, thus oom, without proper FAULT_FLAG_KERNEL handling. > > > > > > > > > > > > > > > > > > > > >Michal, > > > > > > > > > > > > > >now i can definitely confirm that problem with unremovable cgroups > > > > > > >persists. What info do you need from me? I applied also your little > > > > > > >'WARN_ON' patch. > > > > > > > > > > > > Ok, i think you want this: > > > > > > http://watchdog.sk/lkml/kern4.log > > > > > > > > > > Jul 14 01:11:39 server01 kernel: [ 593.589087] [ pid ] uid tgid total_vm rss cpu oom_adj oom_score_adj name > > > > > Jul 14 01:11:39 server01 kernel: [ 593.589451] [12021] 1333 12021 172027 64723 4 0 0 apache2 > > > > > Jul 14 01:11:39 server01 kernel: [ 593.589647] [12030] 1333 12030 172030 64748 2 0 0 apache2 > > > > > Jul 14 01:11:39 server01 kernel: [ 593.589836] [12031] 1333 12031 172030 64749 3 0 0 apache2 > > > > > Jul 14 01:11:39 server01 kernel: [ 593.590025] [12032] 1333 12032 170619 63428 3 0 0 apache2 > > > > > Jul 14 01:11:39 server01 kernel: [ 593.590213] [12033] 1333 12033 167934 60524 2 0 0 apache2 > > > > > Jul 14 01:11:39 server01 kernel: [ 593.590401] [12034] 1333 12034 170747 63496 4 0 0 apache2 > > > > > Jul 14 01:11:39 server01 kernel: [ 593.590588] [12035] 1333 12035 169659 62451 1 0 0 apache2 > > > > > Jul 14 01:11:39 server01 kernel: [ 593.590776] [12036] 1333 12036 167614 60384 3 0 0 apache2 > > > > > Jul 14 01:11:39 server01 kernel: [ 593.590984] [12037] 1333 12037 166342 58964 3 0 0 apache2 > > > > > Jul 14 01:11:39 server01 kernel: [ 593.591178] Memory cgroup out of memory: Kill process 12021 (apache2) score 847 or sacrifice child > > > > > Jul 14 01:11:39 server01 kernel: [ 593.591370] Killed process 12021 (apache2) total-vm:688108kB, anon-rss:255472kB, file-rss:3420kB > > > > > Jul 14 01:11:41 server01 kernel: [ 595.392920] ------------[ cut here ]------------ > > > > > Jul 14 01:11:41 server01 kernel: [ 595.393096] WARNING: at kernel/exit.c:888 do_exit+0x7d0/0x870() > > > > > Jul 14 01:11:41 server01 kernel: [ 595.393256] Hardware name: S5000VSA > > > > > Jul 14 01:11:41 server01 kernel: [ 595.393415] Pid: 12037, comm: apache2 Not tainted 3.2.48-grsec #1 > > > > > Jul 14 01:11:41 server01 kernel: [ 595.393577] Call Trace: > > > > > Jul 14 01:11:41 server01 kernel: [ 595.393737] [<ffffffff8105520a>] warn_slowpath_common+0x7a/0xb0 > > > > > Jul 14 01:11:41 server01 kernel: [ 595.393903] [<ffffffff8105525a>] warn_slowpath_null+0x1a/0x20 > > > > > Jul 14 01:11:41 server01 kernel: [ 595.394068] [<ffffffff81059c50>] do_exit+0x7d0/0x870 > > > > > Jul 14 01:11:41 server01 kernel: [ 595.394231] [<ffffffff81050254>] ? thread_group_times+0x44/0xb0 > > > > > Jul 14 01:11:41 server01 kernel: [ 595.394392] [<ffffffff81059d41>] do_group_exit+0x51/0xc0 > > > > > Jul 14 01:11:41 server01 kernel: [ 595.394551] [<ffffffff81059dc7>] sys_exit_group+0x17/0x20 > > > > > Jul 14 01:11:41 server01 kernel: [ 595.394714] [<ffffffff815caea6>] system_call_fastpath+0x18/0x1d > > > > > Jul 14 01:11:41 server01 kernel: [ 595.394921] ---[ end trace 738570e688acf099 ]--- > > > > > > > > > > OK, so you had an OOM which has been handled by in-kernel oom handler > > > > > (it killed 12021) and 12037 was in the same group. The warning tells us > > > > > that it went through mem_cgroup_oom as well (otherwise it wouldn't have > > > > > memcg_oom.wait_on_memcg set and the warning wouldn't trigger) and then > > > > > it exited on the userspace request (by exit syscall). > > > > > > > > > > I do not see any way how, this could happen though. If mem_cgroup_oom > > > > > is called then we always return CHARGE_NOMEM which turns into ENOMEM > > > > > returned by __mem_cgroup_try_charge (invoke_oom must have been set to > > > > > true). So if nobody screwed the return value on the way up to page > > > > > fault handler then there is no way to escape. > > > > > > > > > > I will check the code. > > > > > > > > OK, I guess I found it: > > > > __do_fault > > > > fault = filemap_fault > > > > do_async_mmap_readahead > > > > page_cache_async_readahead > > > > ondemand_readahead > > > > __do_page_cache_readahead > > > > read_pages > > > > readpages = ext3_readpages > > > > mpage_readpages # Doesn't propagate ENOMEM > > > > add_to_page_cache_lru > > > > add_to_page_cache > > > > add_to_page_cache_locked > > > > mem_cgroup_cache_charge > > > > > > > > So the read ahead most probably. Again! Duhhh. I will try to think > > > > about a fix for this. One obvious place is mpage_readpages but > > > > __do_page_cache_readahead ignores read_pages return value as well and > > > > page_cache_async_readahead, even worse, is just void and exported as > > > > such. > > > > > > > > So this smells like a hard to fix bugger. One possible, and really ugly > > > > way would be calling mem_cgroup_oom_synchronize even if handle_mm_fault > > > > doesn't return VM_FAULT_ERROR, but that is a crude hack. I fixed it by disabling the OOM killer altogether for readahead code. We don't do it globally, we should not do it in the memcg, these are optional allocations/charges. I also disabled it for kernel faults triggered from within a syscall (copy_*user, get_user_pages), which should just return -ENOMEM as usual (unless it's nested inside a userspace fault). The only downside is that we can't get around annotating userspace faults anymore, so every architecture fault handler now passes FAULT_FLAG_USER to handle_mm_fault(). Makes the series a little less self-contained, but it's not unreasonable. It's easy to detect leaks now by checking if the memcg OOM context is setup and we are not returning VM_FAULT_OOM. Here is a combined diff based on 3.2. azurIt, any chance you could give this a shot? I tested it on my local machines, but you have a known reproducer of fairly unlikely scenarios... Thanks! Johannes diff --git a/arch/alpha/mm/fault.c b/arch/alpha/mm/fault.c index fadd5f8..fa6b4e4 100644 --- a/arch/alpha/mm/fault.c +++ b/arch/alpha/mm/fault.c @@ -89,6 +89,7 @@ do_page_fault(unsigned long address, unsigned long mmcsr, struct mm_struct *mm = current->mm; const struct exception_table_entry *fixup; int fault, si_code = SEGV_MAPERR; + unsigned long flags = 0; siginfo_t info; /* As of EV6, a load into $31/$f31 is a prefetch, and never faults @@ -142,10 +143,15 @@ do_page_fault(unsigned long address, unsigned long mmcsr, goto bad_area; } + if (user_mode(regs)) + flags |= FAULT_FLAG_USER; + if (cause > 0) + flags |= FAULT_FLAG_WRITE; + /* If for any reason at all we couldn't handle the fault, make sure we exit gracefully rather than endlessly redo the fault. */ - fault = handle_mm_fault(mm, vma, address, cause > 0 ? FAULT_FLAG_WRITE : 0); + fault = handle_mm_fault(mm, vma, address, flags); up_read(&mm->mmap_sem); if (unlikely(fault & VM_FAULT_ERROR)) { if (fault & VM_FAULT_OOM) diff --git a/arch/arm/mm/fault.c b/arch/arm/mm/fault.c index aa33949..31b1e69 100644 --- a/arch/arm/mm/fault.c +++ b/arch/arm/mm/fault.c @@ -231,9 +231,10 @@ static inline bool access_error(unsigned int fsr, struct vm_area_struct *vma) static int __kprobes __do_page_fault(struct mm_struct *mm, unsigned long addr, unsigned int fsr, - struct task_struct *tsk) + struct task_struct *tsk, struct pt_regs *regs) { struct vm_area_struct *vma; + unsigned long flags = 0; int fault; vma = find_vma(mm, addr); @@ -253,11 +254,16 @@ good_area: goto out; } + if (user_mode(regs)) + flags |= FAULT_FLAG_USER; + if (fsr & FSR_WRITE) + flags |= FAULT_FLAG_WRITE; + /* * If for any reason at all we couldn't handle the fault, make * sure we exit gracefully rather than endlessly redo the fault. */ - fault = handle_mm_fault(mm, vma, addr & PAGE_MASK, (fsr & FSR_WRITE) ? FAULT_FLAG_WRITE : 0); + fault = handle_mm_fault(mm, vma, addr & PAGE_MASK, flags); if (unlikely(fault & VM_FAULT_ERROR)) return fault; if (fault & VM_FAULT_MAJOR) @@ -320,7 +326,7 @@ do_page_fault(unsigned long addr, unsigned int fsr, struct pt_regs *regs) #endif } - fault = __do_page_fault(mm, addr, fsr, tsk); + fault = __do_page_fault(mm, addr, fsr, tsk, regs); up_read(&mm->mmap_sem); perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, addr); diff --git a/arch/avr32/mm/fault.c b/arch/avr32/mm/fault.c index f7040a1..ada6237 100644 --- a/arch/avr32/mm/fault.c +++ b/arch/avr32/mm/fault.c @@ -59,6 +59,7 @@ asmlinkage void do_page_fault(unsigned long ecr, struct pt_regs *regs) struct mm_struct *mm; struct vm_area_struct *vma; const struct exception_table_entry *fixup; + unsigned long flags = 0; unsigned long address; unsigned long page; int writeaccess; @@ -127,12 +128,17 @@ good_area: panic("Unhandled case %lu in do_page_fault!", ecr); } + if (user_mode(regs)) + flags |= FAULT_FLAG_USER; + if (writeaccess) + flags |= FAULT_FLAG_WRITE; + /* * If for any reason at all we couldn't handle the fault, make * sure we exit gracefully rather than endlessly redo the * fault. */ - fault = handle_mm_fault(mm, vma, address, writeaccess ? FAULT_FLAG_WRITE : 0); + fault = handle_mm_fault(mm, vma, address, flags); if (unlikely(fault & VM_FAULT_ERROR)) { if (fault & VM_FAULT_OOM) goto out_of_memory; diff --git a/arch/cris/mm/fault.c b/arch/cris/mm/fault.c index 9dcac8e..35d096a 100644 --- a/arch/cris/mm/fault.c +++ b/arch/cris/mm/fault.c @@ -55,6 +55,7 @@ do_page_fault(unsigned long address, struct pt_regs *regs, struct task_struct *tsk; struct mm_struct *mm; struct vm_area_struct * vma; + unsigned long flags = 0; siginfo_t info; int fault; @@ -156,13 +157,18 @@ do_page_fault(unsigned long address, struct pt_regs *regs, goto bad_area; } + if (user_mode(regs)) + flags |= FAULT_FLAG_USER; + if (writeaccess & 1) + flags |= FAULT_FLAG_WRITE; + /* * If for any reason at all we couldn't handle the fault, * make sure we exit gracefully rather than endlessly redo * the fault. */ - fault = handle_mm_fault(mm, vma, address, (writeaccess & 1) ? FAULT_FLAG_WRITE : 0); + fault = handle_mm_fault(mm, vma, address, flags); if (unlikely(fault & VM_FAULT_ERROR)) { if (fault & VM_FAULT_OOM) goto out_of_memory; diff --git a/arch/frv/mm/fault.c b/arch/frv/mm/fault.c index a325d57..2dbf219 100644 --- a/arch/frv/mm/fault.c +++ b/arch/frv/mm/fault.c @@ -35,6 +35,7 @@ asmlinkage void do_page_fault(int datammu, unsigned long esr0, unsigned long ear struct vm_area_struct *vma; struct mm_struct *mm; unsigned long _pme, lrai, lrad, fixup; + unsigned long flags = 0; siginfo_t info; pgd_t *pge; pud_t *pue; @@ -158,12 +159,17 @@ asmlinkage void do_page_fault(int datammu, unsigned long esr0, unsigned long ear break; } + if (user_mode(__frame)) + flags |= FAULT_FLAG_USER; + if (write) + flags |= FAULT_FLAG_WRITE; + /* * If for any reason at all we couldn't handle the fault, * make sure we exit gracefully rather than endlessly redo * the fault. */ - fault = handle_mm_fault(mm, vma, ear0, write ? FAULT_FLAG_WRITE : 0); + fault = handle_mm_fault(mm, vma, ear0, flags); if (unlikely(fault & VM_FAULT_ERROR)) { if (fault & VM_FAULT_OOM) goto out_of_memory; diff --git a/arch/hexagon/mm/vm_fault.c b/arch/hexagon/mm/vm_fault.c index c10b76f..e56baf3 100644 --- a/arch/hexagon/mm/vm_fault.c +++ b/arch/hexagon/mm/vm_fault.c @@ -52,6 +52,7 @@ void do_page_fault(unsigned long address, long cause, struct pt_regs *regs) siginfo_t info; int si_code = SEGV_MAPERR; int fault; + unsigned long flags = 0; const struct exception_table_entry *fixup; /* @@ -96,7 +97,12 @@ good_area: break; } - fault = handle_mm_fault(mm, vma, address, (cause > 0)); + if (user_mode(regs)) + flags |= FAULT_FLAG_USER; + if (cause > 0) + flags |= FAULT_FLAG_WRITE; + + fault = handle_mm_fault(mm, vma, address, flags); /* The most common case -- we are done. */ if (likely(!(fault & VM_FAULT_ERROR))) { diff --git a/arch/ia64/mm/fault.c b/arch/ia64/mm/fault.c index 20b3593..ad9ef9d 100644 --- a/arch/ia64/mm/fault.c +++ b/arch/ia64/mm/fault.c @@ -79,6 +79,7 @@ ia64_do_page_fault (unsigned long address, unsigned long isr, struct pt_regs *re int signal = SIGSEGV, code = SEGV_MAPERR; struct vm_area_struct *vma, *prev_vma; struct mm_struct *mm = current->mm; + unsigned long flags = 0; struct siginfo si; unsigned long mask; int fault; @@ -149,12 +150,17 @@ ia64_do_page_fault (unsigned long address, unsigned long isr, struct pt_regs *re if ((vma->vm_flags & mask) != mask) goto bad_area; + if (user_mode(regs)) + flags |= FAULT_FLAG_USER; + if (mask & VM_WRITE) + flags |= FAULT_FLAG_WRITE; + /* * If for any reason at all we couldn't handle the fault, make * sure we exit gracefully rather than endlessly redo the * fault. */ - fault = handle_mm_fault(mm, vma, address, (mask & VM_WRITE) ? FAULT_FLAG_WRITE : 0); + fault = handle_mm_fault(mm, vma, address, flags); if (unlikely(fault & VM_FAULT_ERROR)) { /* * We ran out of memory, or some other thing happened diff --git a/arch/m32r/mm/fault.c b/arch/m32r/mm/fault.c index 2c9aeb4..e74f6fa 100644 --- a/arch/m32r/mm/fault.c +++ b/arch/m32r/mm/fault.c @@ -79,6 +79,7 @@ asmlinkage void do_page_fault(struct pt_regs *regs, unsigned long error_code, struct mm_struct *mm; struct vm_area_struct * vma; unsigned long page, addr; + unsigned long flags = 0; int write; int fault; siginfo_t info; @@ -188,6 +189,11 @@ good_area: if ((error_code & ACE_INSTRUCTION) && !(vma->vm_flags & VM_EXEC)) goto bad_area; + if (error_code & ACE_USERMODE) + flags |= FAULT_FLAG_USER; + if (write) + flags |= FAULT_FLAG_WRITE; + /* * If for any reason at all we couldn't handle the fault, * make sure we exit gracefully rather than endlessly redo @@ -195,7 +201,7 @@ good_area: */ addr = (address & PAGE_MASK); set_thread_fault_code(error_code); - fault = handle_mm_fault(mm, vma, addr, write ? FAULT_FLAG_WRITE : 0); + fault = handle_mm_fault(mm, vma, addr, flags); if (unlikely(fault & VM_FAULT_ERROR)) { if (fault & VM_FAULT_OOM) goto out_of_memory; diff --git a/arch/m68k/mm/fault.c b/arch/m68k/mm/fault.c index 2db6099..ab88a91 100644 --- a/arch/m68k/mm/fault.c +++ b/arch/m68k/mm/fault.c @@ -73,6 +73,7 @@ int do_page_fault(struct pt_regs *regs, unsigned long address, { struct mm_struct *mm = current->mm; struct vm_area_struct * vma; + unsigned long flags = 0; int write, fault; #ifdef DEBUG @@ -134,13 +135,18 @@ good_area: goto acc_err; } + if (user_mode(regs)) + flags |= FAULT_FLAG_USER; + if (write) + flags |= FAULT_FLAG_WRITE; + /* * If for any reason at all we couldn't handle the fault, * make sure we exit gracefully rather than endlessly redo * the fault. */ - fault = handle_mm_fault(mm, vma, address, write ? FAULT_FLAG_WRITE : 0); + fault = handle_mm_fault(mm, vma, address, flags); #ifdef DEBUG printk("handle_mm_fault returns %d\n",fault); #endif diff --git a/arch/microblaze/mm/fault.c b/arch/microblaze/mm/fault.c index ae97d2c..b002612 100644 --- a/arch/microblaze/mm/fault.c +++ b/arch/microblaze/mm/fault.c @@ -89,6 +89,7 @@ void do_page_fault(struct pt_regs *regs, unsigned long address, { struct vm_area_struct *vma; struct mm_struct *mm = current->mm; + unsigned long flags = 0; siginfo_t info; int code = SEGV_MAPERR; int is_write = error_code & ESR_S; @@ -206,12 +207,17 @@ good_area: goto bad_area; } + if (user_mode(regs)) + flags |= FAULT_FLAG_USER; + if (is_write) + flags |= FAULT_FLAG_WRITE; + /* * If for any reason at all we couldn't handle the fault, * make sure we exit gracefully rather than endlessly redo * the fault. */ - fault = handle_mm_fault(mm, vma, address, is_write ? FAULT_FLAG_WRITE : 0); + fault = handle_mm_fault(mm, vma, address, flags); if (unlikely(fault & VM_FAULT_ERROR)) { if (fault & VM_FAULT_OOM) goto out_of_memory; diff --git a/arch/mips/mm/fault.c b/arch/mips/mm/fault.c index 937cf33..e5b9fed 100644 --- a/arch/mips/mm/fault.c +++ b/arch/mips/mm/fault.c @@ -40,6 +40,7 @@ asmlinkage void __kprobes do_page_fault(struct pt_regs *regs, unsigned long writ struct task_struct *tsk = current; struct mm_struct *mm = tsk->mm; const int field = sizeof(unsigned long) * 2; + unsigned long flags = 0; siginfo_t info; int fault; @@ -139,12 +140,17 @@ good_area: } } + if (user_mode(regs)) + flags |= FAULT_FLAG_USER; + if (write) + flags |= FAULT_FLAG_WRITE; + /* * If for any reason at all we couldn't handle the fault, * make sure we exit gracefully rather than endlessly redo * the fault. */ - fault = handle_mm_fault(mm, vma, address, write ? FAULT_FLAG_WRITE : 0); + fault = handle_mm_fault(mm, vma, address, flags); perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, address); if (unlikely(fault & VM_FAULT_ERROR)) { if (fault & VM_FAULT_OOM) diff --git a/arch/mn10300/mm/fault.c b/arch/mn10300/mm/fault.c index 0945409..031be56 100644 --- a/arch/mn10300/mm/fault.c +++ b/arch/mn10300/mm/fault.c @@ -121,6 +121,7 @@ asmlinkage void do_page_fault(struct pt_regs *regs, unsigned long fault_code, { struct vm_area_struct *vma; struct task_struct *tsk; + unsigned long flags = 0; struct mm_struct *mm; unsigned long page; siginfo_t info; @@ -247,12 +248,17 @@ good_area: break; } + if ((fault_code & MMUFCR_xFC_ACCESS) == MMUFCR_xFC_ACCESS_USR) + flags |= FAULT_FLAG_USER; + if (write) + flags |= FAULT_FLAG_WRITE; + /* * If for any reason at all we couldn't handle the fault, * make sure we exit gracefully rather than endlessly redo * the fault. */ - fault = handle_mm_fault(mm, vma, address, write ? FAULT_FLAG_WRITE : 0); + fault = handle_mm_fault(mm, vma, address, flags); if (unlikely(fault & VM_FAULT_ERROR)) { if (fault & VM_FAULT_OOM) goto out_of_memory; @@ -329,9 +335,10 @@ no_context: */ out_of_memory: up_read(&mm->mmap_sem); - printk(KERN_ALERT "VM: killing process %s\n", tsk->comm); - if ((fault_code & MMUFCR_xFC_ACCESS) == MMUFCR_xFC_ACCESS_USR) - do_exit(SIGKILL); + if ((fault_code & MMUFCR_xFC_ACCESS) == MMUFCR_xFC_ACCESS_USR) { + pagefault_out_of_memory(); + return; + } goto no_context; do_sigbus: diff --git a/arch/openrisc/mm/fault.c b/arch/openrisc/mm/fault.c index a5dce82..d586119 100644 --- a/arch/openrisc/mm/fault.c +++ b/arch/openrisc/mm/fault.c @@ -52,6 +52,7 @@ asmlinkage void do_page_fault(struct pt_regs *regs, unsigned long address, struct task_struct *tsk; struct mm_struct *mm; struct vm_area_struct *vma; + unsigned long flags = 0; siginfo_t info; int fault; @@ -153,13 +154,18 @@ good_area: if ((vector == 0x400) && !(vma->vm_page_prot.pgprot & _PAGE_EXEC)) goto bad_area; + if (user_mode(regs)) + flags |= FAULT_FLAG_USER; + if (write_acc) + flags |= FAULT_FLAG_WRITE; + /* * If for any reason at all we couldn't handle the fault, * make sure we exit gracefully rather than endlessly redo * the fault. */ - fault = handle_mm_fault(mm, vma, address, write_acc); + fault = handle_mm_fault(mm, vma, address, flags); if (unlikely(fault & VM_FAULT_ERROR)) { if (fault & VM_FAULT_OOM) goto out_of_memory; @@ -246,10 +252,10 @@ out_of_memory: __asm__ __volatile__("l.nop 1"); up_read(&mm->mmap_sem); - printk("VM: killing process %s\n", tsk->comm); - if (user_mode(regs)) - do_exit(SIGKILL); - goto no_context; + if (!user_mode(regs)) + goto no_context; + pagefault_out_of_memory(); + return; do_sigbus: up_read(&mm->mmap_sem); diff --git a/arch/parisc/mm/fault.c b/arch/parisc/mm/fault.c index 18162ce..a151e87 100644 --- a/arch/parisc/mm/fault.c +++ b/arch/parisc/mm/fault.c @@ -173,6 +173,7 @@ void do_page_fault(struct pt_regs *regs, unsigned long code, struct vm_area_struct *vma, *prev_vma; struct task_struct *tsk = current; struct mm_struct *mm = tsk->mm; + unsigned long flags = 0; unsigned long acc_type; int fault; @@ -195,13 +196,18 @@ good_area: if ((vma->vm_flags & acc_type) != acc_type) goto bad_area; + if (user_mode(regs)) + flags |= FAULT_FLAG_USER; + if (acc_type & VM_WRITE) + flags |= FAULT_FLAG_WRITE; + /* * If for any reason at all we couldn't handle the fault, make * sure we exit gracefully rather than endlessly redo the * fault. */ - fault = handle_mm_fault(mm, vma, address, (acc_type & VM_WRITE) ? FAULT_FLAG_WRITE : 0); + fault = handle_mm_fault(mm, vma, address, flags); if (unlikely(fault & VM_FAULT_ERROR)) { /* * We hit a shared mapping outside of the file, or some diff --git a/arch/powerpc/mm/fault.c b/arch/powerpc/mm/fault.c index 5efe8c9..2bf339c 100644 --- a/arch/powerpc/mm/fault.c +++ b/arch/powerpc/mm/fault.c @@ -122,6 +122,7 @@ int __kprobes do_page_fault(struct pt_regs *regs, unsigned long address, { struct vm_area_struct * vma; struct mm_struct *mm = current->mm; + unsigned long flags = 0; siginfo_t info; int code = SEGV_MAPERR; int is_write = 0, ret; @@ -305,12 +306,17 @@ good_area: goto bad_area; } + if (user_mode(regs)) + flags |= FAULT_FLAG_USER; + if (is_write) + flags |= FAULT_FLAG_WRITE; + /* * If for any reason at all we couldn't handle the fault, * make sure we exit gracefully rather than endlessly redo * the fault. */ - ret = handle_mm_fault(mm, vma, address, is_write ? FAULT_FLAG_WRITE : 0); + ret = handle_mm_fault(mm, vma, address, flags); if (unlikely(ret & VM_FAULT_ERROR)) { if (ret & VM_FAULT_OOM) goto out_of_memory; diff --git a/arch/s390/mm/fault.c b/arch/s390/mm/fault.c index a9a3018..fe6109c 100644 --- a/arch/s390/mm/fault.c +++ b/arch/s390/mm/fault.c @@ -301,6 +301,8 @@ static inline int do_exception(struct pt_regs *regs, int access, address = trans_exc_code & __FAIL_ADDR_MASK; perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, address); flags = FAULT_FLAG_ALLOW_RETRY; + if (regs->psw.mask & PSW_MASK_PSTATE) + flags |= FAULT_FLAG_USER; if (access == VM_WRITE || (trans_exc_code & store_indication) == 0x400) flags |= FAULT_FLAG_WRITE; down_read(&mm->mmap_sem); diff --git a/arch/score/mm/fault.c b/arch/score/mm/fault.c index 47b600e..2ca5ae5 100644 --- a/arch/score/mm/fault.c +++ b/arch/score/mm/fault.c @@ -47,6 +47,7 @@ asmlinkage void do_page_fault(struct pt_regs *regs, unsigned long write, struct task_struct *tsk = current; struct mm_struct *mm = tsk->mm; const int field = sizeof(unsigned long) * 2; + unsigned long flags = 0; siginfo_t info; int fault; @@ -101,12 +102,16 @@ good_area: } survive: + if (user_mode(regs)) + flags |= FAULT_FLAG_USER; + if (write) + flags |= FAULT_FLAG_WRITE; /* * If for any reason at all we couldn't handle the fault, * make sure we exit gracefully rather than endlessly redo * the fault. */ - fault = handle_mm_fault(mm, vma, address, write); + fault = handle_mm_fault(mm, vma, address, flags); if (unlikely(fault & VM_FAULT_ERROR)) { if (fault & VM_FAULT_OOM) goto out_of_memory; @@ -172,10 +177,10 @@ out_of_memory: down_read(&mm->mmap_sem); goto survive; } - printk("VM: killing process %s\n", tsk->comm); - if (user_mode(regs)) - do_group_exit(SIGKILL); - goto no_context; + if (!user_mode(regs)) + goto no_context; + pagefault_out_of_memory(); + return; do_sigbus: up_read(&mm->mmap_sem); diff --git a/arch/sh/mm/fault_32.c b/arch/sh/mm/fault_32.c index 7bebd04..a61b803 100644 --- a/arch/sh/mm/fault_32.c +++ b/arch/sh/mm/fault_32.c @@ -126,6 +126,7 @@ asmlinkage void __kprobes do_page_fault(struct pt_regs *regs, struct task_struct *tsk; struct mm_struct *mm; struct vm_area_struct * vma; + unsigned long flags = 0; int si_code; int fault; siginfo_t info; @@ -195,12 +196,17 @@ good_area: goto bad_area; } + if (user_mode(regs)) + flags |= FAULT_FLAG_USER; + if (writeaccess) + flags |= FAULT_FLAG_WRITE; + /* * If for any reason at all we couldn't handle the fault, * make sure we exit gracefully rather than endlessly redo * the fault. */ - fault = handle_mm_fault(mm, vma, address, writeaccess ? FAULT_FLAG_WRITE : 0); + fault = handle_mm_fault(mm, vma, address, flags); if (unlikely(fault & VM_FAULT_ERROR)) { if (fault & VM_FAULT_OOM) goto out_of_memory; diff --git a/arch/sh/mm/tlbflush_64.c b/arch/sh/mm/tlbflush_64.c index e3430e0..0a9d645 100644 --- a/arch/sh/mm/tlbflush_64.c +++ b/arch/sh/mm/tlbflush_64.c @@ -96,6 +96,7 @@ asmlinkage void do_page_fault(struct pt_regs *regs, unsigned long writeaccess, struct mm_struct *mm; struct vm_area_struct * vma; const struct exception_table_entry *fixup; + unsigned long flags = 0; pte_t *pte; int fault; @@ -184,12 +185,17 @@ good_area: } } + if (user_mode(regs)) + flags |= FAULT_FLAG_USER; + if (writeaccess) + flags |= FAULT_FLAG_WRITE; + /* * If for any reason at all we couldn't handle the fault, * make sure we exit gracefully rather than endlessly redo * the fault. */ - fault = handle_mm_fault(mm, vma, address, writeaccess ? FAULT_FLAG_WRITE : 0); + fault = handle_mm_fault(mm, vma, address, flags); if (unlikely(fault & VM_FAULT_ERROR)) { if (fault & VM_FAULT_OOM) goto out_of_memory; diff --git a/arch/sparc/mm/fault_32.c b/arch/sparc/mm/fault_32.c index 8023fd7..efa3d48 100644 --- a/arch/sparc/mm/fault_32.c +++ b/arch/sparc/mm/fault_32.c @@ -222,6 +222,7 @@ asmlinkage void do_sparc_fault(struct pt_regs *regs, int text_fault, int write, struct vm_area_struct *vma; struct task_struct *tsk = current; struct mm_struct *mm = tsk->mm; + unsigned long flags = 0; unsigned int fixup; unsigned long g2; int from_user = !(regs->psr & PSR_PS); @@ -285,12 +286,17 @@ good_area: goto bad_area; } + if (from_user) + flags |= FAULT_FLAG_USER; + if (write) + flags |= FAULT_FLAG_WRITE; + /* * If for any reason at all we couldn't handle the fault, * make sure we exit gracefully rather than endlessly redo * the fault. */ - fault = handle_mm_fault(mm, vma, address, write ? FAULT_FLAG_WRITE : 0); + fault = handle_mm_fault(mm, vma, address, flags); if (unlikely(fault & VM_FAULT_ERROR)) { if (fault & VM_FAULT_OOM) goto out_of_memory; diff --git a/arch/sparc/mm/fault_64.c b/arch/sparc/mm/fault_64.c index 504c062..bc536ea 100644 --- a/arch/sparc/mm/fault_64.c +++ b/arch/sparc/mm/fault_64.c @@ -276,6 +276,7 @@ asmlinkage void __kprobes do_sparc64_fault(struct pt_regs *regs) { struct mm_struct *mm = current->mm; struct vm_area_struct *vma; + unsigned long flags = 0; unsigned int insn = 0; int si_code, fault_code, fault; unsigned long address, mm_rss; @@ -423,7 +424,12 @@ good_area: goto bad_area; } - fault = handle_mm_fault(mm, vma, address, (fault_code & FAULT_CODE_WRITE) ? FAULT_FLAG_WRITE : 0); + if (!(regs->tstate & TSTATE_PRIV)) + flags |= FAULT_FLAG_USER; + if (fault_code & FAULT_CODE_WRITE) + flags |= FAULT_FLAG_WRITE; + + fault = handle_mm_fault(mm, vma, address, flags); if (unlikely(fault & VM_FAULT_ERROR)) { if (fault & VM_FAULT_OOM) goto out_of_memory; diff --git a/arch/tile/mm/fault.c b/arch/tile/mm/fault.c index 25b7b90..b2a7fd5 100644 --- a/arch/tile/mm/fault.c +++ b/arch/tile/mm/fault.c @@ -263,6 +263,7 @@ static int handle_page_fault(struct pt_regs *regs, struct mm_struct *mm; struct vm_area_struct *vma; unsigned long stack_offset; + unsigned long flags = 0; int fault; int si_code; int is_kernel_mode; @@ -415,12 +416,16 @@ good_area: } survive: + if (!is_kernel_mode) + flags |= FAULT_FLAG_USER; + if (write) + flags |= FAULT_FLAG_WRITE; /* * If for any reason at all we couldn't handle the fault, * make sure we exit gracefully rather than endlessly redo * the fault. */ - fault = handle_mm_fault(mm, vma, address, write); + fault = handle_mm_fault(mm, vma, address, flags); if (unlikely(fault & VM_FAULT_ERROR)) { if (fault & VM_FAULT_OOM) goto out_of_memory; @@ -540,10 +545,10 @@ out_of_memory: down_read(&mm->mmap_sem); goto survive; } - pr_alert("VM: killing process %s\n", tsk->comm); - if (!is_kernel_mode) - do_group_exit(SIGKILL); - goto no_context; + if (is_kernel_mode) + goto no_context; + pagefault_out_of_memory(); + return 0; do_sigbus: up_read(&mm->mmap_sem); diff --git a/arch/um/kernel/trap.c b/arch/um/kernel/trap.c index dafc947..626a85e 100644 --- a/arch/um/kernel/trap.c +++ b/arch/um/kernel/trap.c @@ -25,6 +25,7 @@ int handle_page_fault(unsigned long address, unsigned long ip, { struct mm_struct *mm = current->mm; struct vm_area_struct *vma; + unsigned long flags = 0; pgd_t *pgd; pud_t *pud; pmd_t *pmd; @@ -62,10 +63,15 @@ good_area: if (!is_write && !(vma->vm_flags & (VM_READ | VM_EXEC))) goto out; + if (is_user) + flags |= FAULT_FLAG_USER; + if (is_write) + flags |= FAULT_FLAG_WRITE; + do { int fault; - fault = handle_mm_fault(mm, vma, address, is_write ? FAULT_FLAG_WRITE : 0); + fault = handle_mm_fault(mm, vma, address, flags); if (unlikely(fault & VM_FAULT_ERROR)) { if (fault & VM_FAULT_OOM) { goto out_of_memory; diff --git a/arch/unicore32/mm/fault.c b/arch/unicore32/mm/fault.c index 283aa4b..3026943 100644 --- a/arch/unicore32/mm/fault.c +++ b/arch/unicore32/mm/fault.c @@ -169,9 +169,10 @@ static inline bool access_error(unsigned int fsr, struct vm_area_struct *vma) } static int __do_pf(struct mm_struct *mm, unsigned long addr, unsigned int fsr, - struct task_struct *tsk) + struct task_struct *tsk, struct pt_regs *regs) { struct vm_area_struct *vma; + unsigned long flags = 0; int fault; vma = find_vma(mm, addr); @@ -191,12 +192,16 @@ good_area: goto out; } + if (user_mode(regs)) + flags |= FAULT_FLAG_USER; + if (!(fsr ^ 0x12)) + flags |= FAULT_FLAG_WRITE; + /* * If for any reason at all we couldn't handle the fault, make * sure we exit gracefully rather than endlessly redo the fault. */ - fault = handle_mm_fault(mm, vma, addr & PAGE_MASK, - (!(fsr ^ 0x12)) ? FAULT_FLAG_WRITE : 0); + fault = handle_mm_fault(mm, vma, addr & PAGE_MASK, flags); if (unlikely(fault & VM_FAULT_ERROR)) return fault; if (fault & VM_FAULT_MAJOR) @@ -252,7 +257,7 @@ static int do_pf(unsigned long addr, unsigned int fsr, struct pt_regs *regs) #endif } - fault = __do_pf(mm, addr, fsr, tsk); + fault = __do_pf(mm, addr, fsr, tsk, regs); up_read(&mm->mmap_sem); /* diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c index 5db0490..90248c9 100644 --- a/arch/x86/mm/fault.c +++ b/arch/x86/mm/fault.c @@ -846,17 +846,6 @@ static noinline int mm_fault_error(struct pt_regs *regs, unsigned long error_code, unsigned long address, unsigned int fault) { - /* - * Pagefault was interrupted by SIGKILL. We have no reason to - * continue pagefault. - */ - if (fatal_signal_pending(current)) { - if (!(fault & VM_FAULT_RETRY)) - up_read(¤t->mm->mmap_sem); - if (!(error_code & PF_USER)) - no_context(regs, error_code, address); - return 1; - } if (!(fault & VM_FAULT_ERROR)) return 0; @@ -999,8 +988,7 @@ do_page_fault(struct pt_regs *regs, unsigned long error_code) struct mm_struct *mm; int fault; int write = error_code & PF_WRITE; - unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE | - (write ? FAULT_FLAG_WRITE : 0); + unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE; tsk = current; mm = tsk->mm; @@ -1160,6 +1148,11 @@ good_area: return; } + if (error_code & PF_USER) + flags |= FAULT_FLAG_USER; + if (write) + flags |= FAULT_FLAG_WRITE; + /* * If for any reason at all we couldn't handle the fault, * make sure we exit gracefully rather than endlessly redo diff --git a/arch/xtensa/mm/fault.c b/arch/xtensa/mm/fault.c index e367e30..7db9fbe 100644 --- a/arch/xtensa/mm/fault.c +++ b/arch/xtensa/mm/fault.c @@ -41,6 +41,7 @@ void do_page_fault(struct pt_regs *regs) struct mm_struct *mm = current->mm; unsigned int exccause = regs->exccause; unsigned int address = regs->excvaddr; + unsigned long flags = 0; siginfo_t info; int is_write, is_exec; @@ -101,11 +102,16 @@ good_area: if (!(vma->vm_flags & (VM_READ | VM_WRITE))) goto bad_area; + if (user_mode(regs)) + flags |= FAULT_FLAG_USER; + if (is_write) + flags |= FAULT_FLAG_WRITE; + /* If for any reason at all we couldn't handle the fault, * make sure we exit gracefully rather than endlessly redo * the fault. */ - fault = handle_mm_fault(mm, vma, address, is_write ? FAULT_FLAG_WRITE : 0); + fault = handle_mm_fault(mm, vma, address, flags); if (unlikely(fault & VM_FAULT_ERROR)) { if (fault & VM_FAULT_OOM) goto out_of_memory; diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index b87068a..b92e5e7 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -120,6 +120,17 @@ mem_cgroup_get_reclaim_stat_from_page(struct page *page); extern void mem_cgroup_print_oom_info(struct mem_cgroup *memcg, struct task_struct *p); +static inline unsigned int mem_cgroup_xchg_may_oom(struct task_struct *p, + unsigned int new) +{ + unsigned int old; + + old = p->memcg_oom.may_oom; + p->memcg_oom.may_oom = new; + + return old; +} +bool mem_cgroup_oom_synchronize(void); #ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP extern int do_swap_account; #endif @@ -333,6 +344,17 @@ mem_cgroup_print_oom_info(struct mem_cgroup *memcg, struct task_struct *p) { } +static inline unsigned int mem_cgroup_xchg_may_oom(struct task_struct *p, + unsigned int new) +{ + return 0; +} + +static inline bool mem_cgroup_oom_synchronize(void) +{ + return false; +} + static inline void mem_cgroup_inc_page_stat(struct page *page, enum mem_cgroup_page_stat_item idx) { diff --git a/include/linux/mm.h b/include/linux/mm.h index 4baadd1..846b82b 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -156,6 +156,7 @@ extern pgprot_t protection_map[16]; #define FAULT_FLAG_ALLOW_RETRY 0x08 /* Retry fault if blocking */ #define FAULT_FLAG_RETRY_NOWAIT 0x10 /* Don't drop mmap_sem and wait when retrying */ #define FAULT_FLAG_KILLABLE 0x20 /* The fault task is in SIGKILL killable region */ +#define FAULT_FLAG_USER 0x40 /* The fault originated in userspace */ /* * This interface is used by x86 PAT code to identify a pfn mapping that is diff --git a/include/linux/sched.h b/include/linux/sched.h index 1c4f3e9..a77d198 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -91,6 +91,7 @@ struct sched_param { #include <linux/latencytop.h> #include <linux/cred.h> #include <linux/llist.h> +#include <linux/stacktrace.h> #include <asm/processor.h> @@ -1568,6 +1569,14 @@ struct task_struct { unsigned long nr_pages; /* uncharged usage */ unsigned long memsw_nr_pages; /* uncharged mem+swap usage */ } memcg_batch; + struct memcg_oom_info { + unsigned int may_oom:1; + unsigned int in_memcg_oom:1; + struct stack_trace trace; + unsigned long trace_entries[16]; + int wakeups; + struct mem_cgroup *wait_on_memcg; + } memcg_oom; #endif #ifdef CONFIG_HAVE_HW_BREAKPOINT atomic_t ptrace_bp_refcnt; diff --git a/mm/filemap.c b/mm/filemap.c index 5f0a3c9..d18bd47 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -1660,6 +1660,7 @@ int filemap_fault(struct vm_area_struct *vma, struct vm_fault *vmf) struct file_ra_state *ra = &file->f_ra; struct inode *inode = mapping->host; pgoff_t offset = vmf->pgoff; + unsigned int may_oom; struct page *page; pgoff_t size; int ret = 0; @@ -1669,7 +1670,12 @@ int filemap_fault(struct vm_area_struct *vma, struct vm_fault *vmf) return VM_FAULT_SIGBUS; /* - * Do we have something in the page cache already? + * Do we have something in the page cache already? Either + * way, try readahead, but disable the memcg OOM killer for it + * as readahead is optional and no errors are propagated up + * the fault stack, which does not allow proper unwinding of a + * memcg OOM state. The OOM killer is enabled while trying to + * instantiate the faulting page individually below. */ page = find_get_page(mapping, offset); if (likely(page)) { @@ -1677,10 +1683,14 @@ int filemap_fault(struct vm_area_struct *vma, struct vm_fault *vmf) * We found the page, so try async readahead before * waiting for the lock. */ + may_oom = mem_cgroup_xchg_may_oom(current, 0); do_async_mmap_readahead(vma, ra, file, page, offset); + mem_cgroup_xchg_may_oom(current, may_oom); } else { - /* No page in the page cache at all */ + /* No page in the page cache at all. */ + may_oom = mem_cgroup_xchg_may_oom(current, 0); do_sync_mmap_readahead(vma, ra, file, offset); + mem_cgroup_xchg_may_oom(current, may_oom); count_vm_event(PGMAJFAULT); mem_cgroup_count_vm_event(vma->vm_mm, PGMAJFAULT); ret = VM_FAULT_MAJOR; diff --git a/mm/ksm.c b/mm/ksm.c index 310544a..ae7e4ae 100644 --- a/mm/ksm.c +++ b/mm/ksm.c @@ -338,7 +338,7 @@ static int break_ksm(struct vm_area_struct *vma, unsigned long addr) break; if (PageKsm(page)) ret = handle_mm_fault(vma->vm_mm, vma, addr, - FAULT_FLAG_WRITE); + FAULT_FLAG_WRITE); else ret = VM_FAULT_WRITE; put_page(page); diff --git a/mm/memcontrol.c b/mm/memcontrol.c index b63f5f7..c47c77e 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -49,6 +49,7 @@ #include <linux/page_cgroup.h> #include <linux/cpu.h> #include <linux/oom.h> +#include <linux/stacktrace.h> #include "internal.h" #include <asm/uaccess.h> @@ -249,6 +250,7 @@ struct mem_cgroup { bool oom_lock; atomic_t under_oom; + atomic_t oom_wakeups; atomic_t refcnt; @@ -1846,6 +1848,7 @@ static int memcg_oom_wake_function(wait_queue_t *wait, static void memcg_wakeup_oom(struct mem_cgroup *memcg) { + atomic_inc(&memcg->oom_wakeups); /* for filtering, pass "memcg" as argument. */ __wake_up(&memcg_oom_waitq, TASK_NORMAL, 0, memcg); } @@ -1857,30 +1860,26 @@ static void memcg_oom_recover(struct mem_cgroup *memcg) } /* - * try to call OOM killer. returns false if we should exit memory-reclaim loop. + * try to call OOM killer */ -bool mem_cgroup_handle_oom(struct mem_cgroup *memcg, gfp_t mask) +static void mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask) { - struct oom_wait_info owait; - bool locked, need_to_kill; + bool locked, need_to_kill = true; - owait.mem = memcg; - owait.wait.flags = 0; - owait.wait.func = memcg_oom_wake_function; - owait.wait.private = current; - INIT_LIST_HEAD(&owait.wait.task_list); - need_to_kill = true; - mem_cgroup_mark_under_oom(memcg); + if (!current->memcg_oom.may_oom) + return; + + current->memcg_oom.in_memcg_oom = 1; + + current->memcg_oom.trace.nr_entries = 0; + current->memcg_oom.trace.max_entries = 16; + current->memcg_oom.trace.entries = current->memcg_oom.trace_entries; + current->memcg_oom.trace.skip = 1; + save_stack_trace(¤t->memcg_oom.trace); /* At first, try to OOM lock hierarchy under memcg.*/ spin_lock(&memcg_oom_lock); locked = mem_cgroup_oom_lock(memcg); - /* - * Even if signal_pending(), we can't quit charge() loop without - * accounting. So, UNINTERRUPTIBLE is appropriate. But SIGKILL - * under OOM is always welcomed, use TASK_KILLABLE here. - */ - prepare_to_wait(&memcg_oom_waitq, &owait.wait, TASK_KILLABLE); if (!locked || memcg->oom_kill_disable) need_to_kill = false; if (locked) @@ -1888,24 +1887,86 @@ bool mem_cgroup_handle_oom(struct mem_cgroup *memcg, gfp_t mask) spin_unlock(&memcg_oom_lock); if (need_to_kill) { - finish_wait(&memcg_oom_waitq, &owait.wait); mem_cgroup_out_of_memory(memcg, mask); } else { - schedule(); - finish_wait(&memcg_oom_waitq, &owait.wait); + /* + * A system call can just return -ENOMEM, but if this + * is a page fault and somebody else is handling the + * OOM already, we need to sleep on the OOM waitqueue + * for this memcg until the situation is resolved. + * Which can take some time because it might be + * handled by a userspace task. + * + * However, this is the charge context, which means + * that we may sit on a large call stack and hold + * various filesystem locks, the mmap_sem etc. and we + * don't want the OOM handler to deadlock on them + * while we sit here and wait. Store the current OOM + * context in the task_struct, then return -ENOMEM. + * At the end of the page fault handler, with the + * stack unwound, pagefault_out_of_memory() will check + * back with us by calling + * mem_cgroup_oom_synchronize(), possibly putting the + * task to sleep. + */ + mem_cgroup_mark_under_oom(memcg); + current->memcg_oom.wakeups = atomic_read(&memcg->oom_wakeups); + css_get(&memcg->css); + current->memcg_oom.wait_on_memcg = memcg; } - spin_lock(&memcg_oom_lock); - if (locked) + + if (locked) { + spin_lock(&memcg_oom_lock); mem_cgroup_oom_unlock(memcg); - memcg_wakeup_oom(memcg); - spin_unlock(&memcg_oom_lock); + /* + * Sleeping tasks might have been killed, make sure + * they get scheduled so they can exit. + */ + if (need_to_kill) + memcg_oom_recover(memcg); + spin_unlock(&memcg_oom_lock); + } +} - mem_cgroup_unmark_under_oom(memcg); +bool mem_cgroup_oom_synchronize(void) +{ + struct oom_wait_info owait; + struct mem_cgroup *memcg; - if (test_thread_flag(TIF_MEMDIE) || fatal_signal_pending(current)) + /* OOM is global, do not handle */ + if (!current->memcg_oom.in_memcg_oom) return false; - /* Give chance to dying process */ - schedule_timeout_uninterruptible(1); + + /* + * We invoked the OOM killer but there is a chance that a kill + * did not free up any charges. Everybody else might already + * be sleeping, so restart the fault and keep the rampage + * going until some charges are released. + */ + memcg = current->memcg_oom.wait_on_memcg; + if (!memcg) + goto out; + + if (test_thread_flag(TIF_MEMDIE) || fatal_signal_pending(current)) + goto out_put; + + owait.mem = memcg; + owait.wait.flags = 0; + owait.wait.func = memcg_oom_wake_function; + owait.wait.private = current; + INIT_LIST_HEAD(&owait.wait.task_list); + + prepare_to_wait(&memcg_oom_waitq, &owait.wait, TASK_KILLABLE); + /* Only sleep if we didn't miss any wakeups since OOM */ + if (atomic_read(&memcg->oom_wakeups) == current->memcg_oom.wakeups) + schedule(); + finish_wait(&memcg_oom_waitq, &owait.wait); +out_put: + mem_cgroup_unmark_under_oom(memcg); + css_put(&memcg->css); + current->memcg_oom.wait_on_memcg = NULL; +out: + current->memcg_oom.in_memcg_oom = 0; return true; } @@ -2195,11 +2256,10 @@ enum { CHARGE_RETRY, /* need to retry but retry is not bad */ CHARGE_NOMEM, /* we can't do more. return -ENOMEM */ CHARGE_WOULDBLOCK, /* GFP_WAIT wasn't set and no enough res. */ - CHARGE_OOM_DIE, /* the current is killed because of OOM */ }; static int mem_cgroup_do_charge(struct mem_cgroup *memcg, gfp_t gfp_mask, - unsigned int nr_pages, bool oom_check) + unsigned int nr_pages, bool invoke_oom) { unsigned long csize = nr_pages * PAGE_SIZE; struct mem_cgroup *mem_over_limit; @@ -2257,14 +2317,10 @@ static int mem_cgroup_do_charge(struct mem_cgroup *memcg, gfp_t gfp_mask, if (mem_cgroup_wait_acct_move(mem_over_limit)) return CHARGE_RETRY; - /* If we don't need to call oom-killer at el, return immediately */ - if (!oom_check) - return CHARGE_NOMEM; - /* check OOM */ - if (!mem_cgroup_handle_oom(mem_over_limit, gfp_mask)) - return CHARGE_OOM_DIE; + if (invoke_oom) + mem_cgroup_oom(mem_over_limit, gfp_mask); - return CHARGE_RETRY; + return CHARGE_NOMEM; } /* @@ -2349,7 +2405,7 @@ again: } do { - bool oom_check; + bool invoke_oom = oom && !nr_oom_retries; /* If killed, bypass charge */ if (fatal_signal_pending(current)) { @@ -2357,13 +2413,7 @@ again: goto bypass; } - oom_check = false; - if (oom && !nr_oom_retries) { - oom_check = true; - nr_oom_retries = MEM_CGROUP_RECLAIM_RETRIES; - } - - ret = mem_cgroup_do_charge(memcg, gfp_mask, batch, oom_check); + ret = mem_cgroup_do_charge(memcg, gfp_mask, batch, invoke_oom); switch (ret) { case CHARGE_OK: break; @@ -2376,16 +2426,12 @@ again: css_put(&memcg->css); goto nomem; case CHARGE_NOMEM: /* OOM routine works */ - if (!oom) { + if (!oom || invoke_oom) { css_put(&memcg->css); goto nomem; } - /* If oom, we never return -ENOMEM */ nr_oom_retries--; break; - case CHARGE_OOM_DIE: /* Killed by OOM Killer */ - css_put(&memcg->css); - goto bypass; } } while (ret != CHARGE_OK); diff --git a/mm/memory.c b/mm/memory.c index 829d437..fc6d741 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -57,6 +57,7 @@ #include <linux/swapops.h> #include <linux/elf.h> #include <linux/gfp.h> +#include <linux/stacktrace.h> #include <asm/io.h> #include <asm/pgalloc.h> @@ -3439,22 +3440,14 @@ unlock: /* * By the time we get here, we already hold the mm semaphore */ -int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma, - unsigned long address, unsigned int flags) +static int __handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma, + unsigned long address, unsigned int flags) { pgd_t *pgd; pud_t *pud; pmd_t *pmd; pte_t *pte; - __set_current_state(TASK_RUNNING); - - count_vm_event(PGFAULT); - mem_cgroup_count_vm_event(mm, PGFAULT); - - /* do counter updates before entering really critical section. */ - check_sync_rss_stat(current); - if (unlikely(is_vm_hugetlb_page(vma))) return hugetlb_fault(mm, vma, address, flags); @@ -3503,6 +3496,39 @@ int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma, return handle_pte_fault(mm, vma, address, pte, pmd, flags); } +int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma, + unsigned long address, unsigned int flags) +{ + int userfault = flags & FAULT_FLAG_USER; + int ret; + + __set_current_state(TASK_RUNNING); + + count_vm_event(PGFAULT); + mem_cgroup_count_vm_event(mm, PGFAULT); + + /* do counter updates before entering really critical section. */ + check_sync_rss_stat(current); + + if (userfault) + WARN_ON(mem_cgroup_xchg_may_oom(current, 1) == 1); + + ret = __handle_mm_fault(mm, vma, address, flags); + + if (userfault) + WARN_ON(mem_cgroup_xchg_may_oom(current, 0) == 0); + +#ifdef CONFIG_CGROUP_MEM_RES_CTLR + if (WARN(!(ret & VM_FAULT_OOM) && current->memcg_oom.in_memcg_oom, + "Fixing unhandled memcg OOM context, set up from:\n")) { + print_stack_trace(¤t->memcg_oom.trace, 0); + mem_cgroup_oom_synchronize(); + } +#endif + + return ret; +} + #ifndef __PAGETABLE_PUD_FOLDED /* * Allocate page upper directory. diff --git a/mm/oom_kill.c b/mm/oom_kill.c index 069b64e..aa60863 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -785,6 +785,8 @@ out: */ void pagefault_out_of_memory(void) { + if (mem_cgroup_oom_synchronize()) + return; if (try_set_system_oom()) { out_of_memory(NULL, 0, 0, NULL); clear_system_oom(); -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 444+ messages in thread
* [patch 1/5] mm: invoke oom-killer from remaining unconverted page fault handlers 2013-07-19 4:21 ` Johannes Weiner (?) @ 2013-07-19 4:22 ` Johannes Weiner -1 siblings, 0 replies; 444+ messages in thread From: Johannes Weiner @ 2013-07-19 4:22 UTC (permalink / raw) To: Michal Hocko Cc: azurIt, linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki, righi.andrea [already upstream, included for 3.2 reference] A few remaining architectures directly kill the page faulting task in an out of memory situation. This is usually not a good idea since that task might not even use a significant amount of memory and so may not be the optimal victim to resolve the situation. Since 2.6.29's 1c0fe6e ("mm: invoke oom-killer from page fault") there is a hook that architecture page fault handlers are supposed to call to invoke the OOM killer and let it pick the right task to kill. Convert the remaining architectures over to this hook. To have the previous behavior of simply taking out the faulting task the vm.oom_kill_allocating_task sysctl can be set to 1. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Michal Hocko <mhocko@suse.cz> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: David Rientjes <rientjes@google.com> Acked-by: Vineet Gupta <vgupta@synopsys.com> [arch/arc bits] Cc: James Hogan <james.hogan@imgtec.com> Cc: David Howells <dhowells@redhat.com> Cc: Jonas Bonn <jonas@southpole.se> Cc: Chen Liqin <liqin.chen@sunplusct.com> Cc: Lennox Wu <lennox.wu@gmail.com> Cc: Chris Metcalf <cmetcalf@tilera.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> --- arch/mn10300/mm/fault.c | 7 ++++--- arch/openrisc/mm/fault.c | 8 ++++---- arch/score/mm/fault.c | 8 ++++---- arch/tile/mm/fault.c | 8 ++++---- 4 files changed, 16 insertions(+), 15 deletions(-) diff --git a/arch/mn10300/mm/fault.c b/arch/mn10300/mm/fault.c index 0945409..5ac4df5 100644 --- a/arch/mn10300/mm/fault.c +++ b/arch/mn10300/mm/fault.c @@ -329,9 +329,10 @@ no_context: */ out_of_memory: up_read(&mm->mmap_sem); - printk(KERN_ALERT "VM: killing process %s\n", tsk->comm); - if ((fault_code & MMUFCR_xFC_ACCESS) == MMUFCR_xFC_ACCESS_USR) - do_exit(SIGKILL); + if ((fault_code & MMUFCR_xFC_ACCESS) == MMUFCR_xFC_ACCESS_USR) { + pagefault_out_of_memory(); + return; + } goto no_context; do_sigbus: diff --git a/arch/openrisc/mm/fault.c b/arch/openrisc/mm/fault.c index a5dce82..d78881c 100644 --- a/arch/openrisc/mm/fault.c +++ b/arch/openrisc/mm/fault.c @@ -246,10 +246,10 @@ out_of_memory: __asm__ __volatile__("l.nop 1"); up_read(&mm->mmap_sem); - printk("VM: killing process %s\n", tsk->comm); - if (user_mode(regs)) - do_exit(SIGKILL); - goto no_context; + if (!user_mode(regs)) + goto no_context; + pagefault_out_of_memory(); + return; do_sigbus: up_read(&mm->mmap_sem); diff --git a/arch/score/mm/fault.c b/arch/score/mm/fault.c index 47b600e..6b18fb0 100644 --- a/arch/score/mm/fault.c +++ b/arch/score/mm/fault.c @@ -172,10 +172,10 @@ out_of_memory: down_read(&mm->mmap_sem); goto survive; } - printk("VM: killing process %s\n", tsk->comm); - if (user_mode(regs)) - do_group_exit(SIGKILL); - goto no_context; + if (!user_mode(regs)) + goto no_context; + pagefault_out_of_memory(); + return; do_sigbus: up_read(&mm->mmap_sem); diff --git a/arch/tile/mm/fault.c b/arch/tile/mm/fault.c index 25b7b90..3312531 100644 --- a/arch/tile/mm/fault.c +++ b/arch/tile/mm/fault.c @@ -540,10 +540,10 @@ out_of_memory: down_read(&mm->mmap_sem); goto survive; } - pr_alert("VM: killing process %s\n", tsk->comm); - if (!is_kernel_mode) - do_group_exit(SIGKILL); - goto no_context; + if (is_kernel_mode) + goto no_context; + pagefault_out_of_memory(); + return 0; do_sigbus: up_read(&mm->mmap_sem); -- 1.8.3.2 ^ permalink raw reply related [flat|nested] 444+ messages in thread
* [patch 1/5] mm: invoke oom-killer from remaining unconverted page fault handlers @ 2013-07-19 4:22 ` Johannes Weiner 0 siblings, 0 replies; 444+ messages in thread From: Johannes Weiner @ 2013-07-19 4:22 UTC (permalink / raw) To: Michal Hocko Cc: azurIt, linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups mailinglist, KAMEZAWA Hiroyuki, righi.andrea-Re5JQEeQqe8AvxtiuMwx3w [already upstream, included for 3.2 reference] A few remaining architectures directly kill the page faulting task in an out of memory situation. This is usually not a good idea since that task might not even use a significant amount of memory and so may not be the optimal victim to resolve the situation. Since 2.6.29's 1c0fe6e ("mm: invoke oom-killer from page fault") there is a hook that architecture page fault handlers are supposed to call to invoke the OOM killer and let it pick the right task to kill. Convert the remaining architectures over to this hook. To have the previous behavior of simply taking out the faulting task the vm.oom_kill_allocating_task sysctl can be set to 1. Signed-off-by: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org> Reviewed-by: Michal Hocko <mhocko-AlSwsSmVLrQ@public.gmane.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A@public.gmane.org> Acked-by: David Rientjes <rientjes-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> Acked-by: Vineet Gupta <vgupta-HKixBCOQz3hWk0Htik3J/w@public.gmane.org> [arch/arc bits] Cc: James Hogan <james.hogan-1AXoQHu6uovQT0dZR+AlfA@public.gmane.org> Cc: David Howells <dhowells-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> Cc: Jonas Bonn <jonas-A9uVI2HLR7kOP4wsBPIw7w@public.gmane.org> Cc: Chen Liqin <liqin.chen-+XGAvkf1AAHby3iVrkZq2A@public.gmane.org> Cc: Lennox Wu <lennox.wu-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> Cc: Chris Metcalf <cmetcalf-kv+TWInifGbQT0dZR+AlfA@public.gmane.org> Signed-off-by: Andrew Morton <akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org> Signed-off-by: Linus Torvalds <torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org> --- arch/mn10300/mm/fault.c | 7 ++++--- arch/openrisc/mm/fault.c | 8 ++++---- arch/score/mm/fault.c | 8 ++++---- arch/tile/mm/fault.c | 8 ++++---- 4 files changed, 16 insertions(+), 15 deletions(-) diff --git a/arch/mn10300/mm/fault.c b/arch/mn10300/mm/fault.c index 0945409..5ac4df5 100644 --- a/arch/mn10300/mm/fault.c +++ b/arch/mn10300/mm/fault.c @@ -329,9 +329,10 @@ no_context: */ out_of_memory: up_read(&mm->mmap_sem); - printk(KERN_ALERT "VM: killing process %s\n", tsk->comm); - if ((fault_code & MMUFCR_xFC_ACCESS) == MMUFCR_xFC_ACCESS_USR) - do_exit(SIGKILL); + if ((fault_code & MMUFCR_xFC_ACCESS) == MMUFCR_xFC_ACCESS_USR) { + pagefault_out_of_memory(); + return; + } goto no_context; do_sigbus: diff --git a/arch/openrisc/mm/fault.c b/arch/openrisc/mm/fault.c index a5dce82..d78881c 100644 --- a/arch/openrisc/mm/fault.c +++ b/arch/openrisc/mm/fault.c @@ -246,10 +246,10 @@ out_of_memory: __asm__ __volatile__("l.nop 1"); up_read(&mm->mmap_sem); - printk("VM: killing process %s\n", tsk->comm); - if (user_mode(regs)) - do_exit(SIGKILL); - goto no_context; + if (!user_mode(regs)) + goto no_context; + pagefault_out_of_memory(); + return; do_sigbus: up_read(&mm->mmap_sem); diff --git a/arch/score/mm/fault.c b/arch/score/mm/fault.c index 47b600e..6b18fb0 100644 --- a/arch/score/mm/fault.c +++ b/arch/score/mm/fault.c @@ -172,10 +172,10 @@ out_of_memory: down_read(&mm->mmap_sem); goto survive; } - printk("VM: killing process %s\n", tsk->comm); - if (user_mode(regs)) - do_group_exit(SIGKILL); - goto no_context; + if (!user_mode(regs)) + goto no_context; + pagefault_out_of_memory(); + return; do_sigbus: up_read(&mm->mmap_sem); diff --git a/arch/tile/mm/fault.c b/arch/tile/mm/fault.c index 25b7b90..3312531 100644 --- a/arch/tile/mm/fault.c +++ b/arch/tile/mm/fault.c @@ -540,10 +540,10 @@ out_of_memory: down_read(&mm->mmap_sem); goto survive; } - pr_alert("VM: killing process %s\n", tsk->comm); - if (!is_kernel_mode) - do_group_exit(SIGKILL); - goto no_context; + if (is_kernel_mode) + goto no_context; + pagefault_out_of_memory(); + return 0; do_sigbus: up_read(&mm->mmap_sem); -- 1.8.3.2 ^ permalink raw reply related [flat|nested] 444+ messages in thread
* [patch 1/5] mm: invoke oom-killer from remaining unconverted page fault handlers @ 2013-07-19 4:22 ` Johannes Weiner 0 siblings, 0 replies; 444+ messages in thread From: Johannes Weiner @ 2013-07-19 4:22 UTC (permalink / raw) To: Michal Hocko Cc: azurIt, linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki, righi.andrea [already upstream, included for 3.2 reference] A few remaining architectures directly kill the page faulting task in an out of memory situation. This is usually not a good idea since that task might not even use a significant amount of memory and so may not be the optimal victim to resolve the situation. Since 2.6.29's 1c0fe6e ("mm: invoke oom-killer from page fault") there is a hook that architecture page fault handlers are supposed to call to invoke the OOM killer and let it pick the right task to kill. Convert the remaining architectures over to this hook. To have the previous behavior of simply taking out the faulting task the vm.oom_kill_allocating_task sysctl can be set to 1. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Michal Hocko <mhocko@suse.cz> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: David Rientjes <rientjes@google.com> Acked-by: Vineet Gupta <vgupta@synopsys.com> [arch/arc bits] Cc: James Hogan <james.hogan@imgtec.com> Cc: David Howells <dhowells@redhat.com> Cc: Jonas Bonn <jonas@southpole.se> Cc: Chen Liqin <liqin.chen@sunplusct.com> Cc: Lennox Wu <lennox.wu@gmail.com> Cc: Chris Metcalf <cmetcalf@tilera.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> --- arch/mn10300/mm/fault.c | 7 ++++--- arch/openrisc/mm/fault.c | 8 ++++---- arch/score/mm/fault.c | 8 ++++---- arch/tile/mm/fault.c | 8 ++++---- 4 files changed, 16 insertions(+), 15 deletions(-) diff --git a/arch/mn10300/mm/fault.c b/arch/mn10300/mm/fault.c index 0945409..5ac4df5 100644 --- a/arch/mn10300/mm/fault.c +++ b/arch/mn10300/mm/fault.c @@ -329,9 +329,10 @@ no_context: */ out_of_memory: up_read(&mm->mmap_sem); - printk(KERN_ALERT "VM: killing process %s\n", tsk->comm); - if ((fault_code & MMUFCR_xFC_ACCESS) == MMUFCR_xFC_ACCESS_USR) - do_exit(SIGKILL); + if ((fault_code & MMUFCR_xFC_ACCESS) == MMUFCR_xFC_ACCESS_USR) { + pagefault_out_of_memory(); + return; + } goto no_context; do_sigbus: diff --git a/arch/openrisc/mm/fault.c b/arch/openrisc/mm/fault.c index a5dce82..d78881c 100644 --- a/arch/openrisc/mm/fault.c +++ b/arch/openrisc/mm/fault.c @@ -246,10 +246,10 @@ out_of_memory: __asm__ __volatile__("l.nop 1"); up_read(&mm->mmap_sem); - printk("VM: killing process %s\n", tsk->comm); - if (user_mode(regs)) - do_exit(SIGKILL); - goto no_context; + if (!user_mode(regs)) + goto no_context; + pagefault_out_of_memory(); + return; do_sigbus: up_read(&mm->mmap_sem); diff --git a/arch/score/mm/fault.c b/arch/score/mm/fault.c index 47b600e..6b18fb0 100644 --- a/arch/score/mm/fault.c +++ b/arch/score/mm/fault.c @@ -172,10 +172,10 @@ out_of_memory: down_read(&mm->mmap_sem); goto survive; } - printk("VM: killing process %s\n", tsk->comm); - if (user_mode(regs)) - do_group_exit(SIGKILL); - goto no_context; + if (!user_mode(regs)) + goto no_context; + pagefault_out_of_memory(); + return; do_sigbus: up_read(&mm->mmap_sem); diff --git a/arch/tile/mm/fault.c b/arch/tile/mm/fault.c index 25b7b90..3312531 100644 --- a/arch/tile/mm/fault.c +++ b/arch/tile/mm/fault.c @@ -540,10 +540,10 @@ out_of_memory: down_read(&mm->mmap_sem); goto survive; } - pr_alert("VM: killing process %s\n", tsk->comm); - if (!is_kernel_mode) - do_group_exit(SIGKILL); - goto no_context; + if (is_kernel_mode) + goto no_context; + pagefault_out_of_memory(); + return 0; do_sigbus: up_read(&mm->mmap_sem); -- 1.8.3.2 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 444+ messages in thread
* [patch 2/5] mm: pass userspace fault flag to generic fault handler 2013-07-19 4:21 ` Johannes Weiner (?) @ 2013-07-19 4:24 ` Johannes Weiner -1 siblings, 0 replies; 444+ messages in thread From: Johannes Weiner @ 2013-07-19 4:24 UTC (permalink / raw) To: Michal Hocko Cc: azurIt, linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki, righi.andrea The global OOM killer is (XXX: for most architectures) only invoked for userspace faults, not for faults from kernelspace (uaccess, gup). Memcg OOM handling is currently invoked for all faults. Allow it to behave like the global case by having the architectures pass a flag to the generic fault handler code that identifies userspace faults. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> --- arch/alpha/mm/fault.c | 8 +++++++- arch/arm/mm/fault.c | 12 +++++++++--- arch/avr32/mm/fault.c | 8 +++++++- arch/cris/mm/fault.c | 8 +++++++- arch/frv/mm/fault.c | 8 +++++++- arch/hexagon/mm/vm_fault.c | 8 +++++++- arch/ia64/mm/fault.c | 8 +++++++- arch/m32r/mm/fault.c | 8 +++++++- arch/m68k/mm/fault.c | 8 +++++++- arch/microblaze/mm/fault.c | 8 +++++++- arch/mips/mm/fault.c | 8 +++++++- arch/mn10300/mm/fault.c | 8 +++++++- arch/openrisc/mm/fault.c | 8 +++++++- arch/parisc/mm/fault.c | 8 +++++++- arch/powerpc/mm/fault.c | 8 +++++++- arch/s390/mm/fault.c | 2 ++ arch/score/mm/fault.c | 7 ++++++- arch/sh/mm/fault_32.c | 8 +++++++- arch/sh/mm/tlbflush_64.c | 8 +++++++- arch/sparc/mm/fault_32.c | 8 +++++++- arch/sparc/mm/fault_64.c | 8 +++++++- arch/tile/mm/fault.c | 7 ++++++- arch/um/kernel/trap.c | 8 +++++++- arch/unicore32/mm/fault.c | 13 +++++++++---- arch/x86/mm/fault.c | 8 ++++++-- arch/xtensa/mm/fault.c | 8 +++++++- include/linux/mm.h | 1 + 27 files changed, 179 insertions(+), 31 deletions(-) diff --git a/arch/alpha/mm/fault.c b/arch/alpha/mm/fault.c index fadd5f8..fa6b4e4 100644 --- a/arch/alpha/mm/fault.c +++ b/arch/alpha/mm/fault.c @@ -89,6 +89,7 @@ do_page_fault(unsigned long address, unsigned long mmcsr, struct mm_struct *mm = current->mm; const struct exception_table_entry *fixup; int fault, si_code = SEGV_MAPERR; + unsigned long flags = 0; siginfo_t info; /* As of EV6, a load into $31/$f31 is a prefetch, and never faults @@ -142,10 +143,15 @@ do_page_fault(unsigned long address, unsigned long mmcsr, goto bad_area; } + if (user_mode(regs)) + flags |= FAULT_FLAG_USER; + if (cause > 0) + flags |= FAULT_FLAG_WRITE; + /* If for any reason at all we couldn't handle the fault, make sure we exit gracefully rather than endlessly redo the fault. */ - fault = handle_mm_fault(mm, vma, address, cause > 0 ? FAULT_FLAG_WRITE : 0); + fault = handle_mm_fault(mm, vma, address, flags); up_read(&mm->mmap_sem); if (unlikely(fault & VM_FAULT_ERROR)) { if (fault & VM_FAULT_OOM) diff --git a/arch/arm/mm/fault.c b/arch/arm/mm/fault.c index aa33949..31b1e69 100644 --- a/arch/arm/mm/fault.c +++ b/arch/arm/mm/fault.c @@ -231,9 +231,10 @@ static inline bool access_error(unsigned int fsr, struct vm_area_struct *vma) static int __kprobes __do_page_fault(struct mm_struct *mm, unsigned long addr, unsigned int fsr, - struct task_struct *tsk) + struct task_struct *tsk, struct pt_regs *regs) { struct vm_area_struct *vma; + unsigned long flags = 0; int fault; vma = find_vma(mm, addr); @@ -253,11 +254,16 @@ good_area: goto out; } + if (user_mode(regs)) + flags |= FAULT_FLAG_USER; + if (fsr & FSR_WRITE) + flags |= FAULT_FLAG_WRITE; + /* * If for any reason at all we couldn't handle the fault, make * sure we exit gracefully rather than endlessly redo the fault. */ - fault = handle_mm_fault(mm, vma, addr & PAGE_MASK, (fsr & FSR_WRITE) ? FAULT_FLAG_WRITE : 0); + fault = handle_mm_fault(mm, vma, addr & PAGE_MASK, flags); if (unlikely(fault & VM_FAULT_ERROR)) return fault; if (fault & VM_FAULT_MAJOR) @@ -320,7 +326,7 @@ do_page_fault(unsigned long addr, unsigned int fsr, struct pt_regs *regs) #endif } - fault = __do_page_fault(mm, addr, fsr, tsk); + fault = __do_page_fault(mm, addr, fsr, tsk, regs); up_read(&mm->mmap_sem); perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, addr); diff --git a/arch/avr32/mm/fault.c b/arch/avr32/mm/fault.c index f7040a1..ada6237 100644 --- a/arch/avr32/mm/fault.c +++ b/arch/avr32/mm/fault.c @@ -59,6 +59,7 @@ asmlinkage void do_page_fault(unsigned long ecr, struct pt_regs *regs) struct mm_struct *mm; struct vm_area_struct *vma; const struct exception_table_entry *fixup; + unsigned long flags = 0; unsigned long address; unsigned long page; int writeaccess; @@ -127,12 +128,17 @@ good_area: panic("Unhandled case %lu in do_page_fault!", ecr); } + if (user_mode(regs)) + flags |= FAULT_FLAG_USER; + if (writeaccess) + flags |= FAULT_FLAG_WRITE; + /* * If for any reason at all we couldn't handle the fault, make * sure we exit gracefully rather than endlessly redo the * fault. */ - fault = handle_mm_fault(mm, vma, address, writeaccess ? FAULT_FLAG_WRITE : 0); + fault = handle_mm_fault(mm, vma, address, flags); if (unlikely(fault & VM_FAULT_ERROR)) { if (fault & VM_FAULT_OOM) goto out_of_memory; diff --git a/arch/cris/mm/fault.c b/arch/cris/mm/fault.c index 9dcac8e..35d096a 100644 --- a/arch/cris/mm/fault.c +++ b/arch/cris/mm/fault.c @@ -55,6 +55,7 @@ do_page_fault(unsigned long address, struct pt_regs *regs, struct task_struct *tsk; struct mm_struct *mm; struct vm_area_struct * vma; + unsigned long flags = 0; siginfo_t info; int fault; @@ -156,13 +157,18 @@ do_page_fault(unsigned long address, struct pt_regs *regs, goto bad_area; } + if (user_mode(regs)) + flags |= FAULT_FLAG_USER; + if (writeaccess & 1) + flags |= FAULT_FLAG_WRITE; + /* * If for any reason at all we couldn't handle the fault, * make sure we exit gracefully rather than endlessly redo * the fault. */ - fault = handle_mm_fault(mm, vma, address, (writeaccess & 1) ? FAULT_FLAG_WRITE : 0); + fault = handle_mm_fault(mm, vma, address, flags); if (unlikely(fault & VM_FAULT_ERROR)) { if (fault & VM_FAULT_OOM) goto out_of_memory; diff --git a/arch/frv/mm/fault.c b/arch/frv/mm/fault.c index a325d57..2dbf219 100644 --- a/arch/frv/mm/fault.c +++ b/arch/frv/mm/fault.c @@ -35,6 +35,7 @@ asmlinkage void do_page_fault(int datammu, unsigned long esr0, unsigned long ear struct vm_area_struct *vma; struct mm_struct *mm; unsigned long _pme, lrai, lrad, fixup; + unsigned long flags = 0; siginfo_t info; pgd_t *pge; pud_t *pue; @@ -158,12 +159,17 @@ asmlinkage void do_page_fault(int datammu, unsigned long esr0, unsigned long ear break; } + if (user_mode(__frame)) + flags |= FAULT_FLAG_USER; + if (write) + flags |= FAULT_FLAG_WRITE; + /* * If for any reason at all we couldn't handle the fault, * make sure we exit gracefully rather than endlessly redo * the fault. */ - fault = handle_mm_fault(mm, vma, ear0, write ? FAULT_FLAG_WRITE : 0); + fault = handle_mm_fault(mm, vma, ear0, flags); if (unlikely(fault & VM_FAULT_ERROR)) { if (fault & VM_FAULT_OOM) goto out_of_memory; diff --git a/arch/hexagon/mm/vm_fault.c b/arch/hexagon/mm/vm_fault.c index c10b76f..e56baf3 100644 --- a/arch/hexagon/mm/vm_fault.c +++ b/arch/hexagon/mm/vm_fault.c @@ -52,6 +52,7 @@ void do_page_fault(unsigned long address, long cause, struct pt_regs *regs) siginfo_t info; int si_code = SEGV_MAPERR; int fault; + unsigned long flags = 0; const struct exception_table_entry *fixup; /* @@ -96,7 +97,12 @@ good_area: break; } - fault = handle_mm_fault(mm, vma, address, (cause > 0)); + if (user_mode(regs)) + flags |= FAULT_FLAG_USER; + if (cause > 0) + flags |= FAULT_FLAG_WRITE; + + fault = handle_mm_fault(mm, vma, address, flags); /* The most common case -- we are done. */ if (likely(!(fault & VM_FAULT_ERROR))) { diff --git a/arch/ia64/mm/fault.c b/arch/ia64/mm/fault.c index 20b3593..ad9ef9d 100644 --- a/arch/ia64/mm/fault.c +++ b/arch/ia64/mm/fault.c @@ -79,6 +79,7 @@ ia64_do_page_fault (unsigned long address, unsigned long isr, struct pt_regs *re int signal = SIGSEGV, code = SEGV_MAPERR; struct vm_area_struct *vma, *prev_vma; struct mm_struct *mm = current->mm; + unsigned long flags = 0; struct siginfo si; unsigned long mask; int fault; @@ -149,12 +150,17 @@ ia64_do_page_fault (unsigned long address, unsigned long isr, struct pt_regs *re if ((vma->vm_flags & mask) != mask) goto bad_area; + if (user_mode(regs)) + flags |= FAULT_FLAG_USER; + if (mask & VM_WRITE) + flags |= FAULT_FLAG_WRITE; + /* * If for any reason at all we couldn't handle the fault, make * sure we exit gracefully rather than endlessly redo the * fault. */ - fault = handle_mm_fault(mm, vma, address, (mask & VM_WRITE) ? FAULT_FLAG_WRITE : 0); + fault = handle_mm_fault(mm, vma, address, flags); if (unlikely(fault & VM_FAULT_ERROR)) { /* * We ran out of memory, or some other thing happened diff --git a/arch/m32r/mm/fault.c b/arch/m32r/mm/fault.c index 2c9aeb4..e74f6fa 100644 --- a/arch/m32r/mm/fault.c +++ b/arch/m32r/mm/fault.c @@ -79,6 +79,7 @@ asmlinkage void do_page_fault(struct pt_regs *regs, unsigned long error_code, struct mm_struct *mm; struct vm_area_struct * vma; unsigned long page, addr; + unsigned long flags = 0; int write; int fault; siginfo_t info; @@ -188,6 +189,11 @@ good_area: if ((error_code & ACE_INSTRUCTION) && !(vma->vm_flags & VM_EXEC)) goto bad_area; + if (error_code & ACE_USERMODE) + flags |= FAULT_FLAG_USER; + if (write) + flags |= FAULT_FLAG_WRITE; + /* * If for any reason at all we couldn't handle the fault, * make sure we exit gracefully rather than endlessly redo @@ -195,7 +201,7 @@ good_area: */ addr = (address & PAGE_MASK); set_thread_fault_code(error_code); - fault = handle_mm_fault(mm, vma, addr, write ? FAULT_FLAG_WRITE : 0); + fault = handle_mm_fault(mm, vma, addr, flags); if (unlikely(fault & VM_FAULT_ERROR)) { if (fault & VM_FAULT_OOM) goto out_of_memory; diff --git a/arch/m68k/mm/fault.c b/arch/m68k/mm/fault.c index 2db6099..ab88a91 100644 --- a/arch/m68k/mm/fault.c +++ b/arch/m68k/mm/fault.c @@ -73,6 +73,7 @@ int do_page_fault(struct pt_regs *regs, unsigned long address, { struct mm_struct *mm = current->mm; struct vm_area_struct * vma; + unsigned long flags = 0; int write, fault; #ifdef DEBUG @@ -134,13 +135,18 @@ good_area: goto acc_err; } + if (user_mode(regs)) + flags |= FAULT_FLAG_USER; + if (write) + flags |= FAULT_FLAG_WRITE; + /* * If for any reason at all we couldn't handle the fault, * make sure we exit gracefully rather than endlessly redo * the fault. */ - fault = handle_mm_fault(mm, vma, address, write ? FAULT_FLAG_WRITE : 0); + fault = handle_mm_fault(mm, vma, address, flags); #ifdef DEBUG printk("handle_mm_fault returns %d\n",fault); #endif diff --git a/arch/microblaze/mm/fault.c b/arch/microblaze/mm/fault.c index ae97d2c..b002612 100644 --- a/arch/microblaze/mm/fault.c +++ b/arch/microblaze/mm/fault.c @@ -89,6 +89,7 @@ void do_page_fault(struct pt_regs *regs, unsigned long address, { struct vm_area_struct *vma; struct mm_struct *mm = current->mm; + unsigned long flags = 0; siginfo_t info; int code = SEGV_MAPERR; int is_write = error_code & ESR_S; @@ -206,12 +207,17 @@ good_area: goto bad_area; } + if (user_mode(regs)) + flags |= FAULT_FLAG_USER; + if (is_write) + flags |= FAULT_FLAG_WRITE; + /* * If for any reason at all we couldn't handle the fault, * make sure we exit gracefully rather than endlessly redo * the fault. */ - fault = handle_mm_fault(mm, vma, address, is_write ? FAULT_FLAG_WRITE : 0); + fault = handle_mm_fault(mm, vma, address, flags); if (unlikely(fault & VM_FAULT_ERROR)) { if (fault & VM_FAULT_OOM) goto out_of_memory; diff --git a/arch/mips/mm/fault.c b/arch/mips/mm/fault.c index 937cf33..e5b9fed 100644 --- a/arch/mips/mm/fault.c +++ b/arch/mips/mm/fault.c @@ -40,6 +40,7 @@ asmlinkage void __kprobes do_page_fault(struct pt_regs *regs, unsigned long writ struct task_struct *tsk = current; struct mm_struct *mm = tsk->mm; const int field = sizeof(unsigned long) * 2; + unsigned long flags = 0; siginfo_t info; int fault; @@ -139,12 +140,17 @@ good_area: } } + if (user_mode(regs)) + flags |= FAULT_FLAG_USER; + if (write) + flags |= FAULT_FLAG_WRITE; + /* * If for any reason at all we couldn't handle the fault, * make sure we exit gracefully rather than endlessly redo * the fault. */ - fault = handle_mm_fault(mm, vma, address, write ? FAULT_FLAG_WRITE : 0); + fault = handle_mm_fault(mm, vma, address, flags); perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, address); if (unlikely(fault & VM_FAULT_ERROR)) { if (fault & VM_FAULT_OOM) diff --git a/arch/mn10300/mm/fault.c b/arch/mn10300/mm/fault.c index 5ac4df5..031be56 100644 --- a/arch/mn10300/mm/fault.c +++ b/arch/mn10300/mm/fault.c @@ -121,6 +121,7 @@ asmlinkage void do_page_fault(struct pt_regs *regs, unsigned long fault_code, { struct vm_area_struct *vma; struct task_struct *tsk; + unsigned long flags = 0; struct mm_struct *mm; unsigned long page; siginfo_t info; @@ -247,12 +248,17 @@ good_area: break; } + if ((fault_code & MMUFCR_xFC_ACCESS) == MMUFCR_xFC_ACCESS_USR) + flags |= FAULT_FLAG_USER; + if (write) + flags |= FAULT_FLAG_WRITE; + /* * If for any reason at all we couldn't handle the fault, * make sure we exit gracefully rather than endlessly redo * the fault. */ - fault = handle_mm_fault(mm, vma, address, write ? FAULT_FLAG_WRITE : 0); + fault = handle_mm_fault(mm, vma, address, flags); if (unlikely(fault & VM_FAULT_ERROR)) { if (fault & VM_FAULT_OOM) goto out_of_memory; diff --git a/arch/openrisc/mm/fault.c b/arch/openrisc/mm/fault.c index d78881c..d586119 100644 --- a/arch/openrisc/mm/fault.c +++ b/arch/openrisc/mm/fault.c @@ -52,6 +52,7 @@ asmlinkage void do_page_fault(struct pt_regs *regs, unsigned long address, struct task_struct *tsk; struct mm_struct *mm; struct vm_area_struct *vma; + unsigned long flags = 0; siginfo_t info; int fault; @@ -153,13 +154,18 @@ good_area: if ((vector == 0x400) && !(vma->vm_page_prot.pgprot & _PAGE_EXEC)) goto bad_area; + if (user_mode(regs)) + flags |= FAULT_FLAG_USER; + if (write_acc) + flags |= FAULT_FLAG_WRITE; + /* * If for any reason at all we couldn't handle the fault, * make sure we exit gracefully rather than endlessly redo * the fault. */ - fault = handle_mm_fault(mm, vma, address, write_acc); + fault = handle_mm_fault(mm, vma, address, flags); if (unlikely(fault & VM_FAULT_ERROR)) { if (fault & VM_FAULT_OOM) goto out_of_memory; diff --git a/arch/parisc/mm/fault.c b/arch/parisc/mm/fault.c index 18162ce..a151e87 100644 --- a/arch/parisc/mm/fault.c +++ b/arch/parisc/mm/fault.c @@ -173,6 +173,7 @@ void do_page_fault(struct pt_regs *regs, unsigned long code, struct vm_area_struct *vma, *prev_vma; struct task_struct *tsk = current; struct mm_struct *mm = tsk->mm; + unsigned long flags = 0; unsigned long acc_type; int fault; @@ -195,13 +196,18 @@ good_area: if ((vma->vm_flags & acc_type) != acc_type) goto bad_area; + if (user_mode(regs)) + flags |= FAULT_FLAG_USER; + if (acc_type & VM_WRITE) + flags |= FAULT_FLAG_WRITE; + /* * If for any reason at all we couldn't handle the fault, make * sure we exit gracefully rather than endlessly redo the * fault. */ - fault = handle_mm_fault(mm, vma, address, (acc_type & VM_WRITE) ? FAULT_FLAG_WRITE : 0); + fault = handle_mm_fault(mm, vma, address, flags); if (unlikely(fault & VM_FAULT_ERROR)) { /* * We hit a shared mapping outside of the file, or some diff --git a/arch/powerpc/mm/fault.c b/arch/powerpc/mm/fault.c index 5efe8c9..2bf339c 100644 --- a/arch/powerpc/mm/fault.c +++ b/arch/powerpc/mm/fault.c @@ -122,6 +122,7 @@ int __kprobes do_page_fault(struct pt_regs *regs, unsigned long address, { struct vm_area_struct * vma; struct mm_struct *mm = current->mm; + unsigned long flags = 0; siginfo_t info; int code = SEGV_MAPERR; int is_write = 0, ret; @@ -305,12 +306,17 @@ good_area: goto bad_area; } + if (user_mode(regs)) + flags |= FAULT_FLAG_USER; + if (is_write) + flags |= FAULT_FLAG_WRITE; + /* * If for any reason at all we couldn't handle the fault, * make sure we exit gracefully rather than endlessly redo * the fault. */ - ret = handle_mm_fault(mm, vma, address, is_write ? FAULT_FLAG_WRITE : 0); + ret = handle_mm_fault(mm, vma, address, flags); if (unlikely(ret & VM_FAULT_ERROR)) { if (ret & VM_FAULT_OOM) goto out_of_memory; diff --git a/arch/s390/mm/fault.c b/arch/s390/mm/fault.c index a9a3018..fe6109c 100644 --- a/arch/s390/mm/fault.c +++ b/arch/s390/mm/fault.c @@ -301,6 +301,8 @@ static inline int do_exception(struct pt_regs *regs, int access, address = trans_exc_code & __FAIL_ADDR_MASK; perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, address); flags = FAULT_FLAG_ALLOW_RETRY; + if (regs->psw.mask & PSW_MASK_PSTATE) + flags |= FAULT_FLAG_USER; if (access == VM_WRITE || (trans_exc_code & store_indication) == 0x400) flags |= FAULT_FLAG_WRITE; down_read(&mm->mmap_sem); diff --git a/arch/score/mm/fault.c b/arch/score/mm/fault.c index 6b18fb0..2ca5ae5 100644 --- a/arch/score/mm/fault.c +++ b/arch/score/mm/fault.c @@ -47,6 +47,7 @@ asmlinkage void do_page_fault(struct pt_regs *regs, unsigned long write, struct task_struct *tsk = current; struct mm_struct *mm = tsk->mm; const int field = sizeof(unsigned long) * 2; + unsigned long flags = 0; siginfo_t info; int fault; @@ -101,12 +102,16 @@ good_area: } survive: + if (user_mode(regs)) + flags |= FAULT_FLAG_USER; + if (write) + flags |= FAULT_FLAG_WRITE; /* * If for any reason at all we couldn't handle the fault, * make sure we exit gracefully rather than endlessly redo * the fault. */ - fault = handle_mm_fault(mm, vma, address, write); + fault = handle_mm_fault(mm, vma, address, flags); if (unlikely(fault & VM_FAULT_ERROR)) { if (fault & VM_FAULT_OOM) goto out_of_memory; diff --git a/arch/sh/mm/fault_32.c b/arch/sh/mm/fault_32.c index 7bebd04..a61b803 100644 --- a/arch/sh/mm/fault_32.c +++ b/arch/sh/mm/fault_32.c @@ -126,6 +126,7 @@ asmlinkage void __kprobes do_page_fault(struct pt_regs *regs, struct task_struct *tsk; struct mm_struct *mm; struct vm_area_struct * vma; + unsigned long flags = 0; int si_code; int fault; siginfo_t info; @@ -195,12 +196,17 @@ good_area: goto bad_area; } + if (user_mode(regs)) + flags |= FAULT_FLAG_USER; + if (writeaccess) + flags |= FAULT_FLAG_WRITE; + /* * If for any reason at all we couldn't handle the fault, * make sure we exit gracefully rather than endlessly redo * the fault. */ - fault = handle_mm_fault(mm, vma, address, writeaccess ? FAULT_FLAG_WRITE : 0); + fault = handle_mm_fault(mm, vma, address, flags); if (unlikely(fault & VM_FAULT_ERROR)) { if (fault & VM_FAULT_OOM) goto out_of_memory; diff --git a/arch/sh/mm/tlbflush_64.c b/arch/sh/mm/tlbflush_64.c index e3430e0..0a9d645 100644 --- a/arch/sh/mm/tlbflush_64.c +++ b/arch/sh/mm/tlbflush_64.c @@ -96,6 +96,7 @@ asmlinkage void do_page_fault(struct pt_regs *regs, unsigned long writeaccess, struct mm_struct *mm; struct vm_area_struct * vma; const struct exception_table_entry *fixup; + unsigned long flags = 0; pte_t *pte; int fault; @@ -184,12 +185,17 @@ good_area: } } + if (user_mode(regs)) + flags |= FAULT_FLAG_USER; + if (writeaccess) + flags |= FAULT_FLAG_WRITE; + /* * If for any reason at all we couldn't handle the fault, * make sure we exit gracefully rather than endlessly redo * the fault. */ - fault = handle_mm_fault(mm, vma, address, writeaccess ? FAULT_FLAG_WRITE : 0); + fault = handle_mm_fault(mm, vma, address, flags); if (unlikely(fault & VM_FAULT_ERROR)) { if (fault & VM_FAULT_OOM) goto out_of_memory; diff --git a/arch/sparc/mm/fault_32.c b/arch/sparc/mm/fault_32.c index 8023fd7..efa3d48 100644 --- a/arch/sparc/mm/fault_32.c +++ b/arch/sparc/mm/fault_32.c @@ -222,6 +222,7 @@ asmlinkage void do_sparc_fault(struct pt_regs *regs, int text_fault, int write, struct vm_area_struct *vma; struct task_struct *tsk = current; struct mm_struct *mm = tsk->mm; + unsigned long flags = 0; unsigned int fixup; unsigned long g2; int from_user = !(regs->psr & PSR_PS); @@ -285,12 +286,17 @@ good_area: goto bad_area; } + if (from_user) + flags |= FAULT_FLAG_USER; + if (write) + flags |= FAULT_FLAG_WRITE; + /* * If for any reason at all we couldn't handle the fault, * make sure we exit gracefully rather than endlessly redo * the fault. */ - fault = handle_mm_fault(mm, vma, address, write ? FAULT_FLAG_WRITE : 0); + fault = handle_mm_fault(mm, vma, address, flags); if (unlikely(fault & VM_FAULT_ERROR)) { if (fault & VM_FAULT_OOM) goto out_of_memory; diff --git a/arch/sparc/mm/fault_64.c b/arch/sparc/mm/fault_64.c index 504c062..bc536ea 100644 --- a/arch/sparc/mm/fault_64.c +++ b/arch/sparc/mm/fault_64.c @@ -276,6 +276,7 @@ asmlinkage void __kprobes do_sparc64_fault(struct pt_regs *regs) { struct mm_struct *mm = current->mm; struct vm_area_struct *vma; + unsigned long flags = 0; unsigned int insn = 0; int si_code, fault_code, fault; unsigned long address, mm_rss; @@ -423,7 +424,12 @@ good_area: goto bad_area; } - fault = handle_mm_fault(mm, vma, address, (fault_code & FAULT_CODE_WRITE) ? FAULT_FLAG_WRITE : 0); + if (!(regs->tstate & TSTATE_PRIV)) + flags |= FAULT_FLAG_USER; + if (fault_code & FAULT_CODE_WRITE) + flags |= FAULT_FLAG_WRITE; + + fault = handle_mm_fault(mm, vma, address, flags); if (unlikely(fault & VM_FAULT_ERROR)) { if (fault & VM_FAULT_OOM) goto out_of_memory; diff --git a/arch/tile/mm/fault.c b/arch/tile/mm/fault.c index 3312531..b2a7fd5 100644 --- a/arch/tile/mm/fault.c +++ b/arch/tile/mm/fault.c @@ -263,6 +263,7 @@ static int handle_page_fault(struct pt_regs *regs, struct mm_struct *mm; struct vm_area_struct *vma; unsigned long stack_offset; + unsigned long flags = 0; int fault; int si_code; int is_kernel_mode; @@ -415,12 +416,16 @@ good_area: } survive: + if (!is_kernel_mode) + flags |= FAULT_FLAG_USER; + if (write) + flags |= FAULT_FLAG_WRITE; /* * If for any reason at all we couldn't handle the fault, * make sure we exit gracefully rather than endlessly redo * the fault. */ - fault = handle_mm_fault(mm, vma, address, write); + fault = handle_mm_fault(mm, vma, address, flags); if (unlikely(fault & VM_FAULT_ERROR)) { if (fault & VM_FAULT_OOM) goto out_of_memory; diff --git a/arch/um/kernel/trap.c b/arch/um/kernel/trap.c index dafc947..626a85e 100644 --- a/arch/um/kernel/trap.c +++ b/arch/um/kernel/trap.c @@ -25,6 +25,7 @@ int handle_page_fault(unsigned long address, unsigned long ip, { struct mm_struct *mm = current->mm; struct vm_area_struct *vma; + unsigned long flags = 0; pgd_t *pgd; pud_t *pud; pmd_t *pmd; @@ -62,10 +63,15 @@ good_area: if (!is_write && !(vma->vm_flags & (VM_READ | VM_EXEC))) goto out; + if (is_user) + flags |= FAULT_FLAG_USER; + if (is_write) + flags |= FAULT_FLAG_WRITE; + do { int fault; - fault = handle_mm_fault(mm, vma, address, is_write ? FAULT_FLAG_WRITE : 0); + fault = handle_mm_fault(mm, vma, address, flags); if (unlikely(fault & VM_FAULT_ERROR)) { if (fault & VM_FAULT_OOM) { goto out_of_memory; diff --git a/arch/unicore32/mm/fault.c b/arch/unicore32/mm/fault.c index 283aa4b..3026943 100644 --- a/arch/unicore32/mm/fault.c +++ b/arch/unicore32/mm/fault.c @@ -169,9 +169,10 @@ static inline bool access_error(unsigned int fsr, struct vm_area_struct *vma) } static int __do_pf(struct mm_struct *mm, unsigned long addr, unsigned int fsr, - struct task_struct *tsk) + struct task_struct *tsk, struct pt_regs *regs) { struct vm_area_struct *vma; + unsigned long flags = 0; int fault; vma = find_vma(mm, addr); @@ -191,12 +192,16 @@ good_area: goto out; } + if (user_mode(regs)) + flags |= FAULT_FLAG_USER; + if (!(fsr ^ 0x12)) + flags |= FAULT_FLAG_WRITE; + /* * If for any reason at all we couldn't handle the fault, make * sure we exit gracefully rather than endlessly redo the fault. */ - fault = handle_mm_fault(mm, vma, addr & PAGE_MASK, - (!(fsr ^ 0x12)) ? FAULT_FLAG_WRITE : 0); + fault = handle_mm_fault(mm, vma, addr & PAGE_MASK, flags); if (unlikely(fault & VM_FAULT_ERROR)) return fault; if (fault & VM_FAULT_MAJOR) @@ -252,7 +257,7 @@ static int do_pf(unsigned long addr, unsigned int fsr, struct pt_regs *regs) #endif } - fault = __do_pf(mm, addr, fsr, tsk); + fault = __do_pf(mm, addr, fsr, tsk, regs); up_read(&mm->mmap_sem); /* diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c index 5db0490..1cebabe 100644 --- a/arch/x86/mm/fault.c +++ b/arch/x86/mm/fault.c @@ -999,8 +999,7 @@ do_page_fault(struct pt_regs *regs, unsigned long error_code) struct mm_struct *mm; int fault; int write = error_code & PF_WRITE; - unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE | - (write ? FAULT_FLAG_WRITE : 0); + unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE; tsk = current; mm = tsk->mm; @@ -1160,6 +1159,11 @@ good_area: return; } + if (error_code & PF_USER) + flags |= FAULT_FLAG_USER; + if (write) + flags |= FAULT_FLAG_WRITE; + /* * If for any reason at all we couldn't handle the fault, * make sure we exit gracefully rather than endlessly redo diff --git a/arch/xtensa/mm/fault.c b/arch/xtensa/mm/fault.c index e367e30..7db9fbe 100644 --- a/arch/xtensa/mm/fault.c +++ b/arch/xtensa/mm/fault.c @@ -41,6 +41,7 @@ void do_page_fault(struct pt_regs *regs) struct mm_struct *mm = current->mm; unsigned int exccause = regs->exccause; unsigned int address = regs->excvaddr; + unsigned long flags = 0; siginfo_t info; int is_write, is_exec; @@ -101,11 +102,16 @@ good_area: if (!(vma->vm_flags & (VM_READ | VM_WRITE))) goto bad_area; + if (user_mode(regs)) + flags |= FAULT_FLAG_USER; + if (is_write) + flags |= FAULT_FLAG_WRITE; + /* If for any reason at all we couldn't handle the fault, * make sure we exit gracefully rather than endlessly redo * the fault. */ - fault = handle_mm_fault(mm, vma, address, is_write ? FAULT_FLAG_WRITE : 0); + fault = handle_mm_fault(mm, vma, address, flags); if (unlikely(fault & VM_FAULT_ERROR)) { if (fault & VM_FAULT_OOM) goto out_of_memory; diff --git a/include/linux/mm.h b/include/linux/mm.h index 4baadd1..846b82b 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -156,6 +156,7 @@ extern pgprot_t protection_map[16]; #define FAULT_FLAG_ALLOW_RETRY 0x08 /* Retry fault if blocking */ #define FAULT_FLAG_RETRY_NOWAIT 0x10 /* Don't drop mmap_sem and wait when retrying */ #define FAULT_FLAG_KILLABLE 0x20 /* The fault task is in SIGKILL killable region */ +#define FAULT_FLAG_USER 0x40 /* The fault originated in userspace */ /* * This interface is used by x86 PAT code to identify a pfn mapping that is -- 1.8.3.2 ^ permalink raw reply related [flat|nested] 444+ messages in thread
* [patch 2/5] mm: pass userspace fault flag to generic fault handler @ 2013-07-19 4:24 ` Johannes Weiner 0 siblings, 0 replies; 444+ messages in thread From: Johannes Weiner @ 2013-07-19 4:24 UTC (permalink / raw) To: Michal Hocko Cc: azurIt, linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups mailinglist, KAMEZAWA Hiroyuki, righi.andrea-Re5JQEeQqe8AvxtiuMwx3w The global OOM killer is (XXX: for most architectures) only invoked for userspace faults, not for faults from kernelspace (uaccess, gup). Memcg OOM handling is currently invoked for all faults. Allow it to behave like the global case by having the architectures pass a flag to the generic fault handler code that identifies userspace faults. Signed-off-by: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org> --- arch/alpha/mm/fault.c | 8 +++++++- arch/arm/mm/fault.c | 12 +++++++++--- arch/avr32/mm/fault.c | 8 +++++++- arch/cris/mm/fault.c | 8 +++++++- arch/frv/mm/fault.c | 8 +++++++- arch/hexagon/mm/vm_fault.c | 8 +++++++- arch/ia64/mm/fault.c | 8 +++++++- arch/m32r/mm/fault.c | 8 +++++++- arch/m68k/mm/fault.c | 8 +++++++- arch/microblaze/mm/fault.c | 8 +++++++- arch/mips/mm/fault.c | 8 +++++++- arch/mn10300/mm/fault.c | 8 +++++++- arch/openrisc/mm/fault.c | 8 +++++++- arch/parisc/mm/fault.c | 8 +++++++- arch/powerpc/mm/fault.c | 8 +++++++- arch/s390/mm/fault.c | 2 ++ arch/score/mm/fault.c | 7 ++++++- arch/sh/mm/fault_32.c | 8 +++++++- arch/sh/mm/tlbflush_64.c | 8 +++++++- arch/sparc/mm/fault_32.c | 8 +++++++- arch/sparc/mm/fault_64.c | 8 +++++++- arch/tile/mm/fault.c | 7 ++++++- arch/um/kernel/trap.c | 8 +++++++- arch/unicore32/mm/fault.c | 13 +++++++++---- arch/x86/mm/fault.c | 8 ++++++-- arch/xtensa/mm/fault.c | 8 +++++++- include/linux/mm.h | 1 + 27 files changed, 179 insertions(+), 31 deletions(-) diff --git a/arch/alpha/mm/fault.c b/arch/alpha/mm/fault.c index fadd5f8..fa6b4e4 100644 --- a/arch/alpha/mm/fault.c +++ b/arch/alpha/mm/fault.c @@ -89,6 +89,7 @@ do_page_fault(unsigned long address, unsigned long mmcsr, struct mm_struct *mm = current->mm; const struct exception_table_entry *fixup; int fault, si_code = SEGV_MAPERR; + unsigned long flags = 0; siginfo_t info; /* As of EV6, a load into $31/$f31 is a prefetch, and never faults @@ -142,10 +143,15 @@ do_page_fault(unsigned long address, unsigned long mmcsr, goto bad_area; } + if (user_mode(regs)) + flags |= FAULT_FLAG_USER; + if (cause > 0) + flags |= FAULT_FLAG_WRITE; + /* If for any reason at all we couldn't handle the fault, make sure we exit gracefully rather than endlessly redo the fault. */ - fault = handle_mm_fault(mm, vma, address, cause > 0 ? FAULT_FLAG_WRITE : 0); + fault = handle_mm_fault(mm, vma, address, flags); up_read(&mm->mmap_sem); if (unlikely(fault & VM_FAULT_ERROR)) { if (fault & VM_FAULT_OOM) diff --git a/arch/arm/mm/fault.c b/arch/arm/mm/fault.c index aa33949..31b1e69 100644 --- a/arch/arm/mm/fault.c +++ b/arch/arm/mm/fault.c @@ -231,9 +231,10 @@ static inline bool access_error(unsigned int fsr, struct vm_area_struct *vma) static int __kprobes __do_page_fault(struct mm_struct *mm, unsigned long addr, unsigned int fsr, - struct task_struct *tsk) + struct task_struct *tsk, struct pt_regs *regs) { struct vm_area_struct *vma; + unsigned long flags = 0; int fault; vma = find_vma(mm, addr); @@ -253,11 +254,16 @@ good_area: goto out; } + if (user_mode(regs)) + flags |= FAULT_FLAG_USER; + if (fsr & FSR_WRITE) + flags |= FAULT_FLAG_WRITE; + /* * If for any reason at all we couldn't handle the fault, make * sure we exit gracefully rather than endlessly redo the fault. */ - fault = handle_mm_fault(mm, vma, addr & PAGE_MASK, (fsr & FSR_WRITE) ? FAULT_FLAG_WRITE : 0); + fault = handle_mm_fault(mm, vma, addr & PAGE_MASK, flags); if (unlikely(fault & VM_FAULT_ERROR)) return fault; if (fault & VM_FAULT_MAJOR) @@ -320,7 +326,7 @@ do_page_fault(unsigned long addr, unsigned int fsr, struct pt_regs *regs) #endif } - fault = __do_page_fault(mm, addr, fsr, tsk); + fault = __do_page_fault(mm, addr, fsr, tsk, regs); up_read(&mm->mmap_sem); perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, addr); diff --git a/arch/avr32/mm/fault.c b/arch/avr32/mm/fault.c index f7040a1..ada6237 100644 --- a/arch/avr32/mm/fault.c +++ b/arch/avr32/mm/fault.c @@ -59,6 +59,7 @@ asmlinkage void do_page_fault(unsigned long ecr, struct pt_regs *regs) struct mm_struct *mm; struct vm_area_struct *vma; const struct exception_table_entry *fixup; + unsigned long flags = 0; unsigned long address; unsigned long page; int writeaccess; @@ -127,12 +128,17 @@ good_area: panic("Unhandled case %lu in do_page_fault!", ecr); } + if (user_mode(regs)) + flags |= FAULT_FLAG_USER; + if (writeaccess) + flags |= FAULT_FLAG_WRITE; + /* * If for any reason at all we couldn't handle the fault, make * sure we exit gracefully rather than endlessly redo the * fault. */ - fault = handle_mm_fault(mm, vma, address, writeaccess ? FAULT_FLAG_WRITE : 0); + fault = handle_mm_fault(mm, vma, address, flags); if (unlikely(fault & VM_FAULT_ERROR)) { if (fault & VM_FAULT_OOM) goto out_of_memory; diff --git a/arch/cris/mm/fault.c b/arch/cris/mm/fault.c index 9dcac8e..35d096a 100644 --- a/arch/cris/mm/fault.c +++ b/arch/cris/mm/fault.c @@ -55,6 +55,7 @@ do_page_fault(unsigned long address, struct pt_regs *regs, struct task_struct *tsk; struct mm_struct *mm; struct vm_area_struct * vma; + unsigned long flags = 0; siginfo_t info; int fault; @@ -156,13 +157,18 @@ do_page_fault(unsigned long address, struct pt_regs *regs, goto bad_area; } + if (user_mode(regs)) + flags |= FAULT_FLAG_USER; + if (writeaccess & 1) + flags |= FAULT_FLAG_WRITE; + /* * If for any reason at all we couldn't handle the fault, * make sure we exit gracefully rather than endlessly redo * the fault. */ - fault = handle_mm_fault(mm, vma, address, (writeaccess & 1) ? FAULT_FLAG_WRITE : 0); + fault = handle_mm_fault(mm, vma, address, flags); if (unlikely(fault & VM_FAULT_ERROR)) { if (fault & VM_FAULT_OOM) goto out_of_memory; diff --git a/arch/frv/mm/fault.c b/arch/frv/mm/fault.c index a325d57..2dbf219 100644 --- a/arch/frv/mm/fault.c +++ b/arch/frv/mm/fault.c @@ -35,6 +35,7 @@ asmlinkage void do_page_fault(int datammu, unsigned long esr0, unsigned long ear struct vm_area_struct *vma; struct mm_struct *mm; unsigned long _pme, lrai, lrad, fixup; + unsigned long flags = 0; siginfo_t info; pgd_t *pge; pud_t *pue; @@ -158,12 +159,17 @@ asmlinkage void do_page_fault(int datammu, unsigned long esr0, unsigned long ear break; } + if (user_mode(__frame)) + flags |= FAULT_FLAG_USER; + if (write) + flags |= FAULT_FLAG_WRITE; + /* * If for any reason at all we couldn't handle the fault, * make sure we exit gracefully rather than endlessly redo * the fault. */ - fault = handle_mm_fault(mm, vma, ear0, write ? FAULT_FLAG_WRITE : 0); + fault = handle_mm_fault(mm, vma, ear0, flags); if (unlikely(fault & VM_FAULT_ERROR)) { if (fault & VM_FAULT_OOM) goto out_of_memory; diff --git a/arch/hexagon/mm/vm_fault.c b/arch/hexagon/mm/vm_fault.c index c10b76f..e56baf3 100644 --- a/arch/hexagon/mm/vm_fault.c +++ b/arch/hexagon/mm/vm_fault.c @@ -52,6 +52,7 @@ void do_page_fault(unsigned long address, long cause, struct pt_regs *regs) siginfo_t info; int si_code = SEGV_MAPERR; int fault; + unsigned long flags = 0; const struct exception_table_entry *fixup; /* @@ -96,7 +97,12 @@ good_area: break; } - fault = handle_mm_fault(mm, vma, address, (cause > 0)); + if (user_mode(regs)) + flags |= FAULT_FLAG_USER; + if (cause > 0) + flags |= FAULT_FLAG_WRITE; + + fault = handle_mm_fault(mm, vma, address, flags); /* The most common case -- we are done. */ if (likely(!(fault & VM_FAULT_ERROR))) { diff --git a/arch/ia64/mm/fault.c b/arch/ia64/mm/fault.c index 20b3593..ad9ef9d 100644 --- a/arch/ia64/mm/fault.c +++ b/arch/ia64/mm/fault.c @@ -79,6 +79,7 @@ ia64_do_page_fault (unsigned long address, unsigned long isr, struct pt_regs *re int signal = SIGSEGV, code = SEGV_MAPERR; struct vm_area_struct *vma, *prev_vma; struct mm_struct *mm = current->mm; + unsigned long flags = 0; struct siginfo si; unsigned long mask; int fault; @@ -149,12 +150,17 @@ ia64_do_page_fault (unsigned long address, unsigned long isr, struct pt_regs *re if ((vma->vm_flags & mask) != mask) goto bad_area; + if (user_mode(regs)) + flags |= FAULT_FLAG_USER; + if (mask & VM_WRITE) + flags |= FAULT_FLAG_WRITE; + /* * If for any reason at all we couldn't handle the fault, make * sure we exit gracefully rather than endlessly redo the * fault. */ - fault = handle_mm_fault(mm, vma, address, (mask & VM_WRITE) ? FAULT_FLAG_WRITE : 0); + fault = handle_mm_fault(mm, vma, address, flags); if (unlikely(fault & VM_FAULT_ERROR)) { /* * We ran out of memory, or some other thing happened diff --git a/arch/m32r/mm/fault.c b/arch/m32r/mm/fault.c index 2c9aeb4..e74f6fa 100644 --- a/arch/m32r/mm/fault.c +++ b/arch/m32r/mm/fault.c @@ -79,6 +79,7 @@ asmlinkage void do_page_fault(struct pt_regs *regs, unsigned long error_code, struct mm_struct *mm; struct vm_area_struct * vma; unsigned long page, addr; + unsigned long flags = 0; int write; int fault; siginfo_t info; @@ -188,6 +189,11 @@ good_area: if ((error_code & ACE_INSTRUCTION) && !(vma->vm_flags & VM_EXEC)) goto bad_area; + if (error_code & ACE_USERMODE) + flags |= FAULT_FLAG_USER; + if (write) + flags |= FAULT_FLAG_WRITE; + /* * If for any reason at all we couldn't handle the fault, * make sure we exit gracefully rather than endlessly redo @@ -195,7 +201,7 @@ good_area: */ addr = (address & PAGE_MASK); set_thread_fault_code(error_code); - fault = handle_mm_fault(mm, vma, addr, write ? FAULT_FLAG_WRITE : 0); + fault = handle_mm_fault(mm, vma, addr, flags); if (unlikely(fault & VM_FAULT_ERROR)) { if (fault & VM_FAULT_OOM) goto out_of_memory; diff --git a/arch/m68k/mm/fault.c b/arch/m68k/mm/fault.c index 2db6099..ab88a91 100644 --- a/arch/m68k/mm/fault.c +++ b/arch/m68k/mm/fault.c @@ -73,6 +73,7 @@ int do_page_fault(struct pt_regs *regs, unsigned long address, { struct mm_struct *mm = current->mm; struct vm_area_struct * vma; + unsigned long flags = 0; int write, fault; #ifdef DEBUG @@ -134,13 +135,18 @@ good_area: goto acc_err; } + if (user_mode(regs)) + flags |= FAULT_FLAG_USER; + if (write) + flags |= FAULT_FLAG_WRITE; + /* * If for any reason at all we couldn't handle the fault, * make sure we exit gracefully rather than endlessly redo * the fault. */ - fault = handle_mm_fault(mm, vma, address, write ? FAULT_FLAG_WRITE : 0); + fault = handle_mm_fault(mm, vma, address, flags); #ifdef DEBUG printk("handle_mm_fault returns %d\n",fault); #endif diff --git a/arch/microblaze/mm/fault.c b/arch/microblaze/mm/fault.c index ae97d2c..b002612 100644 --- a/arch/microblaze/mm/fault.c +++ b/arch/microblaze/mm/fault.c @@ -89,6 +89,7 @@ void do_page_fault(struct pt_regs *regs, unsigned long address, { struct vm_area_struct *vma; struct mm_struct *mm = current->mm; + unsigned long flags = 0; siginfo_t info; int code = SEGV_MAPERR; int is_write = error_code & ESR_S; @@ -206,12 +207,17 @@ good_area: goto bad_area; } + if (user_mode(regs)) + flags |= FAULT_FLAG_USER; + if (is_write) + flags |= FAULT_FLAG_WRITE; + /* * If for any reason at all we couldn't handle the fault, * make sure we exit gracefully rather than endlessly redo * the fault. */ - fault = handle_mm_fault(mm, vma, address, is_write ? FAULT_FLAG_WRITE : 0); + fault = handle_mm_fault(mm, vma, address, flags); if (unlikely(fault & VM_FAULT_ERROR)) { if (fault & VM_FAULT_OOM) goto out_of_memory; diff --git a/arch/mips/mm/fault.c b/arch/mips/mm/fault.c index 937cf33..e5b9fed 100644 --- a/arch/mips/mm/fault.c +++ b/arch/mips/mm/fault.c @@ -40,6 +40,7 @@ asmlinkage void __kprobes do_page_fault(struct pt_regs *regs, unsigned long writ struct task_struct *tsk = current; struct mm_struct *mm = tsk->mm; const int field = sizeof(unsigned long) * 2; + unsigned long flags = 0; siginfo_t info; int fault; @@ -139,12 +140,17 @@ good_area: } } + if (user_mode(regs)) + flags |= FAULT_FLAG_USER; + if (write) + flags |= FAULT_FLAG_WRITE; + /* * If for any reason at all we couldn't handle the fault, * make sure we exit gracefully rather than endlessly redo * the fault. */ - fault = handle_mm_fault(mm, vma, address, write ? FAULT_FLAG_WRITE : 0); + fault = handle_mm_fault(mm, vma, address, flags); perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, address); if (unlikely(fault & VM_FAULT_ERROR)) { if (fault & VM_FAULT_OOM) diff --git a/arch/mn10300/mm/fault.c b/arch/mn10300/mm/fault.c index 5ac4df5..031be56 100644 --- a/arch/mn10300/mm/fault.c +++ b/arch/mn10300/mm/fault.c @@ -121,6 +121,7 @@ asmlinkage void do_page_fault(struct pt_regs *regs, unsigned long fault_code, { struct vm_area_struct *vma; struct task_struct *tsk; + unsigned long flags = 0; struct mm_struct *mm; unsigned long page; siginfo_t info; @@ -247,12 +248,17 @@ good_area: break; } + if ((fault_code & MMUFCR_xFC_ACCESS) == MMUFCR_xFC_ACCESS_USR) + flags |= FAULT_FLAG_USER; + if (write) + flags |= FAULT_FLAG_WRITE; + /* * If for any reason at all we couldn't handle the fault, * make sure we exit gracefully rather than endlessly redo * the fault. */ - fault = handle_mm_fault(mm, vma, address, write ? FAULT_FLAG_WRITE : 0); + fault = handle_mm_fault(mm, vma, address, flags); if (unlikely(fault & VM_FAULT_ERROR)) { if (fault & VM_FAULT_OOM) goto out_of_memory; diff --git a/arch/openrisc/mm/fault.c b/arch/openrisc/mm/fault.c index d78881c..d586119 100644 --- a/arch/openrisc/mm/fault.c +++ b/arch/openrisc/mm/fault.c @@ -52,6 +52,7 @@ asmlinkage void do_page_fault(struct pt_regs *regs, unsigned long address, struct task_struct *tsk; struct mm_struct *mm; struct vm_area_struct *vma; + unsigned long flags = 0; siginfo_t info; int fault; @@ -153,13 +154,18 @@ good_area: if ((vector == 0x400) && !(vma->vm_page_prot.pgprot & _PAGE_EXEC)) goto bad_area; + if (user_mode(regs)) + flags |= FAULT_FLAG_USER; + if (write_acc) + flags |= FAULT_FLAG_WRITE; + /* * If for any reason at all we couldn't handle the fault, * make sure we exit gracefully rather than endlessly redo * the fault. */ - fault = handle_mm_fault(mm, vma, address, write_acc); + fault = handle_mm_fault(mm, vma, address, flags); if (unlikely(fault & VM_FAULT_ERROR)) { if (fault & VM_FAULT_OOM) goto out_of_memory; diff --git a/arch/parisc/mm/fault.c b/arch/parisc/mm/fault.c index 18162ce..a151e87 100644 --- a/arch/parisc/mm/fault.c +++ b/arch/parisc/mm/fault.c @@ -173,6 +173,7 @@ void do_page_fault(struct pt_regs *regs, unsigned long code, struct vm_area_struct *vma, *prev_vma; struct task_struct *tsk = current; struct mm_struct *mm = tsk->mm; + unsigned long flags = 0; unsigned long acc_type; int fault; @@ -195,13 +196,18 @@ good_area: if ((vma->vm_flags & acc_type) != acc_type) goto bad_area; + if (user_mode(regs)) + flags |= FAULT_FLAG_USER; + if (acc_type & VM_WRITE) + flags |= FAULT_FLAG_WRITE; + /* * If for any reason at all we couldn't handle the fault, make * sure we exit gracefully rather than endlessly redo the * fault. */ - fault = handle_mm_fault(mm, vma, address, (acc_type & VM_WRITE) ? FAULT_FLAG_WRITE : 0); + fault = handle_mm_fault(mm, vma, address, flags); if (unlikely(fault & VM_FAULT_ERROR)) { /* * We hit a shared mapping outside of the file, or some diff --git a/arch/powerpc/mm/fault.c b/arch/powerpc/mm/fault.c index 5efe8c9..2bf339c 100644 --- a/arch/powerpc/mm/fault.c +++ b/arch/powerpc/mm/fault.c @@ -122,6 +122,7 @@ int __kprobes do_page_fault(struct pt_regs *regs, unsigned long address, { struct vm_area_struct * vma; struct mm_struct *mm = current->mm; + unsigned long flags = 0; siginfo_t info; int code = SEGV_MAPERR; int is_write = 0, ret; @@ -305,12 +306,17 @@ good_area: goto bad_area; } + if (user_mode(regs)) + flags |= FAULT_FLAG_USER; + if (is_write) + flags |= FAULT_FLAG_WRITE; + /* * If for any reason at all we couldn't handle the fault, * make sure we exit gracefully rather than endlessly redo * the fault. */ - ret = handle_mm_fault(mm, vma, address, is_write ? FAULT_FLAG_WRITE : 0); + ret = handle_mm_fault(mm, vma, address, flags); if (unlikely(ret & VM_FAULT_ERROR)) { if (ret & VM_FAULT_OOM) goto out_of_memory; diff --git a/arch/s390/mm/fault.c b/arch/s390/mm/fault.c index a9a3018..fe6109c 100644 --- a/arch/s390/mm/fault.c +++ b/arch/s390/mm/fault.c @@ -301,6 +301,8 @@ static inline int do_exception(struct pt_regs *regs, int access, address = trans_exc_code & __FAIL_ADDR_MASK; perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, address); flags = FAULT_FLAG_ALLOW_RETRY; + if (regs->psw.mask & PSW_MASK_PSTATE) + flags |= FAULT_FLAG_USER; if (access == VM_WRITE || (trans_exc_code & store_indication) == 0x400) flags |= FAULT_FLAG_WRITE; down_read(&mm->mmap_sem); diff --git a/arch/score/mm/fault.c b/arch/score/mm/fault.c index 6b18fb0..2ca5ae5 100644 --- a/arch/score/mm/fault.c +++ b/arch/score/mm/fault.c @@ -47,6 +47,7 @@ asmlinkage void do_page_fault(struct pt_regs *regs, unsigned long write, struct task_struct *tsk = current; struct mm_struct *mm = tsk->mm; const int field = sizeof(unsigned long) * 2; + unsigned long flags = 0; siginfo_t info; int fault; @@ -101,12 +102,16 @@ good_area: } survive: + if (user_mode(regs)) + flags |= FAULT_FLAG_USER; + if (write) + flags |= FAULT_FLAG_WRITE; /* * If for any reason at all we couldn't handle the fault, * make sure we exit gracefully rather than endlessly redo * the fault. */ - fault = handle_mm_fault(mm, vma, address, write); + fault = handle_mm_fault(mm, vma, address, flags); if (unlikely(fault & VM_FAULT_ERROR)) { if (fault & VM_FAULT_OOM) goto out_of_memory; diff --git a/arch/sh/mm/fault_32.c b/arch/sh/mm/fault_32.c index 7bebd04..a61b803 100644 --- a/arch/sh/mm/fault_32.c +++ b/arch/sh/mm/fault_32.c @@ -126,6 +126,7 @@ asmlinkage void __kprobes do_page_fault(struct pt_regs *regs, struct task_struct *tsk; struct mm_struct *mm; struct vm_area_struct * vma; + unsigned long flags = 0; int si_code; int fault; siginfo_t info; @@ -195,12 +196,17 @@ good_area: goto bad_area; } + if (user_mode(regs)) + flags |= FAULT_FLAG_USER; + if (writeaccess) + flags |= FAULT_FLAG_WRITE; + /* * If for any reason at all we couldn't handle the fault, * make sure we exit gracefully rather than endlessly redo * the fault. */ - fault = handle_mm_fault(mm, vma, address, writeaccess ? FAULT_FLAG_WRITE : 0); + fault = handle_mm_fault(mm, vma, address, flags); if (unlikely(fault & VM_FAULT_ERROR)) { if (fault & VM_FAULT_OOM) goto out_of_memory; diff --git a/arch/sh/mm/tlbflush_64.c b/arch/sh/mm/tlbflush_64.c index e3430e0..0a9d645 100644 --- a/arch/sh/mm/tlbflush_64.c +++ b/arch/sh/mm/tlbflush_64.c @@ -96,6 +96,7 @@ asmlinkage void do_page_fault(struct pt_regs *regs, unsigned long writeaccess, struct mm_struct *mm; struct vm_area_struct * vma; const struct exception_table_entry *fixup; + unsigned long flags = 0; pte_t *pte; int fault; @@ -184,12 +185,17 @@ good_area: } } + if (user_mode(regs)) + flags |= FAULT_FLAG_USER; + if (writeaccess) + flags |= FAULT_FLAG_WRITE; + /* * If for any reason at all we couldn't handle the fault, * make sure we exit gracefully rather than endlessly redo * the fault. */ - fault = handle_mm_fault(mm, vma, address, writeaccess ? FAULT_FLAG_WRITE : 0); + fault = handle_mm_fault(mm, vma, address, flags); if (unlikely(fault & VM_FAULT_ERROR)) { if (fault & VM_FAULT_OOM) goto out_of_memory; diff --git a/arch/sparc/mm/fault_32.c b/arch/sparc/mm/fault_32.c index 8023fd7..efa3d48 100644 --- a/arch/sparc/mm/fault_32.c +++ b/arch/sparc/mm/fault_32.c @@ -222,6 +222,7 @@ asmlinkage void do_sparc_fault(struct pt_regs *regs, int text_fault, int write, struct vm_area_struct *vma; struct task_struct *tsk = current; struct mm_struct *mm = tsk->mm; + unsigned long flags = 0; unsigned int fixup; unsigned long g2; int from_user = !(regs->psr & PSR_PS); @@ -285,12 +286,17 @@ good_area: goto bad_area; } + if (from_user) + flags |= FAULT_FLAG_USER; + if (write) + flags |= FAULT_FLAG_WRITE; + /* * If for any reason at all we couldn't handle the fault, * make sure we exit gracefully rather than endlessly redo * the fault. */ - fault = handle_mm_fault(mm, vma, address, write ? FAULT_FLAG_WRITE : 0); + fault = handle_mm_fault(mm, vma, address, flags); if (unlikely(fault & VM_FAULT_ERROR)) { if (fault & VM_FAULT_OOM) goto out_of_memory; diff --git a/arch/sparc/mm/fault_64.c b/arch/sparc/mm/fault_64.c index 504c062..bc536ea 100644 --- a/arch/sparc/mm/fault_64.c +++ b/arch/sparc/mm/fault_64.c @@ -276,6 +276,7 @@ asmlinkage void __kprobes do_sparc64_fault(struct pt_regs *regs) { struct mm_struct *mm = current->mm; struct vm_area_struct *vma; + unsigned long flags = 0; unsigned int insn = 0; int si_code, fault_code, fault; unsigned long address, mm_rss; @@ -423,7 +424,12 @@ good_area: goto bad_area; } - fault = handle_mm_fault(mm, vma, address, (fault_code & FAULT_CODE_WRITE) ? FAULT_FLAG_WRITE : 0); + if (!(regs->tstate & TSTATE_PRIV)) + flags |= FAULT_FLAG_USER; + if (fault_code & FAULT_CODE_WRITE) + flags |= FAULT_FLAG_WRITE; + + fault = handle_mm_fault(mm, vma, address, flags); if (unlikely(fault & VM_FAULT_ERROR)) { if (fault & VM_FAULT_OOM) goto out_of_memory; diff --git a/arch/tile/mm/fault.c b/arch/tile/mm/fault.c index 3312531..b2a7fd5 100644 --- a/arch/tile/mm/fault.c +++ b/arch/tile/mm/fault.c @@ -263,6 +263,7 @@ static int handle_page_fault(struct pt_regs *regs, struct mm_struct *mm; struct vm_area_struct *vma; unsigned long stack_offset; + unsigned long flags = 0; int fault; int si_code; int is_kernel_mode; @@ -415,12 +416,16 @@ good_area: } survive: + if (!is_kernel_mode) + flags |= FAULT_FLAG_USER; + if (write) + flags |= FAULT_FLAG_WRITE; /* * If for any reason at all we couldn't handle the fault, * make sure we exit gracefully rather than endlessly redo * the fault. */ - fault = handle_mm_fault(mm, vma, address, write); + fault = handle_mm_fault(mm, vma, address, flags); if (unlikely(fault & VM_FAULT_ERROR)) { if (fault & VM_FAULT_OOM) goto out_of_memory; diff --git a/arch/um/kernel/trap.c b/arch/um/kernel/trap.c index dafc947..626a85e 100644 --- a/arch/um/kernel/trap.c +++ b/arch/um/kernel/trap.c @@ -25,6 +25,7 @@ int handle_page_fault(unsigned long address, unsigned long ip, { struct mm_struct *mm = current->mm; struct vm_area_struct *vma; + unsigned long flags = 0; pgd_t *pgd; pud_t *pud; pmd_t *pmd; @@ -62,10 +63,15 @@ good_area: if (!is_write && !(vma->vm_flags & (VM_READ | VM_EXEC))) goto out; + if (is_user) + flags |= FAULT_FLAG_USER; + if (is_write) + flags |= FAULT_FLAG_WRITE; + do { int fault; - fault = handle_mm_fault(mm, vma, address, is_write ? FAULT_FLAG_WRITE : 0); + fault = handle_mm_fault(mm, vma, address, flags); if (unlikely(fault & VM_FAULT_ERROR)) { if (fault & VM_FAULT_OOM) { goto out_of_memory; diff --git a/arch/unicore32/mm/fault.c b/arch/unicore32/mm/fault.c index 283aa4b..3026943 100644 --- a/arch/unicore32/mm/fault.c +++ b/arch/unicore32/mm/fault.c @@ -169,9 +169,10 @@ static inline bool access_error(unsigned int fsr, struct vm_area_struct *vma) } static int __do_pf(struct mm_struct *mm, unsigned long addr, unsigned int fsr, - struct task_struct *tsk) + struct task_struct *tsk, struct pt_regs *regs) { struct vm_area_struct *vma; + unsigned long flags = 0; int fault; vma = find_vma(mm, addr); @@ -191,12 +192,16 @@ good_area: goto out; } + if (user_mode(regs)) + flags |= FAULT_FLAG_USER; + if (!(fsr ^ 0x12)) + flags |= FAULT_FLAG_WRITE; + /* * If for any reason at all we couldn't handle the fault, make * sure we exit gracefully rather than endlessly redo the fault. */ - fault = handle_mm_fault(mm, vma, addr & PAGE_MASK, - (!(fsr ^ 0x12)) ? FAULT_FLAG_WRITE : 0); + fault = handle_mm_fault(mm, vma, addr & PAGE_MASK, flags); if (unlikely(fault & VM_FAULT_ERROR)) return fault; if (fault & VM_FAULT_MAJOR) @@ -252,7 +257,7 @@ static int do_pf(unsigned long addr, unsigned int fsr, struct pt_regs *regs) #endif } - fault = __do_pf(mm, addr, fsr, tsk); + fault = __do_pf(mm, addr, fsr, tsk, regs); up_read(&mm->mmap_sem); /* diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c index 5db0490..1cebabe 100644 --- a/arch/x86/mm/fault.c +++ b/arch/x86/mm/fault.c @@ -999,8 +999,7 @@ do_page_fault(struct pt_regs *regs, unsigned long error_code) struct mm_struct *mm; int fault; int write = error_code & PF_WRITE; - unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE | - (write ? FAULT_FLAG_WRITE : 0); + unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE; tsk = current; mm = tsk->mm; @@ -1160,6 +1159,11 @@ good_area: return; } + if (error_code & PF_USER) + flags |= FAULT_FLAG_USER; + if (write) + flags |= FAULT_FLAG_WRITE; + /* * If for any reason at all we couldn't handle the fault, * make sure we exit gracefully rather than endlessly redo diff --git a/arch/xtensa/mm/fault.c b/arch/xtensa/mm/fault.c index e367e30..7db9fbe 100644 --- a/arch/xtensa/mm/fault.c +++ b/arch/xtensa/mm/fault.c @@ -41,6 +41,7 @@ void do_page_fault(struct pt_regs *regs) struct mm_struct *mm = current->mm; unsigned int exccause = regs->exccause; unsigned int address = regs->excvaddr; + unsigned long flags = 0; siginfo_t info; int is_write, is_exec; @@ -101,11 +102,16 @@ good_area: if (!(vma->vm_flags & (VM_READ | VM_WRITE))) goto bad_area; + if (user_mode(regs)) + flags |= FAULT_FLAG_USER; + if (is_write) + flags |= FAULT_FLAG_WRITE; + /* If for any reason at all we couldn't handle the fault, * make sure we exit gracefully rather than endlessly redo * the fault. */ - fault = handle_mm_fault(mm, vma, address, is_write ? FAULT_FLAG_WRITE : 0); + fault = handle_mm_fault(mm, vma, address, flags); if (unlikely(fault & VM_FAULT_ERROR)) { if (fault & VM_FAULT_OOM) goto out_of_memory; diff --git a/include/linux/mm.h b/include/linux/mm.h index 4baadd1..846b82b 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -156,6 +156,7 @@ extern pgprot_t protection_map[16]; #define FAULT_FLAG_ALLOW_RETRY 0x08 /* Retry fault if blocking */ #define FAULT_FLAG_RETRY_NOWAIT 0x10 /* Don't drop mmap_sem and wait when retrying */ #define FAULT_FLAG_KILLABLE 0x20 /* The fault task is in SIGKILL killable region */ +#define FAULT_FLAG_USER 0x40 /* The fault originated in userspace */ /* * This interface is used by x86 PAT code to identify a pfn mapping that is -- 1.8.3.2 ^ permalink raw reply related [flat|nested] 444+ messages in thread
* [patch 2/5] mm: pass userspace fault flag to generic fault handler @ 2013-07-19 4:24 ` Johannes Weiner 0 siblings, 0 replies; 444+ messages in thread From: Johannes Weiner @ 2013-07-19 4:24 UTC (permalink / raw) To: Michal Hocko Cc: azurIt, linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki, righi.andrea The global OOM killer is (XXX: for most architectures) only invoked for userspace faults, not for faults from kernelspace (uaccess, gup). Memcg OOM handling is currently invoked for all faults. Allow it to behave like the global case by having the architectures pass a flag to the generic fault handler code that identifies userspace faults. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> --- arch/alpha/mm/fault.c | 8 +++++++- arch/arm/mm/fault.c | 12 +++++++++--- arch/avr32/mm/fault.c | 8 +++++++- arch/cris/mm/fault.c | 8 +++++++- arch/frv/mm/fault.c | 8 +++++++- arch/hexagon/mm/vm_fault.c | 8 +++++++- arch/ia64/mm/fault.c | 8 +++++++- arch/m32r/mm/fault.c | 8 +++++++- arch/m68k/mm/fault.c | 8 +++++++- arch/microblaze/mm/fault.c | 8 +++++++- arch/mips/mm/fault.c | 8 +++++++- arch/mn10300/mm/fault.c | 8 +++++++- arch/openrisc/mm/fault.c | 8 +++++++- arch/parisc/mm/fault.c | 8 +++++++- arch/powerpc/mm/fault.c | 8 +++++++- arch/s390/mm/fault.c | 2 ++ arch/score/mm/fault.c | 7 ++++++- arch/sh/mm/fault_32.c | 8 +++++++- arch/sh/mm/tlbflush_64.c | 8 +++++++- arch/sparc/mm/fault_32.c | 8 +++++++- arch/sparc/mm/fault_64.c | 8 +++++++- arch/tile/mm/fault.c | 7 ++++++- arch/um/kernel/trap.c | 8 +++++++- arch/unicore32/mm/fault.c | 13 +++++++++---- arch/x86/mm/fault.c | 8 ++++++-- arch/xtensa/mm/fault.c | 8 +++++++- include/linux/mm.h | 1 + 27 files changed, 179 insertions(+), 31 deletions(-) diff --git a/arch/alpha/mm/fault.c b/arch/alpha/mm/fault.c index fadd5f8..fa6b4e4 100644 --- a/arch/alpha/mm/fault.c +++ b/arch/alpha/mm/fault.c @@ -89,6 +89,7 @@ do_page_fault(unsigned long address, unsigned long mmcsr, struct mm_struct *mm = current->mm; const struct exception_table_entry *fixup; int fault, si_code = SEGV_MAPERR; + unsigned long flags = 0; siginfo_t info; /* As of EV6, a load into $31/$f31 is a prefetch, and never faults @@ -142,10 +143,15 @@ do_page_fault(unsigned long address, unsigned long mmcsr, goto bad_area; } + if (user_mode(regs)) + flags |= FAULT_FLAG_USER; + if (cause > 0) + flags |= FAULT_FLAG_WRITE; + /* If for any reason at all we couldn't handle the fault, make sure we exit gracefully rather than endlessly redo the fault. */ - fault = handle_mm_fault(mm, vma, address, cause > 0 ? FAULT_FLAG_WRITE : 0); + fault = handle_mm_fault(mm, vma, address, flags); up_read(&mm->mmap_sem); if (unlikely(fault & VM_FAULT_ERROR)) { if (fault & VM_FAULT_OOM) diff --git a/arch/arm/mm/fault.c b/arch/arm/mm/fault.c index aa33949..31b1e69 100644 --- a/arch/arm/mm/fault.c +++ b/arch/arm/mm/fault.c @@ -231,9 +231,10 @@ static inline bool access_error(unsigned int fsr, struct vm_area_struct *vma) static int __kprobes __do_page_fault(struct mm_struct *mm, unsigned long addr, unsigned int fsr, - struct task_struct *tsk) + struct task_struct *tsk, struct pt_regs *regs) { struct vm_area_struct *vma; + unsigned long flags = 0; int fault; vma = find_vma(mm, addr); @@ -253,11 +254,16 @@ good_area: goto out; } + if (user_mode(regs)) + flags |= FAULT_FLAG_USER; + if (fsr & FSR_WRITE) + flags |= FAULT_FLAG_WRITE; + /* * If for any reason at all we couldn't handle the fault, make * sure we exit gracefully rather than endlessly redo the fault. */ - fault = handle_mm_fault(mm, vma, addr & PAGE_MASK, (fsr & FSR_WRITE) ? FAULT_FLAG_WRITE : 0); + fault = handle_mm_fault(mm, vma, addr & PAGE_MASK, flags); if (unlikely(fault & VM_FAULT_ERROR)) return fault; if (fault & VM_FAULT_MAJOR) @@ -320,7 +326,7 @@ do_page_fault(unsigned long addr, unsigned int fsr, struct pt_regs *regs) #endif } - fault = __do_page_fault(mm, addr, fsr, tsk); + fault = __do_page_fault(mm, addr, fsr, tsk, regs); up_read(&mm->mmap_sem); perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, addr); diff --git a/arch/avr32/mm/fault.c b/arch/avr32/mm/fault.c index f7040a1..ada6237 100644 --- a/arch/avr32/mm/fault.c +++ b/arch/avr32/mm/fault.c @@ -59,6 +59,7 @@ asmlinkage void do_page_fault(unsigned long ecr, struct pt_regs *regs) struct mm_struct *mm; struct vm_area_struct *vma; const struct exception_table_entry *fixup; + unsigned long flags = 0; unsigned long address; unsigned long page; int writeaccess; @@ -127,12 +128,17 @@ good_area: panic("Unhandled case %lu in do_page_fault!", ecr); } + if (user_mode(regs)) + flags |= FAULT_FLAG_USER; + if (writeaccess) + flags |= FAULT_FLAG_WRITE; + /* * If for any reason at all we couldn't handle the fault, make * sure we exit gracefully rather than endlessly redo the * fault. */ - fault = handle_mm_fault(mm, vma, address, writeaccess ? FAULT_FLAG_WRITE : 0); + fault = handle_mm_fault(mm, vma, address, flags); if (unlikely(fault & VM_FAULT_ERROR)) { if (fault & VM_FAULT_OOM) goto out_of_memory; diff --git a/arch/cris/mm/fault.c b/arch/cris/mm/fault.c index 9dcac8e..35d096a 100644 --- a/arch/cris/mm/fault.c +++ b/arch/cris/mm/fault.c @@ -55,6 +55,7 @@ do_page_fault(unsigned long address, struct pt_regs *regs, struct task_struct *tsk; struct mm_struct *mm; struct vm_area_struct * vma; + unsigned long flags = 0; siginfo_t info; int fault; @@ -156,13 +157,18 @@ do_page_fault(unsigned long address, struct pt_regs *regs, goto bad_area; } + if (user_mode(regs)) + flags |= FAULT_FLAG_USER; + if (writeaccess & 1) + flags |= FAULT_FLAG_WRITE; + /* * If for any reason at all we couldn't handle the fault, * make sure we exit gracefully rather than endlessly redo * the fault. */ - fault = handle_mm_fault(mm, vma, address, (writeaccess & 1) ? FAULT_FLAG_WRITE : 0); + fault = handle_mm_fault(mm, vma, address, flags); if (unlikely(fault & VM_FAULT_ERROR)) { if (fault & VM_FAULT_OOM) goto out_of_memory; diff --git a/arch/frv/mm/fault.c b/arch/frv/mm/fault.c index a325d57..2dbf219 100644 --- a/arch/frv/mm/fault.c +++ b/arch/frv/mm/fault.c @@ -35,6 +35,7 @@ asmlinkage void do_page_fault(int datammu, unsigned long esr0, unsigned long ear struct vm_area_struct *vma; struct mm_struct *mm; unsigned long _pme, lrai, lrad, fixup; + unsigned long flags = 0; siginfo_t info; pgd_t *pge; pud_t *pue; @@ -158,12 +159,17 @@ asmlinkage void do_page_fault(int datammu, unsigned long esr0, unsigned long ear break; } + if (user_mode(__frame)) + flags |= FAULT_FLAG_USER; + if (write) + flags |= FAULT_FLAG_WRITE; + /* * If for any reason at all we couldn't handle the fault, * make sure we exit gracefully rather than endlessly redo * the fault. */ - fault = handle_mm_fault(mm, vma, ear0, write ? FAULT_FLAG_WRITE : 0); + fault = handle_mm_fault(mm, vma, ear0, flags); if (unlikely(fault & VM_FAULT_ERROR)) { if (fault & VM_FAULT_OOM) goto out_of_memory; diff --git a/arch/hexagon/mm/vm_fault.c b/arch/hexagon/mm/vm_fault.c index c10b76f..e56baf3 100644 --- a/arch/hexagon/mm/vm_fault.c +++ b/arch/hexagon/mm/vm_fault.c @@ -52,6 +52,7 @@ void do_page_fault(unsigned long address, long cause, struct pt_regs *regs) siginfo_t info; int si_code = SEGV_MAPERR; int fault; + unsigned long flags = 0; const struct exception_table_entry *fixup; /* @@ -96,7 +97,12 @@ good_area: break; } - fault = handle_mm_fault(mm, vma, address, (cause > 0)); + if (user_mode(regs)) + flags |= FAULT_FLAG_USER; + if (cause > 0) + flags |= FAULT_FLAG_WRITE; + + fault = handle_mm_fault(mm, vma, address, flags); /* The most common case -- we are done. */ if (likely(!(fault & VM_FAULT_ERROR))) { diff --git a/arch/ia64/mm/fault.c b/arch/ia64/mm/fault.c index 20b3593..ad9ef9d 100644 --- a/arch/ia64/mm/fault.c +++ b/arch/ia64/mm/fault.c @@ -79,6 +79,7 @@ ia64_do_page_fault (unsigned long address, unsigned long isr, struct pt_regs *re int signal = SIGSEGV, code = SEGV_MAPERR; struct vm_area_struct *vma, *prev_vma; struct mm_struct *mm = current->mm; + unsigned long flags = 0; struct siginfo si; unsigned long mask; int fault; @@ -149,12 +150,17 @@ ia64_do_page_fault (unsigned long address, unsigned long isr, struct pt_regs *re if ((vma->vm_flags & mask) != mask) goto bad_area; + if (user_mode(regs)) + flags |= FAULT_FLAG_USER; + if (mask & VM_WRITE) + flags |= FAULT_FLAG_WRITE; + /* * If for any reason at all we couldn't handle the fault, make * sure we exit gracefully rather than endlessly redo the * fault. */ - fault = handle_mm_fault(mm, vma, address, (mask & VM_WRITE) ? FAULT_FLAG_WRITE : 0); + fault = handle_mm_fault(mm, vma, address, flags); if (unlikely(fault & VM_FAULT_ERROR)) { /* * We ran out of memory, or some other thing happened diff --git a/arch/m32r/mm/fault.c b/arch/m32r/mm/fault.c index 2c9aeb4..e74f6fa 100644 --- a/arch/m32r/mm/fault.c +++ b/arch/m32r/mm/fault.c @@ -79,6 +79,7 @@ asmlinkage void do_page_fault(struct pt_regs *regs, unsigned long error_code, struct mm_struct *mm; struct vm_area_struct * vma; unsigned long page, addr; + unsigned long flags = 0; int write; int fault; siginfo_t info; @@ -188,6 +189,11 @@ good_area: if ((error_code & ACE_INSTRUCTION) && !(vma->vm_flags & VM_EXEC)) goto bad_area; + if (error_code & ACE_USERMODE) + flags |= FAULT_FLAG_USER; + if (write) + flags |= FAULT_FLAG_WRITE; + /* * If for any reason at all we couldn't handle the fault, * make sure we exit gracefully rather than endlessly redo @@ -195,7 +201,7 @@ good_area: */ addr = (address & PAGE_MASK); set_thread_fault_code(error_code); - fault = handle_mm_fault(mm, vma, addr, write ? FAULT_FLAG_WRITE : 0); + fault = handle_mm_fault(mm, vma, addr, flags); if (unlikely(fault & VM_FAULT_ERROR)) { if (fault & VM_FAULT_OOM) goto out_of_memory; diff --git a/arch/m68k/mm/fault.c b/arch/m68k/mm/fault.c index 2db6099..ab88a91 100644 --- a/arch/m68k/mm/fault.c +++ b/arch/m68k/mm/fault.c @@ -73,6 +73,7 @@ int do_page_fault(struct pt_regs *regs, unsigned long address, { struct mm_struct *mm = current->mm; struct vm_area_struct * vma; + unsigned long flags = 0; int write, fault; #ifdef DEBUG @@ -134,13 +135,18 @@ good_area: goto acc_err; } + if (user_mode(regs)) + flags |= FAULT_FLAG_USER; + if (write) + flags |= FAULT_FLAG_WRITE; + /* * If for any reason at all we couldn't handle the fault, * make sure we exit gracefully rather than endlessly redo * the fault. */ - fault = handle_mm_fault(mm, vma, address, write ? FAULT_FLAG_WRITE : 0); + fault = handle_mm_fault(mm, vma, address, flags); #ifdef DEBUG printk("handle_mm_fault returns %d\n",fault); #endif diff --git a/arch/microblaze/mm/fault.c b/arch/microblaze/mm/fault.c index ae97d2c..b002612 100644 --- a/arch/microblaze/mm/fault.c +++ b/arch/microblaze/mm/fault.c @@ -89,6 +89,7 @@ void do_page_fault(struct pt_regs *regs, unsigned long address, { struct vm_area_struct *vma; struct mm_struct *mm = current->mm; + unsigned long flags = 0; siginfo_t info; int code = SEGV_MAPERR; int is_write = error_code & ESR_S; @@ -206,12 +207,17 @@ good_area: goto bad_area; } + if (user_mode(regs)) + flags |= FAULT_FLAG_USER; + if (is_write) + flags |= FAULT_FLAG_WRITE; + /* * If for any reason at all we couldn't handle the fault, * make sure we exit gracefully rather than endlessly redo * the fault. */ - fault = handle_mm_fault(mm, vma, address, is_write ? FAULT_FLAG_WRITE : 0); + fault = handle_mm_fault(mm, vma, address, flags); if (unlikely(fault & VM_FAULT_ERROR)) { if (fault & VM_FAULT_OOM) goto out_of_memory; diff --git a/arch/mips/mm/fault.c b/arch/mips/mm/fault.c index 937cf33..e5b9fed 100644 --- a/arch/mips/mm/fault.c +++ b/arch/mips/mm/fault.c @@ -40,6 +40,7 @@ asmlinkage void __kprobes do_page_fault(struct pt_regs *regs, unsigned long writ struct task_struct *tsk = current; struct mm_struct *mm = tsk->mm; const int field = sizeof(unsigned long) * 2; + unsigned long flags = 0; siginfo_t info; int fault; @@ -139,12 +140,17 @@ good_area: } } + if (user_mode(regs)) + flags |= FAULT_FLAG_USER; + if (write) + flags |= FAULT_FLAG_WRITE; + /* * If for any reason at all we couldn't handle the fault, * make sure we exit gracefully rather than endlessly redo * the fault. */ - fault = handle_mm_fault(mm, vma, address, write ? FAULT_FLAG_WRITE : 0); + fault = handle_mm_fault(mm, vma, address, flags); perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, address); if (unlikely(fault & VM_FAULT_ERROR)) { if (fault & VM_FAULT_OOM) diff --git a/arch/mn10300/mm/fault.c b/arch/mn10300/mm/fault.c index 5ac4df5..031be56 100644 --- a/arch/mn10300/mm/fault.c +++ b/arch/mn10300/mm/fault.c @@ -121,6 +121,7 @@ asmlinkage void do_page_fault(struct pt_regs *regs, unsigned long fault_code, { struct vm_area_struct *vma; struct task_struct *tsk; + unsigned long flags = 0; struct mm_struct *mm; unsigned long page; siginfo_t info; @@ -247,12 +248,17 @@ good_area: break; } + if ((fault_code & MMUFCR_xFC_ACCESS) == MMUFCR_xFC_ACCESS_USR) + flags |= FAULT_FLAG_USER; + if (write) + flags |= FAULT_FLAG_WRITE; + /* * If for any reason at all we couldn't handle the fault, * make sure we exit gracefully rather than endlessly redo * the fault. */ - fault = handle_mm_fault(mm, vma, address, write ? FAULT_FLAG_WRITE : 0); + fault = handle_mm_fault(mm, vma, address, flags); if (unlikely(fault & VM_FAULT_ERROR)) { if (fault & VM_FAULT_OOM) goto out_of_memory; diff --git a/arch/openrisc/mm/fault.c b/arch/openrisc/mm/fault.c index d78881c..d586119 100644 --- a/arch/openrisc/mm/fault.c +++ b/arch/openrisc/mm/fault.c @@ -52,6 +52,7 @@ asmlinkage void do_page_fault(struct pt_regs *regs, unsigned long address, struct task_struct *tsk; struct mm_struct *mm; struct vm_area_struct *vma; + unsigned long flags = 0; siginfo_t info; int fault; @@ -153,13 +154,18 @@ good_area: if ((vector == 0x400) && !(vma->vm_page_prot.pgprot & _PAGE_EXEC)) goto bad_area; + if (user_mode(regs)) + flags |= FAULT_FLAG_USER; + if (write_acc) + flags |= FAULT_FLAG_WRITE; + /* * If for any reason at all we couldn't handle the fault, * make sure we exit gracefully rather than endlessly redo * the fault. */ - fault = handle_mm_fault(mm, vma, address, write_acc); + fault = handle_mm_fault(mm, vma, address, flags); if (unlikely(fault & VM_FAULT_ERROR)) { if (fault & VM_FAULT_OOM) goto out_of_memory; diff --git a/arch/parisc/mm/fault.c b/arch/parisc/mm/fault.c index 18162ce..a151e87 100644 --- a/arch/parisc/mm/fault.c +++ b/arch/parisc/mm/fault.c @@ -173,6 +173,7 @@ void do_page_fault(struct pt_regs *regs, unsigned long code, struct vm_area_struct *vma, *prev_vma; struct task_struct *tsk = current; struct mm_struct *mm = tsk->mm; + unsigned long flags = 0; unsigned long acc_type; int fault; @@ -195,13 +196,18 @@ good_area: if ((vma->vm_flags & acc_type) != acc_type) goto bad_area; + if (user_mode(regs)) + flags |= FAULT_FLAG_USER; + if (acc_type & VM_WRITE) + flags |= FAULT_FLAG_WRITE; + /* * If for any reason at all we couldn't handle the fault, make * sure we exit gracefully rather than endlessly redo the * fault. */ - fault = handle_mm_fault(mm, vma, address, (acc_type & VM_WRITE) ? FAULT_FLAG_WRITE : 0); + fault = handle_mm_fault(mm, vma, address, flags); if (unlikely(fault & VM_FAULT_ERROR)) { /* * We hit a shared mapping outside of the file, or some diff --git a/arch/powerpc/mm/fault.c b/arch/powerpc/mm/fault.c index 5efe8c9..2bf339c 100644 --- a/arch/powerpc/mm/fault.c +++ b/arch/powerpc/mm/fault.c @@ -122,6 +122,7 @@ int __kprobes do_page_fault(struct pt_regs *regs, unsigned long address, { struct vm_area_struct * vma; struct mm_struct *mm = current->mm; + unsigned long flags = 0; siginfo_t info; int code = SEGV_MAPERR; int is_write = 0, ret; @@ -305,12 +306,17 @@ good_area: goto bad_area; } + if (user_mode(regs)) + flags |= FAULT_FLAG_USER; + if (is_write) + flags |= FAULT_FLAG_WRITE; + /* * If for any reason at all we couldn't handle the fault, * make sure we exit gracefully rather than endlessly redo * the fault. */ - ret = handle_mm_fault(mm, vma, address, is_write ? FAULT_FLAG_WRITE : 0); + ret = handle_mm_fault(mm, vma, address, flags); if (unlikely(ret & VM_FAULT_ERROR)) { if (ret & VM_FAULT_OOM) goto out_of_memory; diff --git a/arch/s390/mm/fault.c b/arch/s390/mm/fault.c index a9a3018..fe6109c 100644 --- a/arch/s390/mm/fault.c +++ b/arch/s390/mm/fault.c @@ -301,6 +301,8 @@ static inline int do_exception(struct pt_regs *regs, int access, address = trans_exc_code & __FAIL_ADDR_MASK; perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, address); flags = FAULT_FLAG_ALLOW_RETRY; + if (regs->psw.mask & PSW_MASK_PSTATE) + flags |= FAULT_FLAG_USER; if (access == VM_WRITE || (trans_exc_code & store_indication) == 0x400) flags |= FAULT_FLAG_WRITE; down_read(&mm->mmap_sem); diff --git a/arch/score/mm/fault.c b/arch/score/mm/fault.c index 6b18fb0..2ca5ae5 100644 --- a/arch/score/mm/fault.c +++ b/arch/score/mm/fault.c @@ -47,6 +47,7 @@ asmlinkage void do_page_fault(struct pt_regs *regs, unsigned long write, struct task_struct *tsk = current; struct mm_struct *mm = tsk->mm; const int field = sizeof(unsigned long) * 2; + unsigned long flags = 0; siginfo_t info; int fault; @@ -101,12 +102,16 @@ good_area: } survive: + if (user_mode(regs)) + flags |= FAULT_FLAG_USER; + if (write) + flags |= FAULT_FLAG_WRITE; /* * If for any reason at all we couldn't handle the fault, * make sure we exit gracefully rather than endlessly redo * the fault. */ - fault = handle_mm_fault(mm, vma, address, write); + fault = handle_mm_fault(mm, vma, address, flags); if (unlikely(fault & VM_FAULT_ERROR)) { if (fault & VM_FAULT_OOM) goto out_of_memory; diff --git a/arch/sh/mm/fault_32.c b/arch/sh/mm/fault_32.c index 7bebd04..a61b803 100644 --- a/arch/sh/mm/fault_32.c +++ b/arch/sh/mm/fault_32.c @@ -126,6 +126,7 @@ asmlinkage void __kprobes do_page_fault(struct pt_regs *regs, struct task_struct *tsk; struct mm_struct *mm; struct vm_area_struct * vma; + unsigned long flags = 0; int si_code; int fault; siginfo_t info; @@ -195,12 +196,17 @@ good_area: goto bad_area; } + if (user_mode(regs)) + flags |= FAULT_FLAG_USER; + if (writeaccess) + flags |= FAULT_FLAG_WRITE; + /* * If for any reason at all we couldn't handle the fault, * make sure we exit gracefully rather than endlessly redo * the fault. */ - fault = handle_mm_fault(mm, vma, address, writeaccess ? FAULT_FLAG_WRITE : 0); + fault = handle_mm_fault(mm, vma, address, flags); if (unlikely(fault & VM_FAULT_ERROR)) { if (fault & VM_FAULT_OOM) goto out_of_memory; diff --git a/arch/sh/mm/tlbflush_64.c b/arch/sh/mm/tlbflush_64.c index e3430e0..0a9d645 100644 --- a/arch/sh/mm/tlbflush_64.c +++ b/arch/sh/mm/tlbflush_64.c @@ -96,6 +96,7 @@ asmlinkage void do_page_fault(struct pt_regs *regs, unsigned long writeaccess, struct mm_struct *mm; struct vm_area_struct * vma; const struct exception_table_entry *fixup; + unsigned long flags = 0; pte_t *pte; int fault; @@ -184,12 +185,17 @@ good_area: } } + if (user_mode(regs)) + flags |= FAULT_FLAG_USER; + if (writeaccess) + flags |= FAULT_FLAG_WRITE; + /* * If for any reason at all we couldn't handle the fault, * make sure we exit gracefully rather than endlessly redo * the fault. */ - fault = handle_mm_fault(mm, vma, address, writeaccess ? FAULT_FLAG_WRITE : 0); + fault = handle_mm_fault(mm, vma, address, flags); if (unlikely(fault & VM_FAULT_ERROR)) { if (fault & VM_FAULT_OOM) goto out_of_memory; diff --git a/arch/sparc/mm/fault_32.c b/arch/sparc/mm/fault_32.c index 8023fd7..efa3d48 100644 --- a/arch/sparc/mm/fault_32.c +++ b/arch/sparc/mm/fault_32.c @@ -222,6 +222,7 @@ asmlinkage void do_sparc_fault(struct pt_regs *regs, int text_fault, int write, struct vm_area_struct *vma; struct task_struct *tsk = current; struct mm_struct *mm = tsk->mm; + unsigned long flags = 0; unsigned int fixup; unsigned long g2; int from_user = !(regs->psr & PSR_PS); @@ -285,12 +286,17 @@ good_area: goto bad_area; } + if (from_user) + flags |= FAULT_FLAG_USER; + if (write) + flags |= FAULT_FLAG_WRITE; + /* * If for any reason at all we couldn't handle the fault, * make sure we exit gracefully rather than endlessly redo * the fault. */ - fault = handle_mm_fault(mm, vma, address, write ? FAULT_FLAG_WRITE : 0); + fault = handle_mm_fault(mm, vma, address, flags); if (unlikely(fault & VM_FAULT_ERROR)) { if (fault & VM_FAULT_OOM) goto out_of_memory; diff --git a/arch/sparc/mm/fault_64.c b/arch/sparc/mm/fault_64.c index 504c062..bc536ea 100644 --- a/arch/sparc/mm/fault_64.c +++ b/arch/sparc/mm/fault_64.c @@ -276,6 +276,7 @@ asmlinkage void __kprobes do_sparc64_fault(struct pt_regs *regs) { struct mm_struct *mm = current->mm; struct vm_area_struct *vma; + unsigned long flags = 0; unsigned int insn = 0; int si_code, fault_code, fault; unsigned long address, mm_rss; @@ -423,7 +424,12 @@ good_area: goto bad_area; } - fault = handle_mm_fault(mm, vma, address, (fault_code & FAULT_CODE_WRITE) ? FAULT_FLAG_WRITE : 0); + if (!(regs->tstate & TSTATE_PRIV)) + flags |= FAULT_FLAG_USER; + if (fault_code & FAULT_CODE_WRITE) + flags |= FAULT_FLAG_WRITE; + + fault = handle_mm_fault(mm, vma, address, flags); if (unlikely(fault & VM_FAULT_ERROR)) { if (fault & VM_FAULT_OOM) goto out_of_memory; diff --git a/arch/tile/mm/fault.c b/arch/tile/mm/fault.c index 3312531..b2a7fd5 100644 --- a/arch/tile/mm/fault.c +++ b/arch/tile/mm/fault.c @@ -263,6 +263,7 @@ static int handle_page_fault(struct pt_regs *regs, struct mm_struct *mm; struct vm_area_struct *vma; unsigned long stack_offset; + unsigned long flags = 0; int fault; int si_code; int is_kernel_mode; @@ -415,12 +416,16 @@ good_area: } survive: + if (!is_kernel_mode) + flags |= FAULT_FLAG_USER; + if (write) + flags |= FAULT_FLAG_WRITE; /* * If for any reason at all we couldn't handle the fault, * make sure we exit gracefully rather than endlessly redo * the fault. */ - fault = handle_mm_fault(mm, vma, address, write); + fault = handle_mm_fault(mm, vma, address, flags); if (unlikely(fault & VM_FAULT_ERROR)) { if (fault & VM_FAULT_OOM) goto out_of_memory; diff --git a/arch/um/kernel/trap.c b/arch/um/kernel/trap.c index dafc947..626a85e 100644 --- a/arch/um/kernel/trap.c +++ b/arch/um/kernel/trap.c @@ -25,6 +25,7 @@ int handle_page_fault(unsigned long address, unsigned long ip, { struct mm_struct *mm = current->mm; struct vm_area_struct *vma; + unsigned long flags = 0; pgd_t *pgd; pud_t *pud; pmd_t *pmd; @@ -62,10 +63,15 @@ good_area: if (!is_write && !(vma->vm_flags & (VM_READ | VM_EXEC))) goto out; + if (is_user) + flags |= FAULT_FLAG_USER; + if (is_write) + flags |= FAULT_FLAG_WRITE; + do { int fault; - fault = handle_mm_fault(mm, vma, address, is_write ? FAULT_FLAG_WRITE : 0); + fault = handle_mm_fault(mm, vma, address, flags); if (unlikely(fault & VM_FAULT_ERROR)) { if (fault & VM_FAULT_OOM) { goto out_of_memory; diff --git a/arch/unicore32/mm/fault.c b/arch/unicore32/mm/fault.c index 283aa4b..3026943 100644 --- a/arch/unicore32/mm/fault.c +++ b/arch/unicore32/mm/fault.c @@ -169,9 +169,10 @@ static inline bool access_error(unsigned int fsr, struct vm_area_struct *vma) } static int __do_pf(struct mm_struct *mm, unsigned long addr, unsigned int fsr, - struct task_struct *tsk) + struct task_struct *tsk, struct pt_regs *regs) { struct vm_area_struct *vma; + unsigned long flags = 0; int fault; vma = find_vma(mm, addr); @@ -191,12 +192,16 @@ good_area: goto out; } + if (user_mode(regs)) + flags |= FAULT_FLAG_USER; + if (!(fsr ^ 0x12)) + flags |= FAULT_FLAG_WRITE; + /* * If for any reason at all we couldn't handle the fault, make * sure we exit gracefully rather than endlessly redo the fault. */ - fault = handle_mm_fault(mm, vma, addr & PAGE_MASK, - (!(fsr ^ 0x12)) ? FAULT_FLAG_WRITE : 0); + fault = handle_mm_fault(mm, vma, addr & PAGE_MASK, flags); if (unlikely(fault & VM_FAULT_ERROR)) return fault; if (fault & VM_FAULT_MAJOR) @@ -252,7 +257,7 @@ static int do_pf(unsigned long addr, unsigned int fsr, struct pt_regs *regs) #endif } - fault = __do_pf(mm, addr, fsr, tsk); + fault = __do_pf(mm, addr, fsr, tsk, regs); up_read(&mm->mmap_sem); /* diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c index 5db0490..1cebabe 100644 --- a/arch/x86/mm/fault.c +++ b/arch/x86/mm/fault.c @@ -999,8 +999,7 @@ do_page_fault(struct pt_regs *regs, unsigned long error_code) struct mm_struct *mm; int fault; int write = error_code & PF_WRITE; - unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE | - (write ? FAULT_FLAG_WRITE : 0); + unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE; tsk = current; mm = tsk->mm; @@ -1160,6 +1159,11 @@ good_area: return; } + if (error_code & PF_USER) + flags |= FAULT_FLAG_USER; + if (write) + flags |= FAULT_FLAG_WRITE; + /* * If for any reason at all we couldn't handle the fault, * make sure we exit gracefully rather than endlessly redo diff --git a/arch/xtensa/mm/fault.c b/arch/xtensa/mm/fault.c index e367e30..7db9fbe 100644 --- a/arch/xtensa/mm/fault.c +++ b/arch/xtensa/mm/fault.c @@ -41,6 +41,7 @@ void do_page_fault(struct pt_regs *regs) struct mm_struct *mm = current->mm; unsigned int exccause = regs->exccause; unsigned int address = regs->excvaddr; + unsigned long flags = 0; siginfo_t info; int is_write, is_exec; @@ -101,11 +102,16 @@ good_area: if (!(vma->vm_flags & (VM_READ | VM_WRITE))) goto bad_area; + if (user_mode(regs)) + flags |= FAULT_FLAG_USER; + if (is_write) + flags |= FAULT_FLAG_WRITE; + /* If for any reason at all we couldn't handle the fault, * make sure we exit gracefully rather than endlessly redo * the fault. */ - fault = handle_mm_fault(mm, vma, address, is_write ? FAULT_FLAG_WRITE : 0); + fault = handle_mm_fault(mm, vma, address, flags); if (unlikely(fault & VM_FAULT_ERROR)) { if (fault & VM_FAULT_OOM) goto out_of_memory; diff --git a/include/linux/mm.h b/include/linux/mm.h index 4baadd1..846b82b 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -156,6 +156,7 @@ extern pgprot_t protection_map[16]; #define FAULT_FLAG_ALLOW_RETRY 0x08 /* Retry fault if blocking */ #define FAULT_FLAG_RETRY_NOWAIT 0x10 /* Don't drop mmap_sem and wait when retrying */ #define FAULT_FLAG_KILLABLE 0x20 /* The fault task is in SIGKILL killable region */ +#define FAULT_FLAG_USER 0x40 /* The fault originated in userspace */ /* * This interface is used by x86 PAT code to identify a pfn mapping that is -- 1.8.3.2 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 444+ messages in thread
* [patch 3/5] x86: finish fault error path with fatal signal 2013-07-19 4:21 ` Johannes Weiner @ 2013-07-19 4:25 ` Johannes Weiner -1 siblings, 0 replies; 444+ messages in thread From: Johannes Weiner @ 2013-07-19 4:25 UTC (permalink / raw) To: Michal Hocko Cc: azurIt, linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki, righi.andrea The x86 fault handler bails in the middle of error handling when the task has been killed. For the next patch this is a problem, because it relies on pagefault_out_of_memory() being called even when the task has been killed, to perform proper OOM state unwinding. This is a rather minor optimization, just remove it. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> --- arch/x86/mm/fault.c | 11 ----------- 1 file changed, 11 deletions(-) diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c index 1cebabe..90248c9 100644 --- a/arch/x86/mm/fault.c +++ b/arch/x86/mm/fault.c @@ -846,17 +846,6 @@ static noinline int mm_fault_error(struct pt_regs *regs, unsigned long error_code, unsigned long address, unsigned int fault) { - /* - * Pagefault was interrupted by SIGKILL. We have no reason to - * continue pagefault. - */ - if (fatal_signal_pending(current)) { - if (!(fault & VM_FAULT_RETRY)) - up_read(¤t->mm->mmap_sem); - if (!(error_code & PF_USER)) - no_context(regs, error_code, address); - return 1; - } if (!(fault & VM_FAULT_ERROR)) return 0; -- 1.8.3.2 ^ permalink raw reply related [flat|nested] 444+ messages in thread
* [patch 3/5] x86: finish fault error path with fatal signal @ 2013-07-19 4:25 ` Johannes Weiner 0 siblings, 0 replies; 444+ messages in thread From: Johannes Weiner @ 2013-07-19 4:25 UTC (permalink / raw) To: Michal Hocko Cc: azurIt, linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki, righi.andrea The x86 fault handler bails in the middle of error handling when the task has been killed. For the next patch this is a problem, because it relies on pagefault_out_of_memory() being called even when the task has been killed, to perform proper OOM state unwinding. This is a rather minor optimization, just remove it. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> --- arch/x86/mm/fault.c | 11 ----------- 1 file changed, 11 deletions(-) diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c index 1cebabe..90248c9 100644 --- a/arch/x86/mm/fault.c +++ b/arch/x86/mm/fault.c @@ -846,17 +846,6 @@ static noinline int mm_fault_error(struct pt_regs *regs, unsigned long error_code, unsigned long address, unsigned int fault) { - /* - * Pagefault was interrupted by SIGKILL. We have no reason to - * continue pagefault. - */ - if (fatal_signal_pending(current)) { - if (!(fault & VM_FAULT_RETRY)) - up_read(¤t->mm->mmap_sem); - if (!(error_code & PF_USER)) - no_context(regs, error_code, address); - return 1; - } if (!(fault & VM_FAULT_ERROR)) return 0; -- 1.8.3.2 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 444+ messages in thread
* Re: [patch 3/5] x86: finish fault error path with fatal signal 2013-07-19 4:25 ` Johannes Weiner (?) @ 2013-07-24 20:32 ` Johannes Weiner -1 siblings, 0 replies; 444+ messages in thread From: Johannes Weiner @ 2013-07-24 20:32 UTC (permalink / raw) To: Michal Hocko Cc: azurIt, linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki, righi.andrea On Fri, Jul 19, 2013 at 12:25:02AM -0400, Johannes Weiner wrote: > The x86 fault handler bails in the middle of error handling when the > task has been killed. For the next patch this is a problem, because > it relies on pagefault_out_of_memory() being called even when the task > has been killed, to perform proper OOM state unwinding. > > This is a rather minor optimization, just remove it. > > Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> > --- > arch/x86/mm/fault.c | 11 ----------- > 1 file changed, 11 deletions(-) > > diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c > index 1cebabe..90248c9 100644 > --- a/arch/x86/mm/fault.c > +++ b/arch/x86/mm/fault.c > @@ -846,17 +846,6 @@ static noinline int > mm_fault_error(struct pt_regs *regs, unsigned long error_code, > unsigned long address, unsigned int fault) > { > - /* > - * Pagefault was interrupted by SIGKILL. We have no reason to > - * continue pagefault. > - */ > - if (fatal_signal_pending(current)) { > - if (!(fault & VM_FAULT_RETRY)) > - up_read(¤t->mm->mmap_sem); > - if (!(error_code & PF_USER)) > - no_context(regs, error_code, address); > - return 1; This is broken but I only hit it now after testing for a while. The patch has the right idea: in case of an OOM kill, we should continue the fault and not abort. What I missed is that in case of a kill during lock_page, i.e. VM_FAULT_RETRY && fatal_signal, we have to exit the fault and not do up_read() etc. This introduced a locking imbalance that would get everybody hung on mmap_sem. I moved the retry handling outside of mm_fault_error() (come on...) and stole some documentation from arm. It's now a little bit more explicit and comparable to other architectures. I'll send an updated series, patch for reference: --- From: Johannes Weiner <hannes@cmpxchg.org> Subject: [patch] x86: finish fault error path with fatal signal The x86 fault handler bails in the middle of error handling when the task has been killed. For the next patch this is a problem, because it relies on pagefault_out_of_memory() being called even when the task has been killed, to perform proper OOM state unwinding. This is a rather minor optimization that cuts short the fault handling by a few instructions in rare cases. Just remove it. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> --- arch/x86/mm/fault.c | 33 +++++++++++++-------------------- 1 file changed, 13 insertions(+), 20 deletions(-) diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c index 6d77c38..0c18beb 100644 --- a/arch/x86/mm/fault.c +++ b/arch/x86/mm/fault.c @@ -842,31 +842,17 @@ do_sigbus(struct pt_regs *regs, unsigned long error_code, unsigned long address, force_sig_info_fault(SIGBUS, code, address, tsk, fault); } -static noinline int +static noinline void mm_fault_error(struct pt_regs *regs, unsigned long error_code, unsigned long address, unsigned int fault) { - /* - * Pagefault was interrupted by SIGKILL. We have no reason to - * continue pagefault. - */ - if (fatal_signal_pending(current)) { - if (!(fault & VM_FAULT_RETRY)) - up_read(¤t->mm->mmap_sem); - if (!(error_code & PF_USER)) - no_context(regs, error_code, address, 0, 0); - return 1; - } - if (!(fault & VM_FAULT_ERROR)) - return 0; - if (fault & VM_FAULT_OOM) { /* Kernel mode? Handle exceptions or die: */ if (!(error_code & PF_USER)) { up_read(¤t->mm->mmap_sem); no_context(regs, error_code, address, SIGSEGV, SEGV_MAPERR); - return 1; + return; } up_read(¤t->mm->mmap_sem); @@ -884,7 +870,6 @@ mm_fault_error(struct pt_regs *regs, unsigned long error_code, else BUG(); } - return 1; } static int spurious_fault_check(unsigned long error_code, pte_t *pte) @@ -1189,9 +1174,17 @@ good_area: */ fault = handle_mm_fault(mm, vma, address, flags); - if (unlikely(fault & (VM_FAULT_RETRY|VM_FAULT_ERROR))) { - if (mm_fault_error(regs, error_code, address, fault)) - return; + /* + * If we need to retry but a fatal signal is pending, handle the + * signal first. We do not need to release the mmap_sem because it + * would already be released in __lock_page_or_retry in mm/filemap.c. + */ + if ((fault & VM_FAULT_RETRY) && fatal_signal_pending(current)) + return; + + if (unlikely(fault & VM_FAULT_ERROR)) { + mm_fault_error(regs, error_code, address, fault); + return; } /* -- 1.8.3.2 ^ permalink raw reply related [flat|nested] 444+ messages in thread
* Re: [patch 3/5] x86: finish fault error path with fatal signal @ 2013-07-24 20:32 ` Johannes Weiner 0 siblings, 0 replies; 444+ messages in thread From: Johannes Weiner @ 2013-07-24 20:32 UTC (permalink / raw) To: Michal Hocko Cc: azurIt, linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups mailinglist, KAMEZAWA Hiroyuki, righi.andrea-Re5JQEeQqe8AvxtiuMwx3w On Fri, Jul 19, 2013 at 12:25:02AM -0400, Johannes Weiner wrote: > The x86 fault handler bails in the middle of error handling when the > task has been killed. For the next patch this is a problem, because > it relies on pagefault_out_of_memory() being called even when the task > has been killed, to perform proper OOM state unwinding. > > This is a rather minor optimization, just remove it. > > Signed-off-by: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org> > --- > arch/x86/mm/fault.c | 11 ----------- > 1 file changed, 11 deletions(-) > > diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c > index 1cebabe..90248c9 100644 > --- a/arch/x86/mm/fault.c > +++ b/arch/x86/mm/fault.c > @@ -846,17 +846,6 @@ static noinline int > mm_fault_error(struct pt_regs *regs, unsigned long error_code, > unsigned long address, unsigned int fault) > { > - /* > - * Pagefault was interrupted by SIGKILL. We have no reason to > - * continue pagefault. > - */ > - if (fatal_signal_pending(current)) { > - if (!(fault & VM_FAULT_RETRY)) > - up_read(¤t->mm->mmap_sem); > - if (!(error_code & PF_USER)) > - no_context(regs, error_code, address); > - return 1; This is broken but I only hit it now after testing for a while. The patch has the right idea: in case of an OOM kill, we should continue the fault and not abort. What I missed is that in case of a kill during lock_page, i.e. VM_FAULT_RETRY && fatal_signal, we have to exit the fault and not do up_read() etc. This introduced a locking imbalance that would get everybody hung on mmap_sem. I moved the retry handling outside of mm_fault_error() (come on...) and stole some documentation from arm. It's now a little bit more explicit and comparable to other architectures. I'll send an updated series, patch for reference: --- From: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org> Subject: [patch] x86: finish fault error path with fatal signal The x86 fault handler bails in the middle of error handling when the task has been killed. For the next patch this is a problem, because it relies on pagefault_out_of_memory() being called even when the task has been killed, to perform proper OOM state unwinding. This is a rather minor optimization that cuts short the fault handling by a few instructions in rare cases. Just remove it. Signed-off-by: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org> --- arch/x86/mm/fault.c | 33 +++++++++++++-------------------- 1 file changed, 13 insertions(+), 20 deletions(-) diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c index 6d77c38..0c18beb 100644 --- a/arch/x86/mm/fault.c +++ b/arch/x86/mm/fault.c @@ -842,31 +842,17 @@ do_sigbus(struct pt_regs *regs, unsigned long error_code, unsigned long address, force_sig_info_fault(SIGBUS, code, address, tsk, fault); } -static noinline int +static noinline void mm_fault_error(struct pt_regs *regs, unsigned long error_code, unsigned long address, unsigned int fault) { - /* - * Pagefault was interrupted by SIGKILL. We have no reason to - * continue pagefault. - */ - if (fatal_signal_pending(current)) { - if (!(fault & VM_FAULT_RETRY)) - up_read(¤t->mm->mmap_sem); - if (!(error_code & PF_USER)) - no_context(regs, error_code, address, 0, 0); - return 1; - } - if (!(fault & VM_FAULT_ERROR)) - return 0; - if (fault & VM_FAULT_OOM) { /* Kernel mode? Handle exceptions or die: */ if (!(error_code & PF_USER)) { up_read(¤t->mm->mmap_sem); no_context(regs, error_code, address, SIGSEGV, SEGV_MAPERR); - return 1; + return; } up_read(¤t->mm->mmap_sem); @@ -884,7 +870,6 @@ mm_fault_error(struct pt_regs *regs, unsigned long error_code, else BUG(); } - return 1; } static int spurious_fault_check(unsigned long error_code, pte_t *pte) @@ -1189,9 +1174,17 @@ good_area: */ fault = handle_mm_fault(mm, vma, address, flags); - if (unlikely(fault & (VM_FAULT_RETRY|VM_FAULT_ERROR))) { - if (mm_fault_error(regs, error_code, address, fault)) - return; + /* + * If we need to retry but a fatal signal is pending, handle the + * signal first. We do not need to release the mmap_sem because it + * would already be released in __lock_page_or_retry in mm/filemap.c. + */ + if ((fault & VM_FAULT_RETRY) && fatal_signal_pending(current)) + return; + + if (unlikely(fault & VM_FAULT_ERROR)) { + mm_fault_error(regs, error_code, address, fault); + return; } /* -- 1.8.3.2 ^ permalink raw reply related [flat|nested] 444+ messages in thread
* Re: [patch 3/5] x86: finish fault error path with fatal signal @ 2013-07-24 20:32 ` Johannes Weiner 0 siblings, 0 replies; 444+ messages in thread From: Johannes Weiner @ 2013-07-24 20:32 UTC (permalink / raw) To: Michal Hocko Cc: azurIt, linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki, righi.andrea On Fri, Jul 19, 2013 at 12:25:02AM -0400, Johannes Weiner wrote: > The x86 fault handler bails in the middle of error handling when the > task has been killed. For the next patch this is a problem, because > it relies on pagefault_out_of_memory() being called even when the task > has been killed, to perform proper OOM state unwinding. > > This is a rather minor optimization, just remove it. > > Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> > --- > arch/x86/mm/fault.c | 11 ----------- > 1 file changed, 11 deletions(-) > > diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c > index 1cebabe..90248c9 100644 > --- a/arch/x86/mm/fault.c > +++ b/arch/x86/mm/fault.c > @@ -846,17 +846,6 @@ static noinline int > mm_fault_error(struct pt_regs *regs, unsigned long error_code, > unsigned long address, unsigned int fault) > { > - /* > - * Pagefault was interrupted by SIGKILL. We have no reason to > - * continue pagefault. > - */ > - if (fatal_signal_pending(current)) { > - if (!(fault & VM_FAULT_RETRY)) > - up_read(¤t->mm->mmap_sem); > - if (!(error_code & PF_USER)) > - no_context(regs, error_code, address); > - return 1; This is broken but I only hit it now after testing for a while. The patch has the right idea: in case of an OOM kill, we should continue the fault and not abort. What I missed is that in case of a kill during lock_page, i.e. VM_FAULT_RETRY && fatal_signal, we have to exit the fault and not do up_read() etc. This introduced a locking imbalance that would get everybody hung on mmap_sem. I moved the retry handling outside of mm_fault_error() (come on...) and stole some documentation from arm. It's now a little bit more explicit and comparable to other architectures. I'll send an updated series, patch for reference: --- From: Johannes Weiner <hannes@cmpxchg.org> Subject: [patch] x86: finish fault error path with fatal signal The x86 fault handler bails in the middle of error handling when the task has been killed. For the next patch this is a problem, because it relies on pagefault_out_of_memory() being called even when the task has been killed, to perform proper OOM state unwinding. This is a rather minor optimization that cuts short the fault handling by a few instructions in rare cases. Just remove it. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> --- arch/x86/mm/fault.c | 33 +++++++++++++-------------------- 1 file changed, 13 insertions(+), 20 deletions(-) diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c index 6d77c38..0c18beb 100644 --- a/arch/x86/mm/fault.c +++ b/arch/x86/mm/fault.c @@ -842,31 +842,17 @@ do_sigbus(struct pt_regs *regs, unsigned long error_code, unsigned long address, force_sig_info_fault(SIGBUS, code, address, tsk, fault); } -static noinline int +static noinline void mm_fault_error(struct pt_regs *regs, unsigned long error_code, unsigned long address, unsigned int fault) { - /* - * Pagefault was interrupted by SIGKILL. We have no reason to - * continue pagefault. - */ - if (fatal_signal_pending(current)) { - if (!(fault & VM_FAULT_RETRY)) - up_read(¤t->mm->mmap_sem); - if (!(error_code & PF_USER)) - no_context(regs, error_code, address, 0, 0); - return 1; - } - if (!(fault & VM_FAULT_ERROR)) - return 0; - if (fault & VM_FAULT_OOM) { /* Kernel mode? Handle exceptions or die: */ if (!(error_code & PF_USER)) { up_read(¤t->mm->mmap_sem); no_context(regs, error_code, address, SIGSEGV, SEGV_MAPERR); - return 1; + return; } up_read(¤t->mm->mmap_sem); @@ -884,7 +870,6 @@ mm_fault_error(struct pt_regs *regs, unsigned long error_code, else BUG(); } - return 1; } static int spurious_fault_check(unsigned long error_code, pte_t *pte) @@ -1189,9 +1174,17 @@ good_area: */ fault = handle_mm_fault(mm, vma, address, flags); - if (unlikely(fault & (VM_FAULT_RETRY|VM_FAULT_ERROR))) { - if (mm_fault_error(regs, error_code, address, fault)) - return; + /* + * If we need to retry but a fatal signal is pending, handle the + * signal first. We do not need to release the mmap_sem because it + * would already be released in __lock_page_or_retry in mm/filemap.c. + */ + if ((fault & VM_FAULT_RETRY) && fatal_signal_pending(current)) + return; + + if (unlikely(fault & VM_FAULT_ERROR)) { + mm_fault_error(regs, error_code, address, fault); + return; } /* -- 1.8.3.2 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 444+ messages in thread
* Re: [patch 3/5] x86: finish fault error path with fatal signal 2013-07-24 20:32 ` Johannes Weiner @ 2013-07-25 20:29 ` KOSAKI Motohiro -1 siblings, 0 replies; 444+ messages in thread From: KOSAKI Motohiro @ 2013-07-25 20:29 UTC (permalink / raw) To: Johannes Weiner Cc: Michal Hocko, azurIt, linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki, righi.andrea, kosaki.motohiro (7/24/13 4:32 PM), Johannes Weiner wrote: > On Fri, Jul 19, 2013 at 12:25:02AM -0400, Johannes Weiner wrote: >> The x86 fault handler bails in the middle of error handling when the >> task has been killed. For the next patch this is a problem, because >> it relies on pagefault_out_of_memory() being called even when the task >> has been killed, to perform proper OOM state unwinding. >> >> This is a rather minor optimization, just remove it. >> >> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> >> --- >> arch/x86/mm/fault.c | 11 ----------- >> 1 file changed, 11 deletions(-) >> >> diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c >> index 1cebabe..90248c9 100644 >> --- a/arch/x86/mm/fault.c >> +++ b/arch/x86/mm/fault.c >> @@ -846,17 +846,6 @@ static noinline int >> mm_fault_error(struct pt_regs *regs, unsigned long error_code, >> unsigned long address, unsigned int fault) >> { >> - /* >> - * Pagefault was interrupted by SIGKILL. We have no reason to >> - * continue pagefault. >> - */ >> - if (fatal_signal_pending(current)) { >> - if (!(fault & VM_FAULT_RETRY)) >> - up_read(¤t->mm->mmap_sem); >> - if (!(error_code & PF_USER)) >> - no_context(regs, error_code, address); >> - return 1; > > This is broken but I only hit it now after testing for a while. > > The patch has the right idea: in case of an OOM kill, we should > continue the fault and not abort. What I missed is that in case of a > kill during lock_page, i.e. VM_FAULT_RETRY && fatal_signal, we have to > exit the fault and not do up_read() etc. This introduced a locking > imbalance that would get everybody hung on mmap_sem. > > I moved the retry handling outside of mm_fault_error() (come on...) > and stole some documentation from arm. It's now a little bit more > explicit and comparable to other architectures. > > I'll send an updated series, patch for reference: > > --- > From: Johannes Weiner <hannes@cmpxchg.org> > Subject: [patch] x86: finish fault error path with fatal signal > > The x86 fault handler bails in the middle of error handling when the > task has been killed. For the next patch this is a problem, because > it relies on pagefault_out_of_memory() being called even when the task > has been killed, to perform proper OOM state unwinding. > > This is a rather minor optimization that cuts short the fault handling > by a few instructions in rare cases. Just remove it. > > Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> > --- > arch/x86/mm/fault.c | 33 +++++++++++++-------------------- > 1 file changed, 13 insertions(+), 20 deletions(-) > > diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c > index 6d77c38..0c18beb 100644 > --- a/arch/x86/mm/fault.c > +++ b/arch/x86/mm/fault.c > @@ -842,31 +842,17 @@ do_sigbus(struct pt_regs *regs, unsigned long error_code, unsigned long address, > force_sig_info_fault(SIGBUS, code, address, tsk, fault); > } > > -static noinline int > +static noinline void > mm_fault_error(struct pt_regs *regs, unsigned long error_code, > unsigned long address, unsigned int fault) > { > - /* > - * Pagefault was interrupted by SIGKILL. We have no reason to > - * continue pagefault. > - */ > - if (fatal_signal_pending(current)) { > - if (!(fault & VM_FAULT_RETRY)) > - up_read(¤t->mm->mmap_sem); > - if (!(error_code & PF_USER)) > - no_context(regs, error_code, address, 0, 0); > - return 1; > - } > - if (!(fault & VM_FAULT_ERROR)) > - return 0; > - > if (fault & VM_FAULT_OOM) { > /* Kernel mode? Handle exceptions or die: */ > if (!(error_code & PF_USER)) { > up_read(¤t->mm->mmap_sem); > no_context(regs, error_code, address, > SIGSEGV, SEGV_MAPERR); > - return 1; > + return; > } > > up_read(¤t->mm->mmap_sem); > @@ -884,7 +870,6 @@ mm_fault_error(struct pt_regs *regs, unsigned long error_code, > else > BUG(); > } > - return 1; > } > > static int spurious_fault_check(unsigned long error_code, pte_t *pte) > @@ -1189,9 +1174,17 @@ good_area: > */ > fault = handle_mm_fault(mm, vma, address, flags); > > - if (unlikely(fault & (VM_FAULT_RETRY|VM_FAULT_ERROR))) { > - if (mm_fault_error(regs, error_code, address, fault)) > - return; > + /* > + * If we need to retry but a fatal signal is pending, handle the > + * signal first. We do not need to release the mmap_sem because it > + * would already be released in __lock_page_or_retry in mm/filemap.c. > + */ > + if ((fault & VM_FAULT_RETRY) && fatal_signal_pending(current)) > + return; > + > + if (unlikely(fault & VM_FAULT_ERROR)) { > + mm_fault_error(regs, error_code, address, fault); > + return; > } When I made the patch you removed code, Ingo suggested we need put all rare case code into if(unlikely()) block. Yes, this is purely micro optimization. But it is not costly to maintain. ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [patch 3/5] x86: finish fault error path with fatal signal @ 2013-07-25 20:29 ` KOSAKI Motohiro 0 siblings, 0 replies; 444+ messages in thread From: KOSAKI Motohiro @ 2013-07-25 20:29 UTC (permalink / raw) To: Johannes Weiner Cc: Michal Hocko, azurIt, linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki, righi.andrea, kosaki.motohiro (7/24/13 4:32 PM), Johannes Weiner wrote: > On Fri, Jul 19, 2013 at 12:25:02AM -0400, Johannes Weiner wrote: >> The x86 fault handler bails in the middle of error handling when the >> task has been killed. For the next patch this is a problem, because >> it relies on pagefault_out_of_memory() being called even when the task >> has been killed, to perform proper OOM state unwinding. >> >> This is a rather minor optimization, just remove it. >> >> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> >> --- >> arch/x86/mm/fault.c | 11 ----------- >> 1 file changed, 11 deletions(-) >> >> diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c >> index 1cebabe..90248c9 100644 >> --- a/arch/x86/mm/fault.c >> +++ b/arch/x86/mm/fault.c >> @@ -846,17 +846,6 @@ static noinline int >> mm_fault_error(struct pt_regs *regs, unsigned long error_code, >> unsigned long address, unsigned int fault) >> { >> - /* >> - * Pagefault was interrupted by SIGKILL. We have no reason to >> - * continue pagefault. >> - */ >> - if (fatal_signal_pending(current)) { >> - if (!(fault & VM_FAULT_RETRY)) >> - up_read(¤t->mm->mmap_sem); >> - if (!(error_code & PF_USER)) >> - no_context(regs, error_code, address); >> - return 1; > > This is broken but I only hit it now after testing for a while. > > The patch has the right idea: in case of an OOM kill, we should > continue the fault and not abort. What I missed is that in case of a > kill during lock_page, i.e. VM_FAULT_RETRY && fatal_signal, we have to > exit the fault and not do up_read() etc. This introduced a locking > imbalance that would get everybody hung on mmap_sem. > > I moved the retry handling outside of mm_fault_error() (come on...) > and stole some documentation from arm. It's now a little bit more > explicit and comparable to other architectures. > > I'll send an updated series, patch for reference: > > --- > From: Johannes Weiner <hannes@cmpxchg.org> > Subject: [patch] x86: finish fault error path with fatal signal > > The x86 fault handler bails in the middle of error handling when the > task has been killed. For the next patch this is a problem, because > it relies on pagefault_out_of_memory() being called even when the task > has been killed, to perform proper OOM state unwinding. > > This is a rather minor optimization that cuts short the fault handling > by a few instructions in rare cases. Just remove it. > > Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> > --- > arch/x86/mm/fault.c | 33 +++++++++++++-------------------- > 1 file changed, 13 insertions(+), 20 deletions(-) > > diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c > index 6d77c38..0c18beb 100644 > --- a/arch/x86/mm/fault.c > +++ b/arch/x86/mm/fault.c > @@ -842,31 +842,17 @@ do_sigbus(struct pt_regs *regs, unsigned long error_code, unsigned long address, > force_sig_info_fault(SIGBUS, code, address, tsk, fault); > } > > -static noinline int > +static noinline void > mm_fault_error(struct pt_regs *regs, unsigned long error_code, > unsigned long address, unsigned int fault) > { > - /* > - * Pagefault was interrupted by SIGKILL. We have no reason to > - * continue pagefault. > - */ > - if (fatal_signal_pending(current)) { > - if (!(fault & VM_FAULT_RETRY)) > - up_read(¤t->mm->mmap_sem); > - if (!(error_code & PF_USER)) > - no_context(regs, error_code, address, 0, 0); > - return 1; > - } > - if (!(fault & VM_FAULT_ERROR)) > - return 0; > - > if (fault & VM_FAULT_OOM) { > /* Kernel mode? Handle exceptions or die: */ > if (!(error_code & PF_USER)) { > up_read(¤t->mm->mmap_sem); > no_context(regs, error_code, address, > SIGSEGV, SEGV_MAPERR); > - return 1; > + return; > } > > up_read(¤t->mm->mmap_sem); > @@ -884,7 +870,6 @@ mm_fault_error(struct pt_regs *regs, unsigned long error_code, > else > BUG(); > } > - return 1; > } > > static int spurious_fault_check(unsigned long error_code, pte_t *pte) > @@ -1189,9 +1174,17 @@ good_area: > */ > fault = handle_mm_fault(mm, vma, address, flags); > > - if (unlikely(fault & (VM_FAULT_RETRY|VM_FAULT_ERROR))) { > - if (mm_fault_error(regs, error_code, address, fault)) > - return; > + /* > + * If we need to retry but a fatal signal is pending, handle the > + * signal first. We do not need to release the mmap_sem because it > + * would already be released in __lock_page_or_retry in mm/filemap.c. > + */ > + if ((fault & VM_FAULT_RETRY) && fatal_signal_pending(current)) > + return; > + > + if (unlikely(fault & VM_FAULT_ERROR)) { > + mm_fault_error(regs, error_code, address, fault); > + return; > } When I made the patch you removed code, Ingo suggested we need put all rare case code into if(unlikely()) block. Yes, this is purely micro optimization. But it is not costly to maintain. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [patch 3/5] x86: finish fault error path with fatal signal 2013-07-25 20:29 ` KOSAKI Motohiro (?) @ 2013-07-25 21:50 ` Johannes Weiner -1 siblings, 0 replies; 444+ messages in thread From: Johannes Weiner @ 2013-07-25 21:50 UTC (permalink / raw) To: KOSAKI Motohiro Cc: Michal Hocko, azurIt, linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki, righi.andrea On Thu, Jul 25, 2013 at 04:29:13PM -0400, KOSAKI Motohiro wrote: > (7/24/13 4:32 PM), Johannes Weiner wrote: > >@@ -1189,9 +1174,17 @@ good_area: > > */ > > fault = handle_mm_fault(mm, vma, address, flags); > > > >- if (unlikely(fault & (VM_FAULT_RETRY|VM_FAULT_ERROR))) { > >- if (mm_fault_error(regs, error_code, address, fault)) > >- return; > >+ /* > >+ * If we need to retry but a fatal signal is pending, handle the > >+ * signal first. We do not need to release the mmap_sem because it > >+ * would already be released in __lock_page_or_retry in mm/filemap.c. > >+ */ > >+ if ((fault & VM_FAULT_RETRY) && fatal_signal_pending(current)) > >+ return; > >+ > >+ if (unlikely(fault & VM_FAULT_ERROR)) { > >+ mm_fault_error(regs, error_code, address, fault); > >+ return; > > } > > When I made the patch you removed code, Ingo suggested we need put all rare case code > into if(unlikely()) block. Yes, this is purely micro optimization. But it is not costly > to maintain. Fair enough, thanks for the heads up! ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [patch 3/5] x86: finish fault error path with fatal signal @ 2013-07-25 21:50 ` Johannes Weiner 0 siblings, 0 replies; 444+ messages in thread From: Johannes Weiner @ 2013-07-25 21:50 UTC (permalink / raw) To: KOSAKI Motohiro Cc: Michal Hocko, azurIt, linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups mailinglist, KAMEZAWA Hiroyuki, righi.andrea-Re5JQEeQqe8AvxtiuMwx3w On Thu, Jul 25, 2013 at 04:29:13PM -0400, KOSAKI Motohiro wrote: > (7/24/13 4:32 PM), Johannes Weiner wrote: > >@@ -1189,9 +1174,17 @@ good_area: > > */ > > fault = handle_mm_fault(mm, vma, address, flags); > > > >- if (unlikely(fault & (VM_FAULT_RETRY|VM_FAULT_ERROR))) { > >- if (mm_fault_error(regs, error_code, address, fault)) > >- return; > >+ /* > >+ * If we need to retry but a fatal signal is pending, handle the > >+ * signal first. We do not need to release the mmap_sem because it > >+ * would already be released in __lock_page_or_retry in mm/filemap.c. > >+ */ > >+ if ((fault & VM_FAULT_RETRY) && fatal_signal_pending(current)) > >+ return; > >+ > >+ if (unlikely(fault & VM_FAULT_ERROR)) { > >+ mm_fault_error(regs, error_code, address, fault); > >+ return; > > } > > When I made the patch you removed code, Ingo suggested we need put all rare case code > into if(unlikely()) block. Yes, this is purely micro optimization. But it is not costly > to maintain. Fair enough, thanks for the heads up! ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [patch 3/5] x86: finish fault error path with fatal signal @ 2013-07-25 21:50 ` Johannes Weiner 0 siblings, 0 replies; 444+ messages in thread From: Johannes Weiner @ 2013-07-25 21:50 UTC (permalink / raw) To: KOSAKI Motohiro Cc: Michal Hocko, azurIt, linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki, righi.andrea On Thu, Jul 25, 2013 at 04:29:13PM -0400, KOSAKI Motohiro wrote: > (7/24/13 4:32 PM), Johannes Weiner wrote: > >@@ -1189,9 +1174,17 @@ good_area: > > */ > > fault = handle_mm_fault(mm, vma, address, flags); > > > >- if (unlikely(fault & (VM_FAULT_RETRY|VM_FAULT_ERROR))) { > >- if (mm_fault_error(regs, error_code, address, fault)) > >- return; > >+ /* > >+ * If we need to retry but a fatal signal is pending, handle the > >+ * signal first. We do not need to release the mmap_sem because it > >+ * would already be released in __lock_page_or_retry in mm/filemap.c. > >+ */ > >+ if ((fault & VM_FAULT_RETRY) && fatal_signal_pending(current)) > >+ return; > >+ > >+ if (unlikely(fault & VM_FAULT_ERROR)) { > >+ mm_fault_error(regs, error_code, address, fault); > >+ return; > > } > > When I made the patch you removed code, Ingo suggested we need put all rare case code > into if(unlikely()) block. Yes, this is purely micro optimization. But it is not costly > to maintain. Fair enough, thanks for the heads up! -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 444+ messages in thread
* [patch 4/5] memcg: do not trap chargers with full callstack on OOM 2013-07-19 4:21 ` Johannes Weiner (?) @ 2013-07-19 4:25 ` Johannes Weiner -1 siblings, 0 replies; 444+ messages in thread From: Johannes Weiner @ 2013-07-19 4:25 UTC (permalink / raw) To: Michal Hocko Cc: azurIt, linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki, righi.andrea The memcg OOM handling is incredibly fragile and can deadlock. When a task fails to charge memory, it invokes the OOM killer and loops right there in the charge code until it succeeds. Comparably, any other task that enters the charge path at this point will go to a waitqueue right then and there and sleep until the OOM situation is resolved. The problem is that these tasks may hold filesystem locks and the mmap_sem; locks that the selected OOM victim may need to exit. For example, in one reported case, the task invoking the OOM killer was about to charge a page cache page during a write(), which holds the i_mutex. The OOM killer selected a task that was just entering truncate() and trying to acquire the i_mutex: OOM invoking task: [<ffffffff8110a9c1>] mem_cgroup_handle_oom+0x241/0x3b0 [<ffffffff8110b5ab>] T.1146+0x5ab/0x5c0 [<ffffffff8110c22e>] mem_cgroup_cache_charge+0xbe/0xe0 [<ffffffff810ca28c>] add_to_page_cache_locked+0x4c/0x140 [<ffffffff810ca3a2>] add_to_page_cache_lru+0x22/0x50 [<ffffffff810ca45b>] grab_cache_page_write_begin+0x8b/0xe0 [<ffffffff81193a18>] ext3_write_begin+0x88/0x270 [<ffffffff810c8fc6>] generic_file_buffered_write+0x116/0x290 [<ffffffff810cb3cc>] __generic_file_aio_write+0x27c/0x480 [<ffffffff810cb646>] generic_file_aio_write+0x76/0xf0 # takes ->i_mutex [<ffffffff8111156a>] do_sync_write+0xea/0x130 [<ffffffff81112183>] vfs_write+0xf3/0x1f0 [<ffffffff81112381>] sys_write+0x51/0x90 [<ffffffff815b5926>] system_call_fastpath+0x18/0x1d [<ffffffffffffffff>] 0xffffffffffffffff OOM kill victim: [<ffffffff811109b8>] do_truncate+0x58/0xa0 # takes i_mutex [<ffffffff81121c90>] do_last+0x250/0xa30 [<ffffffff81122547>] path_openat+0xd7/0x440 [<ffffffff811229c9>] do_filp_open+0x49/0xa0 [<ffffffff8110f7d6>] do_sys_open+0x106/0x240 [<ffffffff8110f950>] sys_open+0x20/0x30 [<ffffffff815b5926>] system_call_fastpath+0x18/0x1d [<ffffffffffffffff>] 0xffffffffffffffff The OOM handling task will retry the charge indefinitely while the OOM killed task is not releasing any resources. A similar scenario can happen when the kernel OOM killer for a memcg is disabled and a userspace task is in charge of resolving OOM situations. In this case, ALL tasks that enter the OOM path will be made to sleep on the OOM waitqueue and wait for userspace to free resources or increase the group's limit. But a userspace OOM handler is prone to deadlock itself on the locks held by the waiting tasks. For example one of the sleeping tasks may be stuck in a brk() call with the mmap_sem held for writing but the userspace handler, in order to pick an optimal victim, may need to read files from /proc/<pid>, which tries to acquire the same mmap_sem for reading and deadlocks. This patch changes the way tasks behave after detecting a memcg OOM and makes sure nobody loops or sleeps with locks held: 0. When OOMing in a system call (buffered IO and friends), do not invoke the OOM killer, do not sleep on a OOM waitqueue, just return -ENOMEM. Userspace should be able to handle this and it prevents anybody from looping or waiting with locks held. 1. When OOMing in a kernel fault, do not invoke the OOM killer, do not sleep on the OOM waitqueue, just return -ENOMEM. The kernel fault stack knows how to handle this. If a kernel fault is nested inside a user fault, however, user fault handling applies: 2. When OOMing in a user fault, invoke the OOM killer and restart the fault instead of looping on the charge attempt. This way, the OOM victim can not get stuck on locks the looping task may hold. 3. When OOMing in a user fault but somebody else is handling it (either the kernel OOM killer or a userspace handler), don't go to sleep in the charge context. Instead, remember the OOMing memcg in the task struct and then fully unwind the page fault stack with -ENOMEM. pagefault_out_of_memory() will then call back into the memcg code to check if the -ENOMEM came from the memcg, and then either put the task to sleep on the memcg's OOM waitqueue or just restart the fault. The OOM victim can no longer get stuck on any lock a sleeping task may hold. While reworking the OOM routine, also remove a needless OOM waitqueue wakeup when invoking the killer. In addition to the wakeup implied in the kill signal delivery, only uncharges and limit increases, things that actually change the memory situation, should poke the waitqueue. Reported-by: Reported-by: azurIt <azurit@pobox.sk> Debugged-by: Michal Hocko <mhocko@suse.cz> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> --- include/linux/memcontrol.h | 22 +++++++ include/linux/sched.h | 6 ++ mm/filemap.c | 14 ++++- mm/ksm.c | 2 +- mm/memcontrol.c | 139 +++++++++++++++++++++++++++++---------------- mm/memory.c | 37 ++++++++---- mm/oom_kill.c | 2 + 7 files changed, 159 insertions(+), 63 deletions(-) diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index b87068a..b92e5e7 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -120,6 +120,17 @@ mem_cgroup_get_reclaim_stat_from_page(struct page *page); extern void mem_cgroup_print_oom_info(struct mem_cgroup *memcg, struct task_struct *p); +static inline unsigned int mem_cgroup_xchg_may_oom(struct task_struct *p, + unsigned int new) +{ + unsigned int old; + + old = p->memcg_oom.may_oom; + p->memcg_oom.may_oom = new; + + return old; +} +bool mem_cgroup_oom_synchronize(void); #ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP extern int do_swap_account; #endif @@ -333,6 +344,17 @@ mem_cgroup_print_oom_info(struct mem_cgroup *memcg, struct task_struct *p) { } +static inline unsigned int mem_cgroup_xchg_may_oom(struct task_struct *p, + unsigned int new) +{ + return 0; +} + +static inline bool mem_cgroup_oom_synchronize(void) +{ + return false; +} + static inline void mem_cgroup_inc_page_stat(struct page *page, enum mem_cgroup_page_stat_item idx) { diff --git a/include/linux/sched.h b/include/linux/sched.h index 1c4f3e9..7e6c9e9 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1568,6 +1568,12 @@ struct task_struct { unsigned long nr_pages; /* uncharged usage */ unsigned long memsw_nr_pages; /* uncharged mem+swap usage */ } memcg_batch; + struct memcg_oom_info { + unsigned int may_oom:1; + unsigned int in_memcg_oom:1; + int wakeups; + struct mem_cgroup *wait_on_memcg; + } memcg_oom; #endif #ifdef CONFIG_HAVE_HW_BREAKPOINT atomic_t ptrace_bp_refcnt; diff --git a/mm/filemap.c b/mm/filemap.c index 5f0a3c9..d18bd47 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -1660,6 +1660,7 @@ int filemap_fault(struct vm_area_struct *vma, struct vm_fault *vmf) struct file_ra_state *ra = &file->f_ra; struct inode *inode = mapping->host; pgoff_t offset = vmf->pgoff; + unsigned int may_oom; struct page *page; pgoff_t size; int ret = 0; @@ -1669,7 +1670,12 @@ int filemap_fault(struct vm_area_struct *vma, struct vm_fault *vmf) return VM_FAULT_SIGBUS; /* - * Do we have something in the page cache already? + * Do we have something in the page cache already? Either + * way, try readahead, but disable the memcg OOM killer for it + * as readahead is optional and no errors are propagated up + * the fault stack, which does not allow proper unwinding of a + * memcg OOM state. The OOM killer is enabled while trying to + * instantiate the faulting page individually below. */ page = find_get_page(mapping, offset); if (likely(page)) { @@ -1677,10 +1683,14 @@ int filemap_fault(struct vm_area_struct *vma, struct vm_fault *vmf) * We found the page, so try async readahead before * waiting for the lock. */ + may_oom = mem_cgroup_xchg_may_oom(current, 0); do_async_mmap_readahead(vma, ra, file, page, offset); + mem_cgroup_xchg_may_oom(current, may_oom); } else { - /* No page in the page cache at all */ + /* No page in the page cache at all. */ + may_oom = mem_cgroup_xchg_may_oom(current, 0); do_sync_mmap_readahead(vma, ra, file, offset); + mem_cgroup_xchg_may_oom(current, may_oom); count_vm_event(PGMAJFAULT); mem_cgroup_count_vm_event(vma->vm_mm, PGMAJFAULT); ret = VM_FAULT_MAJOR; diff --git a/mm/ksm.c b/mm/ksm.c index 310544a..ae7e4ae 100644 --- a/mm/ksm.c +++ b/mm/ksm.c @@ -338,7 +338,7 @@ static int break_ksm(struct vm_area_struct *vma, unsigned long addr) break; if (PageKsm(page)) ret = handle_mm_fault(vma->vm_mm, vma, addr, - FAULT_FLAG_WRITE); + FAULT_FLAG_WRITE); else ret = VM_FAULT_WRITE; put_page(page); diff --git a/mm/memcontrol.c b/mm/memcontrol.c index b63f5f7..99b0101 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -249,6 +249,7 @@ struct mem_cgroup { bool oom_lock; atomic_t under_oom; + atomic_t oom_wakeups; atomic_t refcnt; @@ -1846,6 +1847,7 @@ static int memcg_oom_wake_function(wait_queue_t *wait, static void memcg_wakeup_oom(struct mem_cgroup *memcg) { + atomic_inc(&memcg->oom_wakeups); /* for filtering, pass "memcg" as argument. */ __wake_up(&memcg_oom_waitq, TASK_NORMAL, 0, memcg); } @@ -1857,30 +1859,20 @@ static void memcg_oom_recover(struct mem_cgroup *memcg) } /* - * try to call OOM killer. returns false if we should exit memory-reclaim loop. + * try to call OOM killer */ -bool mem_cgroup_handle_oom(struct mem_cgroup *memcg, gfp_t mask) +static void mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask) { - struct oom_wait_info owait; - bool locked, need_to_kill; + bool locked, need_to_kill = true; - owait.mem = memcg; - owait.wait.flags = 0; - owait.wait.func = memcg_oom_wake_function; - owait.wait.private = current; - INIT_LIST_HEAD(&owait.wait.task_list); - need_to_kill = true; - mem_cgroup_mark_under_oom(memcg); + if (!current->memcg_oom.may_oom) + return; + + current->memcg_oom.in_memcg_oom = 1; /* At first, try to OOM lock hierarchy under memcg.*/ spin_lock(&memcg_oom_lock); locked = mem_cgroup_oom_lock(memcg); - /* - * Even if signal_pending(), we can't quit charge() loop without - * accounting. So, UNINTERRUPTIBLE is appropriate. But SIGKILL - * under OOM is always welcomed, use TASK_KILLABLE here. - */ - prepare_to_wait(&memcg_oom_waitq, &owait.wait, TASK_KILLABLE); if (!locked || memcg->oom_kill_disable) need_to_kill = false; if (locked) @@ -1888,24 +1880,86 @@ bool mem_cgroup_handle_oom(struct mem_cgroup *memcg, gfp_t mask) spin_unlock(&memcg_oom_lock); if (need_to_kill) { - finish_wait(&memcg_oom_waitq, &owait.wait); mem_cgroup_out_of_memory(memcg, mask); } else { - schedule(); - finish_wait(&memcg_oom_waitq, &owait.wait); + /* + * A system call can just return -ENOMEM, but if this + * is a page fault and somebody else is handling the + * OOM already, we need to sleep on the OOM waitqueue + * for this memcg until the situation is resolved. + * Which can take some time because it might be + * handled by a userspace task. + * + * However, this is the charge context, which means + * that we may sit on a large call stack and hold + * various filesystem locks, the mmap_sem etc. and we + * don't want the OOM handler to deadlock on them + * while we sit here and wait. Store the current OOM + * context in the task_struct, then return -ENOMEM. + * At the end of the page fault handler, with the + * stack unwound, pagefault_out_of_memory() will check + * back with us by calling + * mem_cgroup_oom_synchronize(), possibly putting the + * task to sleep. + */ + mem_cgroup_mark_under_oom(memcg); + current->memcg_oom.wakeups = atomic_read(&memcg->oom_wakeups); + css_get(&memcg->css); + current->memcg_oom.wait_on_memcg = memcg; } - spin_lock(&memcg_oom_lock); - if (locked) + + if (locked) { + spin_lock(&memcg_oom_lock); mem_cgroup_oom_unlock(memcg); - memcg_wakeup_oom(memcg); - spin_unlock(&memcg_oom_lock); + /* + * Sleeping tasks might have been killed, make sure + * they get scheduled so they can exit. + */ + if (need_to_kill) + memcg_oom_recover(memcg); + spin_unlock(&memcg_oom_lock); + } +} - mem_cgroup_unmark_under_oom(memcg); +bool mem_cgroup_oom_synchronize(void) +{ + struct oom_wait_info owait; + struct mem_cgroup *memcg; - if (test_thread_flag(TIF_MEMDIE) || fatal_signal_pending(current)) + /* OOM is global, do not handle */ + if (!current->memcg_oom.in_memcg_oom) return false; - /* Give chance to dying process */ - schedule_timeout_uninterruptible(1); + + /* + * We invoked the OOM killer but there is a chance that a kill + * did not free up any charges. Everybody else might already + * be sleeping, so restart the fault and keep the rampage + * going until some charges are released. + */ + memcg = current->memcg_oom.wait_on_memcg; + if (!memcg) + goto out; + + if (test_thread_flag(TIF_MEMDIE) || fatal_signal_pending(current)) + goto out_put; + + owait.mem = memcg; + owait.wait.flags = 0; + owait.wait.func = memcg_oom_wake_function; + owait.wait.private = current; + INIT_LIST_HEAD(&owait.wait.task_list); + + prepare_to_wait(&memcg_oom_waitq, &owait.wait, TASK_KILLABLE); + /* Only sleep if we didn't miss any wakeups since OOM */ + if (atomic_read(&memcg->oom_wakeups) == current->memcg_oom.wakeups) + schedule(); + finish_wait(&memcg_oom_waitq, &owait.wait); +out_put: + mem_cgroup_unmark_under_oom(memcg); + css_put(&memcg->css); + current->memcg_oom.wait_on_memcg = NULL; +out: + current->memcg_oom.in_memcg_oom = 0; return true; } @@ -2195,11 +2249,10 @@ enum { CHARGE_RETRY, /* need to retry but retry is not bad */ CHARGE_NOMEM, /* we can't do more. return -ENOMEM */ CHARGE_WOULDBLOCK, /* GFP_WAIT wasn't set and no enough res. */ - CHARGE_OOM_DIE, /* the current is killed because of OOM */ }; static int mem_cgroup_do_charge(struct mem_cgroup *memcg, gfp_t gfp_mask, - unsigned int nr_pages, bool oom_check) + unsigned int nr_pages, bool invoke_oom) { unsigned long csize = nr_pages * PAGE_SIZE; struct mem_cgroup *mem_over_limit; @@ -2257,14 +2310,10 @@ static int mem_cgroup_do_charge(struct mem_cgroup *memcg, gfp_t gfp_mask, if (mem_cgroup_wait_acct_move(mem_over_limit)) return CHARGE_RETRY; - /* If we don't need to call oom-killer at el, return immediately */ - if (!oom_check) - return CHARGE_NOMEM; - /* check OOM */ - if (!mem_cgroup_handle_oom(mem_over_limit, gfp_mask)) - return CHARGE_OOM_DIE; + if (invoke_oom) + mem_cgroup_oom(mem_over_limit, gfp_mask); - return CHARGE_RETRY; + return CHARGE_NOMEM; } /* @@ -2349,7 +2398,7 @@ again: } do { - bool oom_check; + bool invoke_oom = oom && !nr_oom_retries; /* If killed, bypass charge */ if (fatal_signal_pending(current)) { @@ -2357,13 +2406,7 @@ again: goto bypass; } - oom_check = false; - if (oom && !nr_oom_retries) { - oom_check = true; - nr_oom_retries = MEM_CGROUP_RECLAIM_RETRIES; - } - - ret = mem_cgroup_do_charge(memcg, gfp_mask, batch, oom_check); + ret = mem_cgroup_do_charge(memcg, gfp_mask, batch, invoke_oom); switch (ret) { case CHARGE_OK: break; @@ -2376,16 +2419,12 @@ again: css_put(&memcg->css); goto nomem; case CHARGE_NOMEM: /* OOM routine works */ - if (!oom) { + if (!oom || invoke_oom) { css_put(&memcg->css); goto nomem; } - /* If oom, we never return -ENOMEM */ nr_oom_retries--; break; - case CHARGE_OOM_DIE: /* Killed by OOM Killer */ - css_put(&memcg->css); - goto bypass; } } while (ret != CHARGE_OK); diff --git a/mm/memory.c b/mm/memory.c index 829d437..2be02b7 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -3439,22 +3439,14 @@ unlock: /* * By the time we get here, we already hold the mm semaphore */ -int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma, - unsigned long address, unsigned int flags) +static int __handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma, + unsigned long address, unsigned int flags) { pgd_t *pgd; pud_t *pud; pmd_t *pmd; pte_t *pte; - __set_current_state(TASK_RUNNING); - - count_vm_event(PGFAULT); - mem_cgroup_count_vm_event(mm, PGFAULT); - - /* do counter updates before entering really critical section. */ - check_sync_rss_stat(current); - if (unlikely(is_vm_hugetlb_page(vma))) return hugetlb_fault(mm, vma, address, flags); @@ -3503,6 +3495,31 @@ int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma, return handle_pte_fault(mm, vma, address, pte, pmd, flags); } +int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma, + unsigned long address, unsigned int flags) +{ + int userfault = flags & FAULT_FLAG_USER; + int ret; + + __set_current_state(TASK_RUNNING); + + count_vm_event(PGFAULT); + mem_cgroup_count_vm_event(mm, PGFAULT); + + /* do counter updates before entering really critical section. */ + check_sync_rss_stat(current); + + if (userfault) + WARN_ON(mem_cgroup_xchg_may_oom(current, 1) == 1); + + ret = __handle_mm_fault(mm, vma, address, flags); + + if (userfault) + WARN_ON(mem_cgroup_xchg_may_oom(current, 0) == 0); + + return ret; +} + #ifndef __PAGETABLE_PUD_FOLDED /* * Allocate page upper directory. diff --git a/mm/oom_kill.c b/mm/oom_kill.c index 069b64e..aa60863 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -785,6 +785,8 @@ out: */ void pagefault_out_of_memory(void) { + if (mem_cgroup_oom_synchronize()) + return; if (try_set_system_oom()) { out_of_memory(NULL, 0, 0, NULL); clear_system_oom(); -- 1.8.3.2 ^ permalink raw reply related [flat|nested] 444+ messages in thread
* [patch 4/5] memcg: do not trap chargers with full callstack on OOM @ 2013-07-19 4:25 ` Johannes Weiner 0 siblings, 0 replies; 444+ messages in thread From: Johannes Weiner @ 2013-07-19 4:25 UTC (permalink / raw) To: Michal Hocko Cc: azurIt, linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups mailinglist, KAMEZAWA Hiroyuki, righi.andrea-Re5JQEeQqe8AvxtiuMwx3w The memcg OOM handling is incredibly fragile and can deadlock. When a task fails to charge memory, it invokes the OOM killer and loops right there in the charge code until it succeeds. Comparably, any other task that enters the charge path at this point will go to a waitqueue right then and there and sleep until the OOM situation is resolved. The problem is that these tasks may hold filesystem locks and the mmap_sem; locks that the selected OOM victim may need to exit. For example, in one reported case, the task invoking the OOM killer was about to charge a page cache page during a write(), which holds the i_mutex. The OOM killer selected a task that was just entering truncate() and trying to acquire the i_mutex: OOM invoking task: [<ffffffff8110a9c1>] mem_cgroup_handle_oom+0x241/0x3b0 [<ffffffff8110b5ab>] T.1146+0x5ab/0x5c0 [<ffffffff8110c22e>] mem_cgroup_cache_charge+0xbe/0xe0 [<ffffffff810ca28c>] add_to_page_cache_locked+0x4c/0x140 [<ffffffff810ca3a2>] add_to_page_cache_lru+0x22/0x50 [<ffffffff810ca45b>] grab_cache_page_write_begin+0x8b/0xe0 [<ffffffff81193a18>] ext3_write_begin+0x88/0x270 [<ffffffff810c8fc6>] generic_file_buffered_write+0x116/0x290 [<ffffffff810cb3cc>] __generic_file_aio_write+0x27c/0x480 [<ffffffff810cb646>] generic_file_aio_write+0x76/0xf0 # takes ->i_mutex [<ffffffff8111156a>] do_sync_write+0xea/0x130 [<ffffffff81112183>] vfs_write+0xf3/0x1f0 [<ffffffff81112381>] sys_write+0x51/0x90 [<ffffffff815b5926>] system_call_fastpath+0x18/0x1d [<ffffffffffffffff>] 0xffffffffffffffff OOM kill victim: [<ffffffff811109b8>] do_truncate+0x58/0xa0 # takes i_mutex [<ffffffff81121c90>] do_last+0x250/0xa30 [<ffffffff81122547>] path_openat+0xd7/0x440 [<ffffffff811229c9>] do_filp_open+0x49/0xa0 [<ffffffff8110f7d6>] do_sys_open+0x106/0x240 [<ffffffff8110f950>] sys_open+0x20/0x30 [<ffffffff815b5926>] system_call_fastpath+0x18/0x1d [<ffffffffffffffff>] 0xffffffffffffffff The OOM handling task will retry the charge indefinitely while the OOM killed task is not releasing any resources. A similar scenario can happen when the kernel OOM killer for a memcg is disabled and a userspace task is in charge of resolving OOM situations. In this case, ALL tasks that enter the OOM path will be made to sleep on the OOM waitqueue and wait for userspace to free resources or increase the group's limit. But a userspace OOM handler is prone to deadlock itself on the locks held by the waiting tasks. For example one of the sleeping tasks may be stuck in a brk() call with the mmap_sem held for writing but the userspace handler, in order to pick an optimal victim, may need to read files from /proc/<pid>, which tries to acquire the same mmap_sem for reading and deadlocks. This patch changes the way tasks behave after detecting a memcg OOM and makes sure nobody loops or sleeps with locks held: 0. When OOMing in a system call (buffered IO and friends), do not invoke the OOM killer, do not sleep on a OOM waitqueue, just return -ENOMEM. Userspace should be able to handle this and it prevents anybody from looping or waiting with locks held. 1. When OOMing in a kernel fault, do not invoke the OOM killer, do not sleep on the OOM waitqueue, just return -ENOMEM. The kernel fault stack knows how to handle this. If a kernel fault is nested inside a user fault, however, user fault handling applies: 2. When OOMing in a user fault, invoke the OOM killer and restart the fault instead of looping on the charge attempt. This way, the OOM victim can not get stuck on locks the looping task may hold. 3. When OOMing in a user fault but somebody else is handling it (either the kernel OOM killer or a userspace handler), don't go to sleep in the charge context. Instead, remember the OOMing memcg in the task struct and then fully unwind the page fault stack with -ENOMEM. pagefault_out_of_memory() will then call back into the memcg code to check if the -ENOMEM came from the memcg, and then either put the task to sleep on the memcg's OOM waitqueue or just restart the fault. The OOM victim can no longer get stuck on any lock a sleeping task may hold. While reworking the OOM routine, also remove a needless OOM waitqueue wakeup when invoking the killer. In addition to the wakeup implied in the kill signal delivery, only uncharges and limit increases, things that actually change the memory situation, should poke the waitqueue. Reported-by: Reported-by: azurIt <azurit-Rm0zKEqwvD4@public.gmane.org> Debugged-by: Michal Hocko <mhocko-AlSwsSmVLrQ@public.gmane.org> Signed-off-by: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org> --- include/linux/memcontrol.h | 22 +++++++ include/linux/sched.h | 6 ++ mm/filemap.c | 14 ++++- mm/ksm.c | 2 +- mm/memcontrol.c | 139 +++++++++++++++++++++++++++++---------------- mm/memory.c | 37 ++++++++---- mm/oom_kill.c | 2 + 7 files changed, 159 insertions(+), 63 deletions(-) diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index b87068a..b92e5e7 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -120,6 +120,17 @@ mem_cgroup_get_reclaim_stat_from_page(struct page *page); extern void mem_cgroup_print_oom_info(struct mem_cgroup *memcg, struct task_struct *p); +static inline unsigned int mem_cgroup_xchg_may_oom(struct task_struct *p, + unsigned int new) +{ + unsigned int old; + + old = p->memcg_oom.may_oom; + p->memcg_oom.may_oom = new; + + return old; +} +bool mem_cgroup_oom_synchronize(void); #ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP extern int do_swap_account; #endif @@ -333,6 +344,17 @@ mem_cgroup_print_oom_info(struct mem_cgroup *memcg, struct task_struct *p) { } +static inline unsigned int mem_cgroup_xchg_may_oom(struct task_struct *p, + unsigned int new) +{ + return 0; +} + +static inline bool mem_cgroup_oom_synchronize(void) +{ + return false; +} + static inline void mem_cgroup_inc_page_stat(struct page *page, enum mem_cgroup_page_stat_item idx) { diff --git a/include/linux/sched.h b/include/linux/sched.h index 1c4f3e9..7e6c9e9 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1568,6 +1568,12 @@ struct task_struct { unsigned long nr_pages; /* uncharged usage */ unsigned long memsw_nr_pages; /* uncharged mem+swap usage */ } memcg_batch; + struct memcg_oom_info { + unsigned int may_oom:1; + unsigned int in_memcg_oom:1; + int wakeups; + struct mem_cgroup *wait_on_memcg; + } memcg_oom; #endif #ifdef CONFIG_HAVE_HW_BREAKPOINT atomic_t ptrace_bp_refcnt; diff --git a/mm/filemap.c b/mm/filemap.c index 5f0a3c9..d18bd47 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -1660,6 +1660,7 @@ int filemap_fault(struct vm_area_struct *vma, struct vm_fault *vmf) struct file_ra_state *ra = &file->f_ra; struct inode *inode = mapping->host; pgoff_t offset = vmf->pgoff; + unsigned int may_oom; struct page *page; pgoff_t size; int ret = 0; @@ -1669,7 +1670,12 @@ int filemap_fault(struct vm_area_struct *vma, struct vm_fault *vmf) return VM_FAULT_SIGBUS; /* - * Do we have something in the page cache already? + * Do we have something in the page cache already? Either + * way, try readahead, but disable the memcg OOM killer for it + * as readahead is optional and no errors are propagated up + * the fault stack, which does not allow proper unwinding of a + * memcg OOM state. The OOM killer is enabled while trying to + * instantiate the faulting page individually below. */ page = find_get_page(mapping, offset); if (likely(page)) { @@ -1677,10 +1683,14 @@ int filemap_fault(struct vm_area_struct *vma, struct vm_fault *vmf) * We found the page, so try async readahead before * waiting for the lock. */ + may_oom = mem_cgroup_xchg_may_oom(current, 0); do_async_mmap_readahead(vma, ra, file, page, offset); + mem_cgroup_xchg_may_oom(current, may_oom); } else { - /* No page in the page cache at all */ + /* No page in the page cache at all. */ + may_oom = mem_cgroup_xchg_may_oom(current, 0); do_sync_mmap_readahead(vma, ra, file, offset); + mem_cgroup_xchg_may_oom(current, may_oom); count_vm_event(PGMAJFAULT); mem_cgroup_count_vm_event(vma->vm_mm, PGMAJFAULT); ret = VM_FAULT_MAJOR; diff --git a/mm/ksm.c b/mm/ksm.c index 310544a..ae7e4ae 100644 --- a/mm/ksm.c +++ b/mm/ksm.c @@ -338,7 +338,7 @@ static int break_ksm(struct vm_area_struct *vma, unsigned long addr) break; if (PageKsm(page)) ret = handle_mm_fault(vma->vm_mm, vma, addr, - FAULT_FLAG_WRITE); + FAULT_FLAG_WRITE); else ret = VM_FAULT_WRITE; put_page(page); diff --git a/mm/memcontrol.c b/mm/memcontrol.c index b63f5f7..99b0101 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -249,6 +249,7 @@ struct mem_cgroup { bool oom_lock; atomic_t under_oom; + atomic_t oom_wakeups; atomic_t refcnt; @@ -1846,6 +1847,7 @@ static int memcg_oom_wake_function(wait_queue_t *wait, static void memcg_wakeup_oom(struct mem_cgroup *memcg) { + atomic_inc(&memcg->oom_wakeups); /* for filtering, pass "memcg" as argument. */ __wake_up(&memcg_oom_waitq, TASK_NORMAL, 0, memcg); } @@ -1857,30 +1859,20 @@ static void memcg_oom_recover(struct mem_cgroup *memcg) } /* - * try to call OOM killer. returns false if we should exit memory-reclaim loop. + * try to call OOM killer */ -bool mem_cgroup_handle_oom(struct mem_cgroup *memcg, gfp_t mask) +static void mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask) { - struct oom_wait_info owait; - bool locked, need_to_kill; + bool locked, need_to_kill = true; - owait.mem = memcg; - owait.wait.flags = 0; - owait.wait.func = memcg_oom_wake_function; - owait.wait.private = current; - INIT_LIST_HEAD(&owait.wait.task_list); - need_to_kill = true; - mem_cgroup_mark_under_oom(memcg); + if (!current->memcg_oom.may_oom) + return; + + current->memcg_oom.in_memcg_oom = 1; /* At first, try to OOM lock hierarchy under memcg.*/ spin_lock(&memcg_oom_lock); locked = mem_cgroup_oom_lock(memcg); - /* - * Even if signal_pending(), we can't quit charge() loop without - * accounting. So, UNINTERRUPTIBLE is appropriate. But SIGKILL - * under OOM is always welcomed, use TASK_KILLABLE here. - */ - prepare_to_wait(&memcg_oom_waitq, &owait.wait, TASK_KILLABLE); if (!locked || memcg->oom_kill_disable) need_to_kill = false; if (locked) @@ -1888,24 +1880,86 @@ bool mem_cgroup_handle_oom(struct mem_cgroup *memcg, gfp_t mask) spin_unlock(&memcg_oom_lock); if (need_to_kill) { - finish_wait(&memcg_oom_waitq, &owait.wait); mem_cgroup_out_of_memory(memcg, mask); } else { - schedule(); - finish_wait(&memcg_oom_waitq, &owait.wait); + /* + * A system call can just return -ENOMEM, but if this + * is a page fault and somebody else is handling the + * OOM already, we need to sleep on the OOM waitqueue + * for this memcg until the situation is resolved. + * Which can take some time because it might be + * handled by a userspace task. + * + * However, this is the charge context, which means + * that we may sit on a large call stack and hold + * various filesystem locks, the mmap_sem etc. and we + * don't want the OOM handler to deadlock on them + * while we sit here and wait. Store the current OOM + * context in the task_struct, then return -ENOMEM. + * At the end of the page fault handler, with the + * stack unwound, pagefault_out_of_memory() will check + * back with us by calling + * mem_cgroup_oom_synchronize(), possibly putting the + * task to sleep. + */ + mem_cgroup_mark_under_oom(memcg); + current->memcg_oom.wakeups = atomic_read(&memcg->oom_wakeups); + css_get(&memcg->css); + current->memcg_oom.wait_on_memcg = memcg; } - spin_lock(&memcg_oom_lock); - if (locked) + + if (locked) { + spin_lock(&memcg_oom_lock); mem_cgroup_oom_unlock(memcg); - memcg_wakeup_oom(memcg); - spin_unlock(&memcg_oom_lock); + /* + * Sleeping tasks might have been killed, make sure + * they get scheduled so they can exit. + */ + if (need_to_kill) + memcg_oom_recover(memcg); + spin_unlock(&memcg_oom_lock); + } +} - mem_cgroup_unmark_under_oom(memcg); +bool mem_cgroup_oom_synchronize(void) +{ + struct oom_wait_info owait; + struct mem_cgroup *memcg; - if (test_thread_flag(TIF_MEMDIE) || fatal_signal_pending(current)) + /* OOM is global, do not handle */ + if (!current->memcg_oom.in_memcg_oom) return false; - /* Give chance to dying process */ - schedule_timeout_uninterruptible(1); + + /* + * We invoked the OOM killer but there is a chance that a kill + * did not free up any charges. Everybody else might already + * be sleeping, so restart the fault and keep the rampage + * going until some charges are released. + */ + memcg = current->memcg_oom.wait_on_memcg; + if (!memcg) + goto out; + + if (test_thread_flag(TIF_MEMDIE) || fatal_signal_pending(current)) + goto out_put; + + owait.mem = memcg; + owait.wait.flags = 0; + owait.wait.func = memcg_oom_wake_function; + owait.wait.private = current; + INIT_LIST_HEAD(&owait.wait.task_list); + + prepare_to_wait(&memcg_oom_waitq, &owait.wait, TASK_KILLABLE); + /* Only sleep if we didn't miss any wakeups since OOM */ + if (atomic_read(&memcg->oom_wakeups) == current->memcg_oom.wakeups) + schedule(); + finish_wait(&memcg_oom_waitq, &owait.wait); +out_put: + mem_cgroup_unmark_under_oom(memcg); + css_put(&memcg->css); + current->memcg_oom.wait_on_memcg = NULL; +out: + current->memcg_oom.in_memcg_oom = 0; return true; } @@ -2195,11 +2249,10 @@ enum { CHARGE_RETRY, /* need to retry but retry is not bad */ CHARGE_NOMEM, /* we can't do more. return -ENOMEM */ CHARGE_WOULDBLOCK, /* GFP_WAIT wasn't set and no enough res. */ - CHARGE_OOM_DIE, /* the current is killed because of OOM */ }; static int mem_cgroup_do_charge(struct mem_cgroup *memcg, gfp_t gfp_mask, - unsigned int nr_pages, bool oom_check) + unsigned int nr_pages, bool invoke_oom) { unsigned long csize = nr_pages * PAGE_SIZE; struct mem_cgroup *mem_over_limit; @@ -2257,14 +2310,10 @@ static int mem_cgroup_do_charge(struct mem_cgroup *memcg, gfp_t gfp_mask, if (mem_cgroup_wait_acct_move(mem_over_limit)) return CHARGE_RETRY; - /* If we don't need to call oom-killer at el, return immediately */ - if (!oom_check) - return CHARGE_NOMEM; - /* check OOM */ - if (!mem_cgroup_handle_oom(mem_over_limit, gfp_mask)) - return CHARGE_OOM_DIE; + if (invoke_oom) + mem_cgroup_oom(mem_over_limit, gfp_mask); - return CHARGE_RETRY; + return CHARGE_NOMEM; } /* @@ -2349,7 +2398,7 @@ again: } do { - bool oom_check; + bool invoke_oom = oom && !nr_oom_retries; /* If killed, bypass charge */ if (fatal_signal_pending(current)) { @@ -2357,13 +2406,7 @@ again: goto bypass; } - oom_check = false; - if (oom && !nr_oom_retries) { - oom_check = true; - nr_oom_retries = MEM_CGROUP_RECLAIM_RETRIES; - } - - ret = mem_cgroup_do_charge(memcg, gfp_mask, batch, oom_check); + ret = mem_cgroup_do_charge(memcg, gfp_mask, batch, invoke_oom); switch (ret) { case CHARGE_OK: break; @@ -2376,16 +2419,12 @@ again: css_put(&memcg->css); goto nomem; case CHARGE_NOMEM: /* OOM routine works */ - if (!oom) { + if (!oom || invoke_oom) { css_put(&memcg->css); goto nomem; } - /* If oom, we never return -ENOMEM */ nr_oom_retries--; break; - case CHARGE_OOM_DIE: /* Killed by OOM Killer */ - css_put(&memcg->css); - goto bypass; } } while (ret != CHARGE_OK); diff --git a/mm/memory.c b/mm/memory.c index 829d437..2be02b7 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -3439,22 +3439,14 @@ unlock: /* * By the time we get here, we already hold the mm semaphore */ -int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma, - unsigned long address, unsigned int flags) +static int __handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma, + unsigned long address, unsigned int flags) { pgd_t *pgd; pud_t *pud; pmd_t *pmd; pte_t *pte; - __set_current_state(TASK_RUNNING); - - count_vm_event(PGFAULT); - mem_cgroup_count_vm_event(mm, PGFAULT); - - /* do counter updates before entering really critical section. */ - check_sync_rss_stat(current); - if (unlikely(is_vm_hugetlb_page(vma))) return hugetlb_fault(mm, vma, address, flags); @@ -3503,6 +3495,31 @@ int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma, return handle_pte_fault(mm, vma, address, pte, pmd, flags); } +int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma, + unsigned long address, unsigned int flags) +{ + int userfault = flags & FAULT_FLAG_USER; + int ret; + + __set_current_state(TASK_RUNNING); + + count_vm_event(PGFAULT); + mem_cgroup_count_vm_event(mm, PGFAULT); + + /* do counter updates before entering really critical section. */ + check_sync_rss_stat(current); + + if (userfault) + WARN_ON(mem_cgroup_xchg_may_oom(current, 1) == 1); + + ret = __handle_mm_fault(mm, vma, address, flags); + + if (userfault) + WARN_ON(mem_cgroup_xchg_may_oom(current, 0) == 0); + + return ret; +} + #ifndef __PAGETABLE_PUD_FOLDED /* * Allocate page upper directory. diff --git a/mm/oom_kill.c b/mm/oom_kill.c index 069b64e..aa60863 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -785,6 +785,8 @@ out: */ void pagefault_out_of_memory(void) { + if (mem_cgroup_oom_synchronize()) + return; if (try_set_system_oom()) { out_of_memory(NULL, 0, 0, NULL); clear_system_oom(); -- 1.8.3.2 ^ permalink raw reply related [flat|nested] 444+ messages in thread
* [patch 4/5] memcg: do not trap chargers with full callstack on OOM @ 2013-07-19 4:25 ` Johannes Weiner 0 siblings, 0 replies; 444+ messages in thread From: Johannes Weiner @ 2013-07-19 4:25 UTC (permalink / raw) To: Michal Hocko Cc: azurIt, linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki, righi.andrea The memcg OOM handling is incredibly fragile and can deadlock. When a task fails to charge memory, it invokes the OOM killer and loops right there in the charge code until it succeeds. Comparably, any other task that enters the charge path at this point will go to a waitqueue right then and there and sleep until the OOM situation is resolved. The problem is that these tasks may hold filesystem locks and the mmap_sem; locks that the selected OOM victim may need to exit. For example, in one reported case, the task invoking the OOM killer was about to charge a page cache page during a write(), which holds the i_mutex. The OOM killer selected a task that was just entering truncate() and trying to acquire the i_mutex: OOM invoking task: [<ffffffff8110a9c1>] mem_cgroup_handle_oom+0x241/0x3b0 [<ffffffff8110b5ab>] T.1146+0x5ab/0x5c0 [<ffffffff8110c22e>] mem_cgroup_cache_charge+0xbe/0xe0 [<ffffffff810ca28c>] add_to_page_cache_locked+0x4c/0x140 [<ffffffff810ca3a2>] add_to_page_cache_lru+0x22/0x50 [<ffffffff810ca45b>] grab_cache_page_write_begin+0x8b/0xe0 [<ffffffff81193a18>] ext3_write_begin+0x88/0x270 [<ffffffff810c8fc6>] generic_file_buffered_write+0x116/0x290 [<ffffffff810cb3cc>] __generic_file_aio_write+0x27c/0x480 [<ffffffff810cb646>] generic_file_aio_write+0x76/0xf0 # takes ->i_mutex [<ffffffff8111156a>] do_sync_write+0xea/0x130 [<ffffffff81112183>] vfs_write+0xf3/0x1f0 [<ffffffff81112381>] sys_write+0x51/0x90 [<ffffffff815b5926>] system_call_fastpath+0x18/0x1d [<ffffffffffffffff>] 0xffffffffffffffff OOM kill victim: [<ffffffff811109b8>] do_truncate+0x58/0xa0 # takes i_mutex [<ffffffff81121c90>] do_last+0x250/0xa30 [<ffffffff81122547>] path_openat+0xd7/0x440 [<ffffffff811229c9>] do_filp_open+0x49/0xa0 [<ffffffff8110f7d6>] do_sys_open+0x106/0x240 [<ffffffff8110f950>] sys_open+0x20/0x30 [<ffffffff815b5926>] system_call_fastpath+0x18/0x1d [<ffffffffffffffff>] 0xffffffffffffffff The OOM handling task will retry the charge indefinitely while the OOM killed task is not releasing any resources. A similar scenario can happen when the kernel OOM killer for a memcg is disabled and a userspace task is in charge of resolving OOM situations. In this case, ALL tasks that enter the OOM path will be made to sleep on the OOM waitqueue and wait for userspace to free resources or increase the group's limit. But a userspace OOM handler is prone to deadlock itself on the locks held by the waiting tasks. For example one of the sleeping tasks may be stuck in a brk() call with the mmap_sem held for writing but the userspace handler, in order to pick an optimal victim, may need to read files from /proc/<pid>, which tries to acquire the same mmap_sem for reading and deadlocks. This patch changes the way tasks behave after detecting a memcg OOM and makes sure nobody loops or sleeps with locks held: 0. When OOMing in a system call (buffered IO and friends), do not invoke the OOM killer, do not sleep on a OOM waitqueue, just return -ENOMEM. Userspace should be able to handle this and it prevents anybody from looping or waiting with locks held. 1. When OOMing in a kernel fault, do not invoke the OOM killer, do not sleep on the OOM waitqueue, just return -ENOMEM. The kernel fault stack knows how to handle this. If a kernel fault is nested inside a user fault, however, user fault handling applies: 2. When OOMing in a user fault, invoke the OOM killer and restart the fault instead of looping on the charge attempt. This way, the OOM victim can not get stuck on locks the looping task may hold. 3. When OOMing in a user fault but somebody else is handling it (either the kernel OOM killer or a userspace handler), don't go to sleep in the charge context. Instead, remember the OOMing memcg in the task struct and then fully unwind the page fault stack with -ENOMEM. pagefault_out_of_memory() will then call back into the memcg code to check if the -ENOMEM came from the memcg, and then either put the task to sleep on the memcg's OOM waitqueue or just restart the fault. The OOM victim can no longer get stuck on any lock a sleeping task may hold. While reworking the OOM routine, also remove a needless OOM waitqueue wakeup when invoking the killer. In addition to the wakeup implied in the kill signal delivery, only uncharges and limit increases, things that actually change the memory situation, should poke the waitqueue. Reported-by: Reported-by: azurIt <azurit@pobox.sk> Debugged-by: Michal Hocko <mhocko@suse.cz> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> --- include/linux/memcontrol.h | 22 +++++++ include/linux/sched.h | 6 ++ mm/filemap.c | 14 ++++- mm/ksm.c | 2 +- mm/memcontrol.c | 139 +++++++++++++++++++++++++++++---------------- mm/memory.c | 37 ++++++++---- mm/oom_kill.c | 2 + 7 files changed, 159 insertions(+), 63 deletions(-) diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index b87068a..b92e5e7 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -120,6 +120,17 @@ mem_cgroup_get_reclaim_stat_from_page(struct page *page); extern void mem_cgroup_print_oom_info(struct mem_cgroup *memcg, struct task_struct *p); +static inline unsigned int mem_cgroup_xchg_may_oom(struct task_struct *p, + unsigned int new) +{ + unsigned int old; + + old = p->memcg_oom.may_oom; + p->memcg_oom.may_oom = new; + + return old; +} +bool mem_cgroup_oom_synchronize(void); #ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP extern int do_swap_account; #endif @@ -333,6 +344,17 @@ mem_cgroup_print_oom_info(struct mem_cgroup *memcg, struct task_struct *p) { } +static inline unsigned int mem_cgroup_xchg_may_oom(struct task_struct *p, + unsigned int new) +{ + return 0; +} + +static inline bool mem_cgroup_oom_synchronize(void) +{ + return false; +} + static inline void mem_cgroup_inc_page_stat(struct page *page, enum mem_cgroup_page_stat_item idx) { diff --git a/include/linux/sched.h b/include/linux/sched.h index 1c4f3e9..7e6c9e9 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1568,6 +1568,12 @@ struct task_struct { unsigned long nr_pages; /* uncharged usage */ unsigned long memsw_nr_pages; /* uncharged mem+swap usage */ } memcg_batch; + struct memcg_oom_info { + unsigned int may_oom:1; + unsigned int in_memcg_oom:1; + int wakeups; + struct mem_cgroup *wait_on_memcg; + } memcg_oom; #endif #ifdef CONFIG_HAVE_HW_BREAKPOINT atomic_t ptrace_bp_refcnt; diff --git a/mm/filemap.c b/mm/filemap.c index 5f0a3c9..d18bd47 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -1660,6 +1660,7 @@ int filemap_fault(struct vm_area_struct *vma, struct vm_fault *vmf) struct file_ra_state *ra = &file->f_ra; struct inode *inode = mapping->host; pgoff_t offset = vmf->pgoff; + unsigned int may_oom; struct page *page; pgoff_t size; int ret = 0; @@ -1669,7 +1670,12 @@ int filemap_fault(struct vm_area_struct *vma, struct vm_fault *vmf) return VM_FAULT_SIGBUS; /* - * Do we have something in the page cache already? + * Do we have something in the page cache already? Either + * way, try readahead, but disable the memcg OOM killer for it + * as readahead is optional and no errors are propagated up + * the fault stack, which does not allow proper unwinding of a + * memcg OOM state. The OOM killer is enabled while trying to + * instantiate the faulting page individually below. */ page = find_get_page(mapping, offset); if (likely(page)) { @@ -1677,10 +1683,14 @@ int filemap_fault(struct vm_area_struct *vma, struct vm_fault *vmf) * We found the page, so try async readahead before * waiting for the lock. */ + may_oom = mem_cgroup_xchg_may_oom(current, 0); do_async_mmap_readahead(vma, ra, file, page, offset); + mem_cgroup_xchg_may_oom(current, may_oom); } else { - /* No page in the page cache at all */ + /* No page in the page cache at all. */ + may_oom = mem_cgroup_xchg_may_oom(current, 0); do_sync_mmap_readahead(vma, ra, file, offset); + mem_cgroup_xchg_may_oom(current, may_oom); count_vm_event(PGMAJFAULT); mem_cgroup_count_vm_event(vma->vm_mm, PGMAJFAULT); ret = VM_FAULT_MAJOR; diff --git a/mm/ksm.c b/mm/ksm.c index 310544a..ae7e4ae 100644 --- a/mm/ksm.c +++ b/mm/ksm.c @@ -338,7 +338,7 @@ static int break_ksm(struct vm_area_struct *vma, unsigned long addr) break; if (PageKsm(page)) ret = handle_mm_fault(vma->vm_mm, vma, addr, - FAULT_FLAG_WRITE); + FAULT_FLAG_WRITE); else ret = VM_FAULT_WRITE; put_page(page); diff --git a/mm/memcontrol.c b/mm/memcontrol.c index b63f5f7..99b0101 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -249,6 +249,7 @@ struct mem_cgroup { bool oom_lock; atomic_t under_oom; + atomic_t oom_wakeups; atomic_t refcnt; @@ -1846,6 +1847,7 @@ static int memcg_oom_wake_function(wait_queue_t *wait, static void memcg_wakeup_oom(struct mem_cgroup *memcg) { + atomic_inc(&memcg->oom_wakeups); /* for filtering, pass "memcg" as argument. */ __wake_up(&memcg_oom_waitq, TASK_NORMAL, 0, memcg); } @@ -1857,30 +1859,20 @@ static void memcg_oom_recover(struct mem_cgroup *memcg) } /* - * try to call OOM killer. returns false if we should exit memory-reclaim loop. + * try to call OOM killer */ -bool mem_cgroup_handle_oom(struct mem_cgroup *memcg, gfp_t mask) +static void mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask) { - struct oom_wait_info owait; - bool locked, need_to_kill; + bool locked, need_to_kill = true; - owait.mem = memcg; - owait.wait.flags = 0; - owait.wait.func = memcg_oom_wake_function; - owait.wait.private = current; - INIT_LIST_HEAD(&owait.wait.task_list); - need_to_kill = true; - mem_cgroup_mark_under_oom(memcg); + if (!current->memcg_oom.may_oom) + return; + + current->memcg_oom.in_memcg_oom = 1; /* At first, try to OOM lock hierarchy under memcg.*/ spin_lock(&memcg_oom_lock); locked = mem_cgroup_oom_lock(memcg); - /* - * Even if signal_pending(), we can't quit charge() loop without - * accounting. So, UNINTERRUPTIBLE is appropriate. But SIGKILL - * under OOM is always welcomed, use TASK_KILLABLE here. - */ - prepare_to_wait(&memcg_oom_waitq, &owait.wait, TASK_KILLABLE); if (!locked || memcg->oom_kill_disable) need_to_kill = false; if (locked) @@ -1888,24 +1880,86 @@ bool mem_cgroup_handle_oom(struct mem_cgroup *memcg, gfp_t mask) spin_unlock(&memcg_oom_lock); if (need_to_kill) { - finish_wait(&memcg_oom_waitq, &owait.wait); mem_cgroup_out_of_memory(memcg, mask); } else { - schedule(); - finish_wait(&memcg_oom_waitq, &owait.wait); + /* + * A system call can just return -ENOMEM, but if this + * is a page fault and somebody else is handling the + * OOM already, we need to sleep on the OOM waitqueue + * for this memcg until the situation is resolved. + * Which can take some time because it might be + * handled by a userspace task. + * + * However, this is the charge context, which means + * that we may sit on a large call stack and hold + * various filesystem locks, the mmap_sem etc. and we + * don't want the OOM handler to deadlock on them + * while we sit here and wait. Store the current OOM + * context in the task_struct, then return -ENOMEM. + * At the end of the page fault handler, with the + * stack unwound, pagefault_out_of_memory() will check + * back with us by calling + * mem_cgroup_oom_synchronize(), possibly putting the + * task to sleep. + */ + mem_cgroup_mark_under_oom(memcg); + current->memcg_oom.wakeups = atomic_read(&memcg->oom_wakeups); + css_get(&memcg->css); + current->memcg_oom.wait_on_memcg = memcg; } - spin_lock(&memcg_oom_lock); - if (locked) + + if (locked) { + spin_lock(&memcg_oom_lock); mem_cgroup_oom_unlock(memcg); - memcg_wakeup_oom(memcg); - spin_unlock(&memcg_oom_lock); + /* + * Sleeping tasks might have been killed, make sure + * they get scheduled so they can exit. + */ + if (need_to_kill) + memcg_oom_recover(memcg); + spin_unlock(&memcg_oom_lock); + } +} - mem_cgroup_unmark_under_oom(memcg); +bool mem_cgroup_oom_synchronize(void) +{ + struct oom_wait_info owait; + struct mem_cgroup *memcg; - if (test_thread_flag(TIF_MEMDIE) || fatal_signal_pending(current)) + /* OOM is global, do not handle */ + if (!current->memcg_oom.in_memcg_oom) return false; - /* Give chance to dying process */ - schedule_timeout_uninterruptible(1); + + /* + * We invoked the OOM killer but there is a chance that a kill + * did not free up any charges. Everybody else might already + * be sleeping, so restart the fault and keep the rampage + * going until some charges are released. + */ + memcg = current->memcg_oom.wait_on_memcg; + if (!memcg) + goto out; + + if (test_thread_flag(TIF_MEMDIE) || fatal_signal_pending(current)) + goto out_put; + + owait.mem = memcg; + owait.wait.flags = 0; + owait.wait.func = memcg_oom_wake_function; + owait.wait.private = current; + INIT_LIST_HEAD(&owait.wait.task_list); + + prepare_to_wait(&memcg_oom_waitq, &owait.wait, TASK_KILLABLE); + /* Only sleep if we didn't miss any wakeups since OOM */ + if (atomic_read(&memcg->oom_wakeups) == current->memcg_oom.wakeups) + schedule(); + finish_wait(&memcg_oom_waitq, &owait.wait); +out_put: + mem_cgroup_unmark_under_oom(memcg); + css_put(&memcg->css); + current->memcg_oom.wait_on_memcg = NULL; +out: + current->memcg_oom.in_memcg_oom = 0; return true; } @@ -2195,11 +2249,10 @@ enum { CHARGE_RETRY, /* need to retry but retry is not bad */ CHARGE_NOMEM, /* we can't do more. return -ENOMEM */ CHARGE_WOULDBLOCK, /* GFP_WAIT wasn't set and no enough res. */ - CHARGE_OOM_DIE, /* the current is killed because of OOM */ }; static int mem_cgroup_do_charge(struct mem_cgroup *memcg, gfp_t gfp_mask, - unsigned int nr_pages, bool oom_check) + unsigned int nr_pages, bool invoke_oom) { unsigned long csize = nr_pages * PAGE_SIZE; struct mem_cgroup *mem_over_limit; @@ -2257,14 +2310,10 @@ static int mem_cgroup_do_charge(struct mem_cgroup *memcg, gfp_t gfp_mask, if (mem_cgroup_wait_acct_move(mem_over_limit)) return CHARGE_RETRY; - /* If we don't need to call oom-killer at el, return immediately */ - if (!oom_check) - return CHARGE_NOMEM; - /* check OOM */ - if (!mem_cgroup_handle_oom(mem_over_limit, gfp_mask)) - return CHARGE_OOM_DIE; + if (invoke_oom) + mem_cgroup_oom(mem_over_limit, gfp_mask); - return CHARGE_RETRY; + return CHARGE_NOMEM; } /* @@ -2349,7 +2398,7 @@ again: } do { - bool oom_check; + bool invoke_oom = oom && !nr_oom_retries; /* If killed, bypass charge */ if (fatal_signal_pending(current)) { @@ -2357,13 +2406,7 @@ again: goto bypass; } - oom_check = false; - if (oom && !nr_oom_retries) { - oom_check = true; - nr_oom_retries = MEM_CGROUP_RECLAIM_RETRIES; - } - - ret = mem_cgroup_do_charge(memcg, gfp_mask, batch, oom_check); + ret = mem_cgroup_do_charge(memcg, gfp_mask, batch, invoke_oom); switch (ret) { case CHARGE_OK: break; @@ -2376,16 +2419,12 @@ again: css_put(&memcg->css); goto nomem; case CHARGE_NOMEM: /* OOM routine works */ - if (!oom) { + if (!oom || invoke_oom) { css_put(&memcg->css); goto nomem; } - /* If oom, we never return -ENOMEM */ nr_oom_retries--; break; - case CHARGE_OOM_DIE: /* Killed by OOM Killer */ - css_put(&memcg->css); - goto bypass; } } while (ret != CHARGE_OK); diff --git a/mm/memory.c b/mm/memory.c index 829d437..2be02b7 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -3439,22 +3439,14 @@ unlock: /* * By the time we get here, we already hold the mm semaphore */ -int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma, - unsigned long address, unsigned int flags) +static int __handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma, + unsigned long address, unsigned int flags) { pgd_t *pgd; pud_t *pud; pmd_t *pmd; pte_t *pte; - __set_current_state(TASK_RUNNING); - - count_vm_event(PGFAULT); - mem_cgroup_count_vm_event(mm, PGFAULT); - - /* do counter updates before entering really critical section. */ - check_sync_rss_stat(current); - if (unlikely(is_vm_hugetlb_page(vma))) return hugetlb_fault(mm, vma, address, flags); @@ -3503,6 +3495,31 @@ int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma, return handle_pte_fault(mm, vma, address, pte, pmd, flags); } +int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma, + unsigned long address, unsigned int flags) +{ + int userfault = flags & FAULT_FLAG_USER; + int ret; + + __set_current_state(TASK_RUNNING); + + count_vm_event(PGFAULT); + mem_cgroup_count_vm_event(mm, PGFAULT); + + /* do counter updates before entering really critical section. */ + check_sync_rss_stat(current); + + if (userfault) + WARN_ON(mem_cgroup_xchg_may_oom(current, 1) == 1); + + ret = __handle_mm_fault(mm, vma, address, flags); + + if (userfault) + WARN_ON(mem_cgroup_xchg_may_oom(current, 0) == 0); + + return ret; +} + #ifndef __PAGETABLE_PUD_FOLDED /* * Allocate page upper directory. diff --git a/mm/oom_kill.c b/mm/oom_kill.c index 069b64e..aa60863 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -785,6 +785,8 @@ out: */ void pagefault_out_of_memory(void) { + if (mem_cgroup_oom_synchronize()) + return; if (try_set_system_oom()) { out_of_memory(NULL, 0, 0, NULL); clear_system_oom(); -- 1.8.3.2 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 444+ messages in thread
* [patch 5/5] mm: memcontrol: sanity check memcg OOM context unwind 2013-07-19 4:21 ` Johannes Weiner (?) @ 2013-07-19 4:26 ` Johannes Weiner -1 siblings, 0 replies; 444+ messages in thread From: Johannes Weiner @ 2013-07-19 4:26 UTC (permalink / raw) To: Michal Hocko Cc: azurIt, linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki, righi.andrea Catch the cases where a memcg OOM context is set up in the failed charge path but the fault handler is not actually returning VM_FAULT_ERROR, which would be required to properly finalize the OOM. Example output: the first trace shows the stack at the end of handle_mm_fault() where an unexpected memcg OOM context is detected. The subsequent trace is of whoever set up that OOM context. In this case it was the charging of readahead pages in a file fault, which does not propagate VM_FAULT_OOM on failure and should disable OOM: [ 27.805359] WARNING: at /home/hannes/src/linux/linux/mm/memory.c:3523 handle_mm_fault+0x1fb/0x3f0() [ 27.805360] Hardware name: PowerEdge 1950 [ 27.805361] Fixing unhandled memcg OOM context, set up from: [ 27.805362] Pid: 1599, comm: file Tainted: G W 3.2.0-00005-g6d10010 #97 [ 27.805363] Call Trace: [ 27.805365] [<ffffffff8103dcea>] warn_slowpath_common+0x6a/0xa0 [ 27.805367] [<ffffffff8103dd91>] warn_slowpath_fmt+0x41/0x50 [ 27.805369] [<ffffffff810c8ffb>] handle_mm_fault+0x1fb/0x3f0 [ 27.805371] [<ffffffff81024fa0>] do_page_fault+0x140/0x4a0 [ 27.805373] [<ffffffff810cdbfb>] ? do_mmap_pgoff+0x34b/0x360 [ 27.805376] [<ffffffff813cbc6f>] page_fault+0x1f/0x30 [ 27.805377] ---[ end trace 305ec584fba81649 ]--- [ 27.805378] [<ffffffff810f2418>] __mem_cgroup_try_charge+0x5c8/0x7e0 [ 27.805380] [<ffffffff810f38fc>] mem_cgroup_cache_charge+0xac/0x110 [ 27.805381] [<ffffffff810a528e>] add_to_page_cache_locked+0x3e/0x120 [ 27.805383] [<ffffffff810a5385>] add_to_page_cache_lru+0x15/0x40 [ 27.805385] [<ffffffff8112dfa3>] mpage_readpages+0xc3/0x150 [ 27.805387] [<ffffffff8115c6d8>] ext4_readpages+0x18/0x20 [ 27.805388] [<ffffffff810afbe1>] __do_page_cache_readahead+0x1c1/0x270 [ 27.805390] [<ffffffff810b023c>] ra_submit+0x1c/0x20 [ 27.805392] [<ffffffff810a5eb4>] filemap_fault+0x3f4/0x450 [ 27.805394] [<ffffffff810c4a2d>] __do_fault+0x6d/0x510 [ 27.805395] [<ffffffff810c741a>] handle_pte_fault+0x8a/0x920 [ 27.805397] [<ffffffff810c8f9c>] handle_mm_fault+0x19c/0x3f0 [ 27.805398] [<ffffffff81024fa0>] do_page_fault+0x140/0x4a0 [ 27.805400] [<ffffffff813cbc6f>] page_fault+0x1f/0x30 [ 27.805401] [<ffffffffffffffff>] 0xffffffffffffffff Debug patch only. Not-signed-off-by: Johannes Weiner <hannes@cmpxchg.org> --- include/linux/sched.h | 3 +++ mm/memcontrol.c | 7 +++++++ mm/memory.c | 9 +++++++++ 3 files changed, 19 insertions(+) diff --git a/include/linux/sched.h b/include/linux/sched.h index 7e6c9e9..a77d198 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -91,6 +91,7 @@ struct sched_param { #include <linux/latencytop.h> #include <linux/cred.h> #include <linux/llist.h> +#include <linux/stacktrace.h> #include <asm/processor.h> @@ -1571,6 +1572,8 @@ struct task_struct { struct memcg_oom_info { unsigned int may_oom:1; unsigned int in_memcg_oom:1; + struct stack_trace trace; + unsigned long trace_entries[16]; int wakeups; struct mem_cgroup *wait_on_memcg; } memcg_oom; diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 99b0101..c47c77e 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -49,6 +49,7 @@ #include <linux/page_cgroup.h> #include <linux/cpu.h> #include <linux/oom.h> +#include <linux/stacktrace.h> #include "internal.h" #include <asm/uaccess.h> @@ -1870,6 +1871,12 @@ static void mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask) current->memcg_oom.in_memcg_oom = 1; + current->memcg_oom.trace.nr_entries = 0; + current->memcg_oom.trace.max_entries = 16; + current->memcg_oom.trace.entries = current->memcg_oom.trace_entries; + current->memcg_oom.trace.skip = 1; + save_stack_trace(¤t->memcg_oom.trace); + /* At first, try to OOM lock hierarchy under memcg.*/ spin_lock(&memcg_oom_lock); locked = mem_cgroup_oom_lock(memcg); diff --git a/mm/memory.c b/mm/memory.c index 2be02b7..fc6d741 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -57,6 +57,7 @@ #include <linux/swapops.h> #include <linux/elf.h> #include <linux/gfp.h> +#include <linux/stacktrace.h> #include <asm/io.h> #include <asm/pgalloc.h> @@ -3517,6 +3518,14 @@ int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma, if (userfault) WARN_ON(mem_cgroup_xchg_may_oom(current, 0) == 0); +#ifdef CONFIG_CGROUP_MEM_RES_CTLR + if (WARN(!(ret & VM_FAULT_OOM) && current->memcg_oom.in_memcg_oom, + "Fixing unhandled memcg OOM context, set up from:\n")) { + print_stack_trace(¤t->memcg_oom.trace, 0); + mem_cgroup_oom_synchronize(); + } +#endif + return ret; } -- 1.8.3.2 ^ permalink raw reply related [flat|nested] 444+ messages in thread
* [patch 5/5] mm: memcontrol: sanity check memcg OOM context unwind @ 2013-07-19 4:26 ` Johannes Weiner 0 siblings, 0 replies; 444+ messages in thread From: Johannes Weiner @ 2013-07-19 4:26 UTC (permalink / raw) To: Michal Hocko Cc: azurIt, linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups mailinglist, KAMEZAWA Hiroyuki, righi.andrea-Re5JQEeQqe8AvxtiuMwx3w Catch the cases where a memcg OOM context is set up in the failed charge path but the fault handler is not actually returning VM_FAULT_ERROR, which would be required to properly finalize the OOM. Example output: the first trace shows the stack at the end of handle_mm_fault() where an unexpected memcg OOM context is detected. The subsequent trace is of whoever set up that OOM context. In this case it was the charging of readahead pages in a file fault, which does not propagate VM_FAULT_OOM on failure and should disable OOM: [ 27.805359] WARNING: at /home/hannes/src/linux/linux/mm/memory.c:3523 handle_mm_fault+0x1fb/0x3f0() [ 27.805360] Hardware name: PowerEdge 1950 [ 27.805361] Fixing unhandled memcg OOM context, set up from: [ 27.805362] Pid: 1599, comm: file Tainted: G W 3.2.0-00005-g6d10010 #97 [ 27.805363] Call Trace: [ 27.805365] [<ffffffff8103dcea>] warn_slowpath_common+0x6a/0xa0 [ 27.805367] [<ffffffff8103dd91>] warn_slowpath_fmt+0x41/0x50 [ 27.805369] [<ffffffff810c8ffb>] handle_mm_fault+0x1fb/0x3f0 [ 27.805371] [<ffffffff81024fa0>] do_page_fault+0x140/0x4a0 [ 27.805373] [<ffffffff810cdbfb>] ? do_mmap_pgoff+0x34b/0x360 [ 27.805376] [<ffffffff813cbc6f>] page_fault+0x1f/0x30 [ 27.805377] ---[ end trace 305ec584fba81649 ]--- [ 27.805378] [<ffffffff810f2418>] __mem_cgroup_try_charge+0x5c8/0x7e0 [ 27.805380] [<ffffffff810f38fc>] mem_cgroup_cache_charge+0xac/0x110 [ 27.805381] [<ffffffff810a528e>] add_to_page_cache_locked+0x3e/0x120 [ 27.805383] [<ffffffff810a5385>] add_to_page_cache_lru+0x15/0x40 [ 27.805385] [<ffffffff8112dfa3>] mpage_readpages+0xc3/0x150 [ 27.805387] [<ffffffff8115c6d8>] ext4_readpages+0x18/0x20 [ 27.805388] [<ffffffff810afbe1>] __do_page_cache_readahead+0x1c1/0x270 [ 27.805390] [<ffffffff810b023c>] ra_submit+0x1c/0x20 [ 27.805392] [<ffffffff810a5eb4>] filemap_fault+0x3f4/0x450 [ 27.805394] [<ffffffff810c4a2d>] __do_fault+0x6d/0x510 [ 27.805395] [<ffffffff810c741a>] handle_pte_fault+0x8a/0x920 [ 27.805397] [<ffffffff810c8f9c>] handle_mm_fault+0x19c/0x3f0 [ 27.805398] [<ffffffff81024fa0>] do_page_fault+0x140/0x4a0 [ 27.805400] [<ffffffff813cbc6f>] page_fault+0x1f/0x30 [ 27.805401] [<ffffffffffffffff>] 0xffffffffffffffff Debug patch only. Not-signed-off-by: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org> --- include/linux/sched.h | 3 +++ mm/memcontrol.c | 7 +++++++ mm/memory.c | 9 +++++++++ 3 files changed, 19 insertions(+) diff --git a/include/linux/sched.h b/include/linux/sched.h index 7e6c9e9..a77d198 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -91,6 +91,7 @@ struct sched_param { #include <linux/latencytop.h> #include <linux/cred.h> #include <linux/llist.h> +#include <linux/stacktrace.h> #include <asm/processor.h> @@ -1571,6 +1572,8 @@ struct task_struct { struct memcg_oom_info { unsigned int may_oom:1; unsigned int in_memcg_oom:1; + struct stack_trace trace; + unsigned long trace_entries[16]; int wakeups; struct mem_cgroup *wait_on_memcg; } memcg_oom; diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 99b0101..c47c77e 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -49,6 +49,7 @@ #include <linux/page_cgroup.h> #include <linux/cpu.h> #include <linux/oom.h> +#include <linux/stacktrace.h> #include "internal.h" #include <asm/uaccess.h> @@ -1870,6 +1871,12 @@ static void mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask) current->memcg_oom.in_memcg_oom = 1; + current->memcg_oom.trace.nr_entries = 0; + current->memcg_oom.trace.max_entries = 16; + current->memcg_oom.trace.entries = current->memcg_oom.trace_entries; + current->memcg_oom.trace.skip = 1; + save_stack_trace(¤t->memcg_oom.trace); + /* At first, try to OOM lock hierarchy under memcg.*/ spin_lock(&memcg_oom_lock); locked = mem_cgroup_oom_lock(memcg); diff --git a/mm/memory.c b/mm/memory.c index 2be02b7..fc6d741 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -57,6 +57,7 @@ #include <linux/swapops.h> #include <linux/elf.h> #include <linux/gfp.h> +#include <linux/stacktrace.h> #include <asm/io.h> #include <asm/pgalloc.h> @@ -3517,6 +3518,14 @@ int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma, if (userfault) WARN_ON(mem_cgroup_xchg_may_oom(current, 0) == 0); +#ifdef CONFIG_CGROUP_MEM_RES_CTLR + if (WARN(!(ret & VM_FAULT_OOM) && current->memcg_oom.in_memcg_oom, + "Fixing unhandled memcg OOM context, set up from:\n")) { + print_stack_trace(¤t->memcg_oom.trace, 0); + mem_cgroup_oom_synchronize(); + } +#endif + return ret; } -- 1.8.3.2 ^ permalink raw reply related [flat|nested] 444+ messages in thread
* [patch 5/5] mm: memcontrol: sanity check memcg OOM context unwind @ 2013-07-19 4:26 ` Johannes Weiner 0 siblings, 0 replies; 444+ messages in thread From: Johannes Weiner @ 2013-07-19 4:26 UTC (permalink / raw) To: Michal Hocko Cc: azurIt, linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki, righi.andrea Catch the cases where a memcg OOM context is set up in the failed charge path but the fault handler is not actually returning VM_FAULT_ERROR, which would be required to properly finalize the OOM. Example output: the first trace shows the stack at the end of handle_mm_fault() where an unexpected memcg OOM context is detected. The subsequent trace is of whoever set up that OOM context. In this case it was the charging of readahead pages in a file fault, which does not propagate VM_FAULT_OOM on failure and should disable OOM: [ 27.805359] WARNING: at /home/hannes/src/linux/linux/mm/memory.c:3523 handle_mm_fault+0x1fb/0x3f0() [ 27.805360] Hardware name: PowerEdge 1950 [ 27.805361] Fixing unhandled memcg OOM context, set up from: [ 27.805362] Pid: 1599, comm: file Tainted: G W 3.2.0-00005-g6d10010 #97 [ 27.805363] Call Trace: [ 27.805365] [<ffffffff8103dcea>] warn_slowpath_common+0x6a/0xa0 [ 27.805367] [<ffffffff8103dd91>] warn_slowpath_fmt+0x41/0x50 [ 27.805369] [<ffffffff810c8ffb>] handle_mm_fault+0x1fb/0x3f0 [ 27.805371] [<ffffffff81024fa0>] do_page_fault+0x140/0x4a0 [ 27.805373] [<ffffffff810cdbfb>] ? do_mmap_pgoff+0x34b/0x360 [ 27.805376] [<ffffffff813cbc6f>] page_fault+0x1f/0x30 [ 27.805377] ---[ end trace 305ec584fba81649 ]--- [ 27.805378] [<ffffffff810f2418>] __mem_cgroup_try_charge+0x5c8/0x7e0 [ 27.805380] [<ffffffff810f38fc>] mem_cgroup_cache_charge+0xac/0x110 [ 27.805381] [<ffffffff810a528e>] add_to_page_cache_locked+0x3e/0x120 [ 27.805383] [<ffffffff810a5385>] add_to_page_cache_lru+0x15/0x40 [ 27.805385] [<ffffffff8112dfa3>] mpage_readpages+0xc3/0x150 [ 27.805387] [<ffffffff8115c6d8>] ext4_readpages+0x18/0x20 [ 27.805388] [<ffffffff810afbe1>] __do_page_cache_readahead+0x1c1/0x270 [ 27.805390] [<ffffffff810b023c>] ra_submit+0x1c/0x20 [ 27.805392] [<ffffffff810a5eb4>] filemap_fault+0x3f4/0x450 [ 27.805394] [<ffffffff810c4a2d>] __do_fault+0x6d/0x510 [ 27.805395] [<ffffffff810c741a>] handle_pte_fault+0x8a/0x920 [ 27.805397] [<ffffffff810c8f9c>] handle_mm_fault+0x19c/0x3f0 [ 27.805398] [<ffffffff81024fa0>] do_page_fault+0x140/0x4a0 [ 27.805400] [<ffffffff813cbc6f>] page_fault+0x1f/0x30 [ 27.805401] [<ffffffffffffffff>] 0xffffffffffffffff Debug patch only. Not-signed-off-by: Johannes Weiner <hannes@cmpxchg.org> --- include/linux/sched.h | 3 +++ mm/memcontrol.c | 7 +++++++ mm/memory.c | 9 +++++++++ 3 files changed, 19 insertions(+) diff --git a/include/linux/sched.h b/include/linux/sched.h index 7e6c9e9..a77d198 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -91,6 +91,7 @@ struct sched_param { #include <linux/latencytop.h> #include <linux/cred.h> #include <linux/llist.h> +#include <linux/stacktrace.h> #include <asm/processor.h> @@ -1571,6 +1572,8 @@ struct task_struct { struct memcg_oom_info { unsigned int may_oom:1; unsigned int in_memcg_oom:1; + struct stack_trace trace; + unsigned long trace_entries[16]; int wakeups; struct mem_cgroup *wait_on_memcg; } memcg_oom; diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 99b0101..c47c77e 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -49,6 +49,7 @@ #include <linux/page_cgroup.h> #include <linux/cpu.h> #include <linux/oom.h> +#include <linux/stacktrace.h> #include "internal.h" #include <asm/uaccess.h> @@ -1870,6 +1871,12 @@ static void mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask) current->memcg_oom.in_memcg_oom = 1; + current->memcg_oom.trace.nr_entries = 0; + current->memcg_oom.trace.max_entries = 16; + current->memcg_oom.trace.entries = current->memcg_oom.trace_entries; + current->memcg_oom.trace.skip = 1; + save_stack_trace(¤t->memcg_oom.trace); + /* At first, try to OOM lock hierarchy under memcg.*/ spin_lock(&memcg_oom_lock); locked = mem_cgroup_oom_lock(memcg); diff --git a/mm/memory.c b/mm/memory.c index 2be02b7..fc6d741 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -57,6 +57,7 @@ #include <linux/swapops.h> #include <linux/elf.h> #include <linux/gfp.h> +#include <linux/stacktrace.h> #include <asm/io.h> #include <asm/pgalloc.h> @@ -3517,6 +3518,14 @@ int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma, if (userfault) WARN_ON(mem_cgroup_xchg_may_oom(current, 0) == 0); +#ifdef CONFIG_CGROUP_MEM_RES_CTLR + if (WARN(!(ret & VM_FAULT_OOM) && current->memcg_oom.in_memcg_oom, + "Fixing unhandled memcg OOM context, set up from:\n")) { + print_stack_trace(¤t->memcg_oom.trace, 0); + mem_cgroup_oom_synchronize(); + } +#endif + return ret; } -- 1.8.3.2 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 444+ messages in thread
* Re: [PATCH for 3.2] memcg: do not trap chargers with full callstack on OOM 2013-07-19 4:21 ` Johannes Weiner (?) @ 2013-07-19 8:23 ` azurIt -1 siblings, 0 replies; 444+ messages in thread From: azurIt @ 2013-07-19 8:23 UTC (permalink / raw) To: Johannes Weiner, Michal Hocko Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki, righi.andrea > CC: linux-kernel@vger.kernel.org, linux-mm@kvack.org, "cgroups mailinglist" <cgroups@vger.kernel.org>, "KAMEZAWA Hiroyuki" <kamezawa.hiroyu@jp.fujitsu.com>, righi.andrea@gmail.com >On Tue, Jul 16, 2013 at 12:48:30PM -0400, Johannes Weiner wrote: >> On Tue, Jul 16, 2013 at 06:09:05PM +0200, Michal Hocko wrote: >> > On Tue 16-07-13 11:35:44, Johannes Weiner wrote: >> > > On Mon, Jul 15, 2013 at 06:00:06PM +0200, Michal Hocko wrote: >> > > > On Mon 15-07-13 17:41:19, Michal Hocko wrote: >> > > > > On Sun 14-07-13 01:51:12, azurIt wrote: >> > > > > > > CC: "Johannes Weiner" <hannes@cmpxchg.org>, linux-kernel@vger.kernel.org, linux-mm@kvack.org, "cgroups mailinglist" <cgroups@vger.kernel.org>, "KAMEZAWA Hiroyuki" <kamezawa.hiroyu@jp.fujitsu.com>, righi.andrea@gmail.com >> > > > > > >> CC: "Johannes Weiner" <hannes@cmpxchg.org>, linux-kernel@vger.kernel.org, linux-mm@kvack.org, "cgroups mailinglist" <cgroups@vger.kernel.org>, "KAMEZAWA Hiroyuki" <kamezawa.hiroyu@jp.fujitsu.com>, righi.andrea@gmail.com >> > > > > > >>On Wed 10-07-13 18:25:06, azurIt wrote: >> > > > > > >>> >> Now i realized that i forgot to remove UID from that cgroup before >> > > > > > >>> >> trying to remove it, so cgroup cannot be removed anyway (we are using >> > > > > > >>> >> third party cgroup called cgroup-uid from Andrea Righi, which is able >> > > > > > >>> >> to associate all user's processes with target cgroup). Look here for >> > > > > > >>> >> cgroup-uid patch: >> > > > > > >>> >> https://www.develer.com/~arighi/linux/patches/cgroup-uid/cgroup-uid-v8.patch >> > > > > > >>> >> >> > > > > > >>> >> ANYWAY, i'm 101% sure that 'tasks' file was empty and 'under_oom' was >> > > > > > >>> >> permanently '1'. >> > > > > > >>> > >> > > > > > >>> >This is really strange. Could you post the whole diff against stable >> > > > > > >>> >tree you are using (except for grsecurity stuff and the above cgroup-uid >> > > > > > >>> >patch)? >> > > > > > >>> >> > > > > > >>> >> > > > > > >>> Here are all patches which i applied to kernel 3.2.48 in my last test: >> > > > > > >>> http://watchdog.sk/lkml/patches3/ >> > > > > > >> >> > > > > > >>The two patches from Johannes seem correct. >> > > > > > >> >> > > > > > >>From a quick look even grsecurity patchset shouldn't interfere as it >> > > > > > >>doesn't seem to put any code between handle_mm_fault and mm_fault_error >> > > > > > >>and there also doesn't seem to be any new handle_mm_fault call sites. >> > > > > > >> >> > > > > > >>But I cannot tell there aren't other code paths which would lead to a >> > > > > > >>memcg charge, thus oom, without proper FAULT_FLAG_KERNEL handling. >> > > > > > > >> > > > > > > >> > > > > > >Michal, >> > > > > > > >> > > > > > >now i can definitely confirm that problem with unremovable cgroups >> > > > > > >persists. What info do you need from me? I applied also your little >> > > > > > >'WARN_ON' patch. >> > > > > > >> > > > > > Ok, i think you want this: >> > > > > > http://watchdog.sk/lkml/kern4.log >> > > > > >> > > > > Jul 14 01:11:39 server01 kernel: [ 593.589087] [ pid ] uid tgid total_vm rss cpu oom_adj oom_score_adj name >> > > > > Jul 14 01:11:39 server01 kernel: [ 593.589451] [12021] 1333 12021 172027 64723 4 0 0 apache2 >> > > > > Jul 14 01:11:39 server01 kernel: [ 593.589647] [12030] 1333 12030 172030 64748 2 0 0 apache2 >> > > > > Jul 14 01:11:39 server01 kernel: [ 593.589836] [12031] 1333 12031 172030 64749 3 0 0 apache2 >> > > > > Jul 14 01:11:39 server01 kernel: [ 593.590025] [12032] 1333 12032 170619 63428 3 0 0 apache2 >> > > > > Jul 14 01:11:39 server01 kernel: [ 593.590213] [12033] 1333 12033 167934 60524 2 0 0 apache2 >> > > > > Jul 14 01:11:39 server01 kernel: [ 593.590401] [12034] 1333 12034 170747 63496 4 0 0 apache2 >> > > > > Jul 14 01:11:39 server01 kernel: [ 593.590588] [12035] 1333 12035 169659 62451 1 0 0 apache2 >> > > > > Jul 14 01:11:39 server01 kernel: [ 593.590776] [12036] 1333 12036 167614 60384 3 0 0 apache2 >> > > > > Jul 14 01:11:39 server01 kernel: [ 593.590984] [12037] 1333 12037 166342 58964 3 0 0 apache2 >> > > > > Jul 14 01:11:39 server01 kernel: [ 593.591178] Memory cgroup out of memory: Kill process 12021 (apache2) score 847 or sacrifice child >> > > > > Jul 14 01:11:39 server01 kernel: [ 593.591370] Killed process 12021 (apache2) total-vm:688108kB, anon-rss:255472kB, file-rss:3420kB >> > > > > Jul 14 01:11:41 server01 kernel: [ 595.392920] ------------[ cut here ]------------ >> > > > > Jul 14 01:11:41 server01 kernel: [ 595.393096] WARNING: at kernel/exit.c:888 do_exit+0x7d0/0x870() >> > > > > Jul 14 01:11:41 server01 kernel: [ 595.393256] Hardware name: S5000VSA >> > > > > Jul 14 01:11:41 server01 kernel: [ 595.393415] Pid: 12037, comm: apache2 Not tainted 3.2.48-grsec #1 >> > > > > Jul 14 01:11:41 server01 kernel: [ 595.393577] Call Trace: >> > > > > Jul 14 01:11:41 server01 kernel: [ 595.393737] [<ffffffff8105520a>] warn_slowpath_common+0x7a/0xb0 >> > > > > Jul 14 01:11:41 server01 kernel: [ 595.393903] [<ffffffff8105525a>] warn_slowpath_null+0x1a/0x20 >> > > > > Jul 14 01:11:41 server01 kernel: [ 595.394068] [<ffffffff81059c50>] do_exit+0x7d0/0x870 >> > > > > Jul 14 01:11:41 server01 kernel: [ 595.394231] [<ffffffff81050254>] ? thread_group_times+0x44/0xb0 >> > > > > Jul 14 01:11:41 server01 kernel: [ 595.394392] [<ffffffff81059d41>] do_group_exit+0x51/0xc0 >> > > > > Jul 14 01:11:41 server01 kernel: [ 595.394551] [<ffffffff81059dc7>] sys_exit_group+0x17/0x20 >> > > > > Jul 14 01:11:41 server01 kernel: [ 595.394714] [<ffffffff815caea6>] system_call_fastpath+0x18/0x1d >> > > > > Jul 14 01:11:41 server01 kernel: [ 595.394921] ---[ end trace 738570e688acf099 ]--- >> > > > > >> > > > > OK, so you had an OOM which has been handled by in-kernel oom handler >> > > > > (it killed 12021) and 12037 was in the same group. The warning tells us >> > > > > that it went through mem_cgroup_oom as well (otherwise it wouldn't have >> > > > > memcg_oom.wait_on_memcg set and the warning wouldn't trigger) and then >> > > > > it exited on the userspace request (by exit syscall). >> > > > > >> > > > > I do not see any way how, this could happen though. If mem_cgroup_oom >> > > > > is called then we always return CHARGE_NOMEM which turns into ENOMEM >> > > > > returned by __mem_cgroup_try_charge (invoke_oom must have been set to >> > > > > true). So if nobody screwed the return value on the way up to page >> > > > > fault handler then there is no way to escape. >> > > > > >> > > > > I will check the code. >> > > > >> > > > OK, I guess I found it: >> > > > __do_fault >> > > > fault = filemap_fault >> > > > do_async_mmap_readahead >> > > > page_cache_async_readahead >> > > > ondemand_readahead >> > > > __do_page_cache_readahead >> > > > read_pages >> > > > readpages = ext3_readpages >> > > > mpage_readpages # Doesn't propagate ENOMEM >> > > > add_to_page_cache_lru >> > > > add_to_page_cache >> > > > add_to_page_cache_locked >> > > > mem_cgroup_cache_charge >> > > > >> > > > So the read ahead most probably. Again! Duhhh. I will try to think >> > > > about a fix for this. One obvious place is mpage_readpages but >> > > > __do_page_cache_readahead ignores read_pages return value as well and >> > > > page_cache_async_readahead, even worse, is just void and exported as >> > > > such. >> > > > >> > > > So this smells like a hard to fix bugger. One possible, and really ugly >> > > > way would be calling mem_cgroup_oom_synchronize even if handle_mm_fault >> > > > doesn't return VM_FAULT_ERROR, but that is a crude hack. > >I fixed it by disabling the OOM killer altogether for readahead code. >We don't do it globally, we should not do it in the memcg, these are >optional allocations/charges. > >I also disabled it for kernel faults triggered from within a syscall >(copy_*user, get_user_pages), which should just return -ENOMEM as >usual (unless it's nested inside a userspace fault). The only >downside is that we can't get around annotating userspace faults >anymore, so every architecture fault handler now passes >FAULT_FLAG_USER to handle_mm_fault(). Makes the series a little less >self-contained, but it's not unreasonable. > >It's easy to detect leaks now by checking if the memcg OOM context is >setup and we are not returning VM_FAULT_OOM. > >Here is a combined diff based on 3.2. azurIt, any chance you could >give this a shot? I tested it on my local machines, but you have a >known reproducer of fairly unlikely scenarios... I will be out of office between 25.7. and 1.8. and I don't want to run anything which can potentially do an outage of our services. I will test this patch after 2.8. Should I use also previous patches of this one is enough? Thank you very much Johannes. azur ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH for 3.2] memcg: do not trap chargers with full callstack on OOM @ 2013-07-19 8:23 ` azurIt 0 siblings, 0 replies; 444+ messages in thread From: azurIt @ 2013-07-19 8:23 UTC (permalink / raw) To: Johannes Weiner, Michal Hocko Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups mailinglist, KAMEZAWA Hiroyuki, righi.andrea-Re5JQEeQqe8AvxtiuMwx3w > CC: linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org, "cgroups mailinglist" <cgroups-u79uwXL29TY76Z2rM5mHXA@public.gmane.org>, "KAMEZAWA Hiroyuki" <kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A@public.gmane.org>, righi.andrea-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org >On Tue, Jul 16, 2013 at 12:48:30PM -0400, Johannes Weiner wrote: >> On Tue, Jul 16, 2013 at 06:09:05PM +0200, Michal Hocko wrote: >> > On Tue 16-07-13 11:35:44, Johannes Weiner wrote: >> > > On Mon, Jul 15, 2013 at 06:00:06PM +0200, Michal Hocko wrote: >> > > > On Mon 15-07-13 17:41:19, Michal Hocko wrote: >> > > > > On Sun 14-07-13 01:51:12, azurIt wrote: >> > > > > > > CC: "Johannes Weiner" <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>, linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org, "cgroups mailinglist" <cgroups-u79uwXL29TY76Z2rM5mHXA@public.gmane.org>, "KAMEZAWA Hiroyuki" <kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A@public.gmane.org>, righi.andrea-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org >> > > > > > >> CC: "Johannes Weiner" <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>, linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org, "cgroups mailinglist" <cgroups-u79uwXL29TY76Z2rM5mHXA@public.gmane.org>, "KAMEZAWA Hiroyuki" <kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A@public.gmane.org>, righi.andrea-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org >> > > > > > >>On Wed 10-07-13 18:25:06, azurIt wrote: >> > > > > > >>> >> Now i realized that i forgot to remove UID from that cgroup before >> > > > > > >>> >> trying to remove it, so cgroup cannot be removed anyway (we are using >> > > > > > >>> >> third party cgroup called cgroup-uid from Andrea Righi, which is able >> > > > > > >>> >> to associate all user's processes with target cgroup). Look here for >> > > > > > >>> >> cgroup-uid patch: >> > > > > > >>> >> https://www.develer.com/~arighi/linux/patches/cgroup-uid/cgroup-uid-v8.patch >> > > > > > >>> >> >> > > > > > >>> >> ANYWAY, i'm 101% sure that 'tasks' file was empty and 'under_oom' was >> > > > > > >>> >> permanently '1'. >> > > > > > >>> > >> > > > > > >>> >This is really strange. Could you post the whole diff against stable >> > > > > > >>> >tree you are using (except for grsecurity stuff and the above cgroup-uid >> > > > > > >>> >patch)? >> > > > > > >>> >> > > > > > >>> >> > > > > > >>> Here are all patches which i applied to kernel 3.2.48 in my last test: >> > > > > > >>> http://watchdog.sk/lkml/patches3/ >> > > > > > >> >> > > > > > >>The two patches from Johannes seem correct. >> > > > > > >> >> > > > > > >>From a quick look even grsecurity patchset shouldn't interfere as it >> > > > > > >>doesn't seem to put any code between handle_mm_fault and mm_fault_error >> > > > > > >>and there also doesn't seem to be any new handle_mm_fault call sites. >> > > > > > >> >> > > > > > >>But I cannot tell there aren't other code paths which would lead to a >> > > > > > >>memcg charge, thus oom, without proper FAULT_FLAG_KERNEL handling. >> > > > > > > >> > > > > > > >> > > > > > >Michal, >> > > > > > > >> > > > > > >now i can definitely confirm that problem with unremovable cgroups >> > > > > > >persists. What info do you need from me? I applied also your little >> > > > > > >'WARN_ON' patch. >> > > > > > >> > > > > > Ok, i think you want this: >> > > > > > http://watchdog.sk/lkml/kern4.log >> > > > > >> > > > > Jul 14 01:11:39 server01 kernel: [ 593.589087] [ pid ] uid tgid total_vm rss cpu oom_adj oom_score_adj name >> > > > > Jul 14 01:11:39 server01 kernel: [ 593.589451] [12021] 1333 12021 172027 64723 4 0 0 apache2 >> > > > > Jul 14 01:11:39 server01 kernel: [ 593.589647] [12030] 1333 12030 172030 64748 2 0 0 apache2 >> > > > > Jul 14 01:11:39 server01 kernel: [ 593.589836] [12031] 1333 12031 172030 64749 3 0 0 apache2 >> > > > > Jul 14 01:11:39 server01 kernel: [ 593.590025] [12032] 1333 12032 170619 63428 3 0 0 apache2 >> > > > > Jul 14 01:11:39 server01 kernel: [ 593.590213] [12033] 1333 12033 167934 60524 2 0 0 apache2 >> > > > > Jul 14 01:11:39 server01 kernel: [ 593.590401] [12034] 1333 12034 170747 63496 4 0 0 apache2 >> > > > > Jul 14 01:11:39 server01 kernel: [ 593.590588] [12035] 1333 12035 169659 62451 1 0 0 apache2 >> > > > > Jul 14 01:11:39 server01 kernel: [ 593.590776] [12036] 1333 12036 167614 60384 3 0 0 apache2 >> > > > > Jul 14 01:11:39 server01 kernel: [ 593.590984] [12037] 1333 12037 166342 58964 3 0 0 apache2 >> > > > > Jul 14 01:11:39 server01 kernel: [ 593.591178] Memory cgroup out of memory: Kill process 12021 (apache2) score 847 or sacrifice child >> > > > > Jul 14 01:11:39 server01 kernel: [ 593.591370] Killed process 12021 (apache2) total-vm:688108kB, anon-rss:255472kB, file-rss:3420kB >> > > > > Jul 14 01:11:41 server01 kernel: [ 595.392920] ------------[ cut here ]------------ >> > > > > Jul 14 01:11:41 server01 kernel: [ 595.393096] WARNING: at kernel/exit.c:888 do_exit+0x7d0/0x870() >> > > > > Jul 14 01:11:41 server01 kernel: [ 595.393256] Hardware name: S5000VSA >> > > > > Jul 14 01:11:41 server01 kernel: [ 595.393415] Pid: 12037, comm: apache2 Not tainted 3.2.48-grsec #1 >> > > > > Jul 14 01:11:41 server01 kernel: [ 595.393577] Call Trace: >> > > > > Jul 14 01:11:41 server01 kernel: [ 595.393737] [<ffffffff8105520a>] warn_slowpath_common+0x7a/0xb0 >> > > > > Jul 14 01:11:41 server01 kernel: [ 595.393903] [<ffffffff8105525a>] warn_slowpath_null+0x1a/0x20 >> > > > > Jul 14 01:11:41 server01 kernel: [ 595.394068] [<ffffffff81059c50>] do_exit+0x7d0/0x870 >> > > > > Jul 14 01:11:41 server01 kernel: [ 595.394231] [<ffffffff81050254>] ? thread_group_times+0x44/0xb0 >> > > > > Jul 14 01:11:41 server01 kernel: [ 595.394392] [<ffffffff81059d41>] do_group_exit+0x51/0xc0 >> > > > > Jul 14 01:11:41 server01 kernel: [ 595.394551] [<ffffffff81059dc7>] sys_exit_group+0x17/0x20 >> > > > > Jul 14 01:11:41 server01 kernel: [ 595.394714] [<ffffffff815caea6>] system_call_fastpath+0x18/0x1d >> > > > > Jul 14 01:11:41 server01 kernel: [ 595.394921] ---[ end trace 738570e688acf099 ]--- >> > > > > >> > > > > OK, so you had an OOM which has been handled by in-kernel oom handler >> > > > > (it killed 12021) and 12037 was in the same group. The warning tells us >> > > > > that it went through mem_cgroup_oom as well (otherwise it wouldn't have >> > > > > memcg_oom.wait_on_memcg set and the warning wouldn't trigger) and then >> > > > > it exited on the userspace request (by exit syscall). >> > > > > >> > > > > I do not see any way how, this could happen though. If mem_cgroup_oom >> > > > > is called then we always return CHARGE_NOMEM which turns into ENOMEM >> > > > > returned by __mem_cgroup_try_charge (invoke_oom must have been set to >> > > > > true). So if nobody screwed the return value on the way up to page >> > > > > fault handler then there is no way to escape. >> > > > > >> > > > > I will check the code. >> > > > >> > > > OK, I guess I found it: >> > > > __do_fault >> > > > fault = filemap_fault >> > > > do_async_mmap_readahead >> > > > page_cache_async_readahead >> > > > ondemand_readahead >> > > > __do_page_cache_readahead >> > > > read_pages >> > > > readpages = ext3_readpages >> > > > mpage_readpages # Doesn't propagate ENOMEM >> > > > add_to_page_cache_lru >> > > > add_to_page_cache >> > > > add_to_page_cache_locked >> > > > mem_cgroup_cache_charge >> > > > >> > > > So the read ahead most probably. Again! Duhhh. I will try to think >> > > > about a fix for this. One obvious place is mpage_readpages but >> > > > __do_page_cache_readahead ignores read_pages return value as well and >> > > > page_cache_async_readahead, even worse, is just void and exported as >> > > > such. >> > > > >> > > > So this smells like a hard to fix bugger. One possible, and really ugly >> > > > way would be calling mem_cgroup_oom_synchronize even if handle_mm_fault >> > > > doesn't return VM_FAULT_ERROR, but that is a crude hack. > >I fixed it by disabling the OOM killer altogether for readahead code. >We don't do it globally, we should not do it in the memcg, these are >optional allocations/charges. > >I also disabled it for kernel faults triggered from within a syscall >(copy_*user, get_user_pages), which should just return -ENOMEM as >usual (unless it's nested inside a userspace fault). The only >downside is that we can't get around annotating userspace faults >anymore, so every architecture fault handler now passes >FAULT_FLAG_USER to handle_mm_fault(). Makes the series a little less >self-contained, but it's not unreasonable. > >It's easy to detect leaks now by checking if the memcg OOM context is >setup and we are not returning VM_FAULT_OOM. > >Here is a combined diff based on 3.2. azurIt, any chance you could >give this a shot? I tested it on my local machines, but you have a >known reproducer of fairly unlikely scenarios... I will be out of office between 25.7. and 1.8. and I don't want to run anything which can potentially do an outage of our services. I will test this patch after 2.8. Should I use also previous patches of this one is enough? Thank you very much Johannes. azur ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH for 3.2] memcg: do not trap chargers with full callstack on OOM @ 2013-07-19 8:23 ` azurIt 0 siblings, 0 replies; 444+ messages in thread From: azurIt @ 2013-07-19 8:23 UTC (permalink / raw) To: Johannes Weiner, Michal Hocko Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki, righi.andrea > CC: linux-kernel@vger.kernel.org, linux-mm@kvack.org, "cgroups mailinglist" <cgroups@vger.kernel.org>, "KAMEZAWA Hiroyuki" <kamezawa.hiroyu@jp.fujitsu.com>, righi.andrea@gmail.com >On Tue, Jul 16, 2013 at 12:48:30PM -0400, Johannes Weiner wrote: >> On Tue, Jul 16, 2013 at 06:09:05PM +0200, Michal Hocko wrote: >> > On Tue 16-07-13 11:35:44, Johannes Weiner wrote: >> > > On Mon, Jul 15, 2013 at 06:00:06PM +0200, Michal Hocko wrote: >> > > > On Mon 15-07-13 17:41:19, Michal Hocko wrote: >> > > > > On Sun 14-07-13 01:51:12, azurIt wrote: >> > > > > > > CC: "Johannes Weiner" <hannes@cmpxchg.org>, linux-kernel@vger.kernel.org, linux-mm@kvack.org, "cgroups mailinglist" <cgroups@vger.kernel.org>, "KAMEZAWA Hiroyuki" <kamezawa.hiroyu@jp.fujitsu.com>, righi.andrea@gmail.com >> > > > > > >> CC: "Johannes Weiner" <hannes@cmpxchg.org>, linux-kernel@vger.kernel.org, linux-mm@kvack.org, "cgroups mailinglist" <cgroups@vger.kernel.org>, "KAMEZAWA Hiroyuki" <kamezawa.hiroyu@jp.fujitsu.com>, righi.andrea@gmail.com >> > > > > > >>On Wed 10-07-13 18:25:06, azurIt wrote: >> > > > > > >>> >> Now i realized that i forgot to remove UID from that cgroup before >> > > > > > >>> >> trying to remove it, so cgroup cannot be removed anyway (we are using >> > > > > > >>> >> third party cgroup called cgroup-uid from Andrea Righi, which is able >> > > > > > >>> >> to associate all user's processes with target cgroup). Look here for >> > > > > > >>> >> cgroup-uid patch: >> > > > > > >>> >> https://www.develer.com/~arighi/linux/patches/cgroup-uid/cgroup-uid-v8.patch >> > > > > > >>> >> >> > > > > > >>> >> ANYWAY, i'm 101% sure that 'tasks' file was empty and 'under_oom' was >> > > > > > >>> >> permanently '1'. >> > > > > > >>> > >> > > > > > >>> >This is really strange. Could you post the whole diff against stable >> > > > > > >>> >tree you are using (except for grsecurity stuff and the above cgroup-uid >> > > > > > >>> >patch)? >> > > > > > >>> >> > > > > > >>> >> > > > > > >>> Here are all patches which i applied to kernel 3.2.48 in my last test: >> > > > > > >>> http://watchdog.sk/lkml/patches3/ >> > > > > > >> >> > > > > > >>The two patches from Johannes seem correct. >> > > > > > >> >> > > > > > >>From a quick look even grsecurity patchset shouldn't interfere as it >> > > > > > >>doesn't seem to put any code between handle_mm_fault and mm_fault_error >> > > > > > >>and there also doesn't seem to be any new handle_mm_fault call sites. >> > > > > > >> >> > > > > > >>But I cannot tell there aren't other code paths which would lead to a >> > > > > > >>memcg charge, thus oom, without proper FAULT_FLAG_KERNEL handling. >> > > > > > > >> > > > > > > >> > > > > > >Michal, >> > > > > > > >> > > > > > >now i can definitely confirm that problem with unremovable cgroups >> > > > > > >persists. What info do you need from me? I applied also your little >> > > > > > >'WARN_ON' patch. >> > > > > > >> > > > > > Ok, i think you want this: >> > > > > > http://watchdog.sk/lkml/kern4.log >> > > > > >> > > > > Jul 14 01:11:39 server01 kernel: [ 593.589087] [ pid ] uid tgid total_vm rss cpu oom_adj oom_score_adj name >> > > > > Jul 14 01:11:39 server01 kernel: [ 593.589451] [12021] 1333 12021 172027 64723 4 0 0 apache2 >> > > > > Jul 14 01:11:39 server01 kernel: [ 593.589647] [12030] 1333 12030 172030 64748 2 0 0 apache2 >> > > > > Jul 14 01:11:39 server01 kernel: [ 593.589836] [12031] 1333 12031 172030 64749 3 0 0 apache2 >> > > > > Jul 14 01:11:39 server01 kernel: [ 593.590025] [12032] 1333 12032 170619 63428 3 0 0 apache2 >> > > > > Jul 14 01:11:39 server01 kernel: [ 593.590213] [12033] 1333 12033 167934 60524 2 0 0 apache2 >> > > > > Jul 14 01:11:39 server01 kernel: [ 593.590401] [12034] 1333 12034 170747 63496 4 0 0 apache2 >> > > > > Jul 14 01:11:39 server01 kernel: [ 593.590588] [12035] 1333 12035 169659 62451 1 0 0 apache2 >> > > > > Jul 14 01:11:39 server01 kernel: [ 593.590776] [12036] 1333 12036 167614 60384 3 0 0 apache2 >> > > > > Jul 14 01:11:39 server01 kernel: [ 593.590984] [12037] 1333 12037 166342 58964 3 0 0 apache2 >> > > > > Jul 14 01:11:39 server01 kernel: [ 593.591178] Memory cgroup out of memory: Kill process 12021 (apache2) score 847 or sacrifice child >> > > > > Jul 14 01:11:39 server01 kernel: [ 593.591370] Killed process 12021 (apache2) total-vm:688108kB, anon-rss:255472kB, file-rss:3420kB >> > > > > Jul 14 01:11:41 server01 kernel: [ 595.392920] ------------[ cut here ]------------ >> > > > > Jul 14 01:11:41 server01 kernel: [ 595.393096] WARNING: at kernel/exit.c:888 do_exit+0x7d0/0x870() >> > > > > Jul 14 01:11:41 server01 kernel: [ 595.393256] Hardware name: S5000VSA >> > > > > Jul 14 01:11:41 server01 kernel: [ 595.393415] Pid: 12037, comm: apache2 Not tainted 3.2.48-grsec #1 >> > > > > Jul 14 01:11:41 server01 kernel: [ 595.393577] Call Trace: >> > > > > Jul 14 01:11:41 server01 kernel: [ 595.393737] [<ffffffff8105520a>] warn_slowpath_common+0x7a/0xb0 >> > > > > Jul 14 01:11:41 server01 kernel: [ 595.393903] [<ffffffff8105525a>] warn_slowpath_null+0x1a/0x20 >> > > > > Jul 14 01:11:41 server01 kernel: [ 595.394068] [<ffffffff81059c50>] do_exit+0x7d0/0x870 >> > > > > Jul 14 01:11:41 server01 kernel: [ 595.394231] [<ffffffff81050254>] ? thread_group_times+0x44/0xb0 >> > > > > Jul 14 01:11:41 server01 kernel: [ 595.394392] [<ffffffff81059d41>] do_group_exit+0x51/0xc0 >> > > > > Jul 14 01:11:41 server01 kernel: [ 595.394551] [<ffffffff81059dc7>] sys_exit_group+0x17/0x20 >> > > > > Jul 14 01:11:41 server01 kernel: [ 595.394714] [<ffffffff815caea6>] system_call_fastpath+0x18/0x1d >> > > > > Jul 14 01:11:41 server01 kernel: [ 595.394921] ---[ end trace 738570e688acf099 ]--- >> > > > > >> > > > > OK, so you had an OOM which has been handled by in-kernel oom handler >> > > > > (it killed 12021) and 12037 was in the same group. The warning tells us >> > > > > that it went through mem_cgroup_oom as well (otherwise it wouldn't have >> > > > > memcg_oom.wait_on_memcg set and the warning wouldn't trigger) and then >> > > > > it exited on the userspace request (by exit syscall). >> > > > > >> > > > > I do not see any way how, this could happen though. If mem_cgroup_oom >> > > > > is called then we always return CHARGE_NOMEM which turns into ENOMEM >> > > > > returned by __mem_cgroup_try_charge (invoke_oom must have been set to >> > > > > true). So if nobody screwed the return value on the way up to page >> > > > > fault handler then there is no way to escape. >> > > > > >> > > > > I will check the code. >> > > > >> > > > OK, I guess I found it: >> > > > __do_fault >> > > > fault = filemap_fault >> > > > do_async_mmap_readahead >> > > > page_cache_async_readahead >> > > > ondemand_readahead >> > > > __do_page_cache_readahead >> > > > read_pages >> > > > readpages = ext3_readpages >> > > > mpage_readpages # Doesn't propagate ENOMEM >> > > > add_to_page_cache_lru >> > > > add_to_page_cache >> > > > add_to_page_cache_locked >> > > > mem_cgroup_cache_charge >> > > > >> > > > So the read ahead most probably. Again! Duhhh. I will try to think >> > > > about a fix for this. One obvious place is mpage_readpages but >> > > > __do_page_cache_readahead ignores read_pages return value as well and >> > > > page_cache_async_readahead, even worse, is just void and exported as >> > > > such. >> > > > >> > > > So this smells like a hard to fix bugger. One possible, and really ugly >> > > > way would be calling mem_cgroup_oom_synchronize even if handle_mm_fault >> > > > doesn't return VM_FAULT_ERROR, but that is a crude hack. > >I fixed it by disabling the OOM killer altogether for readahead code. >We don't do it globally, we should not do it in the memcg, these are >optional allocations/charges. > >I also disabled it for kernel faults triggered from within a syscall >(copy_*user, get_user_pages), which should just return -ENOMEM as >usual (unless it's nested inside a userspace fault). The only >downside is that we can't get around annotating userspace faults >anymore, so every architecture fault handler now passes >FAULT_FLAG_USER to handle_mm_fault(). Makes the series a little less >self-contained, but it's not unreasonable. > >It's easy to detect leaks now by checking if the memcg OOM context is >setup and we are not returning VM_FAULT_OOM. > >Here is a combined diff based on 3.2. azurIt, any chance you could >give this a shot? I tested it on my local machines, but you have a >known reproducer of fairly unlikely scenarios... I will be out of office between 25.7. and 1.8. and I don't want to run anything which can potentially do an outage of our services. I will test this patch after 2.8. Should I use also previous patches of this one is enough? Thank you very much Johannes. azur -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH for 3.2] memcg: do not trap chargers with full callstack on OOM 2013-07-05 19:18 ` Johannes Weiner @ 2013-07-14 17:07 ` azurIt -1 siblings, 0 replies; 444+ messages in thread From: azurIt @ 2013-07-14 17:07 UTC (permalink / raw) To: Johannes Weiner Cc: Michal Hocko, linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki > CC: "Michal Hocko" <mhocko@suse.cz>, linux-kernel@vger.kernel.org, linux-mm@kvack.org, "cgroups mailinglist" <cgroups@vger.kernel.org>, "KAMEZAWA Hiroyuki" <kamezawa.hiroyu@jp.fujitsu.com> >On Fri, Jul 05, 2013 at 09:02:46PM +0200, azurIt wrote: >> >I looked at your debug messages but could not find anything that would >> >hint at a deadlock. All tasks are stuck in the refrigerator, so I >> >assume you use the freezer cgroup and enabled it somehow? >> >> >> Yes, i'm really using freezer cgroup BUT i was checking if it's not >> doing problems - unfortunately, several days passed from that day >> and now i don't fully remember if i was checking it for both cases >> (unremoveabled cgroups and these freezed processes holding web >> server port). I'm 100% sure i was checking it for unremoveable >> cgroups but not so sure for the other problem (i had to act quickly >> in that case). Are you sure (from stacks) that freezer cgroup was >> enabled there? > >Yeah, all the traces without exception look like this: > >1372089762/23433/stack:[<ffffffff81080925>] refrigerator+0x95/0x160 >1372089762/23433/stack:[<ffffffff8106ab7b>] get_signal_to_deliver+0x1cb/0x540 >1372089762/23433/stack:[<ffffffff8100188b>] do_signal+0x6b/0x750 >1372089762/23433/stack:[<ffffffff81001fc5>] do_notify_resume+0x55/0x80 >1372089762/23433/stack:[<ffffffff815cac77>] int_signal+0x12/0x17 >1372089762/23433/stack:[<ffffffffffffffff>] 0xffffffffffffffff > >so the freezer was already enabled when you took the backtraces. > >> Btw, what about that other stacks? I mean this file: >> http://watchdog.sk/lkml/memcg-bug-7.tar.gz >> >> It was taken while running the kernel with your patch and from >> cgroup which was under unresolveable OOM (just like my very original >> problem). > >I looked at these traces too, but none of the tasks are stuck in rmdir >or the OOM path. Some /are/ in the page fault path, but they are >happily doing reclaim and don't appear to be stuck. So I'm having a >hard time matching this data to what you otherwise observed. > >However, based on what you reported the most likely explanation for >the continued hangs is the unfinished OOM handling for which I sent >the followup patch for arch/x86/mm/fault.c. Johannes, this problem happened again but was even worse, now i'm sure it wasn't my fault. This time I even wasn't able to access /proc/<pid> of hanged apache process (which was, again, helding web server port and forced me to reboot the server). Everything which tried to access /proc/<pid> just hanged. Server even wasn't able to reboot correctly, it hanged and then done a hard reboot after few minutes. azur ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH for 3.2] memcg: do not trap chargers with full callstack on OOM @ 2013-07-14 17:07 ` azurIt 0 siblings, 0 replies; 444+ messages in thread From: azurIt @ 2013-07-14 17:07 UTC (permalink / raw) To: Johannes Weiner Cc: Michal Hocko, linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki > CC: "Michal Hocko" <mhocko@suse.cz>, linux-kernel@vger.kernel.org, linux-mm@kvack.org, "cgroups mailinglist" <cgroups@vger.kernel.org>, "KAMEZAWA Hiroyuki" <kamezawa.hiroyu@jp.fujitsu.com> >On Fri, Jul 05, 2013 at 09:02:46PM +0200, azurIt wrote: >> >I looked at your debug messages but could not find anything that would >> >hint at a deadlock. All tasks are stuck in the refrigerator, so I >> >assume you use the freezer cgroup and enabled it somehow? >> >> >> Yes, i'm really using freezer cgroup BUT i was checking if it's not >> doing problems - unfortunately, several days passed from that day >> and now i don't fully remember if i was checking it for both cases >> (unremoveabled cgroups and these freezed processes holding web >> server port). I'm 100% sure i was checking it for unremoveable >> cgroups but not so sure for the other problem (i had to act quickly >> in that case). Are you sure (from stacks) that freezer cgroup was >> enabled there? > >Yeah, all the traces without exception look like this: > >1372089762/23433/stack:[<ffffffff81080925>] refrigerator+0x95/0x160 >1372089762/23433/stack:[<ffffffff8106ab7b>] get_signal_to_deliver+0x1cb/0x540 >1372089762/23433/stack:[<ffffffff8100188b>] do_signal+0x6b/0x750 >1372089762/23433/stack:[<ffffffff81001fc5>] do_notify_resume+0x55/0x80 >1372089762/23433/stack:[<ffffffff815cac77>] int_signal+0x12/0x17 >1372089762/23433/stack:[<ffffffffffffffff>] 0xffffffffffffffff > >so the freezer was already enabled when you took the backtraces. > >> Btw, what about that other stacks? I mean this file: >> http://watchdog.sk/lkml/memcg-bug-7.tar.gz >> >> It was taken while running the kernel with your patch and from >> cgroup which was under unresolveable OOM (just like my very original >> problem). > >I looked at these traces too, but none of the tasks are stuck in rmdir >or the OOM path. Some /are/ in the page fault path, but they are >happily doing reclaim and don't appear to be stuck. So I'm having a >hard time matching this data to what you otherwise observed. > >However, based on what you reported the most likely explanation for >the continued hangs is the unfinished OOM handling for which I sent >the followup patch for arch/x86/mm/fault.c. Johannes, this problem happened again but was even worse, now i'm sure it wasn't my fault. This time I even wasn't able to access /proc/<pid> of hanged apache process (which was, again, helding web server port and forced me to reboot the server). Everything which tried to access /proc/<pid> just hanged. Server even wasn't able to reboot correctly, it hanged and then done a hard reboot after few minutes. azur -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH for 3.2] memcg: do not trap chargers with full callstack on OOM 2013-06-24 20:13 ` Johannes Weiner (?) @ 2013-07-09 13:00 ` Michal Hocko -1 siblings, 0 replies; 444+ messages in thread From: Michal Hocko @ 2013-07-09 13:00 UTC (permalink / raw) To: Johannes Weiner Cc: azurIt, linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki On Mon 24-06-13 16:13:45, Johannes Weiner wrote: > Hi guys, > > On Sat, Jun 22, 2013 at 10:09:58PM +0200, azurIt wrote: > > >> But i'm sure of one thing - when problem occurs, nothing is able to > > >> access hard drives (every process which tries it is freezed until > > >> problem is resolved or server is rebooted). > > > > > >I would be really interesting to see what those tasks are blocked on. > > > > I'm trying to get it, stay tuned :) > > > > Today i noticed one bug, not 100% sure it is related to 'your' patch > > but i didn't seen this before. I noticed that i have lots of cgroups > > which cannot be removed - if i do 'rmdir <cgroup_directory>', it > > just hangs and never complete. Even more, it's not possible to > > access the whole cgroup filesystem until i kill that rmdir > > (anything, which tries it, just hangs). All unremoveable cgroups has > > this in 'memory.oom_control': oom_kill_disable 0 under_oom 1 > > Somebody acquires the OOM wait reference to the memcg and marks it > under oom but then does not call into mem_cgroup_oom_synchronize() to > clean up. That's why under_oom is set and the rmdir waits for > outstanding references. > > > And, yes, 'tasks' file is empty. > > It's not a kernel thread that does it because all kernel-context > handle_mm_fault() are annotated properly, which means the task must be > userspace and, since tasks is empty, have exited before synchronizing. Yes, well spotted. I have missed that while reviewing your patch. The follow up fix looks correct. > Can you try with the following patch on top? > > diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c > index 5db0490..9a0b152 100644 > --- a/arch/x86/mm/fault.c > +++ b/arch/x86/mm/fault.c > @@ -846,17 +846,6 @@ static noinline int > mm_fault_error(struct pt_regs *regs, unsigned long error_code, > unsigned long address, unsigned int fault) > { > - /* > - * Pagefault was interrupted by SIGKILL. We have no reason to > - * continue pagefault. > - */ > - if (fatal_signal_pending(current)) { > - if (!(fault & VM_FAULT_RETRY)) > - up_read(¤t->mm->mmap_sem); > - if (!(error_code & PF_USER)) > - no_context(regs, error_code, address); > - return 1; > - } > if (!(fault & VM_FAULT_ERROR)) > return 0; > -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH for 3.2] memcg: do not trap chargers with full callstack on OOM @ 2013-07-09 13:00 ` Michal Hocko 0 siblings, 0 replies; 444+ messages in thread From: Michal Hocko @ 2013-07-09 13:00 UTC (permalink / raw) To: Johannes Weiner Cc: azurIt, linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups mailinglist, KAMEZAWA Hiroyuki On Mon 24-06-13 16:13:45, Johannes Weiner wrote: > Hi guys, > > On Sat, Jun 22, 2013 at 10:09:58PM +0200, azurIt wrote: > > >> But i'm sure of one thing - when problem occurs, nothing is able to > > >> access hard drives (every process which tries it is freezed until > > >> problem is resolved or server is rebooted). > > > > > >I would be really interesting to see what those tasks are blocked on. > > > > I'm trying to get it, stay tuned :) > > > > Today i noticed one bug, not 100% sure it is related to 'your' patch > > but i didn't seen this before. I noticed that i have lots of cgroups > > which cannot be removed - if i do 'rmdir <cgroup_directory>', it > > just hangs and never complete. Even more, it's not possible to > > access the whole cgroup filesystem until i kill that rmdir > > (anything, which tries it, just hangs). All unremoveable cgroups has > > this in 'memory.oom_control': oom_kill_disable 0 under_oom 1 > > Somebody acquires the OOM wait reference to the memcg and marks it > under oom but then does not call into mem_cgroup_oom_synchronize() to > clean up. That's why under_oom is set and the rmdir waits for > outstanding references. > > > And, yes, 'tasks' file is empty. > > It's not a kernel thread that does it because all kernel-context > handle_mm_fault() are annotated properly, which means the task must be > userspace and, since tasks is empty, have exited before synchronizing. Yes, well spotted. I have missed that while reviewing your patch. The follow up fix looks correct. > Can you try with the following patch on top? > > diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c > index 5db0490..9a0b152 100644 > --- a/arch/x86/mm/fault.c > +++ b/arch/x86/mm/fault.c > @@ -846,17 +846,6 @@ static noinline int > mm_fault_error(struct pt_regs *regs, unsigned long error_code, > unsigned long address, unsigned int fault) > { > - /* > - * Pagefault was interrupted by SIGKILL. We have no reason to > - * continue pagefault. > - */ > - if (fatal_signal_pending(current)) { > - if (!(fault & VM_FAULT_RETRY)) > - up_read(¤t->mm->mmap_sem); > - if (!(error_code & PF_USER)) > - no_context(regs, error_code, address); > - return 1; > - } > if (!(fault & VM_FAULT_ERROR)) > return 0; > -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH for 3.2] memcg: do not trap chargers with full callstack on OOM @ 2013-07-09 13:00 ` Michal Hocko 0 siblings, 0 replies; 444+ messages in thread From: Michal Hocko @ 2013-07-09 13:00 UTC (permalink / raw) To: Johannes Weiner Cc: azurIt, linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki On Mon 24-06-13 16:13:45, Johannes Weiner wrote: > Hi guys, > > On Sat, Jun 22, 2013 at 10:09:58PM +0200, azurIt wrote: > > >> But i'm sure of one thing - when problem occurs, nothing is able to > > >> access hard drives (every process which tries it is freezed until > > >> problem is resolved or server is rebooted). > > > > > >I would be really interesting to see what those tasks are blocked on. > > > > I'm trying to get it, stay tuned :) > > > > Today i noticed one bug, not 100% sure it is related to 'your' patch > > but i didn't seen this before. I noticed that i have lots of cgroups > > which cannot be removed - if i do 'rmdir <cgroup_directory>', it > > just hangs and never complete. Even more, it's not possible to > > access the whole cgroup filesystem until i kill that rmdir > > (anything, which tries it, just hangs). All unremoveable cgroups has > > this in 'memory.oom_control': oom_kill_disable 0 under_oom 1 > > Somebody acquires the OOM wait reference to the memcg and marks it > under oom but then does not call into mem_cgroup_oom_synchronize() to > clean up. That's why under_oom is set and the rmdir waits for > outstanding references. > > > And, yes, 'tasks' file is empty. > > It's not a kernel thread that does it because all kernel-context > handle_mm_fault() are annotated properly, which means the task must be > userspace and, since tasks is empty, have exited before synchronizing. Yes, well spotted. I have missed that while reviewing your patch. The follow up fix looks correct. > Can you try with the following patch on top? > > diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c > index 5db0490..9a0b152 100644 > --- a/arch/x86/mm/fault.c > +++ b/arch/x86/mm/fault.c > @@ -846,17 +846,6 @@ static noinline int > mm_fault_error(struct pt_regs *regs, unsigned long error_code, > unsigned long address, unsigned int fault) > { > - /* > - * Pagefault was interrupted by SIGKILL. We have no reason to > - * continue pagefault. > - */ > - if (fatal_signal_pending(current)) { > - if (!(fault & VM_FAULT_RETRY)) > - up_read(¤t->mm->mmap_sem); > - if (!(error_code & PF_USER)) > - no_context(regs, error_code, address); > - return 1; > - } > if (!(fault & VM_FAULT_ERROR)) > return 0; > -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH for 3.2] memcg: do not trap chargers with full callstack on OOM 2013-07-09 13:00 ` Michal Hocko (?) @ 2013-07-09 13:08 ` Michal Hocko -1 siblings, 0 replies; 444+ messages in thread From: Michal Hocko @ 2013-07-09 13:08 UTC (permalink / raw) To: Johannes Weiner Cc: azurIt, linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki On Tue 09-07-13 15:00:17, Michal Hocko wrote: > On Mon 24-06-13 16:13:45, Johannes Weiner wrote: > > Hi guys, > > > > On Sat, Jun 22, 2013 at 10:09:58PM +0200, azurIt wrote: > > > >> But i'm sure of one thing - when problem occurs, nothing is able to > > > >> access hard drives (every process which tries it is freezed until > > > >> problem is resolved or server is rebooted). > > > > > > > >I would be really interesting to see what those tasks are blocked on. > > > > > > I'm trying to get it, stay tuned :) > > > > > > Today i noticed one bug, not 100% sure it is related to 'your' patch > > > but i didn't seen this before. I noticed that i have lots of cgroups > > > which cannot be removed - if i do 'rmdir <cgroup_directory>', it > > > just hangs and never complete. Even more, it's not possible to > > > access the whole cgroup filesystem until i kill that rmdir > > > (anything, which tries it, just hangs). All unremoveable cgroups has > > > this in 'memory.oom_control': oom_kill_disable 0 under_oom 1 > > > > Somebody acquires the OOM wait reference to the memcg and marks it > > under oom but then does not call into mem_cgroup_oom_synchronize() to > > clean up. That's why under_oom is set and the rmdir waits for > > outstanding references. > > > > > And, yes, 'tasks' file is empty. > > > > It's not a kernel thread that does it because all kernel-context > > handle_mm_fault() are annotated properly, which means the task must be > > userspace and, since tasks is empty, have exited before synchronizing. > > Yes, well spotted. I have missed that while reviewing your patch. > The follow up fix looks correct. Hmm, I guess you wanted to remove !(fault & VM_FAULT_ERROR) test as well otherwise the else BUG() path would be unreachable and we wouldn't know that something fishy is going on. > > Can you try with the following patch on top? > > > > diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c > > index 5db0490..9a0b152 100644 > > --- a/arch/x86/mm/fault.c > > +++ b/arch/x86/mm/fault.c > > @@ -846,17 +846,6 @@ static noinline int > > mm_fault_error(struct pt_regs *regs, unsigned long error_code, > > unsigned long address, unsigned int fault) > > { > > - /* > > - * Pagefault was interrupted by SIGKILL. We have no reason to > > - * continue pagefault. > > - */ > > - if (fatal_signal_pending(current)) { > > - if (!(fault & VM_FAULT_RETRY)) > > - up_read(¤t->mm->mmap_sem); > > - if (!(error_code & PF_USER)) > > - no_context(regs, error_code, address); > > - return 1; > > - } > > if (!(fault & VM_FAULT_ERROR)) > > return 0; > > > > -- > Michal Hocko > SUSE Labs > -- > To unsubscribe from this list: send the line "unsubscribe cgroups" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH for 3.2] memcg: do not trap chargers with full callstack on OOM @ 2013-07-09 13:08 ` Michal Hocko 0 siblings, 0 replies; 444+ messages in thread From: Michal Hocko @ 2013-07-09 13:08 UTC (permalink / raw) To: Johannes Weiner Cc: azurIt, linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups mailinglist, KAMEZAWA Hiroyuki On Tue 09-07-13 15:00:17, Michal Hocko wrote: > On Mon 24-06-13 16:13:45, Johannes Weiner wrote: > > Hi guys, > > > > On Sat, Jun 22, 2013 at 10:09:58PM +0200, azurIt wrote: > > > >> But i'm sure of one thing - when problem occurs, nothing is able to > > > >> access hard drives (every process which tries it is freezed until > > > >> problem is resolved or server is rebooted). > > > > > > > >I would be really interesting to see what those tasks are blocked on. > > > > > > I'm trying to get it, stay tuned :) > > > > > > Today i noticed one bug, not 100% sure it is related to 'your' patch > > > but i didn't seen this before. I noticed that i have lots of cgroups > > > which cannot be removed - if i do 'rmdir <cgroup_directory>', it > > > just hangs and never complete. Even more, it's not possible to > > > access the whole cgroup filesystem until i kill that rmdir > > > (anything, which tries it, just hangs). All unremoveable cgroups has > > > this in 'memory.oom_control': oom_kill_disable 0 under_oom 1 > > > > Somebody acquires the OOM wait reference to the memcg and marks it > > under oom but then does not call into mem_cgroup_oom_synchronize() to > > clean up. That's why under_oom is set and the rmdir waits for > > outstanding references. > > > > > And, yes, 'tasks' file is empty. > > > > It's not a kernel thread that does it because all kernel-context > > handle_mm_fault() are annotated properly, which means the task must be > > userspace and, since tasks is empty, have exited before synchronizing. > > Yes, well spotted. I have missed that while reviewing your patch. > The follow up fix looks correct. Hmm, I guess you wanted to remove !(fault & VM_FAULT_ERROR) test as well otherwise the else BUG() path would be unreachable and we wouldn't know that something fishy is going on. > > Can you try with the following patch on top? > > > > diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c > > index 5db0490..9a0b152 100644 > > --- a/arch/x86/mm/fault.c > > +++ b/arch/x86/mm/fault.c > > @@ -846,17 +846,6 @@ static noinline int > > mm_fault_error(struct pt_regs *regs, unsigned long error_code, > > unsigned long address, unsigned int fault) > > { > > - /* > > - * Pagefault was interrupted by SIGKILL. We have no reason to > > - * continue pagefault. > > - */ > > - if (fatal_signal_pending(current)) { > > - if (!(fault & VM_FAULT_RETRY)) > > - up_read(¤t->mm->mmap_sem); > > - if (!(error_code & PF_USER)) > > - no_context(regs, error_code, address); > > - return 1; > > - } > > if (!(fault & VM_FAULT_ERROR)) > > return 0; > > > > -- > Michal Hocko > SUSE Labs > -- > To unsubscribe from this list: send the line "unsubscribe cgroups" in > the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH for 3.2] memcg: do not trap chargers with full callstack on OOM @ 2013-07-09 13:08 ` Michal Hocko 0 siblings, 0 replies; 444+ messages in thread From: Michal Hocko @ 2013-07-09 13:08 UTC (permalink / raw) To: Johannes Weiner Cc: azurIt, linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki On Tue 09-07-13 15:00:17, Michal Hocko wrote: > On Mon 24-06-13 16:13:45, Johannes Weiner wrote: > > Hi guys, > > > > On Sat, Jun 22, 2013 at 10:09:58PM +0200, azurIt wrote: > > > >> But i'm sure of one thing - when problem occurs, nothing is able to > > > >> access hard drives (every process which tries it is freezed until > > > >> problem is resolved or server is rebooted). > > > > > > > >I would be really interesting to see what those tasks are blocked on. > > > > > > I'm trying to get it, stay tuned :) > > > > > > Today i noticed one bug, not 100% sure it is related to 'your' patch > > > but i didn't seen this before. I noticed that i have lots of cgroups > > > which cannot be removed - if i do 'rmdir <cgroup_directory>', it > > > just hangs and never complete. Even more, it's not possible to > > > access the whole cgroup filesystem until i kill that rmdir > > > (anything, which tries it, just hangs). All unremoveable cgroups has > > > this in 'memory.oom_control': oom_kill_disable 0 under_oom 1 > > > > Somebody acquires the OOM wait reference to the memcg and marks it > > under oom but then does not call into mem_cgroup_oom_synchronize() to > > clean up. That's why under_oom is set and the rmdir waits for > > outstanding references. > > > > > And, yes, 'tasks' file is empty. > > > > It's not a kernel thread that does it because all kernel-context > > handle_mm_fault() are annotated properly, which means the task must be > > userspace and, since tasks is empty, have exited before synchronizing. > > Yes, well spotted. I have missed that while reviewing your patch. > The follow up fix looks correct. Hmm, I guess you wanted to remove !(fault & VM_FAULT_ERROR) test as well otherwise the else BUG() path would be unreachable and we wouldn't know that something fishy is going on. > > Can you try with the following patch on top? > > > > diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c > > index 5db0490..9a0b152 100644 > > --- a/arch/x86/mm/fault.c > > +++ b/arch/x86/mm/fault.c > > @@ -846,17 +846,6 @@ static noinline int > > mm_fault_error(struct pt_regs *regs, unsigned long error_code, > > unsigned long address, unsigned int fault) > > { > > - /* > > - * Pagefault was interrupted by SIGKILL. We have no reason to > > - * continue pagefault. > > - */ > > - if (fatal_signal_pending(current)) { > > - if (!(fault & VM_FAULT_RETRY)) > > - up_read(¤t->mm->mmap_sem); > > - if (!(error_code & PF_USER)) > > - no_context(regs, error_code, address); > > - return 1; > > - } > > if (!(fault & VM_FAULT_ERROR)) > > return 0; > > > > -- > Michal Hocko > SUSE Labs > -- > To unsubscribe from this list: send the line "unsubscribe cgroups" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH for 3.2] memcg: do not trap chargers with full callstack on OOM 2013-07-09 13:08 ` Michal Hocko (?) @ 2013-07-09 13:10 ` Michal Hocko -1 siblings, 0 replies; 444+ messages in thread From: Michal Hocko @ 2013-07-09 13:10 UTC (permalink / raw) To: Johannes Weiner Cc: azurIt, linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki On Tue 09-07-13 15:08:08, Michal Hocko wrote: > On Tue 09-07-13 15:00:17, Michal Hocko wrote: > > On Mon 24-06-13 16:13:45, Johannes Weiner wrote: > > > Hi guys, > > > > > > On Sat, Jun 22, 2013 at 10:09:58PM +0200, azurIt wrote: > > > > >> But i'm sure of one thing - when problem occurs, nothing is able to > > > > >> access hard drives (every process which tries it is freezed until > > > > >> problem is resolved or server is rebooted). > > > > > > > > > >I would be really interesting to see what those tasks are blocked on. > > > > > > > > I'm trying to get it, stay tuned :) > > > > > > > > Today i noticed one bug, not 100% sure it is related to 'your' patch > > > > but i didn't seen this before. I noticed that i have lots of cgroups > > > > which cannot be removed - if i do 'rmdir <cgroup_directory>', it > > > > just hangs and never complete. Even more, it's not possible to > > > > access the whole cgroup filesystem until i kill that rmdir > > > > (anything, which tries it, just hangs). All unremoveable cgroups has > > > > this in 'memory.oom_control': oom_kill_disable 0 under_oom 1 > > > > > > Somebody acquires the OOM wait reference to the memcg and marks it > > > under oom but then does not call into mem_cgroup_oom_synchronize() to > > > clean up. That's why under_oom is set and the rmdir waits for > > > outstanding references. > > > > > > > And, yes, 'tasks' file is empty. > > > > > > It's not a kernel thread that does it because all kernel-context > > > handle_mm_fault() are annotated properly, which means the task must be > > > userspace and, since tasks is empty, have exited before synchronizing. > > > > Yes, well spotted. I have missed that while reviewing your patch. > > The follow up fix looks correct. > > Hmm, I guess you wanted to remove !(fault & VM_FAULT_ERROR) test as well > otherwise the else BUG() path would be unreachable and we wouldn't know > that something fishy is going on. No, scratch it! We need it for VM_FAULT_RETRY. Sorry about the noise. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH for 3.2] memcg: do not trap chargers with full callstack on OOM @ 2013-07-09 13:10 ` Michal Hocko 0 siblings, 0 replies; 444+ messages in thread From: Michal Hocko @ 2013-07-09 13:10 UTC (permalink / raw) To: Johannes Weiner Cc: azurIt, linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups mailinglist, KAMEZAWA Hiroyuki On Tue 09-07-13 15:08:08, Michal Hocko wrote: > On Tue 09-07-13 15:00:17, Michal Hocko wrote: > > On Mon 24-06-13 16:13:45, Johannes Weiner wrote: > > > Hi guys, > > > > > > On Sat, Jun 22, 2013 at 10:09:58PM +0200, azurIt wrote: > > > > >> But i'm sure of one thing - when problem occurs, nothing is able to > > > > >> access hard drives (every process which tries it is freezed until > > > > >> problem is resolved or server is rebooted). > > > > > > > > > >I would be really interesting to see what those tasks are blocked on. > > > > > > > > I'm trying to get it, stay tuned :) > > > > > > > > Today i noticed one bug, not 100% sure it is related to 'your' patch > > > > but i didn't seen this before. I noticed that i have lots of cgroups > > > > which cannot be removed - if i do 'rmdir <cgroup_directory>', it > > > > just hangs and never complete. Even more, it's not possible to > > > > access the whole cgroup filesystem until i kill that rmdir > > > > (anything, which tries it, just hangs). All unremoveable cgroups has > > > > this in 'memory.oom_control': oom_kill_disable 0 under_oom 1 > > > > > > Somebody acquires the OOM wait reference to the memcg and marks it > > > under oom but then does not call into mem_cgroup_oom_synchronize() to > > > clean up. That's why under_oom is set and the rmdir waits for > > > outstanding references. > > > > > > > And, yes, 'tasks' file is empty. > > > > > > It's not a kernel thread that does it because all kernel-context > > > handle_mm_fault() are annotated properly, which means the task must be > > > userspace and, since tasks is empty, have exited before synchronizing. > > > > Yes, well spotted. I have missed that while reviewing your patch. > > The follow up fix looks correct. > > Hmm, I guess you wanted to remove !(fault & VM_FAULT_ERROR) test as well > otherwise the else BUG() path would be unreachable and we wouldn't know > that something fishy is going on. No, scratch it! We need it for VM_FAULT_RETRY. Sorry about the noise. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH for 3.2] memcg: do not trap chargers with full callstack on OOM @ 2013-07-09 13:10 ` Michal Hocko 0 siblings, 0 replies; 444+ messages in thread From: Michal Hocko @ 2013-07-09 13:10 UTC (permalink / raw) To: Johannes Weiner Cc: azurIt, linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki On Tue 09-07-13 15:08:08, Michal Hocko wrote: > On Tue 09-07-13 15:00:17, Michal Hocko wrote: > > On Mon 24-06-13 16:13:45, Johannes Weiner wrote: > > > Hi guys, > > > > > > On Sat, Jun 22, 2013 at 10:09:58PM +0200, azurIt wrote: > > > > >> But i'm sure of one thing - when problem occurs, nothing is able to > > > > >> access hard drives (every process which tries it is freezed until > > > > >> problem is resolved or server is rebooted). > > > > > > > > > >I would be really interesting to see what those tasks are blocked on. > > > > > > > > I'm trying to get it, stay tuned :) > > > > > > > > Today i noticed one bug, not 100% sure it is related to 'your' patch > > > > but i didn't seen this before. I noticed that i have lots of cgroups > > > > which cannot be removed - if i do 'rmdir <cgroup_directory>', it > > > > just hangs and never complete. Even more, it's not possible to > > > > access the whole cgroup filesystem until i kill that rmdir > > > > (anything, which tries it, just hangs). All unremoveable cgroups has > > > > this in 'memory.oom_control': oom_kill_disable 0 under_oom 1 > > > > > > Somebody acquires the OOM wait reference to the memcg and marks it > > > under oom but then does not call into mem_cgroup_oom_synchronize() to > > > clean up. That's why under_oom is set and the rmdir waits for > > > outstanding references. > > > > > > > And, yes, 'tasks' file is empty. > > > > > > It's not a kernel thread that does it because all kernel-context > > > handle_mm_fault() are annotated properly, which means the task must be > > > userspace and, since tasks is empty, have exited before synchronizing. > > > > Yes, well spotted. I have missed that while reviewing your patch. > > The follow up fix looks correct. > > Hmm, I guess you wanted to remove !(fault & VM_FAULT_ERROR) test as well > otherwise the else BUG() path would be unreachable and we wouldn't know > that something fishy is going on. No, scratch it! We need it for VM_FAULT_RETRY. Sorry about the noise. -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH for 3.2] memcg: do not trap chargers with full callstack on OOM 2013-06-19 13:26 ` Michal Hocko @ 2013-06-24 16:48 ` azurIt -1 siblings, 0 replies; 444+ messages in thread From: azurIt @ 2013-06-24 16:48 UTC (permalink / raw) To: Michal Hocko Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki, Johannes Weiner >I would be really interesting to see what those tasks are blocked on. Ok, i got it! Problem occurs two times and it behaves differently each time, I was running kernel with that latest patch. 1.) It doesn't have impact on the whole server, only on one cgroup. Here are stacks: http://watchdog.sk/lkml/memcg-bug-7.tar.gz 2.) It almost takes down the server because of huge I/O on HDDs. Unfortunately, i had a bug in my script which was suppose to gather stacks (i wasn't able to do it by hand like in (1), server was almost unoperable). But I was lucky and somehow killed processes from problematic cgroup (via htop) and server was ok again EXCEPT one important thing - processes from that cgroup were still running in D state and i wasn't able to kill them for good. They were taking web server network ports so i had to reboot the server :( BUT, before that, i gathered stacks: http://watchdog.sk/lkml/memcg-bug-8.tar.gz What do you think? azur ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH for 3.2] memcg: do not trap chargers with full callstack on OOM @ 2013-06-24 16:48 ` azurIt 0 siblings, 0 replies; 444+ messages in thread From: azurIt @ 2013-06-24 16:48 UTC (permalink / raw) To: Michal Hocko Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki, Johannes Weiner >I would be really interesting to see what those tasks are blocked on. Ok, i got it! Problem occurs two times and it behaves differently each time, I was running kernel with that latest patch. 1.) It doesn't have impact on the whole server, only on one cgroup. Here are stacks: http://watchdog.sk/lkml/memcg-bug-7.tar.gz 2.) It almost takes down the server because of huge I/O on HDDs. Unfortunately, i had a bug in my script which was suppose to gather stacks (i wasn't able to do it by hand like in (1), server was almost unoperable). But I was lucky and somehow killed processes from problematic cgroup (via htop) and server was ok again EXCEPT one important thing - processes from that cgroup were still running in D state and i wasn't able to kill them for good. They were taking web server network ports so i had to reboot the server :( BUT, before that, i gathered stacks: http://watchdog.sk/lkml/memcg-bug-8.tar.gz What do you think? azur -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH for 3.2.34] memcg: do not trigger OOM if PF_NO_MEMCG_OOM is set 2013-02-11 11:22 ` Michal Hocko @ 2013-02-22 12:00 ` azurIt -1 siblings, 0 replies; 444+ messages in thread From: azurIt @ 2013-02-22 12:00 UTC (permalink / raw) To: Michal Hocko Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki, Johannes Weiner >Unfortunately I am not able to reproduce this behavior even if I try >to hammer OOM like mad so I am afraid I cannot help you much without >further debugging patches. >I do realize that experimenting in your environment is a problem but I >do not many options left. Please do not use strace and rather collect >/proc/pid/stack instead. It would be also helpful to get group/tasks >file to have a full list of tasks in the group Sending new info! I found out one interesting thing. When problem occurs (it probably happen when OOM is started in target cgroup but i'm not sure), the target cgroup, somehow, becames broken. In other words, after the problem occurs once in target cgroup, it is happening always in this cgroup. I made this test: 1.) I create cgroup A with limits (also with memory limit). 2.) Waited when OOM is started (can takes hours). Processes in target cgroup becames freezed so they must be killed. 3.) After this, processes are always freezing in cgroup A, it usually takes 20-30 seconds after killing previously freezed processes. 4.) I created cgroup B with the *same* limits as cgroup A and moved user from A to B. Problem disappears. 5.) Go to (2) And second thing, i got've kernel oops, look at the end of: http://watchdog.sk/lkml/oops ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH for 3.2.34] memcg: do not trigger OOM if PF_NO_MEMCG_OOM is set @ 2013-02-22 12:00 ` azurIt 0 siblings, 0 replies; 444+ messages in thread From: azurIt @ 2013-02-22 12:00 UTC (permalink / raw) To: Michal Hocko Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki, Johannes Weiner >Unfortunately I am not able to reproduce this behavior even if I try >to hammer OOM like mad so I am afraid I cannot help you much without >further debugging patches. >I do realize that experimenting in your environment is a problem but I >do not many options left. Please do not use strace and rather collect >/proc/pid/stack instead. It would be also helpful to get group/tasks >file to have a full list of tasks in the group Sending new info! I found out one interesting thing. When problem occurs (it probably happen when OOM is started in target cgroup but i'm not sure), the target cgroup, somehow, becames broken. In other words, after the problem occurs once in target cgroup, it is happening always in this cgroup. I made this test: 1.) I create cgroup A with limits (also with memory limit). 2.) Waited when OOM is started (can takes hours). Processes in target cgroup becames freezed so they must be killed. 3.) After this, processes are always freezing in cgroup A, it usually takes 20-30 seconds after killing previously freezed processes. 4.) I created cgroup B with the *same* limits as cgroup A and moved user from A to B. Problem disappears. 5.) Go to (2) And second thing, i got've kernel oops, look at the end of: http://watchdog.sk/lkml/oops -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked 2013-02-06 14:01 ` Michal Hocko @ 2013-02-07 11:01 ` Kamezawa Hiroyuki -1 siblings, 0 replies; 444+ messages in thread From: Kamezawa Hiroyuki @ 2013-02-07 11:01 UTC (permalink / raw) To: Michal Hocko Cc: azurIt, linux-kernel, linux-mm, cgroups mailinglist, Johannes Weiner (2013/02/06 23:01), Michal Hocko wrote: > On Wed 06-02-13 02:17:21, azurIt wrote: >>> 5-memcg-fix-1.patch is not complete. It doesn't contain the folloup I >>> mentioned in a follow up email. Here is the full patch: >> >> >> Here is the log where OOM, again, killed MySQL server [search for "(mysqld)"]: >> http://www.watchdog.sk/lkml/oom_mysqld6 > > [...] > WARNING: at mm/memcontrol.c:2409 T.1149+0x2d9/0x610() > Hardware name: S5000VSA > gfp_mask:4304 nr_pages:1 oom:0 ret:2 > Pid: 3545, comm: apache2 Tainted: G W 3.2.37-grsec #1 > Call Trace: > [<ffffffff8105502a>] warn_slowpath_common+0x7a/0xb0 > [<ffffffff81055116>] warn_slowpath_fmt+0x46/0x50 > [<ffffffff81108163>] ? mem_cgroup_margin+0x73/0xa0 > [<ffffffff8110b6f9>] T.1149+0x2d9/0x610 > [<ffffffff812af298>] ? blk_finish_plug+0x18/0x50 > [<ffffffff8110c6b4>] mem_cgroup_cache_charge+0xc4/0xf0 > [<ffffffff810ca6bf>] add_to_page_cache_locked+0x4f/0x140 > [<ffffffff810ca7d2>] add_to_page_cache_lru+0x22/0x50 > [<ffffffff810cad32>] filemap_fault+0x252/0x4f0 > [<ffffffff810eab18>] __do_fault+0x78/0x5a0 > [<ffffffff810edcb4>] handle_pte_fault+0x84/0x940 > [<ffffffff810e2460>] ? vma_prio_tree_insert+0x30/0x50 > [<ffffffff810f2508>] ? vma_link+0x88/0xe0 > [<ffffffff810ee6a8>] handle_mm_fault+0x138/0x260 > [<ffffffff8102709d>] do_page_fault+0x13d/0x460 > [<ffffffff810f46fc>] ? do_mmap_pgoff+0x3dc/0x430 > [<ffffffff815b61ff>] page_fault+0x1f/0x30 > ---[ end trace 8817670349022007 ]--- > apache2 invoked oom-killer: gfp_mask=0x0, order=0, oom_adj=0, oom_score_adj=0 > apache2 cpuset=uid mems_allowed=0 > Pid: 3545, comm: apache2 Tainted: G W 3.2.37-grsec #1 > Call Trace: > [<ffffffff810ccd2e>] dump_header+0x7e/0x1e0 > [<ffffffff810ccc2f>] ? find_lock_task_mm+0x2f/0x70 > [<ffffffff810cd1f5>] oom_kill_process+0x85/0x2a0 > [<ffffffff810cd8a5>] out_of_memory+0xe5/0x200 > [<ffffffff810cda7d>] pagefault_out_of_memory+0xbd/0x110 > [<ffffffff81026e76>] mm_fault_error+0xb6/0x1a0 > [<ffffffff8102734e>] do_page_fault+0x3ee/0x460 > [<ffffffff810f46fc>] ? do_mmap_pgoff+0x3dc/0x430 > [<ffffffff815b61ff>] page_fault+0x1f/0x30 > > The first trace comes from the debugging WARN and it clearly points to > a file fault path. __do_fault pre-charges a page in case we need to > do CoW (copy-on-write) for the returned page. This one falls back to > memcg OOM and never returns ENOMEM as I have mentioned earlier. > However, the fs fault handler (filemap_fault here) can fallback to > page_cache_read if the readahead (do_sync_mmap_readahead) fails > to get page to the page cache. And we can see this happening in > the first trace. page_cache_read then calls add_to_page_cache_lru > and eventually gets to add_to_page_cache_locked which calls > mem_cgroup_cache_charge_no_oom so we will get ENOMEM if oom should > happen. This ENOMEM gets to the fault handler and kaboom. > Hmm. do we need to increase the "limit" virtually at memcg oom until the oom-killed process dies ? It may be doable by increasing stock->cache of each cpu....I think kernel can offer extra virtual charge up to oom-killed process's memory usage..... Thanks, -Kame ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked @ 2013-02-07 11:01 ` Kamezawa Hiroyuki 0 siblings, 0 replies; 444+ messages in thread From: Kamezawa Hiroyuki @ 2013-02-07 11:01 UTC (permalink / raw) To: Michal Hocko Cc: azurIt, linux-kernel, linux-mm, cgroups mailinglist, Johannes Weiner (2013/02/06 23:01), Michal Hocko wrote: > On Wed 06-02-13 02:17:21, azurIt wrote: >>> 5-memcg-fix-1.patch is not complete. It doesn't contain the folloup I >>> mentioned in a follow up email. Here is the full patch: >> >> >> Here is the log where OOM, again, killed MySQL server [search for "(mysqld)"]: >> http://www.watchdog.sk/lkml/oom_mysqld6 > > [...] > WARNING: at mm/memcontrol.c:2409 T.1149+0x2d9/0x610() > Hardware name: S5000VSA > gfp_mask:4304 nr_pages:1 oom:0 ret:2 > Pid: 3545, comm: apache2 Tainted: G W 3.2.37-grsec #1 > Call Trace: > [<ffffffff8105502a>] warn_slowpath_common+0x7a/0xb0 > [<ffffffff81055116>] warn_slowpath_fmt+0x46/0x50 > [<ffffffff81108163>] ? mem_cgroup_margin+0x73/0xa0 > [<ffffffff8110b6f9>] T.1149+0x2d9/0x610 > [<ffffffff812af298>] ? blk_finish_plug+0x18/0x50 > [<ffffffff8110c6b4>] mem_cgroup_cache_charge+0xc4/0xf0 > [<ffffffff810ca6bf>] add_to_page_cache_locked+0x4f/0x140 > [<ffffffff810ca7d2>] add_to_page_cache_lru+0x22/0x50 > [<ffffffff810cad32>] filemap_fault+0x252/0x4f0 > [<ffffffff810eab18>] __do_fault+0x78/0x5a0 > [<ffffffff810edcb4>] handle_pte_fault+0x84/0x940 > [<ffffffff810e2460>] ? vma_prio_tree_insert+0x30/0x50 > [<ffffffff810f2508>] ? vma_link+0x88/0xe0 > [<ffffffff810ee6a8>] handle_mm_fault+0x138/0x260 > [<ffffffff8102709d>] do_page_fault+0x13d/0x460 > [<ffffffff810f46fc>] ? do_mmap_pgoff+0x3dc/0x430 > [<ffffffff815b61ff>] page_fault+0x1f/0x30 > ---[ end trace 8817670349022007 ]--- > apache2 invoked oom-killer: gfp_mask=0x0, order=0, oom_adj=0, oom_score_adj=0 > apache2 cpuset=uid mems_allowed=0 > Pid: 3545, comm: apache2 Tainted: G W 3.2.37-grsec #1 > Call Trace: > [<ffffffff810ccd2e>] dump_header+0x7e/0x1e0 > [<ffffffff810ccc2f>] ? find_lock_task_mm+0x2f/0x70 > [<ffffffff810cd1f5>] oom_kill_process+0x85/0x2a0 > [<ffffffff810cd8a5>] out_of_memory+0xe5/0x200 > [<ffffffff810cda7d>] pagefault_out_of_memory+0xbd/0x110 > [<ffffffff81026e76>] mm_fault_error+0xb6/0x1a0 > [<ffffffff8102734e>] do_page_fault+0x3ee/0x460 > [<ffffffff810f46fc>] ? do_mmap_pgoff+0x3dc/0x430 > [<ffffffff815b61ff>] page_fault+0x1f/0x30 > > The first trace comes from the debugging WARN and it clearly points to > a file fault path. __do_fault pre-charges a page in case we need to > do CoW (copy-on-write) for the returned page. This one falls back to > memcg OOM and never returns ENOMEM as I have mentioned earlier. > However, the fs fault handler (filemap_fault here) can fallback to > page_cache_read if the readahead (do_sync_mmap_readahead) fails > to get page to the page cache. And we can see this happening in > the first trace. page_cache_read then calls add_to_page_cache_lru > and eventually gets to add_to_page_cache_locked which calls > mem_cgroup_cache_charge_no_oom so we will get ENOMEM if oom should > happen. This ENOMEM gets to the fault handler and kaboom. > Hmm. do we need to increase the "limit" virtually at memcg oom until the oom-killed process dies ? It may be doable by increasing stock->cache of each cpu....I think kernel can offer extra virtual charge up to oom-killed process's memory usage..... Thanks, -Kame -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked 2013-02-07 11:01 ` Kamezawa Hiroyuki (?) @ 2013-02-07 12:31 ` Michal Hocko -1 siblings, 0 replies; 444+ messages in thread From: Michal Hocko @ 2013-02-07 12:31 UTC (permalink / raw) To: Kamezawa Hiroyuki Cc: azurIt, linux-kernel, linux-mm, cgroups mailinglist, Johannes Weiner On Thu 07-02-13 20:01:45, KAMEZAWA Hiroyuki wrote: > (2013/02/06 23:01), Michal Hocko wrote: > >On Wed 06-02-13 02:17:21, azurIt wrote: > >>>5-memcg-fix-1.patch is not complete. It doesn't contain the folloup I > >>>mentioned in a follow up email. Here is the full patch: > >> > >> > >>Here is the log where OOM, again, killed MySQL server [search for "(mysqld)"]: > >>http://www.watchdog.sk/lkml/oom_mysqld6 > > > >[...] > >WARNING: at mm/memcontrol.c:2409 T.1149+0x2d9/0x610() > >Hardware name: S5000VSA > >gfp_mask:4304 nr_pages:1 oom:0 ret:2 > >Pid: 3545, comm: apache2 Tainted: G W 3.2.37-grsec #1 > >Call Trace: > > [<ffffffff8105502a>] warn_slowpath_common+0x7a/0xb0 > > [<ffffffff81055116>] warn_slowpath_fmt+0x46/0x50 > > [<ffffffff81108163>] ? mem_cgroup_margin+0x73/0xa0 > > [<ffffffff8110b6f9>] T.1149+0x2d9/0x610 > > [<ffffffff812af298>] ? blk_finish_plug+0x18/0x50 > > [<ffffffff8110c6b4>] mem_cgroup_cache_charge+0xc4/0xf0 > > [<ffffffff810ca6bf>] add_to_page_cache_locked+0x4f/0x140 > > [<ffffffff810ca7d2>] add_to_page_cache_lru+0x22/0x50 > > [<ffffffff810cad32>] filemap_fault+0x252/0x4f0 > > [<ffffffff810eab18>] __do_fault+0x78/0x5a0 > > [<ffffffff810edcb4>] handle_pte_fault+0x84/0x940 > > [<ffffffff810e2460>] ? vma_prio_tree_insert+0x30/0x50 > > [<ffffffff810f2508>] ? vma_link+0x88/0xe0 > > [<ffffffff810ee6a8>] handle_mm_fault+0x138/0x260 > > [<ffffffff8102709d>] do_page_fault+0x13d/0x460 > > [<ffffffff810f46fc>] ? do_mmap_pgoff+0x3dc/0x430 > > [<ffffffff815b61ff>] page_fault+0x1f/0x30 > >---[ end trace 8817670349022007 ]--- > >apache2 invoked oom-killer: gfp_mask=0x0, order=0, oom_adj=0, oom_score_adj=0 > >apache2 cpuset=uid mems_allowed=0 > >Pid: 3545, comm: apache2 Tainted: G W 3.2.37-grsec #1 > >Call Trace: > > [<ffffffff810ccd2e>] dump_header+0x7e/0x1e0 > > [<ffffffff810ccc2f>] ? find_lock_task_mm+0x2f/0x70 > > [<ffffffff810cd1f5>] oom_kill_process+0x85/0x2a0 > > [<ffffffff810cd8a5>] out_of_memory+0xe5/0x200 > > [<ffffffff810cda7d>] pagefault_out_of_memory+0xbd/0x110 > > [<ffffffff81026e76>] mm_fault_error+0xb6/0x1a0 > > [<ffffffff8102734e>] do_page_fault+0x3ee/0x460 > > [<ffffffff810f46fc>] ? do_mmap_pgoff+0x3dc/0x430 > > [<ffffffff815b61ff>] page_fault+0x1f/0x30 > > > >The first trace comes from the debugging WARN and it clearly points to > >a file fault path. __do_fault pre-charges a page in case we need to > >do CoW (copy-on-write) for the returned page. This one falls back to > >memcg OOM and never returns ENOMEM as I have mentioned earlier. > >However, the fs fault handler (filemap_fault here) can fallback to > >page_cache_read if the readahead (do_sync_mmap_readahead) fails > >to get page to the page cache. And we can see this happening in > >the first trace. page_cache_read then calls add_to_page_cache_lru > >and eventually gets to add_to_page_cache_locked which calls > >mem_cgroup_cache_charge_no_oom so we will get ENOMEM if oom should > >happen. This ENOMEM gets to the fault handler and kaboom. > > > > Hmm. do we need to increase the "limit" virtually at memcg oom until > the oom-killed process dies ? It may be doable by increasing stock->cache > of each cpu....I think kernel can offer extra virtual charge up to > oom-killed process's memory usage..... If we can guarantee that the overflow charges do not exceed the memory usage of the killed process then this would work. The question is, how do we find out how much we can overflow. immigrate_on_move will play some role as well as the amount of the shared memory. I am afraid this would get too complex. Nevertheless the idea is nice. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked @ 2013-02-07 12:31 ` Michal Hocko 0 siblings, 0 replies; 444+ messages in thread From: Michal Hocko @ 2013-02-07 12:31 UTC (permalink / raw) To: Kamezawa Hiroyuki Cc: azurIt, linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups mailinglist, Johannes Weiner On Thu 07-02-13 20:01:45, KAMEZAWA Hiroyuki wrote: > (2013/02/06 23:01), Michal Hocko wrote: > >On Wed 06-02-13 02:17:21, azurIt wrote: > >>>5-memcg-fix-1.patch is not complete. It doesn't contain the folloup I > >>>mentioned in a follow up email. Here is the full patch: > >> > >> > >>Here is the log where OOM, again, killed MySQL server [search for "(mysqld)"]: > >>http://www.watchdog.sk/lkml/oom_mysqld6 > > > >[...] > >WARNING: at mm/memcontrol.c:2409 T.1149+0x2d9/0x610() > >Hardware name: S5000VSA > >gfp_mask:4304 nr_pages:1 oom:0 ret:2 > >Pid: 3545, comm: apache2 Tainted: G W 3.2.37-grsec #1 > >Call Trace: > > [<ffffffff8105502a>] warn_slowpath_common+0x7a/0xb0 > > [<ffffffff81055116>] warn_slowpath_fmt+0x46/0x50 > > [<ffffffff81108163>] ? mem_cgroup_margin+0x73/0xa0 > > [<ffffffff8110b6f9>] T.1149+0x2d9/0x610 > > [<ffffffff812af298>] ? blk_finish_plug+0x18/0x50 > > [<ffffffff8110c6b4>] mem_cgroup_cache_charge+0xc4/0xf0 > > [<ffffffff810ca6bf>] add_to_page_cache_locked+0x4f/0x140 > > [<ffffffff810ca7d2>] add_to_page_cache_lru+0x22/0x50 > > [<ffffffff810cad32>] filemap_fault+0x252/0x4f0 > > [<ffffffff810eab18>] __do_fault+0x78/0x5a0 > > [<ffffffff810edcb4>] handle_pte_fault+0x84/0x940 > > [<ffffffff810e2460>] ? vma_prio_tree_insert+0x30/0x50 > > [<ffffffff810f2508>] ? vma_link+0x88/0xe0 > > [<ffffffff810ee6a8>] handle_mm_fault+0x138/0x260 > > [<ffffffff8102709d>] do_page_fault+0x13d/0x460 > > [<ffffffff810f46fc>] ? do_mmap_pgoff+0x3dc/0x430 > > [<ffffffff815b61ff>] page_fault+0x1f/0x30 > >---[ end trace 8817670349022007 ]--- > >apache2 invoked oom-killer: gfp_mask=0x0, order=0, oom_adj=0, oom_score_adj=0 > >apache2 cpuset=uid mems_allowed=0 > >Pid: 3545, comm: apache2 Tainted: G W 3.2.37-grsec #1 > >Call Trace: > > [<ffffffff810ccd2e>] dump_header+0x7e/0x1e0 > > [<ffffffff810ccc2f>] ? find_lock_task_mm+0x2f/0x70 > > [<ffffffff810cd1f5>] oom_kill_process+0x85/0x2a0 > > [<ffffffff810cd8a5>] out_of_memory+0xe5/0x200 > > [<ffffffff810cda7d>] pagefault_out_of_memory+0xbd/0x110 > > [<ffffffff81026e76>] mm_fault_error+0xb6/0x1a0 > > [<ffffffff8102734e>] do_page_fault+0x3ee/0x460 > > [<ffffffff810f46fc>] ? do_mmap_pgoff+0x3dc/0x430 > > [<ffffffff815b61ff>] page_fault+0x1f/0x30 > > > >The first trace comes from the debugging WARN and it clearly points to > >a file fault path. __do_fault pre-charges a page in case we need to > >do CoW (copy-on-write) for the returned page. This one falls back to > >memcg OOM and never returns ENOMEM as I have mentioned earlier. > >However, the fs fault handler (filemap_fault here) can fallback to > >page_cache_read if the readahead (do_sync_mmap_readahead) fails > >to get page to the page cache. And we can see this happening in > >the first trace. page_cache_read then calls add_to_page_cache_lru > >and eventually gets to add_to_page_cache_locked which calls > >mem_cgroup_cache_charge_no_oom so we will get ENOMEM if oom should > >happen. This ENOMEM gets to the fault handler and kaboom. > > > > Hmm. do we need to increase the "limit" virtually at memcg oom until > the oom-killed process dies ? It may be doable by increasing stock->cache > of each cpu....I think kernel can offer extra virtual charge up to > oom-killed process's memory usage..... If we can guarantee that the overflow charges do not exceed the memory usage of the killed process then this would work. The question is, how do we find out how much we can overflow. immigrate_on_move will play some role as well as the amount of the shared memory. I am afraid this would get too complex. Nevertheless the idea is nice. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked @ 2013-02-07 12:31 ` Michal Hocko 0 siblings, 0 replies; 444+ messages in thread From: Michal Hocko @ 2013-02-07 12:31 UTC (permalink / raw) To: Kamezawa Hiroyuki Cc: azurIt, linux-kernel, linux-mm, cgroups mailinglist, Johannes Weiner On Thu 07-02-13 20:01:45, KAMEZAWA Hiroyuki wrote: > (2013/02/06 23:01), Michal Hocko wrote: > >On Wed 06-02-13 02:17:21, azurIt wrote: > >>>5-memcg-fix-1.patch is not complete. It doesn't contain the folloup I > >>>mentioned in a follow up email. Here is the full patch: > >> > >> > >>Here is the log where OOM, again, killed MySQL server [search for "(mysqld)"]: > >>http://www.watchdog.sk/lkml/oom_mysqld6 > > > >[...] > >WARNING: at mm/memcontrol.c:2409 T.1149+0x2d9/0x610() > >Hardware name: S5000VSA > >gfp_mask:4304 nr_pages:1 oom:0 ret:2 > >Pid: 3545, comm: apache2 Tainted: G W 3.2.37-grsec #1 > >Call Trace: > > [<ffffffff8105502a>] warn_slowpath_common+0x7a/0xb0 > > [<ffffffff81055116>] warn_slowpath_fmt+0x46/0x50 > > [<ffffffff81108163>] ? mem_cgroup_margin+0x73/0xa0 > > [<ffffffff8110b6f9>] T.1149+0x2d9/0x610 > > [<ffffffff812af298>] ? blk_finish_plug+0x18/0x50 > > [<ffffffff8110c6b4>] mem_cgroup_cache_charge+0xc4/0xf0 > > [<ffffffff810ca6bf>] add_to_page_cache_locked+0x4f/0x140 > > [<ffffffff810ca7d2>] add_to_page_cache_lru+0x22/0x50 > > [<ffffffff810cad32>] filemap_fault+0x252/0x4f0 > > [<ffffffff810eab18>] __do_fault+0x78/0x5a0 > > [<ffffffff810edcb4>] handle_pte_fault+0x84/0x940 > > [<ffffffff810e2460>] ? vma_prio_tree_insert+0x30/0x50 > > [<ffffffff810f2508>] ? vma_link+0x88/0xe0 > > [<ffffffff810ee6a8>] handle_mm_fault+0x138/0x260 > > [<ffffffff8102709d>] do_page_fault+0x13d/0x460 > > [<ffffffff810f46fc>] ? do_mmap_pgoff+0x3dc/0x430 > > [<ffffffff815b61ff>] page_fault+0x1f/0x30 > >---[ end trace 8817670349022007 ]--- > >apache2 invoked oom-killer: gfp_mask=0x0, order=0, oom_adj=0, oom_score_adj=0 > >apache2 cpuset=uid mems_allowed=0 > >Pid: 3545, comm: apache2 Tainted: G W 3.2.37-grsec #1 > >Call Trace: > > [<ffffffff810ccd2e>] dump_header+0x7e/0x1e0 > > [<ffffffff810ccc2f>] ? find_lock_task_mm+0x2f/0x70 > > [<ffffffff810cd1f5>] oom_kill_process+0x85/0x2a0 > > [<ffffffff810cd8a5>] out_of_memory+0xe5/0x200 > > [<ffffffff810cda7d>] pagefault_out_of_memory+0xbd/0x110 > > [<ffffffff81026e76>] mm_fault_error+0xb6/0x1a0 > > [<ffffffff8102734e>] do_page_fault+0x3ee/0x460 > > [<ffffffff810f46fc>] ? do_mmap_pgoff+0x3dc/0x430 > > [<ffffffff815b61ff>] page_fault+0x1f/0x30 > > > >The first trace comes from the debugging WARN and it clearly points to > >a file fault path. __do_fault pre-charges a page in case we need to > >do CoW (copy-on-write) for the returned page. This one falls back to > >memcg OOM and never returns ENOMEM as I have mentioned earlier. > >However, the fs fault handler (filemap_fault here) can fallback to > >page_cache_read if the readahead (do_sync_mmap_readahead) fails > >to get page to the page cache. And we can see this happening in > >the first trace. page_cache_read then calls add_to_page_cache_lru > >and eventually gets to add_to_page_cache_locked which calls > >mem_cgroup_cache_charge_no_oom so we will get ENOMEM if oom should > >happen. This ENOMEM gets to the fault handler and kaboom. > > > > Hmm. do we need to increase the "limit" virtually at memcg oom until > the oom-killed process dies ? It may be doable by increasing stock->cache > of each cpu....I think kernel can offer extra virtual charge up to > oom-killed process's memory usage..... If we can guarantee that the overflow charges do not exceed the memory usage of the killed process then this would work. The question is, how do we find out how much we can overflow. immigrate_on_move will play some role as well as the amount of the shared memory. I am afraid this would get too complex. Nevertheless the idea is nice. -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked 2013-02-07 12:31 ` Michal Hocko @ 2013-02-08 4:16 ` Kamezawa Hiroyuki -1 siblings, 0 replies; 444+ messages in thread From: Kamezawa Hiroyuki @ 2013-02-08 4:16 UTC (permalink / raw) To: Michal Hocko Cc: azurIt, linux-kernel, linux-mm, cgroups mailinglist, Johannes Weiner (2013/02/07 21:31), Michal Hocko wrote: > On Thu 07-02-13 20:01:45, KAMEZAWA Hiroyuki wrote: >> (2013/02/06 23:01), Michal Hocko wrote: >>> On Wed 06-02-13 02:17:21, azurIt wrote: >>>>> 5-memcg-fix-1.patch is not complete. It doesn't contain the folloup I >>>>> mentioned in a follow up email. Here is the full patch: >>>> >>>> >>>> Here is the log where OOM, again, killed MySQL server [search for "(mysqld)"]: >>>> http://www.watchdog.sk/lkml/oom_mysqld6 >>> >>> [...] >>> WARNING: at mm/memcontrol.c:2409 T.1149+0x2d9/0x610() >>> Hardware name: S5000VSA >>> gfp_mask:4304 nr_pages:1 oom:0 ret:2 >>> Pid: 3545, comm: apache2 Tainted: G W 3.2.37-grsec #1 >>> Call Trace: >>> [<ffffffff8105502a>] warn_slowpath_common+0x7a/0xb0 >>> [<ffffffff81055116>] warn_slowpath_fmt+0x46/0x50 >>> [<ffffffff81108163>] ? mem_cgroup_margin+0x73/0xa0 >>> [<ffffffff8110b6f9>] T.1149+0x2d9/0x610 >>> [<ffffffff812af298>] ? blk_finish_plug+0x18/0x50 >>> [<ffffffff8110c6b4>] mem_cgroup_cache_charge+0xc4/0xf0 >>> [<ffffffff810ca6bf>] add_to_page_cache_locked+0x4f/0x140 >>> [<ffffffff810ca7d2>] add_to_page_cache_lru+0x22/0x50 >>> [<ffffffff810cad32>] filemap_fault+0x252/0x4f0 >>> [<ffffffff810eab18>] __do_fault+0x78/0x5a0 >>> [<ffffffff810edcb4>] handle_pte_fault+0x84/0x940 >>> [<ffffffff810e2460>] ? vma_prio_tree_insert+0x30/0x50 >>> [<ffffffff810f2508>] ? vma_link+0x88/0xe0 >>> [<ffffffff810ee6a8>] handle_mm_fault+0x138/0x260 >>> [<ffffffff8102709d>] do_page_fault+0x13d/0x460 >>> [<ffffffff810f46fc>] ? do_mmap_pgoff+0x3dc/0x430 >>> [<ffffffff815b61ff>] page_fault+0x1f/0x30 >>> ---[ end trace 8817670349022007 ]--- >>> apache2 invoked oom-killer: gfp_mask=0x0, order=0, oom_adj=0, oom_score_adj=0 >>> apache2 cpuset=uid mems_allowed=0 >>> Pid: 3545, comm: apache2 Tainted: G W 3.2.37-grsec #1 >>> Call Trace: >>> [<ffffffff810ccd2e>] dump_header+0x7e/0x1e0 >>> [<ffffffff810ccc2f>] ? find_lock_task_mm+0x2f/0x70 >>> [<ffffffff810cd1f5>] oom_kill_process+0x85/0x2a0 >>> [<ffffffff810cd8a5>] out_of_memory+0xe5/0x200 >>> [<ffffffff810cda7d>] pagefault_out_of_memory+0xbd/0x110 >>> [<ffffffff81026e76>] mm_fault_error+0xb6/0x1a0 >>> [<ffffffff8102734e>] do_page_fault+0x3ee/0x460 >>> [<ffffffff810f46fc>] ? do_mmap_pgoff+0x3dc/0x430 >>> [<ffffffff815b61ff>] page_fault+0x1f/0x30 >>> >>> The first trace comes from the debugging WARN and it clearly points to >>> a file fault path. __do_fault pre-charges a page in case we need to >>> do CoW (copy-on-write) for the returned page. This one falls back to >>> memcg OOM and never returns ENOMEM as I have mentioned earlier. >>> However, the fs fault handler (filemap_fault here) can fallback to >>> page_cache_read if the readahead (do_sync_mmap_readahead) fails >>> to get page to the page cache. And we can see this happening in >>> the first trace. page_cache_read then calls add_to_page_cache_lru >>> and eventually gets to add_to_page_cache_locked which calls >>> mem_cgroup_cache_charge_no_oom so we will get ENOMEM if oom should >>> happen. This ENOMEM gets to the fault handler and kaboom. >>> >> >> Hmm. do we need to increase the "limit" virtually at memcg oom until >> the oom-killed process dies ? It may be doable by increasing stock->cache >> of each cpu....I think kernel can offer extra virtual charge up to >> oom-killed process's memory usage..... > > If we can guarantee that the overflow charges do not exceed the memory > usage of the killed process then this would work. The question is, how > do we find out how much we can overflow. immigrate_on_move will play > some role as well as the amount of the shared memory. I am afraid this > would get too complex. Nevertheless the idea is nice. > Yes, that's the problem. If we don't do in correct way, resouce usage undeflow can happen. I guess we can count it per task_struct at charging page-faulted anon pages. _Or_ in other consideration, for example, we do charge 1MB per thread regardless of its memory usage. And use it as a security at OOM-killing. Implemtation will be easy but explanation may be difficult.. Thanks, -Kame Thanks, -Kame ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked @ 2013-02-08 4:16 ` Kamezawa Hiroyuki 0 siblings, 0 replies; 444+ messages in thread From: Kamezawa Hiroyuki @ 2013-02-08 4:16 UTC (permalink / raw) To: Michal Hocko Cc: azurIt, linux-kernel, linux-mm, cgroups mailinglist, Johannes Weiner (2013/02/07 21:31), Michal Hocko wrote: > On Thu 07-02-13 20:01:45, KAMEZAWA Hiroyuki wrote: >> (2013/02/06 23:01), Michal Hocko wrote: >>> On Wed 06-02-13 02:17:21, azurIt wrote: >>>>> 5-memcg-fix-1.patch is not complete. It doesn't contain the folloup I >>>>> mentioned in a follow up email. Here is the full patch: >>>> >>>> >>>> Here is the log where OOM, again, killed MySQL server [search for "(mysqld)"]: >>>> http://www.watchdog.sk/lkml/oom_mysqld6 >>> >>> [...] >>> WARNING: at mm/memcontrol.c:2409 T.1149+0x2d9/0x610() >>> Hardware name: S5000VSA >>> gfp_mask:4304 nr_pages:1 oom:0 ret:2 >>> Pid: 3545, comm: apache2 Tainted: G W 3.2.37-grsec #1 >>> Call Trace: >>> [<ffffffff8105502a>] warn_slowpath_common+0x7a/0xb0 >>> [<ffffffff81055116>] warn_slowpath_fmt+0x46/0x50 >>> [<ffffffff81108163>] ? mem_cgroup_margin+0x73/0xa0 >>> [<ffffffff8110b6f9>] T.1149+0x2d9/0x610 >>> [<ffffffff812af298>] ? blk_finish_plug+0x18/0x50 >>> [<ffffffff8110c6b4>] mem_cgroup_cache_charge+0xc4/0xf0 >>> [<ffffffff810ca6bf>] add_to_page_cache_locked+0x4f/0x140 >>> [<ffffffff810ca7d2>] add_to_page_cache_lru+0x22/0x50 >>> [<ffffffff810cad32>] filemap_fault+0x252/0x4f0 >>> [<ffffffff810eab18>] __do_fault+0x78/0x5a0 >>> [<ffffffff810edcb4>] handle_pte_fault+0x84/0x940 >>> [<ffffffff810e2460>] ? vma_prio_tree_insert+0x30/0x50 >>> [<ffffffff810f2508>] ? vma_link+0x88/0xe0 >>> [<ffffffff810ee6a8>] handle_mm_fault+0x138/0x260 >>> [<ffffffff8102709d>] do_page_fault+0x13d/0x460 >>> [<ffffffff810f46fc>] ? do_mmap_pgoff+0x3dc/0x430 >>> [<ffffffff815b61ff>] page_fault+0x1f/0x30 >>> ---[ end trace 8817670349022007 ]--- >>> apache2 invoked oom-killer: gfp_mask=0x0, order=0, oom_adj=0, oom_score_adj=0 >>> apache2 cpuset=uid mems_allowed=0 >>> Pid: 3545, comm: apache2 Tainted: G W 3.2.37-grsec #1 >>> Call Trace: >>> [<ffffffff810ccd2e>] dump_header+0x7e/0x1e0 >>> [<ffffffff810ccc2f>] ? find_lock_task_mm+0x2f/0x70 >>> [<ffffffff810cd1f5>] oom_kill_process+0x85/0x2a0 >>> [<ffffffff810cd8a5>] out_of_memory+0xe5/0x200 >>> [<ffffffff810cda7d>] pagefault_out_of_memory+0xbd/0x110 >>> [<ffffffff81026e76>] mm_fault_error+0xb6/0x1a0 >>> [<ffffffff8102734e>] do_page_fault+0x3ee/0x460 >>> [<ffffffff810f46fc>] ? do_mmap_pgoff+0x3dc/0x430 >>> [<ffffffff815b61ff>] page_fault+0x1f/0x30 >>> >>> The first trace comes from the debugging WARN and it clearly points to >>> a file fault path. __do_fault pre-charges a page in case we need to >>> do CoW (copy-on-write) for the returned page. This one falls back to >>> memcg OOM and never returns ENOMEM as I have mentioned earlier. >>> However, the fs fault handler (filemap_fault here) can fallback to >>> page_cache_read if the readahead (do_sync_mmap_readahead) fails >>> to get page to the page cache. And we can see this happening in >>> the first trace. page_cache_read then calls add_to_page_cache_lru >>> and eventually gets to add_to_page_cache_locked which calls >>> mem_cgroup_cache_charge_no_oom so we will get ENOMEM if oom should >>> happen. This ENOMEM gets to the fault handler and kaboom. >>> >> >> Hmm. do we need to increase the "limit" virtually at memcg oom until >> the oom-killed process dies ? It may be doable by increasing stock->cache >> of each cpu....I think kernel can offer extra virtual charge up to >> oom-killed process's memory usage..... > > If we can guarantee that the overflow charges do not exceed the memory > usage of the killed process then this would work. The question is, how > do we find out how much we can overflow. immigrate_on_move will play > some role as well as the amount of the shared memory. I am afraid this > would get too complex. Nevertheless the idea is nice. > Yes, that's the problem. If we don't do in correct way, resouce usage undeflow can happen. I guess we can count it per task_struct at charging page-faulted anon pages. _Or_ in other consideration, for example, we do charge 1MB per thread regardless of its memory usage. And use it as a security at OOM-killing. Implemtation will be easy but explanation may be difficult.. Thanks, -Kame Thanks, -Kame -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked 2013-02-07 11:01 ` Kamezawa Hiroyuki (?) @ 2013-02-08 1:40 ` Kamezawa Hiroyuki -1 siblings, 0 replies; 444+ messages in thread From: Kamezawa Hiroyuki @ 2013-02-08 1:40 UTC (permalink / raw) To: Michal Hocko Cc: azurIt, linux-kernel, linux-mm, cgroups mailinglist, Johannes Weiner (2013/02/07 20:01), Kamezawa Hiroyuki wrote: > (2013/02/06 23:01), Michal Hocko wrote: >> On Wed 06-02-13 02:17:21, azurIt wrote: >>>> 5-memcg-fix-1.patch is not complete. It doesn't contain the folloup I >>>> mentioned in a follow up email. Here is the full patch: >>> >>> >>> Here is the log where OOM, again, killed MySQL server [search for "(mysqld)"]: >>> http://www.watchdog.sk/lkml/oom_mysqld6 >> >> [...] >> WARNING: at mm/memcontrol.c:2409 T.1149+0x2d9/0x610() >> Hardware name: S5000VSA >> gfp_mask:4304 nr_pages:1 oom:0 ret:2 >> Pid: 3545, comm: apache2 Tainted: G W 3.2.37-grsec #1 >> Call Trace: >> [<ffffffff8105502a>] warn_slowpath_common+0x7a/0xb0 >> [<ffffffff81055116>] warn_slowpath_fmt+0x46/0x50 >> [<ffffffff81108163>] ? mem_cgroup_margin+0x73/0xa0 >> [<ffffffff8110b6f9>] T.1149+0x2d9/0x610 >> [<ffffffff812af298>] ? blk_finish_plug+0x18/0x50 >> [<ffffffff8110c6b4>] mem_cgroup_cache_charge+0xc4/0xf0 >> [<ffffffff810ca6bf>] add_to_page_cache_locked+0x4f/0x140 >> [<ffffffff810ca7d2>] add_to_page_cache_lru+0x22/0x50 >> [<ffffffff810cad32>] filemap_fault+0x252/0x4f0 >> [<ffffffff810eab18>] __do_fault+0x78/0x5a0 >> [<ffffffff810edcb4>] handle_pte_fault+0x84/0x940 >> [<ffffffff810e2460>] ? vma_prio_tree_insert+0x30/0x50 >> [<ffffffff810f2508>] ? vma_link+0x88/0xe0 >> [<ffffffff810ee6a8>] handle_mm_fault+0x138/0x260 >> [<ffffffff8102709d>] do_page_fault+0x13d/0x460 >> [<ffffffff810f46fc>] ? do_mmap_pgoff+0x3dc/0x430 >> [<ffffffff815b61ff>] page_fault+0x1f/0x30 >> ---[ end trace 8817670349022007 ]--- >> apache2 invoked oom-killer: gfp_mask=0x0, order=0, oom_adj=0, oom_score_adj=0 >> apache2 cpuset=uid mems_allowed=0 >> Pid: 3545, comm: apache2 Tainted: G W 3.2.37-grsec #1 >> Call Trace: >> [<ffffffff810ccd2e>] dump_header+0x7e/0x1e0 >> [<ffffffff810ccc2f>] ? find_lock_task_mm+0x2f/0x70 >> [<ffffffff810cd1f5>] oom_kill_process+0x85/0x2a0 >> [<ffffffff810cd8a5>] out_of_memory+0xe5/0x200 >> [<ffffffff810cda7d>] pagefault_out_of_memory+0xbd/0x110 >> [<ffffffff81026e76>] mm_fault_error+0xb6/0x1a0 >> [<ffffffff8102734e>] do_page_fault+0x3ee/0x460 >> [<ffffffff810f46fc>] ? do_mmap_pgoff+0x3dc/0x430 >> [<ffffffff815b61ff>] page_fault+0x1f/0x30 >> >> The first trace comes from the debugging WARN and it clearly points to >> a file fault path. __do_fault pre-charges a page in case we need to >> do CoW (copy-on-write) for the returned page. This one falls back to >> memcg OOM and never returns ENOMEM as I have mentioned earlier. >> However, the fs fault handler (filemap_fault here) can fallback to >> page_cache_read if the readahead (do_sync_mmap_readahead) fails >> to get page to the page cache. And we can see this happening in >> the first trace. page_cache_read then calls add_to_page_cache_lru >> and eventually gets to add_to_page_cache_locked which calls >> mem_cgroup_cache_charge_no_oom so we will get ENOMEM if oom should >> happen. This ENOMEM gets to the fault handler and kaboom. >> > > Hmm. do we need to increase the "limit" virtually at memcg oom until > the oom-killed process dies ? Here is my naive idea... == From 1a46318cf89e7df94bd4844f29105b61dacf335b Mon Sep 17 00:00:00 2001 From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Date: Fri, 8 Feb 2013 10:43:52 +0900 Subject: [PATCH] [Don't Apply][PATCH] memcg relax resource at OOM situation. When an OOM happens, a task is killed and resources will be freed. A problem here is that a task, which is oom-killed, may wait for some other resource in which memory resource is required. Some thread waits for free memory may holds some mutex and oom-killed process wait for the mutex. To avoid this, relaxing charged memory by giving virtual resource can be a help. The system can get back it at uncharge(). This is a sample native implementation. Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> --- mm/memcontrol.c | 79 ++++++++++++++++++++++++++++++++++++++++++++++++++----- 1 file changed, 73 insertions(+), 6 deletions(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 25ac5f4..4dea49a 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -301,6 +301,9 @@ struct mem_cgroup { /* set when res.limit == memsw.limit */ bool memsw_is_minimum; + /* extra resource at emergency situation */ + unsigned long loan; + spinlock_t loan_lock; /* protect arrays of thresholds */ struct mutex thresholds_lock; @@ -2034,6 +2037,61 @@ static int mem_cgroup_soft_reclaim(struct mem_cgroup *root_memcg, mem_cgroup_iter_break(root_memcg, victim); return total; } +/* + * When a memcg is in OOM situation, this lack of resource may cause deadlock + * because of complicated lock dependency(i_mutex...). To avoid that, we + * need extra resource or avoid charging. + * + * A memcg can request resource in an emergency state. We call it as loan. + * A memcg will return a loan when it does uncharge resource. We disallow + * double-loan and moving task to other groups until the loan is fully + * returned. + * + * Note: the problem here is that we cannot know what amount resouce should + * be necessary to exiting an emergency state..... + */ +#define LOAN_MAX (2 * 1024 * 1024) + +static void mem_cgroup_make_loan(struct mem_cgroup *memcg) +{ + u64 usage; + unsigned long amount; + + amount = LOAN_MAX; + + usage = res_counter_read_u64(&memcg->res, RES_USAGE); + if (amount > usage /2 ) + amount = usage / 2; + spin_lock(&memcg->loan_lock); + if (memcg->loan) { + spin_unlock(&memcg->loan_lock); + return; + } + memcg->loan = amount; + res_counter_uncharge(&memcg->res, amount); + if (do_swap_account) + res_counter_uncharge(&memcg->memsw, amount); + spin_unlock(&memcg->loan_lock); +} + +/* return amount of free resource which can be uncharged */ +static unsigned long +mem_cgroup_may_return_loan(struct mem_cgroup *memcg, unsigned long val) +{ + unsigned long tmp; + /* we don't care small race here */ + if (unlikely(!memcg->loan)) + return val; + spin_lock(&memcg->loan_lock); + if (memcg->loan) { + tmp = min(memcg->loan, val); + memcg->loan -= tmp; + val -= tmp; + } + spin_unlock(&memcg->loan_lock); + return val; +} + /* * Check OOM-Killer is already running under our hierarchy. @@ -2182,6 +2240,7 @@ static bool mem_cgroup_handle_oom(struct mem_cgroup *memcg, gfp_t mask, if (need_to_kill) { finish_wait(&memcg_oom_waitq, &owait.wait); mem_cgroup_out_of_memory(memcg, mask, order); + mem_cgroup_make_loan(memcg); } else { schedule(); finish_wait(&memcg_oom_waitq, &owait.wait); @@ -2748,6 +2807,8 @@ static void __mem_cgroup_cancel_charge(struct mem_cgroup *memcg, if (!mem_cgroup_is_root(memcg)) { unsigned long bytes = nr_pages * PAGE_SIZE; + bytes = mem_cgroup_may_return_loan(memcg, bytes); + res_counter_uncharge(&memcg->res, bytes); if (do_swap_account) res_counter_uncharge(&memcg->memsw, bytes); @@ -3989,6 +4050,7 @@ static void mem_cgroup_do_uncharge(struct mem_cgroup *memcg, { struct memcg_batch_info *batch = NULL; bool uncharge_memsw = true; + unsigned long val; /* If swapout, usage of swap doesn't decrease */ if (!do_swap_account || ctype == MEM_CGROUP_CHARGE_TYPE_SWAPOUT) @@ -4029,9 +4091,11 @@ static void mem_cgroup_do_uncharge(struct mem_cgroup *memcg, batch->memsw_nr_pages++; return; direct_uncharge: - res_counter_uncharge(&memcg->res, nr_pages * PAGE_SIZE); + val = nr_pages * PAGE_SIZE; + val = mem_cgroup_may_return_loan(memcg, val); + res_counter_uncharge(&memcg->res, val); if (uncharge_memsw) - res_counter_uncharge(&memcg->memsw, nr_pages * PAGE_SIZE); + res_counter_uncharge(&memcg->memsw, val); if (unlikely(batch->memcg != memcg)) memcg_oom_recover(memcg); } @@ -4182,6 +4246,7 @@ void mem_cgroup_uncharge_start(void) void mem_cgroup_uncharge_end(void) { struct memcg_batch_info *batch = ¤t->memcg_batch; + unsigned long val; if (!batch->do_batch) return; @@ -4192,16 +4257,16 @@ void mem_cgroup_uncharge_end(void) if (!batch->memcg) return; + val = batch->nr_pages * PAGE_SIZE; + val = mem_cgroup_may_return_loan(batch->memcg, val); /* * This "batch->memcg" is valid without any css_get/put etc... * bacause we hide charges behind us. */ if (batch->nr_pages) - res_counter_uncharge(&batch->memcg->res, - batch->nr_pages * PAGE_SIZE); + res_counter_uncharge(&batch->memcg->res, val); if (batch->memsw_nr_pages) - res_counter_uncharge(&batch->memcg->memsw, - batch->memsw_nr_pages * PAGE_SIZE); + res_counter_uncharge(&batch->memcg->memsw, val); memcg_oom_recover(batch->memcg); /* forget this pointer (for sanity check) */ batch->memcg = NULL; @@ -6291,6 +6356,8 @@ mem_cgroup_css_alloc(struct cgroup *cont) memcg->move_charge_at_immigrate = 0; mutex_init(&memcg->thresholds_lock); spin_lock_init(&memcg->move_lock); + memcg->loan = 0; + spin_lock_init(&memcg->loan_lock); return &memcg->css; -- 1.7.10.2 ^ permalink raw reply related [flat|nested] 444+ messages in thread
* Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked @ 2013-02-08 1:40 ` Kamezawa Hiroyuki 0 siblings, 0 replies; 444+ messages in thread From: Kamezawa Hiroyuki @ 2013-02-08 1:40 UTC (permalink / raw) To: Michal Hocko Cc: azurIt, linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups mailinglist, Johannes Weiner (2013/02/07 20:01), Kamezawa Hiroyuki wrote: > (2013/02/06 23:01), Michal Hocko wrote: >> On Wed 06-02-13 02:17:21, azurIt wrote: >>>> 5-memcg-fix-1.patch is not complete. It doesn't contain the folloup I >>>> mentioned in a follow up email. Here is the full patch: >>> >>> >>> Here is the log where OOM, again, killed MySQL server [search for "(mysqld)"]: >>> http://www.watchdog.sk/lkml/oom_mysqld6 >> >> [...] >> WARNING: at mm/memcontrol.c:2409 T.1149+0x2d9/0x610() >> Hardware name: S5000VSA >> gfp_mask:4304 nr_pages:1 oom:0 ret:2 >> Pid: 3545, comm: apache2 Tainted: G W 3.2.37-grsec #1 >> Call Trace: >> [<ffffffff8105502a>] warn_slowpath_common+0x7a/0xb0 >> [<ffffffff81055116>] warn_slowpath_fmt+0x46/0x50 >> [<ffffffff81108163>] ? mem_cgroup_margin+0x73/0xa0 >> [<ffffffff8110b6f9>] T.1149+0x2d9/0x610 >> [<ffffffff812af298>] ? blk_finish_plug+0x18/0x50 >> [<ffffffff8110c6b4>] mem_cgroup_cache_charge+0xc4/0xf0 >> [<ffffffff810ca6bf>] add_to_page_cache_locked+0x4f/0x140 >> [<ffffffff810ca7d2>] add_to_page_cache_lru+0x22/0x50 >> [<ffffffff810cad32>] filemap_fault+0x252/0x4f0 >> [<ffffffff810eab18>] __do_fault+0x78/0x5a0 >> [<ffffffff810edcb4>] handle_pte_fault+0x84/0x940 >> [<ffffffff810e2460>] ? vma_prio_tree_insert+0x30/0x50 >> [<ffffffff810f2508>] ? vma_link+0x88/0xe0 >> [<ffffffff810ee6a8>] handle_mm_fault+0x138/0x260 >> [<ffffffff8102709d>] do_page_fault+0x13d/0x460 >> [<ffffffff810f46fc>] ? do_mmap_pgoff+0x3dc/0x430 >> [<ffffffff815b61ff>] page_fault+0x1f/0x30 >> ---[ end trace 8817670349022007 ]--- >> apache2 invoked oom-killer: gfp_mask=0x0, order=0, oom_adj=0, oom_score_adj=0 >> apache2 cpuset=uid mems_allowed=0 >> Pid: 3545, comm: apache2 Tainted: G W 3.2.37-grsec #1 >> Call Trace: >> [<ffffffff810ccd2e>] dump_header+0x7e/0x1e0 >> [<ffffffff810ccc2f>] ? find_lock_task_mm+0x2f/0x70 >> [<ffffffff810cd1f5>] oom_kill_process+0x85/0x2a0 >> [<ffffffff810cd8a5>] out_of_memory+0xe5/0x200 >> [<ffffffff810cda7d>] pagefault_out_of_memory+0xbd/0x110 >> [<ffffffff81026e76>] mm_fault_error+0xb6/0x1a0 >> [<ffffffff8102734e>] do_page_fault+0x3ee/0x460 >> [<ffffffff810f46fc>] ? do_mmap_pgoff+0x3dc/0x430 >> [<ffffffff815b61ff>] page_fault+0x1f/0x30 >> >> The first trace comes from the debugging WARN and it clearly points to >> a file fault path. __do_fault pre-charges a page in case we need to >> do CoW (copy-on-write) for the returned page. This one falls back to >> memcg OOM and never returns ENOMEM as I have mentioned earlier. >> However, the fs fault handler (filemap_fault here) can fallback to >> page_cache_read if the readahead (do_sync_mmap_readahead) fails >> to get page to the page cache. And we can see this happening in >> the first trace. page_cache_read then calls add_to_page_cache_lru >> and eventually gets to add_to_page_cache_locked which calls >> mem_cgroup_cache_charge_no_oom so we will get ENOMEM if oom should >> happen. This ENOMEM gets to the fault handler and kaboom. >> > > Hmm. do we need to increase the "limit" virtually at memcg oom until > the oom-killed process dies ? Here is my naive idea... == From 1a46318cf89e7df94bd4844f29105b61dacf335b Mon Sep 17 00:00:00 2001 From: KAMEZAWA Hiroyuki <kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A@public.gmane.org> Date: Fri, 8 Feb 2013 10:43:52 +0900 Subject: [PATCH] [Don't Apply][PATCH] memcg relax resource at OOM situation. When an OOM happens, a task is killed and resources will be freed. A problem here is that a task, which is oom-killed, may wait for some other resource in which memory resource is required. Some thread waits for free memory may holds some mutex and oom-killed process wait for the mutex. To avoid this, relaxing charged memory by giving virtual resource can be a help. The system can get back it at uncharge(). This is a sample native implementation. Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A@public.gmane.org> --- mm/memcontrol.c | 79 ++++++++++++++++++++++++++++++++++++++++++++++++++----- 1 file changed, 73 insertions(+), 6 deletions(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 25ac5f4..4dea49a 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -301,6 +301,9 @@ struct mem_cgroup { /* set when res.limit == memsw.limit */ bool memsw_is_minimum; + /* extra resource at emergency situation */ + unsigned long loan; + spinlock_t loan_lock; /* protect arrays of thresholds */ struct mutex thresholds_lock; @@ -2034,6 +2037,61 @@ static int mem_cgroup_soft_reclaim(struct mem_cgroup *root_memcg, mem_cgroup_iter_break(root_memcg, victim); return total; } +/* + * When a memcg is in OOM situation, this lack of resource may cause deadlock + * because of complicated lock dependency(i_mutex...). To avoid that, we + * need extra resource or avoid charging. + * + * A memcg can request resource in an emergency state. We call it as loan. + * A memcg will return a loan when it does uncharge resource. We disallow + * double-loan and moving task to other groups until the loan is fully + * returned. + * + * Note: the problem here is that we cannot know what amount resouce should + * be necessary to exiting an emergency state..... + */ +#define LOAN_MAX (2 * 1024 * 1024) + +static void mem_cgroup_make_loan(struct mem_cgroup *memcg) +{ + u64 usage; + unsigned long amount; + + amount = LOAN_MAX; + + usage = res_counter_read_u64(&memcg->res, RES_USAGE); + if (amount > usage /2 ) + amount = usage / 2; + spin_lock(&memcg->loan_lock); + if (memcg->loan) { + spin_unlock(&memcg->loan_lock); + return; + } + memcg->loan = amount; + res_counter_uncharge(&memcg->res, amount); + if (do_swap_account) + res_counter_uncharge(&memcg->memsw, amount); + spin_unlock(&memcg->loan_lock); +} + +/* return amount of free resource which can be uncharged */ +static unsigned long +mem_cgroup_may_return_loan(struct mem_cgroup *memcg, unsigned long val) +{ + unsigned long tmp; + /* we don't care small race here */ + if (unlikely(!memcg->loan)) + return val; + spin_lock(&memcg->loan_lock); + if (memcg->loan) { + tmp = min(memcg->loan, val); + memcg->loan -= tmp; + val -= tmp; + } + spin_unlock(&memcg->loan_lock); + return val; +} + /* * Check OOM-Killer is already running under our hierarchy. @@ -2182,6 +2240,7 @@ static bool mem_cgroup_handle_oom(struct mem_cgroup *memcg, gfp_t mask, if (need_to_kill) { finish_wait(&memcg_oom_waitq, &owait.wait); mem_cgroup_out_of_memory(memcg, mask, order); + mem_cgroup_make_loan(memcg); } else { schedule(); finish_wait(&memcg_oom_waitq, &owait.wait); @@ -2748,6 +2807,8 @@ static void __mem_cgroup_cancel_charge(struct mem_cgroup *memcg, if (!mem_cgroup_is_root(memcg)) { unsigned long bytes = nr_pages * PAGE_SIZE; + bytes = mem_cgroup_may_return_loan(memcg, bytes); + res_counter_uncharge(&memcg->res, bytes); if (do_swap_account) res_counter_uncharge(&memcg->memsw, bytes); @@ -3989,6 +4050,7 @@ static void mem_cgroup_do_uncharge(struct mem_cgroup *memcg, { struct memcg_batch_info *batch = NULL; bool uncharge_memsw = true; + unsigned long val; /* If swapout, usage of swap doesn't decrease */ if (!do_swap_account || ctype == MEM_CGROUP_CHARGE_TYPE_SWAPOUT) @@ -4029,9 +4091,11 @@ static void mem_cgroup_do_uncharge(struct mem_cgroup *memcg, batch->memsw_nr_pages++; return; direct_uncharge: - res_counter_uncharge(&memcg->res, nr_pages * PAGE_SIZE); + val = nr_pages * PAGE_SIZE; + val = mem_cgroup_may_return_loan(memcg, val); + res_counter_uncharge(&memcg->res, val); if (uncharge_memsw) - res_counter_uncharge(&memcg->memsw, nr_pages * PAGE_SIZE); + res_counter_uncharge(&memcg->memsw, val); if (unlikely(batch->memcg != memcg)) memcg_oom_recover(memcg); } @@ -4182,6 +4246,7 @@ void mem_cgroup_uncharge_start(void) void mem_cgroup_uncharge_end(void) { struct memcg_batch_info *batch = ¤t->memcg_batch; + unsigned long val; if (!batch->do_batch) return; @@ -4192,16 +4257,16 @@ void mem_cgroup_uncharge_end(void) if (!batch->memcg) return; + val = batch->nr_pages * PAGE_SIZE; + val = mem_cgroup_may_return_loan(batch->memcg, val); /* * This "batch->memcg" is valid without any css_get/put etc... * bacause we hide charges behind us. */ if (batch->nr_pages) - res_counter_uncharge(&batch->memcg->res, - batch->nr_pages * PAGE_SIZE); + res_counter_uncharge(&batch->memcg->res, val); if (batch->memsw_nr_pages) - res_counter_uncharge(&batch->memcg->memsw, - batch->memsw_nr_pages * PAGE_SIZE); + res_counter_uncharge(&batch->memcg->memsw, val); memcg_oom_recover(batch->memcg); /* forget this pointer (for sanity check) */ batch->memcg = NULL; @@ -6291,6 +6356,8 @@ mem_cgroup_css_alloc(struct cgroup *cont) memcg->move_charge_at_immigrate = 0; mutex_init(&memcg->thresholds_lock); spin_lock_init(&memcg->move_lock); + memcg->loan = 0; + spin_lock_init(&memcg->loan_lock); return &memcg->css; -- 1.7.10.2 ^ permalink raw reply related [flat|nested] 444+ messages in thread
* Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked @ 2013-02-08 1:40 ` Kamezawa Hiroyuki 0 siblings, 0 replies; 444+ messages in thread From: Kamezawa Hiroyuki @ 2013-02-08 1:40 UTC (permalink / raw) To: Michal Hocko Cc: azurIt, linux-kernel, linux-mm, cgroups mailinglist, Johannes Weiner (2013/02/07 20:01), Kamezawa Hiroyuki wrote: > (2013/02/06 23:01), Michal Hocko wrote: >> On Wed 06-02-13 02:17:21, azurIt wrote: >>>> 5-memcg-fix-1.patch is not complete. It doesn't contain the folloup I >>>> mentioned in a follow up email. Here is the full patch: >>> >>> >>> Here is the log where OOM, again, killed MySQL server [search for "(mysqld)"]: >>> http://www.watchdog.sk/lkml/oom_mysqld6 >> >> [...] >> WARNING: at mm/memcontrol.c:2409 T.1149+0x2d9/0x610() >> Hardware name: S5000VSA >> gfp_mask:4304 nr_pages:1 oom:0 ret:2 >> Pid: 3545, comm: apache2 Tainted: G W 3.2.37-grsec #1 >> Call Trace: >> [<ffffffff8105502a>] warn_slowpath_common+0x7a/0xb0 >> [<ffffffff81055116>] warn_slowpath_fmt+0x46/0x50 >> [<ffffffff81108163>] ? mem_cgroup_margin+0x73/0xa0 >> [<ffffffff8110b6f9>] T.1149+0x2d9/0x610 >> [<ffffffff812af298>] ? blk_finish_plug+0x18/0x50 >> [<ffffffff8110c6b4>] mem_cgroup_cache_charge+0xc4/0xf0 >> [<ffffffff810ca6bf>] add_to_page_cache_locked+0x4f/0x140 >> [<ffffffff810ca7d2>] add_to_page_cache_lru+0x22/0x50 >> [<ffffffff810cad32>] filemap_fault+0x252/0x4f0 >> [<ffffffff810eab18>] __do_fault+0x78/0x5a0 >> [<ffffffff810edcb4>] handle_pte_fault+0x84/0x940 >> [<ffffffff810e2460>] ? vma_prio_tree_insert+0x30/0x50 >> [<ffffffff810f2508>] ? vma_link+0x88/0xe0 >> [<ffffffff810ee6a8>] handle_mm_fault+0x138/0x260 >> [<ffffffff8102709d>] do_page_fault+0x13d/0x460 >> [<ffffffff810f46fc>] ? do_mmap_pgoff+0x3dc/0x430 >> [<ffffffff815b61ff>] page_fault+0x1f/0x30 >> ---[ end trace 8817670349022007 ]--- >> apache2 invoked oom-killer: gfp_mask=0x0, order=0, oom_adj=0, oom_score_adj=0 >> apache2 cpuset=uid mems_allowed=0 >> Pid: 3545, comm: apache2 Tainted: G W 3.2.37-grsec #1 >> Call Trace: >> [<ffffffff810ccd2e>] dump_header+0x7e/0x1e0 >> [<ffffffff810ccc2f>] ? find_lock_task_mm+0x2f/0x70 >> [<ffffffff810cd1f5>] oom_kill_process+0x85/0x2a0 >> [<ffffffff810cd8a5>] out_of_memory+0xe5/0x200 >> [<ffffffff810cda7d>] pagefault_out_of_memory+0xbd/0x110 >> [<ffffffff81026e76>] mm_fault_error+0xb6/0x1a0 >> [<ffffffff8102734e>] do_page_fault+0x3ee/0x460 >> [<ffffffff810f46fc>] ? do_mmap_pgoff+0x3dc/0x430 >> [<ffffffff815b61ff>] page_fault+0x1f/0x30 >> >> The first trace comes from the debugging WARN and it clearly points to >> a file fault path. __do_fault pre-charges a page in case we need to >> do CoW (copy-on-write) for the returned page. This one falls back to >> memcg OOM and never returns ENOMEM as I have mentioned earlier. >> However, the fs fault handler (filemap_fault here) can fallback to >> page_cache_read if the readahead (do_sync_mmap_readahead) fails >> to get page to the page cache. And we can see this happening in >> the first trace. page_cache_read then calls add_to_page_cache_lru >> and eventually gets to add_to_page_cache_locked which calls >> mem_cgroup_cache_charge_no_oom so we will get ENOMEM if oom should >> happen. This ENOMEM gets to the fault handler and kaboom. >> > > Hmm. do we need to increase the "limit" virtually at memcg oom until > the oom-killed process dies ? Here is my naive idea... == From 1a46318cf89e7df94bd4844f29105b61dacf335b Mon Sep 17 00:00:00 2001 From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Date: Fri, 8 Feb 2013 10:43:52 +0900 Subject: [PATCH] [Don't Apply][PATCH] memcg relax resource at OOM situation. When an OOM happens, a task is killed and resources will be freed. A problem here is that a task, which is oom-killed, may wait for some other resource in which memory resource is required. Some thread waits for free memory may holds some mutex and oom-killed process wait for the mutex. To avoid this, relaxing charged memory by giving virtual resource can be a help. The system can get back it at uncharge(). This is a sample native implementation. Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> --- mm/memcontrol.c | 79 ++++++++++++++++++++++++++++++++++++++++++++++++++----- 1 file changed, 73 insertions(+), 6 deletions(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 25ac5f4..4dea49a 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -301,6 +301,9 @@ struct mem_cgroup { /* set when res.limit == memsw.limit */ bool memsw_is_minimum; + /* extra resource at emergency situation */ + unsigned long loan; + spinlock_t loan_lock; /* protect arrays of thresholds */ struct mutex thresholds_lock; @@ -2034,6 +2037,61 @@ static int mem_cgroup_soft_reclaim(struct mem_cgroup *root_memcg, mem_cgroup_iter_break(root_memcg, victim); return total; } +/* + * When a memcg is in OOM situation, this lack of resource may cause deadlock + * because of complicated lock dependency(i_mutex...). To avoid that, we + * need extra resource or avoid charging. + * + * A memcg can request resource in an emergency state. We call it as loan. + * A memcg will return a loan when it does uncharge resource. We disallow + * double-loan and moving task to other groups until the loan is fully + * returned. + * + * Note: the problem here is that we cannot know what amount resouce should + * be necessary to exiting an emergency state..... + */ +#define LOAN_MAX (2 * 1024 * 1024) + +static void mem_cgroup_make_loan(struct mem_cgroup *memcg) +{ + u64 usage; + unsigned long amount; + + amount = LOAN_MAX; + + usage = res_counter_read_u64(&memcg->res, RES_USAGE); + if (amount > usage /2 ) + amount = usage / 2; + spin_lock(&memcg->loan_lock); + if (memcg->loan) { + spin_unlock(&memcg->loan_lock); + return; + } + memcg->loan = amount; + res_counter_uncharge(&memcg->res, amount); + if (do_swap_account) + res_counter_uncharge(&memcg->memsw, amount); + spin_unlock(&memcg->loan_lock); +} + +/* return amount of free resource which can be uncharged */ +static unsigned long +mem_cgroup_may_return_loan(struct mem_cgroup *memcg, unsigned long val) +{ + unsigned long tmp; + /* we don't care small race here */ + if (unlikely(!memcg->loan)) + return val; + spin_lock(&memcg->loan_lock); + if (memcg->loan) { + tmp = min(memcg->loan, val); + memcg->loan -= tmp; + val -= tmp; + } + spin_unlock(&memcg->loan_lock); + return val; +} + /* * Check OOM-Killer is already running under our hierarchy. @@ -2182,6 +2240,7 @@ static bool mem_cgroup_handle_oom(struct mem_cgroup *memcg, gfp_t mask, if (need_to_kill) { finish_wait(&memcg_oom_waitq, &owait.wait); mem_cgroup_out_of_memory(memcg, mask, order); + mem_cgroup_make_loan(memcg); } else { schedule(); finish_wait(&memcg_oom_waitq, &owait.wait); @@ -2748,6 +2807,8 @@ static void __mem_cgroup_cancel_charge(struct mem_cgroup *memcg, if (!mem_cgroup_is_root(memcg)) { unsigned long bytes = nr_pages * PAGE_SIZE; + bytes = mem_cgroup_may_return_loan(memcg, bytes); + res_counter_uncharge(&memcg->res, bytes); if (do_swap_account) res_counter_uncharge(&memcg->memsw, bytes); @@ -3989,6 +4050,7 @@ static void mem_cgroup_do_uncharge(struct mem_cgroup *memcg, { struct memcg_batch_info *batch = NULL; bool uncharge_memsw = true; + unsigned long val; /* If swapout, usage of swap doesn't decrease */ if (!do_swap_account || ctype == MEM_CGROUP_CHARGE_TYPE_SWAPOUT) @@ -4029,9 +4091,11 @@ static void mem_cgroup_do_uncharge(struct mem_cgroup *memcg, batch->memsw_nr_pages++; return; direct_uncharge: - res_counter_uncharge(&memcg->res, nr_pages * PAGE_SIZE); + val = nr_pages * PAGE_SIZE; + val = mem_cgroup_may_return_loan(memcg, val); + res_counter_uncharge(&memcg->res, val); if (uncharge_memsw) - res_counter_uncharge(&memcg->memsw, nr_pages * PAGE_SIZE); + res_counter_uncharge(&memcg->memsw, val); if (unlikely(batch->memcg != memcg)) memcg_oom_recover(memcg); } @@ -4182,6 +4246,7 @@ void mem_cgroup_uncharge_start(void) void mem_cgroup_uncharge_end(void) { struct memcg_batch_info *batch = ¤t->memcg_batch; + unsigned long val; if (!batch->do_batch) return; @@ -4192,16 +4257,16 @@ void mem_cgroup_uncharge_end(void) if (!batch->memcg) return; + val = batch->nr_pages * PAGE_SIZE; + val = mem_cgroup_may_return_loan(batch->memcg, val); /* * This "batch->memcg" is valid without any css_get/put etc... * bacause we hide charges behind us. */ if (batch->nr_pages) - res_counter_uncharge(&batch->memcg->res, - batch->nr_pages * PAGE_SIZE); + res_counter_uncharge(&batch->memcg->res, val); if (batch->memsw_nr_pages) - res_counter_uncharge(&batch->memcg->memsw, - batch->memsw_nr_pages * PAGE_SIZE); + res_counter_uncharge(&batch->memcg->memsw, val); memcg_oom_recover(batch->memcg); /* forget this pointer (for sanity check) */ batch->memcg = NULL; @@ -6291,6 +6356,8 @@ mem_cgroup_css_alloc(struct cgroup *cont) memcg->move_charge_at_immigrate = 0; mutex_init(&memcg->thresholds_lock); spin_lock_init(&memcg->move_lock); + memcg->loan = 0; + spin_lock_init(&memcg->loan_lock); return &memcg->css; -- 1.7.10.2 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 444+ messages in thread
* Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked 2013-02-08 1:40 ` Kamezawa Hiroyuki (?) @ 2013-02-08 16:01 ` Michal Hocko -1 siblings, 0 replies; 444+ messages in thread From: Michal Hocko @ 2013-02-08 16:01 UTC (permalink / raw) To: Kamezawa Hiroyuki Cc: azurIt, linux-kernel, linux-mm, cgroups mailinglist, Johannes Weiner On Fri 08-02-13 10:40:13, KAMEZAWA Hiroyuki wrote: > (2013/02/07 20:01), Kamezawa Hiroyuki wrote: [...] > >Hmm. do we need to increase the "limit" virtually at memcg oom until > >the oom-killed process dies ? > > Here is my naive idea... and the next step would be http://en.wikipedia.org/wiki/Credit_default_swap :P But seriously now. The idea is not bad at all. This implementation would need some tweaks to work though (e.g. you would need to wake oom sleepers when you get a loan - because those are ones which can block the resource). We should also give the borrowed charges only to those who would oom to prevent from stealing. I think that it should be mem_cgroup_out_of_memory who establishes the loan and it can have a look at how much memory the killed task frees - e.g. some portion of get_mm_rss() or a more precise but much more expensive traversing via private vmas and check whether they charged memory from the target memcg hierarchy (this is a slow path anyway). But who knows maybe a fixed 2MB would work out as well. Thanks! > == > From 1a46318cf89e7df94bd4844f29105b61dacf335b Mon Sep 17 00:00:00 2001 > From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> > Date: Fri, 8 Feb 2013 10:43:52 +0900 > Subject: [PATCH] [Don't Apply][PATCH] memcg relax resource at OOM situation. > > When an OOM happens, a task is killed and resources will be freed. > > A problem here is that a task, which is oom-killed, may wait for > some other resource in which memory resource is required. Some thread > waits for free memory may holds some mutex and oom-killed process > wait for the mutex. > > To avoid this, relaxing charged memory by giving virtual resource > can be a help. The system can get back it at uncharge(). > This is a sample native implementation. > > Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> > --- > mm/memcontrol.c | 79 ++++++++++++++++++++++++++++++++++++++++++++++++++----- > 1 file changed, 73 insertions(+), 6 deletions(-) > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index 25ac5f4..4dea49a 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -301,6 +301,9 @@ struct mem_cgroup { > /* set when res.limit == memsw.limit */ > bool memsw_is_minimum; > + /* extra resource at emergency situation */ > + unsigned long loan; > + spinlock_t loan_lock; > /* protect arrays of thresholds */ > struct mutex thresholds_lock; > @@ -2034,6 +2037,61 @@ static int mem_cgroup_soft_reclaim(struct mem_cgroup *root_memcg, > mem_cgroup_iter_break(root_memcg, victim); > return total; > } > +/* > + * When a memcg is in OOM situation, this lack of resource may cause deadlock > + * because of complicated lock dependency(i_mutex...). To avoid that, we > + * need extra resource or avoid charging. > + * > + * A memcg can request resource in an emergency state. We call it as loan. > + * A memcg will return a loan when it does uncharge resource. We disallow > + * double-loan and moving task to other groups until the loan is fully > + * returned. > + * > + * Note: the problem here is that we cannot know what amount resouce should > + * be necessary to exiting an emergency state..... > + */ > +#define LOAN_MAX (2 * 1024 * 1024) > + > +static void mem_cgroup_make_loan(struct mem_cgroup *memcg) > +{ > + u64 usage; > + unsigned long amount; > + > + amount = LOAN_MAX; > + > + usage = res_counter_read_u64(&memcg->res, RES_USAGE); > + if (amount > usage /2 ) > + amount = usage / 2; > + spin_lock(&memcg->loan_lock); > + if (memcg->loan) { > + spin_unlock(&memcg->loan_lock); > + return; > + } > + memcg->loan = amount; > + res_counter_uncharge(&memcg->res, amount); > + if (do_swap_account) > + res_counter_uncharge(&memcg->memsw, amount); > + spin_unlock(&memcg->loan_lock); > +} > + > +/* return amount of free resource which can be uncharged */ > +static unsigned long > +mem_cgroup_may_return_loan(struct mem_cgroup *memcg, unsigned long val) > +{ > + unsigned long tmp; > + /* we don't care small race here */ > + if (unlikely(!memcg->loan)) > + return val; > + spin_lock(&memcg->loan_lock); > + if (memcg->loan) { > + tmp = min(memcg->loan, val); > + memcg->loan -= tmp; > + val -= tmp; > + } > + spin_unlock(&memcg->loan_lock); > + return val; > +} > + > /* > * Check OOM-Killer is already running under our hierarchy. > @@ -2182,6 +2240,7 @@ static bool mem_cgroup_handle_oom(struct mem_cgroup *memcg, gfp_t mask, > if (need_to_kill) { > finish_wait(&memcg_oom_waitq, &owait.wait); > mem_cgroup_out_of_memory(memcg, mask, order); > + mem_cgroup_make_loan(memcg); > } else { > schedule(); > finish_wait(&memcg_oom_waitq, &owait.wait); > @@ -2748,6 +2807,8 @@ static void __mem_cgroup_cancel_charge(struct mem_cgroup *memcg, > if (!mem_cgroup_is_root(memcg)) { > unsigned long bytes = nr_pages * PAGE_SIZE; > + bytes = mem_cgroup_may_return_loan(memcg, bytes); > + > res_counter_uncharge(&memcg->res, bytes); > if (do_swap_account) > res_counter_uncharge(&memcg->memsw, bytes); > @@ -3989,6 +4050,7 @@ static void mem_cgroup_do_uncharge(struct mem_cgroup *memcg, > { > struct memcg_batch_info *batch = NULL; > bool uncharge_memsw = true; > + unsigned long val; > /* If swapout, usage of swap doesn't decrease */ > if (!do_swap_account || ctype == MEM_CGROUP_CHARGE_TYPE_SWAPOUT) > @@ -4029,9 +4091,11 @@ static void mem_cgroup_do_uncharge(struct mem_cgroup *memcg, > batch->memsw_nr_pages++; > return; > direct_uncharge: > - res_counter_uncharge(&memcg->res, nr_pages * PAGE_SIZE); > + val = nr_pages * PAGE_SIZE; > + val = mem_cgroup_may_return_loan(memcg, val); > + res_counter_uncharge(&memcg->res, val); > if (uncharge_memsw) > - res_counter_uncharge(&memcg->memsw, nr_pages * PAGE_SIZE); > + res_counter_uncharge(&memcg->memsw, val); > if (unlikely(batch->memcg != memcg)) > memcg_oom_recover(memcg); > } > @@ -4182,6 +4246,7 @@ void mem_cgroup_uncharge_start(void) > void mem_cgroup_uncharge_end(void) > { > struct memcg_batch_info *batch = ¤t->memcg_batch; > + unsigned long val; > if (!batch->do_batch) > return; > @@ -4192,16 +4257,16 @@ void mem_cgroup_uncharge_end(void) > if (!batch->memcg) > return; > + val = batch->nr_pages * PAGE_SIZE; > + val = mem_cgroup_may_return_loan(batch->memcg, val); > /* > * This "batch->memcg" is valid without any css_get/put etc... > * bacause we hide charges behind us. > */ > if (batch->nr_pages) > - res_counter_uncharge(&batch->memcg->res, > - batch->nr_pages * PAGE_SIZE); > + res_counter_uncharge(&batch->memcg->res, val); > if (batch->memsw_nr_pages) > - res_counter_uncharge(&batch->memcg->memsw, > - batch->memsw_nr_pages * PAGE_SIZE); > + res_counter_uncharge(&batch->memcg->memsw, val); > memcg_oom_recover(batch->memcg); > /* forget this pointer (for sanity check) */ > batch->memcg = NULL; > @@ -6291,6 +6356,8 @@ mem_cgroup_css_alloc(struct cgroup *cont) > memcg->move_charge_at_immigrate = 0; > mutex_init(&memcg->thresholds_lock); > spin_lock_init(&memcg->move_lock); > + memcg->loan = 0; > + spin_lock_init(&memcg->loan_lock); > return &memcg->css; > -- > 1.7.10.2 > > > > > > > > -- > To unsubscribe from this list: send the line "unsubscribe cgroups" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked @ 2013-02-08 16:01 ` Michal Hocko 0 siblings, 0 replies; 444+ messages in thread From: Michal Hocko @ 2013-02-08 16:01 UTC (permalink / raw) To: Kamezawa Hiroyuki Cc: azurIt, linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups mailinglist, Johannes Weiner On Fri 08-02-13 10:40:13, KAMEZAWA Hiroyuki wrote: > (2013/02/07 20:01), Kamezawa Hiroyuki wrote: [...] > >Hmm. do we need to increase the "limit" virtually at memcg oom until > >the oom-killed process dies ? > > Here is my naive idea... and the next step would be http://en.wikipedia.org/wiki/Credit_default_swap :P But seriously now. The idea is not bad at all. This implementation would need some tweaks to work though (e.g. you would need to wake oom sleepers when you get a loan - because those are ones which can block the resource). We should also give the borrowed charges only to those who would oom to prevent from stealing. I think that it should be mem_cgroup_out_of_memory who establishes the loan and it can have a look at how much memory the killed task frees - e.g. some portion of get_mm_rss() or a more precise but much more expensive traversing via private vmas and check whether they charged memory from the target memcg hierarchy (this is a slow path anyway). But who knows maybe a fixed 2MB would work out as well. Thanks! > == > From 1a46318cf89e7df94bd4844f29105b61dacf335b Mon Sep 17 00:00:00 2001 > From: KAMEZAWA Hiroyuki <kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A@public.gmane.org> > Date: Fri, 8 Feb 2013 10:43:52 +0900 > Subject: [PATCH] [Don't Apply][PATCH] memcg relax resource at OOM situation. > > When an OOM happens, a task is killed and resources will be freed. > > A problem here is that a task, which is oom-killed, may wait for > some other resource in which memory resource is required. Some thread > waits for free memory may holds some mutex and oom-killed process > wait for the mutex. > > To avoid this, relaxing charged memory by giving virtual resource > can be a help. The system can get back it at uncharge(). > This is a sample native implementation. > > Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A@public.gmane.org> > --- > mm/memcontrol.c | 79 ++++++++++++++++++++++++++++++++++++++++++++++++++----- > 1 file changed, 73 insertions(+), 6 deletions(-) > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index 25ac5f4..4dea49a 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -301,6 +301,9 @@ struct mem_cgroup { > /* set when res.limit == memsw.limit */ > bool memsw_is_minimum; > + /* extra resource at emergency situation */ > + unsigned long loan; > + spinlock_t loan_lock; > /* protect arrays of thresholds */ > struct mutex thresholds_lock; > @@ -2034,6 +2037,61 @@ static int mem_cgroup_soft_reclaim(struct mem_cgroup *root_memcg, > mem_cgroup_iter_break(root_memcg, victim); > return total; > } > +/* > + * When a memcg is in OOM situation, this lack of resource may cause deadlock > + * because of complicated lock dependency(i_mutex...). To avoid that, we > + * need extra resource or avoid charging. > + * > + * A memcg can request resource in an emergency state. We call it as loan. > + * A memcg will return a loan when it does uncharge resource. We disallow > + * double-loan and moving task to other groups until the loan is fully > + * returned. > + * > + * Note: the problem here is that we cannot know what amount resouce should > + * be necessary to exiting an emergency state..... > + */ > +#define LOAN_MAX (2 * 1024 * 1024) > + > +static void mem_cgroup_make_loan(struct mem_cgroup *memcg) > +{ > + u64 usage; > + unsigned long amount; > + > + amount = LOAN_MAX; > + > + usage = res_counter_read_u64(&memcg->res, RES_USAGE); > + if (amount > usage /2 ) > + amount = usage / 2; > + spin_lock(&memcg->loan_lock); > + if (memcg->loan) { > + spin_unlock(&memcg->loan_lock); > + return; > + } > + memcg->loan = amount; > + res_counter_uncharge(&memcg->res, amount); > + if (do_swap_account) > + res_counter_uncharge(&memcg->memsw, amount); > + spin_unlock(&memcg->loan_lock); > +} > + > +/* return amount of free resource which can be uncharged */ > +static unsigned long > +mem_cgroup_may_return_loan(struct mem_cgroup *memcg, unsigned long val) > +{ > + unsigned long tmp; > + /* we don't care small race here */ > + if (unlikely(!memcg->loan)) > + return val; > + spin_lock(&memcg->loan_lock); > + if (memcg->loan) { > + tmp = min(memcg->loan, val); > + memcg->loan -= tmp; > + val -= tmp; > + } > + spin_unlock(&memcg->loan_lock); > + return val; > +} > + > /* > * Check OOM-Killer is already running under our hierarchy. > @@ -2182,6 +2240,7 @@ static bool mem_cgroup_handle_oom(struct mem_cgroup *memcg, gfp_t mask, > if (need_to_kill) { > finish_wait(&memcg_oom_waitq, &owait.wait); > mem_cgroup_out_of_memory(memcg, mask, order); > + mem_cgroup_make_loan(memcg); > } else { > schedule(); > finish_wait(&memcg_oom_waitq, &owait.wait); > @@ -2748,6 +2807,8 @@ static void __mem_cgroup_cancel_charge(struct mem_cgroup *memcg, > if (!mem_cgroup_is_root(memcg)) { > unsigned long bytes = nr_pages * PAGE_SIZE; > + bytes = mem_cgroup_may_return_loan(memcg, bytes); > + > res_counter_uncharge(&memcg->res, bytes); > if (do_swap_account) > res_counter_uncharge(&memcg->memsw, bytes); > @@ -3989,6 +4050,7 @@ static void mem_cgroup_do_uncharge(struct mem_cgroup *memcg, > { > struct memcg_batch_info *batch = NULL; > bool uncharge_memsw = true; > + unsigned long val; > /* If swapout, usage of swap doesn't decrease */ > if (!do_swap_account || ctype == MEM_CGROUP_CHARGE_TYPE_SWAPOUT) > @@ -4029,9 +4091,11 @@ static void mem_cgroup_do_uncharge(struct mem_cgroup *memcg, > batch->memsw_nr_pages++; > return; > direct_uncharge: > - res_counter_uncharge(&memcg->res, nr_pages * PAGE_SIZE); > + val = nr_pages * PAGE_SIZE; > + val = mem_cgroup_may_return_loan(memcg, val); > + res_counter_uncharge(&memcg->res, val); > if (uncharge_memsw) > - res_counter_uncharge(&memcg->memsw, nr_pages * PAGE_SIZE); > + res_counter_uncharge(&memcg->memsw, val); > if (unlikely(batch->memcg != memcg)) > memcg_oom_recover(memcg); > } > @@ -4182,6 +4246,7 @@ void mem_cgroup_uncharge_start(void) > void mem_cgroup_uncharge_end(void) > { > struct memcg_batch_info *batch = ¤t->memcg_batch; > + unsigned long val; > if (!batch->do_batch) > return; > @@ -4192,16 +4257,16 @@ void mem_cgroup_uncharge_end(void) > if (!batch->memcg) > return; > + val = batch->nr_pages * PAGE_SIZE; > + val = mem_cgroup_may_return_loan(batch->memcg, val); > /* > * This "batch->memcg" is valid without any css_get/put etc... > * bacause we hide charges behind us. > */ > if (batch->nr_pages) > - res_counter_uncharge(&batch->memcg->res, > - batch->nr_pages * PAGE_SIZE); > + res_counter_uncharge(&batch->memcg->res, val); > if (batch->memsw_nr_pages) > - res_counter_uncharge(&batch->memcg->memsw, > - batch->memsw_nr_pages * PAGE_SIZE); > + res_counter_uncharge(&batch->memcg->memsw, val); > memcg_oom_recover(batch->memcg); > /* forget this pointer (for sanity check) */ > batch->memcg = NULL; > @@ -6291,6 +6356,8 @@ mem_cgroup_css_alloc(struct cgroup *cont) > memcg->move_charge_at_immigrate = 0; > mutex_init(&memcg->thresholds_lock); > spin_lock_init(&memcg->move_lock); > + memcg->loan = 0; > + spin_lock_init(&memcg->loan_lock); > return &memcg->css; > -- > 1.7.10.2 > > > > > > > > -- > To unsubscribe from this list: send the line "unsubscribe cgroups" in > the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked @ 2013-02-08 16:01 ` Michal Hocko 0 siblings, 0 replies; 444+ messages in thread From: Michal Hocko @ 2013-02-08 16:01 UTC (permalink / raw) To: Kamezawa Hiroyuki Cc: azurIt, linux-kernel, linux-mm, cgroups mailinglist, Johannes Weiner On Fri 08-02-13 10:40:13, KAMEZAWA Hiroyuki wrote: > (2013/02/07 20:01), Kamezawa Hiroyuki wrote: [...] > >Hmm. do we need to increase the "limit" virtually at memcg oom until > >the oom-killed process dies ? > > Here is my naive idea... and the next step would be http://en.wikipedia.org/wiki/Credit_default_swap :P But seriously now. The idea is not bad at all. This implementation would need some tweaks to work though (e.g. you would need to wake oom sleepers when you get a loan - because those are ones which can block the resource). We should also give the borrowed charges only to those who would oom to prevent from stealing. I think that it should be mem_cgroup_out_of_memory who establishes the loan and it can have a look at how much memory the killed task frees - e.g. some portion of get_mm_rss() or a more precise but much more expensive traversing via private vmas and check whether they charged memory from the target memcg hierarchy (this is a slow path anyway). But who knows maybe a fixed 2MB would work out as well. Thanks! > == > From 1a46318cf89e7df94bd4844f29105b61dacf335b Mon Sep 17 00:00:00 2001 > From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> > Date: Fri, 8 Feb 2013 10:43:52 +0900 > Subject: [PATCH] [Don't Apply][PATCH] memcg relax resource at OOM situation. > > When an OOM happens, a task is killed and resources will be freed. > > A problem here is that a task, which is oom-killed, may wait for > some other resource in which memory resource is required. Some thread > waits for free memory may holds some mutex and oom-killed process > wait for the mutex. > > To avoid this, relaxing charged memory by giving virtual resource > can be a help. The system can get back it at uncharge(). > This is a sample native implementation. > > Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> > --- > mm/memcontrol.c | 79 ++++++++++++++++++++++++++++++++++++++++++++++++++----- > 1 file changed, 73 insertions(+), 6 deletions(-) > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index 25ac5f4..4dea49a 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -301,6 +301,9 @@ struct mem_cgroup { > /* set when res.limit == memsw.limit */ > bool memsw_is_minimum; > + /* extra resource at emergency situation */ > + unsigned long loan; > + spinlock_t loan_lock; > /* protect arrays of thresholds */ > struct mutex thresholds_lock; > @@ -2034,6 +2037,61 @@ static int mem_cgroup_soft_reclaim(struct mem_cgroup *root_memcg, > mem_cgroup_iter_break(root_memcg, victim); > return total; > } > +/* > + * When a memcg is in OOM situation, this lack of resource may cause deadlock > + * because of complicated lock dependency(i_mutex...). To avoid that, we > + * need extra resource or avoid charging. > + * > + * A memcg can request resource in an emergency state. We call it as loan. > + * A memcg will return a loan when it does uncharge resource. We disallow > + * double-loan and moving task to other groups until the loan is fully > + * returned. > + * > + * Note: the problem here is that we cannot know what amount resouce should > + * be necessary to exiting an emergency state..... > + */ > +#define LOAN_MAX (2 * 1024 * 1024) > + > +static void mem_cgroup_make_loan(struct mem_cgroup *memcg) > +{ > + u64 usage; > + unsigned long amount; > + > + amount = LOAN_MAX; > + > + usage = res_counter_read_u64(&memcg->res, RES_USAGE); > + if (amount > usage /2 ) > + amount = usage / 2; > + spin_lock(&memcg->loan_lock); > + if (memcg->loan) { > + spin_unlock(&memcg->loan_lock); > + return; > + } > + memcg->loan = amount; > + res_counter_uncharge(&memcg->res, amount); > + if (do_swap_account) > + res_counter_uncharge(&memcg->memsw, amount); > + spin_unlock(&memcg->loan_lock); > +} > + > +/* return amount of free resource which can be uncharged */ > +static unsigned long > +mem_cgroup_may_return_loan(struct mem_cgroup *memcg, unsigned long val) > +{ > + unsigned long tmp; > + /* we don't care small race here */ > + if (unlikely(!memcg->loan)) > + return val; > + spin_lock(&memcg->loan_lock); > + if (memcg->loan) { > + tmp = min(memcg->loan, val); > + memcg->loan -= tmp; > + val -= tmp; > + } > + spin_unlock(&memcg->loan_lock); > + return val; > +} > + > /* > * Check OOM-Killer is already running under our hierarchy. > @@ -2182,6 +2240,7 @@ static bool mem_cgroup_handle_oom(struct mem_cgroup *memcg, gfp_t mask, > if (need_to_kill) { > finish_wait(&memcg_oom_waitq, &owait.wait); > mem_cgroup_out_of_memory(memcg, mask, order); > + mem_cgroup_make_loan(memcg); > } else { > schedule(); > finish_wait(&memcg_oom_waitq, &owait.wait); > @@ -2748,6 +2807,8 @@ static void __mem_cgroup_cancel_charge(struct mem_cgroup *memcg, > if (!mem_cgroup_is_root(memcg)) { > unsigned long bytes = nr_pages * PAGE_SIZE; > + bytes = mem_cgroup_may_return_loan(memcg, bytes); > + > res_counter_uncharge(&memcg->res, bytes); > if (do_swap_account) > res_counter_uncharge(&memcg->memsw, bytes); > @@ -3989,6 +4050,7 @@ static void mem_cgroup_do_uncharge(struct mem_cgroup *memcg, > { > struct memcg_batch_info *batch = NULL; > bool uncharge_memsw = true; > + unsigned long val; > /* If swapout, usage of swap doesn't decrease */ > if (!do_swap_account || ctype == MEM_CGROUP_CHARGE_TYPE_SWAPOUT) > @@ -4029,9 +4091,11 @@ static void mem_cgroup_do_uncharge(struct mem_cgroup *memcg, > batch->memsw_nr_pages++; > return; > direct_uncharge: > - res_counter_uncharge(&memcg->res, nr_pages * PAGE_SIZE); > + val = nr_pages * PAGE_SIZE; > + val = mem_cgroup_may_return_loan(memcg, val); > + res_counter_uncharge(&memcg->res, val); > if (uncharge_memsw) > - res_counter_uncharge(&memcg->memsw, nr_pages * PAGE_SIZE); > + res_counter_uncharge(&memcg->memsw, val); > if (unlikely(batch->memcg != memcg)) > memcg_oom_recover(memcg); > } > @@ -4182,6 +4246,7 @@ void mem_cgroup_uncharge_start(void) > void mem_cgroup_uncharge_end(void) > { > struct memcg_batch_info *batch = ¤t->memcg_batch; > + unsigned long val; > if (!batch->do_batch) > return; > @@ -4192,16 +4257,16 @@ void mem_cgroup_uncharge_end(void) > if (!batch->memcg) > return; > + val = batch->nr_pages * PAGE_SIZE; > + val = mem_cgroup_may_return_loan(batch->memcg, val); > /* > * This "batch->memcg" is valid without any css_get/put etc... > * bacause we hide charges behind us. > */ > if (batch->nr_pages) > - res_counter_uncharge(&batch->memcg->res, > - batch->nr_pages * PAGE_SIZE); > + res_counter_uncharge(&batch->memcg->res, val); > if (batch->memsw_nr_pages) > - res_counter_uncharge(&batch->memcg->memsw, > - batch->memsw_nr_pages * PAGE_SIZE); > + res_counter_uncharge(&batch->memcg->memsw, val); > memcg_oom_recover(batch->memcg); > /* forget this pointer (for sanity check) */ > batch->memcg = NULL; > @@ -6291,6 +6356,8 @@ mem_cgroup_css_alloc(struct cgroup *cont) > memcg->move_charge_at_immigrate = 0; > mutex_init(&memcg->thresholds_lock); > spin_lock_init(&memcg->move_lock); > + memcg->loan = 0; > + spin_lock_init(&memcg->loan_lock); > return &memcg->css; > -- > 1.7.10.2 > > > > > > > > -- > To unsubscribe from this list: send the line "unsubscribe cgroups" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked 2013-02-05 14:49 ` azurIt (?) @ 2013-02-05 16:31 ` Michal Hocko -1 siblings, 0 replies; 444+ messages in thread From: Michal Hocko @ 2013-02-05 16:31 UTC (permalink / raw) To: azurIt Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki, Johannes Weiner On Tue 05-02-13 15:49:47, azurIt wrote: [...] > I have another old problem which is maybe also related to this. I > wasn't connecting it with this before but now i'm not sure. Two of our > servers, which are affected by this cgroup problem, are also randomly > freezing completely (few times per month). These are the symptoms: > - servers are answering to ping > - it is possible to connect via SSH but connection is freezed after > sending the password > - it is possible to login via console but it is freezed after typeing > the login > These symptoms are very similar to HDD problems or HDD overload (but > there is no overload for sure). The only way to fix it is, probably, > hard rebooting the server (didn't find any other way). What do you > think? Can this be related? This is hard to tell without further information. > Maybe HDDs are locked in the similar way the cgroups are - we already > found out that cgroup freezeing is related also to HDD activity. Maybe > there is a little chance that the whole HDD subsystem ends in > deadlock? "HDD subsystem" whatever that means cannot be blocked by memcg being stuck. Certain access to soem files might be an issue because those could have locks held but I do not see other relations. I would start by checking the HW, trying to focus on reducing elements that could contribute - aka try to nail down to the minimum set which reproduces the issue. I cannot help you much with that I am afraid. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked @ 2013-02-05 16:31 ` Michal Hocko 0 siblings, 0 replies; 444+ messages in thread From: Michal Hocko @ 2013-02-05 16:31 UTC (permalink / raw) To: azurIt Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups mailinglist, KAMEZAWA Hiroyuki, Johannes Weiner On Tue 05-02-13 15:49:47, azurIt wrote: [...] > I have another old problem which is maybe also related to this. I > wasn't connecting it with this before but now i'm not sure. Two of our > servers, which are affected by this cgroup problem, are also randomly > freezing completely (few times per month). These are the symptoms: > - servers are answering to ping > - it is possible to connect via SSH but connection is freezed after > sending the password > - it is possible to login via console but it is freezed after typeing > the login > These symptoms are very similar to HDD problems or HDD overload (but > there is no overload for sure). The only way to fix it is, probably, > hard rebooting the server (didn't find any other way). What do you > think? Can this be related? This is hard to tell without further information. > Maybe HDDs are locked in the similar way the cgroups are - we already > found out that cgroup freezeing is related also to HDD activity. Maybe > there is a little chance that the whole HDD subsystem ends in > deadlock? "HDD subsystem" whatever that means cannot be blocked by memcg being stuck. Certain access to soem files might be an issue because those could have locks held but I do not see other relations. I would start by checking the HW, trying to focus on reducing elements that could contribute - aka try to nail down to the minimum set which reproduces the issue. I cannot help you much with that I am afraid. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked @ 2013-02-05 16:31 ` Michal Hocko 0 siblings, 0 replies; 444+ messages in thread From: Michal Hocko @ 2013-02-05 16:31 UTC (permalink / raw) To: azurIt Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki, Johannes Weiner On Tue 05-02-13 15:49:47, azurIt wrote: [...] > I have another old problem which is maybe also related to this. I > wasn't connecting it with this before but now i'm not sure. Two of our > servers, which are affected by this cgroup problem, are also randomly > freezing completely (few times per month). These are the symptoms: > - servers are answering to ping > - it is possible to connect via SSH but connection is freezed after > sending the password > - it is possible to login via console but it is freezed after typeing > the login > These symptoms are very similar to HDD problems or HDD overload (but > there is no overload for sure). The only way to fix it is, probably, > hard rebooting the server (didn't find any other way). What do you > think? Can this be related? This is hard to tell without further information. > Maybe HDDs are locked in the similar way the cgroups are - we already > found out that cgroup freezeing is related also to HDD activity. Maybe > there is a little chance that the whole HDD subsystem ends in > deadlock? "HDD subsystem" whatever that means cannot be blocked by memcg being stuck. Certain access to soem files might be an issue because those could have locks held but I do not see other relations. I would start by checking the HW, trying to focus on reducing elements that could contribute - aka try to nail down to the minimum set which reproduces the issue. I cannot help you much with that I am afraid. -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked 2012-12-18 15:20 ` Michal Hocko (?) @ 2012-12-24 13:38 ` azurIt -1 siblings, 0 replies; 444+ messages in thread From: azurIt @ 2012-12-24 13:38 UTC (permalink / raw) To: Michal Hocko Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki, Johannes Weiner >OK, good to hear and fingers crossed. I will try to get back to the >original problem and a better solution sometimes early next year when >all the things settle a bit. Btw, i noticed one more thing when problem is happening (=when any cgroup is stucked), i fogot to mention it before, sorry :( . It's related to HDDs, something is slowing them down in a strange way. All services are working normally and i really cannot notice any slowness, the only thing which i noticed is affeceted is our backup software ( www.Bacula.org ). When problem occurs at night, so it's happening when backup is running, backup is extremely slow and usually don't finish until i kill processes inside affected cgroup (=until i resolve the problem). Backup software is NOT doing big HDD bandwidth BUT it's doing quite huge number of disk operations (it needs to stat every file and directory). I believe that only speed of disk operations are affected and are very slow. Merry christmas! ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked @ 2012-12-24 13:38 ` azurIt 0 siblings, 0 replies; 444+ messages in thread From: azurIt @ 2012-12-24 13:38 UTC (permalink / raw) To: Michal Hocko Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups mailinglist, KAMEZAWA Hiroyuki, Johannes Weiner >OK, good to hear and fingers crossed. I will try to get back to the >original problem and a better solution sometimes early next year when >all the things settle a bit. Btw, i noticed one more thing when problem is happening (=when any cgroup is stucked), i fogot to mention it before, sorry :( . It's related to HDDs, something is slowing them down in a strange way. All services are working normally and i really cannot notice any slowness, the only thing which i noticed is affeceted is our backup software ( www.Bacula.org ). When problem occurs at night, so it's happening when backup is running, backup is extremely slow and usually don't finish until i kill processes inside affected cgroup (=until i resolve the problem). Backup software is NOT doing big HDD bandwidth BUT it's doing quite huge number of disk operations (it needs to stat every file and directory). I believe that only speed of disk operations are affected and are very slow. Merry christmas! ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked @ 2012-12-24 13:38 ` azurIt 0 siblings, 0 replies; 444+ messages in thread From: azurIt @ 2012-12-24 13:38 UTC (permalink / raw) To: Michal Hocko Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki, Johannes Weiner >OK, good to hear and fingers crossed. I will try to get back to the >original problem and a better solution sometimes early next year when >all the things settle a bit. Btw, i noticed one more thing when problem is happening (=when any cgroup is stucked), i fogot to mention it before, sorry :( . It's related to HDDs, something is slowing them down in a strange way. All services are working normally and i really cannot notice any slowness, the only thing which i noticed is affeceted is our backup software ( www.Bacula.org ). When problem occurs at night, so it's happening when backup is running, backup is extremely slow and usually don't finish until i kill processes inside affected cgroup (=until i resolve the problem). Backup software is NOT doing big HDD bandwidth BUT it's doing quite huge number of disk operations (it needs to stat every file and directory). I believe that only speed of disk operations are affected and are very slow. Merry christmas! -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked 2012-12-24 13:38 ` azurIt @ 2012-12-28 16:35 ` Michal Hocko -1 siblings, 0 replies; 444+ messages in thread From: Michal Hocko @ 2012-12-28 16:35 UTC (permalink / raw) To: azurIt Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki, Johannes Weiner On Mon 24-12-12 14:38:50, azurIt wrote: > >OK, good to hear and fingers crossed. I will try to get back to the > >original problem and a better solution sometimes early next year when > >all the things settle a bit. > > > Btw, i noticed one more thing when problem is happening (=when any > cgroup is stucked), i fogot to mention it before, sorry :( . It's > related to HDDs, something is slowing them down in a strange way. All > services are working normally and i really cannot notice any slowness, > the only thing which i noticed is affeceted is our backup software ( > www.Bacula.org ). When problem occurs at night, so it's happening when > backup is running, backup is extremely slow and usually don't finish > until i kill processes inside affected cgroup (=until i resolve the > problem). Backup software is NOT doing big HDD bandwidth BUT it's > doing quite huge number of disk operations (it needs to stat every > file and directory). I believe that only speed of disk operations are > affected and are very slow. I would bet that this is caused by the blocked proceses in memcg oom handler which hold i_mutex and the backup process wants to access the same inode with an operation which requires the lock. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked @ 2012-12-28 16:35 ` Michal Hocko 0 siblings, 0 replies; 444+ messages in thread From: Michal Hocko @ 2012-12-28 16:35 UTC (permalink / raw) To: azurIt Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki, Johannes Weiner On Mon 24-12-12 14:38:50, azurIt wrote: > >OK, good to hear and fingers crossed. I will try to get back to the > >original problem and a better solution sometimes early next year when > >all the things settle a bit. > > > Btw, i noticed one more thing when problem is happening (=when any > cgroup is stucked), i fogot to mention it before, sorry :( . It's > related to HDDs, something is slowing them down in a strange way. All > services are working normally and i really cannot notice any slowness, > the only thing which i noticed is affeceted is our backup software ( > www.Bacula.org ). When problem occurs at night, so it's happening when > backup is running, backup is extremely slow and usually don't finish > until i kill processes inside affected cgroup (=until i resolve the > problem). Backup software is NOT doing big HDD bandwidth BUT it's > doing quite huge number of disk operations (it needs to stat every > file and directory). I believe that only speed of disk operations are > affected and are very slow. I would bet that this is caused by the blocked proceses in memcg oom handler which hold i_mutex and the backup process wants to access the same inode with an operation which requires the lock. -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH -mm] memcg: do not trigger OOM from add_to_page_cache_locked 2012-11-26 13:18 ` Michal Hocko @ 2012-11-26 17:46 ` Johannes Weiner -1 siblings, 0 replies; 444+ messages in thread From: Johannes Weiner @ 2012-11-26 17:46 UTC (permalink / raw) To: Michal Hocko Cc: azurIt, linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki On Mon, Nov 26, 2012 at 02:18:37PM +0100, Michal Hocko wrote: > [CCing also Johannes - the thread started here: > https://lkml.org/lkml/2012/11/21/497] > > On Mon 26-11-12 01:38:55, azurIt wrote: > > >This is hackish but it should help you in this case. Kamezawa, what do > > >you think about that? Should we generalize this and prepare something > > >like mem_cgroup_cache_charge_locked which would add __GFP_NORETRY > > >automatically and use the function whenever we are in a locked context? > > >To be honest I do not like this very much but nothing more sensible > > >(without touching non-memcg paths) comes to my mind. > > > > > > I installed kernel with this patch, will report back if problem occurs > > again OR in few weeks if everything will be ok. Thank you! > > Now that I am looking at the patch closer it will not work because it > depends on other patch which is not merged yet and even that one would > help on its own because __GFP_NORETRY doesn't break the charge loop. > Sorry I have missed that... > > The patch bellow should help though. (it is based on top of the current > -mm tree but I will send a backport to 3.2 in the reply as well) > --- > >From 7796f942d62081ad45726efd90b5292b80e7c690 Mon Sep 17 00:00:00 2001 > From: Michal Hocko <mhocko@suse.cz> > Date: Mon, 26 Nov 2012 11:47:57 +0100 > Subject: [PATCH] memcg: do not trigger OOM from add_to_page_cache_locked > > memcg oom killer might deadlock if the process which falls down to > mem_cgroup_handle_oom holds a lock which prevents other task to > terminate because it is blocked on the very same lock. > This can happen when a write system call needs to allocate a page but > the allocation hits the memcg hard limit and there is nothing to reclaim > (e.g. there is no swap or swap limit is hit as well and all cache pages > have been reclaimed already) and the process selected by memcg OOM > killer is blocked on i_mutex on the same inode (e.g. truncate it). > > Process A > [<ffffffff811109b8>] do_truncate+0x58/0xa0 # takes i_mutex > [<ffffffff81121c90>] do_last+0x250/0xa30 > [<ffffffff81122547>] path_openat+0xd7/0x440 > [<ffffffff811229c9>] do_filp_open+0x49/0xa0 > [<ffffffff8110f7d6>] do_sys_open+0x106/0x240 > [<ffffffff8110f950>] sys_open+0x20/0x30 > [<ffffffff815b5926>] system_call_fastpath+0x18/0x1d > [<ffffffffffffffff>] 0xffffffffffffffff > > Process B > [<ffffffff8110a9c1>] mem_cgroup_handle_oom+0x241/0x3b0 > [<ffffffff8110b5ab>] T.1146+0x5ab/0x5c0 > [<ffffffff8110c22e>] mem_cgroup_cache_charge+0xbe/0xe0 > [<ffffffff810ca28c>] add_to_page_cache_locked+0x4c/0x140 > [<ffffffff810ca3a2>] add_to_page_cache_lru+0x22/0x50 > [<ffffffff810ca45b>] grab_cache_page_write_begin+0x8b/0xe0 > [<ffffffff81193a18>] ext3_write_begin+0x88/0x270 > [<ffffffff810c8fc6>] generic_file_buffered_write+0x116/0x290 > [<ffffffff810cb3cc>] __generic_file_aio_write+0x27c/0x480 > [<ffffffff810cb646>] generic_file_aio_write+0x76/0xf0 # takes ->i_mutex > [<ffffffff8111156a>] do_sync_write+0xea/0x130 > [<ffffffff81112183>] vfs_write+0xf3/0x1f0 > [<ffffffff81112381>] sys_write+0x51/0x90 > [<ffffffff815b5926>] system_call_fastpath+0x18/0x1d > [<ffffffffffffffff>] 0xffffffffffffffff So process B manages to lock the hierarchy, calls mem_cgroup_out_of_memory() and retries the charge infinitely, waiting for task A to die. All while it holds the i_mutex, preventing task A from dying, right? I think global oom already handles this in a much better way: invoke the OOM killer, sleep for a second, then return to userspace to relinquish all kernel resources and locks. The only reason why we can't simply change from an endless retry loop is because we don't want to return VM_FAULT_OOM and invoke the global OOM killer. But maybe we can return a new VM_FAULT_OOM_HANDLED for memcg OOM and just restart the pagefault. Return -ENOMEM to the buffered IO syscall respectively. This way, the memcg OOM killer is invoked as it should but nobody gets stuck anywhere livelocking with the exiting task. ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH -mm] memcg: do not trigger OOM from add_to_page_cache_locked @ 2012-11-26 17:46 ` Johannes Weiner 0 siblings, 0 replies; 444+ messages in thread From: Johannes Weiner @ 2012-11-26 17:46 UTC (permalink / raw) To: Michal Hocko Cc: azurIt, linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki On Mon, Nov 26, 2012 at 02:18:37PM +0100, Michal Hocko wrote: > [CCing also Johannes - the thread started here: > https://lkml.org/lkml/2012/11/21/497] > > On Mon 26-11-12 01:38:55, azurIt wrote: > > >This is hackish but it should help you in this case. Kamezawa, what do > > >you think about that? Should we generalize this and prepare something > > >like mem_cgroup_cache_charge_locked which would add __GFP_NORETRY > > >automatically and use the function whenever we are in a locked context? > > >To be honest I do not like this very much but nothing more sensible > > >(without touching non-memcg paths) comes to my mind. > > > > > > I installed kernel with this patch, will report back if problem occurs > > again OR in few weeks if everything will be ok. Thank you! > > Now that I am looking at the patch closer it will not work because it > depends on other patch which is not merged yet and even that one would > help on its own because __GFP_NORETRY doesn't break the charge loop. > Sorry I have missed that... > > The patch bellow should help though. (it is based on top of the current > -mm tree but I will send a backport to 3.2 in the reply as well) > --- > >From 7796f942d62081ad45726efd90b5292b80e7c690 Mon Sep 17 00:00:00 2001 > From: Michal Hocko <mhocko@suse.cz> > Date: Mon, 26 Nov 2012 11:47:57 +0100 > Subject: [PATCH] memcg: do not trigger OOM from add_to_page_cache_locked > > memcg oom killer might deadlock if the process which falls down to > mem_cgroup_handle_oom holds a lock which prevents other task to > terminate because it is blocked on the very same lock. > This can happen when a write system call needs to allocate a page but > the allocation hits the memcg hard limit and there is nothing to reclaim > (e.g. there is no swap or swap limit is hit as well and all cache pages > have been reclaimed already) and the process selected by memcg OOM > killer is blocked on i_mutex on the same inode (e.g. truncate it). > > Process A > [<ffffffff811109b8>] do_truncate+0x58/0xa0 # takes i_mutex > [<ffffffff81121c90>] do_last+0x250/0xa30 > [<ffffffff81122547>] path_openat+0xd7/0x440 > [<ffffffff811229c9>] do_filp_open+0x49/0xa0 > [<ffffffff8110f7d6>] do_sys_open+0x106/0x240 > [<ffffffff8110f950>] sys_open+0x20/0x30 > [<ffffffff815b5926>] system_call_fastpath+0x18/0x1d > [<ffffffffffffffff>] 0xffffffffffffffff > > Process B > [<ffffffff8110a9c1>] mem_cgroup_handle_oom+0x241/0x3b0 > [<ffffffff8110b5ab>] T.1146+0x5ab/0x5c0 > [<ffffffff8110c22e>] mem_cgroup_cache_charge+0xbe/0xe0 > [<ffffffff810ca28c>] add_to_page_cache_locked+0x4c/0x140 > [<ffffffff810ca3a2>] add_to_page_cache_lru+0x22/0x50 > [<ffffffff810ca45b>] grab_cache_page_write_begin+0x8b/0xe0 > [<ffffffff81193a18>] ext3_write_begin+0x88/0x270 > [<ffffffff810c8fc6>] generic_file_buffered_write+0x116/0x290 > [<ffffffff810cb3cc>] __generic_file_aio_write+0x27c/0x480 > [<ffffffff810cb646>] generic_file_aio_write+0x76/0xf0 # takes ->i_mutex > [<ffffffff8111156a>] do_sync_write+0xea/0x130 > [<ffffffff81112183>] vfs_write+0xf3/0x1f0 > [<ffffffff81112381>] sys_write+0x51/0x90 > [<ffffffff815b5926>] system_call_fastpath+0x18/0x1d > [<ffffffffffffffff>] 0xffffffffffffffff So process B manages to lock the hierarchy, calls mem_cgroup_out_of_memory() and retries the charge infinitely, waiting for task A to die. All while it holds the i_mutex, preventing task A from dying, right? I think global oom already handles this in a much better way: invoke the OOM killer, sleep for a second, then return to userspace to relinquish all kernel resources and locks. The only reason why we can't simply change from an endless retry loop is because we don't want to return VM_FAULT_OOM and invoke the global OOM killer. But maybe we can return a new VM_FAULT_OOM_HANDLED for memcg OOM and just restart the pagefault. Return -ENOMEM to the buffered IO syscall respectively. This way, the memcg OOM killer is invoked as it should but nobody gets stuck anywhere livelocking with the exiting task. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH -mm] memcg: do not trigger OOM from add_to_page_cache_locked 2012-11-26 17:46 ` Johannes Weiner @ 2012-11-26 18:04 ` Michal Hocko -1 siblings, 0 replies; 444+ messages in thread From: Michal Hocko @ 2012-11-26 18:04 UTC (permalink / raw) To: Johannes Weiner Cc: azurIt, linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki On Mon 26-11-12 12:46:22, Johannes Weiner wrote: > On Mon, Nov 26, 2012 at 02:18:37PM +0100, Michal Hocko wrote: > > [CCing also Johannes - the thread started here: > > https://lkml.org/lkml/2012/11/21/497] > > > > On Mon 26-11-12 01:38:55, azurIt wrote: > > > >This is hackish but it should help you in this case. Kamezawa, what do > > > >you think about that? Should we generalize this and prepare something > > > >like mem_cgroup_cache_charge_locked which would add __GFP_NORETRY > > > >automatically and use the function whenever we are in a locked context? > > > >To be honest I do not like this very much but nothing more sensible > > > >(without touching non-memcg paths) comes to my mind. > > > > > > > > > I installed kernel with this patch, will report back if problem occurs > > > again OR in few weeks if everything will be ok. Thank you! > > > > Now that I am looking at the patch closer it will not work because it > > depends on other patch which is not merged yet and even that one would > > help on its own because __GFP_NORETRY doesn't break the charge loop. > > Sorry I have missed that... > > > > The patch bellow should help though. (it is based on top of the current > > -mm tree but I will send a backport to 3.2 in the reply as well) > > --- > > >From 7796f942d62081ad45726efd90b5292b80e7c690 Mon Sep 17 00:00:00 2001 > > From: Michal Hocko <mhocko@suse.cz> > > Date: Mon, 26 Nov 2012 11:47:57 +0100 > > Subject: [PATCH] memcg: do not trigger OOM from add_to_page_cache_locked > > > > memcg oom killer might deadlock if the process which falls down to > > mem_cgroup_handle_oom holds a lock which prevents other task to > > terminate because it is blocked on the very same lock. > > This can happen when a write system call needs to allocate a page but > > the allocation hits the memcg hard limit and there is nothing to reclaim > > (e.g. there is no swap or swap limit is hit as well and all cache pages > > have been reclaimed already) and the process selected by memcg OOM > > killer is blocked on i_mutex on the same inode (e.g. truncate it). > > > > Process A > > [<ffffffff811109b8>] do_truncate+0x58/0xa0 # takes i_mutex > > [<ffffffff81121c90>] do_last+0x250/0xa30 > > [<ffffffff81122547>] path_openat+0xd7/0x440 > > [<ffffffff811229c9>] do_filp_open+0x49/0xa0 > > [<ffffffff8110f7d6>] do_sys_open+0x106/0x240 > > [<ffffffff8110f950>] sys_open+0x20/0x30 > > [<ffffffff815b5926>] system_call_fastpath+0x18/0x1d > > [<ffffffffffffffff>] 0xffffffffffffffff > > > > Process B > > [<ffffffff8110a9c1>] mem_cgroup_handle_oom+0x241/0x3b0 > > [<ffffffff8110b5ab>] T.1146+0x5ab/0x5c0 > > [<ffffffff8110c22e>] mem_cgroup_cache_charge+0xbe/0xe0 > > [<ffffffff810ca28c>] add_to_page_cache_locked+0x4c/0x140 > > [<ffffffff810ca3a2>] add_to_page_cache_lru+0x22/0x50 > > [<ffffffff810ca45b>] grab_cache_page_write_begin+0x8b/0xe0 > > [<ffffffff81193a18>] ext3_write_begin+0x88/0x270 > > [<ffffffff810c8fc6>] generic_file_buffered_write+0x116/0x290 > > [<ffffffff810cb3cc>] __generic_file_aio_write+0x27c/0x480 > > [<ffffffff810cb646>] generic_file_aio_write+0x76/0xf0 # takes ->i_mutex > > [<ffffffff8111156a>] do_sync_write+0xea/0x130 > > [<ffffffff81112183>] vfs_write+0xf3/0x1f0 > > [<ffffffff81112381>] sys_write+0x51/0x90 > > [<ffffffff815b5926>] system_call_fastpath+0x18/0x1d > > [<ffffffffffffffff>] 0xffffffffffffffff > > So process B manages to lock the hierarchy, calls > mem_cgroup_out_of_memory() and retries the charge infinitely, waiting > for task A to die. All while it holds the i_mutex, preventing task A > from dying, right? Right. > I think global oom already handles this in a much better way: invoke > the OOM killer, sleep for a second, then return to userspace to > relinquish all kernel resources and locks. The only reason why we > can't simply change from an endless retry loop is because we don't > want to return VM_FAULT_OOM and invoke the global OOM killer. Exactly. > But maybe we can return a new VM_FAULT_OOM_HANDLED for memcg OOM and > just restart the pagefault. Return -ENOMEM to the buffered IO syscall > respectively. This way, the memcg OOM killer is invoked as it should > but nobody gets stuck anywhere livelocking with the exiting task. Hmm, we would still have a problem with oom disabled (aka user space OOM killer), right? All processes but those in mem_cgroup_handle_oom are risky to be killed. Other POV might be, why we should trigger an OOM killer from those paths in the first place. Write or read (or even readahead) are all calls that should rather fail than cause an OOM killer in my opinion. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH -mm] memcg: do not trigger OOM from add_to_page_cache_locked @ 2012-11-26 18:04 ` Michal Hocko 0 siblings, 0 replies; 444+ messages in thread From: Michal Hocko @ 2012-11-26 18:04 UTC (permalink / raw) To: Johannes Weiner Cc: azurIt, linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki On Mon 26-11-12 12:46:22, Johannes Weiner wrote: > On Mon, Nov 26, 2012 at 02:18:37PM +0100, Michal Hocko wrote: > > [CCing also Johannes - the thread started here: > > https://lkml.org/lkml/2012/11/21/497] > > > > On Mon 26-11-12 01:38:55, azurIt wrote: > > > >This is hackish but it should help you in this case. Kamezawa, what do > > > >you think about that? Should we generalize this and prepare something > > > >like mem_cgroup_cache_charge_locked which would add __GFP_NORETRY > > > >automatically and use the function whenever we are in a locked context? > > > >To be honest I do not like this very much but nothing more sensible > > > >(without touching non-memcg paths) comes to my mind. > > > > > > > > > I installed kernel with this patch, will report back if problem occurs > > > again OR in few weeks if everything will be ok. Thank you! > > > > Now that I am looking at the patch closer it will not work because it > > depends on other patch which is not merged yet and even that one would > > help on its own because __GFP_NORETRY doesn't break the charge loop. > > Sorry I have missed that... > > > > The patch bellow should help though. (it is based on top of the current > > -mm tree but I will send a backport to 3.2 in the reply as well) > > --- > > >From 7796f942d62081ad45726efd90b5292b80e7c690 Mon Sep 17 00:00:00 2001 > > From: Michal Hocko <mhocko@suse.cz> > > Date: Mon, 26 Nov 2012 11:47:57 +0100 > > Subject: [PATCH] memcg: do not trigger OOM from add_to_page_cache_locked > > > > memcg oom killer might deadlock if the process which falls down to > > mem_cgroup_handle_oom holds a lock which prevents other task to > > terminate because it is blocked on the very same lock. > > This can happen when a write system call needs to allocate a page but > > the allocation hits the memcg hard limit and there is nothing to reclaim > > (e.g. there is no swap or swap limit is hit as well and all cache pages > > have been reclaimed already) and the process selected by memcg OOM > > killer is blocked on i_mutex on the same inode (e.g. truncate it). > > > > Process A > > [<ffffffff811109b8>] do_truncate+0x58/0xa0 # takes i_mutex > > [<ffffffff81121c90>] do_last+0x250/0xa30 > > [<ffffffff81122547>] path_openat+0xd7/0x440 > > [<ffffffff811229c9>] do_filp_open+0x49/0xa0 > > [<ffffffff8110f7d6>] do_sys_open+0x106/0x240 > > [<ffffffff8110f950>] sys_open+0x20/0x30 > > [<ffffffff815b5926>] system_call_fastpath+0x18/0x1d > > [<ffffffffffffffff>] 0xffffffffffffffff > > > > Process B > > [<ffffffff8110a9c1>] mem_cgroup_handle_oom+0x241/0x3b0 > > [<ffffffff8110b5ab>] T.1146+0x5ab/0x5c0 > > [<ffffffff8110c22e>] mem_cgroup_cache_charge+0xbe/0xe0 > > [<ffffffff810ca28c>] add_to_page_cache_locked+0x4c/0x140 > > [<ffffffff810ca3a2>] add_to_page_cache_lru+0x22/0x50 > > [<ffffffff810ca45b>] grab_cache_page_write_begin+0x8b/0xe0 > > [<ffffffff81193a18>] ext3_write_begin+0x88/0x270 > > [<ffffffff810c8fc6>] generic_file_buffered_write+0x116/0x290 > > [<ffffffff810cb3cc>] __generic_file_aio_write+0x27c/0x480 > > [<ffffffff810cb646>] generic_file_aio_write+0x76/0xf0 # takes ->i_mutex > > [<ffffffff8111156a>] do_sync_write+0xea/0x130 > > [<ffffffff81112183>] vfs_write+0xf3/0x1f0 > > [<ffffffff81112381>] sys_write+0x51/0x90 > > [<ffffffff815b5926>] system_call_fastpath+0x18/0x1d > > [<ffffffffffffffff>] 0xffffffffffffffff > > So process B manages to lock the hierarchy, calls > mem_cgroup_out_of_memory() and retries the charge infinitely, waiting > for task A to die. All while it holds the i_mutex, preventing task A > from dying, right? Right. > I think global oom already handles this in a much better way: invoke > the OOM killer, sleep for a second, then return to userspace to > relinquish all kernel resources and locks. The only reason why we > can't simply change from an endless retry loop is because we don't > want to return VM_FAULT_OOM and invoke the global OOM killer. Exactly. > But maybe we can return a new VM_FAULT_OOM_HANDLED for memcg OOM and > just restart the pagefault. Return -ENOMEM to the buffered IO syscall > respectively. This way, the memcg OOM killer is invoked as it should > but nobody gets stuck anywhere livelocking with the exiting task. Hmm, we would still have a problem with oom disabled (aka user space OOM killer), right? All processes but those in mem_cgroup_handle_oom are risky to be killed. Other POV might be, why we should trigger an OOM killer from those paths in the first place. Write or read (or even readahead) are all calls that should rather fail than cause an OOM killer in my opinion. -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH -mm] memcg: do not trigger OOM from add_to_page_cache_locked 2012-11-26 18:04 ` Michal Hocko (?) @ 2012-11-26 18:24 ` Johannes Weiner -1 siblings, 0 replies; 444+ messages in thread From: Johannes Weiner @ 2012-11-26 18:24 UTC (permalink / raw) To: Michal Hocko Cc: azurIt, linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki On Mon, Nov 26, 2012 at 07:04:44PM +0100, Michal Hocko wrote: > On Mon 26-11-12 12:46:22, Johannes Weiner wrote: > > On Mon, Nov 26, 2012 at 02:18:37PM +0100, Michal Hocko wrote: > > > [CCing also Johannes - the thread started here: > > > https://lkml.org/lkml/2012/11/21/497] > > > > > > On Mon 26-11-12 01:38:55, azurIt wrote: > > > > >This is hackish but it should help you in this case. Kamezawa, what do > > > > >you think about that? Should we generalize this and prepare something > > > > >like mem_cgroup_cache_charge_locked which would add __GFP_NORETRY > > > > >automatically and use the function whenever we are in a locked context? > > > > >To be honest I do not like this very much but nothing more sensible > > > > >(without touching non-memcg paths) comes to my mind. > > > > > > > > > > > > I installed kernel with this patch, will report back if problem occurs > > > > again OR in few weeks if everything will be ok. Thank you! > > > > > > Now that I am looking at the patch closer it will not work because it > > > depends on other patch which is not merged yet and even that one would > > > help on its own because __GFP_NORETRY doesn't break the charge loop. > > > Sorry I have missed that... > > > > > > The patch bellow should help though. (it is based on top of the current > > > -mm tree but I will send a backport to 3.2 in the reply as well) > > > --- > > > >From 7796f942d62081ad45726efd90b5292b80e7c690 Mon Sep 17 00:00:00 2001 > > > From: Michal Hocko <mhocko@suse.cz> > > > Date: Mon, 26 Nov 2012 11:47:57 +0100 > > > Subject: [PATCH] memcg: do not trigger OOM from add_to_page_cache_locked > > > > > > memcg oom killer might deadlock if the process which falls down to > > > mem_cgroup_handle_oom holds a lock which prevents other task to > > > terminate because it is blocked on the very same lock. > > > This can happen when a write system call needs to allocate a page but > > > the allocation hits the memcg hard limit and there is nothing to reclaim > > > (e.g. there is no swap or swap limit is hit as well and all cache pages > > > have been reclaimed already) and the process selected by memcg OOM > > > killer is blocked on i_mutex on the same inode (e.g. truncate it). > > > > > > Process A > > > [<ffffffff811109b8>] do_truncate+0x58/0xa0 # takes i_mutex > > > [<ffffffff81121c90>] do_last+0x250/0xa30 > > > [<ffffffff81122547>] path_openat+0xd7/0x440 > > > [<ffffffff811229c9>] do_filp_open+0x49/0xa0 > > > [<ffffffff8110f7d6>] do_sys_open+0x106/0x240 > > > [<ffffffff8110f950>] sys_open+0x20/0x30 > > > [<ffffffff815b5926>] system_call_fastpath+0x18/0x1d > > > [<ffffffffffffffff>] 0xffffffffffffffff > > > > > > Process B > > > [<ffffffff8110a9c1>] mem_cgroup_handle_oom+0x241/0x3b0 > > > [<ffffffff8110b5ab>] T.1146+0x5ab/0x5c0 > > > [<ffffffff8110c22e>] mem_cgroup_cache_charge+0xbe/0xe0 > > > [<ffffffff810ca28c>] add_to_page_cache_locked+0x4c/0x140 > > > [<ffffffff810ca3a2>] add_to_page_cache_lru+0x22/0x50 > > > [<ffffffff810ca45b>] grab_cache_page_write_begin+0x8b/0xe0 > > > [<ffffffff81193a18>] ext3_write_begin+0x88/0x270 > > > [<ffffffff810c8fc6>] generic_file_buffered_write+0x116/0x290 > > > [<ffffffff810cb3cc>] __generic_file_aio_write+0x27c/0x480 > > > [<ffffffff810cb646>] generic_file_aio_write+0x76/0xf0 # takes ->i_mutex > > > [<ffffffff8111156a>] do_sync_write+0xea/0x130 > > > [<ffffffff81112183>] vfs_write+0xf3/0x1f0 > > > [<ffffffff81112381>] sys_write+0x51/0x90 > > > [<ffffffff815b5926>] system_call_fastpath+0x18/0x1d > > > [<ffffffffffffffff>] 0xffffffffffffffff > > > > So process B manages to lock the hierarchy, calls > > mem_cgroup_out_of_memory() and retries the charge infinitely, waiting > > for task A to die. All while it holds the i_mutex, preventing task A > > from dying, right? > > Right. > > > I think global oom already handles this in a much better way: invoke > > the OOM killer, sleep for a second, then return to userspace to > > relinquish all kernel resources and locks. The only reason why we > > can't simply change from an endless retry loop is because we don't > > want to return VM_FAULT_OOM and invoke the global OOM killer. > > Exactly. > > > But maybe we can return a new VM_FAULT_OOM_HANDLED for memcg OOM and > > just restart the pagefault. Return -ENOMEM to the buffered IO syscall > > respectively. This way, the memcg OOM killer is invoked as it should > > but nobody gets stuck anywhere livelocking with the exiting task. > > Hmm, we would still have a problem with oom disabled (aka user space OOM > killer), right? All processes but those in mem_cgroup_handle_oom are > risky to be killed. Could we still let everybody get stuck in there when the OOM killer is disabled and let userspace take care of it? > Other POV might be, why we should trigger an OOM killer from those paths > in the first place. Write or read (or even readahead) are all calls that > should rather fail than cause an OOM killer in my opinion. Readahead is arguable, but we kill globally for read() and write() and I think we should do the same for memcg. The OOM killer is there to resolve a problem that comes from overcommitting the machine but the overuse does not have to be from the application that pushes the machine over the edge, that's why we don't just kill the allocating task but actually go look for the best candidate. If you have one memory hog that overuses the resources, attempted memory consumption in a different program should invoke the OOM killer. It does not matter if this is a page fault (would still happen with your patch) or a bufferd read/write (would no longer happen). ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH -mm] memcg: do not trigger OOM from add_to_page_cache_locked @ 2012-11-26 18:24 ` Johannes Weiner 0 siblings, 0 replies; 444+ messages in thread From: Johannes Weiner @ 2012-11-26 18:24 UTC (permalink / raw) To: Michal Hocko Cc: azurIt, linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups mailinglist, KAMEZAWA Hiroyuki On Mon, Nov 26, 2012 at 07:04:44PM +0100, Michal Hocko wrote: > On Mon 26-11-12 12:46:22, Johannes Weiner wrote: > > On Mon, Nov 26, 2012 at 02:18:37PM +0100, Michal Hocko wrote: > > > [CCing also Johannes - the thread started here: > > > https://lkml.org/lkml/2012/11/21/497] > > > > > > On Mon 26-11-12 01:38:55, azurIt wrote: > > > > >This is hackish but it should help you in this case. Kamezawa, what do > > > > >you think about that? Should we generalize this and prepare something > > > > >like mem_cgroup_cache_charge_locked which would add __GFP_NORETRY > > > > >automatically and use the function whenever we are in a locked context? > > > > >To be honest I do not like this very much but nothing more sensible > > > > >(without touching non-memcg paths) comes to my mind. > > > > > > > > > > > > I installed kernel with this patch, will report back if problem occurs > > > > again OR in few weeks if everything will be ok. Thank you! > > > > > > Now that I am looking at the patch closer it will not work because it > > > depends on other patch which is not merged yet and even that one would > > > help on its own because __GFP_NORETRY doesn't break the charge loop. > > > Sorry I have missed that... > > > > > > The patch bellow should help though. (it is based on top of the current > > > -mm tree but I will send a backport to 3.2 in the reply as well) > > > --- > > > >From 7796f942d62081ad45726efd90b5292b80e7c690 Mon Sep 17 00:00:00 2001 > > > From: Michal Hocko <mhocko-AlSwsSmVLrQ@public.gmane.org> > > > Date: Mon, 26 Nov 2012 11:47:57 +0100 > > > Subject: [PATCH] memcg: do not trigger OOM from add_to_page_cache_locked > > > > > > memcg oom killer might deadlock if the process which falls down to > > > mem_cgroup_handle_oom holds a lock which prevents other task to > > > terminate because it is blocked on the very same lock. > > > This can happen when a write system call needs to allocate a page but > > > the allocation hits the memcg hard limit and there is nothing to reclaim > > > (e.g. there is no swap or swap limit is hit as well and all cache pages > > > have been reclaimed already) and the process selected by memcg OOM > > > killer is blocked on i_mutex on the same inode (e.g. truncate it). > > > > > > Process A > > > [<ffffffff811109b8>] do_truncate+0x58/0xa0 # takes i_mutex > > > [<ffffffff81121c90>] do_last+0x250/0xa30 > > > [<ffffffff81122547>] path_openat+0xd7/0x440 > > > [<ffffffff811229c9>] do_filp_open+0x49/0xa0 > > > [<ffffffff8110f7d6>] do_sys_open+0x106/0x240 > > > [<ffffffff8110f950>] sys_open+0x20/0x30 > > > [<ffffffff815b5926>] system_call_fastpath+0x18/0x1d > > > [<ffffffffffffffff>] 0xffffffffffffffff > > > > > > Process B > > > [<ffffffff8110a9c1>] mem_cgroup_handle_oom+0x241/0x3b0 > > > [<ffffffff8110b5ab>] T.1146+0x5ab/0x5c0 > > > [<ffffffff8110c22e>] mem_cgroup_cache_charge+0xbe/0xe0 > > > [<ffffffff810ca28c>] add_to_page_cache_locked+0x4c/0x140 > > > [<ffffffff810ca3a2>] add_to_page_cache_lru+0x22/0x50 > > > [<ffffffff810ca45b>] grab_cache_page_write_begin+0x8b/0xe0 > > > [<ffffffff81193a18>] ext3_write_begin+0x88/0x270 > > > [<ffffffff810c8fc6>] generic_file_buffered_write+0x116/0x290 > > > [<ffffffff810cb3cc>] __generic_file_aio_write+0x27c/0x480 > > > [<ffffffff810cb646>] generic_file_aio_write+0x76/0xf0 # takes ->i_mutex > > > [<ffffffff8111156a>] do_sync_write+0xea/0x130 > > > [<ffffffff81112183>] vfs_write+0xf3/0x1f0 > > > [<ffffffff81112381>] sys_write+0x51/0x90 > > > [<ffffffff815b5926>] system_call_fastpath+0x18/0x1d > > > [<ffffffffffffffff>] 0xffffffffffffffff > > > > So process B manages to lock the hierarchy, calls > > mem_cgroup_out_of_memory() and retries the charge infinitely, waiting > > for task A to die. All while it holds the i_mutex, preventing task A > > from dying, right? > > Right. > > > I think global oom already handles this in a much better way: invoke > > the OOM killer, sleep for a second, then return to userspace to > > relinquish all kernel resources and locks. The only reason why we > > can't simply change from an endless retry loop is because we don't > > want to return VM_FAULT_OOM and invoke the global OOM killer. > > Exactly. > > > But maybe we can return a new VM_FAULT_OOM_HANDLED for memcg OOM and > > just restart the pagefault. Return -ENOMEM to the buffered IO syscall > > respectively. This way, the memcg OOM killer is invoked as it should > > but nobody gets stuck anywhere livelocking with the exiting task. > > Hmm, we would still have a problem with oom disabled (aka user space OOM > killer), right? All processes but those in mem_cgroup_handle_oom are > risky to be killed. Could we still let everybody get stuck in there when the OOM killer is disabled and let userspace take care of it? > Other POV might be, why we should trigger an OOM killer from those paths > in the first place. Write or read (or even readahead) are all calls that > should rather fail than cause an OOM killer in my opinion. Readahead is arguable, but we kill globally for read() and write() and I think we should do the same for memcg. The OOM killer is there to resolve a problem that comes from overcommitting the machine but the overuse does not have to be from the application that pushes the machine over the edge, that's why we don't just kill the allocating task but actually go look for the best candidate. If you have one memory hog that overuses the resources, attempted memory consumption in a different program should invoke the OOM killer. It does not matter if this is a page fault (would still happen with your patch) or a bufferd read/write (would no longer happen). ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH -mm] memcg: do not trigger OOM from add_to_page_cache_locked @ 2012-11-26 18:24 ` Johannes Weiner 0 siblings, 0 replies; 444+ messages in thread From: Johannes Weiner @ 2012-11-26 18:24 UTC (permalink / raw) To: Michal Hocko Cc: azurIt, linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki On Mon, Nov 26, 2012 at 07:04:44PM +0100, Michal Hocko wrote: > On Mon 26-11-12 12:46:22, Johannes Weiner wrote: > > On Mon, Nov 26, 2012 at 02:18:37PM +0100, Michal Hocko wrote: > > > [CCing also Johannes - the thread started here: > > > https://lkml.org/lkml/2012/11/21/497] > > > > > > On Mon 26-11-12 01:38:55, azurIt wrote: > > > > >This is hackish but it should help you in this case. Kamezawa, what do > > > > >you think about that? Should we generalize this and prepare something > > > > >like mem_cgroup_cache_charge_locked which would add __GFP_NORETRY > > > > >automatically and use the function whenever we are in a locked context? > > > > >To be honest I do not like this very much but nothing more sensible > > > > >(without touching non-memcg paths) comes to my mind. > > > > > > > > > > > > I installed kernel with this patch, will report back if problem occurs > > > > again OR in few weeks if everything will be ok. Thank you! > > > > > > Now that I am looking at the patch closer it will not work because it > > > depends on other patch which is not merged yet and even that one would > > > help on its own because __GFP_NORETRY doesn't break the charge loop. > > > Sorry I have missed that... > > > > > > The patch bellow should help though. (it is based on top of the current > > > -mm tree but I will send a backport to 3.2 in the reply as well) > > > --- > > > >From 7796f942d62081ad45726efd90b5292b80e7c690 Mon Sep 17 00:00:00 2001 > > > From: Michal Hocko <mhocko@suse.cz> > > > Date: Mon, 26 Nov 2012 11:47:57 +0100 > > > Subject: [PATCH] memcg: do not trigger OOM from add_to_page_cache_locked > > > > > > memcg oom killer might deadlock if the process which falls down to > > > mem_cgroup_handle_oom holds a lock which prevents other task to > > > terminate because it is blocked on the very same lock. > > > This can happen when a write system call needs to allocate a page but > > > the allocation hits the memcg hard limit and there is nothing to reclaim > > > (e.g. there is no swap or swap limit is hit as well and all cache pages > > > have been reclaimed already) and the process selected by memcg OOM > > > killer is blocked on i_mutex on the same inode (e.g. truncate it). > > > > > > Process A > > > [<ffffffff811109b8>] do_truncate+0x58/0xa0 # takes i_mutex > > > [<ffffffff81121c90>] do_last+0x250/0xa30 > > > [<ffffffff81122547>] path_openat+0xd7/0x440 > > > [<ffffffff811229c9>] do_filp_open+0x49/0xa0 > > > [<ffffffff8110f7d6>] do_sys_open+0x106/0x240 > > > [<ffffffff8110f950>] sys_open+0x20/0x30 > > > [<ffffffff815b5926>] system_call_fastpath+0x18/0x1d > > > [<ffffffffffffffff>] 0xffffffffffffffff > > > > > > Process B > > > [<ffffffff8110a9c1>] mem_cgroup_handle_oom+0x241/0x3b0 > > > [<ffffffff8110b5ab>] T.1146+0x5ab/0x5c0 > > > [<ffffffff8110c22e>] mem_cgroup_cache_charge+0xbe/0xe0 > > > [<ffffffff810ca28c>] add_to_page_cache_locked+0x4c/0x140 > > > [<ffffffff810ca3a2>] add_to_page_cache_lru+0x22/0x50 > > > [<ffffffff810ca45b>] grab_cache_page_write_begin+0x8b/0xe0 > > > [<ffffffff81193a18>] ext3_write_begin+0x88/0x270 > > > [<ffffffff810c8fc6>] generic_file_buffered_write+0x116/0x290 > > > [<ffffffff810cb3cc>] __generic_file_aio_write+0x27c/0x480 > > > [<ffffffff810cb646>] generic_file_aio_write+0x76/0xf0 # takes ->i_mutex > > > [<ffffffff8111156a>] do_sync_write+0xea/0x130 > > > [<ffffffff81112183>] vfs_write+0xf3/0x1f0 > > > [<ffffffff81112381>] sys_write+0x51/0x90 > > > [<ffffffff815b5926>] system_call_fastpath+0x18/0x1d > > > [<ffffffffffffffff>] 0xffffffffffffffff > > > > So process B manages to lock the hierarchy, calls > > mem_cgroup_out_of_memory() and retries the charge infinitely, waiting > > for task A to die. All while it holds the i_mutex, preventing task A > > from dying, right? > > Right. > > > I think global oom already handles this in a much better way: invoke > > the OOM killer, sleep for a second, then return to userspace to > > relinquish all kernel resources and locks. The only reason why we > > can't simply change from an endless retry loop is because we don't > > want to return VM_FAULT_OOM and invoke the global OOM killer. > > Exactly. > > > But maybe we can return a new VM_FAULT_OOM_HANDLED for memcg OOM and > > just restart the pagefault. Return -ENOMEM to the buffered IO syscall > > respectively. This way, the memcg OOM killer is invoked as it should > > but nobody gets stuck anywhere livelocking with the exiting task. > > Hmm, we would still have a problem with oom disabled (aka user space OOM > killer), right? All processes but those in mem_cgroup_handle_oom are > risky to be killed. Could we still let everybody get stuck in there when the OOM killer is disabled and let userspace take care of it? > Other POV might be, why we should trigger an OOM killer from those paths > in the first place. Write or read (or even readahead) are all calls that > should rather fail than cause an OOM killer in my opinion. Readahead is arguable, but we kill globally for read() and write() and I think we should do the same for memcg. The OOM killer is there to resolve a problem that comes from overcommitting the machine but the overuse does not have to be from the application that pushes the machine over the edge, that's why we don't just kill the allocating task but actually go look for the best candidate. If you have one memory hog that overuses the resources, attempted memory consumption in a different program should invoke the OOM killer. It does not matter if this is a page fault (would still happen with your patch) or a bufferd read/write (would no longer happen). -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH -mm] memcg: do not trigger OOM from add_to_page_cache_locked 2012-11-26 18:24 ` Johannes Weiner (?) @ 2012-11-26 19:03 ` Michal Hocko -1 siblings, 0 replies; 444+ messages in thread From: Michal Hocko @ 2012-11-26 19:03 UTC (permalink / raw) To: Johannes Weiner Cc: azurIt, linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki On Mon 26-11-12 13:24:21, Johannes Weiner wrote: > On Mon, Nov 26, 2012 at 07:04:44PM +0100, Michal Hocko wrote: > > On Mon 26-11-12 12:46:22, Johannes Weiner wrote: [...] > > > I think global oom already handles this in a much better way: invoke > > > the OOM killer, sleep for a second, then return to userspace to > > > relinquish all kernel resources and locks. The only reason why we > > > can't simply change from an endless retry loop is because we don't > > > want to return VM_FAULT_OOM and invoke the global OOM killer. > > > > Exactly. > > > > > But maybe we can return a new VM_FAULT_OOM_HANDLED for memcg OOM and > > > just restart the pagefault. Return -ENOMEM to the buffered IO syscall > > > respectively. This way, the memcg OOM killer is invoked as it should > > > but nobody gets stuck anywhere livelocking with the exiting task. > > > > Hmm, we would still have a problem with oom disabled (aka user space OOM > > killer), right? All processes but those in mem_cgroup_handle_oom are > > risky to be killed. > > Could we still let everybody get stuck in there when the OOM killer is > disabled and let userspace take care of it? I am not sure what exactly you mean by "userspace take care of it" but if those processes are stuck and holding the lock then it is usually hard to find that out. Well if somebody is familiar with internal then it is doable but this makes the interface really unusable for regular usage. > > Other POV might be, why we should trigger an OOM killer from those paths > > in the first place. Write or read (or even readahead) are all calls that > > should rather fail than cause an OOM killer in my opinion. > > Readahead is arguable, but we kill globally for read() and write() and > I think we should do the same for memcg. Fair point but the global case is little bit easier than memcg in this case because nobody can hook on OOM killer and provide a userspace implementation for it which is one of the cooler feature of memcg... I am all open to any suggestions but we should somehow fix this (and backport it to stable trees as this is there for quite some time. The current report shows that the problem is not that hard to trigger). > The OOM killer is there to resolve a problem that comes from > overcommitting the machine but the overuse does not have to be from > the application that pushes the machine over the edge, that's why we > don't just kill the allocating task but actually go look for the best > candidate. If you have one memory hog that overuses the resources, > attempted memory consumption in a different program should invoke the > OOM killer. > It does not matter if this is a page fault (would still happen with > your patch) or a bufferd read/write (would no longer happen). true and it is sad that mmap then behaves slightly different than read/write which should I've mentioned in the changelog. As I said I am open to other suggestions. Thanks -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH -mm] memcg: do not trigger OOM from add_to_page_cache_locked @ 2012-11-26 19:03 ` Michal Hocko 0 siblings, 0 replies; 444+ messages in thread From: Michal Hocko @ 2012-11-26 19:03 UTC (permalink / raw) To: Johannes Weiner Cc: azurIt, linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups mailinglist, KAMEZAWA Hiroyuki On Mon 26-11-12 13:24:21, Johannes Weiner wrote: > On Mon, Nov 26, 2012 at 07:04:44PM +0100, Michal Hocko wrote: > > On Mon 26-11-12 12:46:22, Johannes Weiner wrote: [...] > > > I think global oom already handles this in a much better way: invoke > > > the OOM killer, sleep for a second, then return to userspace to > > > relinquish all kernel resources and locks. The only reason why we > > > can't simply change from an endless retry loop is because we don't > > > want to return VM_FAULT_OOM and invoke the global OOM killer. > > > > Exactly. > > > > > But maybe we can return a new VM_FAULT_OOM_HANDLED for memcg OOM and > > > just restart the pagefault. Return -ENOMEM to the buffered IO syscall > > > respectively. This way, the memcg OOM killer is invoked as it should > > > but nobody gets stuck anywhere livelocking with the exiting task. > > > > Hmm, we would still have a problem with oom disabled (aka user space OOM > > killer), right? All processes but those in mem_cgroup_handle_oom are > > risky to be killed. > > Could we still let everybody get stuck in there when the OOM killer is > disabled and let userspace take care of it? I am not sure what exactly you mean by "userspace take care of it" but if those processes are stuck and holding the lock then it is usually hard to find that out. Well if somebody is familiar with internal then it is doable but this makes the interface really unusable for regular usage. > > Other POV might be, why we should trigger an OOM killer from those paths > > in the first place. Write or read (or even readahead) are all calls that > > should rather fail than cause an OOM killer in my opinion. > > Readahead is arguable, but we kill globally for read() and write() and > I think we should do the same for memcg. Fair point but the global case is little bit easier than memcg in this case because nobody can hook on OOM killer and provide a userspace implementation for it which is one of the cooler feature of memcg... I am all open to any suggestions but we should somehow fix this (and backport it to stable trees as this is there for quite some time. The current report shows that the problem is not that hard to trigger). > The OOM killer is there to resolve a problem that comes from > overcommitting the machine but the overuse does not have to be from > the application that pushes the machine over the edge, that's why we > don't just kill the allocating task but actually go look for the best > candidate. If you have one memory hog that overuses the resources, > attempted memory consumption in a different program should invoke the > OOM killer. > It does not matter if this is a page fault (would still happen with > your patch) or a bufferd read/write (would no longer happen). true and it is sad that mmap then behaves slightly different than read/write which should I've mentioned in the changelog. As I said I am open to other suggestions. Thanks -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH -mm] memcg: do not trigger OOM from add_to_page_cache_locked @ 2012-11-26 19:03 ` Michal Hocko 0 siblings, 0 replies; 444+ messages in thread From: Michal Hocko @ 2012-11-26 19:03 UTC (permalink / raw) To: Johannes Weiner Cc: azurIt, linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki On Mon 26-11-12 13:24:21, Johannes Weiner wrote: > On Mon, Nov 26, 2012 at 07:04:44PM +0100, Michal Hocko wrote: > > On Mon 26-11-12 12:46:22, Johannes Weiner wrote: [...] > > > I think global oom already handles this in a much better way: invoke > > > the OOM killer, sleep for a second, then return to userspace to > > > relinquish all kernel resources and locks. The only reason why we > > > can't simply change from an endless retry loop is because we don't > > > want to return VM_FAULT_OOM and invoke the global OOM killer. > > > > Exactly. > > > > > But maybe we can return a new VM_FAULT_OOM_HANDLED for memcg OOM and > > > just restart the pagefault. Return -ENOMEM to the buffered IO syscall > > > respectively. This way, the memcg OOM killer is invoked as it should > > > but nobody gets stuck anywhere livelocking with the exiting task. > > > > Hmm, we would still have a problem with oom disabled (aka user space OOM > > killer), right? All processes but those in mem_cgroup_handle_oom are > > risky to be killed. > > Could we still let everybody get stuck in there when the OOM killer is > disabled and let userspace take care of it? I am not sure what exactly you mean by "userspace take care of it" but if those processes are stuck and holding the lock then it is usually hard to find that out. Well if somebody is familiar with internal then it is doable but this makes the interface really unusable for regular usage. > > Other POV might be, why we should trigger an OOM killer from those paths > > in the first place. Write or read (or even readahead) are all calls that > > should rather fail than cause an OOM killer in my opinion. > > Readahead is arguable, but we kill globally for read() and write() and > I think we should do the same for memcg. Fair point but the global case is little bit easier than memcg in this case because nobody can hook on OOM killer and provide a userspace implementation for it which is one of the cooler feature of memcg... I am all open to any suggestions but we should somehow fix this (and backport it to stable trees as this is there for quite some time. The current report shows that the problem is not that hard to trigger). > The OOM killer is there to resolve a problem that comes from > overcommitting the machine but the overuse does not have to be from > the application that pushes the machine over the edge, that's why we > don't just kill the allocating task but actually go look for the best > candidate. If you have one memory hog that overuses the resources, > attempted memory consumption in a different program should invoke the > OOM killer. > It does not matter if this is a page fault (would still happen with > your patch) or a bufferd read/write (would no longer happen). true and it is sad that mmap then behaves slightly different than read/write which should I've mentioned in the changelog. As I said I am open to other suggestions. Thanks -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH -mm] memcg: do not trigger OOM from add_to_page_cache_locked 2012-11-26 19:03 ` Michal Hocko (?) @ 2012-11-26 19:29 ` Johannes Weiner -1 siblings, 0 replies; 444+ messages in thread From: Johannes Weiner @ 2012-11-26 19:29 UTC (permalink / raw) To: Michal Hocko Cc: azurIt, linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki On Mon, Nov 26, 2012 at 08:03:29PM +0100, Michal Hocko wrote: > On Mon 26-11-12 13:24:21, Johannes Weiner wrote: > > On Mon, Nov 26, 2012 at 07:04:44PM +0100, Michal Hocko wrote: > > > On Mon 26-11-12 12:46:22, Johannes Weiner wrote: > [...] > > > > I think global oom already handles this in a much better way: invoke > > > > the OOM killer, sleep for a second, then return to userspace to > > > > relinquish all kernel resources and locks. The only reason why we > > > > can't simply change from an endless retry loop is because we don't > > > > want to return VM_FAULT_OOM and invoke the global OOM killer. > > > > > > Exactly. > > > > > > > But maybe we can return a new VM_FAULT_OOM_HANDLED for memcg OOM and > > > > just restart the pagefault. Return -ENOMEM to the buffered IO syscall > > > > respectively. This way, the memcg OOM killer is invoked as it should > > > > but nobody gets stuck anywhere livelocking with the exiting task. > > > > > > Hmm, we would still have a problem with oom disabled (aka user space OOM > > > killer), right? All processes but those in mem_cgroup_handle_oom are > > > risky to be killed. > > > > Could we still let everybody get stuck in there when the OOM killer is > > disabled and let userspace take care of it? > > I am not sure what exactly you mean by "userspace take care of it" but > if those processes are stuck and holding the lock then it is usually > hard to find that out. Well if somebody is familiar with internal then > it is doable but this makes the interface really unusable for regular > usage. If oom_kill_disable is set, then all processes get stuck all the way down in the charge stack. Whatever resource they pin, you may deadlock on if you try to touch it while handling the problem from userspace. I don't see how this is a new problem...? Or do you mean something else? > > > Other POV might be, why we should trigger an OOM killer from those paths > > > in the first place. Write or read (or even readahead) are all calls that > > > should rather fail than cause an OOM killer in my opinion. > > > > Readahead is arguable, but we kill globally for read() and write() and > > I think we should do the same for memcg. > > Fair point but the global case is little bit easier than memcg in this > case because nobody can hook on OOM killer and provide a userspace > implementation for it which is one of the cooler feature of memcg... > I am all open to any suggestions but we should somehow fix this (and > backport it to stable trees as this is there for quite some time. The > current report shows that the problem is not that hard to trigger). As per above, the userspace OOM handling is risky as hell anyway. What happens when an anonymous fault waits in memcg userspace OOM while holding the mmap_sem, and a writer lines up behind it? Your userspace OOM handler had better not look at any of the /proc files of the stuck task that require the mmap_sem. At the same token, it probably shouldn't touch the same files a memcg task is stuck trying to read/write. ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH -mm] memcg: do not trigger OOM from add_to_page_cache_locked @ 2012-11-26 19:29 ` Johannes Weiner 0 siblings, 0 replies; 444+ messages in thread From: Johannes Weiner @ 2012-11-26 19:29 UTC (permalink / raw) To: Michal Hocko Cc: azurIt, linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups mailinglist, KAMEZAWA Hiroyuki On Mon, Nov 26, 2012 at 08:03:29PM +0100, Michal Hocko wrote: > On Mon 26-11-12 13:24:21, Johannes Weiner wrote: > > On Mon, Nov 26, 2012 at 07:04:44PM +0100, Michal Hocko wrote: > > > On Mon 26-11-12 12:46:22, Johannes Weiner wrote: > [...] > > > > I think global oom already handles this in a much better way: invoke > > > > the OOM killer, sleep for a second, then return to userspace to > > > > relinquish all kernel resources and locks. The only reason why we > > > > can't simply change from an endless retry loop is because we don't > > > > want to return VM_FAULT_OOM and invoke the global OOM killer. > > > > > > Exactly. > > > > > > > But maybe we can return a new VM_FAULT_OOM_HANDLED for memcg OOM and > > > > just restart the pagefault. Return -ENOMEM to the buffered IO syscall > > > > respectively. This way, the memcg OOM killer is invoked as it should > > > > but nobody gets stuck anywhere livelocking with the exiting task. > > > > > > Hmm, we would still have a problem with oom disabled (aka user space OOM > > > killer), right? All processes but those in mem_cgroup_handle_oom are > > > risky to be killed. > > > > Could we still let everybody get stuck in there when the OOM killer is > > disabled and let userspace take care of it? > > I am not sure what exactly you mean by "userspace take care of it" but > if those processes are stuck and holding the lock then it is usually > hard to find that out. Well if somebody is familiar with internal then > it is doable but this makes the interface really unusable for regular > usage. If oom_kill_disable is set, then all processes get stuck all the way down in the charge stack. Whatever resource they pin, you may deadlock on if you try to touch it while handling the problem from userspace. I don't see how this is a new problem...? Or do you mean something else? > > > Other POV might be, why we should trigger an OOM killer from those paths > > > in the first place. Write or read (or even readahead) are all calls that > > > should rather fail than cause an OOM killer in my opinion. > > > > Readahead is arguable, but we kill globally for read() and write() and > > I think we should do the same for memcg. > > Fair point but the global case is little bit easier than memcg in this > case because nobody can hook on OOM killer and provide a userspace > implementation for it which is one of the cooler feature of memcg... > I am all open to any suggestions but we should somehow fix this (and > backport it to stable trees as this is there for quite some time. The > current report shows that the problem is not that hard to trigger). As per above, the userspace OOM handling is risky as hell anyway. What happens when an anonymous fault waits in memcg userspace OOM while holding the mmap_sem, and a writer lines up behind it? Your userspace OOM handler had better not look at any of the /proc files of the stuck task that require the mmap_sem. At the same token, it probably shouldn't touch the same files a memcg task is stuck trying to read/write. ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH -mm] memcg: do not trigger OOM from add_to_page_cache_locked @ 2012-11-26 19:29 ` Johannes Weiner 0 siblings, 0 replies; 444+ messages in thread From: Johannes Weiner @ 2012-11-26 19:29 UTC (permalink / raw) To: Michal Hocko Cc: azurIt, linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki On Mon, Nov 26, 2012 at 08:03:29PM +0100, Michal Hocko wrote: > On Mon 26-11-12 13:24:21, Johannes Weiner wrote: > > On Mon, Nov 26, 2012 at 07:04:44PM +0100, Michal Hocko wrote: > > > On Mon 26-11-12 12:46:22, Johannes Weiner wrote: > [...] > > > > I think global oom already handles this in a much better way: invoke > > > > the OOM killer, sleep for a second, then return to userspace to > > > > relinquish all kernel resources and locks. The only reason why we > > > > can't simply change from an endless retry loop is because we don't > > > > want to return VM_FAULT_OOM and invoke the global OOM killer. > > > > > > Exactly. > > > > > > > But maybe we can return a new VM_FAULT_OOM_HANDLED for memcg OOM and > > > > just restart the pagefault. Return -ENOMEM to the buffered IO syscall > > > > respectively. This way, the memcg OOM killer is invoked as it should > > > > but nobody gets stuck anywhere livelocking with the exiting task. > > > > > > Hmm, we would still have a problem with oom disabled (aka user space OOM > > > killer), right? All processes but those in mem_cgroup_handle_oom are > > > risky to be killed. > > > > Could we still let everybody get stuck in there when the OOM killer is > > disabled and let userspace take care of it? > > I am not sure what exactly you mean by "userspace take care of it" but > if those processes are stuck and holding the lock then it is usually > hard to find that out. Well if somebody is familiar with internal then > it is doable but this makes the interface really unusable for regular > usage. If oom_kill_disable is set, then all processes get stuck all the way down in the charge stack. Whatever resource they pin, you may deadlock on if you try to touch it while handling the problem from userspace. I don't see how this is a new problem...? Or do you mean something else? > > > Other POV might be, why we should trigger an OOM killer from those paths > > > in the first place. Write or read (or even readahead) are all calls that > > > should rather fail than cause an OOM killer in my opinion. > > > > Readahead is arguable, but we kill globally for read() and write() and > > I think we should do the same for memcg. > > Fair point but the global case is little bit easier than memcg in this > case because nobody can hook on OOM killer and provide a userspace > implementation for it which is one of the cooler feature of memcg... > I am all open to any suggestions but we should somehow fix this (and > backport it to stable trees as this is there for quite some time. The > current report shows that the problem is not that hard to trigger). As per above, the userspace OOM handling is risky as hell anyway. What happens when an anonymous fault waits in memcg userspace OOM while holding the mmap_sem, and a writer lines up behind it? Your userspace OOM handler had better not look at any of the /proc files of the stuck task that require the mmap_sem. At the same token, it probably shouldn't touch the same files a memcg task is stuck trying to read/write. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH -mm] memcg: do not trigger OOM from add_to_page_cache_locked 2012-11-26 19:29 ` Johannes Weiner (?) @ 2012-11-26 20:08 ` Michal Hocko -1 siblings, 0 replies; 444+ messages in thread From: Michal Hocko @ 2012-11-26 20:08 UTC (permalink / raw) To: Johannes Weiner Cc: azurIt, linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki On Mon 26-11-12 14:29:41, Johannes Weiner wrote: > On Mon, Nov 26, 2012 at 08:03:29PM +0100, Michal Hocko wrote: > > On Mon 26-11-12 13:24:21, Johannes Weiner wrote: > > > On Mon, Nov 26, 2012 at 07:04:44PM +0100, Michal Hocko wrote: > > > > On Mon 26-11-12 12:46:22, Johannes Weiner wrote: > > [...] > > > > > I think global oom already handles this in a much better way: invoke > > > > > the OOM killer, sleep for a second, then return to userspace to > > > > > relinquish all kernel resources and locks. The only reason why we > > > > > can't simply change from an endless retry loop is because we don't > > > > > want to return VM_FAULT_OOM and invoke the global OOM killer. > > > > > > > > Exactly. > > > > > > > > > But maybe we can return a new VM_FAULT_OOM_HANDLED for memcg OOM and > > > > > just restart the pagefault. Return -ENOMEM to the buffered IO syscall > > > > > respectively. This way, the memcg OOM killer is invoked as it should > > > > > but nobody gets stuck anywhere livelocking with the exiting task. > > > > > > > > Hmm, we would still have a problem with oom disabled (aka user space OOM > > > > killer), right? All processes but those in mem_cgroup_handle_oom are > > > > risky to be killed. > > > > > > Could we still let everybody get stuck in there when the OOM killer is > > > disabled and let userspace take care of it? > > > > I am not sure what exactly you mean by "userspace take care of it" but > > if those processes are stuck and holding the lock then it is usually > > hard to find that out. Well if somebody is familiar with internal then > > it is doable but this makes the interface really unusable for regular > > usage. > > If oom_kill_disable is set, then all processes get stuck all the way > down in the charge stack. Whatever resource they pin, you may > deadlock on if you try to touch it while handling the problem from > userspace. OK, I guess I am getting what you are trying to say. So what you are suggesting is to just let mem_cgroup_out_of_memory send the signal and move on without retry (or with few charge retries without further OOM killing) and fail the charge with your new FAULT_OOM_HANDLED (resp. something like FAULT_RETRY) error code resp. ENOMEM depending on the caller. OOM disabled case would be "you are on your own" because this has been dangerous anyway. Correct? I do agree that the current endless retry loop is far from being ideal and can see some updates but I am quite nervous about any potential regressions in this area (e.g. too aggressive OOM etc...). I have to think about it some more. Anyway if you have some more specific ideas I would be happy to review patches. [...] Thanks -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH -mm] memcg: do not trigger OOM from add_to_page_cache_locked @ 2012-11-26 20:08 ` Michal Hocko 0 siblings, 0 replies; 444+ messages in thread From: Michal Hocko @ 2012-11-26 20:08 UTC (permalink / raw) To: Johannes Weiner Cc: azurIt, linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups mailinglist, KAMEZAWA Hiroyuki On Mon 26-11-12 14:29:41, Johannes Weiner wrote: > On Mon, Nov 26, 2012 at 08:03:29PM +0100, Michal Hocko wrote: > > On Mon 26-11-12 13:24:21, Johannes Weiner wrote: > > > On Mon, Nov 26, 2012 at 07:04:44PM +0100, Michal Hocko wrote: > > > > On Mon 26-11-12 12:46:22, Johannes Weiner wrote: > > [...] > > > > > I think global oom already handles this in a much better way: invoke > > > > > the OOM killer, sleep for a second, then return to userspace to > > > > > relinquish all kernel resources and locks. The only reason why we > > > > > can't simply change from an endless retry loop is because we don't > > > > > want to return VM_FAULT_OOM and invoke the global OOM killer. > > > > > > > > Exactly. > > > > > > > > > But maybe we can return a new VM_FAULT_OOM_HANDLED for memcg OOM and > > > > > just restart the pagefault. Return -ENOMEM to the buffered IO syscall > > > > > respectively. This way, the memcg OOM killer is invoked as it should > > > > > but nobody gets stuck anywhere livelocking with the exiting task. > > > > > > > > Hmm, we would still have a problem with oom disabled (aka user space OOM > > > > killer), right? All processes but those in mem_cgroup_handle_oom are > > > > risky to be killed. > > > > > > Could we still let everybody get stuck in there when the OOM killer is > > > disabled and let userspace take care of it? > > > > I am not sure what exactly you mean by "userspace take care of it" but > > if those processes are stuck and holding the lock then it is usually > > hard to find that out. Well if somebody is familiar with internal then > > it is doable but this makes the interface really unusable for regular > > usage. > > If oom_kill_disable is set, then all processes get stuck all the way > down in the charge stack. Whatever resource they pin, you may > deadlock on if you try to touch it while handling the problem from > userspace. OK, I guess I am getting what you are trying to say. So what you are suggesting is to just let mem_cgroup_out_of_memory send the signal and move on without retry (or with few charge retries without further OOM killing) and fail the charge with your new FAULT_OOM_HANDLED (resp. something like FAULT_RETRY) error code resp. ENOMEM depending on the caller. OOM disabled case would be "you are on your own" because this has been dangerous anyway. Correct? I do agree that the current endless retry loop is far from being ideal and can see some updates but I am quite nervous about any potential regressions in this area (e.g. too aggressive OOM etc...). I have to think about it some more. Anyway if you have some more specific ideas I would be happy to review patches. [...] Thanks -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH -mm] memcg: do not trigger OOM from add_to_page_cache_locked @ 2012-11-26 20:08 ` Michal Hocko 0 siblings, 0 replies; 444+ messages in thread From: Michal Hocko @ 2012-11-26 20:08 UTC (permalink / raw) To: Johannes Weiner Cc: azurIt, linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki On Mon 26-11-12 14:29:41, Johannes Weiner wrote: > On Mon, Nov 26, 2012 at 08:03:29PM +0100, Michal Hocko wrote: > > On Mon 26-11-12 13:24:21, Johannes Weiner wrote: > > > On Mon, Nov 26, 2012 at 07:04:44PM +0100, Michal Hocko wrote: > > > > On Mon 26-11-12 12:46:22, Johannes Weiner wrote: > > [...] > > > > > I think global oom already handles this in a much better way: invoke > > > > > the OOM killer, sleep for a second, then return to userspace to > > > > > relinquish all kernel resources and locks. The only reason why we > > > > > can't simply change from an endless retry loop is because we don't > > > > > want to return VM_FAULT_OOM and invoke the global OOM killer. > > > > > > > > Exactly. > > > > > > > > > But maybe we can return a new VM_FAULT_OOM_HANDLED for memcg OOM and > > > > > just restart the pagefault. Return -ENOMEM to the buffered IO syscall > > > > > respectively. This way, the memcg OOM killer is invoked as it should > > > > > but nobody gets stuck anywhere livelocking with the exiting task. > > > > > > > > Hmm, we would still have a problem with oom disabled (aka user space OOM > > > > killer), right? All processes but those in mem_cgroup_handle_oom are > > > > risky to be killed. > > > > > > Could we still let everybody get stuck in there when the OOM killer is > > > disabled and let userspace take care of it? > > > > I am not sure what exactly you mean by "userspace take care of it" but > > if those processes are stuck and holding the lock then it is usually > > hard to find that out. Well if somebody is familiar with internal then > > it is doable but this makes the interface really unusable for regular > > usage. > > If oom_kill_disable is set, then all processes get stuck all the way > down in the charge stack. Whatever resource they pin, you may > deadlock on if you try to touch it while handling the problem from > userspace. OK, I guess I am getting what you are trying to say. So what you are suggesting is to just let mem_cgroup_out_of_memory send the signal and move on without retry (or with few charge retries without further OOM killing) and fail the charge with your new FAULT_OOM_HANDLED (resp. something like FAULT_RETRY) error code resp. ENOMEM depending on the caller. OOM disabled case would be "you are on your own" because this has been dangerous anyway. Correct? I do agree that the current endless retry loop is far from being ideal and can see some updates but I am quite nervous about any potential regressions in this area (e.g. too aggressive OOM etc...). I have to think about it some more. Anyway if you have some more specific ideas I would be happy to review patches. [...] Thanks -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH -mm] memcg: do not trigger OOM from add_to_page_cache_locked 2012-11-26 20:08 ` Michal Hocko @ 2012-11-26 20:19 ` Johannes Weiner -1 siblings, 0 replies; 444+ messages in thread From: Johannes Weiner @ 2012-11-26 20:19 UTC (permalink / raw) To: Michal Hocko Cc: azurIt, linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki On Mon, Nov 26, 2012 at 09:08:48PM +0100, Michal Hocko wrote: > On Mon 26-11-12 14:29:41, Johannes Weiner wrote: > > On Mon, Nov 26, 2012 at 08:03:29PM +0100, Michal Hocko wrote: > > > On Mon 26-11-12 13:24:21, Johannes Weiner wrote: > > > > On Mon, Nov 26, 2012 at 07:04:44PM +0100, Michal Hocko wrote: > > > > > On Mon 26-11-12 12:46:22, Johannes Weiner wrote: > > > [...] > > > > > > I think global oom already handles this in a much better way: invoke > > > > > > the OOM killer, sleep for a second, then return to userspace to > > > > > > relinquish all kernel resources and locks. The only reason why we > > > > > > can't simply change from an endless retry loop is because we don't > > > > > > want to return VM_FAULT_OOM and invoke the global OOM killer. > > > > > > > > > > Exactly. > > > > > > > > > > > But maybe we can return a new VM_FAULT_OOM_HANDLED for memcg OOM and > > > > > > just restart the pagefault. Return -ENOMEM to the buffered IO syscall > > > > > > respectively. This way, the memcg OOM killer is invoked as it should > > > > > > but nobody gets stuck anywhere livelocking with the exiting task. > > > > > > > > > > Hmm, we would still have a problem with oom disabled (aka user space OOM > > > > > killer), right? All processes but those in mem_cgroup_handle_oom are > > > > > risky to be killed. > > > > > > > > Could we still let everybody get stuck in there when the OOM killer is > > > > disabled and let userspace take care of it? > > > > > > I am not sure what exactly you mean by "userspace take care of it" but > > > if those processes are stuck and holding the lock then it is usually > > > hard to find that out. Well if somebody is familiar with internal then > > > it is doable but this makes the interface really unusable for regular > > > usage. > > > > If oom_kill_disable is set, then all processes get stuck all the way > > down in the charge stack. Whatever resource they pin, you may > > deadlock on if you try to touch it while handling the problem from > > userspace. > > OK, I guess I am getting what you are trying to say. So what you are > suggesting is to just let mem_cgroup_out_of_memory send the signal and > move on without retry (or with few charge retries without further OOM > killing) and fail the charge with your new FAULT_OOM_HANDLED (resp. > something like FAULT_RETRY) error code resp. ENOMEM depending on the > caller. OOM disabled case would be "you are on your own" because this > has been dangerous anyway. Correct? Yes. > I do agree that the current endless retry loop is far from being ideal > and can see some updates but I am quite nervous about any potential > regressions in this area (e.g. too aggressive OOM etc...). I have to > think about it some more. Agreed on all points. Maybe we can keep a couple of the oom retry iterations or something like that, which is still much more than what global does and I don't think the global OOM killer is overly eager. Testing will show more. > Anyway if you have some more specific ideas I would be happy to review > patches. Okay, I just wanted to check back with you before going down this path. What are we going to do short term, though? Do you want to push the disable-oom-for-pagecache for now or should we put the VM_FAULT_OOM_HANDLED fix in the next version and do stable backports? This issue has been around for a while so frankly I don't think it's urgent enough to rush things. ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH -mm] memcg: do not trigger OOM from add_to_page_cache_locked @ 2012-11-26 20:19 ` Johannes Weiner 0 siblings, 0 replies; 444+ messages in thread From: Johannes Weiner @ 2012-11-26 20:19 UTC (permalink / raw) To: Michal Hocko Cc: azurIt, linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki On Mon, Nov 26, 2012 at 09:08:48PM +0100, Michal Hocko wrote: > On Mon 26-11-12 14:29:41, Johannes Weiner wrote: > > On Mon, Nov 26, 2012 at 08:03:29PM +0100, Michal Hocko wrote: > > > On Mon 26-11-12 13:24:21, Johannes Weiner wrote: > > > > On Mon, Nov 26, 2012 at 07:04:44PM +0100, Michal Hocko wrote: > > > > > On Mon 26-11-12 12:46:22, Johannes Weiner wrote: > > > [...] > > > > > > I think global oom already handles this in a much better way: invoke > > > > > > the OOM killer, sleep for a second, then return to userspace to > > > > > > relinquish all kernel resources and locks. The only reason why we > > > > > > can't simply change from an endless retry loop is because we don't > > > > > > want to return VM_FAULT_OOM and invoke the global OOM killer. > > > > > > > > > > Exactly. > > > > > > > > > > > But maybe we can return a new VM_FAULT_OOM_HANDLED for memcg OOM and > > > > > > just restart the pagefault. Return -ENOMEM to the buffered IO syscall > > > > > > respectively. This way, the memcg OOM killer is invoked as it should > > > > > > but nobody gets stuck anywhere livelocking with the exiting task. > > > > > > > > > > Hmm, we would still have a problem with oom disabled (aka user space OOM > > > > > killer), right? All processes but those in mem_cgroup_handle_oom are > > > > > risky to be killed. > > > > > > > > Could we still let everybody get stuck in there when the OOM killer is > > > > disabled and let userspace take care of it? > > > > > > I am not sure what exactly you mean by "userspace take care of it" but > > > if those processes are stuck and holding the lock then it is usually > > > hard to find that out. Well if somebody is familiar with internal then > > > it is doable but this makes the interface really unusable for regular > > > usage. > > > > If oom_kill_disable is set, then all processes get stuck all the way > > down in the charge stack. Whatever resource they pin, you may > > deadlock on if you try to touch it while handling the problem from > > userspace. > > OK, I guess I am getting what you are trying to say. So what you are > suggesting is to just let mem_cgroup_out_of_memory send the signal and > move on without retry (or with few charge retries without further OOM > killing) and fail the charge with your new FAULT_OOM_HANDLED (resp. > something like FAULT_RETRY) error code resp. ENOMEM depending on the > caller. OOM disabled case would be "you are on your own" because this > has been dangerous anyway. Correct? Yes. > I do agree that the current endless retry loop is far from being ideal > and can see some updates but I am quite nervous about any potential > regressions in this area (e.g. too aggressive OOM etc...). I have to > think about it some more. Agreed on all points. Maybe we can keep a couple of the oom retry iterations or something like that, which is still much more than what global does and I don't think the global OOM killer is overly eager. Testing will show more. > Anyway if you have some more specific ideas I would be happy to review > patches. Okay, I just wanted to check back with you before going down this path. What are we going to do short term, though? Do you want to push the disable-oom-for-pagecache for now or should we put the VM_FAULT_OOM_HANDLED fix in the next version and do stable backports? This issue has been around for a while so frankly I don't think it's urgent enough to rush things. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH -mm] memcg: do not trigger OOM from add_to_page_cache_locked 2012-11-26 20:19 ` Johannes Weiner (?) @ 2012-11-26 20:46 ` azurIt -1 siblings, 0 replies; 444+ messages in thread From: azurIt @ 2012-11-26 20:46 UTC (permalink / raw) To: Johannes Weiner, Michal Hocko Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki >This issue has been around for a while so frankly I don't think it's >urgent enough to rush things. Well, it's quite urgent at least for us :( i wasn't reported this so far cos i wasn't sure it's a kernel thing. I will be really happy and thankfull if fix for this can go to 3.2 in some near future.. Thank you very much! azur ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH -mm] memcg: do not trigger OOM from add_to_page_cache_locked @ 2012-11-26 20:46 ` azurIt 0 siblings, 0 replies; 444+ messages in thread From: azurIt @ 2012-11-26 20:46 UTC (permalink / raw) To: Johannes Weiner, Michal Hocko Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups mailinglist, KAMEZAWA Hiroyuki >This issue has been around for a while so frankly I don't think it's >urgent enough to rush things. Well, it's quite urgent at least for us :( i wasn't reported this so far cos i wasn't sure it's a kernel thing. I will be really happy and thankfull if fix for this can go to 3.2 in some near future.. Thank you very much! azur ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH -mm] memcg: do not trigger OOM from add_to_page_cache_locked @ 2012-11-26 20:46 ` azurIt 0 siblings, 0 replies; 444+ messages in thread From: azurIt @ 2012-11-26 20:46 UTC (permalink / raw) To: Johannes Weiner, Michal Hocko Cc: linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki >This issue has been around for a while so frankly I don't think it's >urgent enough to rush things. Well, it's quite urgent at least for us :( i wasn't reported this so far cos i wasn't sure it's a kernel thing. I will be really happy and thankfull if fix for this can go to 3.2 in some near future.. Thank you very much! azur -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH -mm] memcg: do not trigger OOM from add_to_page_cache_locked 2012-11-26 20:46 ` azurIt @ 2012-11-26 20:53 ` Johannes Weiner -1 siblings, 0 replies; 444+ messages in thread From: Johannes Weiner @ 2012-11-26 20:53 UTC (permalink / raw) To: azurIt Cc: Michal Hocko, linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki On Mon, Nov 26, 2012 at 09:46:38PM +0100, azurIt wrote: > >This issue has been around for a while so frankly I don't think it's > >urgent enough to rush things. > > > Well, it's quite urgent at least for us :( i wasn't reported this so > far cos i wasn't sure it's a kernel thing. I will be really happy > and thankfull if fix for this can go to 3.2 in some near > future.. Thank you very much! I understand and of course it's important that we get it fixed as soon as possible. All I meant was that this problem has not exactly been introduced in 3.7 and the fix is non-trivial so we should not be rushing a change like this into 3.7 just days before its release. ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH -mm] memcg: do not trigger OOM from add_to_page_cache_locked @ 2012-11-26 20:53 ` Johannes Weiner 0 siblings, 0 replies; 444+ messages in thread From: Johannes Weiner @ 2012-11-26 20:53 UTC (permalink / raw) To: azurIt Cc: Michal Hocko, linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki On Mon, Nov 26, 2012 at 09:46:38PM +0100, azurIt wrote: > >This issue has been around for a while so frankly I don't think it's > >urgent enough to rush things. > > > Well, it's quite urgent at least for us :( i wasn't reported this so > far cos i wasn't sure it's a kernel thing. I will be really happy > and thankfull if fix for this can go to 3.2 in some near > future.. Thank you very much! I understand and of course it's important that we get it fixed as soon as possible. All I meant was that this problem has not exactly been introduced in 3.7 and the fix is non-trivial so we should not be rushing a change like this into 3.7 just days before its release. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH -mm] memcg: do not trigger OOM from add_to_page_cache_locked 2012-11-26 20:19 ` Johannes Weiner (?) @ 2012-11-26 22:06 ` Michal Hocko -1 siblings, 0 replies; 444+ messages in thread From: Michal Hocko @ 2012-11-26 22:06 UTC (permalink / raw) To: Johannes Weiner Cc: azurIt, linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki On Mon 26-11-12 15:19:18, Johannes Weiner wrote: > On Mon, Nov 26, 2012 at 09:08:48PM +0100, Michal Hocko wrote: [...] > > OK, I guess I am getting what you are trying to say. So what you are > > suggesting is to just let mem_cgroup_out_of_memory send the signal and > > move on without retry (or with few charge retries without further OOM > > killing) and fail the charge with your new FAULT_OOM_HANDLED (resp. > > something like FAULT_RETRY) error code resp. ENOMEM depending on the > > caller. OOM disabled case would be "you are on your own" because this > > has been dangerous anyway. Correct? > > Yes. > > > I do agree that the current endless retry loop is far from being ideal > > and can see some updates but I am quite nervous about any potential > > regressions in this area (e.g. too aggressive OOM etc...). I have to > > think about it some more. > > Agreed on all points. Maybe we can keep a couple of the oom retry > iterations or something like that, which is still much more than what > global does and I don't think the global OOM killer is overly eager. Yes we can offer less blood and more confort > > Testing will show more. > > > Anyway if you have some more specific ideas I would be happy to review > > patches. > > Okay, I just wanted to check back with you before going down this > path. What are we going to do short term, though? Do you want to > push the disable-oom-for-pagecache for now or should we put the > VM_FAULT_OOM_HANDLED fix in the next version and do stable backports? > > This issue has been around for a while so frankly I don't think it's > urgent enough to rush things. Yes, but now we have a real usecase where this hurts AFAIU. Unless we come up with a fix/reasonable workaround I would rather go with something simpler for starter and more sofisticated later. I have to double check other places where we do charging but the last time I've checked we don't hold page locks on already visible pages (we do precharge in __do_fault f.e.), mem_map for reading in the page fault path is also safe (with oom enabled) and I guess that tmpfs is ok as well. Then we have a page cache and that one should be covered by my patch. So we should be covered. But I like your idea long term. Thanks! -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH -mm] memcg: do not trigger OOM from add_to_page_cache_locked @ 2012-11-26 22:06 ` Michal Hocko 0 siblings, 0 replies; 444+ messages in thread From: Michal Hocko @ 2012-11-26 22:06 UTC (permalink / raw) To: Johannes Weiner Cc: azurIt, linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups mailinglist, KAMEZAWA Hiroyuki On Mon 26-11-12 15:19:18, Johannes Weiner wrote: > On Mon, Nov 26, 2012 at 09:08:48PM +0100, Michal Hocko wrote: [...] > > OK, I guess I am getting what you are trying to say. So what you are > > suggesting is to just let mem_cgroup_out_of_memory send the signal and > > move on without retry (or with few charge retries without further OOM > > killing) and fail the charge with your new FAULT_OOM_HANDLED (resp. > > something like FAULT_RETRY) error code resp. ENOMEM depending on the > > caller. OOM disabled case would be "you are on your own" because this > > has been dangerous anyway. Correct? > > Yes. > > > I do agree that the current endless retry loop is far from being ideal > > and can see some updates but I am quite nervous about any potential > > regressions in this area (e.g. too aggressive OOM etc...). I have to > > think about it some more. > > Agreed on all points. Maybe we can keep a couple of the oom retry > iterations or something like that, which is still much more than what > global does and I don't think the global OOM killer is overly eager. Yes we can offer less blood and more confort > > Testing will show more. > > > Anyway if you have some more specific ideas I would be happy to review > > patches. > > Okay, I just wanted to check back with you before going down this > path. What are we going to do short term, though? Do you want to > push the disable-oom-for-pagecache for now or should we put the > VM_FAULT_OOM_HANDLED fix in the next version and do stable backports? > > This issue has been around for a while so frankly I don't think it's > urgent enough to rush things. Yes, but now we have a real usecase where this hurts AFAIU. Unless we come up with a fix/reasonable workaround I would rather go with something simpler for starter and more sofisticated later. I have to double check other places where we do charging but the last time I've checked we don't hold page locks on already visible pages (we do precharge in __do_fault f.e.), mem_map for reading in the page fault path is also safe (with oom enabled) and I guess that tmpfs is ok as well. Then we have a page cache and that one should be covered by my patch. So we should be covered. But I like your idea long term. Thanks! -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH -mm] memcg: do not trigger OOM from add_to_page_cache_locked @ 2012-11-26 22:06 ` Michal Hocko 0 siblings, 0 replies; 444+ messages in thread From: Michal Hocko @ 2012-11-26 22:06 UTC (permalink / raw) To: Johannes Weiner Cc: azurIt, linux-kernel, linux-mm, cgroups mailinglist, KAMEZAWA Hiroyuki On Mon 26-11-12 15:19:18, Johannes Weiner wrote: > On Mon, Nov 26, 2012 at 09:08:48PM +0100, Michal Hocko wrote: [...] > > OK, I guess I am getting what you are trying to say. So what you are > > suggesting is to just let mem_cgroup_out_of_memory send the signal and > > move on without retry (or with few charge retries without further OOM > > killing) and fail the charge with your new FAULT_OOM_HANDLED (resp. > > something like FAULT_RETRY) error code resp. ENOMEM depending on the > > caller. OOM disabled case would be "you are on your own" because this > > has been dangerous anyway. Correct? > > Yes. > > > I do agree that the current endless retry loop is far from being ideal > > and can see some updates but I am quite nervous about any potential > > regressions in this area (e.g. too aggressive OOM etc...). I have to > > think about it some more. > > Agreed on all points. Maybe we can keep a couple of the oom retry > iterations or something like that, which is still much more than what > global does and I don't think the global OOM killer is overly eager. Yes we can offer less blood and more confort > > Testing will show more. > > > Anyway if you have some more specific ideas I would be happy to review > > patches. > > Okay, I just wanted to check back with you before going down this > path. What are we going to do short term, though? Do you want to > push the disable-oom-for-pagecache for now or should we put the > VM_FAULT_OOM_HANDLED fix in the next version and do stable backports? > > This issue has been around for a while so frankly I don't think it's > urgent enough to rush things. Yes, but now we have a real usecase where this hurts AFAIU. Unless we come up with a fix/reasonable workaround I would rather go with something simpler for starter and more sofisticated later. I have to double check other places where we do charging but the last time I've checked we don't hold page locks on already visible pages (we do precharge in __do_fault f.e.), mem_map for reading in the page fault path is also safe (with oom enabled) and I guess that tmpfs is ok as well. Then we have a page cache and that one should be covered by my patch. So we should be covered. But I like your idea long term. Thanks! -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH -mm] memcg: do not trigger OOM from add_to_page_cache_locked 2012-11-26 13:18 ` Michal Hocko (?) @ 2012-11-27 0:05 ` Kamezawa Hiroyuki -1 siblings, 0 replies; 444+ messages in thread From: Kamezawa Hiroyuki @ 2012-11-27 0:05 UTC (permalink / raw) To: Michal Hocko Cc: azurIt, linux-kernel, linux-mm, cgroups mailinglist, Johannes Weiner (2012/11/26 22:18), Michal Hocko wrote: > [CCing also Johannes - the thread started here: > https://lkml.org/lkml/2012/11/21/497] > > On Mon 26-11-12 01:38:55, azurIt wrote: >>> This is hackish but it should help you in this case. Kamezawa, what do >>> you think about that? Should we generalize this and prepare something >>> like mem_cgroup_cache_charge_locked which would add __GFP_NORETRY >>> automatically and use the function whenever we are in a locked context? >>> To be honest I do not like this very much but nothing more sensible >>> (without touching non-memcg paths) comes to my mind. >> >> >> I installed kernel with this patch, will report back if problem occurs >> again OR in few weeks if everything will be ok. Thank you! > > Now that I am looking at the patch closer it will not work because it > depends on other patch which is not merged yet and even that one would > help on its own because __GFP_NORETRY doesn't break the charge loop. > Sorry I have missed that... > > The patch bellow should help though. (it is based on top of the current > -mm tree but I will send a backport to 3.2 in the reply as well) > --- > From 7796f942d62081ad45726efd90b5292b80e7c690 Mon Sep 17 00:00:00 2001 > From: Michal Hocko <mhocko@suse.cz> > Date: Mon, 26 Nov 2012 11:47:57 +0100 > Subject: [PATCH] memcg: do not trigger OOM from add_to_page_cache_locked > > memcg oom killer might deadlock if the process which falls down to > mem_cgroup_handle_oom holds a lock which prevents other task to > terminate because it is blocked on the very same lock. > This can happen when a write system call needs to allocate a page but > the allocation hits the memcg hard limit and there is nothing to reclaim > (e.g. there is no swap or swap limit is hit as well and all cache pages > have been reclaimed already) and the process selected by memcg OOM > killer is blocked on i_mutex on the same inode (e.g. truncate it). > > Process A > [<ffffffff811109b8>] do_truncate+0x58/0xa0 # takes i_mutex > [<ffffffff81121c90>] do_last+0x250/0xa30 > [<ffffffff81122547>] path_openat+0xd7/0x440 > [<ffffffff811229c9>] do_filp_open+0x49/0xa0 > [<ffffffff8110f7d6>] do_sys_open+0x106/0x240 > [<ffffffff8110f950>] sys_open+0x20/0x30 > [<ffffffff815b5926>] system_call_fastpath+0x18/0x1d > [<ffffffffffffffff>] 0xffffffffffffffff > > Process B > [<ffffffff8110a9c1>] mem_cgroup_handle_oom+0x241/0x3b0 > [<ffffffff8110b5ab>] T.1146+0x5ab/0x5c0 > [<ffffffff8110c22e>] mem_cgroup_cache_charge+0xbe/0xe0 > [<ffffffff810ca28c>] add_to_page_cache_locked+0x4c/0x140 > [<ffffffff810ca3a2>] add_to_page_cache_lru+0x22/0x50 > [<ffffffff810ca45b>] grab_cache_page_write_begin+0x8b/0xe0 > [<ffffffff81193a18>] ext3_write_begin+0x88/0x270 > [<ffffffff810c8fc6>] generic_file_buffered_write+0x116/0x290 > [<ffffffff810cb3cc>] __generic_file_aio_write+0x27c/0x480 > [<ffffffff810cb646>] generic_file_aio_write+0x76/0xf0 # takes ->i_mutex > [<ffffffff8111156a>] do_sync_write+0xea/0x130 > [<ffffffff81112183>] vfs_write+0xf3/0x1f0 > [<ffffffff81112381>] sys_write+0x51/0x90 > [<ffffffff815b5926>] system_call_fastpath+0x18/0x1d > [<ffffffffffffffff>] 0xffffffffffffffff > > This is not a hard deadlock though because administrator can still > intervene and increase the limit on the group which helps the writer to > finish the allocation and release the lock. > > This patch heals the problem by forbidding OOM from page cache charges > (namely add_ro_page_cache_locked). mem_cgroup_cache_charge_no_oom helper > function is defined which adds GFP_MEMCG_NO_OOM to the gfp mask which > then tells mem_cgroup_charge_common that OOM is not allowed for the > charge. No OOM from this path, except for fixing the bug, also make some > sense as we really do not want to cause an OOM because of a page cache > usage. > As a possibly visible result add_to_page_cache_lru might fail more often > with ENOMEM but this is to be expected if the limit is set and it is > preferable than OOM killer IMO. > > __GFP_NORETRY is abused for this memcg specific flag because it has been > used to prevent from OOM already (since not-merged-yet "memcg: reclaim > when more than one page needed"). The only difference is that the flag > doesn't prevent from reclaim anymore which kind of makes sense because > the global memory allocator triggers reclaim as well. The retry without > any reclaim on __GFP_NORETRY doesn't make much sense anyway because this > is effectively a busy loop with allowed OOM in this path. > > Reported-by: azurIt <azurit@pobox.sk> > Signed-off-by: Michal Hocko <mhocko@suse.cz> As a short term fix, I think this patch will work enough and seems simple enough. Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Reading discussion between you and Johannes, to release locks, I understand the memcg need to return "RETRY" for a long term fix. Thinking a little, it will be simple to return "RETRY" to all processes waited on oom kill queue of a memcg and it can be done by a small fixes to memory.c. Thank you. -Kame > --- > include/linux/gfp.h | 3 +++ > include/linux/memcontrol.h | 12 ++++++++++++ > mm/filemap.c | 8 +++++++- > mm/memcontrol.c | 5 +---- > 4 files changed, 23 insertions(+), 5 deletions(-) > > diff --git a/include/linux/gfp.h b/include/linux/gfp.h > index 10e667f..aac9b21 100644 > --- a/include/linux/gfp.h > +++ b/include/linux/gfp.h > @@ -152,6 +152,9 @@ struct vm_area_struct; > /* 4GB DMA on some platforms */ > #define GFP_DMA32 __GFP_DMA32 > > +/* memcg oom killer is not allowed */ > +#define GFP_MEMCG_NO_OOM __GFP_NORETRY > + > /* Convert GFP flags to their corresponding migrate type */ > static inline int allocflags_to_migratetype(gfp_t gfp_flags) > { > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h > index 095d2b4..1ad4bc6 100644 > --- a/include/linux/memcontrol.h > +++ b/include/linux/memcontrol.h > @@ -65,6 +65,12 @@ extern void mem_cgroup_cancel_charge_swapin(struct mem_cgroup *memcg); > extern int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm, > gfp_t gfp_mask); > > +static inline int mem_cgroup_cache_charge_no_oom(struct page *page, > + struct mm_struct *mm, gfp_t gfp_mask) > +{ > + return mem_cgroup_cache_charge(page, mm, gfp_mask | GFP_MEMCG_NO_OOM); > +} > + > struct lruvec *mem_cgroup_zone_lruvec(struct zone *, struct mem_cgroup *); > struct lruvec *mem_cgroup_page_lruvec(struct page *, struct zone *); > > @@ -215,6 +221,12 @@ static inline int mem_cgroup_cache_charge(struct page *page, > return 0; > } > > +static inline int mem_cgroup_cache_charge_no_oom(struct page *page, > + struct mm_struct *mm, gfp_t gfp_mask) > +{ > + return 0; > +} > + > static inline int mem_cgroup_try_charge_swapin(struct mm_struct *mm, > struct page *page, gfp_t gfp_mask, struct mem_cgroup **memcgp) > { > diff --git a/mm/filemap.c b/mm/filemap.c > index 83efee7..ef14351 100644 > --- a/mm/filemap.c > +++ b/mm/filemap.c > @@ -447,7 +447,13 @@ int add_to_page_cache_locked(struct page *page, struct address_space *mapping, > VM_BUG_ON(!PageLocked(page)); > VM_BUG_ON(PageSwapBacked(page)); > > - error = mem_cgroup_cache_charge(page, current->mm, > + /* > + * Cannot trigger OOM even if gfp_mask would allow that normally > + * because we might be called from a locked context and that > + * could lead to deadlocks if the killed process is waiting for > + * the same lock. > + */ > + error = mem_cgroup_cache_charge_no_oom(page, current->mm, > gfp_mask & GFP_RECLAIM_MASK); > if (error) > goto out; > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index 02ee2f7..b4754ba 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -2430,9 +2430,6 @@ static int mem_cgroup_do_charge(struct mem_cgroup *memcg, gfp_t gfp_mask, > if (!(gfp_mask & __GFP_WAIT)) > return CHARGE_WOULDBLOCK; > > - if (gfp_mask & __GFP_NORETRY) > - return CHARGE_NOMEM; > - > ret = mem_cgroup_reclaim(mem_over_limit, gfp_mask, flags); > if (mem_cgroup_margin(mem_over_limit) >= nr_pages) > return CHARGE_RETRY; > @@ -3713,7 +3710,7 @@ static int mem_cgroup_charge_common(struct page *page, struct mm_struct *mm, > { > struct mem_cgroup *memcg = NULL; > unsigned int nr_pages = 1; > - bool oom = true; > + bool oom = !(gfp_mask | GFP_MEMCG_NO_OOM); > int ret; > > if (PageTransHuge(page)) { > ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH -mm] memcg: do not trigger OOM from add_to_page_cache_locked @ 2012-11-27 0:05 ` Kamezawa Hiroyuki 0 siblings, 0 replies; 444+ messages in thread From: Kamezawa Hiroyuki @ 2012-11-27 0:05 UTC (permalink / raw) To: Michal Hocko Cc: azurIt, linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups mailinglist, Johannes Weiner (2012/11/26 22:18), Michal Hocko wrote: > [CCing also Johannes - the thread started here: > https://lkml.org/lkml/2012/11/21/497] > > On Mon 26-11-12 01:38:55, azurIt wrote: >>> This is hackish but it should help you in this case. Kamezawa, what do >>> you think about that? Should we generalize this and prepare something >>> like mem_cgroup_cache_charge_locked which would add __GFP_NORETRY >>> automatically and use the function whenever we are in a locked context? >>> To be honest I do not like this very much but nothing more sensible >>> (without touching non-memcg paths) comes to my mind. >> >> >> I installed kernel with this patch, will report back if problem occurs >> again OR in few weeks if everything will be ok. Thank you! > > Now that I am looking at the patch closer it will not work because it > depends on other patch which is not merged yet and even that one would > help on its own because __GFP_NORETRY doesn't break the charge loop. > Sorry I have missed that... > > The patch bellow should help though. (it is based on top of the current > -mm tree but I will send a backport to 3.2 in the reply as well) > --- > From 7796f942d62081ad45726efd90b5292b80e7c690 Mon Sep 17 00:00:00 2001 > From: Michal Hocko <mhocko-AlSwsSmVLrQ@public.gmane.org> > Date: Mon, 26 Nov 2012 11:47:57 +0100 > Subject: [PATCH] memcg: do not trigger OOM from add_to_page_cache_locked > > memcg oom killer might deadlock if the process which falls down to > mem_cgroup_handle_oom holds a lock which prevents other task to > terminate because it is blocked on the very same lock. > This can happen when a write system call needs to allocate a page but > the allocation hits the memcg hard limit and there is nothing to reclaim > (e.g. there is no swap or swap limit is hit as well and all cache pages > have been reclaimed already) and the process selected by memcg OOM > killer is blocked on i_mutex on the same inode (e.g. truncate it). > > Process A > [<ffffffff811109b8>] do_truncate+0x58/0xa0 # takes i_mutex > [<ffffffff81121c90>] do_last+0x250/0xa30 > [<ffffffff81122547>] path_openat+0xd7/0x440 > [<ffffffff811229c9>] do_filp_open+0x49/0xa0 > [<ffffffff8110f7d6>] do_sys_open+0x106/0x240 > [<ffffffff8110f950>] sys_open+0x20/0x30 > [<ffffffff815b5926>] system_call_fastpath+0x18/0x1d > [<ffffffffffffffff>] 0xffffffffffffffff > > Process B > [<ffffffff8110a9c1>] mem_cgroup_handle_oom+0x241/0x3b0 > [<ffffffff8110b5ab>] T.1146+0x5ab/0x5c0 > [<ffffffff8110c22e>] mem_cgroup_cache_charge+0xbe/0xe0 > [<ffffffff810ca28c>] add_to_page_cache_locked+0x4c/0x140 > [<ffffffff810ca3a2>] add_to_page_cache_lru+0x22/0x50 > [<ffffffff810ca45b>] grab_cache_page_write_begin+0x8b/0xe0 > [<ffffffff81193a18>] ext3_write_begin+0x88/0x270 > [<ffffffff810c8fc6>] generic_file_buffered_write+0x116/0x290 > [<ffffffff810cb3cc>] __generic_file_aio_write+0x27c/0x480 > [<ffffffff810cb646>] generic_file_aio_write+0x76/0xf0 # takes ->i_mutex > [<ffffffff8111156a>] do_sync_write+0xea/0x130 > [<ffffffff81112183>] vfs_write+0xf3/0x1f0 > [<ffffffff81112381>] sys_write+0x51/0x90 > [<ffffffff815b5926>] system_call_fastpath+0x18/0x1d > [<ffffffffffffffff>] 0xffffffffffffffff > > This is not a hard deadlock though because administrator can still > intervene and increase the limit on the group which helps the writer to > finish the allocation and release the lock. > > This patch heals the problem by forbidding OOM from page cache charges > (namely add_ro_page_cache_locked). mem_cgroup_cache_charge_no_oom helper > function is defined which adds GFP_MEMCG_NO_OOM to the gfp mask which > then tells mem_cgroup_charge_common that OOM is not allowed for the > charge. No OOM from this path, except for fixing the bug, also make some > sense as we really do not want to cause an OOM because of a page cache > usage. > As a possibly visible result add_to_page_cache_lru might fail more often > with ENOMEM but this is to be expected if the limit is set and it is > preferable than OOM killer IMO. > > __GFP_NORETRY is abused for this memcg specific flag because it has been > used to prevent from OOM already (since not-merged-yet "memcg: reclaim > when more than one page needed"). The only difference is that the flag > doesn't prevent from reclaim anymore which kind of makes sense because > the global memory allocator triggers reclaim as well. The retry without > any reclaim on __GFP_NORETRY doesn't make much sense anyway because this > is effectively a busy loop with allowed OOM in this path. > > Reported-by: azurIt <azurit-Rm0zKEqwvD4@public.gmane.org> > Signed-off-by: Michal Hocko <mhocko-AlSwsSmVLrQ@public.gmane.org> As a short term fix, I think this patch will work enough and seems simple enough. Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A@public.gmane.org> Reading discussion between you and Johannes, to release locks, I understand the memcg need to return "RETRY" for a long term fix. Thinking a little, it will be simple to return "RETRY" to all processes waited on oom kill queue of a memcg and it can be done by a small fixes to memory.c. Thank you. -Kame > --- > include/linux/gfp.h | 3 +++ > include/linux/memcontrol.h | 12 ++++++++++++ > mm/filemap.c | 8 +++++++- > mm/memcontrol.c | 5 +---- > 4 files changed, 23 insertions(+), 5 deletions(-) > > diff --git a/include/linux/gfp.h b/include/linux/gfp.h > index 10e667f..aac9b21 100644 > --- a/include/linux/gfp.h > +++ b/include/linux/gfp.h > @@ -152,6 +152,9 @@ struct vm_area_struct; > /* 4GB DMA on some platforms */ > #define GFP_DMA32 __GFP_DMA32 > > +/* memcg oom killer is not allowed */ > +#define GFP_MEMCG_NO_OOM __GFP_NORETRY > + > /* Convert GFP flags to their corresponding migrate type */ > static inline int allocflags_to_migratetype(gfp_t gfp_flags) > { > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h > index 095d2b4..1ad4bc6 100644 > --- a/include/linux/memcontrol.h > +++ b/include/linux/memcontrol.h > @@ -65,6 +65,12 @@ extern void mem_cgroup_cancel_charge_swapin(struct mem_cgroup *memcg); > extern int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm, > gfp_t gfp_mask); > > +static inline int mem_cgroup_cache_charge_no_oom(struct page *page, > + struct mm_struct *mm, gfp_t gfp_mask) > +{ > + return mem_cgroup_cache_charge(page, mm, gfp_mask | GFP_MEMCG_NO_OOM); > +} > + > struct lruvec *mem_cgroup_zone_lruvec(struct zone *, struct mem_cgroup *); > struct lruvec *mem_cgroup_page_lruvec(struct page *, struct zone *); > > @@ -215,6 +221,12 @@ static inline int mem_cgroup_cache_charge(struct page *page, > return 0; > } > > +static inline int mem_cgroup_cache_charge_no_oom(struct page *page, > + struct mm_struct *mm, gfp_t gfp_mask) > +{ > + return 0; > +} > + > static inline int mem_cgroup_try_charge_swapin(struct mm_struct *mm, > struct page *page, gfp_t gfp_mask, struct mem_cgroup **memcgp) > { > diff --git a/mm/filemap.c b/mm/filemap.c > index 83efee7..ef14351 100644 > --- a/mm/filemap.c > +++ b/mm/filemap.c > @@ -447,7 +447,13 @@ int add_to_page_cache_locked(struct page *page, struct address_space *mapping, > VM_BUG_ON(!PageLocked(page)); > VM_BUG_ON(PageSwapBacked(page)); > > - error = mem_cgroup_cache_charge(page, current->mm, > + /* > + * Cannot trigger OOM even if gfp_mask would allow that normally > + * because we might be called from a locked context and that > + * could lead to deadlocks if the killed process is waiting for > + * the same lock. > + */ > + error = mem_cgroup_cache_charge_no_oom(page, current->mm, > gfp_mask & GFP_RECLAIM_MASK); > if (error) > goto out; > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index 02ee2f7..b4754ba 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -2430,9 +2430,6 @@ static int mem_cgroup_do_charge(struct mem_cgroup *memcg, gfp_t gfp_mask, > if (!(gfp_mask & __GFP_WAIT)) > return CHARGE_WOULDBLOCK; > > - if (gfp_mask & __GFP_NORETRY) > - return CHARGE_NOMEM; > - > ret = mem_cgroup_reclaim(mem_over_limit, gfp_mask, flags); > if (mem_cgroup_margin(mem_over_limit) >= nr_pages) > return CHARGE_RETRY; > @@ -3713,7 +3710,7 @@ static int mem_cgroup_charge_common(struct page *page, struct mm_struct *mm, > { > struct mem_cgroup *memcg = NULL; > unsigned int nr_pages = 1; > - bool oom = true; > + bool oom = !(gfp_mask | GFP_MEMCG_NO_OOM); > int ret; > > if (PageTransHuge(page)) { > ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH -mm] memcg: do not trigger OOM from add_to_page_cache_locked @ 2012-11-27 0:05 ` Kamezawa Hiroyuki 0 siblings, 0 replies; 444+ messages in thread From: Kamezawa Hiroyuki @ 2012-11-27 0:05 UTC (permalink / raw) To: Michal Hocko Cc: azurIt, linux-kernel, linux-mm, cgroups mailinglist, Johannes Weiner (2012/11/26 22:18), Michal Hocko wrote: > [CCing also Johannes - the thread started here: > https://lkml.org/lkml/2012/11/21/497] > > On Mon 26-11-12 01:38:55, azurIt wrote: >>> This is hackish but it should help you in this case. Kamezawa, what do >>> you think about that? Should we generalize this and prepare something >>> like mem_cgroup_cache_charge_locked which would add __GFP_NORETRY >>> automatically and use the function whenever we are in a locked context? >>> To be honest I do not like this very much but nothing more sensible >>> (without touching non-memcg paths) comes to my mind. >> >> >> I installed kernel with this patch, will report back if problem occurs >> again OR in few weeks if everything will be ok. Thank you! > > Now that I am looking at the patch closer it will not work because it > depends on other patch which is not merged yet and even that one would > help on its own because __GFP_NORETRY doesn't break the charge loop. > Sorry I have missed that... > > The patch bellow should help though. (it is based on top of the current > -mm tree but I will send a backport to 3.2 in the reply as well) > --- > From 7796f942d62081ad45726efd90b5292b80e7c690 Mon Sep 17 00:00:00 2001 > From: Michal Hocko <mhocko@suse.cz> > Date: Mon, 26 Nov 2012 11:47:57 +0100 > Subject: [PATCH] memcg: do not trigger OOM from add_to_page_cache_locked > > memcg oom killer might deadlock if the process which falls down to > mem_cgroup_handle_oom holds a lock which prevents other task to > terminate because it is blocked on the very same lock. > This can happen when a write system call needs to allocate a page but > the allocation hits the memcg hard limit and there is nothing to reclaim > (e.g. there is no swap or swap limit is hit as well and all cache pages > have been reclaimed already) and the process selected by memcg OOM > killer is blocked on i_mutex on the same inode (e.g. truncate it). > > Process A > [<ffffffff811109b8>] do_truncate+0x58/0xa0 # takes i_mutex > [<ffffffff81121c90>] do_last+0x250/0xa30 > [<ffffffff81122547>] path_openat+0xd7/0x440 > [<ffffffff811229c9>] do_filp_open+0x49/0xa0 > [<ffffffff8110f7d6>] do_sys_open+0x106/0x240 > [<ffffffff8110f950>] sys_open+0x20/0x30 > [<ffffffff815b5926>] system_call_fastpath+0x18/0x1d > [<ffffffffffffffff>] 0xffffffffffffffff > > Process B > [<ffffffff8110a9c1>] mem_cgroup_handle_oom+0x241/0x3b0 > [<ffffffff8110b5ab>] T.1146+0x5ab/0x5c0 > [<ffffffff8110c22e>] mem_cgroup_cache_charge+0xbe/0xe0 > [<ffffffff810ca28c>] add_to_page_cache_locked+0x4c/0x140 > [<ffffffff810ca3a2>] add_to_page_cache_lru+0x22/0x50 > [<ffffffff810ca45b>] grab_cache_page_write_begin+0x8b/0xe0 > [<ffffffff81193a18>] ext3_write_begin+0x88/0x270 > [<ffffffff810c8fc6>] generic_file_buffered_write+0x116/0x290 > [<ffffffff810cb3cc>] __generic_file_aio_write+0x27c/0x480 > [<ffffffff810cb646>] generic_file_aio_write+0x76/0xf0 # takes ->i_mutex > [<ffffffff8111156a>] do_sync_write+0xea/0x130 > [<ffffffff81112183>] vfs_write+0xf3/0x1f0 > [<ffffffff81112381>] sys_write+0x51/0x90 > [<ffffffff815b5926>] system_call_fastpath+0x18/0x1d > [<ffffffffffffffff>] 0xffffffffffffffff > > This is not a hard deadlock though because administrator can still > intervene and increase the limit on the group which helps the writer to > finish the allocation and release the lock. > > This patch heals the problem by forbidding OOM from page cache charges > (namely add_ro_page_cache_locked). mem_cgroup_cache_charge_no_oom helper > function is defined which adds GFP_MEMCG_NO_OOM to the gfp mask which > then tells mem_cgroup_charge_common that OOM is not allowed for the > charge. No OOM from this path, except for fixing the bug, also make some > sense as we really do not want to cause an OOM because of a page cache > usage. > As a possibly visible result add_to_page_cache_lru might fail more often > with ENOMEM but this is to be expected if the limit is set and it is > preferable than OOM killer IMO. > > __GFP_NORETRY is abused for this memcg specific flag because it has been > used to prevent from OOM already (since not-merged-yet "memcg: reclaim > when more than one page needed"). The only difference is that the flag > doesn't prevent from reclaim anymore which kind of makes sense because > the global memory allocator triggers reclaim as well. The retry without > any reclaim on __GFP_NORETRY doesn't make much sense anyway because this > is effectively a busy loop with allowed OOM in this path. > > Reported-by: azurIt <azurit@pobox.sk> > Signed-off-by: Michal Hocko <mhocko@suse.cz> As a short term fix, I think this patch will work enough and seems simple enough. Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Reading discussion between you and Johannes, to release locks, I understand the memcg need to return "RETRY" for a long term fix. Thinking a little, it will be simple to return "RETRY" to all processes waited on oom kill queue of a memcg and it can be done by a small fixes to memory.c. Thank you. -Kame > --- > include/linux/gfp.h | 3 +++ > include/linux/memcontrol.h | 12 ++++++++++++ > mm/filemap.c | 8 +++++++- > mm/memcontrol.c | 5 +---- > 4 files changed, 23 insertions(+), 5 deletions(-) > > diff --git a/include/linux/gfp.h b/include/linux/gfp.h > index 10e667f..aac9b21 100644 > --- a/include/linux/gfp.h > +++ b/include/linux/gfp.h > @@ -152,6 +152,9 @@ struct vm_area_struct; > /* 4GB DMA on some platforms */ > #define GFP_DMA32 __GFP_DMA32 > > +/* memcg oom killer is not allowed */ > +#define GFP_MEMCG_NO_OOM __GFP_NORETRY > + > /* Convert GFP flags to their corresponding migrate type */ > static inline int allocflags_to_migratetype(gfp_t gfp_flags) > { > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h > index 095d2b4..1ad4bc6 100644 > --- a/include/linux/memcontrol.h > +++ b/include/linux/memcontrol.h > @@ -65,6 +65,12 @@ extern void mem_cgroup_cancel_charge_swapin(struct mem_cgroup *memcg); > extern int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm, > gfp_t gfp_mask); > > +static inline int mem_cgroup_cache_charge_no_oom(struct page *page, > + struct mm_struct *mm, gfp_t gfp_mask) > +{ > + return mem_cgroup_cache_charge(page, mm, gfp_mask | GFP_MEMCG_NO_OOM); > +} > + > struct lruvec *mem_cgroup_zone_lruvec(struct zone *, struct mem_cgroup *); > struct lruvec *mem_cgroup_page_lruvec(struct page *, struct zone *); > > @@ -215,6 +221,12 @@ static inline int mem_cgroup_cache_charge(struct page *page, > return 0; > } > > +static inline int mem_cgroup_cache_charge_no_oom(struct page *page, > + struct mm_struct *mm, gfp_t gfp_mask) > +{ > + return 0; > +} > + > static inline int mem_cgroup_try_charge_swapin(struct mm_struct *mm, > struct page *page, gfp_t gfp_mask, struct mem_cgroup **memcgp) > { > diff --git a/mm/filemap.c b/mm/filemap.c > index 83efee7..ef14351 100644 > --- a/mm/filemap.c > +++ b/mm/filemap.c > @@ -447,7 +447,13 @@ int add_to_page_cache_locked(struct page *page, struct address_space *mapping, > VM_BUG_ON(!PageLocked(page)); > VM_BUG_ON(PageSwapBacked(page)); > > - error = mem_cgroup_cache_charge(page, current->mm, > + /* > + * Cannot trigger OOM even if gfp_mask would allow that normally > + * because we might be called from a locked context and that > + * could lead to deadlocks if the killed process is waiting for > + * the same lock. > + */ > + error = mem_cgroup_cache_charge_no_oom(page, current->mm, > gfp_mask & GFP_RECLAIM_MASK); > if (error) > goto out; > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index 02ee2f7..b4754ba 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -2430,9 +2430,6 @@ static int mem_cgroup_do_charge(struct mem_cgroup *memcg, gfp_t gfp_mask, > if (!(gfp_mask & __GFP_WAIT)) > return CHARGE_WOULDBLOCK; > > - if (gfp_mask & __GFP_NORETRY) > - return CHARGE_NOMEM; > - > ret = mem_cgroup_reclaim(mem_over_limit, gfp_mask, flags); > if (mem_cgroup_margin(mem_over_limit) >= nr_pages) > return CHARGE_RETRY; > @@ -3713,7 +3710,7 @@ static int mem_cgroup_charge_common(struct page *page, struct mm_struct *mm, > { > struct mem_cgroup *memcg = NULL; > unsigned int nr_pages = 1; > - bool oom = true; > + bool oom = !(gfp_mask | GFP_MEMCG_NO_OOM); > int ret; > > if (PageTransHuge(page)) { > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH -mm] memcg: do not trigger OOM from add_to_page_cache_locked 2012-11-27 0:05 ` Kamezawa Hiroyuki (?) @ 2012-11-27 9:54 ` Michal Hocko -1 siblings, 0 replies; 444+ messages in thread From: Michal Hocko @ 2012-11-27 9:54 UTC (permalink / raw) To: Kamezawa Hiroyuki Cc: azurIt, linux-kernel, linux-mm, cgroups mailinglist, Johannes Weiner On Tue 27-11-12 09:05:30, KAMEZAWA Hiroyuki wrote: [...] > As a short term fix, I think this patch will work enough and seems simple enough. > Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Thanks! If Johannes is also ok with this for now I will resubmit the patch to Andrew after I hear back from the reporter. > Reading discussion between you and Johannes, to release locks, I understand > the memcg need to return "RETRY" for a long term fix. Thinking a little, > it will be simple to return "RETRY" to all processes waited on oom kill queue > of a memcg and it can be done by a small fixes to memory.c. I wouldn't call it simple but it is doable. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH -mm] memcg: do not trigger OOM from add_to_page_cache_locked @ 2012-11-27 9:54 ` Michal Hocko 0 siblings, 0 replies; 444+ messages in thread From: Michal Hocko @ 2012-11-27 9:54 UTC (permalink / raw) To: Kamezawa Hiroyuki Cc: azurIt, linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups mailinglist, Johannes Weiner On Tue 27-11-12 09:05:30, KAMEZAWA Hiroyuki wrote: [...] > As a short term fix, I think this patch will work enough and seems simple enough. > Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A@public.gmane.org> Thanks! If Johannes is also ok with this for now I will resubmit the patch to Andrew after I hear back from the reporter. > Reading discussion between you and Johannes, to release locks, I understand > the memcg need to return "RETRY" for a long term fix. Thinking a little, > it will be simple to return "RETRY" to all processes waited on oom kill queue > of a memcg and it can be done by a small fixes to memory.c. I wouldn't call it simple but it is doable. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH -mm] memcg: do not trigger OOM from add_to_page_cache_locked @ 2012-11-27 9:54 ` Michal Hocko 0 siblings, 0 replies; 444+ messages in thread From: Michal Hocko @ 2012-11-27 9:54 UTC (permalink / raw) To: Kamezawa Hiroyuki Cc: azurIt, linux-kernel, linux-mm, cgroups mailinglist, Johannes Weiner On Tue 27-11-12 09:05:30, KAMEZAWA Hiroyuki wrote: [...] > As a short term fix, I think this patch will work enough and seems simple enough. > Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Thanks! If Johannes is also ok with this for now I will resubmit the patch to Andrew after I hear back from the reporter. > Reading discussion between you and Johannes, to release locks, I understand > the memcg need to return "RETRY" for a long term fix. Thinking a little, > it will be simple to return "RETRY" to all processes waited on oom kill queue > of a memcg and it can be done by a small fixes to memory.c. I wouldn't call it simple but it is doable. -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH -mm] memcg: do not trigger OOM from add_to_page_cache_locked 2012-11-27 0:05 ` Kamezawa Hiroyuki (?) @ 2012-11-27 19:48 ` Johannes Weiner -1 siblings, 0 replies; 444+ messages in thread From: Johannes Weiner @ 2012-11-27 19:48 UTC (permalink / raw) To: Kamezawa Hiroyuki Cc: Michal Hocko, azurIt, linux-kernel, linux-mm, cgroups mailinglist On Tue, Nov 27, 2012 at 09:05:30AM +0900, Kamezawa Hiroyuki wrote: > (2012/11/26 22:18), Michal Hocko wrote: > >[CCing also Johannes - the thread started here: > >https://lkml.org/lkml/2012/11/21/497] > > > >On Mon 26-11-12 01:38:55, azurIt wrote: > >>>This is hackish but it should help you in this case. Kamezawa, what do > >>>you think about that? Should we generalize this and prepare something > >>>like mem_cgroup_cache_charge_locked which would add __GFP_NORETRY > >>>automatically and use the function whenever we are in a locked context? > >>>To be honest I do not like this very much but nothing more sensible > >>>(without touching non-memcg paths) comes to my mind. > >> > >> > >>I installed kernel with this patch, will report back if problem occurs > >>again OR in few weeks if everything will be ok. Thank you! > > > >Now that I am looking at the patch closer it will not work because it > >depends on other patch which is not merged yet and even that one would > >help on its own because __GFP_NORETRY doesn't break the charge loop. > >Sorry I have missed that... > > > >The patch bellow should help though. (it is based on top of the current > >-mm tree but I will send a backport to 3.2 in the reply as well) > >--- > > From 7796f942d62081ad45726efd90b5292b80e7c690 Mon Sep 17 00:00:00 2001 > >From: Michal Hocko <mhocko@suse.cz> > >Date: Mon, 26 Nov 2012 11:47:57 +0100 > >Subject: [PATCH] memcg: do not trigger OOM from add_to_page_cache_locked > > > >memcg oom killer might deadlock if the process which falls down to > >mem_cgroup_handle_oom holds a lock which prevents other task to > >terminate because it is blocked on the very same lock. > >This can happen when a write system call needs to allocate a page but > >the allocation hits the memcg hard limit and there is nothing to reclaim > >(e.g. there is no swap or swap limit is hit as well and all cache pages > >have been reclaimed already) and the process selected by memcg OOM > >killer is blocked on i_mutex on the same inode (e.g. truncate it). > > > >Process A > >[<ffffffff811109b8>] do_truncate+0x58/0xa0 # takes i_mutex > >[<ffffffff81121c90>] do_last+0x250/0xa30 > >[<ffffffff81122547>] path_openat+0xd7/0x440 > >[<ffffffff811229c9>] do_filp_open+0x49/0xa0 > >[<ffffffff8110f7d6>] do_sys_open+0x106/0x240 > >[<ffffffff8110f950>] sys_open+0x20/0x30 > >[<ffffffff815b5926>] system_call_fastpath+0x18/0x1d > >[<ffffffffffffffff>] 0xffffffffffffffff > > > >Process B > >[<ffffffff8110a9c1>] mem_cgroup_handle_oom+0x241/0x3b0 > >[<ffffffff8110b5ab>] T.1146+0x5ab/0x5c0 > >[<ffffffff8110c22e>] mem_cgroup_cache_charge+0xbe/0xe0 > >[<ffffffff810ca28c>] add_to_page_cache_locked+0x4c/0x140 > >[<ffffffff810ca3a2>] add_to_page_cache_lru+0x22/0x50 > >[<ffffffff810ca45b>] grab_cache_page_write_begin+0x8b/0xe0 > >[<ffffffff81193a18>] ext3_write_begin+0x88/0x270 > >[<ffffffff810c8fc6>] generic_file_buffered_write+0x116/0x290 > >[<ffffffff810cb3cc>] __generic_file_aio_write+0x27c/0x480 > >[<ffffffff810cb646>] generic_file_aio_write+0x76/0xf0 # takes ->i_mutex > >[<ffffffff8111156a>] do_sync_write+0xea/0x130 > >[<ffffffff81112183>] vfs_write+0xf3/0x1f0 > >[<ffffffff81112381>] sys_write+0x51/0x90 > >[<ffffffff815b5926>] system_call_fastpath+0x18/0x1d > >[<ffffffffffffffff>] 0xffffffffffffffff > > > >This is not a hard deadlock though because administrator can still > >intervene and increase the limit on the group which helps the writer to > >finish the allocation and release the lock. > > > >This patch heals the problem by forbidding OOM from page cache charges > >(namely add_ro_page_cache_locked). mem_cgroup_cache_charge_no_oom helper > >function is defined which adds GFP_MEMCG_NO_OOM to the gfp mask which > >then tells mem_cgroup_charge_common that OOM is not allowed for the > >charge. No OOM from this path, except for fixing the bug, also make some > >sense as we really do not want to cause an OOM because of a page cache > >usage. > >As a possibly visible result add_to_page_cache_lru might fail more often > >with ENOMEM but this is to be expected if the limit is set and it is > >preferable than OOM killer IMO. > > > >__GFP_NORETRY is abused for this memcg specific flag because it has been > >used to prevent from OOM already (since not-merged-yet "memcg: reclaim > >when more than one page needed"). The only difference is that the flag > >doesn't prevent from reclaim anymore which kind of makes sense because > >the global memory allocator triggers reclaim as well. The retry without > >any reclaim on __GFP_NORETRY doesn't make much sense anyway because this > >is effectively a busy loop with allowed OOM in this path. > > > >Reported-by: azurIt <azurit@pobox.sk> > >Signed-off-by: Michal Hocko <mhocko@suse.cz> > > As a short term fix, I think this patch will work enough and seems simple enough. > Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Yes, let's do this for now. > >diff --git a/include/linux/gfp.h b/include/linux/gfp.h > >index 10e667f..aac9b21 100644 > >--- a/include/linux/gfp.h > >+++ b/include/linux/gfp.h > >@@ -152,6 +152,9 @@ struct vm_area_struct; > > /* 4GB DMA on some platforms */ > > #define GFP_DMA32 __GFP_DMA32 > > > >+/* memcg oom killer is not allowed */ > >+#define GFP_MEMCG_NO_OOM __GFP_NORETRY Could we leave this within memcg, please? An extra flag to mem_cgroup_cache_charge() or the like. GFP flags are about controlling the page allocator, this seems abusive. We have an oom flag down in try_charge, maybe just propagate this up the stack? > >diff --git a/mm/filemap.c b/mm/filemap.c > >index 83efee7..ef14351 100644 > >--- a/mm/filemap.c > >+++ b/mm/filemap.c > >@@ -447,7 +447,13 @@ int add_to_page_cache_locked(struct page *page, struct address_space *mapping, > > VM_BUG_ON(!PageLocked(page)); > > VM_BUG_ON(PageSwapBacked(page)); > > > >- error = mem_cgroup_cache_charge(page, current->mm, > >+ /* > >+ * Cannot trigger OOM even if gfp_mask would allow that normally > >+ * because we might be called from a locked context and that > >+ * could lead to deadlocks if the killed process is waiting for > >+ * the same lock. > >+ */ > >+ error = mem_cgroup_cache_charge_no_oom(page, current->mm, > > gfp_mask & GFP_RECLAIM_MASK); > > if (error) > > goto out; Shmem does not use this function but also charges under the i_mutex in the write path and fallocate at least. ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH -mm] memcg: do not trigger OOM from add_to_page_cache_locked @ 2012-11-27 19:48 ` Johannes Weiner 0 siblings, 0 replies; 444+ messages in thread From: Johannes Weiner @ 2012-11-27 19:48 UTC (permalink / raw) To: Kamezawa Hiroyuki Cc: Michal Hocko, azurIt, linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups mailinglist On Tue, Nov 27, 2012 at 09:05:30AM +0900, Kamezawa Hiroyuki wrote: > (2012/11/26 22:18), Michal Hocko wrote: > >[CCing also Johannes - the thread started here: > >https://lkml.org/lkml/2012/11/21/497] > > > >On Mon 26-11-12 01:38:55, azurIt wrote: > >>>This is hackish but it should help you in this case. Kamezawa, what do > >>>you think about that? Should we generalize this and prepare something > >>>like mem_cgroup_cache_charge_locked which would add __GFP_NORETRY > >>>automatically and use the function whenever we are in a locked context? > >>>To be honest I do not like this very much but nothing more sensible > >>>(without touching non-memcg paths) comes to my mind. > >> > >> > >>I installed kernel with this patch, will report back if problem occurs > >>again OR in few weeks if everything will be ok. Thank you! > > > >Now that I am looking at the patch closer it will not work because it > >depends on other patch which is not merged yet and even that one would > >help on its own because __GFP_NORETRY doesn't break the charge loop. > >Sorry I have missed that... > > > >The patch bellow should help though. (it is based on top of the current > >-mm tree but I will send a backport to 3.2 in the reply as well) > >--- > > From 7796f942d62081ad45726efd90b5292b80e7c690 Mon Sep 17 00:00:00 2001 > >From: Michal Hocko <mhocko-AlSwsSmVLrQ@public.gmane.org> > >Date: Mon, 26 Nov 2012 11:47:57 +0100 > >Subject: [PATCH] memcg: do not trigger OOM from add_to_page_cache_locked > > > >memcg oom killer might deadlock if the process which falls down to > >mem_cgroup_handle_oom holds a lock which prevents other task to > >terminate because it is blocked on the very same lock. > >This can happen when a write system call needs to allocate a page but > >the allocation hits the memcg hard limit and there is nothing to reclaim > >(e.g. there is no swap or swap limit is hit as well and all cache pages > >have been reclaimed already) and the process selected by memcg OOM > >killer is blocked on i_mutex on the same inode (e.g. truncate it). > > > >Process A > >[<ffffffff811109b8>] do_truncate+0x58/0xa0 # takes i_mutex > >[<ffffffff81121c90>] do_last+0x250/0xa30 > >[<ffffffff81122547>] path_openat+0xd7/0x440 > >[<ffffffff811229c9>] do_filp_open+0x49/0xa0 > >[<ffffffff8110f7d6>] do_sys_open+0x106/0x240 > >[<ffffffff8110f950>] sys_open+0x20/0x30 > >[<ffffffff815b5926>] system_call_fastpath+0x18/0x1d > >[<ffffffffffffffff>] 0xffffffffffffffff > > > >Process B > >[<ffffffff8110a9c1>] mem_cgroup_handle_oom+0x241/0x3b0 > >[<ffffffff8110b5ab>] T.1146+0x5ab/0x5c0 > >[<ffffffff8110c22e>] mem_cgroup_cache_charge+0xbe/0xe0 > >[<ffffffff810ca28c>] add_to_page_cache_locked+0x4c/0x140 > >[<ffffffff810ca3a2>] add_to_page_cache_lru+0x22/0x50 > >[<ffffffff810ca45b>] grab_cache_page_write_begin+0x8b/0xe0 > >[<ffffffff81193a18>] ext3_write_begin+0x88/0x270 > >[<ffffffff810c8fc6>] generic_file_buffered_write+0x116/0x290 > >[<ffffffff810cb3cc>] __generic_file_aio_write+0x27c/0x480 > >[<ffffffff810cb646>] generic_file_aio_write+0x76/0xf0 # takes ->i_mutex > >[<ffffffff8111156a>] do_sync_write+0xea/0x130 > >[<ffffffff81112183>] vfs_write+0xf3/0x1f0 > >[<ffffffff81112381>] sys_write+0x51/0x90 > >[<ffffffff815b5926>] system_call_fastpath+0x18/0x1d > >[<ffffffffffffffff>] 0xffffffffffffffff > > > >This is not a hard deadlock though because administrator can still > >intervene and increase the limit on the group which helps the writer to > >finish the allocation and release the lock. > > > >This patch heals the problem by forbidding OOM from page cache charges > >(namely add_ro_page_cache_locked). mem_cgroup_cache_charge_no_oom helper > >function is defined which adds GFP_MEMCG_NO_OOM to the gfp mask which > >then tells mem_cgroup_charge_common that OOM is not allowed for the > >charge. No OOM from this path, except for fixing the bug, also make some > >sense as we really do not want to cause an OOM because of a page cache > >usage. > >As a possibly visible result add_to_page_cache_lru might fail more often > >with ENOMEM but this is to be expected if the limit is set and it is > >preferable than OOM killer IMO. > > > >__GFP_NORETRY is abused for this memcg specific flag because it has been > >used to prevent from OOM already (since not-merged-yet "memcg: reclaim > >when more than one page needed"). The only difference is that the flag > >doesn't prevent from reclaim anymore which kind of makes sense because > >the global memory allocator triggers reclaim as well. The retry without > >any reclaim on __GFP_NORETRY doesn't make much sense anyway because this > >is effectively a busy loop with allowed OOM in this path. > > > >Reported-by: azurIt <azurit-Rm0zKEqwvD4@public.gmane.org> > >Signed-off-by: Michal Hocko <mhocko-AlSwsSmVLrQ@public.gmane.org> > > As a short term fix, I think this patch will work enough and seems simple enough. > Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A@public.gmane.org> Yes, let's do this for now. > >diff --git a/include/linux/gfp.h b/include/linux/gfp.h > >index 10e667f..aac9b21 100644 > >--- a/include/linux/gfp.h > >+++ b/include/linux/gfp.h > >@@ -152,6 +152,9 @@ struct vm_area_struct; > > /* 4GB DMA on some platforms */ > > #define GFP_DMA32 __GFP_DMA32 > > > >+/* memcg oom killer is not allowed */ > >+#define GFP_MEMCG_NO_OOM __GFP_NORETRY Could we leave this within memcg, please? An extra flag to mem_cgroup_cache_charge() or the like. GFP flags are about controlling the page allocator, this seems abusive. We have an oom flag down in try_charge, maybe just propagate this up the stack? > >diff --git a/mm/filemap.c b/mm/filemap.c > >index 83efee7..ef14351 100644 > >--- a/mm/filemap.c > >+++ b/mm/filemap.c > >@@ -447,7 +447,13 @@ int add_to_page_cache_locked(struct page *page, struct address_space *mapping, > > VM_BUG_ON(!PageLocked(page)); > > VM_BUG_ON(PageSwapBacked(page)); > > > >- error = mem_cgroup_cache_charge(page, current->mm, > >+ /* > >+ * Cannot trigger OOM even if gfp_mask would allow that normally > >+ * because we might be called from a locked context and that > >+ * could lead to deadlocks if the killed process is waiting for > >+ * the same lock. > >+ */ > >+ error = mem_cgroup_cache_charge_no_oom(page, current->mm, > > gfp_mask & GFP_RECLAIM_MASK); > > if (error) > > goto out; Shmem does not use this function but also charges under the i_mutex in the write path and fallocate at least. ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH -mm] memcg: do not trigger OOM from add_to_page_cache_locked @ 2012-11-27 19:48 ` Johannes Weiner 0 siblings, 0 replies; 444+ messages in thread From: Johannes Weiner @ 2012-11-27 19:48 UTC (permalink / raw) To: Kamezawa Hiroyuki Cc: Michal Hocko, azurIt, linux-kernel, linux-mm, cgroups mailinglist On Tue, Nov 27, 2012 at 09:05:30AM +0900, Kamezawa Hiroyuki wrote: > (2012/11/26 22:18), Michal Hocko wrote: > >[CCing also Johannes - the thread started here: > >https://lkml.org/lkml/2012/11/21/497] > > > >On Mon 26-11-12 01:38:55, azurIt wrote: > >>>This is hackish but it should help you in this case. Kamezawa, what do > >>>you think about that? Should we generalize this and prepare something > >>>like mem_cgroup_cache_charge_locked which would add __GFP_NORETRY > >>>automatically and use the function whenever we are in a locked context? > >>>To be honest I do not like this very much but nothing more sensible > >>>(without touching non-memcg paths) comes to my mind. > >> > >> > >>I installed kernel with this patch, will report back if problem occurs > >>again OR in few weeks if everything will be ok. Thank you! > > > >Now that I am looking at the patch closer it will not work because it > >depends on other patch which is not merged yet and even that one would > >help on its own because __GFP_NORETRY doesn't break the charge loop. > >Sorry I have missed that... > > > >The patch bellow should help though. (it is based on top of the current > >-mm tree but I will send a backport to 3.2 in the reply as well) > >--- > > From 7796f942d62081ad45726efd90b5292b80e7c690 Mon Sep 17 00:00:00 2001 > >From: Michal Hocko <mhocko@suse.cz> > >Date: Mon, 26 Nov 2012 11:47:57 +0100 > >Subject: [PATCH] memcg: do not trigger OOM from add_to_page_cache_locked > > > >memcg oom killer might deadlock if the process which falls down to > >mem_cgroup_handle_oom holds a lock which prevents other task to > >terminate because it is blocked on the very same lock. > >This can happen when a write system call needs to allocate a page but > >the allocation hits the memcg hard limit and there is nothing to reclaim > >(e.g. there is no swap or swap limit is hit as well and all cache pages > >have been reclaimed already) and the process selected by memcg OOM > >killer is blocked on i_mutex on the same inode (e.g. truncate it). > > > >Process A > >[<ffffffff811109b8>] do_truncate+0x58/0xa0 # takes i_mutex > >[<ffffffff81121c90>] do_last+0x250/0xa30 > >[<ffffffff81122547>] path_openat+0xd7/0x440 > >[<ffffffff811229c9>] do_filp_open+0x49/0xa0 > >[<ffffffff8110f7d6>] do_sys_open+0x106/0x240 > >[<ffffffff8110f950>] sys_open+0x20/0x30 > >[<ffffffff815b5926>] system_call_fastpath+0x18/0x1d > >[<ffffffffffffffff>] 0xffffffffffffffff > > > >Process B > >[<ffffffff8110a9c1>] mem_cgroup_handle_oom+0x241/0x3b0 > >[<ffffffff8110b5ab>] T.1146+0x5ab/0x5c0 > >[<ffffffff8110c22e>] mem_cgroup_cache_charge+0xbe/0xe0 > >[<ffffffff810ca28c>] add_to_page_cache_locked+0x4c/0x140 > >[<ffffffff810ca3a2>] add_to_page_cache_lru+0x22/0x50 > >[<ffffffff810ca45b>] grab_cache_page_write_begin+0x8b/0xe0 > >[<ffffffff81193a18>] ext3_write_begin+0x88/0x270 > >[<ffffffff810c8fc6>] generic_file_buffered_write+0x116/0x290 > >[<ffffffff810cb3cc>] __generic_file_aio_write+0x27c/0x480 > >[<ffffffff810cb646>] generic_file_aio_write+0x76/0xf0 # takes ->i_mutex > >[<ffffffff8111156a>] do_sync_write+0xea/0x130 > >[<ffffffff81112183>] vfs_write+0xf3/0x1f0 > >[<ffffffff81112381>] sys_write+0x51/0x90 > >[<ffffffff815b5926>] system_call_fastpath+0x18/0x1d > >[<ffffffffffffffff>] 0xffffffffffffffff > > > >This is not a hard deadlock though because administrator can still > >intervene and increase the limit on the group which helps the writer to > >finish the allocation and release the lock. > > > >This patch heals the problem by forbidding OOM from page cache charges > >(namely add_ro_page_cache_locked). mem_cgroup_cache_charge_no_oom helper > >function is defined which adds GFP_MEMCG_NO_OOM to the gfp mask which > >then tells mem_cgroup_charge_common that OOM is not allowed for the > >charge. No OOM from this path, except for fixing the bug, also make some > >sense as we really do not want to cause an OOM because of a page cache > >usage. > >As a possibly visible result add_to_page_cache_lru might fail more often > >with ENOMEM but this is to be expected if the limit is set and it is > >preferable than OOM killer IMO. > > > >__GFP_NORETRY is abused for this memcg specific flag because it has been > >used to prevent from OOM already (since not-merged-yet "memcg: reclaim > >when more than one page needed"). The only difference is that the flag > >doesn't prevent from reclaim anymore which kind of makes sense because > >the global memory allocator triggers reclaim as well. The retry without > >any reclaim on __GFP_NORETRY doesn't make much sense anyway because this > >is effectively a busy loop with allowed OOM in this path. > > > >Reported-by: azurIt <azurit@pobox.sk> > >Signed-off-by: Michal Hocko <mhocko@suse.cz> > > As a short term fix, I think this patch will work enough and seems simple enough. > Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Yes, let's do this for now. > >diff --git a/include/linux/gfp.h b/include/linux/gfp.h > >index 10e667f..aac9b21 100644 > >--- a/include/linux/gfp.h > >+++ b/include/linux/gfp.h > >@@ -152,6 +152,9 @@ struct vm_area_struct; > > /* 4GB DMA on some platforms */ > > #define GFP_DMA32 __GFP_DMA32 > > > >+/* memcg oom killer is not allowed */ > >+#define GFP_MEMCG_NO_OOM __GFP_NORETRY Could we leave this within memcg, please? An extra flag to mem_cgroup_cache_charge() or the like. GFP flags are about controlling the page allocator, this seems abusive. We have an oom flag down in try_charge, maybe just propagate this up the stack? > >diff --git a/mm/filemap.c b/mm/filemap.c > >index 83efee7..ef14351 100644 > >--- a/mm/filemap.c > >+++ b/mm/filemap.c > >@@ -447,7 +447,13 @@ int add_to_page_cache_locked(struct page *page, struct address_space *mapping, > > VM_BUG_ON(!PageLocked(page)); > > VM_BUG_ON(PageSwapBacked(page)); > > > >- error = mem_cgroup_cache_charge(page, current->mm, > >+ /* > >+ * Cannot trigger OOM even if gfp_mask would allow that normally > >+ * because we might be called from a locked context and that > >+ * could lead to deadlocks if the killed process is waiting for > >+ * the same lock. > >+ */ > >+ error = mem_cgroup_cache_charge_no_oom(page, current->mm, > > gfp_mask & GFP_RECLAIM_MASK); > > if (error) > > goto out; Shmem does not use this function but also charges under the i_mutex in the write path and fallocate at least. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 444+ messages in thread
* [PATCH -v2 -mm] memcg: do not trigger OOM from add_to_page_cache_locked 2012-11-27 19:48 ` Johannes Weiner (?) @ 2012-11-27 20:54 ` Michal Hocko -1 siblings, 0 replies; 444+ messages in thread From: Michal Hocko @ 2012-11-27 20:54 UTC (permalink / raw) To: Johannes Weiner, KAMEZAWA Hiroyuki Cc: azurIt, linux-kernel, linux-mm, cgroups mailinglist On Tue 27-11-12 14:48:13, Johannes Weiner wrote: [...] > > >diff --git a/include/linux/gfp.h b/include/linux/gfp.h > > >index 10e667f..aac9b21 100644 > > >--- a/include/linux/gfp.h > > >+++ b/include/linux/gfp.h > > >@@ -152,6 +152,9 @@ struct vm_area_struct; > > > /* 4GB DMA on some platforms */ > > > #define GFP_DMA32 __GFP_DMA32 > > > > > >+/* memcg oom killer is not allowed */ > > >+#define GFP_MEMCG_NO_OOM __GFP_NORETRY > > Could we leave this within memcg, please? An extra flag to > mem_cgroup_cache_charge() or the like. GFP flags are about > controlling the page allocator, this seems abusive. We have an oom > flag down in try_charge, maybe just propagate this up the stack? OK, what about the patch bellow? I have dropped Kame's Acked-by because it has been reworked. The patch is the same in principle. > > >diff --git a/mm/filemap.c b/mm/filemap.c > > >index 83efee7..ef14351 100644 > > >--- a/mm/filemap.c > > >+++ b/mm/filemap.c > > >@@ -447,7 +447,13 @@ int add_to_page_cache_locked(struct page *page, struct address_space *mapping, > > > VM_BUG_ON(!PageLocked(page)); > > > VM_BUG_ON(PageSwapBacked(page)); > > > > > >- error = mem_cgroup_cache_charge(page, current->mm, > > >+ /* > > >+ * Cannot trigger OOM even if gfp_mask would allow that normally > > >+ * because we might be called from a locked context and that > > >+ * could lead to deadlocks if the killed process is waiting for > > >+ * the same lock. > > >+ */ > > >+ error = mem_cgroup_cache_charge_no_oom(page, current->mm, > > > gfp_mask & GFP_RECLAIM_MASK); > > > if (error) > > > goto out; > > Shmem does not use this function but also charges under the i_mutex in > the write path and fallocate at least. Right you are --- >From 60cc8a184490d277eb24fca551b114f1e2234ce0 Mon Sep 17 00:00:00 2001 From: Michal Hocko <mhocko@suse.cz> Date: Mon, 26 Nov 2012 11:47:57 +0100 Subject: [PATCH] memcg: do not trigger OOM from add_to_page_cache_locked memcg oom killer might deadlock if the process which falls down to mem_cgroup_handle_oom holds a lock which prevents other task to terminate because it is blocked on the very same lock. This can happen when a write system call needs to allocate a page but the allocation hits the memcg hard limit and there is nothing to reclaim (e.g. there is no swap or swap limit is hit as well and all cache pages have been reclaimed already) and the process selected by memcg OOM killer is blocked on i_mutex on the same inode (e.g. truncate it). Process A [<ffffffff811109b8>] do_truncate+0x58/0xa0 # takes i_mutex [<ffffffff81121c90>] do_last+0x250/0xa30 [<ffffffff81122547>] path_openat+0xd7/0x440 [<ffffffff811229c9>] do_filp_open+0x49/0xa0 [<ffffffff8110f7d6>] do_sys_open+0x106/0x240 [<ffffffff8110f950>] sys_open+0x20/0x30 [<ffffffff815b5926>] system_call_fastpath+0x18/0x1d [<ffffffffffffffff>] 0xffffffffffffffff Process B [<ffffffff8110a9c1>] mem_cgroup_handle_oom+0x241/0x3b0 [<ffffffff8110b5ab>] T.1146+0x5ab/0x5c0 [<ffffffff8110c22e>] mem_cgroup_cache_charge+0xbe/0xe0 [<ffffffff810ca28c>] add_to_page_cache_locked+0x4c/0x140 [<ffffffff810ca3a2>] add_to_page_cache_lru+0x22/0x50 [<ffffffff810ca45b>] grab_cache_page_write_begin+0x8b/0xe0 [<ffffffff81193a18>] ext3_write_begin+0x88/0x270 [<ffffffff810c8fc6>] generic_file_buffered_write+0x116/0x290 [<ffffffff810cb3cc>] __generic_file_aio_write+0x27c/0x480 [<ffffffff810cb646>] generic_file_aio_write+0x76/0xf0 # takes ->i_mutex [<ffffffff8111156a>] do_sync_write+0xea/0x130 [<ffffffff81112183>] vfs_write+0xf3/0x1f0 [<ffffffff81112381>] sys_write+0x51/0x90 [<ffffffff815b5926>] system_call_fastpath+0x18/0x1d [<ffffffffffffffff>] 0xffffffffffffffff This is not a hard deadlock though because administrator can still intervene and increase the limit on the group which helps the writer to finish the allocation and release the lock. This patch heals the problem by forbidding OOM from page cache charges (namely add_ro_page_cache_locked). mem_cgroup_cache_charge grows oom argument which is pushed down the call chain. As a possibly visible result add_to_page_cache_lru might fail more often with ENOMEM but this is to be expected if the limit is set and it is preferable than OOM killer IMO. Changes since v1 - do not abuse gfp_flags and rather use oom parameter directly as per Johannes - handle also shmem write fauls resp. fallocate properly as per Johannes Reported-by: azurIt <azurit@pobox.sk> Signed-off-by: Michal Hocko <mhocko@suse.cz> --- include/linux/memcontrol.h | 5 +++-- mm/filemap.c | 9 +++++++-- mm/memcontrol.c | 9 ++++----- mm/shmem.c | 14 +++++++++++--- 4 files changed, 25 insertions(+), 12 deletions(-) diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 095d2b4..8f48d5e 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -63,7 +63,7 @@ extern void mem_cgroup_commit_charge_swapin(struct page *page, extern void mem_cgroup_cancel_charge_swapin(struct mem_cgroup *memcg); extern int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm, - gfp_t gfp_mask); + gfp_t gfp_mask, bool oom); struct lruvec *mem_cgroup_zone_lruvec(struct zone *, struct mem_cgroup *); struct lruvec *mem_cgroup_page_lruvec(struct page *, struct zone *); @@ -210,7 +210,8 @@ static inline int mem_cgroup_newpage_charge(struct page *page, } static inline int mem_cgroup_cache_charge(struct page *page, - struct mm_struct *mm, gfp_t gfp_mask) + struct mm_struct *mm, gfp_t gfp_mask, + bool oom) { return 0; } diff --git a/mm/filemap.c b/mm/filemap.c index 83efee7..ef8fbd5 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -447,8 +447,13 @@ int add_to_page_cache_locked(struct page *page, struct address_space *mapping, VM_BUG_ON(!PageLocked(page)); VM_BUG_ON(PageSwapBacked(page)); - error = mem_cgroup_cache_charge(page, current->mm, - gfp_mask & GFP_RECLAIM_MASK); + /* + * Cannot trigger OOM even if gfp_mask would allow that normally + * because we might be called from a locked context and that + * could lead to deadlocks if the killed process is waiting for + * the same lock. + */ + error = mem_cgroup_cache_charge(page, current->mm, gfp_mask, false); if (error) goto out; diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 02ee2f7..26690d6 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -3709,11 +3709,10 @@ out: * < 0 if the cgroup is over its limit */ static int mem_cgroup_charge_common(struct page *page, struct mm_struct *mm, - gfp_t gfp_mask, enum charge_type ctype) + gfp_t gfp_mask, enum charge_type ctype, bool oom) { struct mem_cgroup *memcg = NULL; unsigned int nr_pages = 1; - bool oom = true; int ret; if (PageTransHuge(page)) { @@ -3742,7 +3741,7 @@ int mem_cgroup_newpage_charge(struct page *page, VM_BUG_ON(page->mapping && !PageAnon(page)); VM_BUG_ON(!mm); return mem_cgroup_charge_common(page, mm, gfp_mask, - MEM_CGROUP_CHARGE_TYPE_ANON); + MEM_CGROUP_CHARGE_TYPE_ANON, true); } /* @@ -3851,7 +3850,7 @@ void mem_cgroup_commit_charge_swapin(struct page *page, } int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm, - gfp_t gfp_mask) + gfp_t gfp_mask, bool oom) { struct mem_cgroup *memcg = NULL; enum charge_type type = MEM_CGROUP_CHARGE_TYPE_CACHE; @@ -3863,7 +3862,7 @@ int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm, return 0; if (!PageSwapCache(page)) - ret = mem_cgroup_charge_common(page, mm, gfp_mask, type); + ret = mem_cgroup_charge_common(page, mm, gfp_mask, type, oom); else { /* page is swapcache/shmem */ ret = __mem_cgroup_try_charge_swapin(mm, page, gfp_mask, &memcg); diff --git a/mm/shmem.c b/mm/shmem.c index 55054a7..cef63b5 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -760,7 +760,7 @@ int shmem_unuse(swp_entry_t swap, struct page *page) * the shmem_swaplist_mutex which might hold up shmem_writepage(). * Charged back to the user (not to caller) when swap account is used. */ - error = mem_cgroup_cache_charge(page, current->mm, GFP_KERNEL); + error = mem_cgroup_cache_charge(page, current->mm, GFP_KERNEL, true); if (error) goto out; /* No radix_tree_preload: swap entry keeps a place for page in tree */ @@ -1152,8 +1152,16 @@ repeat: goto failed; } + /* + * Cannot trigger OOM even if gfp_mask would allow that + * normally because we might be called from a locked + * context (i_mutex held) if this is a write lock or + * fallocate and that could lead to deadlocks if the + * killed process is waiting for the same lock. + */ error = mem_cgroup_cache_charge(page, current->mm, - gfp & GFP_RECLAIM_MASK); + gfp & GFP_RECLAIM_MASK, + sgp < SGP_WRITE); if (!error) { error = shmem_add_to_page_cache(page, mapping, index, gfp, swp_to_radix_entry(swap)); @@ -1209,7 +1217,7 @@ repeat: SetPageSwapBacked(page); __set_page_locked(page); error = mem_cgroup_cache_charge(page, current->mm, - gfp & GFP_RECLAIM_MASK); + gfp & GFP_RECLAIM_MASK, true); if (error) goto decused; error = radix_tree_preload(gfp & GFP_RECLAIM_MASK); -- 1.7.10.4 -- Michal Hocko SUSE Labs ^ permalink raw reply related [flat|nested] 444+ messages in thread
* [PATCH -v2 -mm] memcg: do not trigger OOM from add_to_page_cache_locked @ 2012-11-27 20:54 ` Michal Hocko 0 siblings, 0 replies; 444+ messages in thread From: Michal Hocko @ 2012-11-27 20:54 UTC (permalink / raw) To: Johannes Weiner, KAMEZAWA Hiroyuki Cc: azurIt, linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups mailinglist On Tue 27-11-12 14:48:13, Johannes Weiner wrote: [...] > > >diff --git a/include/linux/gfp.h b/include/linux/gfp.h > > >index 10e667f..aac9b21 100644 > > >--- a/include/linux/gfp.h > > >+++ b/include/linux/gfp.h > > >@@ -152,6 +152,9 @@ struct vm_area_struct; > > > /* 4GB DMA on some platforms */ > > > #define GFP_DMA32 __GFP_DMA32 > > > > > >+/* memcg oom killer is not allowed */ > > >+#define GFP_MEMCG_NO_OOM __GFP_NORETRY > > Could we leave this within memcg, please? An extra flag to > mem_cgroup_cache_charge() or the like. GFP flags are about > controlling the page allocator, this seems abusive. We have an oom > flag down in try_charge, maybe just propagate this up the stack? OK, what about the patch bellow? I have dropped Kame's Acked-by because it has been reworked. The patch is the same in principle. > > >diff --git a/mm/filemap.c b/mm/filemap.c > > >index 83efee7..ef14351 100644 > > >--- a/mm/filemap.c > > >+++ b/mm/filemap.c > > >@@ -447,7 +447,13 @@ int add_to_page_cache_locked(struct page *page, struct address_space *mapping, > > > VM_BUG_ON(!PageLocked(page)); > > > VM_BUG_ON(PageSwapBacked(page)); > > > > > >- error = mem_cgroup_cache_charge(page, current->mm, > > >+ /* > > >+ * Cannot trigger OOM even if gfp_mask would allow that normally > > >+ * because we might be called from a locked context and that > > >+ * could lead to deadlocks if the killed process is waiting for > > >+ * the same lock. > > >+ */ > > >+ error = mem_cgroup_cache_charge_no_oom(page, current->mm, > > > gfp_mask & GFP_RECLAIM_MASK); > > > if (error) > > > goto out; > > Shmem does not use this function but also charges under the i_mutex in > the write path and fallocate at least. Right you are --- From 60cc8a184490d277eb24fca551b114f1e2234ce0 Mon Sep 17 00:00:00 2001 From: Michal Hocko <mhocko-AlSwsSmVLrQ@public.gmane.org> Date: Mon, 26 Nov 2012 11:47:57 +0100 Subject: [PATCH] memcg: do not trigger OOM from add_to_page_cache_locked memcg oom killer might deadlock if the process which falls down to mem_cgroup_handle_oom holds a lock which prevents other task to terminate because it is blocked on the very same lock. This can happen when a write system call needs to allocate a page but the allocation hits the memcg hard limit and there is nothing to reclaim (e.g. there is no swap or swap limit is hit as well and all cache pages have been reclaimed already) and the process selected by memcg OOM killer is blocked on i_mutex on the same inode (e.g. truncate it). Process A [<ffffffff811109b8>] do_truncate+0x58/0xa0 # takes i_mutex [<ffffffff81121c90>] do_last+0x250/0xa30 [<ffffffff81122547>] path_openat+0xd7/0x440 [<ffffffff811229c9>] do_filp_open+0x49/0xa0 [<ffffffff8110f7d6>] do_sys_open+0x106/0x240 [<ffffffff8110f950>] sys_open+0x20/0x30 [<ffffffff815b5926>] system_call_fastpath+0x18/0x1d [<ffffffffffffffff>] 0xffffffffffffffff Process B [<ffffffff8110a9c1>] mem_cgroup_handle_oom+0x241/0x3b0 [<ffffffff8110b5ab>] T.1146+0x5ab/0x5c0 [<ffffffff8110c22e>] mem_cgroup_cache_charge+0xbe/0xe0 [<ffffffff810ca28c>] add_to_page_cache_locked+0x4c/0x140 [<ffffffff810ca3a2>] add_to_page_cache_lru+0x22/0x50 [<ffffffff810ca45b>] grab_cache_page_write_begin+0x8b/0xe0 [<ffffffff81193a18>] ext3_write_begin+0x88/0x270 [<ffffffff810c8fc6>] generic_file_buffered_write+0x116/0x290 [<ffffffff810cb3cc>] __generic_file_aio_write+0x27c/0x480 [<ffffffff810cb646>] generic_file_aio_write+0x76/0xf0 # takes ->i_mutex [<ffffffff8111156a>] do_sync_write+0xea/0x130 [<ffffffff81112183>] vfs_write+0xf3/0x1f0 [<ffffffff81112381>] sys_write+0x51/0x90 [<ffffffff815b5926>] system_call_fastpath+0x18/0x1d [<ffffffffffffffff>] 0xffffffffffffffff This is not a hard deadlock though because administrator can still intervene and increase the limit on the group which helps the writer to finish the allocation and release the lock. This patch heals the problem by forbidding OOM from page cache charges (namely add_ro_page_cache_locked). mem_cgroup_cache_charge grows oom argument which is pushed down the call chain. As a possibly visible result add_to_page_cache_lru might fail more often with ENOMEM but this is to be expected if the limit is set and it is preferable than OOM killer IMO. Changes since v1 - do not abuse gfp_flags and rather use oom parameter directly as per Johannes - handle also shmem write fauls resp. fallocate properly as per Johannes Reported-by: azurIt <azurit-Rm0zKEqwvD4@public.gmane.org> Signed-off-by: Michal Hocko <mhocko-AlSwsSmVLrQ@public.gmane.org> --- include/linux/memcontrol.h | 5 +++-- mm/filemap.c | 9 +++++++-- mm/memcontrol.c | 9 ++++----- mm/shmem.c | 14 +++++++++++--- 4 files changed, 25 insertions(+), 12 deletions(-) diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 095d2b4..8f48d5e 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -63,7 +63,7 @@ extern void mem_cgroup_commit_charge_swapin(struct page *page, extern void mem_cgroup_cancel_charge_swapin(struct mem_cgroup *memcg); extern int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm, - gfp_t gfp_mask); + gfp_t gfp_mask, bool oom); struct lruvec *mem_cgroup_zone_lruvec(struct zone *, struct mem_cgroup *); struct lruvec *mem_cgroup_page_lruvec(struct page *, struct zone *); @@ -210,7 +210,8 @@ static inline int mem_cgroup_newpage_charge(struct page *page, } static inline int mem_cgroup_cache_charge(struct page *page, - struct mm_struct *mm, gfp_t gfp_mask) + struct mm_struct *mm, gfp_t gfp_mask, + bool oom) { return 0; } diff --git a/mm/filemap.c b/mm/filemap.c index 83efee7..ef8fbd5 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -447,8 +447,13 @@ int add_to_page_cache_locked(struct page *page, struct address_space *mapping, VM_BUG_ON(!PageLocked(page)); VM_BUG_ON(PageSwapBacked(page)); - error = mem_cgroup_cache_charge(page, current->mm, - gfp_mask & GFP_RECLAIM_MASK); + /* + * Cannot trigger OOM even if gfp_mask would allow that normally + * because we might be called from a locked context and that + * could lead to deadlocks if the killed process is waiting for + * the same lock. + */ + error = mem_cgroup_cache_charge(page, current->mm, gfp_mask, false); if (error) goto out; diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 02ee2f7..26690d6 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -3709,11 +3709,10 @@ out: * < 0 if the cgroup is over its limit */ static int mem_cgroup_charge_common(struct page *page, struct mm_struct *mm, - gfp_t gfp_mask, enum charge_type ctype) + gfp_t gfp_mask, enum charge_type ctype, bool oom) { struct mem_cgroup *memcg = NULL; unsigned int nr_pages = 1; - bool oom = true; int ret; if (PageTransHuge(page)) { @@ -3742,7 +3741,7 @@ int mem_cgroup_newpage_charge(struct page *page, VM_BUG_ON(page->mapping && !PageAnon(page)); VM_BUG_ON(!mm); return mem_cgroup_charge_common(page, mm, gfp_mask, - MEM_CGROUP_CHARGE_TYPE_ANON); + MEM_CGROUP_CHARGE_TYPE_ANON, true); } /* @@ -3851,7 +3850,7 @@ void mem_cgroup_commit_charge_swapin(struct page *page, } int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm, - gfp_t gfp_mask) + gfp_t gfp_mask, bool oom) { struct mem_cgroup *memcg = NULL; enum charge_type type = MEM_CGROUP_CHARGE_TYPE_CACHE; @@ -3863,7 +3862,7 @@ int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm, return 0; if (!PageSwapCache(page)) - ret = mem_cgroup_charge_common(page, mm, gfp_mask, type); + ret = mem_cgroup_charge_common(page, mm, gfp_mask, type, oom); else { /* page is swapcache/shmem */ ret = __mem_cgroup_try_charge_swapin(mm, page, gfp_mask, &memcg); diff --git a/mm/shmem.c b/mm/shmem.c index 55054a7..cef63b5 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -760,7 +760,7 @@ int shmem_unuse(swp_entry_t swap, struct page *page) * the shmem_swaplist_mutex which might hold up shmem_writepage(). * Charged back to the user (not to caller) when swap account is used. */ - error = mem_cgroup_cache_charge(page, current->mm, GFP_KERNEL); + error = mem_cgroup_cache_charge(page, current->mm, GFP_KERNEL, true); if (error) goto out; /* No radix_tree_preload: swap entry keeps a place for page in tree */ @@ -1152,8 +1152,16 @@ repeat: goto failed; } + /* + * Cannot trigger OOM even if gfp_mask would allow that + * normally because we might be called from a locked + * context (i_mutex held) if this is a write lock or + * fallocate and that could lead to deadlocks if the + * killed process is waiting for the same lock. + */ error = mem_cgroup_cache_charge(page, current->mm, - gfp & GFP_RECLAIM_MASK); + gfp & GFP_RECLAIM_MASK, + sgp < SGP_WRITE); if (!error) { error = shmem_add_to_page_cache(page, mapping, index, gfp, swp_to_radix_entry(swap)); @@ -1209,7 +1217,7 @@ repeat: SetPageSwapBacked(page); __set_page_locked(page); error = mem_cgroup_cache_charge(page, current->mm, - gfp & GFP_RECLAIM_MASK); + gfp & GFP_RECLAIM_MASK, true); if (error) goto decused; error = radix_tree_preload(gfp & GFP_RECLAIM_MASK); -- 1.7.10.4 -- Michal Hocko SUSE Labs ^ permalink raw reply related [flat|nested] 444+ messages in thread
* [PATCH -v2 -mm] memcg: do not trigger OOM from add_to_page_cache_locked @ 2012-11-27 20:54 ` Michal Hocko 0 siblings, 0 replies; 444+ messages in thread From: Michal Hocko @ 2012-11-27 20:54 UTC (permalink / raw) To: Johannes Weiner, KAMEZAWA Hiroyuki Cc: azurIt, linux-kernel, linux-mm, cgroups mailinglist On Tue 27-11-12 14:48:13, Johannes Weiner wrote: [...] > > >diff --git a/include/linux/gfp.h b/include/linux/gfp.h > > >index 10e667f..aac9b21 100644 > > >--- a/include/linux/gfp.h > > >+++ b/include/linux/gfp.h > > >@@ -152,6 +152,9 @@ struct vm_area_struct; > > > /* 4GB DMA on some platforms */ > > > #define GFP_DMA32 __GFP_DMA32 > > > > > >+/* memcg oom killer is not allowed */ > > >+#define GFP_MEMCG_NO_OOM __GFP_NORETRY > > Could we leave this within memcg, please? An extra flag to > mem_cgroup_cache_charge() or the like. GFP flags are about > controlling the page allocator, this seems abusive. We have an oom > flag down in try_charge, maybe just propagate this up the stack? OK, what about the patch bellow? I have dropped Kame's Acked-by because it has been reworked. The patch is the same in principle. > > >diff --git a/mm/filemap.c b/mm/filemap.c > > >index 83efee7..ef14351 100644 > > >--- a/mm/filemap.c > > >+++ b/mm/filemap.c > > >@@ -447,7 +447,13 @@ int add_to_page_cache_locked(struct page *page, struct address_space *mapping, > > > VM_BUG_ON(!PageLocked(page)); > > > VM_BUG_ON(PageSwapBacked(page)); > > > > > >- error = mem_cgroup_cache_charge(page, current->mm, > > >+ /* > > >+ * Cannot trigger OOM even if gfp_mask would allow that normally > > >+ * because we might be called from a locked context and that > > >+ * could lead to deadlocks if the killed process is waiting for > > >+ * the same lock. > > >+ */ > > >+ error = mem_cgroup_cache_charge_no_oom(page, current->mm, > > > gfp_mask & GFP_RECLAIM_MASK); > > > if (error) > > > goto out; > > Shmem does not use this function but also charges under the i_mutex in > the write path and fallocate at least. Right you are --- ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH -v2 -mm] memcg: do not trigger OOM from add_to_page_cache_locked 2012-11-27 20:54 ` Michal Hocko (?) @ 2012-11-27 20:59 ` Michal Hocko -1 siblings, 0 replies; 444+ messages in thread From: Michal Hocko @ 2012-11-27 20:59 UTC (permalink / raw) To: Johannes Weiner, KAMEZAWA Hiroyuki Cc: azurIt, linux-kernel, linux-mm, cgroups mailinglist Sorry, forgot to about one shmem charge: --- >From 7ae29927d24471c1b1a6ceb021219c592c1ef518 Mon Sep 17 00:00:00 2001 From: Michal Hocko <mhocko@suse.cz> Date: Tue, 27 Nov 2012 21:53:13 +0100 Subject: [PATCH] memcg: do not trigger OOM from add_to_page_cache_locked memcg oom killer might deadlock if the process which falls down to mem_cgroup_handle_oom holds a lock which prevents other task to terminate because it is blocked on the very same lock. This can happen when a write system call needs to allocate a page but the allocation hits the memcg hard limit and there is nothing to reclaim (e.g. there is no swap or swap limit is hit as well and all cache pages have been reclaimed already) and the process selected by memcg OOM killer is blocked on i_mutex on the same inode (e.g. truncate it). Process A [<ffffffff811109b8>] do_truncate+0x58/0xa0 # takes i_mutex [<ffffffff81121c90>] do_last+0x250/0xa30 [<ffffffff81122547>] path_openat+0xd7/0x440 [<ffffffff811229c9>] do_filp_open+0x49/0xa0 [<ffffffff8110f7d6>] do_sys_open+0x106/0x240 [<ffffffff8110f950>] sys_open+0x20/0x30 [<ffffffff815b5926>] system_call_fastpath+0x18/0x1d [<ffffffffffffffff>] 0xffffffffffffffff Process B [<ffffffff8110a9c1>] mem_cgroup_handle_oom+0x241/0x3b0 [<ffffffff8110b5ab>] T.1146+0x5ab/0x5c0 [<ffffffff8110c22e>] mem_cgroup_cache_charge+0xbe/0xe0 [<ffffffff810ca28c>] add_to_page_cache_locked+0x4c/0x140 [<ffffffff810ca3a2>] add_to_page_cache_lru+0x22/0x50 [<ffffffff810ca45b>] grab_cache_page_write_begin+0x8b/0xe0 [<ffffffff81193a18>] ext3_write_begin+0x88/0x270 [<ffffffff810c8fc6>] generic_file_buffered_write+0x116/0x290 [<ffffffff810cb3cc>] __generic_file_aio_write+0x27c/0x480 [<ffffffff810cb646>] generic_file_aio_write+0x76/0xf0 # takes ->i_mutex [<ffffffff8111156a>] do_sync_write+0xea/0x130 [<ffffffff81112183>] vfs_write+0xf3/0x1f0 [<ffffffff81112381>] sys_write+0x51/0x90 [<ffffffff815b5926>] system_call_fastpath+0x18/0x1d [<ffffffffffffffff>] 0xffffffffffffffff This is not a hard deadlock though because administrator can still intervene and increase the limit on the group which helps the writer to finish the allocation and release the lock. This patch heals the problem by forbidding OOM from page cache charges (namely add_ro_page_cache_locked). mem_cgroup_cache_charge grows oom argument which is pushed down the call chain. As a possibly visible result add_to_page_cache_lru might fail more often with ENOMEM but this is to be expected if the limit is set and it is preferable than OOM killer IMO. Changes since v1 - do not abuse gfp_flags and rather use oom parameter directly as per Johannes - handle also shmem write fauls resp. fallocate properly as per Johannes Reported-by: azurIt <azurit@pobox.sk> Signed-off-by: Michal Hocko <mhocko@suse.cz> --- include/linux/memcontrol.h | 5 +++-- mm/filemap.c | 9 +++++++-- mm/memcontrol.c | 9 ++++----- mm/shmem.c | 15 ++++++++++++--- 4 files changed, 26 insertions(+), 12 deletions(-) diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 095d2b4..8f48d5e 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -63,7 +63,7 @@ extern void mem_cgroup_commit_charge_swapin(struct page *page, extern void mem_cgroup_cancel_charge_swapin(struct mem_cgroup *memcg); extern int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm, - gfp_t gfp_mask); + gfp_t gfp_mask, bool oom); struct lruvec *mem_cgroup_zone_lruvec(struct zone *, struct mem_cgroup *); struct lruvec *mem_cgroup_page_lruvec(struct page *, struct zone *); @@ -210,7 +210,8 @@ static inline int mem_cgroup_newpage_charge(struct page *page, } static inline int mem_cgroup_cache_charge(struct page *page, - struct mm_struct *mm, gfp_t gfp_mask) + struct mm_struct *mm, gfp_t gfp_mask, + bool oom) { return 0; } diff --git a/mm/filemap.c b/mm/filemap.c index 83efee7..ef8fbd5 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -447,8 +447,13 @@ int add_to_page_cache_locked(struct page *page, struct address_space *mapping, VM_BUG_ON(!PageLocked(page)); VM_BUG_ON(PageSwapBacked(page)); - error = mem_cgroup_cache_charge(page, current->mm, - gfp_mask & GFP_RECLAIM_MASK); + /* + * Cannot trigger OOM even if gfp_mask would allow that normally + * because we might be called from a locked context and that + * could lead to deadlocks if the killed process is waiting for + * the same lock. + */ + error = mem_cgroup_cache_charge(page, current->mm, gfp_mask, false); if (error) goto out; diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 02ee2f7..26690d6 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -3709,11 +3709,10 @@ out: * < 0 if the cgroup is over its limit */ static int mem_cgroup_charge_common(struct page *page, struct mm_struct *mm, - gfp_t gfp_mask, enum charge_type ctype) + gfp_t gfp_mask, enum charge_type ctype, bool oom) { struct mem_cgroup *memcg = NULL; unsigned int nr_pages = 1; - bool oom = true; int ret; if (PageTransHuge(page)) { @@ -3742,7 +3741,7 @@ int mem_cgroup_newpage_charge(struct page *page, VM_BUG_ON(page->mapping && !PageAnon(page)); VM_BUG_ON(!mm); return mem_cgroup_charge_common(page, mm, gfp_mask, - MEM_CGROUP_CHARGE_TYPE_ANON); + MEM_CGROUP_CHARGE_TYPE_ANON, true); } /* @@ -3851,7 +3850,7 @@ void mem_cgroup_commit_charge_swapin(struct page *page, } int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm, - gfp_t gfp_mask) + gfp_t gfp_mask, bool oom) { struct mem_cgroup *memcg = NULL; enum charge_type type = MEM_CGROUP_CHARGE_TYPE_CACHE; @@ -3863,7 +3862,7 @@ int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm, return 0; if (!PageSwapCache(page)) - ret = mem_cgroup_charge_common(page, mm, gfp_mask, type); + ret = mem_cgroup_charge_common(page, mm, gfp_mask, type, oom); else { /* page is swapcache/shmem */ ret = __mem_cgroup_try_charge_swapin(mm, page, gfp_mask, &memcg); diff --git a/mm/shmem.c b/mm/shmem.c index 55054a7..ba59cfa 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -760,7 +760,7 @@ int shmem_unuse(swp_entry_t swap, struct page *page) * the shmem_swaplist_mutex which might hold up shmem_writepage(). * Charged back to the user (not to caller) when swap account is used. */ - error = mem_cgroup_cache_charge(page, current->mm, GFP_KERNEL); + error = mem_cgroup_cache_charge(page, current->mm, GFP_KERNEL, true); if (error) goto out; /* No radix_tree_preload: swap entry keeps a place for page in tree */ @@ -1152,8 +1152,16 @@ repeat: goto failed; } + /* + * Cannot trigger OOM even if gfp_mask would allow that + * normally because we might be called from a locked + * context (i_mutex held) if this is a write lock or + * fallocate and that could lead to deadlocks if the + * killed process is waiting for the same lock. + */ error = mem_cgroup_cache_charge(page, current->mm, - gfp & GFP_RECLAIM_MASK); + gfp & GFP_RECLAIM_MASK, + sgp < SGP_WRITE); if (!error) { error = shmem_add_to_page_cache(page, mapping, index, gfp, swp_to_radix_entry(swap)); @@ -1209,7 +1217,8 @@ repeat: SetPageSwapBacked(page); __set_page_locked(page); error = mem_cgroup_cache_charge(page, current->mm, - gfp & GFP_RECLAIM_MASK); + gfp & GFP_RECLAIM_MASK, + sgp < SGP_WRITE); if (error) goto decused; error = radix_tree_preload(gfp & GFP_RECLAIM_MASK); -- 1.7.10.4 -- Michal Hocko SUSE Labs ^ permalink raw reply related [flat|nested] 444+ messages in thread
* Re: [PATCH -v2 -mm] memcg: do not trigger OOM from add_to_page_cache_locked @ 2012-11-27 20:59 ` Michal Hocko 0 siblings, 0 replies; 444+ messages in thread From: Michal Hocko @ 2012-11-27 20:59 UTC (permalink / raw) To: Johannes Weiner, KAMEZAWA Hiroyuki Cc: azurIt, linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups mailinglist Sorry, forgot to about one shmem charge: --- From 7ae29927d24471c1b1a6ceb021219c592c1ef518 Mon Sep 17 00:00:00 2001 From: Michal Hocko <mhocko-AlSwsSmVLrQ@public.gmane.org> Date: Tue, 27 Nov 2012 21:53:13 +0100 Subject: [PATCH] memcg: do not trigger OOM from add_to_page_cache_locked memcg oom killer might deadlock if the process which falls down to mem_cgroup_handle_oom holds a lock which prevents other task to terminate because it is blocked on the very same lock. This can happen when a write system call needs to allocate a page but the allocation hits the memcg hard limit and there is nothing to reclaim (e.g. there is no swap or swap limit is hit as well and all cache pages have been reclaimed already) and the process selected by memcg OOM killer is blocked on i_mutex on the same inode (e.g. truncate it). Process A [<ffffffff811109b8>] do_truncate+0x58/0xa0 # takes i_mutex [<ffffffff81121c90>] do_last+0x250/0xa30 [<ffffffff81122547>] path_openat+0xd7/0x440 [<ffffffff811229c9>] do_filp_open+0x49/0xa0 [<ffffffff8110f7d6>] do_sys_open+0x106/0x240 [<ffffffff8110f950>] sys_open+0x20/0x30 [<ffffffff815b5926>] system_call_fastpath+0x18/0x1d [<ffffffffffffffff>] 0xffffffffffffffff Process B [<ffffffff8110a9c1>] mem_cgroup_handle_oom+0x241/0x3b0 [<ffffffff8110b5ab>] T.1146+0x5ab/0x5c0 [<ffffffff8110c22e>] mem_cgroup_cache_charge+0xbe/0xe0 [<ffffffff810ca28c>] add_to_page_cache_locked+0x4c/0x140 [<ffffffff810ca3a2>] add_to_page_cache_lru+0x22/0x50 [<ffffffff810ca45b>] grab_cache_page_write_begin+0x8b/0xe0 [<ffffffff81193a18>] ext3_write_begin+0x88/0x270 [<ffffffff810c8fc6>] generic_file_buffered_write+0x116/0x290 [<ffffffff810cb3cc>] __generic_file_aio_write+0x27c/0x480 [<ffffffff810cb646>] generic_file_aio_write+0x76/0xf0 # takes ->i_mutex [<ffffffff8111156a>] do_sync_write+0xea/0x130 [<ffffffff81112183>] vfs_write+0xf3/0x1f0 [<ffffffff81112381>] sys_write+0x51/0x90 [<ffffffff815b5926>] system_call_fastpath+0x18/0x1d [<ffffffffffffffff>] 0xffffffffffffffff This is not a hard deadlock though because administrator can still intervene and increase the limit on the group which helps the writer to finish the allocation and release the lock. This patch heals the problem by forbidding OOM from page cache charges (namely add_ro_page_cache_locked). mem_cgroup_cache_charge grows oom argument which is pushed down the call chain. As a possibly visible result add_to_page_cache_lru might fail more often with ENOMEM but this is to be expected if the limit is set and it is preferable than OOM killer IMO. Changes since v1 - do not abuse gfp_flags and rather use oom parameter directly as per Johannes - handle also shmem write fauls resp. fallocate properly as per Johannes Reported-by: azurIt <azurit-Rm0zKEqwvD4@public.gmane.org> Signed-off-by: Michal Hocko <mhocko-AlSwsSmVLrQ@public.gmane.org> --- include/linux/memcontrol.h | 5 +++-- mm/filemap.c | 9 +++++++-- mm/memcontrol.c | 9 ++++----- mm/shmem.c | 15 ++++++++++++--- 4 files changed, 26 insertions(+), 12 deletions(-) diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 095d2b4..8f48d5e 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -63,7 +63,7 @@ extern void mem_cgroup_commit_charge_swapin(struct page *page, extern void mem_cgroup_cancel_charge_swapin(struct mem_cgroup *memcg); extern int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm, - gfp_t gfp_mask); + gfp_t gfp_mask, bool oom); struct lruvec *mem_cgroup_zone_lruvec(struct zone *, struct mem_cgroup *); struct lruvec *mem_cgroup_page_lruvec(struct page *, struct zone *); @@ -210,7 +210,8 @@ static inline int mem_cgroup_newpage_charge(struct page *page, } static inline int mem_cgroup_cache_charge(struct page *page, - struct mm_struct *mm, gfp_t gfp_mask) + struct mm_struct *mm, gfp_t gfp_mask, + bool oom) { return 0; } diff --git a/mm/filemap.c b/mm/filemap.c index 83efee7..ef8fbd5 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -447,8 +447,13 @@ int add_to_page_cache_locked(struct page *page, struct address_space *mapping, VM_BUG_ON(!PageLocked(page)); VM_BUG_ON(PageSwapBacked(page)); - error = mem_cgroup_cache_charge(page, current->mm, - gfp_mask & GFP_RECLAIM_MASK); + /* + * Cannot trigger OOM even if gfp_mask would allow that normally + * because we might be called from a locked context and that + * could lead to deadlocks if the killed process is waiting for + * the same lock. + */ + error = mem_cgroup_cache_charge(page, current->mm, gfp_mask, false); if (error) goto out; diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 02ee2f7..26690d6 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -3709,11 +3709,10 @@ out: * < 0 if the cgroup is over its limit */ static int mem_cgroup_charge_common(struct page *page, struct mm_struct *mm, - gfp_t gfp_mask, enum charge_type ctype) + gfp_t gfp_mask, enum charge_type ctype, bool oom) { struct mem_cgroup *memcg = NULL; unsigned int nr_pages = 1; - bool oom = true; int ret; if (PageTransHuge(page)) { @@ -3742,7 +3741,7 @@ int mem_cgroup_newpage_charge(struct page *page, VM_BUG_ON(page->mapping && !PageAnon(page)); VM_BUG_ON(!mm); return mem_cgroup_charge_common(page, mm, gfp_mask, - MEM_CGROUP_CHARGE_TYPE_ANON); + MEM_CGROUP_CHARGE_TYPE_ANON, true); } /* @@ -3851,7 +3850,7 @@ void mem_cgroup_commit_charge_swapin(struct page *page, } int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm, - gfp_t gfp_mask) + gfp_t gfp_mask, bool oom) { struct mem_cgroup *memcg = NULL; enum charge_type type = MEM_CGROUP_CHARGE_TYPE_CACHE; @@ -3863,7 +3862,7 @@ int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm, return 0; if (!PageSwapCache(page)) - ret = mem_cgroup_charge_common(page, mm, gfp_mask, type); + ret = mem_cgroup_charge_common(page, mm, gfp_mask, type, oom); else { /* page is swapcache/shmem */ ret = __mem_cgroup_try_charge_swapin(mm, page, gfp_mask, &memcg); diff --git a/mm/shmem.c b/mm/shmem.c index 55054a7..ba59cfa 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -760,7 +760,7 @@ int shmem_unuse(swp_entry_t swap, struct page *page) * the shmem_swaplist_mutex which might hold up shmem_writepage(). * Charged back to the user (not to caller) when swap account is used. */ - error = mem_cgroup_cache_charge(page, current->mm, GFP_KERNEL); + error = mem_cgroup_cache_charge(page, current->mm, GFP_KERNEL, true); if (error) goto out; /* No radix_tree_preload: swap entry keeps a place for page in tree */ @@ -1152,8 +1152,16 @@ repeat: goto failed; } + /* + * Cannot trigger OOM even if gfp_mask would allow that + * normally because we might be called from a locked + * context (i_mutex held) if this is a write lock or + * fallocate and that could lead to deadlocks if the + * killed process is waiting for the same lock. + */ error = mem_cgroup_cache_charge(page, current->mm, - gfp & GFP_RECLAIM_MASK); + gfp & GFP_RECLAIM_MASK, + sgp < SGP_WRITE); if (!error) { error = shmem_add_to_page_cache(page, mapping, index, gfp, swp_to_radix_entry(swap)); @@ -1209,7 +1217,8 @@ repeat: SetPageSwapBacked(page); __set_page_locked(page); error = mem_cgroup_cache_charge(page, current->mm, - gfp & GFP_RECLAIM_MASK); + gfp & GFP_RECLAIM_MASK, + sgp < SGP_WRITE); if (error) goto decused; error = radix_tree_preload(gfp & GFP_RECLAIM_MASK); -- 1.7.10.4 -- Michal Hocko SUSE Labs ^ permalink raw reply related [flat|nested] 444+ messages in thread
* Re: [PATCH -v2 -mm] memcg: do not trigger OOM from add_to_page_cache_locked @ 2012-11-27 20:59 ` Michal Hocko 0 siblings, 0 replies; 444+ messages in thread From: Michal Hocko @ 2012-11-27 20:59 UTC (permalink / raw) To: Johannes Weiner, KAMEZAWA Hiroyuki Cc: azurIt, linux-kernel, linux-mm, cgroups mailinglist Sorry, forgot to about one shmem charge: --- ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH -v2 -mm] memcg: do not trigger OOM from add_to_page_cache_locked 2012-11-27 20:59 ` Michal Hocko (?) @ 2012-11-28 15:26 ` Johannes Weiner -1 siblings, 0 replies; 444+ messages in thread From: Johannes Weiner @ 2012-11-28 15:26 UTC (permalink / raw) To: Michal Hocko Cc: KAMEZAWA Hiroyuki, azurIt, linux-kernel, linux-mm, cgroups mailinglist On Tue, Nov 27, 2012 at 09:59:44PM +0100, Michal Hocko wrote: > @@ -3863,7 +3862,7 @@ int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm, > return 0; > > if (!PageSwapCache(page)) > - ret = mem_cgroup_charge_common(page, mm, gfp_mask, type); > + ret = mem_cgroup_charge_common(page, mm, gfp_mask, type, oom); > else { /* page is swapcache/shmem */ > ret = __mem_cgroup_try_charge_swapin(mm, page, > gfp_mask, &memcg); I think you need to pass it down the swapcache path too, as that is what happens when the shmem page written to is in swap and has been read into swapcache by the time of charging. > @@ -1152,8 +1152,16 @@ repeat: > goto failed; > } > > + /* > + * Cannot trigger OOM even if gfp_mask would allow that > + * normally because we might be called from a locked > + * context (i_mutex held) if this is a write lock or > + * fallocate and that could lead to deadlocks if the > + * killed process is waiting for the same lock. > + */ Indentation broken? > error = mem_cgroup_cache_charge(page, current->mm, > - gfp & GFP_RECLAIM_MASK); > + gfp & GFP_RECLAIM_MASK, > + sgp < SGP_WRITE); The code tests for read-only paths a bunch of times using sgp != SGP_WRITE && sgp != SGP_FALLOC Would probably be more consistent and more robust to use this here as well? > @@ -1209,7 +1217,8 @@ repeat: > SetPageSwapBacked(page); > __set_page_locked(page); > error = mem_cgroup_cache_charge(page, current->mm, > - gfp & GFP_RECLAIM_MASK); > + gfp & GFP_RECLAIM_MASK, > + sgp < SGP_WRITE); Same. Otherwise, the patch looks good to me, thanks for persisting :) ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH -v2 -mm] memcg: do not trigger OOM from add_to_page_cache_locked @ 2012-11-28 15:26 ` Johannes Weiner 0 siblings, 0 replies; 444+ messages in thread From: Johannes Weiner @ 2012-11-28 15:26 UTC (permalink / raw) To: Michal Hocko Cc: KAMEZAWA Hiroyuki, azurIt, linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups mailinglist On Tue, Nov 27, 2012 at 09:59:44PM +0100, Michal Hocko wrote: > @@ -3863,7 +3862,7 @@ int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm, > return 0; > > if (!PageSwapCache(page)) > - ret = mem_cgroup_charge_common(page, mm, gfp_mask, type); > + ret = mem_cgroup_charge_common(page, mm, gfp_mask, type, oom); > else { /* page is swapcache/shmem */ > ret = __mem_cgroup_try_charge_swapin(mm, page, > gfp_mask, &memcg); I think you need to pass it down the swapcache path too, as that is what happens when the shmem page written to is in swap and has been read into swapcache by the time of charging. > @@ -1152,8 +1152,16 @@ repeat: > goto failed; > } > > + /* > + * Cannot trigger OOM even if gfp_mask would allow that > + * normally because we might be called from a locked > + * context (i_mutex held) if this is a write lock or > + * fallocate and that could lead to deadlocks if the > + * killed process is waiting for the same lock. > + */ Indentation broken? > error = mem_cgroup_cache_charge(page, current->mm, > - gfp & GFP_RECLAIM_MASK); > + gfp & GFP_RECLAIM_MASK, > + sgp < SGP_WRITE); The code tests for read-only paths a bunch of times using sgp != SGP_WRITE && sgp != SGP_FALLOC Would probably be more consistent and more robust to use this here as well? > @@ -1209,7 +1217,8 @@ repeat: > SetPageSwapBacked(page); > __set_page_locked(page); > error = mem_cgroup_cache_charge(page, current->mm, > - gfp & GFP_RECLAIM_MASK); > + gfp & GFP_RECLAIM_MASK, > + sgp < SGP_WRITE); Same. Otherwise, the patch looks good to me, thanks for persisting :) ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH -v2 -mm] memcg: do not trigger OOM from add_to_page_cache_locked @ 2012-11-28 15:26 ` Johannes Weiner 0 siblings, 0 replies; 444+ messages in thread From: Johannes Weiner @ 2012-11-28 15:26 UTC (permalink / raw) To: Michal Hocko Cc: KAMEZAWA Hiroyuki, azurIt, linux-kernel, linux-mm, cgroups mailinglist On Tue, Nov 27, 2012 at 09:59:44PM +0100, Michal Hocko wrote: > @@ -3863,7 +3862,7 @@ int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm, > return 0; > > if (!PageSwapCache(page)) > - ret = mem_cgroup_charge_common(page, mm, gfp_mask, type); > + ret = mem_cgroup_charge_common(page, mm, gfp_mask, type, oom); > else { /* page is swapcache/shmem */ > ret = __mem_cgroup_try_charge_swapin(mm, page, > gfp_mask, &memcg); I think you need to pass it down the swapcache path too, as that is what happens when the shmem page written to is in swap and has been read into swapcache by the time of charging. > @@ -1152,8 +1152,16 @@ repeat: > goto failed; > } > > + /* > + * Cannot trigger OOM even if gfp_mask would allow that > + * normally because we might be called from a locked > + * context (i_mutex held) if this is a write lock or > + * fallocate and that could lead to deadlocks if the > + * killed process is waiting for the same lock. > + */ Indentation broken? > error = mem_cgroup_cache_charge(page, current->mm, > - gfp & GFP_RECLAIM_MASK); > + gfp & GFP_RECLAIM_MASK, > + sgp < SGP_WRITE); The code tests for read-only paths a bunch of times using sgp != SGP_WRITE && sgp != SGP_FALLOC Would probably be more consistent and more robust to use this here as well? > @@ -1209,7 +1217,8 @@ repeat: > SetPageSwapBacked(page); > __set_page_locked(page); > error = mem_cgroup_cache_charge(page, current->mm, > - gfp & GFP_RECLAIM_MASK); > + gfp & GFP_RECLAIM_MASK, > + sgp < SGP_WRITE); Same. Otherwise, the patch looks good to me, thanks for persisting :) -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH -v2 -mm] memcg: do not trigger OOM from add_to_page_cache_locked 2012-11-28 15:26 ` Johannes Weiner (?) @ 2012-11-28 16:04 ` Michal Hocko -1 siblings, 0 replies; 444+ messages in thread From: Michal Hocko @ 2012-11-28 16:04 UTC (permalink / raw) To: Johannes Weiner Cc: KAMEZAWA Hiroyuki, azurIt, linux-kernel, linux-mm, cgroups mailinglist On Wed 28-11-12 10:26:31, Johannes Weiner wrote: > On Tue, Nov 27, 2012 at 09:59:44PM +0100, Michal Hocko wrote: > > @@ -3863,7 +3862,7 @@ int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm, > > return 0; > > > > if (!PageSwapCache(page)) > > - ret = mem_cgroup_charge_common(page, mm, gfp_mask, type); > > + ret = mem_cgroup_charge_common(page, mm, gfp_mask, type, oom); > > else { /* page is swapcache/shmem */ > > ret = __mem_cgroup_try_charge_swapin(mm, page, > > gfp_mask, &memcg); > > I think you need to pass it down the swapcache path too, as that is > what happens when the shmem page written to is in swap and has been > read into swapcache by the time of charging. You are right, of course. I shouldn't send patches late in the evening after staring to a crashdump for a good part of the day. /me ashamed. > > @@ -1152,8 +1152,16 @@ repeat: > > goto failed; > > } > > > > + /* > > + * Cannot trigger OOM even if gfp_mask would allow that > > + * normally because we might be called from a locked > > + * context (i_mutex held) if this is a write lock or > > + * fallocate and that could lead to deadlocks if the > > + * killed process is waiting for the same lock. > > + */ > > Indentation broken? c&p > > error = mem_cgroup_cache_charge(page, current->mm, > > - gfp & GFP_RECLAIM_MASK); > > + gfp & GFP_RECLAIM_MASK, > > + sgp < SGP_WRITE); > > The code tests for read-only paths a bunch of times using > > sgp != SGP_WRITE && sgp != SGP_FALLOC > > Would probably be more consistent and more robust to use this here as > well? Yes my laziness. I was considering that but it was really long so I've chosen the simpler way. But you are right that consistency is probably better here > > @@ -1209,7 +1217,8 @@ repeat: > > SetPageSwapBacked(page); > > __set_page_locked(page); > > error = mem_cgroup_cache_charge(page, current->mm, > > - gfp & GFP_RECLAIM_MASK); > > + gfp & GFP_RECLAIM_MASK, > > + sgp < SGP_WRITE); > > Same. > > Otherwise, the patch looks good to me, thanks for persisting :) Thanks for the throughout review. Here we go with the fixed version. --- >From 5000bf32c9c02fcd31d18e615300d8e7e7ef94a5 Mon Sep 17 00:00:00 2001 From: Michal Hocko <mhocko@suse.cz> Date: Wed, 28 Nov 2012 16:49:46 +0100 Subject: [PATCH] memcg: do not trigger OOM from add_to_page_cache_locked memcg oom killer might deadlock if the process which falls down to mem_cgroup_handle_oom holds a lock which prevents other task to terminate because it is blocked on the very same lock. This can happen when a write system call needs to allocate a page but the allocation hits the memcg hard limit and there is nothing to reclaim (e.g. there is no swap or swap limit is hit as well and all cache pages have been reclaimed already) and the process selected by memcg OOM killer is blocked on i_mutex on the same inode (e.g. truncate it). Process A [<ffffffff811109b8>] do_truncate+0x58/0xa0 # takes i_mutex [<ffffffff81121c90>] do_last+0x250/0xa30 [<ffffffff81122547>] path_openat+0xd7/0x440 [<ffffffff811229c9>] do_filp_open+0x49/0xa0 [<ffffffff8110f7d6>] do_sys_open+0x106/0x240 [<ffffffff8110f950>] sys_open+0x20/0x30 [<ffffffff815b5926>] system_call_fastpath+0x18/0x1d [<ffffffffffffffff>] 0xffffffffffffffff Process B [<ffffffff8110a9c1>] mem_cgroup_handle_oom+0x241/0x3b0 [<ffffffff8110b5ab>] T.1146+0x5ab/0x5c0 [<ffffffff8110c22e>] mem_cgroup_cache_charge+0xbe/0xe0 [<ffffffff810ca28c>] add_to_page_cache_locked+0x4c/0x140 [<ffffffff810ca3a2>] add_to_page_cache_lru+0x22/0x50 [<ffffffff810ca45b>] grab_cache_page_write_begin+0x8b/0xe0 [<ffffffff81193a18>] ext3_write_begin+0x88/0x270 [<ffffffff810c8fc6>] generic_file_buffered_write+0x116/0x290 [<ffffffff810cb3cc>] __generic_file_aio_write+0x27c/0x480 [<ffffffff810cb646>] generic_file_aio_write+0x76/0xf0 # takes ->i_mutex [<ffffffff8111156a>] do_sync_write+0xea/0x130 [<ffffffff81112183>] vfs_write+0xf3/0x1f0 [<ffffffff81112381>] sys_write+0x51/0x90 [<ffffffff815b5926>] system_call_fastpath+0x18/0x1d [<ffffffffffffffff>] 0xffffffffffffffff This is not a hard deadlock though because administrator can still intervene and increase the limit on the group which helps the writer to finish the allocation and release the lock. This patch heals the problem by forbidding OOM from page cache charges (namely add_ro_page_cache_locked). mem_cgroup_cache_charge grows oom argument which is pushed down the call chain. As a possibly visible result add_to_page_cache_lru might fail more often with ENOMEM but this is to be expected if the limit is set and it is preferable than OOM killer IMO. Changes since v1 - do not abuse gfp_flags and rather use oom parameter directly as per Johannes - handle also shmem write fauls resp. fallocate properly as per Johannes Reported-by: azurIt <azurit@pobox.sk> Signed-off-by: Michal Hocko <mhocko@suse.cz> --- include/linux/memcontrol.h | 11 +++++++---- mm/filemap.c | 9 +++++++-- mm/memcontrol.c | 25 +++++++++++++------------ mm/memory.c | 2 +- mm/shmem.c | 17 ++++++++++++++--- mm/swapfile.c | 2 +- 6 files changed, 43 insertions(+), 23 deletions(-) diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 095d2b4..5abe441 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -57,13 +57,14 @@ extern int mem_cgroup_newpage_charge(struct page *page, struct mm_struct *mm, gfp_t gfp_mask); /* for swap handling */ extern int mem_cgroup_try_charge_swapin(struct mm_struct *mm, - struct page *page, gfp_t mask, struct mem_cgroup **memcgp); + struct page *page, gfp_t mask, struct mem_cgroup **memcgp, + bool oom); extern void mem_cgroup_commit_charge_swapin(struct page *page, struct mem_cgroup *memcg); extern void mem_cgroup_cancel_charge_swapin(struct mem_cgroup *memcg); extern int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm, - gfp_t gfp_mask); + gfp_t gfp_mask, bool oom); struct lruvec *mem_cgroup_zone_lruvec(struct zone *, struct mem_cgroup *); struct lruvec *mem_cgroup_page_lruvec(struct page *, struct zone *); @@ -210,13 +211,15 @@ static inline int mem_cgroup_newpage_charge(struct page *page, } static inline int mem_cgroup_cache_charge(struct page *page, - struct mm_struct *mm, gfp_t gfp_mask) + struct mm_struct *mm, gfp_t gfp_mask, + bool oom) { return 0; } static inline int mem_cgroup_try_charge_swapin(struct mm_struct *mm, - struct page *page, gfp_t gfp_mask, struct mem_cgroup **memcgp) + struct page *page, gfp_t gfp_mask, struct mem_cgroup **memcgp, + bool oom) { return 0; } diff --git a/mm/filemap.c b/mm/filemap.c index 83efee7..ef8fbd5 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -447,8 +447,13 @@ int add_to_page_cache_locked(struct page *page, struct address_space *mapping, VM_BUG_ON(!PageLocked(page)); VM_BUG_ON(PageSwapBacked(page)); - error = mem_cgroup_cache_charge(page, current->mm, - gfp_mask & GFP_RECLAIM_MASK); + /* + * Cannot trigger OOM even if gfp_mask would allow that normally + * because we might be called from a locked context and that + * could lead to deadlocks if the killed process is waiting for + * the same lock. + */ + error = mem_cgroup_cache_charge(page, current->mm, gfp_mask, false); if (error) goto out; diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 02ee2f7..02a6d70 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -3709,11 +3709,10 @@ out: * < 0 if the cgroup is over its limit */ static int mem_cgroup_charge_common(struct page *page, struct mm_struct *mm, - gfp_t gfp_mask, enum charge_type ctype) + gfp_t gfp_mask, enum charge_type ctype, bool oom) { struct mem_cgroup *memcg = NULL; unsigned int nr_pages = 1; - bool oom = true; int ret; if (PageTransHuge(page)) { @@ -3742,7 +3741,7 @@ int mem_cgroup_newpage_charge(struct page *page, VM_BUG_ON(page->mapping && !PageAnon(page)); VM_BUG_ON(!mm); return mem_cgroup_charge_common(page, mm, gfp_mask, - MEM_CGROUP_CHARGE_TYPE_ANON); + MEM_CGROUP_CHARGE_TYPE_ANON, true); } /* @@ -3754,7 +3753,8 @@ int mem_cgroup_newpage_charge(struct page *page, static int __mem_cgroup_try_charge_swapin(struct mm_struct *mm, struct page *page, gfp_t mask, - struct mem_cgroup **memcgp) + struct mem_cgroup **memcgp, + bool oom) { struct mem_cgroup *memcg; struct page_cgroup *pc; @@ -3776,20 +3776,21 @@ static int __mem_cgroup_try_charge_swapin(struct mm_struct *mm, if (!memcg) goto charge_cur_mm; *memcgp = memcg; - ret = __mem_cgroup_try_charge(NULL, mask, 1, memcgp, true); + ret = __mem_cgroup_try_charge(NULL, mask, 1, memcgp, oom); css_put(&memcg->css); if (ret == -EINTR) ret = 0; return ret; charge_cur_mm: - ret = __mem_cgroup_try_charge(mm, mask, 1, memcgp, true); + ret = __mem_cgroup_try_charge(mm, mask, 1, memcgp, oom); if (ret == -EINTR) ret = 0; return ret; } int mem_cgroup_try_charge_swapin(struct mm_struct *mm, struct page *page, - gfp_t gfp_mask, struct mem_cgroup **memcgp) + gfp_t gfp_mask, struct mem_cgroup **memcgp, + bool oom) { *memcgp = NULL; if (mem_cgroup_disabled()) @@ -3803,12 +3804,12 @@ int mem_cgroup_try_charge_swapin(struct mm_struct *mm, struct page *page, if (!PageSwapCache(page)) { int ret; - ret = __mem_cgroup_try_charge(mm, gfp_mask, 1, memcgp, true); + ret = __mem_cgroup_try_charge(mm, gfp_mask, 1, memcgp, oom); if (ret == -EINTR) ret = 0; return ret; } - return __mem_cgroup_try_charge_swapin(mm, page, gfp_mask, memcgp); + return __mem_cgroup_try_charge_swapin(mm, page, gfp_mask, memcgp, oom); } void mem_cgroup_cancel_charge_swapin(struct mem_cgroup *memcg) @@ -3851,7 +3852,7 @@ void mem_cgroup_commit_charge_swapin(struct page *page, } int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm, - gfp_t gfp_mask) + gfp_t gfp_mask, bool oom) { struct mem_cgroup *memcg = NULL; enum charge_type type = MEM_CGROUP_CHARGE_TYPE_CACHE; @@ -3863,10 +3864,10 @@ int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm, return 0; if (!PageSwapCache(page)) - ret = mem_cgroup_charge_common(page, mm, gfp_mask, type); + ret = mem_cgroup_charge_common(page, mm, gfp_mask, type, oom); else { /* page is swapcache/shmem */ ret = __mem_cgroup_try_charge_swapin(mm, page, - gfp_mask, &memcg); + gfp_mask, &memcg, oom); if (!ret) __mem_cgroup_commit_charge_swapin(page, memcg, type); } diff --git a/mm/memory.c b/mm/memory.c index 6891d3b..afad903 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -2991,7 +2991,7 @@ static int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma, } } - if (mem_cgroup_try_charge_swapin(mm, page, GFP_KERNEL, &ptr)) { + if (mem_cgroup_try_charge_swapin(mm, page, GFP_KERNEL, &ptr, true)) { ret = VM_FAULT_OOM; goto out_page; } diff --git a/mm/shmem.c b/mm/shmem.c index 55054a7..3b27db4 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -760,7 +760,7 @@ int shmem_unuse(swp_entry_t swap, struct page *page) * the shmem_swaplist_mutex which might hold up shmem_writepage(). * Charged back to the user (not to caller) when swap account is used. */ - error = mem_cgroup_cache_charge(page, current->mm, GFP_KERNEL); + error = mem_cgroup_cache_charge(page, current->mm, GFP_KERNEL, true); if (error) goto out; /* No radix_tree_preload: swap entry keeps a place for page in tree */ @@ -1152,8 +1152,17 @@ repeat: goto failed; } + /* + * Cannot trigger OOM even if gfp_mask would allow that + * normally because we might be called from a locked + * context (i_mutex held) if this is a write lock or + * fallocate and that could lead to deadlocks if the + * killed process is waiting for the same lock. + */ error = mem_cgroup_cache_charge(page, current->mm, - gfp & GFP_RECLAIM_MASK); + gfp & GFP_RECLAIM_MASK, + sgp != SGP_WRITE && + sgp != SGP_FALLOC); if (!error) { error = shmem_add_to_page_cache(page, mapping, index, gfp, swp_to_radix_entry(swap)); @@ -1209,7 +1218,9 @@ repeat: SetPageSwapBacked(page); __set_page_locked(page); error = mem_cgroup_cache_charge(page, current->mm, - gfp & GFP_RECLAIM_MASK); + gfp & GFP_RECLAIM_MASK, + sgp != SGP_WRITE && + sgp != SGP_FALLOC); if (error) goto decused; error = radix_tree_preload(gfp & GFP_RECLAIM_MASK); diff --git a/mm/swapfile.c b/mm/swapfile.c index 2f8e429..8ec511e 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -828,7 +828,7 @@ static int unuse_pte(struct vm_area_struct *vma, pmd_t *pmd, int ret = 1; if (mem_cgroup_try_charge_swapin(vma->vm_mm, page, - GFP_KERNEL, &memcg)) { + GFP_KERNEL, &memcg, true)) { ret = -ENOMEM; goto out_nolock; } -- 1.7.10.4 -- Michal Hocko SUSE Labs ^ permalink raw reply related [flat|nested] 444+ messages in thread
* Re: [PATCH -v2 -mm] memcg: do not trigger OOM from add_to_page_cache_locked @ 2012-11-28 16:04 ` Michal Hocko 0 siblings, 0 replies; 444+ messages in thread From: Michal Hocko @ 2012-11-28 16:04 UTC (permalink / raw) To: Johannes Weiner Cc: KAMEZAWA Hiroyuki, azurIt, linux-kernel, linux-mm, cgroups mailinglist On Wed 28-11-12 10:26:31, Johannes Weiner wrote: > On Tue, Nov 27, 2012 at 09:59:44PM +0100, Michal Hocko wrote: > > @@ -3863,7 +3862,7 @@ int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm, > > return 0; > > > > if (!PageSwapCache(page)) > > - ret = mem_cgroup_charge_common(page, mm, gfp_mask, type); > > + ret = mem_cgroup_charge_common(page, mm, gfp_mask, type, oom); > > else { /* page is swapcache/shmem */ > > ret = __mem_cgroup_try_charge_swapin(mm, page, > > gfp_mask, &memcg); > > I think you need to pass it down the swapcache path too, as that is > what happens when the shmem page written to is in swap and has been > read into swapcache by the time of charging. You are right, of course. I shouldn't send patches late in the evening after staring to a crashdump for a good part of the day. /me ashamed. > > @@ -1152,8 +1152,16 @@ repeat: > > goto failed; > > } > > > > + /* > > + * Cannot trigger OOM even if gfp_mask would allow that > > + * normally because we might be called from a locked > > + * context (i_mutex held) if this is a write lock or > > + * fallocate and that could lead to deadlocks if the > > + * killed process is waiting for the same lock. > > + */ > > Indentation broken? c&p > > error = mem_cgroup_cache_charge(page, current->mm, > > - gfp & GFP_RECLAIM_MASK); > > + gfp & GFP_RECLAIM_MASK, > > + sgp < SGP_WRITE); > > The code tests for read-only paths a bunch of times using > > sgp != SGP_WRITE && sgp != SGP_FALLOC > > Would probably be more consistent and more robust to use this here as > well? Yes my laziness. I was considering that but it was really long so I've chosen the simpler way. But you are right that consistency is probably better here > > @@ -1209,7 +1217,8 @@ repeat: > > SetPageSwapBacked(page); > > __set_page_locked(page); > > error = mem_cgroup_cache_charge(page, current->mm, > > - gfp & GFP_RECLAIM_MASK); > > + gfp & GFP_RECLAIM_MASK, > > + sgp < SGP_WRITE); > > Same. > > Otherwise, the patch looks good to me, thanks for persisting :) Thanks for the throughout review. Here we go with the fixed version. --- From 5000bf32c9c02fcd31d18e615300d8e7e7ef94a5 Mon Sep 17 00:00:00 2001 From: Michal Hocko <mhocko@suse.cz> Date: Wed, 28 Nov 2012 16:49:46 +0100 Subject: [PATCH] memcg: do not trigger OOM from add_to_page_cache_locked memcg oom killer might deadlock if the process which falls down to mem_cgroup_handle_oom holds a lock which prevents other task to terminate because it is blocked on the very same lock. This can happen when a write system call needs to allocate a page but the allocation hits the memcg hard limit and there is nothing to reclaim (e.g. there is no swap or swap limit is hit as well and all cache pages have been reclaimed already) and the process selected by memcg OOM killer is blocked on i_mutex on the same inode (e.g. truncate it). Process A [<ffffffff811109b8>] do_truncate+0x58/0xa0 # takes i_mutex [<ffffffff81121c90>] do_last+0x250/0xa30 [<ffffffff81122547>] path_openat+0xd7/0x440 [<ffffffff811229c9>] do_filp_open+0x49/0xa0 [<ffffffff8110f7d6>] do_sys_open+0x106/0x240 [<ffffffff8110f950>] sys_open+0x20/0x30 [<ffffffff815b5926>] system_call_fastpath+0x18/0x1d [<ffffffffffffffff>] 0xffffffffffffffff Process B [<ffffffff8110a9c1>] mem_cgroup_handle_oom+0x241/0x3b0 [<ffffffff8110b5ab>] T.1146+0x5ab/0x5c0 [<ffffffff8110c22e>] mem_cgroup_cache_charge+0xbe/0xe0 [<ffffffff810ca28c>] add_to_page_cache_locked+0x4c/0x140 [<ffffffff810ca3a2>] add_to_page_cache_lru+0x22/0x50 [<ffffffff810ca45b>] grab_cache_page_write_begin+0x8b/0xe0 [<ffffffff81193a18>] ext3_write_begin+0x88/0x270 [<ffffffff810c8fc6>] generic_file_buffered_write+0x116/0x290 [<ffffffff810cb3cc>] __generic_file_aio_write+0x27c/0x480 [<ffffffff810cb646>] generic_file_aio_write+0x76/0xf0 # takes ->i_mutex [<ffffffff8111156a>] do_sync_write+0xea/0x130 [<ffffffff81112183>] vfs_write+0xf3/0x1f0 [<ffffffff81112381>] sys_write+0x51/0x90 [<ffffffff815b5926>] system_call_fastpath+0x18/0x1d [<ffffffffffffffff>] 0xffffffffffffffff This is not a hard deadlock though because administrator can still intervene and increase the limit on the group which helps the writer to finish the allocation and release the lock. This patch heals the problem by forbidding OOM from page cache charges (namely add_ro_page_cache_locked). mem_cgroup_cache_charge grows oom argument which is pushed down the call chain. As a possibly visible result add_to_page_cache_lru might fail more often with ENOMEM but this is to be expected if the limit is set and it is preferable than OOM killer IMO. Changes since v1 - do not abuse gfp_flags and rather use oom parameter directly as per Johannes - handle also shmem write fauls resp. fallocate properly as per Johannes Reported-by: azurIt <azurit@pobox.sk> Signed-off-by: Michal Hocko <mhocko@suse.cz> --- include/linux/memcontrol.h | 11 +++++++---- mm/filemap.c | 9 +++++++-- mm/memcontrol.c | 25 +++++++++++++------------ mm/memory.c | 2 +- mm/shmem.c | 17 ++++++++++++++--- mm/swapfile.c | 2 +- 6 files changed, 43 insertions(+), 23 deletions(-) diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 095d2b4..5abe441 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -57,13 +57,14 @@ extern int mem_cgroup_newpage_charge(struct page *page, struct mm_struct *mm, gfp_t gfp_mask); /* for swap handling */ extern int mem_cgroup_try_charge_swapin(struct mm_struct *mm, - struct page *page, gfp_t mask, struct mem_cgroup **memcgp); + struct page *page, gfp_t mask, struct mem_cgroup **memcgp, + bool oom); extern void mem_cgroup_commit_charge_swapin(struct page *page, struct mem_cgroup *memcg); extern void mem_cgroup_cancel_charge_swapin(struct mem_cgroup *memcg); extern int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm, - gfp_t gfp_mask); + gfp_t gfp_mask, bool oom); struct lruvec *mem_cgroup_zone_lruvec(struct zone *, struct mem_cgroup *); struct lruvec *mem_cgroup_page_lruvec(struct page *, struct zone *); @@ -210,13 +211,15 @@ static inline int mem_cgroup_newpage_charge(struct page *page, } static inline int mem_cgroup_cache_charge(struct page *page, - struct mm_struct *mm, gfp_t gfp_mask) + struct mm_struct *mm, gfp_t gfp_mask, + bool oom) { return 0; } static inline int mem_cgroup_try_charge_swapin(struct mm_struct *mm, - struct page *page, gfp_t gfp_mask, struct mem_cgroup **memcgp) + struct page *page, gfp_t gfp_mask, struct mem_cgroup **memcgp, + bool oom) { return 0; } diff --git a/mm/filemap.c b/mm/filemap.c index 83efee7..ef8fbd5 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -447,8 +447,13 @@ int add_to_page_cache_locked(struct page *page, struct address_space *mapping, VM_BUG_ON(!PageLocked(page)); VM_BUG_ON(PageSwapBacked(page)); - error = mem_cgroup_cache_charge(page, current->mm, - gfp_mask & GFP_RECLAIM_MASK); + /* + * Cannot trigger OOM even if gfp_mask would allow that normally + * because we might be called from a locked context and that + * could lead to deadlocks if the killed process is waiting for + * the same lock. + */ + error = mem_cgroup_cache_charge(page, current->mm, gfp_mask, false); if (error) goto out; diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 02ee2f7..02a6d70 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -3709,11 +3709,10 @@ out: * < 0 if the cgroup is over its limit */ static int mem_cgroup_charge_common(struct page *page, struct mm_struct *mm, - gfp_t gfp_mask, enum charge_type ctype) + gfp_t gfp_mask, enum charge_type ctype, bool oom) { struct mem_cgroup *memcg = NULL; unsigned int nr_pages = 1; - bool oom = true; int ret; if (PageTransHuge(page)) { @@ -3742,7 +3741,7 @@ int mem_cgroup_newpage_charge(struct page *page, VM_BUG_ON(page->mapping && !PageAnon(page)); VM_BUG_ON(!mm); return mem_cgroup_charge_common(page, mm, gfp_mask, - MEM_CGROUP_CHARGE_TYPE_ANON); + MEM_CGROUP_CHARGE_TYPE_ANON, true); } /* @@ -3754,7 +3753,8 @@ int mem_cgroup_newpage_charge(struct page *page, static int __mem_cgroup_try_charge_swapin(struct mm_struct *mm, struct page *page, gfp_t mask, - struct mem_cgroup **memcgp) + struct mem_cgroup **memcgp, + bool oom) { struct mem_cgroup *memcg; struct page_cgroup *pc; @@ -3776,20 +3776,21 @@ static int __mem_cgroup_try_charge_swapin(struct mm_struct *mm, if (!memcg) goto charge_cur_mm; *memcgp = memcg; - ret = __mem_cgroup_try_charge(NULL, mask, 1, memcgp, true); + ret = __mem_cgroup_try_charge(NULL, mask, 1, memcgp, oom); css_put(&memcg->css); if (ret == -EINTR) ret = 0; return ret; charge_cur_mm: - ret = __mem_cgroup_try_charge(mm, mask, 1, memcgp, true); + ret = __mem_cgroup_try_charge(mm, mask, 1, memcgp, oom); if (ret == -EINTR) ret = 0; return ret; } int mem_cgroup_try_charge_swapin(struct mm_struct *mm, struct page *page, - gfp_t gfp_mask, struct mem_cgroup **memcgp) + gfp_t gfp_mask, struct mem_cgroup **memcgp, + bool oom) { *memcgp = NULL; if (mem_cgroup_disabled()) @@ -3803,12 +3804,12 @@ int mem_cgroup_try_charge_swapin(struct mm_struct *mm, struct page *page, if (!PageSwapCache(page)) { int ret; - ret = __mem_cgroup_try_charge(mm, gfp_mask, 1, memcgp, true); + ret = __mem_cgroup_try_charge(mm, gfp_mask, 1, memcgp, oom); if (ret == -EINTR) ret = 0; return ret; } - return __mem_cgroup_try_charge_swapin(mm, page, gfp_mask, memcgp); + return __mem_cgroup_try_charge_swapin(mm, page, gfp_mask, memcgp, oom); } void mem_cgroup_cancel_charge_swapin(struct mem_cgroup *memcg) @@ -3851,7 +3852,7 @@ void mem_cgroup_commit_charge_swapin(struct page *page, } int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm, - gfp_t gfp_mask) + gfp_t gfp_mask, bool oom) { struct mem_cgroup *memcg = NULL; enum charge_type type = MEM_CGROUP_CHARGE_TYPE_CACHE; @@ -3863,10 +3864,10 @@ int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm, return 0; if (!PageSwapCache(page)) - ret = mem_cgroup_charge_common(page, mm, gfp_mask, type); + ret = mem_cgroup_charge_common(page, mm, gfp_mask, type, oom); else { /* page is swapcache/shmem */ ret = __mem_cgroup_try_charge_swapin(mm, page, - gfp_mask, &memcg); + gfp_mask, &memcg, oom); if (!ret) __mem_cgroup_commit_charge_swapin(page, memcg, type); } diff --git a/mm/memory.c b/mm/memory.c index 6891d3b..afad903 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -2991,7 +2991,7 @@ static int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma, } } - if (mem_cgroup_try_charge_swapin(mm, page, GFP_KERNEL, &ptr)) { + if (mem_cgroup_try_charge_swapin(mm, page, GFP_KERNEL, &ptr, true)) { ret = VM_FAULT_OOM; goto out_page; } diff --git a/mm/shmem.c b/mm/shmem.c index 55054a7..3b27db4 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -760,7 +760,7 @@ int shmem_unuse(swp_entry_t swap, struct page *page) * the shmem_swaplist_mutex which might hold up shmem_writepage(). * Charged back to the user (not to caller) when swap account is used. */ - error = mem_cgroup_cache_charge(page, current->mm, GFP_KERNEL); + error = mem_cgroup_cache_charge(page, current->mm, GFP_KERNEL, true); if (error) goto out; /* No radix_tree_preload: swap entry keeps a place for page in tree */ @@ -1152,8 +1152,17 @@ repeat: goto failed; } + /* + * Cannot trigger OOM even if gfp_mask would allow that + * normally because we might be called from a locked + * context (i_mutex held) if this is a write lock or + * fallocate and that could lead to deadlocks if the + * killed process is waiting for the same lock. + */ error = mem_cgroup_cache_charge(page, current->mm, - gfp & GFP_RECLAIM_MASK); + gfp & GFP_RECLAIM_MASK, + sgp != SGP_WRITE && + sgp != SGP_FALLOC); if (!error) { error = shmem_add_to_page_cache(page, mapping, index, gfp, swp_to_radix_entry(swap)); @@ -1209,7 +1218,9 @@ repeat: SetPageSwapBacked(page); __set_page_locked(page); error = mem_cgroup_cache_charge(page, current->mm, - gfp & GFP_RECLAIM_MASK); + gfp & GFP_RECLAIM_MASK, + sgp != SGP_WRITE && + sgp != SGP_FALLOC); if (error) goto decused; error = radix_tree_preload(gfp & GFP_RECLAIM_MASK); diff --git a/mm/swapfile.c b/mm/swapfile.c index 2f8e429..8ec511e 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -828,7 +828,7 @@ static int unuse_pte(struct vm_area_struct *vma, pmd_t *pmd, int ret = 1; if (mem_cgroup_try_charge_swapin(vma->vm_mm, page, - GFP_KERNEL, &memcg)) { + GFP_KERNEL, &memcg, true)) { ret = -ENOMEM; goto out_nolock; } -- 1.7.10.4 -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 444+ messages in thread
* Re: [PATCH -v2 -mm] memcg: do not trigger OOM from add_to_page_cache_locked @ 2012-11-28 16:04 ` Michal Hocko 0 siblings, 0 replies; 444+ messages in thread From: Michal Hocko @ 2012-11-28 16:04 UTC (permalink / raw) To: Johannes Weiner Cc: KAMEZAWA Hiroyuki, azurIt, linux-kernel, linux-mm, cgroups mailinglist On Wed 28-11-12 10:26:31, Johannes Weiner wrote: > On Tue, Nov 27, 2012 at 09:59:44PM +0100, Michal Hocko wrote: > > @@ -3863,7 +3862,7 @@ int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm, > > return 0; > > > > if (!PageSwapCache(page)) > > - ret = mem_cgroup_charge_common(page, mm, gfp_mask, type); > > + ret = mem_cgroup_charge_common(page, mm, gfp_mask, type, oom); > > else { /* page is swapcache/shmem */ > > ret = __mem_cgroup_try_charge_swapin(mm, page, > > gfp_mask, &memcg); > > I think you need to pass it down the swapcache path too, as that is > what happens when the shmem page written to is in swap and has been > read into swapcache by the time of charging. You are right, of course. I shouldn't send patches late in the evening after staring to a crashdump for a good part of the day. /me ashamed. > > @@ -1152,8 +1152,16 @@ repeat: > > goto failed; > > } > > > > + /* > > + * Cannot trigger OOM even if gfp_mask would allow that > > + * normally because we might be called from a locked > > + * context (i_mutex held) if this is a write lock or > > + * fallocate and that could lead to deadlocks if the > > + * killed process is waiting for the same lock. > > + */ > > Indentation broken? c&p > > error = mem_cgroup_cache_charge(page, current->mm, > > - gfp & GFP_RECLAIM_MASK); > > + gfp & GFP_RECLAIM_MASK, > > + sgp < SGP_WRITE); > > The code tests for read-only paths a bunch of times using > > sgp != SGP_WRITE && sgp != SGP_FALLOC > > Would probably be more consistent and more robust to use this here as > well? Yes my laziness. I was considering that but it was really long so I've chosen the simpler way. But you are right that consistency is probably better here > > @@ -1209,7 +1217,8 @@ repeat: > > SetPageSwapBacked(page); > > __set_page_locked(page); > > error = mem_cgroup_cache_charge(page, current->mm, > > - gfp & GFP_RECLAIM_MASK); > > + gfp & GFP_RECLAIM_MASK, > > + sgp < SGP_WRITE); > > Same. > > Otherwise, the patch looks good to me, thanks for persisting :) Thanks for the throughout review. Here we go with the fixed version. --- ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH -v2 -mm] memcg: do not trigger OOM from add_to_page_cache_locked 2012-11-28 16:04 ` Michal Hocko (?) @ 2012-11-28 16:37 ` Johannes Weiner -1 siblings, 0 replies; 444+ messages in thread From: Johannes Weiner @ 2012-11-28 16:37 UTC (permalink / raw) To: Michal Hocko Cc: KAMEZAWA Hiroyuki, azurIt, linux-kernel, linux-mm, cgroups mailinglist On Wed, Nov 28, 2012 at 05:04:47PM +0100, Michal Hocko wrote: > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h > index 095d2b4..5abe441 100644 > --- a/include/linux/memcontrol.h > +++ b/include/linux/memcontrol.h > @@ -57,13 +57,14 @@ extern int mem_cgroup_newpage_charge(struct page *page, struct mm_struct *mm, > gfp_t gfp_mask); > /* for swap handling */ > extern int mem_cgroup_try_charge_swapin(struct mm_struct *mm, > - struct page *page, gfp_t mask, struct mem_cgroup **memcgp); > + struct page *page, gfp_t mask, struct mem_cgroup **memcgp, > + bool oom); Ok, now I feel almost bad for asking, but why the public interface, too? You only ever pass "true" in there and this is unlikely to change anytime soon, no? > @@ -3754,7 +3753,8 @@ int mem_cgroup_newpage_charge(struct page *page, > static int __mem_cgroup_try_charge_swapin(struct mm_struct *mm, > struct page *page, > gfp_t mask, > - struct mem_cgroup **memcgp) > + struct mem_cgroup **memcgp, > + bool oom) > { > struct mem_cgroup *memcg; > struct page_cgroup *pc; > @@ -3776,20 +3776,21 @@ static int __mem_cgroup_try_charge_swapin(struct mm_struct *mm, > if (!memcg) > goto charge_cur_mm; > *memcgp = memcg; > - ret = __mem_cgroup_try_charge(NULL, mask, 1, memcgp, true); > + ret = __mem_cgroup_try_charge(NULL, mask, 1, memcgp, oom); > css_put(&memcg->css); > if (ret == -EINTR) > ret = 0; > return ret; > charge_cur_mm: > - ret = __mem_cgroup_try_charge(mm, mask, 1, memcgp, true); > + ret = __mem_cgroup_try_charge(mm, mask, 1, memcgp, oom); > if (ret == -EINTR) > ret = 0; > return ret; > } Only this one is needed... > @@ -3851,7 +3852,7 @@ void mem_cgroup_commit_charge_swapin(struct page *page, > } > > int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm, > - gfp_t gfp_mask) > + gfp_t gfp_mask, bool oom) > { > struct mem_cgroup *memcg = NULL; > enum charge_type type = MEM_CGROUP_CHARGE_TYPE_CACHE; > @@ -3863,10 +3864,10 @@ int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm, > return 0; > > if (!PageSwapCache(page)) > - ret = mem_cgroup_charge_common(page, mm, gfp_mask, type); > + ret = mem_cgroup_charge_common(page, mm, gfp_mask, type, oom); > else { /* page is swapcache/shmem */ > ret = __mem_cgroup_try_charge_swapin(mm, page, > - gfp_mask, &memcg); > + gfp_mask, &memcg, oom); > if (!ret) > __mem_cgroup_commit_charge_swapin(page, memcg, type); > } ...for this site. > diff --git a/mm/memory.c b/mm/memory.c > index 6891d3b..afad903 100644 > --- a/mm/memory.c > +++ b/mm/memory.c > @@ -2991,7 +2991,7 @@ static int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma, > } > } > > - if (mem_cgroup_try_charge_swapin(mm, page, GFP_KERNEL, &ptr)) { > + if (mem_cgroup_try_charge_swapin(mm, page, GFP_KERNEL, &ptr, true)) { > ret = VM_FAULT_OOM; > goto out_page; > } Can not happen for shmem, the fault handler uses vma->vm_ops->fault. > diff --git a/mm/swapfile.c b/mm/swapfile.c > index 2f8e429..8ec511e 100644 > --- a/mm/swapfile.c > +++ b/mm/swapfile.c > @@ -828,7 +828,7 @@ static int unuse_pte(struct vm_area_struct *vma, pmd_t *pmd, > int ret = 1; > > if (mem_cgroup_try_charge_swapin(vma->vm_mm, page, > - GFP_KERNEL, &memcg)) { > + GFP_KERNEL, &memcg, true)) { > ret = -ENOMEM; > goto out_nolock; > } Can not happen for shmem, uses shmem_unuse() instead. ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH -v2 -mm] memcg: do not trigger OOM from add_to_page_cache_locked @ 2012-11-28 16:37 ` Johannes Weiner 0 siblings, 0 replies; 444+ messages in thread From: Johannes Weiner @ 2012-11-28 16:37 UTC (permalink / raw) To: Michal Hocko Cc: KAMEZAWA Hiroyuki, azurIt, linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups mailinglist On Wed, Nov 28, 2012 at 05:04:47PM +0100, Michal Hocko wrote: > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h > index 095d2b4..5abe441 100644 > --- a/include/linux/memcontrol.h > +++ b/include/linux/memcontrol.h > @@ -57,13 +57,14 @@ extern int mem_cgroup_newpage_charge(struct page *page, struct mm_struct *mm, > gfp_t gfp_mask); > /* for swap handling */ > extern int mem_cgroup_try_charge_swapin(struct mm_struct *mm, > - struct page *page, gfp_t mask, struct mem_cgroup **memcgp); > + struct page *page, gfp_t mask, struct mem_cgroup **memcgp, > + bool oom); Ok, now I feel almost bad for asking, but why the public interface, too? You only ever pass "true" in there and this is unlikely to change anytime soon, no? > @@ -3754,7 +3753,8 @@ int mem_cgroup_newpage_charge(struct page *page, > static int __mem_cgroup_try_charge_swapin(struct mm_struct *mm, > struct page *page, > gfp_t mask, > - struct mem_cgroup **memcgp) > + struct mem_cgroup **memcgp, > + bool oom) > { > struct mem_cgroup *memcg; > struct page_cgroup *pc; > @@ -3776,20 +3776,21 @@ static int __mem_cgroup_try_charge_swapin(struct mm_struct *mm, > if (!memcg) > goto charge_cur_mm; > *memcgp = memcg; > - ret = __mem_cgroup_try_charge(NULL, mask, 1, memcgp, true); > + ret = __mem_cgroup_try_charge(NULL, mask, 1, memcgp, oom); > css_put(&memcg->css); > if (ret == -EINTR) > ret = 0; > return ret; > charge_cur_mm: > - ret = __mem_cgroup_try_charge(mm, mask, 1, memcgp, true); > + ret = __mem_cgroup_try_charge(mm, mask, 1, memcgp, oom); > if (ret == -EINTR) > ret = 0; > return ret; > } Only this one is needed... > @@ -3851,7 +3852,7 @@ void mem_cgroup_commit_charge_swapin(struct page *page, > } > > int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm, > - gfp_t gfp_mask) > + gfp_t gfp_mask, bool oom) > { > struct mem_cgroup *memcg = NULL; > enum charge_type type = MEM_CGROUP_CHARGE_TYPE_CACHE; > @@ -3863,10 +3864,10 @@ int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm, > return 0; > > if (!PageSwapCache(page)) > - ret = mem_cgroup_charge_common(page, mm, gfp_mask, type); > + ret = mem_cgroup_charge_common(page, mm, gfp_mask, type, oom); > else { /* page is swapcache/shmem */ > ret = __mem_cgroup_try_charge_swapin(mm, page, > - gfp_mask, &memcg); > + gfp_mask, &memcg, oom); > if (!ret) > __mem_cgroup_commit_charge_swapin(page, memcg, type); > } ...for this site. > diff --git a/mm/memory.c b/mm/memory.c > index 6891d3b..afad903 100644 > --- a/mm/memory.c > +++ b/mm/memory.c > @@ -2991,7 +2991,7 @@ static int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma, > } > } > > - if (mem_cgroup_try_charge_swapin(mm, page, GFP_KERNEL, &ptr)) { > + if (mem_cgroup_try_charge_swapin(mm, page, GFP_KERNEL, &ptr, true)) { > ret = VM_FAULT_OOM; > goto out_page; > } Can not happen for shmem, the fault handler uses vma->vm_ops->fault. > diff --git a/mm/swapfile.c b/mm/swapfile.c > index 2f8e429..8ec511e 100644 > --- a/mm/swapfile.c > +++ b/mm/swapfile.c > @@ -828,7 +828,7 @@ static int unuse_pte(struct vm_area_struct *vma, pmd_t *pmd, > int ret = 1; > > if (mem_cgroup_try_charge_swapin(vma->vm_mm, page, > - GFP_KERNEL, &memcg)) { > + GFP_KERNEL, &memcg, true)) { > ret = -ENOMEM; > goto out_nolock; > } Can not happen for shmem, uses shmem_unuse() instead. ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH -v2 -mm] memcg: do not trigger OOM from add_to_page_cache_locked @ 2012-11-28 16:37 ` Johannes Weiner 0 siblings, 0 replies; 444+ messages in thread From: Johannes Weiner @ 2012-11-28 16:37 UTC (permalink / raw) To: Michal Hocko Cc: KAMEZAWA Hiroyuki, azurIt, linux-kernel, linux-mm, cgroups mailinglist On Wed, Nov 28, 2012 at 05:04:47PM +0100, Michal Hocko wrote: > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h > index 095d2b4..5abe441 100644 > --- a/include/linux/memcontrol.h > +++ b/include/linux/memcontrol.h > @@ -57,13 +57,14 @@ extern int mem_cgroup_newpage_charge(struct page *page, struct mm_struct *mm, > gfp_t gfp_mask); > /* for swap handling */ > extern int mem_cgroup_try_charge_swapin(struct mm_struct *mm, > - struct page *page, gfp_t mask, struct mem_cgroup **memcgp); > + struct page *page, gfp_t mask, struct mem_cgroup **memcgp, > + bool oom); Ok, now I feel almost bad for asking, but why the public interface, too? You only ever pass "true" in there and this is unlikely to change anytime soon, no? > @@ -3754,7 +3753,8 @@ int mem_cgroup_newpage_charge(struct page *page, > static int __mem_cgroup_try_charge_swapin(struct mm_struct *mm, > struct page *page, > gfp_t mask, > - struct mem_cgroup **memcgp) > + struct mem_cgroup **memcgp, > + bool oom) > { > struct mem_cgroup *memcg; > struct page_cgroup *pc; > @@ -3776,20 +3776,21 @@ static int __mem_cgroup_try_charge_swapin(struct mm_struct *mm, > if (!memcg) > goto charge_cur_mm; > *memcgp = memcg; > - ret = __mem_cgroup_try_charge(NULL, mask, 1, memcgp, true); > + ret = __mem_cgroup_try_charge(NULL, mask, 1, memcgp, oom); > css_put(&memcg->css); > if (ret == -EINTR) > ret = 0; > return ret; > charge_cur_mm: > - ret = __mem_cgroup_try_charge(mm, mask, 1, memcgp, true); > + ret = __mem_cgroup_try_charge(mm, mask, 1, memcgp, oom); > if (ret == -EINTR) > ret = 0; > return ret; > } Only this one is needed... > @@ -3851,7 +3852,7 @@ void mem_cgroup_commit_charge_swapin(struct page *page, > } > > int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm, > - gfp_t gfp_mask) > + gfp_t gfp_mask, bool oom) > { > struct mem_cgroup *memcg = NULL; > enum charge_type type = MEM_CGROUP_CHARGE_TYPE_CACHE; > @@ -3863,10 +3864,10 @@ int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm, > return 0; > > if (!PageSwapCache(page)) > - ret = mem_cgroup_charge_common(page, mm, gfp_mask, type); > + ret = mem_cgroup_charge_common(page, mm, gfp_mask, type, oom); > else { /* page is swapcache/shmem */ > ret = __mem_cgroup_try_charge_swapin(mm, page, > - gfp_mask, &memcg); > + gfp_mask, &memcg, oom); > if (!ret) > __mem_cgroup_commit_charge_swapin(page, memcg, type); > } ...for this site. > diff --git a/mm/memory.c b/mm/memory.c > index 6891d3b..afad903 100644 > --- a/mm/memory.c > +++ b/mm/memory.c > @@ -2991,7 +2991,7 @@ static int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma, > } > } > > - if (mem_cgroup_try_charge_swapin(mm, page, GFP_KERNEL, &ptr)) { > + if (mem_cgroup_try_charge_swapin(mm, page, GFP_KERNEL, &ptr, true)) { > ret = VM_FAULT_OOM; > goto out_page; > } Can not happen for shmem, the fault handler uses vma->vm_ops->fault. > diff --git a/mm/swapfile.c b/mm/swapfile.c > index 2f8e429..8ec511e 100644 > --- a/mm/swapfile.c > +++ b/mm/swapfile.c > @@ -828,7 +828,7 @@ static int unuse_pte(struct vm_area_struct *vma, pmd_t *pmd, > int ret = 1; > > if (mem_cgroup_try_charge_swapin(vma->vm_mm, page, > - GFP_KERNEL, &memcg)) { > + GFP_KERNEL, &memcg, true)) { > ret = -ENOMEM; > goto out_nolock; > } Can not happen for shmem, uses shmem_unuse() instead. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH -v2 -mm] memcg: do not trigger OOM from add_to_page_cache_locked 2012-11-28 16:37 ` Johannes Weiner @ 2012-11-28 16:46 ` Michal Hocko -1 siblings, 0 replies; 444+ messages in thread From: Michal Hocko @ 2012-11-28 16:46 UTC (permalink / raw) To: Johannes Weiner Cc: KAMEZAWA Hiroyuki, azurIt, linux-kernel, linux-mm, cgroups mailinglist On Wed 28-11-12 11:37:36, Johannes Weiner wrote: > On Wed, Nov 28, 2012 at 05:04:47PM +0100, Michal Hocko wrote: > > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h > > index 095d2b4..5abe441 100644 > > --- a/include/linux/memcontrol.h > > +++ b/include/linux/memcontrol.h > > @@ -57,13 +57,14 @@ extern int mem_cgroup_newpage_charge(struct page *page, struct mm_struct *mm, > > gfp_t gfp_mask); > > /* for swap handling */ > > extern int mem_cgroup_try_charge_swapin(struct mm_struct *mm, > > - struct page *page, gfp_t mask, struct mem_cgroup **memcgp); > > + struct page *page, gfp_t mask, struct mem_cgroup **memcgp, > > + bool oom); > > Ok, now I feel almost bad for asking, but why the public interface, > too? Would it work out if I tell it was to double check that your review quality is not decreased after that many revisions? :P Incremental update and the full patch in the reply --- diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 5abe441..8f48d5e 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -57,8 +57,7 @@ extern int mem_cgroup_newpage_charge(struct page *page, struct mm_struct *mm, gfp_t gfp_mask); /* for swap handling */ extern int mem_cgroup_try_charge_swapin(struct mm_struct *mm, - struct page *page, gfp_t mask, struct mem_cgroup **memcgp, - bool oom); + struct page *page, gfp_t mask, struct mem_cgroup **memcgp); extern void mem_cgroup_commit_charge_swapin(struct page *page, struct mem_cgroup *memcg); extern void mem_cgroup_cancel_charge_swapin(struct mem_cgroup *memcg); @@ -218,8 +217,7 @@ static inline int mem_cgroup_cache_charge(struct page *page, } static inline int mem_cgroup_try_charge_swapin(struct mm_struct *mm, - struct page *page, gfp_t gfp_mask, struct mem_cgroup **memcgp, - bool oom) + struct page *page, gfp_t gfp_mask, struct mem_cgroup **memcgp) { return 0; } diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 02a6d70..3c9b1c5 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -3789,8 +3789,7 @@ charge_cur_mm: } int mem_cgroup_try_charge_swapin(struct mm_struct *mm, struct page *page, - gfp_t gfp_mask, struct mem_cgroup **memcgp, - bool oom) + gfp_t gfp_mask, struct mem_cgroup **memcgp) { *memcgp = NULL; if (mem_cgroup_disabled()) @@ -3804,12 +3803,12 @@ int mem_cgroup_try_charge_swapin(struct mm_struct *mm, struct page *page, if (!PageSwapCache(page)) { int ret; - ret = __mem_cgroup_try_charge(mm, gfp_mask, 1, memcgp, oom); + ret = __mem_cgroup_try_charge(mm, gfp_mask, 1, memcgp, true); if (ret == -EINTR) ret = 0; return ret; } - return __mem_cgroup_try_charge_swapin(mm, page, gfp_mask, memcgp, oom); + return __mem_cgroup_try_charge_swapin(mm, page, gfp_mask, memcgp, true); } void mem_cgroup_cancel_charge_swapin(struct mem_cgroup *memcg) diff --git a/mm/memory.c b/mm/memory.c index afad903..6891d3b 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -2991,7 +2991,7 @@ static int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma, } } - if (mem_cgroup_try_charge_swapin(mm, page, GFP_KERNEL, &ptr, true)) { + if (mem_cgroup_try_charge_swapin(mm, page, GFP_KERNEL, &ptr)) { ret = VM_FAULT_OOM; goto out_page; } diff --git a/mm/swapfile.c b/mm/swapfile.c index 8ec511e..2f8e429 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -828,7 +828,7 @@ static int unuse_pte(struct vm_area_struct *vma, pmd_t *pmd, int ret = 1; if (mem_cgroup_try_charge_swapin(vma->vm_mm, page, - GFP_KERNEL, &memcg, true)) { + GFP_KERNEL, &memcg)) { ret = -ENOMEM; goto out_nolock; } -- Michal Hocko SUSE Labs ^ permalink raw reply related [flat|nested] 444+ messages in thread
* Re: [PATCH -v2 -mm] memcg: do not trigger OOM from add_to_page_cache_locked @ 2012-11-28 16:46 ` Michal Hocko 0 siblings, 0 replies; 444+ messages in thread From: Michal Hocko @ 2012-11-28 16:46 UTC (permalink / raw) To: Johannes Weiner Cc: KAMEZAWA Hiroyuki, azurIt, linux-kernel, linux-mm, cgroups mailinglist On Wed 28-11-12 11:37:36, Johannes Weiner wrote: > On Wed, Nov 28, 2012 at 05:04:47PM +0100, Michal Hocko wrote: > > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h > > index 095d2b4..5abe441 100644 > > --- a/include/linux/memcontrol.h > > +++ b/include/linux/memcontrol.h > > @@ -57,13 +57,14 @@ extern int mem_cgroup_newpage_charge(struct page *page, struct mm_struct *mm, > > gfp_t gfp_mask); > > /* for swap handling */ > > extern int mem_cgroup_try_charge_swapin(struct mm_struct *mm, > > - struct page *page, gfp_t mask, struct mem_cgroup **memcgp); > > + struct page *page, gfp_t mask, struct mem_cgroup **memcgp, > > + bool oom); > > Ok, now I feel almost bad for asking, but why the public interface, > too? Would it work out if I tell it was to double check that your review quality is not decreased after that many revisions? :P Incremental update and the full patch in the reply --- diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 5abe441..8f48d5e 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -57,8 +57,7 @@ extern int mem_cgroup_newpage_charge(struct page *page, struct mm_struct *mm, gfp_t gfp_mask); /* for swap handling */ extern int mem_cgroup_try_charge_swapin(struct mm_struct *mm, - struct page *page, gfp_t mask, struct mem_cgroup **memcgp, - bool oom); + struct page *page, gfp_t mask, struct mem_cgroup **memcgp); extern void mem_cgroup_commit_charge_swapin(struct page *page, struct mem_cgroup *memcg); extern void mem_cgroup_cancel_charge_swapin(struct mem_cgroup *memcg); @@ -218,8 +217,7 @@ static inline int mem_cgroup_cache_charge(struct page *page, } static inline int mem_cgroup_try_charge_swapin(struct mm_struct *mm, - struct page *page, gfp_t gfp_mask, struct mem_cgroup **memcgp, - bool oom) + struct page *page, gfp_t gfp_mask, struct mem_cgroup **memcgp) { return 0; } diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 02a6d70..3c9b1c5 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -3789,8 +3789,7 @@ charge_cur_mm: } int mem_cgroup_try_charge_swapin(struct mm_struct *mm, struct page *page, - gfp_t gfp_mask, struct mem_cgroup **memcgp, - bool oom) + gfp_t gfp_mask, struct mem_cgroup **memcgp) { *memcgp = NULL; if (mem_cgroup_disabled()) @@ -3804,12 +3803,12 @@ int mem_cgroup_try_charge_swapin(struct mm_struct *mm, struct page *page, if (!PageSwapCache(page)) { int ret; - ret = __mem_cgroup_try_charge(mm, gfp_mask, 1, memcgp, oom); + ret = __mem_cgroup_try_charge(mm, gfp_mask, 1, memcgp, true); if (ret == -EINTR) ret = 0; return ret; } - return __mem_cgroup_try_charge_swapin(mm, page, gfp_mask, memcgp, oom); + return __mem_cgroup_try_charge_swapin(mm, page, gfp_mask, memcgp, true); } void mem_cgroup_cancel_charge_swapin(struct mem_cgroup *memcg) diff --git a/mm/memory.c b/mm/memory.c index afad903..6891d3b 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -2991,7 +2991,7 @@ static int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma, } } - if (mem_cgroup_try_charge_swapin(mm, page, GFP_KERNEL, &ptr, true)) { + if (mem_cgroup_try_charge_swapin(mm, page, GFP_KERNEL, &ptr)) { ret = VM_FAULT_OOM; goto out_page; } diff --git a/mm/swapfile.c b/mm/swapfile.c index 8ec511e..2f8e429 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -828,7 +828,7 @@ static int unuse_pte(struct vm_area_struct *vma, pmd_t *pmd, int ret = 1; if (mem_cgroup_try_charge_swapin(vma->vm_mm, page, - GFP_KERNEL, &memcg, true)) { + GFP_KERNEL, &memcg)) { ret = -ENOMEM; goto out_nolock; } -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 444+ messages in thread
* Re: [PATCH -v2 -mm] memcg: do not trigger OOM from add_to_page_cache_locked 2012-11-28 16:46 ` Michal Hocko (?) @ 2012-11-28 16:48 ` Michal Hocko -1 siblings, 0 replies; 444+ messages in thread From: Michal Hocko @ 2012-11-28 16:48 UTC (permalink / raw) To: Johannes Weiner Cc: KAMEZAWA Hiroyuki, azurIt, linux-kernel, linux-mm, cgroups mailinglist On Wed 28-11-12 17:46:40, Michal Hocko wrote: > On Wed 28-11-12 11:37:36, Johannes Weiner wrote: > > On Wed, Nov 28, 2012 at 05:04:47PM +0100, Michal Hocko wrote: > > > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h > > > index 095d2b4..5abe441 100644 > > > --- a/include/linux/memcontrol.h > > > +++ b/include/linux/memcontrol.h > > > @@ -57,13 +57,14 @@ extern int mem_cgroup_newpage_charge(struct page *page, struct mm_struct *mm, > > > gfp_t gfp_mask); > > > /* for swap handling */ > > > extern int mem_cgroup_try_charge_swapin(struct mm_struct *mm, > > > - struct page *page, gfp_t mask, struct mem_cgroup **memcgp); > > > + struct page *page, gfp_t mask, struct mem_cgroup **memcgp, > > > + bool oom); > > > > Ok, now I feel almost bad for asking, but why the public interface, > > too? > > Would it work out if I tell it was to double check that your review > quality is not decreased after that many revisions? :P > > Incremental update and the full patch in the reply --- >From e21bb704947e9a477ec1df9121575c606dbfcb52 Mon Sep 17 00:00:00 2001 From: Michal Hocko <mhocko@suse.cz> Date: Wed, 28 Nov 2012 17:46:32 +0100 Subject: [PATCH] memcg: do not trigger OOM from add_to_page_cache_locked memcg oom killer might deadlock if the process which falls down to mem_cgroup_handle_oom holds a lock which prevents other task to terminate because it is blocked on the very same lock. This can happen when a write system call needs to allocate a page but the allocation hits the memcg hard limit and there is nothing to reclaim (e.g. there is no swap or swap limit is hit as well and all cache pages have been reclaimed already) and the process selected by memcg OOM killer is blocked on i_mutex on the same inode (e.g. truncate it). Process A [<ffffffff811109b8>] do_truncate+0x58/0xa0 # takes i_mutex [<ffffffff81121c90>] do_last+0x250/0xa30 [<ffffffff81122547>] path_openat+0xd7/0x440 [<ffffffff811229c9>] do_filp_open+0x49/0xa0 [<ffffffff8110f7d6>] do_sys_open+0x106/0x240 [<ffffffff8110f950>] sys_open+0x20/0x30 [<ffffffff815b5926>] system_call_fastpath+0x18/0x1d [<ffffffffffffffff>] 0xffffffffffffffff Process B [<ffffffff8110a9c1>] mem_cgroup_handle_oom+0x241/0x3b0 [<ffffffff8110b5ab>] T.1146+0x5ab/0x5c0 [<ffffffff8110c22e>] mem_cgroup_cache_charge+0xbe/0xe0 [<ffffffff810ca28c>] add_to_page_cache_locked+0x4c/0x140 [<ffffffff810ca3a2>] add_to_page_cache_lru+0x22/0x50 [<ffffffff810ca45b>] grab_cache_page_write_begin+0x8b/0xe0 [<ffffffff81193a18>] ext3_write_begin+0x88/0x270 [<ffffffff810c8fc6>] generic_file_buffered_write+0x116/0x290 [<ffffffff810cb3cc>] __generic_file_aio_write+0x27c/0x480 [<ffffffff810cb646>] generic_file_aio_write+0x76/0xf0 # takes ->i_mutex [<ffffffff8111156a>] do_sync_write+0xea/0x130 [<ffffffff81112183>] vfs_write+0xf3/0x1f0 [<ffffffff81112381>] sys_write+0x51/0x90 [<ffffffff815b5926>] system_call_fastpath+0x18/0x1d [<ffffffffffffffff>] 0xffffffffffffffff This is not a hard deadlock though because administrator can still intervene and increase the limit on the group which helps the writer to finish the allocation and release the lock. This patch heals the problem by forbidding OOM from page cache charges (namely add_ro_page_cache_locked). mem_cgroup_cache_charge grows oom argument which is pushed down the call chain. As a possibly visible result add_to_page_cache_lru might fail more often with ENOMEM but this is to be expected if the limit is set and it is preferable than OOM killer IMO. Changes since v1 - do not abuse gfp_flags and rather use oom parameter directly as per Johannes - handle also shmem write fauls resp. fallocate properly as per Johannes Reported-by: azurIt <azurit@pobox.sk> Signed-off-by: Michal Hocko <mhocko@suse.cz> --- include/linux/memcontrol.h | 5 +++-- mm/filemap.c | 9 +++++++-- mm/memcontrol.c | 20 ++++++++++---------- mm/shmem.c | 17 ++++++++++++++--- 4 files changed, 34 insertions(+), 17 deletions(-) diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 095d2b4..8f48d5e 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -63,7 +63,7 @@ extern void mem_cgroup_commit_charge_swapin(struct page *page, extern void mem_cgroup_cancel_charge_swapin(struct mem_cgroup *memcg); extern int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm, - gfp_t gfp_mask); + gfp_t gfp_mask, bool oom); struct lruvec *mem_cgroup_zone_lruvec(struct zone *, struct mem_cgroup *); struct lruvec *mem_cgroup_page_lruvec(struct page *, struct zone *); @@ -210,7 +210,8 @@ static inline int mem_cgroup_newpage_charge(struct page *page, } static inline int mem_cgroup_cache_charge(struct page *page, - struct mm_struct *mm, gfp_t gfp_mask) + struct mm_struct *mm, gfp_t gfp_mask, + bool oom) { return 0; } diff --git a/mm/filemap.c b/mm/filemap.c index 83efee7..ef8fbd5 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -447,8 +447,13 @@ int add_to_page_cache_locked(struct page *page, struct address_space *mapping, VM_BUG_ON(!PageLocked(page)); VM_BUG_ON(PageSwapBacked(page)); - error = mem_cgroup_cache_charge(page, current->mm, - gfp_mask & GFP_RECLAIM_MASK); + /* + * Cannot trigger OOM even if gfp_mask would allow that normally + * because we might be called from a locked context and that + * could lead to deadlocks if the killed process is waiting for + * the same lock. + */ + error = mem_cgroup_cache_charge(page, current->mm, gfp_mask, false); if (error) goto out; diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 02ee2f7..3c9b1c5 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -3709,11 +3709,10 @@ out: * < 0 if the cgroup is over its limit */ static int mem_cgroup_charge_common(struct page *page, struct mm_struct *mm, - gfp_t gfp_mask, enum charge_type ctype) + gfp_t gfp_mask, enum charge_type ctype, bool oom) { struct mem_cgroup *memcg = NULL; unsigned int nr_pages = 1; - bool oom = true; int ret; if (PageTransHuge(page)) { @@ -3742,7 +3741,7 @@ int mem_cgroup_newpage_charge(struct page *page, VM_BUG_ON(page->mapping && !PageAnon(page)); VM_BUG_ON(!mm); return mem_cgroup_charge_common(page, mm, gfp_mask, - MEM_CGROUP_CHARGE_TYPE_ANON); + MEM_CGROUP_CHARGE_TYPE_ANON, true); } /* @@ -3754,7 +3753,8 @@ int mem_cgroup_newpage_charge(struct page *page, static int __mem_cgroup_try_charge_swapin(struct mm_struct *mm, struct page *page, gfp_t mask, - struct mem_cgroup **memcgp) + struct mem_cgroup **memcgp, + bool oom) { struct mem_cgroup *memcg; struct page_cgroup *pc; @@ -3776,13 +3776,13 @@ static int __mem_cgroup_try_charge_swapin(struct mm_struct *mm, if (!memcg) goto charge_cur_mm; *memcgp = memcg; - ret = __mem_cgroup_try_charge(NULL, mask, 1, memcgp, true); + ret = __mem_cgroup_try_charge(NULL, mask, 1, memcgp, oom); css_put(&memcg->css); if (ret == -EINTR) ret = 0; return ret; charge_cur_mm: - ret = __mem_cgroup_try_charge(mm, mask, 1, memcgp, true); + ret = __mem_cgroup_try_charge(mm, mask, 1, memcgp, oom); if (ret == -EINTR) ret = 0; return ret; @@ -3808,7 +3808,7 @@ int mem_cgroup_try_charge_swapin(struct mm_struct *mm, struct page *page, ret = 0; return ret; } - return __mem_cgroup_try_charge_swapin(mm, page, gfp_mask, memcgp); + return __mem_cgroup_try_charge_swapin(mm, page, gfp_mask, memcgp, true); } void mem_cgroup_cancel_charge_swapin(struct mem_cgroup *memcg) @@ -3851,7 +3851,7 @@ void mem_cgroup_commit_charge_swapin(struct page *page, } int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm, - gfp_t gfp_mask) + gfp_t gfp_mask, bool oom) { struct mem_cgroup *memcg = NULL; enum charge_type type = MEM_CGROUP_CHARGE_TYPE_CACHE; @@ -3863,10 +3863,10 @@ int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm, return 0; if (!PageSwapCache(page)) - ret = mem_cgroup_charge_common(page, mm, gfp_mask, type); + ret = mem_cgroup_charge_common(page, mm, gfp_mask, type, oom); else { /* page is swapcache/shmem */ ret = __mem_cgroup_try_charge_swapin(mm, page, - gfp_mask, &memcg); + gfp_mask, &memcg, oom); if (!ret) __mem_cgroup_commit_charge_swapin(page, memcg, type); } diff --git a/mm/shmem.c b/mm/shmem.c index 55054a7..3b27db4 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -760,7 +760,7 @@ int shmem_unuse(swp_entry_t swap, struct page *page) * the shmem_swaplist_mutex which might hold up shmem_writepage(). * Charged back to the user (not to caller) when swap account is used. */ - error = mem_cgroup_cache_charge(page, current->mm, GFP_KERNEL); + error = mem_cgroup_cache_charge(page, current->mm, GFP_KERNEL, true); if (error) goto out; /* No radix_tree_preload: swap entry keeps a place for page in tree */ @@ -1152,8 +1152,17 @@ repeat: goto failed; } + /* + * Cannot trigger OOM even if gfp_mask would allow that + * normally because we might be called from a locked + * context (i_mutex held) if this is a write lock or + * fallocate and that could lead to deadlocks if the + * killed process is waiting for the same lock. + */ error = mem_cgroup_cache_charge(page, current->mm, - gfp & GFP_RECLAIM_MASK); + gfp & GFP_RECLAIM_MASK, + sgp != SGP_WRITE && + sgp != SGP_FALLOC); if (!error) { error = shmem_add_to_page_cache(page, mapping, index, gfp, swp_to_radix_entry(swap)); @@ -1209,7 +1218,9 @@ repeat: SetPageSwapBacked(page); __set_page_locked(page); error = mem_cgroup_cache_charge(page, current->mm, - gfp & GFP_RECLAIM_MASK); + gfp & GFP_RECLAIM_MASK, + sgp != SGP_WRITE && + sgp != SGP_FALLOC); if (error) goto decused; error = radix_tree_preload(gfp & GFP_RECLAIM_MASK); -- 1.7.10.4 -- Michal Hocko SUSE Labs ^ permalink raw reply related [flat|nested] 444+ messages in thread
* Re: [PATCH -v2 -mm] memcg: do not trigger OOM from add_to_page_cache_locked @ 2012-11-28 16:48 ` Michal Hocko 0 siblings, 0 replies; 444+ messages in thread From: Michal Hocko @ 2012-11-28 16:48 UTC (permalink / raw) To: Johannes Weiner Cc: KAMEZAWA Hiroyuki, azurIt, linux-kernel, linux-mm, cgroups mailinglist On Wed 28-11-12 17:46:40, Michal Hocko wrote: > On Wed 28-11-12 11:37:36, Johannes Weiner wrote: > > On Wed, Nov 28, 2012 at 05:04:47PM +0100, Michal Hocko wrote: > > > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h > > > index 095d2b4..5abe441 100644 > > > --- a/include/linux/memcontrol.h > > > +++ b/include/linux/memcontrol.h > > > @@ -57,13 +57,14 @@ extern int mem_cgroup_newpage_charge(struct page *page, struct mm_struct *mm, > > > gfp_t gfp_mask); > > > /* for swap handling */ > > > extern int mem_cgroup_try_charge_swapin(struct mm_struct *mm, > > > - struct page *page, gfp_t mask, struct mem_cgroup **memcgp); > > > + struct page *page, gfp_t mask, struct mem_cgroup **memcgp, > > > + bool oom); > > > > Ok, now I feel almost bad for asking, but why the public interface, > > too? > > Would it work out if I tell it was to double check that your review > quality is not decreased after that many revisions? :P > > Incremental update and the full patch in the reply --- From e21bb704947e9a477ec1df9121575c606dbfcb52 Mon Sep 17 00:00:00 2001 From: Michal Hocko <mhocko@suse.cz> Date: Wed, 28 Nov 2012 17:46:32 +0100 Subject: [PATCH] memcg: do not trigger OOM from add_to_page_cache_locked memcg oom killer might deadlock if the process which falls down to mem_cgroup_handle_oom holds a lock which prevents other task to terminate because it is blocked on the very same lock. This can happen when a write system call needs to allocate a page but the allocation hits the memcg hard limit and there is nothing to reclaim (e.g. there is no swap or swap limit is hit as well and all cache pages have been reclaimed already) and the process selected by memcg OOM killer is blocked on i_mutex on the same inode (e.g. truncate it). Process A [<ffffffff811109b8>] do_truncate+0x58/0xa0 # takes i_mutex [<ffffffff81121c90>] do_last+0x250/0xa30 [<ffffffff81122547>] path_openat+0xd7/0x440 [<ffffffff811229c9>] do_filp_open+0x49/0xa0 [<ffffffff8110f7d6>] do_sys_open+0x106/0x240 [<ffffffff8110f950>] sys_open+0x20/0x30 [<ffffffff815b5926>] system_call_fastpath+0x18/0x1d [<ffffffffffffffff>] 0xffffffffffffffff Process B [<ffffffff8110a9c1>] mem_cgroup_handle_oom+0x241/0x3b0 [<ffffffff8110b5ab>] T.1146+0x5ab/0x5c0 [<ffffffff8110c22e>] mem_cgroup_cache_charge+0xbe/0xe0 [<ffffffff810ca28c>] add_to_page_cache_locked+0x4c/0x140 [<ffffffff810ca3a2>] add_to_page_cache_lru+0x22/0x50 [<ffffffff810ca45b>] grab_cache_page_write_begin+0x8b/0xe0 [<ffffffff81193a18>] ext3_write_begin+0x88/0x270 [<ffffffff810c8fc6>] generic_file_buffered_write+0x116/0x290 [<ffffffff810cb3cc>] __generic_file_aio_write+0x27c/0x480 [<ffffffff810cb646>] generic_file_aio_write+0x76/0xf0 # takes ->i_mutex [<ffffffff8111156a>] do_sync_write+0xea/0x130 [<ffffffff81112183>] vfs_write+0xf3/0x1f0 [<ffffffff81112381>] sys_write+0x51/0x90 [<ffffffff815b5926>] system_call_fastpath+0x18/0x1d [<ffffffffffffffff>] 0xffffffffffffffff This is not a hard deadlock though because administrator can still intervene and increase the limit on the group which helps the writer to finish the allocation and release the lock. This patch heals the problem by forbidding OOM from page cache charges (namely add_ro_page_cache_locked). mem_cgroup_cache_charge grows oom argument which is pushed down the call chain. As a possibly visible result add_to_page_cache_lru might fail more often with ENOMEM but this is to be expected if the limit is set and it is preferable than OOM killer IMO. Changes since v1 - do not abuse gfp_flags and rather use oom parameter directly as per Johannes - handle also shmem write fauls resp. fallocate properly as per Johannes Reported-by: azurIt <azurit@pobox.sk> Signed-off-by: Michal Hocko <mhocko@suse.cz> --- include/linux/memcontrol.h | 5 +++-- mm/filemap.c | 9 +++++++-- mm/memcontrol.c | 20 ++++++++++---------- mm/shmem.c | 17 ++++++++++++++--- 4 files changed, 34 insertions(+), 17 deletions(-) diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 095d2b4..8f48d5e 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -63,7 +63,7 @@ extern void mem_cgroup_commit_charge_swapin(struct page *page, extern void mem_cgroup_cancel_charge_swapin(struct mem_cgroup *memcg); extern int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm, - gfp_t gfp_mask); + gfp_t gfp_mask, bool oom); struct lruvec *mem_cgroup_zone_lruvec(struct zone *, struct mem_cgroup *); struct lruvec *mem_cgroup_page_lruvec(struct page *, struct zone *); @@ -210,7 +210,8 @@ static inline int mem_cgroup_newpage_charge(struct page *page, } static inline int mem_cgroup_cache_charge(struct page *page, - struct mm_struct *mm, gfp_t gfp_mask) + struct mm_struct *mm, gfp_t gfp_mask, + bool oom) { return 0; } diff --git a/mm/filemap.c b/mm/filemap.c index 83efee7..ef8fbd5 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -447,8 +447,13 @@ int add_to_page_cache_locked(struct page *page, struct address_space *mapping, VM_BUG_ON(!PageLocked(page)); VM_BUG_ON(PageSwapBacked(page)); - error = mem_cgroup_cache_charge(page, current->mm, - gfp_mask & GFP_RECLAIM_MASK); + /* + * Cannot trigger OOM even if gfp_mask would allow that normally + * because we might be called from a locked context and that + * could lead to deadlocks if the killed process is waiting for + * the same lock. + */ + error = mem_cgroup_cache_charge(page, current->mm, gfp_mask, false); if (error) goto out; diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 02ee2f7..3c9b1c5 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -3709,11 +3709,10 @@ out: * < 0 if the cgroup is over its limit */ static int mem_cgroup_charge_common(struct page *page, struct mm_struct *mm, - gfp_t gfp_mask, enum charge_type ctype) + gfp_t gfp_mask, enum charge_type ctype, bool oom) { struct mem_cgroup *memcg = NULL; unsigned int nr_pages = 1; - bool oom = true; int ret; if (PageTransHuge(page)) { @@ -3742,7 +3741,7 @@ int mem_cgroup_newpage_charge(struct page *page, VM_BUG_ON(page->mapping && !PageAnon(page)); VM_BUG_ON(!mm); return mem_cgroup_charge_common(page, mm, gfp_mask, - MEM_CGROUP_CHARGE_TYPE_ANON); + MEM_CGROUP_CHARGE_TYPE_ANON, true); } /* @@ -3754,7 +3753,8 @@ int mem_cgroup_newpage_charge(struct page *page, static int __mem_cgroup_try_charge_swapin(struct mm_struct *mm, struct page *page, gfp_t mask, - struct mem_cgroup **memcgp) + struct mem_cgroup **memcgp, + bool oom) { struct mem_cgroup *memcg; struct page_cgroup *pc; @@ -3776,13 +3776,13 @@ static int __mem_cgroup_try_charge_swapin(struct mm_struct *mm, if (!memcg) goto charge_cur_mm; *memcgp = memcg; - ret = __mem_cgroup_try_charge(NULL, mask, 1, memcgp, true); + ret = __mem_cgroup_try_charge(NULL, mask, 1, memcgp, oom); css_put(&memcg->css); if (ret == -EINTR) ret = 0; return ret; charge_cur_mm: - ret = __mem_cgroup_try_charge(mm, mask, 1, memcgp, true); + ret = __mem_cgroup_try_charge(mm, mask, 1, memcgp, oom); if (ret == -EINTR) ret = 0; return ret; @@ -3808,7 +3808,7 @@ int mem_cgroup_try_charge_swapin(struct mm_struct *mm, struct page *page, ret = 0; return ret; } - return __mem_cgroup_try_charge_swapin(mm, page, gfp_mask, memcgp); + return __mem_cgroup_try_charge_swapin(mm, page, gfp_mask, memcgp, true); } void mem_cgroup_cancel_charge_swapin(struct mem_cgroup *memcg) @@ -3851,7 +3851,7 @@ void mem_cgroup_commit_charge_swapin(struct page *page, } int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm, - gfp_t gfp_mask) + gfp_t gfp_mask, bool oom) { struct mem_cgroup *memcg = NULL; enum charge_type type = MEM_CGROUP_CHARGE_TYPE_CACHE; @@ -3863,10 +3863,10 @@ int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm, return 0; if (!PageSwapCache(page)) - ret = mem_cgroup_charge_common(page, mm, gfp_mask, type); + ret = mem_cgroup_charge_common(page, mm, gfp_mask, type, oom); else { /* page is swapcache/shmem */ ret = __mem_cgroup_try_charge_swapin(mm, page, - gfp_mask, &memcg); + gfp_mask, &memcg, oom); if (!ret) __mem_cgroup_commit_charge_swapin(page, memcg, type); } diff --git a/mm/shmem.c b/mm/shmem.c index 55054a7..3b27db4 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -760,7 +760,7 @@ int shmem_unuse(swp_entry_t swap, struct page *page) * the shmem_swaplist_mutex which might hold up shmem_writepage(). * Charged back to the user (not to caller) when swap account is used. */ - error = mem_cgroup_cache_charge(page, current->mm, GFP_KERNEL); + error = mem_cgroup_cache_charge(page, current->mm, GFP_KERNEL, true); if (error) goto out; /* No radix_tree_preload: swap entry keeps a place for page in tree */ @@ -1152,8 +1152,17 @@ repeat: goto failed; } + /* + * Cannot trigger OOM even if gfp_mask would allow that + * normally because we might be called from a locked + * context (i_mutex held) if this is a write lock or + * fallocate and that could lead to deadlocks if the + * killed process is waiting for the same lock. + */ error = mem_cgroup_cache_charge(page, current->mm, - gfp & GFP_RECLAIM_MASK); + gfp & GFP_RECLAIM_MASK, + sgp != SGP_WRITE && + sgp != SGP_FALLOC); if (!error) { error = shmem_add_to_page_cache(page, mapping, index, gfp, swp_to_radix_entry(swap)); @@ -1209,7 +1218,9 @@ repeat: SetPageSwapBacked(page); __set_page_locked(page); error = mem_cgroup_cache_charge(page, current->mm, - gfp & GFP_RECLAIM_MASK); + gfp & GFP_RECLAIM_MASK, + sgp != SGP_WRITE && + sgp != SGP_FALLOC); if (error) goto decused; error = radix_tree_preload(gfp & GFP_RECLAIM_MASK); -- 1.7.10.4 -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 444+ messages in thread
* Re: [PATCH -v2 -mm] memcg: do not trigger OOM from add_to_page_cache_locked @ 2012-11-28 16:48 ` Michal Hocko 0 siblings, 0 replies; 444+ messages in thread From: Michal Hocko @ 2012-11-28 16:48 UTC (permalink / raw) To: Johannes Weiner Cc: KAMEZAWA Hiroyuki, azurIt, linux-kernel, linux-mm, cgroups mailinglist On Wed 28-11-12 17:46:40, Michal Hocko wrote: > On Wed 28-11-12 11:37:36, Johannes Weiner wrote: > > On Wed, Nov 28, 2012 at 05:04:47PM +0100, Michal Hocko wrote: > > > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h > > > index 095d2b4..5abe441 100644 > > > --- a/include/linux/memcontrol.h > > > +++ b/include/linux/memcontrol.h > > > @@ -57,13 +57,14 @@ extern int mem_cgroup_newpage_charge(struct page *page, struct mm_struct *mm, > > > gfp_t gfp_mask); > > > /* for swap handling */ > > > extern int mem_cgroup_try_charge_swapin(struct mm_struct *mm, > > > - struct page *page, gfp_t mask, struct mem_cgroup **memcgp); > > > + struct page *page, gfp_t mask, struct mem_cgroup **memcgp, > > > + bool oom); > > > > Ok, now I feel almost bad for asking, but why the public interface, > > too? > > Would it work out if I tell it was to double check that your review > quality is not decreased after that many revisions? :P > > Incremental update and the full patch in the reply --- ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH -v2 -mm] memcg: do not trigger OOM from add_to_page_cache_locked 2012-11-28 16:48 ` Michal Hocko (?) @ 2012-11-28 18:44 ` Johannes Weiner -1 siblings, 0 replies; 444+ messages in thread From: Johannes Weiner @ 2012-11-28 18:44 UTC (permalink / raw) To: Michal Hocko Cc: KAMEZAWA Hiroyuki, azurIt, linux-kernel, linux-mm, cgroups mailinglist On Wed, Nov 28, 2012 at 05:48:24PM +0100, Michal Hocko wrote: > On Wed 28-11-12 17:46:40, Michal Hocko wrote: > > On Wed 28-11-12 11:37:36, Johannes Weiner wrote: > > > On Wed, Nov 28, 2012 at 05:04:47PM +0100, Michal Hocko wrote: > > > > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h > > > > index 095d2b4..5abe441 100644 > > > > --- a/include/linux/memcontrol.h > > > > +++ b/include/linux/memcontrol.h > > > > @@ -57,13 +57,14 @@ extern int mem_cgroup_newpage_charge(struct page *page, struct mm_struct *mm, > > > > gfp_t gfp_mask); > > > > /* for swap handling */ > > > > extern int mem_cgroup_try_charge_swapin(struct mm_struct *mm, > > > > - struct page *page, gfp_t mask, struct mem_cgroup **memcgp); > > > > + struct page *page, gfp_t mask, struct mem_cgroup **memcgp, > > > > + bool oom); > > > > > > Ok, now I feel almost bad for asking, but why the public interface, > > > too? > > > > Would it work out if I tell it was to double check that your review > > quality is not decreased after that many revisions? :P Deal. > >From e21bb704947e9a477ec1df9121575c606dbfcb52 Mon Sep 17 00:00:00 2001 > From: Michal Hocko <mhocko@suse.cz> > Date: Wed, 28 Nov 2012 17:46:32 +0100 > Subject: [PATCH] memcg: do not trigger OOM from add_to_page_cache_locked > > memcg oom killer might deadlock if the process which falls down to > mem_cgroup_handle_oom holds a lock which prevents other task to > terminate because it is blocked on the very same lock. > This can happen when a write system call needs to allocate a page but > the allocation hits the memcg hard limit and there is nothing to reclaim > (e.g. there is no swap or swap limit is hit as well and all cache pages > have been reclaimed already) and the process selected by memcg OOM > killer is blocked on i_mutex on the same inode (e.g. truncate it). > > Process A > [<ffffffff811109b8>] do_truncate+0x58/0xa0 # takes i_mutex > [<ffffffff81121c90>] do_last+0x250/0xa30 > [<ffffffff81122547>] path_openat+0xd7/0x440 > [<ffffffff811229c9>] do_filp_open+0x49/0xa0 > [<ffffffff8110f7d6>] do_sys_open+0x106/0x240 > [<ffffffff8110f950>] sys_open+0x20/0x30 > [<ffffffff815b5926>] system_call_fastpath+0x18/0x1d > [<ffffffffffffffff>] 0xffffffffffffffff > > Process B > [<ffffffff8110a9c1>] mem_cgroup_handle_oom+0x241/0x3b0 > [<ffffffff8110b5ab>] T.1146+0x5ab/0x5c0 > [<ffffffff8110c22e>] mem_cgroup_cache_charge+0xbe/0xe0 > [<ffffffff810ca28c>] add_to_page_cache_locked+0x4c/0x140 > [<ffffffff810ca3a2>] add_to_page_cache_lru+0x22/0x50 > [<ffffffff810ca45b>] grab_cache_page_write_begin+0x8b/0xe0 > [<ffffffff81193a18>] ext3_write_begin+0x88/0x270 > [<ffffffff810c8fc6>] generic_file_buffered_write+0x116/0x290 > [<ffffffff810cb3cc>] __generic_file_aio_write+0x27c/0x480 > [<ffffffff810cb646>] generic_file_aio_write+0x76/0xf0 # takes ->i_mutex > [<ffffffff8111156a>] do_sync_write+0xea/0x130 > [<ffffffff81112183>] vfs_write+0xf3/0x1f0 > [<ffffffff81112381>] sys_write+0x51/0x90 > [<ffffffff815b5926>] system_call_fastpath+0x18/0x1d > [<ffffffffffffffff>] 0xffffffffffffffff > > This is not a hard deadlock though because administrator can still > intervene and increase the limit on the group which helps the writer to > finish the allocation and release the lock. > > This patch heals the problem by forbidding OOM from page cache charges > (namely add_ro_page_cache_locked). mem_cgroup_cache_charge grows oom > argument which is pushed down the call chain. > > As a possibly visible result add_to_page_cache_lru might fail more often > with ENOMEM but this is to be expected if the limit is set and it is > preferable than OOM killer IMO. > > Changes since v1 > - do not abuse gfp_flags and rather use oom parameter directly as per > Johannes > - handle also shmem write fauls resp. fallocate properly as per Johannes > > Reported-by: azurIt <azurit@pobox.sk> > Signed-off-by: Michal Hocko <mhocko@suse.cz> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Thanks, Michal! ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH -v2 -mm] memcg: do not trigger OOM from add_to_page_cache_locked @ 2012-11-28 18:44 ` Johannes Weiner 0 siblings, 0 replies; 444+ messages in thread From: Johannes Weiner @ 2012-11-28 18:44 UTC (permalink / raw) To: Michal Hocko Cc: KAMEZAWA Hiroyuki, azurIt, linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups mailinglist On Wed, Nov 28, 2012 at 05:48:24PM +0100, Michal Hocko wrote: > On Wed 28-11-12 17:46:40, Michal Hocko wrote: > > On Wed 28-11-12 11:37:36, Johannes Weiner wrote: > > > On Wed, Nov 28, 2012 at 05:04:47PM +0100, Michal Hocko wrote: > > > > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h > > > > index 095d2b4..5abe441 100644 > > > > --- a/include/linux/memcontrol.h > > > > +++ b/include/linux/memcontrol.h > > > > @@ -57,13 +57,14 @@ extern int mem_cgroup_newpage_charge(struct page *page, struct mm_struct *mm, > > > > gfp_t gfp_mask); > > > > /* for swap handling */ > > > > extern int mem_cgroup_try_charge_swapin(struct mm_struct *mm, > > > > - struct page *page, gfp_t mask, struct mem_cgroup **memcgp); > > > > + struct page *page, gfp_t mask, struct mem_cgroup **memcgp, > > > > + bool oom); > > > > > > Ok, now I feel almost bad for asking, but why the public interface, > > > too? > > > > Would it work out if I tell it was to double check that your review > > quality is not decreased after that many revisions? :P Deal. > >From e21bb704947e9a477ec1df9121575c606dbfcb52 Mon Sep 17 00:00:00 2001 > From: Michal Hocko <mhocko-AlSwsSmVLrQ@public.gmane.org> > Date: Wed, 28 Nov 2012 17:46:32 +0100 > Subject: [PATCH] memcg: do not trigger OOM from add_to_page_cache_locked > > memcg oom killer might deadlock if the process which falls down to > mem_cgroup_handle_oom holds a lock which prevents other task to > terminate because it is blocked on the very same lock. > This can happen when a write system call needs to allocate a page but > the allocation hits the memcg hard limit and there is nothing to reclaim > (e.g. there is no swap or swap limit is hit as well and all cache pages > have been reclaimed already) and the process selected by memcg OOM > killer is blocked on i_mutex on the same inode (e.g. truncate it). > > Process A > [<ffffffff811109b8>] do_truncate+0x58/0xa0 # takes i_mutex > [<ffffffff81121c90>] do_last+0x250/0xa30 > [<ffffffff81122547>] path_openat+0xd7/0x440 > [<ffffffff811229c9>] do_filp_open+0x49/0xa0 > [<ffffffff8110f7d6>] do_sys_open+0x106/0x240 > [<ffffffff8110f950>] sys_open+0x20/0x30 > [<ffffffff815b5926>] system_call_fastpath+0x18/0x1d > [<ffffffffffffffff>] 0xffffffffffffffff > > Process B > [<ffffffff8110a9c1>] mem_cgroup_handle_oom+0x241/0x3b0 > [<ffffffff8110b5ab>] T.1146+0x5ab/0x5c0 > [<ffffffff8110c22e>] mem_cgroup_cache_charge+0xbe/0xe0 > [<ffffffff810ca28c>] add_to_page_cache_locked+0x4c/0x140 > [<ffffffff810ca3a2>] add_to_page_cache_lru+0x22/0x50 > [<ffffffff810ca45b>] grab_cache_page_write_begin+0x8b/0xe0 > [<ffffffff81193a18>] ext3_write_begin+0x88/0x270 > [<ffffffff810c8fc6>] generic_file_buffered_write+0x116/0x290 > [<ffffffff810cb3cc>] __generic_file_aio_write+0x27c/0x480 > [<ffffffff810cb646>] generic_file_aio_write+0x76/0xf0 # takes ->i_mutex > [<ffffffff8111156a>] do_sync_write+0xea/0x130 > [<ffffffff81112183>] vfs_write+0xf3/0x1f0 > [<ffffffff81112381>] sys_write+0x51/0x90 > [<ffffffff815b5926>] system_call_fastpath+0x18/0x1d > [<ffffffffffffffff>] 0xffffffffffffffff > > This is not a hard deadlock though because administrator can still > intervene and increase the limit on the group which helps the writer to > finish the allocation and release the lock. > > This patch heals the problem by forbidding OOM from page cache charges > (namely add_ro_page_cache_locked). mem_cgroup_cache_charge grows oom > argument which is pushed down the call chain. > > As a possibly visible result add_to_page_cache_lru might fail more often > with ENOMEM but this is to be expected if the limit is set and it is > preferable than OOM killer IMO. > > Changes since v1 > - do not abuse gfp_flags and rather use oom parameter directly as per > Johannes > - handle also shmem write fauls resp. fallocate properly as per Johannes > > Reported-by: azurIt <azurit-Rm0zKEqwvD4@public.gmane.org> > Signed-off-by: Michal Hocko <mhocko-AlSwsSmVLrQ@public.gmane.org> Acked-by: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org> Thanks, Michal! ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH -v2 -mm] memcg: do not trigger OOM from add_to_page_cache_locked @ 2012-11-28 18:44 ` Johannes Weiner 0 siblings, 0 replies; 444+ messages in thread From: Johannes Weiner @ 2012-11-28 18:44 UTC (permalink / raw) To: Michal Hocko Cc: KAMEZAWA Hiroyuki, azurIt, linux-kernel, linux-mm, cgroups mailinglist On Wed, Nov 28, 2012 at 05:48:24PM +0100, Michal Hocko wrote: > On Wed 28-11-12 17:46:40, Michal Hocko wrote: > > On Wed 28-11-12 11:37:36, Johannes Weiner wrote: > > > On Wed, Nov 28, 2012 at 05:04:47PM +0100, Michal Hocko wrote: > > > > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h > > > > index 095d2b4..5abe441 100644 > > > > --- a/include/linux/memcontrol.h > > > > +++ b/include/linux/memcontrol.h > > > > @@ -57,13 +57,14 @@ extern int mem_cgroup_newpage_charge(struct page *page, struct mm_struct *mm, > > > > gfp_t gfp_mask); > > > > /* for swap handling */ > > > > extern int mem_cgroup_try_charge_swapin(struct mm_struct *mm, > > > > - struct page *page, gfp_t mask, struct mem_cgroup **memcgp); > > > > + struct page *page, gfp_t mask, struct mem_cgroup **memcgp, > > > > + bool oom); > > > > > > Ok, now I feel almost bad for asking, but why the public interface, > > > too? > > > > Would it work out if I tell it was to double check that your review > > quality is not decreased after that many revisions? :P Deal. > >From e21bb704947e9a477ec1df9121575c606dbfcb52 Mon Sep 17 00:00:00 2001 > From: Michal Hocko <mhocko@suse.cz> > Date: Wed, 28 Nov 2012 17:46:32 +0100 > Subject: [PATCH] memcg: do not trigger OOM from add_to_page_cache_locked > > memcg oom killer might deadlock if the process which falls down to > mem_cgroup_handle_oom holds a lock which prevents other task to > terminate because it is blocked on the very same lock. > This can happen when a write system call needs to allocate a page but > the allocation hits the memcg hard limit and there is nothing to reclaim > (e.g. there is no swap or swap limit is hit as well and all cache pages > have been reclaimed already) and the process selected by memcg OOM > killer is blocked on i_mutex on the same inode (e.g. truncate it). > > Process A > [<ffffffff811109b8>] do_truncate+0x58/0xa0 # takes i_mutex > [<ffffffff81121c90>] do_last+0x250/0xa30 > [<ffffffff81122547>] path_openat+0xd7/0x440 > [<ffffffff811229c9>] do_filp_open+0x49/0xa0 > [<ffffffff8110f7d6>] do_sys_open+0x106/0x240 > [<ffffffff8110f950>] sys_open+0x20/0x30 > [<ffffffff815b5926>] system_call_fastpath+0x18/0x1d > [<ffffffffffffffff>] 0xffffffffffffffff > > Process B > [<ffffffff8110a9c1>] mem_cgroup_handle_oom+0x241/0x3b0 > [<ffffffff8110b5ab>] T.1146+0x5ab/0x5c0 > [<ffffffff8110c22e>] mem_cgroup_cache_charge+0xbe/0xe0 > [<ffffffff810ca28c>] add_to_page_cache_locked+0x4c/0x140 > [<ffffffff810ca3a2>] add_to_page_cache_lru+0x22/0x50 > [<ffffffff810ca45b>] grab_cache_page_write_begin+0x8b/0xe0 > [<ffffffff81193a18>] ext3_write_begin+0x88/0x270 > [<ffffffff810c8fc6>] generic_file_buffered_write+0x116/0x290 > [<ffffffff810cb3cc>] __generic_file_aio_write+0x27c/0x480 > [<ffffffff810cb646>] generic_file_aio_write+0x76/0xf0 # takes ->i_mutex > [<ffffffff8111156a>] do_sync_write+0xea/0x130 > [<ffffffff81112183>] vfs_write+0xf3/0x1f0 > [<ffffffff81112381>] sys_write+0x51/0x90 > [<ffffffff815b5926>] system_call_fastpath+0x18/0x1d > [<ffffffffffffffff>] 0xffffffffffffffff > > This is not a hard deadlock though because administrator can still > intervene and increase the limit on the group which helps the writer to > finish the allocation and release the lock. > > This patch heals the problem by forbidding OOM from page cache charges > (namely add_ro_page_cache_locked). mem_cgroup_cache_charge grows oom > argument which is pushed down the call chain. > > As a possibly visible result add_to_page_cache_lru might fail more often > with ENOMEM but this is to be expected if the limit is set and it is > preferable than OOM killer IMO. > > Changes since v1 > - do not abuse gfp_flags and rather use oom parameter directly as per > Johannes > - handle also shmem write fauls resp. fallocate properly as per Johannes > > Reported-by: azurIt <azurit@pobox.sk> > Signed-off-by: Michal Hocko <mhocko@suse.cz> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Thanks, Michal! -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH -v2 -mm] memcg: do not trigger OOM from add_to_page_cache_locked 2012-11-28 16:48 ` Michal Hocko (?) @ 2012-11-28 20:20 ` Hugh Dickins -1 siblings, 0 replies; 444+ messages in thread From: Hugh Dickins @ 2012-11-28 20:20 UTC (permalink / raw) To: Michal Hocko Cc: Johannes Weiner, KAMEZAWA Hiroyuki, azurIt, linux-kernel, linux-mm, cgroups mailinglist On Wed, 28 Nov 2012, Michal Hocko wrote: > From e21bb704947e9a477ec1df9121575c606dbfcb52 Mon Sep 17 00:00:00 2001 > From: Michal Hocko <mhocko@suse.cz> > Date: Wed, 28 Nov 2012 17:46:32 +0100 > Subject: [PATCH] memcg: do not trigger OOM from add_to_page_cache_locked > > memcg oom killer might deadlock if the process which falls down to > mem_cgroup_handle_oom holds a lock which prevents other task to > terminate because it is blocked on the very same lock. > This can happen when a write system call needs to allocate a page but > the allocation hits the memcg hard limit and there is nothing to reclaim > (e.g. there is no swap or swap limit is hit as well and all cache pages > have been reclaimed already) and the process selected by memcg OOM > killer is blocked on i_mutex on the same inode (e.g. truncate it). > > Process A > [<ffffffff811109b8>] do_truncate+0x58/0xa0 # takes i_mutex > [<ffffffff81121c90>] do_last+0x250/0xa30 > [<ffffffff81122547>] path_openat+0xd7/0x440 > [<ffffffff811229c9>] do_filp_open+0x49/0xa0 > [<ffffffff8110f7d6>] do_sys_open+0x106/0x240 > [<ffffffff8110f950>] sys_open+0x20/0x30 > [<ffffffff815b5926>] system_call_fastpath+0x18/0x1d > [<ffffffffffffffff>] 0xffffffffffffffff > > Process B > [<ffffffff8110a9c1>] mem_cgroup_handle_oom+0x241/0x3b0 > [<ffffffff8110b5ab>] T.1146+0x5ab/0x5c0 > [<ffffffff8110c22e>] mem_cgroup_cache_charge+0xbe/0xe0 > [<ffffffff810ca28c>] add_to_page_cache_locked+0x4c/0x140 > [<ffffffff810ca3a2>] add_to_page_cache_lru+0x22/0x50 > [<ffffffff810ca45b>] grab_cache_page_write_begin+0x8b/0xe0 > [<ffffffff81193a18>] ext3_write_begin+0x88/0x270 > [<ffffffff810c8fc6>] generic_file_buffered_write+0x116/0x290 > [<ffffffff810cb3cc>] __generic_file_aio_write+0x27c/0x480 > [<ffffffff810cb646>] generic_file_aio_write+0x76/0xf0 # takes ->i_mutex > [<ffffffff8111156a>] do_sync_write+0xea/0x130 > [<ffffffff81112183>] vfs_write+0xf3/0x1f0 > [<ffffffff81112381>] sys_write+0x51/0x90 > [<ffffffff815b5926>] system_call_fastpath+0x18/0x1d > [<ffffffffffffffff>] 0xffffffffffffffff > > This is not a hard deadlock though because administrator can still > intervene and increase the limit on the group which helps the writer to > finish the allocation and release the lock. > > This patch heals the problem by forbidding OOM from page cache charges > (namely add_ro_page_cache_locked). mem_cgroup_cache_charge grows oom > argument which is pushed down the call chain. > > As a possibly visible result add_to_page_cache_lru might fail more often > with ENOMEM but this is to be expected if the limit is set and it is > preferable than OOM killer IMO. > > Changes since v1 > - do not abuse gfp_flags and rather use oom parameter directly as per > Johannes > - handle also shmem write fauls resp. fallocate properly as per Johannes > > Reported-by: azurIt <azurit@pobox.sk> > Signed-off-by: Michal Hocko <mhocko@suse.cz> Sorry, Michal, you've laboured hard on this: but I dislike it so much that I'm here overcoming my dread of entering an OOM-killer discussion, and the resultant deluge of unwelcome CCs for eternity afterwards. I had been relying on Johannes to repeat his "This issue has been around for a while so frankly I don't think it's urgent enough to rush things", but it looks like I have to be the one to repeat it. Your analysis of azurIt's traces may well be correct, and this patch may indeed ameliorate the situation, and it's fine as something for azurIt to try and report on and keep in his tree; but I hope that it does not go upstream and to stable. Why do I dislike it so much? I suppose because it's both too general and too limited at the same time. Too general in that it changes the behaviour on OOM for a large set of memcg charges, all those that go through add_to_page_cache_locked(), when only a subset of those have the i_mutex issue. If you're going to be that general, why not go further? Leave the mem_cgroup_cache_charge() interface as is, make it not-OOM internally, no need for SGP_WRITE,SGP_FALLOC distinctions in mm/shmem.c. No other filesystem gets the benefit of those distinctions: isn't it better to keep it simple? (And I can see a partial truncation case where shmem uses SGP_READ under i_mutex; and the change to shmem_unuse behaviour is a non-issue, since swapoff invites itself to be killed anyway.) Too limited in that i_mutex is just the held resource which azurIt's traces have led you to, but it's a general problem that the OOM-killed task might be waiting for a resource that the OOM-killing task holds. I suspect that if we try hard enough (I admit I have not), we can find an example of such a potential deadlock for almost every memcg charge site. mmap_sem? not as easy to invent a case with that as I thought, since it needs a down_write, and the typical page allocations happen with down_read, and I can't think of a process which does down_write on another's mm. But i_mutex is always good, once you remember the case of write to file from userspace page which got paged out, so the fault path has to read it back in, while i_mutex is still held at the outer level. An unusual case? Well, normally yes, but we're considering out-of-memory conditions, which may converge upon cases like this. Wouldn't it be nice if I could be constructive? But I'm sceptical even of Johannes's faith in what the global OOM killer would do: how does __alloc_pages_slowpath() get out of its "goto restart" loop, excepting the trivial case when the killer is the killed? I wonder why this issue has hit azurIt and no other reporter? No swap plays a part in it, but that's not so unusual. Yours glOOMily, Hugh > --- > include/linux/memcontrol.h | 5 +++-- > mm/filemap.c | 9 +++++++-- > mm/memcontrol.c | 20 ++++++++++---------- > mm/shmem.c | 17 ++++++++++++++--- > 4 files changed, 34 insertions(+), 17 deletions(-) > > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h > index 095d2b4..8f48d5e 100644 > --- a/include/linux/memcontrol.h > +++ b/include/linux/memcontrol.h > @@ -63,7 +63,7 @@ extern void mem_cgroup_commit_charge_swapin(struct page *page, > extern void mem_cgroup_cancel_charge_swapin(struct mem_cgroup *memcg); > > extern int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm, > - gfp_t gfp_mask); > + gfp_t gfp_mask, bool oom); > > struct lruvec *mem_cgroup_zone_lruvec(struct zone *, struct mem_cgroup *); > struct lruvec *mem_cgroup_page_lruvec(struct page *, struct zone *); > @@ -210,7 +210,8 @@ static inline int mem_cgroup_newpage_charge(struct page *page, > } > > static inline int mem_cgroup_cache_charge(struct page *page, > - struct mm_struct *mm, gfp_t gfp_mask) > + struct mm_struct *mm, gfp_t gfp_mask, > + bool oom) > { > return 0; > } > diff --git a/mm/filemap.c b/mm/filemap.c > index 83efee7..ef8fbd5 100644 > --- a/mm/filemap.c > +++ b/mm/filemap.c > @@ -447,8 +447,13 @@ int add_to_page_cache_locked(struct page *page, struct address_space *mapping, > VM_BUG_ON(!PageLocked(page)); > VM_BUG_ON(PageSwapBacked(page)); > > - error = mem_cgroup_cache_charge(page, current->mm, > - gfp_mask & GFP_RECLAIM_MASK); > + /* > + * Cannot trigger OOM even if gfp_mask would allow that normally > + * because we might be called from a locked context and that > + * could lead to deadlocks if the killed process is waiting for > + * the same lock. > + */ > + error = mem_cgroup_cache_charge(page, current->mm, gfp_mask, false); > if (error) > goto out; > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index 02ee2f7..3c9b1c5 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -3709,11 +3709,10 @@ out: > * < 0 if the cgroup is over its limit > */ > static int mem_cgroup_charge_common(struct page *page, struct mm_struct *mm, > - gfp_t gfp_mask, enum charge_type ctype) > + gfp_t gfp_mask, enum charge_type ctype, bool oom) > { > struct mem_cgroup *memcg = NULL; > unsigned int nr_pages = 1; > - bool oom = true; > int ret; > > if (PageTransHuge(page)) { > @@ -3742,7 +3741,7 @@ int mem_cgroup_newpage_charge(struct page *page, > VM_BUG_ON(page->mapping && !PageAnon(page)); > VM_BUG_ON(!mm); > return mem_cgroup_charge_common(page, mm, gfp_mask, > - MEM_CGROUP_CHARGE_TYPE_ANON); > + MEM_CGROUP_CHARGE_TYPE_ANON, true); > } > > /* > @@ -3754,7 +3753,8 @@ int mem_cgroup_newpage_charge(struct page *page, > static int __mem_cgroup_try_charge_swapin(struct mm_struct *mm, > struct page *page, > gfp_t mask, > - struct mem_cgroup **memcgp) > + struct mem_cgroup **memcgp, > + bool oom) > { > struct mem_cgroup *memcg; > struct page_cgroup *pc; > @@ -3776,13 +3776,13 @@ static int __mem_cgroup_try_charge_swapin(struct mm_struct *mm, > if (!memcg) > goto charge_cur_mm; > *memcgp = memcg; > - ret = __mem_cgroup_try_charge(NULL, mask, 1, memcgp, true); > + ret = __mem_cgroup_try_charge(NULL, mask, 1, memcgp, oom); > css_put(&memcg->css); > if (ret == -EINTR) > ret = 0; > return ret; > charge_cur_mm: > - ret = __mem_cgroup_try_charge(mm, mask, 1, memcgp, true); > + ret = __mem_cgroup_try_charge(mm, mask, 1, memcgp, oom); > if (ret == -EINTR) > ret = 0; > return ret; > @@ -3808,7 +3808,7 @@ int mem_cgroup_try_charge_swapin(struct mm_struct *mm, struct page *page, > ret = 0; > return ret; > } > - return __mem_cgroup_try_charge_swapin(mm, page, gfp_mask, memcgp); > + return __mem_cgroup_try_charge_swapin(mm, page, gfp_mask, memcgp, true); > } > > void mem_cgroup_cancel_charge_swapin(struct mem_cgroup *memcg) > @@ -3851,7 +3851,7 @@ void mem_cgroup_commit_charge_swapin(struct page *page, > } > > int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm, > - gfp_t gfp_mask) > + gfp_t gfp_mask, bool oom) > { > struct mem_cgroup *memcg = NULL; > enum charge_type type = MEM_CGROUP_CHARGE_TYPE_CACHE; > @@ -3863,10 +3863,10 @@ int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm, > return 0; > > if (!PageSwapCache(page)) > - ret = mem_cgroup_charge_common(page, mm, gfp_mask, type); > + ret = mem_cgroup_charge_common(page, mm, gfp_mask, type, oom); > else { /* page is swapcache/shmem */ > ret = __mem_cgroup_try_charge_swapin(mm, page, > - gfp_mask, &memcg); > + gfp_mask, &memcg, oom); > if (!ret) > __mem_cgroup_commit_charge_swapin(page, memcg, type); > } > diff --git a/mm/shmem.c b/mm/shmem.c > index 55054a7..3b27db4 100644 > --- a/mm/shmem.c > +++ b/mm/shmem.c > @@ -760,7 +760,7 @@ int shmem_unuse(swp_entry_t swap, struct page *page) > * the shmem_swaplist_mutex which might hold up shmem_writepage(). > * Charged back to the user (not to caller) when swap account is used. > */ > - error = mem_cgroup_cache_charge(page, current->mm, GFP_KERNEL); > + error = mem_cgroup_cache_charge(page, current->mm, GFP_KERNEL, true); > if (error) > goto out; > /* No radix_tree_preload: swap entry keeps a place for page in tree */ > @@ -1152,8 +1152,17 @@ repeat: > goto failed; > } > > + /* > + * Cannot trigger OOM even if gfp_mask would allow that > + * normally because we might be called from a locked > + * context (i_mutex held) if this is a write lock or > + * fallocate and that could lead to deadlocks if the > + * killed process is waiting for the same lock. > + */ > error = mem_cgroup_cache_charge(page, current->mm, > - gfp & GFP_RECLAIM_MASK); > + gfp & GFP_RECLAIM_MASK, > + sgp != SGP_WRITE && > + sgp != SGP_FALLOC); > if (!error) { > error = shmem_add_to_page_cache(page, mapping, index, > gfp, swp_to_radix_entry(swap)); > @@ -1209,7 +1218,9 @@ repeat: > SetPageSwapBacked(page); > __set_page_locked(page); > error = mem_cgroup_cache_charge(page, current->mm, > - gfp & GFP_RECLAIM_MASK); > + gfp & GFP_RECLAIM_MASK, > + sgp != SGP_WRITE && > + sgp != SGP_FALLOC); > if (error) > goto decused; > error = radix_tree_preload(gfp & GFP_RECLAIM_MASK); > -- > 1.7.10.4 > > -- > Michal Hocko > SUSE Labs > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> > ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH -v2 -mm] memcg: do not trigger OOM from add_to_page_cache_locked @ 2012-11-28 20:20 ` Hugh Dickins 0 siblings, 0 replies; 444+ messages in thread From: Hugh Dickins @ 2012-11-28 20:20 UTC (permalink / raw) To: Michal Hocko Cc: Johannes Weiner, KAMEZAWA Hiroyuki, azurIt, linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups mailinglist On Wed, 28 Nov 2012, Michal Hocko wrote: > From e21bb704947e9a477ec1df9121575c606dbfcb52 Mon Sep 17 00:00:00 2001 > From: Michal Hocko <mhocko-AlSwsSmVLrQ@public.gmane.org> > Date: Wed, 28 Nov 2012 17:46:32 +0100 > Subject: [PATCH] memcg: do not trigger OOM from add_to_page_cache_locked > > memcg oom killer might deadlock if the process which falls down to > mem_cgroup_handle_oom holds a lock which prevents other task to > terminate because it is blocked on the very same lock. > This can happen when a write system call needs to allocate a page but > the allocation hits the memcg hard limit and there is nothing to reclaim > (e.g. there is no swap or swap limit is hit as well and all cache pages > have been reclaimed already) and the process selected by memcg OOM > killer is blocked on i_mutex on the same inode (e.g. truncate it). > > Process A > [<ffffffff811109b8>] do_truncate+0x58/0xa0 # takes i_mutex > [<ffffffff81121c90>] do_last+0x250/0xa30 > [<ffffffff81122547>] path_openat+0xd7/0x440 > [<ffffffff811229c9>] do_filp_open+0x49/0xa0 > [<ffffffff8110f7d6>] do_sys_open+0x106/0x240 > [<ffffffff8110f950>] sys_open+0x20/0x30 > [<ffffffff815b5926>] system_call_fastpath+0x18/0x1d > [<ffffffffffffffff>] 0xffffffffffffffff > > Process B > [<ffffffff8110a9c1>] mem_cgroup_handle_oom+0x241/0x3b0 > [<ffffffff8110b5ab>] T.1146+0x5ab/0x5c0 > [<ffffffff8110c22e>] mem_cgroup_cache_charge+0xbe/0xe0 > [<ffffffff810ca28c>] add_to_page_cache_locked+0x4c/0x140 > [<ffffffff810ca3a2>] add_to_page_cache_lru+0x22/0x50 > [<ffffffff810ca45b>] grab_cache_page_write_begin+0x8b/0xe0 > [<ffffffff81193a18>] ext3_write_begin+0x88/0x270 > [<ffffffff810c8fc6>] generic_file_buffered_write+0x116/0x290 > [<ffffffff810cb3cc>] __generic_file_aio_write+0x27c/0x480 > [<ffffffff810cb646>] generic_file_aio_write+0x76/0xf0 # takes ->i_mutex > [<ffffffff8111156a>] do_sync_write+0xea/0x130 > [<ffffffff81112183>] vfs_write+0xf3/0x1f0 > [<ffffffff81112381>] sys_write+0x51/0x90 > [<ffffffff815b5926>] system_call_fastpath+0x18/0x1d > [<ffffffffffffffff>] 0xffffffffffffffff > > This is not a hard deadlock though because administrator can still > intervene and increase the limit on the group which helps the writer to > finish the allocation and release the lock. > > This patch heals the problem by forbidding OOM from page cache charges > (namely add_ro_page_cache_locked). mem_cgroup_cache_charge grows oom > argument which is pushed down the call chain. > > As a possibly visible result add_to_page_cache_lru might fail more often > with ENOMEM but this is to be expected if the limit is set and it is > preferable than OOM killer IMO. > > Changes since v1 > - do not abuse gfp_flags and rather use oom parameter directly as per > Johannes > - handle also shmem write fauls resp. fallocate properly as per Johannes > > Reported-by: azurIt <azurit-Rm0zKEqwvD4@public.gmane.org> > Signed-off-by: Michal Hocko <mhocko-AlSwsSmVLrQ@public.gmane.org> Sorry, Michal, you've laboured hard on this: but I dislike it so much that I'm here overcoming my dread of entering an OOM-killer discussion, and the resultant deluge of unwelcome CCs for eternity afterwards. I had been relying on Johannes to repeat his "This issue has been around for a while so frankly I don't think it's urgent enough to rush things", but it looks like I have to be the one to repeat it. Your analysis of azurIt's traces may well be correct, and this patch may indeed ameliorate the situation, and it's fine as something for azurIt to try and report on and keep in his tree; but I hope that it does not go upstream and to stable. Why do I dislike it so much? I suppose because it's both too general and too limited at the same time. Too general in that it changes the behaviour on OOM for a large set of memcg charges, all those that go through add_to_page_cache_locked(), when only a subset of those have the i_mutex issue. If you're going to be that general, why not go further? Leave the mem_cgroup_cache_charge() interface as is, make it not-OOM internally, no need for SGP_WRITE,SGP_FALLOC distinctions in mm/shmem.c. No other filesystem gets the benefit of those distinctions: isn't it better to keep it simple? (And I can see a partial truncation case where shmem uses SGP_READ under i_mutex; and the change to shmem_unuse behaviour is a non-issue, since swapoff invites itself to be killed anyway.) Too limited in that i_mutex is just the held resource which azurIt's traces have led you to, but it's a general problem that the OOM-killed task might be waiting for a resource that the OOM-killing task holds. I suspect that if we try hard enough (I admit I have not), we can find an example of such a potential deadlock for almost every memcg charge site. mmap_sem? not as easy to invent a case with that as I thought, since it needs a down_write, and the typical page allocations happen with down_read, and I can't think of a process which does down_write on another's mm. But i_mutex is always good, once you remember the case of write to file from userspace page which got paged out, so the fault path has to read it back in, while i_mutex is still held at the outer level. An unusual case? Well, normally yes, but we're considering out-of-memory conditions, which may converge upon cases like this. Wouldn't it be nice if I could be constructive? But I'm sceptical even of Johannes's faith in what the global OOM killer would do: how does __alloc_pages_slowpath() get out of its "goto restart" loop, excepting the trivial case when the killer is the killed? I wonder why this issue has hit azurIt and no other reporter? No swap plays a part in it, but that's not so unusual. Yours glOOMily, Hugh > --- > include/linux/memcontrol.h | 5 +++-- > mm/filemap.c | 9 +++++++-- > mm/memcontrol.c | 20 ++++++++++---------- > mm/shmem.c | 17 ++++++++++++++--- > 4 files changed, 34 insertions(+), 17 deletions(-) > > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h > index 095d2b4..8f48d5e 100644 > --- a/include/linux/memcontrol.h > +++ b/include/linux/memcontrol.h > @@ -63,7 +63,7 @@ extern void mem_cgroup_commit_charge_swapin(struct page *page, > extern void mem_cgroup_cancel_charge_swapin(struct mem_cgroup *memcg); > > extern int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm, > - gfp_t gfp_mask); > + gfp_t gfp_mask, bool oom); > > struct lruvec *mem_cgroup_zone_lruvec(struct zone *, struct mem_cgroup *); > struct lruvec *mem_cgroup_page_lruvec(struct page *, struct zone *); > @@ -210,7 +210,8 @@ static inline int mem_cgroup_newpage_charge(struct page *page, > } > > static inline int mem_cgroup_cache_charge(struct page *page, > - struct mm_struct *mm, gfp_t gfp_mask) > + struct mm_struct *mm, gfp_t gfp_mask, > + bool oom) > { > return 0; > } > diff --git a/mm/filemap.c b/mm/filemap.c > index 83efee7..ef8fbd5 100644 > --- a/mm/filemap.c > +++ b/mm/filemap.c > @@ -447,8 +447,13 @@ int add_to_page_cache_locked(struct page *page, struct address_space *mapping, > VM_BUG_ON(!PageLocked(page)); > VM_BUG_ON(PageSwapBacked(page)); > > - error = mem_cgroup_cache_charge(page, current->mm, > - gfp_mask & GFP_RECLAIM_MASK); > + /* > + * Cannot trigger OOM even if gfp_mask would allow that normally > + * because we might be called from a locked context and that > + * could lead to deadlocks if the killed process is waiting for > + * the same lock. > + */ > + error = mem_cgroup_cache_charge(page, current->mm, gfp_mask, false); > if (error) > goto out; > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index 02ee2f7..3c9b1c5 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -3709,11 +3709,10 @@ out: > * < 0 if the cgroup is over its limit > */ > static int mem_cgroup_charge_common(struct page *page, struct mm_struct *mm, > - gfp_t gfp_mask, enum charge_type ctype) > + gfp_t gfp_mask, enum charge_type ctype, bool oom) > { > struct mem_cgroup *memcg = NULL; > unsigned int nr_pages = 1; > - bool oom = true; > int ret; > > if (PageTransHuge(page)) { > @@ -3742,7 +3741,7 @@ int mem_cgroup_newpage_charge(struct page *page, > VM_BUG_ON(page->mapping && !PageAnon(page)); > VM_BUG_ON(!mm); > return mem_cgroup_charge_common(page, mm, gfp_mask, > - MEM_CGROUP_CHARGE_TYPE_ANON); > + MEM_CGROUP_CHARGE_TYPE_ANON, true); > } > > /* > @@ -3754,7 +3753,8 @@ int mem_cgroup_newpage_charge(struct page *page, > static int __mem_cgroup_try_charge_swapin(struct mm_struct *mm, > struct page *page, > gfp_t mask, > - struct mem_cgroup **memcgp) > + struct mem_cgroup **memcgp, > + bool oom) > { > struct mem_cgroup *memcg; > struct page_cgroup *pc; > @@ -3776,13 +3776,13 @@ static int __mem_cgroup_try_charge_swapin(struct mm_struct *mm, > if (!memcg) > goto charge_cur_mm; > *memcgp = memcg; > - ret = __mem_cgroup_try_charge(NULL, mask, 1, memcgp, true); > + ret = __mem_cgroup_try_charge(NULL, mask, 1, memcgp, oom); > css_put(&memcg->css); > if (ret == -EINTR) > ret = 0; > return ret; > charge_cur_mm: > - ret = __mem_cgroup_try_charge(mm, mask, 1, memcgp, true); > + ret = __mem_cgroup_try_charge(mm, mask, 1, memcgp, oom); > if (ret == -EINTR) > ret = 0; > return ret; > @@ -3808,7 +3808,7 @@ int mem_cgroup_try_charge_swapin(struct mm_struct *mm, struct page *page, > ret = 0; > return ret; > } > - return __mem_cgroup_try_charge_swapin(mm, page, gfp_mask, memcgp); > + return __mem_cgroup_try_charge_swapin(mm, page, gfp_mask, memcgp, true); > } > > void mem_cgroup_cancel_charge_swapin(struct mem_cgroup *memcg) > @@ -3851,7 +3851,7 @@ void mem_cgroup_commit_charge_swapin(struct page *page, > } > > int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm, > - gfp_t gfp_mask) > + gfp_t gfp_mask, bool oom) > { > struct mem_cgroup *memcg = NULL; > enum charge_type type = MEM_CGROUP_CHARGE_TYPE_CACHE; > @@ -3863,10 +3863,10 @@ int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm, > return 0; > > if (!PageSwapCache(page)) > - ret = mem_cgroup_charge_common(page, mm, gfp_mask, type); > + ret = mem_cgroup_charge_common(page, mm, gfp_mask, type, oom); > else { /* page is swapcache/shmem */ > ret = __mem_cgroup_try_charge_swapin(mm, page, > - gfp_mask, &memcg); > + gfp_mask, &memcg, oom); > if (!ret) > __mem_cgroup_commit_charge_swapin(page, memcg, type); > } > diff --git a/mm/shmem.c b/mm/shmem.c > index 55054a7..3b27db4 100644 > --- a/mm/shmem.c > +++ b/mm/shmem.c > @@ -760,7 +760,7 @@ int shmem_unuse(swp_entry_t swap, struct page *page) > * the shmem_swaplist_mutex which might hold up shmem_writepage(). > * Charged back to the user (not to caller) when swap account is used. > */ > - error = mem_cgroup_cache_charge(page, current->mm, GFP_KERNEL); > + error = mem_cgroup_cache_charge(page, current->mm, GFP_KERNEL, true); > if (error) > goto out; > /* No radix_tree_preload: swap entry keeps a place for page in tree */ > @@ -1152,8 +1152,17 @@ repeat: > goto failed; > } > > + /* > + * Cannot trigger OOM even if gfp_mask would allow that > + * normally because we might be called from a locked > + * context (i_mutex held) if this is a write lock or > + * fallocate and that could lead to deadlocks if the > + * killed process is waiting for the same lock. > + */ > error = mem_cgroup_cache_charge(page, current->mm, > - gfp & GFP_RECLAIM_MASK); > + gfp & GFP_RECLAIM_MASK, > + sgp != SGP_WRITE && > + sgp != SGP_FALLOC); > if (!error) { > error = shmem_add_to_page_cache(page, mapping, index, > gfp, swp_to_radix_entry(swap)); > @@ -1209,7 +1218,9 @@ repeat: > SetPageSwapBacked(page); > __set_page_locked(page); > error = mem_cgroup_cache_charge(page, current->mm, > - gfp & GFP_RECLAIM_MASK); > + gfp & GFP_RECLAIM_MASK, > + sgp != SGP_WRITE && > + sgp != SGP_FALLOC); > if (error) > goto decused; > error = radix_tree_preload(gfp & GFP_RECLAIM_MASK); > -- > 1.7.10.4 > > -- > Michal Hocko > SUSE Labs > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo-Bw31MaZKKs0EbZ0PF+XxCw@public.gmane.org For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: <a href=mailto:"dont-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org"> email-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org </a> > ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH -v2 -mm] memcg: do not trigger OOM from add_to_page_cache_locked @ 2012-11-28 20:20 ` Hugh Dickins 0 siblings, 0 replies; 444+ messages in thread From: Hugh Dickins @ 2012-11-28 20:20 UTC (permalink / raw) To: Michal Hocko Cc: Johannes Weiner, KAMEZAWA Hiroyuki, azurIt, linux-kernel, linux-mm, cgroups mailinglist On Wed, 28 Nov 2012, Michal Hocko wrote: > From e21bb704947e9a477ec1df9121575c606dbfcb52 Mon Sep 17 00:00:00 2001 > From: Michal Hocko <mhocko@suse.cz> > Date: Wed, 28 Nov 2012 17:46:32 +0100 > Subject: [PATCH] memcg: do not trigger OOM from add_to_page_cache_locked > > memcg oom killer might deadlock if the process which falls down to > mem_cgroup_handle_oom holds a lock which prevents other task to > terminate because it is blocked on the very same lock. > This can happen when a write system call needs to allocate a page but > the allocation hits the memcg hard limit and there is nothing to reclaim > (e.g. there is no swap or swap limit is hit as well and all cache pages > have been reclaimed already) and the process selected by memcg OOM > killer is blocked on i_mutex on the same inode (e.g. truncate it). > > Process A > [<ffffffff811109b8>] do_truncate+0x58/0xa0 # takes i_mutex > [<ffffffff81121c90>] do_last+0x250/0xa30 > [<ffffffff81122547>] path_openat+0xd7/0x440 > [<ffffffff811229c9>] do_filp_open+0x49/0xa0 > [<ffffffff8110f7d6>] do_sys_open+0x106/0x240 > [<ffffffff8110f950>] sys_open+0x20/0x30 > [<ffffffff815b5926>] system_call_fastpath+0x18/0x1d > [<ffffffffffffffff>] 0xffffffffffffffff > > Process B > [<ffffffff8110a9c1>] mem_cgroup_handle_oom+0x241/0x3b0 > [<ffffffff8110b5ab>] T.1146+0x5ab/0x5c0 > [<ffffffff8110c22e>] mem_cgroup_cache_charge+0xbe/0xe0 > [<ffffffff810ca28c>] add_to_page_cache_locked+0x4c/0x140 > [<ffffffff810ca3a2>] add_to_page_cache_lru+0x22/0x50 > [<ffffffff810ca45b>] grab_cache_page_write_begin+0x8b/0xe0 > [<ffffffff81193a18>] ext3_write_begin+0x88/0x270 > [<ffffffff810c8fc6>] generic_file_buffered_write+0x116/0x290 > [<ffffffff810cb3cc>] __generic_file_aio_write+0x27c/0x480 > [<ffffffff810cb646>] generic_file_aio_write+0x76/0xf0 # takes ->i_mutex > [<ffffffff8111156a>] do_sync_write+0xea/0x130 > [<ffffffff81112183>] vfs_write+0xf3/0x1f0 > [<ffffffff81112381>] sys_write+0x51/0x90 > [<ffffffff815b5926>] system_call_fastpath+0x18/0x1d > [<ffffffffffffffff>] 0xffffffffffffffff > > This is not a hard deadlock though because administrator can still > intervene and increase the limit on the group which helps the writer to > finish the allocation and release the lock. > > This patch heals the problem by forbidding OOM from page cache charges > (namely add_ro_page_cache_locked). mem_cgroup_cache_charge grows oom > argument which is pushed down the call chain. > > As a possibly visible result add_to_page_cache_lru might fail more often > with ENOMEM but this is to be expected if the limit is set and it is > preferable than OOM killer IMO. > > Changes since v1 > - do not abuse gfp_flags and rather use oom parameter directly as per > Johannes > - handle also shmem write fauls resp. fallocate properly as per Johannes > > Reported-by: azurIt <azurit@pobox.sk> > Signed-off-by: Michal Hocko <mhocko@suse.cz> Sorry, Michal, you've laboured hard on this: but I dislike it so much that I'm here overcoming my dread of entering an OOM-killer discussion, and the resultant deluge of unwelcome CCs for eternity afterwards. I had been relying on Johannes to repeat his "This issue has been around for a while so frankly I don't think it's urgent enough to rush things", but it looks like I have to be the one to repeat it. Your analysis of azurIt's traces may well be correct, and this patch may indeed ameliorate the situation, and it's fine as something for azurIt to try and report on and keep in his tree; but I hope that it does not go upstream and to stable. Why do I dislike it so much? I suppose because it's both too general and too limited at the same time. Too general in that it changes the behaviour on OOM for a large set of memcg charges, all those that go through add_to_page_cache_locked(), when only a subset of those have the i_mutex issue. If you're going to be that general, why not go further? Leave the mem_cgroup_cache_charge() interface as is, make it not-OOM internally, no need for SGP_WRITE,SGP_FALLOC distinctions in mm/shmem.c. No other filesystem gets the benefit of those distinctions: isn't it better to keep it simple? (And I can see a partial truncation case where shmem uses SGP_READ under i_mutex; and the change to shmem_unuse behaviour is a non-issue, since swapoff invites itself to be killed anyway.) Too limited in that i_mutex is just the held resource which azurIt's traces have led you to, but it's a general problem that the OOM-killed task might be waiting for a resource that the OOM-killing task holds. I suspect that if we try hard enough (I admit I have not), we can find an example of such a potential deadlock for almost every memcg charge site. mmap_sem? not as easy to invent a case with that as I thought, since it needs a down_write, and the typical page allocations happen with down_read, and I can't think of a process which does down_write on another's mm. But i_mutex is always good, once you remember the case of write to file from userspace page which got paged out, so the fault path has to read it back in, while i_mutex is still held at the outer level. An unusual case? Well, normally yes, but we're considering out-of-memory conditions, which may converge upon cases like this. Wouldn't it be nice if I could be constructive? But I'm sceptical even of Johannes's faith in what the global OOM killer would do: how does __alloc_pages_slowpath() get out of its "goto restart" loop, excepting the trivial case when the killer is the killed? I wonder why this issue has hit azurIt and no other reporter? No swap plays a part in it, but that's not so unusual. Yours glOOMily, Hugh > --- > include/linux/memcontrol.h | 5 +++-- > mm/filemap.c | 9 +++++++-- > mm/memcontrol.c | 20 ++++++++++---------- > mm/shmem.c | 17 ++++++++++++++--- > 4 files changed, 34 insertions(+), 17 deletions(-) > > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h > index 095d2b4..8f48d5e 100644 > --- a/include/linux/memcontrol.h > +++ b/include/linux/memcontrol.h > @@ -63,7 +63,7 @@ extern void mem_cgroup_commit_charge_swapin(struct page *page, > extern void mem_cgroup_cancel_charge_swapin(struct mem_cgroup *memcg); > > extern int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm, > - gfp_t gfp_mask); > + gfp_t gfp_mask, bool oom); > > struct lruvec *mem_cgroup_zone_lruvec(struct zone *, struct mem_cgroup *); > struct lruvec *mem_cgroup_page_lruvec(struct page *, struct zone *); > @@ -210,7 +210,8 @@ static inline int mem_cgroup_newpage_charge(struct page *page, > } > > static inline int mem_cgroup_cache_charge(struct page *page, > - struct mm_struct *mm, gfp_t gfp_mask) > + struct mm_struct *mm, gfp_t gfp_mask, > + bool oom) > { > return 0; > } > diff --git a/mm/filemap.c b/mm/filemap.c > index 83efee7..ef8fbd5 100644 > --- a/mm/filemap.c > +++ b/mm/filemap.c > @@ -447,8 +447,13 @@ int add_to_page_cache_locked(struct page *page, struct address_space *mapping, > VM_BUG_ON(!PageLocked(page)); > VM_BUG_ON(PageSwapBacked(page)); > > - error = mem_cgroup_cache_charge(page, current->mm, > - gfp_mask & GFP_RECLAIM_MASK); > + /* > + * Cannot trigger OOM even if gfp_mask would allow that normally > + * because we might be called from a locked context and that > + * could lead to deadlocks if the killed process is waiting for > + * the same lock. > + */ > + error = mem_cgroup_cache_charge(page, current->mm, gfp_mask, false); > if (error) > goto out; > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index 02ee2f7..3c9b1c5 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -3709,11 +3709,10 @@ out: > * < 0 if the cgroup is over its limit > */ > static int mem_cgroup_charge_common(struct page *page, struct mm_struct *mm, > - gfp_t gfp_mask, enum charge_type ctype) > + gfp_t gfp_mask, enum charge_type ctype, bool oom) > { > struct mem_cgroup *memcg = NULL; > unsigned int nr_pages = 1; > - bool oom = true; > int ret; > > if (PageTransHuge(page)) { > @@ -3742,7 +3741,7 @@ int mem_cgroup_newpage_charge(struct page *page, > VM_BUG_ON(page->mapping && !PageAnon(page)); > VM_BUG_ON(!mm); > return mem_cgroup_charge_common(page, mm, gfp_mask, > - MEM_CGROUP_CHARGE_TYPE_ANON); > + MEM_CGROUP_CHARGE_TYPE_ANON, true); > } > > /* > @@ -3754,7 +3753,8 @@ int mem_cgroup_newpage_charge(struct page *page, > static int __mem_cgroup_try_charge_swapin(struct mm_struct *mm, > struct page *page, > gfp_t mask, > - struct mem_cgroup **memcgp) > + struct mem_cgroup **memcgp, > + bool oom) > { > struct mem_cgroup *memcg; > struct page_cgroup *pc; > @@ -3776,13 +3776,13 @@ static int __mem_cgroup_try_charge_swapin(struct mm_struct *mm, > if (!memcg) > goto charge_cur_mm; > *memcgp = memcg; > - ret = __mem_cgroup_try_charge(NULL, mask, 1, memcgp, true); > + ret = __mem_cgroup_try_charge(NULL, mask, 1, memcgp, oom); > css_put(&memcg->css); > if (ret == -EINTR) > ret = 0; > return ret; > charge_cur_mm: > - ret = __mem_cgroup_try_charge(mm, mask, 1, memcgp, true); > + ret = __mem_cgroup_try_charge(mm, mask, 1, memcgp, oom); > if (ret == -EINTR) > ret = 0; > return ret; > @@ -3808,7 +3808,7 @@ int mem_cgroup_try_charge_swapin(struct mm_struct *mm, struct page *page, > ret = 0; > return ret; > } > - return __mem_cgroup_try_charge_swapin(mm, page, gfp_mask, memcgp); > + return __mem_cgroup_try_charge_swapin(mm, page, gfp_mask, memcgp, true); > } > > void mem_cgroup_cancel_charge_swapin(struct mem_cgroup *memcg) > @@ -3851,7 +3851,7 @@ void mem_cgroup_commit_charge_swapin(struct page *page, > } > > int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm, > - gfp_t gfp_mask) > + gfp_t gfp_mask, bool oom) > { > struct mem_cgroup *memcg = NULL; > enum charge_type type = MEM_CGROUP_CHARGE_TYPE_CACHE; > @@ -3863,10 +3863,10 @@ int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm, > return 0; > > if (!PageSwapCache(page)) > - ret = mem_cgroup_charge_common(page, mm, gfp_mask, type); > + ret = mem_cgroup_charge_common(page, mm, gfp_mask, type, oom); > else { /* page is swapcache/shmem */ > ret = __mem_cgroup_try_charge_swapin(mm, page, > - gfp_mask, &memcg); > + gfp_mask, &memcg, oom); > if (!ret) > __mem_cgroup_commit_charge_swapin(page, memcg, type); > } > diff --git a/mm/shmem.c b/mm/shmem.c > index 55054a7..3b27db4 100644 > --- a/mm/shmem.c > +++ b/mm/shmem.c > @@ -760,7 +760,7 @@ int shmem_unuse(swp_entry_t swap, struct page *page) > * the shmem_swaplist_mutex which might hold up shmem_writepage(). > * Charged back to the user (not to caller) when swap account is used. > */ > - error = mem_cgroup_cache_charge(page, current->mm, GFP_KERNEL); > + error = mem_cgroup_cache_charge(page, current->mm, GFP_KERNEL, true); > if (error) > goto out; > /* No radix_tree_preload: swap entry keeps a place for page in tree */ > @@ -1152,8 +1152,17 @@ repeat: > goto failed; > } > > + /* > + * Cannot trigger OOM even if gfp_mask would allow that > + * normally because we might be called from a locked > + * context (i_mutex held) if this is a write lock or > + * fallocate and that could lead to deadlocks if the > + * killed process is waiting for the same lock. > + */ > error = mem_cgroup_cache_charge(page, current->mm, > - gfp & GFP_RECLAIM_MASK); > + gfp & GFP_RECLAIM_MASK, > + sgp != SGP_WRITE && > + sgp != SGP_FALLOC); > if (!error) { > error = shmem_add_to_page_cache(page, mapping, index, > gfp, swp_to_radix_entry(swap)); > @@ -1209,7 +1218,9 @@ repeat: > SetPageSwapBacked(page); > __set_page_locked(page); > error = mem_cgroup_cache_charge(page, current->mm, > - gfp & GFP_RECLAIM_MASK); > + gfp & GFP_RECLAIM_MASK, > + sgp != SGP_WRITE && > + sgp != SGP_FALLOC); > if (error) > goto decused; > error = radix_tree_preload(gfp & GFP_RECLAIM_MASK); > -- > 1.7.10.4 > > -- > Michal Hocko > SUSE Labs > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH -v2 -mm] memcg: do not trigger OOM from add_to_page_cache_locked 2012-11-28 20:20 ` Hugh Dickins @ 2012-11-29 14:05 ` Michal Hocko -1 siblings, 0 replies; 444+ messages in thread From: Michal Hocko @ 2012-11-29 14:05 UTC (permalink / raw) To: Hugh Dickins Cc: Johannes Weiner, KAMEZAWA Hiroyuki, azurIt, linux-kernel, linux-mm, cgroups mailinglist On Wed 28-11-12 12:20:44, Hugh Dickins wrote: [...] > Sorry, Michal, you've laboured hard on this: but I dislike it so much > that I'm here overcoming my dread of entering an OOM-killer discussion, > and the resultant deluge of unwelcome CCs for eternity afterwards. > > I had been relying on Johannes to repeat his "This issue has been > around for a while so frankly I don't think it's urgent enough to > rush things", but it looks like I have to be the one to repeat it. Well, the idea was to use this only as a temporal fix and come up with a better solution without any hurry. > Your analysis of azurIt's traces may well be correct, and this patch > may indeed ameliorate the situation, and it's fine as something for > azurIt to try and report on and keep in his tree; but I hope that > it does not go upstream and to stable. > > Why do I dislike it so much? I suppose because it's both too general > and too limited at the same time. > > Too general in that it changes the behaviour on OOM for a large set > of memcg charges, all those that go through add_to_page_cache_locked(), > when only a subset of those have the i_mutex issue. This is a fair point but the real fix which we were discussing with Johannes would be even more risky for stable. > If you're going to be that general, why not go further? Leave the > mem_cgroup_cache_charge() interface as is, make it not-OOM internally, > no need for SGP_WRITE,SGP_FALLOC distinctions in mm/shmem.c. No other > filesystem gets the benefit of those distinctions: isn't it better to > keep it simple? (And I can see a partial truncation case where shmem > uses SGP_READ under i_mutex; and the change to shmem_unuse behaviour > is a non-issue, since swapoff invites itself to be killed anyway.) > > Too limited in that i_mutex is just the held resource which azurIt's > traces have led you to, but it's a general problem that the OOM-killed > task might be waiting for a resource that the OOM-killing task holds. > > I suspect that if we try hard enough (I admit I have not), we can find > an example of such a potential deadlock for almost every memcg charge > site. mmap_sem? not as easy to invent a case with that as I thought, > since it needs a down_write, and the typical page allocations happen > with down_read, and I can't think of a process which does down_write > on another's mm. > > But i_mutex is always good, once you remember the case of write to > file from userspace page which got paged out, so the fault path has > to read it back in, while i_mutex is still held at the outer level. > An unusual case? Well, normally yes, but we're considering > out-of-memory conditions, which may converge upon cases like this. > > Wouldn't it be nice if I could be constructive? But I'm sceptical > even of Johannes's faith in what the global OOM killer would do: > how does __alloc_pages_slowpath() get out of its "goto restart" > loop, excepting the trivial case when the killer is the killed? I am not sure I am following you here but the Johannes's idea was to break out of the charge after a signal has been sent and the charge still fails and either retry the fault or fail the allocation. I think this should work but I am afraid that this needs some tuning (number of retries f.e.) to prevent from too aggressive OOM or too many failurs. Do we have any other possibilities to solve this issue? Or do you think we should ignore the problem just because nobody complained for such a long time? Dunno, I think we should fix this with something less risky for now and come up with a real fix after it sees sufficient testing. > I wonder why this issue has hit azurIt and no other reporter? > No swap plays a part in it, but that's not so unusual. > > Yours glOOMily, > Hugh [...] -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 444+ messages in thread
* Re: [PATCH -v2 -mm] memcg: do not trigger OOM from add_to_page_cache_locked @ 2012-11-29 14:05 ` Michal Hocko 0 siblings, 0 replies; 444+ messages in thread From: Michal Hocko @ 2012-11-29 14:05 UTC (permalink / raw) To: Hugh Dickins Cc: Johannes Weiner, KAMEZAWA Hiroyuki, azurIt, linux-kernel, linux-mm, cgroups mailinglist On Wed 28-11-12 12:20:44, Hugh Dickins wrote: [...] > Sorry, Michal, you've laboured hard on this: but I dislike it so much > that I'm here overcoming my dread of entering an OOM-killer discussion, > and the resultant deluge of unwelcome CCs for eternity afterwards. > > I had been relying on Johannes to repeat his "This issue has been > around for a while so frankly I don't think it's urgent enough to > rush things", but it looks like I have to be the one to repeat it. Well, the idea was to use this only as a temporal fix and come up with a better solution without any hurry. > Your analysis of azurIt's traces may well be correct, and this patch > may indeed ameliorate the situation, and it's fine as something for > azurIt to try and report on and keep in his tree; but I hope that > it does not go upstream and to stable. > > Why do I dislike it so much? I suppose because it's both too general > and too limited at the same time. > > Too general in that it changes the behaviour on OOM for a large set > of memcg charges, all those that go through add_to_page_cache_locked(), > when only a subset of those have the i_mutex issue. This is a fair point but the real fix which we were discussing with Johannes would be even more risky for stable. > If you're going to be that general, why not go further? Leave the > mem_cgroup_cache_charge() interface as is, make it not-OOM internally, > no need for SGP_WRITE,SGP_FALLOC distinctions in mm/shmem.c. No other > filesystem gets the benefit of those distinctions: isn't it better to > keep it simple? (And I can see a partial truncation case where shmem > uses SGP_READ under i_mutex; and the change to shmem_unuse behaviour > is a non-issue, since swapoff invites itself to be killed anyway.) > > Too limited in that i_mutex is just the held resource which azurIt's > traces have led you to, but it's a general problem that the OOM-killed > task might be waiting for a resource that the OOM-killing task holds. > > I suspect that if we try hard enough (I admit I have not), we can find > an example of such a potential deadlock for almost every memcg charge > site. mmap_sem? not as easy to invent a case with that as I thought, > since it needs a down_write, and the typical page allocations happen > with down_read, and I can't think of a process which does down_write > on another's mm. > > But i_mutex is always good, once you remember the case of write to > file from userspace page which got paged out, so the fault path has > to read it back in, while i_mutex is still held at the outer level. > An unusual case? Well, normally yes, but we're considering > out-of-memory conditions, which may converge upon cases like this. > > Wouldn't it be nice if I could be constructive? But I'm sceptical > even of Johannes's faith in what the global OOM killer would do: > how does __alloc_pages_slowpath() get out of its "goto restart" > loop, excepting the trivial case when the killer is the killed? I am not sure I am following you here but the Johannes's idea was to break out of the charge after a signal has been sent and the charge still fails and either retry the fault or fail the allocation. I think this should work but I am afraid that this needs some tuning (number of retries f.e.) to prevent from too aggressive OOM or too many failurs. Do we have any other possibilities to solve this issue? Or do you think we should ignore the problem just because nobody complained for such a long time? Dunno, I think we should fix this with something less risky for now and come up with a real fix after it sees sufficient testing. > I wonder why this issue has hit azurIt and no other reporter? > No swap plays a part in it, but that's not so unusual. > > Yours glOOMily, > Hugh [...] -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 444+ messages in thread
end of thread, other threads:[~2013-07-25 21:50 UTC | newest] Thread overview: 444+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2012-11-21 19:02 memory-cgroup bug azurIt 2012-11-22 0:26 ` Kamezawa Hiroyuki 2012-11-22 0:26 ` Kamezawa Hiroyuki 2012-11-22 9:36 ` azurIt 2012-11-22 9:36 ` azurIt 2012-11-22 21:45 ` Michal Hocko 2012-11-22 21:45 ` Michal Hocko 2012-11-22 15:24 ` Michal Hocko 2012-11-22 15:24 ` Michal Hocko 2012-11-22 18:05 ` azurIt 2012-11-22 18:05 ` azurIt 2012-11-22 21:42 ` Michal Hocko 2012-11-22 21:42 ` Michal Hocko 2012-11-22 22:34 ` azurIt 2012-11-22 22:34 ` azurIt 2012-11-23 7:40 ` Michal Hocko 2012-11-23 7:40 ` Michal Hocko 2012-11-23 7:40 ` Michal Hocko 2012-11-23 9:21 ` azurIt 2012-11-23 9:21 ` azurIt 2012-11-23 9:21 ` azurIt 2012-11-23 9:28 ` Michal Hocko 2012-11-23 9:28 ` Michal Hocko 2012-11-23 9:44 ` azurIt 2012-11-23 9:44 ` azurIt 2012-11-23 9:44 ` azurIt 2012-11-23 10:10 ` Michal Hocko 2012-11-23 10:10 ` Michal Hocko 2012-11-23 10:10 ` Michal Hocko 2012-11-23 9:34 ` Glauber Costa 2012-11-23 9:34 ` Glauber Costa 2012-11-23 10:04 ` Michal Hocko 2012-11-23 10:04 ` Michal Hocko 2012-11-23 14:59 ` azurIt 2012-11-23 14:59 ` azurIt 2012-11-23 14:59 ` azurIt 2012-11-25 10:17 ` Michal Hocko 2012-11-25 10:17 ` Michal Hocko 2012-11-25 10:17 ` Michal Hocko 2012-11-25 12:39 ` azurIt 2012-11-25 12:39 ` azurIt 2012-11-25 13:02 ` Michal Hocko 2012-11-25 13:02 ` Michal Hocko 2012-11-25 13:02 ` Michal Hocko 2012-11-25 13:27 ` azurIt 2012-11-25 13:27 ` azurIt 2012-11-25 13:27 ` azurIt 2012-11-25 13:44 ` Michal Hocko 2012-11-25 13:44 ` Michal Hocko 2012-11-25 13:44 ` Michal Hocko 2012-11-25 0:10 ` azurIt 2012-11-25 0:10 ` azurIt 2012-11-25 0:10 ` azurIt 2012-11-25 12:05 ` Michal Hocko 2012-11-25 12:05 ` Michal Hocko 2012-11-25 12:36 ` azurIt 2012-11-25 12:36 ` azurIt 2012-11-25 13:55 ` Michal Hocko 2012-11-25 13:55 ` Michal Hocko 2012-11-26 0:38 ` azurIt 2012-11-26 0:38 ` azurIt 2012-11-26 7:57 ` Michal Hocko 2012-11-26 7:57 ` Michal Hocko 2012-11-26 7:57 ` Michal Hocko 2012-11-26 13:18 ` [PATCH -mm] memcg: do not trigger OOM from add_to_page_cache_locked Michal Hocko 2012-11-26 13:18 ` Michal Hocko 2012-11-26 13:18 ` Michal Hocko 2012-11-26 13:21 ` [PATCH for 3.2.34] " Michal Hocko 2012-11-26 13:21 ` Michal Hocko 2012-11-26 13:21 ` Michal Hocko 2012-11-26 21:28 ` azurIt 2012-11-26 21:28 ` azurIt 2012-11-30 1:45 ` azurIt 2012-11-30 1:45 ` azurIt 2012-11-30 1:45 ` azurIt 2012-11-30 2:29 ` azurIt 2012-11-30 2:29 ` azurIt 2012-11-30 12:45 ` Michal Hocko 2012-11-30 12:45 ` Michal Hocko 2012-11-30 12:45 ` Michal Hocko 2012-11-30 12:53 ` azurIt 2012-11-30 12:53 ` azurIt 2012-11-30 12:53 ` azurIt 2012-11-30 13:44 ` azurIt 2012-11-30 13:44 ` azurIt 2012-11-30 14:44 ` Michal Hocko 2012-11-30 14:44 ` Michal Hocko 2012-11-30 14:44 ` Michal Hocko 2012-11-30 15:03 ` Michal Hocko 2012-11-30 15:03 ` Michal Hocko 2012-11-30 15:03 ` Michal Hocko 2012-11-30 15:37 ` Michal Hocko 2012-11-30 15:37 ` Michal Hocko 2012-11-30 15:08 ` azurIt 2012-11-30 15:08 ` azurIt 2012-11-30 15:39 ` Michal Hocko 2012-11-30 15:39 ` Michal Hocko 2012-11-30 15:59 ` azurIt 2012-11-30 15:59 ` azurIt 2012-11-30 15:59 ` azurIt 2012-11-30 16:19 ` Michal Hocko 2012-11-30 16:19 ` Michal Hocko 2012-11-30 16:26 ` azurIt 2012-11-30 16:26 ` azurIt 2012-11-30 16:53 ` Michal Hocko 2012-11-30 16:53 ` Michal Hocko 2012-11-30 16:53 ` Michal Hocko 2012-11-30 20:43 ` azurIt 2012-11-30 20:43 ` azurIt 2012-12-03 15:16 ` Michal Hocko 2012-12-03 15:16 ` Michal Hocko 2012-12-05 1:36 ` azurIt 2012-12-05 1:36 ` azurIt 2012-12-05 1:36 ` azurIt 2012-12-05 14:17 ` Michal Hocko 2012-12-05 14:17 ` Michal Hocko 2012-12-05 14:17 ` Michal Hocko 2012-12-06 0:29 ` azurIt 2012-12-06 0:29 ` azurIt 2012-12-06 0:29 ` azurIt 2012-12-06 9:54 ` Michal Hocko 2012-12-06 9:54 ` Michal Hocko 2012-12-06 9:54 ` Michal Hocko 2012-12-06 10:12 ` azurIt 2012-12-06 10:12 ` azurIt 2012-12-06 17:06 ` Michal Hocko 2012-12-06 17:06 ` Michal Hocko 2012-12-10 1:20 ` azurIt 2012-12-10 1:20 ` azurIt 2012-12-10 1:20 ` azurIt 2012-12-10 9:43 ` Michal Hocko 2012-12-10 9:43 ` Michal Hocko 2012-12-10 9:43 ` Michal Hocko 2012-12-10 10:18 ` azurIt 2012-12-10 10:18 ` azurIt 2012-12-10 10:18 ` azurIt 2012-12-10 15:52 ` Michal Hocko 2012-12-10 15:52 ` Michal Hocko 2012-12-10 15:52 ` Michal Hocko 2012-12-10 17:18 ` azurIt 2012-12-10 17:18 ` azurIt 2012-12-17 1:34 ` azurIt 2012-12-17 1:34 ` azurIt 2012-12-17 16:32 ` Michal Hocko 2012-12-17 16:32 ` Michal Hocko 2012-12-17 18:23 ` azurIt 2012-12-17 18:23 ` azurIt 2012-12-17 19:55 ` Michal Hocko 2012-12-17 19:55 ` Michal Hocko 2012-12-17 19:55 ` Michal Hocko 2012-12-18 14:22 ` azurIt 2012-12-18 14:22 ` azurIt 2012-12-18 14:22 ` azurIt 2012-12-18 15:20 ` Michal Hocko 2012-12-18 15:20 ` Michal Hocko 2012-12-24 13:25 ` azurIt 2012-12-24 13:25 ` azurIt 2012-12-24 13:25 ` azurIt 2012-12-28 16:22 ` Michal Hocko 2012-12-28 16:22 ` Michal Hocko 2012-12-28 16:22 ` Michal Hocko 2012-12-30 1:09 ` azurIt 2012-12-30 1:09 ` azurIt 2012-12-30 1:09 ` azurIt 2012-12-30 11:08 ` Michal Hocko 2012-12-30 11:08 ` Michal Hocko 2013-01-25 15:07 ` azurIt 2013-01-25 15:07 ` azurIt 2013-01-25 15:07 ` azurIt 2013-01-25 16:31 ` Michal Hocko 2013-01-25 16:31 ` Michal Hocko 2013-01-25 16:31 ` Michal Hocko 2013-02-05 13:49 ` Michal Hocko 2013-02-05 13:49 ` Michal Hocko 2013-02-05 13:49 ` Michal Hocko 2013-02-05 14:49 ` azurIt 2013-02-05 14:49 ` azurIt 2013-02-05 16:09 ` Michal Hocko 2013-02-05 16:09 ` Michal Hocko 2013-02-05 16:09 ` Michal Hocko 2013-02-05 16:46 ` azurIt 2013-02-05 16:46 ` azurIt 2013-02-05 16:48 ` Greg Thelen 2013-02-05 16:48 ` Greg Thelen 2013-02-05 17:46 ` Michal Hocko 2013-02-05 17:46 ` Michal Hocko 2013-02-05 18:09 ` Greg Thelen 2013-02-05 18:09 ` Greg Thelen 2013-02-05 18:59 ` Michal Hocko 2013-02-05 18:59 ` Michal Hocko 2013-02-05 18:59 ` Michal Hocko 2013-02-08 4:27 ` Greg Thelen 2013-02-08 4:27 ` Greg Thelen 2013-02-08 16:29 ` Michal Hocko 2013-02-08 16:29 ` Michal Hocko 2013-02-08 16:29 ` Michal Hocko 2013-02-08 16:40 ` Michal Hocko 2013-02-08 16:40 ` Michal Hocko 2013-02-06 1:17 ` azurIt 2013-02-06 1:17 ` azurIt 2013-02-06 14:01 ` Michal Hocko 2013-02-06 14:01 ` Michal Hocko 2013-02-06 14:01 ` Michal Hocko 2013-02-06 14:22 ` Michal Hocko 2013-02-06 14:22 ` Michal Hocko 2013-02-06 14:22 ` Michal Hocko 2013-02-06 16:00 ` [PATCH for 3.2.34] memcg: do not trigger OOM if PF_NO_MEMCG_OOM is set Michal Hocko 2013-02-06 16:00 ` Michal Hocko 2013-02-06 16:00 ` Michal Hocko 2013-02-08 5:03 ` azurIt 2013-02-08 5:03 ` azurIt 2013-02-08 5:03 ` azurIt 2013-02-08 9:44 ` Michal Hocko 2013-02-08 9:44 ` Michal Hocko 2013-02-08 11:02 ` azurIt 2013-02-08 11:02 ` azurIt 2013-02-08 12:38 ` Michal Hocko 2013-02-08 12:38 ` Michal Hocko 2013-02-08 12:38 ` Michal Hocko 2013-02-08 13:56 ` azurIt 2013-02-08 13:56 ` azurIt 2013-02-08 14:47 ` Michal Hocko 2013-02-08 14:47 ` Michal Hocko 2013-02-08 14:47 ` Michal Hocko 2013-02-08 15:24 ` Michal Hocko 2013-02-08 15:24 ` Michal Hocko 2013-02-08 15:24 ` Michal Hocko 2013-02-08 15:58 ` azurIt 2013-02-08 15:58 ` azurIt 2013-02-08 15:58 ` azurIt 2013-02-08 17:10 ` Michal Hocko 2013-02-08 17:10 ` Michal Hocko 2013-02-08 17:10 ` Michal Hocko 2013-02-08 21:02 ` azurIt 2013-02-08 21:02 ` azurIt 2013-02-10 15:03 ` Michal Hocko 2013-02-10 15:03 ` Michal Hocko 2013-02-10 15:03 ` Michal Hocko 2013-02-10 16:46 ` azurIt 2013-02-10 16:46 ` azurIt 2013-02-10 16:46 ` azurIt 2013-02-11 11:22 ` Michal Hocko 2013-02-11 11:22 ` Michal Hocko 2013-02-11 11:22 ` Michal Hocko 2013-02-22 8:23 ` azurIt 2013-02-22 8:23 ` azurIt 2013-02-22 8:23 ` azurIt 2013-02-22 12:52 ` Michal Hocko 2013-02-22 12:52 ` Michal Hocko 2013-02-22 12:52 ` Michal Hocko 2013-02-22 12:54 ` azurIt 2013-02-22 12:54 ` azurIt 2013-02-22 13:00 ` Michal Hocko 2013-02-22 13:00 ` Michal Hocko 2013-02-22 13:00 ` Michal Hocko 2013-06-06 16:04 ` Michal Hocko 2013-06-06 16:04 ` Michal Hocko 2013-06-06 16:04 ` Michal Hocko 2013-06-06 16:16 ` azurIt 2013-06-06 16:16 ` azurIt 2013-06-06 16:16 ` azurIt 2013-06-07 13:11 ` [PATCH for 3.2] memcg: do not trap chargers with full callstack on OOM Michal Hocko 2013-06-07 13:11 ` Michal Hocko 2013-06-07 13:11 ` Michal Hocko 2013-06-17 10:21 ` azurIt 2013-06-17 10:21 ` azurIt 2013-06-19 13:26 ` Michal Hocko 2013-06-19 13:26 ` Michal Hocko 2013-06-22 20:09 ` azurIt 2013-06-22 20:09 ` azurIt 2013-06-24 20:13 ` Johannes Weiner 2013-06-24 20:13 ` Johannes Weiner 2013-06-24 20:13 ` Johannes Weiner 2013-06-28 10:06 ` azurIt 2013-06-28 10:06 ` azurIt 2013-06-28 10:06 ` azurIt 2013-07-05 18:17 ` Johannes Weiner 2013-07-05 18:17 ` Johannes Weiner 2013-07-05 18:17 ` Johannes Weiner 2013-07-05 19:02 ` azurIt 2013-07-05 19:02 ` azurIt 2013-07-05 19:18 ` Johannes Weiner 2013-07-05 19:18 ` Johannes Weiner 2013-07-05 19:18 ` Johannes Weiner 2013-07-07 23:42 ` azurIt 2013-07-07 23:42 ` azurIt 2013-07-09 13:10 ` Michal Hocko 2013-07-09 13:10 ` Michal Hocko 2013-07-09 13:19 ` azurIt 2013-07-09 13:19 ` azurIt 2013-07-09 13:54 ` Michal Hocko 2013-07-09 13:54 ` Michal Hocko 2013-07-09 13:54 ` Michal Hocko 2013-07-10 16:25 ` azurIt 2013-07-10 16:25 ` azurIt 2013-07-11 7:25 ` Michal Hocko 2013-07-11 7:25 ` Michal Hocko 2013-07-11 7:25 ` Michal Hocko 2013-07-13 23:26 ` azurIt 2013-07-13 23:26 ` azurIt 2013-07-13 23:26 ` azurIt 2013-07-13 23:51 ` azurIt 2013-07-13 23:51 ` azurIt 2013-07-15 15:41 ` Michal Hocko 2013-07-15 15:41 ` Michal Hocko 2013-07-15 15:41 ` Michal Hocko 2013-07-15 16:00 ` Michal Hocko 2013-07-15 16:00 ` Michal Hocko 2013-07-15 16:00 ` Michal Hocko 2013-07-16 15:35 ` Johannes Weiner 2013-07-16 15:35 ` Johannes Weiner 2013-07-16 15:35 ` Johannes Weiner 2013-07-16 16:09 ` Michal Hocko 2013-07-16 16:09 ` Michal Hocko 2013-07-16 16:09 ` Michal Hocko 2013-07-16 16:48 ` Johannes Weiner 2013-07-16 16:48 ` Johannes Weiner 2013-07-16 16:48 ` Johannes Weiner 2013-07-19 4:21 ` Johannes Weiner 2013-07-19 4:21 ` Johannes Weiner 2013-07-19 4:22 ` [patch 1/5] mm: invoke oom-killer from remaining unconverted page fault handlers Johannes Weiner 2013-07-19 4:22 ` Johannes Weiner 2013-07-19 4:22 ` Johannes Weiner 2013-07-19 4:24 ` [patch 2/5] mm: pass userspace fault flag to generic fault handler Johannes Weiner 2013-07-19 4:24 ` Johannes Weiner 2013-07-19 4:24 ` Johannes Weiner 2013-07-19 4:25 ` [patch 3/5] x86: finish fault error path with fatal signal Johannes Weiner 2013-07-19 4:25 ` Johannes Weiner 2013-07-24 20:32 ` Johannes Weiner 2013-07-24 20:32 ` Johannes Weiner 2013-07-24 20:32 ` Johannes Weiner 2013-07-25 20:29 ` KOSAKI Motohiro 2013-07-25 20:29 ` KOSAKI Motohiro 2013-07-25 21:50 ` Johannes Weiner 2013-07-25 21:50 ` Johannes Weiner 2013-07-25 21:50 ` Johannes Weiner 2013-07-19 4:25 ` [patch 4/5] memcg: do not trap chargers with full callstack on OOM Johannes Weiner 2013-07-19 4:25 ` Johannes Weiner 2013-07-19 4:25 ` Johannes Weiner 2013-07-19 4:26 ` [patch 5/5] mm: memcontrol: sanity check memcg OOM context unwind Johannes Weiner 2013-07-19 4:26 ` Johannes Weiner 2013-07-19 4:26 ` Johannes Weiner 2013-07-19 8:23 ` [PATCH for 3.2] memcg: do not trap chargers with full callstack on OOM azurIt 2013-07-19 8:23 ` azurIt 2013-07-19 8:23 ` azurIt 2013-07-14 17:07 ` azurIt 2013-07-14 17:07 ` azurIt 2013-07-09 13:00 ` Michal Hocko 2013-07-09 13:00 ` Michal Hocko 2013-07-09 13:00 ` Michal Hocko 2013-07-09 13:08 ` Michal Hocko 2013-07-09 13:08 ` Michal Hocko 2013-07-09 13:08 ` Michal Hocko 2013-07-09 13:10 ` Michal Hocko 2013-07-09 13:10 ` Michal Hocko 2013-07-09 13:10 ` Michal Hocko 2013-06-24 16:48 ` azurIt 2013-06-24 16:48 ` azurIt 2013-02-22 12:00 ` [PATCH for 3.2.34] memcg: do not trigger OOM if PF_NO_MEMCG_OOM is set azurIt 2013-02-22 12:00 ` azurIt 2013-02-07 11:01 ` [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked Kamezawa Hiroyuki 2013-02-07 11:01 ` Kamezawa Hiroyuki 2013-02-07 12:31 ` Michal Hocko 2013-02-07 12:31 ` Michal Hocko 2013-02-07 12:31 ` Michal Hocko 2013-02-08 4:16 ` Kamezawa Hiroyuki 2013-02-08 4:16 ` Kamezawa Hiroyuki 2013-02-08 1:40 ` Kamezawa Hiroyuki 2013-02-08 1:40 ` Kamezawa Hiroyuki 2013-02-08 1:40 ` Kamezawa Hiroyuki 2013-02-08 16:01 ` Michal Hocko 2013-02-08 16:01 ` Michal Hocko 2013-02-08 16:01 ` Michal Hocko 2013-02-05 16:31 ` Michal Hocko 2013-02-05 16:31 ` Michal Hocko 2013-02-05 16:31 ` Michal Hocko 2012-12-24 13:38 ` azurIt 2012-12-24 13:38 ` azurIt 2012-12-24 13:38 ` azurIt 2012-12-28 16:35 ` Michal Hocko 2012-12-28 16:35 ` Michal Hocko 2012-11-26 17:46 ` [PATCH -mm] " Johannes Weiner 2012-11-26 17:46 ` Johannes Weiner 2012-11-26 18:04 ` Michal Hocko 2012-11-26 18:04 ` Michal Hocko 2012-11-26 18:24 ` Johannes Weiner 2012-11-26 18:24 ` Johannes Weiner 2012-11-26 18:24 ` Johannes Weiner 2012-11-26 19:03 ` Michal Hocko 2012-11-26 19:03 ` Michal Hocko 2012-11-26 19:03 ` Michal Hocko 2012-11-26 19:29 ` Johannes Weiner 2012-11-26 19:29 ` Johannes Weiner 2012-11-26 19:29 ` Johannes Weiner 2012-11-26 20:08 ` Michal Hocko 2012-11-26 20:08 ` Michal Hocko 2012-11-26 20:08 ` Michal Hocko 2012-11-26 20:19 ` Johannes Weiner 2012-11-26 20:19 ` Johannes Weiner 2012-11-26 20:46 ` azurIt 2012-11-26 20:46 ` azurIt 2012-11-26 20:46 ` azurIt 2012-11-26 20:53 ` Johannes Weiner 2012-11-26 20:53 ` Johannes Weiner 2012-11-26 22:06 ` Michal Hocko 2012-11-26 22:06 ` Michal Hocko 2012-11-26 22:06 ` Michal Hocko 2012-11-27 0:05 ` Kamezawa Hiroyuki 2012-11-27 0:05 ` Kamezawa Hiroyuki 2012-11-27 0:05 ` Kamezawa Hiroyuki 2012-11-27 9:54 ` Michal Hocko 2012-11-27 9:54 ` Michal Hocko 2012-11-27 9:54 ` Michal Hocko 2012-11-27 19:48 ` Johannes Weiner 2012-11-27 19:48 ` Johannes Weiner 2012-11-27 19:48 ` Johannes Weiner 2012-11-27 20:54 ` [PATCH -v2 " Michal Hocko 2012-11-27 20:54 ` Michal Hocko 2012-11-27 20:54 ` Michal Hocko 2012-11-27 20:59 ` Michal Hocko 2012-11-27 20:59 ` Michal Hocko 2012-11-27 20:59 ` Michal Hocko 2012-11-28 15:26 ` Johannes Weiner 2012-11-28 15:26 ` Johannes Weiner 2012-11-28 15:26 ` Johannes Weiner 2012-11-28 16:04 ` Michal Hocko 2012-11-28 16:04 ` Michal Hocko 2012-11-28 16:04 ` Michal Hocko 2012-11-28 16:37 ` Johannes Weiner 2012-11-28 16:37 ` Johannes Weiner 2012-11-28 16:37 ` Johannes Weiner 2012-11-28 16:46 ` Michal Hocko 2012-11-28 16:46 ` Michal Hocko 2012-11-28 16:48 ` Michal Hocko 2012-11-28 16:48 ` Michal Hocko 2012-11-28 16:48 ` Michal Hocko 2012-11-28 18:44 ` Johannes Weiner 2012-11-28 18:44 ` Johannes Weiner 2012-11-28 18:44 ` Johannes Weiner 2012-11-28 20:20 ` Hugh Dickins 2012-11-28 20:20 ` Hugh Dickins 2012-11-28 20:20 ` Hugh Dickins 2012-11-29 14:05 ` Michal Hocko 2012-11-29 14:05 ` Michal Hocko
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.