* [PATCH 0/4 -v2] OOM vs. freezer interaction fixes @ 2014-10-21 7:27 Michal Hocko 2014-10-21 7:27 ` [PATCH 1/4] freezer: Do not freeze tasks killed by OOM killer Michal Hocko ` (3 more replies) 0 siblings, 4 replies; 93+ messages in thread From: Michal Hocko @ 2014-10-21 7:27 UTC (permalink / raw) To: Andrew Morton, \"Rafael J. Wysocki\" Cc: Cong Wang, David Rientjes, Tejun Heo, Oleg Nesterov, LKML, linux-mm, Linux PM list Hi Andrew, Rafael, this has been originally discussed here [1] and previously posted here [2]. I have updated patches according to feedback from Oleg. The first and third patch are regression fixes and they are a stable material IMO. The second and fourth patch are simple cleanups. The 1st patch is fixing a regression introduced in 3.3 since when OOM killer is not able to kill any frozen task and live lock as a result. The fix gets us back to the 3.2. As it turned out during the discussion [3] this was still not 100% sufficient and that's why we need the 3rd patch. I was thinking about the proper 1st vs. 3rd patch ordering because the 1st patch basically opens a race window considerably reduced by the later patch. This path is hard to do completely race free without a complete synchronization of OOM path (including the allocator) and freezer which is not worth the trouble. Original patch from Cong Wang has covered this by checking cgroup_freezing(current) in __refrigarator path [4]. But this approach still suffers from OOM vs. PM freezer interaction (OOM killer would still live lock waiting for a PM frozen task this time). So I think the most straight forward way is to address only OOM vs. frozen task interaction in the first patch, mark it for stable 3.3+ and leave the race to a separate follow up patch which is applicable to stable 3.2+ (before a3201227f803 made it inefficient). Switching 1st and 3rd patches would make some sense as well but then it might end up even more confusing because we would be fixing a non-existent issue in upstream first... Cong Wang (2): freezer: Do not freeze tasks killed by OOM killer freezer: remove obsolete comments in __thaw_task() Michal Hocko (2): OOM, PM: OOM killed task shouldn't escape PM suspend PM: convert do_each_thread to for_each_process_thread And diffstat says: include/linux/oom.h | 3 +++ kernel/freezer.c | 9 +++------ kernel/power/process.c | 47 ++++++++++++++++++++++++++++++++++++++--------- mm/oom_kill.c | 17 +++++++++++++++++ mm/page_alloc.c | 8 ++++++++ 5 files changed, 69 insertions(+), 15 deletions(-) --- [1] http://marc.info/?l=linux-kernel&m=140986986423092 [2] http://marc.info/?l=linux-mm&m=141277728508500&w=2 [3] http://marc.info/?l=linux-kernel&m=141074263721166 [4] http://marc.info/?l=linux-kernel&m=140986986423092 ^ permalink raw reply [flat|nested] 93+ messages in thread
* [PATCH 1/4] freezer: Do not freeze tasks killed by OOM killer 2014-10-21 7:27 [PATCH 0/4 -v2] OOM vs. freezer interaction fixes Michal Hocko @ 2014-10-21 7:27 ` Michal Hocko 2014-10-21 12:04 ` Rafael J. Wysocki 2014-10-21 7:27 ` [PATCH 2/4] freezer: remove obsolete comments in __thaw_task() Michal Hocko ` (2 subsequent siblings) 3 siblings, 1 reply; 93+ messages in thread From: Michal Hocko @ 2014-10-21 7:27 UTC (permalink / raw) To: Andrew Morton, \"Rafael J. Wysocki\" Cc: Cong Wang, David Rientjes, Tejun Heo, Oleg Nesterov, LKML, linux-mm, Linux PM list From: Cong Wang <xiyou.wangcong@gmail.com> Since f660daac474c6f (oom: thaw threads if oom killed thread is frozen before deferring) OOM killer relies on being able to thaw a frozen task to handle OOM situation but a3201227f803 (freezer: make freezing() test freeze conditions in effect instead of TIF_FREEZE) has reorganized the code and stopped clearing freeze flag in __thaw_task. This means that the target task only wakes up and goes into the fridge again because the freezing condition hasn't changed for it. This reintroduces the bug fixed by f660daac474c6f. Fix the issue by checking for TIF_MEMDIE thread flag in freezing_slow_path and exclude the task from freezing completely. If a task was already frozen it would get woken by __thaw_task from OOM killer and get out of freezer after rechecking freezing(). Changes since v1 - put TIF_MEMDIE check into freezing_slowpath rather than in __refrigerator as per Oleg - return __thaw_task into oom_scan_process_thread because oom_kill_process will not wake task in the fridge because it is sleeping uninterruptible [mhocko@suse.cz: rewrote the changelog] Fixes: a3201227f803 (freezer: make freezing() test freeze conditions in effect instead of TIF_FREEZE) Cc: stable@vger.kernel.org # 3.3+ Cc: David Rientjes <rientjes@google.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net> Cc: Tejun Heo <tj@kernel.org> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: Michal Hocko <mhocko@suse.cz> Acked-by: Oleg Nesterov <oleg@redhat.com> --- kernel/freezer.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/kernel/freezer.c b/kernel/freezer.c index aa6a8aadb911..8f9279b9c6d7 100644 --- a/kernel/freezer.c +++ b/kernel/freezer.c @@ -42,6 +42,9 @@ bool freezing_slow_path(struct task_struct *p) if (p->flags & (PF_NOFREEZE | PF_SUSPEND_TASK)) return false; + if (test_thread_flag(TIF_MEMDIE)) + return false; + if (pm_nosig_freezing || cgroup_freezing(p)) return true; -- 2.1.1 ^ permalink raw reply related [flat|nested] 93+ messages in thread
* Re: [PATCH 1/4] freezer: Do not freeze tasks killed by OOM killer 2014-10-21 7:27 ` [PATCH 1/4] freezer: Do not freeze tasks killed by OOM killer Michal Hocko @ 2014-10-21 12:04 ` Rafael J. Wysocki 0 siblings, 0 replies; 93+ messages in thread From: Rafael J. Wysocki @ 2014-10-21 12:04 UTC (permalink / raw) To: Michal Hocko Cc: Andrew Morton, Cong Wang, David Rientjes, Tejun Heo, Oleg Nesterov, LKML, linux-mm, Linux PM list On Tuesday, October 21, 2014 09:27:12 AM Michal Hocko wrote: > From: Cong Wang <xiyou.wangcong@gmail.com> > > Since f660daac474c6f (oom: thaw threads if oom killed thread is frozen > before deferring) OOM killer relies on being able to thaw a frozen task > to handle OOM situation but a3201227f803 (freezer: make freezing() test > freeze conditions in effect instead of TIF_FREEZE) has reorganized the > code and stopped clearing freeze flag in __thaw_task. This means that > the target task only wakes up and goes into the fridge again because the > freezing condition hasn't changed for it. This reintroduces the bug > fixed by f660daac474c6f. > > Fix the issue by checking for TIF_MEMDIE thread flag in > freezing_slow_path and exclude the task from freezing completely. If a > task was already frozen it would get woken by __thaw_task from OOM killer > and get out of freezer after rechecking freezing(). > > Changes since v1 > - put TIF_MEMDIE check into freezing_slowpath rather than in __refrigerator > as per Oleg > - return __thaw_task into oom_scan_process_thread because > oom_kill_process will not wake task in the fridge because it is > sleeping uninterruptible > > [mhocko@suse.cz: rewrote the changelog] > Fixes: a3201227f803 (freezer: make freezing() test freeze conditions in effect instead of TIF_FREEZE) > Cc: stable@vger.kernel.org # 3.3+ > Cc: David Rientjes <rientjes@google.com> > Cc: Michal Hocko <mhocko@suse.cz> > Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net> > Cc: Tejun Heo <tj@kernel.org> > Cc: Andrew Morton <akpm@linux-foundation.org> > Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> > Signed-off-by: Michal Hocko <mhocko@suse.cz> > Acked-by: Oleg Nesterov <oleg@redhat.com> ACK > --- > kernel/freezer.c | 3 +++ > 1 file changed, 3 insertions(+) > > diff --git a/kernel/freezer.c b/kernel/freezer.c > index aa6a8aadb911..8f9279b9c6d7 100644 > --- a/kernel/freezer.c > +++ b/kernel/freezer.c > @@ -42,6 +42,9 @@ bool freezing_slow_path(struct task_struct *p) > if (p->flags & (PF_NOFREEZE | PF_SUSPEND_TASK)) > return false; > > + if (test_thread_flag(TIF_MEMDIE)) > + return false; > + > if (pm_nosig_freezing || cgroup_freezing(p)) > return true; > > -- I speak only for myself. Rafael J. Wysocki, Intel Open Source Technology Center. ^ permalink raw reply [flat|nested] 93+ messages in thread
* [PATCH 2/4] freezer: remove obsolete comments in __thaw_task() 2014-10-21 7:27 [PATCH 0/4 -v2] OOM vs. freezer interaction fixes Michal Hocko 2014-10-21 7:27 ` [PATCH 1/4] freezer: Do not freeze tasks killed by OOM killer Michal Hocko @ 2014-10-21 7:27 ` Michal Hocko 2014-10-21 12:04 ` Rafael J. Wysocki 2014-10-21 7:27 ` [PATCH 3/4] OOM, PM: OOM killed task shouldn't escape PM suspend Michal Hocko 2014-10-21 7:27 ` [PATCH 4/4] PM: convert do_each_thread to for_each_process_thread Michal Hocko 3 siblings, 1 reply; 93+ messages in thread From: Michal Hocko @ 2014-10-21 7:27 UTC (permalink / raw) To: Andrew Morton, \"Rafael J. Wysocki\" Cc: Cong Wang, David Rientjes, Tejun Heo, Oleg Nesterov, LKML, linux-mm, Linux PM list From: Cong Wang <xiyou.wangcong@gmail.com> __thaw_task() no longer clears frozen flag since commit a3201227f803 (freezer: make freezing() test freeze conditions in effect instead of TIF_FREEZE). Cc: David Rientjes <rientjes@google.com> Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net> Cc: Tejun Heo <tj@kernel.org> Cc: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Michal Hocko <mhocko@suse.cz> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> --- kernel/freezer.c | 6 ------ 1 file changed, 6 deletions(-) diff --git a/kernel/freezer.c b/kernel/freezer.c index 8f9279b9c6d7..a8900a3bc27a 100644 --- a/kernel/freezer.c +++ b/kernel/freezer.c @@ -150,12 +150,6 @@ void __thaw_task(struct task_struct *p) { unsigned long flags; - /* - * Clear freezing and kick @p if FROZEN. Clearing is guaranteed to - * be visible to @p as waking up implies wmb. Waking up inside - * freezer_lock also prevents wakeups from leaking outside - * refrigerator. - */ spin_lock_irqsave(&freezer_lock, flags); if (frozen(p)) wake_up_process(p); -- 2.1.1 ^ permalink raw reply related [flat|nested] 93+ messages in thread
* Re: [PATCH 2/4] freezer: remove obsolete comments in __thaw_task() 2014-10-21 7:27 ` [PATCH 2/4] freezer: remove obsolete comments in __thaw_task() Michal Hocko @ 2014-10-21 12:04 ` Rafael J. Wysocki 0 siblings, 0 replies; 93+ messages in thread From: Rafael J. Wysocki @ 2014-10-21 12:04 UTC (permalink / raw) To: Michal Hocko Cc: Andrew Morton, Cong Wang, David Rientjes, Tejun Heo, Oleg Nesterov, LKML, linux-mm, Linux PM list On Tuesday, October 21, 2014 09:27:13 AM Michal Hocko wrote: > From: Cong Wang <xiyou.wangcong@gmail.com> > > __thaw_task() no longer clears frozen flag since commit a3201227f803 > (freezer: make freezing() test freeze conditions in effect instead of TIF_FREEZE). > > Cc: David Rientjes <rientjes@google.com> > Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net> > Cc: Tejun Heo <tj@kernel.org> > Cc: Andrew Morton <akpm@linux-foundation.org> > Reviewed-by: Michal Hocko <mhocko@suse.cz> > Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> ACK > --- > kernel/freezer.c | 6 ------ > 1 file changed, 6 deletions(-) > > diff --git a/kernel/freezer.c b/kernel/freezer.c > index 8f9279b9c6d7..a8900a3bc27a 100644 > --- a/kernel/freezer.c > +++ b/kernel/freezer.c > @@ -150,12 +150,6 @@ void __thaw_task(struct task_struct *p) > { > unsigned long flags; > > - /* > - * Clear freezing and kick @p if FROZEN. Clearing is guaranteed to > - * be visible to @p as waking up implies wmb. Waking up inside > - * freezer_lock also prevents wakeups from leaking outside > - * refrigerator. > - */ > spin_lock_irqsave(&freezer_lock, flags); > if (frozen(p)) > wake_up_process(p); > -- I speak only for myself. Rafael J. Wysocki, Intel Open Source Technology Center. ^ permalink raw reply [flat|nested] 93+ messages in thread
* [PATCH 3/4] OOM, PM: OOM killed task shouldn't escape PM suspend 2014-10-21 7:27 [PATCH 0/4 -v2] OOM vs. freezer interaction fixes Michal Hocko 2014-10-21 7:27 ` [PATCH 1/4] freezer: Do not freeze tasks killed by OOM killer Michal Hocko 2014-10-21 7:27 ` [PATCH 2/4] freezer: remove obsolete comments in __thaw_task() Michal Hocko @ 2014-10-21 7:27 ` Michal Hocko 2014-10-21 12:09 ` Rafael J. Wysocki 2014-10-26 18:40 ` Pavel Machek 2014-10-21 7:27 ` [PATCH 4/4] PM: convert do_each_thread to for_each_process_thread Michal Hocko 3 siblings, 2 replies; 93+ messages in thread From: Michal Hocko @ 2014-10-21 7:27 UTC (permalink / raw) To: Andrew Morton, \"Rafael J. Wysocki\" Cc: Cong Wang, David Rientjes, Tejun Heo, Oleg Nesterov, LKML, linux-mm, Linux PM list PM freezer relies on having all tasks frozen by the time devices are getting frozen so that no task will touch them while they are getting frozen. But OOM killer is allowed to kill an already frozen task in order to handle OOM situtation. In order to protect from late wake ups OOM killer is disabled after all tasks are frozen. This, however, still keeps a window open when a killed task didn't manage to die by the time freeze_processes finishes. Reduce the race window by checking all tasks after OOM killer has been disabled. This is still not race free completely unfortunately because oom_killer_disable cannot stop an already ongoing OOM killer so a task might still wake up from the fridge and get killed without freeze_processes noticing. Full synchronization of OOM and freezer is, however, too heavy weight for this highly unlikely case. Introduce and check oom_kills counter which gets incremented early when the allocator enters __alloc_pages_may_oom path and only check all the tasks if the counter changes during the freezing attempt. The counter is updated so early to reduce the race window since allocator checked oom_killer_disabled which is set by PM-freezing code. A false positive will push the PM-freezer into a slow path but that is not a big deal. Fixes: f660daac474c6f (oom: thaw threads if oom killed thread is frozen before deferring) Cc: Cong Wang <xiyou.wangcong@gmail.com> Cc: Rafael J. Wysocki <rjw@rjwysocki.net> Cc: Tejun Heo <tj@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: stable@vger.kernel.org # 3.2+ Signed-off-by: Michal Hocko <mhocko@suse.cz> --- include/linux/oom.h | 3 +++ kernel/power/process.c | 31 ++++++++++++++++++++++++++++++- mm/oom_kill.c | 17 +++++++++++++++++ mm/page_alloc.c | 8 ++++++++ 4 files changed, 58 insertions(+), 1 deletion(-) diff --git a/include/linux/oom.h b/include/linux/oom.h index 647395a1a550..e8d6e1058723 100644 --- a/include/linux/oom.h +++ b/include/linux/oom.h @@ -50,6 +50,9 @@ static inline bool oom_task_origin(const struct task_struct *p) extern unsigned long oom_badness(struct task_struct *p, struct mem_cgroup *memcg, const nodemask_t *nodemask, unsigned long totalpages); + +extern int oom_kills_count(void); +extern void note_oom_kill(void); extern void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order, unsigned int points, unsigned long totalpages, struct mem_cgroup *memcg, nodemask_t *nodemask, diff --git a/kernel/power/process.c b/kernel/power/process.c index 4ee194eb524b..a397fa161d11 100644 --- a/kernel/power/process.c +++ b/kernel/power/process.c @@ -118,6 +118,7 @@ static int try_to_freeze_tasks(bool user_only) int freeze_processes(void) { int error; + int oom_kills_saved; error = __usermodehelper_disable(UMH_FREEZING); if (error) @@ -131,12 +132,40 @@ int freeze_processes(void) printk("Freezing user space processes ... "); pm_freezing = true; + oom_kills_saved = oom_kills_count(); error = try_to_freeze_tasks(true); if (!error) { - printk("done."); __usermodehelper_set_disable_depth(UMH_DISABLED); oom_killer_disable(); + + /* + * There might have been an OOM kill while we were + * freezing tasks and the killed task might be still + * on the way out so we have to double check for race. + */ + if (oom_kills_count() != oom_kills_saved) { + struct task_struct *g, *p; + + read_lock(&tasklist_lock); + for_each_process_thread(g, p) { + if (p == current || freezer_should_skip(p) || + frozen(p)) + continue; + error = -EBUSY; + goto out_loop; + } +out_loop: + read_unlock(&tasklist_lock); + + if (error) { + __usermodehelper_set_disable_depth(UMH_ENABLED); + printk("OOM in progress."); + goto done; + } + } + printk("done."); } +done: printk("\n"); BUG_ON(in_atomic()); diff --git a/mm/oom_kill.c b/mm/oom_kill.c index bbf405a3a18f..5340f6b91312 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -404,6 +404,23 @@ static void dump_header(struct task_struct *p, gfp_t gfp_mask, int order, dump_tasks(memcg, nodemask); } +/* + * Number of OOM killer invocations (including memcg OOM killer). + * Primarily used by PM freezer to check for potential races with + * OOM killed frozen task. + */ +static atomic_t oom_kills = ATOMIC_INIT(0); + +int oom_kills_count(void) +{ + return atomic_read(&oom_kills); +} + +void note_oom_kill(void) +{ + atomic_inc(&oom_kills); +} + #define K(x) ((x) << (PAGE_SHIFT-10)) /* * Must be called while holding a reference to p, which will be released upon diff --git a/mm/page_alloc.c b/mm/page_alloc.c index cb573b10af12..efccbbadd7c9 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -2286,6 +2286,14 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order, } /* + * PM-freezer should be notified that there might be an OOM killer on its + * way to kill and wake somebody up. This is too early and we might end + * up not killing anything but false positives are acceptable. + * See freeze_processes. + */ + note_oom_kill(); + + /* * Go through the zonelist yet one more time, keep very high watermark * here, this is only to catch a parallel oom killing, we must fail if * we're still under heavy pressure. -- 2.1.1 ^ permalink raw reply related [flat|nested] 93+ messages in thread
* Re: [PATCH 3/4] OOM, PM: OOM killed task shouldn't escape PM suspend 2014-10-21 7:27 ` [PATCH 3/4] OOM, PM: OOM killed task shouldn't escape PM suspend Michal Hocko @ 2014-10-21 12:09 ` Rafael J. Wysocki 2014-10-21 13:14 ` Michal Hocko 2014-10-26 18:40 ` Pavel Machek 1 sibling, 1 reply; 93+ messages in thread From: Rafael J. Wysocki @ 2014-10-21 12:09 UTC (permalink / raw) To: Michal Hocko Cc: Andrew Morton, Cong Wang, David Rientjes, Tejun Heo, Oleg Nesterov, LKML, linux-mm, Linux PM list On Tuesday, October 21, 2014 09:27:14 AM Michal Hocko wrote: > PM freezer relies on having all tasks frozen by the time devices are > getting frozen so that no task will touch them while they are getting > frozen. But OOM killer is allowed to kill an already frozen task in > order to handle OOM situtation. In order to protect from late wake ups > OOM killer is disabled after all tasks are frozen. This, however, still > keeps a window open when a killed task didn't manage to die by the time > freeze_processes finishes. > > Reduce the race window by checking all tasks after OOM killer has been > disabled. This is still not race free completely unfortunately because > oom_killer_disable cannot stop an already ongoing OOM killer so a task > might still wake up from the fridge and get killed without > freeze_processes noticing. Full synchronization of OOM and freezer is, > however, too heavy weight for this highly unlikely case. > > Introduce and check oom_kills counter which gets incremented early when > the allocator enters __alloc_pages_may_oom path and only check all the > tasks if the counter changes during the freezing attempt. The counter > is updated so early to reduce the race window since allocator checked > oom_killer_disabled which is set by PM-freezing code. A false positive > will push the PM-freezer into a slow path but that is not a big deal. > > Fixes: f660daac474c6f (oom: thaw threads if oom killed thread is frozen before deferring) > Cc: Cong Wang <xiyou.wangcong@gmail.com> > Cc: Rafael J. Wysocki <rjw@rjwysocki.net> > Cc: Tejun Heo <tj@kernel.org> > Cc: David Rientjes <rientjes@google.com> > Cc: Andrew Morton <akpm@linux-foundation.org> > Cc: stable@vger.kernel.org # 3.2+ > Signed-off-by: Michal Hocko <mhocko@suse.cz> > --- > include/linux/oom.h | 3 +++ > kernel/power/process.c | 31 ++++++++++++++++++++++++++++++- > mm/oom_kill.c | 17 +++++++++++++++++ > mm/page_alloc.c | 8 ++++++++ > 4 files changed, 58 insertions(+), 1 deletion(-) > > diff --git a/include/linux/oom.h b/include/linux/oom.h > index 647395a1a550..e8d6e1058723 100644 > --- a/include/linux/oom.h > +++ b/include/linux/oom.h > @@ -50,6 +50,9 @@ static inline bool oom_task_origin(const struct task_struct *p) > extern unsigned long oom_badness(struct task_struct *p, > struct mem_cgroup *memcg, const nodemask_t *nodemask, > unsigned long totalpages); > + > +extern int oom_kills_count(void); > +extern void note_oom_kill(void); > extern void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order, > unsigned int points, unsigned long totalpages, > struct mem_cgroup *memcg, nodemask_t *nodemask, > diff --git a/kernel/power/process.c b/kernel/power/process.c > index 4ee194eb524b..a397fa161d11 100644 > --- a/kernel/power/process.c > +++ b/kernel/power/process.c > @@ -118,6 +118,7 @@ static int try_to_freeze_tasks(bool user_only) > int freeze_processes(void) > { > int error; > + int oom_kills_saved; > > error = __usermodehelper_disable(UMH_FREEZING); > if (error) > @@ -131,12 +132,40 @@ int freeze_processes(void) > > printk("Freezing user space processes ... "); > pm_freezing = true; > + oom_kills_saved = oom_kills_count(); > error = try_to_freeze_tasks(true); > if (!error) { > - printk("done."); > __usermodehelper_set_disable_depth(UMH_DISABLED); > oom_killer_disable(); > + > + /* > + * There might have been an OOM kill while we were > + * freezing tasks and the killed task might be still > + * on the way out so we have to double check for race. > + */ > + if (oom_kills_count() != oom_kills_saved) { > + struct task_struct *g, *p; > + > + read_lock(&tasklist_lock); > + for_each_process_thread(g, p) { > + if (p == current || freezer_should_skip(p) || > + frozen(p)) > + continue; > + error = -EBUSY; > + goto out_loop; > + } > +out_loop: Well, it looks like this will work here too: for_each_process_thread(g, p) if (p != current && !frozen(p) && !freezer_should_skip(p)) { error = -EBUSY; break; } or I am helplessly misreading the code. > + read_unlock(&tasklist_lock); > + > + if (error) { > + __usermodehelper_set_disable_depth(UMH_ENABLED); > + printk("OOM in progress."); > + goto done; > + } > + } > + printk("done."); > } > +done: > printk("\n"); > BUG_ON(in_atomic()); > > diff --git a/mm/oom_kill.c b/mm/oom_kill.c > index bbf405a3a18f..5340f6b91312 100644 > --- a/mm/oom_kill.c > +++ b/mm/oom_kill.c > @@ -404,6 +404,23 @@ static void dump_header(struct task_struct *p, gfp_t gfp_mask, int order, > dump_tasks(memcg, nodemask); > } > > +/* > + * Number of OOM killer invocations (including memcg OOM killer). > + * Primarily used by PM freezer to check for potential races with > + * OOM killed frozen task. > + */ > +static atomic_t oom_kills = ATOMIC_INIT(0); > + > +int oom_kills_count(void) > +{ > + return atomic_read(&oom_kills); > +} > + > +void note_oom_kill(void) > +{ > + atomic_inc(&oom_kills); > +} > + > #define K(x) ((x) << (PAGE_SHIFT-10)) > /* > * Must be called while holding a reference to p, which will be released upon > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index cb573b10af12..efccbbadd7c9 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -2286,6 +2286,14 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order, > } > > /* > + * PM-freezer should be notified that there might be an OOM killer on its > + * way to kill and wake somebody up. This is too early and we might end > + * up not killing anything but false positives are acceptable. > + * See freeze_processes. > + */ > + note_oom_kill(); > + > + /* > * Go through the zonelist yet one more time, keep very high watermark > * here, this is only to catch a parallel oom killing, we must fail if > * we're still under heavy pressure. > -- I speak only for myself. Rafael J. Wysocki, Intel Open Source Technology Center. ^ permalink raw reply [flat|nested] 93+ messages in thread
* Re: [PATCH 3/4] OOM, PM: OOM killed task shouldn't escape PM suspend 2014-10-21 12:09 ` Rafael J. Wysocki @ 2014-10-21 13:14 ` Michal Hocko 2014-10-21 13:42 ` Rafael J. Wysocki 0 siblings, 1 reply; 93+ messages in thread From: Michal Hocko @ 2014-10-21 13:14 UTC (permalink / raw) To: Rafael J. Wysocki Cc: Andrew Morton, Cong Wang, David Rientjes, Tejun Heo, Oleg Nesterov, LKML, linux-mm, Linux PM list On Tue 21-10-14 14:09:27, Rafael J. Wysocki wrote: [...] > > @@ -131,12 +132,40 @@ int freeze_processes(void) > > > > printk("Freezing user space processes ... "); > > pm_freezing = true; > > + oom_kills_saved = oom_kills_count(); > > error = try_to_freeze_tasks(true); > > if (!error) { > > - printk("done."); > > __usermodehelper_set_disable_depth(UMH_DISABLED); > > oom_killer_disable(); > > + > > + /* > > + * There might have been an OOM kill while we were > > + * freezing tasks and the killed task might be still > > + * on the way out so we have to double check for race. > > + */ > > + if (oom_kills_count() != oom_kills_saved) { > > + struct task_struct *g, *p; > > + > > + read_lock(&tasklist_lock); > > + for_each_process_thread(g, p) { > > + if (p == current || freezer_should_skip(p) || > > + frozen(p)) > > + continue; > > + error = -EBUSY; > > + goto out_loop; > > + } > > +out_loop: > > Well, it looks like this will work here too: > > for_each_process_thread(g, p) > if (p != current && !frozen(p) && > !freezer_should_skip(p)) { > error = -EBUSY; > break; > } > > or I am helplessly misreading the code. break will not work because for_each_process_thread is a double loop. Except for that the negated condition is OK as well. I can change that if you prefer. > > + read_unlock(&tasklist_lock); > > + > > + if (error) { > > + __usermodehelper_set_disable_depth(UMH_ENABLED); > > + printk("OOM in progress."); > > + goto done; > > + } > > + } > > + printk("done."); > > } > > +done: > > printk("\n"); > > BUG_ON(in_atomic()); > > -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 93+ messages in thread
* Re: [PATCH 3/4] OOM, PM: OOM killed task shouldn't escape PM suspend 2014-10-21 13:14 ` Michal Hocko @ 2014-10-21 13:42 ` Rafael J. Wysocki 2014-10-21 14:11 ` Michal Hocko 0 siblings, 1 reply; 93+ messages in thread From: Rafael J. Wysocki @ 2014-10-21 13:42 UTC (permalink / raw) To: Michal Hocko Cc: Andrew Morton, Cong Wang, David Rientjes, Tejun Heo, Oleg Nesterov, LKML, linux-mm, Linux PM list On Tuesday, October 21, 2014 03:14:45 PM Michal Hocko wrote: > On Tue 21-10-14 14:09:27, Rafael J. Wysocki wrote: > [...] > > > @@ -131,12 +132,40 @@ int freeze_processes(void) > > > > > > printk("Freezing user space processes ... "); > > > pm_freezing = true; > > > + oom_kills_saved = oom_kills_count(); > > > error = try_to_freeze_tasks(true); > > > if (!error) { > > > - printk("done."); > > > __usermodehelper_set_disable_depth(UMH_DISABLED); > > > oom_killer_disable(); > > > + > > > + /* > > > + * There might have been an OOM kill while we were > > > + * freezing tasks and the killed task might be still > > > + * on the way out so we have to double check for race. > > > + */ > > > + if (oom_kills_count() != oom_kills_saved) { > > > + struct task_struct *g, *p; > > > + > > > + read_lock(&tasklist_lock); > > > + for_each_process_thread(g, p) { > > > + if (p == current || freezer_should_skip(p) || > > > + frozen(p)) > > > + continue; > > > + error = -EBUSY; > > > + goto out_loop; > > > + } > > > +out_loop: > > > > Well, it looks like this will work here too: > > > > for_each_process_thread(g, p) > > if (p != current && !frozen(p) && > > !freezer_should_skip(p)) { > > error = -EBUSY; > > break; > > } > > > > or I am helplessly misreading the code. > > break will not work because for_each_process_thread is a double loop. I see. In that case I'd do: for_each_process_thread(g, p) if (p != current && !frozen(p) && !freezer_should_skip(p)) { read_unlock(&tasklist_lock); __usermodehelper_set_disable_depth(UMH_ENABLED); printk("OOM in progress."); error = -EBUSY; goto done; } to avoid adding the new label that looks odd. -- I speak only for myself. Rafael J. Wysocki, Intel Open Source Technology Center. ^ permalink raw reply [flat|nested] 93+ messages in thread
* Re: [PATCH 3/4] OOM, PM: OOM killed task shouldn't escape PM suspend 2014-10-21 13:42 ` Rafael J. Wysocki @ 2014-10-21 14:11 ` Michal Hocko 2014-10-21 14:41 ` Rafael J. Wysocki 0 siblings, 1 reply; 93+ messages in thread From: Michal Hocko @ 2014-10-21 14:11 UTC (permalink / raw) To: Rafael J. Wysocki Cc: Andrew Morton, Cong Wang, David Rientjes, Tejun Heo, Oleg Nesterov, LKML, linux-mm, Linux PM list On Tue 21-10-14 15:42:23, Rafael J. Wysocki wrote: > On Tuesday, October 21, 2014 03:14:45 PM Michal Hocko wrote: > > On Tue 21-10-14 14:09:27, Rafael J. Wysocki wrote: > > [...] > > > > @@ -131,12 +132,40 @@ int freeze_processes(void) > > > > > > > > printk("Freezing user space processes ... "); > > > > pm_freezing = true; > > > > + oom_kills_saved = oom_kills_count(); > > > > error = try_to_freeze_tasks(true); > > > > if (!error) { > > > > - printk("done."); > > > > __usermodehelper_set_disable_depth(UMH_DISABLED); > > > > oom_killer_disable(); > > > > + > > > > + /* > > > > + * There might have been an OOM kill while we were > > > > + * freezing tasks and the killed task might be still > > > > + * on the way out so we have to double check for race. > > > > + */ > > > > + if (oom_kills_count() != oom_kills_saved) { > > > > + struct task_struct *g, *p; > > > > + > > > > + read_lock(&tasklist_lock); > > > > + for_each_process_thread(g, p) { > > > > + if (p == current || freezer_should_skip(p) || > > > > + frozen(p)) > > > > + continue; > > > > + error = -EBUSY; > > > > + goto out_loop; > > > > + } > > > > +out_loop: > > > > > > Well, it looks like this will work here too: > > > > > > for_each_process_thread(g, p) > > > if (p != current && !frozen(p) && > > > !freezer_should_skip(p)) { > > > error = -EBUSY; > > > break; > > > } > > > > > > or I am helplessly misreading the code. > > > > break will not work because for_each_process_thread is a double loop. > > I see. In that case I'd do: > > for_each_process_thread(g, p) > if (p != current && !frozen(p) && > !freezer_should_skip(p)) { > > read_unlock(&tasklist_lock); > > __usermodehelper_set_disable_depth(UMH_ENABLED); > printk("OOM in progress."); > error = -EBUSY; > goto done; > } > > to avoid adding the new label that looks odd. OK, incremental diff on top. I will post the complete patch if you are happier with this change --- diff --git a/kernel/power/process.c b/kernel/power/process.c index a397fa161d11..7a37cf3eb1a2 100644 --- a/kernel/power/process.c +++ b/kernel/power/process.c @@ -108,6 +108,28 @@ static int try_to_freeze_tasks(bool user_only) return todo ? -EBUSY : 0; } +/* + * Returns true if all freezable tasks (except for current) are frozen already + */ +static bool check_frozen_processes(void) +{ + struct task_struct *g, *p; + bool ret = true; + + read_lock(&tasklist_lock); + for_each_process_thread(g, p) { + if (p != current && !freezer_should_skip(p) && + !frozen(p)) { + ret = false; + goto done; + } + } +done: + read_unlock(&tasklist_lock); + + return ret; +} + /** * freeze_processes - Signal user space processes to enter the refrigerator. * The current thread will not be frozen. The same process that calls @@ -143,25 +165,12 @@ int freeze_processes(void) * freezing tasks and the killed task might be still * on the way out so we have to double check for race. */ - if (oom_kills_count() != oom_kills_saved) { - struct task_struct *g, *p; - - read_lock(&tasklist_lock); - for_each_process_thread(g, p) { - if (p == current || freezer_should_skip(p) || - frozen(p)) - continue; - error = -EBUSY; - goto out_loop; - } -out_loop: - read_unlock(&tasklist_lock); - - if (error) { - __usermodehelper_set_disable_depth(UMH_ENABLED); - printk("OOM in progress."); - goto done; - } + if (oom_kills_count() != oom_kills_saved && + !check_frozen_processes()) { + __usermodehelper_set_disable_depth(UMH_ENABLED); + printk("OOM in progress."); + error = -EBUSY; + goto done; } printk("done."); } -- Michal Hocko SUSE Labs ^ permalink raw reply related [flat|nested] 93+ messages in thread
* Re: [PATCH 3/4] OOM, PM: OOM killed task shouldn't escape PM suspend 2014-10-21 14:11 ` Michal Hocko @ 2014-10-21 14:41 ` Rafael J. Wysocki 2014-10-21 14:29 ` Michal Hocko 0 siblings, 1 reply; 93+ messages in thread From: Rafael J. Wysocki @ 2014-10-21 14:41 UTC (permalink / raw) To: Michal Hocko Cc: Andrew Morton, Cong Wang, David Rientjes, Tejun Heo, Oleg Nesterov, LKML, linux-mm, Linux PM list On Tuesday, October 21, 2014 04:11:59 PM Michal Hocko wrote: > On Tue 21-10-14 15:42:23, Rafael J. Wysocki wrote: > > On Tuesday, October 21, 2014 03:14:45 PM Michal Hocko wrote: > > > On Tue 21-10-14 14:09:27, Rafael J. Wysocki wrote: > > > [...] > > > > > @@ -131,12 +132,40 @@ int freeze_processes(void) > > > > > > > > > > printk("Freezing user space processes ... "); > > > > > pm_freezing = true; > > > > > + oom_kills_saved = oom_kills_count(); > > > > > error = try_to_freeze_tasks(true); > > > > > if (!error) { > > > > > - printk("done."); > > > > > __usermodehelper_set_disable_depth(UMH_DISABLED); > > > > > oom_killer_disable(); > > > > > + > > > > > + /* > > > > > + * There might have been an OOM kill while we were > > > > > + * freezing tasks and the killed task might be still > > > > > + * on the way out so we have to double check for race. > > > > > + */ > > > > > + if (oom_kills_count() != oom_kills_saved) { > > > > > + struct task_struct *g, *p; > > > > > + > > > > > + read_lock(&tasklist_lock); > > > > > + for_each_process_thread(g, p) { > > > > > + if (p == current || freezer_should_skip(p) || > > > > > + frozen(p)) > > > > > + continue; > > > > > + error = -EBUSY; > > > > > + goto out_loop; > > > > > + } > > > > > +out_loop: > > > > > > > > Well, it looks like this will work here too: > > > > > > > > for_each_process_thread(g, p) > > > > if (p != current && !frozen(p) && > > > > !freezer_should_skip(p)) { > > > > error = -EBUSY; > > > > break; > > > > } > > > > > > > > or I am helplessly misreading the code. > > > > > > break will not work because for_each_process_thread is a double loop. > > > > I see. In that case I'd do: > > > > for_each_process_thread(g, p) > > if (p != current && !frozen(p) && > > !freezer_should_skip(p)) { > > > > read_unlock(&tasklist_lock); > > > > __usermodehelper_set_disable_depth(UMH_ENABLED); > > printk("OOM in progress."); > > error = -EBUSY; > > goto done; > > } > > > > to avoid adding the new label that looks odd. > > OK, incremental diff on top. I will post the complete patch if you are > happier with this change Yes, I am. > --- > diff --git a/kernel/power/process.c b/kernel/power/process.c > index a397fa161d11..7a37cf3eb1a2 100644 > --- a/kernel/power/process.c > +++ b/kernel/power/process.c > @@ -108,6 +108,28 @@ static int try_to_freeze_tasks(bool user_only) > return todo ? -EBUSY : 0; > } > > +/* > + * Returns true if all freezable tasks (except for current) are frozen already > + */ > +static bool check_frozen_processes(void) > +{ > + struct task_struct *g, *p; > + bool ret = true; > + > + read_lock(&tasklist_lock); > + for_each_process_thread(g, p) { > + if (p != current && !freezer_should_skip(p) && > + !frozen(p)) { > + ret = false; > + goto done; > + } > + } > +done: > + read_unlock(&tasklist_lock); > + > + return ret; > +} > + > /** > * freeze_processes - Signal user space processes to enter the refrigerator. > * The current thread will not be frozen. The same process that calls > @@ -143,25 +165,12 @@ int freeze_processes(void) > * freezing tasks and the killed task might be still > * on the way out so we have to double check for race. > */ > - if (oom_kills_count() != oom_kills_saved) { > - struct task_struct *g, *p; > - > - read_lock(&tasklist_lock); > - for_each_process_thread(g, p) { > - if (p == current || freezer_should_skip(p) || > - frozen(p)) > - continue; > - error = -EBUSY; > - goto out_loop; > - } > -out_loop: > - read_unlock(&tasklist_lock); > - > - if (error) { > - __usermodehelper_set_disable_depth(UMH_ENABLED); > - printk("OOM in progress."); > - goto done; > - } > + if (oom_kills_count() != oom_kills_saved && > + !check_frozen_processes()) { > + __usermodehelper_set_disable_depth(UMH_ENABLED); > + printk("OOM in progress."); > + error = -EBUSY; > + goto done; > } > printk("done."); > } > -- I speak only for myself. Rafael J. Wysocki, Intel Open Source Technology Center. ^ permalink raw reply [flat|nested] 93+ messages in thread
* Re: [PATCH 3/4] OOM, PM: OOM killed task shouldn't escape PM suspend 2014-10-21 14:41 ` Rafael J. Wysocki @ 2014-10-21 14:29 ` Michal Hocko 2014-10-22 14:39 ` Rafael J. Wysocki ` (2 more replies) 0 siblings, 3 replies; 93+ messages in thread From: Michal Hocko @ 2014-10-21 14:29 UTC (permalink / raw) To: Rafael J. Wysocki Cc: Andrew Morton, Cong Wang, David Rientjes, Tejun Heo, Oleg Nesterov, LKML, linux-mm, Linux PM list On Tue 21-10-14 16:41:07, Rafael J. Wysocki wrote: > On Tuesday, October 21, 2014 04:11:59 PM Michal Hocko wrote: [...] > > OK, incremental diff on top. I will post the complete patch if you are > > happier with this change > > Yes, I am. --- >From 9ab46fe539cded8e7b6425b2cd23ba9184002fde Mon Sep 17 00:00:00 2001 From: Michal Hocko <mhocko@suse.cz> Date: Mon, 20 Oct 2014 18:12:32 +0200 Subject: [PATCH -v2] OOM, PM: OOM killed task shouldn't escape PM suspend PM freezer relies on having all tasks frozen by the time devices are getting frozen so that no task will touch them while they are getting frozen. But OOM killer is allowed to kill an already frozen task in order to handle OOM situtation. In order to protect from late wake ups OOM killer is disabled after all tasks are frozen. This, however, still keeps a window open when a killed task didn't manage to die by the time freeze_processes finishes. Reduce the race window by checking all tasks after OOM killer has been disabled. This is still not race free completely unfortunately because oom_killer_disable cannot stop an already ongoing OOM killer so a task might still wake up from the fridge and get killed without freeze_processes noticing. Full synchronization of OOM and freezer is, however, too heavy weight for this highly unlikely case. Introduce and check oom_kills counter which gets incremented early when the allocator enters __alloc_pages_may_oom path and only check all the tasks if the counter changes during the freezing attempt. The counter is updated so early to reduce the race window since allocator checked oom_killer_disabled which is set by PM-freezing code. A false positive will push the PM-freezer into a slow path but that is not a big deal. Changes since v1 - push the re-check loop out of freeze_processes into check_frozen_processes and invert the condition to make the code more readable as per Rafael Fixes: f660daac474c6f (oom: thaw threads if oom killed thread is frozen before deferring) Cc: Cong Wang <xiyou.wangcong@gmail.com> Cc: Rafael J. Wysocki <rjw@rjwysocki.net> Cc: Tejun Heo <tj@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: stable@vger.kernel.org # 3.2+ Signed-off-by: Michal Hocko <mhocko@suse.cz> --- include/linux/oom.h | 3 +++ kernel/power/process.c | 40 +++++++++++++++++++++++++++++++++++++++- mm/oom_kill.c | 17 +++++++++++++++++ mm/page_alloc.c | 8 ++++++++ 4 files changed, 67 insertions(+), 1 deletion(-) diff --git a/include/linux/oom.h b/include/linux/oom.h index 647395a1a550..e8d6e1058723 100644 --- a/include/linux/oom.h +++ b/include/linux/oom.h @@ -50,6 +50,9 @@ static inline bool oom_task_origin(const struct task_struct *p) extern unsigned long oom_badness(struct task_struct *p, struct mem_cgroup *memcg, const nodemask_t *nodemask, unsigned long totalpages); + +extern int oom_kills_count(void); +extern void note_oom_kill(void); extern void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order, unsigned int points, unsigned long totalpages, struct mem_cgroup *memcg, nodemask_t *nodemask, diff --git a/kernel/power/process.c b/kernel/power/process.c index 4ee194eb524b..7a37cf3eb1a2 100644 --- a/kernel/power/process.c +++ b/kernel/power/process.c @@ -108,6 +108,28 @@ static int try_to_freeze_tasks(bool user_only) return todo ? -EBUSY : 0; } +/* + * Returns true if all freezable tasks (except for current) are frozen already + */ +static bool check_frozen_processes(void) +{ + struct task_struct *g, *p; + bool ret = true; + + read_lock(&tasklist_lock); + for_each_process_thread(g, p) { + if (p != current && !freezer_should_skip(p) && + !frozen(p)) { + ret = false; + goto done; + } + } +done: + read_unlock(&tasklist_lock); + + return ret; +} + /** * freeze_processes - Signal user space processes to enter the refrigerator. * The current thread will not be frozen. The same process that calls @@ -118,6 +140,7 @@ static int try_to_freeze_tasks(bool user_only) int freeze_processes(void) { int error; + int oom_kills_saved; error = __usermodehelper_disable(UMH_FREEZING); if (error) @@ -131,12 +154,27 @@ int freeze_processes(void) printk("Freezing user space processes ... "); pm_freezing = true; + oom_kills_saved = oom_kills_count(); error = try_to_freeze_tasks(true); if (!error) { - printk("done."); __usermodehelper_set_disable_depth(UMH_DISABLED); oom_killer_disable(); + + /* + * There might have been an OOM kill while we were + * freezing tasks and the killed task might be still + * on the way out so we have to double check for race. + */ + if (oom_kills_count() != oom_kills_saved && + !check_frozen_processes()) { + __usermodehelper_set_disable_depth(UMH_ENABLED); + printk("OOM in progress."); + error = -EBUSY; + goto done; + } + printk("done."); } +done: printk("\n"); BUG_ON(in_atomic()); diff --git a/mm/oom_kill.c b/mm/oom_kill.c index bbf405a3a18f..5340f6b91312 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -404,6 +404,23 @@ static void dump_header(struct task_struct *p, gfp_t gfp_mask, int order, dump_tasks(memcg, nodemask); } +/* + * Number of OOM killer invocations (including memcg OOM killer). + * Primarily used by PM freezer to check for potential races with + * OOM killed frozen task. + */ +static atomic_t oom_kills = ATOMIC_INIT(0); + +int oom_kills_count(void) +{ + return atomic_read(&oom_kills); +} + +void note_oom_kill(void) +{ + atomic_inc(&oom_kills); +} + #define K(x) ((x) << (PAGE_SHIFT-10)) /* * Must be called while holding a reference to p, which will be released upon diff --git a/mm/page_alloc.c b/mm/page_alloc.c index cb573b10af12..22f1929469ec 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -2286,6 +2286,14 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order, } /* + * PM-freezer should be notified that there might be an OOM killer on + * its way to kill and wake somebody up. This is too early and we might + * end up not killing anything but false positives are acceptable. + * See freeze_processes. + */ + note_oom_kill(); + + /* * Go through the zonelist yet one more time, keep very high watermark * here, this is only to catch a parallel oom killing, we must fail if * we're still under heavy pressure. -- 2.1.1 -- Michal Hocko SUSE Labs ^ permalink raw reply related [flat|nested] 93+ messages in thread
* Re: [PATCH 3/4] OOM, PM: OOM killed task shouldn't escape PM suspend 2014-10-21 14:29 ` Michal Hocko @ 2014-10-22 14:39 ` Rafael J. Wysocki 2014-10-22 14:22 ` Michal Hocko 2014-10-26 18:49 ` Pavel Machek 2014-11-04 19:27 ` Tejun Heo 2 siblings, 1 reply; 93+ messages in thread From: Rafael J. Wysocki @ 2014-10-22 14:39 UTC (permalink / raw) To: Michal Hocko Cc: Andrew Morton, Cong Wang, David Rientjes, Tejun Heo, Oleg Nesterov, LKML, linux-mm, Linux PM list On Tuesday, October 21, 2014 04:29:39 PM Michal Hocko wrote: > On Tue 21-10-14 16:41:07, Rafael J. Wysocki wrote: > > On Tuesday, October 21, 2014 04:11:59 PM Michal Hocko wrote: > [...] > > > OK, incremental diff on top. I will post the complete patch if you are > > > happier with this change > > > > Yes, I am. > --- > From 9ab46fe539cded8e7b6425b2cd23ba9184002fde Mon Sep 17 00:00:00 2001 > From: Michal Hocko <mhocko@suse.cz> > Date: Mon, 20 Oct 2014 18:12:32 +0200 > Subject: [PATCH -v2] OOM, PM: OOM killed task shouldn't escape PM suspend > > PM freezer relies on having all tasks frozen by the time devices are > getting frozen so that no task will touch them while they are getting > frozen. But OOM killer is allowed to kill an already frozen task in > order to handle OOM situtation. In order to protect from late wake ups > OOM killer is disabled after all tasks are frozen. This, however, still > keeps a window open when a killed task didn't manage to die by the time > freeze_processes finishes. > > Reduce the race window by checking all tasks after OOM killer has been > disabled. This is still not race free completely unfortunately because > oom_killer_disable cannot stop an already ongoing OOM killer so a task > might still wake up from the fridge and get killed without > freeze_processes noticing. Full synchronization of OOM and freezer is, > however, too heavy weight for this highly unlikely case. > > Introduce and check oom_kills counter which gets incremented early when > the allocator enters __alloc_pages_may_oom path and only check all the > tasks if the counter changes during the freezing attempt. The counter > is updated so early to reduce the race window since allocator checked > oom_killer_disabled which is set by PM-freezing code. A false positive > will push the PM-freezer into a slow path but that is not a big deal. > > Changes since v1 > - push the re-check loop out of freeze_processes into > check_frozen_processes and invert the condition to make the code more > readable as per Rafael I've applied that along with the rest of the series, but what about the following cleanup patch on top of it? Rafael --- kernel/power/process.c | 31 ++++++++++++++++--------------- 1 file changed, 16 insertions(+), 15 deletions(-) Index: linux-pm/kernel/power/process.c =================================================================== --- linux-pm.orig/kernel/power/process.c +++ linux-pm/kernel/power/process.c @@ -108,25 +108,27 @@ static int try_to_freeze_tasks(bool user return todo ? -EBUSY : 0; } +static bool __check_frozen_processes(void) +{ + struct task_struct *g, *p; + + for_each_process_thread(g, p) + if (p != current && !freezer_should_skip(p) && !frozen(p)) + return false; + + return true; +} + /* * Returns true if all freezable tasks (except for current) are frozen already */ static bool check_frozen_processes(void) { - struct task_struct *g, *p; - bool ret = true; + bool ret; read_lock(&tasklist_lock); - for_each_process_thread(g, p) { - if (p != current && !freezer_should_skip(p) && - !frozen(p)) { - ret = false; - goto done; - } - } -done: + ret = __check_frozen_processes(); read_unlock(&tasklist_lock); - return ret; } @@ -167,15 +169,14 @@ int freeze_processes(void) * on the way out so we have to double check for race. */ if (oom_kills_count() != oom_kills_saved && - !check_frozen_processes()) { + !check_frozen_processes()) { __usermodehelper_set_disable_depth(UMH_ENABLED); printk("OOM in progress."); error = -EBUSY; - goto done; + } else { + printk("done."); } - printk("done."); } -done: printk("\n"); BUG_ON(in_atomic()); ^ permalink raw reply [flat|nested] 93+ messages in thread
* Re: [PATCH 3/4] OOM, PM: OOM killed task shouldn't escape PM suspend 2014-10-22 14:39 ` Rafael J. Wysocki @ 2014-10-22 14:22 ` Michal Hocko 2014-10-22 21:18 ` Rafael J. Wysocki 0 siblings, 1 reply; 93+ messages in thread From: Michal Hocko @ 2014-10-22 14:22 UTC (permalink / raw) To: Rafael J. Wysocki Cc: Andrew Morton, Cong Wang, David Rientjes, Tejun Heo, Oleg Nesterov, LKML, linux-mm, Linux PM list On Wed 22-10-14 16:39:12, Rafael J. Wysocki wrote: > On Tuesday, October 21, 2014 04:29:39 PM Michal Hocko wrote: > > On Tue 21-10-14 16:41:07, Rafael J. Wysocki wrote: > > > On Tuesday, October 21, 2014 04:11:59 PM Michal Hocko wrote: > > [...] > > > > OK, incremental diff on top. I will post the complete patch if you are > > > > happier with this change > > > > > > Yes, I am. > > --- > > From 9ab46fe539cded8e7b6425b2cd23ba9184002fde Mon Sep 17 00:00:00 2001 > > From: Michal Hocko <mhocko@suse.cz> > > Date: Mon, 20 Oct 2014 18:12:32 +0200 > > Subject: [PATCH -v2] OOM, PM: OOM killed task shouldn't escape PM suspend > > > > PM freezer relies on having all tasks frozen by the time devices are > > getting frozen so that no task will touch them while they are getting > > frozen. But OOM killer is allowed to kill an already frozen task in > > order to handle OOM situtation. In order to protect from late wake ups > > OOM killer is disabled after all tasks are frozen. This, however, still > > keeps a window open when a killed task didn't manage to die by the time > > freeze_processes finishes. > > > > Reduce the race window by checking all tasks after OOM killer has been > > disabled. This is still not race free completely unfortunately because > > oom_killer_disable cannot stop an already ongoing OOM killer so a task > > might still wake up from the fridge and get killed without > > freeze_processes noticing. Full synchronization of OOM and freezer is, > > however, too heavy weight for this highly unlikely case. > > > > Introduce and check oom_kills counter which gets incremented early when > > the allocator enters __alloc_pages_may_oom path and only check all the > > tasks if the counter changes during the freezing attempt. The counter > > is updated so early to reduce the race window since allocator checked > > oom_killer_disabled which is set by PM-freezing code. A false positive > > will push the PM-freezer into a slow path but that is not a big deal. > > > > Changes since v1 > > - push the re-check loop out of freeze_processes into > > check_frozen_processes and invert the condition to make the code more > > readable as per Rafael > > I've applied that along with the rest of the series, but what about the > following cleanup patch on top of it? Sure, looks good to me. > > Rafael > > > --- > kernel/power/process.c | 31 ++++++++++++++++--------------- > 1 file changed, 16 insertions(+), 15 deletions(-) > > Index: linux-pm/kernel/power/process.c > =================================================================== > --- linux-pm.orig/kernel/power/process.c > +++ linux-pm/kernel/power/process.c > @@ -108,25 +108,27 @@ static int try_to_freeze_tasks(bool user > return todo ? -EBUSY : 0; > } > > +static bool __check_frozen_processes(void) > +{ > + struct task_struct *g, *p; > + > + for_each_process_thread(g, p) > + if (p != current && !freezer_should_skip(p) && !frozen(p)) > + return false; > + > + return true; > +} > + > /* > * Returns true if all freezable tasks (except for current) are frozen already > */ > static bool check_frozen_processes(void) > { > - struct task_struct *g, *p; > - bool ret = true; > + bool ret; > > read_lock(&tasklist_lock); > - for_each_process_thread(g, p) { > - if (p != current && !freezer_should_skip(p) && > - !frozen(p)) { > - ret = false; > - goto done; > - } > - } > -done: > + ret = __check_frozen_processes(); > read_unlock(&tasklist_lock); > - > return ret; > } > > @@ -167,15 +169,14 @@ int freeze_processes(void) > * on the way out so we have to double check for race. > */ > if (oom_kills_count() != oom_kills_saved && > - !check_frozen_processes()) { > + !check_frozen_processes()) { > __usermodehelper_set_disable_depth(UMH_ENABLED); > printk("OOM in progress."); > error = -EBUSY; > - goto done; > + } else { > + printk("done."); > } > - printk("done."); > } > -done: > printk("\n"); > BUG_ON(in_atomic()); > > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 93+ messages in thread
* Re: [PATCH 3/4] OOM, PM: OOM killed task shouldn't escape PM suspend 2014-10-22 14:22 ` Michal Hocko @ 2014-10-22 21:18 ` Rafael J. Wysocki 0 siblings, 0 replies; 93+ messages in thread From: Rafael J. Wysocki @ 2014-10-22 21:18 UTC (permalink / raw) To: Michal Hocko Cc: Andrew Morton, Cong Wang, David Rientjes, Tejun Heo, Oleg Nesterov, LKML, linux-mm, Linux PM list On Wednesday, October 22, 2014 04:22:26 PM Michal Hocko wrote: > On Wed 22-10-14 16:39:12, Rafael J. Wysocki wrote: > > On Tuesday, October 21, 2014 04:29:39 PM Michal Hocko wrote: > > > On Tue 21-10-14 16:41:07, Rafael J. Wysocki wrote: > > > > On Tuesday, October 21, 2014 04:11:59 PM Michal Hocko wrote: > > > [...] > > > > > OK, incremental diff on top. I will post the complete patch if you are > > > > > happier with this change > > > > > > > > Yes, I am. > > > --- > > > From 9ab46fe539cded8e7b6425b2cd23ba9184002fde Mon Sep 17 00:00:00 2001 > > > From: Michal Hocko <mhocko@suse.cz> > > > Date: Mon, 20 Oct 2014 18:12:32 +0200 > > > Subject: [PATCH -v2] OOM, PM: OOM killed task shouldn't escape PM suspend > > > > > > PM freezer relies on having all tasks frozen by the time devices are > > > getting frozen so that no task will touch them while they are getting > > > frozen. But OOM killer is allowed to kill an already frozen task in > > > order to handle OOM situtation. In order to protect from late wake ups > > > OOM killer is disabled after all tasks are frozen. This, however, still > > > keeps a window open when a killed task didn't manage to die by the time > > > freeze_processes finishes. > > > > > > Reduce the race window by checking all tasks after OOM killer has been > > > disabled. This is still not race free completely unfortunately because > > > oom_killer_disable cannot stop an already ongoing OOM killer so a task > > > might still wake up from the fridge and get killed without > > > freeze_processes noticing. Full synchronization of OOM and freezer is, > > > however, too heavy weight for this highly unlikely case. > > > > > > Introduce and check oom_kills counter which gets incremented early when > > > the allocator enters __alloc_pages_may_oom path and only check all the > > > tasks if the counter changes during the freezing attempt. The counter > > > is updated so early to reduce the race window since allocator checked > > > oom_killer_disabled which is set by PM-freezing code. A false positive > > > will push the PM-freezer into a slow path but that is not a big deal. > > > > > > Changes since v1 > > > - push the re-check loop out of freeze_processes into > > > check_frozen_processes and invert the condition to make the code more > > > readable as per Rafael > > > > I've applied that along with the rest of the series, but what about the > > following cleanup patch on top of it? > > Sure, looks good to me. I'll apply it then, thanks! -- I speak only for myself. Rafael J. Wysocki, Intel Open Source Technology Center. ^ permalink raw reply [flat|nested] 93+ messages in thread
* Re: [PATCH 3/4] OOM, PM: OOM killed task shouldn't escape PM suspend 2014-10-21 14:29 ` Michal Hocko 2014-10-22 14:39 ` Rafael J. Wysocki @ 2014-10-26 18:49 ` Pavel Machek 2014-11-04 19:27 ` Tejun Heo 2 siblings, 0 replies; 93+ messages in thread From: Pavel Machek @ 2014-10-26 18:49 UTC (permalink / raw) To: Michal Hocko Cc: Rafael J. Wysocki, Andrew Morton, Cong Wang, David Rientjes, Tejun Heo, Oleg Nesterov, LKML, linux-mm, Linux PM list Hi! > > +/* > + * Number of OOM killer invocations (including memcg OOM killer). > + * Primarily used by PM freezer to check for potential races with > + * OOM killed frozen task. > + */ > +static atomic_t oom_kills = ATOMIC_INIT(0); > + > +int oom_kills_count(void) > +{ > + return atomic_read(&oom_kills); > +} > + > +void note_oom_kill(void) > +{ > + atomic_inc(&oom_kills); > +} > + Do we need the extra abstraction here? Maybe oom_kills should be exported directly? Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html ^ permalink raw reply [flat|nested] 93+ messages in thread
* Re: [PATCH 3/4] OOM, PM: OOM killed task shouldn't escape PM suspend 2014-10-21 14:29 ` Michal Hocko 2014-10-22 14:39 ` Rafael J. Wysocki 2014-10-26 18:49 ` Pavel Machek @ 2014-11-04 19:27 ` Tejun Heo 2014-11-05 12:46 ` Michal Hocko 2 siblings, 1 reply; 93+ messages in thread From: Tejun Heo @ 2014-11-04 19:27 UTC (permalink / raw) To: Michal Hocko Cc: Rafael J. Wysocki, Andrew Morton, Cong Wang, David Rientjes, Oleg Nesterov, LKML, linux-mm, Linux PM list Hello, Sorry about the delay. On Tue, Oct 21, 2014 at 04:29:39PM +0200, Michal Hocko wrote: > Reduce the race window by checking all tasks after OOM killer has been Ugh... this is never a good direction to take. It often just ends up making bugs harder to reproduce and track down. > disabled. This is still not race free completely unfortunately because > oom_killer_disable cannot stop an already ongoing OOM killer so a task > might still wake up from the fridge and get killed without > freeze_processes noticing. Full synchronization of OOM and freezer is, > however, too heavy weight for this highly unlikely case. Both oom killing and PM freezing are exremely rare events and I have difficult time why their exclusion would be heavy weight. Care to elaborate? Overall, this is a lot of complexity for something which doesn't really fix the problem and the comments while referring to the race don't mention that the implemented "fix" is broken, which is pretty bad as it gives readers of the code a false sense of security and another hurdle to overcome in actually tracking down what went wrong if this thing ever shows up as an actual breakage. I'd strongly recommend implementing something which is actually correct. Thanks. -- tejun ^ permalink raw reply [flat|nested] 93+ messages in thread
* Re: [PATCH 3/4] OOM, PM: OOM killed task shouldn't escape PM suspend 2014-11-04 19:27 ` Tejun Heo @ 2014-11-05 12:46 ` Michal Hocko 2014-11-05 13:02 ` Tejun Heo 2014-11-05 14:55 ` [PATCH 3/4] OOM, PM: OOM killed task shouldn't escape PM suspend Michal Hocko 0 siblings, 2 replies; 93+ messages in thread From: Michal Hocko @ 2014-11-05 12:46 UTC (permalink / raw) To: Tejun Heo Cc: Rafael J. Wysocki, Andrew Morton, Cong Wang, David Rientjes, Oleg Nesterov, LKML, linux-mm, Linux PM list On Tue 04-11-14 14:27:05, Tejun Heo wrote: > Hello, > > Sorry about the delay. > > On Tue, Oct 21, 2014 at 04:29:39PM +0200, Michal Hocko wrote: > > Reduce the race window by checking all tasks after OOM killer has been > > Ugh... this is never a good direction to take. It often just ends up > making bugs harder to reproduce and track down. As I've said I wasn't entirely happy with this half solution but it helped the current situation at the time. The full solution would require to fully synchronize OOM path with the freezer. The patch below is doing that. > > disabled. This is still not race free completely unfortunately because > > oom_killer_disable cannot stop an already ongoing OOM killer so a task > > might still wake up from the fridge and get killed without > > freeze_processes noticing. Full synchronization of OOM and freezer is, > > however, too heavy weight for this highly unlikely case. > > Both oom killing and PM freezing are exremely rare events and I have > difficult time why their exclusion would be heavy weight. Care to > elaborate You are right that the allocation OOM path is extremely slow and so an additional locking shouldn't matter much. I originally thought that any locking would require more changes in the allocation path. In the end it looks much easier than I hoped. I haven't tested it so I might be just missing some subtle issues now. Anyway I cannot say I would be happy to expose a lock which can block OOM to happen because this calls for troubles. It is true that we already have that ugly oom_killer_disabled hack but that only causes allocation to fail rather than block the OOM path altogether if something goes wrong. Maybe I am just too paranoid... So my original intention was to provide a mechanism which would be safe from OOM point of view and as good as possible from PM POV. The race is really unlikely and even if it happened there would be an OOM message in the log which would give us a hint (I can add a special note that oom is disabled but we are killing a task regardless to make it more obvious if you prefer). > Overall, this is a lot of complexity for something which doesn't > really fix the problem and the comments while referring to the race > don't mention that the implemented "fix" is broken, which is pretty > bad as it gives readers of the code a false sense of security and > another hurdle to overcome in actually tracking down what went wrong > if this thing ever shows up as an actual breakage. The patch description mentions that the race is not closed completely. It is true that the comments in the code could have been more clear about it. > I'd strongly recommend implementing something which is actually > correct. I think the patch below should be safe. Would you prefer this solution instead? It is race free but there is the risk that exposing a lock which completely blocks OOM killer from the allocation path will kick us later. --- >From ef6227565fa65b52986c4626d49ba53b499e54d1 Mon Sep 17 00:00:00 2001 From: Michal Hocko <mhocko@suse.cz> Date: Wed, 5 Nov 2014 11:49:14 +0100 Subject: [PATCH] OOM, PM: make OOM detection in the freezer path raceless 5695be142e20 (OOM, PM: OOM killed task shouldn't escape PM suspend) has left a race window when OOM killer manages to note_oom_kill after freeze_processes checks the counter. The race window is quite small and really unlikely and deemed sufficient at the time of submission. Tejun wasn't happy about this partial solution though and insisted on a full solution. That requires the full OOM and freezer exclusion, though. This is done by this patch which introduces oom_sem RW lock. Page allocation OOM path takes the lock for reading because there might be concurrent OOM happening on disjunct zonelists. oom_killer_disabled check is moved right before out_of_memory is called because it was checked too early before and we do not want to hold the lock while doing the last attempt for allocation which might involve zone_reclaim. freeze_processes then takes the lock for write throughout the whole freezing process and OOM disabling. There is no need to recheck all the processes with the full synchronization anymore. Signed-off-by: Michal Hocko <mhocko@suse.cz> --- include/linux/oom.h | 5 +++++ kernel/power/process.c | 50 +++++++++----------------------------------------- mm/oom_kill.c | 17 ----------------- mm/page_alloc.c | 24 ++++++++++++------------ 4 files changed, 26 insertions(+), 70 deletions(-) diff --git a/include/linux/oom.h b/include/linux/oom.h index e8d6e1058723..350b9b2ffeec 100644 --- a/include/linux/oom.h +++ b/include/linux/oom.h @@ -73,7 +73,12 @@ extern void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask, extern int register_oom_notifier(struct notifier_block *nb); extern int unregister_oom_notifier(struct notifier_block *nb); +/* + * oom_killer_disabled can be modified only under oom_sem taken for write + * and checked under read lock along with the full OOM handler. + */ extern bool oom_killer_disabled; +extern struct rw_semaphore oom_sem; static inline void oom_killer_disable(void) { diff --git a/kernel/power/process.c b/kernel/power/process.c index 5a6ec8678b9a..befce9785233 100644 --- a/kernel/power/process.c +++ b/kernel/power/process.c @@ -108,30 +108,6 @@ static int try_to_freeze_tasks(bool user_only) return todo ? -EBUSY : 0; } -static bool __check_frozen_processes(void) -{ - struct task_struct *g, *p; - - for_each_process_thread(g, p) - if (p != current && !freezer_should_skip(p) && !frozen(p)) - return false; - - return true; -} - -/* - * Returns true if all freezable tasks (except for current) are frozen already - */ -static bool check_frozen_processes(void) -{ - bool ret; - - read_lock(&tasklist_lock); - ret = __check_frozen_processes(); - read_unlock(&tasklist_lock); - return ret; -} - /** * freeze_processes - Signal user space processes to enter the refrigerator. * The current thread will not be frozen. The same process that calls @@ -142,7 +118,6 @@ static bool check_frozen_processes(void) int freeze_processes(void) { int error; - int oom_kills_saved; error = __usermodehelper_disable(UMH_FREEZING); if (error) @@ -157,27 +132,20 @@ int freeze_processes(void) pm_wakeup_clear(); printk("Freezing user space processes ... "); pm_freezing = true; - oom_kills_saved = oom_kills_count(); + + /* + * Need to exlude OOM killer from triggering while tasks are + * getting frozen to make sure none of them gets killed after + * try_to_freeze_tasks is done. + */ + down_write(&oom_sem); error = try_to_freeze_tasks(true); if (!error) { __usermodehelper_set_disable_depth(UMH_DISABLED); oom_killer_disable(); - - /* - * There might have been an OOM kill while we were - * freezing tasks and the killed task might be still - * on the way out so we have to double check for race. - */ - if (oom_kills_count() != oom_kills_saved && - !check_frozen_processes()) { - __usermodehelper_set_disable_depth(UMH_ENABLED); - printk("OOM in progress."); - error = -EBUSY; - } else { - printk("done."); - } + printk("done.\n"); } - printk("\n"); + up_write(&oom_sem); BUG_ON(in_atomic()); if (error) diff --git a/mm/oom_kill.c b/mm/oom_kill.c index 5340f6b91312..bbf405a3a18f 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -404,23 +404,6 @@ static void dump_header(struct task_struct *p, gfp_t gfp_mask, int order, dump_tasks(memcg, nodemask); } -/* - * Number of OOM killer invocations (including memcg OOM killer). - * Primarily used by PM freezer to check for potential races with - * OOM killed frozen task. - */ -static atomic_t oom_kills = ATOMIC_INIT(0); - -int oom_kills_count(void) -{ - return atomic_read(&oom_kills); -} - -void note_oom_kill(void) -{ - atomic_inc(&oom_kills); -} - #define K(x) ((x) << (PAGE_SHIFT-10)) /* * Must be called while holding a reference to p, which will be released upon diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 9cd36b822444..76095266c4b5 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -243,6 +243,7 @@ void set_pageblock_migratetype(struct page *page, int migratetype) } bool oom_killer_disabled __read_mostly; +DECLARE_RWSEM(oom_sem); #ifdef CONFIG_DEBUG_VM static int page_outside_zone_boundaries(struct zone *zone, struct page *page) @@ -2252,14 +2253,6 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order, } /* - * PM-freezer should be notified that there might be an OOM killer on - * its way to kill and wake somebody up. This is too early and we might - * end up not killing anything but false positives are acceptable. - * See freeze_processes. - */ - note_oom_kill(); - - /* * Go through the zonelist yet one more time, keep very high watermark * here, this is only to catch a parallel oom killing, we must fail if * we're still under heavy pressure. @@ -2288,8 +2281,17 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order, if (gfp_mask & __GFP_THISNODE) goto out; } - /* Exhausted what can be done so it's blamo time */ - out_of_memory(zonelist, gfp_mask, order, nodemask, false); + + /* + * Exhausted what can be done so it's blamo time. + * Just make sure that we cannot race with oom_killer disabling + * e.g. PM freezer needs to make sure that no OOM happens after + * all tasks are frozen. + */ + down_read(&oom_sem); + if (!oom_killer_disabled) + out_of_memory(zonelist, gfp_mask, order, nodemask, false); + up_read(&oom_sem); out: oom_zonelist_unlock(zonelist, gfp_mask); @@ -2716,8 +2718,6 @@ rebalance: */ if (!did_some_progress) { if (oom_gfp_allowed(gfp_mask)) { - if (oom_killer_disabled) - goto nopage; /* Coredumps can quickly deplete all memory reserves */ if ((current->flags & PF_DUMPCORE) && !(gfp_mask & __GFP_NOFAIL)) -- 2.1.1 -- Michal Hocko SUSE Labs ^ permalink raw reply related [flat|nested] 93+ messages in thread
* Re: [PATCH 3/4] OOM, PM: OOM killed task shouldn't escape PM suspend 2014-11-05 12:46 ` Michal Hocko @ 2014-11-05 13:02 ` Tejun Heo 2014-11-05 13:31 ` Michal Hocko 2014-11-05 14:55 ` [PATCH 3/4] OOM, PM: OOM killed task shouldn't escape PM suspend Michal Hocko 1 sibling, 1 reply; 93+ messages in thread From: Tejun Heo @ 2014-11-05 13:02 UTC (permalink / raw) To: Michal Hocko Cc: Rafael J. Wysocki, Andrew Morton, Cong Wang, David Rientjes, Oleg Nesterov, LKML, linux-mm, Linux PM list Hello, Michal. On Wed, Nov 05, 2014 at 01:46:20PM +0100, Michal Hocko wrote: > As I've said I wasn't entirely happy with this half solution but it helped > the current situation at the time. The full solution would require to I don't think this helps the situation. It just makes the bug more obscure and the race window while reduced is still pretty big and there seems to be an actual not too low chance of the bug triggering out in the wild. How does this level of obscuring help anything? In addition to making the bug more difficult to reproduce, it also adds a bunch of code which *pretends* to address the issue but ultimately just lowers visibility into what's going on and hinders tracking down the issue when something actually goes wrong. This is *NOT* making the situation better. The patch is net negative. > I think the patch below should be safe. Would you prefer this solution > instead? It is race free but there is the risk that exposing a lock which Yes, this is an a lot saner approach in general. > completely blocks OOM killer from the allocation path will kick us > later. Can you please spell it out? How would it kick us? We already have oom_killer_disable/enable(), how is this any different in terms of correctness from them? Also, why isn't this part of oom_killer_disable/enable()? The way they're implemented is really silly now. It just sets a flag and returns whether there's a currently running instance or not. How were these even useful? Why can't you just make disable/enable to what they were supposed to do from the beginning? Thanks. -- tejun ^ permalink raw reply [flat|nested] 93+ messages in thread
* Re: [PATCH 3/4] OOM, PM: OOM killed task shouldn't escape PM suspend 2014-11-05 13:02 ` Tejun Heo @ 2014-11-05 13:31 ` Michal Hocko 2014-11-05 13:42 ` Michal Hocko 0 siblings, 1 reply; 93+ messages in thread From: Michal Hocko @ 2014-11-05 13:31 UTC (permalink / raw) To: Tejun Heo Cc: Rafael J. Wysocki, Andrew Morton, Cong Wang, David Rientjes, Oleg Nesterov, LKML, linux-mm, Linux PM list On Wed 05-11-14 08:02:47, Tejun Heo wrote: > Hello, Michal. > > On Wed, Nov 05, 2014 at 01:46:20PM +0100, Michal Hocko wrote: > > As I've said I wasn't entirely happy with this half solution but it helped > > the current situation at the time. The full solution would require to > > I don't think this helps the situation. It just makes the bug more > obscure and the race window while reduced is still pretty big and > there seems to be an actual not too low chance of the bug triggering > out in the wild. How does this level of obscuring help anything? In > addition to making the bug more difficult to reproduce, it also adds a > bunch of code which *pretends* to address the issue but ultimately > just lowers visibility into what's going on and hinders tracking down > the issue when something actually goes wrong. This is *NOT* making > the situation better. The patch is net negative. The patch was a compromise. It was needed to catch the most common OOM conditions while the tasks are getting frozen. The race window between the counter increment and the check in the PM path is negligible compared to the freezing process. And it is safe from OOM point of view because nothing can block it away. > > I think the patch below should be safe. Would you prefer this solution > > instead? It is race free but there is the risk that exposing a lock which > > Yes, this is an a lot saner approach in general. > > > completely blocks OOM killer from the allocation path will kick us > > later. > > Can you please spell it out? How would it kick us? We already have > oom_killer_disable/enable(), how is this any different in terms of > correctness from them? As already said in the part of email you haven't quoted. oom_killer_disable will cause allocations to _fail_. With the lock you are _blocking_ OOM killer completely. This is error prone because no part of system should be able to block the last resort memory shortage actions. > Also, why isn't this part of > oom_killer_disable/enable()? The way they're implemented is really > silly now. It just sets a flag and returns whether there's a > currently running instance or not. How were these even useful? > Why can't you just make disable/enable to what they were supposed to > do from the beginning? Because then we would block all the potential allocators coming from workqueues or kernel threads which are not frozen yet rather than fail the allocation. I am not familiar with the PM code and all the paths this might get called from enough to tell whether failing the allocation is better approach than failing the suspend operation on a timeout. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 93+ messages in thread
* Re: [PATCH 3/4] OOM, PM: OOM killed task shouldn't escape PM suspend 2014-11-05 13:31 ` Michal Hocko @ 2014-11-05 13:42 ` Michal Hocko 2014-11-05 14:14 ` Michal Hocko 2014-11-05 15:44 ` Tejun Heo 0 siblings, 2 replies; 93+ messages in thread From: Michal Hocko @ 2014-11-05 13:42 UTC (permalink / raw) To: Tejun Heo Cc: Rafael J. Wysocki, Andrew Morton, Cong Wang, David Rientjes, Oleg Nesterov, LKML, linux-mm, Linux PM list On Wed 05-11-14 14:31:00, Michal Hocko wrote: > On Wed 05-11-14 08:02:47, Tejun Heo wrote: [...] > > Also, why isn't this part of > > oom_killer_disable/enable()? The way they're implemented is really > > silly now. It just sets a flag and returns whether there's a > > currently running instance or not. How were these even useful? > > Why can't you just make disable/enable to what they were supposed to > > do from the beginning? > > Because then we would block all the potential allocators coming from > workqueues or kernel threads which are not frozen yet rather than fail > the allocation. After thinking about this more it would be doable by using trylock in the allocation oom path. I will respin the patch. The API will be cleaner this way. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 93+ messages in thread
* Re: [PATCH 3/4] OOM, PM: OOM killed task shouldn't escape PM suspend 2014-11-05 13:42 ` Michal Hocko @ 2014-11-05 14:14 ` Michal Hocko 2014-11-05 15:45 ` Michal Hocko 2014-11-05 15:44 ` Tejun Heo 1 sibling, 1 reply; 93+ messages in thread From: Michal Hocko @ 2014-11-05 14:14 UTC (permalink / raw) To: Tejun Heo Cc: Rafael J. Wysocki, Andrew Morton, Cong Wang, David Rientjes, Oleg Nesterov, LKML, linux-mm, Linux PM list On Wed 05-11-14 14:42:19, Michal Hocko wrote: > On Wed 05-11-14 14:31:00, Michal Hocko wrote: > > On Wed 05-11-14 08:02:47, Tejun Heo wrote: > [...] > > > Also, why isn't this part of > > > oom_killer_disable/enable()? The way they're implemented is really > > > silly now. It just sets a flag and returns whether there's a > > > currently running instance or not. How were these even useful? > > > Why can't you just make disable/enable to what they were supposed to > > > do from the beginning? > > > > Because then we would block all the potential allocators coming from > > workqueues or kernel threads which are not frozen yet rather than fail > > the allocation. > > After thinking about this more it would be doable by using trylock in > the allocation oom path. I will respin the patch. The API will be > cleaner this way. --- >From 33654faeea161ef9a411f9ff6d84419712bb4a0f Mon Sep 17 00:00:00 2001 From: Michal Hocko <mhocko@suse.cz> Date: Wed, 5 Nov 2014 15:09:56 +0100 Subject: [PATCH] OOM, PM: make OOM detection in the freezer path raceless 5695be142e20 (OOM, PM: OOM killed task shouldn't escape PM suspend) has left a race window when OOM killer manages to note_oom_kill after freeze_processes checks the counter. The race window is quite small and really unlikely and deemed sufficient at the time of submission. Tejun wasn't happy about this partial solution though and insisted on a full solution. That requires the full OOM and freezer exclusion, though. This is done by this patch which introduces oom_sem RW lock and gets rid of oom_killer_disabled global flag. The PM code uses oom_killer_{disable,enable} which takes the lock for write and exclude all the OOM killer invocation from the page allocation path. The allocation path uses oom_killer_allowed_{begin,end} around __alloc_pages_may_oom call. This is implemented by a read trylock so all the concurrent OOM killers (operating on different zonlists) are allowed to proceed unless OOM is disabled when the allocation simply fails. There is no need to recheck all the processes with the full synchronization anymore. Suggested-by: Tejun Heo <tj@kernel.org> Signed-off-by: Michal Hocko <mhocko@suse.cz> --- include/linux/oom.h | 33 ++++++++++++++++++++++++--------- kernel/power/process.c | 50 ++++++++------------------------------------------ mm/oom_kill.c | 39 ++++++++++++++++++++++----------------- mm/page_alloc.c | 21 +++++++++------------ 4 files changed, 63 insertions(+), 80 deletions(-) diff --git a/include/linux/oom.h b/include/linux/oom.h index e8d6e1058723..850f7f653eb7 100644 --- a/include/linux/oom.h +++ b/include/linux/oom.h @@ -73,17 +73,32 @@ extern void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask, extern int register_oom_notifier(struct notifier_block *nb); extern int unregister_oom_notifier(struct notifier_block *nb); -extern bool oom_killer_disabled; +/** + * oom_killer_disable - disable OOM killer in page allocator + * + * Forces all page allocations to fail rather than trigger OOM killer. + */ +extern void oom_killer_disable(void); -static inline void oom_killer_disable(void) -{ - oom_killer_disabled = true; -} +/** + * oom_killer_enable - enable OOM killer + */ +extern void oom_killer_enable(void); -static inline void oom_killer_enable(void) -{ - oom_killer_disabled = false; -} +/** + * oom_killer_allowed_start - start OOM killer section + * + * Synchronise with oom_killer_{disable,enable} sections. + * Returns 1 if oom_killer is allowed. + */ +extern int oom_killer_allowed_start(void); + +/** + * oom_killer_allowed_end - end OOM killer section + * + * previously started by oom_killer_allowed_end. + */ +extern void oom_killer_allowed_end(void); static inline bool oom_gfp_allowed(gfp_t gfp_mask) { diff --git a/kernel/power/process.c b/kernel/power/process.c index 5a6ec8678b9a..7d08d56cbf3f 100644 --- a/kernel/power/process.c +++ b/kernel/power/process.c @@ -108,30 +108,6 @@ static int try_to_freeze_tasks(bool user_only) return todo ? -EBUSY : 0; } -static bool __check_frozen_processes(void) -{ - struct task_struct *g, *p; - - for_each_process_thread(g, p) - if (p != current && !freezer_should_skip(p) && !frozen(p)) - return false; - - return true; -} - -/* - * Returns true if all freezable tasks (except for current) are frozen already - */ -static bool check_frozen_processes(void) -{ - bool ret; - - read_lock(&tasklist_lock); - ret = __check_frozen_processes(); - read_unlock(&tasklist_lock); - return ret; -} - /** * freeze_processes - Signal user space processes to enter the refrigerator. * The current thread will not be frozen. The same process that calls @@ -142,7 +118,6 @@ static bool check_frozen_processes(void) int freeze_processes(void) { int error; - int oom_kills_saved; error = __usermodehelper_disable(UMH_FREEZING); if (error) @@ -157,27 +132,18 @@ int freeze_processes(void) pm_wakeup_clear(); printk("Freezing user space processes ... "); pm_freezing = true; - oom_kills_saved = oom_kills_count(); + + /* + * Need to exlude OOM killer from triggering while tasks are + * getting frozen to make sure none of them gets killed after + * try_to_freeze_tasks is done. + */ + oom_killer_disable(); error = try_to_freeze_tasks(true); if (!error) { __usermodehelper_set_disable_depth(UMH_DISABLED); - oom_killer_disable(); - - /* - * There might have been an OOM kill while we were - * freezing tasks and the killed task might be still - * on the way out so we have to double check for race. - */ - if (oom_kills_count() != oom_kills_saved && - !check_frozen_processes()) { - __usermodehelper_set_disable_depth(UMH_ENABLED); - printk("OOM in progress."); - error = -EBUSY; - } else { - printk("done."); - } + printk("done.\n"); } - printk("\n"); BUG_ON(in_atomic()); if (error) diff --git a/mm/oom_kill.c b/mm/oom_kill.c index 5340f6b91312..7fc75b4df837 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -404,23 +404,6 @@ static void dump_header(struct task_struct *p, gfp_t gfp_mask, int order, dump_tasks(memcg, nodemask); } -/* - * Number of OOM killer invocations (including memcg OOM killer). - * Primarily used by PM freezer to check for potential races with - * OOM killed frozen task. - */ -static atomic_t oom_kills = ATOMIC_INIT(0); - -int oom_kills_count(void) -{ - return atomic_read(&oom_kills); -} - -void note_oom_kill(void) -{ - atomic_inc(&oom_kills); -} - #define K(x) ((x) << (PAGE_SHIFT-10)) /* * Must be called while holding a reference to p, which will be released upon @@ -615,6 +598,28 @@ void oom_zonelist_unlock(struct zonelist *zonelist, gfp_t gfp_mask) spin_unlock(&zone_scan_lock); } +static DECLARE_RWSEM(oom_sem); + +void oom_killer_disabled(void) +{ + down_write(&oom_sem); +} + +void oom_killer_enable(void) +{ + up_write(&oom_sem); +} + +int oom_killer_allowed_start(void) +{ + return down_read_trylock(&oom_sem); +} + +void oom_killer_allowed_end(void) +{ + up_read(&oom_sem); +} + /** * out_of_memory - kill the "best" process when we run out of memory * @zonelist: zonelist pointer diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 9cd36b822444..206ce46ce975 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -242,8 +242,6 @@ void set_pageblock_migratetype(struct page *page, int migratetype) PB_migrate, PB_migrate_end); } -bool oom_killer_disabled __read_mostly; - #ifdef CONFIG_DEBUG_VM static int page_outside_zone_boundaries(struct zone *zone, struct page *page) { @@ -2252,14 +2250,6 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order, } /* - * PM-freezer should be notified that there might be an OOM killer on - * its way to kill and wake somebody up. This is too early and we might - * end up not killing anything but false positives are acceptable. - * See freeze_processes. - */ - note_oom_kill(); - - /* * Go through the zonelist yet one more time, keep very high watermark * here, this is only to catch a parallel oom killing, we must fail if * we're still under heavy pressure. @@ -2716,16 +2706,23 @@ rebalance: */ if (!did_some_progress) { if (oom_gfp_allowed(gfp_mask)) { - if (oom_killer_disabled) - goto nopage; /* Coredumps can quickly deplete all memory reserves */ if ((current->flags & PF_DUMPCORE) && !(gfp_mask & __GFP_NOFAIL)) goto nopage; + /* + * Just make sure that we cannot race with oom_killer + * disabling e.g. PM freezer needs to make sure that + * no OOM happens after all tasks are frozen. + */ + if (!oom_killer_allowed_start()) + goto nopage; page = __alloc_pages_may_oom(gfp_mask, order, zonelist, high_zoneidx, nodemask, preferred_zone, classzone_idx, migratetype); + oom_killer_allowed_end(); + if (page) goto got_pg; -- 2.1.1 -- Michal Hocko SUSE Labs ^ permalink raw reply related [flat|nested] 93+ messages in thread
* Re: [PATCH 3/4] OOM, PM: OOM killed task shouldn't escape PM suspend 2014-11-05 14:14 ` Michal Hocko @ 2014-11-05 15:45 ` Michal Hocko 0 siblings, 0 replies; 93+ messages in thread From: Michal Hocko @ 2014-11-05 15:45 UTC (permalink / raw) To: Tejun Heo Cc: Rafael J. Wysocki, Andrew Morton, Cong Wang, David Rientjes, Oleg Nesterov, LKML, linux-mm, Linux PM list Ups, just noticed that I have a compile fix staged which didn't make it into git format-patch. Will repost after/if you are OK with this approach. But I guess this is much better outcome. Thanks for pushing Tejun! On Wed 05-11-14 15:14:58, Michal Hocko wrote: [...] > diff --git a/mm/oom_kill.c b/mm/oom_kill.c > index 5340f6b91312..7fc75b4df837 100644 > --- a/mm/oom_kill.c > +++ b/mm/oom_kill.c [...] > @@ -615,6 +598,28 @@ void oom_zonelist_unlock(struct zonelist *zonelist, gfp_t gfp_mask) > spin_unlock(&zone_scan_lock); > } > > +static DECLARE_RWSEM(oom_sem); > + > +void oom_killer_disabled(void) Should be oom_killer_disable(void) -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 93+ messages in thread
* Re: [PATCH 3/4] OOM, PM: OOM killed task shouldn't escape PM suspend 2014-11-05 13:42 ` Michal Hocko 2014-11-05 14:14 ` Michal Hocko @ 2014-11-05 15:44 ` Tejun Heo 2014-11-05 16:01 ` Michal Hocko 1 sibling, 1 reply; 93+ messages in thread From: Tejun Heo @ 2014-11-05 15:44 UTC (permalink / raw) To: Michal Hocko Cc: Rafael J. Wysocki, Andrew Morton, Cong Wang, David Rientjes, Oleg Nesterov, LKML, linux-mm, Linux PM list On Wed, Nov 05, 2014 at 02:42:19PM +0100, Michal Hocko wrote: > On Wed 05-11-14 14:31:00, Michal Hocko wrote: > > On Wed 05-11-14 08:02:47, Tejun Heo wrote: > [...] > > > Also, why isn't this part of > > > oom_killer_disable/enable()? The way they're implemented is really > > > silly now. It just sets a flag and returns whether there's a > > > currently running instance or not. How were these even useful? > > > Why can't you just make disable/enable to what they were supposed to > > > do from the beginning? > > > > Because then we would block all the potential allocators coming from > > workqueues or kernel threads which are not frozen yet rather than fail > > the allocation. > > After thinking about this more it would be doable by using trylock in > the allocation oom path. I will respin the patch. The API will be > cleaner this way. In disable, block new invocations of OOM killer and then drain the in-progress ones. This is a common pattern, isn't it? -- tejun ^ permalink raw reply [flat|nested] 93+ messages in thread
* Re: [PATCH 3/4] OOM, PM: OOM killed task shouldn't escape PM suspend 2014-11-05 15:44 ` Tejun Heo @ 2014-11-05 16:01 ` Michal Hocko 2014-11-05 16:29 ` Tejun Heo 0 siblings, 1 reply; 93+ messages in thread From: Michal Hocko @ 2014-11-05 16:01 UTC (permalink / raw) To: Tejun Heo Cc: Rafael J. Wysocki, Andrew Morton, Cong Wang, David Rientjes, Oleg Nesterov, LKML, linux-mm, Linux PM list On Wed 05-11-14 10:44:36, Tejun Heo wrote: > On Wed, Nov 05, 2014 at 02:42:19PM +0100, Michal Hocko wrote: > > On Wed 05-11-14 14:31:00, Michal Hocko wrote: > > > On Wed 05-11-14 08:02:47, Tejun Heo wrote: > > [...] > > > > Also, why isn't this part of > > > > oom_killer_disable/enable()? The way they're implemented is really > > > > silly now. It just sets a flag and returns whether there's a > > > > currently running instance or not. How were these even useful? > > > > Why can't you just make disable/enable to what they were supposed to > > > > do from the beginning? > > > > > > Because then we would block all the potential allocators coming from > > > workqueues or kernel threads which are not frozen yet rather than fail > > > the allocation. > > > > After thinking about this more it would be doable by using trylock in > > the allocation oom path. I will respin the patch. The API will be > > cleaner this way. > > In disable, block new invocations of OOM killer and then drain the > in-progress ones. This is a common pattern, isn't it? I am not sure I am following. With the latest patch OOM path is no longer blocked by the PM (aka oom_killer_disable()). Allocations simply fail if the read_trylock fails. oom_killer_disable is moved before tasks are frozen and it will wait for all on-going OOM killers on the write lock. OOM killer is enabled again on the resume path. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 93+ messages in thread
* Re: [PATCH 3/4] OOM, PM: OOM killed task shouldn't escape PM suspend 2014-11-05 16:01 ` Michal Hocko @ 2014-11-05 16:29 ` Tejun Heo 2014-11-05 16:39 ` Michal Hocko 0 siblings, 1 reply; 93+ messages in thread From: Tejun Heo @ 2014-11-05 16:29 UTC (permalink / raw) To: Michal Hocko Cc: Rafael J. Wysocki, Andrew Morton, Cong Wang, David Rientjes, Oleg Nesterov, LKML, linux-mm, Linux PM list Hello, Michal. On Wed, Nov 05, 2014 at 05:01:15PM +0100, Michal Hocko wrote: > I am not sure I am following. With the latest patch OOM path is no > longer blocked by the PM (aka oom_killer_disable()). Allocations simply > fail if the read_trylock fails. > oom_killer_disable is moved before tasks are frozen and it will wait for > all on-going OOM killers on the write lock. OOM killer is enabled again > on the resume path. Sure, but why are we exposing new interfaces? Can't we just make oom_killer_disable() first set the disable flag and wait for the on-going ones to finish (and make the function fail if it gets chosen as an OOM victim)? It's weird to expose extra stuff on top. Why are we doing that? Thanks. -- tejun ^ permalink raw reply [flat|nested] 93+ messages in thread
* Re: [PATCH 3/4] OOM, PM: OOM killed task shouldn't escape PM suspend 2014-11-05 16:29 ` Tejun Heo @ 2014-11-05 16:39 ` Michal Hocko 2014-11-05 16:54 ` Tejun Heo 0 siblings, 1 reply; 93+ messages in thread From: Michal Hocko @ 2014-11-05 16:39 UTC (permalink / raw) To: Tejun Heo Cc: Rafael J. Wysocki, Andrew Morton, Cong Wang, David Rientjes, Oleg Nesterov, LKML, linux-mm, Linux PM list On Wed 05-11-14 11:29:29, Tejun Heo wrote: > Hello, Michal. > > On Wed, Nov 05, 2014 at 05:01:15PM +0100, Michal Hocko wrote: > > I am not sure I am following. With the latest patch OOM path is no > > longer blocked by the PM (aka oom_killer_disable()). Allocations simply > > fail if the read_trylock fails. > > oom_killer_disable is moved before tasks are frozen and it will wait for > > all on-going OOM killers on the write lock. OOM killer is enabled again > > on the resume path. > > Sure, but why are we exposing new interfaces? Can't we just make > oom_killer_disable() first set the disable flag and wait for the > on-going ones to finish (and make the function fail if it gets chosen > as an OOM victim)? Still not following. How do you want to detect an on-going OOM without any interface around out_of_memory? -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 93+ messages in thread
* Re: [PATCH 3/4] OOM, PM: OOM killed task shouldn't escape PM suspend 2014-11-05 16:39 ` Michal Hocko @ 2014-11-05 16:54 ` Tejun Heo 2014-11-05 17:01 ` Tejun Heo 2014-11-05 17:46 ` Michal Hocko 0 siblings, 2 replies; 93+ messages in thread From: Tejun Heo @ 2014-11-05 16:54 UTC (permalink / raw) To: Michal Hocko Cc: Rafael J. Wysocki, Andrew Morton, Cong Wang, David Rientjes, Oleg Nesterov, LKML, linux-mm, Linux PM list On Wed, Nov 05, 2014 at 05:39:56PM +0100, Michal Hocko wrote: > On Wed 05-11-14 11:29:29, Tejun Heo wrote: > > Hello, Michal. > > > > On Wed, Nov 05, 2014 at 05:01:15PM +0100, Michal Hocko wrote: > > > I am not sure I am following. With the latest patch OOM path is no > > > longer blocked by the PM (aka oom_killer_disable()). Allocations simply > > > fail if the read_trylock fails. > > > oom_killer_disable is moved before tasks are frozen and it will wait for > > > all on-going OOM killers on the write lock. OOM killer is enabled again > > > on the resume path. > > > > Sure, but why are we exposing new interfaces? Can't we just make > > oom_killer_disable() first set the disable flag and wait for the > > on-going ones to finish (and make the function fail if it gets chosen > > as an OOM victim)? > > Still not following. How do you want to detect an on-going OOM without > any interface around out_of_memory? I thought you were using oom_killer_allowed_start() outside OOM path. Ugh.... why is everything weirdly structured? oom_killer_disabled implies that oom killer may fail, right? Why is __alloc_pages_slowpath() checking it directly? If whether oom killing failed or not is relevant to its users, make out_of_memory() return an error code. There's no reason for the exclusion detail to leak out of the oom killer proper. The only interface should be disable/enable and whether oom killing failed or not. Thanks. -- tejun ^ permalink raw reply [flat|nested] 93+ messages in thread
* Re: [PATCH 3/4] OOM, PM: OOM killed task shouldn't escape PM suspend 2014-11-05 16:54 ` Tejun Heo @ 2014-11-05 17:01 ` Tejun Heo 2014-11-06 13:05 ` Michal Hocko 2014-11-05 17:46 ` Michal Hocko 1 sibling, 1 reply; 93+ messages in thread From: Tejun Heo @ 2014-11-05 17:01 UTC (permalink / raw) To: Michal Hocko Cc: Rafael J. Wysocki, Andrew Morton, Cong Wang, David Rientjes, Oleg Nesterov, LKML, linux-mm, Linux PM list On Wed, Nov 05, 2014 at 11:54:28AM -0500, Tejun Heo wrote: > > Still not following. How do you want to detect an on-going OOM without > > any interface around out_of_memory? > > I thought you were using oom_killer_allowed_start() outside OOM path. > Ugh.... why is everything weirdly structured? oom_killer_disabled > implies that oom killer may fail, right? Why is > __alloc_pages_slowpath() checking it directly? If whether oom killing > failed or not is relevant to its users, make out_of_memory() return an > error code. There's no reason for the exclusion detail to leak out of > the oom killer proper. The only interface should be disable/enable > and whether oom killing failed or not. And what's implemented is wrong. What happens if oom killing is already in progress and then a task blocks trying to write-lock the rwsem and then that task is selected as the OOM victim? disable() call must be able to fail. -- tejun ^ permalink raw reply [flat|nested] 93+ messages in thread
* Re: [PATCH 3/4] OOM, PM: OOM killed task shouldn't escape PM suspend 2014-11-05 17:01 ` Tejun Heo @ 2014-11-06 13:05 ` Michal Hocko 2014-11-06 15:09 ` Tejun Heo 0 siblings, 1 reply; 93+ messages in thread From: Michal Hocko @ 2014-11-06 13:05 UTC (permalink / raw) To: Tejun Heo Cc: Rafael J. Wysocki, Andrew Morton, Cong Wang, David Rientjes, Oleg Nesterov, LKML, linux-mm, Linux PM list On Wed 05-11-14 12:01:11, Tejun Heo wrote: > On Wed, Nov 05, 2014 at 11:54:28AM -0500, Tejun Heo wrote: > > > Still not following. How do you want to detect an on-going OOM without > > > any interface around out_of_memory? > > > > I thought you were using oom_killer_allowed_start() outside OOM path. > > Ugh.... why is everything weirdly structured? oom_killer_disabled > > implies that oom killer may fail, right? Why is > > __alloc_pages_slowpath() checking it directly? If whether oom killing > > failed or not is relevant to its users, make out_of_memory() return an > > error code. There's no reason for the exclusion detail to leak out of > > the oom killer proper. The only interface should be disable/enable > > and whether oom killing failed or not. > > And what's implemented is wrong. What happens if oom killing is > already in progress and then a task blocks trying to write-lock the > rwsem and then that task is selected as the OOM victim? But this is nothing new. Suspend hasn't been checking for fatal signals nor for TIF_MEMDIE since the OOM disabling was introduced and I suppose even before. This is not harmful though. The previous OOM kill attempt would kick the current TASK and mark it with TIF_MEMDIE and retry the allocation. After OOM is disabled the allocation simply fails. The current will die on its way out of the kernel. Definitely worth fixing. In a separate patch. > disable() call must be able to fail. This would be a way to do it without requiring caller to check for TIF_MEMDIE explicitly. The fewer of them we have the better. --- >From 3a7e18144a369bfc537c1cda4c7c2c548e9114b8 Mon Sep 17 00:00:00 2001 From: Michal Hocko <mhocko@suse.cz> Date: Thu, 6 Nov 2014 11:51:34 +0100 Subject: [PATCH] OOM, PM: handle pm freezer as an OOM victim correctly PM freezer doesn't check whether it has been killed by OOM killer after it disables OOM killer which means that it continues with the suspend even though it should die as soon as possible. This has been the case ever since PM suspend disables OOM killer and I suppose it has ignored OOM even before. This is not harmful though. The allocation which triggers OOM will retry the allocation after a process is killed and the next attempt will fail because the OOM killer will be disabled at the time so there is no risk of an endless loop because the OOM victim doesn't die. But this is a correctness issue because no task should ignore OOM. As suggested by Tejun, oom_killer_disable will return a success status now. If the current task is pending fatal signals or TIF_MEMDIE is set after oom_sem is taken then the caller should bail out and this is what freeze_processes does with this patch. Signed-off-by: Michal Hocko <mhocko@suse.cz> --- include/linux/oom.h | 4 +++- kernel/power/process.c | 16 ++++++++++------ mm/oom_kill.c | 12 +++++++++++- 3 files changed, 24 insertions(+), 8 deletions(-) diff --git a/include/linux/oom.h b/include/linux/oom.h index 4af99a9b543b..a978bf2b02a1 100644 --- a/include/linux/oom.h +++ b/include/linux/oom.h @@ -77,8 +77,10 @@ extern int unregister_oom_notifier(struct notifier_block *nb); * oom_killer_disable - disable OOM killer in page allocator * * Forces all page allocations to fail rather than trigger OOM killer. + * Returns true on success and fails if the OOM killer couldn't be + * disabled (e.g. because the current task has been killed before) */ -extern void oom_killer_disable(void); +extern bool oom_killer_disable(void); /** * oom_killer_enable - enable OOM killer diff --git a/kernel/power/process.c b/kernel/power/process.c index 7d08d56cbf3f..0f8b782f9215 100644 --- a/kernel/power/process.c +++ b/kernel/power/process.c @@ -123,6 +123,16 @@ int freeze_processes(void) if (error) return error; + /* + * Need to exlude OOM killer from triggering while tasks are + * getting frozen to make sure none of them gets killed after + * try_to_freeze_tasks is done. + */ + if (!oom_killer_disable()) { + usermodehelper_enable(); + return -EBUSY; + } + /* Make sure this task doesn't get frozen */ current->flags |= PF_SUSPEND_TASK; @@ -133,12 +143,6 @@ int freeze_processes(void) printk("Freezing user space processes ... "); pm_freezing = true; - /* - * Need to exlude OOM killer from triggering while tasks are - * getting frozen to make sure none of them gets killed after - * try_to_freeze_tasks is done. - */ - oom_killer_disable(); error = try_to_freeze_tasks(true); if (!error) { __usermodehelper_set_disable_depth(UMH_DISABLED); diff --git a/mm/oom_kill.c b/mm/oom_kill.c index f80c5b777f05..58ade54ee421 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -600,9 +600,19 @@ void oom_zonelist_unlock(struct zonelist *zonelist, gfp_t gfp_mask) static DECLARE_RWSEM(oom_sem); -void oom_killer_disable(void) +bool oom_killer_disable(void) { + bool ret = true; + down_write(&oom_sem); + + /* We might have been killed while waiting for the oom_sem. */ + if (fatal_signal_pending(current) || test_thread_flag(TIF_MEMDIE)) { + up_write(&oom_sem); + ret = false; + } + + return ret; } void oom_killer_enable(void) -- 2.1.1 -- Michal Hocko SUSE Labs ^ permalink raw reply related [flat|nested] 93+ messages in thread
* Re: [PATCH 3/4] OOM, PM: OOM killed task shouldn't escape PM suspend 2014-11-06 13:05 ` Michal Hocko @ 2014-11-06 15:09 ` Tejun Heo 2014-11-06 16:01 ` Michal Hocko 0 siblings, 1 reply; 93+ messages in thread From: Tejun Heo @ 2014-11-06 15:09 UTC (permalink / raw) To: Michal Hocko Cc: Rafael J. Wysocki, Andrew Morton, Cong Wang, David Rientjes, Oleg Nesterov, LKML, linux-mm, Linux PM list On Thu, Nov 06, 2014 at 02:05:43PM +0100, Michal Hocko wrote: > But this is nothing new. Suspend hasn't been checking for fatal signals > nor for TIF_MEMDIE since the OOM disabling was introduced and I suppose > even before. > > This is not harmful though. The previous OOM kill attempt would kick the > current TASK and mark it with TIF_MEMDIE and retry the allocation. After > OOM is disabled the allocation simply fails. The current will die on its > way out of the kernel. Definitely worth fixing. In a separate patch. Hah? Isn't this a new outright A-B B-A deadlock involving the rwsem you added? > > disable() call must be able to fail. > > This would be a way to do it without requiring caller to check for > TIF_MEMDIE explicitly. The fewer of them we have the better. Why the hell would the caller ever even KNOW about this? This is something which must be encapsulated in the OOM killer disable/enable interface. > +bool oom_killer_disable(void) > { > + bool ret = true; > + > down_write(&oom_sem); How would this task pass the above down_write() if the OOM killer is already read locking oom_sem? Or is the OOM killer guaranteed to make forward progress even if the killed task can't make forward progress? But, if so, what are we talking about in this thread? > + > + /* We might have been killed while waiting for the oom_sem. */ > + if (fatal_signal_pending(current) || test_thread_flag(TIF_MEMDIE)) { > + up_write(&oom_sem); > + ret = false; > + } This is pointless. What does the above do? -- tejun ^ permalink raw reply [flat|nested] 93+ messages in thread
* Re: [PATCH 3/4] OOM, PM: OOM killed task shouldn't escape PM suspend 2014-11-06 15:09 ` Tejun Heo @ 2014-11-06 16:01 ` Michal Hocko 2014-11-06 16:12 ` Tejun Heo 0 siblings, 1 reply; 93+ messages in thread From: Michal Hocko @ 2014-11-06 16:01 UTC (permalink / raw) To: Tejun Heo Cc: Rafael J. Wysocki, Andrew Morton, Cong Wang, David Rientjes, Oleg Nesterov, LKML, linux-mm, Linux PM list On Thu 06-11-14 10:09:27, Tejun Heo wrote: > On Thu, Nov 06, 2014 at 02:05:43PM +0100, Michal Hocko wrote: > > But this is nothing new. Suspend hasn't been checking for fatal signals > > nor for TIF_MEMDIE since the OOM disabling was introduced and I suppose > > even before. > > > > This is not harmful though. The previous OOM kill attempt would kick the > > current TASK and mark it with TIF_MEMDIE and retry the allocation. After > > OOM is disabled the allocation simply fails. The current will die on its > > way out of the kernel. Definitely worth fixing. In a separate patch. > > Hah? Isn't this a new outright A-B B-A deadlock involving the rwsem > you added? No, see below. > > > disable() call must be able to fail. > > > > This would be a way to do it without requiring caller to check for > > TIF_MEMDIE explicitly. The fewer of them we have the better. > > Why the hell would the caller ever even KNOW about this? This is > something which must be encapsulated in the OOM killer disable/enable > interface. > > > +bool oom_killer_disable(void) > > { > > + bool ret = true; > > + > > down_write(&oom_sem); > > How would this task pass the above down_write() if the OOM killer is > already read locking oom_sem? Or is the OOM killer guaranteed to make > forward progress even if the killed task can't make forward progress? > But, if so, what are we talking about in this thread? Yes, OOM killer simply kicks the process sets TIF_MEMDIE and terminates. That will release the read_lock, allow this to take the write lock and check whether it the current has been killed without any races. OOM killer doesn't wait for the killed task. The allocation is retried. Does this explain your concern? [...] -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 93+ messages in thread
* Re: [PATCH 3/4] OOM, PM: OOM killed task shouldn't escape PM suspend 2014-11-06 16:01 ` Michal Hocko @ 2014-11-06 16:12 ` Tejun Heo 2014-11-06 16:31 ` Michal Hocko 0 siblings, 1 reply; 93+ messages in thread From: Tejun Heo @ 2014-11-06 16:12 UTC (permalink / raw) To: Michal Hocko Cc: Rafael J. Wysocki, Andrew Morton, Cong Wang, David Rientjes, Oleg Nesterov, LKML, linux-mm, Linux PM list On Thu, Nov 06, 2014 at 05:01:58PM +0100, Michal Hocko wrote: > Yes, OOM killer simply kicks the process sets TIF_MEMDIE and terminates. > That will release the read_lock, allow this to take the write lock and > check whether it the current has been killed without any races. > OOM killer doesn't wait for the killed task. The allocation is retried. > > Does this explain your concern? Draining oom killer then doesn't mean anything, no? OOM killer may have been disabled and drained but the killed tasks might wake up after the PM freezer considers them to be frozen, right? What am I missing? -- tejun ^ permalink raw reply [flat|nested] 93+ messages in thread
* Re: [PATCH 3/4] OOM, PM: OOM killed task shouldn't escape PM suspend 2014-11-06 16:12 ` Tejun Heo @ 2014-11-06 16:31 ` Michal Hocko 2014-11-06 16:33 ` Tejun Heo 0 siblings, 1 reply; 93+ messages in thread From: Michal Hocko @ 2014-11-06 16:31 UTC (permalink / raw) To: Tejun Heo Cc: Rafael J. Wysocki, Andrew Morton, Cong Wang, David Rientjes, Oleg Nesterov, LKML, linux-mm, Linux PM list On Thu 06-11-14 11:12:11, Tejun Heo wrote: > On Thu, Nov 06, 2014 at 05:01:58PM +0100, Michal Hocko wrote: > > Yes, OOM killer simply kicks the process sets TIF_MEMDIE and terminates. > > That will release the read_lock, allow this to take the write lock and > > check whether it the current has been killed without any races. > > OOM killer doesn't wait for the killed task. The allocation is retried. > > > > Does this explain your concern? > > Draining oom killer then doesn't mean anything, no? OOM killer may > have been disabled and drained but the killed tasks might wake up > after the PM freezer considers them to be frozen, right? What am I > missing? The mutual exclusion between OOM and the freezer will cause that the victim will have TIF_MEMDIE already set when try_to_freeze_tasks even starts. Then freezing_slow_path wouldn't allow the task to enter the fridge so the wake up moment is not really that important. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 93+ messages in thread
* Re: [PATCH 3/4] OOM, PM: OOM killed task shouldn't escape PM suspend 2014-11-06 16:31 ` Michal Hocko @ 2014-11-06 16:33 ` Tejun Heo 2014-11-06 16:58 ` Michal Hocko 0 siblings, 1 reply; 93+ messages in thread From: Tejun Heo @ 2014-11-06 16:33 UTC (permalink / raw) To: Michal Hocko Cc: Rafael J. Wysocki, Andrew Morton, Cong Wang, David Rientjes, Oleg Nesterov, LKML, linux-mm, Linux PM list On Thu, Nov 06, 2014 at 05:31:24PM +0100, Michal Hocko wrote: > On Thu 06-11-14 11:12:11, Tejun Heo wrote: > > On Thu, Nov 06, 2014 at 05:01:58PM +0100, Michal Hocko wrote: > > > Yes, OOM killer simply kicks the process sets TIF_MEMDIE and terminates. > > > That will release the read_lock, allow this to take the write lock and > > > check whether it the current has been killed without any races. > > > OOM killer doesn't wait for the killed task. The allocation is retried. > > > > > > Does this explain your concern? > > > > Draining oom killer then doesn't mean anything, no? OOM killer may > > have been disabled and drained but the killed tasks might wake up > > after the PM freezer considers them to be frozen, right? What am I > > missing? > > The mutual exclusion between OOM and the freezer will cause that the > victim will have TIF_MEMDIE already set when try_to_freeze_tasks even > starts. Then freezing_slow_path wouldn't allow the task to enter the > fridge so the wake up moment is not really that important. What if it was already in the freezer? -- tejun ^ permalink raw reply [flat|nested] 93+ messages in thread
* Re: [PATCH 3/4] OOM, PM: OOM killed task shouldn't escape PM suspend 2014-11-06 16:33 ` Tejun Heo @ 2014-11-06 16:58 ` Michal Hocko 0 siblings, 0 replies; 93+ messages in thread From: Michal Hocko @ 2014-11-06 16:58 UTC (permalink / raw) To: Tejun Heo Cc: Rafael J. Wysocki, Andrew Morton, Cong Wang, David Rientjes, Oleg Nesterov, LKML, linux-mm, Linux PM list On Thu 06-11-14 11:33:04, Tejun Heo wrote: > On Thu, Nov 06, 2014 at 05:31:24PM +0100, Michal Hocko wrote: > > On Thu 06-11-14 11:12:11, Tejun Heo wrote: [...] > > > Draining oom killer then doesn't mean anything, no? OOM killer may > > > have been disabled and drained but the killed tasks might wake up > > > after the PM freezer considers them to be frozen, right? What am I > > > missing? > > > > The mutual exclusion between OOM and the freezer will cause that the > > victim will have TIF_MEMDIE already set when try_to_freeze_tasks even > > starts. Then freezing_slow_path wouldn't allow the task to enter the > > fridge so the wake up moment is not really that important. > > What if it was already in the freezer? Good question! You are right that there is a race window until the wake up then. I will think about this case some more. There is simply no control on when the task wakes up and freezer will see it as frozen until then. An immediate way around would be to check for TIF_MEMDIE in try_to_freeze_tasks. I have to call it end of the day unfortunately and will be back on Monday. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 93+ messages in thread
* Re: [PATCH 3/4] OOM, PM: OOM killed task shouldn't escape PM suspend 2014-11-05 16:54 ` Tejun Heo 2014-11-05 17:01 ` Tejun Heo @ 2014-11-05 17:46 ` Michal Hocko 2014-11-05 17:55 ` Tejun Heo 1 sibling, 1 reply; 93+ messages in thread From: Michal Hocko @ 2014-11-05 17:46 UTC (permalink / raw) To: Tejun Heo Cc: Rafael J. Wysocki, Andrew Morton, Cong Wang, David Rientjes, Oleg Nesterov, LKML, linux-mm, Linux PM list On Wed 05-11-14 11:54:28, Tejun Heo wrote: > On Wed, Nov 05, 2014 at 05:39:56PM +0100, Michal Hocko wrote: > > On Wed 05-11-14 11:29:29, Tejun Heo wrote: > > > Hello, Michal. > > > > > > On Wed, Nov 05, 2014 at 05:01:15PM +0100, Michal Hocko wrote: > > > > I am not sure I am following. With the latest patch OOM path is no > > > > longer blocked by the PM (aka oom_killer_disable()). Allocations simply > > > > fail if the read_trylock fails. > > > > oom_killer_disable is moved before tasks are frozen and it will wait for > > > > all on-going OOM killers on the write lock. OOM killer is enabled again > > > > on the resume path. > > > > > > Sure, but why are we exposing new interfaces? Can't we just make > > > oom_killer_disable() first set the disable flag and wait for the > > > on-going ones to finish (and make the function fail if it gets chosen > > > as an OOM victim)? > > > > Still not following. How do you want to detect an on-going OOM without > > any interface around out_of_memory? > > I thought you were using oom_killer_allowed_start() outside OOM path. > Ugh.... why is everything weirdly structured? oom_killer_disabled > implies that oom killer may fail, right? Why is > __alloc_pages_slowpath() checking it directly? Because out_of_memory can be called from mutliple paths. And the only interesting one should be the page allocation path. pagefault_out_of_memory is not interesting because it cannot happen for the frozen task. Now that I am looking maybe even sysrq OOM trigger should as well. > If whether oom killing failed or not is relevant to its users, make > out_of_memory() return an error code. There's no reason for the > exclusion detail to leak out of the oom killer proper. The only > interface should be disable/enable and whether oom killing failed or > not. Got your point. I can reshuffle the code and make the trylock thingy inside oom_kill.c. I am not sure it is so much better because the OOM knowledge is already spread (e.g. check oom_zonelist_trylock outside of out_of_memory or even oom_gfp_allowed before we enter__alloc_pages_may_oom). Anyway, I do not care much and I am OK with your return code convention as the only other way how OOM might fail is when there is no victim and we panic then. Something like (even not compile tested) --- diff --git a/drivers/tty/sysrq.c b/drivers/tty/sysrq.c index 42bad18c66c9..14f3d7fd961f 100644 --- a/drivers/tty/sysrq.c +++ b/drivers/tty/sysrq.c @@ -355,8 +355,10 @@ static struct sysrq_key_op sysrq_term_op = { static void moom_callback(struct work_struct *ignored) { - out_of_memory(node_zonelist(first_memory_node, GFP_KERNEL), GFP_KERNEL, - 0, NULL, true); + if (!out_of_memory(node_zonelist(first_memory_node, GFP_KERNEL), + GFP_KERNEL, 0, NULL, true)) { + printk(KERN_INFO "OOM killer disabled\n"); + } } static DECLARE_WORK(moom_work, moom_callback); diff --git a/include/linux/oom.h b/include/linux/oom.h index 850f7f653eb7..4af99a9b543b 100644 --- a/include/linux/oom.h +++ b/include/linux/oom.h @@ -68,7 +68,7 @@ extern enum oom_scan_t oom_scan_process_thread(struct task_struct *task, unsigned long totalpages, const nodemask_t *nodemask, bool force_kill); -extern void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask, +extern bool out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask, int order, nodemask_t *mask, bool force_kill); extern int register_oom_notifier(struct notifier_block *nb); extern int unregister_oom_notifier(struct notifier_block *nb); @@ -85,21 +85,6 @@ extern void oom_killer_disable(void); */ extern void oom_killer_enable(void); -/** - * oom_killer_allowed_start - start OOM killer section - * - * Synchronise with oom_killer_{disable,enable} sections. - * Returns 1 if oom_killer is allowed. - */ -extern int oom_killer_allowed_start(void); - -/** - * oom_killer_allowed_end - end OOM killer section - * - * previously started by oom_killer_allowed_end. - */ -extern void oom_killer_allowed_end(void); - static inline bool oom_gfp_allowed(gfp_t gfp_mask) { return (gfp_mask & __GFP_FS) && !(gfp_mask & __GFP_NORETRY); diff --git a/mm/oom_kill.c b/mm/oom_kill.c index 126e7da17cf9..3e136a2c0b1f 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -610,18 +610,8 @@ void oom_killer_enable(void) up_write(&oom_sem); } -int oom_killer_allowed_start(void) -{ - return down_read_trylock(&oom_sem); -} - -void oom_killer_allowed_end(void) -{ - up_read(&oom_sem); -} - /** - * out_of_memory - kill the "best" process when we run out of memory + * __out_of_memory - kill the "best" process when we run out of memory * @zonelist: zonelist pointer * @gfp_mask: memory allocation flags * @order: amount of memory being requested as a power of 2 @@ -633,7 +623,7 @@ void oom_killer_allowed_end(void) * OR try to be smart about which process to kill. Note that we * don't have to be perfect here, we just have to be good. */ -void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask, +static void __out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask, int order, nodemask_t *nodemask, bool force_kill) { const nodemask_t *mpol_mask; @@ -698,6 +688,27 @@ out: schedule_timeout_killable(1); } +/** out_of_memory - tries to invoke OOM killer. + * @zonelist: zonelist pointer + * @gfp_mask: memory allocation flags + * @order: amount of memory being requested as a power of 2 + * @nodemask: nodemask passed to page allocator + * @force_kill: true if a task must be killed, even if others are exiting + * + * invokes __out_of_memory if the OOM is not disabled by oom_killer_disable() + * when it returns false. Otherwise returns true. + */ +bool out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask, + int order, nodemask_t *nodemask, bool force_kill) +{ + if (!down_read_trylock(&oom_sem)) + return false; + __out_of_memory(zonlist, gfp_mask, order, nodemask, force_kill); + up_read(&oom_sem); + + return true; +} + /* * The pagefault handler calls here because it is out of memory, so kill a * memory-hogging task. If any populated zone has ZONE_OOM_LOCKED set, a @@ -712,7 +723,7 @@ void pagefault_out_of_memory(void) zonelist = node_zonelist(first_memory_node, GFP_KERNEL); if (oom_zonelist_trylock(zonelist, GFP_KERNEL)) { - out_of_memory(NULL, 0, 0, NULL, false); + __out_of_memory(NULL, 0, 0, NULL, false); oom_zonelist_unlock(zonelist, GFP_KERNEL); } } diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 206ce46ce975..fdbcdd9cd1a9 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -2239,10 +2239,11 @@ static inline struct page * __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order, struct zonelist *zonelist, enum zone_type high_zoneidx, nodemask_t *nodemask, struct zone *preferred_zone, - int classzone_idx, int migratetype) + int classzone_idx, int migratetype, bool *oom_failed) { struct page *page; + *oom_failed = false; /* Acquire the per-zone oom lock for each zone */ if (!oom_zonelist_trylock(zonelist, gfp_mask)) { schedule_timeout_uninterruptible(1); @@ -2279,8 +2280,8 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order, goto out; } /* Exhausted what can be done so it's blamo time */ - out_of_memory(zonelist, gfp_mask, order, nodemask, false); - + if (!out_of_memory(zonelist, gfp_mask, order, nodemask, false)) + *oom_failed = true; out: oom_zonelist_unlock(zonelist, gfp_mask); return page; @@ -2706,26 +2707,28 @@ rebalance: */ if (!did_some_progress) { if (oom_gfp_allowed(gfp_mask)) { + bool oom_failed; + /* Coredumps can quickly deplete all memory reserves */ if ((current->flags & PF_DUMPCORE) && !(gfp_mask & __GFP_NOFAIL)) goto nopage; - /* - * Just make sure that we cannot race with oom_killer - * disabling e.g. PM freezer needs to make sure that - * no OOM happens after all tasks are frozen. - */ - if (!oom_killer_allowed_start()) - goto nopage; page = __alloc_pages_may_oom(gfp_mask, order, zonelist, high_zoneidx, nodemask, preferred_zone, - classzone_idx, migratetype); - oom_killer_allowed_end(); + classzone_idx, migratetype, + &oom_failed); if (page) goto got_pg; + /* + * OOM killer might be disabled and then we have to + * fail the allocation + */ + if (oom_failed) + goto no_page; + if (!(gfp_mask & __GFP_NOFAIL)) { /* * The oom killer is not called for high-order -- Michal Hocko SUSE Labs ^ permalink raw reply related [flat|nested] 93+ messages in thread
* Re: [PATCH 3/4] OOM, PM: OOM killed task shouldn't escape PM suspend 2014-11-05 17:46 ` Michal Hocko @ 2014-11-05 17:55 ` Tejun Heo 2014-11-06 12:49 ` Michal Hocko 0 siblings, 1 reply; 93+ messages in thread From: Tejun Heo @ 2014-11-05 17:55 UTC (permalink / raw) To: Michal Hocko Cc: Rafael J. Wysocki, Andrew Morton, Cong Wang, David Rientjes, Oleg Nesterov, LKML, linux-mm, Linux PM list On Wed, Nov 05, 2014 at 06:46:09PM +0100, Michal Hocko wrote: > Because out_of_memory can be called from mutliple paths. And > the only interesting one should be the page allocation path. > pagefault_out_of_memory is not interesting because it cannot happen for > the frozen task. Hmmm.... wouldn't that be broken by definition tho? So, if the oom killer is invoked from somewhere else than page allocation path, it would proceed ignoring the disabled setting and would race against PM freeze path all the same. Why are things broken at such basic levels? Something named oom_killer_disable does a lame attempt at it and not even that depending on who's calling. There probably is a history leading to the current situation but the level that things are broken at is too basic and baffling. :( -- tejun ^ permalink raw reply [flat|nested] 93+ messages in thread
* Re: [PATCH 3/4] OOM, PM: OOM killed task shouldn't escape PM suspend 2014-11-05 17:55 ` Tejun Heo @ 2014-11-06 12:49 ` Michal Hocko 2014-11-06 15:01 ` Tejun Heo 0 siblings, 1 reply; 93+ messages in thread From: Michal Hocko @ 2014-11-06 12:49 UTC (permalink / raw) To: Tejun Heo Cc: Rafael J. Wysocki, Andrew Morton, Cong Wang, David Rientjes, Oleg Nesterov, LKML, linux-mm, Linux PM list On Wed 05-11-14 12:55:27, Tejun Heo wrote: > On Wed, Nov 05, 2014 at 06:46:09PM +0100, Michal Hocko wrote: > > Because out_of_memory can be called from mutliple paths. And > > the only interesting one should be the page allocation path. > > pagefault_out_of_memory is not interesting because it cannot happen for > > the frozen task. > > Hmmm.... wouldn't that be broken by definition tho? So, if the oom > killer is invoked from somewhere else than page allocation path, it > would proceed ignoring the disabled setting and would race against PM > freeze path all the same. Not really because try_to_freeze_tasks doesn't finish until _all_ tasks are frozen and a task in the page fault path cannot be frozen, can it? I mean there shouldn't be any problem to not invoke OOM killer under from the page fault path as well but that might lead to looping in the page fault path without any progress until freezer enables OOM killer on the failure path because the said task cannot be frozen. Is this preferable? -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 93+ messages in thread
* Re: [PATCH 3/4] OOM, PM: OOM killed task shouldn't escape PM suspend 2014-11-06 12:49 ` Michal Hocko @ 2014-11-06 15:01 ` Tejun Heo 2014-11-06 16:02 ` Michal Hocko 0 siblings, 1 reply; 93+ messages in thread From: Tejun Heo @ 2014-11-06 15:01 UTC (permalink / raw) To: Michal Hocko Cc: Rafael J. Wysocki, Andrew Morton, Cong Wang, David Rientjes, Oleg Nesterov, LKML, linux-mm, Linux PM list On Thu, Nov 06, 2014 at 01:49:53PM +0100, Michal Hocko wrote: > On Wed 05-11-14 12:55:27, Tejun Heo wrote: > > On Wed, Nov 05, 2014 at 06:46:09PM +0100, Michal Hocko wrote: > > > Because out_of_memory can be called from mutliple paths. And > > > the only interesting one should be the page allocation path. > > > pagefault_out_of_memory is not interesting because it cannot happen for > > > the frozen task. > > > > Hmmm.... wouldn't that be broken by definition tho? So, if the oom > > killer is invoked from somewhere else than page allocation path, it > > would proceed ignoring the disabled setting and would race against PM > > freeze path all the same. > > Not really because try_to_freeze_tasks doesn't finish until _all_ tasks > are frozen and a task in the page fault path cannot be frozen, can it? We used to have freezing points deep in file system code which may be reacheable from page fault. Please take a step back and look at the paragraph above. Doesn't it sound extremely contrived and brittle even if it's not outright broken? What if somebody adds another oom killing site somewhere else? How can this possibly be a solution that we intentionally implement? > I mean there shouldn't be any problem to not invoke OOM killer under > from the page fault path as well but that might lead to looping in the > page fault path without any progress until freezer enables OOM killer on > the failure path because the said task cannot be frozen. > > Is this preferable? Why would PM freezing make OOM killing fail? That doesn't make much sense. Sure, it can block it for a finite duration for sync purposes but making OOM killing fail seems the wrong way around. We're doing one thing for non-PM freezing and the other way around for PM freezing, which indicates one of the two directions is wrong. Shouldn't it be that OOM killing happening while PM freezing is in progress cancels PM freezing rather than the other way around? Find a point in PM suspend/hibernation operation where everything must be stable, disable OOM killing there and check whether OOM killing happened inbetween and if so back out. It seems rather obvious to me that OOM killing has to have precedence over PM freezing. Sure, once the system reaches a point where the whole system must be in a stable state for snapshotting or whatever, disabling OOM killing is fine but at that point the system is in a very limited execution mode and sure won't be processing page faults from userland for example and we can actually disable OOM killing knowing that anything afterwards is ready to handle memory allocation failures. -- tejun ^ permalink raw reply [flat|nested] 93+ messages in thread
* Re: [PATCH 3/4] OOM, PM: OOM killed task shouldn't escape PM suspend 2014-11-06 15:01 ` Tejun Heo @ 2014-11-06 16:02 ` Michal Hocko 2014-11-06 16:28 ` Tejun Heo 0 siblings, 1 reply; 93+ messages in thread From: Michal Hocko @ 2014-11-06 16:02 UTC (permalink / raw) To: Tejun Heo Cc: Rafael J. Wysocki, Andrew Morton, Cong Wang, David Rientjes, Oleg Nesterov, LKML, linux-mm, Linux PM list On Thu 06-11-14 10:01:21, Tejun Heo wrote: > On Thu, Nov 06, 2014 at 01:49:53PM +0100, Michal Hocko wrote: > > On Wed 05-11-14 12:55:27, Tejun Heo wrote: > > > On Wed, Nov 05, 2014 at 06:46:09PM +0100, Michal Hocko wrote: > > > > Because out_of_memory can be called from mutliple paths. And > > > > the only interesting one should be the page allocation path. > > > > pagefault_out_of_memory is not interesting because it cannot happen for > > > > the frozen task. > > > > > > Hmmm.... wouldn't that be broken by definition tho? So, if the oom > > > killer is invoked from somewhere else than page allocation path, it > > > would proceed ignoring the disabled setting and would race against PM > > > freeze path all the same. > > > > Not really because try_to_freeze_tasks doesn't finish until _all_ tasks > > are frozen and a task in the page fault path cannot be frozen, can it? > > We used to have freezing points deep in file system code which may be > reacheable from page fault. If that is really the case then there is no way around and use out_of_memory from the page fault path as well. I cannot say I would be happy about that though. There should be ideally only single freezing place. But that is another story. > Please take a step back and look at the paragraph above. Doesn't > it sound extremely contrived and brittle even if it's not outright > broken? What if somebody adds another oom killing site somewhere > else? The only way to add an oom killing site is out_of_memory and that does all the magic with my patch. > How can this possibly be a solution that we intentionally implement? > > > I mean there shouldn't be any problem to not invoke OOM killer under > > from the page fault path as well but that might lead to looping in the > > page fault path without any progress until freezer enables OOM killer on > > the failure path because the said task cannot be frozen. > > > > Is this preferable? > > Why would PM freezing make OOM killing fail? That doesn't make much > sense. Sure, it can block it for a finite duration for sync purposes > but making OOM killing fail seems the wrong way around. We cannot block in the allocation path because the request might come from the freezer path itself (e.g. when suspending devices etc.). At least this is my understanding why the original oom disable approach was implemented. > We're doing one thing for non-PM freezing and the other way around for > PM freezing, which indicates one of the two directions is wrong. Because those two paths are quite different in their requirements. The cgroup freezer only cares about freezing tasks and it doesn't have to care about tasks accessing a possibly half suspended device on their way out. > Shouldn't it be that OOM killing happening while PM freezing is in > progress cancels PM freezing rather than the other way around? Find a > point in PM suspend/hibernation operation where everything must be > stable, disable OOM killing there and check whether OOM killing > happened inbetween and if so back out. This is freeze_processes AFAIU. I might be wrong of course but this is the time since when nobody should be waking processes up because they could access half suspended devices. > It seems rather obvious to me that OOM killing has to have precedence > over PM freezing. > > Sure, once the system reaches a point where the whole system must be > in a stable state for snapshotting or whatever, disabling OOM killing > is fine but at that point the system is in a very limited execution > mode and sure won't be processing page faults from userland for > example and we can actually disable OOM killing knowing that anything > afterwards is ready to handle memory allocation failures. I am really confused now. This is basically what the final patch does actually. Here is the what I have currently just to make the further discussion easier. --- >From 337e772eaf636a96409e84bcd33d77ebc2950549 Mon Sep 17 00:00:00 2001 From: Michal Hocko <mhocko@suse.cz> Date: Wed, 5 Nov 2014 15:09:56 +0100 Subject: [PATCH 1/2] OOM, PM: make OOM detection in the freezer path raceless 5695be142e20 (OOM, PM: OOM killed task shouldn't escape PM suspend) has left a race window when OOM killer manages to note_oom_kill after freeze_processes checks the counter. The race window is quite small and really unlikely and partial solution deemed sufficient at the time of submission. Tejun wasn't happy about this partial solution though and insisted on a full solution. That requires the full OOM and freezer exclusion, though. This is done by this patch which introduces oom_sem RW lock and gets rid of oom_killer_disabled global flag. The PM code uses oom_killer_{disable,enable} which takes the lock for write and excludes all the OOM killer invocation from any out_of_memory users which newly returns a success status. It fails only if the oom_sem cannot be taken for read which indicates that OOM has been disabled. This is done by read trylock so we can never deadlock. The caller has to take an appropriate action when the out_of_memory fails. The allocation path simply fails the allocation request the same way as previously. Sysrq path notes that the OOM didn't happen due to OOM disable. The page fault path ignored oom disabled flag previously with an assumption that the page fault path cannot enter the fridge. As per Tejun the freezing point used to be deep in the fs code. Therefore it is safer and more robust to include pagefault_out_of_memory as well. The task will be refaulting until there is some memory freed or PM freezer fails because the said task cannot be frozen and re-enable OOM killer when the OOM eventually happens if the memory still short. There is no need to recheck all the processes with the full synchronization anymore. Suggested-by: Tejun Heo <tj@kernel.org> Signed-off-by: Michal Hocko <mhocko@suse.cz> fold me --- drivers/tty/sysrq.c | 6 ++++-- include/linux/oom.h | 25 +++++++++++++---------- kernel/power/process.c | 50 ++++++++-------------------------------------- mm/oom_kill.c | 54 ++++++++++++++++++++++++++++++++------------------ mm/page_alloc.c | 32 +++++++++++++++--------------- 5 files changed, 77 insertions(+), 90 deletions(-) diff --git a/drivers/tty/sysrq.c b/drivers/tty/sysrq.c index 42bad18c66c9..14f3d7fd961f 100644 --- a/drivers/tty/sysrq.c +++ b/drivers/tty/sysrq.c @@ -355,8 +355,10 @@ static struct sysrq_key_op sysrq_term_op = { static void moom_callback(struct work_struct *ignored) { - out_of_memory(node_zonelist(first_memory_node, GFP_KERNEL), GFP_KERNEL, - 0, NULL, true); + if (!out_of_memory(node_zonelist(first_memory_node, GFP_KERNEL), + GFP_KERNEL, 0, NULL, true)) { + printk(KERN_INFO "OOM killer disabled\n"); + } } static DECLARE_WORK(moom_work, moom_callback); diff --git a/include/linux/oom.h b/include/linux/oom.h index e8d6e1058723..04b892ddca7d 100644 --- a/include/linux/oom.h +++ b/include/linux/oom.h @@ -68,22 +68,25 @@ extern enum oom_scan_t oom_scan_process_thread(struct task_struct *task, unsigned long totalpages, const nodemask_t *nodemask, bool force_kill); -extern void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask, +extern bool out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask, int order, nodemask_t *mask, bool force_kill); extern int register_oom_notifier(struct notifier_block *nb); extern int unregister_oom_notifier(struct notifier_block *nb); -extern bool oom_killer_disabled; - -static inline void oom_killer_disable(void) -{ - oom_killer_disabled = true; -} +/** + * oom_killer_disable - disable OOM killer in page allocator + * + * Forces all page allocations to fail rather than trigger OOM killer. + * + * This function should be used with an extreme care and any new usage + * should be consulted with MM people. + */ +extern void oom_killer_disable(void); -static inline void oom_killer_enable(void) -{ - oom_killer_disabled = false; -} +/** + * oom_killer_enable - enable OOM killer + */ +extern void oom_killer_enable(void); static inline bool oom_gfp_allowed(gfp_t gfp_mask) { diff --git a/kernel/power/process.c b/kernel/power/process.c index 5a6ec8678b9a..7d08d56cbf3f 100644 --- a/kernel/power/process.c +++ b/kernel/power/process.c @@ -108,30 +108,6 @@ static int try_to_freeze_tasks(bool user_only) return todo ? -EBUSY : 0; } -static bool __check_frozen_processes(void) -{ - struct task_struct *g, *p; - - for_each_process_thread(g, p) - if (p != current && !freezer_should_skip(p) && !frozen(p)) - return false; - - return true; -} - -/* - * Returns true if all freezable tasks (except for current) are frozen already - */ -static bool check_frozen_processes(void) -{ - bool ret; - - read_lock(&tasklist_lock); - ret = __check_frozen_processes(); - read_unlock(&tasklist_lock); - return ret; -} - /** * freeze_processes - Signal user space processes to enter the refrigerator. * The current thread will not be frozen. The same process that calls @@ -142,7 +118,6 @@ static bool check_frozen_processes(void) int freeze_processes(void) { int error; - int oom_kills_saved; error = __usermodehelper_disable(UMH_FREEZING); if (error) @@ -157,27 +132,18 @@ int freeze_processes(void) pm_wakeup_clear(); printk("Freezing user space processes ... "); pm_freezing = true; - oom_kills_saved = oom_kills_count(); + + /* + * Need to exlude OOM killer from triggering while tasks are + * getting frozen to make sure none of them gets killed after + * try_to_freeze_tasks is done. + */ + oom_killer_disable(); error = try_to_freeze_tasks(true); if (!error) { __usermodehelper_set_disable_depth(UMH_DISABLED); - oom_killer_disable(); - - /* - * There might have been an OOM kill while we were - * freezing tasks and the killed task might be still - * on the way out so we have to double check for race. - */ - if (oom_kills_count() != oom_kills_saved && - !check_frozen_processes()) { - __usermodehelper_set_disable_depth(UMH_ENABLED); - printk("OOM in progress."); - error = -EBUSY; - } else { - printk("done."); - } + printk("done.\n"); } - printk("\n"); BUG_ON(in_atomic()); if (error) diff --git a/mm/oom_kill.c b/mm/oom_kill.c index 5340f6b91312..7f88ddd55f80 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -404,23 +404,6 @@ static void dump_header(struct task_struct *p, gfp_t gfp_mask, int order, dump_tasks(memcg, nodemask); } -/* - * Number of OOM killer invocations (including memcg OOM killer). - * Primarily used by PM freezer to check for potential races with - * OOM killed frozen task. - */ -static atomic_t oom_kills = ATOMIC_INIT(0); - -int oom_kills_count(void) -{ - return atomic_read(&oom_kills); -} - -void note_oom_kill(void) -{ - atomic_inc(&oom_kills); -} - #define K(x) ((x) << (PAGE_SHIFT-10)) /* * Must be called while holding a reference to p, which will be released upon @@ -615,8 +598,20 @@ void oom_zonelist_unlock(struct zonelist *zonelist, gfp_t gfp_mask) spin_unlock(&zone_scan_lock); } +static DECLARE_RWSEM(oom_sem); + +void oom_killer_disable(void) +{ + down_write(&oom_sem); +} + +void oom_killer_enable(void) +{ + up_write(&oom_sem); +} + /** - * out_of_memory - kill the "best" process when we run out of memory + * __out_of_memory - kill the "best" process when we run out of memory * @zonelist: zonelist pointer * @gfp_mask: memory allocation flags * @order: amount of memory being requested as a power of 2 @@ -628,7 +623,7 @@ void oom_zonelist_unlock(struct zonelist *zonelist, gfp_t gfp_mask) * OR try to be smart about which process to kill. Note that we * don't have to be perfect here, we just have to be good. */ -void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask, +static void __out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask, int order, nodemask_t *nodemask, bool force_kill) { const nodemask_t *mpol_mask; @@ -693,6 +688,27 @@ out: schedule_timeout_killable(1); } +/** out_of_memory - tries to invoke OOM killer. + * @zonelist: zonelist pointer + * @gfp_mask: memory allocation flags + * @order: amount of memory being requested as a power of 2 + * @nodemask: nodemask passed to page allocator + * @force_kill: true if a task must be killed, even if others are exiting + * + * invokes __out_of_memory if the OOM is not disabled by oom_killer_disable() + * when it returns false. Otherwise returns true. + */ +bool out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask, + int order, nodemask_t *nodemask, bool force_kill) +{ + if (!down_read_trylock(&oom_sem)) + return false; + __out_of_memory(zonelist, gfp_mask, order, nodemask, force_kill); + up_read(&oom_sem); + + return true; +} + /* * The pagefault handler calls here because it is out of memory, so kill a * memory-hogging task. If any populated zone has ZONE_OOM_LOCKED set, a diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 9cd36b822444..d44d69aa7b70 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -242,8 +242,6 @@ void set_pageblock_migratetype(struct page *page, int migratetype) PB_migrate, PB_migrate_end); } -bool oom_killer_disabled __read_mostly; - #ifdef CONFIG_DEBUG_VM static int page_outside_zone_boundaries(struct zone *zone, struct page *page) { @@ -2241,10 +2239,11 @@ static inline struct page * __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order, struct zonelist *zonelist, enum zone_type high_zoneidx, nodemask_t *nodemask, struct zone *preferred_zone, - int classzone_idx, int migratetype) + int classzone_idx, int migratetype, bool *oom_failed) { struct page *page; + *oom_failed = false; /* Acquire the per-zone oom lock for each zone */ if (!oom_zonelist_trylock(zonelist, gfp_mask)) { schedule_timeout_uninterruptible(1); @@ -2252,14 +2251,6 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order, } /* - * PM-freezer should be notified that there might be an OOM killer on - * its way to kill and wake somebody up. This is too early and we might - * end up not killing anything but false positives are acceptable. - * See freeze_processes. - */ - note_oom_kill(); - - /* * Go through the zonelist yet one more time, keep very high watermark * here, this is only to catch a parallel oom killing, we must fail if * we're still under heavy pressure. @@ -2289,8 +2280,8 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order, goto out; } /* Exhausted what can be done so it's blamo time */ - out_of_memory(zonelist, gfp_mask, order, nodemask, false); - + if (!out_of_memory(zonelist, gfp_mask, order, nodemask, false)) + *oom_failed = true; out: oom_zonelist_unlock(zonelist, gfp_mask); return page; @@ -2716,8 +2707,8 @@ rebalance: */ if (!did_some_progress) { if (oom_gfp_allowed(gfp_mask)) { - if (oom_killer_disabled) - goto nopage; + bool oom_failed; + /* Coredumps can quickly deplete all memory reserves */ if ((current->flags & PF_DUMPCORE) && !(gfp_mask & __GFP_NOFAIL)) @@ -2725,10 +2716,19 @@ rebalance: page = __alloc_pages_may_oom(gfp_mask, order, zonelist, high_zoneidx, nodemask, preferred_zone, - classzone_idx, migratetype); + classzone_idx, migratetype, + &oom_failed); + if (page) goto got_pg; + /* + * OOM killer might be disabled and then we have to + * fail the allocation + */ + if (oom_failed) + goto nopage; + if (!(gfp_mask & __GFP_NOFAIL)) { /* * The oom killer is not called for high-order -- 2.1.1 -- Michal Hocko SUSE Labs ^ permalink raw reply related [flat|nested] 93+ messages in thread
* Re: [PATCH 3/4] OOM, PM: OOM killed task shouldn't escape PM suspend 2014-11-06 16:02 ` Michal Hocko @ 2014-11-06 16:28 ` Tejun Heo 2014-11-10 16:30 ` Michal Hocko 0 siblings, 1 reply; 93+ messages in thread From: Tejun Heo @ 2014-11-06 16:28 UTC (permalink / raw) To: Michal Hocko Cc: Rafael J. Wysocki, Andrew Morton, Cong Wang, David Rientjes, Oleg Nesterov, LKML, linux-mm, Linux PM list On Thu, Nov 06, 2014 at 05:02:23PM +0100, Michal Hocko wrote: > > Why would PM freezing make OOM killing fail? That doesn't make much > > sense. Sure, it can block it for a finite duration for sync purposes > > but making OOM killing fail seems the wrong way around. > > We cannot block in the allocation path because the request might come > from the freezer path itself (e.g. when suspending devices etc.). > At least this is my understanding why the original oom disable approach > was implemented. I was saying that it could temporarily block either direction to implement proper synchronization while guaranteeing forward progress. > > We're doing one thing for non-PM freezing and the other way around for > > PM freezing, which indicates one of the two directions is wrong. > > Because those two paths are quite different in their requirements. The > cgroup freezer only cares about freezing tasks and it doesn't have to > care about tasks accessing a possibly half suspended device on their way > out. I don't think the fundamental relationship between freezing and oom killing are different between the two and the failure to recognize that is what's leading to these weird issues. > > Shouldn't it be that OOM killing happening while PM freezing is in > > progress cancels PM freezing rather than the other way around? Find a > > point in PM suspend/hibernation operation where everything must be > > stable, disable OOM killing there and check whether OOM killing > > happened inbetween and if so back out. > > This is freeze_processes AFAIU. I might be wrong of course but this is > the time since when nobody should be waking processes up because they > could access half suspended devices. No, you're doing it before freezing starts. The system is in no way in a quiescent state at that point. > > It seems rather obvious to me that OOM killing has to have precedence > > over PM freezing. > > > > Sure, once the system reaches a point where the whole system must be > > in a stable state for snapshotting or whatever, disabling OOM killing > > is fine but at that point the system is in a very limited execution > > mode and sure won't be processing page faults from userland for > > example and we can actually disable OOM killing knowing that anything > > afterwards is ready to handle memory allocation failures. > > I am really confused now. This is basically what the final patch does > actually. Here is the what I have currently just to make the further > discussion easier. Please see above. -- tejun ^ permalink raw reply [flat|nested] 93+ messages in thread
* Re: [PATCH 3/4] OOM, PM: OOM killed task shouldn't escape PM suspend 2014-11-06 16:28 ` Tejun Heo @ 2014-11-10 16:30 ` Michal Hocko 2014-11-12 18:58 ` [RFC 0/4] OOM vs PM freezer fixes Michal Hocko 2014-12-05 16:41 ` [PATCH 0/4] OOM vs PM freezer fixes Michal Hocko 0 siblings, 2 replies; 93+ messages in thread From: Michal Hocko @ 2014-11-10 16:30 UTC (permalink / raw) To: Tejun Heo Cc: Rafael J. Wysocki, Andrew Morton, Cong Wang, David Rientjes, Oleg Nesterov, LKML, linux-mm, Linux PM list On Thu 06-11-14 11:28:45, Tejun Heo wrote: > On Thu, Nov 06, 2014 at 05:02:23PM +0100, Michal Hocko wrote: [...] > > > We're doing one thing for non-PM freezing and the other way around for > > > PM freezing, which indicates one of the two directions is wrong. > > > > Because those two paths are quite different in their requirements. The > > cgroup freezer only cares about freezing tasks and it doesn't have to > > care about tasks accessing a possibly half suspended device on their way > > out. > > I don't think the fundamental relationship between freezing and oom > killing are different between the two and the failure to recognize > that is what's leading to these weird issues. I do not understand the above. Could you be more specific, please? AFAIU cgroup freezer requires that no task will leak into userspace while the cgroup is frozen. This is naturally true for the OOM path whether the two are synchronized or not. The PM freezer, on the other hand, requires that no task is _woken up_ after all tasks are frozen. This requires synchronization between the freezer and OOM path because allocations are allowed also after tasks are frozen. What am I missing? > > > Shouldn't it be that OOM killing happening while PM freezing is in > > > progress cancels PM freezing rather than the other way around? Find a > > > point in PM suspend/hibernation operation where everything must be > > > stable, disable OOM killing there and check whether OOM killing > > > happened inbetween and if so back out. > > > > This is freeze_processes AFAIU. I might be wrong of course but this is > > the time since when nobody should be waking processes up because they > > could access half suspended devices. > > No, you're doing it before freezing starts. The system is in no way > in a quiescent state at that point. You are right! Userspace shouldn't see any unexpected allocation failures just because PM freezing is in progress. This whole process should be transparent from userspace POV. I am getting back to oom_killer_lock(); error = try_to_freeze_tasks(); if (!error) oom_killer_disable(); oom_killer_unlock(); Thanks! -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 93+ messages in thread
* [RFC 0/4] OOM vs PM freezer fixes 2014-11-10 16:30 ` Michal Hocko @ 2014-11-12 18:58 ` Michal Hocko 2014-11-12 18:58 ` [RFC 1/4] OOM, PM: Do not miss OOM killed frozen tasks Michal Hocko ` (4 more replies) 2014-12-05 16:41 ` [PATCH 0/4] OOM vs PM freezer fixes Michal Hocko 1 sibling, 5 replies; 93+ messages in thread From: Michal Hocko @ 2014-11-12 18:58 UTC (permalink / raw) To: LKML Cc: linux-mm, linux-pm, Tejun Heo, Andrew Morton, \"Rafael J. Wysocki\", David Rientjes, Oleg Nesterov, Cong Wang Hi, here is another take at OOM vs. PM freezer interaction fixes/cleanups. First three patches are fixes for an unlikely cases when OOM races with the PM freezer which should be closed completely finally. The last patch is a simple code enhancement which is not needed strictly speaking but it is nice to have IMO. Both OOM killer and PM freezer are quite subtle so I hope I haven't missing anything. Any feedback is highly appreciated. I am also interested about feedback for the used approach. To be honest I am not really happy about spreading TIF_MEMDIE checks into freezer (patch 1) but I didn't find any other way for detecting OOM killed tasks. Changes are based on top of Linus tree (3.18-rc3). Michal Hocko (4): OOM, PM: Do not miss OOM killed frozen tasks OOM, PM: make OOM detection in the freezer path raceless OOM, PM: handle pm freezer as an OOM victim correctly OOM: thaw the OOM victim if it is frozen Diffstat says: drivers/tty/sysrq.c | 6 ++-- include/linux/oom.h | 39 ++++++++++++++++------ kernel/freezer.c | 15 +++++++-- kernel/power/process.c | 60 +++++++++------------------------- mm/memcontrol.c | 4 ++- mm/oom_kill.c | 89 ++++++++++++++++++++++++++++++++++++++------------ mm/page_alloc.c | 32 +++++++++--------- 7 files changed, 147 insertions(+), 98 deletions(-) ^ permalink raw reply [flat|nested] 93+ messages in thread
* [RFC 1/4] OOM, PM: Do not miss OOM killed frozen tasks 2014-11-12 18:58 ` [RFC 0/4] OOM vs PM freezer fixes Michal Hocko @ 2014-11-12 18:58 ` Michal Hocko 2014-11-14 17:55 ` Tejun Heo 2014-11-12 18:58 ` [RFC 2/4] OOM, PM: make OOM detection in the freezer path raceless Michal Hocko ` (3 subsequent siblings) 4 siblings, 1 reply; 93+ messages in thread From: Michal Hocko @ 2014-11-12 18:58 UTC (permalink / raw) To: LKML Cc: linux-mm, linux-pm, Tejun Heo, Andrew Morton, \"Rafael J. Wysocki\", David Rientjes, Oleg Nesterov, Cong Wang Although the freezer code ignores tasks which are killed by the OOM killer (in freezing_slow_path) there are two problems why this is not suitable for the PM freezer: - The information gets lost on its way from freezing path because it is interpreted as if the task doesn't _need_ to be frozen which is true also for other reasons - The killed task might be frozen (in cgroup) already but hasn't woken up yet. We do not have an easy way to wait for such a task This means that try_to_freeze_tasks will consider all tasks frozen despite there is an OOM victim waiting for its slice to wake up. The OOM might have happened anytime before OOM exlusion started so it might leak without PM freezer noticing and access already suspended devices. Fix this by checking TIF_MEMDIE for each task in freeze_task and consider such a task as blocking the freezer. Also change the return value semantic as the current one is little bit awkward. There is just one caller (try_to_freeze_tasks) which checks the return value and it is only interested whether the request was successful or the task blocks the freezing progress. It is natural to reflect the success by true rather than false. Signed-off-by: Michal Hocko <mhocko@suse.cz> --- kernel/freezer.c | 15 ++++++++++++--- kernel/power/process.c | 5 ++--- 2 files changed, 14 insertions(+), 6 deletions(-) diff --git a/kernel/freezer.c b/kernel/freezer.c index a8900a3bc27a..93bd3fc65371 100644 --- a/kernel/freezer.c +++ b/kernel/freezer.c @@ -113,7 +113,8 @@ static void fake_signal_wake_up(struct task_struct *p) * thread). * * RETURNS: - * %false, if @p is not freezing or already frozen; %true, otherwise + * %false, if @p cannot get frozen; %true, if successful, already frozen or + * ignored by the freezer altogether. */ bool freeze_task(struct task_struct *p) { @@ -129,12 +130,20 @@ bool freeze_task(struct task_struct *p) * normally. */ if (freezer_should_skip(p)) + return true; + + /* + * Do not check freezing state or attempt to freeze a task + * which has been killed by OOM killer. We are just waiting + * for the task to wake up and die. + */ + if (!test_tsk_thread_flag(p, TIF_MEMDIE)) return false; spin_lock_irqsave(&freezer_lock, flags); if (!freezing(p) || frozen(p)) { spin_unlock_irqrestore(&freezer_lock, flags); - return false; + return true; } if (!(p->flags & PF_KTHREAD)) @@ -143,7 +152,7 @@ bool freeze_task(struct task_struct *p) wake_up_state(p, TASK_INTERRUPTIBLE); spin_unlock_irqrestore(&freezer_lock, flags); - return true; + return false; } void __thaw_task(struct task_struct *p) diff --git a/kernel/power/process.c b/kernel/power/process.c index 5a6ec8678b9a..3d528f291da8 100644 --- a/kernel/power/process.c +++ b/kernel/power/process.c @@ -47,11 +47,10 @@ static int try_to_freeze_tasks(bool user_only) todo = 0; read_lock(&tasklist_lock); for_each_process_thread(g, p) { - if (p == current || !freeze_task(p)) + if (p != current && freeze_task(p)) continue; - if (!freezer_should_skip(p)) - todo++; + todo++; } read_unlock(&tasklist_lock); -- 2.1.1 ^ permalink raw reply related [flat|nested] 93+ messages in thread
* Re: [RFC 1/4] OOM, PM: Do not miss OOM killed frozen tasks 2014-11-12 18:58 ` [RFC 1/4] OOM, PM: Do not miss OOM killed frozen tasks Michal Hocko @ 2014-11-14 17:55 ` Tejun Heo 0 siblings, 0 replies; 93+ messages in thread From: Tejun Heo @ 2014-11-14 17:55 UTC (permalink / raw) To: Michal Hocko Cc: LKML, linux-mm, linux-pm, Andrew Morton, \"Rafael J. Wysocki\", David Rientjes, Oleg Nesterov, Cong Wang Hello, Michal. On Wed, Nov 12, 2014 at 07:58:49PM +0100, Michal Hocko wrote: > Also change the return value semantic as the current one is little bit > awkward. There is just one caller (try_to_freeze_tasks) which checks > the return value and it is only interested whether the request was > successful or the task blocks the freezing progress. It is natural to > reflect the success by true rather than false. I don't know about this. It's also customary to return %true when further action needs to be taken. I don't think either is particularly wrong but the flip seems gratuitous. > bool freeze_task(struct task_struct *p) > { > @@ -129,12 +130,20 @@ bool freeze_task(struct task_struct *p) > * normally. > */ > if (freezer_should_skip(p)) > + return true; > + > + /* > + * Do not check freezing state or attempt to freeze a task > + * which has been killed by OOM killer. We are just waiting > + * for the task to wake up and die. Maybe saying sth like "consider the task freezing as ...." is a clearer way to put it? > + */ > + if (!test_tsk_thread_flag(p, TIF_MEMDIE)) > return false; Thanks. -- tejun ^ permalink raw reply [flat|nested] 93+ messages in thread
* [RFC 2/4] OOM, PM: make OOM detection in the freezer path raceless 2014-11-12 18:58 ` [RFC 0/4] OOM vs PM freezer fixes Michal Hocko 2014-11-12 18:58 ` [RFC 1/4] OOM, PM: Do not miss OOM killed frozen tasks Michal Hocko @ 2014-11-12 18:58 ` Michal Hocko 2014-11-12 18:58 ` [RFC 3/4] OOM, PM: handle pm freezer as an OOM victim correctly Michal Hocko ` (2 subsequent siblings) 4 siblings, 0 replies; 93+ messages in thread From: Michal Hocko @ 2014-11-12 18:58 UTC (permalink / raw) To: LKML Cc: linux-mm, linux-pm, Tejun Heo, Andrew Morton, \"Rafael J. Wysocki\", David Rientjes, Oleg Nesterov, Cong Wang 5695be142e20 (OOM, PM: OOM killed task shouldn't escape PM suspend) has left a race window when OOM killer manages to note_oom_kill after freeze_processes checks the counter. The race window is quite small and really unlikely and partial solution deemed sufficient at the time of submission. Tejun wasn't happy about this partial solution though and insisted on a full solution. That requires the full OOM and freezer's task freezing exclusion, though. This is done by this patch which introduces oom_sem RW lock. oom_killer_disabled is now handled at out_of_memory level which takes the lock for reading. This also means that the page fault path is covered now as well although it was assumed to be safe before. As per Tejun, "We used to have freezing points deep in file system code which may be reacheable from page fault." so it would be better and more robust to not rely on freezing points here. Same applies to the memcg OOM killer. out_of_memory tells the caller whether the OOM was allowed to trigger and the callers are supposed to handle the situation. The page allocation path simply fails the allocation same as before. The page fault path will be retrying the fault until the freezer fails and Sysrq will simply complain to the log. The freezer will use the new oom_killer_{un}lock API which takes the lock for write to wait for an ongoing OOM killer and block all future invocations while attempting to freeze all the tasks. If it was successful oom_killer_disable is called to disallow all the further OOM killer invocations. There is no need to recheck all the processes with the full synchronization anymore so it can go away again. Suggested-by: Tejun Heo <tj@kernel.org> Signed-off-by: Michal Hocko <mhocko@suse.cz> --- drivers/tty/sysrq.c | 6 ++-- include/linux/oom.h | 36 ++++++++++++++++------- kernel/power/process.c | 52 +++++++--------------------------- mm/memcontrol.c | 4 ++- mm/oom_kill.c | 77 ++++++++++++++++++++++++++++++++++++-------------- mm/page_alloc.c | 32 ++++++++++----------- 6 files changed, 115 insertions(+), 92 deletions(-) diff --git a/drivers/tty/sysrq.c b/drivers/tty/sysrq.c index 42bad18c66c9..6818589c1004 100644 --- a/drivers/tty/sysrq.c +++ b/drivers/tty/sysrq.c @@ -355,8 +355,10 @@ static struct sysrq_key_op sysrq_term_op = { static void moom_callback(struct work_struct *ignored) { - out_of_memory(node_zonelist(first_memory_node, GFP_KERNEL), GFP_KERNEL, - 0, NULL, true); + if (!out_of_memory(node_zonelist(first_memory_node, GFP_KERNEL), + GFP_KERNEL, 0, NULL, true)) { + printk(KERN_INFO "OOM request ignored because killer is disabled\n"); + } } static DECLARE_WORK(moom_work, moom_callback); diff --git a/include/linux/oom.h b/include/linux/oom.h index e8d6e1058723..8ca73c0b07df 100644 --- a/include/linux/oom.h +++ b/include/linux/oom.h @@ -68,22 +68,38 @@ extern enum oom_scan_t oom_scan_process_thread(struct task_struct *task, unsigned long totalpages, const nodemask_t *nodemask, bool force_kill); -extern void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask, +extern bool out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask, int order, nodemask_t *mask, bool force_kill); extern int register_oom_notifier(struct notifier_block *nb); extern int unregister_oom_notifier(struct notifier_block *nb); -extern bool oom_killer_disabled; +/** + * oom_killer_disable - disable OOM killer + * + * Forces all page allocations to fail rather than trigger OOM killer. + * Has to be called with oom_killer_lock held to prevent from races + * with an ongoing OOM killer. + * + * This function should be used with an extreme care and any new usage + * should be consulted with MM people. + */ +extern void oom_killer_disable(void); -static inline void oom_killer_disable(void) -{ - oom_killer_disabled = true; -} +/** + * oom_killer_enable - enable OOM killer + */ +extern void oom_killer_enable(void); -static inline void oom_killer_enable(void) -{ - oom_killer_disabled = false; -} +/** oom_killer_lock - locks global OOM killer. + * + * This function should be used with an extreme care. No allocations + * are allowed with the lock held. + */ +extern void oom_killer_lock(void); + +/** oom_killer_unlock - unlocks global OOM killer. + */ +extern void oom_killer_unlock(void); static inline bool oom_gfp_allowed(gfp_t gfp_mask) { diff --git a/kernel/power/process.c b/kernel/power/process.c index 3d528f291da8..5c5da0fe54dd 100644 --- a/kernel/power/process.c +++ b/kernel/power/process.c @@ -107,30 +107,6 @@ static int try_to_freeze_tasks(bool user_only) return todo ? -EBUSY : 0; } -static bool __check_frozen_processes(void) -{ - struct task_struct *g, *p; - - for_each_process_thread(g, p) - if (p != current && !freezer_should_skip(p) && !frozen(p)) - return false; - - return true; -} - -/* - * Returns true if all freezable tasks (except for current) are frozen already - */ -static bool check_frozen_processes(void) -{ - bool ret; - - read_lock(&tasklist_lock); - ret = __check_frozen_processes(); - read_unlock(&tasklist_lock); - return ret; -} - /** * freeze_processes - Signal user space processes to enter the refrigerator. * The current thread will not be frozen. The same process that calls @@ -141,12 +117,18 @@ static bool check_frozen_processes(void) int freeze_processes(void) { int error; - int oom_kills_saved; error = __usermodehelper_disable(UMH_FREEZING); if (error) return error; + /* + * Need to exlude OOM killer from triggering while tasks are + * getting frozen to make sure none of them gets killed after + * try_to_freeze_tasks is done. + */ + oom_killer_lock() + /* Make sure this task doesn't get frozen */ current->flags |= PF_SUSPEND_TASK; @@ -156,27 +138,13 @@ int freeze_processes(void) pm_wakeup_clear(); printk("Freezing user space processes ... "); pm_freezing = true; - oom_kills_saved = oom_kills_count(); error = try_to_freeze_tasks(true); if (!error) { - __usermodehelper_set_disable_depth(UMH_DISABLED); oom_killer_disable(); - - /* - * There might have been an OOM kill while we were - * freezing tasks and the killed task might be still - * on the way out so we have to double check for race. - */ - if (oom_kills_count() != oom_kills_saved && - !check_frozen_processes()) { - __usermodehelper_set_disable_depth(UMH_ENABLED); - printk("OOM in progress."); - error = -EBUSY; - } else { - printk("done."); - } + __usermodehelper_set_disable_depth(UMH_DISABLED); + printk("done.\n"); } - printk("\n"); + oom_killer_unlock(); BUG_ON(in_atomic()); if (error) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index d6ac0e33e150..620aff77da4a 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -2128,6 +2128,8 @@ static void mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask, int order) current->memcg_oom.order = order; } +extern bool oom_killer_disabled; + /** * mem_cgroup_oom_synchronize - complete memcg OOM handling * @handle: actually kill/wait or just clean up the OOM state @@ -2155,7 +2157,7 @@ bool mem_cgroup_oom_synchronize(bool handle) if (!memcg) return false; - if (!handle) + if (!handle || oom_killer_disabled) goto cleanup; owait.memcg = memcg; diff --git a/mm/oom_kill.c b/mm/oom_kill.c index 5340f6b91312..0a061803be09 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -404,23 +404,6 @@ static void dump_header(struct task_struct *p, gfp_t gfp_mask, int order, dump_tasks(memcg, nodemask); } -/* - * Number of OOM killer invocations (including memcg OOM killer). - * Primarily used by PM freezer to check for potential races with - * OOM killed frozen task. - */ -static atomic_t oom_kills = ATOMIC_INIT(0); - -int oom_kills_count(void) -{ - return atomic_read(&oom_kills); -} - -void note_oom_kill(void) -{ - atomic_inc(&oom_kills); -} - #define K(x) ((x) << (PAGE_SHIFT-10)) /* * Must be called while holding a reference to p, which will be released upon @@ -615,8 +598,31 @@ void oom_zonelist_unlock(struct zonelist *zonelist, gfp_t gfp_mask) spin_unlock(&zone_scan_lock); } +bool oom_killer_disabled __read_mostly; +static DECLARE_RWSEM(oom_sem); + +void oom_killer_lock(void) +{ + down_write(&oom_sem); +} + +void oom_killer_unlock(void) +{ + up_write(&oom_sem); +} + +void oom_killer_disable(void) +{ + oom_killer_disabled = true; +} + +void oom_killer_enable(void) +{ + oom_killer_disabled = false; +} + /** - * out_of_memory - kill the "best" process when we run out of memory + * __out_of_memory - kill the "best" process when we run out of memory * @zonelist: zonelist pointer * @gfp_mask: memory allocation flags * @order: amount of memory being requested as a power of 2 @@ -628,7 +634,7 @@ void oom_zonelist_unlock(struct zonelist *zonelist, gfp_t gfp_mask) * OR try to be smart about which process to kill. Note that we * don't have to be perfect here, we just have to be good. */ -void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask, +static void __out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask, int order, nodemask_t *nodemask, bool force_kill) { const nodemask_t *mpol_mask; @@ -693,6 +699,31 @@ out: schedule_timeout_killable(1); } +/** out_of_memory - tries to invoke OOM killer. + * @zonelist: zonelist pointer + * @gfp_mask: memory allocation flags + * @order: amount of memory being requested as a power of 2 + * @nodemask: nodemask passed to page allocator + * @force_kill: true if a task must be killed, even if others are exiting + * + * invokes __out_of_memory if the OOM is not disabled by oom_killer_disable() + * when it returns false. Otherwise returns true. + */ +bool out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask, + int order, nodemask_t *nodemask, bool force_kill) +{ + bool ret = false; + + down_read(&oom_sem); + if (!oom_killer_disabled) { + __out_of_memory(zonelist, gfp_mask, order, nodemask, force_kill); + ret = true; + } + up_read(&oom_sem); + + return true; +} + /* * The pagefault handler calls here because it is out of memory, so kill a * memory-hogging task. If any populated zone has ZONE_OOM_LOCKED set, a @@ -702,12 +733,16 @@ void pagefault_out_of_memory(void) { struct zonelist *zonelist; + down_read(&oom_sem); if (mem_cgroup_oom_synchronize(true)) - return; + goto unlock; zonelist = node_zonelist(first_memory_node, GFP_KERNEL); if (oom_zonelist_trylock(zonelist, GFP_KERNEL)) { - out_of_memory(NULL, 0, 0, NULL, false); + if (!oom_killer_disabled) + __out_of_memory(NULL, 0, 0, NULL, false); oom_zonelist_unlock(zonelist, GFP_KERNEL); } +unlock: + up_read(&oom_sem); } diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 9cd36b822444..d44d69aa7b70 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -242,8 +242,6 @@ void set_pageblock_migratetype(struct page *page, int migratetype) PB_migrate, PB_migrate_end); } -bool oom_killer_disabled __read_mostly; - #ifdef CONFIG_DEBUG_VM static int page_outside_zone_boundaries(struct zone *zone, struct page *page) { @@ -2241,10 +2239,11 @@ static inline struct page * __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order, struct zonelist *zonelist, enum zone_type high_zoneidx, nodemask_t *nodemask, struct zone *preferred_zone, - int classzone_idx, int migratetype) + int classzone_idx, int migratetype, bool *oom_failed) { struct page *page; + *oom_failed = false; /* Acquire the per-zone oom lock for each zone */ if (!oom_zonelist_trylock(zonelist, gfp_mask)) { schedule_timeout_uninterruptible(1); @@ -2252,14 +2251,6 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order, } /* - * PM-freezer should be notified that there might be an OOM killer on - * its way to kill and wake somebody up. This is too early and we might - * end up not killing anything but false positives are acceptable. - * See freeze_processes. - */ - note_oom_kill(); - - /* * Go through the zonelist yet one more time, keep very high watermark * here, this is only to catch a parallel oom killing, we must fail if * we're still under heavy pressure. @@ -2289,8 +2280,8 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order, goto out; } /* Exhausted what can be done so it's blamo time */ - out_of_memory(zonelist, gfp_mask, order, nodemask, false); - + if (!out_of_memory(zonelist, gfp_mask, order, nodemask, false)) + *oom_failed = true; out: oom_zonelist_unlock(zonelist, gfp_mask); return page; @@ -2716,8 +2707,8 @@ rebalance: */ if (!did_some_progress) { if (oom_gfp_allowed(gfp_mask)) { - if (oom_killer_disabled) - goto nopage; + bool oom_failed; + /* Coredumps can quickly deplete all memory reserves */ if ((current->flags & PF_DUMPCORE) && !(gfp_mask & __GFP_NOFAIL)) @@ -2725,10 +2716,19 @@ rebalance: page = __alloc_pages_may_oom(gfp_mask, order, zonelist, high_zoneidx, nodemask, preferred_zone, - classzone_idx, migratetype); + classzone_idx, migratetype, + &oom_failed); + if (page) goto got_pg; + /* + * OOM killer might be disabled and then we have to + * fail the allocation + */ + if (oom_failed) + goto nopage; + if (!(gfp_mask & __GFP_NOFAIL)) { /* * The oom killer is not called for high-order -- 2.1.1 ^ permalink raw reply related [flat|nested] 93+ messages in thread
* [RFC 3/4] OOM, PM: handle pm freezer as an OOM victim correctly 2014-11-12 18:58 ` [RFC 0/4] OOM vs PM freezer fixes Michal Hocko 2014-11-12 18:58 ` [RFC 1/4] OOM, PM: Do not miss OOM killed frozen tasks Michal Hocko 2014-11-12 18:58 ` [RFC 2/4] OOM, PM: make OOM detection in the freezer path raceless Michal Hocko @ 2014-11-12 18:58 ` Michal Hocko 2014-11-12 18:58 ` [RFC 4/4] OOM: thaw the OOM victim if it is frozen Michal Hocko 2014-11-14 20:14 ` [RFC 0/4] OOM vs PM freezer fixes Tejun Heo 4 siblings, 0 replies; 93+ messages in thread From: Michal Hocko @ 2014-11-12 18:58 UTC (permalink / raw) To: LKML Cc: linux-mm, linux-pm, Tejun Heo, Andrew Morton, \"Rafael J. Wysocki\", David Rientjes, Oleg Nesterov, Cong Wang PM freezer doesn't check whether it has been killed by OOM killer after it disables OOM killer which means that it continues with the suspend even though it should die as soon as possible. This has been the case ever since PM suspend disables OOM killer and I suppose it has ignored OOM even before. This is not harmful though. The allocation which triggers OOM will retry the allocation after a process is killed and the next attempt will fail because the OOM killer will be disabled at the time so there is no risk of an endless loop because the OOM victim doesn't die. But this is a correctness issue because no task should ignore OOM. As suggested by Tejun, oom_killer_lock will return a success status now. If the current task is pending fatal signals or TIF_MEMDIE is set after oom_sem is taken then the caller should bail out and this is what freeze_processes does with this patch. Signed-off-by: Michal Hocko <mhocko@suse.cz> --- include/linux/oom.h | 5 ++++- kernel/power/process.c | 5 ++++- mm/oom_kill.c | 12 +++++++++++- 3 files changed, 19 insertions(+), 3 deletions(-) diff --git a/include/linux/oom.h b/include/linux/oom.h index 8ca73c0b07df..8f4f634cc5b3 100644 --- a/include/linux/oom.h +++ b/include/linux/oom.h @@ -92,10 +92,13 @@ extern void oom_killer_enable(void); /** oom_killer_lock - locks global OOM killer. * + * Returns true on success and fails if the OOM killer couldn't be + * locked (e.g. because the current task has been killed before). + * * This function should be used with an extreme care. No allocations * are allowed with the lock held. */ -extern void oom_killer_lock(void); +extern bool oom_killer_lock(void); /** oom_killer_unlock - unlocks global OOM killer. */ diff --git a/kernel/power/process.c b/kernel/power/process.c index 5c5da0fe54dd..49d8d84ccd6e 100644 --- a/kernel/power/process.c +++ b/kernel/power/process.c @@ -127,7 +127,10 @@ int freeze_processes(void) * getting frozen to make sure none of them gets killed after * try_to_freeze_tasks is done. */ - oom_killer_lock() + if (!oom_killer_lock()) { + usermodehelper_enable(); + return -EBUSY; + } /* Make sure this task doesn't get frozen */ current->flags |= PF_SUSPEND_TASK; diff --git a/mm/oom_kill.c b/mm/oom_kill.c index 0a061803be09..39a591092ca0 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -601,9 +601,19 @@ void oom_zonelist_unlock(struct zonelist *zonelist, gfp_t gfp_mask) bool oom_killer_disabled __read_mostly; static DECLARE_RWSEM(oom_sem); -void oom_killer_lock(void) +bool oom_killer_lock(void) { + bool ret = true; + down_write(&oom_sem); + + /* We might have been killed while waiting for the oom_sem. */ + if (fatal_signal_pending(current) || test_thread_flag(TIF_MEMDIE)) { + up_write(&oom_sem); + ret = false; + } + + return ret; } void oom_killer_unlock(void) -- 2.1.1 ^ permalink raw reply related [flat|nested] 93+ messages in thread
* [RFC 4/4] OOM: thaw the OOM victim if it is frozen 2014-11-12 18:58 ` [RFC 0/4] OOM vs PM freezer fixes Michal Hocko ` (2 preceding siblings ...) 2014-11-12 18:58 ` [RFC 3/4] OOM, PM: handle pm freezer as an OOM victim correctly Michal Hocko @ 2014-11-12 18:58 ` Michal Hocko 2014-11-14 20:14 ` [RFC 0/4] OOM vs PM freezer fixes Tejun Heo 4 siblings, 0 replies; 93+ messages in thread From: Michal Hocko @ 2014-11-12 18:58 UTC (permalink / raw) To: LKML Cc: linux-mm, linux-pm, Tejun Heo, Andrew Morton, \"Rafael J. Wysocki\", David Rientjes, Oleg Nesterov, Cong Wang oom_kill_process only sets TIF_MEMDIE flag and sends a signal to the victim. This is basically noop when the task is frozen though because the task sleeps in uninterruptible sleep. The victim is eventually thawed later when oom_scan_process_thread meets the task again in a later OOM invocation so the OOM killer doesn't live lock. But this is less than optimal. Let's add the frozen check and thaw the task right before we send SIGKILL to the victim. The check and thawing in oom_scan_process_thread has to stay because the task might got access to memory reserves even without an explicit SIGKILL from oom_kill_process (e.g. it already has fatal signal pending or it is exiting already). Signed-off-by: Michal Hocko <mhocko@suse.cz> --- mm/oom_kill.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/mm/oom_kill.c b/mm/oom_kill.c index 39a591092ca0..67ea7fb70fa4 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -511,6 +511,8 @@ void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order, rcu_read_unlock(); set_tsk_thread_flag(victim, TIF_MEMDIE); + if (frozen(victim)) + __thaw_task(victim); do_send_sig_info(SIGKILL, SEND_SIG_FORCED, victim, true); put_task_struct(victim); } -- 2.1.1 ^ permalink raw reply related [flat|nested] 93+ messages in thread
* Re: [RFC 0/4] OOM vs PM freezer fixes 2014-11-12 18:58 ` [RFC 0/4] OOM vs PM freezer fixes Michal Hocko ` (3 preceding siblings ...) 2014-11-12 18:58 ` [RFC 4/4] OOM: thaw the OOM victim if it is frozen Michal Hocko @ 2014-11-14 20:14 ` Tejun Heo 2014-11-18 21:08 ` Michal Hocko 4 siblings, 1 reply; 93+ messages in thread From: Tejun Heo @ 2014-11-14 20:14 UTC (permalink / raw) To: Michal Hocko Cc: LKML, linux-mm, linux-pm, Andrew Morton, \"Rafael J. Wysocki\", David Rientjes, Oleg Nesterov, Cong Wang On Wed, Nov 12, 2014 at 07:58:48PM +0100, Michal Hocko wrote: > Hi, > here is another take at OOM vs. PM freezer interaction fixes/cleanups. > First three patches are fixes for an unlikely cases when OOM races with > the PM freezer which should be closed completely finally. The last patch > is a simple code enhancement which is not needed strictly speaking but > it is nice to have IMO. > > Both OOM killer and PM freezer are quite subtle so I hope I haven't > missing anything. Any feedback is highly appreciated. I am also > interested about feedback for the used approach. To be honest I am not > really happy about spreading TIF_MEMDIE checks into freezer (patch 1) > but I didn't find any other way for detecting OOM killed tasks. I really don't get why this is structured this way. Can't you just do the following? 1. Freeze all freezables. Don't worry about PF_MEMDIE. 2. Disable OOM killer. This should be contained in the OOM killer proper. Lock out the OOM killer and disable it. 3. At this point, we know that no one will create more freezable threads and no new process will be OOM kliled. Wait till there's no process w/ PF_MEMDIE set. There's no reason to lock out or disable OOM killer while the system is not in the quiescent state, which is a big can of worms. Bring down the system to the quiescent state, disable the OOM killer and then drain PF_MEMDIEs. Thanks. -- tejun ^ permalink raw reply [flat|nested] 93+ messages in thread
* Re: [RFC 0/4] OOM vs PM freezer fixes 2014-11-14 20:14 ` [RFC 0/4] OOM vs PM freezer fixes Tejun Heo @ 2014-11-18 21:08 ` Michal Hocko 2014-11-18 21:10 ` [RFC 1/2] oom: add helper for setting and clearing TIF_MEMDIE Michal Hocko 0 siblings, 1 reply; 93+ messages in thread From: Michal Hocko @ 2014-11-18 21:08 UTC (permalink / raw) To: Tejun Heo Cc: LKML, linux-mm, linux-pm, Andrew Morton, \"Rafael J. Wysocki\", David Rientjes, Oleg Nesterov, Cong Wang On Fri 14-11-14 15:14:19, Tejun Heo wrote: > On Wed, Nov 12, 2014 at 07:58:48PM +0100, Michal Hocko wrote: > > Hi, > > here is another take at OOM vs. PM freezer interaction fixes/cleanups. > > First three patches are fixes for an unlikely cases when OOM races with > > the PM freezer which should be closed completely finally. The last patch > > is a simple code enhancement which is not needed strictly speaking but > > it is nice to have IMO. > > > > Both OOM killer and PM freezer are quite subtle so I hope I haven't > > missing anything. Any feedback is highly appreciated. I am also > > interested about feedback for the used approach. To be honest I am not > > really happy about spreading TIF_MEMDIE checks into freezer (patch 1) > > but I didn't find any other way for detecting OOM killed tasks. > > I really don't get why this is structured this way. Can't you just do > the following? Well, I liked how simple this was and localized at the only place which matters. When I was thinking about a solution which you are describing below it was more complicated and more subtle (e.g. waiting for an OOM victim might be tricky if it stumbles over a lock which is held by a frozen thread which uses try_to_freeze_unsafe). Anyway I gave it another try and will post the two patches as a reply to this email. I hope the both interface and implementation is cleaner. > 1. Freeze all freezables. Don't worry about PF_MEMDIE. > > 2. Disable OOM killer. This should be contained in the OOM killer > proper. Lock out the OOM killer and disable it. > > 3. At this point, we know that no one will create more freezable > threads and no new process will be OOM kliled. Wait till there's > no process w/ PF_MEMDIE set. > > There's no reason to lock out or disable OOM killer while the system > is not in the quiescent state, which is a big can of worms. Bring > down the system to the quiescent state, disable the OOM killer and > then drain PF_MEMDIEs. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 93+ messages in thread
* [RFC 1/2] oom: add helper for setting and clearing TIF_MEMDIE 2014-11-18 21:08 ` Michal Hocko @ 2014-11-18 21:10 ` Michal Hocko 2014-11-18 21:10 ` [RFC 2/2] OOM, PM: make OOM detection in the freezer path raceless Michal Hocko 0 siblings, 1 reply; 93+ messages in thread From: Michal Hocko @ 2014-11-18 21:10 UTC (permalink / raw) To: LKML Cc: linux-mm, linux-pm, Tejun Heo, Andrew Morton, \"Rafael J. Wysocki\", David Rientjes, Oleg Nesterov, Cong Wang This patch is just a preparatory and it doesn't introduce any functional change. Signed-off-by: Michal Hocko <mhocko@suse.cz> --- include/linux/oom.h | 4 ++++ kernel/exit.c | 2 +- mm/memcontrol.c | 2 +- mm/oom_kill.c | 16 +++++++++++++--- 4 files changed, 19 insertions(+), 5 deletions(-) diff --git a/include/linux/oom.h b/include/linux/oom.h index e8d6e1058723..8f7e74f8ab3a 100644 --- a/include/linux/oom.h +++ b/include/linux/oom.h @@ -47,6 +47,10 @@ static inline bool oom_task_origin(const struct task_struct *p) return !!(p->signal->oom_flags & OOM_FLAG_ORIGIN); } +void mark_tsk_oom_victim(struct task_struct *tsk); + +void unmark_tsk_oom_victim(struct task_struct *tsk); + extern unsigned long oom_badness(struct task_struct *p, struct mem_cgroup *memcg, const nodemask_t *nodemask, unsigned long totalpages); diff --git a/kernel/exit.c b/kernel/exit.c index 5d30019ff953..323882973b4b 100644 --- a/kernel/exit.c +++ b/kernel/exit.c @@ -459,7 +459,7 @@ static void exit_mm(struct task_struct *tsk) task_unlock(tsk); mm_update_next_owner(mm); mmput(mm); - clear_thread_flag(TIF_MEMDIE); + unmark_tsk_oom_victim(current); } /* diff --git a/mm/memcontrol.c b/mm/memcontrol.c index d6ac0e33e150..302e0fc6d121 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -1735,7 +1735,7 @@ static void mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask, * quickly exit and free its memory. */ if (fatal_signal_pending(current) || current->flags & PF_EXITING) { - set_thread_flag(TIF_MEMDIE); + mark_tsk_oom_victim(current); return; } diff --git a/mm/oom_kill.c b/mm/oom_kill.c index 5340f6b91312..8b6e14136f4f 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -421,6 +421,16 @@ void note_oom_kill(void) atomic_inc(&oom_kills); } +void mark_tsk_oom_victim(struct task_struct *tsk) +{ + set_tsk_thread_flag(tsk, TIF_MEMDIE); +} + +void unmark_tsk_oom_victim(struct task_struct *tsk) +{ + clear_thread_flag(TIF_MEMDIE); +} + #define K(x) ((x) << (PAGE_SHIFT-10)) /* * Must be called while holding a reference to p, which will be released upon @@ -444,7 +454,7 @@ void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order, * its children or threads, just set TIF_MEMDIE so it can die quickly */ if (p->flags & PF_EXITING) { - set_tsk_thread_flag(p, TIF_MEMDIE); + mark_tsk_oom_victim(p); put_task_struct(p); return; } @@ -527,7 +537,7 @@ void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order, } rcu_read_unlock(); - set_tsk_thread_flag(victim, TIF_MEMDIE); + mark_tsk_oom_victim(victim); do_send_sig_info(SIGKILL, SEND_SIG_FORCED, victim, true); put_task_struct(victim); } @@ -650,7 +660,7 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask, * quickly exit and free its memory. */ if (fatal_signal_pending(current) || current->flags & PF_EXITING) { - set_thread_flag(TIF_MEMDIE); + mark_tsk_oom_victim(current); return; } -- 2.1.3 ^ permalink raw reply related [flat|nested] 93+ messages in thread
* [RFC 2/2] OOM, PM: make OOM detection in the freezer path raceless 2014-11-18 21:10 ` [RFC 1/2] oom: add helper for setting and clearing TIF_MEMDIE Michal Hocko @ 2014-11-18 21:10 ` Michal Hocko 2014-11-27 0:47 ` Rafael J. Wysocki 2014-12-02 22:08 ` Tejun Heo 0 siblings, 2 replies; 93+ messages in thread From: Michal Hocko @ 2014-11-18 21:10 UTC (permalink / raw) To: LKML Cc: linux-mm, linux-pm, Tejun Heo, Andrew Morton, \"Rafael J. Wysocki\", David Rientjes, Oleg Nesterov, Cong Wang 5695be142e20 (OOM, PM: OOM killed task shouldn't escape PM suspend) has left a race window when OOM killer manages to note_oom_kill after freeze_processes checks the counter. The race window is quite small and really unlikely and partial solution deemed sufficient at the time of submission. Tejun wasn't happy about this partial solution though and insisted on a full solution. That requires the full OOM and freezer's task freezing exclusion, though. This is done by this patch which introduces oom_sem RW lock and turns oom_killer_disable() into a full OOM barrier. oom_killer_disabled is now checked at out_of_memory level which takes the lock for reading. This also means that the page fault path is covered now as well although it was assumed to be safe before. As per Tejun, "We used to have freezing points deep in file system code which may be reacheable from page fault." so it would be better and more robust to not rely on freezing points here. Same applies to the memcg OOM killer. out_of_memory tells the caller whether the OOM was allowed to trigger and the callers are supposed to handle the situation. The page allocation path simply fails the allocation same as before. The page fault path will be retrying the fault until the freezer fails and Sysrq OOM trigger will simply complain to the log. oom_killer_disable takes oom_sem for writing and after it disables further OOM killer invocations it checks for any OOM victims which are still alive (because they haven't woken up to handle the pending signal). Victims are counted via {un}mark_tsk_oom_victim. The last victim signals the completion via oom_victims_wait on which oom_killer_disable() waits if it sees non zero oom_victims. This is safe against both mark_tsk_oom_victim which cannot be called after oom_killer_disabled is set and unmark_tsk_oom_victim signals the completion only for the last oom_victim when oom is disabled and oom_killer_disable waits for completion only of there was at least one victim at the time it disabled the oom. As oom_killer_disable is a full OOM barrier now we can postpone it to later after all freezable tasks are frozen during PM freezer. This reduces the time when OOM is put out order and so reduces chances of misbehavior due to unexpected allocation failures. TODO: Android lowmemory killer abuses mark_tsk_oom_victim in lowmem_scan and it has to learn about oom_disable logic as well. Suggested-by: Tejun Heo <tj@kernel.org> Signed-off-by: Michal Hocko <mhocko@suse.cz> --- drivers/tty/sysrq.c | 6 ++-- include/linux/oom.h | 26 ++++++++------ kernel/power/process.c | 60 +++++++++----------------------- mm/memcontrol.c | 4 ++- mm/oom_kill.c | 94 +++++++++++++++++++++++++++++++++++++++++--------- mm/page_alloc.c | 32 ++++++++--------- 6 files changed, 132 insertions(+), 90 deletions(-) diff --git a/drivers/tty/sysrq.c b/drivers/tty/sysrq.c index 42bad18c66c9..6818589c1004 100644 --- a/drivers/tty/sysrq.c +++ b/drivers/tty/sysrq.c @@ -355,8 +355,10 @@ static struct sysrq_key_op sysrq_term_op = { static void moom_callback(struct work_struct *ignored) { - out_of_memory(node_zonelist(first_memory_node, GFP_KERNEL), GFP_KERNEL, - 0, NULL, true); + if (!out_of_memory(node_zonelist(first_memory_node, GFP_KERNEL), + GFP_KERNEL, 0, NULL, true)) { + printk(KERN_INFO "OOM request ignored because killer is disabled\n"); + } } static DECLARE_WORK(moom_work, moom_callback); diff --git a/include/linux/oom.h b/include/linux/oom.h index 8f7e74f8ab3a..d802575c9307 100644 --- a/include/linux/oom.h +++ b/include/linux/oom.h @@ -72,22 +72,26 @@ extern enum oom_scan_t oom_scan_process_thread(struct task_struct *task, unsigned long totalpages, const nodemask_t *nodemask, bool force_kill); -extern void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask, +extern bool out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask, int order, nodemask_t *mask, bool force_kill); extern int register_oom_notifier(struct notifier_block *nb); extern int unregister_oom_notifier(struct notifier_block *nb); -extern bool oom_killer_disabled; - -static inline void oom_killer_disable(void) -{ - oom_killer_disabled = true; -} +/** + * oom_killer_disable - disable OOM killer + * + * Forces all page allocations to fail rather than trigger OOM killer. + * Will block and wait until all OOM victims are dead. + * + * Returns true if successfull and false if the OOM killer cannot be + * disabled. + */ +extern bool oom_killer_disable(void); -static inline void oom_killer_enable(void) -{ - oom_killer_disabled = false; -} +/** + * oom_killer_enable - enable OOM killer + */ +extern void oom_killer_enable(void); static inline bool oom_gfp_allowed(gfp_t gfp_mask) { diff --git a/kernel/power/process.c b/kernel/power/process.c index 5a6ec8678b9a..a4306e39f35c 100644 --- a/kernel/power/process.c +++ b/kernel/power/process.c @@ -108,30 +108,6 @@ static int try_to_freeze_tasks(bool user_only) return todo ? -EBUSY : 0; } -static bool __check_frozen_processes(void) -{ - struct task_struct *g, *p; - - for_each_process_thread(g, p) - if (p != current && !freezer_should_skip(p) && !frozen(p)) - return false; - - return true; -} - -/* - * Returns true if all freezable tasks (except for current) are frozen already - */ -static bool check_frozen_processes(void) -{ - bool ret; - - read_lock(&tasklist_lock); - ret = __check_frozen_processes(); - read_unlock(&tasklist_lock); - return ret; -} - /** * freeze_processes - Signal user space processes to enter the refrigerator. * The current thread will not be frozen. The same process that calls @@ -142,7 +118,6 @@ static bool check_frozen_processes(void) int freeze_processes(void) { int error; - int oom_kills_saved; error = __usermodehelper_disable(UMH_FREEZING); if (error) @@ -157,27 +132,11 @@ int freeze_processes(void) pm_wakeup_clear(); printk("Freezing user space processes ... "); pm_freezing = true; - oom_kills_saved = oom_kills_count(); error = try_to_freeze_tasks(true); if (!error) { __usermodehelper_set_disable_depth(UMH_DISABLED); - oom_killer_disable(); - - /* - * There might have been an OOM kill while we were - * freezing tasks and the killed task might be still - * on the way out so we have to double check for race. - */ - if (oom_kills_count() != oom_kills_saved && - !check_frozen_processes()) { - __usermodehelper_set_disable_depth(UMH_ENABLED); - printk("OOM in progress."); - error = -EBUSY; - } else { - printk("done."); - } + printk("done.\n"); } - printk("\n"); BUG_ON(in_atomic()); if (error) @@ -206,6 +165,18 @@ int freeze_kernel_threads(void) printk("\n"); BUG_ON(in_atomic()); + /* + * Now that everything freezable is handled we need to disbale + * the OOM killer to disallow any further interference with + * killable tasks. + */ + printk("Disabling OOM killer ... "); + if (!oom_killer_disable()) { + printk("failed.\n"); + error = -EAGAIN; + } else + printk("done.\n"); + if (error) thaw_kernel_threads(); return error; @@ -222,8 +193,6 @@ void thaw_processes(void) pm_freezing = false; pm_nosig_freezing = false; - oom_killer_enable(); - printk("Restarting tasks ... "); __usermodehelper_set_disable_depth(UMH_FREEZING); @@ -251,6 +220,9 @@ void thaw_kernel_threads(void) { struct task_struct *g, *p; + printk("Enabling OOM killer again.\n"); + oom_killer_enable(); + pm_nosig_freezing = false; printk("Restarting kernel threads ... "); diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 302e0fc6d121..34bcbb053132 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -2128,6 +2128,8 @@ static void mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask, int order) current->memcg_oom.order = order; } +extern bool oom_killer_disabled; + /** * mem_cgroup_oom_synchronize - complete memcg OOM handling * @handle: actually kill/wait or just clean up the OOM state @@ -2155,7 +2157,7 @@ bool mem_cgroup_oom_synchronize(bool handle) if (!memcg) return false; - if (!handle) + if (!handle || oom_killer_disabled) goto cleanup; owait.memcg = memcg; diff --git a/mm/oom_kill.c b/mm/oom_kill.c index 8b6e14136f4f..b3ccd92bc6dc 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -405,30 +405,63 @@ static void dump_header(struct task_struct *p, gfp_t gfp_mask, int order, } /* - * Number of OOM killer invocations (including memcg OOM killer). - * Primarily used by PM freezer to check for potential races with - * OOM killed frozen task. + * Number of OOM victims in flight */ -static atomic_t oom_kills = ATOMIC_INIT(0); +static atomic_t oom_victims = ATOMIC_INIT(0); +static DECLARE_COMPLETION(oom_victims_wait); -int oom_kills_count(void) +bool oom_killer_disabled __read_mostly; +static DECLARE_RWSEM(oom_sem); + +void mark_tsk_oom_victim(struct task_struct *tsk) { - return atomic_read(&oom_kills); + BUG_ON(oom_killer_disabled); + if (test_and_set_tsk_thread_flag(tsk, TIF_MEMDIE)) + return; + atomic_inc(&oom_victims); } -void note_oom_kill(void) +void unmark_tsk_oom_victim(struct task_struct *tsk) { - atomic_inc(&oom_kills); + int count; + + if (!test_and_clear_tsk_thread_flag(tsk, TIF_MEMDIE)) + return; + + down_read(&oom_sem); + /* + * There is no need to signal the lasst oom_victim if there + * is nobody who cares. + */ + if (!atomic_dec_return(&oom_victims) && oom_killer_disabled) + complete(&oom_victims_wait); + up_read(&oom_sem); } -void mark_tsk_oom_victim(struct task_struct *tsk) +bool oom_killer_disable(void) { - set_tsk_thread_flag(tsk, TIF_MEMDIE); + /* + * Make sure to not race with an ongoing OOM killer + * and that the current is not the victim. + */ + down_write(&oom_sem); + if (!test_tsk_thread_flag(current, TIF_MEMDIE)) + oom_killer_disabled = true; + + count = atomic_read(&oom_victims); + up_write(&oom_sem); + + if (count && oom_killer_disabled) + wait_for_completion(&oom_victims_wait); + + return oom_killer_disabled; } -void unmark_tsk_oom_victim(struct task_struct *tsk) +void oom_killer_enable(void) { - clear_thread_flag(TIF_MEMDIE); + down_write(&oom_sem); + oom_killer_disabled = false; + up_write(&oom_sem); } #define K(x) ((x) << (PAGE_SHIFT-10)) @@ -626,7 +659,7 @@ void oom_zonelist_unlock(struct zonelist *zonelist, gfp_t gfp_mask) } /** - * out_of_memory - kill the "best" process when we run out of memory + * __out_of_memory - kill the "best" process when we run out of memory * @zonelist: zonelist pointer * @gfp_mask: memory allocation flags * @order: amount of memory being requested as a power of 2 @@ -638,7 +671,7 @@ void oom_zonelist_unlock(struct zonelist *zonelist, gfp_t gfp_mask) * OR try to be smart about which process to kill. Note that we * don't have to be perfect here, we just have to be good. */ -void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask, +static void __out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask, int order, nodemask_t *nodemask, bool force_kill) { const nodemask_t *mpol_mask; @@ -703,6 +736,31 @@ out: schedule_timeout_killable(1); } +/** out_of_memory - tries to invoke OOM killer. + * @zonelist: zonelist pointer + * @gfp_mask: memory allocation flags + * @order: amount of memory being requested as a power of 2 + * @nodemask: nodemask passed to page allocator + * @force_kill: true if a task must be killed, even if others are exiting + * + * invokes __out_of_memory if the OOM is not disabled by oom_killer_disable() + * when it returns false. Otherwise returns true. + */ +bool out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask, + int order, nodemask_t *nodemask, bool force_kill) +{ + bool ret = false; + + down_read(&oom_sem); + if (!oom_killer_disabled) { + __out_of_memory(zonelist, gfp_mask, order, nodemask, force_kill); + ret = true; + } + up_read(&oom_sem); + + return ret; +} + /* * The pagefault handler calls here because it is out of memory, so kill a * memory-hogging task. If any populated zone has ZONE_OOM_LOCKED set, a @@ -712,12 +770,16 @@ void pagefault_out_of_memory(void) { struct zonelist *zonelist; + down_read(&oom_sem); if (mem_cgroup_oom_synchronize(true)) - return; + goto unlock; zonelist = node_zonelist(first_memory_node, GFP_KERNEL); if (oom_zonelist_trylock(zonelist, GFP_KERNEL)) { - out_of_memory(NULL, 0, 0, NULL, false); + if (!oom_killer_disabled) + __out_of_memory(NULL, 0, 0, NULL, false); oom_zonelist_unlock(zonelist, GFP_KERNEL); } +unlock: + up_read(&oom_sem); } diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 9cd36b822444..d44d69aa7b70 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -242,8 +242,6 @@ void set_pageblock_migratetype(struct page *page, int migratetype) PB_migrate, PB_migrate_end); } -bool oom_killer_disabled __read_mostly; - #ifdef CONFIG_DEBUG_VM static int page_outside_zone_boundaries(struct zone *zone, struct page *page) { @@ -2241,10 +2239,11 @@ static inline struct page * __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order, struct zonelist *zonelist, enum zone_type high_zoneidx, nodemask_t *nodemask, struct zone *preferred_zone, - int classzone_idx, int migratetype) + int classzone_idx, int migratetype, bool *oom_failed) { struct page *page; + *oom_failed = false; /* Acquire the per-zone oom lock for each zone */ if (!oom_zonelist_trylock(zonelist, gfp_mask)) { schedule_timeout_uninterruptible(1); @@ -2252,14 +2251,6 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order, } /* - * PM-freezer should be notified that there might be an OOM killer on - * its way to kill and wake somebody up. This is too early and we might - * end up not killing anything but false positives are acceptable. - * See freeze_processes. - */ - note_oom_kill(); - - /* * Go through the zonelist yet one more time, keep very high watermark * here, this is only to catch a parallel oom killing, we must fail if * we're still under heavy pressure. @@ -2289,8 +2280,8 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order, goto out; } /* Exhausted what can be done so it's blamo time */ - out_of_memory(zonelist, gfp_mask, order, nodemask, false); - + if (!out_of_memory(zonelist, gfp_mask, order, nodemask, false)) + *oom_failed = true; out: oom_zonelist_unlock(zonelist, gfp_mask); return page; @@ -2716,8 +2707,8 @@ rebalance: */ if (!did_some_progress) { if (oom_gfp_allowed(gfp_mask)) { - if (oom_killer_disabled) - goto nopage; + bool oom_failed; + /* Coredumps can quickly deplete all memory reserves */ if ((current->flags & PF_DUMPCORE) && !(gfp_mask & __GFP_NOFAIL)) @@ -2725,10 +2716,19 @@ rebalance: page = __alloc_pages_may_oom(gfp_mask, order, zonelist, high_zoneidx, nodemask, preferred_zone, - classzone_idx, migratetype); + classzone_idx, migratetype, + &oom_failed); + if (page) goto got_pg; + /* + * OOM killer might be disabled and then we have to + * fail the allocation + */ + if (oom_failed) + goto nopage; + if (!(gfp_mask & __GFP_NOFAIL)) { /* * The oom killer is not called for high-order -- 2.1.3 ^ permalink raw reply related [flat|nested] 93+ messages in thread
* Re: [RFC 2/2] OOM, PM: make OOM detection in the freezer path raceless 2014-11-18 21:10 ` [RFC 2/2] OOM, PM: make OOM detection in the freezer path raceless Michal Hocko @ 2014-11-27 0:47 ` Rafael J. Wysocki 2014-12-02 22:08 ` Tejun Heo 1 sibling, 0 replies; 93+ messages in thread From: Rafael J. Wysocki @ 2014-11-27 0:47 UTC (permalink / raw) To: Michal Hocko, Tejun Heo Cc: LKML, linux-mm, linux-pm, Andrew Morton, David Rientjes, Oleg Nesterov, Cong Wang On Tuesday, November 18, 2014 10:10:06 PM Michal Hocko wrote: > 5695be142e20 (OOM, PM: OOM killed task shouldn't escape PM suspend) > has left a race window when OOM killer manages to note_oom_kill after > freeze_processes checks the counter. The race window is quite small and > really unlikely and partial solution deemed sufficient at the time of > submission. > > Tejun wasn't happy about this partial solution though and insisted on a > full solution. That requires the full OOM and freezer's task freezing > exclusion, though. This is done by this patch which introduces oom_sem > RW lock and turns oom_killer_disable() into a full OOM barrier. > > oom_killer_disabled is now checked at out_of_memory level which takes > the lock for reading. This also means that the page fault path is > covered now as well although it was assumed to be safe before. As per > Tejun, "We used to have freezing points deep in file system code which > may be reacheable from page fault." so it would be better and more > robust to not rely on freezing points here. Same applies to the memcg > OOM killer. > > out_of_memory tells the caller whether the OOM was allowed to > trigger and the callers are supposed to handle the situation. The page > allocation path simply fails the allocation same as before. The page > fault path will be retrying the fault until the freezer fails and Sysrq > OOM trigger will simply complain to the log. > > oom_killer_disable takes oom_sem for writing and after it disables > further OOM killer invocations it checks for any OOM victims which > are still alive (because they haven't woken up to handle the pending > signal). Victims are counted via {un}mark_tsk_oom_victim. The > last victim signals the completion via oom_victims_wait on which > oom_killer_disable() waits if it sees non zero oom_victims. > This is safe against both mark_tsk_oom_victim which cannot be called > after oom_killer_disabled is set and unmark_tsk_oom_victim signals the > completion only for the last oom_victim when oom is disabled and > oom_killer_disable waits for completion only of there was at least one > victim at the time it disabled the oom. > > As oom_killer_disable is a full OOM barrier now we can postpone it to > later after all freezable tasks are frozen during PM freezer. This > reduces the time when OOM is put out order and so reduces chances of > misbehavior due to unexpected allocation failures. > > TODO: > Android lowmemory killer abuses mark_tsk_oom_victim in lowmem_scan > and it has to learn about oom_disable logic as well. > > Suggested-by: Tejun Heo <tj@kernel.org> > Signed-off-by: Michal Hocko <mhocko@suse.cz> This appears to do the right thing to me, although I admit I haven't checked the details very carefully. Tejun? > --- > drivers/tty/sysrq.c | 6 ++-- > include/linux/oom.h | 26 ++++++++------ > kernel/power/process.c | 60 +++++++++----------------------- > mm/memcontrol.c | 4 ++- > mm/oom_kill.c | 94 +++++++++++++++++++++++++++++++++++++++++--------- > mm/page_alloc.c | 32 ++++++++--------- > 6 files changed, 132 insertions(+), 90 deletions(-) > > diff --git a/drivers/tty/sysrq.c b/drivers/tty/sysrq.c > index 42bad18c66c9..6818589c1004 100644 > --- a/drivers/tty/sysrq.c > +++ b/drivers/tty/sysrq.c > @@ -355,8 +355,10 @@ static struct sysrq_key_op sysrq_term_op = { > > static void moom_callback(struct work_struct *ignored) > { > - out_of_memory(node_zonelist(first_memory_node, GFP_KERNEL), GFP_KERNEL, > - 0, NULL, true); > + if (!out_of_memory(node_zonelist(first_memory_node, GFP_KERNEL), > + GFP_KERNEL, 0, NULL, true)) { > + printk(KERN_INFO "OOM request ignored because killer is disabled\n"); > + } > } > > static DECLARE_WORK(moom_work, moom_callback); > diff --git a/include/linux/oom.h b/include/linux/oom.h > index 8f7e74f8ab3a..d802575c9307 100644 > --- a/include/linux/oom.h > +++ b/include/linux/oom.h > @@ -72,22 +72,26 @@ extern enum oom_scan_t oom_scan_process_thread(struct task_struct *task, > unsigned long totalpages, const nodemask_t *nodemask, > bool force_kill); > > -extern void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask, > +extern bool out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask, > int order, nodemask_t *mask, bool force_kill); > extern int register_oom_notifier(struct notifier_block *nb); > extern int unregister_oom_notifier(struct notifier_block *nb); > > -extern bool oom_killer_disabled; > - > -static inline void oom_killer_disable(void) > -{ > - oom_killer_disabled = true; > -} > +/** > + * oom_killer_disable - disable OOM killer > + * > + * Forces all page allocations to fail rather than trigger OOM killer. > + * Will block and wait until all OOM victims are dead. > + * > + * Returns true if successfull and false if the OOM killer cannot be > + * disabled. > + */ > +extern bool oom_killer_disable(void); > > -static inline void oom_killer_enable(void) > -{ > - oom_killer_disabled = false; > -} > +/** > + * oom_killer_enable - enable OOM killer > + */ > +extern void oom_killer_enable(void); > > static inline bool oom_gfp_allowed(gfp_t gfp_mask) > { > diff --git a/kernel/power/process.c b/kernel/power/process.c > index 5a6ec8678b9a..a4306e39f35c 100644 > --- a/kernel/power/process.c > +++ b/kernel/power/process.c > @@ -108,30 +108,6 @@ static int try_to_freeze_tasks(bool user_only) > return todo ? -EBUSY : 0; > } > > -static bool __check_frozen_processes(void) > -{ > - struct task_struct *g, *p; > - > - for_each_process_thread(g, p) > - if (p != current && !freezer_should_skip(p) && !frozen(p)) > - return false; > - > - return true; > -} > - > -/* > - * Returns true if all freezable tasks (except for current) are frozen already > - */ > -static bool check_frozen_processes(void) > -{ > - bool ret; > - > - read_lock(&tasklist_lock); > - ret = __check_frozen_processes(); > - read_unlock(&tasklist_lock); > - return ret; > -} > - > /** > * freeze_processes - Signal user space processes to enter the refrigerator. > * The current thread will not be frozen. The same process that calls > @@ -142,7 +118,6 @@ static bool check_frozen_processes(void) > int freeze_processes(void) > { > int error; > - int oom_kills_saved; > > error = __usermodehelper_disable(UMH_FREEZING); > if (error) > @@ -157,27 +132,11 @@ int freeze_processes(void) > pm_wakeup_clear(); > printk("Freezing user space processes ... "); > pm_freezing = true; > - oom_kills_saved = oom_kills_count(); > error = try_to_freeze_tasks(true); > if (!error) { > __usermodehelper_set_disable_depth(UMH_DISABLED); > - oom_killer_disable(); > - > - /* > - * There might have been an OOM kill while we were > - * freezing tasks and the killed task might be still > - * on the way out so we have to double check for race. > - */ > - if (oom_kills_count() != oom_kills_saved && > - !check_frozen_processes()) { > - __usermodehelper_set_disable_depth(UMH_ENABLED); > - printk("OOM in progress."); > - error = -EBUSY; > - } else { > - printk("done."); > - } > + printk("done.\n"); > } > - printk("\n"); > BUG_ON(in_atomic()); > > if (error) > @@ -206,6 +165,18 @@ int freeze_kernel_threads(void) > printk("\n"); > BUG_ON(in_atomic()); > > + /* > + * Now that everything freezable is handled we need to disbale > + * the OOM killer to disallow any further interference with > + * killable tasks. > + */ > + printk("Disabling OOM killer ... "); > + if (!oom_killer_disable()) { > + printk("failed.\n"); > + error = -EAGAIN; > + } else > + printk("done.\n"); > + > if (error) > thaw_kernel_threads(); > return error; > @@ -222,8 +193,6 @@ void thaw_processes(void) > pm_freezing = false; > pm_nosig_freezing = false; > > - oom_killer_enable(); > - > printk("Restarting tasks ... "); > > __usermodehelper_set_disable_depth(UMH_FREEZING); > @@ -251,6 +220,9 @@ void thaw_kernel_threads(void) > { > struct task_struct *g, *p; > > + printk("Enabling OOM killer again.\n"); > + oom_killer_enable(); > + > pm_nosig_freezing = false; > printk("Restarting kernel threads ... "); > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index 302e0fc6d121..34bcbb053132 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -2128,6 +2128,8 @@ static void mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask, int order) > current->memcg_oom.order = order; > } > > +extern bool oom_killer_disabled; > + > /** > * mem_cgroup_oom_synchronize - complete memcg OOM handling > * @handle: actually kill/wait or just clean up the OOM state > @@ -2155,7 +2157,7 @@ bool mem_cgroup_oom_synchronize(bool handle) > if (!memcg) > return false; > > - if (!handle) > + if (!handle || oom_killer_disabled) > goto cleanup; > > owait.memcg = memcg; > diff --git a/mm/oom_kill.c b/mm/oom_kill.c > index 8b6e14136f4f..b3ccd92bc6dc 100644 > --- a/mm/oom_kill.c > +++ b/mm/oom_kill.c > @@ -405,30 +405,63 @@ static void dump_header(struct task_struct *p, gfp_t gfp_mask, int order, > } > > /* > - * Number of OOM killer invocations (including memcg OOM killer). > - * Primarily used by PM freezer to check for potential races with > - * OOM killed frozen task. > + * Number of OOM victims in flight > */ > -static atomic_t oom_kills = ATOMIC_INIT(0); > +static atomic_t oom_victims = ATOMIC_INIT(0); > +static DECLARE_COMPLETION(oom_victims_wait); > > -int oom_kills_count(void) > +bool oom_killer_disabled __read_mostly; > +static DECLARE_RWSEM(oom_sem); > + > +void mark_tsk_oom_victim(struct task_struct *tsk) > { > - return atomic_read(&oom_kills); > + BUG_ON(oom_killer_disabled); > + if (test_and_set_tsk_thread_flag(tsk, TIF_MEMDIE)) > + return; > + atomic_inc(&oom_victims); > } > > -void note_oom_kill(void) > +void unmark_tsk_oom_victim(struct task_struct *tsk) > { > - atomic_inc(&oom_kills); > + int count; > + > + if (!test_and_clear_tsk_thread_flag(tsk, TIF_MEMDIE)) > + return; > + > + down_read(&oom_sem); > + /* > + * There is no need to signal the lasst oom_victim if there > + * is nobody who cares. > + */ > + if (!atomic_dec_return(&oom_victims) && oom_killer_disabled) > + complete(&oom_victims_wait); > + up_read(&oom_sem); > } > > -void mark_tsk_oom_victim(struct task_struct *tsk) > +bool oom_killer_disable(void) > { > - set_tsk_thread_flag(tsk, TIF_MEMDIE); > + /* > + * Make sure to not race with an ongoing OOM killer > + * and that the current is not the victim. > + */ > + down_write(&oom_sem); > + if (!test_tsk_thread_flag(current, TIF_MEMDIE)) > + oom_killer_disabled = true; > + > + count = atomic_read(&oom_victims); > + up_write(&oom_sem); > + > + if (count && oom_killer_disabled) > + wait_for_completion(&oom_victims_wait); > + > + return oom_killer_disabled; > } > > -void unmark_tsk_oom_victim(struct task_struct *tsk) > +void oom_killer_enable(void) > { > - clear_thread_flag(TIF_MEMDIE); > + down_write(&oom_sem); > + oom_killer_disabled = false; > + up_write(&oom_sem); > } > > #define K(x) ((x) << (PAGE_SHIFT-10)) > @@ -626,7 +659,7 @@ void oom_zonelist_unlock(struct zonelist *zonelist, gfp_t gfp_mask) > } > > /** > - * out_of_memory - kill the "best" process when we run out of memory > + * __out_of_memory - kill the "best" process when we run out of memory > * @zonelist: zonelist pointer > * @gfp_mask: memory allocation flags > * @order: amount of memory being requested as a power of 2 > @@ -638,7 +671,7 @@ void oom_zonelist_unlock(struct zonelist *zonelist, gfp_t gfp_mask) > * OR try to be smart about which process to kill. Note that we > * don't have to be perfect here, we just have to be good. > */ > -void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask, > +static void __out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask, > int order, nodemask_t *nodemask, bool force_kill) > { > const nodemask_t *mpol_mask; > @@ -703,6 +736,31 @@ out: > schedule_timeout_killable(1); > } > > +/** out_of_memory - tries to invoke OOM killer. > + * @zonelist: zonelist pointer > + * @gfp_mask: memory allocation flags > + * @order: amount of memory being requested as a power of 2 > + * @nodemask: nodemask passed to page allocator > + * @force_kill: true if a task must be killed, even if others are exiting > + * > + * invokes __out_of_memory if the OOM is not disabled by oom_killer_disable() > + * when it returns false. Otherwise returns true. > + */ > +bool out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask, > + int order, nodemask_t *nodemask, bool force_kill) > +{ > + bool ret = false; > + > + down_read(&oom_sem); > + if (!oom_killer_disabled) { > + __out_of_memory(zonelist, gfp_mask, order, nodemask, force_kill); > + ret = true; > + } > + up_read(&oom_sem); > + > + return ret; > +} > + > /* > * The pagefault handler calls here because it is out of memory, so kill a > * memory-hogging task. If any populated zone has ZONE_OOM_LOCKED set, a > @@ -712,12 +770,16 @@ void pagefault_out_of_memory(void) > { > struct zonelist *zonelist; > > + down_read(&oom_sem); > if (mem_cgroup_oom_synchronize(true)) > - return; > + goto unlock; > > zonelist = node_zonelist(first_memory_node, GFP_KERNEL); > if (oom_zonelist_trylock(zonelist, GFP_KERNEL)) { > - out_of_memory(NULL, 0, 0, NULL, false); > + if (!oom_killer_disabled) > + __out_of_memory(NULL, 0, 0, NULL, false); > oom_zonelist_unlock(zonelist, GFP_KERNEL); > } > +unlock: > + up_read(&oom_sem); > } > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index 9cd36b822444..d44d69aa7b70 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -242,8 +242,6 @@ void set_pageblock_migratetype(struct page *page, int migratetype) > PB_migrate, PB_migrate_end); > } > > -bool oom_killer_disabled __read_mostly; > - > #ifdef CONFIG_DEBUG_VM > static int page_outside_zone_boundaries(struct zone *zone, struct page *page) > { > @@ -2241,10 +2239,11 @@ static inline struct page * > __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order, > struct zonelist *zonelist, enum zone_type high_zoneidx, > nodemask_t *nodemask, struct zone *preferred_zone, > - int classzone_idx, int migratetype) > + int classzone_idx, int migratetype, bool *oom_failed) > { > struct page *page; > > + *oom_failed = false; > /* Acquire the per-zone oom lock for each zone */ > if (!oom_zonelist_trylock(zonelist, gfp_mask)) { > schedule_timeout_uninterruptible(1); > @@ -2252,14 +2251,6 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order, > } > > /* > - * PM-freezer should be notified that there might be an OOM killer on > - * its way to kill and wake somebody up. This is too early and we might > - * end up not killing anything but false positives are acceptable. > - * See freeze_processes. > - */ > - note_oom_kill(); > - > - /* > * Go through the zonelist yet one more time, keep very high watermark > * here, this is only to catch a parallel oom killing, we must fail if > * we're still under heavy pressure. > @@ -2289,8 +2280,8 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order, > goto out; > } > /* Exhausted what can be done so it's blamo time */ > - out_of_memory(zonelist, gfp_mask, order, nodemask, false); > - > + if (!out_of_memory(zonelist, gfp_mask, order, nodemask, false)) > + *oom_failed = true; > out: > oom_zonelist_unlock(zonelist, gfp_mask); > return page; > @@ -2716,8 +2707,8 @@ rebalance: > */ > if (!did_some_progress) { > if (oom_gfp_allowed(gfp_mask)) { > - if (oom_killer_disabled) > - goto nopage; > + bool oom_failed; > + > /* Coredumps can quickly deplete all memory reserves */ > if ((current->flags & PF_DUMPCORE) && > !(gfp_mask & __GFP_NOFAIL)) > @@ -2725,10 +2716,19 @@ rebalance: > page = __alloc_pages_may_oom(gfp_mask, order, > zonelist, high_zoneidx, > nodemask, preferred_zone, > - classzone_idx, migratetype); > + classzone_idx, migratetype, > + &oom_failed); > + > if (page) > goto got_pg; > > + /* > + * OOM killer might be disabled and then we have to > + * fail the allocation > + */ > + if (oom_failed) > + goto nopage; > + > if (!(gfp_mask & __GFP_NOFAIL)) { > /* > * The oom killer is not called for high-order > -- I speak only for myself. Rafael J. Wysocki, Intel Open Source Technology Center. ^ permalink raw reply [flat|nested] 93+ messages in thread
* Re: [RFC 2/2] OOM, PM: make OOM detection in the freezer path raceless 2014-11-18 21:10 ` [RFC 2/2] OOM, PM: make OOM detection in the freezer path raceless Michal Hocko 2014-11-27 0:47 ` Rafael J. Wysocki @ 2014-12-02 22:08 ` Tejun Heo 2014-12-04 14:16 ` Michal Hocko 1 sibling, 1 reply; 93+ messages in thread From: Tejun Heo @ 2014-12-02 22:08 UTC (permalink / raw) To: Michal Hocko Cc: LKML, linux-mm, linux-pm, Andrew Morton, \"Rafael J. Wysocki\", David Rientjes, Oleg Nesterov, Cong Wang Hello, sorry about the delay. Was on vacation. Generally looks good to me. Some comments below. > @@ -355,8 +355,10 @@ static struct sysrq_key_op sysrq_term_op = { > > static void moom_callback(struct work_struct *ignored) > { > - out_of_memory(node_zonelist(first_memory_node, GFP_KERNEL), GFP_KERNEL, > - 0, NULL, true); > + if (!out_of_memory(node_zonelist(first_memory_node, GFP_KERNEL), > + GFP_KERNEL, 0, NULL, true)) { > + printk(KERN_INFO "OOM request ignored because killer is disabled\n"); > + } > } CodingStyle line 157 says "Do not unnecessarily use braces where a single statement will do.". > +/** > + * oom_killer_disable - disable OOM killer > + * > + * Forces all page allocations to fail rather than trigger OOM killer. > + * Will block and wait until all OOM victims are dead. > + * > + * Returns true if successfull and false if the OOM killer cannot be > + * disabled. > + */ > +extern bool oom_killer_disable(void); And function comments usually go where the function body is, not where the function is declared, no? > @@ -157,27 +132,11 @@ int freeze_processes(void) > pm_wakeup_clear(); > printk("Freezing user space processes ... "); > pm_freezing = true; > - oom_kills_saved = oom_kills_count(); > error = try_to_freeze_tasks(true); > if (!error) { > __usermodehelper_set_disable_depth(UMH_DISABLED); > - oom_killer_disable(); > - > - /* > - * There might have been an OOM kill while we were > - * freezing tasks and the killed task might be still > - * on the way out so we have to double check for race. > - */ > - if (oom_kills_count() != oom_kills_saved && > - !check_frozen_processes()) { > - __usermodehelper_set_disable_depth(UMH_ENABLED); > - printk("OOM in progress."); > - error = -EBUSY; > - } else { > - printk("done."); > - } > + printk("done.\n"); A delta but shouldn't it be pr_cont()? ... > @@ -206,6 +165,18 @@ int freeze_kernel_threads(void) > printk("\n"); > BUG_ON(in_atomic()); > > + /* > + * Now that everything freezable is handled we need to disbale > + * the OOM killer to disallow any further interference with > + * killable tasks. > + */ > + printk("Disabling OOM killer ... "); > + if (!oom_killer_disable()) { > + printk("failed.\n"); > + error = -EAGAIN; > + } else > + printk("done.\n"); Ditto on pr_cont() and CodingStyle line 169 says "This does not apply if only one branch of a conditional statement is a single statement; in the latter case use braces in both branches:" > @@ -251,6 +220,9 @@ void thaw_kernel_threads(void) > { > struct task_struct *g, *p; > > + printk("Enabling OOM killer again.\n"); Do we really need this printk? The same goes for Disabling OOM killer. For freezing it makes some sense because freezing may take a considerable amount of time and even occassionally fail due to timeout. We aren't really expecting those to happen for OOM victims. > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index 302e0fc6d121..34bcbb053132 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -2128,6 +2128,8 @@ static void mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask, int order) > current->memcg_oom.order = order; > } > > +extern bool oom_killer_disabled; Ugh... don't we wanna put this in a header file? > +void mark_tsk_oom_victim(struct task_struct *tsk) > { > - return atomic_read(&oom_kills); > + BUG_ON(oom_killer_disabled); WARN_ON_ONCE() is prolly a better option here? > + if (test_and_set_tsk_thread_flag(tsk, TIF_MEMDIE)) Can a task actually be selected as an OOM victim multiple times? > + return; > + atomic_inc(&oom_victims); > } > > -void note_oom_kill(void) > +void unmark_tsk_oom_victim(struct task_struct *tsk) > { > - atomic_inc(&oom_kills); > + int count; > + > + if (!test_and_clear_tsk_thread_flag(tsk, TIF_MEMDIE)) > + return; Maybe test this inline in exit_mm()? e.g. if (test_thread_flag(TIF_MEMDIE)) unmark_tsk_oom_victim(current); Also, can the function ever be called by someone other than current? If not, why would it take @task? > + > + down_read(&oom_sem); > + /* > + * There is no need to signal the lasst oom_victim if there > + * is nobody who cares. > + */ > + if (!atomic_dec_return(&oom_victims) && oom_killer_disabled) > + complete(&oom_victims_wait); I don't think using completion this way is safe. Please read on. > + up_read(&oom_sem); > } > > -void mark_tsk_oom_victim(struct task_struct *tsk) > +bool oom_killer_disable(void) > { > - set_tsk_thread_flag(tsk, TIF_MEMDIE); > + /* > + * Make sure to not race with an ongoing OOM killer > + * and that the current is not the victim. > + */ > + down_write(&oom_sem); > + if (!test_tsk_thread_flag(current, TIF_MEMDIE)) > + oom_killer_disabled = true; Prolly "if (TIF_MEMDIE) { unlock; return; }" is easier to follow. > + > + count = atomic_read(&oom_victims); > + up_write(&oom_sem); > + > + if (count && oom_killer_disabled) > + wait_for_completion(&oom_victims_wait); So, each complete() increments the done count and wait decs. The above code works iff the complete()'s and wait()'s are always balanced which usually isn't true in this type of wait code. Either use reinit_completion() / complete_all() combos or wait_event(). > + > + return oom_killer_disabled; Maybe 0 / -errno is better choice as return values? > +/** out_of_memory - tries to invoke OOM killer. Formatting? > + * @zonelist: zonelist pointer > + * @gfp_mask: memory allocation flags > + * @order: amount of memory being requested as a power of 2 > + * @nodemask: nodemask passed to page allocator > + * @force_kill: true if a task must be killed, even if others are exiting > + * > + * invokes __out_of_memory if the OOM is not disabled by oom_killer_disable() > + * when it returns false. Otherwise returns true. > + */ > +bool out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask, > + int order, nodemask_t *nodemask, bool force_kill) > +{ > + bool ret = false; > + > + down_read(&oom_sem); > + if (!oom_killer_disabled) { > + __out_of_memory(zonelist, gfp_mask, order, nodemask, force_kill); > + ret = true; > + } > + up_read(&oom_sem); > + > + return ret; Ditto on return value. 0 / -EBUSY seem like a better choice to me. > @@ -712,12 +770,16 @@ void pagefault_out_of_memory(void) > { > struct zonelist *zonelist; > > + down_read(&oom_sem); > if (mem_cgroup_oom_synchronize(true)) > - return; > + goto unlock; > > zonelist = node_zonelist(first_memory_node, GFP_KERNEL); > if (oom_zonelist_trylock(zonelist, GFP_KERNEL)) { > - out_of_memory(NULL, 0, 0, NULL, false); > + if (!oom_killer_disabled) > + __out_of_memory(NULL, 0, 0, NULL, false); > oom_zonelist_unlock(zonelist, GFP_KERNEL); Is this a condition which can happen and we can deal with? With userland fully frozen, there shouldn't be page faults which lead to memory allocation, right? Shouldn't we document how oom disable/enable is supposed to be used (it only makes sense while the whole system is in quiescent state) and at least trigger WARN_ON_ONCE() if the above code path gets triggered while oom killer is disabled? Thanks. -- tejun ^ permalink raw reply [flat|nested] 93+ messages in thread
* Re: [RFC 2/2] OOM, PM: make OOM detection in the freezer path raceless 2014-12-02 22:08 ` Tejun Heo @ 2014-12-04 14:16 ` Michal Hocko 2014-12-04 14:44 ` Tejun Heo 0 siblings, 1 reply; 93+ messages in thread From: Michal Hocko @ 2014-12-04 14:16 UTC (permalink / raw) To: Tejun Heo Cc: LKML, linux-mm, linux-pm, Andrew Morton, \"Rafael J. Wysocki\", David Rientjes, Oleg Nesterov, Cong Wang On Tue 02-12-14 17:08:04, Tejun Heo wrote: > Hello, sorry about the delay. Was on vacation. > > Generally looks good to me. Some comments below. > > > @@ -355,8 +355,10 @@ static struct sysrq_key_op sysrq_term_op = { > > > > static void moom_callback(struct work_struct *ignored) > > { > > - out_of_memory(node_zonelist(first_memory_node, GFP_KERNEL), GFP_KERNEL, > > - 0, NULL, true); > > + if (!out_of_memory(node_zonelist(first_memory_node, GFP_KERNEL), > > + GFP_KERNEL, 0, NULL, true)) { > > + printk(KERN_INFO "OOM request ignored because killer is disabled\n"); > > + } > > } > > CodingStyle line 157 says "Do not unnecessarily use braces where a > single statement will do.". Sure. Fixed > > +/** > > + * oom_killer_disable - disable OOM killer > > + * > > + * Forces all page allocations to fail rather than trigger OOM killer. > > + * Will block and wait until all OOM victims are dead. > > + * > > + * Returns true if successfull and false if the OOM killer cannot be > > + * disabled. > > + */ > > +extern bool oom_killer_disable(void); > > And function comments usually go where the function body is, not where > the function is declared, no? Fixed > > @@ -157,27 +132,11 @@ int freeze_processes(void) > > pm_wakeup_clear(); > > printk("Freezing user space processes ... "); > > pm_freezing = true; > > - oom_kills_saved = oom_kills_count(); > > error = try_to_freeze_tasks(true); > > if (!error) { > > __usermodehelper_set_disable_depth(UMH_DISABLED); > > - oom_killer_disable(); > > - > > - /* > > - * There might have been an OOM kill while we were > > - * freezing tasks and the killed task might be still > > - * on the way out so we have to double check for race. > > - */ > > - if (oom_kills_count() != oom_kills_saved && > > - !check_frozen_processes()) { > > - __usermodehelper_set_disable_depth(UMH_ENABLED); > > - printk("OOM in progress."); > > - error = -EBUSY; > > - } else { > > - printk("done."); > > - } > > + printk("done.\n"); > > A delta but shouldn't it be pr_cont()? kernel/power/process.c doesn't use pr_* so I've stayed with what the rest of the file is using. I can add a patch which transforms all of them. > ... > > @@ -206,6 +165,18 @@ int freeze_kernel_threads(void) > > printk("\n"); > > BUG_ON(in_atomic()); > > > > + /* > > + * Now that everything freezable is handled we need to disbale > > + * the OOM killer to disallow any further interference with > > + * killable tasks. > > + */ > > + printk("Disabling OOM killer ... "); > > + if (!oom_killer_disable()) { > > + printk("failed.\n"); > > + error = -EAGAIN; > > + } else > > + printk("done.\n"); > > Ditto on pr_cont() and > > CodingStyle line 169 says "This does not apply > if only one branch of a conditional statement is a single statement; > in the latter case use braces in both branches:" Fixed > > @@ -251,6 +220,9 @@ void thaw_kernel_threads(void) > > { > > struct task_struct *g, *p; > > > > + printk("Enabling OOM killer again.\n"); > > Do we really need this printk? The same goes for Disabling OOM > killer. For freezing it makes some sense because freezing may take a > considerable amount of time and even occassionally fail due to > timeout. We aren't really expecting those to happen for OOM victims. I just considered them useful if there are follow up allocation failure messages to know that they are due to OOM killer. I can remove them. > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > > index 302e0fc6d121..34bcbb053132 100644 > > --- a/mm/memcontrol.c > > +++ b/mm/memcontrol.c > > @@ -2128,6 +2128,8 @@ static void mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask, int order) > > current->memcg_oom.order = order; > > } > > > > +extern bool oom_killer_disabled; > > Ugh... don't we wanna put this in a header file? Who else would need the declaration? This is not something random code should look at. > > +void mark_tsk_oom_victim(struct task_struct *tsk) > > { > > - return atomic_read(&oom_kills); > > + BUG_ON(oom_killer_disabled); > > WARN_ON_ONCE() is prolly a better option here? Well, something fishy is going on when oom_killer_disabled is set and we mark new OOM victim. This is a clear bug. Why would be warning and a allow the follow up breakage? > > + if (test_and_set_tsk_thread_flag(tsk, TIF_MEMDIE)) > > Can a task actually be selected as an OOM victim multiple times? AFAICS nothing prevents from global OOM and memcg OOM killers racing. > > + return; > > + atomic_inc(&oom_victims); > > } > > > > -void note_oom_kill(void) > > +void unmark_tsk_oom_victim(struct task_struct *tsk) > > { > > - atomic_inc(&oom_kills); > > + int count; > > + > > + if (!test_and_clear_tsk_thread_flag(tsk, TIF_MEMDIE)) > > + return; > > Maybe test this inline in exit_mm()? e.g. > > if (test_thread_flag(TIF_MEMDIE)) > unmark_tsk_oom_victim(current); Why do you think testing TIF_MEMDIE in exit_mm is better? I would like to reduce the usage of the flag as much as possible. > Also, can the function ever be called by someone other than current? > If not, why would it take @task? Changed to use current only. If there is anybody who needs that we can change that later. I wanted to have it symmetric to mark_tsk_oom_victim but that is not that important. > > + > > + down_read(&oom_sem); > > + /* > > + * There is no need to signal the lasst oom_victim if there > > + * is nobody who cares. > > + */ > > + if (!atomic_dec_return(&oom_victims) && oom_killer_disabled) > > + complete(&oom_victims_wait); > > I don't think using completion this way is safe. Please read on. > > > + up_read(&oom_sem); > > } > > > > -void mark_tsk_oom_victim(struct task_struct *tsk) > > +bool oom_killer_disable(void) > > { > > - set_tsk_thread_flag(tsk, TIF_MEMDIE); > > + /* > > + * Make sure to not race with an ongoing OOM killer > > + * and that the current is not the victim. > > + */ > > + down_write(&oom_sem); > > + if (!test_tsk_thread_flag(current, TIF_MEMDIE)) > > + oom_killer_disabled = true; > > Prolly "if (TIF_MEMDIE) { unlock; return; }" is easier to follow. OK > > + > > + count = atomic_read(&oom_victims); > > + up_write(&oom_sem); > > + > > + if (count && oom_killer_disabled) > > + wait_for_completion(&oom_victims_wait); > > So, each complete() increments the done count and wait decs. The > above code works iff the complete()'s and wait()'s are always balanced > which usually isn't true in this type of wait code. Either use > reinit_completion() / complete_all() combos or wait_event(). Hmm, I thought that only a single instance of freeze_kernel_threads (which calls oom_killer_disable) can run at a time. But I am currently not sure that all paths are called under lock_system_sleep. I am not familiar with reinit_completion API. Is the following correct? [...] @@ -434,10 +434,23 @@ void unmark_tsk_oom_victim(struct task_struct *tsk) * is nobody who cares. */ if (!atomic_dec_return(&oom_victims) && oom_killer_disabled) - complete(&oom_victims_wait); + complete_all(&oom_victims_wait); up_read(&oom_sem); } [...] @@ -445,16 +458,23 @@ bool oom_killer_disable(void) * and that the current is not the victim. */ down_write(&oom_sem); - if (!test_tsk_thread_flag(current, TIF_MEMDIE)) - oom_killer_disabled = true; + if (test_thread_flag(TIF_MEMDIE)) { + up_write(&oom_sem); + return false; + } + + /* unmark_tsk_oom_victim is calling complete_all */ + if (!oom_killer_disable) + reinit_completion(&oom_victims_wait); + oom_killer_disabled = true; count = atomic_read(&oom_victims); up_write(&oom_sem); - if (count && oom_killer_disabled) + if (count) wait_for_completion(&oom_victims_wait); - return oom_killer_disabled; + return true; } > > + > > + return oom_killer_disabled; > > Maybe 0 / -errno is better choice as return values? I do not have problem to change this if you feel strong about it but true/false sounds easier to me and it allows the caller to decide what to report. If there were multiple reasons to fail then sure but that is not the case. > > +/** out_of_memory - tries to invoke OOM killer. > > Formatting? fixed > > + * @zonelist: zonelist pointer > > + * @gfp_mask: memory allocation flags > > + * @order: amount of memory being requested as a power of 2 > > + * @nodemask: nodemask passed to page allocator > > + * @force_kill: true if a task must be killed, even if others are exiting > > + * > > + * invokes __out_of_memory if the OOM is not disabled by oom_killer_disable() > > + * when it returns false. Otherwise returns true. > > + */ > > +bool out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask, > > + int order, nodemask_t *nodemask, bool force_kill) > > +{ > > + bool ret = false; > > + > > + down_read(&oom_sem); > > + if (!oom_killer_disabled) { > > + __out_of_memory(zonelist, gfp_mask, order, nodemask, force_kill); > > + ret = true; > > + } > > + up_read(&oom_sem); > > + > > + return ret; > > Ditto on return value. 0 / -EBUSY seem like a better choice to me. > > > @@ -712,12 +770,16 @@ void pagefault_out_of_memory(void) > > { > > struct zonelist *zonelist; > > > > + down_read(&oom_sem); > > if (mem_cgroup_oom_synchronize(true)) > > - return; > > + goto unlock; > > > > zonelist = node_zonelist(first_memory_node, GFP_KERNEL); > > if (oom_zonelist_trylock(zonelist, GFP_KERNEL)) { > > - out_of_memory(NULL, 0, 0, NULL, false); > > + if (!oom_killer_disabled) > > + __out_of_memory(NULL, 0, 0, NULL, false); > > oom_zonelist_unlock(zonelist, GFP_KERNEL); > > Is this a condition which can happen and we can deal with? With > userland fully frozen, there shouldn't be page faults which lead to > memory allocation, right? Except for racing OOM victims which were missed by try_to_freeze_tasks because they didn't get cpu slice to wake up from the freezer. The task would die on the way out from the page fault exception. I have updated the changelog to be more verbose about this. > Shouldn't we document how oom disable/enable is supposed to be used Well the API shouldn't be used outside of the PM freezer IMO. This is not a general API that other part of the kernel should be using. I can surely add more documentation for the PM usage though. I have rewritten the changelog: " As oom_killer_disable() is a full OOM barrier now we can postpone it in the PM freezer to later after all freezable user tasks are considered frozen (to freeze_kernel_threads). Normally there wouldn't be any unfrozen user tasks at this moment so the function will not block. But if there was an OOM killer racing with try_to_freeze_tasks and the OOM victim didn't finish yet then we have to wait for it. This should complete in a finite time, though, because - the victim cannot loop in the page fault handler (it would die on the way out from the exception) - it cannot loop in the page allocator because all the further\ allocation would fail - it shouldn't be blocked on any locks held by frozen tasks (try_to_freeze expects lockless context) and kernel threads and work queues are not frozen yet " And I've added: +/** + * oom_killer_disable - disable OOM killer + * + * Forces all page allocations to fail rather than trigger OOM killer. + * Will block and wait until all OOM victims are dead. + * + * The function cannot be called when there are runnable user tasks because + * the userspace would see unexpected allocation failures as a result. Any + * new usage of this function should be consulted with MM people. + * + * Returns true if successful and false if the OOM killer cannot be + * disabled. + */ bool oom_killer_disable(void) > (it only makes sense while the whole system is in quiescent state) > and at least trigger WARN_ON_ONCE() if the above code path gets > triggered while oom killer is disabled? I can add a WARN_ON(!test_thread_flag(tsk, TIF_MEMDIE)). Thanks for the review! -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 93+ messages in thread
* Re: [RFC 2/2] OOM, PM: make OOM detection in the freezer path raceless 2014-12-04 14:16 ` Michal Hocko @ 2014-12-04 14:44 ` Tejun Heo 2014-12-04 16:56 ` Michal Hocko 0 siblings, 1 reply; 93+ messages in thread From: Tejun Heo @ 2014-12-04 14:44 UTC (permalink / raw) To: Michal Hocko Cc: LKML, linux-mm, linux-pm, Andrew Morton, \"Rafael J. Wysocki\", David Rientjes, Oleg Nesterov, Cong Wang On Thu, Dec 04, 2014 at 03:16:23PM +0100, Michal Hocko wrote: > > A delta but shouldn't it be pr_cont()? > > kernel/power/process.c doesn't use pr_* so I've stayed with what the > rest of the file is using. I can add a patch which transforms all of > them. The console output becomes wrong when printk() is used on continuation. So, yeah, it'd be great to fix it. > > > +extern bool oom_killer_disabled; > > > > Ugh... don't we wanna put this in a header file? > > Who else would need the declaration? This is not something random code > should look at. Let's say, somebody changes the type to ulong for whatever reason later and forgets to update this declaration. What happens then on a big endian machine? Jesus, this is basic C programming. You don't sprinkle external declarations which the compiler can't verify against the actual definitions. There's absolutely no compelling reason to do that here. Why would you take out compiler verification for no reason? > > > +void mark_tsk_oom_victim(struct task_struct *tsk) > > > { > > > - return atomic_read(&oom_kills); > > > + BUG_ON(oom_killer_disabled); > > > > WARN_ON_ONCE() is prolly a better option here? > > Well, something fishy is going on when oom_killer_disabled is set and we > mark new OOM victim. This is a clear bug. Why would be warning and a > allow the follow up breakage? Because the system is more likely to be able to go on and we don't BUG when we can WARN as a general rule. Working systems is almost always better than a dead system even for debugging. > > > + if (test_and_set_tsk_thread_flag(tsk, TIF_MEMDIE)) > > > > Can a task actually be selected as an OOM victim multiple times? > > AFAICS nothing prevents from global OOM and memcg OOM killers racing. Maybe it'd be a good idea to note that in the comment? > > > -void note_oom_kill(void) > > > +void unmark_tsk_oom_victim(struct task_struct *tsk) > > > { > > > - atomic_inc(&oom_kills); > > > + int count; > > > + > > > + if (!test_and_clear_tsk_thread_flag(tsk, TIF_MEMDIE)) > > > + return; > > > > Maybe test this inline in exit_mm()? e.g. > > > > if (test_thread_flag(TIF_MEMDIE)) > > unmark_tsk_oom_victim(current); > > Why do you think testing TIF_MEMDIE in exit_mm is better? I would like > to reduce the usage of the flag as much as possible. Because it's adding a function call/return to hot path for everybody. It sure is a miniscule cost but we're adding that for no good reason. > > So, each complete() increments the done count and wait decs. The > > above code works iff the complete()'s and wait()'s are always balanced > > which usually isn't true in this type of wait code. Either use > > reinit_completion() / complete_all() combos or wait_event(). > > Hmm, I thought that only a single instance of freeze_kernel_threads > (which calls oom_killer_disable) can run at a time. But I am currently > not sure that all paths are called under lock_system_sleep. > I am not familiar with reinit_completion API. Is the following correct? Hmmm... wouldn't wait_event() easier to read in this case? ... > > Maybe 0 / -errno is better choice as return values? > > I do not have problem to change this if you feel strong about it but > true/false sounds easier to me and it allows the caller to decide what to > report. If there were multiple reasons to fail then sure but that is not > the case. It's not a big deal but except for functions which have clear boolean behavior - functions which try/attempt something or query or decide certain things - randomly thrown in bool returns tend to become confusing especially because its bool fail value is the opposite of 0/-errno fail value. So, "this function only fails with one reason" is usually a bad and arbitrary reason for choosing bool return which causes confusion on callsites and headaches when the function develops more reasons to fail. ... > > > @@ -712,12 +770,16 @@ void pagefault_out_of_memory(void) > > > { > > > struct zonelist *zonelist; > > > > > > + down_read(&oom_sem); > > > if (mem_cgroup_oom_synchronize(true)) > > > - return; > > > + goto unlock; > > > > > > zonelist = node_zonelist(first_memory_node, GFP_KERNEL); > > > if (oom_zonelist_trylock(zonelist, GFP_KERNEL)) { > > > - out_of_memory(NULL, 0, 0, NULL, false); > > > + if (!oom_killer_disabled) > > > + __out_of_memory(NULL, 0, 0, NULL, false); > > > oom_zonelist_unlock(zonelist, GFP_KERNEL); > > > > Is this a condition which can happen and we can deal with? With > > userland fully frozen, there shouldn't be page faults which lead to > > memory allocation, right? > > Except for racing OOM victims which were missed by try_to_freeze_tasks > because they didn't get cpu slice to wake up from the freezer. The task > would die on the way out from the page fault exception. I have updated > the changelog to be more verbose about this. That's something very not obvious. Let's please add a comment explaining that. > > (it only makes sense while the whole system is in quiescent state) > > and at least trigger WARN_ON_ONCE() if the above code path gets > > triggered while oom killer is disabled? > > I can add a WARN_ON(!test_thread_flag(tsk, TIF_MEMDIE)). Yeah, that makes sense to me. Thanks. -- tejun ^ permalink raw reply [flat|nested] 93+ messages in thread
* Re: [RFC 2/2] OOM, PM: make OOM detection in the freezer path raceless 2014-12-04 14:44 ` Tejun Heo @ 2014-12-04 16:56 ` Michal Hocko 2014-12-04 17:18 ` Michal Hocko 0 siblings, 1 reply; 93+ messages in thread From: Michal Hocko @ 2014-12-04 16:56 UTC (permalink / raw) To: Tejun Heo Cc: LKML, linux-mm, linux-pm, Andrew Morton, \"Rafael J. Wysocki\", David Rientjes, Oleg Nesterov, Cong Wang On Thu 04-12-14 09:44:54, Tejun Heo wrote: > On Thu, Dec 04, 2014 at 03:16:23PM +0100, Michal Hocko wrote: > > > A delta but shouldn't it be pr_cont()? > > > > kernel/power/process.c doesn't use pr_* so I've stayed with what the > > rest of the file is using. I can add a patch which transforms all of > > them. > > The console output becomes wrong when printk() is used on > continuation. So, yeah, it'd be great to fix it. > > > > > +extern bool oom_killer_disabled; > > > > > > Ugh... don't we wanna put this in a header file? > > > > Who else would need the declaration? This is not something random code > > should look at. > > Let's say, somebody changes the type to ulong for whatever reason > later and forgets to update this declaration. What happens then on a > big endian machine? OK, see your point. Although this is unlikely... > Jesus, this is basic C programming. You don't sprinkle external > declarations which the compiler can't verify against the actual > definitions. There's absolutely no compelling reason to do that here. > Why would you take out compiler verification for no reason? > > > > > +void mark_tsk_oom_victim(struct task_struct *tsk) > > > > { > > > > - return atomic_read(&oom_kills); > > > > + BUG_ON(oom_killer_disabled); > > > > > > WARN_ON_ONCE() is prolly a better option here? > > > > Well, something fishy is going on when oom_killer_disabled is set and we > > mark new OOM victim. This is a clear bug. Why would be warning and a > > allow the follow up breakage? > > Because the system is more likely to be able to go on and we don't BUG > when we can WARN as a general rule. Working systems is almost always > better than a dead system even for debugging. > > > > > + if (test_and_set_tsk_thread_flag(tsk, TIF_MEMDIE)) > > > > > > Can a task actually be selected as an OOM victim multiple times? > > > > AFAICS nothing prevents from global OOM and memcg OOM killers racing. > > Maybe it'd be a good idea to note that in the comment? ok > > > > -void note_oom_kill(void) > > > > +void unmark_tsk_oom_victim(struct task_struct *tsk) > > > > { > > > > - atomic_inc(&oom_kills); > > > > + int count; > > > > + > > > > + if (!test_and_clear_tsk_thread_flag(tsk, TIF_MEMDIE)) > > > > + return; > > > > > > Maybe test this inline in exit_mm()? e.g. > > > > > > if (test_thread_flag(TIF_MEMDIE)) > > > unmark_tsk_oom_victim(current); > > > > Why do you think testing TIF_MEMDIE in exit_mm is better? I would like > > to reduce the usage of the flag as much as possible. > > Because it's adding a function call/return to hot path for everybody. > It sure is a miniscule cost but we're adding that for no good reason. ok. > > > So, each complete() increments the done count and wait decs. The > > > above code works iff the complete()'s and wait()'s are always balanced > > > which usually isn't true in this type of wait code. Either use > > > reinit_completion() / complete_all() combos or wait_event(). > > > > Hmm, I thought that only a single instance of freeze_kernel_threads > > (which calls oom_killer_disable) can run at a time. But I am currently > > not sure that all paths are called under lock_system_sleep. > > I am not familiar with reinit_completion API. Is the following correct? > > Hmmm... wouldn't wait_event() easier to read in this case? OK, it looks easier. I thought it would require some additional synchronization between wake up and wait but everything necessary seems to be done in wait_event already so we cannot miss a wake up AFAICS: diff --git a/mm/oom_kill.c b/mm/oom_kill.c index 1d55ab12792f..032be9d2a239 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -408,7 +408,7 @@ static void dump_header(struct task_struct *p, gfp_t gfp_mask, int order, * Number of OOM victims in flight */ static atomic_t oom_victims = ATOMIC_INIT(0); -static DECLARE_COMPLETION(oom_victims_wait); +static DECLARE_WAIT_QUEUE_HEAD(oom_victims_wait); bool oom_killer_disabled __read_mostly; static DECLARE_RWSEM(oom_sem); @@ -435,7 +435,7 @@ void unmark_tsk_oom_victim(void) * is nobody who cares. */ if (!atomic_dec_return(&oom_victims) && oom_killer_disabled) - complete_all(&oom_victims_wait); + wake_up_all(&oom_victims_wait); up_read(&oom_sem); } @@ -464,16 +464,11 @@ bool oom_killer_disable(void) return false; } - /* unmark_tsk_oom_victim is calling complete_all */ - if (!oom_killer_disable) - reinit_completion(&oom_victims_wait); - oom_killer_disabled = true; - count = atomic_read(&oom_victims); up_write(&oom_sem); if (count) - wait_for_completion(&oom_victims_wait); + wait_event(oom_victims_wait, atomic_read(&oom_victims)); return true; } > ... > > > Maybe 0 / -errno is better choice as return values? > > > > I do not have problem to change this if you feel strong about it but > > true/false sounds easier to me and it allows the caller to decide what to > > report. If there were multiple reasons to fail then sure but that is not > > the case. > > It's not a big deal but except for functions which have clear boolean > behavior - functions which try/attempt something or query or decide this is basically try_lock which might fail due to whatever internal reasons. > certain things - randomly thrown in bool returns tend to become > confusing especially because its bool fail value is the opposite of > 0/-errno fail value. So, "this function only fails with one reason" > is usually a bad and arbitrary reason for choosing bool return which > causes confusion on callsites and headaches when the function develops > more reasons to fail. > > ... > > > > @@ -712,12 +770,16 @@ void pagefault_out_of_memory(void) > > > > { > > > > struct zonelist *zonelist; > > > > > > > > + down_read(&oom_sem); > > > > if (mem_cgroup_oom_synchronize(true)) > > > > - return; > > > > + goto unlock; > > > > > > > > zonelist = node_zonelist(first_memory_node, GFP_KERNEL); > > > > if (oom_zonelist_trylock(zonelist, GFP_KERNEL)) { > > > > - out_of_memory(NULL, 0, 0, NULL, false); > > > > + if (!oom_killer_disabled) > > > > + __out_of_memory(NULL, 0, 0, NULL, false); > > > > oom_zonelist_unlock(zonelist, GFP_KERNEL); > > > > > > Is this a condition which can happen and we can deal with? With > > > userland fully frozen, there shouldn't be page faults which lead to > > > memory allocation, right? > > > > Except for racing OOM victims which were missed by try_to_freeze_tasks > > because they didn't get cpu slice to wake up from the freezer. The task > > would die on the way out from the page fault exception. I have updated > > the changelog to be more verbose about this. > > That's something very not obvious. Let's please add a comment > explaining that. @@ -778,6 +795,15 @@ void pagefault_out_of_memory(void) if (oom_zonelist_trylock(zonelist, GFP_KERNEL)) { if (!oom_killer_disabled) __out_of_memory(NULL, 0, 0, NULL, false); + else + /* + * There shouldn't be any user tasks runable while the + * OOM killer is disabled so the current task has to + * be a racing OOM victim for which oom_killer_disable() + * is waiting for. + */ + WARN_ON(test_thread_flag(TIF_MEMDIE)); + oom_zonelist_unlock(zonelist, GFP_KERNEL); } unlock: > > > > (it only makes sense while the whole system is in quiescent state) > > > and at least trigger WARN_ON_ONCE() if the above code path gets > > > triggered while oom killer is disabled? > > > > I can add a WARN_ON(!test_thread_flag(tsk, TIF_MEMDIE)). > > Yeah, that makes sense to me. > > Thanks. > > -- > tejun -- Michal Hocko SUSE Labs ^ permalink raw reply related [flat|nested] 93+ messages in thread
* Re: [RFC 2/2] OOM, PM: make OOM detection in the freezer path raceless 2014-12-04 16:56 ` Michal Hocko @ 2014-12-04 17:18 ` Michal Hocko 0 siblings, 0 replies; 93+ messages in thread From: Michal Hocko @ 2014-12-04 17:18 UTC (permalink / raw) To: Tejun Heo Cc: LKML, linux-mm, linux-pm, Andrew Morton, \"Rafael J. Wysocki\", David Rientjes, Oleg Nesterov, Cong Wang On Thu 04-12-14 17:56:01, Michal Hocko wrote: > diff --git a/mm/oom_kill.c b/mm/oom_kill.c > index 1d55ab12792f..032be9d2a239 100644 > --- a/mm/oom_kill.c > +++ b/mm/oom_kill.c > @@ -408,7 +408,7 @@ static void dump_header(struct task_struct *p, gfp_t gfp_mask, int order, > * Number of OOM victims in flight > */ > static atomic_t oom_victims = ATOMIC_INIT(0); > -static DECLARE_COMPLETION(oom_victims_wait); > +static DECLARE_WAIT_QUEUE_HEAD(oom_victims_wait); > > bool oom_killer_disabled __read_mostly; > static DECLARE_RWSEM(oom_sem); > @@ -435,7 +435,7 @@ void unmark_tsk_oom_victim(void) > * is nobody who cares. > */ > if (!atomic_dec_return(&oom_victims) && oom_killer_disabled) > - complete_all(&oom_victims_wait); > + wake_up_all(&oom_victims_wait); > up_read(&oom_sem); > } > > @@ -464,16 +464,11 @@ bool oom_killer_disable(void) > return false; > } > > - /* unmark_tsk_oom_victim is calling complete_all */ > - if (!oom_killer_disable) > - reinit_completion(&oom_victims_wait); > - > oom_killer_disabled = true; > - count = atomic_read(&oom_victims); > up_write(&oom_sem); > > if (count) whithout this count test obviously > - wait_for_completion(&oom_victims_wait); > + wait_event(oom_victims_wait, atomic_read(&oom_victims)); > > return true; > } -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 93+ messages in thread
* [PATCH 0/4] OOM vs PM freezer fixes 2014-11-10 16:30 ` Michal Hocko 2014-11-12 18:58 ` [RFC 0/4] OOM vs PM freezer fixes Michal Hocko @ 2014-12-05 16:41 ` Michal Hocko 2014-12-05 16:41 ` [PATCH -v2 1/5] oom: add helpers for setting and clearing TIF_MEMDIE Michal Hocko ` (5 more replies) 1 sibling, 6 replies; 93+ messages in thread From: Michal Hocko @ 2014-12-05 16:41 UTC (permalink / raw) To: linux-mm Cc: Andrew Morton, Tejun Heo, \"Rafael J. Wysocki\", David Rientjes, Johannes Weiner, Oleg Nesterov, Cong Wang, LKML, linux-pm Hi, here is another take at OOM vs. PM freezer interaction fixes/cleanups. First three patches are fixes for an unlikely cases when OOM races with the PM freezer which should be closed completely finally. The last patch is a simple code enhancement which is not needed strictly speaking but it is nice to have IMO. Both OOM killer and PM freezer are quite subtle so I hope I haven't missing anything. Any feedback is highly appreciated. I am also interested about feedback for the used approach. To be honest I am not really happy about spreading TIF_MEMDIE checks into freezer (patch 1) but I didn't find any other way for detecting OOM killed tasks. Changes are based on top of Linus tree (3.18-rc3). Michal Hocko (4): OOM, PM: Do not miss OOM killed frozen tasks OOM, PM: make OOM detection in the freezer path raceless OOM, PM: handle pm freezer as an OOM victim correctly OOM: thaw the OOM victim if it is frozen Diffstat says: drivers/tty/sysrq.c | 6 ++-- include/linux/oom.h | 39 ++++++++++++++++------ kernel/freezer.c | 15 +++++++-- kernel/power/process.c | 60 +++++++++------------------------- mm/memcontrol.c | 4 ++- mm/oom_kill.c | 89 ++++++++++++++++++++++++++++++++++++++------------ mm/page_alloc.c | 32 +++++++++--------- 7 files changed, 147 insertions(+), 98 deletions(-) ^ permalink raw reply [flat|nested] 93+ messages in thread
* [PATCH -v2 1/5] oom: add helpers for setting and clearing TIF_MEMDIE 2014-12-05 16:41 ` [PATCH 0/4] OOM vs PM freezer fixes Michal Hocko @ 2014-12-05 16:41 ` Michal Hocko 2014-12-06 12:56 ` Tejun Heo 2015-01-07 17:57 ` Tejun Heo 2014-12-05 16:41 ` [PATCH -v2 2/5] OOM: thaw the OOM victim if it is frozen Michal Hocko ` (4 subsequent siblings) 5 siblings, 2 replies; 93+ messages in thread From: Michal Hocko @ 2014-12-05 16:41 UTC (permalink / raw) To: linux-mm Cc: Andrew Morton, Tejun Heo, \"Rafael J. Wysocki\", David Rientjes, Johannes Weiner, Oleg Nesterov, Cong Wang, LKML, linux-pm This patch is just a preparatory and it doesn't introduce any functional change. Signed-off-by: Michal Hocko <mhocko@suse.cz> --- include/linux/oom.h | 4 ++++ kernel/exit.c | 2 +- mm/memcontrol.c | 2 +- mm/oom_kill.c | 23 ++++++++++++++++++++--- 4 files changed, 26 insertions(+), 5 deletions(-) diff --git a/include/linux/oom.h b/include/linux/oom.h index 4971874f54db..1315fcbb9527 100644 --- a/include/linux/oom.h +++ b/include/linux/oom.h @@ -47,6 +47,10 @@ static inline bool oom_task_origin(const struct task_struct *p) return !!(p->signal->oom_flags & OOM_FLAG_ORIGIN); } +extern void mark_tsk_oom_victim(struct task_struct *tsk); + +extern void unmark_tsk_oom_victim(void); + extern unsigned long oom_badness(struct task_struct *p, struct mem_cgroup *memcg, const nodemask_t *nodemask, unsigned long totalpages); diff --git a/kernel/exit.c b/kernel/exit.c index 5d30019ff953..ee5176e2a1ba 100644 --- a/kernel/exit.c +++ b/kernel/exit.c @@ -459,7 +459,7 @@ static void exit_mm(struct task_struct *tsk) task_unlock(tsk); mm_update_next_owner(mm); mmput(mm); - clear_thread_flag(TIF_MEMDIE); + unmark_tsk_oom_victim(); } /* diff --git a/mm/memcontrol.c b/mm/memcontrol.c index d6ac0e33e150..302e0fc6d121 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -1735,7 +1735,7 @@ static void mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask, * quickly exit and free its memory. */ if (fatal_signal_pending(current) || current->flags & PF_EXITING) { - set_thread_flag(TIF_MEMDIE); + mark_tsk_oom_victim(current); return; } diff --git a/mm/oom_kill.c b/mm/oom_kill.c index 5340f6b91312..c75b37d59a32 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -421,6 +421,23 @@ void note_oom_kill(void) atomic_inc(&oom_kills); } +/** + * Marks the given taks as OOM victim. + * @tsk: task to mark + */ +void mark_tsk_oom_victim(struct task_struct *tsk) +{ + set_tsk_thread_flag(tsk, TIF_MEMDIE); +} + +/** + * Unmarks the current task as OOM victim. + */ +void unmark_tsk_oom_victim(void) +{ + clear_thread_flag(TIF_MEMDIE); +} + #define K(x) ((x) << (PAGE_SHIFT-10)) /* * Must be called while holding a reference to p, which will be released upon @@ -444,7 +461,7 @@ void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order, * its children or threads, just set TIF_MEMDIE so it can die quickly */ if (p->flags & PF_EXITING) { - set_tsk_thread_flag(p, TIF_MEMDIE); + mark_tsk_oom_victim(p); put_task_struct(p); return; } @@ -527,7 +544,7 @@ void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order, } rcu_read_unlock(); - set_tsk_thread_flag(victim, TIF_MEMDIE); + mark_tsk_oom_victim(victim); do_send_sig_info(SIGKILL, SEND_SIG_FORCED, victim, true); put_task_struct(victim); } @@ -650,7 +667,7 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask, * quickly exit and free its memory. */ if (fatal_signal_pending(current) || current->flags & PF_EXITING) { - set_thread_flag(TIF_MEMDIE); + mark_tsk_oom_victim(current); return; } -- 2.1.3 ^ permalink raw reply related [flat|nested] 93+ messages in thread
* Re: [PATCH -v2 1/5] oom: add helpers for setting and clearing TIF_MEMDIE 2014-12-05 16:41 ` [PATCH -v2 1/5] oom: add helpers for setting and clearing TIF_MEMDIE Michal Hocko @ 2014-12-06 12:56 ` Tejun Heo 2014-12-07 10:13 ` Michal Hocko 2015-01-07 17:57 ` Tejun Heo 1 sibling, 1 reply; 93+ messages in thread From: Tejun Heo @ 2014-12-06 12:56 UTC (permalink / raw) To: Michal Hocko Cc: linux-mm, Andrew Morton, \"Rafael J. Wysocki\", David Rientjes, Johannes Weiner, Oleg Nesterov, Cong Wang, LKML, linux-pm On Fri, Dec 05, 2014 at 05:41:43PM +0100, Michal Hocko wrote: > +/** > + * Marks the given taks as OOM victim. /** * $FUNCTION_NAME - $DESCRIPTION > + * @tsk: task to mark > + */ > +void mark_tsk_oom_victim(struct task_struct *tsk) > +{ > + set_tsk_thread_flag(tsk, TIF_MEMDIE); > +} > + > +/** > + * Unmarks the current task as OOM victim. Ditto. -- tejun ^ permalink raw reply [flat|nested] 93+ messages in thread
* Re: [PATCH -v2 1/5] oom: add helpers for setting and clearing TIF_MEMDIE 2014-12-06 12:56 ` Tejun Heo @ 2014-12-07 10:13 ` Michal Hocko 0 siblings, 0 replies; 93+ messages in thread From: Michal Hocko @ 2014-12-07 10:13 UTC (permalink / raw) To: Tejun Heo Cc: linux-mm, Andrew Morton, \"Rafael J. Wysocki\", David Rientjes, Johannes Weiner, Oleg Nesterov, Cong Wang, LKML, linux-pm On Sat 06-12-14 07:56:17, Tejun Heo wrote: > On Fri, Dec 05, 2014 at 05:41:43PM +0100, Michal Hocko wrote: > > +/** > > + * Marks the given taks as OOM victim. > > /** > * $FUNCTION_NAME - $DESCRIPTION > > > + * @tsk: task to mark > > + */ > > +void mark_tsk_oom_victim(struct task_struct *tsk) > > +{ > > + set_tsk_thread_flag(tsk, TIF_MEMDIE); > > +} > > + > > +/** > > + * Unmarks the current task as OOM victim. > > Ditto. Fixed -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 93+ messages in thread
* Re: [PATCH -v2 1/5] oom: add helpers for setting and clearing TIF_MEMDIE 2014-12-05 16:41 ` [PATCH -v2 1/5] oom: add helpers for setting and clearing TIF_MEMDIE Michal Hocko 2014-12-06 12:56 ` Tejun Heo @ 2015-01-07 17:57 ` Tejun Heo 2015-01-07 18:23 ` Michal Hocko 1 sibling, 1 reply; 93+ messages in thread From: Tejun Heo @ 2015-01-07 17:57 UTC (permalink / raw) To: Michal Hocko Cc: linux-mm, Andrew Morton, \"Rafael J. Wysocki\", David Rientjes, Johannes Weiner, Oleg Nesterov, Cong Wang, LKML, linux-pm On Fri, Dec 05, 2014 at 05:41:43PM +0100, Michal Hocko wrote: > +/** > + * Unmarks the current task as OOM victim. > + */ > +void unmark_tsk_oom_victim(void) > +{ > + clear_thread_flag(TIF_MEMDIE); > +} This prolly should be unmark_current_oom_victim()? Also, can we please use full "task" at least in global symbols? I don't think tsk abbreviation is that popular in function names. Thanks. -- tejun ^ permalink raw reply [flat|nested] 93+ messages in thread
* Re: [PATCH -v2 1/5] oom: add helpers for setting and clearing TIF_MEMDIE 2015-01-07 17:57 ` Tejun Heo @ 2015-01-07 18:23 ` Michal Hocko 0 siblings, 0 replies; 93+ messages in thread From: Michal Hocko @ 2015-01-07 18:23 UTC (permalink / raw) To: Tejun Heo Cc: linux-mm, Andrew Morton, \"Rafael J. Wysocki\", David Rientjes, Johannes Weiner, Oleg Nesterov, Cong Wang, LKML, linux-pm On Wed 07-01-15 12:57:31, Tejun Heo wrote: > On Fri, Dec 05, 2014 at 05:41:43PM +0100, Michal Hocko wrote: > > +/** > > + * Unmarks the current task as OOM victim. > > + */ > > +void unmark_tsk_oom_victim(void) > > +{ > > + clear_thread_flag(TIF_MEMDIE); > > +} > > This prolly should be unmark_current_oom_victim()? OK. > Also, can we > please use full "task" at least in global symbols? I don't think tsk > abbreviation is that popular in function names. It is mimicking *_tsk_thread_flag() API. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 93+ messages in thread
* [PATCH -v2 2/5] OOM: thaw the OOM victim if it is frozen 2014-12-05 16:41 ` [PATCH 0/4] OOM vs PM freezer fixes Michal Hocko 2014-12-05 16:41 ` [PATCH -v2 1/5] oom: add helpers for setting and clearing TIF_MEMDIE Michal Hocko @ 2014-12-05 16:41 ` Michal Hocko 2014-12-06 13:06 ` Tejun Heo 2014-12-05 16:41 ` [PATCH -v2 3/5] PM: convert printk to pr_* equivalent Michal Hocko ` (3 subsequent siblings) 5 siblings, 1 reply; 93+ messages in thread From: Michal Hocko @ 2014-12-05 16:41 UTC (permalink / raw) To: linux-mm Cc: Andrew Morton, Tejun Heo, \"Rafael J. Wysocki\", David Rientjes, Johannes Weiner, Oleg Nesterov, Cong Wang, LKML, linux-pm oom_kill_process only sets TIF_MEMDIE flag and sends a signal to the victim. This is basically noop when the task is frozen though because the task sleeps in uninterruptible sleep. The victim is eventually thawed later when oom_scan_process_thread meets the task again in a later OOM invocation so the OOM killer doesn't live lock. But this is less than optimal. Let's add the frozen check and thaw the task right before we send SIGKILL to the victim. The check and thawing in oom_scan_process_thread has to stay because the task might got access to memory reserves even without an explicit SIGKILL from oom_kill_process (e.g. it already has fatal signal pending or it is exiting already). Signed-off-by: Michal Hocko <mhocko@suse.cz> --- mm/oom_kill.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/mm/oom_kill.c b/mm/oom_kill.c index c75b37d59a32..8874058d62db 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -545,6 +545,8 @@ void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order, rcu_read_unlock(); mark_tsk_oom_victim(victim); + if (frozen(victim)) + __thaw_task(victim); do_send_sig_info(SIGKILL, SEND_SIG_FORCED, victim, true); put_task_struct(victim); } -- 2.1.3 ^ permalink raw reply related [flat|nested] 93+ messages in thread
* Re: [PATCH -v2 2/5] OOM: thaw the OOM victim if it is frozen 2014-12-05 16:41 ` [PATCH -v2 2/5] OOM: thaw the OOM victim if it is frozen Michal Hocko @ 2014-12-06 13:06 ` Tejun Heo 2014-12-07 10:24 ` Michal Hocko 0 siblings, 1 reply; 93+ messages in thread From: Tejun Heo @ 2014-12-06 13:06 UTC (permalink / raw) To: Michal Hocko Cc: linux-mm, Andrew Morton, \"Rafael J. Wysocki\", David Rientjes, Johannes Weiner, Oleg Nesterov, Cong Wang, LKML, linux-pm Hello, On Fri, Dec 05, 2014 at 05:41:44PM +0100, Michal Hocko wrote: > oom_kill_process only sets TIF_MEMDIE flag and sends a signal to the > victim. This is basically noop when the task is frozen though because > the task sleeps in uninterruptible sleep. The victim is eventually > thawed later when oom_scan_process_thread meets the task again in a > later OOM invocation so the OOM killer doesn't live lock. But this is > less than optimal. Let's add the frozen check and thaw the task right > before we send SIGKILL to the victim. > > The check and thawing in oom_scan_process_thread has to stay because the > task might got access to memory reserves even without an explicit > SIGKILL from oom_kill_process (e.g. it already has fatal signal pending > or it is exiting already). How else would a task get TIF_MEMDIE? If there are other paths which set TIF_MEMDIE, the right thing to do is creating a function which thaws / wakes up the target task and use it there too. Please interlock these things properly from the get-go instead of scattering these things around. > @@ -545,6 +545,8 @@ void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order, > rcu_read_unlock(); > > mark_tsk_oom_victim(victim); > + if (frozen(victim)) > + __thaw_task(victim); The frozen() test here is racy. Always calling __thaw_task() wouldn't be. You can argue that being racy here is okay because the later scanning would find it but why complicate things like that? Just properly interlock each instance and be done with it. Thanks. -- tejun ^ permalink raw reply [flat|nested] 93+ messages in thread
* Re: [PATCH -v2 2/5] OOM: thaw the OOM victim if it is frozen 2014-12-06 13:06 ` Tejun Heo @ 2014-12-07 10:24 ` Michal Hocko 2014-12-07 10:45 ` Michal Hocko 0 siblings, 1 reply; 93+ messages in thread From: Michal Hocko @ 2014-12-07 10:24 UTC (permalink / raw) To: Tejun Heo Cc: linux-mm, Andrew Morton, \"Rafael J. Wysocki\", David Rientjes, Johannes Weiner, Oleg Nesterov, Cong Wang, LKML, linux-pm On Sat 06-12-14 08:06:57, Tejun Heo wrote: > Hello, > > On Fri, Dec 05, 2014 at 05:41:44PM +0100, Michal Hocko wrote: > > oom_kill_process only sets TIF_MEMDIE flag and sends a signal to the > > victim. This is basically noop when the task is frozen though because > > the task sleeps in uninterruptible sleep. The victim is eventually > > thawed later when oom_scan_process_thread meets the task again in a > > later OOM invocation so the OOM killer doesn't live lock. But this is > > less than optimal. Let's add the frozen check and thaw the task right > > before we send SIGKILL to the victim. > > > > The check and thawing in oom_scan_process_thread has to stay because the > > task might got access to memory reserves even without an explicit > > SIGKILL from oom_kill_process (e.g. it already has fatal signal pending > > or it is exiting already). > > How else would a task get TIF_MEMDIE? If there are other paths which > set TIF_MEMDIE, the right thing to do is creating a function which > thaws / wakes up the target task and use it there too. Please > interlock these things properly from the get-go instead of scattering > these things around. See __out_of_memory which sets TIF_MEMDIE on current when it is exiting or has fatal signals pending. This task cannot be frozen obviously. > > @@ -545,6 +545,8 @@ void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order, > > rcu_read_unlock(); > > > > mark_tsk_oom_victim(victim); > > + if (frozen(victim)) > > + __thaw_task(victim); > > The frozen() test here is racy. Always calling __thaw_task() wouldn't > be. You can argue that being racy here is okay because the later > scanning would find it but why complicate things like that? Just > properly interlock each instance and be done with it. OK, changed. I didn't realize that __thaw_task does the check already and was following what we have in oom_scan_process_thread. Removed the check from that one as well. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 93+ messages in thread
* Re: [PATCH -v2 2/5] OOM: thaw the OOM victim if it is frozen 2014-12-07 10:24 ` Michal Hocko @ 2014-12-07 10:45 ` Michal Hocko 2014-12-07 13:59 ` Tejun Heo 0 siblings, 1 reply; 93+ messages in thread From: Michal Hocko @ 2014-12-07 10:45 UTC (permalink / raw) To: Tejun Heo Cc: linux-mm, Andrew Morton, \"Rafael J. Wysocki\", David Rientjes, Johannes Weiner, Oleg Nesterov, Cong Wang, LKML, linux-pm On Sun 07-12-14 11:24:30, Michal Hocko wrote: > On Sat 06-12-14 08:06:57, Tejun Heo wrote: > > Hello, > > > > On Fri, Dec 05, 2014 at 05:41:44PM +0100, Michal Hocko wrote: > > > oom_kill_process only sets TIF_MEMDIE flag and sends a signal to the > > > victim. This is basically noop when the task is frozen though because > > > the task sleeps in uninterruptible sleep. The victim is eventually > > > thawed later when oom_scan_process_thread meets the task again in a > > > later OOM invocation so the OOM killer doesn't live lock. But this is > > > less than optimal. Let's add the frozen check and thaw the task right > > > before we send SIGKILL to the victim. > > > > > > The check and thawing in oom_scan_process_thread has to stay because the > > > task might got access to memory reserves even without an explicit > > > SIGKILL from oom_kill_process (e.g. it already has fatal signal pending > > > or it is exiting already). > > > > How else would a task get TIF_MEMDIE? If there are other paths which > > set TIF_MEMDIE, the right thing to do is creating a function which > > thaws / wakes up the target task and use it there too. Please > > interlock these things properly from the get-go instead of scattering > > these things around. > > See __out_of_memory which sets TIF_MEMDIE on current when it is exiting > or has fatal signals pending. This task cannot be frozen obviously. On the other hand we are doing the same early in oom_kill_process which doesn't work on the current. I've moved the __thaw_task into mark_tsk_oom_victim so it catches all instances now. oom_scan_process_thread doesn't need to thaw anymore. --- >From af8222df6c503fa1beab8279ff39a282fd90698b Mon Sep 17 00:00:00 2001 From: Michal Hocko <mhocko@suse.cz> Date: Wed, 12 Nov 2014 18:56:54 +0100 Subject: [PATCH] OOM: thaw the OOM victim if it is frozen oom_kill_process only sets TIF_MEMDIE flag and sends a signal to the victim. This is basically noop when the task is frozen though because the task sleeps in uninterruptible sleep. The victim is eventually thawed later when oom_scan_process_thread meets the task again in a later OOM invocation so the OOM killer doesn't live lock. But this is less than optimal. Let's add __thaw_task into mark_tsk_oom_victim after we set TIF_MEMDIE to the victim. We are not checking whether the task is frozen because that would be racy and __thaw_task does that already. oom_scan_process_thread doesn't need to care about freezer anymore as TIF_MEMDIE and freezer are excluded completely now. Signed-off-by: Michal Hocko <mhocko@suse.cz> --- mm/oom_kill.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/mm/oom_kill.c b/mm/oom_kill.c index 56eab9621c3a..19a08f3f00ba 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -266,8 +266,6 @@ enum oom_scan_t oom_scan_process_thread(struct task_struct *task, * Don't allow any other task to have access to the reserves. */ if (test_tsk_thread_flag(task, TIF_MEMDIE)) { - if (unlikely(frozen(task))) - __thaw_task(task); if (!force_kill) return OOM_SCAN_ABORT; } @@ -428,6 +426,7 @@ void note_oom_kill(void) void mark_tsk_oom_victim(struct task_struct *tsk) { set_tsk_thread_flag(tsk, TIF_MEMDIE); + __thaw_task(tsk); } /** -- 2.1.3 -- Michal Hocko SUSE Labs ^ permalink raw reply related [flat|nested] 93+ messages in thread
* Re: [PATCH -v2 2/5] OOM: thaw the OOM victim if it is frozen 2014-12-07 10:45 ` Michal Hocko @ 2014-12-07 13:59 ` Tejun Heo 2014-12-07 18:55 ` Michal Hocko 0 siblings, 1 reply; 93+ messages in thread From: Tejun Heo @ 2014-12-07 13:59 UTC (permalink / raw) To: Michal Hocko Cc: linux-mm, Andrew Morton, \"Rafael J. Wysocki\", David Rientjes, Johannes Weiner, Oleg Nesterov, Cong Wang, LKML, linux-pm On Sun, Dec 07, 2014 at 11:45:39AM +0100, Michal Hocko wrote: .... > void mark_tsk_oom_victim(struct task_struct *tsk) > { > set_tsk_thread_flag(tsk, TIF_MEMDIE); > + __thaw_task(tsk); Yeah, this is a lot better. Maybe we can add a comment at least pointing readers to where to look at to understand what's going on? This stems from the fact that OOM killer which essentially is a memory reclaim operation overrides freezing. It'd be nice if that is documented somehow. Thanks. -- tejun ^ permalink raw reply [flat|nested] 93+ messages in thread
* Re: [PATCH -v2 2/5] OOM: thaw the OOM victim if it is frozen 2014-12-07 13:59 ` Tejun Heo @ 2014-12-07 18:55 ` Michal Hocko 0 siblings, 0 replies; 93+ messages in thread From: Michal Hocko @ 2014-12-07 18:55 UTC (permalink / raw) To: Tejun Heo Cc: linux-mm, Andrew Morton, \"Rafael J. Wysocki\", David Rientjes, Johannes Weiner, Oleg Nesterov, Cong Wang, LKML, linux-pm On Sun 07-12-14 08:59:40, Tejun Heo wrote: > On Sun, Dec 07, 2014 at 11:45:39AM +0100, Michal Hocko wrote: > .... > > void mark_tsk_oom_victim(struct task_struct *tsk) > > { > > set_tsk_thread_flag(tsk, TIF_MEMDIE); > > + __thaw_task(tsk); > > Yeah, this is a lot better. Maybe we can add a comment at least > pointing readers to where to look at to understand what's going on? > This stems from the fact that OOM killer which essentially is a memory > reclaim operation overrides freezing. It'd be nice if that is > documented somehow. diff --git a/mm/oom_kill.c b/mm/oom_kill.c index 19a08f3f00ba..fca456fe855a 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -426,6 +426,13 @@ void note_oom_kill(void) void mark_tsk_oom_victim(struct task_struct *tsk) { set_tsk_thread_flag(tsk, TIF_MEMDIE); + + /* + * Make sure that the task is woken up from uninterruptible sleep + * if it is frozen because OOM killer wouldn't be able to free + * any memory and livelock. freezing_slow_path will tell the freezer + * that TIF_MEMDIE tasks should be ignored. + */ __thaw_task(tsk); } Better? -- Michal Hocko SUSE Labs ^ permalink raw reply related [flat|nested] 93+ messages in thread
* [PATCH -v2 3/5] PM: convert printk to pr_* equivalent 2014-12-05 16:41 ` [PATCH 0/4] OOM vs PM freezer fixes Michal Hocko 2014-12-05 16:41 ` [PATCH -v2 1/5] oom: add helpers for setting and clearing TIF_MEMDIE Michal Hocko 2014-12-05 16:41 ` [PATCH -v2 2/5] OOM: thaw the OOM victim if it is frozen Michal Hocko @ 2014-12-05 16:41 ` Michal Hocko 2014-12-05 22:40 ` Rafael J. Wysocki 2014-12-06 13:08 ` Tejun Heo 2014-12-05 16:41 ` [PATCH -v2 4/5] sysrq: " Michal Hocko ` (2 subsequent siblings) 5 siblings, 2 replies; 93+ messages in thread From: Michal Hocko @ 2014-12-05 16:41 UTC (permalink / raw) To: linux-mm Cc: Andrew Morton, Tejun Heo, \"Rafael J. Wysocki\", David Rientjes, Johannes Weiner, Oleg Nesterov, Cong Wang, LKML, linux-pm While touching this area let's convert printk to pr_*. This also makes the printing of continuation lines done properly. Signed-off-by: Michal Hocko <mhocko@suse.cz> --- kernel/power/process.c | 29 +++++++++++++++-------------- 1 file changed, 15 insertions(+), 14 deletions(-) diff --git a/kernel/power/process.c b/kernel/power/process.c index 5a6ec8678b9a..3ac45f192e9f 100644 --- a/kernel/power/process.c +++ b/kernel/power/process.c @@ -84,8 +84,8 @@ static int try_to_freeze_tasks(bool user_only) elapsed_msecs = elapsed_msecs64; if (todo) { - printk("\n"); - printk(KERN_ERR "Freezing of tasks %s after %d.%03d seconds " + pr_cont("\n"); + pr_err("Freezing of tasks %s after %d.%03d seconds " "(%d tasks refusing to freeze, wq_busy=%d):\n", wakeup ? "aborted" : "failed", elapsed_msecs / 1000, elapsed_msecs % 1000, @@ -101,7 +101,7 @@ static int try_to_freeze_tasks(bool user_only) read_unlock(&tasklist_lock); } } else { - printk("(elapsed %d.%03d seconds) ", elapsed_msecs / 1000, + pr_cont("(elapsed %d.%03d seconds) ", elapsed_msecs / 1000, elapsed_msecs % 1000); } @@ -155,7 +155,7 @@ int freeze_processes(void) atomic_inc(&system_freezing_cnt); pm_wakeup_clear(); - printk("Freezing user space processes ... "); + pr_info("Freezing user space processes ... "); pm_freezing = true; oom_kills_saved = oom_kills_count(); error = try_to_freeze_tasks(true); @@ -171,13 +171,13 @@ int freeze_processes(void) if (oom_kills_count() != oom_kills_saved && !check_frozen_processes()) { __usermodehelper_set_disable_depth(UMH_ENABLED); - printk("OOM in progress."); + pr_cont("OOM in progress."); error = -EBUSY; } else { - printk("done."); + pr_cont("done."); } } - printk("\n"); + pr_cont("\n"); BUG_ON(in_atomic()); if (error) @@ -197,13 +197,14 @@ int freeze_kernel_threads(void) { int error; - printk("Freezing remaining freezable tasks ... "); + pr_info("Freezing remaining freezable tasks ... "); + pm_nosig_freezing = true; error = try_to_freeze_tasks(false); if (!error) - printk("done."); + pr_cont("done."); - printk("\n"); + pr_cont("\n"); BUG_ON(in_atomic()); if (error) @@ -224,7 +225,7 @@ void thaw_processes(void) oom_killer_enable(); - printk("Restarting tasks ... "); + pr_info("Restarting tasks ... "); __usermodehelper_set_disable_depth(UMH_FREEZING); thaw_workqueues(); @@ -243,7 +244,7 @@ void thaw_processes(void) usermodehelper_enable(); schedule(); - printk("done.\n"); + pr_cont("done.\n"); trace_suspend_resume(TPS("thaw_processes"), 0, false); } @@ -252,7 +253,7 @@ void thaw_kernel_threads(void) struct task_struct *g, *p; pm_nosig_freezing = false; - printk("Restarting kernel threads ... "); + pr_info("Restarting kernel threads ... "); thaw_workqueues(); @@ -264,5 +265,5 @@ void thaw_kernel_threads(void) read_unlock(&tasklist_lock); schedule(); - printk("done.\n"); + pr_cont("done.\n"); } -- 2.1.3 ^ permalink raw reply related [flat|nested] 93+ messages in thread
* Re: [PATCH -v2 3/5] PM: convert printk to pr_* equivalent 2014-12-05 16:41 ` [PATCH -v2 3/5] PM: convert printk to pr_* equivalent Michal Hocko @ 2014-12-05 22:40 ` Rafael J. Wysocki 2014-12-07 10:26 ` Michal Hocko 2014-12-06 13:08 ` Tejun Heo 1 sibling, 1 reply; 93+ messages in thread From: Rafael J. Wysocki @ 2014-12-05 22:40 UTC (permalink / raw) To: Michal Hocko Cc: linux-mm, Andrew Morton, Tejun Heo, David Rientjes, Johannes Weiner, Oleg Nesterov, Cong Wang, LKML, linux-pm On Friday, December 05, 2014 05:41:45 PM Michal Hocko wrote: > While touching this area let's convert printk to pr_*. This also makes > the printing of continuation lines done properly. > > Signed-off-by: Michal Hocko <mhocko@suse.cz> This is fine by me. Please let me know if you want me to take it. Otherwise, please feel free to push it through a different tree. > --- > kernel/power/process.c | 29 +++++++++++++++-------------- > 1 file changed, 15 insertions(+), 14 deletions(-) > > diff --git a/kernel/power/process.c b/kernel/power/process.c > index 5a6ec8678b9a..3ac45f192e9f 100644 > --- a/kernel/power/process.c > +++ b/kernel/power/process.c > @@ -84,8 +84,8 @@ static int try_to_freeze_tasks(bool user_only) > elapsed_msecs = elapsed_msecs64; > > if (todo) { > - printk("\n"); > - printk(KERN_ERR "Freezing of tasks %s after %d.%03d seconds " > + pr_cont("\n"); > + pr_err("Freezing of tasks %s after %d.%03d seconds " > "(%d tasks refusing to freeze, wq_busy=%d):\n", > wakeup ? "aborted" : "failed", > elapsed_msecs / 1000, elapsed_msecs % 1000, > @@ -101,7 +101,7 @@ static int try_to_freeze_tasks(bool user_only) > read_unlock(&tasklist_lock); > } > } else { > - printk("(elapsed %d.%03d seconds) ", elapsed_msecs / 1000, > + pr_cont("(elapsed %d.%03d seconds) ", elapsed_msecs / 1000, > elapsed_msecs % 1000); > } > > @@ -155,7 +155,7 @@ int freeze_processes(void) > atomic_inc(&system_freezing_cnt); > > pm_wakeup_clear(); > - printk("Freezing user space processes ... "); > + pr_info("Freezing user space processes ... "); > pm_freezing = true; > oom_kills_saved = oom_kills_count(); > error = try_to_freeze_tasks(true); > @@ -171,13 +171,13 @@ int freeze_processes(void) > if (oom_kills_count() != oom_kills_saved && > !check_frozen_processes()) { > __usermodehelper_set_disable_depth(UMH_ENABLED); > - printk("OOM in progress."); > + pr_cont("OOM in progress."); > error = -EBUSY; > } else { > - printk("done."); > + pr_cont("done."); > } > } > - printk("\n"); > + pr_cont("\n"); > BUG_ON(in_atomic()); > > if (error) > @@ -197,13 +197,14 @@ int freeze_kernel_threads(void) > { > int error; > > - printk("Freezing remaining freezable tasks ... "); > + pr_info("Freezing remaining freezable tasks ... "); > + > pm_nosig_freezing = true; > error = try_to_freeze_tasks(false); > if (!error) > - printk("done."); > + pr_cont("done."); > > - printk("\n"); > + pr_cont("\n"); > BUG_ON(in_atomic()); > > if (error) > @@ -224,7 +225,7 @@ void thaw_processes(void) > > oom_killer_enable(); > > - printk("Restarting tasks ... "); > + pr_info("Restarting tasks ... "); > > __usermodehelper_set_disable_depth(UMH_FREEZING); > thaw_workqueues(); > @@ -243,7 +244,7 @@ void thaw_processes(void) > usermodehelper_enable(); > > schedule(); > - printk("done.\n"); > + pr_cont("done.\n"); > trace_suspend_resume(TPS("thaw_processes"), 0, false); > } > > @@ -252,7 +253,7 @@ void thaw_kernel_threads(void) > struct task_struct *g, *p; > > pm_nosig_freezing = false; > - printk("Restarting kernel threads ... "); > + pr_info("Restarting kernel threads ... "); > > thaw_workqueues(); > > @@ -264,5 +265,5 @@ void thaw_kernel_threads(void) > read_unlock(&tasklist_lock); > > schedule(); > - printk("done.\n"); > + pr_cont("done.\n"); > } > -- I speak only for myself. Rafael J. Wysocki, Intel Open Source Technology Center. ^ permalink raw reply [flat|nested] 93+ messages in thread
* Re: [PATCH -v2 3/5] PM: convert printk to pr_* equivalent 2014-12-05 22:40 ` Rafael J. Wysocki @ 2014-12-07 10:26 ` Michal Hocko 0 siblings, 0 replies; 93+ messages in thread From: Michal Hocko @ 2014-12-07 10:26 UTC (permalink / raw) To: Rafael J. Wysocki Cc: linux-mm, Andrew Morton, Tejun Heo, David Rientjes, Johannes Weiner, Oleg Nesterov, Cong Wang, LKML, linux-pm On Fri 05-12-14 23:40:55, Rafael J. Wysocki wrote: > On Friday, December 05, 2014 05:41:45 PM Michal Hocko wrote: > > While touching this area let's convert printk to pr_*. This also makes > > the printing of continuation lines done properly. > > > > Signed-off-by: Michal Hocko <mhocko@suse.cz> > > This is fine by me. > > Please let me know if you want me to take it. Otherwise, please feel free to > push it through a different tree. I guess it will be easier to push this through Andrew's tree due to other dependencies. > > --- > > kernel/power/process.c | 29 +++++++++++++++-------------- > > 1 file changed, 15 insertions(+), 14 deletions(-) > > > > diff --git a/kernel/power/process.c b/kernel/power/process.c > > index 5a6ec8678b9a..3ac45f192e9f 100644 > > --- a/kernel/power/process.c > > +++ b/kernel/power/process.c > > @@ -84,8 +84,8 @@ static int try_to_freeze_tasks(bool user_only) > > elapsed_msecs = elapsed_msecs64; > > > > if (todo) { > > - printk("\n"); > > - printk(KERN_ERR "Freezing of tasks %s after %d.%03d seconds " > > + pr_cont("\n"); > > + pr_err("Freezing of tasks %s after %d.%03d seconds " > > "(%d tasks refusing to freeze, wq_busy=%d):\n", > > wakeup ? "aborted" : "failed", > > elapsed_msecs / 1000, elapsed_msecs % 1000, > > @@ -101,7 +101,7 @@ static int try_to_freeze_tasks(bool user_only) > > read_unlock(&tasklist_lock); > > } > > } else { > > - printk("(elapsed %d.%03d seconds) ", elapsed_msecs / 1000, > > + pr_cont("(elapsed %d.%03d seconds) ", elapsed_msecs / 1000, > > elapsed_msecs % 1000); > > } > > > > @@ -155,7 +155,7 @@ int freeze_processes(void) > > atomic_inc(&system_freezing_cnt); > > > > pm_wakeup_clear(); > > - printk("Freezing user space processes ... "); > > + pr_info("Freezing user space processes ... "); > > pm_freezing = true; > > oom_kills_saved = oom_kills_count(); > > error = try_to_freeze_tasks(true); > > @@ -171,13 +171,13 @@ int freeze_processes(void) > > if (oom_kills_count() != oom_kills_saved && > > !check_frozen_processes()) { > > __usermodehelper_set_disable_depth(UMH_ENABLED); > > - printk("OOM in progress."); > > + pr_cont("OOM in progress."); > > error = -EBUSY; > > } else { > > - printk("done."); > > + pr_cont("done."); > > } > > } > > - printk("\n"); > > + pr_cont("\n"); > > BUG_ON(in_atomic()); > > > > if (error) > > @@ -197,13 +197,14 @@ int freeze_kernel_threads(void) > > { > > int error; > > > > - printk("Freezing remaining freezable tasks ... "); > > + pr_info("Freezing remaining freezable tasks ... "); > > + > > pm_nosig_freezing = true; > > error = try_to_freeze_tasks(false); > > if (!error) > > - printk("done."); > > + pr_cont("done."); > > > > - printk("\n"); > > + pr_cont("\n"); > > BUG_ON(in_atomic()); > > > > if (error) > > @@ -224,7 +225,7 @@ void thaw_processes(void) > > > > oom_killer_enable(); > > > > - printk("Restarting tasks ... "); > > + pr_info("Restarting tasks ... "); > > > > __usermodehelper_set_disable_depth(UMH_FREEZING); > > thaw_workqueues(); > > @@ -243,7 +244,7 @@ void thaw_processes(void) > > usermodehelper_enable(); > > > > schedule(); > > - printk("done.\n"); > > + pr_cont("done.\n"); > > trace_suspend_resume(TPS("thaw_processes"), 0, false); > > } > > > > @@ -252,7 +253,7 @@ void thaw_kernel_threads(void) > > struct task_struct *g, *p; > > > > pm_nosig_freezing = false; > > - printk("Restarting kernel threads ... "); > > + pr_info("Restarting kernel threads ... "); > > > > thaw_workqueues(); > > > > @@ -264,5 +265,5 @@ void thaw_kernel_threads(void) > > read_unlock(&tasklist_lock); > > > > schedule(); > > - printk("done.\n"); > > + pr_cont("done.\n"); > > } > > > > -- > I speak only for myself. > Rafael J. Wysocki, Intel Open Source Technology Center. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 93+ messages in thread
* Re: [PATCH -v2 3/5] PM: convert printk to pr_* equivalent 2014-12-05 16:41 ` [PATCH -v2 3/5] PM: convert printk to pr_* equivalent Michal Hocko 2014-12-05 22:40 ` Rafael J. Wysocki @ 2014-12-06 13:08 ` Tejun Heo 1 sibling, 0 replies; 93+ messages in thread From: Tejun Heo @ 2014-12-06 13:08 UTC (permalink / raw) To: Michal Hocko Cc: linux-mm, Andrew Morton, \"Rafael J. Wysocki\", David Rientjes, Johannes Weiner, Oleg Nesterov, Cong Wang, LKML, linux-pm On Fri, Dec 05, 2014 at 05:41:45PM +0100, Michal Hocko wrote: > While touching this area let's convert printk to pr_*. This also makes > the printing of continuation lines done properly. > > Signed-off-by: Michal Hocko <mhocko@suse.cz> Acked-by: Tejun Heo <tj@kernel.org> Thanks. -- tejun ^ permalink raw reply [flat|nested] 93+ messages in thread
* [PATCH -v2 4/5] sysrq: convert printk to pr_* equivalent 2014-12-05 16:41 ` [PATCH 0/4] OOM vs PM freezer fixes Michal Hocko ` (2 preceding siblings ...) 2014-12-05 16:41 ` [PATCH -v2 3/5] PM: convert printk to pr_* equivalent Michal Hocko @ 2014-12-05 16:41 ` Michal Hocko 2014-12-06 13:09 ` Tejun Heo 2014-12-05 16:41 ` [PATCH -v2 5/5] OOM, PM: make OOM detection in the freezer path raceless Michal Hocko 2014-12-07 10:09 ` [PATCH 0/4] OOM vs PM freezer fixes Michal Hocko 5 siblings, 1 reply; 93+ messages in thread From: Michal Hocko @ 2014-12-05 16:41 UTC (permalink / raw) To: linux-mm Cc: Andrew Morton, Tejun Heo, \"Rafael J. Wysocki\", David Rientjes, Johannes Weiner, Oleg Nesterov, Cong Wang, LKML, linux-pm While touching this area let's convert printk to pr_*. This also makes the printing of continuation lines done properly. Signed-off-by: Michal Hocko <mhocko@suse.cz> --- drivers/tty/sysrq.c | 18 +++++++++--------- 1 file changed, 9 insertions(+), 9 deletions(-) diff --git a/drivers/tty/sysrq.c b/drivers/tty/sysrq.c index 42bad18c66c9..0071469ecbf1 100644 --- a/drivers/tty/sysrq.c +++ b/drivers/tty/sysrq.c @@ -90,7 +90,7 @@ static void sysrq_handle_loglevel(int key) i = key - '0'; console_loglevel = CONSOLE_LOGLEVEL_DEFAULT; - printk("Loglevel set to %d\n", i); + pr_info("Loglevel set to %d\n", i); console_loglevel = i; } static struct sysrq_key_op sysrq_loglevel_op = { @@ -220,7 +220,7 @@ static void showacpu(void *dummy) return; spin_lock_irqsave(&show_lock, flags); - printk(KERN_INFO "CPU%d:\n", smp_processor_id()); + pr_info("CPU%d:\n", smp_processor_id()); show_stack(NULL, NULL); spin_unlock_irqrestore(&show_lock, flags); } @@ -243,7 +243,7 @@ static void sysrq_handle_showallcpus(int key) struct pt_regs *regs = get_irq_regs(); if (regs) { - printk(KERN_INFO "CPU%d:\n", smp_processor_id()); + pr_info("CPU%d:\n", smp_processor_id()); show_regs(regs); } schedule_work(&sysrq_showallcpus); @@ -522,7 +522,7 @@ void __handle_sysrq(int key, bool check_mask) */ orig_log_level = console_loglevel; console_loglevel = CONSOLE_LOGLEVEL_DEFAULT; - printk(KERN_INFO "SysRq : "); + pr_info("SysRq : "); op_p = __sysrq_get_key_op(key); if (op_p) { @@ -531,14 +531,14 @@ void __handle_sysrq(int key, bool check_mask) * should not) and is the invoked operation enabled? */ if (!check_mask || sysrq_on_mask(op_p->enable_mask)) { - printk("%s\n", op_p->action_msg); + pr_cont("%s\n", op_p->action_msg); console_loglevel = orig_log_level; op_p->handler(key); } else { - printk("This sysrq operation is disabled.\n"); + pr_cont("This sysrq operation is disabled.\n"); } } else { - printk("HELP : "); + pr_cont("HELP : "); /* Only print the help msg once per handler */ for (i = 0; i < ARRAY_SIZE(sysrq_key_table); i++) { if (sysrq_key_table[i]) { @@ -549,10 +549,10 @@ void __handle_sysrq(int key, bool check_mask) ; if (j != i) continue; - printk("%s ", sysrq_key_table[i]->help_msg); + pr_cont("%s ", sysrq_key_table[i]->help_msg); } } - printk("\n"); + pr_cont("\n"); console_loglevel = orig_log_level; } rcu_read_unlock(); -- 2.1.3 ^ permalink raw reply related [flat|nested] 93+ messages in thread
* Re: [PATCH -v2 4/5] sysrq: convert printk to pr_* equivalent 2014-12-05 16:41 ` [PATCH -v2 4/5] sysrq: " Michal Hocko @ 2014-12-06 13:09 ` Tejun Heo 0 siblings, 0 replies; 93+ messages in thread From: Tejun Heo @ 2014-12-06 13:09 UTC (permalink / raw) To: Michal Hocko Cc: linux-mm, Andrew Morton, \"Rafael J. Wysocki\", David Rientjes, Johannes Weiner, Oleg Nesterov, Cong Wang, LKML, linux-pm On Fri, Dec 05, 2014 at 05:41:46PM +0100, Michal Hocko wrote: > While touching this area let's convert printk to pr_*. This also makes > the printing of continuation lines done properly. > > Signed-off-by: Michal Hocko <mhocko@suse.cz> Acked-by: Tejun Heo <tj@kernel.org> Thanks. -- tejun ^ permalink raw reply [flat|nested] 93+ messages in thread
* [PATCH -v2 5/5] OOM, PM: make OOM detection in the freezer path raceless 2014-12-05 16:41 ` [PATCH 0/4] OOM vs PM freezer fixes Michal Hocko ` (3 preceding siblings ...) 2014-12-05 16:41 ` [PATCH -v2 4/5] sysrq: " Michal Hocko @ 2014-12-05 16:41 ` Michal Hocko 2014-12-06 13:11 ` Tejun Heo ` (2 more replies) 2014-12-07 10:09 ` [PATCH 0/4] OOM vs PM freezer fixes Michal Hocko 5 siblings, 3 replies; 93+ messages in thread From: Michal Hocko @ 2014-12-05 16:41 UTC (permalink / raw) To: linux-mm Cc: Andrew Morton, Tejun Heo, \"Rafael J. Wysocki\", David Rientjes, Johannes Weiner, Oleg Nesterov, Cong Wang, LKML, linux-pm 5695be142e20 (OOM, PM: OOM killed task shouldn't escape PM suspend) has left a race window when OOM killer manages to note_oom_kill after freeze_processes checks the counter. The race window is quite small and really unlikely and partial solution deemed sufficient at the time of submission. Tejun wasn't happy about this partial solution though and insisted on a full solution. That requires the full OOM and freezer's task freezing exclusion, though. This is done by this patch which introduces oom_sem RW lock and turns oom_killer_disable() into a full OOM barrier. oom_killer_disabled check is moved from the allocation path to the OOM level and we take oom_sem for reading for both the check and the whole OOM invocation. oom_killer_disable() takes oom_sem for writing so it waits for all currently running OOM killer invocations. Then it disable all the further OOMs by setting oom_killer_disabled and checks for any oom victims. Victims are counted via {un}mark_tsk_oom_victim. The last victim wakes up all waiters on oom_victims_wait waitqueue enqueued by oom_killer_disable(). Therefore this function acts as the full OOM barrier. The page fault path is covered now as well although it was assumed to be safe before. As per Tejun, "We used to have freezing points deep in file system code which may be reacheable from page fault." so it would be better and more robust to not rely on freezing points here. Same applies to the memcg OOM killer. out_of_memory tells the caller whether the OOM was allowed to trigger and the callers are supposed to handle the situation. The page allocation path simply fails the allocation same as before. The page fault path will retry the fault (more on that later) and Sysrq OOM trigger will simply complain to the log. As oom_killer_disable() is a full OOM barrier now we can postpone it in the PM freezer to later after all freezable user tasks are considered frozen (to freeze_kernel_threads). Normally there wouldn't be any unfrozen user tasks at this moment so the function will not block. But if there was an OOM killer racing with try_to_freeze_tasks and the OOM victim didn't finish yet then we have to wait for it. This should complete in a finite time, though, because - the victim cannot loop in the page fault handler (it would die on the way out from the exception) - it cannot loop in the page allocator because all the further allocation would fail and __GFP_NOFAIL allocations are not acceptable at this stage - it shouldn't be blocked on any locks held by frozen tasks (try_to_freeze expects lockless context) and kernel threads and work queues are not frozen yet TODO: Android lowmemory killer abuses TIF_MEMDIE in lowmem_scan and it has to learn about oom_disable logic as well. Suggested-by: Tejun Heo <tj@kernel.org> Signed-off-by: Michal Hocko <mhocko@suse.cz> --- drivers/tty/sysrq.c | 5 +- include/linux/oom.h | 14 ++---- kernel/exit.c | 3 +- kernel/power/process.c | 58 ++++++---------------- mm/memcontrol.c | 2 +- mm/oom_kill.c | 131 ++++++++++++++++++++++++++++++++++++++++++------- mm/page_alloc.c | 17 +------ 7 files changed, 137 insertions(+), 93 deletions(-) diff --git a/drivers/tty/sysrq.c b/drivers/tty/sysrq.c index 0071469ecbf1..259a4d5a4e8f 100644 --- a/drivers/tty/sysrq.c +++ b/drivers/tty/sysrq.c @@ -355,8 +355,9 @@ static struct sysrq_key_op sysrq_term_op = { static void moom_callback(struct work_struct *ignored) { - out_of_memory(node_zonelist(first_memory_node, GFP_KERNEL), GFP_KERNEL, - 0, NULL, true); + if (!out_of_memory(node_zonelist(first_memory_node, GFP_KERNEL), + GFP_KERNEL, 0, NULL, true)) + pr_info("OOM request ignored because killer is disabled\n"); } static DECLARE_WORK(moom_work, moom_callback); diff --git a/include/linux/oom.h b/include/linux/oom.h index 1315fcbb9527..03b5c395e514 100644 --- a/include/linux/oom.h +++ b/include/linux/oom.h @@ -72,22 +72,14 @@ extern enum oom_scan_t oom_scan_process_thread(struct task_struct *task, unsigned long totalpages, const nodemask_t *nodemask, bool force_kill); -extern void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask, +extern bool out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask, int order, nodemask_t *mask, bool force_kill); extern int register_oom_notifier(struct notifier_block *nb); extern int unregister_oom_notifier(struct notifier_block *nb); extern bool oom_killer_disabled; - -static inline void oom_killer_disable(void) -{ - oom_killer_disabled = true; -} - -static inline void oom_killer_enable(void) -{ - oom_killer_disabled = false; -} +extern bool oom_killer_disable(void); +extern void oom_killer_enable(void); extern struct task_struct *find_lock_task_mm(struct task_struct *p); diff --git a/kernel/exit.c b/kernel/exit.c index ee5176e2a1ba..272915fc603f 100644 --- a/kernel/exit.c +++ b/kernel/exit.c @@ -459,7 +459,8 @@ static void exit_mm(struct task_struct *tsk) task_unlock(tsk); mm_update_next_owner(mm); mmput(mm); - unmark_tsk_oom_victim(); + if (test_thread_flag(TIF_MEMDIE)) + unmark_tsk_oom_victim(); } /* diff --git a/kernel/power/process.c b/kernel/power/process.c index 3ac45f192e9f..c3da8b297b10 100644 --- a/kernel/power/process.c +++ b/kernel/power/process.c @@ -108,30 +108,6 @@ static int try_to_freeze_tasks(bool user_only) return todo ? -EBUSY : 0; } -static bool __check_frozen_processes(void) -{ - struct task_struct *g, *p; - - for_each_process_thread(g, p) - if (p != current && !freezer_should_skip(p) && !frozen(p)) - return false; - - return true; -} - -/* - * Returns true if all freezable tasks (except for current) are frozen already - */ -static bool check_frozen_processes(void) -{ - bool ret; - - read_lock(&tasklist_lock); - ret = __check_frozen_processes(); - read_unlock(&tasklist_lock); - return ret; -} - /** * freeze_processes - Signal user space processes to enter the refrigerator. * The current thread will not be frozen. The same process that calls @@ -142,7 +118,6 @@ static bool check_frozen_processes(void) int freeze_processes(void) { int error; - int oom_kills_saved; error = __usermodehelper_disable(UMH_FREEZING); if (error) @@ -157,25 +132,10 @@ int freeze_processes(void) pm_wakeup_clear(); pr_info("Freezing user space processes ... "); pm_freezing = true; - oom_kills_saved = oom_kills_count(); error = try_to_freeze_tasks(true); if (!error) { __usermodehelper_set_disable_depth(UMH_DISABLED); - oom_killer_disable(); - - /* - * There might have been an OOM kill while we were - * freezing tasks and the killed task might be still - * on the way out so we have to double check for race. - */ - if (oom_kills_count() != oom_kills_saved && - !check_frozen_processes()) { - __usermodehelper_set_disable_depth(UMH_ENABLED); - pr_cont("OOM in progress."); - error = -EBUSY; - } else { - pr_cont("done."); - } + pr_cont("done."); } pr_cont("\n"); BUG_ON(in_atomic()); @@ -197,8 +157,17 @@ int freeze_kernel_threads(void) { int error; - pr_info("Freezing remaining freezable tasks ... "); + /* + * Now that the whole userspace is frozen we need to disbale + * the OOM killer to disallow any further interference with + * killable tasks. + */ + if (!oom_killer_disable()) { + error = -EBUSY; + goto out; + } + pr_info("Freezing remaining freezable tasks ... "); pm_nosig_freezing = true; error = try_to_freeze_tasks(false); if (!error) @@ -207,6 +176,7 @@ int freeze_kernel_threads(void) pr_cont("\n"); BUG_ON(in_atomic()); +out: if (error) thaw_kernel_threads(); return error; @@ -223,8 +193,6 @@ void thaw_processes(void) pm_freezing = false; pm_nosig_freezing = false; - oom_killer_enable(); - pr_info("Restarting tasks ... "); __usermodehelper_set_disable_depth(UMH_FREEZING); @@ -252,6 +220,8 @@ void thaw_kernel_threads(void) { struct task_struct *g, *p; + oom_killer_enable(); + pm_nosig_freezing = false; pr_info("Restarting kernel threads ... "); diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 302e0fc6d121..34a196eb45cd 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -2155,7 +2155,7 @@ bool mem_cgroup_oom_synchronize(bool handle) if (!memcg) return false; - if (!handle) + if (!handle || oom_killer_disabled) goto cleanup; owait.memcg = memcg; diff --git a/mm/oom_kill.c b/mm/oom_kill.c index 8874058d62db..facc4587daf3 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -405,37 +405,91 @@ static void dump_header(struct task_struct *p, gfp_t gfp_mask, int order, } /* - * Number of OOM killer invocations (including memcg OOM killer). - * Primarily used by PM freezer to check for potential races with - * OOM killed frozen task. + * Number of OOM victims in flight */ -static atomic_t oom_kills = ATOMIC_INIT(0); +static atomic_t oom_victims = ATOMIC_INIT(0); +static DECLARE_WAIT_QUEUE_HEAD(oom_victims_wait); -int oom_kills_count(void) -{ - return atomic_read(&oom_kills); -} - -void note_oom_kill(void) -{ - atomic_inc(&oom_kills); -} +bool oom_killer_disabled __read_mostly; +static DECLARE_RWSEM(oom_sem); /** * Marks the given taks as OOM victim. * @tsk: task to mark + * + * Has to be called with oom_sem taken for read and never after + * oom has been disabled already. */ void mark_tsk_oom_victim(struct task_struct *tsk) { - set_tsk_thread_flag(tsk, TIF_MEMDIE); + WARN_ON(oom_killer_disabled); + /* OOM killer might race with memcg OOM */ + if (test_and_set_tsk_thread_flag(tsk, TIF_MEMDIE)) + return; + atomic_inc(&oom_victims); } /** * Unmarks the current task as OOM victim. + * + * Wakes up all waiters in oom_killer_disable() */ void unmark_tsk_oom_victim(void) { - clear_thread_flag(TIF_MEMDIE); + if (!test_and_clear_thread_flag(TIF_MEMDIE)) + return; + + down_read(&oom_sem); + /* + * There is no need to signal the lasst oom_victim if there + * is nobody who cares. + */ + if (!atomic_dec_return(&oom_victims) && oom_killer_disabled) + wake_up_all(&oom_victims_wait); + up_read(&oom_sem); +} + +/** + * oom_killer_disable - disable OOM killer + * + * Forces all page allocations to fail rather than trigger OOM killer. + * Will block and wait until all OOM victims are killed. + * + * The function cannot be called when there are runnable user tasks because + * the userspace would see unexpected allocation failures as a result. Any + * new usage of this function should be consulted with MM people. + * + * Returns true if successful and false if the OOM killer cannot be + * disabled. + */ +bool oom_killer_disable(void) +{ + /* + * Make sure to not race with an ongoing OOM killer + * and that the current is not the victim. + */ + down_write(&oom_sem); + if (test_thread_flag(TIF_MEMDIE)) { + up_write(&oom_sem); + return false; + } + + oom_killer_disabled = true; + up_write(&oom_sem); + + wait_event(oom_victims_wait, atomic_read(&oom_victims)); + + return true; +} + +/** + * oom_killer_enable - enable OOM killer + */ +void oom_killer_enable(void) +{ + down_write(&oom_sem); + oom_killer_disabled = false; + up_write(&oom_sem); } #define K(x) ((x) << (PAGE_SHIFT-10)) @@ -635,7 +689,7 @@ void oom_zonelist_unlock(struct zonelist *zonelist, gfp_t gfp_mask) } /** - * out_of_memory - kill the "best" process when we run out of memory + * __out_of_memory - kill the "best" process when we run out of memory * @zonelist: zonelist pointer * @gfp_mask: memory allocation flags * @order: amount of memory being requested as a power of 2 @@ -647,7 +701,7 @@ void oom_zonelist_unlock(struct zonelist *zonelist, gfp_t gfp_mask) * OR try to be smart about which process to kill. Note that we * don't have to be perfect here, we just have to be good. */ -void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask, +static void __out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask, int order, nodemask_t *nodemask, bool force_kill) { const nodemask_t *mpol_mask; @@ -712,6 +766,32 @@ out: schedule_timeout_killable(1); } +/** + * out_of_memory - tries to invoke OOM killer. + * @zonelist: zonelist pointer + * @gfp_mask: memory allocation flags + * @order: amount of memory being requested as a power of 2 + * @nodemask: nodemask passed to page allocator + * @force_kill: true if a task must be killed, even if others are exiting + * + * invokes __out_of_memory if the OOM is not disabled by oom_killer_disable() + * when it returns false. Otherwise returns true. + */ +bool out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask, + int order, nodemask_t *nodemask, bool force_kill) +{ + bool ret = false; + + down_read(&oom_sem); + if (!oom_killer_disabled) { + __out_of_memory(zonelist, gfp_mask, order, nodemask, force_kill); + ret = true; + } + up_read(&oom_sem); + + return ret; +} + /* * The pagefault handler calls here because it is out of memory, so kill a * memory-hogging task. If any populated zone has ZONE_OOM_LOCKED set, a @@ -721,12 +801,25 @@ void pagefault_out_of_memory(void) { struct zonelist *zonelist; + down_read(&oom_sem); if (mem_cgroup_oom_synchronize(true)) - return; + goto unlock; zonelist = node_zonelist(first_memory_node, GFP_KERNEL); if (oom_zonelist_trylock(zonelist, GFP_KERNEL)) { - out_of_memory(NULL, 0, 0, NULL, false); + if (!oom_killer_disabled) + __out_of_memory(NULL, 0, 0, NULL, false); + else + /* + * There shouldn't be any user tasks runable while the + * OOM killer is disabled so the current task has to + * be a racing OOM victim for which oom_killer_disable() + * is waiting for. + */ + WARN_ON(test_thread_flag(TIF_MEMDIE)); + oom_zonelist_unlock(zonelist, GFP_KERNEL); } +unlock: + up_read(&oom_sem); } diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 721780ce1fd3..5b87346837dd 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -242,8 +242,6 @@ void set_pageblock_migratetype(struct page *page, int migratetype) PB_migrate, PB_migrate_end); } -bool oom_killer_disabled __read_mostly; - #ifdef CONFIG_DEBUG_VM static int page_outside_zone_boundaries(struct zone *zone, struct page *page) { @@ -2247,9 +2245,6 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order, *did_some_progress = 0; - if (oom_killer_disabled) - return NULL; - /* * Acquire the per-zone oom lock for each zone. If that * fails, somebody else is making progress for us. @@ -2261,14 +2256,6 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order, } /* - * PM-freezer should be notified that there might be an OOM killer on - * its way to kill and wake somebody up. This is too early and we might - * end up not killing anything but false positives are acceptable. - * See freeze_processes. - */ - note_oom_kill(); - - /* * Go through the zonelist yet one more time, keep very high watermark * here, this is only to catch a parallel oom killing, we must fail if * we're still under heavy pressure. @@ -2304,8 +2291,8 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order, goto out; } /* Exhausted what can be done so it's blamo time */ - out_of_memory(zonelist, gfp_mask, order, nodemask, false); - *did_some_progress = 1; + if (out_of_memory(zonelist, gfp_mask, order, nodemask, false)) + *did_some_progress = 1; out: oom_zonelist_unlock(zonelist, gfp_mask); return page; -- 2.1.3 ^ permalink raw reply related [flat|nested] 93+ messages in thread
* Re: [PATCH -v2 5/5] OOM, PM: make OOM detection in the freezer path raceless 2014-12-05 16:41 ` [PATCH -v2 5/5] OOM, PM: make OOM detection in the freezer path raceless Michal Hocko @ 2014-12-06 13:11 ` Tejun Heo 2014-12-07 10:11 ` Michal Hocko 2015-01-07 18:41 ` Tejun Heo 2015-01-08 11:51 ` Michal Hocko 2 siblings, 1 reply; 93+ messages in thread From: Tejun Heo @ 2014-12-06 13:11 UTC (permalink / raw) To: Michal Hocko Cc: linux-mm, Andrew Morton, \"Rafael J. Wysocki\", David Rientjes, Johannes Weiner, Oleg Nesterov, Cong Wang, LKML, linux-pm On Fri, Dec 05, 2014 at 05:41:47PM +0100, Michal Hocko wrote: > 5695be142e20 (OOM, PM: OOM killed task shouldn't escape PM suspend) > has left a race window when OOM killer manages to note_oom_kill after > freeze_processes checks the counter. The race window is quite small and > really unlikely and partial solution deemed sufficient at the time of > submission. This patch doesn't apply on top of v3.18-rc3, latest mainline, -mm or -next. Did I miss something? Can you please check the patch? Thanks. -- tejun ^ permalink raw reply [flat|nested] 93+ messages in thread
* Re: [PATCH -v2 5/5] OOM, PM: make OOM detection in the freezer path raceless 2014-12-06 13:11 ` Tejun Heo @ 2014-12-07 10:11 ` Michal Hocko 0 siblings, 0 replies; 93+ messages in thread From: Michal Hocko @ 2014-12-07 10:11 UTC (permalink / raw) To: Tejun Heo Cc: linux-mm, Andrew Morton, \"Rafael J. Wysocki\", David Rientjes, Johannes Weiner, Oleg Nesterov, Cong Wang, LKML, linux-pm On Sat 06-12-14 08:11:15, Tejun Heo wrote: > On Fri, Dec 05, 2014 at 05:41:47PM +0100, Michal Hocko wrote: > > 5695be142e20 (OOM, PM: OOM killed task shouldn't escape PM suspend) > > has left a race window when OOM killer manages to note_oom_kill after > > freeze_processes checks the counter. The race window is quite small and > > really unlikely and partial solution deemed sufficient at the time of > > submission. > > This patch doesn't apply on top of v3.18-rc3, latest mainline, -mm or > -next. Did I miss something? Can you please check the patch? The original cover letter which didn't make it to the mailing list has mentioned that. I have reposted it now. Anyway this is on top of http://marc.info/?l=linux-kernel&m=141779091114777 which hasn't landed into -mm tree at the time I was posting this. Sorry about the confusion. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 93+ messages in thread
* Re: [PATCH -v2 5/5] OOM, PM: make OOM detection in the freezer path raceless 2014-12-05 16:41 ` [PATCH -v2 5/5] OOM, PM: make OOM detection in the freezer path raceless Michal Hocko 2014-12-06 13:11 ` Tejun Heo @ 2015-01-07 18:41 ` Tejun Heo 2015-01-07 18:48 ` Michal Hocko 2015-01-08 11:51 ` Michal Hocko 2 siblings, 1 reply; 93+ messages in thread From: Tejun Heo @ 2015-01-07 18:41 UTC (permalink / raw) To: Michal Hocko Cc: linux-mm, Andrew Morton, \"Rafael J. Wysocki\", David Rientjes, Johannes Weiner, Oleg Nesterov, Cong Wang, LKML, linux-pm Hello, Michal. Sorry about the long delay. On Fri, Dec 05, 2014 at 05:41:47PM +0100, Michal Hocko wrote: ... > @@ -252,6 +220,8 @@ void thaw_kernel_threads(void) > { > struct task_struct *g, *p; > > + oom_killer_enable(); > + Wouldn't it be more symmetrical and make more sense to enable oom killer after kernel threads are thawed? Until kernel threads are thawed, it isn't guaranteed that oom killer would be able to make forward progress, right? Other than that, looks good to me. Thanks! -- tejun ^ permalink raw reply [flat|nested] 93+ messages in thread
* Re: [PATCH -v2 5/5] OOM, PM: make OOM detection in the freezer path raceless 2015-01-07 18:41 ` Tejun Heo @ 2015-01-07 18:48 ` Michal Hocko 0 siblings, 0 replies; 93+ messages in thread From: Michal Hocko @ 2015-01-07 18:48 UTC (permalink / raw) To: Tejun Heo Cc: linux-mm, Andrew Morton, \"Rafael J. Wysocki\", David Rientjes, Johannes Weiner, Oleg Nesterov, Cong Wang, LKML, linux-pm On Wed 07-01-15 13:41:24, Tejun Heo wrote: > Hello, Michal. Sorry about the long delay. > > On Fri, Dec 05, 2014 at 05:41:47PM +0100, Michal Hocko wrote: > ... > > @@ -252,6 +220,8 @@ void thaw_kernel_threads(void) > > { > > struct task_struct *g, *p; > > > > + oom_killer_enable(); > > + > > Wouldn't it be more symmetrical and make more sense to enable oom > killer after kernel threads are thawed? Until kernel threads are > thawed, it isn't guaranteed that oom killer would be able to make > forward progress, right? Makes sense, fixed. > Other than that, looks good to me. Thanks! Btw. I plan to repost after Andrew releases new mmotm as there are some dependencies in oom area. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 93+ messages in thread
* Re: [PATCH -v2 5/5] OOM, PM: make OOM detection in the freezer path raceless 2014-12-05 16:41 ` [PATCH -v2 5/5] OOM, PM: make OOM detection in the freezer path raceless Michal Hocko 2014-12-06 13:11 ` Tejun Heo 2015-01-07 18:41 ` Tejun Heo @ 2015-01-08 11:51 ` Michal Hocko 2 siblings, 0 replies; 93+ messages in thread From: Michal Hocko @ 2015-01-08 11:51 UTC (permalink / raw) To: linux-mm Cc: Andrew Morton, Tejun Heo, \"Rafael J. Wysocki\", David Rientjes, Johannes Weiner, Oleg Nesterov, Cong Wang, LKML, linux-pm On Fri 05-12-14 17:41:47, Michal Hocko wrote: [...] > +bool oom_killer_disable(void) > +{ > + /* > + * Make sure to not race with an ongoing OOM killer > + * and that the current is not the victim. > + */ > + down_write(&oom_sem); > + if (test_thread_flag(TIF_MEMDIE)) { > + up_write(&oom_sem); > + return false; > + } > + > + oom_killer_disabled = true; > + up_write(&oom_sem); > + > + wait_event(oom_victims_wait, atomic_read(&oom_victims)); Ups brainfart... Should be !atomic_read(&oom_victims). Condition says for what we are waiting not when we are waiting. > + > + return true; > +} [...] -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 93+ messages in thread
* Re: [PATCH 0/4] OOM vs PM freezer fixes 2014-12-05 16:41 ` [PATCH 0/4] OOM vs PM freezer fixes Michal Hocko ` (4 preceding siblings ...) 2014-12-05 16:41 ` [PATCH -v2 5/5] OOM, PM: make OOM detection in the freezer path raceless Michal Hocko @ 2014-12-07 10:09 ` Michal Hocko 2014-12-07 13:55 ` Tejun Heo 5 siblings, 1 reply; 93+ messages in thread From: Michal Hocko @ 2014-12-07 10:09 UTC (permalink / raw) To: linux-mm Cc: Andrew Morton, Tejun Heo, \"Rafael J. Wysocki\", David Rientjes, Johannes Weiner, Oleg Nesterov, Cong Wang, LKML, linux-pm For some reason this is the previous version of the cover letter. I had some issues with git send-email which was failing for me. Anyway, this is the correct cover. Sorry about the cofusion. Hi, this is another attempt to address OOM vs. PM interaction. More about the issue is described in the last patch. The other 4 patches are just clean ups. This is based on top of 3.18-rc3 + Johannes' http://marc.info/?l=linux-kernel&m=141779091114777 which is not in the Andrew's tree yet but I wanted to prevent from later merge conflicts. The previous version of the main patch (5th one) was posted here: http://marc.info/?l=linux-mm&m=141634503316543&w=2. This version has hopefully addressed all the points raised by Tejun in the previous version. Namely - checkpatch fixes + printk -> pr_* changes in the respective areas - more comments added to clarify subtle interactions - oom_killer_disable(), unmark_tsk_oom_victim changed into wait_even API which is easier to use Both OOM killer and the PM freezer are really subtle so I would really appreciate a throughout review here. I still haven't changed lowmemory killer which is abusing TIF_MEMDIE yet and it would break this code (oom_victims counter balance) and I plan to look at it as soon as the rest of the of the series is OK and agreed as a way to go. So there will be at least one more patch for the final submission. Thanks! Michal Hocko (5): oom: add helpers for setting and clearing TIF_MEMDIE OOM: thaw the OOM victim if it is frozen PM: convert printk to pr_* equivalent sysrq: convert printk to pr_* equivalent OOM, PM: make OOM detection in the freezer path raceless And diffstat: drivers/tty/sysrq.c | 23 ++++---- include/linux/oom.h | 18 +++---- kernel/exit.c | 3 +- kernel/power/process.c | 81 +++++++++------------------- mm/memcontrol.c | 4 +- mm/oom_kill.c | 142 +++++++++++++++++++++++++++++++++++++++++++------ mm/page_alloc.c | 17 +----- 7 files changed, 178 insertions(+), 110 deletions(-) -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 93+ messages in thread
* Re: [PATCH 0/4] OOM vs PM freezer fixes 2014-12-07 10:09 ` [PATCH 0/4] OOM vs PM freezer fixes Michal Hocko @ 2014-12-07 13:55 ` Tejun Heo 2014-12-07 19:00 ` Michal Hocko 0 siblings, 1 reply; 93+ messages in thread From: Tejun Heo @ 2014-12-07 13:55 UTC (permalink / raw) To: Michal Hocko Cc: linux-mm, Andrew Morton, \"Rafael J. Wysocki\", David Rientjes, Johannes Weiner, Oleg Nesterov, Cong Wang, LKML, linux-pm On Sun, Dec 07, 2014 at 11:09:53AM +0100, Michal Hocko wrote: > this is another attempt to address OOM vs. PM interaction. More > about the issue is described in the last patch. The other 4 patches > are just clean ups. This is based on top of 3.18-rc3 + Johannes' > http://marc.info/?l=linux-kernel&m=141779091114777 which is not in the > Andrew's tree yet but I wanted to prevent from later merge conflicts. When the patches are based on a custom tree, it's often a good idea to create a git branch of the patches to help reviewing. Thanks. -- tejun ^ permalink raw reply [flat|nested] 93+ messages in thread
* Re: [PATCH 0/4] OOM vs PM freezer fixes 2014-12-07 13:55 ` Tejun Heo @ 2014-12-07 19:00 ` Michal Hocko 2014-12-18 16:27 ` Michal Hocko 0 siblings, 1 reply; 93+ messages in thread From: Michal Hocko @ 2014-12-07 19:00 UTC (permalink / raw) To: Tejun Heo Cc: linux-mm, Andrew Morton, \"Rafael J. Wysocki\", David Rientjes, Johannes Weiner, Oleg Nesterov, Cong Wang, LKML, linux-pm On Sun 07-12-14 08:55:51, Tejun Heo wrote: > On Sun, Dec 07, 2014 at 11:09:53AM +0100, Michal Hocko wrote: > > this is another attempt to address OOM vs. PM interaction. More > > about the issue is described in the last patch. The other 4 patches > > are just clean ups. This is based on top of 3.18-rc3 + Johannes' > > http://marc.info/?l=linux-kernel&m=141779091114777 which is not in the > > Andrew's tree yet but I wanted to prevent from later merge conflicts. > > When the patches are based on a custom tree, it's often a good idea to > create a git branch of the patches to help reviewing. git://git.kernel.org/pub/scm/linux/kernel/git/mhocko/mm.git to-review/make-oom-vs-pm-freezing-more-robust-2 -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 93+ messages in thread
* Re: [PATCH 0/4] OOM vs PM freezer fixes 2014-12-07 19:00 ` Michal Hocko @ 2014-12-18 16:27 ` Michal Hocko 0 siblings, 0 replies; 93+ messages in thread From: Michal Hocko @ 2014-12-18 16:27 UTC (permalink / raw) To: Tejun Heo Cc: linux-mm, Andrew Morton, \"Rafael J. Wysocki\", David Rientjes, Johannes Weiner, Oleg Nesterov, Cong Wang, LKML, linux-pm On Sun 07-12-14 20:00:26, Michal Hocko wrote: > On Sun 07-12-14 08:55:51, Tejun Heo wrote: > > On Sun, Dec 07, 2014 at 11:09:53AM +0100, Michal Hocko wrote: > > > this is another attempt to address OOM vs. PM interaction. More > > > about the issue is described in the last patch. The other 4 patches > > > are just clean ups. This is based on top of 3.18-rc3 + Johannes' > > > http://marc.info/?l=linux-kernel&m=141779091114777 which is not in the > > > Andrew's tree yet but I wanted to prevent from later merge conflicts. > > > > When the patches are based on a custom tree, it's often a good idea to > > create a git branch of the patches to help reviewing. > > git://git.kernel.org/pub/scm/linux/kernel/git/mhocko/mm.git to-review/make-oom-vs-pm-freezing-more-robust-2 Are there any other concerns? Should I just resubmit (after rc1)? -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 93+ messages in thread
* Re: [PATCH 3/4] OOM, PM: OOM killed task shouldn't escape PM suspend 2014-11-05 12:46 ` Michal Hocko 2014-11-05 13:02 ` Tejun Heo @ 2014-11-05 14:55 ` Michal Hocko 1 sibling, 0 replies; 93+ messages in thread From: Michal Hocko @ 2014-11-05 14:55 UTC (permalink / raw) To: Tejun Heo Cc: Rafael J. Wysocki, Andrew Morton, Cong Wang, David Rientjes, Oleg Nesterov, LKML, linux-mm, Linux PM list On Wed 05-11-14 13:46:20, Michal Hocko wrote: [...] > From ef6227565fa65b52986c4626d49ba53b499e54d1 Mon Sep 17 00:00:00 2001 > From: Michal Hocko <mhocko@suse.cz> > Date: Wed, 5 Nov 2014 11:49:14 +0100 > Subject: [PATCH] OOM, PM: make OOM detection in the freezer path raceless > > 5695be142e20 (OOM, PM: OOM killed task shouldn't escape PM suspend) > has left a race window when OOM killer manages to note_oom_kill after > freeze_processes checks the counter. The race window is quite small > and really unlikely and deemed sufficient at the time of submission. > > Tejun wasn't happy about this partial solution though and insisted on > a full solution. That requires the full OOM and freezer exclusion, > though. This is done by this patch which introduces oom_sem RW lock. > Page allocation OOM path takes the lock for reading because there might > be concurrent OOM happening on disjunct zonelists. oom_killer_disabled > check is moved right before out_of_memory is called because it was > checked too early before and we do not want to hold the lock while doing > the last attempt for allocation which might involve zone_reclaim. This is incorrect because it would cause an endless allocation loop because we really have to got to no_page if OOM is disabled. > freeze_processes then takes the lock for write throughout the whole > freezing process and OOM disabling. > > There is no need to recheck all the processes with the full > synchronization anymore. > > Signed-off-by: Michal Hocko <mhocko@suse.cz> > --- > include/linux/oom.h | 5 +++++ > kernel/power/process.c | 50 +++++++++----------------------------------------- > mm/oom_kill.c | 17 ----------------- > mm/page_alloc.c | 24 ++++++++++++------------ > 4 files changed, 26 insertions(+), 70 deletions(-) > > diff --git a/include/linux/oom.h b/include/linux/oom.h > index e8d6e1058723..350b9b2ffeec 100644 > --- a/include/linux/oom.h > +++ b/include/linux/oom.h > @@ -73,7 +73,12 @@ extern void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask, > extern int register_oom_notifier(struct notifier_block *nb); > extern int unregister_oom_notifier(struct notifier_block *nb); > > +/* > + * oom_killer_disabled can be modified only under oom_sem taken for write > + * and checked under read lock along with the full OOM handler. > + */ > extern bool oom_killer_disabled; > +extern struct rw_semaphore oom_sem; > > static inline void oom_killer_disable(void) > { > diff --git a/kernel/power/process.c b/kernel/power/process.c > index 5a6ec8678b9a..befce9785233 100644 > --- a/kernel/power/process.c > +++ b/kernel/power/process.c > @@ -108,30 +108,6 @@ static int try_to_freeze_tasks(bool user_only) > return todo ? -EBUSY : 0; > } > > -static bool __check_frozen_processes(void) > -{ > - struct task_struct *g, *p; > - > - for_each_process_thread(g, p) > - if (p != current && !freezer_should_skip(p) && !frozen(p)) > - return false; > - > - return true; > -} > - > -/* > - * Returns true if all freezable tasks (except for current) are frozen already > - */ > -static bool check_frozen_processes(void) > -{ > - bool ret; > - > - read_lock(&tasklist_lock); > - ret = __check_frozen_processes(); > - read_unlock(&tasklist_lock); > - return ret; > -} > - > /** > * freeze_processes - Signal user space processes to enter the refrigerator. > * The current thread will not be frozen. The same process that calls > @@ -142,7 +118,6 @@ static bool check_frozen_processes(void) > int freeze_processes(void) > { > int error; > - int oom_kills_saved; > > error = __usermodehelper_disable(UMH_FREEZING); > if (error) > @@ -157,27 +132,20 @@ int freeze_processes(void) > pm_wakeup_clear(); > printk("Freezing user space processes ... "); > pm_freezing = true; > - oom_kills_saved = oom_kills_count(); > + > + /* > + * Need to exlude OOM killer from triggering while tasks are > + * getting frozen to make sure none of them gets killed after > + * try_to_freeze_tasks is done. > + */ > + down_write(&oom_sem); > error = try_to_freeze_tasks(true); > if (!error) { > __usermodehelper_set_disable_depth(UMH_DISABLED); > oom_killer_disable(); > - > - /* > - * There might have been an OOM kill while we were > - * freezing tasks and the killed task might be still > - * on the way out so we have to double check for race. > - */ > - if (oom_kills_count() != oom_kills_saved && > - !check_frozen_processes()) { > - __usermodehelper_set_disable_depth(UMH_ENABLED); > - printk("OOM in progress."); > - error = -EBUSY; > - } else { > - printk("done."); > - } > + printk("done.\n"); > } > - printk("\n"); > + up_write(&oom_sem); > BUG_ON(in_atomic()); > > if (error) > diff --git a/mm/oom_kill.c b/mm/oom_kill.c > index 5340f6b91312..bbf405a3a18f 100644 > --- a/mm/oom_kill.c > +++ b/mm/oom_kill.c > @@ -404,23 +404,6 @@ static void dump_header(struct task_struct *p, gfp_t gfp_mask, int order, > dump_tasks(memcg, nodemask); > } > > -/* > - * Number of OOM killer invocations (including memcg OOM killer). > - * Primarily used by PM freezer to check for potential races with > - * OOM killed frozen task. > - */ > -static atomic_t oom_kills = ATOMIC_INIT(0); > - > -int oom_kills_count(void) > -{ > - return atomic_read(&oom_kills); > -} > - > -void note_oom_kill(void) > -{ > - atomic_inc(&oom_kills); > -} > - > #define K(x) ((x) << (PAGE_SHIFT-10)) > /* > * Must be called while holding a reference to p, which will be released upon > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index 9cd36b822444..76095266c4b5 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -243,6 +243,7 @@ void set_pageblock_migratetype(struct page *page, int migratetype) > } > > bool oom_killer_disabled __read_mostly; > +DECLARE_RWSEM(oom_sem); > > #ifdef CONFIG_DEBUG_VM > static int page_outside_zone_boundaries(struct zone *zone, struct page *page) > @@ -2252,14 +2253,6 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order, > } > > /* > - * PM-freezer should be notified that there might be an OOM killer on > - * its way to kill and wake somebody up. This is too early and we might > - * end up not killing anything but false positives are acceptable. > - * See freeze_processes. > - */ > - note_oom_kill(); > - > - /* > * Go through the zonelist yet one more time, keep very high watermark > * here, this is only to catch a parallel oom killing, we must fail if > * we're still under heavy pressure. > @@ -2288,8 +2281,17 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order, > if (gfp_mask & __GFP_THISNODE) > goto out; > } > - /* Exhausted what can be done so it's blamo time */ > - out_of_memory(zonelist, gfp_mask, order, nodemask, false); > + > + /* > + * Exhausted what can be done so it's blamo time. > + * Just make sure that we cannot race with oom_killer disabling > + * e.g. PM freezer needs to make sure that no OOM happens after > + * all tasks are frozen. > + */ > + down_read(&oom_sem); > + if (!oom_killer_disabled) > + out_of_memory(zonelist, gfp_mask, order, nodemask, false); > + up_read(&oom_sem); > > out: > oom_zonelist_unlock(zonelist, gfp_mask); > @@ -2716,8 +2718,6 @@ rebalance: > */ > if (!did_some_progress) { > if (oom_gfp_allowed(gfp_mask)) { > - if (oom_killer_disabled) > - goto nopage; > /* Coredumps can quickly deplete all memory reserves */ > if ((current->flags & PF_DUMPCORE) && > !(gfp_mask & __GFP_NOFAIL)) > -- > 2.1.1 > > > -- > Michal Hocko > SUSE Labs > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 93+ messages in thread
* Re: [PATCH 3/4] OOM, PM: OOM killed task shouldn't escape PM suspend 2014-10-21 7:27 ` [PATCH 3/4] OOM, PM: OOM killed task shouldn't escape PM suspend Michal Hocko 2014-10-21 12:09 ` Rafael J. Wysocki @ 2014-10-26 18:40 ` Pavel Machek 1 sibling, 0 replies; 93+ messages in thread From: Pavel Machek @ 2014-10-26 18:40 UTC (permalink / raw) To: Michal Hocko Cc: Andrew Morton, \"Rafael J. Wysocki\", Cong Wang, David Rientjes, Tejun Heo, Oleg Nesterov, LKML, linux-mm, Linux PM list Hi! > + > + /* > + * There might have been an OOM kill while we were > + * freezing tasks and the killed task might be still > + * on the way out so we have to double check for race. > + */ ", so" > /* > + * PM-freezer should be notified that there might be an OOM killer on its > + * way to kill and wake somebody up. This is too early and we might end > + * up not killing anything but false positives are acceptable. ", but". 1,2 look good to me, Acked-by: Pavel Machek <pavel@ucw.cz> Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html ^ permalink raw reply [flat|nested] 93+ messages in thread
* [PATCH 4/4] PM: convert do_each_thread to for_each_process_thread 2014-10-21 7:27 [PATCH 0/4 -v2] OOM vs. freezer interaction fixes Michal Hocko ` (2 preceding siblings ...) 2014-10-21 7:27 ` [PATCH 3/4] OOM, PM: OOM killed task shouldn't escape PM suspend Michal Hocko @ 2014-10-21 7:27 ` Michal Hocko 2014-10-21 12:10 ` Rafael J. Wysocki 3 siblings, 1 reply; 93+ messages in thread From: Michal Hocko @ 2014-10-21 7:27 UTC (permalink / raw) To: Andrew Morton, \"Rafael J. Wysocki\" Cc: Cong Wang, David Rientjes, Tejun Heo, Oleg Nesterov, LKML, linux-mm, Linux PM list as per 0c740d0afc3b (introduce for_each_thread() to replace the buggy while_each_thread()) get rid of do_each_thread { } while_each_thread() construct and replace it by a more error prone for_each_thread. This patch doesn't introduce any user visible change. Suggested-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Michal Hocko <mhocko@suse.cz> --- kernel/power/process.c | 16 ++++++++-------- 1 file changed, 8 insertions(+), 8 deletions(-) diff --git a/kernel/power/process.c b/kernel/power/process.c index a397fa161d11..7fd7b72554fe 100644 --- a/kernel/power/process.c +++ b/kernel/power/process.c @@ -46,13 +46,13 @@ static int try_to_freeze_tasks(bool user_only) while (true) { todo = 0; read_lock(&tasklist_lock); - do_each_thread(g, p) { + for_each_process_thread(g, p) { if (p == current || !freeze_task(p)) continue; if (!freezer_should_skip(p)) todo++; - } while_each_thread(g, p); + } read_unlock(&tasklist_lock); if (!user_only) { @@ -93,11 +93,11 @@ static int try_to_freeze_tasks(bool user_only) if (!wakeup) { read_lock(&tasklist_lock); - do_each_thread(g, p) { + for_each_process_thread(g, p) { if (p != current && !freezer_should_skip(p) && freezing(p) && !frozen(p)) sched_show_task(p); - } while_each_thread(g, p); + } read_unlock(&tasklist_lock); } } else { @@ -219,11 +219,11 @@ void thaw_processes(void) thaw_workqueues(); read_lock(&tasklist_lock); - do_each_thread(g, p) { + for_each_process_thread(g, p) { /* No other threads should have PF_SUSPEND_TASK set */ WARN_ON((p != curr) && (p->flags & PF_SUSPEND_TASK)); __thaw_task(p); - } while_each_thread(g, p); + } read_unlock(&tasklist_lock); WARN_ON(!(curr->flags & PF_SUSPEND_TASK)); @@ -246,10 +246,10 @@ void thaw_kernel_threads(void) thaw_workqueues(); read_lock(&tasklist_lock); - do_each_thread(g, p) { + for_each_process_thread(g, p) { if (p->flags & (PF_KTHREAD | PF_WQ_WORKER)) __thaw_task(p); - } while_each_thread(g, p); + } read_unlock(&tasklist_lock); schedule(); -- 2.1.1 ^ permalink raw reply related [flat|nested] 93+ messages in thread
* Re: [PATCH 4/4] PM: convert do_each_thread to for_each_process_thread 2014-10-21 7:27 ` [PATCH 4/4] PM: convert do_each_thread to for_each_process_thread Michal Hocko @ 2014-10-21 12:10 ` Rafael J. Wysocki 2014-10-21 13:19 ` Michal Hocko 0 siblings, 1 reply; 93+ messages in thread From: Rafael J. Wysocki @ 2014-10-21 12:10 UTC (permalink / raw) To: Michal Hocko Cc: Andrew Morton, Cong Wang, David Rientjes, Tejun Heo, Oleg Nesterov, LKML, linux-mm, Linux PM list On Tuesday, October 21, 2014 09:27:15 AM Michal Hocko wrote: > as per 0c740d0afc3b (introduce for_each_thread() to replace the buggy > while_each_thread()) get rid of do_each_thread { } while_each_thread() > construct and replace it by a more error prone for_each_thread. > > This patch doesn't introduce any user visible change. > > Suggested-by: Oleg Nesterov <oleg@redhat.com> > Signed-off-by: Michal Hocko <mhocko@suse.cz> ACK Or do you want me to handle this series? > --- > kernel/power/process.c | 16 ++++++++-------- > 1 file changed, 8 insertions(+), 8 deletions(-) > > diff --git a/kernel/power/process.c b/kernel/power/process.c > index a397fa161d11..7fd7b72554fe 100644 > --- a/kernel/power/process.c > +++ b/kernel/power/process.c > @@ -46,13 +46,13 @@ static int try_to_freeze_tasks(bool user_only) > while (true) { > todo = 0; > read_lock(&tasklist_lock); > - do_each_thread(g, p) { > + for_each_process_thread(g, p) { > if (p == current || !freeze_task(p)) > continue; > > if (!freezer_should_skip(p)) > todo++; > - } while_each_thread(g, p); > + } > read_unlock(&tasklist_lock); > > if (!user_only) { > @@ -93,11 +93,11 @@ static int try_to_freeze_tasks(bool user_only) > > if (!wakeup) { > read_lock(&tasklist_lock); > - do_each_thread(g, p) { > + for_each_process_thread(g, p) { > if (p != current && !freezer_should_skip(p) > && freezing(p) && !frozen(p)) > sched_show_task(p); > - } while_each_thread(g, p); > + } > read_unlock(&tasklist_lock); > } > } else { > @@ -219,11 +219,11 @@ void thaw_processes(void) > thaw_workqueues(); > > read_lock(&tasklist_lock); > - do_each_thread(g, p) { > + for_each_process_thread(g, p) { > /* No other threads should have PF_SUSPEND_TASK set */ > WARN_ON((p != curr) && (p->flags & PF_SUSPEND_TASK)); > __thaw_task(p); > - } while_each_thread(g, p); > + } > read_unlock(&tasklist_lock); > > WARN_ON(!(curr->flags & PF_SUSPEND_TASK)); > @@ -246,10 +246,10 @@ void thaw_kernel_threads(void) > thaw_workqueues(); > > read_lock(&tasklist_lock); > - do_each_thread(g, p) { > + for_each_process_thread(g, p) { > if (p->flags & (PF_KTHREAD | PF_WQ_WORKER)) > __thaw_task(p); > - } while_each_thread(g, p); > + } > read_unlock(&tasklist_lock); > > schedule(); > -- I speak only for myself. Rafael J. Wysocki, Intel Open Source Technology Center. ^ permalink raw reply [flat|nested] 93+ messages in thread
* Re: [PATCH 4/4] PM: convert do_each_thread to for_each_process_thread 2014-10-21 12:10 ` Rafael J. Wysocki @ 2014-10-21 13:19 ` Michal Hocko 2014-10-21 13:43 ` Rafael J. Wysocki 0 siblings, 1 reply; 93+ messages in thread From: Michal Hocko @ 2014-10-21 13:19 UTC (permalink / raw) To: Rafael J. Wysocki Cc: Andrew Morton, Cong Wang, David Rientjes, Tejun Heo, Oleg Nesterov, LKML, linux-mm, Linux PM list On Tue 21-10-14 14:10:18, Rafael J. Wysocki wrote: > On Tuesday, October 21, 2014 09:27:15 AM Michal Hocko wrote: > > as per 0c740d0afc3b (introduce for_each_thread() to replace the buggy > > while_each_thread()) get rid of do_each_thread { } while_each_thread() > > construct and replace it by a more error prone for_each_thread. > > > > This patch doesn't introduce any user visible change. > > > > Suggested-by: Oleg Nesterov <oleg@redhat.com> > > Signed-off-by: Michal Hocko <mhocko@suse.cz> > > ACK > > Or do you want me to handle this series? I don't know, I hoped either you or Andrew to pick it up. > > --- > > kernel/power/process.c | 16 ++++++++-------- > > 1 file changed, 8 insertions(+), 8 deletions(-) > > > > diff --git a/kernel/power/process.c b/kernel/power/process.c > > index a397fa161d11..7fd7b72554fe 100644 > > --- a/kernel/power/process.c > > +++ b/kernel/power/process.c > > @@ -46,13 +46,13 @@ static int try_to_freeze_tasks(bool user_only) > > while (true) { > > todo = 0; > > read_lock(&tasklist_lock); > > - do_each_thread(g, p) { > > + for_each_process_thread(g, p) { > > if (p == current || !freeze_task(p)) > > continue; > > > > if (!freezer_should_skip(p)) > > todo++; > > - } while_each_thread(g, p); > > + } > > read_unlock(&tasklist_lock); > > > > if (!user_only) { > > @@ -93,11 +93,11 @@ static int try_to_freeze_tasks(bool user_only) > > > > if (!wakeup) { > > read_lock(&tasklist_lock); > > - do_each_thread(g, p) { > > + for_each_process_thread(g, p) { > > if (p != current && !freezer_should_skip(p) > > && freezing(p) && !frozen(p)) > > sched_show_task(p); > > - } while_each_thread(g, p); > > + } > > read_unlock(&tasklist_lock); > > } > > } else { > > @@ -219,11 +219,11 @@ void thaw_processes(void) > > thaw_workqueues(); > > > > read_lock(&tasklist_lock); > > - do_each_thread(g, p) { > > + for_each_process_thread(g, p) { > > /* No other threads should have PF_SUSPEND_TASK set */ > > WARN_ON((p != curr) && (p->flags & PF_SUSPEND_TASK)); > > __thaw_task(p); > > - } while_each_thread(g, p); > > + } > > read_unlock(&tasklist_lock); > > > > WARN_ON(!(curr->flags & PF_SUSPEND_TASK)); > > @@ -246,10 +246,10 @@ void thaw_kernel_threads(void) > > thaw_workqueues(); > > > > read_lock(&tasklist_lock); > > - do_each_thread(g, p) { > > + for_each_process_thread(g, p) { > > if (p->flags & (PF_KTHREAD | PF_WQ_WORKER)) > > __thaw_task(p); > > - } while_each_thread(g, p); > > + } > > read_unlock(&tasklist_lock); > > > > schedule(); > > > > -- > I speak only for myself. > Rafael J. Wysocki, Intel Open Source Technology Center. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 93+ messages in thread
* Re: [PATCH 4/4] PM: convert do_each_thread to for_each_process_thread 2014-10-21 13:19 ` Michal Hocko @ 2014-10-21 13:43 ` Rafael J. Wysocki 0 siblings, 0 replies; 93+ messages in thread From: Rafael J. Wysocki @ 2014-10-21 13:43 UTC (permalink / raw) To: Michal Hocko Cc: Andrew Morton, Cong Wang, David Rientjes, Tejun Heo, Oleg Nesterov, LKML, linux-mm, Linux PM list On Tuesday, October 21, 2014 03:19:53 PM Michal Hocko wrote: > On Tue 21-10-14 14:10:18, Rafael J. Wysocki wrote: > > On Tuesday, October 21, 2014 09:27:15 AM Michal Hocko wrote: > > > as per 0c740d0afc3b (introduce for_each_thread() to replace the buggy > > > while_each_thread()) get rid of do_each_thread { } while_each_thread() > > > construct and replace it by a more error prone for_each_thread. > > > > > > This patch doesn't introduce any user visible change. > > > > > > Suggested-by: Oleg Nesterov <oleg@redhat.com> > > > Signed-off-by: Michal Hocko <mhocko@suse.cz> > > > > ACK > > > > Or do you want me to handle this series? > > I don't know, I hoped either you or Andrew to pick it up. OK, I will then. -- I speak only for myself. Rafael J. Wysocki, Intel Open Source Technology Center. ^ permalink raw reply [flat|nested] 93+ messages in thread
end of thread, other threads:[~2015-01-08 11:51 UTC | newest] Thread overview: 93+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2014-10-21 7:27 [PATCH 0/4 -v2] OOM vs. freezer interaction fixes Michal Hocko 2014-10-21 7:27 ` [PATCH 1/4] freezer: Do not freeze tasks killed by OOM killer Michal Hocko 2014-10-21 12:04 ` Rafael J. Wysocki 2014-10-21 7:27 ` [PATCH 2/4] freezer: remove obsolete comments in __thaw_task() Michal Hocko 2014-10-21 12:04 ` Rafael J. Wysocki 2014-10-21 7:27 ` [PATCH 3/4] OOM, PM: OOM killed task shouldn't escape PM suspend Michal Hocko 2014-10-21 12:09 ` Rafael J. Wysocki 2014-10-21 13:14 ` Michal Hocko 2014-10-21 13:42 ` Rafael J. Wysocki 2014-10-21 14:11 ` Michal Hocko 2014-10-21 14:41 ` Rafael J. Wysocki 2014-10-21 14:29 ` Michal Hocko 2014-10-22 14:39 ` Rafael J. Wysocki 2014-10-22 14:22 ` Michal Hocko 2014-10-22 21:18 ` Rafael J. Wysocki 2014-10-26 18:49 ` Pavel Machek 2014-11-04 19:27 ` Tejun Heo 2014-11-05 12:46 ` Michal Hocko 2014-11-05 13:02 ` Tejun Heo 2014-11-05 13:31 ` Michal Hocko 2014-11-05 13:42 ` Michal Hocko 2014-11-05 14:14 ` Michal Hocko 2014-11-05 15:45 ` Michal Hocko 2014-11-05 15:44 ` Tejun Heo 2014-11-05 16:01 ` Michal Hocko 2014-11-05 16:29 ` Tejun Heo 2014-11-05 16:39 ` Michal Hocko 2014-11-05 16:54 ` Tejun Heo 2014-11-05 17:01 ` Tejun Heo 2014-11-06 13:05 ` Michal Hocko 2014-11-06 15:09 ` Tejun Heo 2014-11-06 16:01 ` Michal Hocko 2014-11-06 16:12 ` Tejun Heo 2014-11-06 16:31 ` Michal Hocko 2014-11-06 16:33 ` Tejun Heo 2014-11-06 16:58 ` Michal Hocko 2014-11-05 17:46 ` Michal Hocko 2014-11-05 17:55 ` Tejun Heo 2014-11-06 12:49 ` Michal Hocko 2014-11-06 15:01 ` Tejun Heo 2014-11-06 16:02 ` Michal Hocko 2014-11-06 16:28 ` Tejun Heo 2014-11-10 16:30 ` Michal Hocko 2014-11-12 18:58 ` [RFC 0/4] OOM vs PM freezer fixes Michal Hocko 2014-11-12 18:58 ` [RFC 1/4] OOM, PM: Do not miss OOM killed frozen tasks Michal Hocko 2014-11-14 17:55 ` Tejun Heo 2014-11-12 18:58 ` [RFC 2/4] OOM, PM: make OOM detection in the freezer path raceless Michal Hocko 2014-11-12 18:58 ` [RFC 3/4] OOM, PM: handle pm freezer as an OOM victim correctly Michal Hocko 2014-11-12 18:58 ` [RFC 4/4] OOM: thaw the OOM victim if it is frozen Michal Hocko 2014-11-14 20:14 ` [RFC 0/4] OOM vs PM freezer fixes Tejun Heo 2014-11-18 21:08 ` Michal Hocko 2014-11-18 21:10 ` [RFC 1/2] oom: add helper for setting and clearing TIF_MEMDIE Michal Hocko 2014-11-18 21:10 ` [RFC 2/2] OOM, PM: make OOM detection in the freezer path raceless Michal Hocko 2014-11-27 0:47 ` Rafael J. Wysocki 2014-12-02 22:08 ` Tejun Heo 2014-12-04 14:16 ` Michal Hocko 2014-12-04 14:44 ` Tejun Heo 2014-12-04 16:56 ` Michal Hocko 2014-12-04 17:18 ` Michal Hocko 2014-12-05 16:41 ` [PATCH 0/4] OOM vs PM freezer fixes Michal Hocko 2014-12-05 16:41 ` [PATCH -v2 1/5] oom: add helpers for setting and clearing TIF_MEMDIE Michal Hocko 2014-12-06 12:56 ` Tejun Heo 2014-12-07 10:13 ` Michal Hocko 2015-01-07 17:57 ` Tejun Heo 2015-01-07 18:23 ` Michal Hocko 2014-12-05 16:41 ` [PATCH -v2 2/5] OOM: thaw the OOM victim if it is frozen Michal Hocko 2014-12-06 13:06 ` Tejun Heo 2014-12-07 10:24 ` Michal Hocko 2014-12-07 10:45 ` Michal Hocko 2014-12-07 13:59 ` Tejun Heo 2014-12-07 18:55 ` Michal Hocko 2014-12-05 16:41 ` [PATCH -v2 3/5] PM: convert printk to pr_* equivalent Michal Hocko 2014-12-05 22:40 ` Rafael J. Wysocki 2014-12-07 10:26 ` Michal Hocko 2014-12-06 13:08 ` Tejun Heo 2014-12-05 16:41 ` [PATCH -v2 4/5] sysrq: " Michal Hocko 2014-12-06 13:09 ` Tejun Heo 2014-12-05 16:41 ` [PATCH -v2 5/5] OOM, PM: make OOM detection in the freezer path raceless Michal Hocko 2014-12-06 13:11 ` Tejun Heo 2014-12-07 10:11 ` Michal Hocko 2015-01-07 18:41 ` Tejun Heo 2015-01-07 18:48 ` Michal Hocko 2015-01-08 11:51 ` Michal Hocko 2014-12-07 10:09 ` [PATCH 0/4] OOM vs PM freezer fixes Michal Hocko 2014-12-07 13:55 ` Tejun Heo 2014-12-07 19:00 ` Michal Hocko 2014-12-18 16:27 ` Michal Hocko 2014-11-05 14:55 ` [PATCH 3/4] OOM, PM: OOM killed task shouldn't escape PM suspend Michal Hocko 2014-10-26 18:40 ` Pavel Machek 2014-10-21 7:27 ` [PATCH 4/4] PM: convert do_each_thread to for_each_process_thread Michal Hocko 2014-10-21 12:10 ` Rafael J. Wysocki 2014-10-21 13:19 ` Michal Hocko 2014-10-21 13:43 ` Rafael J. Wysocki
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).