linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/4 -v2] OOM vs. freezer interaction fixes
@ 2014-10-21  7:27 Michal Hocko
  2014-10-21  7:27 ` [PATCH 1/4] freezer: Do not freeze tasks killed by OOM killer Michal Hocko
                   ` (3 more replies)
  0 siblings, 4 replies; 93+ messages in thread
From: Michal Hocko @ 2014-10-21  7:27 UTC (permalink / raw)
  To: Andrew Morton, \"Rafael J. Wysocki\"
  Cc: Cong Wang, David Rientjes, Tejun Heo, Oleg Nesterov, LKML,
	linux-mm, Linux PM list

Hi Andrew, Rafael,

this has been originally discussed here [1] and previously posted here [2].
I have updated patches according to feedback from Oleg.

The first and third patch are regression fixes and they are a stable
material IMO. The second and fourth patch are simple cleanups.

The 1st patch is fixing a regression introduced in 3.3 since when OOM
killer is not able to kill any frozen task and live lock as a result.
The fix gets us back to the 3.2. As it turned out during the discussion [3]
this was still not 100% sufficient and that's why we need the 3rd patch.

I was thinking about the proper 1st vs. 3rd patch ordering because
the 1st patch basically opens a race window considerably reduced by the
later patch. This path is hard to do completely race free without a complete
synchronization of OOM path (including the allocator) and freezer which is not
worth the trouble.

Original patch from Cong Wang has covered this by checking
cgroup_freezing(current) in __refrigarator path [4]. But this approach
still suffers from OOM vs. PM freezer interaction (OOM killer would
still live lock waiting for a PM frozen task this time).

So I think the most straight forward way is to address only OOM vs.
frozen task interaction in the first patch, mark it for stable 3.3+ and
leave the race to a separate follow up patch which is applicable to
stable 3.2+ (before a3201227f803 made it inefficient).

Switching 1st and 3rd patches would make some sense as well but then
it might end up even more confusing because we would be fixing a
non-existent issue in upstream first...

Cong Wang (2):
      freezer: Do not freeze tasks killed by OOM killer
      freezer: remove obsolete comments in __thaw_task()

Michal Hocko (2):
      OOM, PM: OOM killed task shouldn't escape PM suspend
      PM: convert do_each_thread to for_each_process_thread

And diffstat says:
 include/linux/oom.h    |  3 +++
 kernel/freezer.c       |  9 +++------
 kernel/power/process.c | 47 ++++++++++++++++++++++++++++++++++++++---------
 mm/oom_kill.c          | 17 +++++++++++++++++
 mm/page_alloc.c        |  8 ++++++++
 5 files changed, 69 insertions(+), 15 deletions(-)

---
[1] http://marc.info/?l=linux-kernel&m=140986986423092
[2] http://marc.info/?l=linux-mm&m=141277728508500&w=2
[3] http://marc.info/?l=linux-kernel&m=141074263721166
[4] http://marc.info/?l=linux-kernel&m=140986986423092

^ permalink raw reply	[flat|nested] 93+ messages in thread

* [PATCH 1/4] freezer: Do not freeze tasks killed by OOM killer
  2014-10-21  7:27 [PATCH 0/4 -v2] OOM vs. freezer interaction fixes Michal Hocko
@ 2014-10-21  7:27 ` Michal Hocko
  2014-10-21 12:04   ` Rafael J. Wysocki
  2014-10-21  7:27 ` [PATCH 2/4] freezer: remove obsolete comments in __thaw_task() Michal Hocko
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 93+ messages in thread
From: Michal Hocko @ 2014-10-21  7:27 UTC (permalink / raw)
  To: Andrew Morton, \"Rafael J. Wysocki\"
  Cc: Cong Wang, David Rientjes, Tejun Heo, Oleg Nesterov, LKML,
	linux-mm, Linux PM list

From: Cong Wang <xiyou.wangcong@gmail.com>

Since f660daac474c6f (oom: thaw threads if oom killed thread is frozen
before deferring) OOM killer relies on being able to thaw a frozen task
to handle OOM situation but a3201227f803 (freezer: make freezing() test
freeze conditions in effect instead of TIF_FREEZE) has reorganized the
code and stopped clearing freeze flag in __thaw_task. This means that
the target task only wakes up and goes into the fridge again because the
freezing condition hasn't changed for it. This reintroduces the bug
fixed by f660daac474c6f.

Fix the issue by checking for TIF_MEMDIE thread flag in
freezing_slow_path and exclude the task from freezing completely. If a
task was already frozen it would get woken by __thaw_task from OOM killer
and get out of freezer after rechecking freezing().

Changes since v1
- put TIF_MEMDIE check into freezing_slowpath rather than in __refrigerator
  as per Oleg
- return __thaw_task into oom_scan_process_thread because
  oom_kill_process will not wake task in the fridge because it is
  sleeping uninterruptible

[mhocko@suse.cz: rewrote the changelog]
Fixes: a3201227f803 (freezer: make freezing() test freeze conditions in effect instead of TIF_FREEZE)
Cc: stable@vger.kernel.org # 3.3+
Cc: David Rientjes <rientjes@google.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Cc: Tejun Heo <tj@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: Michal Hocko <mhocko@suse.cz>
Acked-by: Oleg Nesterov <oleg@redhat.com>
---
 kernel/freezer.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/kernel/freezer.c b/kernel/freezer.c
index aa6a8aadb911..8f9279b9c6d7 100644
--- a/kernel/freezer.c
+++ b/kernel/freezer.c
@@ -42,6 +42,9 @@ bool freezing_slow_path(struct task_struct *p)
 	if (p->flags & (PF_NOFREEZE | PF_SUSPEND_TASK))
 		return false;
 
+	if (test_thread_flag(TIF_MEMDIE))
+		return false;
+
 	if (pm_nosig_freezing || cgroup_freezing(p))
 		return true;
 
-- 
2.1.1


^ permalink raw reply related	[flat|nested] 93+ messages in thread

* [PATCH 2/4] freezer: remove obsolete comments in __thaw_task()
  2014-10-21  7:27 [PATCH 0/4 -v2] OOM vs. freezer interaction fixes Michal Hocko
  2014-10-21  7:27 ` [PATCH 1/4] freezer: Do not freeze tasks killed by OOM killer Michal Hocko
@ 2014-10-21  7:27 ` Michal Hocko
  2014-10-21 12:04   ` Rafael J. Wysocki
  2014-10-21  7:27 ` [PATCH 3/4] OOM, PM: OOM killed task shouldn't escape PM suspend Michal Hocko
  2014-10-21  7:27 ` [PATCH 4/4] PM: convert do_each_thread to for_each_process_thread Michal Hocko
  3 siblings, 1 reply; 93+ messages in thread
From: Michal Hocko @ 2014-10-21  7:27 UTC (permalink / raw)
  To: Andrew Morton, \"Rafael J. Wysocki\"
  Cc: Cong Wang, David Rientjes, Tejun Heo, Oleg Nesterov, LKML,
	linux-mm, Linux PM list

From: Cong Wang <xiyou.wangcong@gmail.com>

__thaw_task() no longer clears frozen flag since commit a3201227f803
(freezer: make freezing() test freeze conditions in effect instead of TIF_FREEZE).

Cc: David Rientjes <rientjes@google.com>
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Cc: Tejun Heo <tj@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
---
 kernel/freezer.c | 6 ------
 1 file changed, 6 deletions(-)

diff --git a/kernel/freezer.c b/kernel/freezer.c
index 8f9279b9c6d7..a8900a3bc27a 100644
--- a/kernel/freezer.c
+++ b/kernel/freezer.c
@@ -150,12 +150,6 @@ void __thaw_task(struct task_struct *p)
 {
 	unsigned long flags;
 
-	/*
-	 * Clear freezing and kick @p if FROZEN.  Clearing is guaranteed to
-	 * be visible to @p as waking up implies wmb.  Waking up inside
-	 * freezer_lock also prevents wakeups from leaking outside
-	 * refrigerator.
-	 */
 	spin_lock_irqsave(&freezer_lock, flags);
 	if (frozen(p))
 		wake_up_process(p);
-- 
2.1.1


^ permalink raw reply related	[flat|nested] 93+ messages in thread

* [PATCH 3/4] OOM, PM: OOM killed task shouldn't escape PM suspend
  2014-10-21  7:27 [PATCH 0/4 -v2] OOM vs. freezer interaction fixes Michal Hocko
  2014-10-21  7:27 ` [PATCH 1/4] freezer: Do not freeze tasks killed by OOM killer Michal Hocko
  2014-10-21  7:27 ` [PATCH 2/4] freezer: remove obsolete comments in __thaw_task() Michal Hocko
@ 2014-10-21  7:27 ` Michal Hocko
  2014-10-21 12:09   ` Rafael J. Wysocki
  2014-10-26 18:40   ` Pavel Machek
  2014-10-21  7:27 ` [PATCH 4/4] PM: convert do_each_thread to for_each_process_thread Michal Hocko
  3 siblings, 2 replies; 93+ messages in thread
From: Michal Hocko @ 2014-10-21  7:27 UTC (permalink / raw)
  To: Andrew Morton, \"Rafael J. Wysocki\"
  Cc: Cong Wang, David Rientjes, Tejun Heo, Oleg Nesterov, LKML,
	linux-mm, Linux PM list

PM freezer relies on having all tasks frozen by the time devices are
getting frozen so that no task will touch them while they are getting
frozen. But OOM killer is allowed to kill an already frozen task in
order to handle OOM situtation. In order to protect from late wake ups
OOM killer is disabled after all tasks are frozen. This, however, still
keeps a window open when a killed task didn't manage to die by the time
freeze_processes finishes.

Reduce the race window by checking all tasks after OOM killer has been
disabled. This is still not race free completely unfortunately because
oom_killer_disable cannot stop an already ongoing OOM killer so a task
might still wake up from the fridge and get killed without
freeze_processes noticing. Full synchronization of OOM and freezer is,
however, too heavy weight for this highly unlikely case.

Introduce and check oom_kills counter which gets incremented early when
the allocator enters __alloc_pages_may_oom path and only check all the
tasks if the counter changes during the freezing attempt. The counter
is updated so early to reduce the race window since allocator checked
oom_killer_disabled which is set by PM-freezing code. A false positive
will push the PM-freezer into a slow path but that is not a big deal.

Fixes: f660daac474c6f (oom: thaw threads if oom killed thread is frozen before deferring)
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Cc: Rafael J. Wysocki <rjw@rjwysocki.net>
Cc: Tejun Heo <tj@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: stable@vger.kernel.org # 3.2+
Signed-off-by: Michal Hocko <mhocko@suse.cz>
---
 include/linux/oom.h    |  3 +++
 kernel/power/process.c | 31 ++++++++++++++++++++++++++++++-
 mm/oom_kill.c          | 17 +++++++++++++++++
 mm/page_alloc.c        |  8 ++++++++
 4 files changed, 58 insertions(+), 1 deletion(-)

diff --git a/include/linux/oom.h b/include/linux/oom.h
index 647395a1a550..e8d6e1058723 100644
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -50,6 +50,9 @@ static inline bool oom_task_origin(const struct task_struct *p)
 extern unsigned long oom_badness(struct task_struct *p,
 		struct mem_cgroup *memcg, const nodemask_t *nodemask,
 		unsigned long totalpages);
+
+extern int oom_kills_count(void);
+extern void note_oom_kill(void);
 extern void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
 			     unsigned int points, unsigned long totalpages,
 			     struct mem_cgroup *memcg, nodemask_t *nodemask,
diff --git a/kernel/power/process.c b/kernel/power/process.c
index 4ee194eb524b..a397fa161d11 100644
--- a/kernel/power/process.c
+++ b/kernel/power/process.c
@@ -118,6 +118,7 @@ static int try_to_freeze_tasks(bool user_only)
 int freeze_processes(void)
 {
 	int error;
+	int oom_kills_saved;
 
 	error = __usermodehelper_disable(UMH_FREEZING);
 	if (error)
@@ -131,12 +132,40 @@ int freeze_processes(void)
 
 	printk("Freezing user space processes ... ");
 	pm_freezing = true;
+	oom_kills_saved = oom_kills_count();
 	error = try_to_freeze_tasks(true);
 	if (!error) {
-		printk("done.");
 		__usermodehelper_set_disable_depth(UMH_DISABLED);
 		oom_killer_disable();
+
+		/*
+		 * There might have been an OOM kill while we were
+		 * freezing tasks and the killed task might be still
+		 * on the way out so we have to double check for race.
+		 */
+		if (oom_kills_count() != oom_kills_saved) {
+			struct task_struct *g, *p;
+
+			read_lock(&tasklist_lock);
+			for_each_process_thread(g, p) {
+				if (p == current || freezer_should_skip(p) ||
+				    frozen(p))
+					continue;
+				error = -EBUSY;
+				goto out_loop;
+			}
+out_loop:
+			read_unlock(&tasklist_lock);
+
+			if (error) {
+				__usermodehelper_set_disable_depth(UMH_ENABLED);
+				printk("OOM in progress.");
+				goto done;
+			}
+		}
+		printk("done.");
 	}
+done:
 	printk("\n");
 	BUG_ON(in_atomic());
 
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index bbf405a3a18f..5340f6b91312 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -404,6 +404,23 @@ static void dump_header(struct task_struct *p, gfp_t gfp_mask, int order,
 		dump_tasks(memcg, nodemask);
 }
 
+/*
+ * Number of OOM killer invocations (including memcg OOM killer).
+ * Primarily used by PM freezer to check for potential races with
+ * OOM killed frozen task.
+ */
+static atomic_t oom_kills = ATOMIC_INIT(0);
+
+int oom_kills_count(void)
+{
+	return atomic_read(&oom_kills);
+}
+
+void note_oom_kill(void)
+{
+	atomic_inc(&oom_kills);
+}
+
 #define K(x) ((x) << (PAGE_SHIFT-10))
 /*
  * Must be called while holding a reference to p, which will be released upon
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index cb573b10af12..efccbbadd7c9 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2286,6 +2286,14 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
 	}
 
 	/*
+	 * PM-freezer should be notified that there might be an OOM killer on its
+	 * way to kill and wake somebody up. This is too early and we might end
+	 * up not killing anything but false positives are acceptable.
+	 * See freeze_processes.
+	 */
+	note_oom_kill();
+
+	/*
 	 * Go through the zonelist yet one more time, keep very high watermark
 	 * here, this is only to catch a parallel oom killing, we must fail if
 	 * we're still under heavy pressure.
-- 
2.1.1


^ permalink raw reply related	[flat|nested] 93+ messages in thread

* [PATCH 4/4] PM: convert do_each_thread to for_each_process_thread
  2014-10-21  7:27 [PATCH 0/4 -v2] OOM vs. freezer interaction fixes Michal Hocko
                   ` (2 preceding siblings ...)
  2014-10-21  7:27 ` [PATCH 3/4] OOM, PM: OOM killed task shouldn't escape PM suspend Michal Hocko
@ 2014-10-21  7:27 ` Michal Hocko
  2014-10-21 12:10   ` Rafael J. Wysocki
  3 siblings, 1 reply; 93+ messages in thread
From: Michal Hocko @ 2014-10-21  7:27 UTC (permalink / raw)
  To: Andrew Morton, \"Rafael J. Wysocki\"
  Cc: Cong Wang, David Rientjes, Tejun Heo, Oleg Nesterov, LKML,
	linux-mm, Linux PM list

as per 0c740d0afc3b (introduce for_each_thread() to replace the buggy
while_each_thread()) get rid of do_each_thread { } while_each_thread()
construct and replace it by a more error prone for_each_thread.

This patch doesn't introduce any user visible change.

Suggested-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Michal Hocko <mhocko@suse.cz>
---
 kernel/power/process.c | 16 ++++++++--------
 1 file changed, 8 insertions(+), 8 deletions(-)

diff --git a/kernel/power/process.c b/kernel/power/process.c
index a397fa161d11..7fd7b72554fe 100644
--- a/kernel/power/process.c
+++ b/kernel/power/process.c
@@ -46,13 +46,13 @@ static int try_to_freeze_tasks(bool user_only)
 	while (true) {
 		todo = 0;
 		read_lock(&tasklist_lock);
-		do_each_thread(g, p) {
+		for_each_process_thread(g, p) {
 			if (p == current || !freeze_task(p))
 				continue;
 
 			if (!freezer_should_skip(p))
 				todo++;
-		} while_each_thread(g, p);
+		}
 		read_unlock(&tasklist_lock);
 
 		if (!user_only) {
@@ -93,11 +93,11 @@ static int try_to_freeze_tasks(bool user_only)
 
 		if (!wakeup) {
 			read_lock(&tasklist_lock);
-			do_each_thread(g, p) {
+			for_each_process_thread(g, p) {
 				if (p != current && !freezer_should_skip(p)
 				    && freezing(p) && !frozen(p))
 					sched_show_task(p);
-			} while_each_thread(g, p);
+			}
 			read_unlock(&tasklist_lock);
 		}
 	} else {
@@ -219,11 +219,11 @@ void thaw_processes(void)
 	thaw_workqueues();
 
 	read_lock(&tasklist_lock);
-	do_each_thread(g, p) {
+	for_each_process_thread(g, p) {
 		/* No other threads should have PF_SUSPEND_TASK set */
 		WARN_ON((p != curr) && (p->flags & PF_SUSPEND_TASK));
 		__thaw_task(p);
-	} while_each_thread(g, p);
+	}
 	read_unlock(&tasklist_lock);
 
 	WARN_ON(!(curr->flags & PF_SUSPEND_TASK));
@@ -246,10 +246,10 @@ void thaw_kernel_threads(void)
 	thaw_workqueues();
 
 	read_lock(&tasklist_lock);
-	do_each_thread(g, p) {
+	for_each_process_thread(g, p) {
 		if (p->flags & (PF_KTHREAD | PF_WQ_WORKER))
 			__thaw_task(p);
-	} while_each_thread(g, p);
+	}
 	read_unlock(&tasklist_lock);
 
 	schedule();
-- 
2.1.1


^ permalink raw reply related	[flat|nested] 93+ messages in thread

* Re: [PATCH 1/4] freezer: Do not freeze tasks killed by OOM killer
  2014-10-21  7:27 ` [PATCH 1/4] freezer: Do not freeze tasks killed by OOM killer Michal Hocko
@ 2014-10-21 12:04   ` Rafael J. Wysocki
  0 siblings, 0 replies; 93+ messages in thread
From: Rafael J. Wysocki @ 2014-10-21 12:04 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Cong Wang, David Rientjes, Tejun Heo,
	Oleg Nesterov, LKML, linux-mm, Linux PM list

On Tuesday, October 21, 2014 09:27:12 AM Michal Hocko wrote:
> From: Cong Wang <xiyou.wangcong@gmail.com>
> 
> Since f660daac474c6f (oom: thaw threads if oom killed thread is frozen
> before deferring) OOM killer relies on being able to thaw a frozen task
> to handle OOM situation but a3201227f803 (freezer: make freezing() test
> freeze conditions in effect instead of TIF_FREEZE) has reorganized the
> code and stopped clearing freeze flag in __thaw_task. This means that
> the target task only wakes up and goes into the fridge again because the
> freezing condition hasn't changed for it. This reintroduces the bug
> fixed by f660daac474c6f.
> 
> Fix the issue by checking for TIF_MEMDIE thread flag in
> freezing_slow_path and exclude the task from freezing completely. If a
> task was already frozen it would get woken by __thaw_task from OOM killer
> and get out of freezer after rechecking freezing().
> 
> Changes since v1
> - put TIF_MEMDIE check into freezing_slowpath rather than in __refrigerator
>   as per Oleg
> - return __thaw_task into oom_scan_process_thread because
>   oom_kill_process will not wake task in the fridge because it is
>   sleeping uninterruptible
> 
> [mhocko@suse.cz: rewrote the changelog]
> Fixes: a3201227f803 (freezer: make freezing() test freeze conditions in effect instead of TIF_FREEZE)
> Cc: stable@vger.kernel.org # 3.3+
> Cc: David Rientjes <rientjes@google.com>
> Cc: Michal Hocko <mhocko@suse.cz>
> Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
> Cc: Tejun Heo <tj@kernel.org>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
> Signed-off-by: Michal Hocko <mhocko@suse.cz>
> Acked-by: Oleg Nesterov <oleg@redhat.com>

ACK

> ---
>  kernel/freezer.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/kernel/freezer.c b/kernel/freezer.c
> index aa6a8aadb911..8f9279b9c6d7 100644
> --- a/kernel/freezer.c
> +++ b/kernel/freezer.c
> @@ -42,6 +42,9 @@ bool freezing_slow_path(struct task_struct *p)
>  	if (p->flags & (PF_NOFREEZE | PF_SUSPEND_TASK))
>  		return false;
>  
> +	if (test_thread_flag(TIF_MEMDIE))
> +		return false;
> +
>  	if (pm_nosig_freezing || cgroup_freezing(p))
>  		return true;
>  
> 

-- 
I speak only for myself.
Rafael J. Wysocki, Intel Open Source Technology Center.

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH 2/4] freezer: remove obsolete comments in __thaw_task()
  2014-10-21  7:27 ` [PATCH 2/4] freezer: remove obsolete comments in __thaw_task() Michal Hocko
@ 2014-10-21 12:04   ` Rafael J. Wysocki
  0 siblings, 0 replies; 93+ messages in thread
From: Rafael J. Wysocki @ 2014-10-21 12:04 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Cong Wang, David Rientjes, Tejun Heo,
	Oleg Nesterov, LKML, linux-mm, Linux PM list

On Tuesday, October 21, 2014 09:27:13 AM Michal Hocko wrote:
> From: Cong Wang <xiyou.wangcong@gmail.com>
> 
> __thaw_task() no longer clears frozen flag since commit a3201227f803
> (freezer: make freezing() test freeze conditions in effect instead of TIF_FREEZE).
> 
> Cc: David Rientjes <rientjes@google.com>
> Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
> Cc: Tejun Heo <tj@kernel.org>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Reviewed-by: Michal Hocko <mhocko@suse.cz>
> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>

ACK

> ---
>  kernel/freezer.c | 6 ------
>  1 file changed, 6 deletions(-)
> 
> diff --git a/kernel/freezer.c b/kernel/freezer.c
> index 8f9279b9c6d7..a8900a3bc27a 100644
> --- a/kernel/freezer.c
> +++ b/kernel/freezer.c
> @@ -150,12 +150,6 @@ void __thaw_task(struct task_struct *p)
>  {
>  	unsigned long flags;
>  
> -	/*
> -	 * Clear freezing and kick @p if FROZEN.  Clearing is guaranteed to
> -	 * be visible to @p as waking up implies wmb.  Waking up inside
> -	 * freezer_lock also prevents wakeups from leaking outside
> -	 * refrigerator.
> -	 */
>  	spin_lock_irqsave(&freezer_lock, flags);
>  	if (frozen(p))
>  		wake_up_process(p);
> 

-- 
I speak only for myself.
Rafael J. Wysocki, Intel Open Source Technology Center.

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH 3/4] OOM, PM: OOM killed task shouldn't escape PM suspend
  2014-10-21  7:27 ` [PATCH 3/4] OOM, PM: OOM killed task shouldn't escape PM suspend Michal Hocko
@ 2014-10-21 12:09   ` Rafael J. Wysocki
  2014-10-21 13:14     ` Michal Hocko
  2014-10-26 18:40   ` Pavel Machek
  1 sibling, 1 reply; 93+ messages in thread
From: Rafael J. Wysocki @ 2014-10-21 12:09 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Cong Wang, David Rientjes, Tejun Heo,
	Oleg Nesterov, LKML, linux-mm, Linux PM list

On Tuesday, October 21, 2014 09:27:14 AM Michal Hocko wrote:
> PM freezer relies on having all tasks frozen by the time devices are
> getting frozen so that no task will touch them while they are getting
> frozen. But OOM killer is allowed to kill an already frozen task in
> order to handle OOM situtation. In order to protect from late wake ups
> OOM killer is disabled after all tasks are frozen. This, however, still
> keeps a window open when a killed task didn't manage to die by the time
> freeze_processes finishes.
> 
> Reduce the race window by checking all tasks after OOM killer has been
> disabled. This is still not race free completely unfortunately because
> oom_killer_disable cannot stop an already ongoing OOM killer so a task
> might still wake up from the fridge and get killed without
> freeze_processes noticing. Full synchronization of OOM and freezer is,
> however, too heavy weight for this highly unlikely case.
> 
> Introduce and check oom_kills counter which gets incremented early when
> the allocator enters __alloc_pages_may_oom path and only check all the
> tasks if the counter changes during the freezing attempt. The counter
> is updated so early to reduce the race window since allocator checked
> oom_killer_disabled which is set by PM-freezing code. A false positive
> will push the PM-freezer into a slow path but that is not a big deal.
> 
> Fixes: f660daac474c6f (oom: thaw threads if oom killed thread is frozen before deferring)
> Cc: Cong Wang <xiyou.wangcong@gmail.com>
> Cc: Rafael J. Wysocki <rjw@rjwysocki.net>
> Cc: Tejun Heo <tj@kernel.org>
> Cc: David Rientjes <rientjes@google.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: stable@vger.kernel.org # 3.2+
> Signed-off-by: Michal Hocko <mhocko@suse.cz>
> ---
>  include/linux/oom.h    |  3 +++
>  kernel/power/process.c | 31 ++++++++++++++++++++++++++++++-
>  mm/oom_kill.c          | 17 +++++++++++++++++
>  mm/page_alloc.c        |  8 ++++++++
>  4 files changed, 58 insertions(+), 1 deletion(-)
> 
> diff --git a/include/linux/oom.h b/include/linux/oom.h
> index 647395a1a550..e8d6e1058723 100644
> --- a/include/linux/oom.h
> +++ b/include/linux/oom.h
> @@ -50,6 +50,9 @@ static inline bool oom_task_origin(const struct task_struct *p)
>  extern unsigned long oom_badness(struct task_struct *p,
>  		struct mem_cgroup *memcg, const nodemask_t *nodemask,
>  		unsigned long totalpages);
> +
> +extern int oom_kills_count(void);
> +extern void note_oom_kill(void);
>  extern void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
>  			     unsigned int points, unsigned long totalpages,
>  			     struct mem_cgroup *memcg, nodemask_t *nodemask,
> diff --git a/kernel/power/process.c b/kernel/power/process.c
> index 4ee194eb524b..a397fa161d11 100644
> --- a/kernel/power/process.c
> +++ b/kernel/power/process.c
> @@ -118,6 +118,7 @@ static int try_to_freeze_tasks(bool user_only)
>  int freeze_processes(void)
>  {
>  	int error;
> +	int oom_kills_saved;
>  
>  	error = __usermodehelper_disable(UMH_FREEZING);
>  	if (error)
> @@ -131,12 +132,40 @@ int freeze_processes(void)
>  
>  	printk("Freezing user space processes ... ");
>  	pm_freezing = true;
> +	oom_kills_saved = oom_kills_count();
>  	error = try_to_freeze_tasks(true);
>  	if (!error) {
> -		printk("done.");
>  		__usermodehelper_set_disable_depth(UMH_DISABLED);
>  		oom_killer_disable();
> +
> +		/*
> +		 * There might have been an OOM kill while we were
> +		 * freezing tasks and the killed task might be still
> +		 * on the way out so we have to double check for race.
> +		 */
> +		if (oom_kills_count() != oom_kills_saved) {
> +			struct task_struct *g, *p;
> +
> +			read_lock(&tasklist_lock);
> +			for_each_process_thread(g, p) {
> +				if (p == current || freezer_should_skip(p) ||
> +				    frozen(p))
> +					continue;
> +				error = -EBUSY;
> +				goto out_loop;
> +			}
> +out_loop:

Well, it looks like this will work here too:

			for_each_process_thread(g, p)
				if (p != current && !frozen(p) &&
				    !freezer_should_skip(p)) {
					error = -EBUSY;
					break;
				}

or I am helplessly misreading the code.

> +			read_unlock(&tasklist_lock);
> +
> +			if (error) {
> +				__usermodehelper_set_disable_depth(UMH_ENABLED);
> +				printk("OOM in progress.");
> +				goto done;
> +			}
> +		}
> +		printk("done.");
>  	}
> +done:
>  	printk("\n");
>  	BUG_ON(in_atomic());
>  
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index bbf405a3a18f..5340f6b91312 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -404,6 +404,23 @@ static void dump_header(struct task_struct *p, gfp_t gfp_mask, int order,
>  		dump_tasks(memcg, nodemask);
>  }
>  
> +/*
> + * Number of OOM killer invocations (including memcg OOM killer).
> + * Primarily used by PM freezer to check for potential races with
> + * OOM killed frozen task.
> + */
> +static atomic_t oom_kills = ATOMIC_INIT(0);
> +
> +int oom_kills_count(void)
> +{
> +	return atomic_read(&oom_kills);
> +}
> +
> +void note_oom_kill(void)
> +{
> +	atomic_inc(&oom_kills);
> +}
> +
>  #define K(x) ((x) << (PAGE_SHIFT-10))
>  /*
>   * Must be called while holding a reference to p, which will be released upon
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index cb573b10af12..efccbbadd7c9 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -2286,6 +2286,14 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
>  	}
>  
>  	/*
> +	 * PM-freezer should be notified that there might be an OOM killer on its
> +	 * way to kill and wake somebody up. This is too early and we might end
> +	 * up not killing anything but false positives are acceptable.
> +	 * See freeze_processes.
> +	 */
> +	note_oom_kill();
> +
> +	/*
>  	 * Go through the zonelist yet one more time, keep very high watermark
>  	 * here, this is only to catch a parallel oom killing, we must fail if
>  	 * we're still under heavy pressure.
> 

-- 
I speak only for myself.
Rafael J. Wysocki, Intel Open Source Technology Center.

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH 4/4] PM: convert do_each_thread to for_each_process_thread
  2014-10-21  7:27 ` [PATCH 4/4] PM: convert do_each_thread to for_each_process_thread Michal Hocko
@ 2014-10-21 12:10   ` Rafael J. Wysocki
  2014-10-21 13:19     ` Michal Hocko
  0 siblings, 1 reply; 93+ messages in thread
From: Rafael J. Wysocki @ 2014-10-21 12:10 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Cong Wang, David Rientjes, Tejun Heo,
	Oleg Nesterov, LKML, linux-mm, Linux PM list

On Tuesday, October 21, 2014 09:27:15 AM Michal Hocko wrote:
> as per 0c740d0afc3b (introduce for_each_thread() to replace the buggy
> while_each_thread()) get rid of do_each_thread { } while_each_thread()
> construct and replace it by a more error prone for_each_thread.
> 
> This patch doesn't introduce any user visible change.
> 
> Suggested-by: Oleg Nesterov <oleg@redhat.com>
> Signed-off-by: Michal Hocko <mhocko@suse.cz>

ACK

Or do you want me to handle this series?

> ---
>  kernel/power/process.c | 16 ++++++++--------
>  1 file changed, 8 insertions(+), 8 deletions(-)
> 
> diff --git a/kernel/power/process.c b/kernel/power/process.c
> index a397fa161d11..7fd7b72554fe 100644
> --- a/kernel/power/process.c
> +++ b/kernel/power/process.c
> @@ -46,13 +46,13 @@ static int try_to_freeze_tasks(bool user_only)
>  	while (true) {
>  		todo = 0;
>  		read_lock(&tasklist_lock);
> -		do_each_thread(g, p) {
> +		for_each_process_thread(g, p) {
>  			if (p == current || !freeze_task(p))
>  				continue;
>  
>  			if (!freezer_should_skip(p))
>  				todo++;
> -		} while_each_thread(g, p);
> +		}
>  		read_unlock(&tasklist_lock);
>  
>  		if (!user_only) {
> @@ -93,11 +93,11 @@ static int try_to_freeze_tasks(bool user_only)
>  
>  		if (!wakeup) {
>  			read_lock(&tasklist_lock);
> -			do_each_thread(g, p) {
> +			for_each_process_thread(g, p) {
>  				if (p != current && !freezer_should_skip(p)
>  				    && freezing(p) && !frozen(p))
>  					sched_show_task(p);
> -			} while_each_thread(g, p);
> +			}
>  			read_unlock(&tasklist_lock);
>  		}
>  	} else {
> @@ -219,11 +219,11 @@ void thaw_processes(void)
>  	thaw_workqueues();
>  
>  	read_lock(&tasklist_lock);
> -	do_each_thread(g, p) {
> +	for_each_process_thread(g, p) {
>  		/* No other threads should have PF_SUSPEND_TASK set */
>  		WARN_ON((p != curr) && (p->flags & PF_SUSPEND_TASK));
>  		__thaw_task(p);
> -	} while_each_thread(g, p);
> +	}
>  	read_unlock(&tasklist_lock);
>  
>  	WARN_ON(!(curr->flags & PF_SUSPEND_TASK));
> @@ -246,10 +246,10 @@ void thaw_kernel_threads(void)
>  	thaw_workqueues();
>  
>  	read_lock(&tasklist_lock);
> -	do_each_thread(g, p) {
> +	for_each_process_thread(g, p) {
>  		if (p->flags & (PF_KTHREAD | PF_WQ_WORKER))
>  			__thaw_task(p);
> -	} while_each_thread(g, p);
> +	}
>  	read_unlock(&tasklist_lock);
>  
>  	schedule();
> 

-- 
I speak only for myself.
Rafael J. Wysocki, Intel Open Source Technology Center.

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH 3/4] OOM, PM: OOM killed task shouldn't escape PM suspend
  2014-10-21 12:09   ` Rafael J. Wysocki
@ 2014-10-21 13:14     ` Michal Hocko
  2014-10-21 13:42       ` Rafael J. Wysocki
  0 siblings, 1 reply; 93+ messages in thread
From: Michal Hocko @ 2014-10-21 13:14 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Andrew Morton, Cong Wang, David Rientjes, Tejun Heo,
	Oleg Nesterov, LKML, linux-mm, Linux PM list

On Tue 21-10-14 14:09:27, Rafael J. Wysocki wrote:
[...]
> > @@ -131,12 +132,40 @@ int freeze_processes(void)
> >  
> >  	printk("Freezing user space processes ... ");
> >  	pm_freezing = true;
> > +	oom_kills_saved = oom_kills_count();
> >  	error = try_to_freeze_tasks(true);
> >  	if (!error) {
> > -		printk("done.");
> >  		__usermodehelper_set_disable_depth(UMH_DISABLED);
> >  		oom_killer_disable();
> > +
> > +		/*
> > +		 * There might have been an OOM kill while we were
> > +		 * freezing tasks and the killed task might be still
> > +		 * on the way out so we have to double check for race.
> > +		 */
> > +		if (oom_kills_count() != oom_kills_saved) {
> > +			struct task_struct *g, *p;
> > +
> > +			read_lock(&tasklist_lock);
> > +			for_each_process_thread(g, p) {
> > +				if (p == current || freezer_should_skip(p) ||
> > +				    frozen(p))
> > +					continue;
> > +				error = -EBUSY;
> > +				goto out_loop;
> > +			}
> > +out_loop:
> 
> Well, it looks like this will work here too:
> 
> 			for_each_process_thread(g, p)
> 				if (p != current && !frozen(p) &&
> 				    !freezer_should_skip(p)) {
> 					error = -EBUSY;
> 					break;
> 				}
> 
> or I am helplessly misreading the code.

break will not work because for_each_process_thread is a double loop.
Except for that the negated condition is OK as well. I can change that
if you prefer.

> > +			read_unlock(&tasklist_lock);
> > +
> > +			if (error) {
> > +				__usermodehelper_set_disable_depth(UMH_ENABLED);
> > +				printk("OOM in progress.");
> > +				goto done;
> > +			}
> > +		}
> > +		printk("done.");
> >  	}
> > +done:
> >  	printk("\n");
> >  	BUG_ON(in_atomic());
> >  
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH 4/4] PM: convert do_each_thread to for_each_process_thread
  2014-10-21 12:10   ` Rafael J. Wysocki
@ 2014-10-21 13:19     ` Michal Hocko
  2014-10-21 13:43       ` Rafael J. Wysocki
  0 siblings, 1 reply; 93+ messages in thread
From: Michal Hocko @ 2014-10-21 13:19 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Andrew Morton, Cong Wang, David Rientjes, Tejun Heo,
	Oleg Nesterov, LKML, linux-mm, Linux PM list

On Tue 21-10-14 14:10:18, Rafael J. Wysocki wrote:
> On Tuesday, October 21, 2014 09:27:15 AM Michal Hocko wrote:
> > as per 0c740d0afc3b (introduce for_each_thread() to replace the buggy
> > while_each_thread()) get rid of do_each_thread { } while_each_thread()
> > construct and replace it by a more error prone for_each_thread.
> > 
> > This patch doesn't introduce any user visible change.
> > 
> > Suggested-by: Oleg Nesterov <oleg@redhat.com>
> > Signed-off-by: Michal Hocko <mhocko@suse.cz>
> 
> ACK
> 
> Or do you want me to handle this series?

I don't know, I hoped either you or Andrew to pick it up.

> > ---
> >  kernel/power/process.c | 16 ++++++++--------
> >  1 file changed, 8 insertions(+), 8 deletions(-)
> > 
> > diff --git a/kernel/power/process.c b/kernel/power/process.c
> > index a397fa161d11..7fd7b72554fe 100644
> > --- a/kernel/power/process.c
> > +++ b/kernel/power/process.c
> > @@ -46,13 +46,13 @@ static int try_to_freeze_tasks(bool user_only)
> >  	while (true) {
> >  		todo = 0;
> >  		read_lock(&tasklist_lock);
> > -		do_each_thread(g, p) {
> > +		for_each_process_thread(g, p) {
> >  			if (p == current || !freeze_task(p))
> >  				continue;
> >  
> >  			if (!freezer_should_skip(p))
> >  				todo++;
> > -		} while_each_thread(g, p);
> > +		}
> >  		read_unlock(&tasklist_lock);
> >  
> >  		if (!user_only) {
> > @@ -93,11 +93,11 @@ static int try_to_freeze_tasks(bool user_only)
> >  
> >  		if (!wakeup) {
> >  			read_lock(&tasklist_lock);
> > -			do_each_thread(g, p) {
> > +			for_each_process_thread(g, p) {
> >  				if (p != current && !freezer_should_skip(p)
> >  				    && freezing(p) && !frozen(p))
> >  					sched_show_task(p);
> > -			} while_each_thread(g, p);
> > +			}
> >  			read_unlock(&tasklist_lock);
> >  		}
> >  	} else {
> > @@ -219,11 +219,11 @@ void thaw_processes(void)
> >  	thaw_workqueues();
> >  
> >  	read_lock(&tasklist_lock);
> > -	do_each_thread(g, p) {
> > +	for_each_process_thread(g, p) {
> >  		/* No other threads should have PF_SUSPEND_TASK set */
> >  		WARN_ON((p != curr) && (p->flags & PF_SUSPEND_TASK));
> >  		__thaw_task(p);
> > -	} while_each_thread(g, p);
> > +	}
> >  	read_unlock(&tasklist_lock);
> >  
> >  	WARN_ON(!(curr->flags & PF_SUSPEND_TASK));
> > @@ -246,10 +246,10 @@ void thaw_kernel_threads(void)
> >  	thaw_workqueues();
> >  
> >  	read_lock(&tasklist_lock);
> > -	do_each_thread(g, p) {
> > +	for_each_process_thread(g, p) {
> >  		if (p->flags & (PF_KTHREAD | PF_WQ_WORKER))
> >  			__thaw_task(p);
> > -	} while_each_thread(g, p);
> > +	}
> >  	read_unlock(&tasklist_lock);
> >  
> >  	schedule();
> > 
> 
> -- 
> I speak only for myself.
> Rafael J. Wysocki, Intel Open Source Technology Center.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH 3/4] OOM, PM: OOM killed task shouldn't escape PM suspend
  2014-10-21 13:14     ` Michal Hocko
@ 2014-10-21 13:42       ` Rafael J. Wysocki
  2014-10-21 14:11         ` Michal Hocko
  0 siblings, 1 reply; 93+ messages in thread
From: Rafael J. Wysocki @ 2014-10-21 13:42 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Cong Wang, David Rientjes, Tejun Heo,
	Oleg Nesterov, LKML, linux-mm, Linux PM list

On Tuesday, October 21, 2014 03:14:45 PM Michal Hocko wrote:
> On Tue 21-10-14 14:09:27, Rafael J. Wysocki wrote:
> [...]
> > > @@ -131,12 +132,40 @@ int freeze_processes(void)
> > >  
> > >  	printk("Freezing user space processes ... ");
> > >  	pm_freezing = true;
> > > +	oom_kills_saved = oom_kills_count();
> > >  	error = try_to_freeze_tasks(true);
> > >  	if (!error) {
> > > -		printk("done.");
> > >  		__usermodehelper_set_disable_depth(UMH_DISABLED);
> > >  		oom_killer_disable();
> > > +
> > > +		/*
> > > +		 * There might have been an OOM kill while we were
> > > +		 * freezing tasks and the killed task might be still
> > > +		 * on the way out so we have to double check for race.
> > > +		 */
> > > +		if (oom_kills_count() != oom_kills_saved) {
> > > +			struct task_struct *g, *p;
> > > +
> > > +			read_lock(&tasklist_lock);
> > > +			for_each_process_thread(g, p) {
> > > +				if (p == current || freezer_should_skip(p) ||
> > > +				    frozen(p))
> > > +					continue;
> > > +				error = -EBUSY;
> > > +				goto out_loop;
> > > +			}
> > > +out_loop:
> > 
> > Well, it looks like this will work here too:
> > 
> > 			for_each_process_thread(g, p)
> > 				if (p != current && !frozen(p) &&
> > 				    !freezer_should_skip(p)) {
> > 					error = -EBUSY;
> > 					break;
> > 				}
> > 
> > or I am helplessly misreading the code.
> 
> break will not work because for_each_process_thread is a double loop.

I see.  In that case I'd do:

                        for_each_process_thread(g, p)
                                if (p != current && !frozen(p) &&
                                    !freezer_should_skip(p)) {

					read_unlock(&tasklist_lock);

					__usermodehelper_set_disable_depth(UMH_ENABLED);
					printk("OOM in progress.");
                                        error = -EBUSY;
                                        goto done;
                                }

to avoid adding the new label that looks odd.

-- 
I speak only for myself.
Rafael J. Wysocki, Intel Open Source Technology Center.

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH 4/4] PM: convert do_each_thread to for_each_process_thread
  2014-10-21 13:19     ` Michal Hocko
@ 2014-10-21 13:43       ` Rafael J. Wysocki
  0 siblings, 0 replies; 93+ messages in thread
From: Rafael J. Wysocki @ 2014-10-21 13:43 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Cong Wang, David Rientjes, Tejun Heo,
	Oleg Nesterov, LKML, linux-mm, Linux PM list

On Tuesday, October 21, 2014 03:19:53 PM Michal Hocko wrote:
> On Tue 21-10-14 14:10:18, Rafael J. Wysocki wrote:
> > On Tuesday, October 21, 2014 09:27:15 AM Michal Hocko wrote:
> > > as per 0c740d0afc3b (introduce for_each_thread() to replace the buggy
> > > while_each_thread()) get rid of do_each_thread { } while_each_thread()
> > > construct and replace it by a more error prone for_each_thread.
> > > 
> > > This patch doesn't introduce any user visible change.
> > > 
> > > Suggested-by: Oleg Nesterov <oleg@redhat.com>
> > > Signed-off-by: Michal Hocko <mhocko@suse.cz>
> > 
> > ACK
> > 
> > Or do you want me to handle this series?
> 
> I don't know, I hoped either you or Andrew to pick it up.

OK, I will then.

-- 
I speak only for myself.
Rafael J. Wysocki, Intel Open Source Technology Center.

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH 3/4] OOM, PM: OOM killed task shouldn't escape PM suspend
  2014-10-21 13:42       ` Rafael J. Wysocki
@ 2014-10-21 14:11         ` Michal Hocko
  2014-10-21 14:41           ` Rafael J. Wysocki
  0 siblings, 1 reply; 93+ messages in thread
From: Michal Hocko @ 2014-10-21 14:11 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Andrew Morton, Cong Wang, David Rientjes, Tejun Heo,
	Oleg Nesterov, LKML, linux-mm, Linux PM list

On Tue 21-10-14 15:42:23, Rafael J. Wysocki wrote:
> On Tuesday, October 21, 2014 03:14:45 PM Michal Hocko wrote:
> > On Tue 21-10-14 14:09:27, Rafael J. Wysocki wrote:
> > [...]
> > > > @@ -131,12 +132,40 @@ int freeze_processes(void)
> > > >  
> > > >  	printk("Freezing user space processes ... ");
> > > >  	pm_freezing = true;
> > > > +	oom_kills_saved = oom_kills_count();
> > > >  	error = try_to_freeze_tasks(true);
> > > >  	if (!error) {
> > > > -		printk("done.");
> > > >  		__usermodehelper_set_disable_depth(UMH_DISABLED);
> > > >  		oom_killer_disable();
> > > > +
> > > > +		/*
> > > > +		 * There might have been an OOM kill while we were
> > > > +		 * freezing tasks and the killed task might be still
> > > > +		 * on the way out so we have to double check for race.
> > > > +		 */
> > > > +		if (oom_kills_count() != oom_kills_saved) {
> > > > +			struct task_struct *g, *p;
> > > > +
> > > > +			read_lock(&tasklist_lock);
> > > > +			for_each_process_thread(g, p) {
> > > > +				if (p == current || freezer_should_skip(p) ||
> > > > +				    frozen(p))
> > > > +					continue;
> > > > +				error = -EBUSY;
> > > > +				goto out_loop;
> > > > +			}
> > > > +out_loop:
> > > 
> > > Well, it looks like this will work here too:
> > > 
> > > 			for_each_process_thread(g, p)
> > > 				if (p != current && !frozen(p) &&
> > > 				    !freezer_should_skip(p)) {
> > > 					error = -EBUSY;
> > > 					break;
> > > 				}
> > > 
> > > or I am helplessly misreading the code.
> > 
> > break will not work because for_each_process_thread is a double loop.
> 
> I see.  In that case I'd do:
> 
>                         for_each_process_thread(g, p)
>                                 if (p != current && !frozen(p) &&
>                                     !freezer_should_skip(p)) {
> 
> 					read_unlock(&tasklist_lock);
> 
> 					__usermodehelper_set_disable_depth(UMH_ENABLED);
> 					printk("OOM in progress.");
>                                         error = -EBUSY;
>                                         goto done;
>                                 }
> 
> to avoid adding the new label that looks odd.

OK, incremental diff on top. I will post the complete patch if you are
happier with this change
---
diff --git a/kernel/power/process.c b/kernel/power/process.c
index a397fa161d11..7a37cf3eb1a2 100644
--- a/kernel/power/process.c
+++ b/kernel/power/process.c
@@ -108,6 +108,28 @@ static int try_to_freeze_tasks(bool user_only)
 	return todo ? -EBUSY : 0;
 }
 
+/*
+ * Returns true if all freezable tasks (except for current) are frozen already
+ */
+static bool check_frozen_processes(void)
+{
+	struct task_struct *g, *p;
+	bool ret = true;
+
+	read_lock(&tasklist_lock);
+	for_each_process_thread(g, p) {
+		if (p != current && !freezer_should_skip(p) &&
+		    !frozen(p)) {
+			ret = false;
+			goto done;
+		}
+	}
+done:
+	read_unlock(&tasklist_lock);
+
+	return ret;
+}
+
 /**
  * freeze_processes - Signal user space processes to enter the refrigerator.
  * The current thread will not be frozen.  The same process that calls
@@ -143,25 +165,12 @@ int freeze_processes(void)
 		 * freezing tasks and the killed task might be still
 		 * on the way out so we have to double check for race.
 		 */
-		if (oom_kills_count() != oom_kills_saved) {
-			struct task_struct *g, *p;
-
-			read_lock(&tasklist_lock);
-			for_each_process_thread(g, p) {
-				if (p == current || freezer_should_skip(p) ||
-				    frozen(p))
-					continue;
-				error = -EBUSY;
-				goto out_loop;
-			}
-out_loop:
-			read_unlock(&tasklist_lock);
-
-			if (error) {
-				__usermodehelper_set_disable_depth(UMH_ENABLED);
-				printk("OOM in progress.");
-				goto done;
-			}
+		if (oom_kills_count() != oom_kills_saved &&
+				!check_frozen_processes()) {
+			__usermodehelper_set_disable_depth(UMH_ENABLED);
+			printk("OOM in progress.");
+			error = -EBUSY;
+			goto done;
 		}
 		printk("done.");
 	}
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply related	[flat|nested] 93+ messages in thread

* Re: [PATCH 3/4] OOM, PM: OOM killed task shouldn't escape PM suspend
  2014-10-21 14:41           ` Rafael J. Wysocki
@ 2014-10-21 14:29             ` Michal Hocko
  2014-10-22 14:39               ` Rafael J. Wysocki
                                 ` (2 more replies)
  0 siblings, 3 replies; 93+ messages in thread
From: Michal Hocko @ 2014-10-21 14:29 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Andrew Morton, Cong Wang, David Rientjes, Tejun Heo,
	Oleg Nesterov, LKML, linux-mm, Linux PM list

On Tue 21-10-14 16:41:07, Rafael J. Wysocki wrote:
> On Tuesday, October 21, 2014 04:11:59 PM Michal Hocko wrote:
[...]
> > OK, incremental diff on top. I will post the complete patch if you are
> > happier with this change
> 
> Yes, I am.
---
>From 9ab46fe539cded8e7b6425b2cd23ba9184002fde Mon Sep 17 00:00:00 2001
From: Michal Hocko <mhocko@suse.cz>
Date: Mon, 20 Oct 2014 18:12:32 +0200
Subject: [PATCH -v2] OOM, PM: OOM killed task shouldn't escape PM suspend

PM freezer relies on having all tasks frozen by the time devices are
getting frozen so that no task will touch them while they are getting
frozen. But OOM killer is allowed to kill an already frozen task in
order to handle OOM situtation. In order to protect from late wake ups
OOM killer is disabled after all tasks are frozen. This, however, still
keeps a window open when a killed task didn't manage to die by the time
freeze_processes finishes.

Reduce the race window by checking all tasks after OOM killer has been
disabled. This is still not race free completely unfortunately because
oom_killer_disable cannot stop an already ongoing OOM killer so a task
might still wake up from the fridge and get killed without
freeze_processes noticing. Full synchronization of OOM and freezer is,
however, too heavy weight for this highly unlikely case.

Introduce and check oom_kills counter which gets incremented early when
the allocator enters __alloc_pages_may_oom path and only check all the
tasks if the counter changes during the freezing attempt. The counter
is updated so early to reduce the race window since allocator checked
oom_killer_disabled which is set by PM-freezing code. A false positive
will push the PM-freezer into a slow path but that is not a big deal.

Changes since v1
- push the re-check loop out of freeze_processes into
  check_frozen_processes and invert the condition to make the code more
  readable as per Rafael

Fixes: f660daac474c6f (oom: thaw threads if oom killed thread is frozen before deferring)
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Cc: Rafael J. Wysocki <rjw@rjwysocki.net>
Cc: Tejun Heo <tj@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: stable@vger.kernel.org # 3.2+
Signed-off-by: Michal Hocko <mhocko@suse.cz>
---
 include/linux/oom.h    |  3 +++
 kernel/power/process.c | 40 +++++++++++++++++++++++++++++++++++++++-
 mm/oom_kill.c          | 17 +++++++++++++++++
 mm/page_alloc.c        |  8 ++++++++
 4 files changed, 67 insertions(+), 1 deletion(-)

diff --git a/include/linux/oom.h b/include/linux/oom.h
index 647395a1a550..e8d6e1058723 100644
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -50,6 +50,9 @@ static inline bool oom_task_origin(const struct task_struct *p)
 extern unsigned long oom_badness(struct task_struct *p,
 		struct mem_cgroup *memcg, const nodemask_t *nodemask,
 		unsigned long totalpages);
+
+extern int oom_kills_count(void);
+extern void note_oom_kill(void);
 extern void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
 			     unsigned int points, unsigned long totalpages,
 			     struct mem_cgroup *memcg, nodemask_t *nodemask,
diff --git a/kernel/power/process.c b/kernel/power/process.c
index 4ee194eb524b..7a37cf3eb1a2 100644
--- a/kernel/power/process.c
+++ b/kernel/power/process.c
@@ -108,6 +108,28 @@ static int try_to_freeze_tasks(bool user_only)
 	return todo ? -EBUSY : 0;
 }
 
+/*
+ * Returns true if all freezable tasks (except for current) are frozen already
+ */
+static bool check_frozen_processes(void)
+{
+	struct task_struct *g, *p;
+	bool ret = true;
+
+	read_lock(&tasklist_lock);
+	for_each_process_thread(g, p) {
+		if (p != current && !freezer_should_skip(p) &&
+		    !frozen(p)) {
+			ret = false;
+			goto done;
+		}
+	}
+done:
+	read_unlock(&tasklist_lock);
+
+	return ret;
+}
+
 /**
  * freeze_processes - Signal user space processes to enter the refrigerator.
  * The current thread will not be frozen.  The same process that calls
@@ -118,6 +140,7 @@ static int try_to_freeze_tasks(bool user_only)
 int freeze_processes(void)
 {
 	int error;
+	int oom_kills_saved;
 
 	error = __usermodehelper_disable(UMH_FREEZING);
 	if (error)
@@ -131,12 +154,27 @@ int freeze_processes(void)
 
 	printk("Freezing user space processes ... ");
 	pm_freezing = true;
+	oom_kills_saved = oom_kills_count();
 	error = try_to_freeze_tasks(true);
 	if (!error) {
-		printk("done.");
 		__usermodehelper_set_disable_depth(UMH_DISABLED);
 		oom_killer_disable();
+
+		/*
+		 * There might have been an OOM kill while we were
+		 * freezing tasks and the killed task might be still
+		 * on the way out so we have to double check for race.
+		 */
+		if (oom_kills_count() != oom_kills_saved &&
+				!check_frozen_processes()) {
+			__usermodehelper_set_disable_depth(UMH_ENABLED);
+			printk("OOM in progress.");
+			error = -EBUSY;
+			goto done;
+		}
+		printk("done.");
 	}
+done:
 	printk("\n");
 	BUG_ON(in_atomic());
 
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index bbf405a3a18f..5340f6b91312 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -404,6 +404,23 @@ static void dump_header(struct task_struct *p, gfp_t gfp_mask, int order,
 		dump_tasks(memcg, nodemask);
 }
 
+/*
+ * Number of OOM killer invocations (including memcg OOM killer).
+ * Primarily used by PM freezer to check for potential races with
+ * OOM killed frozen task.
+ */
+static atomic_t oom_kills = ATOMIC_INIT(0);
+
+int oom_kills_count(void)
+{
+	return atomic_read(&oom_kills);
+}
+
+void note_oom_kill(void)
+{
+	atomic_inc(&oom_kills);
+}
+
 #define K(x) ((x) << (PAGE_SHIFT-10))
 /*
  * Must be called while holding a reference to p, which will be released upon
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index cb573b10af12..22f1929469ec 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2286,6 +2286,14 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
 	}
 
 	/*
+	 * PM-freezer should be notified that there might be an OOM killer on
+	 * its way to kill and wake somebody up. This is too early and we might
+	 * end up not killing anything but false positives are acceptable.
+	 * See freeze_processes.
+	 */
+	note_oom_kill();
+
+	/*
 	 * Go through the zonelist yet one more time, keep very high watermark
 	 * here, this is only to catch a parallel oom killing, we must fail if
 	 * we're still under heavy pressure.
-- 
2.1.1

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply related	[flat|nested] 93+ messages in thread

* Re: [PATCH 3/4] OOM, PM: OOM killed task shouldn't escape PM suspend
  2014-10-21 14:11         ` Michal Hocko
@ 2014-10-21 14:41           ` Rafael J. Wysocki
  2014-10-21 14:29             ` Michal Hocko
  0 siblings, 1 reply; 93+ messages in thread
From: Rafael J. Wysocki @ 2014-10-21 14:41 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Cong Wang, David Rientjes, Tejun Heo,
	Oleg Nesterov, LKML, linux-mm, Linux PM list

On Tuesday, October 21, 2014 04:11:59 PM Michal Hocko wrote:
> On Tue 21-10-14 15:42:23, Rafael J. Wysocki wrote:
> > On Tuesday, October 21, 2014 03:14:45 PM Michal Hocko wrote:
> > > On Tue 21-10-14 14:09:27, Rafael J. Wysocki wrote:
> > > [...]
> > > > > @@ -131,12 +132,40 @@ int freeze_processes(void)
> > > > >  
> > > > >  	printk("Freezing user space processes ... ");
> > > > >  	pm_freezing = true;
> > > > > +	oom_kills_saved = oom_kills_count();
> > > > >  	error = try_to_freeze_tasks(true);
> > > > >  	if (!error) {
> > > > > -		printk("done.");
> > > > >  		__usermodehelper_set_disable_depth(UMH_DISABLED);
> > > > >  		oom_killer_disable();
> > > > > +
> > > > > +		/*
> > > > > +		 * There might have been an OOM kill while we were
> > > > > +		 * freezing tasks and the killed task might be still
> > > > > +		 * on the way out so we have to double check for race.
> > > > > +		 */
> > > > > +		if (oom_kills_count() != oom_kills_saved) {
> > > > > +			struct task_struct *g, *p;
> > > > > +
> > > > > +			read_lock(&tasklist_lock);
> > > > > +			for_each_process_thread(g, p) {
> > > > > +				if (p == current || freezer_should_skip(p) ||
> > > > > +				    frozen(p))
> > > > > +					continue;
> > > > > +				error = -EBUSY;
> > > > > +				goto out_loop;
> > > > > +			}
> > > > > +out_loop:
> > > > 
> > > > Well, it looks like this will work here too:
> > > > 
> > > > 			for_each_process_thread(g, p)
> > > > 				if (p != current && !frozen(p) &&
> > > > 				    !freezer_should_skip(p)) {
> > > > 					error = -EBUSY;
> > > > 					break;
> > > > 				}
> > > > 
> > > > or I am helplessly misreading the code.
> > > 
> > > break will not work because for_each_process_thread is a double loop.
> > 
> > I see.  In that case I'd do:
> > 
> >                         for_each_process_thread(g, p)
> >                                 if (p != current && !frozen(p) &&
> >                                     !freezer_should_skip(p)) {
> > 
> > 					read_unlock(&tasklist_lock);
> > 
> > 					__usermodehelper_set_disable_depth(UMH_ENABLED);
> > 					printk("OOM in progress.");
> >                                         error = -EBUSY;
> >                                         goto done;
> >                                 }
> > 
> > to avoid adding the new label that looks odd.
> 
> OK, incremental diff on top. I will post the complete patch if you are
> happier with this change

Yes, I am.

> ---
> diff --git a/kernel/power/process.c b/kernel/power/process.c
> index a397fa161d11..7a37cf3eb1a2 100644
> --- a/kernel/power/process.c
> +++ b/kernel/power/process.c
> @@ -108,6 +108,28 @@ static int try_to_freeze_tasks(bool user_only)
>  	return todo ? -EBUSY : 0;
>  }
>  
> +/*
> + * Returns true if all freezable tasks (except for current) are frozen already
> + */
> +static bool check_frozen_processes(void)
> +{
> +	struct task_struct *g, *p;
> +	bool ret = true;
> +
> +	read_lock(&tasklist_lock);
> +	for_each_process_thread(g, p) {
> +		if (p != current && !freezer_should_skip(p) &&
> +		    !frozen(p)) {
> +			ret = false;
> +			goto done;
> +		}
> +	}
> +done:
> +	read_unlock(&tasklist_lock);
> +
> +	return ret;
> +}
> +
>  /**
>   * freeze_processes - Signal user space processes to enter the refrigerator.
>   * The current thread will not be frozen.  The same process that calls
> @@ -143,25 +165,12 @@ int freeze_processes(void)
>  		 * freezing tasks and the killed task might be still
>  		 * on the way out so we have to double check for race.
>  		 */
> -		if (oom_kills_count() != oom_kills_saved) {
> -			struct task_struct *g, *p;
> -
> -			read_lock(&tasklist_lock);
> -			for_each_process_thread(g, p) {
> -				if (p == current || freezer_should_skip(p) ||
> -				    frozen(p))
> -					continue;
> -				error = -EBUSY;
> -				goto out_loop;
> -			}
> -out_loop:
> -			read_unlock(&tasklist_lock);
> -
> -			if (error) {
> -				__usermodehelper_set_disable_depth(UMH_ENABLED);
> -				printk("OOM in progress.");
> -				goto done;
> -			}
> +		if (oom_kills_count() != oom_kills_saved &&
> +				!check_frozen_processes()) {
> +			__usermodehelper_set_disable_depth(UMH_ENABLED);
> +			printk("OOM in progress.");
> +			error = -EBUSY;
> +			goto done;
>  		}
>  		printk("done.");
>  	}
> 

-- 
I speak only for myself.
Rafael J. Wysocki, Intel Open Source Technology Center.

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH 3/4] OOM, PM: OOM killed task shouldn't escape PM suspend
  2014-10-22 14:39               ` Rafael J. Wysocki
@ 2014-10-22 14:22                 ` Michal Hocko
  2014-10-22 21:18                   ` Rafael J. Wysocki
  0 siblings, 1 reply; 93+ messages in thread
From: Michal Hocko @ 2014-10-22 14:22 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Andrew Morton, Cong Wang, David Rientjes, Tejun Heo,
	Oleg Nesterov, LKML, linux-mm, Linux PM list

On Wed 22-10-14 16:39:12, Rafael J. Wysocki wrote:
> On Tuesday, October 21, 2014 04:29:39 PM Michal Hocko wrote:
> > On Tue 21-10-14 16:41:07, Rafael J. Wysocki wrote:
> > > On Tuesday, October 21, 2014 04:11:59 PM Michal Hocko wrote:
> > [...]
> > > > OK, incremental diff on top. I will post the complete patch if you are
> > > > happier with this change
> > > 
> > > Yes, I am.
> > ---
> > From 9ab46fe539cded8e7b6425b2cd23ba9184002fde Mon Sep 17 00:00:00 2001
> > From: Michal Hocko <mhocko@suse.cz>
> > Date: Mon, 20 Oct 2014 18:12:32 +0200
> > Subject: [PATCH -v2] OOM, PM: OOM killed task shouldn't escape PM suspend
> > 
> > PM freezer relies on having all tasks frozen by the time devices are
> > getting frozen so that no task will touch them while they are getting
> > frozen. But OOM killer is allowed to kill an already frozen task in
> > order to handle OOM situtation. In order to protect from late wake ups
> > OOM killer is disabled after all tasks are frozen. This, however, still
> > keeps a window open when a killed task didn't manage to die by the time
> > freeze_processes finishes.
> > 
> > Reduce the race window by checking all tasks after OOM killer has been
> > disabled. This is still not race free completely unfortunately because
> > oom_killer_disable cannot stop an already ongoing OOM killer so a task
> > might still wake up from the fridge and get killed without
> > freeze_processes noticing. Full synchronization of OOM and freezer is,
> > however, too heavy weight for this highly unlikely case.
> > 
> > Introduce and check oom_kills counter which gets incremented early when
> > the allocator enters __alloc_pages_may_oom path and only check all the
> > tasks if the counter changes during the freezing attempt. The counter
> > is updated so early to reduce the race window since allocator checked
> > oom_killer_disabled which is set by PM-freezing code. A false positive
> > will push the PM-freezer into a slow path but that is not a big deal.
> > 
> > Changes since v1
> > - push the re-check loop out of freeze_processes into
> >   check_frozen_processes and invert the condition to make the code more
> >   readable as per Rafael
> 
> I've applied that along with the rest of the series, but what about the
> following cleanup patch on top of it?

Sure, looks good to me.

> 
> Rafael
> 
> 
> ---
>  kernel/power/process.c |   31 ++++++++++++++++---------------
>  1 file changed, 16 insertions(+), 15 deletions(-)
> 
> Index: linux-pm/kernel/power/process.c
> ===================================================================
> --- linux-pm.orig/kernel/power/process.c
> +++ linux-pm/kernel/power/process.c
> @@ -108,25 +108,27 @@ static int try_to_freeze_tasks(bool user
>  	return todo ? -EBUSY : 0;
>  }
>  
> +static bool __check_frozen_processes(void)
> +{
> +	struct task_struct *g, *p;
> +
> +	for_each_process_thread(g, p)
> +		if (p != current && !freezer_should_skip(p) && !frozen(p))
> +			return false;
> +
> +	return true;
> +}
> +
>  /*
>   * Returns true if all freezable tasks (except for current) are frozen already
>   */
>  static bool check_frozen_processes(void)
>  {
> -	struct task_struct *g, *p;
> -	bool ret = true;
> +	bool ret;
>  
>  	read_lock(&tasklist_lock);
> -	for_each_process_thread(g, p) {
> -		if (p != current && !freezer_should_skip(p) &&
> -		    !frozen(p)) {
> -			ret = false;
> -			goto done;
> -		}
> -	}
> -done:
> +	ret = __check_frozen_processes();
>  	read_unlock(&tasklist_lock);
> -
>  	return ret;
>  }
>  
> @@ -167,15 +169,14 @@ int freeze_processes(void)
>  		 * on the way out so we have to double check for race.
>  		 */
>  		if (oom_kills_count() != oom_kills_saved &&
> -				!check_frozen_processes()) {
> +		    !check_frozen_processes()) {
>  			__usermodehelper_set_disable_depth(UMH_ENABLED);
>  			printk("OOM in progress.");
>  			error = -EBUSY;
> -			goto done;
> +		} else {
> +			printk("done.");
>  		}
> -		printk("done.");
>  	}
> -done:
>  	printk("\n");
>  	BUG_ON(in_atomic());
>  
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH 3/4] OOM, PM: OOM killed task shouldn't escape PM suspend
  2014-10-21 14:29             ` Michal Hocko
@ 2014-10-22 14:39               ` Rafael J. Wysocki
  2014-10-22 14:22                 ` Michal Hocko
  2014-10-26 18:49               ` Pavel Machek
  2014-11-04 19:27               ` Tejun Heo
  2 siblings, 1 reply; 93+ messages in thread
From: Rafael J. Wysocki @ 2014-10-22 14:39 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Cong Wang, David Rientjes, Tejun Heo,
	Oleg Nesterov, LKML, linux-mm, Linux PM list

On Tuesday, October 21, 2014 04:29:39 PM Michal Hocko wrote:
> On Tue 21-10-14 16:41:07, Rafael J. Wysocki wrote:
> > On Tuesday, October 21, 2014 04:11:59 PM Michal Hocko wrote:
> [...]
> > > OK, incremental diff on top. I will post the complete patch if you are
> > > happier with this change
> > 
> > Yes, I am.
> ---
> From 9ab46fe539cded8e7b6425b2cd23ba9184002fde Mon Sep 17 00:00:00 2001
> From: Michal Hocko <mhocko@suse.cz>
> Date: Mon, 20 Oct 2014 18:12:32 +0200
> Subject: [PATCH -v2] OOM, PM: OOM killed task shouldn't escape PM suspend
> 
> PM freezer relies on having all tasks frozen by the time devices are
> getting frozen so that no task will touch them while they are getting
> frozen. But OOM killer is allowed to kill an already frozen task in
> order to handle OOM situtation. In order to protect from late wake ups
> OOM killer is disabled after all tasks are frozen. This, however, still
> keeps a window open when a killed task didn't manage to die by the time
> freeze_processes finishes.
> 
> Reduce the race window by checking all tasks after OOM killer has been
> disabled. This is still not race free completely unfortunately because
> oom_killer_disable cannot stop an already ongoing OOM killer so a task
> might still wake up from the fridge and get killed without
> freeze_processes noticing. Full synchronization of OOM and freezer is,
> however, too heavy weight for this highly unlikely case.
> 
> Introduce and check oom_kills counter which gets incremented early when
> the allocator enters __alloc_pages_may_oom path and only check all the
> tasks if the counter changes during the freezing attempt. The counter
> is updated so early to reduce the race window since allocator checked
> oom_killer_disabled which is set by PM-freezing code. A false positive
> will push the PM-freezer into a slow path but that is not a big deal.
> 
> Changes since v1
> - push the re-check loop out of freeze_processes into
>   check_frozen_processes and invert the condition to make the code more
>   readable as per Rafael

I've applied that along with the rest of the series, but what about the
following cleanup patch on top of it?

Rafael


---
 kernel/power/process.c |   31 ++++++++++++++++---------------
 1 file changed, 16 insertions(+), 15 deletions(-)

Index: linux-pm/kernel/power/process.c
===================================================================
--- linux-pm.orig/kernel/power/process.c
+++ linux-pm/kernel/power/process.c
@@ -108,25 +108,27 @@ static int try_to_freeze_tasks(bool user
 	return todo ? -EBUSY : 0;
 }
 
+static bool __check_frozen_processes(void)
+{
+	struct task_struct *g, *p;
+
+	for_each_process_thread(g, p)
+		if (p != current && !freezer_should_skip(p) && !frozen(p))
+			return false;
+
+	return true;
+}
+
 /*
  * Returns true if all freezable tasks (except for current) are frozen already
  */
 static bool check_frozen_processes(void)
 {
-	struct task_struct *g, *p;
-	bool ret = true;
+	bool ret;
 
 	read_lock(&tasklist_lock);
-	for_each_process_thread(g, p) {
-		if (p != current && !freezer_should_skip(p) &&
-		    !frozen(p)) {
-			ret = false;
-			goto done;
-		}
-	}
-done:
+	ret = __check_frozen_processes();
 	read_unlock(&tasklist_lock);
-
 	return ret;
 }
 
@@ -167,15 +169,14 @@ int freeze_processes(void)
 		 * on the way out so we have to double check for race.
 		 */
 		if (oom_kills_count() != oom_kills_saved &&
-				!check_frozen_processes()) {
+		    !check_frozen_processes()) {
 			__usermodehelper_set_disable_depth(UMH_ENABLED);
 			printk("OOM in progress.");
 			error = -EBUSY;
-			goto done;
+		} else {
+			printk("done.");
 		}
-		printk("done.");
 	}
-done:
 	printk("\n");
 	BUG_ON(in_atomic());
 


^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH 3/4] OOM, PM: OOM killed task shouldn't escape PM suspend
  2014-10-22 14:22                 ` Michal Hocko
@ 2014-10-22 21:18                   ` Rafael J. Wysocki
  0 siblings, 0 replies; 93+ messages in thread
From: Rafael J. Wysocki @ 2014-10-22 21:18 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Cong Wang, David Rientjes, Tejun Heo,
	Oleg Nesterov, LKML, linux-mm, Linux PM list

On Wednesday, October 22, 2014 04:22:26 PM Michal Hocko wrote:
> On Wed 22-10-14 16:39:12, Rafael J. Wysocki wrote:
> > On Tuesday, October 21, 2014 04:29:39 PM Michal Hocko wrote:
> > > On Tue 21-10-14 16:41:07, Rafael J. Wysocki wrote:
> > > > On Tuesday, October 21, 2014 04:11:59 PM Michal Hocko wrote:
> > > [...]
> > > > > OK, incremental diff on top. I will post the complete patch if you are
> > > > > happier with this change
> > > > 
> > > > Yes, I am.
> > > ---
> > > From 9ab46fe539cded8e7b6425b2cd23ba9184002fde Mon Sep 17 00:00:00 2001
> > > From: Michal Hocko <mhocko@suse.cz>
> > > Date: Mon, 20 Oct 2014 18:12:32 +0200
> > > Subject: [PATCH -v2] OOM, PM: OOM killed task shouldn't escape PM suspend
> > > 
> > > PM freezer relies on having all tasks frozen by the time devices are
> > > getting frozen so that no task will touch them while they are getting
> > > frozen. But OOM killer is allowed to kill an already frozen task in
> > > order to handle OOM situtation. In order to protect from late wake ups
> > > OOM killer is disabled after all tasks are frozen. This, however, still
> > > keeps a window open when a killed task didn't manage to die by the time
> > > freeze_processes finishes.
> > > 
> > > Reduce the race window by checking all tasks after OOM killer has been
> > > disabled. This is still not race free completely unfortunately because
> > > oom_killer_disable cannot stop an already ongoing OOM killer so a task
> > > might still wake up from the fridge and get killed without
> > > freeze_processes noticing. Full synchronization of OOM and freezer is,
> > > however, too heavy weight for this highly unlikely case.
> > > 
> > > Introduce and check oom_kills counter which gets incremented early when
> > > the allocator enters __alloc_pages_may_oom path and only check all the
> > > tasks if the counter changes during the freezing attempt. The counter
> > > is updated so early to reduce the race window since allocator checked
> > > oom_killer_disabled which is set by PM-freezing code. A false positive
> > > will push the PM-freezer into a slow path but that is not a big deal.
> > > 
> > > Changes since v1
> > > - push the re-check loop out of freeze_processes into
> > >   check_frozen_processes and invert the condition to make the code more
> > >   readable as per Rafael
> > 
> > I've applied that along with the rest of the series, but what about the
> > following cleanup patch on top of it?
> 
> Sure, looks good to me.

I'll apply it then, thanks!

-- 
I speak only for myself.
Rafael J. Wysocki, Intel Open Source Technology Center.

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH 3/4] OOM, PM: OOM killed task shouldn't escape PM suspend
  2014-10-21  7:27 ` [PATCH 3/4] OOM, PM: OOM killed task shouldn't escape PM suspend Michal Hocko
  2014-10-21 12:09   ` Rafael J. Wysocki
@ 2014-10-26 18:40   ` Pavel Machek
  1 sibling, 0 replies; 93+ messages in thread
From: Pavel Machek @ 2014-10-26 18:40 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, \"Rafael J. Wysocki\",
	Cong Wang, David Rientjes, Tejun Heo, Oleg Nesterov, LKML,
	linux-mm, Linux PM list

Hi!

> +
> +		/*
> +		 * There might have been an OOM kill while we were
> +		 * freezing tasks and the killed task might be still
> +		 * on the way out so we have to double check for race.
> +		 */

", so"

>  	/*
> +	 * PM-freezer should be notified that there might be an OOM killer on its
> +	 * way to kill and wake somebody up. This is too early and we might end
> +	 * up not killing anything but false positives are acceptable.

", but".

1,2 look good to me, 

Acked-by: Pavel Machek <pavel@ucw.cz>
									Pavel

-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH 3/4] OOM, PM: OOM killed task shouldn't escape PM suspend
  2014-10-21 14:29             ` Michal Hocko
  2014-10-22 14:39               ` Rafael J. Wysocki
@ 2014-10-26 18:49               ` Pavel Machek
  2014-11-04 19:27               ` Tejun Heo
  2 siblings, 0 replies; 93+ messages in thread
From: Pavel Machek @ 2014-10-26 18:49 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Rafael J. Wysocki, Andrew Morton, Cong Wang, David Rientjes,
	Tejun Heo, Oleg Nesterov, LKML, linux-mm, Linux PM list

Hi!
>  
> +/*
> + * Number of OOM killer invocations (including memcg OOM killer).
> + * Primarily used by PM freezer to check for potential races with
> + * OOM killed frozen task.
> + */
> +static atomic_t oom_kills = ATOMIC_INIT(0);
> +
> +int oom_kills_count(void)
> +{
> +	return atomic_read(&oom_kills);
> +}
> +
> +void note_oom_kill(void)
> +{
> +	atomic_inc(&oom_kills);
> +}
> +

Do we need the extra abstraction here? Maybe oom_kills should be
exported directly?
									Pavel

-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH 3/4] OOM, PM: OOM killed task shouldn't escape PM suspend
  2014-10-21 14:29             ` Michal Hocko
  2014-10-22 14:39               ` Rafael J. Wysocki
  2014-10-26 18:49               ` Pavel Machek
@ 2014-11-04 19:27               ` Tejun Heo
  2014-11-05 12:46                 ` Michal Hocko
  2 siblings, 1 reply; 93+ messages in thread
From: Tejun Heo @ 2014-11-04 19:27 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Rafael J. Wysocki, Andrew Morton, Cong Wang, David Rientjes,
	Oleg Nesterov, LKML, linux-mm, Linux PM list

Hello,

Sorry about the delay.

On Tue, Oct 21, 2014 at 04:29:39PM +0200, Michal Hocko wrote:
> Reduce the race window by checking all tasks after OOM killer has been

Ugh... this is never a good direction to take.  It often just ends up
making bugs harder to reproduce and track down.

> disabled. This is still not race free completely unfortunately because
> oom_killer_disable cannot stop an already ongoing OOM killer so a task
> might still wake up from the fridge and get killed without
> freeze_processes noticing. Full synchronization of OOM and freezer is,
> however, too heavy weight for this highly unlikely case.

Both oom killing and PM freezing are exremely rare events and I have
difficult time why their exclusion would be heavy weight.  Care to
elaborate?

Overall, this is a lot of complexity for something which doesn't
really fix the problem and the comments while referring to the race
don't mention that the implemented "fix" is broken, which is pretty
bad as it gives readers of the code a false sense of security and
another hurdle to overcome in actually tracking down what went wrong
if this thing ever shows up as an actual breakage.

I'd strongly recommend implementing something which is actually
correct.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH 3/4] OOM, PM: OOM killed task shouldn't escape PM suspend
  2014-11-04 19:27               ` Tejun Heo
@ 2014-11-05 12:46                 ` Michal Hocko
  2014-11-05 13:02                   ` Tejun Heo
  2014-11-05 14:55                   ` [PATCH 3/4] OOM, PM: OOM killed task shouldn't escape PM suspend Michal Hocko
  0 siblings, 2 replies; 93+ messages in thread
From: Michal Hocko @ 2014-11-05 12:46 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Rafael J. Wysocki, Andrew Morton, Cong Wang, David Rientjes,
	Oleg Nesterov, LKML, linux-mm, Linux PM list

On Tue 04-11-14 14:27:05, Tejun Heo wrote:
> Hello,
> 
> Sorry about the delay.
> 
> On Tue, Oct 21, 2014 at 04:29:39PM +0200, Michal Hocko wrote:
> > Reduce the race window by checking all tasks after OOM killer has been
> 
> Ugh... this is never a good direction to take.  It often just ends up
> making bugs harder to reproduce and track down.

As I've said I wasn't entirely happy with this half solution but it helped
the current situation at the time. The full solution would require to
fully synchronize OOM path with the freezer. The patch below is doing
that.

> > disabled. This is still not race free completely unfortunately because
> > oom_killer_disable cannot stop an already ongoing OOM killer so a task
> > might still wake up from the fridge and get killed without
> > freeze_processes noticing. Full synchronization of OOM and freezer is,
> > however, too heavy weight for this highly unlikely case.
> 
> Both oom killing and PM freezing are exremely rare events and I have
> difficult time why their exclusion would be heavy weight.  Care to
> elaborate

You are right that the allocation OOM path is extremely slow and so an
additional locking shouldn't matter much. I originally thought that
any locking would require more changes in the allocation path. In the
end it looks much easier than I hoped. I haven't tested it so I might be
just missing some subtle issues now.

Anyway I cannot say I would be happy to expose a lock which can block
OOM to happen because this calls for troubles. It is true that we
already have that ugly oom_killer_disabled hack but that only causes
allocation to fail rather than block the OOM path altogether if
something goes wrong. Maybe I am just too paranoid...

So my original intention was to provide a mechanism which would be safe
from OOM point of view and as good as possible from PM POV. The race is
really unlikely and even if it happened there would be an OOM message in
the log which would give us a hint (I can add a special note that oom is
disabled but we are killing a task regardless to make it more obvious if
you prefer).

> Overall, this is a lot of complexity for something which doesn't
> really fix the problem and the comments while referring to the race
> don't mention that the implemented "fix" is broken, which is pretty
> bad as it gives readers of the code a false sense of security and
> another hurdle to overcome in actually tracking down what went wrong
> if this thing ever shows up as an actual breakage.

The patch description mentions that the race is not closed completely.
It is true that the comments in the code could have been more clear
about it.

> I'd strongly recommend implementing something which is actually
> correct.

I think the patch below should be safe. Would you prefer this solution
instead? It is race free but there is the risk that exposing a lock which
completely blocks OOM killer from the allocation path will kick us
later.
---
>From ef6227565fa65b52986c4626d49ba53b499e54d1 Mon Sep 17 00:00:00 2001
From: Michal Hocko <mhocko@suse.cz>
Date: Wed, 5 Nov 2014 11:49:14 +0100
Subject: [PATCH] OOM, PM: make OOM detection in the freezer path raceless

5695be142e20 (OOM, PM: OOM killed task shouldn't escape PM suspend)
has left a race window when OOM killer manages to note_oom_kill after
freeze_processes checks the counter. The race window is quite small
and really unlikely and deemed sufficient at the time of submission.

Tejun wasn't happy about this partial solution though and insisted on
a full solution. That requires the full OOM and freezer exclusion,
though. This is done by this patch which introduces oom_sem RW lock.
Page allocation OOM path takes the lock for reading because there might
be concurrent OOM happening on disjunct zonelists. oom_killer_disabled
check is moved right before out_of_memory is called because it was
checked too early before and we do not want to hold the lock while doing
the last attempt for allocation which might involve zone_reclaim.
freeze_processes then takes the lock for write throughout the whole
freezing process and OOM disabling.

There is no need to recheck all the processes with the full
synchronization anymore.

Signed-off-by: Michal Hocko <mhocko@suse.cz>
---
 include/linux/oom.h    |  5 +++++
 kernel/power/process.c | 50 +++++++++-----------------------------------------
 mm/oom_kill.c          | 17 -----------------
 mm/page_alloc.c        | 24 ++++++++++++------------
 4 files changed, 26 insertions(+), 70 deletions(-)

diff --git a/include/linux/oom.h b/include/linux/oom.h
index e8d6e1058723..350b9b2ffeec 100644
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -73,7 +73,12 @@ extern void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
 extern int register_oom_notifier(struct notifier_block *nb);
 extern int unregister_oom_notifier(struct notifier_block *nb);
 
+/*
+ * oom_killer_disabled can be modified only under oom_sem taken for write
+ * and checked under read lock along with the full OOM handler.
+ */
 extern bool oom_killer_disabled;
+extern struct rw_semaphore oom_sem;
 
 static inline void oom_killer_disable(void)
 {
diff --git a/kernel/power/process.c b/kernel/power/process.c
index 5a6ec8678b9a..befce9785233 100644
--- a/kernel/power/process.c
+++ b/kernel/power/process.c
@@ -108,30 +108,6 @@ static int try_to_freeze_tasks(bool user_only)
 	return todo ? -EBUSY : 0;
 }
 
-static bool __check_frozen_processes(void)
-{
-	struct task_struct *g, *p;
-
-	for_each_process_thread(g, p)
-		if (p != current && !freezer_should_skip(p) && !frozen(p))
-			return false;
-
-	return true;
-}
-
-/*
- * Returns true if all freezable tasks (except for current) are frozen already
- */
-static bool check_frozen_processes(void)
-{
-	bool ret;
-
-	read_lock(&tasklist_lock);
-	ret = __check_frozen_processes();
-	read_unlock(&tasklist_lock);
-	return ret;
-}
-
 /**
  * freeze_processes - Signal user space processes to enter the refrigerator.
  * The current thread will not be frozen.  The same process that calls
@@ -142,7 +118,6 @@ static bool check_frozen_processes(void)
 int freeze_processes(void)
 {
 	int error;
-	int oom_kills_saved;
 
 	error = __usermodehelper_disable(UMH_FREEZING);
 	if (error)
@@ -157,27 +132,20 @@ int freeze_processes(void)
 	pm_wakeup_clear();
 	printk("Freezing user space processes ... ");
 	pm_freezing = true;
-	oom_kills_saved = oom_kills_count();
+
+	/*
+	 * Need to exlude OOM killer from triggering while tasks are
+	 * getting frozen to make sure none of them gets killed after
+	 * try_to_freeze_tasks is done.
+	 */
+	down_write(&oom_sem);
 	error = try_to_freeze_tasks(true);
 	if (!error) {
 		__usermodehelper_set_disable_depth(UMH_DISABLED);
 		oom_killer_disable();
-
-		/*
-		 * There might have been an OOM kill while we were
-		 * freezing tasks and the killed task might be still
-		 * on the way out so we have to double check for race.
-		 */
-		if (oom_kills_count() != oom_kills_saved &&
-		    !check_frozen_processes()) {
-			__usermodehelper_set_disable_depth(UMH_ENABLED);
-			printk("OOM in progress.");
-			error = -EBUSY;
-		} else {
-			printk("done.");
-		}
+		printk("done.\n");
 	}
-	printk("\n");
+	up_write(&oom_sem);
 	BUG_ON(in_atomic());
 
 	if (error)
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 5340f6b91312..bbf405a3a18f 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -404,23 +404,6 @@ static void dump_header(struct task_struct *p, gfp_t gfp_mask, int order,
 		dump_tasks(memcg, nodemask);
 }
 
-/*
- * Number of OOM killer invocations (including memcg OOM killer).
- * Primarily used by PM freezer to check for potential races with
- * OOM killed frozen task.
- */
-static atomic_t oom_kills = ATOMIC_INIT(0);
-
-int oom_kills_count(void)
-{
-	return atomic_read(&oom_kills);
-}
-
-void note_oom_kill(void)
-{
-	atomic_inc(&oom_kills);
-}
-
 #define K(x) ((x) << (PAGE_SHIFT-10))
 /*
  * Must be called while holding a reference to p, which will be released upon
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 9cd36b822444..76095266c4b5 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -243,6 +243,7 @@ void set_pageblock_migratetype(struct page *page, int migratetype)
 }
 
 bool oom_killer_disabled __read_mostly;
+DECLARE_RWSEM(oom_sem);
 
 #ifdef CONFIG_DEBUG_VM
 static int page_outside_zone_boundaries(struct zone *zone, struct page *page)
@@ -2252,14 +2253,6 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
 	}
 
 	/*
-	 * PM-freezer should be notified that there might be an OOM killer on
-	 * its way to kill and wake somebody up. This is too early and we might
-	 * end up not killing anything but false positives are acceptable.
-	 * See freeze_processes.
-	 */
-	note_oom_kill();
-
-	/*
 	 * Go through the zonelist yet one more time, keep very high watermark
 	 * here, this is only to catch a parallel oom killing, we must fail if
 	 * we're still under heavy pressure.
@@ -2288,8 +2281,17 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
 		if (gfp_mask & __GFP_THISNODE)
 			goto out;
 	}
-	/* Exhausted what can be done so it's blamo time */
-	out_of_memory(zonelist, gfp_mask, order, nodemask, false);
+
+	/*
+	 * Exhausted what can be done so it's blamo time.
+	 * Just make sure that we cannot race with oom_killer disabling
+	 * e.g. PM freezer needs to make sure that no OOM happens after
+	 * all tasks are frozen.
+	 */
+	down_read(&oom_sem);
+	if (!oom_killer_disabled)
+		out_of_memory(zonelist, gfp_mask, order, nodemask, false);
+	up_read(&oom_sem);
 
 out:
 	oom_zonelist_unlock(zonelist, gfp_mask);
@@ -2716,8 +2718,6 @@ rebalance:
 	 */
 	if (!did_some_progress) {
 		if (oom_gfp_allowed(gfp_mask)) {
-			if (oom_killer_disabled)
-				goto nopage;
 			/* Coredumps can quickly deplete all memory reserves */
 			if ((current->flags & PF_DUMPCORE) &&
 			    !(gfp_mask & __GFP_NOFAIL))
-- 
2.1.1


-- 
Michal Hocko
SUSE Labs

^ permalink raw reply related	[flat|nested] 93+ messages in thread

* Re: [PATCH 3/4] OOM, PM: OOM killed task shouldn't escape PM suspend
  2014-11-05 12:46                 ` Michal Hocko
@ 2014-11-05 13:02                   ` Tejun Heo
  2014-11-05 13:31                     ` Michal Hocko
  2014-11-05 14:55                   ` [PATCH 3/4] OOM, PM: OOM killed task shouldn't escape PM suspend Michal Hocko
  1 sibling, 1 reply; 93+ messages in thread
From: Tejun Heo @ 2014-11-05 13:02 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Rafael J. Wysocki, Andrew Morton, Cong Wang, David Rientjes,
	Oleg Nesterov, LKML, linux-mm, Linux PM list

Hello, Michal.

On Wed, Nov 05, 2014 at 01:46:20PM +0100, Michal Hocko wrote:
> As I've said I wasn't entirely happy with this half solution but it helped
> the current situation at the time. The full solution would require to

I don't think this helps the situation.  It just makes the bug more
obscure and the race window while reduced is still pretty big and
there seems to be an actual not too low chance of the bug triggering
out in the wild.  How does this level of obscuring help anything?  In
addition to making the bug more difficult to reproduce, it also adds a
bunch of code which *pretends* to address the issue but ultimately
just lowers visibility into what's going on and hinders tracking down
the issue when something actually goes wrong.  This is *NOT* making
the situation better.  The patch is net negative.

> I think the patch below should be safe. Would you prefer this solution
> instead? It is race free but there is the risk that exposing a lock which

Yes, this is an a lot saner approach in general.

> completely blocks OOM killer from the allocation path will kick us
> later.

Can you please spell it out?  How would it kick us?  We already have
oom_killer_disable/enable(), how is this any different in terms of
correctness from them?  Also, why isn't this part of
oom_killer_disable/enable()?  The way they're implemented is really
silly now.  It just sets a flag and returns whether there's a
currently running instance or not.  How were these even useful?  Why
can't you just make disable/enable to what they were supposed to do
from the beginning?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH 3/4] OOM, PM: OOM killed task shouldn't escape PM suspend
  2014-11-05 13:02                   ` Tejun Heo
@ 2014-11-05 13:31                     ` Michal Hocko
  2014-11-05 13:42                       ` Michal Hocko
  0 siblings, 1 reply; 93+ messages in thread
From: Michal Hocko @ 2014-11-05 13:31 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Rafael J. Wysocki, Andrew Morton, Cong Wang, David Rientjes,
	Oleg Nesterov, LKML, linux-mm, Linux PM list

On Wed 05-11-14 08:02:47, Tejun Heo wrote:
> Hello, Michal.
> 
> On Wed, Nov 05, 2014 at 01:46:20PM +0100, Michal Hocko wrote:
> > As I've said I wasn't entirely happy with this half solution but it helped
> > the current situation at the time. The full solution would require to
> 
> I don't think this helps the situation.  It just makes the bug more
> obscure and the race window while reduced is still pretty big and
> there seems to be an actual not too low chance of the bug triggering
> out in the wild.  How does this level of obscuring help anything?  In
> addition to making the bug more difficult to reproduce, it also adds a
> bunch of code which *pretends* to address the issue but ultimately
> just lowers visibility into what's going on and hinders tracking down
> the issue when something actually goes wrong.  This is *NOT* making
> the situation better.  The patch is net negative.

The patch was a compromise. It was needed to catch the most common
OOM conditions while the tasks are getting frozen. The race window
between the counter increment and the check in the PM path is negligible
compared to the freezing process. And it is safe from OOM point of view
because nothing can block it away.

> > I think the patch below should be safe. Would you prefer this solution
> > instead? It is race free but there is the risk that exposing a lock which
> 
> Yes, this is an a lot saner approach in general.
> 
> > completely blocks OOM killer from the allocation path will kick us
> > later.
> 
> Can you please spell it out?  How would it kick us?  We already have
> oom_killer_disable/enable(), how is this any different in terms of
> correctness from them? 

As already said in the part of email you haven't quoted.
oom_killer_disable will cause allocations to _fail_. With the lock you are
_blocking_ OOM killer completely. This is error prone because no part of
system should be able to block the last resort memory shortage actions.

> Also, why isn't this part of
> oom_killer_disable/enable()?  The way they're implemented is really
> silly now.  It just sets a flag and returns whether there's a
> currently running instance or not.  How were these even useful? 
> Why can't you just make disable/enable to what they were supposed to
> do from the beginning?

Because then we would block all the potential allocators coming from
workqueues or kernel threads which are not frozen yet rather than fail
the allocation. I am not familiar with the PM code and all the paths
this might get called from enough to tell whether failing the allocation
is better approach than failing the suspend operation on a timeout.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH 3/4] OOM, PM: OOM killed task shouldn't escape PM suspend
  2014-11-05 13:31                     ` Michal Hocko
@ 2014-11-05 13:42                       ` Michal Hocko
  2014-11-05 14:14                         ` Michal Hocko
  2014-11-05 15:44                         ` Tejun Heo
  0 siblings, 2 replies; 93+ messages in thread
From: Michal Hocko @ 2014-11-05 13:42 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Rafael J. Wysocki, Andrew Morton, Cong Wang, David Rientjes,
	Oleg Nesterov, LKML, linux-mm, Linux PM list

On Wed 05-11-14 14:31:00, Michal Hocko wrote:
> On Wed 05-11-14 08:02:47, Tejun Heo wrote:
[...]
> > Also, why isn't this part of
> > oom_killer_disable/enable()?  The way they're implemented is really
> > silly now.  It just sets a flag and returns whether there's a
> > currently running instance or not.  How were these even useful? 
> > Why can't you just make disable/enable to what they were supposed to
> > do from the beginning?
> 
> Because then we would block all the potential allocators coming from
> workqueues or kernel threads which are not frozen yet rather than fail
> the allocation.

After thinking about this more it would be doable by using trylock in
the allocation oom path. I will respin the patch. The API will be
cleaner this way.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH 3/4] OOM, PM: OOM killed task shouldn't escape PM suspend
  2014-11-05 13:42                       ` Michal Hocko
@ 2014-11-05 14:14                         ` Michal Hocko
  2014-11-05 15:45                           ` Michal Hocko
  2014-11-05 15:44                         ` Tejun Heo
  1 sibling, 1 reply; 93+ messages in thread
From: Michal Hocko @ 2014-11-05 14:14 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Rafael J. Wysocki, Andrew Morton, Cong Wang, David Rientjes,
	Oleg Nesterov, LKML, linux-mm, Linux PM list

On Wed 05-11-14 14:42:19, Michal Hocko wrote:
> On Wed 05-11-14 14:31:00, Michal Hocko wrote:
> > On Wed 05-11-14 08:02:47, Tejun Heo wrote:
> [...]
> > > Also, why isn't this part of
> > > oom_killer_disable/enable()?  The way they're implemented is really
> > > silly now.  It just sets a flag and returns whether there's a
> > > currently running instance or not.  How were these even useful? 
> > > Why can't you just make disable/enable to what they were supposed to
> > > do from the beginning?
> > 
> > Because then we would block all the potential allocators coming from
> > workqueues or kernel threads which are not frozen yet rather than fail
> > the allocation.
> 
> After thinking about this more it would be doable by using trylock in
> the allocation oom path. I will respin the patch. The API will be
> cleaner this way.
---
>From 33654faeea161ef9a411f9ff6d84419712bb4a0f Mon Sep 17 00:00:00 2001
From: Michal Hocko <mhocko@suse.cz>
Date: Wed, 5 Nov 2014 15:09:56 +0100
Subject: [PATCH] OOM, PM: make OOM detection in the freezer path raceless

5695be142e20 (OOM, PM: OOM killed task shouldn't escape PM suspend)
has left a race window when OOM killer manages to note_oom_kill after
freeze_processes checks the counter. The race window is quite small
and really unlikely and deemed sufficient at the time of submission.

Tejun wasn't happy about this partial solution though and insisted on
a full solution. That requires the full OOM and freezer exclusion,
though. This is done by this patch which introduces oom_sem RW lock
and gets rid of oom_killer_disabled global flag.

The PM code uses oom_killer_{disable,enable} which takes the lock
for write and exclude all the OOM killer invocation from the page
allocation path.

The allocation path uses oom_killer_allowed_{begin,end} around
__alloc_pages_may_oom call. This is implemented by a read trylock so all
the concurrent OOM killers (operating on different zonlists) are allowed
to proceed unless OOM is disabled when the allocation simply fails.

There is no need to recheck all the processes with the full
synchronization anymore.

Suggested-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Michal Hocko <mhocko@suse.cz>
---
 include/linux/oom.h    | 33 ++++++++++++++++++++++++---------
 kernel/power/process.c | 50 ++++++++------------------------------------------
 mm/oom_kill.c          | 39 ++++++++++++++++++++++-----------------
 mm/page_alloc.c        | 21 +++++++++------------
 4 files changed, 63 insertions(+), 80 deletions(-)

diff --git a/include/linux/oom.h b/include/linux/oom.h
index e8d6e1058723..850f7f653eb7 100644
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -73,17 +73,32 @@ extern void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
 extern int register_oom_notifier(struct notifier_block *nb);
 extern int unregister_oom_notifier(struct notifier_block *nb);
 
-extern bool oom_killer_disabled;
+/**
+ * oom_killer_disable - disable OOM killer in page allocator
+ *
+ * Forces all page allocations to fail rather than trigger OOM killer.
+ */
+extern void oom_killer_disable(void);
 
-static inline void oom_killer_disable(void)
-{
-	oom_killer_disabled = true;
-}
+/**
+ * oom_killer_enable - enable OOM killer
+ */
+extern void oom_killer_enable(void);
 
-static inline void oom_killer_enable(void)
-{
-	oom_killer_disabled = false;
-}
+/**
+ * oom_killer_allowed_start - start OOM killer section
+ *
+ * Synchronise with oom_killer_{disable,enable} sections.
+ * Returns 1 if oom_killer is allowed.
+ */
+extern int oom_killer_allowed_start(void);
+
+/**
+ * oom_killer_allowed_end - end OOM killer section
+ *
+ * previously started by oom_killer_allowed_end.
+ */
+extern void oom_killer_allowed_end(void);
 
 static inline bool oom_gfp_allowed(gfp_t gfp_mask)
 {
diff --git a/kernel/power/process.c b/kernel/power/process.c
index 5a6ec8678b9a..7d08d56cbf3f 100644
--- a/kernel/power/process.c
+++ b/kernel/power/process.c
@@ -108,30 +108,6 @@ static int try_to_freeze_tasks(bool user_only)
 	return todo ? -EBUSY : 0;
 }
 
-static bool __check_frozen_processes(void)
-{
-	struct task_struct *g, *p;
-
-	for_each_process_thread(g, p)
-		if (p != current && !freezer_should_skip(p) && !frozen(p))
-			return false;
-
-	return true;
-}
-
-/*
- * Returns true if all freezable tasks (except for current) are frozen already
- */
-static bool check_frozen_processes(void)
-{
-	bool ret;
-
-	read_lock(&tasklist_lock);
-	ret = __check_frozen_processes();
-	read_unlock(&tasklist_lock);
-	return ret;
-}
-
 /**
  * freeze_processes - Signal user space processes to enter the refrigerator.
  * The current thread will not be frozen.  The same process that calls
@@ -142,7 +118,6 @@ static bool check_frozen_processes(void)
 int freeze_processes(void)
 {
 	int error;
-	int oom_kills_saved;
 
 	error = __usermodehelper_disable(UMH_FREEZING);
 	if (error)
@@ -157,27 +132,18 @@ int freeze_processes(void)
 	pm_wakeup_clear();
 	printk("Freezing user space processes ... ");
 	pm_freezing = true;
-	oom_kills_saved = oom_kills_count();
+
+	/*
+	 * Need to exlude OOM killer from triggering while tasks are
+	 * getting frozen to make sure none of them gets killed after
+	 * try_to_freeze_tasks is done.
+	 */
+	oom_killer_disable();
 	error = try_to_freeze_tasks(true);
 	if (!error) {
 		__usermodehelper_set_disable_depth(UMH_DISABLED);
-		oom_killer_disable();
-
-		/*
-		 * There might have been an OOM kill while we were
-		 * freezing tasks and the killed task might be still
-		 * on the way out so we have to double check for race.
-		 */
-		if (oom_kills_count() != oom_kills_saved &&
-		    !check_frozen_processes()) {
-			__usermodehelper_set_disable_depth(UMH_ENABLED);
-			printk("OOM in progress.");
-			error = -EBUSY;
-		} else {
-			printk("done.");
-		}
+		printk("done.\n");
 	}
-	printk("\n");
 	BUG_ON(in_atomic());
 
 	if (error)
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 5340f6b91312..7fc75b4df837 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -404,23 +404,6 @@ static void dump_header(struct task_struct *p, gfp_t gfp_mask, int order,
 		dump_tasks(memcg, nodemask);
 }
 
-/*
- * Number of OOM killer invocations (including memcg OOM killer).
- * Primarily used by PM freezer to check for potential races with
- * OOM killed frozen task.
- */
-static atomic_t oom_kills = ATOMIC_INIT(0);
-
-int oom_kills_count(void)
-{
-	return atomic_read(&oom_kills);
-}
-
-void note_oom_kill(void)
-{
-	atomic_inc(&oom_kills);
-}
-
 #define K(x) ((x) << (PAGE_SHIFT-10))
 /*
  * Must be called while holding a reference to p, which will be released upon
@@ -615,6 +598,28 @@ void oom_zonelist_unlock(struct zonelist *zonelist, gfp_t gfp_mask)
 	spin_unlock(&zone_scan_lock);
 }
 
+static DECLARE_RWSEM(oom_sem);
+
+void oom_killer_disabled(void)
+{
+	down_write(&oom_sem);
+}
+
+void oom_killer_enable(void)
+{
+	up_write(&oom_sem);
+}
+
+int oom_killer_allowed_start(void)
+{
+	return down_read_trylock(&oom_sem);
+}
+
+void oom_killer_allowed_end(void)
+{
+	up_read(&oom_sem);
+}
+
 /**
  * out_of_memory - kill the "best" process when we run out of memory
  * @zonelist: zonelist pointer
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 9cd36b822444..206ce46ce975 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -242,8 +242,6 @@ void set_pageblock_migratetype(struct page *page, int migratetype)
 					PB_migrate, PB_migrate_end);
 }
 
-bool oom_killer_disabled __read_mostly;
-
 #ifdef CONFIG_DEBUG_VM
 static int page_outside_zone_boundaries(struct zone *zone, struct page *page)
 {
@@ -2252,14 +2250,6 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
 	}
 
 	/*
-	 * PM-freezer should be notified that there might be an OOM killer on
-	 * its way to kill and wake somebody up. This is too early and we might
-	 * end up not killing anything but false positives are acceptable.
-	 * See freeze_processes.
-	 */
-	note_oom_kill();
-
-	/*
 	 * Go through the zonelist yet one more time, keep very high watermark
 	 * here, this is only to catch a parallel oom killing, we must fail if
 	 * we're still under heavy pressure.
@@ -2716,16 +2706,23 @@ rebalance:
 	 */
 	if (!did_some_progress) {
 		if (oom_gfp_allowed(gfp_mask)) {
-			if (oom_killer_disabled)
-				goto nopage;
 			/* Coredumps can quickly deplete all memory reserves */
 			if ((current->flags & PF_DUMPCORE) &&
 			    !(gfp_mask & __GFP_NOFAIL))
 				goto nopage;
+			/*
+			 * Just make sure that we cannot race with oom_killer
+			 * disabling e.g. PM freezer needs to make sure that
+			 * no OOM happens after all tasks are frozen.
+			 */
+			if (!oom_killer_allowed_start())
+				goto nopage;
 			page = __alloc_pages_may_oom(gfp_mask, order,
 					zonelist, high_zoneidx,
 					nodemask, preferred_zone,
 					classzone_idx, migratetype);
+			oom_killer_allowed_end();
+
 			if (page)
 				goto got_pg;
 
-- 
2.1.1

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply related	[flat|nested] 93+ messages in thread

* Re: [PATCH 3/4] OOM, PM: OOM killed task shouldn't escape PM suspend
  2014-11-05 12:46                 ` Michal Hocko
  2014-11-05 13:02                   ` Tejun Heo
@ 2014-11-05 14:55                   ` Michal Hocko
  1 sibling, 0 replies; 93+ messages in thread
From: Michal Hocko @ 2014-11-05 14:55 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Rafael J. Wysocki, Andrew Morton, Cong Wang, David Rientjes,
	Oleg Nesterov, LKML, linux-mm, Linux PM list

On Wed 05-11-14 13:46:20, Michal Hocko wrote:
[...]
> From ef6227565fa65b52986c4626d49ba53b499e54d1 Mon Sep 17 00:00:00 2001
> From: Michal Hocko <mhocko@suse.cz>
> Date: Wed, 5 Nov 2014 11:49:14 +0100
> Subject: [PATCH] OOM, PM: make OOM detection in the freezer path raceless
> 
> 5695be142e20 (OOM, PM: OOM killed task shouldn't escape PM suspend)
> has left a race window when OOM killer manages to note_oom_kill after
> freeze_processes checks the counter. The race window is quite small
> and really unlikely and deemed sufficient at the time of submission.
> 
> Tejun wasn't happy about this partial solution though and insisted on
> a full solution. That requires the full OOM and freezer exclusion,
> though. This is done by this patch which introduces oom_sem RW lock.
> Page allocation OOM path takes the lock for reading because there might
> be concurrent OOM happening on disjunct zonelists. oom_killer_disabled
> check is moved right before out_of_memory is called because it was
> checked too early before and we do not want to hold the lock while doing
> the last attempt for allocation which might involve zone_reclaim.

This is incorrect because it would cause an endless allocation loop
because we really have to got to no_page if OOM is disabled.

> freeze_processes then takes the lock for write throughout the whole
> freezing process and OOM disabling.
> 
> There is no need to recheck all the processes with the full
> synchronization anymore.
> 
> Signed-off-by: Michal Hocko <mhocko@suse.cz>
> ---
>  include/linux/oom.h    |  5 +++++
>  kernel/power/process.c | 50 +++++++++-----------------------------------------
>  mm/oom_kill.c          | 17 -----------------
>  mm/page_alloc.c        | 24 ++++++++++++------------
>  4 files changed, 26 insertions(+), 70 deletions(-)
> 
> diff --git a/include/linux/oom.h b/include/linux/oom.h
> index e8d6e1058723..350b9b2ffeec 100644
> --- a/include/linux/oom.h
> +++ b/include/linux/oom.h
> @@ -73,7 +73,12 @@ extern void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
>  extern int register_oom_notifier(struct notifier_block *nb);
>  extern int unregister_oom_notifier(struct notifier_block *nb);
>  
> +/*
> + * oom_killer_disabled can be modified only under oom_sem taken for write
> + * and checked under read lock along with the full OOM handler.
> + */
>  extern bool oom_killer_disabled;
> +extern struct rw_semaphore oom_sem;
>  
>  static inline void oom_killer_disable(void)
>  {
> diff --git a/kernel/power/process.c b/kernel/power/process.c
> index 5a6ec8678b9a..befce9785233 100644
> --- a/kernel/power/process.c
> +++ b/kernel/power/process.c
> @@ -108,30 +108,6 @@ static int try_to_freeze_tasks(bool user_only)
>  	return todo ? -EBUSY : 0;
>  }
>  
> -static bool __check_frozen_processes(void)
> -{
> -	struct task_struct *g, *p;
> -
> -	for_each_process_thread(g, p)
> -		if (p != current && !freezer_should_skip(p) && !frozen(p))
> -			return false;
> -
> -	return true;
> -}
> -
> -/*
> - * Returns true if all freezable tasks (except for current) are frozen already
> - */
> -static bool check_frozen_processes(void)
> -{
> -	bool ret;
> -
> -	read_lock(&tasklist_lock);
> -	ret = __check_frozen_processes();
> -	read_unlock(&tasklist_lock);
> -	return ret;
> -}
> -
>  /**
>   * freeze_processes - Signal user space processes to enter the refrigerator.
>   * The current thread will not be frozen.  The same process that calls
> @@ -142,7 +118,6 @@ static bool check_frozen_processes(void)
>  int freeze_processes(void)
>  {
>  	int error;
> -	int oom_kills_saved;
>  
>  	error = __usermodehelper_disable(UMH_FREEZING);
>  	if (error)
> @@ -157,27 +132,20 @@ int freeze_processes(void)
>  	pm_wakeup_clear();
>  	printk("Freezing user space processes ... ");
>  	pm_freezing = true;
> -	oom_kills_saved = oom_kills_count();
> +
> +	/*
> +	 * Need to exlude OOM killer from triggering while tasks are
> +	 * getting frozen to make sure none of them gets killed after
> +	 * try_to_freeze_tasks is done.
> +	 */
> +	down_write(&oom_sem);
>  	error = try_to_freeze_tasks(true);
>  	if (!error) {
>  		__usermodehelper_set_disable_depth(UMH_DISABLED);
>  		oom_killer_disable();
> -
> -		/*
> -		 * There might have been an OOM kill while we were
> -		 * freezing tasks and the killed task might be still
> -		 * on the way out so we have to double check for race.
> -		 */
> -		if (oom_kills_count() != oom_kills_saved &&
> -		    !check_frozen_processes()) {
> -			__usermodehelper_set_disable_depth(UMH_ENABLED);
> -			printk("OOM in progress.");
> -			error = -EBUSY;
> -		} else {
> -			printk("done.");
> -		}
> +		printk("done.\n");
>  	}
> -	printk("\n");
> +	up_write(&oom_sem);
>  	BUG_ON(in_atomic());
>  
>  	if (error)
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 5340f6b91312..bbf405a3a18f 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -404,23 +404,6 @@ static void dump_header(struct task_struct *p, gfp_t gfp_mask, int order,
>  		dump_tasks(memcg, nodemask);
>  }
>  
> -/*
> - * Number of OOM killer invocations (including memcg OOM killer).
> - * Primarily used by PM freezer to check for potential races with
> - * OOM killed frozen task.
> - */
> -static atomic_t oom_kills = ATOMIC_INIT(0);
> -
> -int oom_kills_count(void)
> -{
> -	return atomic_read(&oom_kills);
> -}
> -
> -void note_oom_kill(void)
> -{
> -	atomic_inc(&oom_kills);
> -}
> -
>  #define K(x) ((x) << (PAGE_SHIFT-10))
>  /*
>   * Must be called while holding a reference to p, which will be released upon
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 9cd36b822444..76095266c4b5 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -243,6 +243,7 @@ void set_pageblock_migratetype(struct page *page, int migratetype)
>  }
>  
>  bool oom_killer_disabled __read_mostly;
> +DECLARE_RWSEM(oom_sem);
>  
>  #ifdef CONFIG_DEBUG_VM
>  static int page_outside_zone_boundaries(struct zone *zone, struct page *page)
> @@ -2252,14 +2253,6 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
>  	}
>  
>  	/*
> -	 * PM-freezer should be notified that there might be an OOM killer on
> -	 * its way to kill and wake somebody up. This is too early and we might
> -	 * end up not killing anything but false positives are acceptable.
> -	 * See freeze_processes.
> -	 */
> -	note_oom_kill();
> -
> -	/*
>  	 * Go through the zonelist yet one more time, keep very high watermark
>  	 * here, this is only to catch a parallel oom killing, we must fail if
>  	 * we're still under heavy pressure.
> @@ -2288,8 +2281,17 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
>  		if (gfp_mask & __GFP_THISNODE)
>  			goto out;
>  	}
> -	/* Exhausted what can be done so it's blamo time */
> -	out_of_memory(zonelist, gfp_mask, order, nodemask, false);
> +
> +	/*
> +	 * Exhausted what can be done so it's blamo time.
> +	 * Just make sure that we cannot race with oom_killer disabling
> +	 * e.g. PM freezer needs to make sure that no OOM happens after
> +	 * all tasks are frozen.
> +	 */
> +	down_read(&oom_sem);
> +	if (!oom_killer_disabled)
> +		out_of_memory(zonelist, gfp_mask, order, nodemask, false);
> +	up_read(&oom_sem);
>  
>  out:
>  	oom_zonelist_unlock(zonelist, gfp_mask);
> @@ -2716,8 +2718,6 @@ rebalance:
>  	 */
>  	if (!did_some_progress) {
>  		if (oom_gfp_allowed(gfp_mask)) {
> -			if (oom_killer_disabled)
> -				goto nopage;
>  			/* Coredumps can quickly deplete all memory reserves */
>  			if ((current->flags & PF_DUMPCORE) &&
>  			    !(gfp_mask & __GFP_NOFAIL))
> -- 
> 2.1.1
> 
> 
> -- 
> Michal Hocko
> SUSE Labs
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH 3/4] OOM, PM: OOM killed task shouldn't escape PM suspend
  2014-11-05 13:42                       ` Michal Hocko
  2014-11-05 14:14                         ` Michal Hocko
@ 2014-11-05 15:44                         ` Tejun Heo
  2014-11-05 16:01                           ` Michal Hocko
  1 sibling, 1 reply; 93+ messages in thread
From: Tejun Heo @ 2014-11-05 15:44 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Rafael J. Wysocki, Andrew Morton, Cong Wang, David Rientjes,
	Oleg Nesterov, LKML, linux-mm, Linux PM list

On Wed, Nov 05, 2014 at 02:42:19PM +0100, Michal Hocko wrote:
> On Wed 05-11-14 14:31:00, Michal Hocko wrote:
> > On Wed 05-11-14 08:02:47, Tejun Heo wrote:
> [...]
> > > Also, why isn't this part of
> > > oom_killer_disable/enable()?  The way they're implemented is really
> > > silly now.  It just sets a flag and returns whether there's a
> > > currently running instance or not.  How were these even useful? 
> > > Why can't you just make disable/enable to what they were supposed to
> > > do from the beginning?
> > 
> > Because then we would block all the potential allocators coming from
> > workqueues or kernel threads which are not frozen yet rather than fail
> > the allocation.
> 
> After thinking about this more it would be doable by using trylock in
> the allocation oom path. I will respin the patch. The API will be
> cleaner this way.

In disable, block new invocations of OOM killer and then drain the
in-progress ones.  This is a common pattern, isn't it?

-- 
tejun

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH 3/4] OOM, PM: OOM killed task shouldn't escape PM suspend
  2014-11-05 14:14                         ` Michal Hocko
@ 2014-11-05 15:45                           ` Michal Hocko
  0 siblings, 0 replies; 93+ messages in thread
From: Michal Hocko @ 2014-11-05 15:45 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Rafael J. Wysocki, Andrew Morton, Cong Wang, David Rientjes,
	Oleg Nesterov, LKML, linux-mm, Linux PM list

Ups, just noticed that I have a compile fix staged which didn't make it
into git format-patch. Will repost after/if you are OK with this
approach. But I guess this is much better outcome. Thanks for pushing
Tejun!

On Wed 05-11-14 15:14:58, Michal Hocko wrote:
[...]
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 5340f6b91312..7fc75b4df837 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
[...]
> @@ -615,6 +598,28 @@ void oom_zonelist_unlock(struct zonelist *zonelist, gfp_t gfp_mask)
>  	spin_unlock(&zone_scan_lock);
>  }
>  
> +static DECLARE_RWSEM(oom_sem);
> +
> +void oom_killer_disabled(void)

Should be oom_killer_disable(void)

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH 3/4] OOM, PM: OOM killed task shouldn't escape PM suspend
  2014-11-05 15:44                         ` Tejun Heo
@ 2014-11-05 16:01                           ` Michal Hocko
  2014-11-05 16:29                             ` Tejun Heo
  0 siblings, 1 reply; 93+ messages in thread
From: Michal Hocko @ 2014-11-05 16:01 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Rafael J. Wysocki, Andrew Morton, Cong Wang, David Rientjes,
	Oleg Nesterov, LKML, linux-mm, Linux PM list

On Wed 05-11-14 10:44:36, Tejun Heo wrote:
> On Wed, Nov 05, 2014 at 02:42:19PM +0100, Michal Hocko wrote:
> > On Wed 05-11-14 14:31:00, Michal Hocko wrote:
> > > On Wed 05-11-14 08:02:47, Tejun Heo wrote:
> > [...]
> > > > Also, why isn't this part of
> > > > oom_killer_disable/enable()?  The way they're implemented is really
> > > > silly now.  It just sets a flag and returns whether there's a
> > > > currently running instance or not.  How were these even useful? 
> > > > Why can't you just make disable/enable to what they were supposed to
> > > > do from the beginning?
> > > 
> > > Because then we would block all the potential allocators coming from
> > > workqueues or kernel threads which are not frozen yet rather than fail
> > > the allocation.
> > 
> > After thinking about this more it would be doable by using trylock in
> > the allocation oom path. I will respin the patch. The API will be
> > cleaner this way.
> 
> In disable, block new invocations of OOM killer and then drain the
> in-progress ones.  This is a common pattern, isn't it?

I am not sure I am following. With the latest patch OOM path is no
longer blocked by the PM (aka oom_killer_disable()). Allocations simply
fail if the read_trylock fails.
oom_killer_disable is moved before tasks are frozen and it will wait for
all on-going OOM killers on the write lock. OOM killer is enabled again
on the resume path.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH 3/4] OOM, PM: OOM killed task shouldn't escape PM suspend
  2014-11-05 16:01                           ` Michal Hocko
@ 2014-11-05 16:29                             ` Tejun Heo
  2014-11-05 16:39                               ` Michal Hocko
  0 siblings, 1 reply; 93+ messages in thread
From: Tejun Heo @ 2014-11-05 16:29 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Rafael J. Wysocki, Andrew Morton, Cong Wang, David Rientjes,
	Oleg Nesterov, LKML, linux-mm, Linux PM list

Hello, Michal.

On Wed, Nov 05, 2014 at 05:01:15PM +0100, Michal Hocko wrote:
> I am not sure I am following. With the latest patch OOM path is no
> longer blocked by the PM (aka oom_killer_disable()). Allocations simply
> fail if the read_trylock fails.
> oom_killer_disable is moved before tasks are frozen and it will wait for
> all on-going OOM killers on the write lock. OOM killer is enabled again
> on the resume path.

Sure, but why are we exposing new interfaces?  Can't we just make
oom_killer_disable() first set the disable flag and wait for the
on-going ones to finish (and make the function fail if it gets chosen
as an OOM victim)?  It's weird to expose extra stuff on top.  Why are
we doing that?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH 3/4] OOM, PM: OOM killed task shouldn't escape PM suspend
  2014-11-05 16:29                             ` Tejun Heo
@ 2014-11-05 16:39                               ` Michal Hocko
  2014-11-05 16:54                                 ` Tejun Heo
  0 siblings, 1 reply; 93+ messages in thread
From: Michal Hocko @ 2014-11-05 16:39 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Rafael J. Wysocki, Andrew Morton, Cong Wang, David Rientjes,
	Oleg Nesterov, LKML, linux-mm, Linux PM list

On Wed 05-11-14 11:29:29, Tejun Heo wrote:
> Hello, Michal.
> 
> On Wed, Nov 05, 2014 at 05:01:15PM +0100, Michal Hocko wrote:
> > I am not sure I am following. With the latest patch OOM path is no
> > longer blocked by the PM (aka oom_killer_disable()). Allocations simply
> > fail if the read_trylock fails.
> > oom_killer_disable is moved before tasks are frozen and it will wait for
> > all on-going OOM killers on the write lock. OOM killer is enabled again
> > on the resume path.
> 
> Sure, but why are we exposing new interfaces?  Can't we just make
> oom_killer_disable() first set the disable flag and wait for the
> on-going ones to finish (and make the function fail if it gets chosen
> as an OOM victim)?

Still not following. How do you want to detect an on-going OOM without
any interface around out_of_memory?
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH 3/4] OOM, PM: OOM killed task shouldn't escape PM suspend
  2014-11-05 16:39                               ` Michal Hocko
@ 2014-11-05 16:54                                 ` Tejun Heo
  2014-11-05 17:01                                   ` Tejun Heo
  2014-11-05 17:46                                   ` Michal Hocko
  0 siblings, 2 replies; 93+ messages in thread
From: Tejun Heo @ 2014-11-05 16:54 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Rafael J. Wysocki, Andrew Morton, Cong Wang, David Rientjes,
	Oleg Nesterov, LKML, linux-mm, Linux PM list

On Wed, Nov 05, 2014 at 05:39:56PM +0100, Michal Hocko wrote:
> On Wed 05-11-14 11:29:29, Tejun Heo wrote:
> > Hello, Michal.
> > 
> > On Wed, Nov 05, 2014 at 05:01:15PM +0100, Michal Hocko wrote:
> > > I am not sure I am following. With the latest patch OOM path is no
> > > longer blocked by the PM (aka oom_killer_disable()). Allocations simply
> > > fail if the read_trylock fails.
> > > oom_killer_disable is moved before tasks are frozen and it will wait for
> > > all on-going OOM killers on the write lock. OOM killer is enabled again
> > > on the resume path.
> > 
> > Sure, but why are we exposing new interfaces?  Can't we just make
> > oom_killer_disable() first set the disable flag and wait for the
> > on-going ones to finish (and make the function fail if it gets chosen
> > as an OOM victim)?
> 
> Still not following. How do you want to detect an on-going OOM without
> any interface around out_of_memory?

I thought you were using oom_killer_allowed_start() outside OOM path.
Ugh.... why is everything weirdly structured?  oom_killer_disabled
implies that oom killer may fail, right?  Why is
__alloc_pages_slowpath() checking it directly?  If whether oom killing
failed or not is relevant to its users, make out_of_memory() return an
error code.  There's no reason for the exclusion detail to leak out of
the oom killer proper.  The only interface should be disable/enable
and whether oom killing failed or not.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH 3/4] OOM, PM: OOM killed task shouldn't escape PM suspend
  2014-11-05 16:54                                 ` Tejun Heo
@ 2014-11-05 17:01                                   ` Tejun Heo
  2014-11-06 13:05                                     ` Michal Hocko
  2014-11-05 17:46                                   ` Michal Hocko
  1 sibling, 1 reply; 93+ messages in thread
From: Tejun Heo @ 2014-11-05 17:01 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Rafael J. Wysocki, Andrew Morton, Cong Wang, David Rientjes,
	Oleg Nesterov, LKML, linux-mm, Linux PM list

On Wed, Nov 05, 2014 at 11:54:28AM -0500, Tejun Heo wrote:
> > Still not following. How do you want to detect an on-going OOM without
> > any interface around out_of_memory?
> 
> I thought you were using oom_killer_allowed_start() outside OOM path.
> Ugh.... why is everything weirdly structured?  oom_killer_disabled
> implies that oom killer may fail, right?  Why is
> __alloc_pages_slowpath() checking it directly?  If whether oom killing
> failed or not is relevant to its users, make out_of_memory() return an
> error code.  There's no reason for the exclusion detail to leak out of
> the oom killer proper.  The only interface should be disable/enable
> and whether oom killing failed or not.

And what's implemented is wrong.  What happens if oom killing is
already in progress and then a task blocks trying to write-lock the
rwsem and then that task is selected as the OOM victim?  disable()
call must be able to fail.

-- 
tejun

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH 3/4] OOM, PM: OOM killed task shouldn't escape PM suspend
  2014-11-05 16:54                                 ` Tejun Heo
  2014-11-05 17:01                                   ` Tejun Heo
@ 2014-11-05 17:46                                   ` Michal Hocko
  2014-11-05 17:55                                     ` Tejun Heo
  1 sibling, 1 reply; 93+ messages in thread
From: Michal Hocko @ 2014-11-05 17:46 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Rafael J. Wysocki, Andrew Morton, Cong Wang, David Rientjes,
	Oleg Nesterov, LKML, linux-mm, Linux PM list

On Wed 05-11-14 11:54:28, Tejun Heo wrote:
> On Wed, Nov 05, 2014 at 05:39:56PM +0100, Michal Hocko wrote:
> > On Wed 05-11-14 11:29:29, Tejun Heo wrote:
> > > Hello, Michal.
> > > 
> > > On Wed, Nov 05, 2014 at 05:01:15PM +0100, Michal Hocko wrote:
> > > > I am not sure I am following. With the latest patch OOM path is no
> > > > longer blocked by the PM (aka oom_killer_disable()). Allocations simply
> > > > fail if the read_trylock fails.
> > > > oom_killer_disable is moved before tasks are frozen and it will wait for
> > > > all on-going OOM killers on the write lock. OOM killer is enabled again
> > > > on the resume path.
> > > 
> > > Sure, but why are we exposing new interfaces?  Can't we just make
> > > oom_killer_disable() first set the disable flag and wait for the
> > > on-going ones to finish (and make the function fail if it gets chosen
> > > as an OOM victim)?
> > 
> > Still not following. How do you want to detect an on-going OOM without
> > any interface around out_of_memory?
> 
> I thought you were using oom_killer_allowed_start() outside OOM path.
> Ugh.... why is everything weirdly structured?  oom_killer_disabled
> implies that oom killer may fail, right?  Why is
> __alloc_pages_slowpath() checking it directly?

Because out_of_memory can be called from mutliple paths. And
the only interesting one should be the page allocation path.
pagefault_out_of_memory is not interesting because it cannot happen for
the frozen task.

Now that I am looking maybe even sysrq OOM trigger should as well.

> If whether oom killing failed or not is relevant to its users, make
> out_of_memory() return an error code.  There's no reason for the
> exclusion detail to leak out of the oom killer proper.  The only
> interface should be disable/enable and whether oom killing failed or
> not.

Got your point. I can reshuffle the code and make the trylock thingy
inside oom_kill.c. I am not sure it is so much better because the OOM
knowledge is already spread (e.g. check oom_zonelist_trylock outside of
out_of_memory or even oom_gfp_allowed before we
enter__alloc_pages_may_oom). Anyway, I do not care much and I am OK with
your return code convention as the only other way how OOM might fail is
when there is no victim and we panic then.

Something like (even not compile tested)
---
diff --git a/drivers/tty/sysrq.c b/drivers/tty/sysrq.c
index 42bad18c66c9..14f3d7fd961f 100644
--- a/drivers/tty/sysrq.c
+++ b/drivers/tty/sysrq.c
@@ -355,8 +355,10 @@ static struct sysrq_key_op sysrq_term_op = {
 
 static void moom_callback(struct work_struct *ignored)
 {
-	out_of_memory(node_zonelist(first_memory_node, GFP_KERNEL), GFP_KERNEL,
-		      0, NULL, true);
+	if (!out_of_memory(node_zonelist(first_memory_node, GFP_KERNEL),
+			   GFP_KERNEL, 0, NULL, true)) {
+		printk(KERN_INFO "OOM killer disabled\n");
+	}
 }
 
 static DECLARE_WORK(moom_work, moom_callback);
diff --git a/include/linux/oom.h b/include/linux/oom.h
index 850f7f653eb7..4af99a9b543b 100644
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -68,7 +68,7 @@ extern enum oom_scan_t oom_scan_process_thread(struct task_struct *task,
 		unsigned long totalpages, const nodemask_t *nodemask,
 		bool force_kill);
 
-extern void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
+extern bool out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
 		int order, nodemask_t *mask, bool force_kill);
 extern int register_oom_notifier(struct notifier_block *nb);
 extern int unregister_oom_notifier(struct notifier_block *nb);
@@ -85,21 +85,6 @@ extern void oom_killer_disable(void);
  */
 extern void oom_killer_enable(void);
 
-/**
- * oom_killer_allowed_start - start OOM killer section
- *
- * Synchronise with oom_killer_{disable,enable} sections.
- * Returns 1 if oom_killer is allowed.
- */
-extern int oom_killer_allowed_start(void);
-
-/**
- * oom_killer_allowed_end - end OOM killer section
- *
- * previously started by oom_killer_allowed_end.
- */
-extern void oom_killer_allowed_end(void);
-
 static inline bool oom_gfp_allowed(gfp_t gfp_mask)
 {
 	return (gfp_mask & __GFP_FS) && !(gfp_mask & __GFP_NORETRY);
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 126e7da17cf9..3e136a2c0b1f 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -610,18 +610,8 @@ void oom_killer_enable(void)
 	up_write(&oom_sem);
 }
 
-int oom_killer_allowed_start(void)
-{
-	return down_read_trylock(&oom_sem);
-}
-
-void oom_killer_allowed_end(void)
-{
-	up_read(&oom_sem);
-}
-
 /**
- * out_of_memory - kill the "best" process when we run out of memory
+ * __out_of_memory - kill the "best" process when we run out of memory
  * @zonelist: zonelist pointer
  * @gfp_mask: memory allocation flags
  * @order: amount of memory being requested as a power of 2
@@ -633,7 +623,7 @@ void oom_killer_allowed_end(void)
  * OR try to be smart about which process to kill. Note that we
  * don't have to be perfect here, we just have to be good.
  */
-void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
+static void __out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
 		int order, nodemask_t *nodemask, bool force_kill)
 {
 	const nodemask_t *mpol_mask;
@@ -698,6 +688,27 @@ out:
 		schedule_timeout_killable(1);
 }
 
+/** out_of_memory -  tries to invoke OOM killer.
+ * @zonelist: zonelist pointer
+ * @gfp_mask: memory allocation flags
+ * @order: amount of memory being requested as a power of 2
+ * @nodemask: nodemask passed to page allocator
+ * @force_kill: true if a task must be killed, even if others are exiting
+ *
+ * invokes __out_of_memory if the OOM is not disabled by oom_killer_disable()
+ * when it returns false. Otherwise returns true.
+ */
+bool out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
+		int order, nodemask_t *nodemask, bool force_kill)
+{
+	if (!down_read_trylock(&oom_sem))
+		return false;
+	__out_of_memory(zonlist, gfp_mask, order, nodemask, force_kill);
+	up_read(&oom_sem);
+
+	return true;
+}
+
 /*
  * The pagefault handler calls here because it is out of memory, so kill a
  * memory-hogging task.  If any populated zone has ZONE_OOM_LOCKED set, a
@@ -712,7 +723,7 @@ void pagefault_out_of_memory(void)
 
 	zonelist = node_zonelist(first_memory_node, GFP_KERNEL);
 	if (oom_zonelist_trylock(zonelist, GFP_KERNEL)) {
-		out_of_memory(NULL, 0, 0, NULL, false);
+		__out_of_memory(NULL, 0, 0, NULL, false);
 		oom_zonelist_unlock(zonelist, GFP_KERNEL);
 	}
 }
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 206ce46ce975..fdbcdd9cd1a9 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2239,10 +2239,11 @@ static inline struct page *
 __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
 	struct zonelist *zonelist, enum zone_type high_zoneidx,
 	nodemask_t *nodemask, struct zone *preferred_zone,
-	int classzone_idx, int migratetype)
+	int classzone_idx, int migratetype, bool *oom_failed)
 {
 	struct page *page;
 
+	*oom_failed = false;
 	/* Acquire the per-zone oom lock for each zone */
 	if (!oom_zonelist_trylock(zonelist, gfp_mask)) {
 		schedule_timeout_uninterruptible(1);
@@ -2279,8 +2280,8 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
 			goto out;
 	}
 	/* Exhausted what can be done so it's blamo time */
-	out_of_memory(zonelist, gfp_mask, order, nodemask, false);
-
+	if (!out_of_memory(zonelist, gfp_mask, order, nodemask, false))
+		*oom_failed = true;
 out:
 	oom_zonelist_unlock(zonelist, gfp_mask);
 	return page;
@@ -2706,26 +2707,28 @@ rebalance:
 	 */
 	if (!did_some_progress) {
 		if (oom_gfp_allowed(gfp_mask)) {
+			bool oom_failed;
+
 			/* Coredumps can quickly deplete all memory reserves */
 			if ((current->flags & PF_DUMPCORE) &&
 			    !(gfp_mask & __GFP_NOFAIL))
 				goto nopage;
-			/*
-			 * Just make sure that we cannot race with oom_killer
-			 * disabling e.g. PM freezer needs to make sure that
-			 * no OOM happens after all tasks are frozen.
-			 */
-			if (!oom_killer_allowed_start())
-				goto nopage;
 			page = __alloc_pages_may_oom(gfp_mask, order,
 					zonelist, high_zoneidx,
 					nodemask, preferred_zone,
-					classzone_idx, migratetype);
-			oom_killer_allowed_end();
+					classzone_idx, migratetype,
+					&oom_failed);
 
 			if (page)
 				goto got_pg;
 
+			/*
+			 * OOM killer might be disabled and then we have to
+			 * fail the allocation
+			 */
+			if (oom_failed)
+				goto no_page;
+
 			if (!(gfp_mask & __GFP_NOFAIL)) {
 				/*
 				 * The oom killer is not called for high-order
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply related	[flat|nested] 93+ messages in thread

* Re: [PATCH 3/4] OOM, PM: OOM killed task shouldn't escape PM suspend
  2014-11-05 17:46                                   ` Michal Hocko
@ 2014-11-05 17:55                                     ` Tejun Heo
  2014-11-06 12:49                                       ` Michal Hocko
  0 siblings, 1 reply; 93+ messages in thread
From: Tejun Heo @ 2014-11-05 17:55 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Rafael J. Wysocki, Andrew Morton, Cong Wang, David Rientjes,
	Oleg Nesterov, LKML, linux-mm, Linux PM list

On Wed, Nov 05, 2014 at 06:46:09PM +0100, Michal Hocko wrote:
> Because out_of_memory can be called from mutliple paths. And
> the only interesting one should be the page allocation path.
> pagefault_out_of_memory is not interesting because it cannot happen for
> the frozen task.

Hmmm.... wouldn't that be broken by definition tho?  So, if the oom
killer is invoked from somewhere else than page allocation path, it
would proceed ignoring the disabled setting and would race against PM
freeze path all the same.  Why are things broken at such basic levels?
Something named oom_killer_disable does a lame attempt at it and not
even that depending on who's calling.  There probably is a history
leading to the current situation but the level that things are broken
at is too basic and baffling.  :(

-- 
tejun

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH 3/4] OOM, PM: OOM killed task shouldn't escape PM suspend
  2014-11-05 17:55                                     ` Tejun Heo
@ 2014-11-06 12:49                                       ` Michal Hocko
  2014-11-06 15:01                                         ` Tejun Heo
  0 siblings, 1 reply; 93+ messages in thread
From: Michal Hocko @ 2014-11-06 12:49 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Rafael J. Wysocki, Andrew Morton, Cong Wang, David Rientjes,
	Oleg Nesterov, LKML, linux-mm, Linux PM list

On Wed 05-11-14 12:55:27, Tejun Heo wrote:
> On Wed, Nov 05, 2014 at 06:46:09PM +0100, Michal Hocko wrote:
> > Because out_of_memory can be called from mutliple paths. And
> > the only interesting one should be the page allocation path.
> > pagefault_out_of_memory is not interesting because it cannot happen for
> > the frozen task.
> 
> Hmmm.... wouldn't that be broken by definition tho?  So, if the oom
> killer is invoked from somewhere else than page allocation path, it
> would proceed ignoring the disabled setting and would race against PM
> freeze path all the same. 

Not really because try_to_freeze_tasks doesn't finish until _all_ tasks
are frozen and a task in the page fault path cannot be frozen, can it?

I mean there shouldn't be any problem to not invoke OOM killer under
from the page fault path as well but that might lead to looping in the
page fault path without any progress until freezer enables OOM killer on
the failure path because the said task cannot be frozen.

Is this preferable?
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH 3/4] OOM, PM: OOM killed task shouldn't escape PM suspend
  2014-11-05 17:01                                   ` Tejun Heo
@ 2014-11-06 13:05                                     ` Michal Hocko
  2014-11-06 15:09                                       ` Tejun Heo
  0 siblings, 1 reply; 93+ messages in thread
From: Michal Hocko @ 2014-11-06 13:05 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Rafael J. Wysocki, Andrew Morton, Cong Wang, David Rientjes,
	Oleg Nesterov, LKML, linux-mm, Linux PM list

On Wed 05-11-14 12:01:11, Tejun Heo wrote:
> On Wed, Nov 05, 2014 at 11:54:28AM -0500, Tejun Heo wrote:
> > > Still not following. How do you want to detect an on-going OOM without
> > > any interface around out_of_memory?
> > 
> > I thought you were using oom_killer_allowed_start() outside OOM path.
> > Ugh.... why is everything weirdly structured?  oom_killer_disabled
> > implies that oom killer may fail, right?  Why is
> > __alloc_pages_slowpath() checking it directly?  If whether oom killing
> > failed or not is relevant to its users, make out_of_memory() return an
> > error code.  There's no reason for the exclusion detail to leak out of
> > the oom killer proper.  The only interface should be disable/enable
> > and whether oom killing failed or not.
> 
> And what's implemented is wrong.  What happens if oom killing is
> already in progress and then a task blocks trying to write-lock the
> rwsem and then that task is selected as the OOM victim?

But this is nothing new. Suspend hasn't been checking for fatal signals
nor for TIF_MEMDIE since the OOM disabling was introduced and I suppose
even before.

This is not harmful though. The previous OOM kill attempt would kick the
current TASK and mark it with TIF_MEMDIE and retry the allocation. After
OOM is disabled the allocation simply fails. The current will die on its
way out of the kernel. Definitely worth fixing. In a separate patch.

> disable() call must be able to fail.

This would be a way to do it without requiring caller to check for
TIF_MEMDIE explicitly. The fewer of them we have the better.
---
>From 3a7e18144a369bfc537c1cda4c7c2c548e9114b8 Mon Sep 17 00:00:00 2001
From: Michal Hocko <mhocko@suse.cz>
Date: Thu, 6 Nov 2014 11:51:34 +0100
Subject: [PATCH] OOM, PM: handle pm freezer as an OOM victim correctly

PM freezer doesn't check whether it has been killed by OOM killer
after it disables OOM killer which means that it continues with the
suspend even though it should die as soon as possible. This has been
the case ever since PM suspend disables OOM killer and I suppose
it has ignored OOM even before.

This is not harmful though. The allocation which triggers OOM will
retry the allocation after a process is killed and the next attempt
will fail because the OOM killer will be disabled at the time so
there is no risk of an endless loop because the OOM victim doesn't
die.

But this is a correctness issue because no task should ignore OOM.
As suggested by Tejun, oom_killer_disable will return a success status
now. If the current task is pending fatal signals or TIF_MEMDIE is set
after oom_sem is taken then the caller should bail out and this is what
freeze_processes does with this patch.

Signed-off-by: Michal Hocko <mhocko@suse.cz>
---
 include/linux/oom.h    |  4 +++-
 kernel/power/process.c | 16 ++++++++++------
 mm/oom_kill.c          | 12 +++++++++++-
 3 files changed, 24 insertions(+), 8 deletions(-)

diff --git a/include/linux/oom.h b/include/linux/oom.h
index 4af99a9b543b..a978bf2b02a1 100644
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -77,8 +77,10 @@ extern int unregister_oom_notifier(struct notifier_block *nb);
  * oom_killer_disable - disable OOM killer in page allocator
  *
  * Forces all page allocations to fail rather than trigger OOM killer.
+ * Returns true on success and fails if the OOM killer couldn't be
+ * disabled (e.g. because the current task has been killed before)
  */
-extern void oom_killer_disable(void);
+extern bool oom_killer_disable(void);
 
 /**
  * oom_killer_enable - enable OOM killer
diff --git a/kernel/power/process.c b/kernel/power/process.c
index 7d08d56cbf3f..0f8b782f9215 100644
--- a/kernel/power/process.c
+++ b/kernel/power/process.c
@@ -123,6 +123,16 @@ int freeze_processes(void)
 	if (error)
 		return error;
 
+	/*
+	 * Need to exlude OOM killer from triggering while tasks are
+	 * getting frozen to make sure none of them gets killed after
+	 * try_to_freeze_tasks is done.
+	 */
+	if (!oom_killer_disable()) {
+		usermodehelper_enable();
+		return -EBUSY;
+	}
+
 	/* Make sure this task doesn't get frozen */
 	current->flags |= PF_SUSPEND_TASK;
 
@@ -133,12 +143,6 @@ int freeze_processes(void)
 	printk("Freezing user space processes ... ");
 	pm_freezing = true;
 
-	/*
-	 * Need to exlude OOM killer from triggering while tasks are
-	 * getting frozen to make sure none of them gets killed after
-	 * try_to_freeze_tasks is done.
-	 */
-	oom_killer_disable();
 	error = try_to_freeze_tasks(true);
 	if (!error) {
 		__usermodehelper_set_disable_depth(UMH_DISABLED);
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index f80c5b777f05..58ade54ee421 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -600,9 +600,19 @@ void oom_zonelist_unlock(struct zonelist *zonelist, gfp_t gfp_mask)
 
 static DECLARE_RWSEM(oom_sem);
 
-void oom_killer_disable(void)
+bool oom_killer_disable(void)
 {
+	bool ret = true;
+
 	down_write(&oom_sem);
+
+	/* We might have been killed while waiting for the oom_sem. */
+	if (fatal_signal_pending(current) || test_thread_flag(TIF_MEMDIE)) {
+		up_write(&oom_sem);
+		ret = false;
+	}
+
+	return ret;
 }
 
 void oom_killer_enable(void)
-- 
2.1.1


-- 
Michal Hocko
SUSE Labs

^ permalink raw reply related	[flat|nested] 93+ messages in thread

* Re: [PATCH 3/4] OOM, PM: OOM killed task shouldn't escape PM suspend
  2014-11-06 12:49                                       ` Michal Hocko
@ 2014-11-06 15:01                                         ` Tejun Heo
  2014-11-06 16:02                                           ` Michal Hocko
  0 siblings, 1 reply; 93+ messages in thread
From: Tejun Heo @ 2014-11-06 15:01 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Rafael J. Wysocki, Andrew Morton, Cong Wang, David Rientjes,
	Oleg Nesterov, LKML, linux-mm, Linux PM list

On Thu, Nov 06, 2014 at 01:49:53PM +0100, Michal Hocko wrote:
> On Wed 05-11-14 12:55:27, Tejun Heo wrote:
> > On Wed, Nov 05, 2014 at 06:46:09PM +0100, Michal Hocko wrote:
> > > Because out_of_memory can be called from mutliple paths. And
> > > the only interesting one should be the page allocation path.
> > > pagefault_out_of_memory is not interesting because it cannot happen for
> > > the frozen task.
> > 
> > Hmmm.... wouldn't that be broken by definition tho?  So, if the oom
> > killer is invoked from somewhere else than page allocation path, it
> > would proceed ignoring the disabled setting and would race against PM
> > freeze path all the same. 
> 
> Not really because try_to_freeze_tasks doesn't finish until _all_ tasks
> are frozen and a task in the page fault path cannot be frozen, can it?

We used to have freezing points deep in file system code which may be
reacheable from page fault.  Please take a step back and look at the
paragraph above.  Doesn't it sound extremely contrived and brittle
even if it's not outright broken?  What if somebody adds another oom
killing site somewhere else?  How can this possibly be a solution that
we intentionally implement?

> I mean there shouldn't be any problem to not invoke OOM killer under
> from the page fault path as well but that might lead to looping in the
> page fault path without any progress until freezer enables OOM killer on
> the failure path because the said task cannot be frozen.
> 
> Is this preferable?

Why would PM freezing make OOM killing fail?  That doesn't make much
sense.  Sure, it can block it for a finite duration for sync purposes
but making OOM killing fail seems the wrong way around.  We're doing
one thing for non-PM freezing and the other way around for PM
freezing, which indicates one of the two directions is wrong.

Shouldn't it be that OOM killing happening while PM freezing is in
progress cancels PM freezing rather than the other way around?  Find a
point in PM suspend/hibernation operation where everything must be
stable, disable OOM killing there and check whether OOM killing
happened inbetween and if so back out.  It seems rather obvious to me
that OOM killing has to have precedence over PM freezing.

Sure, once the system reaches a point where the whole system must be
in a stable state for snapshotting or whatever, disabling OOM killing
is fine but at that point the system is in a very limited execution
mode and sure won't be processing page faults from userland for
example and we can actually disable OOM killing knowing that anything
afterwards is ready to handle memory allocation failures.

-- 
tejun

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH 3/4] OOM, PM: OOM killed task shouldn't escape PM suspend
  2014-11-06 13:05                                     ` Michal Hocko
@ 2014-11-06 15:09                                       ` Tejun Heo
  2014-11-06 16:01                                         ` Michal Hocko
  0 siblings, 1 reply; 93+ messages in thread
From: Tejun Heo @ 2014-11-06 15:09 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Rafael J. Wysocki, Andrew Morton, Cong Wang, David Rientjes,
	Oleg Nesterov, LKML, linux-mm, Linux PM list

On Thu, Nov 06, 2014 at 02:05:43PM +0100, Michal Hocko wrote:
> But this is nothing new. Suspend hasn't been checking for fatal signals
> nor for TIF_MEMDIE since the OOM disabling was introduced and I suppose
> even before.
> 
> This is not harmful though. The previous OOM kill attempt would kick the
> current TASK and mark it with TIF_MEMDIE and retry the allocation. After
> OOM is disabled the allocation simply fails. The current will die on its
> way out of the kernel. Definitely worth fixing. In a separate patch.

Hah?  Isn't this a new outright A-B B-A deadlock involving the rwsem
you added?

> > disable() call must be able to fail.
> 
> This would be a way to do it without requiring caller to check for
> TIF_MEMDIE explicitly. The fewer of them we have the better.

Why the hell would the caller ever even KNOW about this?  This is
something which must be encapsulated in the OOM killer disable/enable
interface.

> +bool oom_killer_disable(void)
>  {
> +	bool ret = true;
> +
>  	down_write(&oom_sem);

How would this task pass the above down_write() if the OOM killer is
already read locking oom_sem?  Or is the OOM killer guaranteed to make
forward progress even if the killed task can't make forward progress?
But, if so, what are we talking about in this thread?

> +
> +	/* We might have been killed while waiting for the oom_sem. */
> +	if (fatal_signal_pending(current) || test_thread_flag(TIF_MEMDIE)) {
> +		up_write(&oom_sem);
> +		ret = false;
> +	}

This is pointless.  What does the above do?

-- 
tejun

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH 3/4] OOM, PM: OOM killed task shouldn't escape PM suspend
  2014-11-06 15:09                                       ` Tejun Heo
@ 2014-11-06 16:01                                         ` Michal Hocko
  2014-11-06 16:12                                           ` Tejun Heo
  0 siblings, 1 reply; 93+ messages in thread
From: Michal Hocko @ 2014-11-06 16:01 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Rafael J. Wysocki, Andrew Morton, Cong Wang, David Rientjes,
	Oleg Nesterov, LKML, linux-mm, Linux PM list

On Thu 06-11-14 10:09:27, Tejun Heo wrote:
> On Thu, Nov 06, 2014 at 02:05:43PM +0100, Michal Hocko wrote:
> > But this is nothing new. Suspend hasn't been checking for fatal signals
> > nor for TIF_MEMDIE since the OOM disabling was introduced and I suppose
> > even before.
> > 
> > This is not harmful though. The previous OOM kill attempt would kick the
> > current TASK and mark it with TIF_MEMDIE and retry the allocation. After
> > OOM is disabled the allocation simply fails. The current will die on its
> > way out of the kernel. Definitely worth fixing. In a separate patch.
> 
> Hah?  Isn't this a new outright A-B B-A deadlock involving the rwsem
> you added?

No, see below.
 
> > > disable() call must be able to fail.
> > 
> > This would be a way to do it without requiring caller to check for
> > TIF_MEMDIE explicitly. The fewer of them we have the better.
> 
> Why the hell would the caller ever even KNOW about this?  This is
> something which must be encapsulated in the OOM killer disable/enable
> interface.
> 
> > +bool oom_killer_disable(void)
> >  {
> > +	bool ret = true;
> > +
> >  	down_write(&oom_sem);
> 
> How would this task pass the above down_write() if the OOM killer is
> already read locking oom_sem?  Or is the OOM killer guaranteed to make
> forward progress even if the killed task can't make forward progress?
> But, if so, what are we talking about in this thread?

Yes, OOM killer simply kicks the process sets TIF_MEMDIE and terminates.
That will release the read_lock, allow this to take the write lock and
check whether it the current has been killed without any races.
OOM killer doesn't wait for the killed task. The allocation is retried.

Does this explain your concern?

[...]
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH 3/4] OOM, PM: OOM killed task shouldn't escape PM suspend
  2014-11-06 15:01                                         ` Tejun Heo
@ 2014-11-06 16:02                                           ` Michal Hocko
  2014-11-06 16:28                                             ` Tejun Heo
  0 siblings, 1 reply; 93+ messages in thread
From: Michal Hocko @ 2014-11-06 16:02 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Rafael J. Wysocki, Andrew Morton, Cong Wang, David Rientjes,
	Oleg Nesterov, LKML, linux-mm, Linux PM list

On Thu 06-11-14 10:01:21, Tejun Heo wrote:
> On Thu, Nov 06, 2014 at 01:49:53PM +0100, Michal Hocko wrote:
> > On Wed 05-11-14 12:55:27, Tejun Heo wrote:
> > > On Wed, Nov 05, 2014 at 06:46:09PM +0100, Michal Hocko wrote:
> > > > Because out_of_memory can be called from mutliple paths. And
> > > > the only interesting one should be the page allocation path.
> > > > pagefault_out_of_memory is not interesting because it cannot happen for
> > > > the frozen task.
> > > 
> > > Hmmm.... wouldn't that be broken by definition tho?  So, if the oom
> > > killer is invoked from somewhere else than page allocation path, it
> > > would proceed ignoring the disabled setting and would race against PM
> > > freeze path all the same. 
> > 
> > Not really because try_to_freeze_tasks doesn't finish until _all_ tasks
> > are frozen and a task in the page fault path cannot be frozen, can it?
> 
> We used to have freezing points deep in file system code which may be
> reacheable from page fault.

If that is really the case then there is no way around and use
out_of_memory from the page fault path as well. I cannot say I would be
happy about that though. There should be ideally only single freezing
place. But that is another story.

> Please take a step back and look at the paragraph above.  Doesn't
> it sound extremely contrived and brittle even if it's not outright
> broken?  What if somebody adds another oom killing site somewhere
> else?

The only way to add an oom killing site is out_of_memory and that does
all the magic with my patch.

> How can this possibly be a solution that we intentionally implement?
>
> > I mean there shouldn't be any problem to not invoke OOM killer under
> > from the page fault path as well but that might lead to looping in the
> > page fault path without any progress until freezer enables OOM killer on
> > the failure path because the said task cannot be frozen.
> > 
> > Is this preferable?
> 
> Why would PM freezing make OOM killing fail?  That doesn't make much
> sense.  Sure, it can block it for a finite duration for sync purposes
> but making OOM killing fail seems the wrong way around.  

We cannot block in the allocation path because the request might come
from the freezer path itself (e.g. when suspending devices etc.).
At least this is my understanding why the original oom disable approach
was implemented.

> We're doing one thing for non-PM freezing and the other way around for
> PM freezing, which indicates one of the two directions is wrong.

Because those two paths are quite different in their requirements. The
cgroup freezer only cares about freezing tasks and it doesn't have to
care about tasks accessing a possibly half suspended device on their way
out.

> Shouldn't it be that OOM killing happening while PM freezing is in
> progress cancels PM freezing rather than the other way around?  Find a
> point in PM suspend/hibernation operation where everything must be
> stable, disable OOM killing there and check whether OOM killing
> happened inbetween and if so back out. 

This is freeze_processes AFAIU. I might be wrong of course but this is
the time since when nobody should be waking processes up because they
could access half suspended devices.

> It seems rather obvious to me that OOM killing has to have precedence
> over PM freezing.
> 
> Sure, once the system reaches a point where the whole system must be
> in a stable state for snapshotting or whatever, disabling OOM killing
> is fine but at that point the system is in a very limited execution
> mode and sure won't be processing page faults from userland for
> example and we can actually disable OOM killing knowing that anything
> afterwards is ready to handle memory allocation failures.

I am really confused now. This is basically what the final patch does
actually.  Here is the what I have currently just to make the further
discussion easier.
---
>From 337e772eaf636a96409e84bcd33d77ebc2950549 Mon Sep 17 00:00:00 2001
From: Michal Hocko <mhocko@suse.cz>
Date: Wed, 5 Nov 2014 15:09:56 +0100
Subject: [PATCH 1/2] OOM, PM: make OOM detection in the freezer path raceless

5695be142e20 (OOM, PM: OOM killed task shouldn't escape PM suspend)
has left a race window when OOM killer manages to note_oom_kill after
freeze_processes checks the counter. The race window is quite small and
really unlikely and partial solution deemed sufficient at the time of
submission.

Tejun wasn't happy about this partial solution though and insisted on
a full solution. That requires the full OOM and freezer exclusion,
though. This is done by this patch which introduces oom_sem RW lock
and gets rid of oom_killer_disabled global flag.

The PM code uses oom_killer_{disable,enable} which takes the lock
for write and excludes all the OOM killer invocation from any
out_of_memory users which newly returns a success status. It fails
only if the oom_sem cannot be taken for read which indicates that
OOM has been disabled. This is done by read trylock so we can never
deadlock.

The caller has to take an appropriate action when the out_of_memory
fails.

The allocation path simply fails the allocation request the same
way as previously. Sysrq path notes that the OOM didn't happen due to
OOM disable.

The page fault path ignored oom disabled flag previously with an
assumption that the page fault path cannot enter the fridge. As per
Tejun the freezing point used to be deep in the fs code. Therefore it
is safer and more robust to include pagefault_out_of_memory as well.
The task will be refaulting until there is some memory freed or PM
freezer fails because the said task cannot be frozen and re-enable OOM
killer when the OOM eventually happens if the memory still short.

There is no need to recheck all the processes with the full
synchronization anymore.

Suggested-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Michal Hocko <mhocko@suse.cz>

fold me
---
 drivers/tty/sysrq.c    |  6 ++++--
 include/linux/oom.h    | 25 +++++++++++++----------
 kernel/power/process.c | 50 ++++++++--------------------------------------
 mm/oom_kill.c          | 54 ++++++++++++++++++++++++++++++++------------------
 mm/page_alloc.c        | 32 +++++++++++++++---------------
 5 files changed, 77 insertions(+), 90 deletions(-)

diff --git a/drivers/tty/sysrq.c b/drivers/tty/sysrq.c
index 42bad18c66c9..14f3d7fd961f 100644
--- a/drivers/tty/sysrq.c
+++ b/drivers/tty/sysrq.c
@@ -355,8 +355,10 @@ static struct sysrq_key_op sysrq_term_op = {
 
 static void moom_callback(struct work_struct *ignored)
 {
-	out_of_memory(node_zonelist(first_memory_node, GFP_KERNEL), GFP_KERNEL,
-		      0, NULL, true);
+	if (!out_of_memory(node_zonelist(first_memory_node, GFP_KERNEL),
+			   GFP_KERNEL, 0, NULL, true)) {
+		printk(KERN_INFO "OOM killer disabled\n");
+	}
 }
 
 static DECLARE_WORK(moom_work, moom_callback);
diff --git a/include/linux/oom.h b/include/linux/oom.h
index e8d6e1058723..04b892ddca7d 100644
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -68,22 +68,25 @@ extern enum oom_scan_t oom_scan_process_thread(struct task_struct *task,
 		unsigned long totalpages, const nodemask_t *nodemask,
 		bool force_kill);
 
-extern void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
+extern bool out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
 		int order, nodemask_t *mask, bool force_kill);
 extern int register_oom_notifier(struct notifier_block *nb);
 extern int unregister_oom_notifier(struct notifier_block *nb);
 
-extern bool oom_killer_disabled;
-
-static inline void oom_killer_disable(void)
-{
-	oom_killer_disabled = true;
-}
+/**
+ * oom_killer_disable - disable OOM killer in page allocator
+ *
+ * Forces all page allocations to fail rather than trigger OOM killer.
+ *
+ * This function should be used with an extreme care and any new usage
+ * should be consulted with MM people.
+ */
+extern void oom_killer_disable(void);
 
-static inline void oom_killer_enable(void)
-{
-	oom_killer_disabled = false;
-}
+/**
+ * oom_killer_enable - enable OOM killer
+ */
+extern void oom_killer_enable(void);
 
 static inline bool oom_gfp_allowed(gfp_t gfp_mask)
 {
diff --git a/kernel/power/process.c b/kernel/power/process.c
index 5a6ec8678b9a..7d08d56cbf3f 100644
--- a/kernel/power/process.c
+++ b/kernel/power/process.c
@@ -108,30 +108,6 @@ static int try_to_freeze_tasks(bool user_only)
 	return todo ? -EBUSY : 0;
 }
 
-static bool __check_frozen_processes(void)
-{
-	struct task_struct *g, *p;
-
-	for_each_process_thread(g, p)
-		if (p != current && !freezer_should_skip(p) && !frozen(p))
-			return false;
-
-	return true;
-}
-
-/*
- * Returns true if all freezable tasks (except for current) are frozen already
- */
-static bool check_frozen_processes(void)
-{
-	bool ret;
-
-	read_lock(&tasklist_lock);
-	ret = __check_frozen_processes();
-	read_unlock(&tasklist_lock);
-	return ret;
-}
-
 /**
  * freeze_processes - Signal user space processes to enter the refrigerator.
  * The current thread will not be frozen.  The same process that calls
@@ -142,7 +118,6 @@ static bool check_frozen_processes(void)
 int freeze_processes(void)
 {
 	int error;
-	int oom_kills_saved;
 
 	error = __usermodehelper_disable(UMH_FREEZING);
 	if (error)
@@ -157,27 +132,18 @@ int freeze_processes(void)
 	pm_wakeup_clear();
 	printk("Freezing user space processes ... ");
 	pm_freezing = true;
-	oom_kills_saved = oom_kills_count();
+
+	/*
+	 * Need to exlude OOM killer from triggering while tasks are
+	 * getting frozen to make sure none of them gets killed after
+	 * try_to_freeze_tasks is done.
+	 */
+	oom_killer_disable();
 	error = try_to_freeze_tasks(true);
 	if (!error) {
 		__usermodehelper_set_disable_depth(UMH_DISABLED);
-		oom_killer_disable();
-
-		/*
-		 * There might have been an OOM kill while we were
-		 * freezing tasks and the killed task might be still
-		 * on the way out so we have to double check for race.
-		 */
-		if (oom_kills_count() != oom_kills_saved &&
-		    !check_frozen_processes()) {
-			__usermodehelper_set_disable_depth(UMH_ENABLED);
-			printk("OOM in progress.");
-			error = -EBUSY;
-		} else {
-			printk("done.");
-		}
+		printk("done.\n");
 	}
-	printk("\n");
 	BUG_ON(in_atomic());
 
 	if (error)
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 5340f6b91312..7f88ddd55f80 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -404,23 +404,6 @@ static void dump_header(struct task_struct *p, gfp_t gfp_mask, int order,
 		dump_tasks(memcg, nodemask);
 }
 
-/*
- * Number of OOM killer invocations (including memcg OOM killer).
- * Primarily used by PM freezer to check for potential races with
- * OOM killed frozen task.
- */
-static atomic_t oom_kills = ATOMIC_INIT(0);
-
-int oom_kills_count(void)
-{
-	return atomic_read(&oom_kills);
-}
-
-void note_oom_kill(void)
-{
-	atomic_inc(&oom_kills);
-}
-
 #define K(x) ((x) << (PAGE_SHIFT-10))
 /*
  * Must be called while holding a reference to p, which will be released upon
@@ -615,8 +598,20 @@ void oom_zonelist_unlock(struct zonelist *zonelist, gfp_t gfp_mask)
 	spin_unlock(&zone_scan_lock);
 }
 
+static DECLARE_RWSEM(oom_sem);
+
+void oom_killer_disable(void)
+{
+	down_write(&oom_sem);
+}
+
+void oom_killer_enable(void)
+{
+	up_write(&oom_sem);
+}
+
 /**
- * out_of_memory - kill the "best" process when we run out of memory
+ * __out_of_memory - kill the "best" process when we run out of memory
  * @zonelist: zonelist pointer
  * @gfp_mask: memory allocation flags
  * @order: amount of memory being requested as a power of 2
@@ -628,7 +623,7 @@ void oom_zonelist_unlock(struct zonelist *zonelist, gfp_t gfp_mask)
  * OR try to be smart about which process to kill. Note that we
  * don't have to be perfect here, we just have to be good.
  */
-void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
+static void __out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
 		int order, nodemask_t *nodemask, bool force_kill)
 {
 	const nodemask_t *mpol_mask;
@@ -693,6 +688,27 @@ out:
 		schedule_timeout_killable(1);
 }
 
+/** out_of_memory -  tries to invoke OOM killer.
+ * @zonelist: zonelist pointer
+ * @gfp_mask: memory allocation flags
+ * @order: amount of memory being requested as a power of 2
+ * @nodemask: nodemask passed to page allocator
+ * @force_kill: true if a task must be killed, even if others are exiting
+ *
+ * invokes __out_of_memory if the OOM is not disabled by oom_killer_disable()
+ * when it returns false. Otherwise returns true.
+ */
+bool out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
+		int order, nodemask_t *nodemask, bool force_kill)
+{
+	if (!down_read_trylock(&oom_sem))
+		return false;
+	__out_of_memory(zonelist, gfp_mask, order, nodemask, force_kill);
+	up_read(&oom_sem);
+
+	return true;
+}
+
 /*
  * The pagefault handler calls here because it is out of memory, so kill a
  * memory-hogging task.  If any populated zone has ZONE_OOM_LOCKED set, a
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 9cd36b822444..d44d69aa7b70 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -242,8 +242,6 @@ void set_pageblock_migratetype(struct page *page, int migratetype)
 					PB_migrate, PB_migrate_end);
 }
 
-bool oom_killer_disabled __read_mostly;
-
 #ifdef CONFIG_DEBUG_VM
 static int page_outside_zone_boundaries(struct zone *zone, struct page *page)
 {
@@ -2241,10 +2239,11 @@ static inline struct page *
 __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
 	struct zonelist *zonelist, enum zone_type high_zoneidx,
 	nodemask_t *nodemask, struct zone *preferred_zone,
-	int classzone_idx, int migratetype)
+	int classzone_idx, int migratetype, bool *oom_failed)
 {
 	struct page *page;
 
+	*oom_failed = false;
 	/* Acquire the per-zone oom lock for each zone */
 	if (!oom_zonelist_trylock(zonelist, gfp_mask)) {
 		schedule_timeout_uninterruptible(1);
@@ -2252,14 +2251,6 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
 	}
 
 	/*
-	 * PM-freezer should be notified that there might be an OOM killer on
-	 * its way to kill and wake somebody up. This is too early and we might
-	 * end up not killing anything but false positives are acceptable.
-	 * See freeze_processes.
-	 */
-	note_oom_kill();
-
-	/*
 	 * Go through the zonelist yet one more time, keep very high watermark
 	 * here, this is only to catch a parallel oom killing, we must fail if
 	 * we're still under heavy pressure.
@@ -2289,8 +2280,8 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
 			goto out;
 	}
 	/* Exhausted what can be done so it's blamo time */
-	out_of_memory(zonelist, gfp_mask, order, nodemask, false);
-
+	if (!out_of_memory(zonelist, gfp_mask, order, nodemask, false))
+		*oom_failed = true;
 out:
 	oom_zonelist_unlock(zonelist, gfp_mask);
 	return page;
@@ -2716,8 +2707,8 @@ rebalance:
 	 */
 	if (!did_some_progress) {
 		if (oom_gfp_allowed(gfp_mask)) {
-			if (oom_killer_disabled)
-				goto nopage;
+			bool oom_failed;
+
 			/* Coredumps can quickly deplete all memory reserves */
 			if ((current->flags & PF_DUMPCORE) &&
 			    !(gfp_mask & __GFP_NOFAIL))
@@ -2725,10 +2716,19 @@ rebalance:
 			page = __alloc_pages_may_oom(gfp_mask, order,
 					zonelist, high_zoneidx,
 					nodemask, preferred_zone,
-					classzone_idx, migratetype);
+					classzone_idx, migratetype,
+					&oom_failed);
+
 			if (page)
 				goto got_pg;
 
+			/*
+			 * OOM killer might be disabled and then we have to
+			 * fail the allocation
+			 */
+			if (oom_failed)
+				goto nopage;
+
 			if (!(gfp_mask & __GFP_NOFAIL)) {
 				/*
 				 * The oom killer is not called for high-order
-- 
2.1.1

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply related	[flat|nested] 93+ messages in thread

* Re: [PATCH 3/4] OOM, PM: OOM killed task shouldn't escape PM suspend
  2014-11-06 16:01                                         ` Michal Hocko
@ 2014-11-06 16:12                                           ` Tejun Heo
  2014-11-06 16:31                                             ` Michal Hocko
  0 siblings, 1 reply; 93+ messages in thread
From: Tejun Heo @ 2014-11-06 16:12 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Rafael J. Wysocki, Andrew Morton, Cong Wang, David Rientjes,
	Oleg Nesterov, LKML, linux-mm, Linux PM list

On Thu, Nov 06, 2014 at 05:01:58PM +0100, Michal Hocko wrote:
> Yes, OOM killer simply kicks the process sets TIF_MEMDIE and terminates.
> That will release the read_lock, allow this to take the write lock and
> check whether it the current has been killed without any races.
> OOM killer doesn't wait for the killed task. The allocation is retried.
> 
> Does this explain your concern?

Draining oom killer then doesn't mean anything, no?  OOM killer may
have been disabled and drained but the killed tasks might wake up
after the PM freezer considers them to be frozen, right?  What am I
missing?

-- 
tejun

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH 3/4] OOM, PM: OOM killed task shouldn't escape PM suspend
  2014-11-06 16:02                                           ` Michal Hocko
@ 2014-11-06 16:28                                             ` Tejun Heo
  2014-11-10 16:30                                               ` Michal Hocko
  0 siblings, 1 reply; 93+ messages in thread
From: Tejun Heo @ 2014-11-06 16:28 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Rafael J. Wysocki, Andrew Morton, Cong Wang, David Rientjes,
	Oleg Nesterov, LKML, linux-mm, Linux PM list

On Thu, Nov 06, 2014 at 05:02:23PM +0100, Michal Hocko wrote:
> > Why would PM freezing make OOM killing fail?  That doesn't make much
> > sense.  Sure, it can block it for a finite duration for sync purposes
> > but making OOM killing fail seems the wrong way around.  
> 
> We cannot block in the allocation path because the request might come
> from the freezer path itself (e.g. when suspending devices etc.).
> At least this is my understanding why the original oom disable approach
> was implemented.

I was saying that it could temporarily block either direction to
implement proper synchronization while guaranteeing forward progress.

> > We're doing one thing for non-PM freezing and the other way around for
> > PM freezing, which indicates one of the two directions is wrong.
> 
> Because those two paths are quite different in their requirements. The
> cgroup freezer only cares about freezing tasks and it doesn't have to
> care about tasks accessing a possibly half suspended device on their way
> out.

I don't think the fundamental relationship between freezing and oom
killing are different between the two and the failure to recognize
that is what's leading to these weird issues.

> > Shouldn't it be that OOM killing happening while PM freezing is in
> > progress cancels PM freezing rather than the other way around?  Find a
> > point in PM suspend/hibernation operation where everything must be
> > stable, disable OOM killing there and check whether OOM killing
> > happened inbetween and if so back out. 
> 
> This is freeze_processes AFAIU. I might be wrong of course but this is
> the time since when nobody should be waking processes up because they
> could access half suspended devices.

No, you're doing it before freezing starts.  The system is in no way
in a quiescent state at that point.

> > It seems rather obvious to me that OOM killing has to have precedence
> > over PM freezing.
> > 
> > Sure, once the system reaches a point where the whole system must be
> > in a stable state for snapshotting or whatever, disabling OOM killing
> > is fine but at that point the system is in a very limited execution
> > mode and sure won't be processing page faults from userland for
> > example and we can actually disable OOM killing knowing that anything
> > afterwards is ready to handle memory allocation failures.
> 
> I am really confused now. This is basically what the final patch does
> actually.  Here is the what I have currently just to make the further
> discussion easier.

Please see above.

-- 
tejun

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH 3/4] OOM, PM: OOM killed task shouldn't escape PM suspend
  2014-11-06 16:12                                           ` Tejun Heo
@ 2014-11-06 16:31                                             ` Michal Hocko
  2014-11-06 16:33                                               ` Tejun Heo
  0 siblings, 1 reply; 93+ messages in thread
From: Michal Hocko @ 2014-11-06 16:31 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Rafael J. Wysocki, Andrew Morton, Cong Wang, David Rientjes,
	Oleg Nesterov, LKML, linux-mm, Linux PM list

On Thu 06-11-14 11:12:11, Tejun Heo wrote:
> On Thu, Nov 06, 2014 at 05:01:58PM +0100, Michal Hocko wrote:
> > Yes, OOM killer simply kicks the process sets TIF_MEMDIE and terminates.
> > That will release the read_lock, allow this to take the write lock and
> > check whether it the current has been killed without any races.
> > OOM killer doesn't wait for the killed task. The allocation is retried.
> > 
> > Does this explain your concern?
> 
> Draining oom killer then doesn't mean anything, no?  OOM killer may
> have been disabled and drained but the killed tasks might wake up
> after the PM freezer considers them to be frozen, right?  What am I
> missing?

The mutual exclusion between OOM and the freezer will cause that the
victim will have TIF_MEMDIE already set when try_to_freeze_tasks even
starts. Then freezing_slow_path wouldn't allow the task to enter the
fridge so the wake up moment is not really that important.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH 3/4] OOM, PM: OOM killed task shouldn't escape PM suspend
  2014-11-06 16:31                                             ` Michal Hocko
@ 2014-11-06 16:33                                               ` Tejun Heo
  2014-11-06 16:58                                                 ` Michal Hocko
  0 siblings, 1 reply; 93+ messages in thread
From: Tejun Heo @ 2014-11-06 16:33 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Rafael J. Wysocki, Andrew Morton, Cong Wang, David Rientjes,
	Oleg Nesterov, LKML, linux-mm, Linux PM list

On Thu, Nov 06, 2014 at 05:31:24PM +0100, Michal Hocko wrote:
> On Thu 06-11-14 11:12:11, Tejun Heo wrote:
> > On Thu, Nov 06, 2014 at 05:01:58PM +0100, Michal Hocko wrote:
> > > Yes, OOM killer simply kicks the process sets TIF_MEMDIE and terminates.
> > > That will release the read_lock, allow this to take the write lock and
> > > check whether it the current has been killed without any races.
> > > OOM killer doesn't wait for the killed task. The allocation is retried.
> > > 
> > > Does this explain your concern?
> > 
> > Draining oom killer then doesn't mean anything, no?  OOM killer may
> > have been disabled and drained but the killed tasks might wake up
> > after the PM freezer considers them to be frozen, right?  What am I
> > missing?
> 
> The mutual exclusion between OOM and the freezer will cause that the
> victim will have TIF_MEMDIE already set when try_to_freeze_tasks even
> starts. Then freezing_slow_path wouldn't allow the task to enter the
> fridge so the wake up moment is not really that important.

What if it was already in the freezer?

-- 
tejun

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH 3/4] OOM, PM: OOM killed task shouldn't escape PM suspend
  2014-11-06 16:33                                               ` Tejun Heo
@ 2014-11-06 16:58                                                 ` Michal Hocko
  0 siblings, 0 replies; 93+ messages in thread
From: Michal Hocko @ 2014-11-06 16:58 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Rafael J. Wysocki, Andrew Morton, Cong Wang, David Rientjes,
	Oleg Nesterov, LKML, linux-mm, Linux PM list

On Thu 06-11-14 11:33:04, Tejun Heo wrote:
> On Thu, Nov 06, 2014 at 05:31:24PM +0100, Michal Hocko wrote:
> > On Thu 06-11-14 11:12:11, Tejun Heo wrote:
[...]
> > > Draining oom killer then doesn't mean anything, no?  OOM killer may
> > > have been disabled and drained but the killed tasks might wake up
> > > after the PM freezer considers them to be frozen, right?  What am I
> > > missing?
> > 
> > The mutual exclusion between OOM and the freezer will cause that the
> > victim will have TIF_MEMDIE already set when try_to_freeze_tasks even
> > starts. Then freezing_slow_path wouldn't allow the task to enter the
> > fridge so the wake up moment is not really that important.
> 
> What if it was already in the freezer?

Good question! You are right that there is a race window until the wake
up then. I will think about this case some more. There is simply no
control on when the task wakes up and freezer will see it as frozen
until then. An immediate way around would be to check for TIF_MEMDIE in
try_to_freeze_tasks.

I have to call it end of the day unfortunately and will be back on
Monday.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH 3/4] OOM, PM: OOM killed task shouldn't escape PM suspend
  2014-11-06 16:28                                             ` Tejun Heo
@ 2014-11-10 16:30                                               ` Michal Hocko
  2014-11-12 18:58                                                 ` [RFC 0/4] OOM vs PM freezer fixes Michal Hocko
  2014-12-05 16:41                                                 ` [PATCH 0/4] OOM vs PM freezer fixes Michal Hocko
  0 siblings, 2 replies; 93+ messages in thread
From: Michal Hocko @ 2014-11-10 16:30 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Rafael J. Wysocki, Andrew Morton, Cong Wang, David Rientjes,
	Oleg Nesterov, LKML, linux-mm, Linux PM list

On Thu 06-11-14 11:28:45, Tejun Heo wrote:
> On Thu, Nov 06, 2014 at 05:02:23PM +0100, Michal Hocko wrote:
[...]
> > > We're doing one thing for non-PM freezing and the other way around for
> > > PM freezing, which indicates one of the two directions is wrong.
> > 
> > Because those two paths are quite different in their requirements. The
> > cgroup freezer only cares about freezing tasks and it doesn't have to
> > care about tasks accessing a possibly half suspended device on their way
> > out.
> 
> I don't think the fundamental relationship between freezing and oom
> killing are different between the two and the failure to recognize
> that is what's leading to these weird issues.

I do not understand the above. Could you be more specific, please?
AFAIU cgroup freezer requires that no task will leak into userspace
while the cgroup is frozen. This is naturally true for the OOM path
whether the two are synchronized or not.
The PM freezer, on the other hand, requires that no task is _woken up_
after all tasks are frozen. This requires synchronization between the
freezer and OOM path because allocations are allowed also after tasks
are frozen.
What am I missing?

> > > Shouldn't it be that OOM killing happening while PM freezing is in
> > > progress cancels PM freezing rather than the other way around?  Find a
> > > point in PM suspend/hibernation operation where everything must be
> > > stable, disable OOM killing there and check whether OOM killing
> > > happened inbetween and if so back out. 
> > 
> > This is freeze_processes AFAIU. I might be wrong of course but this is
> > the time since when nobody should be waking processes up because they
> > could access half suspended devices.
> 
> No, you're doing it before freezing starts.  The system is in no way
> in a quiescent state at that point.

You are right! Userspace shouldn't see any unexpected allocation
failures just because PM freezing is in progress. This whole process
should be transparent from userspace POV.

I am getting back to
	oom_killer_lock();
	error = try_to_freeze_tasks();
	if (!error)
		oom_killer_disable();
	oom_killer_unlock();

Thanks!
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 93+ messages in thread

* [RFC 0/4] OOM vs PM freezer fixes
  2014-11-10 16:30                                               ` Michal Hocko
@ 2014-11-12 18:58                                                 ` Michal Hocko
  2014-11-12 18:58                                                   ` [RFC 1/4] OOM, PM: Do not miss OOM killed frozen tasks Michal Hocko
                                                                     ` (4 more replies)
  2014-12-05 16:41                                                 ` [PATCH 0/4] OOM vs PM freezer fixes Michal Hocko
  1 sibling, 5 replies; 93+ messages in thread
From: Michal Hocko @ 2014-11-12 18:58 UTC (permalink / raw)
  To: LKML
  Cc: linux-mm, linux-pm, Tejun Heo, Andrew Morton,
	\"Rafael J. Wysocki\",
	David Rientjes, Oleg Nesterov, Cong Wang

Hi,
here is another take at OOM vs. PM freezer interaction fixes/cleanups.
First three patches are fixes for an unlikely cases when OOM races with
the PM freezer which should be closed completely finally. The last patch
is a simple code enhancement which is not needed strictly speaking but
it is nice to have IMO.

Both OOM killer and PM freezer are quite subtle so I hope I haven't
missing anything. Any feedback is highly appreciated. I am also
interested about feedback for the used approach. To be honest I am not
really happy about spreading TIF_MEMDIE checks into freezer (patch 1)
but I didn't find any other way for detecting OOM killed tasks.

Changes are based on top of Linus tree (3.18-rc3).

Michal Hocko (4):
      OOM, PM: Do not miss OOM killed frozen tasks
      OOM, PM: make OOM detection in the freezer path raceless
      OOM, PM: handle pm freezer as an OOM victim correctly
      OOM: thaw the OOM victim if it is frozen

Diffstat says:
 drivers/tty/sysrq.c    |  6 ++--
 include/linux/oom.h    | 39 ++++++++++++++++------
 kernel/freezer.c       | 15 +++++++--
 kernel/power/process.c | 60 +++++++++-------------------------
 mm/memcontrol.c        |  4 ++-
 mm/oom_kill.c          | 89 ++++++++++++++++++++++++++++++++++++++------------
 mm/page_alloc.c        | 32 +++++++++---------
 7 files changed, 147 insertions(+), 98 deletions(-)


^ permalink raw reply	[flat|nested] 93+ messages in thread

* [RFC 1/4] OOM, PM: Do not miss OOM killed frozen tasks
  2014-11-12 18:58                                                 ` [RFC 0/4] OOM vs PM freezer fixes Michal Hocko
@ 2014-11-12 18:58                                                   ` Michal Hocko
  2014-11-14 17:55                                                     ` Tejun Heo
  2014-11-12 18:58                                                   ` [RFC 2/4] OOM, PM: make OOM detection in the freezer path raceless Michal Hocko
                                                                     ` (3 subsequent siblings)
  4 siblings, 1 reply; 93+ messages in thread
From: Michal Hocko @ 2014-11-12 18:58 UTC (permalink / raw)
  To: LKML
  Cc: linux-mm, linux-pm, Tejun Heo, Andrew Morton,
	\"Rafael J. Wysocki\",
	David Rientjes, Oleg Nesterov, Cong Wang

Although the freezer code ignores tasks which are killed by the OOM
killer (in freezing_slow_path) there are two problems why this is not
suitable for the PM freezer:
	- The information gets lost on its way from freezing path
	  because it is interpreted as if the task doesn't _need_ to be
	  frozen which is true also for other reasons
	- The killed task might be frozen (in cgroup) already but hasn't
	  woken up yet. We do not have an easy way to wait for such a
	  task

This means that try_to_freeze_tasks will consider all tasks frozen
despite there is an OOM victim waiting for its slice to wake up. The
OOM might have happened anytime before OOM exlusion started so it might
leak without PM freezer noticing and access already suspended devices.
Fix this by checking TIF_MEMDIE for each task in freeze_task and consider
such a task as blocking the freezer.

Also change the return value semantic as the current one is little bit
awkward. There is just one caller (try_to_freeze_tasks) which checks
the return value and it is only interested whether the request was
successful or the task blocks the freezing progress. It is natural to
reflect the success by true rather than false.

Signed-off-by: Michal Hocko <mhocko@suse.cz>
---
 kernel/freezer.c       | 15 ++++++++++++---
 kernel/power/process.c |  5 ++---
 2 files changed, 14 insertions(+), 6 deletions(-)

diff --git a/kernel/freezer.c b/kernel/freezer.c
index a8900a3bc27a..93bd3fc65371 100644
--- a/kernel/freezer.c
+++ b/kernel/freezer.c
@@ -113,7 +113,8 @@ static void fake_signal_wake_up(struct task_struct *p)
  * thread).
  *
  * RETURNS:
- * %false, if @p is not freezing or already frozen; %true, otherwise
+ * %false, if @p cannot get frozen; %true, if successful, already frozen or
+ * ignored by the freezer altogether.
  */
 bool freeze_task(struct task_struct *p)
 {
@@ -129,12 +130,20 @@ bool freeze_task(struct task_struct *p)
 	 * normally.
 	 */
 	if (freezer_should_skip(p))
+		return true;
+
+	/*
+	 * Do not check freezing state or attempt to freeze a task
+	 * which has been killed by OOM killer. We are just waiting
+	 * for the task to wake up and die.
+	 */
+	if (!test_tsk_thread_flag(p, TIF_MEMDIE))
 		return false;
 
 	spin_lock_irqsave(&freezer_lock, flags);
 	if (!freezing(p) || frozen(p)) {
 		spin_unlock_irqrestore(&freezer_lock, flags);
-		return false;
+		return true;
 	}
 
 	if (!(p->flags & PF_KTHREAD))
@@ -143,7 +152,7 @@ bool freeze_task(struct task_struct *p)
 		wake_up_state(p, TASK_INTERRUPTIBLE);
 
 	spin_unlock_irqrestore(&freezer_lock, flags);
-	return true;
+	return false;
 }
 
 void __thaw_task(struct task_struct *p)
diff --git a/kernel/power/process.c b/kernel/power/process.c
index 5a6ec8678b9a..3d528f291da8 100644
--- a/kernel/power/process.c
+++ b/kernel/power/process.c
@@ -47,11 +47,10 @@ static int try_to_freeze_tasks(bool user_only)
 		todo = 0;
 		read_lock(&tasklist_lock);
 		for_each_process_thread(g, p) {
-			if (p == current || !freeze_task(p))
+			if (p != current && freeze_task(p))
 				continue;
 
-			if (!freezer_should_skip(p))
-				todo++;
+			todo++;
 		}
 		read_unlock(&tasklist_lock);
 
-- 
2.1.1


^ permalink raw reply related	[flat|nested] 93+ messages in thread

* [RFC 2/4] OOM, PM: make OOM detection in the freezer path raceless
  2014-11-12 18:58                                                 ` [RFC 0/4] OOM vs PM freezer fixes Michal Hocko
  2014-11-12 18:58                                                   ` [RFC 1/4] OOM, PM: Do not miss OOM killed frozen tasks Michal Hocko
@ 2014-11-12 18:58                                                   ` Michal Hocko
  2014-11-12 18:58                                                   ` [RFC 3/4] OOM, PM: handle pm freezer as an OOM victim correctly Michal Hocko
                                                                     ` (2 subsequent siblings)
  4 siblings, 0 replies; 93+ messages in thread
From: Michal Hocko @ 2014-11-12 18:58 UTC (permalink / raw)
  To: LKML
  Cc: linux-mm, linux-pm, Tejun Heo, Andrew Morton,
	\"Rafael J. Wysocki\",
	David Rientjes, Oleg Nesterov, Cong Wang

5695be142e20 (OOM, PM: OOM killed task shouldn't escape PM suspend)
has left a race window when OOM killer manages to note_oom_kill after
freeze_processes checks the counter. The race window is quite small and
really unlikely and partial solution deemed sufficient at the time of
submission.

Tejun wasn't happy about this partial solution though and insisted on a
full solution. That requires the full OOM and freezer's task freezing
exclusion, though. This is done by this patch which introduces oom_sem
RW lock.

oom_killer_disabled is now handled at out_of_memory level which takes
the lock for reading. This also means that the page fault path is
covered now as well although it was assumed to be safe before. As per
Tejun, "We used to have freezing points deep in file system code which
may be reacheable from page fault." so it would be better and more
robust to not rely on freezing points here. Same applies to the memcg
OOM killer.

out_of_memory tells the caller whether the OOM was allowed to
trigger and the callers are supposed to handle the situation. The page
allocation path simply fails the allocation same as before. The page
fault path will be retrying the fault until the freezer fails and Sysrq
will simply complain to the log.

The freezer will use the new oom_killer_{un}lock API which takes
the lock for write to wait for an ongoing OOM killer and block all
future invocations while attempting to freeze all the tasks. If it was
successful oom_killer_disable is called to disallow all the further OOM
killer invocations.

There is no need to recheck all the processes with the full
synchronization anymore so it can go away again.

Suggested-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Michal Hocko <mhocko@suse.cz>
---
 drivers/tty/sysrq.c    |  6 ++--
 include/linux/oom.h    | 36 ++++++++++++++++-------
 kernel/power/process.c | 52 +++++++---------------------------
 mm/memcontrol.c        |  4 ++-
 mm/oom_kill.c          | 77 ++++++++++++++++++++++++++++++++++++--------------
 mm/page_alloc.c        | 32 ++++++++++-----------
 6 files changed, 115 insertions(+), 92 deletions(-)

diff --git a/drivers/tty/sysrq.c b/drivers/tty/sysrq.c
index 42bad18c66c9..6818589c1004 100644
--- a/drivers/tty/sysrq.c
+++ b/drivers/tty/sysrq.c
@@ -355,8 +355,10 @@ static struct sysrq_key_op sysrq_term_op = {
 
 static void moom_callback(struct work_struct *ignored)
 {
-	out_of_memory(node_zonelist(first_memory_node, GFP_KERNEL), GFP_KERNEL,
-		      0, NULL, true);
+	if (!out_of_memory(node_zonelist(first_memory_node, GFP_KERNEL),
+			   GFP_KERNEL, 0, NULL, true)) {
+		printk(KERN_INFO "OOM request ignored because killer is disabled\n");
+	}
 }
 
 static DECLARE_WORK(moom_work, moom_callback);
diff --git a/include/linux/oom.h b/include/linux/oom.h
index e8d6e1058723..8ca73c0b07df 100644
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -68,22 +68,38 @@ extern enum oom_scan_t oom_scan_process_thread(struct task_struct *task,
 		unsigned long totalpages, const nodemask_t *nodemask,
 		bool force_kill);
 
-extern void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
+extern bool out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
 		int order, nodemask_t *mask, bool force_kill);
 extern int register_oom_notifier(struct notifier_block *nb);
 extern int unregister_oom_notifier(struct notifier_block *nb);
 
-extern bool oom_killer_disabled;
+/**
+ * oom_killer_disable - disable OOM killer
+ *
+ * Forces all page allocations to fail rather than trigger OOM killer.
+ * Has to be called with oom_killer_lock held to prevent from races
+ * with an ongoing OOM killer.
+ *
+ * This function should be used with an extreme care and any new usage
+ * should be consulted with MM people.
+ */
+extern void oom_killer_disable(void);
 
-static inline void oom_killer_disable(void)
-{
-	oom_killer_disabled = true;
-}
+/**
+ * oom_killer_enable - enable OOM killer
+ */
+extern void oom_killer_enable(void);
 
-static inline void oom_killer_enable(void)
-{
-	oom_killer_disabled = false;
-}
+/** oom_killer_lock - locks global OOM killer.
+ *
+ * This function should be used with an extreme care. No allocations
+ * are allowed with the lock held.
+ */
+extern void oom_killer_lock(void);
+
+/** oom_killer_unlock - unlocks global OOM killer.
+ */
+extern void oom_killer_unlock(void);
 
 static inline bool oom_gfp_allowed(gfp_t gfp_mask)
 {
diff --git a/kernel/power/process.c b/kernel/power/process.c
index 3d528f291da8..5c5da0fe54dd 100644
--- a/kernel/power/process.c
+++ b/kernel/power/process.c
@@ -107,30 +107,6 @@ static int try_to_freeze_tasks(bool user_only)
 	return todo ? -EBUSY : 0;
 }
 
-static bool __check_frozen_processes(void)
-{
-	struct task_struct *g, *p;
-
-	for_each_process_thread(g, p)
-		if (p != current && !freezer_should_skip(p) && !frozen(p))
-			return false;
-
-	return true;
-}
-
-/*
- * Returns true if all freezable tasks (except for current) are frozen already
- */
-static bool check_frozen_processes(void)
-{
-	bool ret;
-
-	read_lock(&tasklist_lock);
-	ret = __check_frozen_processes();
-	read_unlock(&tasklist_lock);
-	return ret;
-}
-
 /**
  * freeze_processes - Signal user space processes to enter the refrigerator.
  * The current thread will not be frozen.  The same process that calls
@@ -141,12 +117,18 @@ static bool check_frozen_processes(void)
 int freeze_processes(void)
 {
 	int error;
-	int oom_kills_saved;
 
 	error = __usermodehelper_disable(UMH_FREEZING);
 	if (error)
 		return error;
 
+	/*
+	 * Need to exlude OOM killer from triggering while tasks are
+	 * getting frozen to make sure none of them gets killed after
+	 * try_to_freeze_tasks is done.
+	 */
+	oom_killer_lock()
+
 	/* Make sure this task doesn't get frozen */
 	current->flags |= PF_SUSPEND_TASK;
 
@@ -156,27 +138,13 @@ int freeze_processes(void)
 	pm_wakeup_clear();
 	printk("Freezing user space processes ... ");
 	pm_freezing = true;
-	oom_kills_saved = oom_kills_count();
 	error = try_to_freeze_tasks(true);
 	if (!error) {
-		__usermodehelper_set_disable_depth(UMH_DISABLED);
 		oom_killer_disable();
-
-		/*
-		 * There might have been an OOM kill while we were
-		 * freezing tasks and the killed task might be still
-		 * on the way out so we have to double check for race.
-		 */
-		if (oom_kills_count() != oom_kills_saved &&
-		    !check_frozen_processes()) {
-			__usermodehelper_set_disable_depth(UMH_ENABLED);
-			printk("OOM in progress.");
-			error = -EBUSY;
-		} else {
-			printk("done.");
-		}
+		__usermodehelper_set_disable_depth(UMH_DISABLED);
+		printk("done.\n");
 	}
-	printk("\n");
+	oom_killer_unlock();
 	BUG_ON(in_atomic());
 
 	if (error)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index d6ac0e33e150..620aff77da4a 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2128,6 +2128,8 @@ static void mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask, int order)
 	current->memcg_oom.order = order;
 }
 
+extern bool oom_killer_disabled;
+
 /**
  * mem_cgroup_oom_synchronize - complete memcg OOM handling
  * @handle: actually kill/wait or just clean up the OOM state
@@ -2155,7 +2157,7 @@ bool mem_cgroup_oom_synchronize(bool handle)
 	if (!memcg)
 		return false;
 
-	if (!handle)
+	if (!handle || oom_killer_disabled)
 		goto cleanup;
 
 	owait.memcg = memcg;
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 5340f6b91312..0a061803be09 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -404,23 +404,6 @@ static void dump_header(struct task_struct *p, gfp_t gfp_mask, int order,
 		dump_tasks(memcg, nodemask);
 }
 
-/*
- * Number of OOM killer invocations (including memcg OOM killer).
- * Primarily used by PM freezer to check for potential races with
- * OOM killed frozen task.
- */
-static atomic_t oom_kills = ATOMIC_INIT(0);
-
-int oom_kills_count(void)
-{
-	return atomic_read(&oom_kills);
-}
-
-void note_oom_kill(void)
-{
-	atomic_inc(&oom_kills);
-}
-
 #define K(x) ((x) << (PAGE_SHIFT-10))
 /*
  * Must be called while holding a reference to p, which will be released upon
@@ -615,8 +598,31 @@ void oom_zonelist_unlock(struct zonelist *zonelist, gfp_t gfp_mask)
 	spin_unlock(&zone_scan_lock);
 }
 
+bool oom_killer_disabled __read_mostly;
+static DECLARE_RWSEM(oom_sem);
+
+void oom_killer_lock(void)
+{
+	down_write(&oom_sem);
+}
+
+void oom_killer_unlock(void)
+{
+	up_write(&oom_sem);
+}
+
+void oom_killer_disable(void)
+{
+	oom_killer_disabled = true;
+}
+
+void oom_killer_enable(void)
+{
+	oom_killer_disabled = false;
+}
+
 /**
- * out_of_memory - kill the "best" process when we run out of memory
+ * __out_of_memory - kill the "best" process when we run out of memory
  * @zonelist: zonelist pointer
  * @gfp_mask: memory allocation flags
  * @order: amount of memory being requested as a power of 2
@@ -628,7 +634,7 @@ void oom_zonelist_unlock(struct zonelist *zonelist, gfp_t gfp_mask)
  * OR try to be smart about which process to kill. Note that we
  * don't have to be perfect here, we just have to be good.
  */
-void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
+static void __out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
 		int order, nodemask_t *nodemask, bool force_kill)
 {
 	const nodemask_t *mpol_mask;
@@ -693,6 +699,31 @@ out:
 		schedule_timeout_killable(1);
 }
 
+/** out_of_memory -  tries to invoke OOM killer.
+ * @zonelist: zonelist pointer
+ * @gfp_mask: memory allocation flags
+ * @order: amount of memory being requested as a power of 2
+ * @nodemask: nodemask passed to page allocator
+ * @force_kill: true if a task must be killed, even if others are exiting
+ *
+ * invokes __out_of_memory if the OOM is not disabled by oom_killer_disable()
+ * when it returns false. Otherwise returns true.
+ */
+bool out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
+		int order, nodemask_t *nodemask, bool force_kill)
+{
+	bool ret = false;
+
+	down_read(&oom_sem);
+	if (!oom_killer_disabled) {
+		__out_of_memory(zonelist, gfp_mask, order, nodemask, force_kill);
+		ret = true;
+	}
+	up_read(&oom_sem);
+
+	return true;
+}
+
 /*
  * The pagefault handler calls here because it is out of memory, so kill a
  * memory-hogging task.  If any populated zone has ZONE_OOM_LOCKED set, a
@@ -702,12 +733,16 @@ void pagefault_out_of_memory(void)
 {
 	struct zonelist *zonelist;
 
+	down_read(&oom_sem);
 	if (mem_cgroup_oom_synchronize(true))
-		return;
+		goto unlock;
 
 	zonelist = node_zonelist(first_memory_node, GFP_KERNEL);
 	if (oom_zonelist_trylock(zonelist, GFP_KERNEL)) {
-		out_of_memory(NULL, 0, 0, NULL, false);
+		if (!oom_killer_disabled)
+			__out_of_memory(NULL, 0, 0, NULL, false);
 		oom_zonelist_unlock(zonelist, GFP_KERNEL);
 	}
+unlock:
+	up_read(&oom_sem);
 }
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 9cd36b822444..d44d69aa7b70 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -242,8 +242,6 @@ void set_pageblock_migratetype(struct page *page, int migratetype)
 					PB_migrate, PB_migrate_end);
 }
 
-bool oom_killer_disabled __read_mostly;
-
 #ifdef CONFIG_DEBUG_VM
 static int page_outside_zone_boundaries(struct zone *zone, struct page *page)
 {
@@ -2241,10 +2239,11 @@ static inline struct page *
 __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
 	struct zonelist *zonelist, enum zone_type high_zoneidx,
 	nodemask_t *nodemask, struct zone *preferred_zone,
-	int classzone_idx, int migratetype)
+	int classzone_idx, int migratetype, bool *oom_failed)
 {
 	struct page *page;
 
+	*oom_failed = false;
 	/* Acquire the per-zone oom lock for each zone */
 	if (!oom_zonelist_trylock(zonelist, gfp_mask)) {
 		schedule_timeout_uninterruptible(1);
@@ -2252,14 +2251,6 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
 	}
 
 	/*
-	 * PM-freezer should be notified that there might be an OOM killer on
-	 * its way to kill and wake somebody up. This is too early and we might
-	 * end up not killing anything but false positives are acceptable.
-	 * See freeze_processes.
-	 */
-	note_oom_kill();
-
-	/*
 	 * Go through the zonelist yet one more time, keep very high watermark
 	 * here, this is only to catch a parallel oom killing, we must fail if
 	 * we're still under heavy pressure.
@@ -2289,8 +2280,8 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
 			goto out;
 	}
 	/* Exhausted what can be done so it's blamo time */
-	out_of_memory(zonelist, gfp_mask, order, nodemask, false);
-
+	if (!out_of_memory(zonelist, gfp_mask, order, nodemask, false))
+		*oom_failed = true;
 out:
 	oom_zonelist_unlock(zonelist, gfp_mask);
 	return page;
@@ -2716,8 +2707,8 @@ rebalance:
 	 */
 	if (!did_some_progress) {
 		if (oom_gfp_allowed(gfp_mask)) {
-			if (oom_killer_disabled)
-				goto nopage;
+			bool oom_failed;
+
 			/* Coredumps can quickly deplete all memory reserves */
 			if ((current->flags & PF_DUMPCORE) &&
 			    !(gfp_mask & __GFP_NOFAIL))
@@ -2725,10 +2716,19 @@ rebalance:
 			page = __alloc_pages_may_oom(gfp_mask, order,
 					zonelist, high_zoneidx,
 					nodemask, preferred_zone,
-					classzone_idx, migratetype);
+					classzone_idx, migratetype,
+					&oom_failed);
+
 			if (page)
 				goto got_pg;
 
+			/*
+			 * OOM killer might be disabled and then we have to
+			 * fail the allocation
+			 */
+			if (oom_failed)
+				goto nopage;
+
 			if (!(gfp_mask & __GFP_NOFAIL)) {
 				/*
 				 * The oom killer is not called for high-order
-- 
2.1.1


^ permalink raw reply related	[flat|nested] 93+ messages in thread

* [RFC 3/4] OOM, PM: handle pm freezer as an OOM victim correctly
  2014-11-12 18:58                                                 ` [RFC 0/4] OOM vs PM freezer fixes Michal Hocko
  2014-11-12 18:58                                                   ` [RFC 1/4] OOM, PM: Do not miss OOM killed frozen tasks Michal Hocko
  2014-11-12 18:58                                                   ` [RFC 2/4] OOM, PM: make OOM detection in the freezer path raceless Michal Hocko
@ 2014-11-12 18:58                                                   ` Michal Hocko
  2014-11-12 18:58                                                   ` [RFC 4/4] OOM: thaw the OOM victim if it is frozen Michal Hocko
  2014-11-14 20:14                                                   ` [RFC 0/4] OOM vs PM freezer fixes Tejun Heo
  4 siblings, 0 replies; 93+ messages in thread
From: Michal Hocko @ 2014-11-12 18:58 UTC (permalink / raw)
  To: LKML
  Cc: linux-mm, linux-pm, Tejun Heo, Andrew Morton,
	\"Rafael J. Wysocki\",
	David Rientjes, Oleg Nesterov, Cong Wang

PM freezer doesn't check whether it has been killed by OOM killer
after it disables OOM killer which means that it continues with the
suspend even though it should die as soon as possible. This has been
the case ever since PM suspend disables OOM killer and I suppose
it has ignored OOM even before.

This is not harmful though. The allocation which triggers OOM will
retry the allocation after a process is killed and the next attempt
will fail because the OOM killer will be disabled at the time so
there is no risk of an endless loop because the OOM victim doesn't
die.

But this is a correctness issue because no task should ignore OOM.
As suggested by Tejun, oom_killer_lock will return a success status
now. If the current task is pending fatal signals or TIF_MEMDIE is set
after oom_sem is taken then the caller should bail out and this is what
freeze_processes does with this patch.

Signed-off-by: Michal Hocko <mhocko@suse.cz>
---
 include/linux/oom.h    |  5 ++++-
 kernel/power/process.c |  5 ++++-
 mm/oom_kill.c          | 12 +++++++++++-
 3 files changed, 19 insertions(+), 3 deletions(-)

diff --git a/include/linux/oom.h b/include/linux/oom.h
index 8ca73c0b07df..8f4f634cc5b3 100644
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -92,10 +92,13 @@ extern void oom_killer_enable(void);
 
 /** oom_killer_lock - locks global OOM killer.
  *
+ * Returns true on success and fails if the OOM killer couldn't be
+ * locked (e.g. because the current task has been killed before).
+ *
  * This function should be used with an extreme care. No allocations
  * are allowed with the lock held.
  */
-extern void oom_killer_lock(void);
+extern bool oom_killer_lock(void);
 
 /** oom_killer_unlock - unlocks global OOM killer.
  */
diff --git a/kernel/power/process.c b/kernel/power/process.c
index 5c5da0fe54dd..49d8d84ccd6e 100644
--- a/kernel/power/process.c
+++ b/kernel/power/process.c
@@ -127,7 +127,10 @@ int freeze_processes(void)
 	 * getting frozen to make sure none of them gets killed after
 	 * try_to_freeze_tasks is done.
 	 */
-	oom_killer_lock()
+	if (!oom_killer_lock()) {
+		usermodehelper_enable();
+		return -EBUSY;
+	}
 
 	/* Make sure this task doesn't get frozen */
 	current->flags |= PF_SUSPEND_TASK;
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 0a061803be09..39a591092ca0 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -601,9 +601,19 @@ void oom_zonelist_unlock(struct zonelist *zonelist, gfp_t gfp_mask)
 bool oom_killer_disabled __read_mostly;
 static DECLARE_RWSEM(oom_sem);
 
-void oom_killer_lock(void)
+bool oom_killer_lock(void)
 {
+	bool ret = true;
+
 	down_write(&oom_sem);
+
+	/* We might have been killed while waiting for the oom_sem. */
+	if (fatal_signal_pending(current) || test_thread_flag(TIF_MEMDIE)) {
+		up_write(&oom_sem);
+		ret = false;
+	}
+
+	return ret;
 }
 
 void oom_killer_unlock(void)
-- 
2.1.1


^ permalink raw reply related	[flat|nested] 93+ messages in thread

* [RFC 4/4] OOM: thaw the OOM victim if it is frozen
  2014-11-12 18:58                                                 ` [RFC 0/4] OOM vs PM freezer fixes Michal Hocko
                                                                     ` (2 preceding siblings ...)
  2014-11-12 18:58                                                   ` [RFC 3/4] OOM, PM: handle pm freezer as an OOM victim correctly Michal Hocko
@ 2014-11-12 18:58                                                   ` Michal Hocko
  2014-11-14 20:14                                                   ` [RFC 0/4] OOM vs PM freezer fixes Tejun Heo
  4 siblings, 0 replies; 93+ messages in thread
From: Michal Hocko @ 2014-11-12 18:58 UTC (permalink / raw)
  To: LKML
  Cc: linux-mm, linux-pm, Tejun Heo, Andrew Morton,
	\"Rafael J. Wysocki\",
	David Rientjes, Oleg Nesterov, Cong Wang

oom_kill_process only sets TIF_MEMDIE flag and sends a signal to the
victim. This is basically noop when the task is frozen though because
the task sleeps in uninterruptible sleep. The victim is eventually
thawed later when oom_scan_process_thread meets the task again in a
later OOM invocation so the OOM killer doesn't live lock. But this is
less than optimal. Let's add the frozen check and thaw the task right
before we send SIGKILL to the victim.

The check and thawing in oom_scan_process_thread has to stay because the
task might got access to memory reserves even without an explicit
SIGKILL from oom_kill_process (e.g. it already has fatal signal pending
or it is exiting already).

Signed-off-by: Michal Hocko <mhocko@suse.cz>
---
 mm/oom_kill.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 39a591092ca0..67ea7fb70fa4 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -511,6 +511,8 @@ void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
 	rcu_read_unlock();
 
 	set_tsk_thread_flag(victim, TIF_MEMDIE);
+	if (frozen(victim))
+		__thaw_task(victim);
 	do_send_sig_info(SIGKILL, SEND_SIG_FORCED, victim, true);
 	put_task_struct(victim);
 }
-- 
2.1.1


^ permalink raw reply related	[flat|nested] 93+ messages in thread

* Re: [RFC 1/4] OOM, PM: Do not miss OOM killed frozen tasks
  2014-11-12 18:58                                                   ` [RFC 1/4] OOM, PM: Do not miss OOM killed frozen tasks Michal Hocko
@ 2014-11-14 17:55                                                     ` Tejun Heo
  0 siblings, 0 replies; 93+ messages in thread
From: Tejun Heo @ 2014-11-14 17:55 UTC (permalink / raw)
  To: Michal Hocko
  Cc: LKML, linux-mm, linux-pm, Andrew Morton,
	\"Rafael J. Wysocki\",
	David Rientjes, Oleg Nesterov, Cong Wang

Hello, Michal.

On Wed, Nov 12, 2014 at 07:58:49PM +0100, Michal Hocko wrote:
> Also change the return value semantic as the current one is little bit
> awkward. There is just one caller (try_to_freeze_tasks) which checks
> the return value and it is only interested whether the request was
> successful or the task blocks the freezing progress. It is natural to
> reflect the success by true rather than false.

I don't know about this.  It's also customary to return %true when
further action needs to be taken.  I don't think either is
particularly wrong but the flip seems gratuitous.

>  bool freeze_task(struct task_struct *p)
>  {
> @@ -129,12 +130,20 @@ bool freeze_task(struct task_struct *p)
>  	 * normally.
>  	 */
>  	if (freezer_should_skip(p))
> +		return true;
> +
> +	/*
> +	 * Do not check freezing state or attempt to freeze a task
> +	 * which has been killed by OOM killer. We are just waiting
> +	 * for the task to wake up and die.

Maybe saying sth like "consider the task freezing as ...." is a
clearer way to put it?

> +	 */
> +	if (!test_tsk_thread_flag(p, TIF_MEMDIE))
>  		return false;

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [RFC 0/4] OOM vs PM freezer fixes
  2014-11-12 18:58                                                 ` [RFC 0/4] OOM vs PM freezer fixes Michal Hocko
                                                                     ` (3 preceding siblings ...)
  2014-11-12 18:58                                                   ` [RFC 4/4] OOM: thaw the OOM victim if it is frozen Michal Hocko
@ 2014-11-14 20:14                                                   ` Tejun Heo
  2014-11-18 21:08                                                     ` Michal Hocko
  4 siblings, 1 reply; 93+ messages in thread
From: Tejun Heo @ 2014-11-14 20:14 UTC (permalink / raw)
  To: Michal Hocko
  Cc: LKML, linux-mm, linux-pm, Andrew Morton,
	\"Rafael J. Wysocki\",
	David Rientjes, Oleg Nesterov, Cong Wang

On Wed, Nov 12, 2014 at 07:58:48PM +0100, Michal Hocko wrote:
> Hi,
> here is another take at OOM vs. PM freezer interaction fixes/cleanups.
> First three patches are fixes for an unlikely cases when OOM races with
> the PM freezer which should be closed completely finally. The last patch
> is a simple code enhancement which is not needed strictly speaking but
> it is nice to have IMO.
> 
> Both OOM killer and PM freezer are quite subtle so I hope I haven't
> missing anything. Any feedback is highly appreciated. I am also
> interested about feedback for the used approach. To be honest I am not
> really happy about spreading TIF_MEMDIE checks into freezer (patch 1)
> but I didn't find any other way for detecting OOM killed tasks.

I really don't get why this is structured this way.  Can't you just do
the following?

1. Freeze all freezables.  Don't worry about PF_MEMDIE.

2. Disable OOM killer.  This should be contained in the OOM killer
   proper.  Lock out the OOM killer and disable it.

3. At this point, we know that no one will create more freezable
   threads and no new process will be OOM kliled.  Wait till there's
   no process w/ PF_MEMDIE set.

There's no reason to lock out or disable OOM killer while the system
is not in the quiescent state, which is a big can of worms.  Bring
down the system to the quiescent state, disable the OOM killer and
then drain PF_MEMDIEs.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [RFC 0/4] OOM vs PM freezer fixes
  2014-11-14 20:14                                                   ` [RFC 0/4] OOM vs PM freezer fixes Tejun Heo
@ 2014-11-18 21:08                                                     ` Michal Hocko
  2014-11-18 21:10                                                       ` [RFC 1/2] oom: add helper for setting and clearing TIF_MEMDIE Michal Hocko
  0 siblings, 1 reply; 93+ messages in thread
From: Michal Hocko @ 2014-11-18 21:08 UTC (permalink / raw)
  To: Tejun Heo
  Cc: LKML, linux-mm, linux-pm, Andrew Morton,
	\"Rafael J. Wysocki\",
	David Rientjes, Oleg Nesterov, Cong Wang

On Fri 14-11-14 15:14:19, Tejun Heo wrote:
> On Wed, Nov 12, 2014 at 07:58:48PM +0100, Michal Hocko wrote:
> > Hi,
> > here is another take at OOM vs. PM freezer interaction fixes/cleanups.
> > First three patches are fixes for an unlikely cases when OOM races with
> > the PM freezer which should be closed completely finally. The last patch
> > is a simple code enhancement which is not needed strictly speaking but
> > it is nice to have IMO.
> > 
> > Both OOM killer and PM freezer are quite subtle so I hope I haven't
> > missing anything. Any feedback is highly appreciated. I am also
> > interested about feedback for the used approach. To be honest I am not
> > really happy about spreading TIF_MEMDIE checks into freezer (patch 1)
> > but I didn't find any other way for detecting OOM killed tasks.
> 
> I really don't get why this is structured this way.  Can't you just do
> the following?

Well, I liked how simple this was and localized at the only place which
matters. When I was thinking about a solution which you are describing
below it was more complicated and more subtle (e.g. waiting for an OOM
victim might be tricky if it stumbles over a lock which is held by a
frozen thread which uses try_to_freeze_unsafe). Anyway I gave it another
try and will post the two patches as a reply to this email. I hope the
both interface and implementation is cleaner.

> 1. Freeze all freezables.  Don't worry about PF_MEMDIE.
> 
> 2. Disable OOM killer.  This should be contained in the OOM killer
>    proper.  Lock out the OOM killer and disable it.
> 
> 3. At this point, we know that no one will create more freezable
>    threads and no new process will be OOM kliled.  Wait till there's
>    no process w/ PF_MEMDIE set.
> 
> There's no reason to lock out or disable OOM killer while the system
> is not in the quiescent state, which is a big can of worms.  Bring
> down the system to the quiescent state, disable the OOM killer and
> then drain PF_MEMDIEs.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 93+ messages in thread

* [RFC 1/2] oom: add helper for setting and clearing TIF_MEMDIE
  2014-11-18 21:08                                                     ` Michal Hocko
@ 2014-11-18 21:10                                                       ` Michal Hocko
  2014-11-18 21:10                                                         ` [RFC 2/2] OOM, PM: make OOM detection in the freezer path raceless Michal Hocko
  0 siblings, 1 reply; 93+ messages in thread
From: Michal Hocko @ 2014-11-18 21:10 UTC (permalink / raw)
  To: LKML
  Cc: linux-mm, linux-pm, Tejun Heo, Andrew Morton,
	\"Rafael J. Wysocki\",
	David Rientjes, Oleg Nesterov, Cong Wang

This patch is just a preparatory and it doesn't introduce any functional
change.

Signed-off-by: Michal Hocko <mhocko@suse.cz>
---
 include/linux/oom.h |  4 ++++
 kernel/exit.c       |  2 +-
 mm/memcontrol.c     |  2 +-
 mm/oom_kill.c       | 16 +++++++++++++---
 4 files changed, 19 insertions(+), 5 deletions(-)

diff --git a/include/linux/oom.h b/include/linux/oom.h
index e8d6e1058723..8f7e74f8ab3a 100644
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -47,6 +47,10 @@ static inline bool oom_task_origin(const struct task_struct *p)
 	return !!(p->signal->oom_flags & OOM_FLAG_ORIGIN);
 }
 
+void mark_tsk_oom_victim(struct task_struct *tsk);
+
+void unmark_tsk_oom_victim(struct task_struct *tsk);
+
 extern unsigned long oom_badness(struct task_struct *p,
 		struct mem_cgroup *memcg, const nodemask_t *nodemask,
 		unsigned long totalpages);
diff --git a/kernel/exit.c b/kernel/exit.c
index 5d30019ff953..323882973b4b 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -459,7 +459,7 @@ static void exit_mm(struct task_struct *tsk)
 	task_unlock(tsk);
 	mm_update_next_owner(mm);
 	mmput(mm);
-	clear_thread_flag(TIF_MEMDIE);
+	unmark_tsk_oom_victim(current);
 }
 
 /*
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index d6ac0e33e150..302e0fc6d121 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1735,7 +1735,7 @@ static void mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
 	 * quickly exit and free its memory.
 	 */
 	if (fatal_signal_pending(current) || current->flags & PF_EXITING) {
-		set_thread_flag(TIF_MEMDIE);
+		mark_tsk_oom_victim(current);
 		return;
 	}
 
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 5340f6b91312..8b6e14136f4f 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -421,6 +421,16 @@ void note_oom_kill(void)
 	atomic_inc(&oom_kills);
 }
 
+void mark_tsk_oom_victim(struct task_struct *tsk)
+{
+	set_tsk_thread_flag(tsk, TIF_MEMDIE);
+}
+
+void unmark_tsk_oom_victim(struct task_struct *tsk)
+{
+	clear_thread_flag(TIF_MEMDIE);
+}
+
 #define K(x) ((x) << (PAGE_SHIFT-10))
 /*
  * Must be called while holding a reference to p, which will be released upon
@@ -444,7 +454,7 @@ void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
 	 * its children or threads, just set TIF_MEMDIE so it can die quickly
 	 */
 	if (p->flags & PF_EXITING) {
-		set_tsk_thread_flag(p, TIF_MEMDIE);
+		mark_tsk_oom_victim(p);
 		put_task_struct(p);
 		return;
 	}
@@ -527,7 +537,7 @@ void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
 		}
 	rcu_read_unlock();
 
-	set_tsk_thread_flag(victim, TIF_MEMDIE);
+	mark_tsk_oom_victim(victim);
 	do_send_sig_info(SIGKILL, SEND_SIG_FORCED, victim, true);
 	put_task_struct(victim);
 }
@@ -650,7 +660,7 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
 	 * quickly exit and free its memory.
 	 */
 	if (fatal_signal_pending(current) || current->flags & PF_EXITING) {
-		set_thread_flag(TIF_MEMDIE);
+		mark_tsk_oom_victim(current);
 		return;
 	}
 
-- 
2.1.3


^ permalink raw reply related	[flat|nested] 93+ messages in thread

* [RFC 2/2] OOM, PM: make OOM detection in the freezer path raceless
  2014-11-18 21:10                                                       ` [RFC 1/2] oom: add helper for setting and clearing TIF_MEMDIE Michal Hocko
@ 2014-11-18 21:10                                                         ` Michal Hocko
  2014-11-27  0:47                                                           ` Rafael J. Wysocki
  2014-12-02 22:08                                                           ` Tejun Heo
  0 siblings, 2 replies; 93+ messages in thread
From: Michal Hocko @ 2014-11-18 21:10 UTC (permalink / raw)
  To: LKML
  Cc: linux-mm, linux-pm, Tejun Heo, Andrew Morton,
	\"Rafael J. Wysocki\",
	David Rientjes, Oleg Nesterov, Cong Wang

5695be142e20 (OOM, PM: OOM killed task shouldn't escape PM suspend)
has left a race window when OOM killer manages to note_oom_kill after
freeze_processes checks the counter. The race window is quite small and
really unlikely and partial solution deemed sufficient at the time of
submission.

Tejun wasn't happy about this partial solution though and insisted on a
full solution. That requires the full OOM and freezer's task freezing
exclusion, though. This is done by this patch which introduces oom_sem
RW lock and turns oom_killer_disable() into a full OOM barrier.

oom_killer_disabled is now checked at out_of_memory level which takes
the lock for reading. This also means that the page fault path is
covered now as well although it was assumed to be safe before. As per
Tejun, "We used to have freezing points deep in file system code which
may be reacheable from page fault." so it would be better and more
robust to not rely on freezing points here. Same applies to the memcg
OOM killer.

out_of_memory tells the caller whether the OOM was allowed to
trigger and the callers are supposed to handle the situation. The page
allocation path simply fails the allocation same as before. The page
fault path will be retrying the fault until the freezer fails and Sysrq
OOM trigger will simply complain to the log.

oom_killer_disable takes oom_sem for writing and after it disables
further OOM killer invocations it checks for any OOM victims which
are still alive (because they haven't woken up to handle the pending
signal). Victims are counted via {un}mark_tsk_oom_victim. The
last victim signals the completion via oom_victims_wait on which
oom_killer_disable() waits if it sees non zero oom_victims.
This is safe against both mark_tsk_oom_victim which cannot be called
after oom_killer_disabled is set and unmark_tsk_oom_victim signals the
completion only for the last oom_victim when oom is disabled and
oom_killer_disable waits for completion only of there was at least one
victim at the time it disabled the oom.

As oom_killer_disable is a full OOM barrier now we can postpone it to
later after all freezable tasks are frozen during PM freezer. This
reduces the time when OOM is put out order and so reduces chances of
misbehavior due to unexpected allocation failures.

TODO:
Android lowmemory killer abuses mark_tsk_oom_victim in lowmem_scan
and it has to learn about oom_disable logic as well.

Suggested-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Michal Hocko <mhocko@suse.cz>
---
 drivers/tty/sysrq.c    |  6 ++--
 include/linux/oom.h    | 26 ++++++++------
 kernel/power/process.c | 60 +++++++++-----------------------
 mm/memcontrol.c        |  4 ++-
 mm/oom_kill.c          | 94 +++++++++++++++++++++++++++++++++++++++++---------
 mm/page_alloc.c        | 32 ++++++++---------
 6 files changed, 132 insertions(+), 90 deletions(-)

diff --git a/drivers/tty/sysrq.c b/drivers/tty/sysrq.c
index 42bad18c66c9..6818589c1004 100644
--- a/drivers/tty/sysrq.c
+++ b/drivers/tty/sysrq.c
@@ -355,8 +355,10 @@ static struct sysrq_key_op sysrq_term_op = {
 
 static void moom_callback(struct work_struct *ignored)
 {
-	out_of_memory(node_zonelist(first_memory_node, GFP_KERNEL), GFP_KERNEL,
-		      0, NULL, true);
+	if (!out_of_memory(node_zonelist(first_memory_node, GFP_KERNEL),
+			   GFP_KERNEL, 0, NULL, true)) {
+		printk(KERN_INFO "OOM request ignored because killer is disabled\n");
+	}
 }
 
 static DECLARE_WORK(moom_work, moom_callback);
diff --git a/include/linux/oom.h b/include/linux/oom.h
index 8f7e74f8ab3a..d802575c9307 100644
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -72,22 +72,26 @@ extern enum oom_scan_t oom_scan_process_thread(struct task_struct *task,
 		unsigned long totalpages, const nodemask_t *nodemask,
 		bool force_kill);
 
-extern void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
+extern bool out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
 		int order, nodemask_t *mask, bool force_kill);
 extern int register_oom_notifier(struct notifier_block *nb);
 extern int unregister_oom_notifier(struct notifier_block *nb);
 
-extern bool oom_killer_disabled;
-
-static inline void oom_killer_disable(void)
-{
-	oom_killer_disabled = true;
-}
+/**
+ * oom_killer_disable - disable OOM killer
+ *
+ * Forces all page allocations to fail rather than trigger OOM killer.
+ * Will block and wait until all OOM victims are dead.
+ *
+ * Returns true if successfull and false if the OOM killer cannot be
+ * disabled.
+ */
+extern bool oom_killer_disable(void);
 
-static inline void oom_killer_enable(void)
-{
-	oom_killer_disabled = false;
-}
+/**
+ * oom_killer_enable - enable OOM killer
+ */
+extern void oom_killer_enable(void);
 
 static inline bool oom_gfp_allowed(gfp_t gfp_mask)
 {
diff --git a/kernel/power/process.c b/kernel/power/process.c
index 5a6ec8678b9a..a4306e39f35c 100644
--- a/kernel/power/process.c
+++ b/kernel/power/process.c
@@ -108,30 +108,6 @@ static int try_to_freeze_tasks(bool user_only)
 	return todo ? -EBUSY : 0;
 }
 
-static bool __check_frozen_processes(void)
-{
-	struct task_struct *g, *p;
-
-	for_each_process_thread(g, p)
-		if (p != current && !freezer_should_skip(p) && !frozen(p))
-			return false;
-
-	return true;
-}
-
-/*
- * Returns true if all freezable tasks (except for current) are frozen already
- */
-static bool check_frozen_processes(void)
-{
-	bool ret;
-
-	read_lock(&tasklist_lock);
-	ret = __check_frozen_processes();
-	read_unlock(&tasklist_lock);
-	return ret;
-}
-
 /**
  * freeze_processes - Signal user space processes to enter the refrigerator.
  * The current thread will not be frozen.  The same process that calls
@@ -142,7 +118,6 @@ static bool check_frozen_processes(void)
 int freeze_processes(void)
 {
 	int error;
-	int oom_kills_saved;
 
 	error = __usermodehelper_disable(UMH_FREEZING);
 	if (error)
@@ -157,27 +132,11 @@ int freeze_processes(void)
 	pm_wakeup_clear();
 	printk("Freezing user space processes ... ");
 	pm_freezing = true;
-	oom_kills_saved = oom_kills_count();
 	error = try_to_freeze_tasks(true);
 	if (!error) {
 		__usermodehelper_set_disable_depth(UMH_DISABLED);
-		oom_killer_disable();
-
-		/*
-		 * There might have been an OOM kill while we were
-		 * freezing tasks and the killed task might be still
-		 * on the way out so we have to double check for race.
-		 */
-		if (oom_kills_count() != oom_kills_saved &&
-		    !check_frozen_processes()) {
-			__usermodehelper_set_disable_depth(UMH_ENABLED);
-			printk("OOM in progress.");
-			error = -EBUSY;
-		} else {
-			printk("done.");
-		}
+		printk("done.\n");
 	}
-	printk("\n");
 	BUG_ON(in_atomic());
 
 	if (error)
@@ -206,6 +165,18 @@ int freeze_kernel_threads(void)
 	printk("\n");
 	BUG_ON(in_atomic());
 
+	/*
+	 * Now that everything freezable is handled we need to disbale
+	 * the OOM killer to disallow any further interference with
+	 * killable tasks.
+	 */
+	printk("Disabling OOM killer ... ");
+	if (!oom_killer_disable()) {
+		printk("failed.\n");
+		error = -EAGAIN;
+	} else
+		printk("done.\n");
+
 	if (error)
 		thaw_kernel_threads();
 	return error;
@@ -222,8 +193,6 @@ void thaw_processes(void)
 	pm_freezing = false;
 	pm_nosig_freezing = false;
 
-	oom_killer_enable();
-
 	printk("Restarting tasks ... ");
 
 	__usermodehelper_set_disable_depth(UMH_FREEZING);
@@ -251,6 +220,9 @@ void thaw_kernel_threads(void)
 {
 	struct task_struct *g, *p;
 
+	printk("Enabling OOM killer again.\n");
+	oom_killer_enable();
+
 	pm_nosig_freezing = false;
 	printk("Restarting kernel threads ... ");
 
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 302e0fc6d121..34bcbb053132 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2128,6 +2128,8 @@ static void mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask, int order)
 	current->memcg_oom.order = order;
 }
 
+extern bool oom_killer_disabled;
+
 /**
  * mem_cgroup_oom_synchronize - complete memcg OOM handling
  * @handle: actually kill/wait or just clean up the OOM state
@@ -2155,7 +2157,7 @@ bool mem_cgroup_oom_synchronize(bool handle)
 	if (!memcg)
 		return false;
 
-	if (!handle)
+	if (!handle || oom_killer_disabled)
 		goto cleanup;
 
 	owait.memcg = memcg;
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 8b6e14136f4f..b3ccd92bc6dc 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -405,30 +405,63 @@ static void dump_header(struct task_struct *p, gfp_t gfp_mask, int order,
 }
 
 /*
- * Number of OOM killer invocations (including memcg OOM killer).
- * Primarily used by PM freezer to check for potential races with
- * OOM killed frozen task.
+ * Number of OOM victims in flight
  */
-static atomic_t oom_kills = ATOMIC_INIT(0);
+static atomic_t oom_victims = ATOMIC_INIT(0);
+static DECLARE_COMPLETION(oom_victims_wait);
 
-int oom_kills_count(void)
+bool oom_killer_disabled __read_mostly;
+static DECLARE_RWSEM(oom_sem);
+
+void mark_tsk_oom_victim(struct task_struct *tsk)
 {
-	return atomic_read(&oom_kills);
+	BUG_ON(oom_killer_disabled);
+	if (test_and_set_tsk_thread_flag(tsk, TIF_MEMDIE))
+		return;
+	atomic_inc(&oom_victims);
 }
 
-void note_oom_kill(void)
+void unmark_tsk_oom_victim(struct task_struct *tsk)
 {
-	atomic_inc(&oom_kills);
+	int count;
+
+	if (!test_and_clear_tsk_thread_flag(tsk, TIF_MEMDIE))
+		return;
+
+	down_read(&oom_sem);
+	/*
+	 * There is no need to signal the lasst oom_victim if there
+	 * is nobody who cares.
+	 */
+	if (!atomic_dec_return(&oom_victims) && oom_killer_disabled)
+		complete(&oom_victims_wait);
+	up_read(&oom_sem);
 }
 
-void mark_tsk_oom_victim(struct task_struct *tsk)
+bool oom_killer_disable(void)
 {
-	set_tsk_thread_flag(tsk, TIF_MEMDIE);
+	/*
+	 * Make sure to not race with an ongoing OOM killer
+	 * and that the current is not the victim.
+	 */
+	down_write(&oom_sem);
+	if (!test_tsk_thread_flag(current, TIF_MEMDIE))
+		oom_killer_disabled = true;
+
+	count = atomic_read(&oom_victims);
+	up_write(&oom_sem);
+
+	if (count && oom_killer_disabled)
+		wait_for_completion(&oom_victims_wait);
+
+	return oom_killer_disabled;
 }
 
-void unmark_tsk_oom_victim(struct task_struct *tsk)
+void oom_killer_enable(void)
 {
-	clear_thread_flag(TIF_MEMDIE);
+	down_write(&oom_sem);
+	oom_killer_disabled = false;
+	up_write(&oom_sem);
 }
 
 #define K(x) ((x) << (PAGE_SHIFT-10))
@@ -626,7 +659,7 @@ void oom_zonelist_unlock(struct zonelist *zonelist, gfp_t gfp_mask)
 }
 
 /**
- * out_of_memory - kill the "best" process when we run out of memory
+ * __out_of_memory - kill the "best" process when we run out of memory
  * @zonelist: zonelist pointer
  * @gfp_mask: memory allocation flags
  * @order: amount of memory being requested as a power of 2
@@ -638,7 +671,7 @@ void oom_zonelist_unlock(struct zonelist *zonelist, gfp_t gfp_mask)
  * OR try to be smart about which process to kill. Note that we
  * don't have to be perfect here, we just have to be good.
  */
-void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
+static void __out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
 		int order, nodemask_t *nodemask, bool force_kill)
 {
 	const nodemask_t *mpol_mask;
@@ -703,6 +736,31 @@ out:
 		schedule_timeout_killable(1);
 }
 
+/** out_of_memory -  tries to invoke OOM killer.
+ * @zonelist: zonelist pointer
+ * @gfp_mask: memory allocation flags
+ * @order: amount of memory being requested as a power of 2
+ * @nodemask: nodemask passed to page allocator
+ * @force_kill: true if a task must be killed, even if others are exiting
+ *
+ * invokes __out_of_memory if the OOM is not disabled by oom_killer_disable()
+ * when it returns false. Otherwise returns true.
+ */
+bool out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
+		int order, nodemask_t *nodemask, bool force_kill)
+{
+	bool ret = false;
+
+	down_read(&oom_sem);
+	if (!oom_killer_disabled) {
+		__out_of_memory(zonelist, gfp_mask, order, nodemask, force_kill);
+		ret = true;
+	}
+	up_read(&oom_sem);
+
+	return ret;
+}
+
 /*
  * The pagefault handler calls here because it is out of memory, so kill a
  * memory-hogging task.  If any populated zone has ZONE_OOM_LOCKED set, a
@@ -712,12 +770,16 @@ void pagefault_out_of_memory(void)
 {
 	struct zonelist *zonelist;
 
+	down_read(&oom_sem);
 	if (mem_cgroup_oom_synchronize(true))
-		return;
+		goto unlock;
 
 	zonelist = node_zonelist(first_memory_node, GFP_KERNEL);
 	if (oom_zonelist_trylock(zonelist, GFP_KERNEL)) {
-		out_of_memory(NULL, 0, 0, NULL, false);
+		if (!oom_killer_disabled)
+			__out_of_memory(NULL, 0, 0, NULL, false);
 		oom_zonelist_unlock(zonelist, GFP_KERNEL);
 	}
+unlock:
+	up_read(&oom_sem);
 }
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 9cd36b822444..d44d69aa7b70 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -242,8 +242,6 @@ void set_pageblock_migratetype(struct page *page, int migratetype)
 					PB_migrate, PB_migrate_end);
 }
 
-bool oom_killer_disabled __read_mostly;
-
 #ifdef CONFIG_DEBUG_VM
 static int page_outside_zone_boundaries(struct zone *zone, struct page *page)
 {
@@ -2241,10 +2239,11 @@ static inline struct page *
 __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
 	struct zonelist *zonelist, enum zone_type high_zoneidx,
 	nodemask_t *nodemask, struct zone *preferred_zone,
-	int classzone_idx, int migratetype)
+	int classzone_idx, int migratetype, bool *oom_failed)
 {
 	struct page *page;
 
+	*oom_failed = false;
 	/* Acquire the per-zone oom lock for each zone */
 	if (!oom_zonelist_trylock(zonelist, gfp_mask)) {
 		schedule_timeout_uninterruptible(1);
@@ -2252,14 +2251,6 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
 	}
 
 	/*
-	 * PM-freezer should be notified that there might be an OOM killer on
-	 * its way to kill and wake somebody up. This is too early and we might
-	 * end up not killing anything but false positives are acceptable.
-	 * See freeze_processes.
-	 */
-	note_oom_kill();
-
-	/*
 	 * Go through the zonelist yet one more time, keep very high watermark
 	 * here, this is only to catch a parallel oom killing, we must fail if
 	 * we're still under heavy pressure.
@@ -2289,8 +2280,8 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
 			goto out;
 	}
 	/* Exhausted what can be done so it's blamo time */
-	out_of_memory(zonelist, gfp_mask, order, nodemask, false);
-
+	if (!out_of_memory(zonelist, gfp_mask, order, nodemask, false))
+		*oom_failed = true;
 out:
 	oom_zonelist_unlock(zonelist, gfp_mask);
 	return page;
@@ -2716,8 +2707,8 @@ rebalance:
 	 */
 	if (!did_some_progress) {
 		if (oom_gfp_allowed(gfp_mask)) {
-			if (oom_killer_disabled)
-				goto nopage;
+			bool oom_failed;
+
 			/* Coredumps can quickly deplete all memory reserves */
 			if ((current->flags & PF_DUMPCORE) &&
 			    !(gfp_mask & __GFP_NOFAIL))
@@ -2725,10 +2716,19 @@ rebalance:
 			page = __alloc_pages_may_oom(gfp_mask, order,
 					zonelist, high_zoneidx,
 					nodemask, preferred_zone,
-					classzone_idx, migratetype);
+					classzone_idx, migratetype,
+					&oom_failed);
+
 			if (page)
 				goto got_pg;
 
+			/*
+			 * OOM killer might be disabled and then we have to
+			 * fail the allocation
+			 */
+			if (oom_failed)
+				goto nopage;
+
 			if (!(gfp_mask & __GFP_NOFAIL)) {
 				/*
 				 * The oom killer is not called for high-order
-- 
2.1.3


^ permalink raw reply related	[flat|nested] 93+ messages in thread

* Re: [RFC 2/2] OOM, PM: make OOM detection in the freezer path raceless
  2014-11-18 21:10                                                         ` [RFC 2/2] OOM, PM: make OOM detection in the freezer path raceless Michal Hocko
@ 2014-11-27  0:47                                                           ` Rafael J. Wysocki
  2014-12-02 22:08                                                           ` Tejun Heo
  1 sibling, 0 replies; 93+ messages in thread
From: Rafael J. Wysocki @ 2014-11-27  0:47 UTC (permalink / raw)
  To: Michal Hocko, Tejun Heo
  Cc: LKML, linux-mm, linux-pm, Andrew Morton, David Rientjes,
	Oleg Nesterov, Cong Wang

On Tuesday, November 18, 2014 10:10:06 PM Michal Hocko wrote:
> 5695be142e20 (OOM, PM: OOM killed task shouldn't escape PM suspend)
> has left a race window when OOM killer manages to note_oom_kill after
> freeze_processes checks the counter. The race window is quite small and
> really unlikely and partial solution deemed sufficient at the time of
> submission.
> 
> Tejun wasn't happy about this partial solution though and insisted on a
> full solution. That requires the full OOM and freezer's task freezing
> exclusion, though. This is done by this patch which introduces oom_sem
> RW lock and turns oom_killer_disable() into a full OOM barrier.
> 
> oom_killer_disabled is now checked at out_of_memory level which takes
> the lock for reading. This also means that the page fault path is
> covered now as well although it was assumed to be safe before. As per
> Tejun, "We used to have freezing points deep in file system code which
> may be reacheable from page fault." so it would be better and more
> robust to not rely on freezing points here. Same applies to the memcg
> OOM killer.
> 
> out_of_memory tells the caller whether the OOM was allowed to
> trigger and the callers are supposed to handle the situation. The page
> allocation path simply fails the allocation same as before. The page
> fault path will be retrying the fault until the freezer fails and Sysrq
> OOM trigger will simply complain to the log.
> 
> oom_killer_disable takes oom_sem for writing and after it disables
> further OOM killer invocations it checks for any OOM victims which
> are still alive (because they haven't woken up to handle the pending
> signal). Victims are counted via {un}mark_tsk_oom_victim. The
> last victim signals the completion via oom_victims_wait on which
> oom_killer_disable() waits if it sees non zero oom_victims.
> This is safe against both mark_tsk_oom_victim which cannot be called
> after oom_killer_disabled is set and unmark_tsk_oom_victim signals the
> completion only for the last oom_victim when oom is disabled and
> oom_killer_disable waits for completion only of there was at least one
> victim at the time it disabled the oom.
> 
> As oom_killer_disable is a full OOM barrier now we can postpone it to
> later after all freezable tasks are frozen during PM freezer. This
> reduces the time when OOM is put out order and so reduces chances of
> misbehavior due to unexpected allocation failures.
> 
> TODO:
> Android lowmemory killer abuses mark_tsk_oom_victim in lowmem_scan
> and it has to learn about oom_disable logic as well.
> 
> Suggested-by: Tejun Heo <tj@kernel.org>
> Signed-off-by: Michal Hocko <mhocko@suse.cz>

This appears to do the right thing to me, although I admit I haven't checked
the details very carefully.

Tejun?

> ---
>  drivers/tty/sysrq.c    |  6 ++--
>  include/linux/oom.h    | 26 ++++++++------
>  kernel/power/process.c | 60 +++++++++-----------------------
>  mm/memcontrol.c        |  4 ++-
>  mm/oom_kill.c          | 94 +++++++++++++++++++++++++++++++++++++++++---------
>  mm/page_alloc.c        | 32 ++++++++---------
>  6 files changed, 132 insertions(+), 90 deletions(-)
> 
> diff --git a/drivers/tty/sysrq.c b/drivers/tty/sysrq.c
> index 42bad18c66c9..6818589c1004 100644
> --- a/drivers/tty/sysrq.c
> +++ b/drivers/tty/sysrq.c
> @@ -355,8 +355,10 @@ static struct sysrq_key_op sysrq_term_op = {
>  
>  static void moom_callback(struct work_struct *ignored)
>  {
> -	out_of_memory(node_zonelist(first_memory_node, GFP_KERNEL), GFP_KERNEL,
> -		      0, NULL, true);
> +	if (!out_of_memory(node_zonelist(first_memory_node, GFP_KERNEL),
> +			   GFP_KERNEL, 0, NULL, true)) {
> +		printk(KERN_INFO "OOM request ignored because killer is disabled\n");
> +	}
>  }
>  
>  static DECLARE_WORK(moom_work, moom_callback);
> diff --git a/include/linux/oom.h b/include/linux/oom.h
> index 8f7e74f8ab3a..d802575c9307 100644
> --- a/include/linux/oom.h
> +++ b/include/linux/oom.h
> @@ -72,22 +72,26 @@ extern enum oom_scan_t oom_scan_process_thread(struct task_struct *task,
>  		unsigned long totalpages, const nodemask_t *nodemask,
>  		bool force_kill);
>  
> -extern void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
> +extern bool out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
>  		int order, nodemask_t *mask, bool force_kill);
>  extern int register_oom_notifier(struct notifier_block *nb);
>  extern int unregister_oom_notifier(struct notifier_block *nb);
>  
> -extern bool oom_killer_disabled;
> -
> -static inline void oom_killer_disable(void)
> -{
> -	oom_killer_disabled = true;
> -}
> +/**
> + * oom_killer_disable - disable OOM killer
> + *
> + * Forces all page allocations to fail rather than trigger OOM killer.
> + * Will block and wait until all OOM victims are dead.
> + *
> + * Returns true if successfull and false if the OOM killer cannot be
> + * disabled.
> + */
> +extern bool oom_killer_disable(void);
>  
> -static inline void oom_killer_enable(void)
> -{
> -	oom_killer_disabled = false;
> -}
> +/**
> + * oom_killer_enable - enable OOM killer
> + */
> +extern void oom_killer_enable(void);
>  
>  static inline bool oom_gfp_allowed(gfp_t gfp_mask)
>  {
> diff --git a/kernel/power/process.c b/kernel/power/process.c
> index 5a6ec8678b9a..a4306e39f35c 100644
> --- a/kernel/power/process.c
> +++ b/kernel/power/process.c
> @@ -108,30 +108,6 @@ static int try_to_freeze_tasks(bool user_only)
>  	return todo ? -EBUSY : 0;
>  }
>  
> -static bool __check_frozen_processes(void)
> -{
> -	struct task_struct *g, *p;
> -
> -	for_each_process_thread(g, p)
> -		if (p != current && !freezer_should_skip(p) && !frozen(p))
> -			return false;
> -
> -	return true;
> -}
> -
> -/*
> - * Returns true if all freezable tasks (except for current) are frozen already
> - */
> -static bool check_frozen_processes(void)
> -{
> -	bool ret;
> -
> -	read_lock(&tasklist_lock);
> -	ret = __check_frozen_processes();
> -	read_unlock(&tasklist_lock);
> -	return ret;
> -}
> -
>  /**
>   * freeze_processes - Signal user space processes to enter the refrigerator.
>   * The current thread will not be frozen.  The same process that calls
> @@ -142,7 +118,6 @@ static bool check_frozen_processes(void)
>  int freeze_processes(void)
>  {
>  	int error;
> -	int oom_kills_saved;
>  
>  	error = __usermodehelper_disable(UMH_FREEZING);
>  	if (error)
> @@ -157,27 +132,11 @@ int freeze_processes(void)
>  	pm_wakeup_clear();
>  	printk("Freezing user space processes ... ");
>  	pm_freezing = true;
> -	oom_kills_saved = oom_kills_count();
>  	error = try_to_freeze_tasks(true);
>  	if (!error) {
>  		__usermodehelper_set_disable_depth(UMH_DISABLED);
> -		oom_killer_disable();
> -
> -		/*
> -		 * There might have been an OOM kill while we were
> -		 * freezing tasks and the killed task might be still
> -		 * on the way out so we have to double check for race.
> -		 */
> -		if (oom_kills_count() != oom_kills_saved &&
> -		    !check_frozen_processes()) {
> -			__usermodehelper_set_disable_depth(UMH_ENABLED);
> -			printk("OOM in progress.");
> -			error = -EBUSY;
> -		} else {
> -			printk("done.");
> -		}
> +		printk("done.\n");
>  	}
> -	printk("\n");
>  	BUG_ON(in_atomic());
>  
>  	if (error)
> @@ -206,6 +165,18 @@ int freeze_kernel_threads(void)
>  	printk("\n");
>  	BUG_ON(in_atomic());
>  
> +	/*
> +	 * Now that everything freezable is handled we need to disbale
> +	 * the OOM killer to disallow any further interference with
> +	 * killable tasks.
> +	 */
> +	printk("Disabling OOM killer ... ");
> +	if (!oom_killer_disable()) {
> +		printk("failed.\n");
> +		error = -EAGAIN;
> +	} else
> +		printk("done.\n");
> +
>  	if (error)
>  		thaw_kernel_threads();
>  	return error;
> @@ -222,8 +193,6 @@ void thaw_processes(void)
>  	pm_freezing = false;
>  	pm_nosig_freezing = false;
>  
> -	oom_killer_enable();
> -
>  	printk("Restarting tasks ... ");
>  
>  	__usermodehelper_set_disable_depth(UMH_FREEZING);
> @@ -251,6 +220,9 @@ void thaw_kernel_threads(void)
>  {
>  	struct task_struct *g, *p;
>  
> +	printk("Enabling OOM killer again.\n");
> +	oom_killer_enable();
> +
>  	pm_nosig_freezing = false;
>  	printk("Restarting kernel threads ... ");
>  
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 302e0fc6d121..34bcbb053132 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -2128,6 +2128,8 @@ static void mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask, int order)
>  	current->memcg_oom.order = order;
>  }
>  
> +extern bool oom_killer_disabled;
> +
>  /**
>   * mem_cgroup_oom_synchronize - complete memcg OOM handling
>   * @handle: actually kill/wait or just clean up the OOM state
> @@ -2155,7 +2157,7 @@ bool mem_cgroup_oom_synchronize(bool handle)
>  	if (!memcg)
>  		return false;
>  
> -	if (!handle)
> +	if (!handle || oom_killer_disabled)
>  		goto cleanup;
>  
>  	owait.memcg = memcg;
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 8b6e14136f4f..b3ccd92bc6dc 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -405,30 +405,63 @@ static void dump_header(struct task_struct *p, gfp_t gfp_mask, int order,
>  }
>  
>  /*
> - * Number of OOM killer invocations (including memcg OOM killer).
> - * Primarily used by PM freezer to check for potential races with
> - * OOM killed frozen task.
> + * Number of OOM victims in flight
>   */
> -static atomic_t oom_kills = ATOMIC_INIT(0);
> +static atomic_t oom_victims = ATOMIC_INIT(0);
> +static DECLARE_COMPLETION(oom_victims_wait);
>  
> -int oom_kills_count(void)
> +bool oom_killer_disabled __read_mostly;
> +static DECLARE_RWSEM(oom_sem);
> +
> +void mark_tsk_oom_victim(struct task_struct *tsk)
>  {
> -	return atomic_read(&oom_kills);
> +	BUG_ON(oom_killer_disabled);
> +	if (test_and_set_tsk_thread_flag(tsk, TIF_MEMDIE))
> +		return;
> +	atomic_inc(&oom_victims);
>  }
>  
> -void note_oom_kill(void)
> +void unmark_tsk_oom_victim(struct task_struct *tsk)
>  {
> -	atomic_inc(&oom_kills);
> +	int count;
> +
> +	if (!test_and_clear_tsk_thread_flag(tsk, TIF_MEMDIE))
> +		return;
> +
> +	down_read(&oom_sem);
> +	/*
> +	 * There is no need to signal the lasst oom_victim if there
> +	 * is nobody who cares.
> +	 */
> +	if (!atomic_dec_return(&oom_victims) && oom_killer_disabled)
> +		complete(&oom_victims_wait);
> +	up_read(&oom_sem);
>  }
>  
> -void mark_tsk_oom_victim(struct task_struct *tsk)
> +bool oom_killer_disable(void)
>  {
> -	set_tsk_thread_flag(tsk, TIF_MEMDIE);
> +	/*
> +	 * Make sure to not race with an ongoing OOM killer
> +	 * and that the current is not the victim.
> +	 */
> +	down_write(&oom_sem);
> +	if (!test_tsk_thread_flag(current, TIF_MEMDIE))
> +		oom_killer_disabled = true;
> +
> +	count = atomic_read(&oom_victims);
> +	up_write(&oom_sem);
> +
> +	if (count && oom_killer_disabled)
> +		wait_for_completion(&oom_victims_wait);
> +
> +	return oom_killer_disabled;
>  }
>  
> -void unmark_tsk_oom_victim(struct task_struct *tsk)
> +void oom_killer_enable(void)
>  {
> -	clear_thread_flag(TIF_MEMDIE);
> +	down_write(&oom_sem);
> +	oom_killer_disabled = false;
> +	up_write(&oom_sem);
>  }
>  
>  #define K(x) ((x) << (PAGE_SHIFT-10))
> @@ -626,7 +659,7 @@ void oom_zonelist_unlock(struct zonelist *zonelist, gfp_t gfp_mask)
>  }
>  
>  /**
> - * out_of_memory - kill the "best" process when we run out of memory
> + * __out_of_memory - kill the "best" process when we run out of memory
>   * @zonelist: zonelist pointer
>   * @gfp_mask: memory allocation flags
>   * @order: amount of memory being requested as a power of 2
> @@ -638,7 +671,7 @@ void oom_zonelist_unlock(struct zonelist *zonelist, gfp_t gfp_mask)
>   * OR try to be smart about which process to kill. Note that we
>   * don't have to be perfect here, we just have to be good.
>   */
> -void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
> +static void __out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
>  		int order, nodemask_t *nodemask, bool force_kill)
>  {
>  	const nodemask_t *mpol_mask;
> @@ -703,6 +736,31 @@ out:
>  		schedule_timeout_killable(1);
>  }
>  
> +/** out_of_memory -  tries to invoke OOM killer.
> + * @zonelist: zonelist pointer
> + * @gfp_mask: memory allocation flags
> + * @order: amount of memory being requested as a power of 2
> + * @nodemask: nodemask passed to page allocator
> + * @force_kill: true if a task must be killed, even if others are exiting
> + *
> + * invokes __out_of_memory if the OOM is not disabled by oom_killer_disable()
> + * when it returns false. Otherwise returns true.
> + */
> +bool out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
> +		int order, nodemask_t *nodemask, bool force_kill)
> +{
> +	bool ret = false;
> +
> +	down_read(&oom_sem);
> +	if (!oom_killer_disabled) {
> +		__out_of_memory(zonelist, gfp_mask, order, nodemask, force_kill);
> +		ret = true;
> +	}
> +	up_read(&oom_sem);
> +
> +	return ret;
> +}
> +
>  /*
>   * The pagefault handler calls here because it is out of memory, so kill a
>   * memory-hogging task.  If any populated zone has ZONE_OOM_LOCKED set, a
> @@ -712,12 +770,16 @@ void pagefault_out_of_memory(void)
>  {
>  	struct zonelist *zonelist;
>  
> +	down_read(&oom_sem);
>  	if (mem_cgroup_oom_synchronize(true))
> -		return;
> +		goto unlock;
>  
>  	zonelist = node_zonelist(first_memory_node, GFP_KERNEL);
>  	if (oom_zonelist_trylock(zonelist, GFP_KERNEL)) {
> -		out_of_memory(NULL, 0, 0, NULL, false);
> +		if (!oom_killer_disabled)
> +			__out_of_memory(NULL, 0, 0, NULL, false);
>  		oom_zonelist_unlock(zonelist, GFP_KERNEL);
>  	}
> +unlock:
> +	up_read(&oom_sem);
>  }
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 9cd36b822444..d44d69aa7b70 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -242,8 +242,6 @@ void set_pageblock_migratetype(struct page *page, int migratetype)
>  					PB_migrate, PB_migrate_end);
>  }
>  
> -bool oom_killer_disabled __read_mostly;
> -
>  #ifdef CONFIG_DEBUG_VM
>  static int page_outside_zone_boundaries(struct zone *zone, struct page *page)
>  {
> @@ -2241,10 +2239,11 @@ static inline struct page *
>  __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
>  	struct zonelist *zonelist, enum zone_type high_zoneidx,
>  	nodemask_t *nodemask, struct zone *preferred_zone,
> -	int classzone_idx, int migratetype)
> +	int classzone_idx, int migratetype, bool *oom_failed)
>  {
>  	struct page *page;
>  
> +	*oom_failed = false;
>  	/* Acquire the per-zone oom lock for each zone */
>  	if (!oom_zonelist_trylock(zonelist, gfp_mask)) {
>  		schedule_timeout_uninterruptible(1);
> @@ -2252,14 +2251,6 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
>  	}
>  
>  	/*
> -	 * PM-freezer should be notified that there might be an OOM killer on
> -	 * its way to kill and wake somebody up. This is too early and we might
> -	 * end up not killing anything but false positives are acceptable.
> -	 * See freeze_processes.
> -	 */
> -	note_oom_kill();
> -
> -	/*
>  	 * Go through the zonelist yet one more time, keep very high watermark
>  	 * here, this is only to catch a parallel oom killing, we must fail if
>  	 * we're still under heavy pressure.
> @@ -2289,8 +2280,8 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
>  			goto out;
>  	}
>  	/* Exhausted what can be done so it's blamo time */
> -	out_of_memory(zonelist, gfp_mask, order, nodemask, false);
> -
> +	if (!out_of_memory(zonelist, gfp_mask, order, nodemask, false))
> +		*oom_failed = true;
>  out:
>  	oom_zonelist_unlock(zonelist, gfp_mask);
>  	return page;
> @@ -2716,8 +2707,8 @@ rebalance:
>  	 */
>  	if (!did_some_progress) {
>  		if (oom_gfp_allowed(gfp_mask)) {
> -			if (oom_killer_disabled)
> -				goto nopage;
> +			bool oom_failed;
> +
>  			/* Coredumps can quickly deplete all memory reserves */
>  			if ((current->flags & PF_DUMPCORE) &&
>  			    !(gfp_mask & __GFP_NOFAIL))
> @@ -2725,10 +2716,19 @@ rebalance:
>  			page = __alloc_pages_may_oom(gfp_mask, order,
>  					zonelist, high_zoneidx,
>  					nodemask, preferred_zone,
> -					classzone_idx, migratetype);
> +					classzone_idx, migratetype,
> +					&oom_failed);
> +
>  			if (page)
>  				goto got_pg;
>  
> +			/*
> +			 * OOM killer might be disabled and then we have to
> +			 * fail the allocation
> +			 */
> +			if (oom_failed)
> +				goto nopage;
> +
>  			if (!(gfp_mask & __GFP_NOFAIL)) {
>  				/*
>  				 * The oom killer is not called for high-order
> 

-- 
I speak only for myself.
Rafael J. Wysocki, Intel Open Source Technology Center.

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [RFC 2/2] OOM, PM: make OOM detection in the freezer path raceless
  2014-11-18 21:10                                                         ` [RFC 2/2] OOM, PM: make OOM detection in the freezer path raceless Michal Hocko
  2014-11-27  0:47                                                           ` Rafael J. Wysocki
@ 2014-12-02 22:08                                                           ` Tejun Heo
  2014-12-04 14:16                                                             ` Michal Hocko
  1 sibling, 1 reply; 93+ messages in thread
From: Tejun Heo @ 2014-12-02 22:08 UTC (permalink / raw)
  To: Michal Hocko
  Cc: LKML, linux-mm, linux-pm, Andrew Morton,
	\"Rafael J. Wysocki\",
	David Rientjes, Oleg Nesterov, Cong Wang

Hello, sorry about the delay.  Was on vacation.

Generally looks good to me.  Some comments below.

> @@ -355,8 +355,10 @@ static struct sysrq_key_op sysrq_term_op = {
>  
>  static void moom_callback(struct work_struct *ignored)
>  {
> -	out_of_memory(node_zonelist(first_memory_node, GFP_KERNEL), GFP_KERNEL,
> -		      0, NULL, true);
> +	if (!out_of_memory(node_zonelist(first_memory_node, GFP_KERNEL),
> +			   GFP_KERNEL, 0, NULL, true)) {
> +		printk(KERN_INFO "OOM request ignored because killer is disabled\n");
> +	}
>  }

CodingStyle line 157 says "Do not unnecessarily use braces where a
single statement will do.".

> +/**
> + * oom_killer_disable - disable OOM killer
> + *
> + * Forces all page allocations to fail rather than trigger OOM killer.
> + * Will block and wait until all OOM victims are dead.
> + *
> + * Returns true if successfull and false if the OOM killer cannot be
> + * disabled.
> + */
> +extern bool oom_killer_disable(void);

And function comments usually go where the function body is, not where
the function is declared, no?

> @@ -157,27 +132,11 @@ int freeze_processes(void)
>  	pm_wakeup_clear();
>  	printk("Freezing user space processes ... ");
>  	pm_freezing = true;
> -	oom_kills_saved = oom_kills_count();
>  	error = try_to_freeze_tasks(true);
>  	if (!error) {
>  		__usermodehelper_set_disable_depth(UMH_DISABLED);
> -		oom_killer_disable();
> -
> -		/*
> -		 * There might have been an OOM kill while we were
> -		 * freezing tasks and the killed task might be still
> -		 * on the way out so we have to double check for race.
> -		 */
> -		if (oom_kills_count() != oom_kills_saved &&
> -		    !check_frozen_processes()) {
> -			__usermodehelper_set_disable_depth(UMH_ENABLED);
> -			printk("OOM in progress.");
> -			error = -EBUSY;
> -		} else {
> -			printk("done.");
> -		}
> +		printk("done.\n");

A delta but shouldn't it be pr_cont()?

...
> @@ -206,6 +165,18 @@ int freeze_kernel_threads(void)
>  	printk("\n");
>  	BUG_ON(in_atomic());
>  
> +	/*
> +	 * Now that everything freezable is handled we need to disbale
> +	 * the OOM killer to disallow any further interference with
> +	 * killable tasks.
> +	 */
> +	printk("Disabling OOM killer ... ");
> +	if (!oom_killer_disable()) {
> +		printk("failed.\n");
> +		error = -EAGAIN;
> +	} else
> +		printk("done.\n");

Ditto on pr_cont() and CodingStyle line 169 says "This does not apply
if only one branch of a conditional statement is a single statement;
in the latter case use braces in both branches:"

> @@ -251,6 +220,9 @@ void thaw_kernel_threads(void)
>  {
>  	struct task_struct *g, *p;
>  
> +	printk("Enabling OOM killer again.\n");

Do we really need this printk?  The same goes for Disabling OOM
killer.  For freezing it makes some sense because freezing may take a
considerable amount of time and even occassionally fail due to
timeout.  We aren't really expecting those to happen for OOM victims.

> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 302e0fc6d121..34bcbb053132 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -2128,6 +2128,8 @@ static void mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask, int order)
>  	current->memcg_oom.order = order;
>  }
>  
> +extern bool oom_killer_disabled;

Ugh... don't we wanna put this in a header file?

> +void mark_tsk_oom_victim(struct task_struct *tsk)
>  {
> -	return atomic_read(&oom_kills);
> +	BUG_ON(oom_killer_disabled);

WARN_ON_ONCE() is prolly a better option here?

> +	if (test_and_set_tsk_thread_flag(tsk, TIF_MEMDIE))

Can a task actually be selected as an OOM victim multiple times?

> +		return;
> +	atomic_inc(&oom_victims);
>  }
>  
> -void note_oom_kill(void)
> +void unmark_tsk_oom_victim(struct task_struct *tsk)
>  {
> -	atomic_inc(&oom_kills);
> +	int count;
> +
> +	if (!test_and_clear_tsk_thread_flag(tsk, TIF_MEMDIE))
> +		return;

Maybe test this inline in exit_mm()?  e.g.

	if (test_thread_flag(TIF_MEMDIE))
		unmark_tsk_oom_victim(current);

Also, can the function ever be called by someone other than current?
If not, why would it take @task?

> +
> +	down_read(&oom_sem);
> +	/*
> +	 * There is no need to signal the lasst oom_victim if there
> +	 * is nobody who cares.
> +	 */
> +	if (!atomic_dec_return(&oom_victims) && oom_killer_disabled)
> +		complete(&oom_victims_wait);

I don't think using completion this way is safe.  Please read on.

> +	up_read(&oom_sem);
>  }
>  
> -void mark_tsk_oom_victim(struct task_struct *tsk)
> +bool oom_killer_disable(void)
>  {
> -	set_tsk_thread_flag(tsk, TIF_MEMDIE);
> +	/*
> +	 * Make sure to not race with an ongoing OOM killer
> +	 * and that the current is not the victim.
> +	 */
> +	down_write(&oom_sem);
> +	if (!test_tsk_thread_flag(current, TIF_MEMDIE))
> +		oom_killer_disabled = true;

Prolly "if (TIF_MEMDIE) { unlock; return; }" is easier to follow.

> +
> +	count = atomic_read(&oom_victims);
> +	up_write(&oom_sem);
> +
> +	if (count && oom_killer_disabled)
> +		wait_for_completion(&oom_victims_wait);

So, each complete() increments the done count and wait decs.  The
above code works iff the complete()'s and wait()'s are always balanced
which usually isn't true in this type of wait code.  Either use
reinit_completion() / complete_all() combos or wait_event().

> +
> +	return oom_killer_disabled;

Maybe 0 / -errno is better choice as return values?

> +/** out_of_memory -  tries to invoke OOM killer.

Formatting?

> + * @zonelist: zonelist pointer
> + * @gfp_mask: memory allocation flags
> + * @order: amount of memory being requested as a power of 2
> + * @nodemask: nodemask passed to page allocator
> + * @force_kill: true if a task must be killed, even if others are exiting
> + *
> + * invokes __out_of_memory if the OOM is not disabled by oom_killer_disable()
> + * when it returns false. Otherwise returns true.
> + */
> +bool out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
> +		int order, nodemask_t *nodemask, bool force_kill)
> +{
> +	bool ret = false;
> +
> +	down_read(&oom_sem);
> +	if (!oom_killer_disabled) {
> +		__out_of_memory(zonelist, gfp_mask, order, nodemask, force_kill);
> +		ret = true;
> +	}
> +	up_read(&oom_sem);
> +
> +	return ret;

Ditto on return value.  0 / -EBUSY seem like a better choice to me.

> @@ -712,12 +770,16 @@ void pagefault_out_of_memory(void)
>  {
>  	struct zonelist *zonelist;
>  
> +	down_read(&oom_sem);
>  	if (mem_cgroup_oom_synchronize(true))
> -		return;
> +		goto unlock;
>  
>  	zonelist = node_zonelist(first_memory_node, GFP_KERNEL);
>  	if (oom_zonelist_trylock(zonelist, GFP_KERNEL)) {
> -		out_of_memory(NULL, 0, 0, NULL, false);
> +		if (!oom_killer_disabled)
> +			__out_of_memory(NULL, 0, 0, NULL, false);
>  		oom_zonelist_unlock(zonelist, GFP_KERNEL);

Is this a condition which can happen and we can deal with?  With
userland fully frozen, there shouldn't be page faults which lead to
memory allocation, right?  Shouldn't we document how oom
disable/enable is supposed to be used (it only makes sense while the
whole system is in quiescent state) and at least trigger
WARN_ON_ONCE() if the above code path gets triggered while oom killer
is disabled?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [RFC 2/2] OOM, PM: make OOM detection in the freezer path raceless
  2014-12-02 22:08                                                           ` Tejun Heo
@ 2014-12-04 14:16                                                             ` Michal Hocko
  2014-12-04 14:44                                                               ` Tejun Heo
  0 siblings, 1 reply; 93+ messages in thread
From: Michal Hocko @ 2014-12-04 14:16 UTC (permalink / raw)
  To: Tejun Heo
  Cc: LKML, linux-mm, linux-pm, Andrew Morton,
	\"Rafael J. Wysocki\",
	David Rientjes, Oleg Nesterov, Cong Wang

On Tue 02-12-14 17:08:04, Tejun Heo wrote:
> Hello, sorry about the delay.  Was on vacation.
> 
> Generally looks good to me.  Some comments below.
> 
> > @@ -355,8 +355,10 @@ static struct sysrq_key_op sysrq_term_op = {
> >  
> >  static void moom_callback(struct work_struct *ignored)
> >  {
> > -	out_of_memory(node_zonelist(first_memory_node, GFP_KERNEL), GFP_KERNEL,
> > -		      0, NULL, true);
> > +	if (!out_of_memory(node_zonelist(first_memory_node, GFP_KERNEL),
> > +			   GFP_KERNEL, 0, NULL, true)) {
> > +		printk(KERN_INFO "OOM request ignored because killer is disabled\n");
> > +	}
> >  }
> 
> CodingStyle line 157 says "Do not unnecessarily use braces where a
> single statement will do.".

Sure. Fixed

> > +/**
> > + * oom_killer_disable - disable OOM killer
> > + *
> > + * Forces all page allocations to fail rather than trigger OOM killer.
> > + * Will block and wait until all OOM victims are dead.
> > + *
> > + * Returns true if successfull and false if the OOM killer cannot be
> > + * disabled.
> > + */
> > +extern bool oom_killer_disable(void);
> 
> And function comments usually go where the function body is, not where
> the function is declared, no?

Fixed

> > @@ -157,27 +132,11 @@ int freeze_processes(void)
> >  	pm_wakeup_clear();
> >  	printk("Freezing user space processes ... ");
> >  	pm_freezing = true;
> > -	oom_kills_saved = oom_kills_count();
> >  	error = try_to_freeze_tasks(true);
> >  	if (!error) {
> >  		__usermodehelper_set_disable_depth(UMH_DISABLED);
> > -		oom_killer_disable();
> > -
> > -		/*
> > -		 * There might have been an OOM kill while we were
> > -		 * freezing tasks and the killed task might be still
> > -		 * on the way out so we have to double check for race.
> > -		 */
> > -		if (oom_kills_count() != oom_kills_saved &&
> > -		    !check_frozen_processes()) {
> > -			__usermodehelper_set_disable_depth(UMH_ENABLED);
> > -			printk("OOM in progress.");
> > -			error = -EBUSY;
> > -		} else {
> > -			printk("done.");
> > -		}
> > +		printk("done.\n");
> 
> A delta but shouldn't it be pr_cont()?

kernel/power/process.c doesn't use pr_* so I've stayed with what the
rest of the file is using. I can add a patch which transforms all of
them.

> ...
> > @@ -206,6 +165,18 @@ int freeze_kernel_threads(void)
> >  	printk("\n");
> >  	BUG_ON(in_atomic());
> >  
> > +	/*
> > +	 * Now that everything freezable is handled we need to disbale
> > +	 * the OOM killer to disallow any further interference with
> > +	 * killable tasks.
> > +	 */
> > +	printk("Disabling OOM killer ... ");
> > +	if (!oom_killer_disable()) {
> > +		printk("failed.\n");
> > +		error = -EAGAIN;
> > +	} else
> > +		printk("done.\n");
> 
> Ditto on pr_cont() and
>
> CodingStyle line 169 says "This does not apply
> if only one branch of a conditional statement is a single statement;
> in the latter case use braces in both branches:"

Fixed

> > @@ -251,6 +220,9 @@ void thaw_kernel_threads(void)
> >  {
> >  	struct task_struct *g, *p;
> >  
> > +	printk("Enabling OOM killer again.\n");
> 
> Do we really need this printk?  The same goes for Disabling OOM
> killer.  For freezing it makes some sense because freezing may take a
> considerable amount of time and even occassionally fail due to
> timeout.  We aren't really expecting those to happen for OOM victims.

I just considered them useful if there are follow up allocation failure
messages to know that they are due to OOM killer.
I can remove them. 

> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index 302e0fc6d121..34bcbb053132 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -2128,6 +2128,8 @@ static void mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask, int order)
> >  	current->memcg_oom.order = order;
> >  }
> >  
> > +extern bool oom_killer_disabled;
> 
> Ugh... don't we wanna put this in a header file?

Who else would need the declaration? This is not something random code
should look at.

> > +void mark_tsk_oom_victim(struct task_struct *tsk)
> >  {
> > -	return atomic_read(&oom_kills);
> > +	BUG_ON(oom_killer_disabled);
> 
> WARN_ON_ONCE() is prolly a better option here?

Well, something fishy is going on when oom_killer_disabled is set and we
mark new OOM victim. This is a clear bug. Why would be warning and a
allow the follow up breakage?

> > +	if (test_and_set_tsk_thread_flag(tsk, TIF_MEMDIE))
> 
> Can a task actually be selected as an OOM victim multiple times?

AFAICS nothing prevents from global OOM and memcg OOM killers racing.
 
> > +		return;
> > +	atomic_inc(&oom_victims);
> >  }
> >  
> > -void note_oom_kill(void)
> > +void unmark_tsk_oom_victim(struct task_struct *tsk)
> >  {
> > -	atomic_inc(&oom_kills);
> > +	int count;
> > +
> > +	if (!test_and_clear_tsk_thread_flag(tsk, TIF_MEMDIE))
> > +		return;
> 
> Maybe test this inline in exit_mm()?  e.g.
> 
> 	if (test_thread_flag(TIF_MEMDIE))
> 		unmark_tsk_oom_victim(current);

Why do you think testing TIF_MEMDIE in exit_mm is better? I would like
to reduce the usage of the flag as much as possible.

> Also, can the function ever be called by someone other than current?
> If not, why would it take @task?

Changed to use current only. If there is anybody who needs that we can
change that later. I wanted to have it symmetric to mark_tsk_oom_victim
but that is not that important.

> > +
> > +	down_read(&oom_sem);
> > +	/*
> > +	 * There is no need to signal the lasst oom_victim if there
> > +	 * is nobody who cares.
> > +	 */
> > +	if (!atomic_dec_return(&oom_victims) && oom_killer_disabled)
> > +		complete(&oom_victims_wait);
> 
> I don't think using completion this way is safe.  Please read on.
> 
> > +	up_read(&oom_sem);
> >  }
> >  
> > -void mark_tsk_oom_victim(struct task_struct *tsk)
> > +bool oom_killer_disable(void)
> >  {
> > -	set_tsk_thread_flag(tsk, TIF_MEMDIE);
> > +	/*
> > +	 * Make sure to not race with an ongoing OOM killer
> > +	 * and that the current is not the victim.
> > +	 */
> > +	down_write(&oom_sem);
> > +	if (!test_tsk_thread_flag(current, TIF_MEMDIE))
> > +		oom_killer_disabled = true;
> 
> Prolly "if (TIF_MEMDIE) { unlock; return; }" is easier to follow.

OK

> > +
> > +	count = atomic_read(&oom_victims);
> > +	up_write(&oom_sem);
> > +
> > +	if (count && oom_killer_disabled)
> > +		wait_for_completion(&oom_victims_wait);
> 
> So, each complete() increments the done count and wait decs.  The
> above code works iff the complete()'s and wait()'s are always balanced
> which usually isn't true in this type of wait code.  Either use
> reinit_completion() / complete_all() combos or wait_event().

Hmm, I thought that only a single instance of freeze_kernel_threads
(which calls oom_killer_disable) can run at a time. But I am currently
not sure that all paths are called under lock_system_sleep.
I am not familiar with reinit_completion API. Is the following correct?
[...]
@@ -434,10 +434,23 @@ void unmark_tsk_oom_victim(struct task_struct *tsk)
 	 * is nobody who cares.
 	 */
 	if (!atomic_dec_return(&oom_victims) && oom_killer_disabled)
-		complete(&oom_victims_wait);
+		complete_all(&oom_victims_wait);
 	up_read(&oom_sem);
 }
[...]
@@ -445,16 +458,23 @@ bool oom_killer_disable(void)
 	 * and that the current is not the victim.
 	 */
 	down_write(&oom_sem);
-	if (!test_tsk_thread_flag(current, TIF_MEMDIE))
-		oom_killer_disabled = true;
+	if (test_thread_flag(TIF_MEMDIE)) {
+		up_write(&oom_sem);
+		return false;
+	}
+
+	/* unmark_tsk_oom_victim is calling complete_all */
+	if (!oom_killer_disable)
+		reinit_completion(&oom_victims_wait);
 
+	oom_killer_disabled = true;
 	count = atomic_read(&oom_victims);
 	up_write(&oom_sem);
 
-	if (count && oom_killer_disabled)
+	if (count)
 		wait_for_completion(&oom_victims_wait);
 
-	return oom_killer_disabled;
+	return true;
 }

> > +
> > +	return oom_killer_disabled;
> 
> Maybe 0 / -errno is better choice as return values?

I do not have problem to change this if you feel strong about it but
true/false sounds easier to me and it allows the caller to decide what to
report. If there were multiple reasons to fail then sure but that is not
the case.
 
> > +/** out_of_memory -  tries to invoke OOM killer.
> 
> Formatting?

fixed

> > + * @zonelist: zonelist pointer
> > + * @gfp_mask: memory allocation flags
> > + * @order: amount of memory being requested as a power of 2
> > + * @nodemask: nodemask passed to page allocator
> > + * @force_kill: true if a task must be killed, even if others are exiting
> > + *
> > + * invokes __out_of_memory if the OOM is not disabled by oom_killer_disable()
> > + * when it returns false. Otherwise returns true.
> > + */
> > +bool out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
> > +		int order, nodemask_t *nodemask, bool force_kill)
> > +{
> > +	bool ret = false;
> > +
> > +	down_read(&oom_sem);
> > +	if (!oom_killer_disabled) {
> > +		__out_of_memory(zonelist, gfp_mask, order, nodemask, force_kill);
> > +		ret = true;
> > +	}
> > +	up_read(&oom_sem);
> > +
> > +	return ret;
> 
> Ditto on return value.  0 / -EBUSY seem like a better choice to me.
> 
> > @@ -712,12 +770,16 @@ void pagefault_out_of_memory(void)
> >  {
> >  	struct zonelist *zonelist;
> >  
> > +	down_read(&oom_sem);
> >  	if (mem_cgroup_oom_synchronize(true))
> > -		return;
> > +		goto unlock;
> >  
> >  	zonelist = node_zonelist(first_memory_node, GFP_KERNEL);
> >  	if (oom_zonelist_trylock(zonelist, GFP_KERNEL)) {
> > -		out_of_memory(NULL, 0, 0, NULL, false);
> > +		if (!oom_killer_disabled)
> > +			__out_of_memory(NULL, 0, 0, NULL, false);
> >  		oom_zonelist_unlock(zonelist, GFP_KERNEL);
> 
> Is this a condition which can happen and we can deal with? With
> userland fully frozen, there shouldn't be page faults which lead to
> memory allocation, right?

Except for racing OOM victims which were missed by try_to_freeze_tasks
because they didn't get cpu slice to wake up from the freezer. The task
would die on the way out from the page fault exception. I have updated
the changelog to be more verbose about this.

> Shouldn't we document how oom disable/enable is supposed to be used

Well the API shouldn't be used outside of the PM freezer IMO. This is not a
general API that other part of the kernel should be using. I can surely
add more documentation for the PM usage though. I have rewritten the
changelog:
"
    As oom_killer_disable() is a full OOM barrier now we can postpone it in
    the PM freezer to later after all freezable user tasks are considered
    frozen (to freeze_kernel_threads).

    Normally there wouldn't be any unfrozen user tasks at this moment so
    the function will not block. But if there was an OOM killer racing with
    try_to_freeze_tasks and the OOM victim didn't finish yet then we have to
    wait for it. This should complete in a finite time, though, because
        - the victim cannot loop in the page fault handler (it would die
          on the way out from the exception)
        - it cannot loop in the page allocator because all the further\
          allocation would fail
        - it shouldn't be blocked on any locks held by frozen tasks
          (try_to_freeze expects lockless context) and kernel threads and
          work queues are not frozen yet
"

And I've added:
+/**
+ * oom_killer_disable - disable OOM killer
+ *
+ * Forces all page allocations to fail rather than trigger OOM killer.
+ * Will block and wait until all OOM victims are dead.
+ *
+ * The function cannot be called when there are runnable user tasks because
+ * the userspace would see unexpected allocation failures as a result. Any
+ * new usage of this function should be consulted with MM people.
+ *
+ * Returns true if successful and false if the OOM killer cannot be
+ * disabled.
+ */
 bool oom_killer_disable(void)

> (it only makes sense while the whole system is in quiescent state)
> and at least trigger WARN_ON_ONCE() if the above code path gets
> triggered while oom killer is disabled?

I can add a WARN_ON(!test_thread_flag(tsk, TIF_MEMDIE)).

Thanks for the review!
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [RFC 2/2] OOM, PM: make OOM detection in the freezer path raceless
  2014-12-04 14:16                                                             ` Michal Hocko
@ 2014-12-04 14:44                                                               ` Tejun Heo
  2014-12-04 16:56                                                                 ` Michal Hocko
  0 siblings, 1 reply; 93+ messages in thread
From: Tejun Heo @ 2014-12-04 14:44 UTC (permalink / raw)
  To: Michal Hocko
  Cc: LKML, linux-mm, linux-pm, Andrew Morton,
	\"Rafael J. Wysocki\",
	David Rientjes, Oleg Nesterov, Cong Wang

On Thu, Dec 04, 2014 at 03:16:23PM +0100, Michal Hocko wrote:
> > A delta but shouldn't it be pr_cont()?
> 
> kernel/power/process.c doesn't use pr_* so I've stayed with what the
> rest of the file is using. I can add a patch which transforms all of
> them.

The console output becomes wrong when printk() is used on
continuation.  So, yeah, it'd be great to fix it.

> > > +extern bool oom_killer_disabled;
> > 
> > Ugh... don't we wanna put this in a header file?
> 
> Who else would need the declaration? This is not something random code
> should look at.

Let's say, somebody changes the type to ulong for whatever reason
later and forgets to update this declaration.  What happens then on a
big endian machine?

Jesus, this is basic C programming.  You don't sprinkle external
declarations which the compiler can't verify against the actual
definitions.  There's absolutely no compelling reason to do that here.
Why would you take out compiler verification for no reason?

> > > +void mark_tsk_oom_victim(struct task_struct *tsk)
> > >  {
> > > -	return atomic_read(&oom_kills);
> > > +	BUG_ON(oom_killer_disabled);
> > 
> > WARN_ON_ONCE() is prolly a better option here?
> 
> Well, something fishy is going on when oom_killer_disabled is set and we
> mark new OOM victim. This is a clear bug. Why would be warning and a
> allow the follow up breakage?

Because the system is more likely to be able to go on and we don't BUG
when we can WARN as a general rule.  Working systems is almost always
better than a dead system even for debugging.

> > > +	if (test_and_set_tsk_thread_flag(tsk, TIF_MEMDIE))
> > 
> > Can a task actually be selected as an OOM victim multiple times?
> 
> AFAICS nothing prevents from global OOM and memcg OOM killers racing.

Maybe it'd be a good idea to note that in the comment?

> > > -void note_oom_kill(void)
> > > +void unmark_tsk_oom_victim(struct task_struct *tsk)
> > >  {
> > > -	atomic_inc(&oom_kills);
> > > +	int count;
> > > +
> > > +	if (!test_and_clear_tsk_thread_flag(tsk, TIF_MEMDIE))
> > > +		return;
> > 
> > Maybe test this inline in exit_mm()?  e.g.
> > 
> > 	if (test_thread_flag(TIF_MEMDIE))
> > 		unmark_tsk_oom_victim(current);
> 
> Why do you think testing TIF_MEMDIE in exit_mm is better? I would like
> to reduce the usage of the flag as much as possible.

Because it's adding a function call/return to hot path for everybody.
It sure is a miniscule cost but we're adding that for no good reason.

> > So, each complete() increments the done count and wait decs.  The
> > above code works iff the complete()'s and wait()'s are always balanced
> > which usually isn't true in this type of wait code.  Either use
> > reinit_completion() / complete_all() combos or wait_event().
> 
> Hmm, I thought that only a single instance of freeze_kernel_threads
> (which calls oom_killer_disable) can run at a time. But I am currently
> not sure that all paths are called under lock_system_sleep.
> I am not familiar with reinit_completion API. Is the following correct?

Hmmm... wouldn't wait_event() easier to read in this case?

...
> > Maybe 0 / -errno is better choice as return values?
> 
> I do not have problem to change this if you feel strong about it but
> true/false sounds easier to me and it allows the caller to decide what to
> report. If there were multiple reasons to fail then sure but that is not
> the case.

It's not a big deal but except for functions which have clear boolean
behavior - functions which try/attempt something or query or decide
certain things - randomly thrown in bool returns tend to become
confusing especially because its bool fail value is the opposite of
0/-errno fail value.  So, "this function only fails with one reason"
is usually a bad and arbitrary reason for choosing bool return which
causes confusion on callsites and headaches when the function develops
more reasons to fail.

...
> > > @@ -712,12 +770,16 @@ void pagefault_out_of_memory(void)
> > >  {
> > >  	struct zonelist *zonelist;
> > >  
> > > +	down_read(&oom_sem);
> > >  	if (mem_cgroup_oom_synchronize(true))
> > > -		return;
> > > +		goto unlock;
> > >  
> > >  	zonelist = node_zonelist(first_memory_node, GFP_KERNEL);
> > >  	if (oom_zonelist_trylock(zonelist, GFP_KERNEL)) {
> > > -		out_of_memory(NULL, 0, 0, NULL, false);
> > > +		if (!oom_killer_disabled)
> > > +			__out_of_memory(NULL, 0, 0, NULL, false);
> > >  		oom_zonelist_unlock(zonelist, GFP_KERNEL);
> > 
> > Is this a condition which can happen and we can deal with? With
> > userland fully frozen, there shouldn't be page faults which lead to
> > memory allocation, right?
> 
> Except for racing OOM victims which were missed by try_to_freeze_tasks
> because they didn't get cpu slice to wake up from the freezer. The task
> would die on the way out from the page fault exception. I have updated
> the changelog to be more verbose about this.

That's something very not obvious.  Let's please add a comment
explaining that.

> > (it only makes sense while the whole system is in quiescent state)
> > and at least trigger WARN_ON_ONCE() if the above code path gets
> > triggered while oom killer is disabled?
> 
> I can add a WARN_ON(!test_thread_flag(tsk, TIF_MEMDIE)).

Yeah, that makes sense to me.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [RFC 2/2] OOM, PM: make OOM detection in the freezer path raceless
  2014-12-04 14:44                                                               ` Tejun Heo
@ 2014-12-04 16:56                                                                 ` Michal Hocko
  2014-12-04 17:18                                                                   ` Michal Hocko
  0 siblings, 1 reply; 93+ messages in thread
From: Michal Hocko @ 2014-12-04 16:56 UTC (permalink / raw)
  To: Tejun Heo
  Cc: LKML, linux-mm, linux-pm, Andrew Morton,
	\"Rafael J. Wysocki\",
	David Rientjes, Oleg Nesterov, Cong Wang

On Thu 04-12-14 09:44:54, Tejun Heo wrote:
> On Thu, Dec 04, 2014 at 03:16:23PM +0100, Michal Hocko wrote:
> > > A delta but shouldn't it be pr_cont()?
> > 
> > kernel/power/process.c doesn't use pr_* so I've stayed with what the
> > rest of the file is using. I can add a patch which transforms all of
> > them.
> 
> The console output becomes wrong when printk() is used on
> continuation.  So, yeah, it'd be great to fix it.
> 
> > > > +extern bool oom_killer_disabled;
> > > 
> > > Ugh... don't we wanna put this in a header file?
> > 
> > Who else would need the declaration? This is not something random code
> > should look at.
> 
> Let's say, somebody changes the type to ulong for whatever reason
> later and forgets to update this declaration.  What happens then on a
> big endian machine?

OK, see your point. Although this is unlikely...
 
> Jesus, this is basic C programming.  You don't sprinkle external
> declarations which the compiler can't verify against the actual
> definitions.  There's absolutely no compelling reason to do that here.
> Why would you take out compiler verification for no reason?
> 
> > > > +void mark_tsk_oom_victim(struct task_struct *tsk)
> > > >  {
> > > > -	return atomic_read(&oom_kills);
> > > > +	BUG_ON(oom_killer_disabled);
> > > 
> > > WARN_ON_ONCE() is prolly a better option here?
> > 
> > Well, something fishy is going on when oom_killer_disabled is set and we
> > mark new OOM victim. This is a clear bug. Why would be warning and a
> > allow the follow up breakage?
> 
> Because the system is more likely to be able to go on and we don't BUG
> when we can WARN as a general rule.  Working systems is almost always
> better than a dead system even for debugging.
> 
> > > > +	if (test_and_set_tsk_thread_flag(tsk, TIF_MEMDIE))
> > > 
> > > Can a task actually be selected as an OOM victim multiple times?
> > 
> > AFAICS nothing prevents from global OOM and memcg OOM killers racing.
> 
> Maybe it'd be a good idea to note that in the comment?

ok

> > > > -void note_oom_kill(void)
> > > > +void unmark_tsk_oom_victim(struct task_struct *tsk)
> > > >  {
> > > > -	atomic_inc(&oom_kills);
> > > > +	int count;
> > > > +
> > > > +	if (!test_and_clear_tsk_thread_flag(tsk, TIF_MEMDIE))
> > > > +		return;
> > > 
> > > Maybe test this inline in exit_mm()?  e.g.
> > > 
> > > 	if (test_thread_flag(TIF_MEMDIE))
> > > 		unmark_tsk_oom_victim(current);
> > 
> > Why do you think testing TIF_MEMDIE in exit_mm is better? I would like
> > to reduce the usage of the flag as much as possible.
> 
> Because it's adding a function call/return to hot path for everybody.
> It sure is a miniscule cost but we're adding that for no good reason.

ok. 
 
> > > So, each complete() increments the done count and wait decs.  The
> > > above code works iff the complete()'s and wait()'s are always balanced
> > > which usually isn't true in this type of wait code.  Either use
> > > reinit_completion() / complete_all() combos or wait_event().
> > 
> > Hmm, I thought that only a single instance of freeze_kernel_threads
> > (which calls oom_killer_disable) can run at a time. But I am currently
> > not sure that all paths are called under lock_system_sleep.
> > I am not familiar with reinit_completion API. Is the following correct?
> 
> Hmmm... wouldn't wait_event() easier to read in this case?

OK, it looks easier. I thought it would require some additional
synchronization between wake up and wait but everything necessary seems
to be done in wait_event already so we cannot miss a wake up AFAICS:

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 1d55ab12792f..032be9d2a239 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -408,7 +408,7 @@ static void dump_header(struct task_struct *p, gfp_t gfp_mask, int order,
  * Number of OOM victims in flight
  */
 static atomic_t oom_victims = ATOMIC_INIT(0);
-static DECLARE_COMPLETION(oom_victims_wait);
+static DECLARE_WAIT_QUEUE_HEAD(oom_victims_wait);
 
 bool oom_killer_disabled __read_mostly;
 static DECLARE_RWSEM(oom_sem);
@@ -435,7 +435,7 @@ void unmark_tsk_oom_victim(void)
 	 * is nobody who cares.
 	 */
 	if (!atomic_dec_return(&oom_victims) && oom_killer_disabled)
-		complete_all(&oom_victims_wait);
+		wake_up_all(&oom_victims_wait);
 	up_read(&oom_sem);
 }
 
@@ -464,16 +464,11 @@ bool oom_killer_disable(void)
 		return false;
 	}
 
-	/* unmark_tsk_oom_victim is calling complete_all */
-	if (!oom_killer_disable)
-		reinit_completion(&oom_victims_wait);
-
 	oom_killer_disabled = true;
-	count = atomic_read(&oom_victims);
 	up_write(&oom_sem);
 
 	if (count)
-		wait_for_completion(&oom_victims_wait);
+		wait_event(oom_victims_wait, atomic_read(&oom_victims));
 
 	return true;
 }

> ...
> > > Maybe 0 / -errno is better choice as return values?
> > 
> > I do not have problem to change this if you feel strong about it but
> > true/false sounds easier to me and it allows the caller to decide what to
> > report. If there were multiple reasons to fail then sure but that is not
> > the case.
> 
> It's not a big deal but except for functions which have clear boolean
> behavior - functions which try/attempt something or query or decide

this is basically try_lock which might fail due to whatever internal
reasons.

> certain things - randomly thrown in bool returns tend to become
> confusing especially because its bool fail value is the opposite of
> 0/-errno fail value.  So, "this function only fails with one reason"
> is usually a bad and arbitrary reason for choosing bool return which
> causes confusion on callsites and headaches when the function develops
> more reasons to fail.
> 
> ...
> > > > @@ -712,12 +770,16 @@ void pagefault_out_of_memory(void)
> > > >  {
> > > >  	struct zonelist *zonelist;
> > > >  
> > > > +	down_read(&oom_sem);
> > > >  	if (mem_cgroup_oom_synchronize(true))
> > > > -		return;
> > > > +		goto unlock;
> > > >  
> > > >  	zonelist = node_zonelist(first_memory_node, GFP_KERNEL);
> > > >  	if (oom_zonelist_trylock(zonelist, GFP_KERNEL)) {
> > > > -		out_of_memory(NULL, 0, 0, NULL, false);
> > > > +		if (!oom_killer_disabled)
> > > > +			__out_of_memory(NULL, 0, 0, NULL, false);
> > > >  		oom_zonelist_unlock(zonelist, GFP_KERNEL);
> > > 
> > > Is this a condition which can happen and we can deal with? With
> > > userland fully frozen, there shouldn't be page faults which lead to
> > > memory allocation, right?
> > 
> > Except for racing OOM victims which were missed by try_to_freeze_tasks
> > because they didn't get cpu slice to wake up from the freezer. The task
> > would die on the way out from the page fault exception. I have updated
> > the changelog to be more verbose about this.
> 
> That's something very not obvious.  Let's please add a comment
> explaining that.

@@ -778,6 +795,15 @@ void pagefault_out_of_memory(void)
 	if (oom_zonelist_trylock(zonelist, GFP_KERNEL)) {
 		if (!oom_killer_disabled)
 			__out_of_memory(NULL, 0, 0, NULL, false);
+		else
+			/*
+			 * There shouldn't be any user tasks runable while the
+			 * OOM killer is disabled so the current task has to
+			 * be a racing OOM victim for which oom_killer_disable()
+			 * is waiting for.
+			 */
+			WARN_ON(test_thread_flag(TIF_MEMDIE));
+
 		oom_zonelist_unlock(zonelist, GFP_KERNEL);
 	}
 unlock:

> 
> > > (it only makes sense while the whole system is in quiescent state)
> > > and at least trigger WARN_ON_ONCE() if the above code path gets
> > > triggered while oom killer is disabled?
> > 
> > I can add a WARN_ON(!test_thread_flag(tsk, TIF_MEMDIE)).
> 
> Yeah, that makes sense to me.
> 
> Thanks.
> 
> -- 
> tejun

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply related	[flat|nested] 93+ messages in thread

* Re: [RFC 2/2] OOM, PM: make OOM detection in the freezer path raceless
  2014-12-04 16:56                                                                 ` Michal Hocko
@ 2014-12-04 17:18                                                                   ` Michal Hocko
  0 siblings, 0 replies; 93+ messages in thread
From: Michal Hocko @ 2014-12-04 17:18 UTC (permalink / raw)
  To: Tejun Heo
  Cc: LKML, linux-mm, linux-pm, Andrew Morton,
	\"Rafael J. Wysocki\",
	David Rientjes, Oleg Nesterov, Cong Wang

On Thu 04-12-14 17:56:01, Michal Hocko wrote:
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 1d55ab12792f..032be9d2a239 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -408,7 +408,7 @@ static void dump_header(struct task_struct *p, gfp_t gfp_mask, int order,
>   * Number of OOM victims in flight
>   */
>  static atomic_t oom_victims = ATOMIC_INIT(0);
> -static DECLARE_COMPLETION(oom_victims_wait);
> +static DECLARE_WAIT_QUEUE_HEAD(oom_victims_wait);
>  
>  bool oom_killer_disabled __read_mostly;
>  static DECLARE_RWSEM(oom_sem);
> @@ -435,7 +435,7 @@ void unmark_tsk_oom_victim(void)
>  	 * is nobody who cares.
>  	 */
>  	if (!atomic_dec_return(&oom_victims) && oom_killer_disabled)
> -		complete_all(&oom_victims_wait);
> +		wake_up_all(&oom_victims_wait);
>  	up_read(&oom_sem);
>  }
>  
> @@ -464,16 +464,11 @@ bool oom_killer_disable(void)
>  		return false;
>  	}
>  
> -	/* unmark_tsk_oom_victim is calling complete_all */
> -	if (!oom_killer_disable)
> -		reinit_completion(&oom_victims_wait);
> -
>  	oom_killer_disabled = true;
> -	count = atomic_read(&oom_victims);
>  	up_write(&oom_sem);
>  
>  	if (count)

whithout this count test obviously

> -		wait_for_completion(&oom_victims_wait);
> +		wait_event(oom_victims_wait, atomic_read(&oom_victims));
>  
>  	return true;
>  }
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 93+ messages in thread

* [PATCH 0/4] OOM vs PM freezer fixes
  2014-11-10 16:30                                               ` Michal Hocko
  2014-11-12 18:58                                                 ` [RFC 0/4] OOM vs PM freezer fixes Michal Hocko
@ 2014-12-05 16:41                                                 ` Michal Hocko
  2014-12-05 16:41                                                   ` [PATCH -v2 1/5] oom: add helpers for setting and clearing TIF_MEMDIE Michal Hocko
                                                                     ` (5 more replies)
  1 sibling, 6 replies; 93+ messages in thread
From: Michal Hocko @ 2014-12-05 16:41 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Tejun Heo, \"Rafael J. Wysocki\",
	David Rientjes, Johannes Weiner, Oleg Nesterov, Cong Wang, LKML,
	linux-pm

Hi,
here is another take at OOM vs. PM freezer interaction fixes/cleanups.
First three patches are fixes for an unlikely cases when OOM races with
the PM freezer which should be closed completely finally. The last patch
is a simple code enhancement which is not needed strictly speaking but
it is nice to have IMO.

Both OOM killer and PM freezer are quite subtle so I hope I haven't
missing anything. Any feedback is highly appreciated. I am also
interested about feedback for the used approach. To be honest I am not
really happy about spreading TIF_MEMDIE checks into freezer (patch 1)
but I didn't find any other way for detecting OOM killed tasks.

Changes are based on top of Linus tree (3.18-rc3).

Michal Hocko (4):
      OOM, PM: Do not miss OOM killed frozen tasks
      OOM, PM: make OOM detection in the freezer path raceless
      OOM, PM: handle pm freezer as an OOM victim correctly
      OOM: thaw the OOM victim if it is frozen

Diffstat says:
 drivers/tty/sysrq.c    |  6 ++--
 include/linux/oom.h    | 39 ++++++++++++++++------
 kernel/freezer.c       | 15 +++++++--
 kernel/power/process.c | 60 +++++++++-------------------------
 mm/memcontrol.c        |  4 ++-
 mm/oom_kill.c          | 89 ++++++++++++++++++++++++++++++++++++++------------
 mm/page_alloc.c        | 32 +++++++++---------
 7 files changed, 147 insertions(+), 98 deletions(-)


^ permalink raw reply	[flat|nested] 93+ messages in thread

* [PATCH -v2 1/5] oom: add helpers for setting and clearing TIF_MEMDIE
  2014-12-05 16:41                                                 ` [PATCH 0/4] OOM vs PM freezer fixes Michal Hocko
@ 2014-12-05 16:41                                                   ` Michal Hocko
  2014-12-06 12:56                                                     ` Tejun Heo
  2015-01-07 17:57                                                     ` Tejun Heo
  2014-12-05 16:41                                                   ` [PATCH -v2 2/5] OOM: thaw the OOM victim if it is frozen Michal Hocko
                                                                     ` (4 subsequent siblings)
  5 siblings, 2 replies; 93+ messages in thread
From: Michal Hocko @ 2014-12-05 16:41 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Tejun Heo, \"Rafael J. Wysocki\",
	David Rientjes, Johannes Weiner, Oleg Nesterov, Cong Wang, LKML,
	linux-pm

This patch is just a preparatory and it doesn't introduce any functional
change.

Signed-off-by: Michal Hocko <mhocko@suse.cz>
---
 include/linux/oom.h |  4 ++++
 kernel/exit.c       |  2 +-
 mm/memcontrol.c     |  2 +-
 mm/oom_kill.c       | 23 ++++++++++++++++++++---
 4 files changed, 26 insertions(+), 5 deletions(-)

diff --git a/include/linux/oom.h b/include/linux/oom.h
index 4971874f54db..1315fcbb9527 100644
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -47,6 +47,10 @@ static inline bool oom_task_origin(const struct task_struct *p)
 	return !!(p->signal->oom_flags & OOM_FLAG_ORIGIN);
 }
 
+extern void mark_tsk_oom_victim(struct task_struct *tsk);
+
+extern void unmark_tsk_oom_victim(void);
+
 extern unsigned long oom_badness(struct task_struct *p,
 		struct mem_cgroup *memcg, const nodemask_t *nodemask,
 		unsigned long totalpages);
diff --git a/kernel/exit.c b/kernel/exit.c
index 5d30019ff953..ee5176e2a1ba 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -459,7 +459,7 @@ static void exit_mm(struct task_struct *tsk)
 	task_unlock(tsk);
 	mm_update_next_owner(mm);
 	mmput(mm);
-	clear_thread_flag(TIF_MEMDIE);
+	unmark_tsk_oom_victim();
 }
 
 /*
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index d6ac0e33e150..302e0fc6d121 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1735,7 +1735,7 @@ static void mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
 	 * quickly exit and free its memory.
 	 */
 	if (fatal_signal_pending(current) || current->flags & PF_EXITING) {
-		set_thread_flag(TIF_MEMDIE);
+		mark_tsk_oom_victim(current);
 		return;
 	}
 
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 5340f6b91312..c75b37d59a32 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -421,6 +421,23 @@ void note_oom_kill(void)
 	atomic_inc(&oom_kills);
 }
 
+/**
+ * Marks the given taks as OOM victim.
+ * @tsk: task to mark
+ */
+void mark_tsk_oom_victim(struct task_struct *tsk)
+{
+	set_tsk_thread_flag(tsk, TIF_MEMDIE);
+}
+
+/**
+ * Unmarks the current task as OOM victim.
+ */
+void unmark_tsk_oom_victim(void)
+{
+	clear_thread_flag(TIF_MEMDIE);
+}
+
 #define K(x) ((x) << (PAGE_SHIFT-10))
 /*
  * Must be called while holding a reference to p, which will be released upon
@@ -444,7 +461,7 @@ void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
 	 * its children or threads, just set TIF_MEMDIE so it can die quickly
 	 */
 	if (p->flags & PF_EXITING) {
-		set_tsk_thread_flag(p, TIF_MEMDIE);
+		mark_tsk_oom_victim(p);
 		put_task_struct(p);
 		return;
 	}
@@ -527,7 +544,7 @@ void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
 		}
 	rcu_read_unlock();
 
-	set_tsk_thread_flag(victim, TIF_MEMDIE);
+	mark_tsk_oom_victim(victim);
 	do_send_sig_info(SIGKILL, SEND_SIG_FORCED, victim, true);
 	put_task_struct(victim);
 }
@@ -650,7 +667,7 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
 	 * quickly exit and free its memory.
 	 */
 	if (fatal_signal_pending(current) || current->flags & PF_EXITING) {
-		set_thread_flag(TIF_MEMDIE);
+		mark_tsk_oom_victim(current);
 		return;
 	}
 
-- 
2.1.3


^ permalink raw reply related	[flat|nested] 93+ messages in thread

* [PATCH -v2 2/5] OOM: thaw the OOM victim if it is frozen
  2014-12-05 16:41                                                 ` [PATCH 0/4] OOM vs PM freezer fixes Michal Hocko
  2014-12-05 16:41                                                   ` [PATCH -v2 1/5] oom: add helpers for setting and clearing TIF_MEMDIE Michal Hocko
@ 2014-12-05 16:41                                                   ` Michal Hocko
  2014-12-06 13:06                                                     ` Tejun Heo
  2014-12-05 16:41                                                   ` [PATCH -v2 3/5] PM: convert printk to pr_* equivalent Michal Hocko
                                                                     ` (3 subsequent siblings)
  5 siblings, 1 reply; 93+ messages in thread
From: Michal Hocko @ 2014-12-05 16:41 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Tejun Heo, \"Rafael J. Wysocki\",
	David Rientjes, Johannes Weiner, Oleg Nesterov, Cong Wang, LKML,
	linux-pm

oom_kill_process only sets TIF_MEMDIE flag and sends a signal to the
victim. This is basically noop when the task is frozen though because
the task sleeps in uninterruptible sleep. The victim is eventually
thawed later when oom_scan_process_thread meets the task again in a
later OOM invocation so the OOM killer doesn't live lock. But this is
less than optimal. Let's add the frozen check and thaw the task right
before we send SIGKILL to the victim.

The check and thawing in oom_scan_process_thread has to stay because the
task might got access to memory reserves even without an explicit
SIGKILL from oom_kill_process (e.g. it already has fatal signal pending
or it is exiting already).

Signed-off-by: Michal Hocko <mhocko@suse.cz>
---
 mm/oom_kill.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index c75b37d59a32..8874058d62db 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -545,6 +545,8 @@ void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
 	rcu_read_unlock();
 
 	mark_tsk_oom_victim(victim);
+	if (frozen(victim))
+		__thaw_task(victim);
 	do_send_sig_info(SIGKILL, SEND_SIG_FORCED, victim, true);
 	put_task_struct(victim);
 }
-- 
2.1.3


^ permalink raw reply related	[flat|nested] 93+ messages in thread

* [PATCH -v2 3/5] PM: convert printk to pr_* equivalent
  2014-12-05 16:41                                                 ` [PATCH 0/4] OOM vs PM freezer fixes Michal Hocko
  2014-12-05 16:41                                                   ` [PATCH -v2 1/5] oom: add helpers for setting and clearing TIF_MEMDIE Michal Hocko
  2014-12-05 16:41                                                   ` [PATCH -v2 2/5] OOM: thaw the OOM victim if it is frozen Michal Hocko
@ 2014-12-05 16:41                                                   ` Michal Hocko
  2014-12-05 22:40                                                     ` Rafael J. Wysocki
  2014-12-06 13:08                                                     ` Tejun Heo
  2014-12-05 16:41                                                   ` [PATCH -v2 4/5] sysrq: " Michal Hocko
                                                                     ` (2 subsequent siblings)
  5 siblings, 2 replies; 93+ messages in thread
From: Michal Hocko @ 2014-12-05 16:41 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Tejun Heo, \"Rafael J. Wysocki\",
	David Rientjes, Johannes Weiner, Oleg Nesterov, Cong Wang, LKML,
	linux-pm

While touching this area let's convert printk to pr_*. This also makes
the printing of continuation lines done properly.

Signed-off-by: Michal Hocko <mhocko@suse.cz>
---
 kernel/power/process.c | 29 +++++++++++++++--------------
 1 file changed, 15 insertions(+), 14 deletions(-)

diff --git a/kernel/power/process.c b/kernel/power/process.c
index 5a6ec8678b9a..3ac45f192e9f 100644
--- a/kernel/power/process.c
+++ b/kernel/power/process.c
@@ -84,8 +84,8 @@ static int try_to_freeze_tasks(bool user_only)
 	elapsed_msecs = elapsed_msecs64;
 
 	if (todo) {
-		printk("\n");
-		printk(KERN_ERR "Freezing of tasks %s after %d.%03d seconds "
+		pr_cont("\n");
+		pr_err("Freezing of tasks %s after %d.%03d seconds "
 		       "(%d tasks refusing to freeze, wq_busy=%d):\n",
 		       wakeup ? "aborted" : "failed",
 		       elapsed_msecs / 1000, elapsed_msecs % 1000,
@@ -101,7 +101,7 @@ static int try_to_freeze_tasks(bool user_only)
 			read_unlock(&tasklist_lock);
 		}
 	} else {
-		printk("(elapsed %d.%03d seconds) ", elapsed_msecs / 1000,
+		pr_cont("(elapsed %d.%03d seconds) ", elapsed_msecs / 1000,
 			elapsed_msecs % 1000);
 	}
 
@@ -155,7 +155,7 @@ int freeze_processes(void)
 		atomic_inc(&system_freezing_cnt);
 
 	pm_wakeup_clear();
-	printk("Freezing user space processes ... ");
+	pr_info("Freezing user space processes ... ");
 	pm_freezing = true;
 	oom_kills_saved = oom_kills_count();
 	error = try_to_freeze_tasks(true);
@@ -171,13 +171,13 @@ int freeze_processes(void)
 		if (oom_kills_count() != oom_kills_saved &&
 		    !check_frozen_processes()) {
 			__usermodehelper_set_disable_depth(UMH_ENABLED);
-			printk("OOM in progress.");
+			pr_cont("OOM in progress.");
 			error = -EBUSY;
 		} else {
-			printk("done.");
+			pr_cont("done.");
 		}
 	}
-	printk("\n");
+	pr_cont("\n");
 	BUG_ON(in_atomic());
 
 	if (error)
@@ -197,13 +197,14 @@ int freeze_kernel_threads(void)
 {
 	int error;
 
-	printk("Freezing remaining freezable tasks ... ");
+	pr_info("Freezing remaining freezable tasks ... ");
+
 	pm_nosig_freezing = true;
 	error = try_to_freeze_tasks(false);
 	if (!error)
-		printk("done.");
+		pr_cont("done.");
 
-	printk("\n");
+	pr_cont("\n");
 	BUG_ON(in_atomic());
 
 	if (error)
@@ -224,7 +225,7 @@ void thaw_processes(void)
 
 	oom_killer_enable();
 
-	printk("Restarting tasks ... ");
+	pr_info("Restarting tasks ... ");
 
 	__usermodehelper_set_disable_depth(UMH_FREEZING);
 	thaw_workqueues();
@@ -243,7 +244,7 @@ void thaw_processes(void)
 	usermodehelper_enable();
 
 	schedule();
-	printk("done.\n");
+	pr_cont("done.\n");
 	trace_suspend_resume(TPS("thaw_processes"), 0, false);
 }
 
@@ -252,7 +253,7 @@ void thaw_kernel_threads(void)
 	struct task_struct *g, *p;
 
 	pm_nosig_freezing = false;
-	printk("Restarting kernel threads ... ");
+	pr_info("Restarting kernel threads ... ");
 
 	thaw_workqueues();
 
@@ -264,5 +265,5 @@ void thaw_kernel_threads(void)
 	read_unlock(&tasklist_lock);
 
 	schedule();
-	printk("done.\n");
+	pr_cont("done.\n");
 }
-- 
2.1.3


^ permalink raw reply related	[flat|nested] 93+ messages in thread

* [PATCH -v2 4/5] sysrq: convert printk to pr_* equivalent
  2014-12-05 16:41                                                 ` [PATCH 0/4] OOM vs PM freezer fixes Michal Hocko
                                                                     ` (2 preceding siblings ...)
  2014-12-05 16:41                                                   ` [PATCH -v2 3/5] PM: convert printk to pr_* equivalent Michal Hocko
@ 2014-12-05 16:41                                                   ` Michal Hocko
  2014-12-06 13:09                                                     ` Tejun Heo
  2014-12-05 16:41                                                   ` [PATCH -v2 5/5] OOM, PM: make OOM detection in the freezer path raceless Michal Hocko
  2014-12-07 10:09                                                   ` [PATCH 0/4] OOM vs PM freezer fixes Michal Hocko
  5 siblings, 1 reply; 93+ messages in thread
From: Michal Hocko @ 2014-12-05 16:41 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Tejun Heo, \"Rafael J. Wysocki\",
	David Rientjes, Johannes Weiner, Oleg Nesterov, Cong Wang, LKML,
	linux-pm

While touching this area let's convert printk to pr_*. This also makes
the printing of continuation lines done properly.

Signed-off-by: Michal Hocko <mhocko@suse.cz>
---
 drivers/tty/sysrq.c | 18 +++++++++---------
 1 file changed, 9 insertions(+), 9 deletions(-)

diff --git a/drivers/tty/sysrq.c b/drivers/tty/sysrq.c
index 42bad18c66c9..0071469ecbf1 100644
--- a/drivers/tty/sysrq.c
+++ b/drivers/tty/sysrq.c
@@ -90,7 +90,7 @@ static void sysrq_handle_loglevel(int key)
 
 	i = key - '0';
 	console_loglevel = CONSOLE_LOGLEVEL_DEFAULT;
-	printk("Loglevel set to %d\n", i);
+	pr_info("Loglevel set to %d\n", i);
 	console_loglevel = i;
 }
 static struct sysrq_key_op sysrq_loglevel_op = {
@@ -220,7 +220,7 @@ static void showacpu(void *dummy)
 		return;
 
 	spin_lock_irqsave(&show_lock, flags);
-	printk(KERN_INFO "CPU%d:\n", smp_processor_id());
+	pr_info("CPU%d:\n", smp_processor_id());
 	show_stack(NULL, NULL);
 	spin_unlock_irqrestore(&show_lock, flags);
 }
@@ -243,7 +243,7 @@ static void sysrq_handle_showallcpus(int key)
 		struct pt_regs *regs = get_irq_regs();
 
 		if (regs) {
-			printk(KERN_INFO "CPU%d:\n", smp_processor_id());
+			pr_info("CPU%d:\n", smp_processor_id());
 			show_regs(regs);
 		}
 		schedule_work(&sysrq_showallcpus);
@@ -522,7 +522,7 @@ void __handle_sysrq(int key, bool check_mask)
 	 */
 	orig_log_level = console_loglevel;
 	console_loglevel = CONSOLE_LOGLEVEL_DEFAULT;
-	printk(KERN_INFO "SysRq : ");
+	pr_info("SysRq : ");
 
         op_p = __sysrq_get_key_op(key);
         if (op_p) {
@@ -531,14 +531,14 @@ void __handle_sysrq(int key, bool check_mask)
 		 * should not) and is the invoked operation enabled?
 		 */
 		if (!check_mask || sysrq_on_mask(op_p->enable_mask)) {
-			printk("%s\n", op_p->action_msg);
+			pr_cont("%s\n", op_p->action_msg);
 			console_loglevel = orig_log_level;
 			op_p->handler(key);
 		} else {
-			printk("This sysrq operation is disabled.\n");
+			pr_cont("This sysrq operation is disabled.\n");
 		}
 	} else {
-		printk("HELP : ");
+		pr_cont("HELP : ");
 		/* Only print the help msg once per handler */
 		for (i = 0; i < ARRAY_SIZE(sysrq_key_table); i++) {
 			if (sysrq_key_table[i]) {
@@ -549,10 +549,10 @@ void __handle_sysrq(int key, bool check_mask)
 					;
 				if (j != i)
 					continue;
-				printk("%s ", sysrq_key_table[i]->help_msg);
+				pr_cont("%s ", sysrq_key_table[i]->help_msg);
 			}
 		}
-		printk("\n");
+		pr_cont("\n");
 		console_loglevel = orig_log_level;
 	}
 	rcu_read_unlock();
-- 
2.1.3


^ permalink raw reply related	[flat|nested] 93+ messages in thread

* [PATCH -v2 5/5] OOM, PM: make OOM detection in the freezer path raceless
  2014-12-05 16:41                                                 ` [PATCH 0/4] OOM vs PM freezer fixes Michal Hocko
                                                                     ` (3 preceding siblings ...)
  2014-12-05 16:41                                                   ` [PATCH -v2 4/5] sysrq: " Michal Hocko
@ 2014-12-05 16:41                                                   ` Michal Hocko
  2014-12-06 13:11                                                     ` Tejun Heo
                                                                       ` (2 more replies)
  2014-12-07 10:09                                                   ` [PATCH 0/4] OOM vs PM freezer fixes Michal Hocko
  5 siblings, 3 replies; 93+ messages in thread
From: Michal Hocko @ 2014-12-05 16:41 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Tejun Heo, \"Rafael J. Wysocki\",
	David Rientjes, Johannes Weiner, Oleg Nesterov, Cong Wang, LKML,
	linux-pm

5695be142e20 (OOM, PM: OOM killed task shouldn't escape PM suspend)
has left a race window when OOM killer manages to note_oom_kill after
freeze_processes checks the counter. The race window is quite small and
really unlikely and partial solution deemed sufficient at the time of
submission.

Tejun wasn't happy about this partial solution though and insisted on a
full solution. That requires the full OOM and freezer's task freezing
exclusion, though. This is done by this patch which introduces oom_sem
RW lock and turns oom_killer_disable() into a full OOM barrier.

oom_killer_disabled check is moved from the allocation path to the OOM
level and we take oom_sem for reading for both the check and the whole
OOM invocation.

oom_killer_disable() takes oom_sem for writing so it waits for all
currently running OOM killer invocations. Then it disable all the
further OOMs by setting oom_killer_disabled and checks for any oom
victims. Victims are counted via {un}mark_tsk_oom_victim. The last
victim wakes up all waiters on oom_victims_wait waitqueue enqueued by
oom_killer_disable(). Therefore this function acts as the full OOM
barrier.

The page fault path is covered now as well although it was assumed to be
safe before. As per Tejun, "We used to have freezing points deep in file
system code which may be reacheable from page fault." so it would be
better and more robust to not rely on freezing points here. Same applies
to the memcg OOM killer.

out_of_memory tells the caller whether the OOM was allowed to trigger
and the callers are supposed to handle the situation. The page
allocation path simply fails the allocation same as before. The page
fault path will retry the fault (more on that later) and Sysrq OOM
trigger will simply complain to the log.

As oom_killer_disable() is a full OOM barrier now we can postpone it in
the PM freezer to later after all freezable user tasks are considered
frozen (to freeze_kernel_threads).

Normally there wouldn't be any unfrozen user tasks at this moment so
the function will not block. But if there was an OOM killer racing with
try_to_freeze_tasks and the OOM victim didn't finish yet then we have to
wait for it. This should complete in a finite time, though, because
	- the victim cannot loop in the page fault handler (it would die
	  on the way out from the exception)
	- it cannot loop in the page allocator because all the further
	  allocation would fail and __GFP_NOFAIL allocations are not
	  acceptable at this stage
	- it shouldn't be blocked on any locks held by frozen tasks
	  (try_to_freeze expects lockless context) and kernel threads and
	  work queues are not frozen yet

TODO:
Android lowmemory killer abuses TIF_MEMDIE in lowmem_scan and it has to
learn about oom_disable logic as well.

Suggested-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Michal Hocko <mhocko@suse.cz>
---
 drivers/tty/sysrq.c    |   5 +-
 include/linux/oom.h    |  14 ++----
 kernel/exit.c          |   3 +-
 kernel/power/process.c |  58 ++++++----------------
 mm/memcontrol.c        |   2 +-
 mm/oom_kill.c          | 131 ++++++++++++++++++++++++++++++++++++++++++-------
 mm/page_alloc.c        |  17 +------
 7 files changed, 137 insertions(+), 93 deletions(-)

diff --git a/drivers/tty/sysrq.c b/drivers/tty/sysrq.c
index 0071469ecbf1..259a4d5a4e8f 100644
--- a/drivers/tty/sysrq.c
+++ b/drivers/tty/sysrq.c
@@ -355,8 +355,9 @@ static struct sysrq_key_op sysrq_term_op = {
 
 static void moom_callback(struct work_struct *ignored)
 {
-	out_of_memory(node_zonelist(first_memory_node, GFP_KERNEL), GFP_KERNEL,
-		      0, NULL, true);
+	if (!out_of_memory(node_zonelist(first_memory_node, GFP_KERNEL),
+			   GFP_KERNEL, 0, NULL, true))
+		pr_info("OOM request ignored because killer is disabled\n");
 }
 
 static DECLARE_WORK(moom_work, moom_callback);
diff --git a/include/linux/oom.h b/include/linux/oom.h
index 1315fcbb9527..03b5c395e514 100644
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -72,22 +72,14 @@ extern enum oom_scan_t oom_scan_process_thread(struct task_struct *task,
 		unsigned long totalpages, const nodemask_t *nodemask,
 		bool force_kill);
 
-extern void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
+extern bool out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
 		int order, nodemask_t *mask, bool force_kill);
 extern int register_oom_notifier(struct notifier_block *nb);
 extern int unregister_oom_notifier(struct notifier_block *nb);
 
 extern bool oom_killer_disabled;
-
-static inline void oom_killer_disable(void)
-{
-	oom_killer_disabled = true;
-}
-
-static inline void oom_killer_enable(void)
-{
-	oom_killer_disabled = false;
-}
+extern bool oom_killer_disable(void);
+extern void oom_killer_enable(void);
 
 extern struct task_struct *find_lock_task_mm(struct task_struct *p);
 
diff --git a/kernel/exit.c b/kernel/exit.c
index ee5176e2a1ba..272915fc603f 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -459,7 +459,8 @@ static void exit_mm(struct task_struct *tsk)
 	task_unlock(tsk);
 	mm_update_next_owner(mm);
 	mmput(mm);
-	unmark_tsk_oom_victim();
+	if (test_thread_flag(TIF_MEMDIE))
+		unmark_tsk_oom_victim();
 }
 
 /*
diff --git a/kernel/power/process.c b/kernel/power/process.c
index 3ac45f192e9f..c3da8b297b10 100644
--- a/kernel/power/process.c
+++ b/kernel/power/process.c
@@ -108,30 +108,6 @@ static int try_to_freeze_tasks(bool user_only)
 	return todo ? -EBUSY : 0;
 }
 
-static bool __check_frozen_processes(void)
-{
-	struct task_struct *g, *p;
-
-	for_each_process_thread(g, p)
-		if (p != current && !freezer_should_skip(p) && !frozen(p))
-			return false;
-
-	return true;
-}
-
-/*
- * Returns true if all freezable tasks (except for current) are frozen already
- */
-static bool check_frozen_processes(void)
-{
-	bool ret;
-
-	read_lock(&tasklist_lock);
-	ret = __check_frozen_processes();
-	read_unlock(&tasklist_lock);
-	return ret;
-}
-
 /**
  * freeze_processes - Signal user space processes to enter the refrigerator.
  * The current thread will not be frozen.  The same process that calls
@@ -142,7 +118,6 @@ static bool check_frozen_processes(void)
 int freeze_processes(void)
 {
 	int error;
-	int oom_kills_saved;
 
 	error = __usermodehelper_disable(UMH_FREEZING);
 	if (error)
@@ -157,25 +132,10 @@ int freeze_processes(void)
 	pm_wakeup_clear();
 	pr_info("Freezing user space processes ... ");
 	pm_freezing = true;
-	oom_kills_saved = oom_kills_count();
 	error = try_to_freeze_tasks(true);
 	if (!error) {
 		__usermodehelper_set_disable_depth(UMH_DISABLED);
-		oom_killer_disable();
-
-		/*
-		 * There might have been an OOM kill while we were
-		 * freezing tasks and the killed task might be still
-		 * on the way out so we have to double check for race.
-		 */
-		if (oom_kills_count() != oom_kills_saved &&
-		    !check_frozen_processes()) {
-			__usermodehelper_set_disable_depth(UMH_ENABLED);
-			pr_cont("OOM in progress.");
-			error = -EBUSY;
-		} else {
-			pr_cont("done.");
-		}
+		pr_cont("done.");
 	}
 	pr_cont("\n");
 	BUG_ON(in_atomic());
@@ -197,8 +157,17 @@ int freeze_kernel_threads(void)
 {
 	int error;
 
-	pr_info("Freezing remaining freezable tasks ... ");
+	/*
+	 * Now that the whole userspace is frozen we need to disbale
+	 * the OOM killer to disallow any further interference with
+	 * killable tasks.
+	 */
+	if (!oom_killer_disable()) {
+		error = -EBUSY;
+		goto out;
+	}
 
+	pr_info("Freezing remaining freezable tasks ... ");
 	pm_nosig_freezing = true;
 	error = try_to_freeze_tasks(false);
 	if (!error)
@@ -207,6 +176,7 @@ int freeze_kernel_threads(void)
 	pr_cont("\n");
 	BUG_ON(in_atomic());
 
+out:
 	if (error)
 		thaw_kernel_threads();
 	return error;
@@ -223,8 +193,6 @@ void thaw_processes(void)
 	pm_freezing = false;
 	pm_nosig_freezing = false;
 
-	oom_killer_enable();
-
 	pr_info("Restarting tasks ... ");
 
 	__usermodehelper_set_disable_depth(UMH_FREEZING);
@@ -252,6 +220,8 @@ void thaw_kernel_threads(void)
 {
 	struct task_struct *g, *p;
 
+	oom_killer_enable();
+
 	pm_nosig_freezing = false;
 	pr_info("Restarting kernel threads ... ");
 
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 302e0fc6d121..34a196eb45cd 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2155,7 +2155,7 @@ bool mem_cgroup_oom_synchronize(bool handle)
 	if (!memcg)
 		return false;
 
-	if (!handle)
+	if (!handle || oom_killer_disabled)
 		goto cleanup;
 
 	owait.memcg = memcg;
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 8874058d62db..facc4587daf3 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -405,37 +405,91 @@ static void dump_header(struct task_struct *p, gfp_t gfp_mask, int order,
 }
 
 /*
- * Number of OOM killer invocations (including memcg OOM killer).
- * Primarily used by PM freezer to check for potential races with
- * OOM killed frozen task.
+ * Number of OOM victims in flight
  */
-static atomic_t oom_kills = ATOMIC_INIT(0);
+static atomic_t oom_victims = ATOMIC_INIT(0);
+static DECLARE_WAIT_QUEUE_HEAD(oom_victims_wait);
 
-int oom_kills_count(void)
-{
-	return atomic_read(&oom_kills);
-}
-
-void note_oom_kill(void)
-{
-	atomic_inc(&oom_kills);
-}
+bool oom_killer_disabled __read_mostly;
+static DECLARE_RWSEM(oom_sem);
 
 /**
  * Marks the given taks as OOM victim.
  * @tsk: task to mark
+ *
+ * Has to be called with oom_sem taken for read and never after
+ * oom has been disabled already.
  */
 void mark_tsk_oom_victim(struct task_struct *tsk)
 {
-	set_tsk_thread_flag(tsk, TIF_MEMDIE);
+	WARN_ON(oom_killer_disabled);
+	/* OOM killer might race with memcg OOM */
+	if (test_and_set_tsk_thread_flag(tsk, TIF_MEMDIE))
+		return;
+	atomic_inc(&oom_victims);
 }
 
 /**
  * Unmarks the current task as OOM victim.
+ *
+ * Wakes up all waiters in oom_killer_disable()
  */
 void unmark_tsk_oom_victim(void)
 {
-	clear_thread_flag(TIF_MEMDIE);
+	if (!test_and_clear_thread_flag(TIF_MEMDIE))
+		return;
+
+	down_read(&oom_sem);
+	/*
+	 * There is no need to signal the lasst oom_victim if there
+	 * is nobody who cares.
+	 */
+	if (!atomic_dec_return(&oom_victims) && oom_killer_disabled)
+		wake_up_all(&oom_victims_wait);
+	up_read(&oom_sem);
+}
+
+/**
+ * oom_killer_disable - disable OOM killer
+ *
+ * Forces all page allocations to fail rather than trigger OOM killer.
+ * Will block and wait until all OOM victims are killed.
+ *
+ * The function cannot be called when there are runnable user tasks because
+ * the userspace would see unexpected allocation failures as a result. Any
+ * new usage of this function should be consulted with MM people.
+ *
+ * Returns true if successful and false if the OOM killer cannot be
+ * disabled.
+ */
+bool oom_killer_disable(void)
+{
+	/*
+	 * Make sure to not race with an ongoing OOM killer
+	 * and that the current is not the victim.
+	 */
+	down_write(&oom_sem);
+	if (test_thread_flag(TIF_MEMDIE)) {
+		up_write(&oom_sem);
+		return false;
+	}
+
+	oom_killer_disabled = true;
+	up_write(&oom_sem);
+
+	wait_event(oom_victims_wait, atomic_read(&oom_victims));
+
+	return true;
+}
+
+/**
+ * oom_killer_enable - enable OOM killer
+ */
+void oom_killer_enable(void)
+{
+	down_write(&oom_sem);
+	oom_killer_disabled = false;
+	up_write(&oom_sem);
 }
 
 #define K(x) ((x) << (PAGE_SHIFT-10))
@@ -635,7 +689,7 @@ void oom_zonelist_unlock(struct zonelist *zonelist, gfp_t gfp_mask)
 }
 
 /**
- * out_of_memory - kill the "best" process when we run out of memory
+ * __out_of_memory - kill the "best" process when we run out of memory
  * @zonelist: zonelist pointer
  * @gfp_mask: memory allocation flags
  * @order: amount of memory being requested as a power of 2
@@ -647,7 +701,7 @@ void oom_zonelist_unlock(struct zonelist *zonelist, gfp_t gfp_mask)
  * OR try to be smart about which process to kill. Note that we
  * don't have to be perfect here, we just have to be good.
  */
-void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
+static void __out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
 		int order, nodemask_t *nodemask, bool force_kill)
 {
 	const nodemask_t *mpol_mask;
@@ -712,6 +766,32 @@ out:
 		schedule_timeout_killable(1);
 }
 
+/**
+ * out_of_memory -  tries to invoke OOM killer.
+ * @zonelist: zonelist pointer
+ * @gfp_mask: memory allocation flags
+ * @order: amount of memory being requested as a power of 2
+ * @nodemask: nodemask passed to page allocator
+ * @force_kill: true if a task must be killed, even if others are exiting
+ *
+ * invokes __out_of_memory if the OOM is not disabled by oom_killer_disable()
+ * when it returns false. Otherwise returns true.
+ */
+bool out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
+		int order, nodemask_t *nodemask, bool force_kill)
+{
+	bool ret = false;
+
+	down_read(&oom_sem);
+	if (!oom_killer_disabled) {
+		__out_of_memory(zonelist, gfp_mask, order, nodemask, force_kill);
+		ret = true;
+	}
+	up_read(&oom_sem);
+
+	return ret;
+}
+
 /*
  * The pagefault handler calls here because it is out of memory, so kill a
  * memory-hogging task.  If any populated zone has ZONE_OOM_LOCKED set, a
@@ -721,12 +801,25 @@ void pagefault_out_of_memory(void)
 {
 	struct zonelist *zonelist;
 
+	down_read(&oom_sem);
 	if (mem_cgroup_oom_synchronize(true))
-		return;
+		goto unlock;
 
 	zonelist = node_zonelist(first_memory_node, GFP_KERNEL);
 	if (oom_zonelist_trylock(zonelist, GFP_KERNEL)) {
-		out_of_memory(NULL, 0, 0, NULL, false);
+		if (!oom_killer_disabled)
+			__out_of_memory(NULL, 0, 0, NULL, false);
+		else
+			/*
+			 * There shouldn't be any user tasks runable while the
+			 * OOM killer is disabled so the current task has to
+			 * be a racing OOM victim for which oom_killer_disable()
+			 * is waiting for.
+			 */
+			WARN_ON(test_thread_flag(TIF_MEMDIE));
+
 		oom_zonelist_unlock(zonelist, GFP_KERNEL);
 	}
+unlock:
+	up_read(&oom_sem);
 }
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 721780ce1fd3..5b87346837dd 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -242,8 +242,6 @@ void set_pageblock_migratetype(struct page *page, int migratetype)
 					PB_migrate, PB_migrate_end);
 }
 
-bool oom_killer_disabled __read_mostly;
-
 #ifdef CONFIG_DEBUG_VM
 static int page_outside_zone_boundaries(struct zone *zone, struct page *page)
 {
@@ -2247,9 +2245,6 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
 
 	*did_some_progress = 0;
 
-	if (oom_killer_disabled)
-		return NULL;
-
 	/*
 	 * Acquire the per-zone oom lock for each zone.  If that
 	 * fails, somebody else is making progress for us.
@@ -2261,14 +2256,6 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
 	}
 
 	/*
-	 * PM-freezer should be notified that there might be an OOM killer on
-	 * its way to kill and wake somebody up. This is too early and we might
-	 * end up not killing anything but false positives are acceptable.
-	 * See freeze_processes.
-	 */
-	note_oom_kill();
-
-	/*
 	 * Go through the zonelist yet one more time, keep very high watermark
 	 * here, this is only to catch a parallel oom killing, we must fail if
 	 * we're still under heavy pressure.
@@ -2304,8 +2291,8 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
 			goto out;
 	}
 	/* Exhausted what can be done so it's blamo time */
-	out_of_memory(zonelist, gfp_mask, order, nodemask, false);
-	*did_some_progress = 1;
+	if (out_of_memory(zonelist, gfp_mask, order, nodemask, false))
+		*did_some_progress = 1;
 out:
 	oom_zonelist_unlock(zonelist, gfp_mask);
 	return page;
-- 
2.1.3


^ permalink raw reply related	[flat|nested] 93+ messages in thread

* Re: [PATCH -v2 3/5] PM: convert printk to pr_* equivalent
  2014-12-05 16:41                                                   ` [PATCH -v2 3/5] PM: convert printk to pr_* equivalent Michal Hocko
@ 2014-12-05 22:40                                                     ` Rafael J. Wysocki
  2014-12-07 10:26                                                       ` Michal Hocko
  2014-12-06 13:08                                                     ` Tejun Heo
  1 sibling, 1 reply; 93+ messages in thread
From: Rafael J. Wysocki @ 2014-12-05 22:40 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, Andrew Morton, Tejun Heo, David Rientjes,
	Johannes Weiner, Oleg Nesterov, Cong Wang, LKML, linux-pm

On Friday, December 05, 2014 05:41:45 PM Michal Hocko wrote:
> While touching this area let's convert printk to pr_*. This also makes
> the printing of continuation lines done properly.
> 
> Signed-off-by: Michal Hocko <mhocko@suse.cz>

This is fine by me.

Please let me know if you want me to take it.  Otherwise, please feel free to
push it through a different tree.

> ---
>  kernel/power/process.c | 29 +++++++++++++++--------------
>  1 file changed, 15 insertions(+), 14 deletions(-)
> 
> diff --git a/kernel/power/process.c b/kernel/power/process.c
> index 5a6ec8678b9a..3ac45f192e9f 100644
> --- a/kernel/power/process.c
> +++ b/kernel/power/process.c
> @@ -84,8 +84,8 @@ static int try_to_freeze_tasks(bool user_only)
>  	elapsed_msecs = elapsed_msecs64;
>  
>  	if (todo) {
> -		printk("\n");
> -		printk(KERN_ERR "Freezing of tasks %s after %d.%03d seconds "
> +		pr_cont("\n");
> +		pr_err("Freezing of tasks %s after %d.%03d seconds "
>  		       "(%d tasks refusing to freeze, wq_busy=%d):\n",
>  		       wakeup ? "aborted" : "failed",
>  		       elapsed_msecs / 1000, elapsed_msecs % 1000,
> @@ -101,7 +101,7 @@ static int try_to_freeze_tasks(bool user_only)
>  			read_unlock(&tasklist_lock);
>  		}
>  	} else {
> -		printk("(elapsed %d.%03d seconds) ", elapsed_msecs / 1000,
> +		pr_cont("(elapsed %d.%03d seconds) ", elapsed_msecs / 1000,
>  			elapsed_msecs % 1000);
>  	}
>  
> @@ -155,7 +155,7 @@ int freeze_processes(void)
>  		atomic_inc(&system_freezing_cnt);
>  
>  	pm_wakeup_clear();
> -	printk("Freezing user space processes ... ");
> +	pr_info("Freezing user space processes ... ");
>  	pm_freezing = true;
>  	oom_kills_saved = oom_kills_count();
>  	error = try_to_freeze_tasks(true);
> @@ -171,13 +171,13 @@ int freeze_processes(void)
>  		if (oom_kills_count() != oom_kills_saved &&
>  		    !check_frozen_processes()) {
>  			__usermodehelper_set_disable_depth(UMH_ENABLED);
> -			printk("OOM in progress.");
> +			pr_cont("OOM in progress.");
>  			error = -EBUSY;
>  		} else {
> -			printk("done.");
> +			pr_cont("done.");
>  		}
>  	}
> -	printk("\n");
> +	pr_cont("\n");
>  	BUG_ON(in_atomic());
>  
>  	if (error)
> @@ -197,13 +197,14 @@ int freeze_kernel_threads(void)
>  {
>  	int error;
>  
> -	printk("Freezing remaining freezable tasks ... ");
> +	pr_info("Freezing remaining freezable tasks ... ");
> +
>  	pm_nosig_freezing = true;
>  	error = try_to_freeze_tasks(false);
>  	if (!error)
> -		printk("done.");
> +		pr_cont("done.");
>  
> -	printk("\n");
> +	pr_cont("\n");
>  	BUG_ON(in_atomic());
>  
>  	if (error)
> @@ -224,7 +225,7 @@ void thaw_processes(void)
>  
>  	oom_killer_enable();
>  
> -	printk("Restarting tasks ... ");
> +	pr_info("Restarting tasks ... ");
>  
>  	__usermodehelper_set_disable_depth(UMH_FREEZING);
>  	thaw_workqueues();
> @@ -243,7 +244,7 @@ void thaw_processes(void)
>  	usermodehelper_enable();
>  
>  	schedule();
> -	printk("done.\n");
> +	pr_cont("done.\n");
>  	trace_suspend_resume(TPS("thaw_processes"), 0, false);
>  }
>  
> @@ -252,7 +253,7 @@ void thaw_kernel_threads(void)
>  	struct task_struct *g, *p;
>  
>  	pm_nosig_freezing = false;
> -	printk("Restarting kernel threads ... ");
> +	pr_info("Restarting kernel threads ... ");
>  
>  	thaw_workqueues();
>  
> @@ -264,5 +265,5 @@ void thaw_kernel_threads(void)
>  	read_unlock(&tasklist_lock);
>  
>  	schedule();
> -	printk("done.\n");
> +	pr_cont("done.\n");
>  }
> 

-- 
I speak only for myself.
Rafael J. Wysocki, Intel Open Source Technology Center.

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH -v2 1/5] oom: add helpers for setting and clearing TIF_MEMDIE
  2014-12-05 16:41                                                   ` [PATCH -v2 1/5] oom: add helpers for setting and clearing TIF_MEMDIE Michal Hocko
@ 2014-12-06 12:56                                                     ` Tejun Heo
  2014-12-07 10:13                                                       ` Michal Hocko
  2015-01-07 17:57                                                     ` Tejun Heo
  1 sibling, 1 reply; 93+ messages in thread
From: Tejun Heo @ 2014-12-06 12:56 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, Andrew Morton, \"Rafael J. Wysocki\",
	David Rientjes, Johannes Weiner, Oleg Nesterov, Cong Wang, LKML,
	linux-pm

On Fri, Dec 05, 2014 at 05:41:43PM +0100, Michal Hocko wrote:
> +/**
> + * Marks the given taks as OOM victim.

/**
 * $FUNCTION_NAME - $DESCRIPTION

> + * @tsk: task to mark
> + */
> +void mark_tsk_oom_victim(struct task_struct *tsk)
> +{
> +	set_tsk_thread_flag(tsk, TIF_MEMDIE);
> +}
> +
> +/**
> + * Unmarks the current task as OOM victim.

Ditto.

-- 
tejun

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH -v2 2/5] OOM: thaw the OOM victim if it is frozen
  2014-12-05 16:41                                                   ` [PATCH -v2 2/5] OOM: thaw the OOM victim if it is frozen Michal Hocko
@ 2014-12-06 13:06                                                     ` Tejun Heo
  2014-12-07 10:24                                                       ` Michal Hocko
  0 siblings, 1 reply; 93+ messages in thread
From: Tejun Heo @ 2014-12-06 13:06 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, Andrew Morton, \"Rafael J. Wysocki\",
	David Rientjes, Johannes Weiner, Oleg Nesterov, Cong Wang, LKML,
	linux-pm

Hello,

On Fri, Dec 05, 2014 at 05:41:44PM +0100, Michal Hocko wrote:
> oom_kill_process only sets TIF_MEMDIE flag and sends a signal to the
> victim. This is basically noop when the task is frozen though because
> the task sleeps in uninterruptible sleep. The victim is eventually
> thawed later when oom_scan_process_thread meets the task again in a
> later OOM invocation so the OOM killer doesn't live lock. But this is
> less than optimal. Let's add the frozen check and thaw the task right
> before we send SIGKILL to the victim.
> 
> The check and thawing in oom_scan_process_thread has to stay because the
> task might got access to memory reserves even without an explicit
> SIGKILL from oom_kill_process (e.g. it already has fatal signal pending
> or it is exiting already).

How else would a task get TIF_MEMDIE?  If there are other paths which
set TIF_MEMDIE, the right thing to do is creating a function which
thaws / wakes up the target task and use it there too.  Please
interlock these things properly from the get-go instead of scattering
these things around.

> @@ -545,6 +545,8 @@ void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
>  	rcu_read_unlock();
>  
>  	mark_tsk_oom_victim(victim);
> +	if (frozen(victim))
> +		__thaw_task(victim);

The frozen() test here is racy.  Always calling __thaw_task() wouldn't
be.  You can argue that being racy here is okay because the later
scanning would find it but why complicate things like that?  Just
properly interlock each instance and be done with it.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH -v2 3/5] PM: convert printk to pr_* equivalent
  2014-12-05 16:41                                                   ` [PATCH -v2 3/5] PM: convert printk to pr_* equivalent Michal Hocko
  2014-12-05 22:40                                                     ` Rafael J. Wysocki
@ 2014-12-06 13:08                                                     ` Tejun Heo
  1 sibling, 0 replies; 93+ messages in thread
From: Tejun Heo @ 2014-12-06 13:08 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, Andrew Morton, \"Rafael J. Wysocki\",
	David Rientjes, Johannes Weiner, Oleg Nesterov, Cong Wang, LKML,
	linux-pm

On Fri, Dec 05, 2014 at 05:41:45PM +0100, Michal Hocko wrote:
> While touching this area let's convert printk to pr_*. This also makes
> the printing of continuation lines done properly.
> 
> Signed-off-by: Michal Hocko <mhocko@suse.cz>

Acked-by: Tejun Heo <tj@kernel.org>

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH -v2 4/5] sysrq: convert printk to pr_* equivalent
  2014-12-05 16:41                                                   ` [PATCH -v2 4/5] sysrq: " Michal Hocko
@ 2014-12-06 13:09                                                     ` Tejun Heo
  0 siblings, 0 replies; 93+ messages in thread
From: Tejun Heo @ 2014-12-06 13:09 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, Andrew Morton, \"Rafael J. Wysocki\",
	David Rientjes, Johannes Weiner, Oleg Nesterov, Cong Wang, LKML,
	linux-pm

On Fri, Dec 05, 2014 at 05:41:46PM +0100, Michal Hocko wrote:
> While touching this area let's convert printk to pr_*. This also makes
> the printing of continuation lines done properly.
> 
> Signed-off-by: Michal Hocko <mhocko@suse.cz>

Acked-by: Tejun Heo <tj@kernel.org>

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH -v2 5/5] OOM, PM: make OOM detection in the freezer path raceless
  2014-12-05 16:41                                                   ` [PATCH -v2 5/5] OOM, PM: make OOM detection in the freezer path raceless Michal Hocko
@ 2014-12-06 13:11                                                     ` Tejun Heo
  2014-12-07 10:11                                                       ` Michal Hocko
  2015-01-07 18:41                                                     ` Tejun Heo
  2015-01-08 11:51                                                     ` Michal Hocko
  2 siblings, 1 reply; 93+ messages in thread
From: Tejun Heo @ 2014-12-06 13:11 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, Andrew Morton, \"Rafael J. Wysocki\",
	David Rientjes, Johannes Weiner, Oleg Nesterov, Cong Wang, LKML,
	linux-pm

On Fri, Dec 05, 2014 at 05:41:47PM +0100, Michal Hocko wrote:
> 5695be142e20 (OOM, PM: OOM killed task shouldn't escape PM suspend)
> has left a race window when OOM killer manages to note_oom_kill after
> freeze_processes checks the counter. The race window is quite small and
> really unlikely and partial solution deemed sufficient at the time of
> submission.

This patch doesn't apply on top of v3.18-rc3, latest mainline, -mm or
-next.  Did I miss something?  Can you please check the patch?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH 0/4] OOM vs PM freezer fixes
  2014-12-05 16:41                                                 ` [PATCH 0/4] OOM vs PM freezer fixes Michal Hocko
                                                                     ` (4 preceding siblings ...)
  2014-12-05 16:41                                                   ` [PATCH -v2 5/5] OOM, PM: make OOM detection in the freezer path raceless Michal Hocko
@ 2014-12-07 10:09                                                   ` Michal Hocko
  2014-12-07 13:55                                                     ` Tejun Heo
  5 siblings, 1 reply; 93+ messages in thread
From: Michal Hocko @ 2014-12-07 10:09 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Tejun Heo, \"Rafael J. Wysocki\",
	David Rientjes, Johannes Weiner, Oleg Nesterov, Cong Wang, LKML,
	linux-pm

For some reason this is the previous version of the cover letter. I had
some issues with git send-email which was failing for me. Anyway, this
is the correct cover. Sorry about the cofusion.

Hi,
this is another attempt to address OOM vs. PM interaction. More
about the issue is described in the last patch. The other 4 patches
are just clean ups. This is based on top of 3.18-rc3 + Johannes'
http://marc.info/?l=linux-kernel&m=141779091114777 which is not in the
Andrew's tree yet but I wanted to prevent from later merge conflicts.

The previous version of the main patch (5th one) was posted here:
http://marc.info/?l=linux-mm&m=141634503316543&w=2. This version has
hopefully addressed all the points raised by Tejun in the previous
version. Namely
	- checkpatch fixes + printk -> pr_* changes in the respective
	  areas
	- more comments added to clarify subtle interactions
	- oom_killer_disable(), unmark_tsk_oom_victim changed into
	  wait_even API which is easier to use

Both OOM killer and the PM freezer are really subtle so I would really
appreciate a throughout review here. I still haven't changed lowmemory
killer which is abusing TIF_MEMDIE yet and it would break this code
(oom_victims counter balance) and I plan to look at it as soon as the
rest of the of the series is OK and agreed as a way to go. So there will
be at least one more patch for the final submission.

Thanks!

Michal Hocko (5):
      oom: add helpers for setting and clearing TIF_MEMDIE
      OOM: thaw the OOM victim if it is frozen
      PM: convert printk to pr_* equivalent
      sysrq: convert printk to pr_* equivalent
      OOM, PM: make OOM detection in the freezer path raceless

And diffstat:
 drivers/tty/sysrq.c    |  23 ++++----
 include/linux/oom.h    |  18 +++----
 kernel/exit.c          |   3 +-
 kernel/power/process.c |  81 +++++++++-------------------
 mm/memcontrol.c        |   4 +-
 mm/oom_kill.c          | 142 +++++++++++++++++++++++++++++++++++++++++++------
 mm/page_alloc.c        |  17 +-----
 7 files changed, 178 insertions(+), 110 deletions(-)

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH -v2 5/5] OOM, PM: make OOM detection in the freezer path raceless
  2014-12-06 13:11                                                     ` Tejun Heo
@ 2014-12-07 10:11                                                       ` Michal Hocko
  0 siblings, 0 replies; 93+ messages in thread
From: Michal Hocko @ 2014-12-07 10:11 UTC (permalink / raw)
  To: Tejun Heo
  Cc: linux-mm, Andrew Morton, \"Rafael J. Wysocki\",
	David Rientjes, Johannes Weiner, Oleg Nesterov, Cong Wang, LKML,
	linux-pm

On Sat 06-12-14 08:11:15, Tejun Heo wrote:
> On Fri, Dec 05, 2014 at 05:41:47PM +0100, Michal Hocko wrote:
> > 5695be142e20 (OOM, PM: OOM killed task shouldn't escape PM suspend)
> > has left a race window when OOM killer manages to note_oom_kill after
> > freeze_processes checks the counter. The race window is quite small and
> > really unlikely and partial solution deemed sufficient at the time of
> > submission.
> 
> This patch doesn't apply on top of v3.18-rc3, latest mainline, -mm or
> -next.  Did I miss something?  Can you please check the patch?

The original cover letter which didn't make it to the mailing list has
mentioned that. I have reposted it now. Anyway this is on top of
http://marc.info/?l=linux-kernel&m=141779091114777 which hasn't landed
into -mm tree at the time I was posting this. Sorry about the confusion.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH -v2 1/5] oom: add helpers for setting and clearing TIF_MEMDIE
  2014-12-06 12:56                                                     ` Tejun Heo
@ 2014-12-07 10:13                                                       ` Michal Hocko
  0 siblings, 0 replies; 93+ messages in thread
From: Michal Hocko @ 2014-12-07 10:13 UTC (permalink / raw)
  To: Tejun Heo
  Cc: linux-mm, Andrew Morton, \"Rafael J. Wysocki\",
	David Rientjes, Johannes Weiner, Oleg Nesterov, Cong Wang, LKML,
	linux-pm

On Sat 06-12-14 07:56:17, Tejun Heo wrote:
> On Fri, Dec 05, 2014 at 05:41:43PM +0100, Michal Hocko wrote:
> > +/**
> > + * Marks the given taks as OOM victim.
> 
> /**
>  * $FUNCTION_NAME - $DESCRIPTION
> 
> > + * @tsk: task to mark
> > + */
> > +void mark_tsk_oom_victim(struct task_struct *tsk)
> > +{
> > +	set_tsk_thread_flag(tsk, TIF_MEMDIE);
> > +}
> > +
> > +/**
> > + * Unmarks the current task as OOM victim.
> 
> Ditto.

Fixed
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH -v2 2/5] OOM: thaw the OOM victim if it is frozen
  2014-12-06 13:06                                                     ` Tejun Heo
@ 2014-12-07 10:24                                                       ` Michal Hocko
  2014-12-07 10:45                                                         ` Michal Hocko
  0 siblings, 1 reply; 93+ messages in thread
From: Michal Hocko @ 2014-12-07 10:24 UTC (permalink / raw)
  To: Tejun Heo
  Cc: linux-mm, Andrew Morton, \"Rafael J. Wysocki\",
	David Rientjes, Johannes Weiner, Oleg Nesterov, Cong Wang, LKML,
	linux-pm

On Sat 06-12-14 08:06:57, Tejun Heo wrote:
> Hello,
> 
> On Fri, Dec 05, 2014 at 05:41:44PM +0100, Michal Hocko wrote:
> > oom_kill_process only sets TIF_MEMDIE flag and sends a signal to the
> > victim. This is basically noop when the task is frozen though because
> > the task sleeps in uninterruptible sleep. The victim is eventually
> > thawed later when oom_scan_process_thread meets the task again in a
> > later OOM invocation so the OOM killer doesn't live lock. But this is
> > less than optimal. Let's add the frozen check and thaw the task right
> > before we send SIGKILL to the victim.
> > 
> > The check and thawing in oom_scan_process_thread has to stay because the
> > task might got access to memory reserves even without an explicit
> > SIGKILL from oom_kill_process (e.g. it already has fatal signal pending
> > or it is exiting already).
> 
> How else would a task get TIF_MEMDIE?  If there are other paths which
> set TIF_MEMDIE, the right thing to do is creating a function which
> thaws / wakes up the target task and use it there too.  Please
> interlock these things properly from the get-go instead of scattering
> these things around.

See __out_of_memory which sets TIF_MEMDIE on current when it is exiting
or has fatal signals pending. This task cannot be frozen obviously.

> > @@ -545,6 +545,8 @@ void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
> >  	rcu_read_unlock();
> >  
> >  	mark_tsk_oom_victim(victim);
> > +	if (frozen(victim))
> > +		__thaw_task(victim);
> 
> The frozen() test here is racy.  Always calling __thaw_task() wouldn't
> be.  You can argue that being racy here is okay because the later
> scanning would find it but why complicate things like that?  Just
> properly interlock each instance and be done with it.

OK, changed. I didn't realize that __thaw_task does the check already
and was following what we have in oom_scan_process_thread. Removed the
check from that one as well.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH -v2 3/5] PM: convert printk to pr_* equivalent
  2014-12-05 22:40                                                     ` Rafael J. Wysocki
@ 2014-12-07 10:26                                                       ` Michal Hocko
  0 siblings, 0 replies; 93+ messages in thread
From: Michal Hocko @ 2014-12-07 10:26 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: linux-mm, Andrew Morton, Tejun Heo, David Rientjes,
	Johannes Weiner, Oleg Nesterov, Cong Wang, LKML, linux-pm

On Fri 05-12-14 23:40:55, Rafael J. Wysocki wrote:
> On Friday, December 05, 2014 05:41:45 PM Michal Hocko wrote:
> > While touching this area let's convert printk to pr_*. This also makes
> > the printing of continuation lines done properly.
> > 
> > Signed-off-by: Michal Hocko <mhocko@suse.cz>
> 
> This is fine by me.
> 
> Please let me know if you want me to take it.  Otherwise, please feel free to
> push it through a different tree.

I guess it will be easier to push this through Andrew's tree due to
other dependencies.
 
> > ---
> >  kernel/power/process.c | 29 +++++++++++++++--------------
> >  1 file changed, 15 insertions(+), 14 deletions(-)
> > 
> > diff --git a/kernel/power/process.c b/kernel/power/process.c
> > index 5a6ec8678b9a..3ac45f192e9f 100644
> > --- a/kernel/power/process.c
> > +++ b/kernel/power/process.c
> > @@ -84,8 +84,8 @@ static int try_to_freeze_tasks(bool user_only)
> >  	elapsed_msecs = elapsed_msecs64;
> >  
> >  	if (todo) {
> > -		printk("\n");
> > -		printk(KERN_ERR "Freezing of tasks %s after %d.%03d seconds "
> > +		pr_cont("\n");
> > +		pr_err("Freezing of tasks %s after %d.%03d seconds "
> >  		       "(%d tasks refusing to freeze, wq_busy=%d):\n",
> >  		       wakeup ? "aborted" : "failed",
> >  		       elapsed_msecs / 1000, elapsed_msecs % 1000,
> > @@ -101,7 +101,7 @@ static int try_to_freeze_tasks(bool user_only)
> >  			read_unlock(&tasklist_lock);
> >  		}
> >  	} else {
> > -		printk("(elapsed %d.%03d seconds) ", elapsed_msecs / 1000,
> > +		pr_cont("(elapsed %d.%03d seconds) ", elapsed_msecs / 1000,
> >  			elapsed_msecs % 1000);
> >  	}
> >  
> > @@ -155,7 +155,7 @@ int freeze_processes(void)
> >  		atomic_inc(&system_freezing_cnt);
> >  
> >  	pm_wakeup_clear();
> > -	printk("Freezing user space processes ... ");
> > +	pr_info("Freezing user space processes ... ");
> >  	pm_freezing = true;
> >  	oom_kills_saved = oom_kills_count();
> >  	error = try_to_freeze_tasks(true);
> > @@ -171,13 +171,13 @@ int freeze_processes(void)
> >  		if (oom_kills_count() != oom_kills_saved &&
> >  		    !check_frozen_processes()) {
> >  			__usermodehelper_set_disable_depth(UMH_ENABLED);
> > -			printk("OOM in progress.");
> > +			pr_cont("OOM in progress.");
> >  			error = -EBUSY;
> >  		} else {
> > -			printk("done.");
> > +			pr_cont("done.");
> >  		}
> >  	}
> > -	printk("\n");
> > +	pr_cont("\n");
> >  	BUG_ON(in_atomic());
> >  
> >  	if (error)
> > @@ -197,13 +197,14 @@ int freeze_kernel_threads(void)
> >  {
> >  	int error;
> >  
> > -	printk("Freezing remaining freezable tasks ... ");
> > +	pr_info("Freezing remaining freezable tasks ... ");
> > +
> >  	pm_nosig_freezing = true;
> >  	error = try_to_freeze_tasks(false);
> >  	if (!error)
> > -		printk("done.");
> > +		pr_cont("done.");
> >  
> > -	printk("\n");
> > +	pr_cont("\n");
> >  	BUG_ON(in_atomic());
> >  
> >  	if (error)
> > @@ -224,7 +225,7 @@ void thaw_processes(void)
> >  
> >  	oom_killer_enable();
> >  
> > -	printk("Restarting tasks ... ");
> > +	pr_info("Restarting tasks ... ");
> >  
> >  	__usermodehelper_set_disable_depth(UMH_FREEZING);
> >  	thaw_workqueues();
> > @@ -243,7 +244,7 @@ void thaw_processes(void)
> >  	usermodehelper_enable();
> >  
> >  	schedule();
> > -	printk("done.\n");
> > +	pr_cont("done.\n");
> >  	trace_suspend_resume(TPS("thaw_processes"), 0, false);
> >  }
> >  
> > @@ -252,7 +253,7 @@ void thaw_kernel_threads(void)
> >  	struct task_struct *g, *p;
> >  
> >  	pm_nosig_freezing = false;
> > -	printk("Restarting kernel threads ... ");
> > +	pr_info("Restarting kernel threads ... ");
> >  
> >  	thaw_workqueues();
> >  
> > @@ -264,5 +265,5 @@ void thaw_kernel_threads(void)
> >  	read_unlock(&tasklist_lock);
> >  
> >  	schedule();
> > -	printk("done.\n");
> > +	pr_cont("done.\n");
> >  }
> > 
> 
> -- 
> I speak only for myself.
> Rafael J. Wysocki, Intel Open Source Technology Center.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH -v2 2/5] OOM: thaw the OOM victim if it is frozen
  2014-12-07 10:24                                                       ` Michal Hocko
@ 2014-12-07 10:45                                                         ` Michal Hocko
  2014-12-07 13:59                                                           ` Tejun Heo
  0 siblings, 1 reply; 93+ messages in thread
From: Michal Hocko @ 2014-12-07 10:45 UTC (permalink / raw)
  To: Tejun Heo
  Cc: linux-mm, Andrew Morton, \"Rafael J. Wysocki\",
	David Rientjes, Johannes Weiner, Oleg Nesterov, Cong Wang, LKML,
	linux-pm

On Sun 07-12-14 11:24:30, Michal Hocko wrote:
> On Sat 06-12-14 08:06:57, Tejun Heo wrote:
> > Hello,
> > 
> > On Fri, Dec 05, 2014 at 05:41:44PM +0100, Michal Hocko wrote:
> > > oom_kill_process only sets TIF_MEMDIE flag and sends a signal to the
> > > victim. This is basically noop when the task is frozen though because
> > > the task sleeps in uninterruptible sleep. The victim is eventually
> > > thawed later when oom_scan_process_thread meets the task again in a
> > > later OOM invocation so the OOM killer doesn't live lock. But this is
> > > less than optimal. Let's add the frozen check and thaw the task right
> > > before we send SIGKILL to the victim.
> > > 
> > > The check and thawing in oom_scan_process_thread has to stay because the
> > > task might got access to memory reserves even without an explicit
> > > SIGKILL from oom_kill_process (e.g. it already has fatal signal pending
> > > or it is exiting already).
> > 
> > How else would a task get TIF_MEMDIE?  If there are other paths which
> > set TIF_MEMDIE, the right thing to do is creating a function which
> > thaws / wakes up the target task and use it there too.  Please
> > interlock these things properly from the get-go instead of scattering
> > these things around.
> 
> See __out_of_memory which sets TIF_MEMDIE on current when it is exiting
> or has fatal signals pending. This task cannot be frozen obviously.

On the other hand we are doing the same early in oom_kill_process which
doesn't work on the current. I've moved the __thaw_task
into mark_tsk_oom_victim so it catches all instances now.
oom_scan_process_thread doesn't need to thaw anymore.
---
>From af8222df6c503fa1beab8279ff39a282fd90698b Mon Sep 17 00:00:00 2001
From: Michal Hocko <mhocko@suse.cz>
Date: Wed, 12 Nov 2014 18:56:54 +0100
Subject: [PATCH] OOM: thaw the OOM victim if it is frozen

oom_kill_process only sets TIF_MEMDIE flag and sends a signal to the
victim. This is basically noop when the task is frozen though because
the task sleeps in uninterruptible sleep. The victim is eventually
thawed later when oom_scan_process_thread meets the task again in a
later OOM invocation so the OOM killer doesn't live lock. But this is
less than optimal. Let's add __thaw_task into mark_tsk_oom_victim after
we set TIF_MEMDIE to the victim. We are not checking whether the task is
frozen because that would be racy and __thaw_task does that already.
oom_scan_process_thread doesn't need to care about freezer anymore
as TIF_MEMDIE and freezer are excluded completely now.

Signed-off-by: Michal Hocko <mhocko@suse.cz>
---
 mm/oom_kill.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 56eab9621c3a..19a08f3f00ba 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -266,8 +266,6 @@ enum oom_scan_t oom_scan_process_thread(struct task_struct *task,
 	 * Don't allow any other task to have access to the reserves.
 	 */
 	if (test_tsk_thread_flag(task, TIF_MEMDIE)) {
-		if (unlikely(frozen(task)))
-			__thaw_task(task);
 		if (!force_kill)
 			return OOM_SCAN_ABORT;
 	}
@@ -428,6 +426,7 @@ void note_oom_kill(void)
 void mark_tsk_oom_victim(struct task_struct *tsk)
 {
 	set_tsk_thread_flag(tsk, TIF_MEMDIE);
+	__thaw_task(tsk);
 }
 
 /**
-- 
2.1.3


-- 
Michal Hocko
SUSE Labs

^ permalink raw reply related	[flat|nested] 93+ messages in thread

* Re: [PATCH 0/4] OOM vs PM freezer fixes
  2014-12-07 10:09                                                   ` [PATCH 0/4] OOM vs PM freezer fixes Michal Hocko
@ 2014-12-07 13:55                                                     ` Tejun Heo
  2014-12-07 19:00                                                       ` Michal Hocko
  0 siblings, 1 reply; 93+ messages in thread
From: Tejun Heo @ 2014-12-07 13:55 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, Andrew Morton, \"Rafael J. Wysocki\",
	David Rientjes, Johannes Weiner, Oleg Nesterov, Cong Wang, LKML,
	linux-pm

On Sun, Dec 07, 2014 at 11:09:53AM +0100, Michal Hocko wrote:
> this is another attempt to address OOM vs. PM interaction. More
> about the issue is described in the last patch. The other 4 patches
> are just clean ups. This is based on top of 3.18-rc3 + Johannes'
> http://marc.info/?l=linux-kernel&m=141779091114777 which is not in the
> Andrew's tree yet but I wanted to prevent from later merge conflicts.

When the patches are based on a custom tree, it's often a good idea to
create a git branch of the patches to help reviewing.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH -v2 2/5] OOM: thaw the OOM victim if it is frozen
  2014-12-07 10:45                                                         ` Michal Hocko
@ 2014-12-07 13:59                                                           ` Tejun Heo
  2014-12-07 18:55                                                             ` Michal Hocko
  0 siblings, 1 reply; 93+ messages in thread
From: Tejun Heo @ 2014-12-07 13:59 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, Andrew Morton, \"Rafael J. Wysocki\",
	David Rientjes, Johannes Weiner, Oleg Nesterov, Cong Wang, LKML,
	linux-pm

On Sun, Dec 07, 2014 at 11:45:39AM +0100, Michal Hocko wrote:
....
>  void mark_tsk_oom_victim(struct task_struct *tsk)
>  {
>  	set_tsk_thread_flag(tsk, TIF_MEMDIE);
> +	__thaw_task(tsk);

Yeah, this is a lot better.  Maybe we can add a comment at least
pointing readers to where to look at to understand what's going on?
This stems from the fact that OOM killer which essentially is a memory
reclaim operation overrides freezing.  It'd be nice if that is
documented somehow.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH -v2 2/5] OOM: thaw the OOM victim if it is frozen
  2014-12-07 13:59                                                           ` Tejun Heo
@ 2014-12-07 18:55                                                             ` Michal Hocko
  0 siblings, 0 replies; 93+ messages in thread
From: Michal Hocko @ 2014-12-07 18:55 UTC (permalink / raw)
  To: Tejun Heo
  Cc: linux-mm, Andrew Morton, \"Rafael J. Wysocki\",
	David Rientjes, Johannes Weiner, Oleg Nesterov, Cong Wang, LKML,
	linux-pm

On Sun 07-12-14 08:59:40, Tejun Heo wrote:
> On Sun, Dec 07, 2014 at 11:45:39AM +0100, Michal Hocko wrote:
> ....
> >  void mark_tsk_oom_victim(struct task_struct *tsk)
> >  {
> >  	set_tsk_thread_flag(tsk, TIF_MEMDIE);
> > +	__thaw_task(tsk);
> 
> Yeah, this is a lot better.  Maybe we can add a comment at least
> pointing readers to where to look at to understand what's going on?
> This stems from the fact that OOM killer which essentially is a memory
> reclaim operation overrides freezing.  It'd be nice if that is
> documented somehow.
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 19a08f3f00ba..fca456fe855a 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -426,6 +426,13 @@ void note_oom_kill(void)
 void mark_tsk_oom_victim(struct task_struct *tsk)
 {
 	set_tsk_thread_flag(tsk, TIF_MEMDIE);
+
+	/*
+	 * Make sure that the task is woken up from uninterruptible sleep
+	 * if it is frozen because OOM killer wouldn't be able to free
+	 * any memory and livelock. freezing_slow_path will tell the freezer
+	 * that TIF_MEMDIE tasks should be ignored.
+	 */
 	__thaw_task(tsk);
 }

Better?
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply related	[flat|nested] 93+ messages in thread

* Re: [PATCH 0/4] OOM vs PM freezer fixes
  2014-12-07 13:55                                                     ` Tejun Heo
@ 2014-12-07 19:00                                                       ` Michal Hocko
  2014-12-18 16:27                                                         ` Michal Hocko
  0 siblings, 1 reply; 93+ messages in thread
From: Michal Hocko @ 2014-12-07 19:00 UTC (permalink / raw)
  To: Tejun Heo
  Cc: linux-mm, Andrew Morton, \"Rafael J. Wysocki\",
	David Rientjes, Johannes Weiner, Oleg Nesterov, Cong Wang, LKML,
	linux-pm

On Sun 07-12-14 08:55:51, Tejun Heo wrote:
> On Sun, Dec 07, 2014 at 11:09:53AM +0100, Michal Hocko wrote:
> > this is another attempt to address OOM vs. PM interaction. More
> > about the issue is described in the last patch. The other 4 patches
> > are just clean ups. This is based on top of 3.18-rc3 + Johannes'
> > http://marc.info/?l=linux-kernel&m=141779091114777 which is not in the
> > Andrew's tree yet but I wanted to prevent from later merge conflicts.
> 
> When the patches are based on a custom tree, it's often a good idea to
> create a git branch of the patches to help reviewing.

git://git.kernel.org/pub/scm/linux/kernel/git/mhocko/mm.git to-review/make-oom-vs-pm-freezing-more-robust-2
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH 0/4] OOM vs PM freezer fixes
  2014-12-07 19:00                                                       ` Michal Hocko
@ 2014-12-18 16:27                                                         ` Michal Hocko
  0 siblings, 0 replies; 93+ messages in thread
From: Michal Hocko @ 2014-12-18 16:27 UTC (permalink / raw)
  To: Tejun Heo
  Cc: linux-mm, Andrew Morton, \"Rafael J. Wysocki\",
	David Rientjes, Johannes Weiner, Oleg Nesterov, Cong Wang, LKML,
	linux-pm

On Sun 07-12-14 20:00:26, Michal Hocko wrote:
> On Sun 07-12-14 08:55:51, Tejun Heo wrote:
> > On Sun, Dec 07, 2014 at 11:09:53AM +0100, Michal Hocko wrote:
> > > this is another attempt to address OOM vs. PM interaction. More
> > > about the issue is described in the last patch. The other 4 patches
> > > are just clean ups. This is based on top of 3.18-rc3 + Johannes'
> > > http://marc.info/?l=linux-kernel&m=141779091114777 which is not in the
> > > Andrew's tree yet but I wanted to prevent from later merge conflicts.
> > 
> > When the patches are based on a custom tree, it's often a good idea to
> > create a git branch of the patches to help reviewing.
> 
> git://git.kernel.org/pub/scm/linux/kernel/git/mhocko/mm.git to-review/make-oom-vs-pm-freezing-more-robust-2

Are there any other concerns? Should I just resubmit (after rc1)?
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH -v2 1/5] oom: add helpers for setting and clearing TIF_MEMDIE
  2014-12-05 16:41                                                   ` [PATCH -v2 1/5] oom: add helpers for setting and clearing TIF_MEMDIE Michal Hocko
  2014-12-06 12:56                                                     ` Tejun Heo
@ 2015-01-07 17:57                                                     ` Tejun Heo
  2015-01-07 18:23                                                       ` Michal Hocko
  1 sibling, 1 reply; 93+ messages in thread
From: Tejun Heo @ 2015-01-07 17:57 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, Andrew Morton, \"Rafael J. Wysocki\",
	David Rientjes, Johannes Weiner, Oleg Nesterov, Cong Wang, LKML,
	linux-pm

On Fri, Dec 05, 2014 at 05:41:43PM +0100, Michal Hocko wrote:
> +/**
> + * Unmarks the current task as OOM victim.
> + */
> +void unmark_tsk_oom_victim(void)
> +{
> +	clear_thread_flag(TIF_MEMDIE);
> +}

This prolly should be unmark_current_oom_victim()?  Also, can we
please use full "task" at least in global symbols?  I don't think tsk
abbreviation is that popular in function names.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH -v2 1/5] oom: add helpers for setting and clearing TIF_MEMDIE
  2015-01-07 17:57                                                     ` Tejun Heo
@ 2015-01-07 18:23                                                       ` Michal Hocko
  0 siblings, 0 replies; 93+ messages in thread
From: Michal Hocko @ 2015-01-07 18:23 UTC (permalink / raw)
  To: Tejun Heo
  Cc: linux-mm, Andrew Morton, \"Rafael J. Wysocki\",
	David Rientjes, Johannes Weiner, Oleg Nesterov, Cong Wang, LKML,
	linux-pm

On Wed 07-01-15 12:57:31, Tejun Heo wrote:
> On Fri, Dec 05, 2014 at 05:41:43PM +0100, Michal Hocko wrote:
> > +/**
> > + * Unmarks the current task as OOM victim.
> > + */
> > +void unmark_tsk_oom_victim(void)
> > +{
> > +	clear_thread_flag(TIF_MEMDIE);
> > +}
> 
> This prolly should be unmark_current_oom_victim()?

OK.

> Also, can we
> please use full "task" at least in global symbols?  I don't think tsk
> abbreviation is that popular in function names.

It is mimicking *_tsk_thread_flag() API.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH -v2 5/5] OOM, PM: make OOM detection in the freezer path raceless
  2014-12-05 16:41                                                   ` [PATCH -v2 5/5] OOM, PM: make OOM detection in the freezer path raceless Michal Hocko
  2014-12-06 13:11                                                     ` Tejun Heo
@ 2015-01-07 18:41                                                     ` Tejun Heo
  2015-01-07 18:48                                                       ` Michal Hocko
  2015-01-08 11:51                                                     ` Michal Hocko
  2 siblings, 1 reply; 93+ messages in thread
From: Tejun Heo @ 2015-01-07 18:41 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, Andrew Morton, \"Rafael J. Wysocki\",
	David Rientjes, Johannes Weiner, Oleg Nesterov, Cong Wang, LKML,
	linux-pm

Hello, Michal.  Sorry about the long delay.

On Fri, Dec 05, 2014 at 05:41:47PM +0100, Michal Hocko wrote:
...
> @@ -252,6 +220,8 @@ void thaw_kernel_threads(void)
>  {
>  	struct task_struct *g, *p;
>  
> +	oom_killer_enable();
> +

Wouldn't it be more symmetrical and make more sense to enable oom
killer after kernel threads are thawed?  Until kernel threads are
thawed, it isn't guaranteed that oom killer would be able to make
forward progress, right?

Other than that, looks good to me.

Thanks!

-- 
tejun

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH -v2 5/5] OOM, PM: make OOM detection in the freezer path raceless
  2015-01-07 18:41                                                     ` Tejun Heo
@ 2015-01-07 18:48                                                       ` Michal Hocko
  0 siblings, 0 replies; 93+ messages in thread
From: Michal Hocko @ 2015-01-07 18:48 UTC (permalink / raw)
  To: Tejun Heo
  Cc: linux-mm, Andrew Morton, \"Rafael J. Wysocki\",
	David Rientjes, Johannes Weiner, Oleg Nesterov, Cong Wang, LKML,
	linux-pm

On Wed 07-01-15 13:41:24, Tejun Heo wrote:
> Hello, Michal.  Sorry about the long delay.
> 
> On Fri, Dec 05, 2014 at 05:41:47PM +0100, Michal Hocko wrote:
> ...
> > @@ -252,6 +220,8 @@ void thaw_kernel_threads(void)
> >  {
> >  	struct task_struct *g, *p;
> >  
> > +	oom_killer_enable();
> > +
> 
> Wouldn't it be more symmetrical and make more sense to enable oom
> killer after kernel threads are thawed?  Until kernel threads are
> thawed, it isn't guaranteed that oom killer would be able to make
> forward progress, right?

Makes sense, fixed.

> Other than that, looks good to me.

Thanks! Btw. I plan to repost after Andrew releases new mmotm as there
are some dependencies in oom area.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH -v2 5/5] OOM, PM: make OOM detection in the freezer path raceless
  2014-12-05 16:41                                                   ` [PATCH -v2 5/5] OOM, PM: make OOM detection in the freezer path raceless Michal Hocko
  2014-12-06 13:11                                                     ` Tejun Heo
  2015-01-07 18:41                                                     ` Tejun Heo
@ 2015-01-08 11:51                                                     ` Michal Hocko
  2 siblings, 0 replies; 93+ messages in thread
From: Michal Hocko @ 2015-01-08 11:51 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Tejun Heo, \"Rafael J. Wysocki\",
	David Rientjes, Johannes Weiner, Oleg Nesterov, Cong Wang, LKML,
	linux-pm

On Fri 05-12-14 17:41:47, Michal Hocko wrote:
[...]
> +bool oom_killer_disable(void)
> +{
> +	/*
> +	 * Make sure to not race with an ongoing OOM killer
> +	 * and that the current is not the victim.
> +	 */
> +	down_write(&oom_sem);
> +	if (test_thread_flag(TIF_MEMDIE)) {
> +		up_write(&oom_sem);
> +		return false;
> +	}
> +
> +	oom_killer_disabled = true;
> +	up_write(&oom_sem);
> +
> +	wait_event(oom_victims_wait, atomic_read(&oom_victims));

Ups brainfart... Should be !atomic_read(&oom_victims). Condition says
for what we are waiting not when we are waiting.

> +
> +	return true;
> +}
[...]
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 93+ messages in thread

end of thread, other threads:[~2015-01-08 11:51 UTC | newest]

Thread overview: 93+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-10-21  7:27 [PATCH 0/4 -v2] OOM vs. freezer interaction fixes Michal Hocko
2014-10-21  7:27 ` [PATCH 1/4] freezer: Do not freeze tasks killed by OOM killer Michal Hocko
2014-10-21 12:04   ` Rafael J. Wysocki
2014-10-21  7:27 ` [PATCH 2/4] freezer: remove obsolete comments in __thaw_task() Michal Hocko
2014-10-21 12:04   ` Rafael J. Wysocki
2014-10-21  7:27 ` [PATCH 3/4] OOM, PM: OOM killed task shouldn't escape PM suspend Michal Hocko
2014-10-21 12:09   ` Rafael J. Wysocki
2014-10-21 13:14     ` Michal Hocko
2014-10-21 13:42       ` Rafael J. Wysocki
2014-10-21 14:11         ` Michal Hocko
2014-10-21 14:41           ` Rafael J. Wysocki
2014-10-21 14:29             ` Michal Hocko
2014-10-22 14:39               ` Rafael J. Wysocki
2014-10-22 14:22                 ` Michal Hocko
2014-10-22 21:18                   ` Rafael J. Wysocki
2014-10-26 18:49               ` Pavel Machek
2014-11-04 19:27               ` Tejun Heo
2014-11-05 12:46                 ` Michal Hocko
2014-11-05 13:02                   ` Tejun Heo
2014-11-05 13:31                     ` Michal Hocko
2014-11-05 13:42                       ` Michal Hocko
2014-11-05 14:14                         ` Michal Hocko
2014-11-05 15:45                           ` Michal Hocko
2014-11-05 15:44                         ` Tejun Heo
2014-11-05 16:01                           ` Michal Hocko
2014-11-05 16:29                             ` Tejun Heo
2014-11-05 16:39                               ` Michal Hocko
2014-11-05 16:54                                 ` Tejun Heo
2014-11-05 17:01                                   ` Tejun Heo
2014-11-06 13:05                                     ` Michal Hocko
2014-11-06 15:09                                       ` Tejun Heo
2014-11-06 16:01                                         ` Michal Hocko
2014-11-06 16:12                                           ` Tejun Heo
2014-11-06 16:31                                             ` Michal Hocko
2014-11-06 16:33                                               ` Tejun Heo
2014-11-06 16:58                                                 ` Michal Hocko
2014-11-05 17:46                                   ` Michal Hocko
2014-11-05 17:55                                     ` Tejun Heo
2014-11-06 12:49                                       ` Michal Hocko
2014-11-06 15:01                                         ` Tejun Heo
2014-11-06 16:02                                           ` Michal Hocko
2014-11-06 16:28                                             ` Tejun Heo
2014-11-10 16:30                                               ` Michal Hocko
2014-11-12 18:58                                                 ` [RFC 0/4] OOM vs PM freezer fixes Michal Hocko
2014-11-12 18:58                                                   ` [RFC 1/4] OOM, PM: Do not miss OOM killed frozen tasks Michal Hocko
2014-11-14 17:55                                                     ` Tejun Heo
2014-11-12 18:58                                                   ` [RFC 2/4] OOM, PM: make OOM detection in the freezer path raceless Michal Hocko
2014-11-12 18:58                                                   ` [RFC 3/4] OOM, PM: handle pm freezer as an OOM victim correctly Michal Hocko
2014-11-12 18:58                                                   ` [RFC 4/4] OOM: thaw the OOM victim if it is frozen Michal Hocko
2014-11-14 20:14                                                   ` [RFC 0/4] OOM vs PM freezer fixes Tejun Heo
2014-11-18 21:08                                                     ` Michal Hocko
2014-11-18 21:10                                                       ` [RFC 1/2] oom: add helper for setting and clearing TIF_MEMDIE Michal Hocko
2014-11-18 21:10                                                         ` [RFC 2/2] OOM, PM: make OOM detection in the freezer path raceless Michal Hocko
2014-11-27  0:47                                                           ` Rafael J. Wysocki
2014-12-02 22:08                                                           ` Tejun Heo
2014-12-04 14:16                                                             ` Michal Hocko
2014-12-04 14:44                                                               ` Tejun Heo
2014-12-04 16:56                                                                 ` Michal Hocko
2014-12-04 17:18                                                                   ` Michal Hocko
2014-12-05 16:41                                                 ` [PATCH 0/4] OOM vs PM freezer fixes Michal Hocko
2014-12-05 16:41                                                   ` [PATCH -v2 1/5] oom: add helpers for setting and clearing TIF_MEMDIE Michal Hocko
2014-12-06 12:56                                                     ` Tejun Heo
2014-12-07 10:13                                                       ` Michal Hocko
2015-01-07 17:57                                                     ` Tejun Heo
2015-01-07 18:23                                                       ` Michal Hocko
2014-12-05 16:41                                                   ` [PATCH -v2 2/5] OOM: thaw the OOM victim if it is frozen Michal Hocko
2014-12-06 13:06                                                     ` Tejun Heo
2014-12-07 10:24                                                       ` Michal Hocko
2014-12-07 10:45                                                         ` Michal Hocko
2014-12-07 13:59                                                           ` Tejun Heo
2014-12-07 18:55                                                             ` Michal Hocko
2014-12-05 16:41                                                   ` [PATCH -v2 3/5] PM: convert printk to pr_* equivalent Michal Hocko
2014-12-05 22:40                                                     ` Rafael J. Wysocki
2014-12-07 10:26                                                       ` Michal Hocko
2014-12-06 13:08                                                     ` Tejun Heo
2014-12-05 16:41                                                   ` [PATCH -v2 4/5] sysrq: " Michal Hocko
2014-12-06 13:09                                                     ` Tejun Heo
2014-12-05 16:41                                                   ` [PATCH -v2 5/5] OOM, PM: make OOM detection in the freezer path raceless Michal Hocko
2014-12-06 13:11                                                     ` Tejun Heo
2014-12-07 10:11                                                       ` Michal Hocko
2015-01-07 18:41                                                     ` Tejun Heo
2015-01-07 18:48                                                       ` Michal Hocko
2015-01-08 11:51                                                     ` Michal Hocko
2014-12-07 10:09                                                   ` [PATCH 0/4] OOM vs PM freezer fixes Michal Hocko
2014-12-07 13:55                                                     ` Tejun Heo
2014-12-07 19:00                                                       ` Michal Hocko
2014-12-18 16:27                                                         ` Michal Hocko
2014-11-05 14:55                   ` [PATCH 3/4] OOM, PM: OOM killed task shouldn't escape PM suspend Michal Hocko
2014-10-26 18:40   ` Pavel Machek
2014-10-21  7:27 ` [PATCH 4/4] PM: convert do_each_thread to for_each_process_thread Michal Hocko
2014-10-21 12:10   ` Rafael J. Wysocki
2014-10-21 13:19     ` Michal Hocko
2014-10-21 13:43       ` Rafael J. Wysocki

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).