All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH -v3 0/5] OOM vs PM freezer fixes
@ 2015-01-09 11:05 ` Michal Hocko
  0 siblings, 0 replies; 31+ messages in thread
From: Michal Hocko @ 2015-01-09 11:05 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Tejun Heo, \"Rafael J. Wysocki\",
	David Rientjes, Johannes Weiner, Oleg Nesterov, Cong Wang,
	linux-mm, LKML, linux-pm

Hi,
this is an updated version of the patchset previous posted here:
http://marc.info/?l=linux-mm&m=141779771518056&w=2. 
Changes since then are:
- cleanups, doc and function renames as per Tejun
- __thaw_task moved to mark_tsk_oom_victim and frozen() check removed
  as it would be racy and it is not necessary anyway - per Tejun
- obvious typo in wait_event condition
- oom_killer_enable moved to thaw_processes before user tasks are thawed
  rather than thaw_kernel_threads which is even not called from s2ram resume
  path - per Tejun
- oom_killer_disable moved to freeze_processes to be more in sync with
  the enable.

I have tested the series in KVM with 100M RAM:
- many small tasks (20M anon mmap) which are triggering OOM continually
- s2ram which resumes automatically is triggered in a loop
	echo processors > /sys/power/pm_test
	while true
	do
		echo mem > /sys/power/state
		sleep 1s
	done
- simple module which allocates and frees 20M in 8K chunks. If it sees
  freezing(current) then it tries another round of allocation before calling
  try_to_freeze
- debugging messages of PM stages and OOM killer enable/disable/fail added
  and unmark_oom_victim is delayed by 1s after it clears TIF_MEMDIE and before
  it wakes up waiters.
- rebased on top of the current mmotm which means some necessary updates
  in mm/oom_kill.c. mark_tsk_oom_victim is now called under task_lock but
  I think this should be OK because __thaw_task shouldn't interfere with any
  locking down wake_up_process. Oleg?

As expected there are no OOM killed tasks after oom is disabled and
allocations requested by the kernel thread are failing after all the
tasks are frozen and OOM disabled. I wasn't able to catch a race where
oom_killer_disable would really have to wait but I kinda expected the
race is really unlikely.

[  242.609330] Killed process 2992 (mem_eater) total-vm:24412kB, anon-rss:2164kB, file-rss:4kB
[  243.628071] Unmarking 2992 OOM victim. oom_victims: 1
[  243.636072] (elapsed 2.837 seconds) done.
[  243.641985] Trying to disable OOM killer
[  243.643032] Waiting for concurent OOM victims
[  243.644342] OOM killer disabled
[  243.645447] Freezing remaining freezable tasks ... (elapsed 0.005 seconds) done.
[  243.652983] Suspending console(s) (use no_console_suspend to debug)
[  243.903299] kmem_eater: page allocation failure: order:1, mode:0x204010
[...]
[  243.992600] PM: suspend of devices complete after 336.667 msecs
[  243.993264] PM: late suspend of devices complete after 0.660 msecs
[  243.994713] PM: noirq suspend of devices complete after 1.446 msecs
[  243.994717] ACPI: Preparing to enter system sleep state S3
[  243.994795] PM: Saving platform NVS memory
[  243.994796] Disabling non-boot CPUs ...

The first 2 patches are simple cleanups for OOM. They should go in
regardless the rest IMO.
Patches 3 and 4 are trivial printk -> pr_info conversion and they should
go in ditto.
The main patch is the last one and I would appreciate acks from Tejun
and Rafael. I think the OOM part should be OK (except for __thaw_task
vs. task_lock where a look from Oleg would appreciated) but I am not
so sure I haven't screwed anything in the freezer code. I have found
several surprises there.

The patchset is based on the current mmotm tree (mmotm-2015-01-07-17-07).
I think it make more sense if it is routed via Andrew due to dependences on
other OOM killer patches.

Shortlog says:
Michal Hocko (5):
      oom: add helpers for setting and clearing TIF_MEMDIE
      oom: thaw the OOM victim if it is frozen
      PM: convert printk to pr_* equivalent
      sysrq: convert printk to pr_* equivalent
      oom, PM: make OOM detection in the freezer path raceless

And diffstat:
 drivers/staging/android/lowmemorykiller.c |   7 +-
 drivers/tty/sysrq.c                       |  23 ++---
 include/linux/oom.h                       |  18 ++--
 kernel/exit.c                             |   3 +-
 kernel/power/process.c                    |  76 +++++----------
 mm/memcontrol.c                           |   4 +-
 mm/oom_kill.c                             | 149 ++++++++++++++++++++++++++----
 mm/page_alloc.c                           |  17 +---
 8 files changed, 185 insertions(+), 112 deletions(-)


^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH -v3 0/5] OOM vs PM freezer fixes
@ 2015-01-09 11:05 ` Michal Hocko
  0 siblings, 0 replies; 31+ messages in thread
From: Michal Hocko @ 2015-01-09 11:05 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Tejun Heo, \"Rafael J. Wysocki\",
	David Rientjes, Johannes Weiner, Oleg Nesterov, Cong Wang,
	linux-mm, LKML, linux-pm

Hi,
this is an updated version of the patchset previous posted here:
http://marc.info/?l=linux-mm&m=141779771518056&w=2. 
Changes since then are:
- cleanups, doc and function renames as per Tejun
- __thaw_task moved to mark_tsk_oom_victim and frozen() check removed
  as it would be racy and it is not necessary anyway - per Tejun
- obvious typo in wait_event condition
- oom_killer_enable moved to thaw_processes before user tasks are thawed
  rather than thaw_kernel_threads which is even not called from s2ram resume
  path - per Tejun
- oom_killer_disable moved to freeze_processes to be more in sync with
  the enable.

I have tested the series in KVM with 100M RAM:
- many small tasks (20M anon mmap) which are triggering OOM continually
- s2ram which resumes automatically is triggered in a loop
	echo processors > /sys/power/pm_test
	while true
	do
		echo mem > /sys/power/state
		sleep 1s
	done
- simple module which allocates and frees 20M in 8K chunks. If it sees
  freezing(current) then it tries another round of allocation before calling
  try_to_freeze
- debugging messages of PM stages and OOM killer enable/disable/fail added
  and unmark_oom_victim is delayed by 1s after it clears TIF_MEMDIE and before
  it wakes up waiters.
- rebased on top of the current mmotm which means some necessary updates
  in mm/oom_kill.c. mark_tsk_oom_victim is now called under task_lock but
  I think this should be OK because __thaw_task shouldn't interfere with any
  locking down wake_up_process. Oleg?

As expected there are no OOM killed tasks after oom is disabled and
allocations requested by the kernel thread are failing after all the
tasks are frozen and OOM disabled. I wasn't able to catch a race where
oom_killer_disable would really have to wait but I kinda expected the
race is really unlikely.

[  242.609330] Killed process 2992 (mem_eater) total-vm:24412kB, anon-rss:2164kB, file-rss:4kB
[  243.628071] Unmarking 2992 OOM victim. oom_victims: 1
[  243.636072] (elapsed 2.837 seconds) done.
[  243.641985] Trying to disable OOM killer
[  243.643032] Waiting for concurent OOM victims
[  243.644342] OOM killer disabled
[  243.645447] Freezing remaining freezable tasks ... (elapsed 0.005 seconds) done.
[  243.652983] Suspending console(s) (use no_console_suspend to debug)
[  243.903299] kmem_eater: page allocation failure: order:1, mode:0x204010
[...]
[  243.992600] PM: suspend of devices complete after 336.667 msecs
[  243.993264] PM: late suspend of devices complete after 0.660 msecs
[  243.994713] PM: noirq suspend of devices complete after 1.446 msecs
[  243.994717] ACPI: Preparing to enter system sleep state S3
[  243.994795] PM: Saving platform NVS memory
[  243.994796] Disabling non-boot CPUs ...

The first 2 patches are simple cleanups for OOM. They should go in
regardless the rest IMO.
Patches 3 and 4 are trivial printk -> pr_info conversion and they should
go in ditto.
The main patch is the last one and I would appreciate acks from Tejun
and Rafael. I think the OOM part should be OK (except for __thaw_task
vs. task_lock where a look from Oleg would appreciated) but I am not
so sure I haven't screwed anything in the freezer code. I have found
several surprises there.

The patchset is based on the current mmotm tree (mmotm-2015-01-07-17-07).
I think it make more sense if it is routed via Andrew due to dependences on
other OOM killer patches.

Shortlog says:
Michal Hocko (5):
      oom: add helpers for setting and clearing TIF_MEMDIE
      oom: thaw the OOM victim if it is frozen
      PM: convert printk to pr_* equivalent
      sysrq: convert printk to pr_* equivalent
      oom, PM: make OOM detection in the freezer path raceless

And diffstat:
 drivers/staging/android/lowmemorykiller.c |   7 +-
 drivers/tty/sysrq.c                       |  23 ++---
 include/linux/oom.h                       |  18 ++--
 kernel/exit.c                             |   3 +-
 kernel/power/process.c                    |  76 +++++----------
 mm/memcontrol.c                           |   4 +-
 mm/oom_kill.c                             | 149 ++++++++++++++++++++++++++----
 mm/page_alloc.c                           |  17 +---
 8 files changed, 185 insertions(+), 112 deletions(-)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH -v3 1/5] oom: add helpers for setting and clearing TIF_MEMDIE
  2015-01-09 11:05 ` Michal Hocko
@ 2015-01-09 11:05   ` Michal Hocko
  -1 siblings, 0 replies; 31+ messages in thread
From: Michal Hocko @ 2015-01-09 11:05 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Tejun Heo, \"Rafael J. Wysocki\",
	David Rientjes, Johannes Weiner, Oleg Nesterov, Cong Wang,
	linux-mm, LKML, linux-pm

This patch is just a preparatory and it doesn't introduce any functional
change.

Note:
I am utterly unhappy about lowmemory killer abusing TIF_MEMDIE just to
wait for the oom victim and to prevent from new killing. This is
just a side effect of the flag. The primary meaning is to give the oom
victim access to the memory reserves and that shouldn't be necessary
here.

Signed-off-by: Michal Hocko <mhocko@suse.cz>
---
 drivers/staging/android/lowmemorykiller.c |  7 ++++++-
 include/linux/oom.h                       |  4 ++++
 kernel/exit.c                             |  2 +-
 mm/memcontrol.c                           |  2 +-
 mm/oom_kill.c                             | 23 ++++++++++++++++++++---
 5 files changed, 32 insertions(+), 6 deletions(-)

diff --git a/drivers/staging/android/lowmemorykiller.c b/drivers/staging/android/lowmemorykiller.c
index b545d3d1da3e..feafa172b155 100644
--- a/drivers/staging/android/lowmemorykiller.c
+++ b/drivers/staging/android/lowmemorykiller.c
@@ -160,7 +160,12 @@ static unsigned long lowmem_scan(struct shrinker *s, struct shrink_control *sc)
 			     selected->pid, selected->comm,
 			     selected_oom_score_adj, selected_tasksize);
 		lowmem_deathpending_timeout = jiffies + HZ;
-		set_tsk_thread_flag(selected, TIF_MEMDIE);
+		/*
+		 * FIXME: lowmemorykiller shouldn't abuse global OOM killer
+		 * infrastructure. There is no real reason why the selected
+		 * task should have access to the memory reserves.
+		 */
+		mark_tsk_oom_victim(selected);
 		send_sig(SIGKILL, selected, 0);
 		rem += selected_tasksize;
 	}
diff --git a/include/linux/oom.h b/include/linux/oom.h
index 76200984d1e2..b42b80f88c3a 100644
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -47,6 +47,10 @@ static inline bool oom_task_origin(const struct task_struct *p)
 	return !!(p->signal->oom_flags & OOM_FLAG_ORIGIN);
 }
 
+extern void mark_tsk_oom_victim(struct task_struct *tsk);
+
+extern void unmark_oom_victim(void);
+
 extern unsigned long oom_badness(struct task_struct *p,
 		struct mem_cgroup *memcg, const nodemask_t *nodemask,
 		unsigned long totalpages);
diff --git a/kernel/exit.c b/kernel/exit.c
index 287884b05b89..5db52e52c493 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -456,7 +456,7 @@ static void exit_mm(struct task_struct *tsk)
 	task_unlock(tsk);
 	mm_update_next_owner(mm);
 	mmput(mm);
-	clear_thread_flag(TIF_MEMDIE);
+	unmark_oom_victim();
 }
 
 static struct task_struct *find_alive_thread(struct task_struct *p)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 0021313d1210..18ecef729597 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1559,7 +1559,7 @@ static void mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
 	 * quickly exit and free its memory.
 	 */
 	if (fatal_signal_pending(current) || task_will_free_mem(current)) {
-		set_thread_flag(TIF_MEMDIE);
+		mark_tsk_oom_victim(current);
 		return;
 	}
 
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 294493a7ae4b..80b34e285f96 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -416,6 +416,23 @@ void note_oom_kill(void)
 	atomic_inc(&oom_kills);
 }
 
+/**
+ * mark_tsk_oom_victim - marks the given taks as OOM victim.
+ * @tsk: task to mark
+ */
+void mark_tsk_oom_victim(struct task_struct *tsk)
+{
+	set_tsk_thread_flag(tsk, TIF_MEMDIE);
+}
+
+/**
+ * unmark_oom_victim - unmarks the current task as OOM victim.
+ */
+void unmark_oom_victim(void)
+{
+	clear_thread_flag(TIF_MEMDIE);
+}
+
 #define K(x) ((x) << (PAGE_SHIFT-10))
 /*
  * Must be called while holding a reference to p, which will be released upon
@@ -440,7 +457,7 @@ void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
 	 */
 	task_lock(p);
 	if (p->mm && task_will_free_mem(p)) {
-		set_tsk_thread_flag(p, TIF_MEMDIE);
+		mark_tsk_oom_victim(p);
 		task_unlock(p);
 		put_task_struct(p);
 		return;
@@ -495,7 +512,7 @@ void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
 
 	/* mm cannot safely be dereferenced after task_unlock(victim) */
 	mm = victim->mm;
-	set_tsk_thread_flag(victim, TIF_MEMDIE);
+	mark_tsk_oom_victim(victim);
 	pr_err("Killed process %d (%s) total-vm:%lukB, anon-rss:%lukB, file-rss:%lukB\n",
 		task_pid_nr(victim), victim->comm, K(victim->mm->total_vm),
 		K(get_mm_counter(victim->mm, MM_ANONPAGES)),
@@ -652,7 +669,7 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
 	 */
 	if (current->mm &&
 	    (fatal_signal_pending(current) || task_will_free_mem(current))) {
-		set_thread_flag(TIF_MEMDIE);
+		mark_tsk_oom_victim(current);
 		return;
 	}
 
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH -v3 1/5] oom: add helpers for setting and clearing TIF_MEMDIE
@ 2015-01-09 11:05   ` Michal Hocko
  0 siblings, 0 replies; 31+ messages in thread
From: Michal Hocko @ 2015-01-09 11:05 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Tejun Heo, \"Rafael J. Wysocki\",
	David Rientjes, Johannes Weiner, Oleg Nesterov, Cong Wang,
	linux-mm, LKML, linux-pm

This patch is just a preparatory and it doesn't introduce any functional
change.

Note:
I am utterly unhappy about lowmemory killer abusing TIF_MEMDIE just to
wait for the oom victim and to prevent from new killing. This is
just a side effect of the flag. The primary meaning is to give the oom
victim access to the memory reserves and that shouldn't be necessary
here.

Signed-off-by: Michal Hocko <mhocko@suse.cz>
---
 drivers/staging/android/lowmemorykiller.c |  7 ++++++-
 include/linux/oom.h                       |  4 ++++
 kernel/exit.c                             |  2 +-
 mm/memcontrol.c                           |  2 +-
 mm/oom_kill.c                             | 23 ++++++++++++++++++++---
 5 files changed, 32 insertions(+), 6 deletions(-)

diff --git a/drivers/staging/android/lowmemorykiller.c b/drivers/staging/android/lowmemorykiller.c
index b545d3d1da3e..feafa172b155 100644
--- a/drivers/staging/android/lowmemorykiller.c
+++ b/drivers/staging/android/lowmemorykiller.c
@@ -160,7 +160,12 @@ static unsigned long lowmem_scan(struct shrinker *s, struct shrink_control *sc)
 			     selected->pid, selected->comm,
 			     selected_oom_score_adj, selected_tasksize);
 		lowmem_deathpending_timeout = jiffies + HZ;
-		set_tsk_thread_flag(selected, TIF_MEMDIE);
+		/*
+		 * FIXME: lowmemorykiller shouldn't abuse global OOM killer
+		 * infrastructure. There is no real reason why the selected
+		 * task should have access to the memory reserves.
+		 */
+		mark_tsk_oom_victim(selected);
 		send_sig(SIGKILL, selected, 0);
 		rem += selected_tasksize;
 	}
diff --git a/include/linux/oom.h b/include/linux/oom.h
index 76200984d1e2..b42b80f88c3a 100644
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -47,6 +47,10 @@ static inline bool oom_task_origin(const struct task_struct *p)
 	return !!(p->signal->oom_flags & OOM_FLAG_ORIGIN);
 }
 
+extern void mark_tsk_oom_victim(struct task_struct *tsk);
+
+extern void unmark_oom_victim(void);
+
 extern unsigned long oom_badness(struct task_struct *p,
 		struct mem_cgroup *memcg, const nodemask_t *nodemask,
 		unsigned long totalpages);
diff --git a/kernel/exit.c b/kernel/exit.c
index 287884b05b89..5db52e52c493 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -456,7 +456,7 @@ static void exit_mm(struct task_struct *tsk)
 	task_unlock(tsk);
 	mm_update_next_owner(mm);
 	mmput(mm);
-	clear_thread_flag(TIF_MEMDIE);
+	unmark_oom_victim();
 }
 
 static struct task_struct *find_alive_thread(struct task_struct *p)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 0021313d1210..18ecef729597 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1559,7 +1559,7 @@ static void mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
 	 * quickly exit and free its memory.
 	 */
 	if (fatal_signal_pending(current) || task_will_free_mem(current)) {
-		set_thread_flag(TIF_MEMDIE);
+		mark_tsk_oom_victim(current);
 		return;
 	}
 
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 294493a7ae4b..80b34e285f96 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -416,6 +416,23 @@ void note_oom_kill(void)
 	atomic_inc(&oom_kills);
 }
 
+/**
+ * mark_tsk_oom_victim - marks the given taks as OOM victim.
+ * @tsk: task to mark
+ */
+void mark_tsk_oom_victim(struct task_struct *tsk)
+{
+	set_tsk_thread_flag(tsk, TIF_MEMDIE);
+}
+
+/**
+ * unmark_oom_victim - unmarks the current task as OOM victim.
+ */
+void unmark_oom_victim(void)
+{
+	clear_thread_flag(TIF_MEMDIE);
+}
+
 #define K(x) ((x) << (PAGE_SHIFT-10))
 /*
  * Must be called while holding a reference to p, which will be released upon
@@ -440,7 +457,7 @@ void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
 	 */
 	task_lock(p);
 	if (p->mm && task_will_free_mem(p)) {
-		set_tsk_thread_flag(p, TIF_MEMDIE);
+		mark_tsk_oom_victim(p);
 		task_unlock(p);
 		put_task_struct(p);
 		return;
@@ -495,7 +512,7 @@ void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
 
 	/* mm cannot safely be dereferenced after task_unlock(victim) */
 	mm = victim->mm;
-	set_tsk_thread_flag(victim, TIF_MEMDIE);
+	mark_tsk_oom_victim(victim);
 	pr_err("Killed process %d (%s) total-vm:%lukB, anon-rss:%lukB, file-rss:%lukB\n",
 		task_pid_nr(victim), victim->comm, K(victim->mm->total_vm),
 		K(get_mm_counter(victim->mm, MM_ANONPAGES)),
@@ -652,7 +669,7 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
 	 */
 	if (current->mm &&
 	    (fatal_signal_pending(current) || task_will_free_mem(current))) {
-		set_thread_flag(TIF_MEMDIE);
+		mark_tsk_oom_victim(current);
 		return;
 	}
 
-- 
2.1.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH -v3 2/5] oom: thaw the OOM victim if it is frozen
  2015-01-09 11:05 ` Michal Hocko
  (?)
@ 2015-01-09 11:05   ` Michal Hocko
  -1 siblings, 0 replies; 31+ messages in thread
From: Michal Hocko @ 2015-01-09 11:05 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Tejun Heo, \"Rafael J. Wysocki\",
	David Rientjes, Johannes Weiner, Oleg Nesterov, Cong Wang,
	linux-mm, LKML, linux-pm

oom_kill_process only sets TIF_MEMDIE flag and sends a signal to the
victim. This is basically noop when the task is frozen though because
the task sleeps in the uninterruptible sleep.
The victim is eventually thawed later when oom_scan_process_thread meets
the task again in a later OOM invocation so the OOM killer doesn't live
lock. But this is less than optimal.

Let's add __thaw_task into mark_tsk_oom_victim after we set TIF_MEMDIE
to the victim. We are not checking whether the task is frozen
because that would be racy and __thaw_task does that already.
oom_scan_process_thread doesn't need to care about freezer anymore as
TIF_MEMDIE and freezer are excluded completely now.

Signed-off-by: Michal Hocko <mhocko@suse.cz>
---
 mm/oom_kill.c | 10 ++++++++--
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 80b34e285f96..3cbd76b8c13b 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -266,8 +266,6 @@ enum oom_scan_t oom_scan_process_thread(struct task_struct *task,
 	 * Don't allow any other task to have access to the reserves.
 	 */
 	if (test_tsk_thread_flag(task, TIF_MEMDIE)) {
-		if (unlikely(frozen(task)))
-			__thaw_task(task);
 		if (!force_kill)
 			return OOM_SCAN_ABORT;
 	}
@@ -423,6 +421,14 @@ void note_oom_kill(void)
 void mark_tsk_oom_victim(struct task_struct *tsk)
 {
 	set_tsk_thread_flag(tsk, TIF_MEMDIE);
+
+	/*
+	 * Make sure that the task is woken up from uninterruptible sleep
+	 * if it is frozen because OOM killer wouldn't be able to free
+	 * any memory and livelock. freezing_slow_path will tell the freezer
+	 * that TIF_MEMDIE tasks should be ignored.
+	 */
+	__thaw_task(tsk);
 }
 
 /**
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH -v3 2/5] oom: thaw the OOM victim if it is frozen
@ 2015-01-09 11:05   ` Michal Hocko
  0 siblings, 0 replies; 31+ messages in thread
From: Michal Hocko @ 2015-01-09 11:05 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Tejun Heo, \"Rafael J. Wysocki\",
	David Rientjes, Johannes Weiner, Oleg Nesterov, Cong Wang,
	linux-mm, LKML, linux-pm

oom_kill_process only sets TIF_MEMDIE flag and sends a signal to the
victim. This is basically noop when the task is frozen though because
the task sleeps in the uninterruptible sleep.
The victim is eventually thawed later when oom_scan_process_thread meets
the task again in a later OOM invocation so the OOM killer doesn't live
lock. But this is less than optimal.

Let's add __thaw_task into mark_tsk_oom_victim after we set TIF_MEMDIE
to the victim. We are not checking whether the task is frozen
because that would be racy and __thaw_task does that already.
oom_scan_process_thread doesn't need to care about freezer anymore as
TIF_MEMDIE and freezer are excluded completely now.

Signed-off-by: Michal Hocko <mhocko@suse.cz>
---
 mm/oom_kill.c | 10 ++++++++--
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 80b34e285f96..3cbd76b8c13b 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -266,8 +266,6 @@ enum oom_scan_t oom_scan_process_thread(struct task_struct *task,
 	 * Don't allow any other task to have access to the reserves.
 	 */
 	if (test_tsk_thread_flag(task, TIF_MEMDIE)) {
-		if (unlikely(frozen(task)))
-			__thaw_task(task);
 		if (!force_kill)
 			return OOM_SCAN_ABORT;
 	}
@@ -423,6 +421,14 @@ void note_oom_kill(void)
 void mark_tsk_oom_victim(struct task_struct *tsk)
 {
 	set_tsk_thread_flag(tsk, TIF_MEMDIE);
+
+	/*
+	 * Make sure that the task is woken up from uninterruptible sleep
+	 * if it is frozen because OOM killer wouldn't be able to free
+	 * any memory and livelock. freezing_slow_path will tell the freezer
+	 * that TIF_MEMDIE tasks should be ignored.
+	 */
+	__thaw_task(tsk);
 }
 
 /**
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH -v3 2/5] oom: thaw the OOM victim if it is frozen
@ 2015-01-09 11:05   ` Michal Hocko
  0 siblings, 0 replies; 31+ messages in thread
From: Michal Hocko @ 2015-01-09 11:05 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Tejun Heo, \"Rafael J. Wysocki\",
	David Rientjes, Johannes Weiner, Oleg Nesterov, Cong Wang,
	linux-mm, LKML, linux-pm

oom_kill_process only sets TIF_MEMDIE flag and sends a signal to the
victim. This is basically noop when the task is frozen though because
the task sleeps in the uninterruptible sleep.
The victim is eventually thawed later when oom_scan_process_thread meets
the task again in a later OOM invocation so the OOM killer doesn't live
lock. But this is less than optimal.

Let's add __thaw_task into mark_tsk_oom_victim after we set TIF_MEMDIE
to the victim. We are not checking whether the task is frozen
because that would be racy and __thaw_task does that already.
oom_scan_process_thread doesn't need to care about freezer anymore as
TIF_MEMDIE and freezer are excluded completely now.

Signed-off-by: Michal Hocko <mhocko@suse.cz>
---
 mm/oom_kill.c | 10 ++++++++--
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 80b34e285f96..3cbd76b8c13b 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -266,8 +266,6 @@ enum oom_scan_t oom_scan_process_thread(struct task_struct *task,
 	 * Don't allow any other task to have access to the reserves.
 	 */
 	if (test_tsk_thread_flag(task, TIF_MEMDIE)) {
-		if (unlikely(frozen(task)))
-			__thaw_task(task);
 		if (!force_kill)
 			return OOM_SCAN_ABORT;
 	}
@@ -423,6 +421,14 @@ void note_oom_kill(void)
 void mark_tsk_oom_victim(struct task_struct *tsk)
 {
 	set_tsk_thread_flag(tsk, TIF_MEMDIE);
+
+	/*
+	 * Make sure that the task is woken up from uninterruptible sleep
+	 * if it is frozen because OOM killer wouldn't be able to free
+	 * any memory and livelock. freezing_slow_path will tell the freezer
+	 * that TIF_MEMDIE tasks should be ignored.
+	 */
+	__thaw_task(tsk);
 }
 
 /**
-- 
2.1.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH -v3 3/5] PM: convert printk to pr_* equivalent
  2015-01-09 11:05 ` Michal Hocko
  (?)
@ 2015-01-09 11:05   ` Michal Hocko
  -1 siblings, 0 replies; 31+ messages in thread
From: Michal Hocko @ 2015-01-09 11:05 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Tejun Heo, \"Rafael J. Wysocki\",
	David Rientjes, Johannes Weiner, Oleg Nesterov, Cong Wang,
	linux-mm, LKML, linux-pm

While touching this area let's convert printk to pr_*. This also makes
the printing of continuation lines done properly.

Signed-off-by: Michal Hocko <mhocko@suse.cz>
Acked-by: Tejun Heo <tj@kernel.org>
---
 kernel/power/process.c | 29 +++++++++++++++--------------
 1 file changed, 15 insertions(+), 14 deletions(-)

diff --git a/kernel/power/process.c b/kernel/power/process.c
index 5a6ec8678b9a..3ac45f192e9f 100644
--- a/kernel/power/process.c
+++ b/kernel/power/process.c
@@ -84,8 +84,8 @@ static int try_to_freeze_tasks(bool user_only)
 	elapsed_msecs = elapsed_msecs64;
 
 	if (todo) {
-		printk("\n");
-		printk(KERN_ERR "Freezing of tasks %s after %d.%03d seconds "
+		pr_cont("\n");
+		pr_err("Freezing of tasks %s after %d.%03d seconds "
 		       "(%d tasks refusing to freeze, wq_busy=%d):\n",
 		       wakeup ? "aborted" : "failed",
 		       elapsed_msecs / 1000, elapsed_msecs % 1000,
@@ -101,7 +101,7 @@ static int try_to_freeze_tasks(bool user_only)
 			read_unlock(&tasklist_lock);
 		}
 	} else {
-		printk("(elapsed %d.%03d seconds) ", elapsed_msecs / 1000,
+		pr_cont("(elapsed %d.%03d seconds) ", elapsed_msecs / 1000,
 			elapsed_msecs % 1000);
 	}
 
@@ -155,7 +155,7 @@ int freeze_processes(void)
 		atomic_inc(&system_freezing_cnt);
 
 	pm_wakeup_clear();
-	printk("Freezing user space processes ... ");
+	pr_info("Freezing user space processes ... ");
 	pm_freezing = true;
 	oom_kills_saved = oom_kills_count();
 	error = try_to_freeze_tasks(true);
@@ -171,13 +171,13 @@ int freeze_processes(void)
 		if (oom_kills_count() != oom_kills_saved &&
 		    !check_frozen_processes()) {
 			__usermodehelper_set_disable_depth(UMH_ENABLED);
-			printk("OOM in progress.");
+			pr_cont("OOM in progress.");
 			error = -EBUSY;
 		} else {
-			printk("done.");
+			pr_cont("done.");
 		}
 	}
-	printk("\n");
+	pr_cont("\n");
 	BUG_ON(in_atomic());
 
 	if (error)
@@ -197,13 +197,14 @@ int freeze_kernel_threads(void)
 {
 	int error;
 
-	printk("Freezing remaining freezable tasks ... ");
+	pr_info("Freezing remaining freezable tasks ... ");
+
 	pm_nosig_freezing = true;
 	error = try_to_freeze_tasks(false);
 	if (!error)
-		printk("done.");
+		pr_cont("done.");
 
-	printk("\n");
+	pr_cont("\n");
 	BUG_ON(in_atomic());
 
 	if (error)
@@ -224,7 +225,7 @@ void thaw_processes(void)
 
 	oom_killer_enable();
 
-	printk("Restarting tasks ... ");
+	pr_info("Restarting tasks ... ");
 
 	__usermodehelper_set_disable_depth(UMH_FREEZING);
 	thaw_workqueues();
@@ -243,7 +244,7 @@ void thaw_processes(void)
 	usermodehelper_enable();
 
 	schedule();
-	printk("done.\n");
+	pr_cont("done.\n");
 	trace_suspend_resume(TPS("thaw_processes"), 0, false);
 }
 
@@ -252,7 +253,7 @@ void thaw_kernel_threads(void)
 	struct task_struct *g, *p;
 
 	pm_nosig_freezing = false;
-	printk("Restarting kernel threads ... ");
+	pr_info("Restarting kernel threads ... ");
 
 	thaw_workqueues();
 
@@ -264,5 +265,5 @@ void thaw_kernel_threads(void)
 	read_unlock(&tasklist_lock);
 
 	schedule();
-	printk("done.\n");
+	pr_cont("done.\n");
 }
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH -v3 3/5] PM: convert printk to pr_* equivalent
@ 2015-01-09 11:05   ` Michal Hocko
  0 siblings, 0 replies; 31+ messages in thread
From: Michal Hocko @ 2015-01-09 11:05 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Tejun Heo, \"Rafael J. Wysocki\",
	David Rientjes, Johannes Weiner, Oleg Nesterov, Cong Wang,
	linux-mm, LKML, linux-pm

While touching this area let's convert printk to pr_*. This also makes
the printing of continuation lines done properly.

Signed-off-by: Michal Hocko <mhocko@suse.cz>
Acked-by: Tejun Heo <tj@kernel.org>
---
 kernel/power/process.c | 29 +++++++++++++++--------------
 1 file changed, 15 insertions(+), 14 deletions(-)

diff --git a/kernel/power/process.c b/kernel/power/process.c
index 5a6ec8678b9a..3ac45f192e9f 100644
--- a/kernel/power/process.c
+++ b/kernel/power/process.c
@@ -84,8 +84,8 @@ static int try_to_freeze_tasks(bool user_only)
 	elapsed_msecs = elapsed_msecs64;
 
 	if (todo) {
-		printk("\n");
-		printk(KERN_ERR "Freezing of tasks %s after %d.%03d seconds "
+		pr_cont("\n");
+		pr_err("Freezing of tasks %s after %d.%03d seconds "
 		       "(%d tasks refusing to freeze, wq_busy=%d):\n",
 		       wakeup ? "aborted" : "failed",
 		       elapsed_msecs / 1000, elapsed_msecs % 1000,
@@ -101,7 +101,7 @@ static int try_to_freeze_tasks(bool user_only)
 			read_unlock(&tasklist_lock);
 		}
 	} else {
-		printk("(elapsed %d.%03d seconds) ", elapsed_msecs / 1000,
+		pr_cont("(elapsed %d.%03d seconds) ", elapsed_msecs / 1000,
 			elapsed_msecs % 1000);
 	}
 
@@ -155,7 +155,7 @@ int freeze_processes(void)
 		atomic_inc(&system_freezing_cnt);
 
 	pm_wakeup_clear();
-	printk("Freezing user space processes ... ");
+	pr_info("Freezing user space processes ... ");
 	pm_freezing = true;
 	oom_kills_saved = oom_kills_count();
 	error = try_to_freeze_tasks(true);
@@ -171,13 +171,13 @@ int freeze_processes(void)
 		if (oom_kills_count() != oom_kills_saved &&
 		    !check_frozen_processes()) {
 			__usermodehelper_set_disable_depth(UMH_ENABLED);
-			printk("OOM in progress.");
+			pr_cont("OOM in progress.");
 			error = -EBUSY;
 		} else {
-			printk("done.");
+			pr_cont("done.");
 		}
 	}
-	printk("\n");
+	pr_cont("\n");
 	BUG_ON(in_atomic());
 
 	if (error)
@@ -197,13 +197,14 @@ int freeze_kernel_threads(void)
 {
 	int error;
 
-	printk("Freezing remaining freezable tasks ... ");
+	pr_info("Freezing remaining freezable tasks ... ");
+
 	pm_nosig_freezing = true;
 	error = try_to_freeze_tasks(false);
 	if (!error)
-		printk("done.");
+		pr_cont("done.");
 
-	printk("\n");
+	pr_cont("\n");
 	BUG_ON(in_atomic());
 
 	if (error)
@@ -224,7 +225,7 @@ void thaw_processes(void)
 
 	oom_killer_enable();
 
-	printk("Restarting tasks ... ");
+	pr_info("Restarting tasks ... ");
 
 	__usermodehelper_set_disable_depth(UMH_FREEZING);
 	thaw_workqueues();
@@ -243,7 +244,7 @@ void thaw_processes(void)
 	usermodehelper_enable();
 
 	schedule();
-	printk("done.\n");
+	pr_cont("done.\n");
 	trace_suspend_resume(TPS("thaw_processes"), 0, false);
 }
 
@@ -252,7 +253,7 @@ void thaw_kernel_threads(void)
 	struct task_struct *g, *p;
 
 	pm_nosig_freezing = false;
-	printk("Restarting kernel threads ... ");
+	pr_info("Restarting kernel threads ... ");
 
 	thaw_workqueues();
 
@@ -264,5 +265,5 @@ void thaw_kernel_threads(void)
 	read_unlock(&tasklist_lock);
 
 	schedule();
-	printk("done.\n");
+	pr_cont("done.\n");
 }
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH -v3 3/5] PM: convert printk to pr_* equivalent
@ 2015-01-09 11:05   ` Michal Hocko
  0 siblings, 0 replies; 31+ messages in thread
From: Michal Hocko @ 2015-01-09 11:05 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Tejun Heo, \"Rafael J. Wysocki\",
	David Rientjes, Johannes Weiner, Oleg Nesterov, Cong Wang,
	linux-mm, LKML, linux-pm

While touching this area let's convert printk to pr_*. This also makes
the printing of continuation lines done properly.

Signed-off-by: Michal Hocko <mhocko@suse.cz>
Acked-by: Tejun Heo <tj@kernel.org>
---
 kernel/power/process.c | 29 +++++++++++++++--------------
 1 file changed, 15 insertions(+), 14 deletions(-)

diff --git a/kernel/power/process.c b/kernel/power/process.c
index 5a6ec8678b9a..3ac45f192e9f 100644
--- a/kernel/power/process.c
+++ b/kernel/power/process.c
@@ -84,8 +84,8 @@ static int try_to_freeze_tasks(bool user_only)
 	elapsed_msecs = elapsed_msecs64;
 
 	if (todo) {
-		printk("\n");
-		printk(KERN_ERR "Freezing of tasks %s after %d.%03d seconds "
+		pr_cont("\n");
+		pr_err("Freezing of tasks %s after %d.%03d seconds "
 		       "(%d tasks refusing to freeze, wq_busy=%d):\n",
 		       wakeup ? "aborted" : "failed",
 		       elapsed_msecs / 1000, elapsed_msecs % 1000,
@@ -101,7 +101,7 @@ static int try_to_freeze_tasks(bool user_only)
 			read_unlock(&tasklist_lock);
 		}
 	} else {
-		printk("(elapsed %d.%03d seconds) ", elapsed_msecs / 1000,
+		pr_cont("(elapsed %d.%03d seconds) ", elapsed_msecs / 1000,
 			elapsed_msecs % 1000);
 	}
 
@@ -155,7 +155,7 @@ int freeze_processes(void)
 		atomic_inc(&system_freezing_cnt);
 
 	pm_wakeup_clear();
-	printk("Freezing user space processes ... ");
+	pr_info("Freezing user space processes ... ");
 	pm_freezing = true;
 	oom_kills_saved = oom_kills_count();
 	error = try_to_freeze_tasks(true);
@@ -171,13 +171,13 @@ int freeze_processes(void)
 		if (oom_kills_count() != oom_kills_saved &&
 		    !check_frozen_processes()) {
 			__usermodehelper_set_disable_depth(UMH_ENABLED);
-			printk("OOM in progress.");
+			pr_cont("OOM in progress.");
 			error = -EBUSY;
 		} else {
-			printk("done.");
+			pr_cont("done.");
 		}
 	}
-	printk("\n");
+	pr_cont("\n");
 	BUG_ON(in_atomic());
 
 	if (error)
@@ -197,13 +197,14 @@ int freeze_kernel_threads(void)
 {
 	int error;
 
-	printk("Freezing remaining freezable tasks ... ");
+	pr_info("Freezing remaining freezable tasks ... ");
+
 	pm_nosig_freezing = true;
 	error = try_to_freeze_tasks(false);
 	if (!error)
-		printk("done.");
+		pr_cont("done.");
 
-	printk("\n");
+	pr_cont("\n");
 	BUG_ON(in_atomic());
 
 	if (error)
@@ -224,7 +225,7 @@ void thaw_processes(void)
 
 	oom_killer_enable();
 
-	printk("Restarting tasks ... ");
+	pr_info("Restarting tasks ... ");
 
 	__usermodehelper_set_disable_depth(UMH_FREEZING);
 	thaw_workqueues();
@@ -243,7 +244,7 @@ void thaw_processes(void)
 	usermodehelper_enable();
 
 	schedule();
-	printk("done.\n");
+	pr_cont("done.\n");
 	trace_suspend_resume(TPS("thaw_processes"), 0, false);
 }
 
@@ -252,7 +253,7 @@ void thaw_kernel_threads(void)
 	struct task_struct *g, *p;
 
 	pm_nosig_freezing = false;
-	printk("Restarting kernel threads ... ");
+	pr_info("Restarting kernel threads ... ");
 
 	thaw_workqueues();
 
@@ -264,5 +265,5 @@ void thaw_kernel_threads(void)
 	read_unlock(&tasklist_lock);
 
 	schedule();
-	printk("done.\n");
+	pr_cont("done.\n");
 }
-- 
2.1.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH -v3 4/5] sysrq: convert printk to pr_* equivalent
  2015-01-09 11:05 ` Michal Hocko
  (?)
@ 2015-01-09 11:05   ` Michal Hocko
  -1 siblings, 0 replies; 31+ messages in thread
From: Michal Hocko @ 2015-01-09 11:05 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Tejun Heo, \"Rafael J. Wysocki\",
	David Rientjes, Johannes Weiner, Oleg Nesterov, Cong Wang,
	linux-mm, LKML, linux-pm

While touching this area let's convert printk to pr_*. This also makes
the printing of continuation lines done properly.

Signed-off-by: Michal Hocko <mhocko@suse.cz>
Acked-by: Tejun Heo <tj@kernel.org>
---
 drivers/tty/sysrq.c | 18 +++++++++---------
 1 file changed, 9 insertions(+), 9 deletions(-)

diff --git a/drivers/tty/sysrq.c b/drivers/tty/sysrq.c
index 42bad18c66c9..0071469ecbf1 100644
--- a/drivers/tty/sysrq.c
+++ b/drivers/tty/sysrq.c
@@ -90,7 +90,7 @@ static void sysrq_handle_loglevel(int key)
 
 	i = key - '0';
 	console_loglevel = CONSOLE_LOGLEVEL_DEFAULT;
-	printk("Loglevel set to %d\n", i);
+	pr_info("Loglevel set to %d\n", i);
 	console_loglevel = i;
 }
 static struct sysrq_key_op sysrq_loglevel_op = {
@@ -220,7 +220,7 @@ static void showacpu(void *dummy)
 		return;
 
 	spin_lock_irqsave(&show_lock, flags);
-	printk(KERN_INFO "CPU%d:\n", smp_processor_id());
+	pr_info("CPU%d:\n", smp_processor_id());
 	show_stack(NULL, NULL);
 	spin_unlock_irqrestore(&show_lock, flags);
 }
@@ -243,7 +243,7 @@ static void sysrq_handle_showallcpus(int key)
 		struct pt_regs *regs = get_irq_regs();
 
 		if (regs) {
-			printk(KERN_INFO "CPU%d:\n", smp_processor_id());
+			pr_info("CPU%d:\n", smp_processor_id());
 			show_regs(regs);
 		}
 		schedule_work(&sysrq_showallcpus);
@@ -522,7 +522,7 @@ void __handle_sysrq(int key, bool check_mask)
 	 */
 	orig_log_level = console_loglevel;
 	console_loglevel = CONSOLE_LOGLEVEL_DEFAULT;
-	printk(KERN_INFO "SysRq : ");
+	pr_info("SysRq : ");
 
         op_p = __sysrq_get_key_op(key);
         if (op_p) {
@@ -531,14 +531,14 @@ void __handle_sysrq(int key, bool check_mask)
 		 * should not) and is the invoked operation enabled?
 		 */
 		if (!check_mask || sysrq_on_mask(op_p->enable_mask)) {
-			printk("%s\n", op_p->action_msg);
+			pr_cont("%s\n", op_p->action_msg);
 			console_loglevel = orig_log_level;
 			op_p->handler(key);
 		} else {
-			printk("This sysrq operation is disabled.\n");
+			pr_cont("This sysrq operation is disabled.\n");
 		}
 	} else {
-		printk("HELP : ");
+		pr_cont("HELP : ");
 		/* Only print the help msg once per handler */
 		for (i = 0; i < ARRAY_SIZE(sysrq_key_table); i++) {
 			if (sysrq_key_table[i]) {
@@ -549,10 +549,10 @@ void __handle_sysrq(int key, bool check_mask)
 					;
 				if (j != i)
 					continue;
-				printk("%s ", sysrq_key_table[i]->help_msg);
+				pr_cont("%s ", sysrq_key_table[i]->help_msg);
 			}
 		}
-		printk("\n");
+		pr_cont("\n");
 		console_loglevel = orig_log_level;
 	}
 	rcu_read_unlock();
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH -v3 4/5] sysrq: convert printk to pr_* equivalent
@ 2015-01-09 11:05   ` Michal Hocko
  0 siblings, 0 replies; 31+ messages in thread
From: Michal Hocko @ 2015-01-09 11:05 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Tejun Heo, \"Rafael J. Wysocki\",
	David Rientjes, Johannes Weiner, Oleg Nesterov, Cong Wang,
	linux-mm, LKML, linux-pm

While touching this area let's convert printk to pr_*. This also makes
the printing of continuation lines done properly.

Signed-off-by: Michal Hocko <mhocko@suse.cz>
Acked-by: Tejun Heo <tj@kernel.org>
---
 drivers/tty/sysrq.c | 18 +++++++++---------
 1 file changed, 9 insertions(+), 9 deletions(-)

diff --git a/drivers/tty/sysrq.c b/drivers/tty/sysrq.c
index 42bad18c66c9..0071469ecbf1 100644
--- a/drivers/tty/sysrq.c
+++ b/drivers/tty/sysrq.c
@@ -90,7 +90,7 @@ static void sysrq_handle_loglevel(int key)
 
 	i = key - '0';
 	console_loglevel = CONSOLE_LOGLEVEL_DEFAULT;
-	printk("Loglevel set to %d\n", i);
+	pr_info("Loglevel set to %d\n", i);
 	console_loglevel = i;
 }
 static struct sysrq_key_op sysrq_loglevel_op = {
@@ -220,7 +220,7 @@ static void showacpu(void *dummy)
 		return;
 
 	spin_lock_irqsave(&show_lock, flags);
-	printk(KERN_INFO "CPU%d:\n", smp_processor_id());
+	pr_info("CPU%d:\n", smp_processor_id());
 	show_stack(NULL, NULL);
 	spin_unlock_irqrestore(&show_lock, flags);
 }
@@ -243,7 +243,7 @@ static void sysrq_handle_showallcpus(int key)
 		struct pt_regs *regs = get_irq_regs();
 
 		if (regs) {
-			printk(KERN_INFO "CPU%d:\n", smp_processor_id());
+			pr_info("CPU%d:\n", smp_processor_id());
 			show_regs(regs);
 		}
 		schedule_work(&sysrq_showallcpus);
@@ -522,7 +522,7 @@ void __handle_sysrq(int key, bool check_mask)
 	 */
 	orig_log_level = console_loglevel;
 	console_loglevel = CONSOLE_LOGLEVEL_DEFAULT;
-	printk(KERN_INFO "SysRq : ");
+	pr_info("SysRq : ");
 
         op_p = __sysrq_get_key_op(key);
         if (op_p) {
@@ -531,14 +531,14 @@ void __handle_sysrq(int key, bool check_mask)
 		 * should not) and is the invoked operation enabled?
 		 */
 		if (!check_mask || sysrq_on_mask(op_p->enable_mask)) {
-			printk("%s\n", op_p->action_msg);
+			pr_cont("%s\n", op_p->action_msg);
 			console_loglevel = orig_log_level;
 			op_p->handler(key);
 		} else {
-			printk("This sysrq operation is disabled.\n");
+			pr_cont("This sysrq operation is disabled.\n");
 		}
 	} else {
-		printk("HELP : ");
+		pr_cont("HELP : ");
 		/* Only print the help msg once per handler */
 		for (i = 0; i < ARRAY_SIZE(sysrq_key_table); i++) {
 			if (sysrq_key_table[i]) {
@@ -549,10 +549,10 @@ void __handle_sysrq(int key, bool check_mask)
 					;
 				if (j != i)
 					continue;
-				printk("%s ", sysrq_key_table[i]->help_msg);
+				pr_cont("%s ", sysrq_key_table[i]->help_msg);
 			}
 		}
-		printk("\n");
+		pr_cont("\n");
 		console_loglevel = orig_log_level;
 	}
 	rcu_read_unlock();
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH -v3 4/5] sysrq: convert printk to pr_* equivalent
@ 2015-01-09 11:05   ` Michal Hocko
  0 siblings, 0 replies; 31+ messages in thread
From: Michal Hocko @ 2015-01-09 11:05 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Tejun Heo, \"Rafael J. Wysocki\",
	David Rientjes, Johannes Weiner, Oleg Nesterov, Cong Wang,
	linux-mm, LKML, linux-pm

While touching this area let's convert printk to pr_*. This also makes
the printing of continuation lines done properly.

Signed-off-by: Michal Hocko <mhocko@suse.cz>
Acked-by: Tejun Heo <tj@kernel.org>
---
 drivers/tty/sysrq.c | 18 +++++++++---------
 1 file changed, 9 insertions(+), 9 deletions(-)

diff --git a/drivers/tty/sysrq.c b/drivers/tty/sysrq.c
index 42bad18c66c9..0071469ecbf1 100644
--- a/drivers/tty/sysrq.c
+++ b/drivers/tty/sysrq.c
@@ -90,7 +90,7 @@ static void sysrq_handle_loglevel(int key)
 
 	i = key - '0';
 	console_loglevel = CONSOLE_LOGLEVEL_DEFAULT;
-	printk("Loglevel set to %d\n", i);
+	pr_info("Loglevel set to %d\n", i);
 	console_loglevel = i;
 }
 static struct sysrq_key_op sysrq_loglevel_op = {
@@ -220,7 +220,7 @@ static void showacpu(void *dummy)
 		return;
 
 	spin_lock_irqsave(&show_lock, flags);
-	printk(KERN_INFO "CPU%d:\n", smp_processor_id());
+	pr_info("CPU%d:\n", smp_processor_id());
 	show_stack(NULL, NULL);
 	spin_unlock_irqrestore(&show_lock, flags);
 }
@@ -243,7 +243,7 @@ static void sysrq_handle_showallcpus(int key)
 		struct pt_regs *regs = get_irq_regs();
 
 		if (regs) {
-			printk(KERN_INFO "CPU%d:\n", smp_processor_id());
+			pr_info("CPU%d:\n", smp_processor_id());
 			show_regs(regs);
 		}
 		schedule_work(&sysrq_showallcpus);
@@ -522,7 +522,7 @@ void __handle_sysrq(int key, bool check_mask)
 	 */
 	orig_log_level = console_loglevel;
 	console_loglevel = CONSOLE_LOGLEVEL_DEFAULT;
-	printk(KERN_INFO "SysRq : ");
+	pr_info("SysRq : ");
 
         op_p = __sysrq_get_key_op(key);
         if (op_p) {
@@ -531,14 +531,14 @@ void __handle_sysrq(int key, bool check_mask)
 		 * should not) and is the invoked operation enabled?
 		 */
 		if (!check_mask || sysrq_on_mask(op_p->enable_mask)) {
-			printk("%s\n", op_p->action_msg);
+			pr_cont("%s\n", op_p->action_msg);
 			console_loglevel = orig_log_level;
 			op_p->handler(key);
 		} else {
-			printk("This sysrq operation is disabled.\n");
+			pr_cont("This sysrq operation is disabled.\n");
 		}
 	} else {
-		printk("HELP : ");
+		pr_cont("HELP : ");
 		/* Only print the help msg once per handler */
 		for (i = 0; i < ARRAY_SIZE(sysrq_key_table); i++) {
 			if (sysrq_key_table[i]) {
@@ -549,10 +549,10 @@ void __handle_sysrq(int key, bool check_mask)
 					;
 				if (j != i)
 					continue;
-				printk("%s ", sysrq_key_table[i]->help_msg);
+				pr_cont("%s ", sysrq_key_table[i]->help_msg);
 			}
 		}
-		printk("\n");
+		pr_cont("\n");
 		console_loglevel = orig_log_level;
 	}
 	rcu_read_unlock();
-- 
2.1.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH -v3 5/5] oom, PM: make OOM detection in the freezer path raceless
  2015-01-09 11:05 ` Michal Hocko
  (?)
@ 2015-01-09 11:05   ` Michal Hocko
  -1 siblings, 0 replies; 31+ messages in thread
From: Michal Hocko @ 2015-01-09 11:05 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Tejun Heo, \"Rafael J. Wysocki\",
	David Rientjes, Johannes Weiner, Oleg Nesterov, Cong Wang,
	linux-mm, LKML, linux-pm

5695be142e20 (OOM, PM: OOM killed task shouldn't escape PM suspend)
has left a race window when OOM killer manages to note_oom_kill after
freeze_processes checks the counter. The race window is quite small and
really unlikely and partial solution deemed sufficient at the time of
submission.

Tejun wasn't happy about this partial solution though and insisted on a
full solution. That requires the full OOM and freezer's task freezing
exclusion, though. This is done by this patch which introduces oom_sem
RW lock and turns oom_killer_disable() into a full OOM barrier.

oom_killer_disabled check is moved from the allocation path to the OOM
level and we take oom_sem for reading for both the check and the whole
OOM invocation.

oom_killer_disable() takes oom_sem for writing so it waits for all
currently running OOM killer invocations. Then it disable all the
further OOMs by setting oom_killer_disabled and checks for any oom
victims. Victims are counted via mark_tsk_oom_victim resp.
unmark_oom_victim. The last victim wakes up all waiters enqueued by
oom_killer_disable(). Therefore this function acts as the full OOM
barrier.

The page fault path is covered now as well although it was assumed to be
safe before. As per Tejun, "We used to have freezing points deep in file
system code which may be reacheable from page fault." so it would be
better and more robust to not rely on freezing points here. Same applies
to the memcg OOM killer.

out_of_memory tells the caller whether the OOM was allowed to trigger
and the callers are supposed to handle the situation. The page
allocation path simply fails the allocation same as before. The page
fault path will retry the fault (more on that later) and Sysrq OOM
trigger will simply complain to the log.

Normally there wouldn't be any unfrozen user tasks after
try_to_freeze_tasks so the function will not block. But if there was an
OOM killer racing with try_to_freeze_tasks and the OOM victim didn't
finish yet then we have to wait for it. This should complete in a finite
time, though, because
	- the victim cannot loop in the page fault handler (it would die
	  on the way out from the exception)
	- it cannot loop in the page allocator because all the further
	  allocation would fail and __GFP_NOFAIL allocations are not
	  acceptable at this stage
	- it shouldn't be blocked on any locks held by frozen tasks
	  (try_to_freeze expects lockless context) and kernel threads and
	  work queues are not frozen yet

Suggested-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Michal Hocko <mhocko@suse.cz>
---
 drivers/tty/sysrq.c    |   5 +-
 include/linux/oom.h    |  14 ++----
 kernel/exit.c          |   3 +-
 kernel/power/process.c |  50 ++++---------------
 mm/memcontrol.c        |   2 +-
 mm/oom_kill.c          | 132 +++++++++++++++++++++++++++++++++++++++++--------
 mm/page_alloc.c        |  17 +------
 7 files changed, 132 insertions(+), 91 deletions(-)

diff --git a/drivers/tty/sysrq.c b/drivers/tty/sysrq.c
index 0071469ecbf1..259a4d5a4e8f 100644
--- a/drivers/tty/sysrq.c
+++ b/drivers/tty/sysrq.c
@@ -355,8 +355,9 @@ static struct sysrq_key_op sysrq_term_op = {
 
 static void moom_callback(struct work_struct *ignored)
 {
-	out_of_memory(node_zonelist(first_memory_node, GFP_KERNEL), GFP_KERNEL,
-		      0, NULL, true);
+	if (!out_of_memory(node_zonelist(first_memory_node, GFP_KERNEL),
+			   GFP_KERNEL, 0, NULL, true))
+		pr_info("OOM request ignored because killer is disabled\n");
 }
 
 static DECLARE_WORK(moom_work, moom_callback);
diff --git a/include/linux/oom.h b/include/linux/oom.h
index b42b80f88c3a..d5771bed59c9 100644
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -72,22 +72,14 @@ extern enum oom_scan_t oom_scan_process_thread(struct task_struct *task,
 		unsigned long totalpages, const nodemask_t *nodemask,
 		bool force_kill);
 
-extern void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
+extern bool out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
 		int order, nodemask_t *mask, bool force_kill);
 extern int register_oom_notifier(struct notifier_block *nb);
 extern int unregister_oom_notifier(struct notifier_block *nb);
 
 extern bool oom_killer_disabled;
-
-static inline void oom_killer_disable(void)
-{
-	oom_killer_disabled = true;
-}
-
-static inline void oom_killer_enable(void)
-{
-	oom_killer_disabled = false;
-}
+extern bool oom_killer_disable(void);
+extern void oom_killer_enable(void);
 
 extern struct task_struct *find_lock_task_mm(struct task_struct *p);
 
diff --git a/kernel/exit.c b/kernel/exit.c
index 5db52e52c493..4e319a0c97ea 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -456,7 +456,8 @@ static void exit_mm(struct task_struct *tsk)
 	task_unlock(tsk);
 	mm_update_next_owner(mm);
 	mmput(mm);
-	unmark_oom_victim();
+	if (test_thread_flag(TIF_MEMDIE))
+		unmark_oom_victim();
 }
 
 static struct task_struct *find_alive_thread(struct task_struct *p)
diff --git a/kernel/power/process.c b/kernel/power/process.c
index 3ac45f192e9f..564f786df470 100644
--- a/kernel/power/process.c
+++ b/kernel/power/process.c
@@ -108,30 +108,6 @@ static int try_to_freeze_tasks(bool user_only)
 	return todo ? -EBUSY : 0;
 }
 
-static bool __check_frozen_processes(void)
-{
-	struct task_struct *g, *p;
-
-	for_each_process_thread(g, p)
-		if (p != current && !freezer_should_skip(p) && !frozen(p))
-			return false;
-
-	return true;
-}
-
-/*
- * Returns true if all freezable tasks (except for current) are frozen already
- */
-static bool check_frozen_processes(void)
-{
-	bool ret;
-
-	read_lock(&tasklist_lock);
-	ret = __check_frozen_processes();
-	read_unlock(&tasklist_lock);
-	return ret;
-}
-
 /**
  * freeze_processes - Signal user space processes to enter the refrigerator.
  * The current thread will not be frozen.  The same process that calls
@@ -142,7 +118,6 @@ static bool check_frozen_processes(void)
 int freeze_processes(void)
 {
 	int error;
-	int oom_kills_saved;
 
 	error = __usermodehelper_disable(UMH_FREEZING);
 	if (error)
@@ -157,29 +132,22 @@ int freeze_processes(void)
 	pm_wakeup_clear();
 	pr_info("Freezing user space processes ... ");
 	pm_freezing = true;
-	oom_kills_saved = oom_kills_count();
 	error = try_to_freeze_tasks(true);
 	if (!error) {
 		__usermodehelper_set_disable_depth(UMH_DISABLED);
-		oom_killer_disable();
-
-		/*
-		 * There might have been an OOM kill while we were
-		 * freezing tasks and the killed task might be still
-		 * on the way out so we have to double check for race.
-		 */
-		if (oom_kills_count() != oom_kills_saved &&
-		    !check_frozen_processes()) {
-			__usermodehelper_set_disable_depth(UMH_ENABLED);
-			pr_cont("OOM in progress.");
-			error = -EBUSY;
-		} else {
-			pr_cont("done.");
-		}
+		pr_cont("done.");
 	}
 	pr_cont("\n");
 	BUG_ON(in_atomic());
 
+	/*
+	 * Now that the whole userspace is frozen we need to disbale
+	 * the OOM killer to disallow any further interference with
+	 * killable tasks.
+	 */
+	if (!error && !oom_killer_disable())
+		error = -EBUSY;
+
 	if (error)
 		thaw_processes();
 	return error;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 18ecef729597..c1e408bdc713 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1933,7 +1933,7 @@ bool mem_cgroup_oom_synchronize(bool handle)
 	if (!memcg)
 		return false;
 
-	if (!handle)
+	if (!handle || oom_killer_disabled)
 		goto cleanup;
 
 	owait.memcg = memcg;
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 3cbd76b8c13b..b8df76ee2be3 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -398,30 +398,27 @@ static void dump_header(struct task_struct *p, gfp_t gfp_mask, int order,
 }
 
 /*
- * Number of OOM killer invocations (including memcg OOM killer).
- * Primarily used by PM freezer to check for potential races with
- * OOM killed frozen task.
+ * Number of OOM victims in flight
  */
-static atomic_t oom_kills = ATOMIC_INIT(0);
+static atomic_t oom_victims = ATOMIC_INIT(0);
+static DECLARE_WAIT_QUEUE_HEAD(oom_victims_wait);
 
-int oom_kills_count(void)
-{
-	return atomic_read(&oom_kills);
-}
-
-void note_oom_kill(void)
-{
-	atomic_inc(&oom_kills);
-}
+bool oom_killer_disabled __read_mostly;
+static DECLARE_RWSEM(oom_sem);
 
 /**
  * mark_tsk_oom_victim - marks the given taks as OOM victim.
  * @tsk: task to mark
+ *
+ * Has to be called with oom_sem taken for read and never after
+ * oom has been disabled already.
  */
 void mark_tsk_oom_victim(struct task_struct *tsk)
 {
-	set_tsk_thread_flag(tsk, TIF_MEMDIE);
-
+	WARN_ON(oom_killer_disabled);
+	/* OOM killer might race with memcg OOM */
+	if (test_and_set_tsk_thread_flag(tsk, TIF_MEMDIE))
+		return;
 	/*
 	 * Make sure that the task is woken up from uninterruptible sleep
 	 * if it is frozen because OOM killer wouldn't be able to free
@@ -429,14 +426,70 @@ void mark_tsk_oom_victim(struct task_struct *tsk)
 	 * that TIF_MEMDIE tasks should be ignored.
 	 */
 	__thaw_task(tsk);
+	atomic_inc(&oom_victims);
 }
 
 /**
  * unmark_oom_victim - unmarks the current task as OOM victim.
+ *
+ * Wakes up all waiters in oom_killer_disable()
  */
 void unmark_oom_victim(void)
 {
-	clear_thread_flag(TIF_MEMDIE);
+	if (!test_and_clear_thread_flag(TIF_MEMDIE))
+		return;
+
+	down_read(&oom_sem);
+	/*
+	 * There is no need to signal the lasst oom_victim if there
+	 * is nobody who cares.
+	 */
+	if (!atomic_dec_return(&oom_victims) && oom_killer_disabled)
+		wake_up_all(&oom_victims_wait);
+	up_read(&oom_sem);
+}
+
+/**
+ * oom_killer_disable - disable OOM killer
+ *
+ * Forces all page allocations to fail rather than trigger OOM killer.
+ * Will block and wait until all OOM victims are killed.
+ *
+ * The function cannot be called when there are runnable user tasks because
+ * the userspace would see unexpected allocation failures as a result. Any
+ * new usage of this function should be consulted with MM people.
+ *
+ * Returns true if successful and false if the OOM killer cannot be
+ * disabled.
+ */
+bool oom_killer_disable(void)
+{
+	/*
+	 * Make sure to not race with an ongoing OOM killer
+	 * and that the current is not the victim.
+	 */
+	down_write(&oom_sem);
+	if (test_thread_flag(TIF_MEMDIE)) {
+		up_write(&oom_sem);
+		return false;
+	}
+
+	oom_killer_disabled = true;
+	up_write(&oom_sem);
+
+	wait_event(oom_victims_wait, !atomic_read(&oom_victims));
+
+	return true;
+}
+
+/**
+ * oom_killer_enable - enable OOM killer
+ */
+void oom_killer_enable(void)
+{
+	down_write(&oom_sem);
+	oom_killer_disabled = false;
+	up_write(&oom_sem);
 }
 
 #define K(x) ((x) << (PAGE_SHIFT-10))
@@ -637,7 +690,7 @@ void oom_zonelist_unlock(struct zonelist *zonelist, gfp_t gfp_mask)
 }
 
 /**
- * out_of_memory - kill the "best" process when we run out of memory
+ * __out_of_memory - kill the "best" process when we run out of memory
  * @zonelist: zonelist pointer
  * @gfp_mask: memory allocation flags
  * @order: amount of memory being requested as a power of 2
@@ -649,7 +702,7 @@ void oom_zonelist_unlock(struct zonelist *zonelist, gfp_t gfp_mask)
  * OR try to be smart about which process to kill. Note that we
  * don't have to be perfect here, we just have to be good.
  */
-void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
+static void __out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
 		int order, nodemask_t *nodemask, bool force_kill)
 {
 	const nodemask_t *mpol_mask;
@@ -718,6 +771,32 @@ out:
 		schedule_timeout_killable(1);
 }
 
+/**
+ * out_of_memory -  tries to invoke OOM killer.
+ * @zonelist: zonelist pointer
+ * @gfp_mask: memory allocation flags
+ * @order: amount of memory being requested as a power of 2
+ * @nodemask: nodemask passed to page allocator
+ * @force_kill: true if a task must be killed, even if others are exiting
+ *
+ * invokes __out_of_memory if the OOM is not disabled by oom_killer_disable()
+ * when it returns false. Otherwise returns true.
+ */
+bool out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
+		int order, nodemask_t *nodemask, bool force_kill)
+{
+	bool ret = false;
+
+	down_read(&oom_sem);
+	if (!oom_killer_disabled) {
+		__out_of_memory(zonelist, gfp_mask, order, nodemask, force_kill);
+		ret = true;
+	}
+	up_read(&oom_sem);
+
+	return ret;
+}
+
 /*
  * The pagefault handler calls here because it is out of memory, so kill a
  * memory-hogging task.  If any populated zone has ZONE_OOM_LOCKED set, a
@@ -727,12 +806,25 @@ void pagefault_out_of_memory(void)
 {
 	struct zonelist *zonelist;
 
+	down_read(&oom_sem);
 	if (mem_cgroup_oom_synchronize(true))
-		return;
+		goto unlock;
 
 	zonelist = node_zonelist(first_memory_node, GFP_KERNEL);
 	if (oom_zonelist_trylock(zonelist, GFP_KERNEL)) {
-		out_of_memory(NULL, 0, 0, NULL, false);
+		if (!oom_killer_disabled)
+			__out_of_memory(NULL, 0, 0, NULL, false);
+		else
+			/*
+			 * There shouldn't be any user tasks runable while the
+			 * OOM killer is disabled so the current task has to
+			 * be a racing OOM victim for which oom_killer_disable()
+			 * is waiting for.
+			 */
+			WARN_ON(test_thread_flag(TIF_MEMDIE));
+
 		oom_zonelist_unlock(zonelist, GFP_KERNEL);
 	}
+unlock:
+	up_read(&oom_sem);
 }
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 5ed7f93d0152..b89fc9e84d48 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -244,8 +244,6 @@ void set_pageblock_migratetype(struct page *page, int migratetype)
 					PB_migrate, PB_migrate_end);
 }
 
-bool oom_killer_disabled __read_mostly;
-
 #ifdef CONFIG_DEBUG_VM
 static int page_outside_zone_boundaries(struct zone *zone, struct page *page)
 {
@@ -2305,9 +2303,6 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
 
 	*did_some_progress = 0;
 
-	if (oom_killer_disabled)
-		return NULL;
-
 	/*
 	 * Acquire the per-zone oom lock for each zone.  If that
 	 * fails, somebody else is making progress for us.
@@ -2319,14 +2314,6 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
 	}
 
 	/*
-	 * PM-freezer should be notified that there might be an OOM killer on
-	 * its way to kill and wake somebody up. This is too early and we might
-	 * end up not killing anything but false positives are acceptable.
-	 * See freeze_processes.
-	 */
-	note_oom_kill();
-
-	/*
 	 * Go through the zonelist yet one more time, keep very high watermark
 	 * here, this is only to catch a parallel oom killing, we must fail if
 	 * we're still under heavy pressure.
@@ -2362,8 +2349,8 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
 			goto out;
 	}
 	/* Exhausted what can be done so it's blamo time */
-	out_of_memory(zonelist, gfp_mask, order, nodemask, false);
-	*did_some_progress = 1;
+	if (out_of_memory(zonelist, gfp_mask, order, nodemask, false))
+		*did_some_progress = 1;
 out:
 	oom_zonelist_unlock(zonelist, gfp_mask);
 	return page;
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH -v3 5/5] oom, PM: make OOM detection in the freezer path raceless
@ 2015-01-09 11:05   ` Michal Hocko
  0 siblings, 0 replies; 31+ messages in thread
From: Michal Hocko @ 2015-01-09 11:05 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Tejun Heo, \"Rafael J. Wysocki\",
	David Rientjes, Johannes Weiner, Oleg Nesterov, Cong Wang,
	linux-mm, LKML, linux-pm

5695be142e20 (OOM, PM: OOM killed task shouldn't escape PM suspend)
has left a race window when OOM killer manages to note_oom_kill after
freeze_processes checks the counter. The race window is quite small and
really unlikely and partial solution deemed sufficient at the time of
submission.

Tejun wasn't happy about this partial solution though and insisted on a
full solution. That requires the full OOM and freezer's task freezing
exclusion, though. This is done by this patch which introduces oom_sem
RW lock and turns oom_killer_disable() into a full OOM barrier.

oom_killer_disabled check is moved from the allocation path to the OOM
level and we take oom_sem for reading for both the check and the whole
OOM invocation.

oom_killer_disable() takes oom_sem for writing so it waits for all
currently running OOM killer invocations. Then it disable all the
further OOMs by setting oom_killer_disabled and checks for any oom
victims. Victims are counted via mark_tsk_oom_victim resp.
unmark_oom_victim. The last victim wakes up all waiters enqueued by
oom_killer_disable(). Therefore this function acts as the full OOM
barrier.

The page fault path is covered now as well although it was assumed to be
safe before. As per Tejun, "We used to have freezing points deep in file
system code which may be reacheable from page fault." so it would be
better and more robust to not rely on freezing points here. Same applies
to the memcg OOM killer.

out_of_memory tells the caller whether the OOM was allowed to trigger
and the callers are supposed to handle the situation. The page
allocation path simply fails the allocation same as before. The page
fault path will retry the fault (more on that later) and Sysrq OOM
trigger will simply complain to the log.

Normally there wouldn't be any unfrozen user tasks after
try_to_freeze_tasks so the function will not block. But if there was an
OOM killer racing with try_to_freeze_tasks and the OOM victim didn't
finish yet then we have to wait for it. This should complete in a finite
time, though, because
	- the victim cannot loop in the page fault handler (it would die
	  on the way out from the exception)
	- it cannot loop in the page allocator because all the further
	  allocation would fail and __GFP_NOFAIL allocations are not
	  acceptable at this stage
	- it shouldn't be blocked on any locks held by frozen tasks
	  (try_to_freeze expects lockless context) and kernel threads and
	  work queues are not frozen yet

Suggested-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Michal Hocko <mhocko@suse.cz>
---
 drivers/tty/sysrq.c    |   5 +-
 include/linux/oom.h    |  14 ++----
 kernel/exit.c          |   3 +-
 kernel/power/process.c |  50 ++++---------------
 mm/memcontrol.c        |   2 +-
 mm/oom_kill.c          | 132 +++++++++++++++++++++++++++++++++++++++++--------
 mm/page_alloc.c        |  17 +------
 7 files changed, 132 insertions(+), 91 deletions(-)

diff --git a/drivers/tty/sysrq.c b/drivers/tty/sysrq.c
index 0071469ecbf1..259a4d5a4e8f 100644
--- a/drivers/tty/sysrq.c
+++ b/drivers/tty/sysrq.c
@@ -355,8 +355,9 @@ static struct sysrq_key_op sysrq_term_op = {
 
 static void moom_callback(struct work_struct *ignored)
 {
-	out_of_memory(node_zonelist(first_memory_node, GFP_KERNEL), GFP_KERNEL,
-		      0, NULL, true);
+	if (!out_of_memory(node_zonelist(first_memory_node, GFP_KERNEL),
+			   GFP_KERNEL, 0, NULL, true))
+		pr_info("OOM request ignored because killer is disabled\n");
 }
 
 static DECLARE_WORK(moom_work, moom_callback);
diff --git a/include/linux/oom.h b/include/linux/oom.h
index b42b80f88c3a..d5771bed59c9 100644
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -72,22 +72,14 @@ extern enum oom_scan_t oom_scan_process_thread(struct task_struct *task,
 		unsigned long totalpages, const nodemask_t *nodemask,
 		bool force_kill);
 
-extern void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
+extern bool out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
 		int order, nodemask_t *mask, bool force_kill);
 extern int register_oom_notifier(struct notifier_block *nb);
 extern int unregister_oom_notifier(struct notifier_block *nb);
 
 extern bool oom_killer_disabled;
-
-static inline void oom_killer_disable(void)
-{
-	oom_killer_disabled = true;
-}
-
-static inline void oom_killer_enable(void)
-{
-	oom_killer_disabled = false;
-}
+extern bool oom_killer_disable(void);
+extern void oom_killer_enable(void);
 
 extern struct task_struct *find_lock_task_mm(struct task_struct *p);
 
diff --git a/kernel/exit.c b/kernel/exit.c
index 5db52e52c493..4e319a0c97ea 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -456,7 +456,8 @@ static void exit_mm(struct task_struct *tsk)
 	task_unlock(tsk);
 	mm_update_next_owner(mm);
 	mmput(mm);
-	unmark_oom_victim();
+	if (test_thread_flag(TIF_MEMDIE))
+		unmark_oom_victim();
 }
 
 static struct task_struct *find_alive_thread(struct task_struct *p)
diff --git a/kernel/power/process.c b/kernel/power/process.c
index 3ac45f192e9f..564f786df470 100644
--- a/kernel/power/process.c
+++ b/kernel/power/process.c
@@ -108,30 +108,6 @@ static int try_to_freeze_tasks(bool user_only)
 	return todo ? -EBUSY : 0;
 }
 
-static bool __check_frozen_processes(void)
-{
-	struct task_struct *g, *p;
-
-	for_each_process_thread(g, p)
-		if (p != current && !freezer_should_skip(p) && !frozen(p))
-			return false;
-
-	return true;
-}
-
-/*
- * Returns true if all freezable tasks (except for current) are frozen already
- */
-static bool check_frozen_processes(void)
-{
-	bool ret;
-
-	read_lock(&tasklist_lock);
-	ret = __check_frozen_processes();
-	read_unlock(&tasklist_lock);
-	return ret;
-}
-
 /**
  * freeze_processes - Signal user space processes to enter the refrigerator.
  * The current thread will not be frozen.  The same process that calls
@@ -142,7 +118,6 @@ static bool check_frozen_processes(void)
 int freeze_processes(void)
 {
 	int error;
-	int oom_kills_saved;
 
 	error = __usermodehelper_disable(UMH_FREEZING);
 	if (error)
@@ -157,29 +132,22 @@ int freeze_processes(void)
 	pm_wakeup_clear();
 	pr_info("Freezing user space processes ... ");
 	pm_freezing = true;
-	oom_kills_saved = oom_kills_count();
 	error = try_to_freeze_tasks(true);
 	if (!error) {
 		__usermodehelper_set_disable_depth(UMH_DISABLED);
-		oom_killer_disable();
-
-		/*
-		 * There might have been an OOM kill while we were
-		 * freezing tasks and the killed task might be still
-		 * on the way out so we have to double check for race.
-		 */
-		if (oom_kills_count() != oom_kills_saved &&
-		    !check_frozen_processes()) {
-			__usermodehelper_set_disable_depth(UMH_ENABLED);
-			pr_cont("OOM in progress.");
-			error = -EBUSY;
-		} else {
-			pr_cont("done.");
-		}
+		pr_cont("done.");
 	}
 	pr_cont("\n");
 	BUG_ON(in_atomic());
 
+	/*
+	 * Now that the whole userspace is frozen we need to disbale
+	 * the OOM killer to disallow any further interference with
+	 * killable tasks.
+	 */
+	if (!error && !oom_killer_disable())
+		error = -EBUSY;
+
 	if (error)
 		thaw_processes();
 	return error;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 18ecef729597..c1e408bdc713 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1933,7 +1933,7 @@ bool mem_cgroup_oom_synchronize(bool handle)
 	if (!memcg)
 		return false;
 
-	if (!handle)
+	if (!handle || oom_killer_disabled)
 		goto cleanup;
 
 	owait.memcg = memcg;
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 3cbd76b8c13b..b8df76ee2be3 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -398,30 +398,27 @@ static void dump_header(struct task_struct *p, gfp_t gfp_mask, int order,
 }
 
 /*
- * Number of OOM killer invocations (including memcg OOM killer).
- * Primarily used by PM freezer to check for potential races with
- * OOM killed frozen task.
+ * Number of OOM victims in flight
  */
-static atomic_t oom_kills = ATOMIC_INIT(0);
+static atomic_t oom_victims = ATOMIC_INIT(0);
+static DECLARE_WAIT_QUEUE_HEAD(oom_victims_wait);
 
-int oom_kills_count(void)
-{
-	return atomic_read(&oom_kills);
-}
-
-void note_oom_kill(void)
-{
-	atomic_inc(&oom_kills);
-}
+bool oom_killer_disabled __read_mostly;
+static DECLARE_RWSEM(oom_sem);
 
 /**
  * mark_tsk_oom_victim - marks the given taks as OOM victim.
  * @tsk: task to mark
+ *
+ * Has to be called with oom_sem taken for read and never after
+ * oom has been disabled already.
  */
 void mark_tsk_oom_victim(struct task_struct *tsk)
 {
-	set_tsk_thread_flag(tsk, TIF_MEMDIE);
-
+	WARN_ON(oom_killer_disabled);
+	/* OOM killer might race with memcg OOM */
+	if (test_and_set_tsk_thread_flag(tsk, TIF_MEMDIE))
+		return;
 	/*
 	 * Make sure that the task is woken up from uninterruptible sleep
 	 * if it is frozen because OOM killer wouldn't be able to free
@@ -429,14 +426,70 @@ void mark_tsk_oom_victim(struct task_struct *tsk)
 	 * that TIF_MEMDIE tasks should be ignored.
 	 */
 	__thaw_task(tsk);
+	atomic_inc(&oom_victims);
 }
 
 /**
  * unmark_oom_victim - unmarks the current task as OOM victim.
+ *
+ * Wakes up all waiters in oom_killer_disable()
  */
 void unmark_oom_victim(void)
 {
-	clear_thread_flag(TIF_MEMDIE);
+	if (!test_and_clear_thread_flag(TIF_MEMDIE))
+		return;
+
+	down_read(&oom_sem);
+	/*
+	 * There is no need to signal the lasst oom_victim if there
+	 * is nobody who cares.
+	 */
+	if (!atomic_dec_return(&oom_victims) && oom_killer_disabled)
+		wake_up_all(&oom_victims_wait);
+	up_read(&oom_sem);
+}
+
+/**
+ * oom_killer_disable - disable OOM killer
+ *
+ * Forces all page allocations to fail rather than trigger OOM killer.
+ * Will block and wait until all OOM victims are killed.
+ *
+ * The function cannot be called when there are runnable user tasks because
+ * the userspace would see unexpected allocation failures as a result. Any
+ * new usage of this function should be consulted with MM people.
+ *
+ * Returns true if successful and false if the OOM killer cannot be
+ * disabled.
+ */
+bool oom_killer_disable(void)
+{
+	/*
+	 * Make sure to not race with an ongoing OOM killer
+	 * and that the current is not the victim.
+	 */
+	down_write(&oom_sem);
+	if (test_thread_flag(TIF_MEMDIE)) {
+		up_write(&oom_sem);
+		return false;
+	}
+
+	oom_killer_disabled = true;
+	up_write(&oom_sem);
+
+	wait_event(oom_victims_wait, !atomic_read(&oom_victims));
+
+	return true;
+}
+
+/**
+ * oom_killer_enable - enable OOM killer
+ */
+void oom_killer_enable(void)
+{
+	down_write(&oom_sem);
+	oom_killer_disabled = false;
+	up_write(&oom_sem);
 }
 
 #define K(x) ((x) << (PAGE_SHIFT-10))
@@ -637,7 +690,7 @@ void oom_zonelist_unlock(struct zonelist *zonelist, gfp_t gfp_mask)
 }
 
 /**
- * out_of_memory - kill the "best" process when we run out of memory
+ * __out_of_memory - kill the "best" process when we run out of memory
  * @zonelist: zonelist pointer
  * @gfp_mask: memory allocation flags
  * @order: amount of memory being requested as a power of 2
@@ -649,7 +702,7 @@ void oom_zonelist_unlock(struct zonelist *zonelist, gfp_t gfp_mask)
  * OR try to be smart about which process to kill. Note that we
  * don't have to be perfect here, we just have to be good.
  */
-void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
+static void __out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
 		int order, nodemask_t *nodemask, bool force_kill)
 {
 	const nodemask_t *mpol_mask;
@@ -718,6 +771,32 @@ out:
 		schedule_timeout_killable(1);
 }
 
+/**
+ * out_of_memory -  tries to invoke OOM killer.
+ * @zonelist: zonelist pointer
+ * @gfp_mask: memory allocation flags
+ * @order: amount of memory being requested as a power of 2
+ * @nodemask: nodemask passed to page allocator
+ * @force_kill: true if a task must be killed, even if others are exiting
+ *
+ * invokes __out_of_memory if the OOM is not disabled by oom_killer_disable()
+ * when it returns false. Otherwise returns true.
+ */
+bool out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
+		int order, nodemask_t *nodemask, bool force_kill)
+{
+	bool ret = false;
+
+	down_read(&oom_sem);
+	if (!oom_killer_disabled) {
+		__out_of_memory(zonelist, gfp_mask, order, nodemask, force_kill);
+		ret = true;
+	}
+	up_read(&oom_sem);
+
+	return ret;
+}
+
 /*
  * The pagefault handler calls here because it is out of memory, so kill a
  * memory-hogging task.  If any populated zone has ZONE_OOM_LOCKED set, a
@@ -727,12 +806,25 @@ void pagefault_out_of_memory(void)
 {
 	struct zonelist *zonelist;
 
+	down_read(&oom_sem);
 	if (mem_cgroup_oom_synchronize(true))
-		return;
+		goto unlock;
 
 	zonelist = node_zonelist(first_memory_node, GFP_KERNEL);
 	if (oom_zonelist_trylock(zonelist, GFP_KERNEL)) {
-		out_of_memory(NULL, 0, 0, NULL, false);
+		if (!oom_killer_disabled)
+			__out_of_memory(NULL, 0, 0, NULL, false);
+		else
+			/*
+			 * There shouldn't be any user tasks runable while the
+			 * OOM killer is disabled so the current task has to
+			 * be a racing OOM victim for which oom_killer_disable()
+			 * is waiting for.
+			 */
+			WARN_ON(test_thread_flag(TIF_MEMDIE));
+
 		oom_zonelist_unlock(zonelist, GFP_KERNEL);
 	}
+unlock:
+	up_read(&oom_sem);
 }
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 5ed7f93d0152..b89fc9e84d48 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -244,8 +244,6 @@ void set_pageblock_migratetype(struct page *page, int migratetype)
 					PB_migrate, PB_migrate_end);
 }
 
-bool oom_killer_disabled __read_mostly;
-
 #ifdef CONFIG_DEBUG_VM
 static int page_outside_zone_boundaries(struct zone *zone, struct page *page)
 {
@@ -2305,9 +2303,6 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
 
 	*did_some_progress = 0;
 
-	if (oom_killer_disabled)
-		return NULL;
-
 	/*
 	 * Acquire the per-zone oom lock for each zone.  If that
 	 * fails, somebody else is making progress for us.
@@ -2319,14 +2314,6 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
 	}
 
 	/*
-	 * PM-freezer should be notified that there might be an OOM killer on
-	 * its way to kill and wake somebody up. This is too early and we might
-	 * end up not killing anything but false positives are acceptable.
-	 * See freeze_processes.
-	 */
-	note_oom_kill();
-
-	/*
 	 * Go through the zonelist yet one more time, keep very high watermark
 	 * here, this is only to catch a parallel oom killing, we must fail if
 	 * we're still under heavy pressure.
@@ -2362,8 +2349,8 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
 			goto out;
 	}
 	/* Exhausted what can be done so it's blamo time */
-	out_of_memory(zonelist, gfp_mask, order, nodemask, false);
-	*did_some_progress = 1;
+	if (out_of_memory(zonelist, gfp_mask, order, nodemask, false))
+		*did_some_progress = 1;
 out:
 	oom_zonelist_unlock(zonelist, gfp_mask);
 	return page;
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH -v3 5/5] oom, PM: make OOM detection in the freezer path raceless
@ 2015-01-09 11:05   ` Michal Hocko
  0 siblings, 0 replies; 31+ messages in thread
From: Michal Hocko @ 2015-01-09 11:05 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Tejun Heo, \"Rafael J. Wysocki\",
	David Rientjes, Johannes Weiner, Oleg Nesterov, Cong Wang,
	linux-mm, LKML, linux-pm

5695be142e20 (OOM, PM: OOM killed task shouldn't escape PM suspend)
has left a race window when OOM killer manages to note_oom_kill after
freeze_processes checks the counter. The race window is quite small and
really unlikely and partial solution deemed sufficient at the time of
submission.

Tejun wasn't happy about this partial solution though and insisted on a
full solution. That requires the full OOM and freezer's task freezing
exclusion, though. This is done by this patch which introduces oom_sem
RW lock and turns oom_killer_disable() into a full OOM barrier.

oom_killer_disabled check is moved from the allocation path to the OOM
level and we take oom_sem for reading for both the check and the whole
OOM invocation.

oom_killer_disable() takes oom_sem for writing so it waits for all
currently running OOM killer invocations. Then it disable all the
further OOMs by setting oom_killer_disabled and checks for any oom
victims. Victims are counted via mark_tsk_oom_victim resp.
unmark_oom_victim. The last victim wakes up all waiters enqueued by
oom_killer_disable(). Therefore this function acts as the full OOM
barrier.

The page fault path is covered now as well although it was assumed to be
safe before. As per Tejun, "We used to have freezing points deep in file
system code which may be reacheable from page fault." so it would be
better and more robust to not rely on freezing points here. Same applies
to the memcg OOM killer.

out_of_memory tells the caller whether the OOM was allowed to trigger
and the callers are supposed to handle the situation. The page
allocation path simply fails the allocation same as before. The page
fault path will retry the fault (more on that later) and Sysrq OOM
trigger will simply complain to the log.

Normally there wouldn't be any unfrozen user tasks after
try_to_freeze_tasks so the function will not block. But if there was an
OOM killer racing with try_to_freeze_tasks and the OOM victim didn't
finish yet then we have to wait for it. This should complete in a finite
time, though, because
	- the victim cannot loop in the page fault handler (it would die
	  on the way out from the exception)
	- it cannot loop in the page allocator because all the further
	  allocation would fail and __GFP_NOFAIL allocations are not
	  acceptable at this stage
	- it shouldn't be blocked on any locks held by frozen tasks
	  (try_to_freeze expects lockless context) and kernel threads and
	  work queues are not frozen yet

Suggested-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Michal Hocko <mhocko@suse.cz>
---
 drivers/tty/sysrq.c    |   5 +-
 include/linux/oom.h    |  14 ++----
 kernel/exit.c          |   3 +-
 kernel/power/process.c |  50 ++++---------------
 mm/memcontrol.c        |   2 +-
 mm/oom_kill.c          | 132 +++++++++++++++++++++++++++++++++++++++++--------
 mm/page_alloc.c        |  17 +------
 7 files changed, 132 insertions(+), 91 deletions(-)

diff --git a/drivers/tty/sysrq.c b/drivers/tty/sysrq.c
index 0071469ecbf1..259a4d5a4e8f 100644
--- a/drivers/tty/sysrq.c
+++ b/drivers/tty/sysrq.c
@@ -355,8 +355,9 @@ static struct sysrq_key_op sysrq_term_op = {
 
 static void moom_callback(struct work_struct *ignored)
 {
-	out_of_memory(node_zonelist(first_memory_node, GFP_KERNEL), GFP_KERNEL,
-		      0, NULL, true);
+	if (!out_of_memory(node_zonelist(first_memory_node, GFP_KERNEL),
+			   GFP_KERNEL, 0, NULL, true))
+		pr_info("OOM request ignored because killer is disabled\n");
 }
 
 static DECLARE_WORK(moom_work, moom_callback);
diff --git a/include/linux/oom.h b/include/linux/oom.h
index b42b80f88c3a..d5771bed59c9 100644
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -72,22 +72,14 @@ extern enum oom_scan_t oom_scan_process_thread(struct task_struct *task,
 		unsigned long totalpages, const nodemask_t *nodemask,
 		bool force_kill);
 
-extern void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
+extern bool out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
 		int order, nodemask_t *mask, bool force_kill);
 extern int register_oom_notifier(struct notifier_block *nb);
 extern int unregister_oom_notifier(struct notifier_block *nb);
 
 extern bool oom_killer_disabled;
-
-static inline void oom_killer_disable(void)
-{
-	oom_killer_disabled = true;
-}
-
-static inline void oom_killer_enable(void)
-{
-	oom_killer_disabled = false;
-}
+extern bool oom_killer_disable(void);
+extern void oom_killer_enable(void);
 
 extern struct task_struct *find_lock_task_mm(struct task_struct *p);
 
diff --git a/kernel/exit.c b/kernel/exit.c
index 5db52e52c493..4e319a0c97ea 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -456,7 +456,8 @@ static void exit_mm(struct task_struct *tsk)
 	task_unlock(tsk);
 	mm_update_next_owner(mm);
 	mmput(mm);
-	unmark_oom_victim();
+	if (test_thread_flag(TIF_MEMDIE))
+		unmark_oom_victim();
 }
 
 static struct task_struct *find_alive_thread(struct task_struct *p)
diff --git a/kernel/power/process.c b/kernel/power/process.c
index 3ac45f192e9f..564f786df470 100644
--- a/kernel/power/process.c
+++ b/kernel/power/process.c
@@ -108,30 +108,6 @@ static int try_to_freeze_tasks(bool user_only)
 	return todo ? -EBUSY : 0;
 }
 
-static bool __check_frozen_processes(void)
-{
-	struct task_struct *g, *p;
-
-	for_each_process_thread(g, p)
-		if (p != current && !freezer_should_skip(p) && !frozen(p))
-			return false;
-
-	return true;
-}
-
-/*
- * Returns true if all freezable tasks (except for current) are frozen already
- */
-static bool check_frozen_processes(void)
-{
-	bool ret;
-
-	read_lock(&tasklist_lock);
-	ret = __check_frozen_processes();
-	read_unlock(&tasklist_lock);
-	return ret;
-}
-
 /**
  * freeze_processes - Signal user space processes to enter the refrigerator.
  * The current thread will not be frozen.  The same process that calls
@@ -142,7 +118,6 @@ static bool check_frozen_processes(void)
 int freeze_processes(void)
 {
 	int error;
-	int oom_kills_saved;
 
 	error = __usermodehelper_disable(UMH_FREEZING);
 	if (error)
@@ -157,29 +132,22 @@ int freeze_processes(void)
 	pm_wakeup_clear();
 	pr_info("Freezing user space processes ... ");
 	pm_freezing = true;
-	oom_kills_saved = oom_kills_count();
 	error = try_to_freeze_tasks(true);
 	if (!error) {
 		__usermodehelper_set_disable_depth(UMH_DISABLED);
-		oom_killer_disable();
-
-		/*
-		 * There might have been an OOM kill while we were
-		 * freezing tasks and the killed task might be still
-		 * on the way out so we have to double check for race.
-		 */
-		if (oom_kills_count() != oom_kills_saved &&
-		    !check_frozen_processes()) {
-			__usermodehelper_set_disable_depth(UMH_ENABLED);
-			pr_cont("OOM in progress.");
-			error = -EBUSY;
-		} else {
-			pr_cont("done.");
-		}
+		pr_cont("done.");
 	}
 	pr_cont("\n");
 	BUG_ON(in_atomic());
 
+	/*
+	 * Now that the whole userspace is frozen we need to disbale
+	 * the OOM killer to disallow any further interference with
+	 * killable tasks.
+	 */
+	if (!error && !oom_killer_disable())
+		error = -EBUSY;
+
 	if (error)
 		thaw_processes();
 	return error;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 18ecef729597..c1e408bdc713 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1933,7 +1933,7 @@ bool mem_cgroup_oom_synchronize(bool handle)
 	if (!memcg)
 		return false;
 
-	if (!handle)
+	if (!handle || oom_killer_disabled)
 		goto cleanup;
 
 	owait.memcg = memcg;
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 3cbd76b8c13b..b8df76ee2be3 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -398,30 +398,27 @@ static void dump_header(struct task_struct *p, gfp_t gfp_mask, int order,
 }
 
 /*
- * Number of OOM killer invocations (including memcg OOM killer).
- * Primarily used by PM freezer to check for potential races with
- * OOM killed frozen task.
+ * Number of OOM victims in flight
  */
-static atomic_t oom_kills = ATOMIC_INIT(0);
+static atomic_t oom_victims = ATOMIC_INIT(0);
+static DECLARE_WAIT_QUEUE_HEAD(oom_victims_wait);
 
-int oom_kills_count(void)
-{
-	return atomic_read(&oom_kills);
-}
-
-void note_oom_kill(void)
-{
-	atomic_inc(&oom_kills);
-}
+bool oom_killer_disabled __read_mostly;
+static DECLARE_RWSEM(oom_sem);
 
 /**
  * mark_tsk_oom_victim - marks the given taks as OOM victim.
  * @tsk: task to mark
+ *
+ * Has to be called with oom_sem taken for read and never after
+ * oom has been disabled already.
  */
 void mark_tsk_oom_victim(struct task_struct *tsk)
 {
-	set_tsk_thread_flag(tsk, TIF_MEMDIE);
-
+	WARN_ON(oom_killer_disabled);
+	/* OOM killer might race with memcg OOM */
+	if (test_and_set_tsk_thread_flag(tsk, TIF_MEMDIE))
+		return;
 	/*
 	 * Make sure that the task is woken up from uninterruptible sleep
 	 * if it is frozen because OOM killer wouldn't be able to free
@@ -429,14 +426,70 @@ void mark_tsk_oom_victim(struct task_struct *tsk)
 	 * that TIF_MEMDIE tasks should be ignored.
 	 */
 	__thaw_task(tsk);
+	atomic_inc(&oom_victims);
 }
 
 /**
  * unmark_oom_victim - unmarks the current task as OOM victim.
+ *
+ * Wakes up all waiters in oom_killer_disable()
  */
 void unmark_oom_victim(void)
 {
-	clear_thread_flag(TIF_MEMDIE);
+	if (!test_and_clear_thread_flag(TIF_MEMDIE))
+		return;
+
+	down_read(&oom_sem);
+	/*
+	 * There is no need to signal the lasst oom_victim if there
+	 * is nobody who cares.
+	 */
+	if (!atomic_dec_return(&oom_victims) && oom_killer_disabled)
+		wake_up_all(&oom_victims_wait);
+	up_read(&oom_sem);
+}
+
+/**
+ * oom_killer_disable - disable OOM killer
+ *
+ * Forces all page allocations to fail rather than trigger OOM killer.
+ * Will block and wait until all OOM victims are killed.
+ *
+ * The function cannot be called when there are runnable user tasks because
+ * the userspace would see unexpected allocation failures as a result. Any
+ * new usage of this function should be consulted with MM people.
+ *
+ * Returns true if successful and false if the OOM killer cannot be
+ * disabled.
+ */
+bool oom_killer_disable(void)
+{
+	/*
+	 * Make sure to not race with an ongoing OOM killer
+	 * and that the current is not the victim.
+	 */
+	down_write(&oom_sem);
+	if (test_thread_flag(TIF_MEMDIE)) {
+		up_write(&oom_sem);
+		return false;
+	}
+
+	oom_killer_disabled = true;
+	up_write(&oom_sem);
+
+	wait_event(oom_victims_wait, !atomic_read(&oom_victims));
+
+	return true;
+}
+
+/**
+ * oom_killer_enable - enable OOM killer
+ */
+void oom_killer_enable(void)
+{
+	down_write(&oom_sem);
+	oom_killer_disabled = false;
+	up_write(&oom_sem);
 }
 
 #define K(x) ((x) << (PAGE_SHIFT-10))
@@ -637,7 +690,7 @@ void oom_zonelist_unlock(struct zonelist *zonelist, gfp_t gfp_mask)
 }
 
 /**
- * out_of_memory - kill the "best" process when we run out of memory
+ * __out_of_memory - kill the "best" process when we run out of memory
  * @zonelist: zonelist pointer
  * @gfp_mask: memory allocation flags
  * @order: amount of memory being requested as a power of 2
@@ -649,7 +702,7 @@ void oom_zonelist_unlock(struct zonelist *zonelist, gfp_t gfp_mask)
  * OR try to be smart about which process to kill. Note that we
  * don't have to be perfect here, we just have to be good.
  */
-void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
+static void __out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
 		int order, nodemask_t *nodemask, bool force_kill)
 {
 	const nodemask_t *mpol_mask;
@@ -718,6 +771,32 @@ out:
 		schedule_timeout_killable(1);
 }
 
+/**
+ * out_of_memory -  tries to invoke OOM killer.
+ * @zonelist: zonelist pointer
+ * @gfp_mask: memory allocation flags
+ * @order: amount of memory being requested as a power of 2
+ * @nodemask: nodemask passed to page allocator
+ * @force_kill: true if a task must be killed, even if others are exiting
+ *
+ * invokes __out_of_memory if the OOM is not disabled by oom_killer_disable()
+ * when it returns false. Otherwise returns true.
+ */
+bool out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
+		int order, nodemask_t *nodemask, bool force_kill)
+{
+	bool ret = false;
+
+	down_read(&oom_sem);
+	if (!oom_killer_disabled) {
+		__out_of_memory(zonelist, gfp_mask, order, nodemask, force_kill);
+		ret = true;
+	}
+	up_read(&oom_sem);
+
+	return ret;
+}
+
 /*
  * The pagefault handler calls here because it is out of memory, so kill a
  * memory-hogging task.  If any populated zone has ZONE_OOM_LOCKED set, a
@@ -727,12 +806,25 @@ void pagefault_out_of_memory(void)
 {
 	struct zonelist *zonelist;
 
+	down_read(&oom_sem);
 	if (mem_cgroup_oom_synchronize(true))
-		return;
+		goto unlock;
 
 	zonelist = node_zonelist(first_memory_node, GFP_KERNEL);
 	if (oom_zonelist_trylock(zonelist, GFP_KERNEL)) {
-		out_of_memory(NULL, 0, 0, NULL, false);
+		if (!oom_killer_disabled)
+			__out_of_memory(NULL, 0, 0, NULL, false);
+		else
+			/*
+			 * There shouldn't be any user tasks runable while the
+			 * OOM killer is disabled so the current task has to
+			 * be a racing OOM victim for which oom_killer_disable()
+			 * is waiting for.
+			 */
+			WARN_ON(test_thread_flag(TIF_MEMDIE));
+
 		oom_zonelist_unlock(zonelist, GFP_KERNEL);
 	}
+unlock:
+	up_read(&oom_sem);
 }
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 5ed7f93d0152..b89fc9e84d48 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -244,8 +244,6 @@ void set_pageblock_migratetype(struct page *page, int migratetype)
 					PB_migrate, PB_migrate_end);
 }
 
-bool oom_killer_disabled __read_mostly;
-
 #ifdef CONFIG_DEBUG_VM
 static int page_outside_zone_boundaries(struct zone *zone, struct page *page)
 {
@@ -2305,9 +2303,6 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
 
 	*did_some_progress = 0;
 
-	if (oom_killer_disabled)
-		return NULL;
-
 	/*
 	 * Acquire the per-zone oom lock for each zone.  If that
 	 * fails, somebody else is making progress for us.
@@ -2319,14 +2314,6 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
 	}
 
 	/*
-	 * PM-freezer should be notified that there might be an OOM killer on
-	 * its way to kill and wake somebody up. This is too early and we might
-	 * end up not killing anything but false positives are acceptable.
-	 * See freeze_processes.
-	 */
-	note_oom_kill();
-
-	/*
 	 * Go through the zonelist yet one more time, keep very high watermark
 	 * here, this is only to catch a parallel oom killing, we must fail if
 	 * we're still under heavy pressure.
@@ -2362,8 +2349,8 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
 			goto out;
 	}
 	/* Exhausted what can be done so it's blamo time */
-	out_of_memory(zonelist, gfp_mask, order, nodemask, false);
-	*did_some_progress = 1;
+	if (out_of_memory(zonelist, gfp_mask, order, nodemask, false))
+		*did_some_progress = 1;
 out:
 	oom_zonelist_unlock(zonelist, gfp_mask);
 	return page;
-- 
2.1.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [PATCH -v3 5/5] oom, PM: make OOM detection in the freezer path raceless
  2015-01-09 11:05   ` Michal Hocko
@ 2015-01-10  0:54     ` Cong Wang
  -1 siblings, 0 replies; 31+ messages in thread
From: Cong Wang @ 2015-01-10  0:54 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Tejun Heo, \Rafael J. Wysocki\,
	David Rientjes, Johannes Weiner, Oleg Nesterov, linux-mm, LKML,
	Linux PM

On Fri, Jan 9, 2015 at 3:05 AM, Michal Hocko <mhocko@suse.cz> wrote:
>  /**
>   * freeze_processes - Signal user space processes to enter the refrigerator.
>   * The current thread will not be frozen.  The same process that calls
> @@ -142,7 +118,6 @@ static bool check_frozen_processes(void)
>  int freeze_processes(void)
>  {
>         int error;
> -       int oom_kills_saved;
>
>         error = __usermodehelper_disable(UMH_FREEZING);
>         if (error)
> @@ -157,29 +132,22 @@ int freeze_processes(void)
>         pm_wakeup_clear();
>         pr_info("Freezing user space processes ... ");
>         pm_freezing = true;
> -       oom_kills_saved = oom_kills_count();
>         error = try_to_freeze_tasks(true);
>         if (!error) {
>                 __usermodehelper_set_disable_depth(UMH_DISABLED);
> -               oom_killer_disable();
> -
> -               /*
> -                * There might have been an OOM kill while we were
> -                * freezing tasks and the killed task might be still
> -                * on the way out so we have to double check for race.
> -                */
> -               if (oom_kills_count() != oom_kills_saved &&
> -                   !check_frozen_processes()) {
> -                       __usermodehelper_set_disable_depth(UMH_ENABLED);
> -                       pr_cont("OOM in progress.");
> -                       error = -EBUSY;
> -               } else {
> -                       pr_cont("done.");
> -               }
> +               pr_cont("done.");
>         }
>         pr_cont("\n");
>         BUG_ON(in_atomic());
>
> +       /*
> +        * Now that the whole userspace is frozen we need to disbale


disable


> +        * the OOM killer to disallow any further interference with
> +        * killable tasks.
> +        */
> +       if (!error && !oom_killer_disable())
> +               error = -EBUSY;
> +
[...]
>  void unmark_oom_victim(void)
>  {
> -       clear_thread_flag(TIF_MEMDIE);
> +       if (!test_and_clear_thread_flag(TIF_MEMDIE))
> +               return;
> +
> +       down_read(&oom_sem);
> +       /*
> +        * There is no need to signal the lasst oom_victim if there

last

> +        * is nobody who cares.
> +        */
> +       if (!atomic_dec_return(&oom_victims) && oom_killer_disabled)
> +               wake_up_all(&oom_victims_wait);
> +       up_read(&oom_sem);
> +}
[...]
>  /*
>   * The pagefault handler calls here because it is out of memory, so kill a
>   * memory-hogging task.  If any populated zone has ZONE_OOM_LOCKED set, a
> @@ -727,12 +806,25 @@ void pagefault_out_of_memory(void)
>  {
>         struct zonelist *zonelist;
>
> +       down_read(&oom_sem);
>         if (mem_cgroup_oom_synchronize(true))
> -               return;
> +               goto unlock;
>
>         zonelist = node_zonelist(first_memory_node, GFP_KERNEL);
>         if (oom_zonelist_trylock(zonelist, GFP_KERNEL)) {
> -               out_of_memory(NULL, 0, 0, NULL, false);
> +               if (!oom_killer_disabled)
> +                       __out_of_memory(NULL, 0, 0, NULL, false);
> +               else
> +                       /*
> +                        * There shouldn't be any user tasks runable while the

runnable


> +                        * OOM killer is disabled so the current task has to
> +                        * be a racing OOM victim for which oom_killer_disable()
> +                        * is waiting for.
> +                        */
> +                       WARN_ON(test_thread_flag(TIF_MEMDIE));
> +
>                 oom_zonelist_unlock(zonelist, GFP_KERNEL);
>         }
> +unlock:
> +       up_read(&oom_sem);
>  }


Thanks!

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH -v3 5/5] oom, PM: make OOM detection in the freezer path raceless
@ 2015-01-10  0:54     ` Cong Wang
  0 siblings, 0 replies; 31+ messages in thread
From: Cong Wang @ 2015-01-10  0:54 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Tejun Heo, \Rafael J. Wysocki\,
	David Rientjes, Johannes Weiner, Oleg Nesterov, linux-mm, LKML,
	Linux PM

On Fri, Jan 9, 2015 at 3:05 AM, Michal Hocko <mhocko@suse.cz> wrote:
>  /**
>   * freeze_processes - Signal user space processes to enter the refrigerator.
>   * The current thread will not be frozen.  The same process that calls
> @@ -142,7 +118,6 @@ static bool check_frozen_processes(void)
>  int freeze_processes(void)
>  {
>         int error;
> -       int oom_kills_saved;
>
>         error = __usermodehelper_disable(UMH_FREEZING);
>         if (error)
> @@ -157,29 +132,22 @@ int freeze_processes(void)
>         pm_wakeup_clear();
>         pr_info("Freezing user space processes ... ");
>         pm_freezing = true;
> -       oom_kills_saved = oom_kills_count();
>         error = try_to_freeze_tasks(true);
>         if (!error) {
>                 __usermodehelper_set_disable_depth(UMH_DISABLED);
> -               oom_killer_disable();
> -
> -               /*
> -                * There might have been an OOM kill while we were
> -                * freezing tasks and the killed task might be still
> -                * on the way out so we have to double check for race.
> -                */
> -               if (oom_kills_count() != oom_kills_saved &&
> -                   !check_frozen_processes()) {
> -                       __usermodehelper_set_disable_depth(UMH_ENABLED);
> -                       pr_cont("OOM in progress.");
> -                       error = -EBUSY;
> -               } else {
> -                       pr_cont("done.");
> -               }
> +               pr_cont("done.");
>         }
>         pr_cont("\n");
>         BUG_ON(in_atomic());
>
> +       /*
> +        * Now that the whole userspace is frozen we need to disbale


disable


> +        * the OOM killer to disallow any further interference with
> +        * killable tasks.
> +        */
> +       if (!error && !oom_killer_disable())
> +               error = -EBUSY;
> +
[...]
>  void unmark_oom_victim(void)
>  {
> -       clear_thread_flag(TIF_MEMDIE);
> +       if (!test_and_clear_thread_flag(TIF_MEMDIE))
> +               return;
> +
> +       down_read(&oom_sem);
> +       /*
> +        * There is no need to signal the lasst oom_victim if there

last

> +        * is nobody who cares.
> +        */
> +       if (!atomic_dec_return(&oom_victims) && oom_killer_disabled)
> +               wake_up_all(&oom_victims_wait);
> +       up_read(&oom_sem);
> +}
[...]
>  /*
>   * The pagefault handler calls here because it is out of memory, so kill a
>   * memory-hogging task.  If any populated zone has ZONE_OOM_LOCKED set, a
> @@ -727,12 +806,25 @@ void pagefault_out_of_memory(void)
>  {
>         struct zonelist *zonelist;
>
> +       down_read(&oom_sem);
>         if (mem_cgroup_oom_synchronize(true))
> -               return;
> +               goto unlock;
>
>         zonelist = node_zonelist(first_memory_node, GFP_KERNEL);
>         if (oom_zonelist_trylock(zonelist, GFP_KERNEL)) {
> -               out_of_memory(NULL, 0, 0, NULL, false);
> +               if (!oom_killer_disabled)
> +                       __out_of_memory(NULL, 0, 0, NULL, false);
> +               else
> +                       /*
> +                        * There shouldn't be any user tasks runable while the

runnable


> +                        * OOM killer is disabled so the current task has to
> +                        * be a racing OOM victim for which oom_killer_disable()
> +                        * is waiting for.
> +                        */
> +                       WARN_ON(test_thread_flag(TIF_MEMDIE));
> +
>                 oom_zonelist_unlock(zonelist, GFP_KERNEL);
>         }
> +unlock:
> +       up_read(&oom_sem);
>  }


Thanks!

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH -v3 5/5] oom, PM: make OOM detection in the freezer path raceless
  2015-01-09 11:05   ` Michal Hocko
@ 2015-01-10 19:43     ` Tejun Heo
  -1 siblings, 0 replies; 31+ messages in thread
From: Tejun Heo @ 2015-01-10 19:43 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, \"Rafael J. Wysocki\",
	David Rientjes, Johannes Weiner, Oleg Nesterov, Cong Wang,
	linux-mm, LKML, linux-pm

On Fri, Jan 09, 2015 at 12:05:55PM +0100, Michal Hocko wrote:
...
> @@ -142,7 +118,6 @@ static bool check_frozen_processes(void)
>  int freeze_processes(void)
>  {
>  	int error;
> -	int oom_kills_saved;
>  
>  	error = __usermodehelper_disable(UMH_FREEZING);
>  	if (error)
> @@ -157,29 +132,22 @@ int freeze_processes(void)
>  	pm_wakeup_clear();
>  	pr_info("Freezing user space processes ... ");
>  	pm_freezing = true;
> -	oom_kills_saved = oom_kills_count();
>  	error = try_to_freeze_tasks(true);
>  	if (!error) {
>  		__usermodehelper_set_disable_depth(UMH_DISABLED);
> -		oom_killer_disable();
> -
> -		/*
> -		 * There might have been an OOM kill while we were
> -		 * freezing tasks and the killed task might be still
> -		 * on the way out so we have to double check for race.
> -		 */
> -		if (oom_kills_count() != oom_kills_saved &&
> -		    !check_frozen_processes()) {
> -			__usermodehelper_set_disable_depth(UMH_ENABLED);
> -			pr_cont("OOM in progress.");
> -			error = -EBUSY;
> -		} else {
> -			pr_cont("done.");
> -		}
> +		pr_cont("done.");
>  	}
>  	pr_cont("\n");
>  	BUG_ON(in_atomic());
>  
> +	/*
> +	 * Now that the whole userspace is frozen we need to disbale
> +	 * the OOM killer to disallow any further interference with
> +	 * killable tasks.
> +	 */
> +	if (!error && !oom_killer_disable())

So, previously, oom killer was disabled at the top of
freeze_kernel_threads(), right?  I think that was the better spot to
do that.  We don't want to disable oom killer before the system is
just about to enter total quiescence which is freeze_kernel_threads().
We want to delay this as long as possible.  Let's please disable oom
killing in at the top of freeze_kernel_threads() and re-enable at the
bottom of thaw_kernel_threads().

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH -v3 5/5] oom, PM: make OOM detection in the freezer path raceless
@ 2015-01-10 19:43     ` Tejun Heo
  0 siblings, 0 replies; 31+ messages in thread
From: Tejun Heo @ 2015-01-10 19:43 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, \"Rafael J. Wysocki\",
	David Rientjes, Johannes Weiner, Oleg Nesterov, Cong Wang,
	linux-mm, LKML, linux-pm

On Fri, Jan 09, 2015 at 12:05:55PM +0100, Michal Hocko wrote:
...
> @@ -142,7 +118,6 @@ static bool check_frozen_processes(void)
>  int freeze_processes(void)
>  {
>  	int error;
> -	int oom_kills_saved;
>  
>  	error = __usermodehelper_disable(UMH_FREEZING);
>  	if (error)
> @@ -157,29 +132,22 @@ int freeze_processes(void)
>  	pm_wakeup_clear();
>  	pr_info("Freezing user space processes ... ");
>  	pm_freezing = true;
> -	oom_kills_saved = oom_kills_count();
>  	error = try_to_freeze_tasks(true);
>  	if (!error) {
>  		__usermodehelper_set_disable_depth(UMH_DISABLED);
> -		oom_killer_disable();
> -
> -		/*
> -		 * There might have been an OOM kill while we were
> -		 * freezing tasks and the killed task might be still
> -		 * on the way out so we have to double check for race.
> -		 */
> -		if (oom_kills_count() != oom_kills_saved &&
> -		    !check_frozen_processes()) {
> -			__usermodehelper_set_disable_depth(UMH_ENABLED);
> -			pr_cont("OOM in progress.");
> -			error = -EBUSY;
> -		} else {
> -			pr_cont("done.");
> -		}
> +		pr_cont("done.");
>  	}
>  	pr_cont("\n");
>  	BUG_ON(in_atomic());
>  
> +	/*
> +	 * Now that the whole userspace is frozen we need to disbale
> +	 * the OOM killer to disallow any further interference with
> +	 * killable tasks.
> +	 */
> +	if (!error && !oom_killer_disable())

So, previously, oom killer was disabled at the top of
freeze_kernel_threads(), right?  I think that was the better spot to
do that.  We don't want to disable oom killer before the system is
just about to enter total quiescence which is freeze_kernel_threads().
We want to delay this as long as possible.  Let's please disable oom
killing in at the top of freeze_kernel_threads() and re-enable at the
bottom of thaw_kernel_threads().

Thanks.

-- 
tejun

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH -v3 5/5] oom, PM: make OOM detection in the freezer path raceless
  2015-01-10 19:43     ` Tejun Heo
@ 2015-01-12 16:10       ` Michal Hocko
  -1 siblings, 0 replies; 31+ messages in thread
From: Michal Hocko @ 2015-01-12 16:10 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Andrew Morton, \"Rafael J. Wysocki\",
	David Rientjes, Johannes Weiner, Oleg Nesterov, Cong Wang,
	linux-mm, LKML, linux-pm

On Sat 10-01-15 14:43:22, Tejun Heo wrote:
> On Fri, Jan 09, 2015 at 12:05:55PM +0100, Michal Hocko wrote:
> ...
> > @@ -142,7 +118,6 @@ static bool check_frozen_processes(void)
> >  int freeze_processes(void)
> >  {
> >  	int error;
> > -	int oom_kills_saved;
> >  
> >  	error = __usermodehelper_disable(UMH_FREEZING);
> >  	if (error)
> > @@ -157,29 +132,22 @@ int freeze_processes(void)
> >  	pm_wakeup_clear();
> >  	pr_info("Freezing user space processes ... ");
> >  	pm_freezing = true;
> > -	oom_kills_saved = oom_kills_count();
> >  	error = try_to_freeze_tasks(true);
> >  	if (!error) {
> >  		__usermodehelper_set_disable_depth(UMH_DISABLED);
> > -		oom_killer_disable();
> > -
> > -		/*
> > -		 * There might have been an OOM kill while we were
> > -		 * freezing tasks and the killed task might be still
> > -		 * on the way out so we have to double check for race.
> > -		 */
> > -		if (oom_kills_count() != oom_kills_saved &&
> > -		    !check_frozen_processes()) {
> > -			__usermodehelper_set_disable_depth(UMH_ENABLED);
> > -			pr_cont("OOM in progress.");
> > -			error = -EBUSY;
> > -		} else {
> > -			pr_cont("done.");
> > -		}
> > +		pr_cont("done.");
> >  	}
> >  	pr_cont("\n");
> >  	BUG_ON(in_atomic());
> >  
> > +	/*
> > +	 * Now that the whole userspace is frozen we need to disbale
> > +	 * the OOM killer to disallow any further interference with
> > +	 * killable tasks.
> > +	 */
> > +	if (!error && !oom_killer_disable())
> 
> So, previously, oom killer was disabled at the top of
> freeze_kernel_threads(), right?  I think that was the better spot to
> do that.  We don't want to disable oom killer before the system is
> just about to enter total quiescence which is freeze_kernel_threads().
> We want to delay this as long as possible.  Let's please disable oom
> killing in at the top of freeze_kernel_threads() and re-enable at the
> bottom of thaw_kernel_threads().

Yes I had it this way but it didn't work out because thaw_kernel_threads
is not called on the resume because it is only used as a fail
path when kernel threads freezing fails. I would rather keep the
enabling/disabling points as we had them. This is less risky IMHO.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH -v3 5/5] oom, PM: make OOM detection in the freezer path raceless
@ 2015-01-12 16:10       ` Michal Hocko
  0 siblings, 0 replies; 31+ messages in thread
From: Michal Hocko @ 2015-01-12 16:10 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Andrew Morton, \"Rafael J. Wysocki\",
	David Rientjes, Johannes Weiner, Oleg Nesterov, Cong Wang,
	linux-mm, LKML, linux-pm

On Sat 10-01-15 14:43:22, Tejun Heo wrote:
> On Fri, Jan 09, 2015 at 12:05:55PM +0100, Michal Hocko wrote:
> ...
> > @@ -142,7 +118,6 @@ static bool check_frozen_processes(void)
> >  int freeze_processes(void)
> >  {
> >  	int error;
> > -	int oom_kills_saved;
> >  
> >  	error = __usermodehelper_disable(UMH_FREEZING);
> >  	if (error)
> > @@ -157,29 +132,22 @@ int freeze_processes(void)
> >  	pm_wakeup_clear();
> >  	pr_info("Freezing user space processes ... ");
> >  	pm_freezing = true;
> > -	oom_kills_saved = oom_kills_count();
> >  	error = try_to_freeze_tasks(true);
> >  	if (!error) {
> >  		__usermodehelper_set_disable_depth(UMH_DISABLED);
> > -		oom_killer_disable();
> > -
> > -		/*
> > -		 * There might have been an OOM kill while we were
> > -		 * freezing tasks and the killed task might be still
> > -		 * on the way out so we have to double check for race.
> > -		 */
> > -		if (oom_kills_count() != oom_kills_saved &&
> > -		    !check_frozen_processes()) {
> > -			__usermodehelper_set_disable_depth(UMH_ENABLED);
> > -			pr_cont("OOM in progress.");
> > -			error = -EBUSY;
> > -		} else {
> > -			pr_cont("done.");
> > -		}
> > +		pr_cont("done.");
> >  	}
> >  	pr_cont("\n");
> >  	BUG_ON(in_atomic());
> >  
> > +	/*
> > +	 * Now that the whole userspace is frozen we need to disbale
> > +	 * the OOM killer to disallow any further interference with
> > +	 * killable tasks.
> > +	 */
> > +	if (!error && !oom_killer_disable())
> 
> So, previously, oom killer was disabled at the top of
> freeze_kernel_threads(), right?  I think that was the better spot to
> do that.  We don't want to disable oom killer before the system is
> just about to enter total quiescence which is freeze_kernel_threads().
> We want to delay this as long as possible.  Let's please disable oom
> killing in at the top of freeze_kernel_threads() and re-enable at the
> bottom of thaw_kernel_threads().

Yes I had it this way but it didn't work out because thaw_kernel_threads
is not called on the resume because it is only used as a fail
path when kernel threads freezing fails. I would rather keep the
enabling/disabling points as we had them. This is less risky IMHO.

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH -v3 5/5] oom, PM: make OOM detection in the freezer path raceless
  2015-01-12 16:10       ` Michal Hocko
@ 2015-01-12 17:22         ` Tejun Heo
  -1 siblings, 0 replies; 31+ messages in thread
From: Tejun Heo @ 2015-01-12 17:22 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, \"Rafael J. Wysocki\",
	David Rientjes, Johannes Weiner, Oleg Nesterov, Cong Wang,
	linux-mm, LKML, linux-pm

On Mon, Jan 12, 2015 at 05:10:11PM +0100, Michal Hocko wrote:
> Yes I had it this way but it didn't work out because thaw_kernel_threads
> is not called on the resume because it is only used as a fail
> path when kernel threads freezing fails. I would rather keep the

Ooh, that's kinda asymmetric.

> enabling/disabling points as we had them. This is less risky IMHO.

Okay, please feel free to add

 Acked-by: Tejun Heo <tj@kernel.org>

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH -v3 5/5] oom, PM: make OOM detection in the freezer path raceless
@ 2015-01-12 17:22         ` Tejun Heo
  0 siblings, 0 replies; 31+ messages in thread
From: Tejun Heo @ 2015-01-12 17:22 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, \"Rafael J. Wysocki\",
	David Rientjes, Johannes Weiner, Oleg Nesterov, Cong Wang,
	linux-mm, LKML, linux-pm

On Mon, Jan 12, 2015 at 05:10:11PM +0100, Michal Hocko wrote:
> Yes I had it this way but it didn't work out because thaw_kernel_threads
> is not called on the resume because it is only used as a fail
> path when kernel threads freezing fails. I would rather keep the

Ooh, that's kinda asymmetric.

> enabling/disabling points as we had them. This is less risky IMHO.

Okay, please feel free to add

 Acked-by: Tejun Heo <tj@kernel.org>

Thanks.

-- 
tejun

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH -v3 5/5] oom, PM: make OOM detection in the freezer path raceless
  2015-01-12 17:22         ` Tejun Heo
@ 2015-01-12 17:35           ` Michal Hocko
  -1 siblings, 0 replies; 31+ messages in thread
From: Michal Hocko @ 2015-01-12 17:35 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Andrew Morton, \"Rafael J. Wysocki\",
	David Rientjes, Johannes Weiner, Oleg Nesterov, Cong Wang,
	linux-mm, LKML, linux-pm

On Mon 12-01-15 12:22:51, Tejun Heo wrote:
> On Mon, Jan 12, 2015 at 05:10:11PM +0100, Michal Hocko wrote:
> > Yes I had it this way but it didn't work out because thaw_kernel_threads
> > is not called on the resume because it is only used as a fail
> > path when kernel threads freezing fails. I would rather keep the
> 
> Ooh, that's kinda asymmetric.
> 
> > enabling/disabling points as we had them. This is less risky IMHO.
> 
> Okay, please feel free to add
> 
>  Acked-by: Tejun Heo <tj@kernel.org>

Thanks!
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH -v3 5/5] oom, PM: make OOM detection in the freezer path raceless
@ 2015-01-12 17:35           ` Michal Hocko
  0 siblings, 0 replies; 31+ messages in thread
From: Michal Hocko @ 2015-01-12 17:35 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Andrew Morton, \"Rafael J. Wysocki\",
	David Rientjes, Johannes Weiner, Oleg Nesterov, Cong Wang,
	linux-mm, LKML, linux-pm

On Mon 12-01-15 12:22:51, Tejun Heo wrote:
> On Mon, Jan 12, 2015 at 05:10:11PM +0100, Michal Hocko wrote:
> > Yes I had it this way but it didn't work out because thaw_kernel_threads
> > is not called on the resume because it is only used as a fail
> > path when kernel threads freezing fails. I would rather keep the
> 
> Ooh, that's kinda asymmetric.
> 
> > enabling/disabling points as we had them. This is less risky IMHO.
> 
> Okay, please feel free to add
> 
>  Acked-by: Tejun Heo <tj@kernel.org>

Thanks!
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH -v3 0/5] OOM vs PM freezer fixes
  2015-01-09 11:05 ` Michal Hocko
  (?)
@ 2015-01-12 23:59   ` Andrew Morton
  -1 siblings, 0 replies; 31+ messages in thread
From: Andrew Morton @ 2015-01-12 23:59 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Tejun Heo, \"Rafael J. Wysocki\",
	David Rientjes, Johannes Weiner, Oleg Nesterov, Cong Wang,
	linux-mm, LKML, linux-pm

On Fri,  9 Jan 2015 12:05:50 +0100 Michal Hocko <mhocko@suse.cz> wrote:

> Hi,

I've been cheerily ignoring this discussion, sorry.  I trust everyone's
all happy and ready to go with this?

> [what changed since the last patchset]
>
> ...
>
> [testing results]
>
> ...
>
> [overview of the 5 patches]
>
> ...
> 

That's nice, but it doesn't really tell us what the patchset does.  The
first paragraph of the [5/5] changelog provides hints, but doesn't
explain why we even need to fix a race which is "quite small and really
unlikely".

So...  could we please have a few words describing the overall intent
and effect of this patchset?

Thanks.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH -v3 0/5] OOM vs PM freezer fixes
@ 2015-01-12 23:59   ` Andrew Morton
  0 siblings, 0 replies; 31+ messages in thread
From: Andrew Morton @ 2015-01-12 23:59 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Tejun Heo, \"Rafael J. Wysocki\",
	David Rientjes, Johannes Weiner, Oleg Nesterov, Cong Wang,
	linux-mm, LKML, linux-pm

On Fri,  9 Jan 2015 12:05:50 +0100 Michal Hocko <mhocko@suse.cz> wrote:

> Hi,

I've been cheerily ignoring this discussion, sorry.  I trust everyone's
all happy and ready to go with this?

> [what changed since the last patchset]
>
> ...
>
> [testing results]
>
> ...
>
> [overview of the 5 patches]
>
> ...
> 

That's nice, but it doesn't really tell us what the patchset does.  The
first paragraph of the [5/5] changelog provides hints, but doesn't
explain why we even need to fix a race which is "quite small and really
unlikely".

So...  could we please have a few words describing the overall intent
and effect of this patchset?

Thanks.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH -v3 0/5] OOM vs PM freezer fixes
@ 2015-01-12 23:59   ` Andrew Morton
  0 siblings, 0 replies; 31+ messages in thread
From: Andrew Morton @ 2015-01-12 23:59 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Tejun Heo, \"Rafael J. Wysocki\",
	David Rientjes, Johannes Weiner, Oleg Nesterov, Cong Wang,
	linux-mm, LKML, linux-pm

On Fri,  9 Jan 2015 12:05:50 +0100 Michal Hocko <mhocko@suse.cz> wrote:

> Hi,

I've been cheerily ignoring this discussion, sorry.  I trust everyone's
all happy and ready to go with this?

> [what changed since the last patchset]
>
> ...
>
> [testing results]
>
> ...
>
> [overview of the 5 patches]
>
> ...
> 

That's nice, but it doesn't really tell us what the patchset does.  The
first paragraph of the [5/5] changelog provides hints, but doesn't
explain why we even need to fix a race which is "quite small and really
unlikely".

So...  could we please have a few words describing the overall intent
and effect of this patchset?

Thanks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH -v3 0/5] OOM vs PM freezer fixes
  2015-01-12 23:59   ` Andrew Morton
@ 2015-01-13  8:41     ` Michal Hocko
  -1 siblings, 0 replies; 31+ messages in thread
From: Michal Hocko @ 2015-01-13  8:41 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Tejun Heo, \"Rafael J. Wysocki\",
	David Rientjes, Johannes Weiner, Oleg Nesterov, Cong Wang,
	linux-mm, LKML, linux-pm

On Mon 12-01-15 15:59:35, Andrew Morton wrote:
> On Fri,  9 Jan 2015 12:05:50 +0100 Michal Hocko <mhocko@suse.cz> wrote:
> 
> > Hi,
> 
> I've been cheerily ignoring this discussion, sorry.  I trust everyone's
> all happy and ready to go with this?
> 
> > [what changed since the last patchset]
> >
> > ...
> >
> > [testing results]
> >
> > ...
> >
> > [overview of the 5 patches]
> >
> > ...
> > 
> 
> That's nice, but it doesn't really tell us what the patchset does.  The
> first paragraph of the [5/5] changelog provides hints, but doesn't
> explain why we even need to fix a race which is "quite small and really
> unlikely".

The primary reason for ruling out OOM killer from PM freezing is
described in the changelog of the original "fix" 5695be142e20 (OOM,
PM: OOM killed task shouldn't escape PM suspend) for which this is a
follow up:
"
    PM freezer relies on having all tasks frozen by the time devices are
    getting frozen so that no task will touch them while they are getting
    frozen. But OOM killer is allowed to kill an already frozen task in
    order to handle OOM situtation. In order to protect from late wake ups
    OOM killer is disabled after all tasks are frozen. This, however, still
    keeps a window open when a killed task didn't manage to die by the time
    freeze_processes finishes.
"

The original patch hasn't closed the race window completely because
that would require a more complex solution as it can be seen by this
patchset.
 
> So...  could we please have a few words describing the overall intent
> and effect of this patchset?

The primary motivation was to close the race condition between OOM
killer and PM freezer _completely_. As Tejun pointed out, even though
the race condition is unlikely the harder it would be to debug weird
bugs deep in the PM freezer when the debugging options are reduced
considerably.  I can only speculate what might happen when a task is
still runnable unexpectedly. I can imagine deadlocks or memory
corruptions but I am, by no means, an expert in this area.

On a plus side and as a side effect the oom enable/disable has a better
(full barrier) semantic without polluting hot paths.

Hope that clarifies the things a bit.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH -v3 0/5] OOM vs PM freezer fixes
@ 2015-01-13  8:41     ` Michal Hocko
  0 siblings, 0 replies; 31+ messages in thread
From: Michal Hocko @ 2015-01-13  8:41 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Tejun Heo, \"Rafael J. Wysocki\",
	David Rientjes, Johannes Weiner, Oleg Nesterov, Cong Wang,
	linux-mm, LKML, linux-pm

On Mon 12-01-15 15:59:35, Andrew Morton wrote:
> On Fri,  9 Jan 2015 12:05:50 +0100 Michal Hocko <mhocko@suse.cz> wrote:
> 
> > Hi,
> 
> I've been cheerily ignoring this discussion, sorry.  I trust everyone's
> all happy and ready to go with this?
> 
> > [what changed since the last patchset]
> >
> > ...
> >
> > [testing results]
> >
> > ...
> >
> > [overview of the 5 patches]
> >
> > ...
> > 
> 
> That's nice, but it doesn't really tell us what the patchset does.  The
> first paragraph of the [5/5] changelog provides hints, but doesn't
> explain why we even need to fix a race which is "quite small and really
> unlikely".

The primary reason for ruling out OOM killer from PM freezing is
described in the changelog of the original "fix" 5695be142e20 (OOM,
PM: OOM killed task shouldn't escape PM suspend) for which this is a
follow up:
"
    PM freezer relies on having all tasks frozen by the time devices are
    getting frozen so that no task will touch them while they are getting
    frozen. But OOM killer is allowed to kill an already frozen task in
    order to handle OOM situtation. In order to protect from late wake ups
    OOM killer is disabled after all tasks are frozen. This, however, still
    keeps a window open when a killed task didn't manage to die by the time
    freeze_processes finishes.
"

The original patch hasn't closed the race window completely because
that would require a more complex solution as it can be seen by this
patchset.
 
> So...  could we please have a few words describing the overall intent
> and effect of this patchset?

The primary motivation was to close the race condition between OOM
killer and PM freezer _completely_. As Tejun pointed out, even though
the race condition is unlikely the harder it would be to debug weird
bugs deep in the PM freezer when the debugging options are reduced
considerably.  I can only speculate what might happen when a task is
still runnable unexpectedly. I can imagine deadlocks or memory
corruptions but I am, by no means, an expert in this area.

On a plus side and as a side effect the oom enable/disable has a better
(full barrier) semantic without polluting hot paths.

Hope that clarifies the things a bit.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

end of thread, other threads:[~2015-01-13  8:46 UTC | newest]

Thread overview: 31+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-01-09 11:05 [PATCH -v3 0/5] OOM vs PM freezer fixes Michal Hocko
2015-01-09 11:05 ` Michal Hocko
2015-01-09 11:05 ` [PATCH -v3 1/5] oom: add helpers for setting and clearing TIF_MEMDIE Michal Hocko
2015-01-09 11:05   ` Michal Hocko
2015-01-09 11:05 ` [PATCH -v3 2/5] oom: thaw the OOM victim if it is frozen Michal Hocko
2015-01-09 11:05   ` Michal Hocko
2015-01-09 11:05   ` Michal Hocko
2015-01-09 11:05 ` [PATCH -v3 3/5] PM: convert printk to pr_* equivalent Michal Hocko
2015-01-09 11:05   ` Michal Hocko
2015-01-09 11:05   ` Michal Hocko
2015-01-09 11:05 ` [PATCH -v3 4/5] sysrq: " Michal Hocko
2015-01-09 11:05   ` Michal Hocko
2015-01-09 11:05   ` Michal Hocko
2015-01-09 11:05 ` [PATCH -v3 5/5] oom, PM: make OOM detection in the freezer path raceless Michal Hocko
2015-01-09 11:05   ` Michal Hocko
2015-01-09 11:05   ` Michal Hocko
2015-01-10  0:54   ` Cong Wang
2015-01-10  0:54     ` Cong Wang
2015-01-10 19:43   ` Tejun Heo
2015-01-10 19:43     ` Tejun Heo
2015-01-12 16:10     ` Michal Hocko
2015-01-12 16:10       ` Michal Hocko
2015-01-12 17:22       ` Tejun Heo
2015-01-12 17:22         ` Tejun Heo
2015-01-12 17:35         ` Michal Hocko
2015-01-12 17:35           ` Michal Hocko
2015-01-12 23:59 ` [PATCH -v3 0/5] OOM vs PM freezer fixes Andrew Morton
2015-01-12 23:59   ` Andrew Morton
2015-01-12 23:59   ` Andrew Morton
2015-01-13  8:41   ` Michal Hocko
2015-01-13  8:41     ` Michal Hocko

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.