Linux-mm Archive on lore.kernel.org
 help / color / Atom feed
* [PATCH v2 0/10] fortify oom killer even more 
@ 2016-08-25 10:03 Michal Hocko
  2016-08-25 10:03 ` [PATCH v2 1/9] mm,oom_reaper: Reduce find_lock_task_mm() usage Michal Hocko
                   ` (8 more replies)
  0 siblings, 9 replies; 10+ messages in thread
From: Michal Hocko @ 2016-08-25 10:03 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Tetsuo Handa, Oleg Nesterov, David Rientjes,
	Vladimir Davydov, Michael S. Tsirkin, Michal Hocko

Hi,
I have sent this pile as an [1] previously. There are two changes since
then. I have dropped patch 8 [2] because Tetsuo was concerned that it
might increase chances to deplete memory reserves. While I am not sure
this would be the case I agree that it is not really necessary for this
series and it will fit better into changes I am plaiing later on.
Then I have replaced patch 9 [3] because Michael has noted that [4]
that protecting vhost get_user usage is not sufficient because the driver
can call into tun so that would need some changes as well and who knows
what else might need tweaking.

Patch 1 and 2 are cleanups from Tetsuo.

Patch 3 is the core part of this series. It makes the mm of the oom victim
persistent in signal struct so that the oom killer can rely purely on this
mm rather than find_lock_task_mm which might not find any mm if all threads
passed exit_mm. Patch 4 is a follow up fix noticed during testing. I could
have folded it to the patch 3 but I guess both will be easier to review if
they are separate.

Patch 5 is a cleanup and it removes signal_struct::oom_victims which is no
longer needed.

Patch 6 makes oom_killer_disable full quiescent state barrier again.

Patch 7 is a pure cleanup. Again taken from Tetsuo's series [2].

Patch 8 makes sure that all kthreads (use_mm users) will detect that the mm
might have been reaped and do not trust memory returned from the page fault.

Patch 9 then allows to reap oom victim memory even when it is shared
with a kthread via use_mm as the only problematic user is safe to after
the previous patch. This leaves the only non-reapable case when the global
init shares the mm with a different process (other than vfork) which I
would consider exotic and slightly idiotic so I wouldn't lose sleep over
it.

After this series we should have guaranteed forward progress for the oom
killer invocation for mmu arches AFAICS. It would be great if this could
make it into 4.9. I would like to build on top of this and clean up the
code even more. I would like to get rid of TIF_MEMDIE in the next step
and make memory reserves access completely independent on the rest of the
OOM killer logic.

I have run this through the hammering tests mostly coming from Tetsuo
and apart from the lockup fixed by the patch 4 and nothing popped out.

The series is based on top of the mmotm (2016-08-23-14-42). Feedback is
more than welcome.

Thanks!

[1] http://lkml.kernel.org/r/1469734954-31247-1-git-send-email-mhocko@kernel.org
[2] http://lkml.kernel.org/r/1469734954-31247-9-git-send-email-mhocko@kernel.org
[3] http://lkml.kernel.org/r/1469734954-31247-10-git-send-email-mhocko@kernel.org
[4] http://lkml.kernel.org/r/20160822210123.5k6zwdrkhrwjw5vv@redhat.com

Michal Hocko (6):
      oom: keep mm of the killed task available
      kernel, oom: fix potential pgd_lock deadlock from __mmdrop
      mm, oom: get rid of signal_struct::oom_victims
      oom, suspend: fix oom_killer_disable vs. pm suspend properly
      mm: make sure that kthreads will not refault oom reaped memory
      oom, oom_reaper: allow to reap mm shared by the kthreads

Tetsuo Handa (3):
      mm,oom_reaper: Reduce find_lock_task_mm() usage.
      mm,oom_reaper: Do not attempt to reap a task twice.
      mm, oom: enforce exit_oom_victim on current task


 include/linux/mm_types.h |   2 -
 include/linux/oom.h      |   9 ++-
 include/linux/sched.h    |  21 ++++++-
 kernel/exit.c            |   2 +-
 kernel/fork.c            |   7 +++
 kernel/power/process.c   |  17 +----
 mm/memory.c              |  13 ++++
 mm/oom_kill.c            | 161 ++++++++++++++++++++---------------------------
 8 files changed, 118 insertions(+), 114 deletions(-)


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH v2 1/9] mm,oom_reaper: Reduce find_lock_task_mm() usage.
  2016-08-25 10:03 [PATCH v2 0/10] fortify oom killer even more Michal Hocko
@ 2016-08-25 10:03 ` Michal Hocko
  2016-08-25 10:03 ` [PATCH v2 2/9] mm,oom_reaper: Do not attempt to reap a task twice Michal Hocko
                   ` (7 subsequent siblings)
  8 siblings, 0 replies; 10+ messages in thread
From: Michal Hocko @ 2016-08-25 10:03 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Tetsuo Handa, Oleg Nesterov, David Rientjes,
	Vladimir Davydov, Michal Hocko

From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>

__oom_reap_task() can be simplified a bit if it receives a valid mm from
oom_reap_task() which also uses that mm when __oom_reap_task() failed.
We can drop one find_lock_task_mm() call and also make the
__oom_reap_task() code flow easier to follow. Moreover, this will make
later patch in the series easier to review. Pinning mm's mm_count for
longer time is not really harmful because this will not pin much memory.

This patch doesn't introduce any functional change.

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 mm/oom_kill.c | 81 ++++++++++++++++++++++++++++-------------------------------
 1 file changed, 38 insertions(+), 43 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 463cdd22d4e0..87fad956c96b 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -463,12 +463,10 @@ static DECLARE_WAIT_QUEUE_HEAD(oom_reaper_wait);
 static struct task_struct *oom_reaper_list;
 static DEFINE_SPINLOCK(oom_reaper_lock);
 
-static bool __oom_reap_task(struct task_struct *tsk)
+static bool __oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
 {
 	struct mmu_gather tlb;
 	struct vm_area_struct *vma;
-	struct mm_struct *mm = NULL;
-	struct task_struct *p;
 	struct zap_details details = {.check_swap_entries = true,
 				      .ignore_dirty = true};
 	bool ret = true;
@@ -476,7 +474,7 @@ static bool __oom_reap_task(struct task_struct *tsk)
 	/*
 	 * We have to make sure to not race with the victim exit path
 	 * and cause premature new oom victim selection:
-	 * __oom_reap_task		exit_mm
+	 * __oom_reap_task_mm		exit_mm
 	 *   mmget_not_zero
 	 *				  mmput
 	 *				    atomic_dec_and_test
@@ -489,22 +487,9 @@ static bool __oom_reap_task(struct task_struct *tsk)
 	 */
 	mutex_lock(&oom_lock);
 
-	/*
-	 * Make sure we find the associated mm_struct even when the particular
-	 * thread has already terminated and cleared its mm.
-	 * We might have race with exit path so consider our work done if there
-	 * is no mm.
-	 */
-	p = find_lock_task_mm(tsk);
-	if (!p)
-		goto unlock_oom;
-	mm = p->mm;
-	atomic_inc(&mm->mm_count);
-	task_unlock(p);
-
 	if (!down_read_trylock(&mm->mmap_sem)) {
 		ret = false;
-		goto mm_drop;
+		goto unlock_oom;
 	}
 
 	/*
@@ -514,7 +499,7 @@ static bool __oom_reap_task(struct task_struct *tsk)
 	 */
 	if (!mmget_not_zero(mm)) {
 		up_read(&mm->mmap_sem);
-		goto mm_drop;
+		goto unlock_oom;
 	}
 
 	tlb_gather_mmu(&tlb, mm, 0, -1);
@@ -562,8 +547,6 @@ static bool __oom_reap_task(struct task_struct *tsk)
 	 * put the oom_reaper out of the way.
 	 */
 	mmput_async(mm);
-mm_drop:
-	mmdrop(mm);
 unlock_oom:
 	mutex_unlock(&oom_lock);
 	return ret;
@@ -573,36 +556,45 @@ static bool __oom_reap_task(struct task_struct *tsk)
 static void oom_reap_task(struct task_struct *tsk)
 {
 	int attempts = 0;
+	struct mm_struct *mm = NULL;
+	struct task_struct *p = find_lock_task_mm(tsk);
+
+	/*
+	 * Make sure we find the associated mm_struct even when the particular
+	 * thread has already terminated and cleared its mm.
+	 * We might have race with exit path so consider our work done if there
+	 * is no mm.
+	 */
+	if (!p)
+		goto done;
+	mm = p->mm;
+	atomic_inc(&mm->mm_count);
+	task_unlock(p);
 
 	/* Retry the down_read_trylock(mmap_sem) a few times */
-	while (attempts++ < MAX_OOM_REAP_RETRIES && !__oom_reap_task(tsk))
+	while (attempts++ < MAX_OOM_REAP_RETRIES && !__oom_reap_task_mm(tsk, mm))
 		schedule_timeout_idle(HZ/10);
 
-	if (attempts > MAX_OOM_REAP_RETRIES) {
-		struct task_struct *p;
+	if (attempts <= MAX_OOM_REAP_RETRIES)
+		goto done;
 
-		pr_info("oom_reaper: unable to reap pid:%d (%s)\n",
-				task_pid_nr(tsk), tsk->comm);
+	pr_info("oom_reaper: unable to reap pid:%d (%s)\n",
+		task_pid_nr(tsk), tsk->comm);
 
-		/*
-		 * If we've already tried to reap this task in the past and
-		 * failed it probably doesn't make much sense to try yet again
-		 * so hide the mm from the oom killer so that it can move on
-		 * to another task with a different mm struct.
-		 */
-		p = find_lock_task_mm(tsk);
-		if (p) {
-			if (test_and_set_bit(MMF_OOM_NOT_REAPABLE, &p->mm->flags)) {
-				pr_info("oom_reaper: giving up pid:%d (%s)\n",
-						task_pid_nr(tsk), tsk->comm);
-				set_bit(MMF_OOM_REAPED, &p->mm->flags);
-			}
-			task_unlock(p);
-		}
-
-		debug_show_all_locks();
+	/*
+	 * If we've already tried to reap this task in the past and
+	 * failed it probably doesn't make much sense to try yet again
+	 * so hide the mm from the oom killer so that it can move on
+	 * to another task with a different mm struct.
+	 */
+	if (test_and_set_bit(MMF_OOM_NOT_REAPABLE, &mm->flags)) {
+		pr_info("oom_reaper: giving up pid:%d (%s)\n",
+			task_pid_nr(tsk), tsk->comm);
+		set_bit(MMF_OOM_REAPED, &mm->flags);
 	}
+	debug_show_all_locks();
 
+done:
 	/*
 	 * Clear TIF_MEMDIE because the task shouldn't be sitting on a
 	 * reasonably reclaimable memory anymore or it is not a good candidate
@@ -614,6 +606,9 @@ static void oom_reap_task(struct task_struct *tsk)
 
 	/* Drop a reference taken by wake_oom_reaper */
 	put_task_struct(tsk);
+	/* Drop a reference taken above. */
+	if (mm)
+		mmdrop(mm);
 }
 
 static int oom_reaper(void *unused)
-- 
2.8.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH v2 2/9] mm,oom_reaper: Do not attempt to reap a task twice.
  2016-08-25 10:03 [PATCH v2 0/10] fortify oom killer even more Michal Hocko
  2016-08-25 10:03 ` [PATCH v2 1/9] mm,oom_reaper: Reduce find_lock_task_mm() usage Michal Hocko
@ 2016-08-25 10:03 ` Michal Hocko
  2016-08-25 10:03 ` [PATCH v2 3/9] oom: keep mm of the killed task available Michal Hocko
                   ` (6 subsequent siblings)
  8 siblings, 0 replies; 10+ messages in thread
From: Michal Hocko @ 2016-08-25 10:03 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Tetsuo Handa, Oleg Nesterov, David Rientjes,
	Vladimir Davydov, Michal Hocko

From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>

"mm, oom_reaper: do not attempt to reap a task twice" tried to give the
OOM reaper one more chance to retry using MMF_OOM_NOT_REAPABLE flag. But
the usefulness of the flag is rather limited and actually never shown
in practice. If the flag is set, it means that the holder of mm->mmap_sem
cannot call up_write() due to presumably being blocked at unkillable wait
waiting for other thread's memory allocation. But since one of threads
sharing that mm will queue that mm immediately via task_will_free_mem()
shortcut (otherwise, oom_badness() will select the same mm again due to
oom_score_adj value unchanged), retrying MMF_OOM_NOT_REAPABLE mm is
unlikely helpful.

Let's always set MMF_OOM_REAPED.

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 include/linux/sched.h |  1 -
 mm/oom_kill.c         | 15 +++------------
 2 files changed, 3 insertions(+), 13 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index d7e1e783cf01..f9b0b2dd4f18 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -522,7 +522,6 @@ static inline int get_dumpable(struct mm_struct *mm)
 #define MMF_HAS_UPROBES		19	/* has uprobes */
 #define MMF_RECALC_UPROBES	20	/* MMF_HAS_UPROBES can be wrong */
 #define MMF_OOM_REAPED		21	/* mm has been already reaped */
-#define MMF_OOM_NOT_REAPABLE	22	/* mm couldn't be reaped */
 
 #define MMF_INIT_MASK		(MMF_DUMPABLE_MASK | MMF_DUMP_FILTER_MASK)
 
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 87fad956c96b..45097f5a8f30 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -578,20 +578,11 @@ static void oom_reap_task(struct task_struct *tsk)
 	if (attempts <= MAX_OOM_REAP_RETRIES)
 		goto done;
 
+	/* Ignore this mm because somebody can't call up_write(mmap_sem). */
+	set_bit(MMF_OOM_REAPED, &mm->flags);
+
 	pr_info("oom_reaper: unable to reap pid:%d (%s)\n",
 		task_pid_nr(tsk), tsk->comm);
-
-	/*
-	 * If we've already tried to reap this task in the past and
-	 * failed it probably doesn't make much sense to try yet again
-	 * so hide the mm from the oom killer so that it can move on
-	 * to another task with a different mm struct.
-	 */
-	if (test_and_set_bit(MMF_OOM_NOT_REAPABLE, &mm->flags)) {
-		pr_info("oom_reaper: giving up pid:%d (%s)\n",
-			task_pid_nr(tsk), tsk->comm);
-		set_bit(MMF_OOM_REAPED, &mm->flags);
-	}
 	debug_show_all_locks();
 
 done:
-- 
2.8.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH v2 3/9] oom: keep mm of the killed task available
  2016-08-25 10:03 [PATCH v2 0/10] fortify oom killer even more Michal Hocko
  2016-08-25 10:03 ` [PATCH v2 1/9] mm,oom_reaper: Reduce find_lock_task_mm() usage Michal Hocko
  2016-08-25 10:03 ` [PATCH v2 2/9] mm,oom_reaper: Do not attempt to reap a task twice Michal Hocko
@ 2016-08-25 10:03 ` Michal Hocko
  2016-08-25 10:03 ` [PATCH v2 4/9] kernel, oom: fix potential pgd_lock deadlock from __mmdrop Michal Hocko
                   ` (5 subsequent siblings)
  8 siblings, 0 replies; 10+ messages in thread
From: Michal Hocko @ 2016-08-25 10:03 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Tetsuo Handa, Oleg Nesterov, David Rientjes,
	Vladimir Davydov, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

oom_reap_task has to call exit_oom_victim in order to make sure that the
oom vicim will not block the oom killer for ever. This is, however,
opening new problems (e.g oom_killer_disable exclusion - see
74070542099c ("oom, suspend: fix oom_reaper vs. oom_killer_disable
race")). exit_oom_victim should be only called from the victim's
context ideally.

One way to achieve this would be to rely on per mm_struct flags. We
already have MMF_OOM_REAPED to hide a task from the oom killer since
"mm, oom: hide mm which is shared with kthread or global init". The
problem is that the exit path:
do_exit
  exit_mm
    tsk->mm = NULL;
    mmput
      __mmput
    exit_oom_victim

doesn't guarantee that exit_oom_victim will get called in a bounded
amount of time. At least exit_aio depends on IO which might get blocked
due to lack of memory and who knows what else is lurking there.

This patch takes a different approach. We remember tsk->mm into the
signal_struct and bind it to the signal struct life time for all oom
victims. __oom_reap_task_mm as well as oom_scan_process_thread do not have
to rely on find_lock_task_mm anymore and they will have a reliable
reference to the mm struct. As a result all the oom specific
communication inside the OOM killer can be done via tsk->signal->oom_mm.

Increasing the signal_struct for something as unlikely as the oom
killer is far from ideal but this approach will make the code much more
reasonable and long term we even might want to move task->mm into the
signal_struct anyway. In the next step we might want to make the oom
killer exclusion and access to memory reserves completely independent
which would be also nice.

Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 include/linux/sched.h |  2 ++
 kernel/fork.c         |  2 ++
 mm/oom_kill.c         | 51 +++++++++++++++++++--------------------------------
 3 files changed, 23 insertions(+), 32 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index f9b0b2dd4f18..da278b6ce44d 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -802,6 +802,8 @@ struct signal_struct {
 	short oom_score_adj;		/* OOM kill score adjustment */
 	short oom_score_adj_min;	/* OOM kill score adjustment min value.
 					 * Only settable by CAP_SYS_RESOURCE. */
+	struct mm_struct *oom_mm;	/* recorded mm when the thread group got
+					 * killed by the oom killer */
 
 	struct mutex cred_guard_mutex;	/* guard against foreign influences on
 					 * credential calculations
diff --git a/kernel/fork.c b/kernel/fork.c
index 52e725d4a866..f3b78c713211 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -243,6 +243,8 @@ static inline void free_signal_struct(struct signal_struct *sig)
 {
 	taskstats_tgid_free(sig);
 	sched_autogroup_exit(sig);
+	if (sig->oom_mm)
+		mmdrop(sig->oom_mm);
 	kmem_cache_free(signal_cachep, sig);
 }
 
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 45097f5a8f30..f16ec0840a0e 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -300,14 +300,7 @@ static int oom_evaluate_task(struct task_struct *task, void *arg)
 	 * any memory is quite low.
 	 */
 	if (!is_sysrq_oom(oc) && atomic_read(&task->signal->oom_victims)) {
-		struct task_struct *p = find_lock_task_mm(task);
-		bool reaped = false;
-
-		if (p) {
-			reaped = test_bit(MMF_OOM_REAPED, &p->mm->flags);
-			task_unlock(p);
-		}
-		if (reaped)
+		if (test_bit(MMF_OOM_REAPED, &task->signal->oom_mm->flags))
 			goto next;
 		goto abort;
 	}
@@ -537,11 +530,6 @@ static bool __oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
 	up_read(&mm->mmap_sem);
 
 	/*
-	 * This task can be safely ignored because we cannot do much more
-	 * to release its memory.
-	 */
-	set_bit(MMF_OOM_REAPED, &mm->flags);
-	/*
 	 * Drop our reference but make sure the mmput slow path is called from a
 	 * different context because we shouldn't risk we get stuck there and
 	 * put the oom_reaper out of the way.
@@ -556,20 +544,7 @@ static bool __oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
 static void oom_reap_task(struct task_struct *tsk)
 {
 	int attempts = 0;
-	struct mm_struct *mm = NULL;
-	struct task_struct *p = find_lock_task_mm(tsk);
-
-	/*
-	 * Make sure we find the associated mm_struct even when the particular
-	 * thread has already terminated and cleared its mm.
-	 * We might have race with exit path so consider our work done if there
-	 * is no mm.
-	 */
-	if (!p)
-		goto done;
-	mm = p->mm;
-	atomic_inc(&mm->mm_count);
-	task_unlock(p);
+	struct mm_struct *mm = tsk->signal->oom_mm;
 
 	/* Retry the down_read_trylock(mmap_sem) a few times */
 	while (attempts++ < MAX_OOM_REAP_RETRIES && !__oom_reap_task_mm(tsk, mm))
@@ -578,8 +553,6 @@ static void oom_reap_task(struct task_struct *tsk)
 	if (attempts <= MAX_OOM_REAP_RETRIES)
 		goto done;
 
-	/* Ignore this mm because somebody can't call up_write(mmap_sem). */
-	set_bit(MMF_OOM_REAPED, &mm->flags);
 
 	pr_info("oom_reaper: unable to reap pid:%d (%s)\n",
 		task_pid_nr(tsk), tsk->comm);
@@ -595,11 +568,14 @@ static void oom_reap_task(struct task_struct *tsk)
 	tsk->oom_reaper_list = NULL;
 	exit_oom_victim(tsk);
 
+	/*
+	 * Hide this mm from OOM killer because it has been either reaped or
+	 * somebody can't call up_write(mmap_sem).
+	 */
+	set_bit(MMF_OOM_REAPED, &mm->flags);
+
 	/* Drop a reference taken by wake_oom_reaper */
 	put_task_struct(tsk);
-	/* Drop a reference taken above. */
-	if (mm)
-		mmdrop(mm);
 }
 
 static int oom_reaper(void *unused)
@@ -665,14 +641,25 @@ static inline void wake_oom_reaper(struct task_struct *tsk)
  *
  * Has to be called with oom_lock held and never after
  * oom has been disabled already.
+ *
+ * tsk->mm has to be non NULL and caller has to guarantee it is stable (either
+ * under task_lock or operate on the current).
  */
 static void mark_oom_victim(struct task_struct *tsk)
 {
+	struct mm_struct *mm = tsk->mm;
+
 	WARN_ON(oom_killer_disabled);
 	/* OOM killer might race with memcg OOM */
 	if (test_and_set_tsk_thread_flag(tsk, TIF_MEMDIE))
 		return;
+
 	atomic_inc(&tsk->signal->oom_victims);
+
+	/* oom_mm is bound to the signal struct life time. */
+	if (!cmpxchg(&tsk->signal->oom_mm, NULL, mm))
+		atomic_inc(&tsk->signal->oom_mm->mm_count);
+
 	/*
 	 * Make sure that the task is woken up from uninterruptible sleep
 	 * if it is frozen because OOM killer wouldn't be able to free
-- 
2.8.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH v2 4/9] kernel, oom: fix potential pgd_lock deadlock from __mmdrop
  2016-08-25 10:03 [PATCH v2 0/10] fortify oom killer even more Michal Hocko
                   ` (2 preceding siblings ...)
  2016-08-25 10:03 ` [PATCH v2 3/9] oom: keep mm of the killed task available Michal Hocko
@ 2016-08-25 10:03 ` Michal Hocko
  2016-08-25 10:03 ` [PATCH v2 5/9] mm, oom: get rid of signal_struct::oom_victims Michal Hocko
                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 10+ messages in thread
From: Michal Hocko @ 2016-08-25 10:03 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Tetsuo Handa, Oleg Nesterov, David Rientjes,
	Vladimir Davydov, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

Lockdep complains that __mmdrop is not safe from the softirq context:

[   63.860469] =================================
[   63.861326] [ INFO: inconsistent lock state ]
[   63.862677] 4.6.0-oomfortification2-00011-geeb3eadeab96-dirty #949 Tainted: G        W
[   63.864072] ---------------------------------
[   63.864072] inconsistent {SOFTIRQ-ON-W} -> {IN-SOFTIRQ-W} usage.
[   63.864072] swapper/1/0 [HC0[0]:SC1[1]:HE1:SE0] takes:
[   63.864072]  (pgd_lock){+.?...}, at: [<ffffffff81048762>] pgd_free+0x19/0x6b
[   63.864072] {SOFTIRQ-ON-W} state was registered at:
[   63.864072]   [<ffffffff81097da2>] __lock_acquire+0xa06/0x196e
[   63.864072]   [<ffffffff810994d8>] lock_acquire+0x139/0x1e1
[   63.864072]   [<ffffffff81625cd2>] _raw_spin_lock+0x32/0x41
[   63.864072]   [<ffffffff8104594d>] __change_page_attr_set_clr+0x2a5/0xacd
[   63.864072]   [<ffffffff810462e4>] change_page_attr_set_clr+0x16f/0x32c
[   63.864072]   [<ffffffff81046544>] set_memory_nx+0x37/0x3a
[   63.864072]   [<ffffffff81041b2c>] free_init_pages+0x9e/0xc7
[   63.864072]   [<ffffffff81d49105>] alternative_instructions+0xa2/0xb3
[   63.864072]   [<ffffffff81d4a763>] check_bugs+0xe/0x2d
[   63.864072]   [<ffffffff81d3eed0>] start_kernel+0x3ce/0x3ea
[   63.864072]   [<ffffffff81d3e2f1>] x86_64_start_reservations+0x2a/0x2c
[   63.864072]   [<ffffffff81d3e46d>] x86_64_start_kernel+0x17a/0x18d
[   63.864072] irq event stamp: 105916
[   63.864072] hardirqs last  enabled at (105916): [<ffffffff8112f5ba>] free_hot_cold_page+0x37e/0x390
[   63.864072] hardirqs last disabled at (105915): [<ffffffff8112f4fd>] free_hot_cold_page+0x2c1/0x390
[   63.864072] softirqs last  enabled at (105878): [<ffffffff81055724>] _local_bh_enable+0x42/0x44
[   63.864072] softirqs last disabled at (105879): [<ffffffff81055a6d>] irq_exit+0x6f/0xd1
[   63.864072]
[   63.864072] other info that might help us debug this:
[   63.864072]  Possible unsafe locking scenario:
[   63.864072]
[   63.864072]        CPU0
[   63.864072]        ----
[   63.864072]   lock(pgd_lock);
[   63.864072]   <Interrupt>
[   63.864072]     lock(pgd_lock);
[   63.864072]
[   63.864072]  *** DEADLOCK ***
[   63.864072]
[   63.864072] 1 lock held by swapper/1/0:
[   63.864072]  #0:  (rcu_callback){......}, at: [<ffffffff810b44f2>] rcu_process_callbacks+0x390/0x800
[   63.864072]
[   63.864072] stack backtrace:
[   63.864072] CPU: 1 PID: 0 Comm: swapper/1 Tainted: G        W       4.6.0-oomfortification2-00011-geeb3eadeab96-dirty #949
[   63.864072] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Debian-1.8.2-1 04/01/2014
[   63.864072]  0000000000000000 ffff88000fb03c38 ffffffff81312df8 ffffffff8257a0d0
[   63.864072]  ffff8800069f8000 ffff88000fb03c70 ffffffff81125bc5 0000000000000004
[   63.864072]  ffff8800069f8888 ffff8800069f8000 ffffffff8109603a 0000000000000004
[   63.864072] Call Trace:
[   63.864072]  <IRQ>  [<ffffffff81312df8>] dump_stack+0x67/0x90
[   63.864072]  [<ffffffff81125bc5>] print_usage_bug.part.25+0x259/0x268
[   63.864072]  [<ffffffff8109603a>] ? print_shortest_lock_dependencies+0x180/0x180
[   63.864072]  [<ffffffff81096d33>] mark_lock+0x381/0x567
[   63.864072]  [<ffffffff81097d2f>] __lock_acquire+0x993/0x196e
[   63.864072]  [<ffffffff81048762>] ? pgd_free+0x19/0x6b
[   63.864072]  [<ffffffff8117b8ae>] ? discard_slab+0x42/0x44
[   63.864072]  [<ffffffff8117e00d>] ? __slab_free+0x3e6/0x429
[   63.864072]  [<ffffffff810994d8>] lock_acquire+0x139/0x1e1
[   63.864072]  [<ffffffff810994d8>] ? lock_acquire+0x139/0x1e1
[   63.864072]  [<ffffffff81048762>] ? pgd_free+0x19/0x6b
[   63.864072]  [<ffffffff81625cd2>] _raw_spin_lock+0x32/0x41
[   63.864072]  [<ffffffff81048762>] ? pgd_free+0x19/0x6b
[   63.864072]  [<ffffffff81048762>] pgd_free+0x19/0x6b
[   63.864072]  [<ffffffff8104d018>] __mmdrop+0x25/0xb9
[   63.864072]  [<ffffffff8104d29d>] __put_task_struct+0x103/0x11e
[   63.864072]  [<ffffffff810526a0>] delayed_put_task_struct+0x157/0x15e
[   63.864072]  [<ffffffff810b47c2>] rcu_process_callbacks+0x660/0x800
[   63.864072]  [<ffffffff81052549>] ? will_become_orphaned_pgrp+0xae/0xae
[   63.864072]  [<ffffffff8162921c>] __do_softirq+0x1ec/0x4d5
[   63.864072]  [<ffffffff81055a6d>] irq_exit+0x6f/0xd1
[   63.864072]  [<ffffffff81628d7b>] smp_apic_timer_interrupt+0x42/0x4d
[   63.864072]  [<ffffffff8162732e>] apic_timer_interrupt+0x8e/0xa0
[   63.864072]  <EOI>  [<ffffffff81021657>] ? default_idle+0x6b/0x16e
[   63.864072]  [<ffffffff81021ed2>] arch_cpu_idle+0xf/0x11
[   63.864072]  [<ffffffff8108e59b>] default_idle_call+0x32/0x34
[   63.864072]  [<ffffffff8108e7a9>] cpu_startup_entry+0x20c/0x399
[   63.864072]  [<ffffffff81034600>] start_secondary+0xfe/0x101

More over a79e53d85683 ("x86/mm: Fix pgd_lock deadlock") was explicit
about pgd_lock not to be called from the irq context. This means that
__mmdrop called from free_signal_struct has to be postponed to a user
context. We already have a similar mechanism for mmput_async so we
can use it here as well. This is safe because mm_count is pinned by
mm_users.

This fixes bug introduced by "oom: keep mm of the killed task available"

Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 include/linux/mm_types.h |  2 --
 include/linux/sched.h    | 14 ++++++++++++++
 kernel/fork.c            |  6 +++++-
 3 files changed, 19 insertions(+), 3 deletions(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 903200f4ec41..4a8acedf4b7d 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -515,9 +515,7 @@ struct mm_struct {
 #ifdef CONFIG_HUGETLB_PAGE
 	atomic_long_t hugetlb_usage;
 #endif
-#ifdef CONFIG_MMU
 	struct work_struct async_put_work;
-#endif
 };
 
 static inline void mm_init_cpumask(struct mm_struct *mm)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index da278b6ce44d..cccb575dc242 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2829,6 +2829,20 @@ static inline void mmdrop(struct mm_struct *mm)
 		__mmdrop(mm);
 }
 
+static inline void mmdrop_async_fn(struct work_struct *work)
+{
+	struct mm_struct *mm = container_of(work, struct mm_struct, async_put_work);
+	__mmdrop(mm);
+}
+
+static inline void mmdrop_async(struct mm_struct *mm)
+{
+	if (unlikely(atomic_dec_and_test(&mm->mm_count))) {
+		INIT_WORK(&mm->async_put_work, mmdrop_async_fn);
+		schedule_work(&mm->async_put_work);
+	}
+}
+
 static inline bool mmget_not_zero(struct mm_struct *mm)
 {
 	return atomic_inc_not_zero(&mm->mm_users);
diff --git a/kernel/fork.c b/kernel/fork.c
index f3b78c713211..136a2c6784cb 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -243,8 +243,12 @@ static inline void free_signal_struct(struct signal_struct *sig)
 {
 	taskstats_tgid_free(sig);
 	sched_autogroup_exit(sig);
+	/*
+	 * __mmdrop is not safe to call from softirq context on x86 due to
+	 * pgd_dtor so postpone it to the async context
+	 */
 	if (sig->oom_mm)
-		mmdrop(sig->oom_mm);
+		mmdrop_async(sig->oom_mm);
 	kmem_cache_free(signal_cachep, sig);
 }
 
-- 
2.8.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH v2 5/9] mm, oom: get rid of signal_struct::oom_victims
  2016-08-25 10:03 [PATCH v2 0/10] fortify oom killer even more Michal Hocko
                   ` (3 preceding siblings ...)
  2016-08-25 10:03 ` [PATCH v2 4/9] kernel, oom: fix potential pgd_lock deadlock from __mmdrop Michal Hocko
@ 2016-08-25 10:03 ` Michal Hocko
  2016-08-25 10:03 ` [PATCH v2 6/9] oom, suspend: fix oom_killer_disable vs. pm suspend properly Michal Hocko
                   ` (3 subsequent siblings)
  8 siblings, 0 replies; 10+ messages in thread
From: Michal Hocko @ 2016-08-25 10:03 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Tetsuo Handa, Oleg Nesterov, David Rientjes,
	Vladimir Davydov, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

After "oom: keep mm of the killed task available" we can safely
detect an oom victim by checking task->signal->oom_mm so we do not need
the signal_struct counter anymore so let's get rid of it.

This alone wouldn't be sufficient for nommu archs because exit_oom_victim
doesn't hide the process from the oom killer anymore. We can, however,
mark the mm with a MMF flag in __mmput. We can reuse MMF_OOM_REAPED and
rename it to a more generic MMF_OOM_SKIP.

Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 include/linux/oom.h   |  5 +++++
 include/linux/sched.h |  3 +--
 kernel/fork.c         |  1 +
 mm/oom_kill.c         | 17 +++++++----------
 4 files changed, 14 insertions(+), 12 deletions(-)

diff --git a/include/linux/oom.h b/include/linux/oom.h
index 17946e5121b6..b61357d07170 100644
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -58,6 +58,11 @@ static inline bool oom_task_origin(const struct task_struct *p)
 	return p->signal->oom_flag_origin;
 }
 
+static inline bool tsk_is_oom_victim(struct task_struct * tsk)
+{
+	return tsk->signal->oom_mm;
+}
+
 extern unsigned long oom_badness(struct task_struct *p,
 		struct mem_cgroup *memcg, const nodemask_t *nodemask,
 		unsigned long totalpages);
diff --git a/include/linux/sched.h b/include/linux/sched.h
index cccb575dc242..eda579f3283a 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -521,7 +521,7 @@ static inline int get_dumpable(struct mm_struct *mm)
 
 #define MMF_HAS_UPROBES		19	/* has uprobes */
 #define MMF_RECALC_UPROBES	20	/* MMF_HAS_UPROBES can be wrong */
-#define MMF_OOM_REAPED		21	/* mm has been already reaped */
+#define MMF_OOM_SKIP		21	/* mm is of no interest for the OOM killer */
 
 #define MMF_INIT_MASK		(MMF_DUMPABLE_MASK | MMF_DUMP_FILTER_MASK)
 
@@ -669,7 +669,6 @@ struct signal_struct {
 	atomic_t		sigcnt;
 	atomic_t		live;
 	int			nr_threads;
-	atomic_t oom_victims; /* # of TIF_MEDIE threads in this thread group */
 	struct list_head	thread_head;
 
 	wait_queue_head_t	wait_chldexit;	/* for wait4() */
diff --git a/kernel/fork.c b/kernel/fork.c
index 136a2c6784cb..64624fb42f96 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -725,6 +725,7 @@ static inline void __mmput(struct mm_struct *mm)
 	}
 	if (mm->binfmt)
 		module_put(mm->binfmt->module);
+	set_bit(MMF_OOM_SKIP, &mm->flags);
 	mmdrop(mm);
 }
 
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index f16ec0840a0e..e2a2c35dd493 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -186,7 +186,7 @@ unsigned long oom_badness(struct task_struct *p, struct mem_cgroup *memcg,
 	 */
 	adj = (long)p->signal->oom_score_adj;
 	if (adj == OOM_SCORE_ADJ_MIN ||
-			test_bit(MMF_OOM_REAPED, &p->mm->flags) ||
+			test_bit(MMF_OOM_SKIP, &p->mm->flags) ||
 			in_vfork(p)) {
 		task_unlock(p);
 		return 0;
@@ -296,11 +296,11 @@ static int oom_evaluate_task(struct task_struct *task, void *arg)
 	/*
 	 * This task already has access to memory reserves and is being killed.
 	 * Don't allow any other task to have access to the reserves unless
-	 * the task has MMF_OOM_REAPED because chances that it would release
+	 * the task has MMF_OOM_SKIP because chances that it would release
 	 * any memory is quite low.
 	 */
-	if (!is_sysrq_oom(oc) && atomic_read(&task->signal->oom_victims)) {
-		if (test_bit(MMF_OOM_REAPED, &task->signal->oom_mm->flags))
+	if (!is_sysrq_oom(oc) && tsk_is_oom_victim(task)) {
+		if (test_bit(MMF_OOM_SKIP, &task->signal->oom_mm->flags))
 			goto next;
 		goto abort;
 	}
@@ -572,7 +572,7 @@ static void oom_reap_task(struct task_struct *tsk)
 	 * Hide this mm from OOM killer because it has been either reaped or
 	 * somebody can't call up_write(mmap_sem).
 	 */
-	set_bit(MMF_OOM_REAPED, &mm->flags);
+	set_bit(MMF_OOM_SKIP, &mm->flags);
 
 	/* Drop a reference taken by wake_oom_reaper */
 	put_task_struct(tsk);
@@ -654,8 +654,6 @@ static void mark_oom_victim(struct task_struct *tsk)
 	if (test_and_set_tsk_thread_flag(tsk, TIF_MEMDIE))
 		return;
 
-	atomic_inc(&tsk->signal->oom_victims);
-
 	/* oom_mm is bound to the signal struct life time. */
 	if (!cmpxchg(&tsk->signal->oom_mm, NULL, mm))
 		atomic_inc(&tsk->signal->oom_mm->mm_count);
@@ -677,7 +675,6 @@ void exit_oom_victim(struct task_struct *tsk)
 {
 	if (!test_and_clear_tsk_thread_flag(tsk, TIF_MEMDIE))
 		return;
-	atomic_dec(&tsk->signal->oom_victims);
 
 	if (!atomic_dec_return(&oom_victims))
 		wake_up_all(&oom_victims_wait);
@@ -769,7 +766,7 @@ static bool task_will_free_mem(struct task_struct *task)
 	 * This task has already been drained by the oom reaper so there are
 	 * only small chances it will free some more
 	 */
-	if (test_bit(MMF_OOM_REAPED, &mm->flags))
+	if (test_bit(MMF_OOM_SKIP, &mm->flags))
 		return false;
 
 	if (atomic_read(&mm->mm_users) <= 1)
@@ -906,7 +903,7 @@ static void oom_kill_process(struct oom_control *oc, const char *message)
 			 * killer to guarantee OOM forward progress.
 			 */
 			can_oom_reap = false;
-			set_bit(MMF_OOM_REAPED, &mm->flags);
+			set_bit(MMF_OOM_SKIP, &mm->flags);
 			pr_info("oom killer %d (%s) has mm pinned by %d (%s)\n",
 					task_pid_nr(victim), victim->comm,
 					task_pid_nr(p), p->comm);
-- 
2.8.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH v2 6/9] oom, suspend: fix oom_killer_disable vs. pm suspend properly
  2016-08-25 10:03 [PATCH v2 0/10] fortify oom killer even more Michal Hocko
                   ` (4 preceding siblings ...)
  2016-08-25 10:03 ` [PATCH v2 5/9] mm, oom: get rid of signal_struct::oom_victims Michal Hocko
@ 2016-08-25 10:03 ` Michal Hocko
  2016-08-25 10:03 ` [PATCH v2 7/9] mm, oom: enforce exit_oom_victim on current task Michal Hocko
                   ` (2 subsequent siblings)
  8 siblings, 0 replies; 10+ messages in thread
From: Michal Hocko @ 2016-08-25 10:03 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Tetsuo Handa, Oleg Nesterov, David Rientjes,
	Vladimir Davydov, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

74070542099c ("oom, suspend: fix oom_reaper vs. oom_killer_disable
race") has workaround an existing race between oom_killer_disable
and oom_reaper by adding another round of try_to_freeze_tasks after
the oom killer was disabled. This was the easiest thing to do for
a late 4.7 fix. Let's fix it properly now.

After "oom: keep mm of the killed task available" we no longer
have to call exit_oom_victim from the oom reaper because we have stable
mm available and hide the oom_reaped mm by MMF_OOM_SKIP flag. So
let's remove exit_oom_victim and the race described in the above commit
doesn't exist anymore if.

Unfortunately this alone is not sufficient for the oom_killer_disable
usecase because now we do not have any reliable way to reach
exit_oom_victim (the victim might get stuck on a way to exit for an
unbounded amount of time). OOM killer can cope with that by checking
mm flags and move on to another victim but we cannot do the same
for oom_killer_disable as we would lose the guarantee of no further
interference of the victim with the rest of the system. What we can do
instead is to cap the maximum time the oom_killer_disable waits for
victims. The only current user of this function (pm suspend) already has
a concept of timeout for back off so we can reuse the same value there.

Let's drop set_freezable for the oom_reaper kthread because it is no
longer needed as the reaper doesn't wake or thaw any processes.

Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 include/linux/oom.h    |  2 +-
 kernel/power/process.c | 17 +++--------------
 mm/oom_kill.c          | 40 ++++++++++++++++++++--------------------
 3 files changed, 24 insertions(+), 35 deletions(-)

diff --git a/include/linux/oom.h b/include/linux/oom.h
index b61357d07170..0f1b9da108e4 100644
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -74,7 +74,7 @@ extern void exit_oom_victim(struct task_struct *tsk);
 extern int register_oom_notifier(struct notifier_block *nb);
 extern int unregister_oom_notifier(struct notifier_block *nb);
 
-extern bool oom_killer_disable(void);
+extern bool oom_killer_disable(signed long timeout);
 extern void oom_killer_enable(void);
 
 extern struct task_struct *find_lock_task_mm(struct task_struct *p);
diff --git a/kernel/power/process.c b/kernel/power/process.c
index 0c2ee9761d57..2456f10c7326 100644
--- a/kernel/power/process.c
+++ b/kernel/power/process.c
@@ -141,23 +141,12 @@ int freeze_processes(void)
 	/*
 	 * Now that the whole userspace is frozen we need to disbale
 	 * the OOM killer to disallow any further interference with
-	 * killable tasks.
+	 * killable tasks. There is no guarantee oom victims will
+	 * ever reach a point they go away we have to wait with a timeout.
 	 */
-	if (!error && !oom_killer_disable())
+	if (!error && !oom_killer_disable(msecs_to_jiffies(freeze_timeout_msecs)))
 		error = -EBUSY;
 
-	/*
-	 * There is a hard to fix race between oom_reaper kernel thread
-	 * and oom_killer_disable. oom_reaper calls exit_oom_victim
-	 * before the victim reaches exit_mm so try to freeze all the tasks
-	 * again and catch such a left over task.
-	 */
-	if (!error) {
-		pr_info("Double checking all user space processes after OOM killer disable... ");
-		error = try_to_freeze_tasks(true);
-		pr_cont("\n");
-	}
-
 	if (error)
 		thaw_processes();
 	return error;
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index e2a2c35dd493..895a51fe8e18 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -559,14 +559,7 @@ static void oom_reap_task(struct task_struct *tsk)
 	debug_show_all_locks();
 
 done:
-	/*
-	 * Clear TIF_MEMDIE because the task shouldn't be sitting on a
-	 * reasonably reclaimable memory anymore or it is not a good candidate
-	 * for the oom victim right now because it cannot release its memory
-	 * itself nor by the oom reaper.
-	 */
 	tsk->oom_reaper_list = NULL;
-	exit_oom_victim(tsk);
 
 	/*
 	 * Hide this mm from OOM killer because it has been either reaped or
@@ -580,8 +573,6 @@ static void oom_reap_task(struct task_struct *tsk)
 
 static int oom_reaper(void *unused)
 {
-	set_freezable();
-
 	while (true) {
 		struct task_struct *tsk = NULL;
 
@@ -681,10 +672,20 @@ void exit_oom_victim(struct task_struct *tsk)
 }
 
 /**
+ * oom_killer_enable - enable OOM killer
+ */
+void oom_killer_enable(void)
+{
+	oom_killer_disabled = false;
+}
+
+/**
  * oom_killer_disable - disable OOM killer
+ * @timeout: maximum timeout to wait for oom victims in jiffies
  *
  * Forces all page allocations to fail rather than trigger OOM killer.
- * Will block and wait until all OOM victims are killed.
+ * Will block and wait until all OOM victims are killed or the given
+ * timeout expires.
  *
  * The function cannot be called when there are runnable user tasks because
  * the userspace would see unexpected allocation failures as a result. Any
@@ -693,8 +694,10 @@ void exit_oom_victim(struct task_struct *tsk)
  * Returns true if successful and false if the OOM killer cannot be
  * disabled.
  */
-bool oom_killer_disable(void)
+bool oom_killer_disable(signed long timeout)
 {
+	signed long ret;
+
 	/*
 	 * Make sure to not race with an ongoing OOM killer. Check that the
 	 * current is not killed (possibly due to sharing the victim's memory).
@@ -704,19 +707,16 @@ bool oom_killer_disable(void)
 	oom_killer_disabled = true;
 	mutex_unlock(&oom_lock);
 
-	wait_event(oom_victims_wait, !atomic_read(&oom_victims));
+	ret = wait_event_interruptible_timeout(oom_victims_wait,
+			!atomic_read(&oom_victims), timeout);
+	if (ret <= 0) {
+		oom_killer_enable();
+		return false;
+	}
 
 	return true;
 }
 
-/**
- * oom_killer_enable - enable OOM killer
- */
-void oom_killer_enable(void)
-{
-	oom_killer_disabled = false;
-}
-
 static inline bool __task_will_free_mem(struct task_struct *task)
 {
 	struct signal_struct *sig = task->signal;
-- 
2.8.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH v2 7/9] mm, oom: enforce exit_oom_victim on current task
  2016-08-25 10:03 [PATCH v2 0/10] fortify oom killer even more Michal Hocko
                   ` (5 preceding siblings ...)
  2016-08-25 10:03 ` [PATCH v2 6/9] oom, suspend: fix oom_killer_disable vs. pm suspend properly Michal Hocko
@ 2016-08-25 10:03 ` Michal Hocko
  2016-08-25 10:03 ` [PATCH v2 8/9] mm: make sure that kthreads will not refault oom reaped memory Michal Hocko
  2016-08-25 10:03 ` [PATCH v2 9/9] oom, oom_reaper: allow to reap mm shared by the kthreads Michal Hocko
  8 siblings, 0 replies; 10+ messages in thread
From: Michal Hocko @ 2016-08-25 10:03 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Tetsuo Handa, Oleg Nesterov, David Rientjes,
	Vladimir Davydov, Michal Hocko

From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>

There are no users of exit_oom_victim on !current task anymore so
enforce the API to always work on the current.

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 include/linux/oom.h | 2 +-
 kernel/exit.c       | 2 +-
 mm/oom_kill.c       | 5 ++---
 3 files changed, 4 insertions(+), 5 deletions(-)

diff --git a/include/linux/oom.h b/include/linux/oom.h
index 0f1b9da108e4..b4e36e92bc87 100644
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -69,7 +69,7 @@ extern unsigned long oom_badness(struct task_struct *p,
 
 extern bool out_of_memory(struct oom_control *oc);
 
-extern void exit_oom_victim(struct task_struct *tsk);
+extern void exit_oom_victim(void);
 
 extern int register_oom_notifier(struct notifier_block *nb);
 extern int unregister_oom_notifier(struct notifier_block *nb);
diff --git a/kernel/exit.c b/kernel/exit.c
index bbdef62d6e3b..c36f8e0ab66d 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -435,7 +435,7 @@ static void exit_mm(struct task_struct *tsk)
 	mm_update_next_owner(mm);
 	mmput(mm);
 	if (test_thread_flag(TIF_MEMDIE))
-		exit_oom_victim(tsk);
+		exit_oom_victim();
 }
 
 static struct task_struct *find_alive_thread(struct task_struct *p)
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 895a51fe8e18..3b990544db6d 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -662,10 +662,9 @@ static void mark_oom_victim(struct task_struct *tsk)
 /**
  * exit_oom_victim - note the exit of an OOM victim
  */
-void exit_oom_victim(struct task_struct *tsk)
+void exit_oom_victim(void)
 {
-	if (!test_and_clear_tsk_thread_flag(tsk, TIF_MEMDIE))
-		return;
+	clear_thread_flag(TIF_MEMDIE);
 
 	if (!atomic_dec_return(&oom_victims))
 		wake_up_all(&oom_victims_wait);
-- 
2.8.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH v2 8/9] mm: make sure that kthreads will not refault oom reaped memory
  2016-08-25 10:03 [PATCH v2 0/10] fortify oom killer even more Michal Hocko
                   ` (6 preceding siblings ...)
  2016-08-25 10:03 ` [PATCH v2 7/9] mm, oom: enforce exit_oom_victim on current task Michal Hocko
@ 2016-08-25 10:03 ` Michal Hocko
  2016-08-25 10:03 ` [PATCH v2 9/9] oom, oom_reaper: allow to reap mm shared by the kthreads Michal Hocko
  8 siblings, 0 replies; 10+ messages in thread
From: Michal Hocko @ 2016-08-25 10:03 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Tetsuo Handa, Oleg Nesterov, David Rientjes,
	Vladimir Davydov, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

There are only few use_mm() users in the kernel right now. Most
of them write to the target memory but vhost driver relies on
copy_from_user/get_user from a kernel thread context. This makes it
impossible to reap the memory of an oom victim which shares the mm with
the vhost kernel thread because it could see a zero page unexpectedly
and theoretically make an incorrect decision visible outside of the
killed task context.

To quote Michael S. Tsirkin:
: Getting an error from __get_user and friends is handled gracefully.
: Getting zero instead of a real value will cause userspace
: memory corruption.

The vhost kernel thread is bound to an open fd of the vhost device which
is not tight to the mm owner life cycle in general. The device fd can be
inherited or passed over to another process which means that we really
have to be careful about unexpected memory corruption because unlike for
normal oom victims the result will be visible outside of the oom victim
context.

Make sure that no kthread context (users of use_mm) can ever see
corrupted data because of the oom reaper and hook into the page fault
path by checking MMF_UNSTABLE mm flag. __oom_reap_task_mm will set the
flag before it starts unmapping the address space while the flag is
checked after the page fault has been handled. If the flag is set
then SIGBUS is triggered so any g-u-p user will get a error code.

Regular tasks do not need this protection because all which share the mm
are killed when the mm is reaped and so the corruption will not outlive
them.

This patch shouldn't have any visible effect at this moment because the
OOM killer doesn't invoke oom reaper for tasks with mm shared with
kthreads yet.

Acked-by: "Michael S. Tsirkin" <mst@redhat.com>
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 include/linux/sched.h |  1 +
 mm/memory.c           | 13 +++++++++++++
 mm/oom_kill.c         |  8 ++++++++
 3 files changed, 22 insertions(+)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index eda579f3283a..63acaf9cc51c 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -522,6 +522,7 @@ static inline int get_dumpable(struct mm_struct *mm)
 #define MMF_HAS_UPROBES		19	/* has uprobes */
 #define MMF_RECALC_UPROBES	20	/* MMF_HAS_UPROBES can be wrong */
 #define MMF_OOM_SKIP		21	/* mm is of no interest for the OOM killer */
+#define MMF_UNSTABLE		22	/* mm is unstable for copy_from_user */
 
 #define MMF_INIT_MASK		(MMF_DUMPABLE_MASK | MMF_DUMP_FILTER_MASK)
 
diff --git a/mm/memory.c b/mm/memory.c
index 83be99d9d8a1..020226b4114b 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3656,6 +3656,19 @@ int handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
                         mem_cgroup_oom_synchronize(false);
 	}
 
+	/*
+	 * This mm has been already reaped by the oom reaper and so the
+	 * refault cannot be trusted in general. Anonymous refaults would
+	 * lose data and give a zero page instead e.g. This is especially
+	 * problem for use_mm() because regular tasks will just die and
+	 * the corrupted data will not be visible anywhere while kthread
+	 * will outlive the oom victim and potentially propagate the date
+	 * further.
+	 */
+	if (unlikely((current->flags & PF_KTHREAD) && !(ret & VM_FAULT_ERROR)
+				&& test_bit(MMF_UNSTABLE, &vma->vm_mm->flags)))
+		ret = VM_FAULT_SIGBUS;
+
 	return ret;
 }
 EXPORT_SYMBOL_GPL(handle_mm_fault);
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 3b990544db6d..5a3ba96c8338 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -495,6 +495,14 @@ static bool __oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
 		goto unlock_oom;
 	}
 
+	/*
+	 * Tell all users of get_user/copy_from_user etc... that the content
+	 * is no longer stable. No barriers really needed because unmapping
+	 * should imply barriers already and the reader would hit a page fault
+	 * if it stumbled over a reaped memory.
+	 */
+	set_bit(MMF_UNSTABLE, &mm->flags);
+
 	tlb_gather_mmu(&tlb, mm, 0, -1);
 	for (vma = mm->mmap ; vma; vma = vma->vm_next) {
 		if (is_vm_hugetlb_page(vma))
-- 
2.8.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH v2 9/9] oom, oom_reaper: allow to reap mm shared by the kthreads
  2016-08-25 10:03 [PATCH v2 0/10] fortify oom killer even more Michal Hocko
                   ` (7 preceding siblings ...)
  2016-08-25 10:03 ` [PATCH v2 8/9] mm: make sure that kthreads will not refault oom reaped memory Michal Hocko
@ 2016-08-25 10:03 ` Michal Hocko
  8 siblings, 0 replies; 10+ messages in thread
From: Michal Hocko @ 2016-08-25 10:03 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Tetsuo Handa, Oleg Nesterov, David Rientjes,
	Vladimir Davydov, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

oom reaper was skipped for an mm which is shared with the kernel thread
(aka use_mm()). The primary concern was that such a kthread might want
to read from the userspace memory and see zero page as a result of the
oom reaper action. This is no longer a problem after "mm: make sure that
kthreads will not refault oom reaped memory" because any attempt to
fault in when the MMF_UNSTABLE is set will result in SIGBUS and so the
target user should see an error. This means that we can finally allow
oom reaper also to tasks which share their mm with kthreads.

Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 mm/oom_kill.c | 14 +++++++-------
 1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 5a3ba96c8338..10f686969fc4 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -902,13 +902,7 @@ static void oom_kill_process(struct oom_control *oc, const char *message)
 			continue;
 		if (same_thread_group(p, victim))
 			continue;
-		if (unlikely(p->flags & PF_KTHREAD) || is_global_init(p)) {
-			/*
-			 * We cannot use oom_reaper for the mm shared by this
-			 * process because it wouldn't get killed and so the
-			 * memory might be still used. Hide the mm from the oom
-			 * killer to guarantee OOM forward progress.
-			 */
+		if (is_global_init(p)) {
 			can_oom_reap = false;
 			set_bit(MMF_OOM_SKIP, &mm->flags);
 			pr_info("oom killer %d (%s) has mm pinned by %d (%s)\n",
@@ -916,6 +910,12 @@ static void oom_kill_process(struct oom_control *oc, const char *message)
 					task_pid_nr(p), p->comm);
 			continue;
 		}
+		/*
+		 * No use_mm() user needs to read from the userspace so we are
+		 * ok to reap it.
+		 */
+		if (unlikely(p->flags & PF_KTHREAD))
+			continue;
 		do_send_sig_info(SIGKILL, SEND_SIG_FORCED, p, true);
 	}
 	rcu_read_unlock();
-- 
2.8.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, back to index

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-08-25 10:03 [PATCH v2 0/10] fortify oom killer even more Michal Hocko
2016-08-25 10:03 ` [PATCH v2 1/9] mm,oom_reaper: Reduce find_lock_task_mm() usage Michal Hocko
2016-08-25 10:03 ` [PATCH v2 2/9] mm,oom_reaper: Do not attempt to reap a task twice Michal Hocko
2016-08-25 10:03 ` [PATCH v2 3/9] oom: keep mm of the killed task available Michal Hocko
2016-08-25 10:03 ` [PATCH v2 4/9] kernel, oom: fix potential pgd_lock deadlock from __mmdrop Michal Hocko
2016-08-25 10:03 ` [PATCH v2 5/9] mm, oom: get rid of signal_struct::oom_victims Michal Hocko
2016-08-25 10:03 ` [PATCH v2 6/9] oom, suspend: fix oom_killer_disable vs. pm suspend properly Michal Hocko
2016-08-25 10:03 ` [PATCH v2 7/9] mm, oom: enforce exit_oom_victim on current task Michal Hocko
2016-08-25 10:03 ` [PATCH v2 8/9] mm: make sure that kthreads will not refault oom reaped memory Michal Hocko
2016-08-25 10:03 ` [PATCH v2 9/9] oom, oom_reaper: allow to reap mm shared by the kthreads Michal Hocko

Linux-mm Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-mm/0 linux-mm/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-mm linux-mm/ https://lore.kernel.org/linux-mm \
		linux-mm@kvack.org
	public-inbox-index linux-mm

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kvack.linux-mm


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git