linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH 0/6] fortify oom killer even more
@ 2016-07-01  9:26 Michal Hocko
  2016-07-01  9:26 ` [RFC PATCH 1/6] oom: keep mm of the killed task available Michal Hocko
                   ` (5 more replies)
  0 siblings, 6 replies; 26+ messages in thread
From: Michal Hocko @ 2016-07-01  9:26 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Tetsuo Handa, Oleg Nesterov, David Rientjes,
	Vladimir Davydov, Michael S. Tsirkin, Michal Hocko

Hi,
I am sending this pile as an RFC and I hope it will make a good
foundation to hopefully plug the remaining holes which could lead to oom
lockups for CONFIG_MMU.

There are two main parts patches 1-4 and the 5-6. The first pile focuses
on moving decisions about oom victims more to mm_struct. Especially the
part when there is an oom victim noticed and we decide whether to select
new victim. Patch 1 remembers the mm at the time oom victim is selected
and it is stable for the rest of the process group life time. This
allows some simplifications.

The later part is about kthread vs. oom_reaper interaction. It seems that
the only use_mm() user which needs fixing is vhost and that is fixable.
Then we can remove the kthread restriction and so basically the every
oom victim will be reapable now (well except the weird cases where the
mm is shared with init but I consider that uninteresting).

I haven't tested this properly yet. I will be mostly offline next week
but definitely plan to test it later on. Right now I would appreciated
feedback/review. If this looks OK then I would like to target it for 4.9.

The series is based on top of the current mmotm (2016-06-24-15-53) +
http://lkml.kernel.org/r/1467201562-6709-1-git-send-email-mhocko@kernel.org

Thanks!

Michal Hocko (6):
      oom: keep mm of the killed task available
      oom, suspend: fix oom_killer_disable vs. pm suspend properly
      exit, oom: postpone exit_oom_victim to later
      oom, oom_reaper: consider mmget_not_zero as a failure
      vhost, mm: make sure that oom_reaper doesn't reap memory read by vhost
      oom, oom_reaper: allow to reap mm shared by the kthreads

 drivers/vhost/scsi.c    |   2 +-
 drivers/vhost/vhost.c   |  18 +++----
 include/linux/oom.h     |   2 +-
 include/linux/sched.h   |   3 ++
 include/linux/uaccess.h |  22 +++++++++
 include/linux/uio.h     |  10 ++++
 kernel/exit.c           |   5 +-
 kernel/fork.c           |   2 +
 kernel/power/process.c  |  17 ++-----
 mm/mmu_context.c        |   6 +++
 mm/oom_kill.c           | 127 ++++++++++++++++++++++++------------------------
 11 files changed, 124 insertions(+), 90 deletions(-)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [RFC PATCH 1/6] oom: keep mm of the killed task available
  2016-07-01  9:26 [RFC PATCH 0/6] fortify oom killer even more Michal Hocko
@ 2016-07-01  9:26 ` Michal Hocko
  2016-07-03  2:45   ` Tetsuo Handa
  2016-07-01  9:26 ` [RFC PATCH 2/6] oom, suspend: fix oom_killer_disable vs. pm suspend properly Michal Hocko
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 26+ messages in thread
From: Michal Hocko @ 2016-07-01  9:26 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Tetsuo Handa, Oleg Nesterov, David Rientjes,
	Vladimir Davydov, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

oom_reap_task has to call exit_oom_victim in order to make sure that the
oom vicim will never block the oom killer for ever. This is, however,
opening new problems (e.g oom_killer_disable exclusion - see
74070542099c ("oom, suspend: fix oom_reaper vs. oom_killer_disable
race")). exit_oom_victim should be only called from the victim's
context ideally.

One way to achieve this would be to rely on per mm_struct flags. We
already have MMF_OOM_REAPED to hide a task from the oom killer since
"mm, oom: hide mm which is shared with kthread or global init". The
problem is that the exit path:
do_exit
  exit_mm
    tsk->mm = NULL;
    mmput
      __mmput
    exit_oom_victim

doesn't guarantee that exit_oom_victim will get called in a bounded
amount of time. At least exit_aio depends on IO which might get blocked
due to lack of memory and who knows what else is lurking there.

This patch takes a different approach. We remember tsk->mm into the
signal_struct and bind it to the signal struct life time for all oom
victims. __oom_reap_task as well as oom_scan_process_thread do not have
to rely on find_lock_task_mm anymore and they will have a reliable
reference to the mm struct. As a result all the oom specific
communication inside the OOM killer can be done via tsk->signal->oom_mm.
exit_oom_victim from __oom_reap_task can be dropped.
MMF_OOM_NOT_REAPABLE is trivial to implement as well because we just
need to OOM_SCAN_SELECT it when we see the flag.

Increasing the signal_struct for something as unlikely as the oom
killer is far from ideal but this approach will make the code much more
reasonable and long term we even might want to move task->mm into the
signal_struct anyway. In the next step we might want to make the oom
killer exclusion and access to memory reserves completely independent
which would be also nice.

Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 include/linux/sched.h |  2 ++
 kernel/fork.c         |  2 ++
 mm/oom_kill.c         | 67 +++++++++++++++++++++------------------------------
 3 files changed, 31 insertions(+), 40 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 6d81a1eb974a..befdcc1cde3c 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -793,6 +793,8 @@ struct signal_struct {
 	short oom_score_adj;		/* OOM kill score adjustment */
 	short oom_score_adj_min;	/* OOM kill score adjustment min value.
 					 * Only settable by CAP_SYS_RESOURCE. */
+	struct mm_struct *oom_mm;	/* recorded mm when the thread group got
+					 * killed by the oom killer */
 
 	struct mutex cred_guard_mutex;	/* guard against foreign influences on
 					 * credential calculations
diff --git a/kernel/fork.c b/kernel/fork.c
index 452fc864f2f6..2bd3cc73d103 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -245,6 +245,8 @@ static inline void free_signal_struct(struct signal_struct *sig)
 {
 	taskstats_tgid_free(sig);
 	sched_autogroup_exit(sig);
+	if (sig->oom_mm)
+		mmdrop(sig->oom_mm);
 	kmem_cache_free(signal_cachep, sig);
 }
 
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 7d0a275df822..4ea4a649822d 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -286,16 +286,17 @@ enum oom_scan_t oom_scan_process_thread(struct oom_control *oc,
 	 * Don't allow any other task to have access to the reserves unless
 	 * the task has MMF_OOM_REAPED because chances that it would release
 	 * any memory is quite low.
+	 * MMF_OOM_NOT_REAPABLE means that the oom_reaper backed off last time
+	 * so let it try again.
 	 */
 	if (!is_sysrq_oom(oc) && atomic_read(&task->signal->oom_victims)) {
-		struct task_struct *p = find_lock_task_mm(task);
+		struct mm_struct *mm = task->signal->oom_mm;
 		enum oom_scan_t ret = OOM_SCAN_ABORT;
 
-		if (p) {
-			if (test_bit(MMF_OOM_REAPED, &p->mm->flags))
-				ret = OOM_SCAN_CONTINUE;
-			task_unlock(p);
-		}
+		if (test_bit(MMF_OOM_REAPED, &mm->flags))
+			ret = OOM_SCAN_CONTINUE;
+		else if (test_bit(MMF_OOM_NOT_REAPABLE, &mm->flags))
+			ret = OOM_SCAN_SELECT;
 
 		return ret;
 	}
@@ -457,7 +458,6 @@ static bool __oom_reap_task(struct task_struct *tsk)
 	struct mmu_gather tlb;
 	struct vm_area_struct *vma;
 	struct mm_struct *mm = NULL;
-	struct task_struct *p;
 	struct zap_details details = {.check_swap_entries = true,
 				      .ignore_dirty = true};
 	bool ret = true;
@@ -478,22 +478,10 @@ static bool __oom_reap_task(struct task_struct *tsk)
 	 */
 	mutex_lock(&oom_lock);
 
-	/*
-	 * Make sure we find the associated mm_struct even when the particular
-	 * thread has already terminated and cleared its mm.
-	 * We might have race with exit path so consider our work done if there
-	 * is no mm.
-	 */
-	p = find_lock_task_mm(tsk);
-	if (!p)
-		goto unlock_oom;
-	mm = p->mm;
-	atomic_inc(&mm->mm_count);
-	task_unlock(p);
-
+	mm = tsk->signal->oom_mm;
 	if (!down_read_trylock(&mm->mmap_sem)) {
 		ret = false;
-		goto mm_drop;
+		goto unlock_oom;
 	}
 
 	/*
@@ -503,7 +491,7 @@ static bool __oom_reap_task(struct task_struct *tsk)
 	 */
 	if (!mmget_not_zero(mm)) {
 		up_read(&mm->mmap_sem);
-		goto mm_drop;
+		goto unlock_oom;
 	}
 
 	tlb_gather_mmu(&tlb, mm, 0, -1);
@@ -551,8 +539,6 @@ static bool __oom_reap_task(struct task_struct *tsk)
 	 * put the oom_reaper out of the way.
 	 */
 	mmput_async(mm);
-mm_drop:
-	mmdrop(mm);
 unlock_oom:
 	mutex_unlock(&oom_lock);
 	return ret;
@@ -568,7 +554,7 @@ static void oom_reap_task(struct task_struct *tsk)
 		schedule_timeout_idle(HZ/10);
 
 	if (attempts > MAX_OOM_REAP_RETRIES) {
-		struct task_struct *p;
+		struct mm_struct *mm;
 
 		pr_info("oom_reaper: unable to reap pid:%d (%s)\n",
 				task_pid_nr(tsk), tsk->comm);
@@ -579,27 +565,17 @@ static void oom_reap_task(struct task_struct *tsk)
 		 * so hide the mm from the oom killer so that it can move on
 		 * to another task with a different mm struct.
 		 */
-		p = find_lock_task_mm(tsk);
-		if (p) {
-			if (test_and_set_bit(MMF_OOM_NOT_REAPABLE, &p->mm->flags)) {
-				pr_info("oom_reaper: giving up pid:%d (%s)\n",
-						task_pid_nr(tsk), tsk->comm);
-				set_bit(MMF_OOM_REAPED, &p->mm->flags);
-			}
-			task_unlock(p);
+		mm = tsk->signal->oom_mm;
+		if (test_and_set_bit(MMF_OOM_NOT_REAPABLE, &mm->flags)) {
+			pr_info("oom_reaper: giving up pid:%d (%s)\n",
+					task_pid_nr(tsk), tsk->comm);
+			set_bit(MMF_OOM_REAPED, &mm->flags);
 		}
 
 		debug_show_all_locks();
 	}
 
-	/*
-	 * Clear TIF_MEMDIE because the task shouldn't be sitting on a
-	 * reasonably reclaimable memory anymore or it is not a good candidate
-	 * for the oom victim right now because it cannot release its memory
-	 * itself nor by the oom reaper.
-	 */
 	tsk->oom_reaper_list = NULL;
-	exit_oom_victim(tsk);
 
 	/* Drop a reference taken by wake_oom_reaper */
 	put_task_struct(tsk);
@@ -664,14 +640,25 @@ subsys_initcall(oom_init)
  *
  * Has to be called with oom_lock held and never after
  * oom has been disabled already.
+ *
+ * tsk->mm has to be non NULL and caller has to guarantee it is stable (either
+ * under task_lock or operate on the current).
  */
 void mark_oom_victim(struct task_struct *tsk)
 {
+	struct mm_struct *mm = tsk->mm;
+
 	WARN_ON(oom_killer_disabled);
 	/* OOM killer might race with memcg OOM */
 	if (test_and_set_tsk_thread_flag(tsk, TIF_MEMDIE))
 		return;
+
 	atomic_inc(&tsk->signal->oom_victims);
+
+	/* oom_mm is bound to the signal struct life time. */
+	if (!cmpxchg(&tsk->signal->oom_mm, NULL, mm))
+		atomic_inc(&tsk->signal->oom_mm->mm_count);
+
 	/*
 	 * Make sure that the task is woken up from uninterruptible sleep
 	 * if it is frozen because OOM killer wouldn't be able to free
-- 
2.8.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [RFC PATCH 2/6] oom, suspend: fix oom_killer_disable vs. pm suspend properly
  2016-07-01  9:26 [RFC PATCH 0/6] fortify oom killer even more Michal Hocko
  2016-07-01  9:26 ` [RFC PATCH 1/6] oom: keep mm of the killed task available Michal Hocko
@ 2016-07-01  9:26 ` Michal Hocko
  2016-07-01  9:26 ` [RFC PATCH 3/6] exit, oom: postpone exit_oom_victim to later Michal Hocko
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 26+ messages in thread
From: Michal Hocko @ 2016-07-01  9:26 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Tetsuo Handa, Oleg Nesterov, David Rientjes,
	Vladimir Davydov, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

74070542099c ("oom, suspend: fix oom_reaper vs. oom_killer_disable
race") has workaround an existing race between oom_killer_disable
and oom_reaper by adding another round of try_to_freeze_tasks after
the oom killer was disabled. This was an easiest thing to do for
a late 4.7 fix. Let's fix it properly now.

After "oom: keep mm of the killed task available" we no longer call
exit_oom_victim from the oom reaper so the race described in the above
commit doesn't exist anymore. Unfortunately this alone is not sufficient
for the oom_killer_disable usecase because now we do not have any
reliable way to reach exit_oom_victim (the victim might get stuck on a
way to exit for an unbounded amount of time). OOM killer can cope with
that by checking mm flags and move on to another victim but we cannot
do the same for oom_killer_disable as we would lose the guarantee of no
further interference of the victim with the rest of the system. What we
can do instead is to cap the maximum time the oom_killer_disable waits
for victims. The only current user of this function (pm suspend) already
has a concept of timeout for back off so we can reuse the same value
there.

Let's drop set_freezable for the oom_reaper kthread because it is no
longer needed as the reaper doesn't wake or thaw any processes.

Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 include/linux/oom.h    |  2 +-
 kernel/power/process.c | 17 +++--------------
 mm/oom_kill.c          | 33 ++++++++++++++++++++-------------
 3 files changed, 24 insertions(+), 28 deletions(-)

diff --git a/include/linux/oom.h b/include/linux/oom.h
index 5bc0457ee3a8..eb44374a3f32 100644
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -102,7 +102,7 @@ extern int register_oom_notifier(struct notifier_block *nb);
 extern int unregister_oom_notifier(struct notifier_block *nb);
 
 extern bool oom_killer_disabled;
-extern bool oom_killer_disable(void);
+extern bool oom_killer_disable(signed long timeout);
 extern void oom_killer_enable(void);
 
 extern struct task_struct *find_lock_task_mm(struct task_struct *p);
diff --git a/kernel/power/process.c b/kernel/power/process.c
index 0c2ee9761d57..2456f10c7326 100644
--- a/kernel/power/process.c
+++ b/kernel/power/process.c
@@ -141,23 +141,12 @@ int freeze_processes(void)
 	/*
 	 * Now that the whole userspace is frozen we need to disbale
 	 * the OOM killer to disallow any further interference with
-	 * killable tasks.
+	 * killable tasks. There is no guarantee oom victims will
+	 * ever reach a point they go away we have to wait with a timeout.
 	 */
-	if (!error && !oom_killer_disable())
+	if (!error && !oom_killer_disable(msecs_to_jiffies(freeze_timeout_msecs)))
 		error = -EBUSY;
 
-	/*
-	 * There is a hard to fix race between oom_reaper kernel thread
-	 * and oom_killer_disable. oom_reaper calls exit_oom_victim
-	 * before the victim reaches exit_mm so try to freeze all the tasks
-	 * again and catch such a left over task.
-	 */
-	if (!error) {
-		pr_info("Double checking all user space processes after OOM killer disable... ");
-		error = try_to_freeze_tasks(true);
-		pr_cont("\n");
-	}
-
 	if (error)
 		thaw_processes();
 	return error;
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 4ea4a649822d..4ac089cba353 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -583,8 +583,6 @@ static void oom_reap_task(struct task_struct *tsk)
 
 static int oom_reaper(void *unused)
 {
-	set_freezable();
-
 	while (true) {
 		struct task_struct *tsk = NULL;
 
@@ -683,10 +681,20 @@ void exit_oom_victim(struct task_struct *tsk)
 }
 
 /**
+ * oom_killer_enable - enable OOM killer
+ */
+void oom_killer_enable(void)
+{
+	oom_killer_disabled = false;
+}
+
+/**
  * oom_killer_disable - disable OOM killer
+ * @timeout: maximum timeout to wait for oom victims in jiffies
  *
  * Forces all page allocations to fail rather than trigger OOM killer.
- * Will block and wait until all OOM victims are killed.
+ * Will block and wait until all OOM victims are killed or the given
+ * timeout expires.
  *
  * The function cannot be called when there are runnable user tasks because
  * the userspace would see unexpected allocation failures as a result. Any
@@ -695,8 +703,10 @@ void exit_oom_victim(struct task_struct *tsk)
  * Returns true if successful and false if the OOM killer cannot be
  * disabled.
  */
-bool oom_killer_disable(void)
+bool oom_killer_disable(signed long timeout)
 {
+	signed long ret;
+
 	/*
 	 * Make sure to not race with an ongoing OOM killer. Check that the
 	 * current is not killed (possibly due to sharing the victim's memory).
@@ -706,19 +716,16 @@ bool oom_killer_disable(void)
 	oom_killer_disabled = true;
 	mutex_unlock(&oom_lock);
 
-	wait_event(oom_victims_wait, !atomic_read(&oom_victims));
+	ret = wait_event_interruptible_timeout(oom_victims_wait,
+			!atomic_read(&oom_victims), timeout);
+	if (ret <= 0) {
+		oom_killer_enable();
+		return false;
+	}
 
 	return true;
 }
 
-/**
- * oom_killer_enable - enable OOM killer
- */
-void oom_killer_enable(void)
-{
-	oom_killer_disabled = false;
-}
-
 static inline bool __task_will_free_mem(struct task_struct *task)
 {
 	struct signal_struct *sig = task->signal;
-- 
2.8.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [RFC PATCH 3/6] exit, oom: postpone exit_oom_victim to later
  2016-07-01  9:26 [RFC PATCH 0/6] fortify oom killer even more Michal Hocko
  2016-07-01  9:26 ` [RFC PATCH 1/6] oom: keep mm of the killed task available Michal Hocko
  2016-07-01  9:26 ` [RFC PATCH 2/6] oom, suspend: fix oom_killer_disable vs. pm suspend properly Michal Hocko
@ 2016-07-01  9:26 ` Michal Hocko
  2016-07-01  9:26 ` [RFC PATCH 4/6] oom, oom_reaper: consider mmget_not_zero as a failure Michal Hocko
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 26+ messages in thread
From: Michal Hocko @ 2016-07-01  9:26 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Tetsuo Handa, Oleg Nesterov, David Rientjes,
	Vladimir Davydov, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

exit_oom_victim was called after mmput because it is expected that
address space of the victim would get released by that time and there is
no reason to hold off the oom killer from selecting another task should
that be insufficient to handle the oom situation. In order to catch
post exit_mm() allocations we used to check for PF_EXITING but this
got removed by 6a618957ad17 ("mm: oom_kill: don't ignore oom score on
exiting tasks") because this check was lockup prone.

It seems that we have all needed pieces ready now and can finally
fix this properly (at least for CONFIG_MMU cases where we have the
oom_reaper).  Since "oom: keep mm of the killed task available" we have
a reliable way to ignore oom victims which are no longer interesting
because they either were reaped and do not sit on a lot of memory or
they are not reapable for some reason and it is safer to ignore them
and move on to another victim. That means that we can safely postpone
exit_oom_victim to closer to the final schedule.

There is possible advantages of this because we are reducing chances
of further interference of the oom victim with the rest of the system
after oom_killer_disable(). Strictly speaking this is possible right
now because there are indeed allocations possible past exit_mm() and
who knows whether some of them can trigger IO. I haven't seen this in
practice though.

Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 kernel/exit.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/kernel/exit.c b/kernel/exit.c
index 9e6e1356e6bb..a7260c05f18c 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -434,8 +434,6 @@ static void exit_mm(struct task_struct *tsk)
 	task_unlock(tsk);
 	mm_update_next_owner(mm);
 	mmput(mm);
-	if (test_thread_flag(TIF_MEMDIE))
-		exit_oom_victim(tsk);
 }
 
 static struct task_struct *find_alive_thread(struct task_struct *p)
@@ -822,6 +820,9 @@ void do_exit(long code)
 	smp_mb();
 	raw_spin_unlock_wait(&tsk->pi_lock);
 
+	if (test_thread_flag(TIF_MEMDIE))
+		exit_oom_victim(tsk);
+
 	/* causes final put_task_struct in finish_task_switch(). */
 	tsk->state = TASK_DEAD;
 	tsk->flags |= PF_NOFREEZE;	/* tell freezer to ignore us */
-- 
2.8.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [RFC PATCH 4/6] oom, oom_reaper: consider mmget_not_zero as a failure
  2016-07-01  9:26 [RFC PATCH 0/6] fortify oom killer even more Michal Hocko
                   ` (2 preceding siblings ...)
  2016-07-01  9:26 ` [RFC PATCH 3/6] exit, oom: postpone exit_oom_victim to later Michal Hocko
@ 2016-07-01  9:26 ` Michal Hocko
  2016-07-01  9:26 ` [RFC PATCH 5/6] vhost, mm: make sure that oom_reaper doesn't reap memory read by vhost Michal Hocko
  2016-07-01  9:26 ` [RFC PATCH 6/6] oom, oom_reaper: allow to reap mm shared by the kthreads Michal Hocko
  5 siblings, 0 replies; 26+ messages in thread
From: Michal Hocko @ 2016-07-01  9:26 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Tetsuo Handa, Oleg Nesterov, David Rientjes,
	Vladimir Davydov, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

mmget_not_zero failing means that we are racing with mmput->__mmput
and we are currently interpreting this as a success because we believe
that __mmput will release the address space. This is not guaranteed
though because at least exit_aio might wait on IO and it is not entirely
clear whether it will terminate in a bounded amount of time. It is hard
to tell what else is lurking there.

This patch makes this path more conservative and we report a failure
which will lead to setting MMF_OOM_NOT_REAPABLE and MMF_OOM_REAPED if
this state is permanent.

Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 mm/oom_kill.c | 7 +++----
 1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 4ac089cba353..b2210b6c38ba 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -460,7 +460,7 @@ static bool __oom_reap_task(struct task_struct *tsk)
 	struct mm_struct *mm = NULL;
 	struct zap_details details = {.check_swap_entries = true,
 				      .ignore_dirty = true};
-	bool ret = true;
+	bool ret = false;
 
 	/*
 	 * We have to make sure to not race with the victim exit path
@@ -479,10 +479,8 @@ static bool __oom_reap_task(struct task_struct *tsk)
 	mutex_lock(&oom_lock);
 
 	mm = tsk->signal->oom_mm;
-	if (!down_read_trylock(&mm->mmap_sem)) {
-		ret = false;
+	if (!down_read_trylock(&mm->mmap_sem))
 		goto unlock_oom;
-	}
 
 	/*
 	 * increase mm_users only after we know we will reap something so
@@ -494,6 +492,7 @@ static bool __oom_reap_task(struct task_struct *tsk)
 		goto unlock_oom;
 	}
 
+	ret = true;
 	tlb_gather_mmu(&tlb, mm, 0, -1);
 	for (vma = mm->mmap ; vma; vma = vma->vm_next) {
 		if (is_vm_hugetlb_page(vma))
-- 
2.8.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [RFC PATCH 5/6] vhost, mm: make sure that oom_reaper doesn't reap memory read by vhost
  2016-07-01  9:26 [RFC PATCH 0/6] fortify oom killer even more Michal Hocko
                   ` (3 preceding siblings ...)
  2016-07-01  9:26 ` [RFC PATCH 4/6] oom, oom_reaper: consider mmget_not_zero as a failure Michal Hocko
@ 2016-07-01  9:26 ` Michal Hocko
  2016-07-03 13:47   ` Oleg Nesterov
  2016-07-01  9:26 ` [RFC PATCH 6/6] oom, oom_reaper: allow to reap mm shared by the kthreads Michal Hocko
  5 siblings, 1 reply; 26+ messages in thread
From: Michal Hocko @ 2016-07-01  9:26 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Tetsuo Handa, Oleg Nesterov, David Rientjes,
	Vladimir Davydov, Michal Hocko, Michael S. Tsirkin

From: Michal Hocko <mhocko@suse.com>

vhost driver relies on copy_from_user/get_user from a kernel thread.
This makes it impossible to reap the memory of an oom victim which
shares mm with the vhost kernel thread because it could see a zero
page unexpectedly and theoretically make an incorrect decision visible
outside of the killed task context. To quote Michael S. Tsirkin:
: Getting an error from __get_user and friends is handled gracefully.
: Getting zero instead of a real value will cause userspace
: memory corruption.

Make sure that each place which can read from userspace is annotated
properly and it uses copy_from_user_mm, __get_user_mm resp.
copy_from_iter_mm. Each will get the target mm as an argument and it
performs a pessimistic check to rule out that the oom_reaper could
possibly unmap the particular page. __oom_reap_task then just needs to
mark the mm as unstable before it unmaps any page.

This is a preparatory patch without any functional changes because
the oom reaper doesn't touch mm shared with kthreads yet.

Cc: "Michael S. Tsirkin" <mst@redhat.com>
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 drivers/vhost/scsi.c    |  2 +-
 drivers/vhost/vhost.c   | 18 +++++++++---------
 include/linux/sched.h   |  1 +
 include/linux/uaccess.h | 22 ++++++++++++++++++++++
 include/linux/uio.h     | 10 ++++++++++
 mm/oom_kill.c           |  8 ++++++++
 6 files changed, 51 insertions(+), 10 deletions(-)

diff --git a/drivers/vhost/scsi.c b/drivers/vhost/scsi.c
index 0e6fd556c982..2c8dc0b9a21f 100644
--- a/drivers/vhost/scsi.c
+++ b/drivers/vhost/scsi.c
@@ -932,7 +932,7 @@ vhost_scsi_handle_vq(struct vhost_scsi *vs, struct vhost_virtqueue *vq)
 		 */
 		iov_iter_init(&out_iter, WRITE, vq->iov, out, out_size);
 
-		ret = copy_from_iter(req, req_size, &out_iter);
+		ret = copy_from_iter_mm(vq->dev->mm, req, req_size, &out_iter);
 		if (unlikely(ret != req_size)) {
 			vq_err(vq, "Faulted on copy_from_iter\n");
 			vhost_scsi_send_bad_target(vs, vq, head, out);
diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 669fef1e2bb6..71a754a0fe7e 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -1212,7 +1212,7 @@ int vhost_vq_init_access(struct vhost_virtqueue *vq)
 		r = -EFAULT;
 		goto err;
 	}
-	r = __get_user(last_used_idx, &vq->used->idx);
+	r = __get_user_mm(vq->dev->mm, last_used_idx, &vq->used->idx);
 	if (r)
 		goto err;
 	vq->last_used_idx = vhost16_to_cpu(vq, last_used_idx);
@@ -1328,7 +1328,7 @@ static int get_indirect(struct vhost_virtqueue *vq,
 			       i, count);
 			return -EINVAL;
 		}
-		if (unlikely(copy_from_iter(&desc, sizeof(desc), &from) !=
+		if (unlikely(copy_from_iter_mm(vq->dev->mm, &desc, sizeof(desc), &from) !=
 			     sizeof(desc))) {
 			vq_err(vq, "Failed indirect descriptor: idx %d, %zx\n",
 			       i, (size_t)vhost64_to_cpu(vq, indirect->addr) + i * sizeof desc);
@@ -1392,7 +1392,7 @@ int vhost_get_vq_desc(struct vhost_virtqueue *vq,
 
 	/* Check it isn't doing very strange things with descriptor numbers. */
 	last_avail_idx = vq->last_avail_idx;
-	if (unlikely(__get_user(avail_idx, &vq->avail->idx))) {
+	if (unlikely(__get_user_mm(vq->dev->mm, avail_idx, &vq->avail->idx))) {
 		vq_err(vq, "Failed to access avail idx at %p\n",
 		       &vq->avail->idx);
 		return -EFAULT;
@@ -1414,7 +1414,7 @@ int vhost_get_vq_desc(struct vhost_virtqueue *vq,
 
 	/* Grab the next descriptor number they're advertising, and increment
 	 * the index we've seen. */
-	if (unlikely(__get_user(ring_head,
+	if (unlikely(__get_user_mm(vq->dev->mm, ring_head,
 				&vq->avail->ring[last_avail_idx & (vq->num - 1)]))) {
 		vq_err(vq, "Failed to read head: idx %d address %p\n",
 		       last_avail_idx,
@@ -1450,7 +1450,7 @@ int vhost_get_vq_desc(struct vhost_virtqueue *vq,
 			       i, vq->num, head);
 			return -EINVAL;
 		}
-		ret = __copy_from_user(&desc, vq->desc + i, sizeof desc);
+		ret = __copy_from_user_mm(vq->dev->mm, &desc, vq->desc + i, sizeof desc);
 		if (unlikely(ret)) {
 			vq_err(vq, "Failed to get descriptor: idx %d addr %p\n",
 			       i, vq->desc + i);
@@ -1622,7 +1622,7 @@ static bool vhost_notify(struct vhost_dev *dev, struct vhost_virtqueue *vq)
 
 	if (!vhost_has_feature(vq, VIRTIO_RING_F_EVENT_IDX)) {
 		__virtio16 flags;
-		if (__get_user(flags, &vq->avail->flags)) {
+		if (__get_user_mm(dev->mm, flags, &vq->avail->flags)) {
 			vq_err(vq, "Failed to get flags");
 			return true;
 		}
@@ -1636,7 +1636,7 @@ static bool vhost_notify(struct vhost_dev *dev, struct vhost_virtqueue *vq)
 	if (unlikely(!v))
 		return true;
 
-	if (__get_user(event, vhost_used_event(vq))) {
+	if (__get_user_mm(dev->mm, event, vhost_used_event(vq))) {
 		vq_err(vq, "Failed to get used event idx");
 		return true;
 	}
@@ -1678,7 +1678,7 @@ bool vhost_vq_avail_empty(struct vhost_dev *dev, struct vhost_virtqueue *vq)
 	__virtio16 avail_idx;
 	int r;
 
-	r = __get_user(avail_idx, &vq->avail->idx);
+	r = __get_user_mm(dev->mm, avail_idx, &vq->avail->idx);
 	if (r)
 		return false;
 
@@ -1713,7 +1713,7 @@ bool vhost_enable_notify(struct vhost_dev *dev, struct vhost_virtqueue *vq)
 	/* They could have slipped one in as we were doing that: make
 	 * sure it's written, then check again. */
 	smp_mb();
-	r = __get_user(avail_idx, &vq->avail->idx);
+	r = __get_user_mm(dev->mm, avail_idx, &vq->avail->idx);
 	if (r) {
 		vq_err(vq, "Failed to check avail idx at %p: %d\n",
 		       &vq->avail->idx, r);
diff --git a/include/linux/sched.h b/include/linux/sched.h
index befdcc1cde3c..ff5102adb0c4 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -513,6 +513,7 @@ static inline int get_dumpable(struct mm_struct *mm)
 #define MMF_RECALC_UPROBES	20	/* MMF_HAS_UPROBES can be wrong */
 #define MMF_OOM_REAPED		21	/* mm has been already reaped */
 #define MMF_OOM_NOT_REAPABLE	22	/* mm couldn't be reaped */
+#define MMF_UNSTABLE		23	/* mm is unstable for copy_from_user */
 
 #define MMF_INIT_MASK		(MMF_DUMPABLE_MASK | MMF_DUMP_FILTER_MASK)
 
diff --git a/include/linux/uaccess.h b/include/linux/uaccess.h
index 349557825428..a327d5362581 100644
--- a/include/linux/uaccess.h
+++ b/include/linux/uaccess.h
@@ -76,6 +76,28 @@ static inline unsigned long __copy_from_user_nocache(void *to,
 #endif		/* ARCH_HAS_NOCACHE_UACCESS */
 
 /*
+ * A safe variant of __get_user for for use_mm() users to have a
+ * gurantee that the address space wasn't reaped in the background
+ */
+#define __get_user_mm(mm, x, ptr)				\
+({								\
+	int ___gu_err = __get_user(x, ptr);			\
+	if (!___gu_err && test_bit(MMF_UNSTABLE, &mm->flags))	\
+		___gu_err = -EFAULT;				\
+	___gu_err;						\
+})
+
+/* similar to __get_user_mm */
+static inline __must_check long __copy_from_user_mm(struct mm_struct *mm,
+		void *to, const void __user * from, unsigned long n)
+{
+	long ret = __copy_from_user(to, from, n);
+	if ((ret >= 0) && test_bit(MMF_UNSTABLE, &mm->flags))
+		return -EFAULT;
+	return ret;
+}
+
+/*
  * probe_kernel_read(): safely attempt to read from a location
  * @dst: pointer to the buffer that shall take the data
  * @src: address to read from
diff --git a/include/linux/uio.h b/include/linux/uio.h
index 1b5d1cd796e2..4be6b24003d8 100644
--- a/include/linux/uio.h
+++ b/include/linux/uio.h
@@ -9,6 +9,7 @@
 #ifndef __LINUX_UIO_H
 #define __LINUX_UIO_H
 
+#include <linux/sched.h>
 #include <linux/kernel.h>
 #include <uapi/linux/uio.h>
 
@@ -84,6 +85,15 @@ size_t copy_page_from_iter(struct page *page, size_t offset, size_t bytes,
 			 struct iov_iter *i);
 size_t copy_to_iter(const void *addr, size_t bytes, struct iov_iter *i);
 size_t copy_from_iter(void *addr, size_t bytes, struct iov_iter *i);
+
+static inline size_t copy_from_iter_mm(struct mm_struct *mm, void *addr,
+		size_t bytes, struct iov_iter *i)
+{
+	size_t ret = copy_from_iter(addr, bytes, i);
+	if (!IS_ERR_VALUE(ret) && test_bit(MMF_UNSTABLE, &mm->flags))
+		return -EFAULT;
+	return ret;
+}
 size_t copy_from_iter_nocache(void *addr, size_t bytes, struct iov_iter *i);
 size_t iov_iter_zero(size_t bytes, struct iov_iter *);
 unsigned long iov_iter_alignment(const struct iov_iter *i);
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index b2210b6c38ba..38a0cd32c01b 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -492,6 +492,14 @@ static bool __oom_reap_task(struct task_struct *tsk)
 		goto unlock_oom;
 	}
 
+	/*
+	 * Tell all users of get_user_mm/copy_from_user_mm that the content
+	 * is no longer stable. No barriers really needed because unmapping
+	 * should imply barriers already and the reader would hit a page fault
+	 * if it stumbled over a reaped memory.
+	 */
+	set_bit(MMF_UNSTABLE, &mm->flags);
+
 	ret = true;
 	tlb_gather_mmu(&tlb, mm, 0, -1);
 	for (vma = mm->mmap ; vma; vma = vma->vm_next) {
-- 
2.8.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [RFC PATCH 6/6] oom, oom_reaper: allow to reap mm shared by the kthreads
  2016-07-01  9:26 [RFC PATCH 0/6] fortify oom killer even more Michal Hocko
                   ` (4 preceding siblings ...)
  2016-07-01  9:26 ` [RFC PATCH 5/6] vhost, mm: make sure that oom_reaper doesn't reap memory read by vhost Michal Hocko
@ 2016-07-01  9:26 ` Michal Hocko
  5 siblings, 0 replies; 26+ messages in thread
From: Michal Hocko @ 2016-07-01  9:26 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Tetsuo Handa, Oleg Nesterov, David Rientjes,
	Vladimir Davydov, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

oom reaper was skipped for an mm which is shared with the kernel thread
(aka use_mm()). The primary concern was that such a kthread might want
to read from the userspace memory and see zero page as a result of the
oom reaper action. This seems to be overly conservative because none of
the current use_mm() users need to do copy_from_user or get_user. aio
code used to rely on copy_from_user but this is long gone along with
use_mm() usage in fs/aio.c.

We currently have only 3 users in the kernel:
- ffs_user_copy_worker, ep_user_copy_worker only do copy_to_iter()
- vhost_worker needs to copy from userspace but it relies on the
  safe __get_user_mm, copy_from_user_mm resp. copy_from_iter_mm

Add a note to use_mm about the copy_from_user risk and allow the oom
killer to invoke the oom_reaper for mms shared with kthreads. This will
practically cause all the sane use cases to be reapable.

Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 mm/mmu_context.c |  6 ++++++
 mm/oom_kill.c    | 14 +++++++-------
 2 files changed, 13 insertions(+), 7 deletions(-)

diff --git a/mm/mmu_context.c b/mm/mmu_context.c
index f802c2d216a7..61a7a90250be 100644
--- a/mm/mmu_context.c
+++ b/mm/mmu_context.c
@@ -16,6 +16,12 @@
  *	mm context.
  *	(Note: this routine is intended to be called only
  *	from a kernel thread context)
+ *
+ *	Do not use copy_from_user/__get_user from this context
+ *	and use the safe copy_from_user_mm/__get_user_mm because
+ *	the address space might got reclaimed behind the back by
+ *	the oom_reaper so an unexpected zero page might be
+ *	encountered.
  */
 void use_mm(struct mm_struct *mm)
 {
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 38a0cd32c01b..d5ea082a7360 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -914,13 +914,7 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
 			continue;
 		if (same_thread_group(p, victim))
 			continue;
-		if (unlikely(p->flags & PF_KTHREAD) || is_global_init(p)) {
-			/*
-			 * We cannot use oom_reaper for the mm shared by this
-			 * process because it wouldn't get killed and so the
-			 * memory might be still used. Hide the mm from the oom
-			 * killer to guarantee OOM forward progress.
-			 */
+		if (is_global_init(p)) {
 			can_oom_reap = false;
 			set_bit(MMF_OOM_REAPED, &mm->flags);
 			pr_info("oom killer %d (%s) has mm pinned by %d (%s)\n",
@@ -928,6 +922,12 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
 					task_pid_nr(p), p->comm);
 			continue;
 		}
+		/*
+		 * No use_mm() user needs to read from the userspace so we are
+		 * ok to reap it.
+		 */
+		if (unlikely(p->flags & PF_KTHREAD))
+			continue;
 		do_send_sig_info(SIGKILL, SEND_SIG_FORCED, p, true);
 	}
 	rcu_read_unlock();
-- 
2.8.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [RFC PATCH 1/6] oom: keep mm of the killed task available
  2016-07-01  9:26 ` [RFC PATCH 1/6] oom: keep mm of the killed task available Michal Hocko
@ 2016-07-03  2:45   ` Tetsuo Handa
  2016-07-07  8:24     ` Michal Hocko
  0 siblings, 1 reply; 26+ messages in thread
From: Tetsuo Handa @ 2016-07-03  2:45 UTC (permalink / raw)
  To: mhocko, linux-mm; +Cc: akpm, oleg, rientjes, vdavydov, mhocko

Michal Hocko wrote:
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 7d0a275df822..4ea4a649822d 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -286,16 +286,17 @@ enum oom_scan_t oom_scan_process_thread(struct oom_control *oc,
>  	 * Don't allow any other task to have access to the reserves unless
>  	 * the task has MMF_OOM_REAPED because chances that it would release
>  	 * any memory is quite low.
> +	 * MMF_OOM_NOT_REAPABLE means that the oom_reaper backed off last time
> +	 * so let it try again.
>  	 */
>  	if (!is_sysrq_oom(oc) && atomic_read(&task->signal->oom_victims)) {
> -		struct task_struct *p = find_lock_task_mm(task);
> +		struct mm_struct *mm = task->signal->oom_mm;
>  		enum oom_scan_t ret = OOM_SCAN_ABORT;
>  
> -		if (p) {
> -			if (test_bit(MMF_OOM_REAPED, &p->mm->flags))
> -				ret = OOM_SCAN_CONTINUE;
> -			task_unlock(p);
> -		}
> +		if (test_bit(MMF_OOM_REAPED, &mm->flags))
> +			ret = OOM_SCAN_CONTINUE;
> +		else if (test_bit(MMF_OOM_NOT_REAPABLE, &mm->flags))
> +			ret = OOM_SCAN_SELECT;

I don't think this is useful.

MMF_OOM_NOT_REAPABLE is set when mm->mmap_sem could not be held for read
by the OOM reaper thread. That occurs when someone is blocked at unkillable
wait with that mm->mmap_sem held for write. Unless the reason that someone
is blocked is lack of CPU time, the reason is likely that that someone is
blocked due to waiting for somebody else's memory allocation. Then, it
won't succeed that retrying OOM reaping MMF_OOM_NOT_REAPABLE mm as soon as
oom_scan_process_thread() finds it. At least, retrying OOM reaping
MMF_OOM_NOT_REAPABLE mm should be attempted after that someone is no longer
blocked due to waiting for somebody else's memory allocation (e.g. retry
only when oom_scan_process_thread() is sure that the OOM reaper thread can
hold mm->mmap_sem for read).

But I don't think with need to dance with task->signal->oom_mm.
See my series which removes task->signal->oom_victims and OOM_SCAN_ABORT case.

>  
>  		return ret;
>  	}

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC PATCH 5/6] vhost, mm: make sure that oom_reaper doesn't reap memory read by vhost
  2016-07-01  9:26 ` [RFC PATCH 5/6] vhost, mm: make sure that oom_reaper doesn't reap memory read by vhost Michal Hocko
@ 2016-07-03 13:47   ` Oleg Nesterov
  2016-07-03 14:09     ` Michael S. Tsirkin
  0 siblings, 1 reply; 26+ messages in thread
From: Oleg Nesterov @ 2016-07-03 13:47 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, Andrew Morton, Tetsuo Handa, David Rientjes,
	Vladimir Davydov, Michal Hocko, Michael S. Tsirkin

On 07/01, Michal Hocko wrote:
>
> From: Michal Hocko <mhocko@suse.com>
>
> vhost driver relies on copy_from_user/get_user from a kernel thread.
> This makes it impossible to reap the memory of an oom victim which
> shares mm with the vhost kernel thread because it could see a zero
> page unexpectedly and theoretically make an incorrect decision visible
> outside of the killed task context.

And I still can't understand how, but let me repeat that I don't understand
this code at all.

> To quote Michael S. Tsirkin:
> : Getting an error from __get_user and friends is handled gracefully.
> : Getting zero instead of a real value will cause userspace
> : memory corruption.

Which userspace memory corruption? We are going to kill the dev->mm owner,
the task which did ioctl(VHOST_SET_OWNER) and (at first glance) the task
who communicates with the callbacks fired by vhost_worker().

Michael, could you please spell why should we care?

> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -492,6 +492,14 @@ static bool __oom_reap_task(struct task_struct *tsk)
>  		goto unlock_oom;
>  	}
>
> +	/*
> +	 * Tell all users of get_user_mm/copy_from_user_mm that the content
> +	 * is no longer stable. No barriers really needed because unmapping
> +	 * should imply barriers already and the reader would hit a page fault
> +	 * if it stumbled over a reaped memory.
> +	 */
> +	set_bit(MMF_UNSTABLE, &mm->flags);

And this is racy anyway.

Oleg.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC PATCH 5/6] vhost, mm: make sure that oom_reaper doesn't reap memory read by vhost
  2016-07-03 13:47   ` Oleg Nesterov
@ 2016-07-03 14:09     ` Michael S. Tsirkin
  2016-07-03 15:18       ` Oleg Nesterov
  2016-07-07  8:39       ` Michal Hocko
  0 siblings, 2 replies; 26+ messages in thread
From: Michael S. Tsirkin @ 2016-07-03 14:09 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Michal Hocko, linux-mm, Andrew Morton, Tetsuo Handa,
	David Rientjes, Vladimir Davydov, Michal Hocko

On Sun, Jul 03, 2016 at 03:47:19PM +0200, Oleg Nesterov wrote:
> On 07/01, Michal Hocko wrote:
> >
> > From: Michal Hocko <mhocko@suse.com>
> >
> > vhost driver relies on copy_from_user/get_user from a kernel thread.
> > This makes it impossible to reap the memory of an oom victim which
> > shares mm with the vhost kernel thread because it could see a zero
> > page unexpectedly and theoretically make an incorrect decision visible
> > outside of the killed task context.
> 
> And I still can't understand how, but let me repeat that I don't understand
> this code at all.
> 
> > To quote Michael S. Tsirkin:
> > : Getting an error from __get_user and friends is handled gracefully.
> > : Getting zero instead of a real value will cause userspace
> > : memory corruption.
> 
> Which userspace memory corruption? We are going to kill the dev->mm owner,
> the task which did ioctl(VHOST_SET_OWNER) and (at first glance) the task
> who communicates with the callbacks fired by vhost_worker().
> 
> Michael, could you please spell why should we care?

I am concerned that
- oom victim is sharing memory with another task
- getting incorrect value from ring read makes vhost
  change that shared memory


Also, I don't see where do we kill the task that communicates with the
callbacks.


Having said all that, how about we just add some kind of per-mm
notifier list, and let vhost know that owner is going away so
it should stop looking at memory?

Seems cleaner than looking at flags at each memory access,
since vhost has its own locking.


> > --- a/mm/oom_kill.c
> > +++ b/mm/oom_kill.c
> > @@ -492,6 +492,14 @@ static bool __oom_reap_task(struct task_struct *tsk)
> >  		goto unlock_oom;
> >  	}
> >
> > +	/*
> > +	 * Tell all users of get_user_mm/copy_from_user_mm that the content
> > +	 * is no longer stable. No barriers really needed because unmapping
> > +	 * should imply barriers already and the reader would hit a page fault
> > +	 * if it stumbled over a reaped memory.
> > +	 */
> > +	set_bit(MMF_UNSTABLE, &mm->flags);
> 
> And this is racy anyway.
> 
> Oleg.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC PATCH 5/6] vhost, mm: make sure that oom_reaper doesn't reap memory read by vhost
  2016-07-03 14:09     ` Michael S. Tsirkin
@ 2016-07-03 15:18       ` Oleg Nesterov
  2016-07-03 15:30         ` Michael S. Tsirkin
  2016-07-07  8:42         ` Michal Hocko
  2016-07-07  8:39       ` Michal Hocko
  1 sibling, 2 replies; 26+ messages in thread
From: Oleg Nesterov @ 2016-07-03 15:18 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Michal Hocko, linux-mm, Andrew Morton, Tetsuo Handa,
	David Rientjes, Vladimir Davydov, Michal Hocko

On 07/03, Michael S. Tsirkin wrote:
>
> On Sun, Jul 03, 2016 at 03:47:19PM +0200, Oleg Nesterov wrote:
> > On 07/01, Michal Hocko wrote:
> > >
> > > From: Michal Hocko <mhocko@suse.com>
> > >
> > > vhost driver relies on copy_from_user/get_user from a kernel thread.
> > > This makes it impossible to reap the memory of an oom victim which
> > > shares mm with the vhost kernel thread because it could see a zero
> > > page unexpectedly and theoretically make an incorrect decision visible
> > > outside of the killed task context.
> >
> > And I still can't understand how, but let me repeat that I don't understand
> > this code at all.
> >
> > > To quote Michael S. Tsirkin:
> > > : Getting an error from __get_user and friends is handled gracefully.
> > > : Getting zero instead of a real value will cause userspace
> > > : memory corruption.
> >
> > Which userspace memory corruption? We are going to kill the dev->mm owner,
> > the task which did ioctl(VHOST_SET_OWNER) and (at first glance) the task
> > who communicates with the callbacks fired by vhost_worker().
> >
> > Michael, could you please spell why should we care?
>
> I am concerned that
> - oom victim is sharing memory with another task
> - getting incorrect value from ring read makes vhost
>   change that shared memory

Well, we are going to kill all tasks which share this memory. I mean, ->mm.
If "sharing memory with another task" means, say, a file, then this memory
won't be unmapped (if shared).

So let me ask again... Suppose, say, QEMU does VHOST_SET_OWNER and then we
unmap its (anonymous/non-shared) memory. Who else's memory can be corrupted?

Sorry, I simply do not know what vhost does, quite possibly a stupid question.

> Having said all that, how about we just add some kind of per-mm
> notifier list, and let vhost know that owner is going away so
> it should stop looking at memory?
>
> Seems cleaner than looking at flags at each memory access,
> since vhost has its own locking.

Agreed... although of course I do not understand how this should work. But
looks better in any case..

Or perhaps we can change oom_kill_process() to send SIGKILL to kthreads as
well, this should not have any effect unless kthread does allow_signal(SIGKILL),
then we can change vhost_worker() to catch SIGKILL and react somehow. Not sure
this is really possible.

Oleg.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC PATCH 5/6] vhost, mm: make sure that oom_reaper doesn't reap memory read by vhost
  2016-07-03 15:18       ` Oleg Nesterov
@ 2016-07-03 15:30         ` Michael S. Tsirkin
  2016-07-03 16:47           ` Oleg Nesterov
  2016-07-07  8:42         ` Michal Hocko
  1 sibling, 1 reply; 26+ messages in thread
From: Michael S. Tsirkin @ 2016-07-03 15:30 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Michal Hocko, linux-mm, Andrew Morton, Tetsuo Handa,
	David Rientjes, Vladimir Davydov, Michal Hocko

On Sun, Jul 03, 2016 at 05:18:29PM +0200, Oleg Nesterov wrote:
> On 07/03, Michael S. Tsirkin wrote:
> >
> > On Sun, Jul 03, 2016 at 03:47:19PM +0200, Oleg Nesterov wrote:
> > > On 07/01, Michal Hocko wrote:
> > > >
> > > > From: Michal Hocko <mhocko@suse.com>
> > > >
> > > > vhost driver relies on copy_from_user/get_user from a kernel thread.
> > > > This makes it impossible to reap the memory of an oom victim which
> > > > shares mm with the vhost kernel thread because it could see a zero
> > > > page unexpectedly and theoretically make an incorrect decision visible
> > > > outside of the killed task context.
> > >
> > > And I still can't understand how, but let me repeat that I don't understand
> > > this code at all.
> > >
> > > > To quote Michael S. Tsirkin:
> > > > : Getting an error from __get_user and friends is handled gracefully.
> > > > : Getting zero instead of a real value will cause userspace
> > > > : memory corruption.
> > >
> > > Which userspace memory corruption? We are going to kill the dev->mm owner,
> > > the task which did ioctl(VHOST_SET_OWNER) and (at first glance) the task
> > > who communicates with the callbacks fired by vhost_worker().
> > >
> > > Michael, could you please spell why should we care?
> >
> > I am concerned that
> > - oom victim is sharing memory with another task
> > - getting incorrect value from ring read makes vhost
> >   change that shared memory
> 
> Well, we are going to kill all tasks which share this memory. I mean, ->mm.
> If "sharing memory with another task" means, say, a file, then this memory
> won't be unmapped (if shared).
> 
> So let me ask again... Suppose, say, QEMU does VHOST_SET_OWNER and then we
> unmap its (anonymous/non-shared) memory. Who else's memory can be corrupted?

As you say, I mean anyone who shares memory with QEMU through a file.
IIUC current users that do this are all stateless so
even if they crash this is not a big deal, but it seems
wrong to assume this will be like this forever.

> Sorry, I simply do not know what vhost does, quite possibly a stupid question.
> 
> > Having said all that, how about we just add some kind of per-mm
> > notifier list, and let vhost know that owner is going away so
> > it should stop looking at memory?
> >
> > Seems cleaner than looking at flags at each memory access,
> > since vhost has its own locking.
> 
> Agreed... although of course I do not understand how this should work.

Add a linked list of callbacks in in struct mm_struct. vhost would add itself there.
In callback, set private_data for all vqs to NULL under vq mutex.


> But
> looks better in any case..
> 
> Or perhaps we can change oom_kill_process() to send SIGKILL to kthreads as
> well, this should not have any effect unless kthread does allow_signal(SIGKILL),
> then we can change vhost_worker() to catch SIGKILL and react somehow. Not sure
> this is really possible.
> 
> Oleg.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC PATCH 5/6] vhost, mm: make sure that oom_reaper doesn't reap memory read by vhost
  2016-07-03 15:30         ` Michael S. Tsirkin
@ 2016-07-03 16:47           ` Oleg Nesterov
  2016-07-03 21:17             ` Michael S. Tsirkin
  0 siblings, 1 reply; 26+ messages in thread
From: Oleg Nesterov @ 2016-07-03 16:47 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Michal Hocko, linux-mm, Andrew Morton, Tetsuo Handa,
	David Rientjes, Vladimir Davydov, Michal Hocko

On 07/03, Michael S. Tsirkin wrote:
>
> On Sun, Jul 03, 2016 at 05:18:29PM +0200, Oleg Nesterov wrote:
> >
> > Well, we are going to kill all tasks which share this memory. I mean, ->mm.
> > If "sharing memory with another task" means, say, a file, then this memory
> > won't be unmapped (if shared).
> >
> > So let me ask again... Suppose, say, QEMU does VHOST_SET_OWNER and then we
> > unmap its (anonymous/non-shared) memory. Who else's memory can be corrupted?
>
> As you say, I mean anyone who shares memory with QEMU through a file.

And in this case vhost_worker() reads the anonymous memory of QEMU process,
not the memory which can be shared with another task, correct?

And if QEMU simply crashes, this can't affect anyone who shares memory with
QEMU through a file, yes?

Oleg.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC PATCH 5/6] vhost, mm: make sure that oom_reaper doesn't reap memory read by vhost
  2016-07-03 16:47           ` Oleg Nesterov
@ 2016-07-03 21:17             ` Michael S. Tsirkin
  2016-07-07  8:28               ` Michal Hocko
  0 siblings, 1 reply; 26+ messages in thread
From: Michael S. Tsirkin @ 2016-07-03 21:17 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Michal Hocko, linux-mm, Andrew Morton, Tetsuo Handa,
	David Rientjes, Vladimir Davydov, Michal Hocko

On Sun, Jul 03, 2016 at 06:47:23PM +0200, Oleg Nesterov wrote:
> On 07/03, Michael S. Tsirkin wrote:
> >
> > On Sun, Jul 03, 2016 at 05:18:29PM +0200, Oleg Nesterov wrote:
> > >
> > > Well, we are going to kill all tasks which share this memory. I mean, ->mm.
> > > If "sharing memory with another task" means, say, a file, then this memory
> > > won't be unmapped (if shared).
> > >
> > > So let me ask again... Suppose, say, QEMU does VHOST_SET_OWNER and then we
> > > unmap its (anonymous/non-shared) memory. Who else's memory can be corrupted?
> >
> > As you say, I mean anyone who shares memory with QEMU through a file.
> 
> And in this case vhost_worker() reads the anonymous memory of QEMU process,
> not the memory which can be shared with another task, correct?
> 
> And if QEMU simply crashes, this can't affect anyone who shares memory with
> QEMU through a file, yes?
> 
> Oleg.

Well no - the VM memory is not always anonymous memory. It can be an
mmaped file.

-- 
MST

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC PATCH 1/6] oom: keep mm of the killed task available
  2016-07-03  2:45   ` Tetsuo Handa
@ 2016-07-07  8:24     ` Michal Hocko
  2016-07-07 11:48       ` Tetsuo Handa
  0 siblings, 1 reply; 26+ messages in thread
From: Michal Hocko @ 2016-07-07  8:24 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: linux-mm, akpm, oleg, rientjes, vdavydov

On Sun 03-07-16 11:45:34, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> > index 7d0a275df822..4ea4a649822d 100644
> > --- a/mm/oom_kill.c
> > +++ b/mm/oom_kill.c
> > @@ -286,16 +286,17 @@ enum oom_scan_t oom_scan_process_thread(struct oom_control *oc,
> >  	 * Don't allow any other task to have access to the reserves unless
> >  	 * the task has MMF_OOM_REAPED because chances that it would release
> >  	 * any memory is quite low.
> > +	 * MMF_OOM_NOT_REAPABLE means that the oom_reaper backed off last time
> > +	 * so let it try again.
> >  	 */
> >  	if (!is_sysrq_oom(oc) && atomic_read(&task->signal->oom_victims)) {
> > -		struct task_struct *p = find_lock_task_mm(task);
> > +		struct mm_struct *mm = task->signal->oom_mm;
> >  		enum oom_scan_t ret = OOM_SCAN_ABORT;
> >  
> > -		if (p) {
> > -			if (test_bit(MMF_OOM_REAPED, &p->mm->flags))
> > -				ret = OOM_SCAN_CONTINUE;
> > -			task_unlock(p);
> > -		}
> > +		if (test_bit(MMF_OOM_REAPED, &mm->flags))
> > +			ret = OOM_SCAN_CONTINUE;
> > +		else if (test_bit(MMF_OOM_NOT_REAPABLE, &mm->flags))
> > +			ret = OOM_SCAN_SELECT;
> 
> I don't think this is useful.

Well, to be honest me neither but changing the retry logic is not in
scope of this patch. It just preserved the existing logic. I guess we
can get rid of it but that deserves a separate patch. The retry was
implemented to cover unlikely stalls when the lock is held but as this
hasn't ever been observed in the real life I would agree to remove it to
simplify the code (even though it is literally few lines of code). I was
probably overcautious when adding the flag.

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC PATCH 5/6] vhost, mm: make sure that oom_reaper doesn't reap memory read by vhost
  2016-07-03 21:17             ` Michael S. Tsirkin
@ 2016-07-07  8:28               ` Michal Hocko
  2016-07-07 15:38                 ` Michael S. Tsirkin
  0 siblings, 1 reply; 26+ messages in thread
From: Michal Hocko @ 2016-07-07  8:28 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Oleg Nesterov, linux-mm, Andrew Morton, Tetsuo Handa,
	David Rientjes, Vladimir Davydov

On Mon 04-07-16 00:17:55, Michael S. Tsirkin wrote:
> On Sun, Jul 03, 2016 at 06:47:23PM +0200, Oleg Nesterov wrote:
> > On 07/03, Michael S. Tsirkin wrote:
> > >
> > > On Sun, Jul 03, 2016 at 05:18:29PM +0200, Oleg Nesterov wrote:
> > > >
> > > > Well, we are going to kill all tasks which share this memory. I mean, ->mm.
> > > > If "sharing memory with another task" means, say, a file, then this memory
> > > > won't be unmapped (if shared).
> > > >
> > > > So let me ask again... Suppose, say, QEMU does VHOST_SET_OWNER and then we
> > > > unmap its (anonymous/non-shared) memory. Who else's memory can be corrupted?
> > >
> > > As you say, I mean anyone who shares memory with QEMU through a file.
> > 
> > And in this case vhost_worker() reads the anonymous memory of QEMU process,
> > not the memory which can be shared with another task, correct?
> > 
> > And if QEMU simply crashes, this can't affect anyone who shares memory with
> > QEMU through a file, yes?
> > 
> > Oleg.
> 
> Well no - the VM memory is not always anonymous memory. It can be an
> mmaped file.

Just to make sure we are all at the same page. I guess the scenario is
as follows. The owner of the mm has ring and other statefull information
in the private memory but consumers living with their own mm consume
some data from a shared memory segments (e.g. files). The worker would
misinterpret statefull information (zeros rather than the original
content) and would copy invalid/corrupted data to the consumer. Am I
correct?

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC PATCH 5/6] vhost, mm: make sure that oom_reaper doesn't reap memory read by vhost
  2016-07-03 14:09     ` Michael S. Tsirkin
  2016-07-03 15:18       ` Oleg Nesterov
@ 2016-07-07  8:39       ` Michal Hocko
  2016-07-22 11:09         ` Michal Hocko
  1 sibling, 1 reply; 26+ messages in thread
From: Michal Hocko @ 2016-07-07  8:39 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Oleg Nesterov, linux-mm, Andrew Morton, Tetsuo Handa,
	David Rientjes, Vladimir Davydov

On Sun 03-07-16 17:09:04, Michael S. Tsirkin wrote:
[...]
> Having said all that, how about we just add some kind of per-mm
> notifier list, and let vhost know that owner is going away so
> it should stop looking at memory?

But this would have to be a synchronous operation from the oom killer,
no? I would really like to reduce the number of external dependencies
from the oom killer paths as much as possible. This is the whole point
of these patches. If we have a notification mechanism, what would
guarantee that the oom killer would make a forward progress if the
notified end cannot continue (wait for a lock etc...)?

I do realize that a test per each memory access is not welcome that
much. An alternative would be to hook the check into the page fault
handler because then the overhead would be reduced only to the slowpath
(from the copy_from_user POV). But then also non use_mm users would have
to pay the price which is even less attractive.

Another alternative would be disabling pagefaults when copying from the
userspace. This would require that the memory is prefault when used
which might be a problem for the current implementation.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC PATCH 5/6] vhost, mm: make sure that oom_reaper doesn't reap memory read by vhost
  2016-07-03 15:18       ` Oleg Nesterov
  2016-07-03 15:30         ` Michael S. Tsirkin
@ 2016-07-07  8:42         ` Michal Hocko
  2016-07-07 16:46           ` Oleg Nesterov
  1 sibling, 1 reply; 26+ messages in thread
From: Michal Hocko @ 2016-07-07  8:42 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Michael S. Tsirkin, linux-mm, Andrew Morton, Tetsuo Handa,
	David Rientjes, Vladimir Davydov

On Sun 03-07-16 17:18:29, Oleg Nesterov wrote:
[...]
> Or perhaps we can change oom_kill_process() to send SIGKILL to kthreads as
> well, this should not have any effect unless kthread does allow_signal(SIGKILL),
> then we can change vhost_worker() to catch SIGKILL and react somehow. Not sure
> this is really possible.

But then we would have to check for the signal after every memory
access no? This sounds much more error prone than the test being wrapped
inside the copy_from... API to me.

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC PATCH 1/6] oom: keep mm of the killed task available
  2016-07-07  8:24     ` Michal Hocko
@ 2016-07-07 11:48       ` Tetsuo Handa
  2016-07-07 13:32         ` Michal Hocko
  0 siblings, 1 reply; 26+ messages in thread
From: Tetsuo Handa @ 2016-07-07 11:48 UTC (permalink / raw)
  To: mhocko; +Cc: linux-mm, akpm, oleg, rientjes, vdavydov

Michal Hocko wrote:
> On Sun 03-07-16 11:45:34, Tetsuo Handa wrote:
> > Michal Hocko wrote:
> > > diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> > > index 7d0a275df822..4ea4a649822d 100644
> > > --- a/mm/oom_kill.c
> > > +++ b/mm/oom_kill.c
> > > @@ -286,16 +286,17 @@ enum oom_scan_t oom_scan_process_thread(struct oom_control *oc,
> > >  	 * Don't allow any other task to have access to the reserves unless
> > >  	 * the task has MMF_OOM_REAPED because chances that it would release
> > >  	 * any memory is quite low.
> > > +	 * MMF_OOM_NOT_REAPABLE means that the oom_reaper backed off last time
> > > +	 * so let it try again.
> > >  	 */
> > >  	if (!is_sysrq_oom(oc) && atomic_read(&task->signal->oom_victims)) {
> > > -		struct task_struct *p = find_lock_task_mm(task);
> > > +		struct mm_struct *mm = task->signal->oom_mm;
> > >  		enum oom_scan_t ret = OOM_SCAN_ABORT;
> > >  
> > > -		if (p) {
> > > -			if (test_bit(MMF_OOM_REAPED, &p->mm->flags))
> > > -				ret = OOM_SCAN_CONTINUE;
> > > -			task_unlock(p);
> > > -		}
> > > +		if (test_bit(MMF_OOM_REAPED, &mm->flags))
> > > +			ret = OOM_SCAN_CONTINUE;
> > > +		else if (test_bit(MMF_OOM_NOT_REAPABLE, &mm->flags))
> > > +			ret = OOM_SCAN_SELECT;
> > 
> > I don't think this is useful.
> 
> Well, to be honest me neither but changing the retry logic is not in
> scope of this patch. It just preserved the existing logic. I guess we
> can get rid of it but that deserves a separate patch. The retry was
> implemented to cover unlikely stalls when the lock is held but as this
> hasn't ever been observed in the real life I would agree to remove it to
> simplify the code (even though it is literally few lines of code). I was
> probably overcautious when adding the flag.
> 

You mean reverting http://lkml.kernel.org/r/1466426628-15074-10-git-send-email-mhocko@kernel.org ?

If we hit a situation where MMF_OOM_NOT_REAPABLE is set, it means that that mm
was used by multiple threads and one of them is blocked. On the other hand,
since currently task_struct->oom_reaper_list is used, we can hit
(say, T1 and T2 and T3 are sharing the same mm)

  (1) The T1's mm is queued to oom_reaper_list for the first time by T1.
  (2) The OOM reaper finds that mm for the first time.
  (3) The OOM reaper fails to hold mm->mmap_sem for read because T3 is blocked with that mm->mmap_sem held for write.
  (4) The T2's mm (which is same with T1's mm) is queued to oom_reaper_list for the second time by T2.
  (5) The OOM reaper still fails to hold mm->mmap_sem for read because T3 is blocked with that mm->mmap_sem held for write.
  (6) The OOM reaper sets MMF_OOM_NOT_REAPABLE.
  (7) That mm is dequeued from oom_reaper_list for the first time by the OOM reaper.
  (8) The OOM reaper finds that mm for the second time.
  (9) The OOM reaper still fails to hold mm->mmap_sem for read because T3 is blocked with that mm->mmap_sem held for write.
  (10) The OOM reaper sets MMF_OOM_REAPED.
  (11) That mm is dequeued from oom_reaper_list for the second time by the OOM reaper.

sequences. To me, MMF_OOM_NOT_REAPABLE alone is unlikely helpful.

If oom_mm_list list which chains mm_struct is used, at least we won't
concurrently queue same mm which is currently under OOM reaper's operation.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC PATCH 1/6] oom: keep mm of the killed task available
  2016-07-07 11:48       ` Tetsuo Handa
@ 2016-07-07 13:32         ` Michal Hocko
  0 siblings, 0 replies; 26+ messages in thread
From: Michal Hocko @ 2016-07-07 13:32 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: linux-mm, akpm, oleg, rientjes, vdavydov

On Thu 07-07-16 20:48:46, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > On Sun 03-07-16 11:45:34, Tetsuo Handa wrote:
> > > Michal Hocko wrote:
> > > > diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> > > > index 7d0a275df822..4ea4a649822d 100644
> > > > --- a/mm/oom_kill.c
> > > > +++ b/mm/oom_kill.c
> > > > @@ -286,16 +286,17 @@ enum oom_scan_t oom_scan_process_thread(struct oom_control *oc,
> > > >  	 * Don't allow any other task to have access to the reserves unless
> > > >  	 * the task has MMF_OOM_REAPED because chances that it would release
> > > >  	 * any memory is quite low.
> > > > +	 * MMF_OOM_NOT_REAPABLE means that the oom_reaper backed off last time
> > > > +	 * so let it try again.
> > > >  	 */
> > > >  	if (!is_sysrq_oom(oc) && atomic_read(&task->signal->oom_victims)) {
> > > > -		struct task_struct *p = find_lock_task_mm(task);
> > > > +		struct mm_struct *mm = task->signal->oom_mm;
> > > >  		enum oom_scan_t ret = OOM_SCAN_ABORT;
> > > >  
> > > > -		if (p) {
> > > > -			if (test_bit(MMF_OOM_REAPED, &p->mm->flags))
> > > > -				ret = OOM_SCAN_CONTINUE;
> > > > -			task_unlock(p);
> > > > -		}
> > > > +		if (test_bit(MMF_OOM_REAPED, &mm->flags))
> > > > +			ret = OOM_SCAN_CONTINUE;
> > > > +		else if (test_bit(MMF_OOM_NOT_REAPABLE, &mm->flags))
> > > > +			ret = OOM_SCAN_SELECT;
> > > 
> > > I don't think this is useful.
> > 
> > Well, to be honest me neither but changing the retry logic is not in
> > scope of this patch. It just preserved the existing logic. I guess we
> > can get rid of it but that deserves a separate patch. The retry was
> > implemented to cover unlikely stalls when the lock is held but as this
> > hasn't ever been observed in the real life I would agree to remove it to
> > simplify the code (even though it is literally few lines of code). I was
> > probably overcautious when adding the flag.
> > 
> 
> You mean reverting http://lkml.kernel.org/r/1466426628-15074-10-git-send-email-mhocko@kernel.org ?

Yes, asuming that MMF_OOM_REAPED is set in that case of course.

[Skipping the rest as this is not related to this patch.]
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC PATCH 5/6] vhost, mm: make sure that oom_reaper doesn't reap memory read by vhost
  2016-07-07  8:28               ` Michal Hocko
@ 2016-07-07 15:38                 ` Michael S. Tsirkin
  2016-07-08 12:29                   ` Oleg Nesterov
  0 siblings, 1 reply; 26+ messages in thread
From: Michael S. Tsirkin @ 2016-07-07 15:38 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Oleg Nesterov, linux-mm, Andrew Morton, Tetsuo Handa,
	David Rientjes, Vladimir Davydov

On Thu, Jul 07, 2016 at 10:28:12AM +0200, Michal Hocko wrote:
> On Mon 04-07-16 00:17:55, Michael S. Tsirkin wrote:
> > On Sun, Jul 03, 2016 at 06:47:23PM +0200, Oleg Nesterov wrote:
> > > On 07/03, Michael S. Tsirkin wrote:
> > > >
> > > > On Sun, Jul 03, 2016 at 05:18:29PM +0200, Oleg Nesterov wrote:
> > > > >
> > > > > Well, we are going to kill all tasks which share this memory. I mean, ->mm.
> > > > > If "sharing memory with another task" means, say, a file, then this memory
> > > > > won't be unmapped (if shared).
> > > > >
> > > > > So let me ask again... Suppose, say, QEMU does VHOST_SET_OWNER and then we
> > > > > unmap its (anonymous/non-shared) memory. Who else's memory can be corrupted?
> > > >
> > > > As you say, I mean anyone who shares memory with QEMU through a file.
> > > 
> > > And in this case vhost_worker() reads the anonymous memory of QEMU process,
> > > not the memory which can be shared with another task, correct?
> > > 
> > > And if QEMU simply crashes, this can't affect anyone who shares memory with
> > > QEMU through a file, yes?
> > > 
> > > Oleg.
> > 
> > Well no - the VM memory is not always anonymous memory. It can be an
> > mmaped file.
> 
> Just to make sure we are all at the same page. I guess the scenario is
> as follows. The owner of the mm has ring and other statefull information
> in the private memory but consumers living with their own mm consume
> some data from a shared memory segments (e.g. files). The worker would
> misinterpret statefull information (zeros rather than the original
> content) and would copy invalid/corrupted data to the consumer. Am I
> correct?
> 
> -- 
> Michal Hocko
> SUSE Labs


Exactly.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC PATCH 5/6] vhost, mm: make sure that oom_reaper doesn't reap memory read by vhost
  2016-07-07  8:42         ` Michal Hocko
@ 2016-07-07 16:46           ` Oleg Nesterov
  0 siblings, 0 replies; 26+ messages in thread
From: Oleg Nesterov @ 2016-07-07 16:46 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Michael S. Tsirkin, linux-mm, Andrew Morton, Tetsuo Handa,
	David Rientjes, Vladimir Davydov

On 07/07, Michal Hocko wrote:
>
> On Sun 03-07-16 17:18:29, Oleg Nesterov wrote:
> [...]
> > Or perhaps we can change oom_kill_process() to send SIGKILL to kthreads as
> > well, this should not have any effect unless kthread does allow_signal(SIGKILL),
> > then we can change vhost_worker() to catch SIGKILL and react somehow. Not sure
> > this is really possible.
>
> But then we would have to check for the signal after every memory
> access no? This sounds much more error prone than the test being wrapped
> inside the copy_from... API to me.

At least I agree this doesn't look nice too.

Oleg.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC PATCH 5/6] vhost, mm: make sure that oom_reaper doesn't reap memory read by vhost
  2016-07-07 15:38                 ` Michael S. Tsirkin
@ 2016-07-08 12:29                   ` Oleg Nesterov
  2016-07-11 14:14                     ` Michal Hocko
  0 siblings, 1 reply; 26+ messages in thread
From: Oleg Nesterov @ 2016-07-08 12:29 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Michal Hocko, linux-mm, Andrew Morton, Tetsuo Handa,
	David Rientjes, Vladimir Davydov

On 07/07, Michael S. Tsirkin wrote:
>
> On Thu, Jul 07, 2016 at 10:28:12AM +0200, Michal Hocko wrote:
> >
> > Just to make sure we are all at the same page. I guess the scenario is
> > as follows. The owner of the mm has ring and other statefull information
> > in the private memory but consumers living with their own mm consume
> > some data from a shared memory segments (e.g. files). The worker would
> > misinterpret statefull information (zeros rather than the original
> > content) and would copy invalid/corrupted data to the consumer. Am I
> > correct?
>
> Exactly.

Michael, let me ask again.

But what if we simply kill the owner of this mm? Yes, if we dont't unmap its
memory then vhost_worker() can't read the wrong zero from anonymous vma.
But the killed process obviously won't be able to update this statefull info
after that, it will be frozen. Are you saying this can't affect other apps
which share the memory with the (killed) mm owner?

IOW. If we kill a process, this can affect other applications anyway. Just for
example, suppose that this process takes a non-robust futex in the shared memory
segment. After that other users of this futex will hang forever.

So do you think that this particular "vhost" problem is really worse and we must
specialy avoid it?

If yes, note that this means that any process which can do VHOST_SET_OWNER becomes
"oom-unkillable" to some degree, and this doesn't look right. It can spawn another
CLONE_FILES process and this will block fops->release() which (iiuc) should stop
the kernel thread which pins the memory hog's memory.

Oleg.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC PATCH 5/6] vhost, mm: make sure that oom_reaper doesn't reap memory read by vhost
  2016-07-08 12:29                   ` Oleg Nesterov
@ 2016-07-11 14:14                     ` Michal Hocko
  2016-07-12 14:33                       ` Oleg Nesterov
  0 siblings, 1 reply; 26+ messages in thread
From: Michal Hocko @ 2016-07-11 14:14 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Michael S. Tsirkin, linux-mm, Andrew Morton, Tetsuo Handa,
	David Rientjes, Vladimir Davydov

On Fri 08-07-16 14:29:48, Oleg Nesterov wrote:
> On 07/07, Michael S. Tsirkin wrote:
> >
> > On Thu, Jul 07, 2016 at 10:28:12AM +0200, Michal Hocko wrote:
> > >
> > > Just to make sure we are all at the same page. I guess the scenario is
> > > as follows. The owner of the mm has ring and other statefull information
> > > in the private memory but consumers living with their own mm consume
> > > some data from a shared memory segments (e.g. files). The worker would
> > > misinterpret statefull information (zeros rather than the original
> > > content) and would copy invalid/corrupted data to the consumer. Am I
> > > correct?
> >
> > Exactly.
> 
> Michael, let me ask again.
> 
> But what if we simply kill the owner of this mm?

I might be wrong here but the mm owner doesn't really matter AFAIU. It
is the holder of the file descriptor for the "device" who control all
the actions, no? The fact that it hijacked the mm along the way is hiden
from users. If you kill the owner but pass the fd somewhere else then
the mm will live as long as the fd.

[...]

> If yes, note that this means that any process which can do VHOST_SET_OWNER becomes
> "oom-unkillable" to some degree, and this doesn't look right. It can spawn another
> CLONE_FILES process and this will block fops->release() which (iiuc) should stop
> the kernel thread which pins the memory hog's memory.

I believe this is indeed possible. It can even pass the fd to a
different process and keep it alive, hidden from the oom killer causing
other processes to be killed.

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC PATCH 5/6] vhost, mm: make sure that oom_reaper doesn't reap memory read by vhost
  2016-07-11 14:14                     ` Michal Hocko
@ 2016-07-12 14:33                       ` Oleg Nesterov
  0 siblings, 0 replies; 26+ messages in thread
From: Oleg Nesterov @ 2016-07-12 14:33 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Michael S. Tsirkin, linux-mm, Andrew Morton, Tetsuo Handa,
	David Rientjes, Vladimir Davydov

On 07/11, Michal Hocko wrote:
>
> On Fri 08-07-16 14:29:48, Oleg Nesterov wrote:
> > On 07/07, Michael S. Tsirkin wrote:
> > >
> > > On Thu, Jul 07, 2016 at 10:28:12AM +0200, Michal Hocko wrote:
> > > >
> > > > Just to make sure we are all at the same page. I guess the scenario is
> > > > as follows. The owner of the mm has ring and other statefull information
> > > > in the private memory but consumers living with their own mm consume
> > > > some data from a shared memory segments (e.g. files). The worker would
> > > > misinterpret statefull information (zeros rather than the original
> > > > content) and would copy invalid/corrupted data to the consumer. Am I
> > > > correct?
> > >
> > > Exactly.
> >
> > Michael, let me ask again.
> >
> > But what if we simply kill the owner of this mm?
>
> I might be wrong here but the mm owner doesn't really matter AFAIU. It
> is the holder of the file descriptor for the "device" who control all
> the actions, no? The fact that it hijacked the mm along the way is hiden
> from users. If you kill the owner but pass the fd somewhere else then
> the mm will live as long as the fd.

Of course. I meant that qemu/guest won't update that statefull info in its
anonymous memory after we kill it. And I have no idea if it is fine or not.

As I said, I do not even know what drivers/vhost actually does, and probably
that is why I do not understand why this particular problem (bogus zeroes in
anonymous memory) is worse than other problems we can't avoid anyway when we
kill the victim and this affects other applications.

> > If yes, note that this means that any process which can do VHOST_SET_OWNER becomes
> > "oom-unkillable" to some degree, and this doesn't look right. It can spawn another
> > CLONE_FILES process and this will block fops->release() which (iiuc) should stop
> > the kernel thread which pins the memory hog's memory.
>
> I believe this is indeed possible. It can even pass the fd to a
> different process and keep it alive, hidden from the oom killer causing
> other processes to be killed.

Yes, so I think we should unmap the memory even if it is used by kthread.

Oleg.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC PATCH 5/6] vhost, mm: make sure that oom_reaper doesn't reap memory read by vhost
  2016-07-07  8:39       ` Michal Hocko
@ 2016-07-22 11:09         ` Michal Hocko
  0 siblings, 0 replies; 26+ messages in thread
From: Michal Hocko @ 2016-07-22 11:09 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Oleg Nesterov, linux-mm, Andrew Morton, Tetsuo Handa,
	David Rientjes, Vladimir Davydov

On Thu 07-07-16 10:39:32, Michal Hocko wrote:
> On Sun 03-07-16 17:09:04, Michael S. Tsirkin wrote:
> [...]
> > Having said all that, how about we just add some kind of per-mm
> > notifier list, and let vhost know that owner is going away so
> > it should stop looking at memory?
> 
> But this would have to be a synchronous operation from the oom killer,
> no? I would really like to reduce the number of external dependencies
> from the oom killer paths as much as possible. This is the whole point
> of these patches. If we have a notification mechanism, what would
> guarantee that the oom killer would make a forward progress if the
> notified end cannot continue (wait for a lock etc...)?
> 
> I do realize that a test per each memory access is not welcome that
> much. An alternative would be to hook the check into the page fault
> handler because then the overhead would be reduced only to the slowpath
> (from the copy_from_user POV). But then also non use_mm users would have
> to pay the price which is even less attractive.
> 
> Another alternative would be disabling pagefaults when copying from the
> userspace. This would require that the memory is prefault when used
> which might be a problem for the current implementation.

ping Michael... I would like to pursue this again and have something for
4.9 ideally.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2016-07-22 11:09 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-07-01  9:26 [RFC PATCH 0/6] fortify oom killer even more Michal Hocko
2016-07-01  9:26 ` [RFC PATCH 1/6] oom: keep mm of the killed task available Michal Hocko
2016-07-03  2:45   ` Tetsuo Handa
2016-07-07  8:24     ` Michal Hocko
2016-07-07 11:48       ` Tetsuo Handa
2016-07-07 13:32         ` Michal Hocko
2016-07-01  9:26 ` [RFC PATCH 2/6] oom, suspend: fix oom_killer_disable vs. pm suspend properly Michal Hocko
2016-07-01  9:26 ` [RFC PATCH 3/6] exit, oom: postpone exit_oom_victim to later Michal Hocko
2016-07-01  9:26 ` [RFC PATCH 4/6] oom, oom_reaper: consider mmget_not_zero as a failure Michal Hocko
2016-07-01  9:26 ` [RFC PATCH 5/6] vhost, mm: make sure that oom_reaper doesn't reap memory read by vhost Michal Hocko
2016-07-03 13:47   ` Oleg Nesterov
2016-07-03 14:09     ` Michael S. Tsirkin
2016-07-03 15:18       ` Oleg Nesterov
2016-07-03 15:30         ` Michael S. Tsirkin
2016-07-03 16:47           ` Oleg Nesterov
2016-07-03 21:17             ` Michael S. Tsirkin
2016-07-07  8:28               ` Michal Hocko
2016-07-07 15:38                 ` Michael S. Tsirkin
2016-07-08 12:29                   ` Oleg Nesterov
2016-07-11 14:14                     ` Michal Hocko
2016-07-12 14:33                       ` Oleg Nesterov
2016-07-07  8:42         ` Michal Hocko
2016-07-07 16:46           ` Oleg Nesterov
2016-07-07  8:39       ` Michal Hocko
2016-07-22 11:09         ` Michal Hocko
2016-07-01  9:26 ` [RFC PATCH 6/6] oom, oom_reaper: allow to reap mm shared by the kthreads Michal Hocko

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).