All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH 0/10] fortify oom killer even more
@ 2016-07-28 19:42 Michal Hocko
  2016-07-28 19:42 ` [PATCH 01/10] mm,oom_reaper: Reduce find_lock_task_mm() usage Michal Hocko
                   ` (9 more replies)
  0 siblings, 10 replies; 47+ messages in thread
From: Michal Hocko @ 2016-07-28 19:42 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Tetsuo Handa, Oleg Nesterov, David Rientjes,
	Vladimir Davydov, Marcelo Tosatti, Michal Hocko, Mikulas Patocka

Hi,
I have sent this pile as an RFC [1] previously and after some testing
and discussion about an alternative approach [2] it seems this will plug
the remaining holes which could lead to oom lockups for CONFIG_MMU and
open doors to further improvements and cleanups.

I have added few more patches since the last time. Patches 1 and 2 can be
considered cleanups and I've taken them from Tetsuo's patchset [2].

Patch 3 is the core part of this series. It makes the mm of the oom victim
persistent in signal struct so that the oom killer can rely purely on this
mm rather than find_lock_task_mm which might not find any mm if all threads
passed exit_mm. Patch 4 is a follow up fix noticed during testing. I could
have folded it to the patch 3 but I guess both will be easier to review if
they are separate.

Patch 5 is a cleanup and it removes signal_struct::oom_victims which is no
longer needed.

Patch 6 makes oom_killer_disable full quiescent state barrier again.

Patch 7 is a pure cleanup. Again taken from Tetsuo's series [2].

Patch 8 moves exit_oom_victim out of exit_mm to the end of do_exit.

Patch 9 transforms vhost get_user/copy_from_user usage to oom reaper
safe variant. There were some proposals how to provide a different API -
e.g. a notification mechanism resp. processing SIGKILL from the kernel
thread - but none of them was either lockless or easier and I really do
not want to make further lock dependencies between oom killer and other
kernel subsystems. I didn't get to test this part because I do not have
much idea why. I would really appreciate help from Michael here.

Patch 10 then allows to reap oom victim memory even when it is shared
with a kthread via use_mm as the only problematic user is safe to after
the previous patch. This leaves the only non-reapable case when the global
init shares the mm with a different process (other than vfork) which I
would consider exotic and slightly idiotic so I wouldn't lose sleep over
it.

After this series we should have guaranteed forward progress for the oom
killer invocation for mmu arches AFAICS. It would be great if this could
make it into 4.9. I would like to build on top of this and clean up the
code even more. I would like to get rid of TIF_MEMDIE in the next step
and make memory reserves access completely independent on the rest of the
OOM killer logic.

I have run this through the hammering tests mostly coming from Tetsuo
and apart from the lockup fixed by the patch 4 and nothing popped out.

The series is based on top of the mmotm (2016-07-22-15-51). Feedback is
more than welcome.

Thanks!

[1] http://lkml.kernel.org/r/1467365190-24640-1-git-send-email-mhocko@kernel.org
[2] http://lkml.kernel.org/r/201607031135.AAH95347.MVOHQtFJFLOOFS@I-love.SAKURA.ne.jp

Michal Hocko (7):
      oom: keep mm of the killed task available
      mm, oom: get rid of signal_struct::oom_victims
      kernel, oom: fix potential pgd_lock deadlock from __mmdrop
      oom, suspend: fix oom_killer_disable vs. pm suspend properly
      exit, oom: postpone exit_oom_victim to later
      vhost, mm: make sure that oom_reaper doesn't reap memory read by vhost
      oom, oom_reaper: allow to reap mm shared by the kthreads

Tetsuo Handa (3):
      mm,oom_reaper: Reduce find_lock_task_mm() usage.
      mm,oom_reaper: Do not attempt to reap a task twice.
      mm, oom: enforce exit_oom_victim on current task


 drivers/vhost/scsi.c     |   2 +-
 drivers/vhost/vhost.c    |  18 +++---
 include/linux/mm_types.h |   2 -
 include/linux/oom.h      |  10 ++-
 include/linux/sched.h    |  21 ++++++-
 include/linux/uaccess.h  |  22 +++++++
 include/linux/uio.h      |  10 +++
 kernel/exit.c            |   5 +-
 kernel/fork.c            |   7 +++
 kernel/power/process.c   |  17 +----
 mm/mmu_context.c         |   6 ++
 mm/oom_kill.c            | 161 +++++++++++++++++++++--------------------------
 12 files changed, 158 insertions(+), 123 deletions(-)


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [PATCH 01/10] mm,oom_reaper: Reduce find_lock_task_mm() usage.
  2016-07-28 19:42 [RFC PATCH 0/10] fortify oom killer even more Michal Hocko
@ 2016-07-28 19:42 ` Michal Hocko
  2016-07-28 19:42 ` [PATCH 02/10] mm,oom_reaper: Do not attempt to reap a task twice Michal Hocko
                   ` (8 subsequent siblings)
  9 siblings, 0 replies; 47+ messages in thread
From: Michal Hocko @ 2016-07-28 19:42 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Tetsuo Handa, Oleg Nesterov, David Rientjes,
	Vladimir Davydov, Michal Hocko

From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>

__oom_reap_task() can be simplified a bit if it receives a valid mm from
oom_reap_task() which also uses that mm when __oom_reap_task() failed.
We can drop one find_lock_task_mm() call and also make the
__oom_reap_task() code flow easier to follow. Moreover, this will make
later patch in the series easier to review. Pinning mm's mm_count for
longer time is not really harmful because this will not pin much memory.

This patch doesn't introduce any functional change.

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 mm/oom_kill.c | 81 ++++++++++++++++++++++++++++-------------------------------
 1 file changed, 38 insertions(+), 43 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 7d0a275df822..f685341bdee2 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -452,12 +452,10 @@ static DECLARE_WAIT_QUEUE_HEAD(oom_reaper_wait);
 static struct task_struct *oom_reaper_list;
 static DEFINE_SPINLOCK(oom_reaper_lock);
 
-static bool __oom_reap_task(struct task_struct *tsk)
+static bool __oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
 {
 	struct mmu_gather tlb;
 	struct vm_area_struct *vma;
-	struct mm_struct *mm = NULL;
-	struct task_struct *p;
 	struct zap_details details = {.check_swap_entries = true,
 				      .ignore_dirty = true};
 	bool ret = true;
@@ -465,7 +463,7 @@ static bool __oom_reap_task(struct task_struct *tsk)
 	/*
 	 * We have to make sure to not race with the victim exit path
 	 * and cause premature new oom victim selection:
-	 * __oom_reap_task		exit_mm
+	 * __oom_reap_task_mm		exit_mm
 	 *   mmget_not_zero
 	 *				  mmput
 	 *				    atomic_dec_and_test
@@ -478,22 +476,9 @@ static bool __oom_reap_task(struct task_struct *tsk)
 	 */
 	mutex_lock(&oom_lock);
 
-	/*
-	 * Make sure we find the associated mm_struct even when the particular
-	 * thread has already terminated and cleared its mm.
-	 * We might have race with exit path so consider our work done if there
-	 * is no mm.
-	 */
-	p = find_lock_task_mm(tsk);
-	if (!p)
-		goto unlock_oom;
-	mm = p->mm;
-	atomic_inc(&mm->mm_count);
-	task_unlock(p);
-
 	if (!down_read_trylock(&mm->mmap_sem)) {
 		ret = false;
-		goto mm_drop;
+		goto unlock_oom;
 	}
 
 	/*
@@ -503,7 +488,7 @@ static bool __oom_reap_task(struct task_struct *tsk)
 	 */
 	if (!mmget_not_zero(mm)) {
 		up_read(&mm->mmap_sem);
-		goto mm_drop;
+		goto unlock_oom;
 	}
 
 	tlb_gather_mmu(&tlb, mm, 0, -1);
@@ -551,8 +536,6 @@ static bool __oom_reap_task(struct task_struct *tsk)
 	 * put the oom_reaper out of the way.
 	 */
 	mmput_async(mm);
-mm_drop:
-	mmdrop(mm);
 unlock_oom:
 	mutex_unlock(&oom_lock);
 	return ret;
@@ -562,36 +545,45 @@ static bool __oom_reap_task(struct task_struct *tsk)
 static void oom_reap_task(struct task_struct *tsk)
 {
 	int attempts = 0;
+	struct mm_struct *mm = NULL;
+	struct task_struct *p = find_lock_task_mm(tsk);
+
+	/*
+	 * Make sure we find the associated mm_struct even when the particular
+	 * thread has already terminated and cleared its mm.
+	 * We might have race with exit path so consider our work done if there
+	 * is no mm.
+	 */
+	if (!p)
+		goto done;
+	mm = p->mm;
+	atomic_inc(&mm->mm_count);
+	task_unlock(p);
 
 	/* Retry the down_read_trylock(mmap_sem) a few times */
-	while (attempts++ < MAX_OOM_REAP_RETRIES && !__oom_reap_task(tsk))
+	while (attempts++ < MAX_OOM_REAP_RETRIES && !__oom_reap_task_mm(tsk, mm))
 		schedule_timeout_idle(HZ/10);
 
-	if (attempts > MAX_OOM_REAP_RETRIES) {
-		struct task_struct *p;
+	if (attempts <= MAX_OOM_REAP_RETRIES)
+		goto done;
 
-		pr_info("oom_reaper: unable to reap pid:%d (%s)\n",
-				task_pid_nr(tsk), tsk->comm);
+	pr_info("oom_reaper: unable to reap pid:%d (%s)\n",
+		task_pid_nr(tsk), tsk->comm);
 
-		/*
-		 * If we've already tried to reap this task in the past and
-		 * failed it probably doesn't make much sense to try yet again
-		 * so hide the mm from the oom killer so that it can move on
-		 * to another task with a different mm struct.
-		 */
-		p = find_lock_task_mm(tsk);
-		if (p) {
-			if (test_and_set_bit(MMF_OOM_NOT_REAPABLE, &p->mm->flags)) {
-				pr_info("oom_reaper: giving up pid:%d (%s)\n",
-						task_pid_nr(tsk), tsk->comm);
-				set_bit(MMF_OOM_REAPED, &p->mm->flags);
-			}
-			task_unlock(p);
-		}
-
-		debug_show_all_locks();
+	/*
+	 * If we've already tried to reap this task in the past and
+	 * failed it probably doesn't make much sense to try yet again
+	 * so hide the mm from the oom killer so that it can move on
+	 * to another task with a different mm struct.
+	 */
+	if (test_and_set_bit(MMF_OOM_NOT_REAPABLE, &mm->flags)) {
+		pr_info("oom_reaper: giving up pid:%d (%s)\n",
+			task_pid_nr(tsk), tsk->comm);
+		set_bit(MMF_OOM_REAPED, &mm->flags);
 	}
+	debug_show_all_locks();
 
+done:
 	/*
 	 * Clear TIF_MEMDIE because the task shouldn't be sitting on a
 	 * reasonably reclaimable memory anymore or it is not a good candidate
@@ -603,6 +595,9 @@ static void oom_reap_task(struct task_struct *tsk)
 
 	/* Drop a reference taken by wake_oom_reaper */
 	put_task_struct(tsk);
+	/* Drop a reference taken above. */
+	if (mm)
+		mmdrop(mm);
 }
 
 static int oom_reaper(void *unused)
-- 
2.8.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [PATCH 02/10] mm,oom_reaper: Do not attempt to reap a task twice.
  2016-07-28 19:42 [RFC PATCH 0/10] fortify oom killer even more Michal Hocko
  2016-07-28 19:42 ` [PATCH 01/10] mm,oom_reaper: Reduce find_lock_task_mm() usage Michal Hocko
@ 2016-07-28 19:42 ` Michal Hocko
  2016-07-28 19:42 ` [PATCH 03/10] oom: keep mm of the killed task available Michal Hocko
                   ` (7 subsequent siblings)
  9 siblings, 0 replies; 47+ messages in thread
From: Michal Hocko @ 2016-07-28 19:42 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Tetsuo Handa, Oleg Nesterov, David Rientjes,
	Vladimir Davydov, Michal Hocko

From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>

"mm, oom_reaper: do not attempt to reap a task twice" tried to give the
OOM reaper one more chance to retry using MMF_OOM_NOT_REAPABLE flag. But
the usefulness of the flag is rather limited and actually never shown
in practice. If the flag is set, it means that the holder of mm->mmap_sem
cannot call up_write() due to presumably being blocked at unkillable wait
waiting for other thread's memory allocation. But since one of threads
sharing that mm will queue that mm immediately via task_will_free_mem()
shortcut (otherwise, oom_badness() will select the same mm again due to
oom_score_adj value unchanged), retrying MMF_OOM_NOT_REAPABLE mm is
unlikely helpful.

Let's always set MMF_OOM_REAPED.

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 include/linux/sched.h |  1 -
 mm/oom_kill.c         | 15 +++------------
 2 files changed, 3 insertions(+), 13 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 3568dfca12bb..d9585771dc8b 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -512,7 +512,6 @@ static inline int get_dumpable(struct mm_struct *mm)
 #define MMF_HAS_UPROBES		19	/* has uprobes */
 #define MMF_RECALC_UPROBES	20	/* MMF_HAS_UPROBES can be wrong */
 #define MMF_OOM_REAPED		21	/* mm has been already reaped */
-#define MMF_OOM_NOT_REAPABLE	22	/* mm couldn't be reaped */
 
 #define MMF_INIT_MASK		(MMF_DUMPABLE_MASK | MMF_DUMP_FILTER_MASK)
 
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index f685341bdee2..a8c06883a142 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -567,20 +567,11 @@ static void oom_reap_task(struct task_struct *tsk)
 	if (attempts <= MAX_OOM_REAP_RETRIES)
 		goto done;
 
+	/* Ignore this mm because somebody can't call up_write(mmap_sem). */
+	set_bit(MMF_OOM_REAPED, &mm->flags);
+
 	pr_info("oom_reaper: unable to reap pid:%d (%s)\n",
 		task_pid_nr(tsk), tsk->comm);
-
-	/*
-	 * If we've already tried to reap this task in the past and
-	 * failed it probably doesn't make much sense to try yet again
-	 * so hide the mm from the oom killer so that it can move on
-	 * to another task with a different mm struct.
-	 */
-	if (test_and_set_bit(MMF_OOM_NOT_REAPABLE, &mm->flags)) {
-		pr_info("oom_reaper: giving up pid:%d (%s)\n",
-			task_pid_nr(tsk), tsk->comm);
-		set_bit(MMF_OOM_REAPED, &mm->flags);
-	}
 	debug_show_all_locks();
 
 done:
-- 
2.8.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [PATCH 03/10] oom: keep mm of the killed task available
  2016-07-28 19:42 [RFC PATCH 0/10] fortify oom killer even more Michal Hocko
  2016-07-28 19:42 ` [PATCH 01/10] mm,oom_reaper: Reduce find_lock_task_mm() usage Michal Hocko
  2016-07-28 19:42 ` [PATCH 02/10] mm,oom_reaper: Do not attempt to reap a task twice Michal Hocko
@ 2016-07-28 19:42 ` Michal Hocko
  2016-07-28 19:42 ` [PATCH 04/10] mm, oom: get rid of signal_struct::oom_victims Michal Hocko
                   ` (6 subsequent siblings)
  9 siblings, 0 replies; 47+ messages in thread
From: Michal Hocko @ 2016-07-28 19:42 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Tetsuo Handa, Oleg Nesterov, David Rientjes,
	Vladimir Davydov, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

oom_reap_task has to call exit_oom_victim in order to make sure that the
oom vicim will not block the oom killer for ever. This is, however,
opening new problems (e.g oom_killer_disable exclusion - see
74070542099c ("oom, suspend: fix oom_reaper vs. oom_killer_disable
race")). exit_oom_victim should be only called from the victim's
context ideally.

One way to achieve this would be to rely on per mm_struct flags. We
already have MMF_OOM_REAPED to hide a task from the oom killer since
"mm, oom: hide mm which is shared with kthread or global init". The
problem is that the exit path:
do_exit
  exit_mm
    tsk->mm = NULL;
    mmput
      __mmput
    exit_oom_victim

doesn't guarantee that exit_oom_victim will get called in a bounded
amount of time. At least exit_aio depends on IO which might get blocked
due to lack of memory and who knows what else is lurking there.

This patch takes a different approach. We remember tsk->mm into the
signal_struct and bind it to the signal struct life time for all oom
victims. __oom_reap_task_mm as well as oom_scan_process_thread do not have
to rely on find_lock_task_mm anymore and they will have a reliable
reference to the mm struct. As a result all the oom specific
communication inside the OOM killer can be done via tsk->signal->oom_mm.

Increasing the signal_struct for something as unlikely as the oom
killer is far from ideal but this approach will make the code much more
reasonable and long term we even might want to move task->mm into the
signal_struct anyway. In the next step we might want to make the oom
killer exclusion and access to memory reserves completely independent
which would be also nice.

Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 include/linux/sched.h |  2 ++
 kernel/fork.c         |  2 ++
 mm/oom_kill.c         | 51 +++++++++++++++++++++------------------------------
 3 files changed, 25 insertions(+), 30 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index d9585771dc8b..8943546d52e7 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -792,6 +792,8 @@ struct signal_struct {
 	short oom_score_adj;		/* OOM kill score adjustment */
 	short oom_score_adj_min;	/* OOM kill score adjustment min value.
 					 * Only settable by CAP_SYS_RESOURCE. */
+	struct mm_struct *oom_mm;	/* recorded mm when the thread group got
+					 * killed by the oom killer */
 
 	struct mutex cred_guard_mutex;	/* guard against foreign influences on
 					 * credential calculations
diff --git a/kernel/fork.c b/kernel/fork.c
index 191844a157df..7e9f83d5fe95 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -243,6 +243,8 @@ static inline void free_signal_struct(struct signal_struct *sig)
 {
 	taskstats_tgid_free(sig);
 	sched_autogroup_exit(sig);
+	if (sig->oom_mm)
+		mmdrop(sig->oom_mm);
 	kmem_cache_free(signal_cachep, sig);
 }
 
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index a8c06883a142..7f09608405b7 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -288,14 +288,11 @@ enum oom_scan_t oom_scan_process_thread(struct oom_control *oc,
 	 * any memory is quite low.
 	 */
 	if (!is_sysrq_oom(oc) && atomic_read(&task->signal->oom_victims)) {
-		struct task_struct *p = find_lock_task_mm(task);
+		struct mm_struct *mm = task->signal->oom_mm;
 		enum oom_scan_t ret = OOM_SCAN_ABORT;
 
-		if (p) {
-			if (test_bit(MMF_OOM_REAPED, &p->mm->flags))
-				ret = OOM_SCAN_CONTINUE;
-			task_unlock(p);
-		}
+		if (test_bit(MMF_OOM_REAPED, &mm->flags))
+			ret = OOM_SCAN_CONTINUE;
 
 		return ret;
 	}
@@ -526,11 +523,6 @@ static bool __oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
 	up_read(&mm->mmap_sem);
 
 	/*
-	 * This task can be safely ignored because we cannot do much more
-	 * to release its memory.
-	 */
-	set_bit(MMF_OOM_REAPED, &mm->flags);
-	/*
 	 * Drop our reference but make sure the mmput slow path is called from a
 	 * different context because we shouldn't risk we get stuck there and
 	 * put the oom_reaper out of the way.
@@ -545,20 +537,7 @@ static bool __oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
 static void oom_reap_task(struct task_struct *tsk)
 {
 	int attempts = 0;
-	struct mm_struct *mm = NULL;
-	struct task_struct *p = find_lock_task_mm(tsk);
-
-	/*
-	 * Make sure we find the associated mm_struct even when the particular
-	 * thread has already terminated and cleared its mm.
-	 * We might have race with exit path so consider our work done if there
-	 * is no mm.
-	 */
-	if (!p)
-		goto done;
-	mm = p->mm;
-	atomic_inc(&mm->mm_count);
-	task_unlock(p);
+	struct mm_struct *mm = tsk->signal->oom_mm;
 
 	/* Retry the down_read_trylock(mmap_sem) a few times */
 	while (attempts++ < MAX_OOM_REAP_RETRIES && !__oom_reap_task_mm(tsk, mm))
@@ -567,8 +546,6 @@ static void oom_reap_task(struct task_struct *tsk)
 	if (attempts <= MAX_OOM_REAP_RETRIES)
 		goto done;
 
-	/* Ignore this mm because somebody can't call up_write(mmap_sem). */
-	set_bit(MMF_OOM_REAPED, &mm->flags);
 
 	pr_info("oom_reaper: unable to reap pid:%d (%s)\n",
 		task_pid_nr(tsk), tsk->comm);
@@ -584,11 +561,14 @@ static void oom_reap_task(struct task_struct *tsk)
 	tsk->oom_reaper_list = NULL;
 	exit_oom_victim(tsk);
 
+	/*
+	 * Hide this mm from OOM killer because it has been either reaped or
+	 * somebody can't call up_write(mmap_sem).
+	 */
+	set_bit(MMF_OOM_REAPED, &mm->flags);
+
 	/* Drop a reference taken by wake_oom_reaper */
 	put_task_struct(tsk);
-	/* Drop a reference taken above. */
-	if (mm)
-		mmdrop(mm);
 }
 
 static int oom_reaper(void *unused)
@@ -650,14 +630,25 @@ subsys_initcall(oom_init)
  *
  * Has to be called with oom_lock held and never after
  * oom has been disabled already.
+ *
+ * tsk->mm has to be non NULL and caller has to guarantee it is stable (either
+ * under task_lock or operate on the current).
  */
 void mark_oom_victim(struct task_struct *tsk)
 {
+	struct mm_struct *mm = tsk->mm;
+
 	WARN_ON(oom_killer_disabled);
 	/* OOM killer might race with memcg OOM */
 	if (test_and_set_tsk_thread_flag(tsk, TIF_MEMDIE))
 		return;
+
 	atomic_inc(&tsk->signal->oom_victims);
+
+	/* oom_mm is bound to the signal struct life time. */
+	if (!cmpxchg(&tsk->signal->oom_mm, NULL, mm))
+		atomic_inc(&tsk->signal->oom_mm->mm_count);
+
 	/*
 	 * Make sure that the task is woken up from uninterruptible sleep
 	 * if it is frozen because OOM killer wouldn't be able to free
-- 
2.8.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [PATCH 04/10] mm, oom: get rid of signal_struct::oom_victims
  2016-07-28 19:42 [RFC PATCH 0/10] fortify oom killer even more Michal Hocko
                   ` (2 preceding siblings ...)
  2016-07-28 19:42 ` [PATCH 03/10] oom: keep mm of the killed task available Michal Hocko
@ 2016-07-28 19:42 ` Michal Hocko
  2016-07-28 19:42 ` [PATCH 05/10] kernel, oom: fix potential pgd_lock deadlock from __mmdrop Michal Hocko
                   ` (5 subsequent siblings)
  9 siblings, 0 replies; 47+ messages in thread
From: Michal Hocko @ 2016-07-28 19:42 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Tetsuo Handa, Oleg Nesterov, David Rientjes,
	Vladimir Davydov, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

After "oom: keep mm of the killed task available" we can safely
detect an oom victim by checking task->signal->oom_mm so we do not need
the signal_struct counter anymore so let's get rid of it.

This alone wouldn't be sufficient for nommu archs because exit_oom_victim
doesn't hide the process from the oom killer anymore. We can, however,
mark the mm with a MMF flag in __mmput. We can reuse MMF_OOM_REAPED and
rename it to a more generic MMF_OOM_SKIP.

Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 include/linux/oom.h   |  6 ++++++
 include/linux/sched.h |  3 +--
 kernel/fork.c         |  1 +
 mm/oom_kill.c         | 17 +++++++----------
 4 files changed, 15 insertions(+), 12 deletions(-)

diff --git a/include/linux/oom.h b/include/linux/oom.h
index 5bc0457ee3a8..bbe0a7789636 100644
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -72,6 +72,12 @@ static inline bool oom_task_origin(const struct task_struct *p)
 
 extern void mark_oom_victim(struct task_struct *tsk);
 
+static inline bool tsk_is_oom_victim(struct task_struct * tsk)
+{
+	return tsk->signal->oom_mm;
+}
+
+
 #ifdef CONFIG_MMU
 extern void wake_oom_reaper(struct task_struct *tsk);
 #else
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 8943546d52e7..e3376215f4d0 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -511,7 +511,7 @@ static inline int get_dumpable(struct mm_struct *mm)
 
 #define MMF_HAS_UPROBES		19	/* has uprobes */
 #define MMF_RECALC_UPROBES	20	/* MMF_HAS_UPROBES can be wrong */
-#define MMF_OOM_REAPED		21	/* mm has been already reaped */
+#define MMF_OOM_SKIP		21	/* mm is of no interest for the OOM killer */
 
 #define MMF_INIT_MASK		(MMF_DUMPABLE_MASK | MMF_DUMP_FILTER_MASK)
 
@@ -659,7 +659,6 @@ struct signal_struct {
 	atomic_t		sigcnt;
 	atomic_t		live;
 	int			nr_threads;
-	atomic_t oom_victims; /* # of TIF_MEDIE threads in this thread group */
 	struct list_head	thread_head;
 
 	wait_queue_head_t	wait_chldexit;	/* for wait4() */
diff --git a/kernel/fork.c b/kernel/fork.c
index 7e9f83d5fe95..89905b641a0a 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -721,6 +721,7 @@ static inline void __mmput(struct mm_struct *mm)
 	}
 	if (mm->binfmt)
 		module_put(mm->binfmt->module);
+	set_bit(MMF_OOM_SKIP, &mm->flags);
 	mmdrop(mm);
 }
 
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 7f09608405b7..bb4c2ee9c67f 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -181,7 +181,7 @@ unsigned long oom_badness(struct task_struct *p, struct mem_cgroup *memcg,
 	 */
 	adj = (long)p->signal->oom_score_adj;
 	if (adj == OOM_SCORE_ADJ_MIN ||
-			test_bit(MMF_OOM_REAPED, &p->mm->flags) ||
+			test_bit(MMF_OOM_SKIP, &p->mm->flags) ||
 			in_vfork(p)) {
 		task_unlock(p);
 		return 0;
@@ -284,14 +284,14 @@ enum oom_scan_t oom_scan_process_thread(struct oom_control *oc,
 	/*
 	 * This task already has access to memory reserves and is being killed.
 	 * Don't allow any other task to have access to the reserves unless
-	 * the task has MMF_OOM_REAPED because chances that it would release
+	 * the task has MMF_OOM_SKIP because chances that it would release
 	 * any memory is quite low.
 	 */
-	if (!is_sysrq_oom(oc) && atomic_read(&task->signal->oom_victims)) {
+	if (!is_sysrq_oom(oc) && tsk_is_oom_victim(task)) {
 		struct mm_struct *mm = task->signal->oom_mm;
 		enum oom_scan_t ret = OOM_SCAN_ABORT;
 
-		if (test_bit(MMF_OOM_REAPED, &mm->flags))
+		if (test_bit(MMF_OOM_SKIP, &mm->flags))
 			ret = OOM_SCAN_CONTINUE;
 
 		return ret;
@@ -565,7 +565,7 @@ static void oom_reap_task(struct task_struct *tsk)
 	 * Hide this mm from OOM killer because it has been either reaped or
 	 * somebody can't call up_write(mmap_sem).
 	 */
-	set_bit(MMF_OOM_REAPED, &mm->flags);
+	set_bit(MMF_OOM_SKIP, &mm->flags);
 
 	/* Drop a reference taken by wake_oom_reaper */
 	put_task_struct(tsk);
@@ -643,8 +643,6 @@ void mark_oom_victim(struct task_struct *tsk)
 	if (test_and_set_tsk_thread_flag(tsk, TIF_MEMDIE))
 		return;
 
-	atomic_inc(&tsk->signal->oom_victims);
-
 	/* oom_mm is bound to the signal struct life time. */
 	if (!cmpxchg(&tsk->signal->oom_mm, NULL, mm))
 		atomic_inc(&tsk->signal->oom_mm->mm_count);
@@ -666,7 +664,6 @@ void exit_oom_victim(struct task_struct *tsk)
 {
 	if (!test_and_clear_tsk_thread_flag(tsk, TIF_MEMDIE))
 		return;
-	atomic_dec(&tsk->signal->oom_victims);
 
 	if (!atomic_dec_return(&oom_victims))
 		wake_up_all(&oom_victims_wait);
@@ -758,7 +755,7 @@ bool task_will_free_mem(struct task_struct *task)
 	 * This task has already been drained by the oom reaper so there are
 	 * only small chances it will free some more
 	 */
-	if (test_bit(MMF_OOM_REAPED, &mm->flags))
+	if (test_bit(MMF_OOM_SKIP, &mm->flags))
 		return false;
 
 	if (atomic_read(&mm->mm_users) <= 1)
@@ -898,7 +895,7 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
 			 * killer to guarantee OOM forward progress.
 			 */
 			can_oom_reap = false;
-			set_bit(MMF_OOM_REAPED, &mm->flags);
+			set_bit(MMF_OOM_SKIP, &mm->flags);
 			pr_info("oom killer %d (%s) has mm pinned by %d (%s)\n",
 					task_pid_nr(victim), victim->comm,
 					task_pid_nr(p), p->comm);
-- 
2.8.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [PATCH 05/10] kernel, oom: fix potential pgd_lock deadlock from __mmdrop
  2016-07-28 19:42 [RFC PATCH 0/10] fortify oom killer even more Michal Hocko
                   ` (3 preceding siblings ...)
  2016-07-28 19:42 ` [PATCH 04/10] mm, oom: get rid of signal_struct::oom_victims Michal Hocko
@ 2016-07-28 19:42 ` Michal Hocko
  2016-07-28 19:42 ` [PATCH 06/10] oom, suspend: fix oom_killer_disable vs. pm suspend properly Michal Hocko
                   ` (4 subsequent siblings)
  9 siblings, 0 replies; 47+ messages in thread
From: Michal Hocko @ 2016-07-28 19:42 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Tetsuo Handa, Oleg Nesterov, David Rientjes,
	Vladimir Davydov, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

Lockdep complains that __mmdrop is not safe from the softirq context:

[   63.860469] =================================
[   63.861326] [ INFO: inconsistent lock state ]
[   63.862677] 4.6.0-oomfortification2-00011-geeb3eadeab96-dirty #949 Tainted: G        W
[   63.864072] ---------------------------------
[   63.864072] inconsistent {SOFTIRQ-ON-W} -> {IN-SOFTIRQ-W} usage.
[   63.864072] swapper/1/0 [HC0[0]:SC1[1]:HE1:SE0] takes:
[   63.864072]  (pgd_lock){+.?...}, at: [<ffffffff81048762>] pgd_free+0x19/0x6b
[   63.864072] {SOFTIRQ-ON-W} state was registered at:
[   63.864072]   [<ffffffff81097da2>] __lock_acquire+0xa06/0x196e
[   63.864072]   [<ffffffff810994d8>] lock_acquire+0x139/0x1e1
[   63.864072]   [<ffffffff81625cd2>] _raw_spin_lock+0x32/0x41
[   63.864072]   [<ffffffff8104594d>] __change_page_attr_set_clr+0x2a5/0xacd
[   63.864072]   [<ffffffff810462e4>] change_page_attr_set_clr+0x16f/0x32c
[   63.864072]   [<ffffffff81046544>] set_memory_nx+0x37/0x3a
[   63.864072]   [<ffffffff81041b2c>] free_init_pages+0x9e/0xc7
[   63.864072]   [<ffffffff81d49105>] alternative_instructions+0xa2/0xb3
[   63.864072]   [<ffffffff81d4a763>] check_bugs+0xe/0x2d
[   63.864072]   [<ffffffff81d3eed0>] start_kernel+0x3ce/0x3ea
[   63.864072]   [<ffffffff81d3e2f1>] x86_64_start_reservations+0x2a/0x2c
[   63.864072]   [<ffffffff81d3e46d>] x86_64_start_kernel+0x17a/0x18d
[   63.864072] irq event stamp: 105916
[   63.864072] hardirqs last  enabled at (105916): [<ffffffff8112f5ba>] free_hot_cold_page+0x37e/0x390
[   63.864072] hardirqs last disabled at (105915): [<ffffffff8112f4fd>] free_hot_cold_page+0x2c1/0x390
[   63.864072] softirqs last  enabled at (105878): [<ffffffff81055724>] _local_bh_enable+0x42/0x44
[   63.864072] softirqs last disabled at (105879): [<ffffffff81055a6d>] irq_exit+0x6f/0xd1
[   63.864072]
[   63.864072] other info that might help us debug this:
[   63.864072]  Possible unsafe locking scenario:
[   63.864072]
[   63.864072]        CPU0
[   63.864072]        ----
[   63.864072]   lock(pgd_lock);
[   63.864072]   <Interrupt>
[   63.864072]     lock(pgd_lock);
[   63.864072]
[   63.864072]  *** DEADLOCK ***
[   63.864072]
[   63.864072] 1 lock held by swapper/1/0:
[   63.864072]  #0:  (rcu_callback){......}, at: [<ffffffff810b44f2>] rcu_process_callbacks+0x390/0x800
[   63.864072]
[   63.864072] stack backtrace:
[   63.864072] CPU: 1 PID: 0 Comm: swapper/1 Tainted: G        W       4.6.0-oomfortification2-00011-geeb3eadeab96-dirty #949
[   63.864072] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Debian-1.8.2-1 04/01/2014
[   63.864072]  0000000000000000 ffff88000fb03c38 ffffffff81312df8 ffffffff8257a0d0
[   63.864072]  ffff8800069f8000 ffff88000fb03c70 ffffffff81125bc5 0000000000000004
[   63.864072]  ffff8800069f8888 ffff8800069f8000 ffffffff8109603a 0000000000000004
[   63.864072] Call Trace:
[   63.864072]  <IRQ>  [<ffffffff81312df8>] dump_stack+0x67/0x90
[   63.864072]  [<ffffffff81125bc5>] print_usage_bug.part.25+0x259/0x268
[   63.864072]  [<ffffffff8109603a>] ? print_shortest_lock_dependencies+0x180/0x180
[   63.864072]  [<ffffffff81096d33>] mark_lock+0x381/0x567
[   63.864072]  [<ffffffff81097d2f>] __lock_acquire+0x993/0x196e
[   63.864072]  [<ffffffff81048762>] ? pgd_free+0x19/0x6b
[   63.864072]  [<ffffffff8117b8ae>] ? discard_slab+0x42/0x44
[   63.864072]  [<ffffffff8117e00d>] ? __slab_free+0x3e6/0x429
[   63.864072]  [<ffffffff810994d8>] lock_acquire+0x139/0x1e1
[   63.864072]  [<ffffffff810994d8>] ? lock_acquire+0x139/0x1e1
[   63.864072]  [<ffffffff81048762>] ? pgd_free+0x19/0x6b
[   63.864072]  [<ffffffff81625cd2>] _raw_spin_lock+0x32/0x41
[   63.864072]  [<ffffffff81048762>] ? pgd_free+0x19/0x6b
[   63.864072]  [<ffffffff81048762>] pgd_free+0x19/0x6b
[   63.864072]  [<ffffffff8104d018>] __mmdrop+0x25/0xb9
[   63.864072]  [<ffffffff8104d29d>] __put_task_struct+0x103/0x11e
[   63.864072]  [<ffffffff810526a0>] delayed_put_task_struct+0x157/0x15e
[   63.864072]  [<ffffffff810b47c2>] rcu_process_callbacks+0x660/0x800
[   63.864072]  [<ffffffff81052549>] ? will_become_orphaned_pgrp+0xae/0xae
[   63.864072]  [<ffffffff8162921c>] __do_softirq+0x1ec/0x4d5
[   63.864072]  [<ffffffff81055a6d>] irq_exit+0x6f/0xd1
[   63.864072]  [<ffffffff81628d7b>] smp_apic_timer_interrupt+0x42/0x4d
[   63.864072]  [<ffffffff8162732e>] apic_timer_interrupt+0x8e/0xa0
[   63.864072]  <EOI>  [<ffffffff81021657>] ? default_idle+0x6b/0x16e
[   63.864072]  [<ffffffff81021ed2>] arch_cpu_idle+0xf/0x11
[   63.864072]  [<ffffffff8108e59b>] default_idle_call+0x32/0x34
[   63.864072]  [<ffffffff8108e7a9>] cpu_startup_entry+0x20c/0x399
[   63.864072]  [<ffffffff81034600>] start_secondary+0xfe/0x101

More over a79e53d85683 ("x86/mm: Fix pgd_lock deadlock") was explicit
about pgd_lock not to be called from the irq context. This means that
__mmdrop called from free_signal_struct has to be postponed to a user
context. We already have a similar mechanism for mmput_async so we
can use it here as well. This is safe because mm_count is pinned by
mm_users.

This fixes bug introduced by "oom: keep mm of the killed task available"

Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 include/linux/mm_types.h |  2 --
 include/linux/sched.h    | 14 ++++++++++++++
 kernel/fork.c            |  6 +++++-
 3 files changed, 19 insertions(+), 3 deletions(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index ca2ed9a6c8d8..6605b66eebb5 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -515,9 +515,7 @@ struct mm_struct {
 #ifdef CONFIG_HUGETLB_PAGE
 	atomic_long_t hugetlb_usage;
 #endif
-#ifdef CONFIG_MMU
 	struct work_struct async_put_work;
-#endif
 };
 
 static inline void mm_init_cpumask(struct mm_struct *mm)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index e3376215f4d0..127c7f9a7719 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2649,6 +2649,20 @@ static inline void mmdrop(struct mm_struct *mm)
 		__mmdrop(mm);
 }
 
+static inline void mmdrop_async_fn(struct work_struct *work)
+{
+	struct mm_struct *mm = container_of(work, struct mm_struct, async_put_work);
+	__mmdrop(mm);
+}
+
+static inline void mmdrop_async(struct mm_struct *mm)
+{
+	if (unlikely(atomic_dec_and_test(&mm->mm_count))) {
+		INIT_WORK(&mm->async_put_work, mmdrop_async_fn);
+		schedule_work(&mm->async_put_work);
+	}
+}
+
 static inline bool mmget_not_zero(struct mm_struct *mm)
 {
 	return atomic_inc_not_zero(&mm->mm_users);
diff --git a/kernel/fork.c b/kernel/fork.c
index 89905b641a0a..fe8cf4e6fa93 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -243,8 +243,12 @@ static inline void free_signal_struct(struct signal_struct *sig)
 {
 	taskstats_tgid_free(sig);
 	sched_autogroup_exit(sig);
+	/*
+	 * __mmdrop is not safe to call from softirq context on x86 due to
+	 * pgd_dtor so postpone it to the async context
+	 */
 	if (sig->oom_mm)
-		mmdrop(sig->oom_mm);
+		mmdrop_async(sig->oom_mm);
 	kmem_cache_free(signal_cachep, sig);
 }
 
-- 
2.8.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [PATCH 06/10] oom, suspend: fix oom_killer_disable vs. pm suspend properly
  2016-07-28 19:42 [RFC PATCH 0/10] fortify oom killer even more Michal Hocko
                   ` (4 preceding siblings ...)
  2016-07-28 19:42 ` [PATCH 05/10] kernel, oom: fix potential pgd_lock deadlock from __mmdrop Michal Hocko
@ 2016-07-28 19:42 ` Michal Hocko
  2016-07-28 19:42 ` [PATCH 07/10] mm, oom: enforce exit_oom_victim on current task Michal Hocko
                   ` (3 subsequent siblings)
  9 siblings, 0 replies; 47+ messages in thread
From: Michal Hocko @ 2016-07-28 19:42 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Tetsuo Handa, Oleg Nesterov, David Rientjes,
	Vladimir Davydov, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

74070542099c ("oom, suspend: fix oom_reaper vs. oom_killer_disable
race") has workaround an existing race between oom_killer_disable
and oom_reaper by adding another round of try_to_freeze_tasks after
the oom killer was disabled. This was the easiest thing to do for
a late 4.7 fix. Let's fix it properly now.

After "oom: keep mm of the killed task available" we no longer
have to call exit_oom_victim from the oom reaper because we have stable
mm available and hide the oom_reaped mm by MMF_OOM_SKIP flag. So
let's remove exit_oom_victim and the race described in the above commit
doesn't exist anymore if.

Unfortunately this alone is not sufficient for the oom_killer_disable
usecase because now we do not have any reliable way to reach
exit_oom_victim (the victim might get stuck on a way to exit for an
unbounded amount of time). OOM killer can cope with that by checking
mm flags and move on to another victim but we cannot do the same
for oom_killer_disable as we would lose the guarantee of no further
interference of the victim with the rest of the system. What we can do
instead is to cap the maximum time the oom_killer_disable waits for
victims. The only current user of this function (pm suspend) already has
a concept of timeout for back off so we can reuse the same value there.

Let's drop set_freezable for the oom_reaper kthread because it is no
longer needed as the reaper doesn't wake or thaw any processes.

Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 include/linux/oom.h    |  2 +-
 kernel/power/process.c | 17 +++--------------
 mm/oom_kill.c          | 40 ++++++++++++++++++++--------------------
 3 files changed, 24 insertions(+), 35 deletions(-)

diff --git a/include/linux/oom.h b/include/linux/oom.h
index bbe0a7789636..14a0f15c0c59 100644
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -108,7 +108,7 @@ extern int register_oom_notifier(struct notifier_block *nb);
 extern int unregister_oom_notifier(struct notifier_block *nb);
 
 extern bool oom_killer_disabled;
-extern bool oom_killer_disable(void);
+extern bool oom_killer_disable(signed long timeout);
 extern void oom_killer_enable(void);
 
 extern struct task_struct *find_lock_task_mm(struct task_struct *p);
diff --git a/kernel/power/process.c b/kernel/power/process.c
index 0c2ee9761d57..2456f10c7326 100644
--- a/kernel/power/process.c
+++ b/kernel/power/process.c
@@ -141,23 +141,12 @@ int freeze_processes(void)
 	/*
 	 * Now that the whole userspace is frozen we need to disbale
 	 * the OOM killer to disallow any further interference with
-	 * killable tasks.
+	 * killable tasks. There is no guarantee oom victims will
+	 * ever reach a point they go away we have to wait with a timeout.
 	 */
-	if (!error && !oom_killer_disable())
+	if (!error && !oom_killer_disable(msecs_to_jiffies(freeze_timeout_msecs)))
 		error = -EBUSY;
 
-	/*
-	 * There is a hard to fix race between oom_reaper kernel thread
-	 * and oom_killer_disable. oom_reaper calls exit_oom_victim
-	 * before the victim reaches exit_mm so try to freeze all the tasks
-	 * again and catch such a left over task.
-	 */
-	if (!error) {
-		pr_info("Double checking all user space processes after OOM killer disable... ");
-		error = try_to_freeze_tasks(true);
-		pr_cont("\n");
-	}
-
 	if (error)
 		thaw_processes();
 	return error;
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index bb4c2ee9c67f..300957e59246 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -552,14 +552,7 @@ static void oom_reap_task(struct task_struct *tsk)
 	debug_show_all_locks();
 
 done:
-	/*
-	 * Clear TIF_MEMDIE because the task shouldn't be sitting on a
-	 * reasonably reclaimable memory anymore or it is not a good candidate
-	 * for the oom victim right now because it cannot release its memory
-	 * itself nor by the oom reaper.
-	 */
 	tsk->oom_reaper_list = NULL;
-	exit_oom_victim(tsk);
 
 	/*
 	 * Hide this mm from OOM killer because it has been either reaped or
@@ -573,8 +566,6 @@ static void oom_reap_task(struct task_struct *tsk)
 
 static int oom_reaper(void *unused)
 {
-	set_freezable();
-
 	while (true) {
 		struct task_struct *tsk = NULL;
 
@@ -670,10 +661,20 @@ void exit_oom_victim(struct task_struct *tsk)
 }
 
 /**
+ * oom_killer_enable - enable OOM killer
+ */
+void oom_killer_enable(void)
+{
+	oom_killer_disabled = false;
+}
+
+/**
  * oom_killer_disable - disable OOM killer
+ * @timeout: maximum timeout to wait for oom victims in jiffies
  *
  * Forces all page allocations to fail rather than trigger OOM killer.
- * Will block and wait until all OOM victims are killed.
+ * Will block and wait until all OOM victims are killed or the given
+ * timeout expires.
  *
  * The function cannot be called when there are runnable user tasks because
  * the userspace would see unexpected allocation failures as a result. Any
@@ -682,8 +683,10 @@ void exit_oom_victim(struct task_struct *tsk)
  * Returns true if successful and false if the OOM killer cannot be
  * disabled.
  */
-bool oom_killer_disable(void)
+bool oom_killer_disable(signed long timeout)
 {
+	signed long ret;
+
 	/*
 	 * Make sure to not race with an ongoing OOM killer. Check that the
 	 * current is not killed (possibly due to sharing the victim's memory).
@@ -693,19 +696,16 @@ bool oom_killer_disable(void)
 	oom_killer_disabled = true;
 	mutex_unlock(&oom_lock);
 
-	wait_event(oom_victims_wait, !atomic_read(&oom_victims));
+	ret = wait_event_interruptible_timeout(oom_victims_wait,
+			!atomic_read(&oom_victims), timeout);
+	if (ret <= 0) {
+		oom_killer_enable();
+		return false;
+	}
 
 	return true;
 }
 
-/**
- * oom_killer_enable - enable OOM killer
- */
-void oom_killer_enable(void)
-{
-	oom_killer_disabled = false;
-}
-
 static inline bool __task_will_free_mem(struct task_struct *task)
 {
 	struct signal_struct *sig = task->signal;
-- 
2.8.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [PATCH 07/10] mm, oom: enforce exit_oom_victim on current task
  2016-07-28 19:42 [RFC PATCH 0/10] fortify oom killer even more Michal Hocko
                   ` (5 preceding siblings ...)
  2016-07-28 19:42 ` [PATCH 06/10] oom, suspend: fix oom_killer_disable vs. pm suspend properly Michal Hocko
@ 2016-07-28 19:42 ` Michal Hocko
  2016-07-28 19:42 ` [PATCH 08/10] exit, oom: postpone exit_oom_victim to later Michal Hocko
                   ` (2 subsequent siblings)
  9 siblings, 0 replies; 47+ messages in thread
From: Michal Hocko @ 2016-07-28 19:42 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Tetsuo Handa, Oleg Nesterov, David Rientjes,
	Vladimir Davydov, Michal Hocko

From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>

There are no users of exit_oom_victim on !current task anymore so
enforce the API to always work on the current.

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 include/linux/oom.h | 2 +-
 kernel/exit.c       | 2 +-
 mm/oom_kill.c       | 5 ++---
 3 files changed, 4 insertions(+), 5 deletions(-)

diff --git a/include/linux/oom.h b/include/linux/oom.h
index 14a0f15c0c59..22e18c4adc98 100644
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -102,7 +102,7 @@ extern enum oom_scan_t oom_scan_process_thread(struct oom_control *oc,
 
 extern bool out_of_memory(struct oom_control *oc);
 
-extern void exit_oom_victim(struct task_struct *tsk);
+extern void exit_oom_victim(void);
 
 extern int register_oom_notifier(struct notifier_block *nb);
 extern int unregister_oom_notifier(struct notifier_block *nb);
diff --git a/kernel/exit.c b/kernel/exit.c
index 9e6e1356e6bb..c742c37c3a92 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -435,7 +435,7 @@ static void exit_mm(struct task_struct *tsk)
 	mm_update_next_owner(mm);
 	mmput(mm);
 	if (test_thread_flag(TIF_MEMDIE))
-		exit_oom_victim(tsk);
+		exit_oom_victim();
 }
 
 static struct task_struct *find_alive_thread(struct task_struct *p)
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 300957e59246..ca1cc24ba720 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -651,10 +651,9 @@ void mark_oom_victim(struct task_struct *tsk)
 /**
  * exit_oom_victim - note the exit of an OOM victim
  */
-void exit_oom_victim(struct task_struct *tsk)
+void exit_oom_victim(void)
 {
-	if (!test_and_clear_tsk_thread_flag(tsk, TIF_MEMDIE))
-		return;
+	clear_thread_flag(TIF_MEMDIE);
 
 	if (!atomic_dec_return(&oom_victims))
 		wake_up_all(&oom_victims_wait);
-- 
2.8.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [PATCH 08/10] exit, oom: postpone exit_oom_victim to later
  2016-07-28 19:42 [RFC PATCH 0/10] fortify oom killer even more Michal Hocko
                   ` (6 preceding siblings ...)
  2016-07-28 19:42 ` [PATCH 07/10] mm, oom: enforce exit_oom_victim on current task Michal Hocko
@ 2016-07-28 19:42 ` Michal Hocko
  2016-07-30  8:20   ` Tetsuo Handa
  2016-07-28 19:42 ` [PATCH 09/10] vhost, mm: make sure that oom_reaper doesn't reap memory read by vhost Michal Hocko
  2016-07-28 19:42 ` [PATCH 10/10] oom, oom_reaper: allow to reap mm shared by the kthreads Michal Hocko
  9 siblings, 1 reply; 47+ messages in thread
From: Michal Hocko @ 2016-07-28 19:42 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Tetsuo Handa, Oleg Nesterov, David Rientjes,
	Vladimir Davydov, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

exit_oom_victim was called after mmput because it is expected that
address space of the victim would get released by that time and there is
no reason to hold off the oom killer from selecting another task should
that be insufficient to handle the oom situation. In order to catch
post exit_mm() allocations we used to check for PF_EXITING but this
got removed by 6a618957ad17 ("mm: oom_kill: don't ignore oom score on
exiting tasks") because this check was lockup prone.

It seems that we have all needed pieces ready now and can finally
fix this properly (at least for CONFIG_MMU cases where we have the
oom_reaper).  Since "oom: keep mm of the killed task available" we have
a reliable way to ignore oom victims which are no longer interesting
because they either were reaped and do not sit on a lot of memory or
they are not reapable for some reason and it is safer to ignore them
and move on to another victim. That means that we can safely postpone
exit_oom_victim to closer to the final schedule.

There is possible advantages of this because we are reducing chances
of further interference of the oom victim with the rest of the system
after oom_killer_disable(). Strictly speaking this is possible right
now because there are indeed allocations possible past exit_mm() and
who knows whether some of them can trigger IO. I haven't seen this in
practice though.

Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 kernel/exit.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/kernel/exit.c b/kernel/exit.c
index c742c37c3a92..30e055117e5a 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -434,8 +434,6 @@ static void exit_mm(struct task_struct *tsk)
 	task_unlock(tsk);
 	mm_update_next_owner(mm);
 	mmput(mm);
-	if (test_thread_flag(TIF_MEMDIE))
-		exit_oom_victim();
 }
 
 static struct task_struct *find_alive_thread(struct task_struct *p)
@@ -822,6 +820,9 @@ void do_exit(long code)
 	smp_mb();
 	raw_spin_unlock_wait(&tsk->pi_lock);
 
+	if (test_thread_flag(TIF_MEMDIE))
+		exit_oom_victim();
+
 	/* causes final put_task_struct in finish_task_switch(). */
 	tsk->state = TASK_DEAD;
 	tsk->flags |= PF_NOFREEZE;	/* tell freezer to ignore us */
-- 
2.8.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [PATCH 09/10] vhost, mm: make sure that oom_reaper doesn't reap memory read by vhost
  2016-07-28 19:42 [RFC PATCH 0/10] fortify oom killer even more Michal Hocko
                   ` (7 preceding siblings ...)
  2016-07-28 19:42 ` [PATCH 08/10] exit, oom: postpone exit_oom_victim to later Michal Hocko
@ 2016-07-28 19:42 ` Michal Hocko
  2016-07-28 20:41   ` Michael S. Tsirkin
  2016-07-29 17:07   ` Oleg Nesterov
  2016-07-28 19:42 ` [PATCH 10/10] oom, oom_reaper: allow to reap mm shared by the kthreads Michal Hocko
  9 siblings, 2 replies; 47+ messages in thread
From: Michal Hocko @ 2016-07-28 19:42 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Tetsuo Handa, Oleg Nesterov, David Rientjes,
	Vladimir Davydov, Michal Hocko, Michael S. Tsirkin

From: Michal Hocko <mhocko@suse.com>

vhost driver relies on copy_from_user/get_user from a kernel thread.
This makes it impossible to reap the memory of an oom victim which
shares mm with the vhost kernel thread because it could see a zero
page unexpectedly and theoretically make an incorrect decision visible
outside of the killed task context. To quote Michael S. Tsirkin:
: Getting an error from __get_user and friends is handled gracefully.
: Getting zero instead of a real value will cause userspace
: memory corruption.

Make sure that each place which can read from userspace is annotated
properly and it uses copy_from_user_mm, __get_user_mm resp.
copy_from_iter_mm. Each will get the target mm as an argument and it
performs a pessimistic check to rule out that the oom_reaper could
possibly unmap the particular page. __oom_reap_task then just needs to
mark the mm as unstable before it unmaps any page.

This is a preparatory patch without any functional changes because
the oom reaper doesn't touch mm shared with kthreads yet.

Cc: "Michael S. Tsirkin" <mst@redhat.com>
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 drivers/vhost/scsi.c    |  2 +-
 drivers/vhost/vhost.c   | 18 +++++++++---------
 include/linux/sched.h   |  1 +
 include/linux/uaccess.h | 22 ++++++++++++++++++++++
 include/linux/uio.h     | 10 ++++++++++
 mm/oom_kill.c           |  8 ++++++++
 6 files changed, 51 insertions(+), 10 deletions(-)

diff --git a/drivers/vhost/scsi.c b/drivers/vhost/scsi.c
index 0e6fd556c982..2c8dc0b9a21f 100644
--- a/drivers/vhost/scsi.c
+++ b/drivers/vhost/scsi.c
@@ -932,7 +932,7 @@ vhost_scsi_handle_vq(struct vhost_scsi *vs, struct vhost_virtqueue *vq)
 		 */
 		iov_iter_init(&out_iter, WRITE, vq->iov, out, out_size);
 
-		ret = copy_from_iter(req, req_size, &out_iter);
+		ret = copy_from_iter_mm(vq->dev->mm, req, req_size, &out_iter);
 		if (unlikely(ret != req_size)) {
 			vq_err(vq, "Faulted on copy_from_iter\n");
 			vhost_scsi_send_bad_target(vs, vq, head, out);
diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 669fef1e2bb6..71a754a0fe7e 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -1212,7 +1212,7 @@ int vhost_vq_init_access(struct vhost_virtqueue *vq)
 		r = -EFAULT;
 		goto err;
 	}
-	r = __get_user(last_used_idx, &vq->used->idx);
+	r = __get_user_mm(vq->dev->mm, last_used_idx, &vq->used->idx);
 	if (r)
 		goto err;
 	vq->last_used_idx = vhost16_to_cpu(vq, last_used_idx);
@@ -1328,7 +1328,7 @@ static int get_indirect(struct vhost_virtqueue *vq,
 			       i, count);
 			return -EINVAL;
 		}
-		if (unlikely(copy_from_iter(&desc, sizeof(desc), &from) !=
+		if (unlikely(copy_from_iter_mm(vq->dev->mm, &desc, sizeof(desc), &from) !=
 			     sizeof(desc))) {
 			vq_err(vq, "Failed indirect descriptor: idx %d, %zx\n",
 			       i, (size_t)vhost64_to_cpu(vq, indirect->addr) + i * sizeof desc);
@@ -1392,7 +1392,7 @@ int vhost_get_vq_desc(struct vhost_virtqueue *vq,
 
 	/* Check it isn't doing very strange things with descriptor numbers. */
 	last_avail_idx = vq->last_avail_idx;
-	if (unlikely(__get_user(avail_idx, &vq->avail->idx))) {
+	if (unlikely(__get_user_mm(vq->dev->mm, avail_idx, &vq->avail->idx))) {
 		vq_err(vq, "Failed to access avail idx at %p\n",
 		       &vq->avail->idx);
 		return -EFAULT;
@@ -1414,7 +1414,7 @@ int vhost_get_vq_desc(struct vhost_virtqueue *vq,
 
 	/* Grab the next descriptor number they're advertising, and increment
 	 * the index we've seen. */
-	if (unlikely(__get_user(ring_head,
+	if (unlikely(__get_user_mm(vq->dev->mm, ring_head,
 				&vq->avail->ring[last_avail_idx & (vq->num - 1)]))) {
 		vq_err(vq, "Failed to read head: idx %d address %p\n",
 		       last_avail_idx,
@@ -1450,7 +1450,7 @@ int vhost_get_vq_desc(struct vhost_virtqueue *vq,
 			       i, vq->num, head);
 			return -EINVAL;
 		}
-		ret = __copy_from_user(&desc, vq->desc + i, sizeof desc);
+		ret = __copy_from_user_mm(vq->dev->mm, &desc, vq->desc + i, sizeof desc);
 		if (unlikely(ret)) {
 			vq_err(vq, "Failed to get descriptor: idx %d addr %p\n",
 			       i, vq->desc + i);
@@ -1622,7 +1622,7 @@ static bool vhost_notify(struct vhost_dev *dev, struct vhost_virtqueue *vq)
 
 	if (!vhost_has_feature(vq, VIRTIO_RING_F_EVENT_IDX)) {
 		__virtio16 flags;
-		if (__get_user(flags, &vq->avail->flags)) {
+		if (__get_user_mm(dev->mm, flags, &vq->avail->flags)) {
 			vq_err(vq, "Failed to get flags");
 			return true;
 		}
@@ -1636,7 +1636,7 @@ static bool vhost_notify(struct vhost_dev *dev, struct vhost_virtqueue *vq)
 	if (unlikely(!v))
 		return true;
 
-	if (__get_user(event, vhost_used_event(vq))) {
+	if (__get_user_mm(dev->mm, event, vhost_used_event(vq))) {
 		vq_err(vq, "Failed to get used event idx");
 		return true;
 	}
@@ -1678,7 +1678,7 @@ bool vhost_vq_avail_empty(struct vhost_dev *dev, struct vhost_virtqueue *vq)
 	__virtio16 avail_idx;
 	int r;
 
-	r = __get_user(avail_idx, &vq->avail->idx);
+	r = __get_user_mm(dev->mm, avail_idx, &vq->avail->idx);
 	if (r)
 		return false;
 
@@ -1713,7 +1713,7 @@ bool vhost_enable_notify(struct vhost_dev *dev, struct vhost_virtqueue *vq)
 	/* They could have slipped one in as we were doing that: make
 	 * sure it's written, then check again. */
 	smp_mb();
-	r = __get_user(avail_idx, &vq->avail->idx);
+	r = __get_user_mm(dev->mm, avail_idx, &vq->avail->idx);
 	if (r) {
 		vq_err(vq, "Failed to check avail idx at %p: %d\n",
 		       &vq->avail->idx, r);
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 127c7f9a7719..1ba4642b1efb 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -512,6 +512,7 @@ static inline int get_dumpable(struct mm_struct *mm)
 #define MMF_HAS_UPROBES		19	/* has uprobes */
 #define MMF_RECALC_UPROBES	20	/* MMF_HAS_UPROBES can be wrong */
 #define MMF_OOM_SKIP		21	/* mm is of no interest for the OOM killer */
+#define MMF_UNSTABLE		22	/* mm is unstable for copy_from_user */
 
 #define MMF_INIT_MASK		(MMF_DUMPABLE_MASK | MMF_DUMP_FILTER_MASK)
 
diff --git a/include/linux/uaccess.h b/include/linux/uaccess.h
index 349557825428..a327d5362581 100644
--- a/include/linux/uaccess.h
+++ b/include/linux/uaccess.h
@@ -76,6 +76,28 @@ static inline unsigned long __copy_from_user_nocache(void *to,
 #endif		/* ARCH_HAS_NOCACHE_UACCESS */
 
 /*
+ * A safe variant of __get_user for for use_mm() users to have a
+ * gurantee that the address space wasn't reaped in the background
+ */
+#define __get_user_mm(mm, x, ptr)				\
+({								\
+	int ___gu_err = __get_user(x, ptr);			\
+	if (!___gu_err && test_bit(MMF_UNSTABLE, &mm->flags))	\
+		___gu_err = -EFAULT;				\
+	___gu_err;						\
+})
+
+/* similar to __get_user_mm */
+static inline __must_check long __copy_from_user_mm(struct mm_struct *mm,
+		void *to, const void __user * from, unsigned long n)
+{
+	long ret = __copy_from_user(to, from, n);
+	if ((ret >= 0) && test_bit(MMF_UNSTABLE, &mm->flags))
+		return -EFAULT;
+	return ret;
+}
+
+/*
  * probe_kernel_read(): safely attempt to read from a location
  * @dst: pointer to the buffer that shall take the data
  * @src: address to read from
diff --git a/include/linux/uio.h b/include/linux/uio.h
index 1b5d1cd796e2..4be6b24003d8 100644
--- a/include/linux/uio.h
+++ b/include/linux/uio.h
@@ -9,6 +9,7 @@
 #ifndef __LINUX_UIO_H
 #define __LINUX_UIO_H
 
+#include <linux/sched.h>
 #include <linux/kernel.h>
 #include <uapi/linux/uio.h>
 
@@ -84,6 +85,15 @@ size_t copy_page_from_iter(struct page *page, size_t offset, size_t bytes,
 			 struct iov_iter *i);
 size_t copy_to_iter(const void *addr, size_t bytes, struct iov_iter *i);
 size_t copy_from_iter(void *addr, size_t bytes, struct iov_iter *i);
+
+static inline size_t copy_from_iter_mm(struct mm_struct *mm, void *addr,
+		size_t bytes, struct iov_iter *i)
+{
+	size_t ret = copy_from_iter(addr, bytes, i);
+	if (!IS_ERR_VALUE(ret) && test_bit(MMF_UNSTABLE, &mm->flags))
+		return -EFAULT;
+	return ret;
+}
 size_t copy_from_iter_nocache(void *addr, size_t bytes, struct iov_iter *i);
 size_t iov_iter_zero(size_t bytes, struct iov_iter *);
 unsigned long iov_iter_alignment(const struct iov_iter *i);
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index ca1cc24ba720..6ccf63fbfc72 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -488,6 +488,14 @@ static bool __oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
 		goto unlock_oom;
 	}
 
+	/*
+	 * Tell all users of get_user_mm/copy_from_user_mm that the content
+	 * is no longer stable. No barriers really needed because unmapping
+	 * should imply barriers already and the reader would hit a page fault
+	 * if it stumbled over a reaped memory.
+	 */
+	set_bit(MMF_UNSTABLE, &mm->flags);
+
 	tlb_gather_mmu(&tlb, mm, 0, -1);
 	for (vma = mm->mmap ; vma; vma = vma->vm_next) {
 		if (is_vm_hugetlb_page(vma))
-- 
2.8.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [PATCH 10/10] oom, oom_reaper: allow to reap mm shared by the kthreads
  2016-07-28 19:42 [RFC PATCH 0/10] fortify oom killer even more Michal Hocko
                   ` (8 preceding siblings ...)
  2016-07-28 19:42 ` [PATCH 09/10] vhost, mm: make sure that oom_reaper doesn't reap memory read by vhost Michal Hocko
@ 2016-07-28 19:42 ` Michal Hocko
  9 siblings, 0 replies; 47+ messages in thread
From: Michal Hocko @ 2016-07-28 19:42 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Tetsuo Handa, Oleg Nesterov, David Rientjes,
	Vladimir Davydov, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

oom reaper was skipped for an mm which is shared with the kernel thread
(aka use_mm()). The primary concern was that such a kthread might want
to read from the userspace memory and see zero page as a result of the
oom reaper action. This seems to be overly conservative because none of
the current use_mm() users need to do copy_from_user or get_user. aio
code used to rely on copy_from_user but this is long gone along with
use_mm() usage in fs/aio.c.

We currently have only 3 users in the kernel:
- ffs_user_copy_worker, ep_user_copy_worker only do copy_to_iter()
- vhost_worker needs to copy from userspace but it relies on the
  safe __get_user_mm, copy_from_user_mm resp. copy_from_iter_mm

Add a note to use_mm about the copy_from_user risk and allow the oom
killer to invoke the oom_reaper for mms shared with kthreads. This will
practically cause all the sane use cases to be reapable.

Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 mm/mmu_context.c |  6 ++++++
 mm/oom_kill.c    | 14 +++++++-------
 2 files changed, 13 insertions(+), 7 deletions(-)

diff --git a/mm/mmu_context.c b/mm/mmu_context.c
index f802c2d216a7..61a7a90250be 100644
--- a/mm/mmu_context.c
+++ b/mm/mmu_context.c
@@ -16,6 +16,12 @@
  *	mm context.
  *	(Note: this routine is intended to be called only
  *	from a kernel thread context)
+ *
+ *	Do not use copy_from_user/__get_user from this context
+ *	and use the safe copy_from_user_mm/__get_user_mm because
+ *	the address space might got reclaimed behind the back by
+ *	the oom_reaper so an unexpected zero page might be
+ *	encountered.
  */
 void use_mm(struct mm_struct *mm)
 {
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 6ccf63fbfc72..ca83b1706e13 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -894,13 +894,7 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
 			continue;
 		if (same_thread_group(p, victim))
 			continue;
-		if (unlikely(p->flags & PF_KTHREAD) || is_global_init(p)) {
-			/*
-			 * We cannot use oom_reaper for the mm shared by this
-			 * process because it wouldn't get killed and so the
-			 * memory might be still used. Hide the mm from the oom
-			 * killer to guarantee OOM forward progress.
-			 */
+		if (is_global_init(p)) {
 			can_oom_reap = false;
 			set_bit(MMF_OOM_SKIP, &mm->flags);
 			pr_info("oom killer %d (%s) has mm pinned by %d (%s)\n",
@@ -908,6 +902,12 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
 					task_pid_nr(p), p->comm);
 			continue;
 		}
+		/*
+		 * No use_mm() user needs to read from the userspace so we are
+		 * ok to reap it.
+		 */
+		if (unlikely(p->flags & PF_KTHREAD))
+			continue;
 		do_send_sig_info(SIGKILL, SEND_SIG_FORCED, p, true);
 	}
 	rcu_read_unlock();
-- 
2.8.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 09/10] vhost, mm: make sure that oom_reaper doesn't reap memory read by vhost
  2016-07-28 19:42 ` [PATCH 09/10] vhost, mm: make sure that oom_reaper doesn't reap memory read by vhost Michal Hocko
@ 2016-07-28 20:41   ` Michael S. Tsirkin
  2016-07-29  6:04     ` Michal Hocko
  2016-07-29 17:07   ` Oleg Nesterov
  1 sibling, 1 reply; 47+ messages in thread
From: Michael S. Tsirkin @ 2016-07-28 20:41 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, Andrew Morton, Tetsuo Handa, Oleg Nesterov,
	David Rientjes, Vladimir Davydov, Michal Hocko

On Thu, Jul 28, 2016 at 09:42:33PM +0200, Michal Hocko wrote:
> From: Michal Hocko <mhocko@suse.com>
> 
> vhost driver relies on copy_from_user/get_user from a kernel thread.
> This makes it impossible to reap the memory of an oom victim which
> shares mm with the vhost kernel thread because it could see a zero
> page unexpectedly and theoretically make an incorrect decision visible
> outside of the killed task context. To quote Michael S. Tsirkin:
> : Getting an error from __get_user and friends is handled gracefully.
> : Getting zero instead of a real value will cause userspace
> : memory corruption.
> 
> Make sure that each place which can read from userspace is annotated
> properly and it uses copy_from_user_mm, __get_user_mm resp.
> copy_from_iter_mm. Each will get the target mm as an argument and it
> performs a pessimistic check to rule out that the oom_reaper could
> possibly unmap the particular page. __oom_reap_task then just needs to
> mark the mm as unstable before it unmaps any page.
> 
> This is a preparatory patch without any functional changes because
> the oom reaper doesn't touch mm shared with kthreads yet.
> 
> Cc: "Michael S. Tsirkin" <mst@redhat.com>
> Signed-off-by: Michal Hocko <mhocko@suse.com>
> ---
>  drivers/vhost/scsi.c    |  2 +-
>  drivers/vhost/vhost.c   | 18 +++++++++---------
>  include/linux/sched.h   |  1 +
>  include/linux/uaccess.h | 22 ++++++++++++++++++++++
>  include/linux/uio.h     | 10 ++++++++++
>  mm/oom_kill.c           |  8 ++++++++
>  6 files changed, 51 insertions(+), 10 deletions(-)
> 
> diff --git a/drivers/vhost/scsi.c b/drivers/vhost/scsi.c
> index 0e6fd556c982..2c8dc0b9a21f 100644
> --- a/drivers/vhost/scsi.c
> +++ b/drivers/vhost/scsi.c
> @@ -932,7 +932,7 @@ vhost_scsi_handle_vq(struct vhost_scsi *vs, struct vhost_virtqueue *vq)
>  		 */
>  		iov_iter_init(&out_iter, WRITE, vq->iov, out, out_size);
>  
> -		ret = copy_from_iter(req, req_size, &out_iter);
> +		ret = copy_from_iter_mm(vq->dev->mm, req, req_size, &out_iter);
>  		if (unlikely(ret != req_size)) {
>  			vq_err(vq, "Faulted on copy_from_iter\n");
>  			vhost_scsi_send_bad_target(vs, vq, head, out);
> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> index 669fef1e2bb6..71a754a0fe7e 100644
> --- a/drivers/vhost/vhost.c
> +++ b/drivers/vhost/vhost.c
> @@ -1212,7 +1212,7 @@ int vhost_vq_init_access(struct vhost_virtqueue *vq)
>  		r = -EFAULT;
>  		goto err;
>  	}
> -	r = __get_user(last_used_idx, &vq->used->idx);
> +	r = __get_user_mm(vq->dev->mm, last_used_idx, &vq->used->idx);
>  	if (r)
>  		goto err;
>  	vq->last_used_idx = vhost16_to_cpu(vq, last_used_idx);
> @@ -1328,7 +1328,7 @@ static int get_indirect(struct vhost_virtqueue *vq,
>  			       i, count);
>  			return -EINVAL;
>  		}
> -		if (unlikely(copy_from_iter(&desc, sizeof(desc), &from) !=
> +		if (unlikely(copy_from_iter_mm(vq->dev->mm, &desc, sizeof(desc), &from) !=
>  			     sizeof(desc))) {
>  			vq_err(vq, "Failed indirect descriptor: idx %d, %zx\n",
>  			       i, (size_t)vhost64_to_cpu(vq, indirect->addr) + i * sizeof desc);
> @@ -1392,7 +1392,7 @@ int vhost_get_vq_desc(struct vhost_virtqueue *vq,
>  
>  	/* Check it isn't doing very strange things with descriptor numbers. */
>  	last_avail_idx = vq->last_avail_idx;
> -	if (unlikely(__get_user(avail_idx, &vq->avail->idx))) {
> +	if (unlikely(__get_user_mm(vq->dev->mm, avail_idx, &vq->avail->idx))) {
>  		vq_err(vq, "Failed to access avail idx at %p\n",
>  		       &vq->avail->idx);
>  		return -EFAULT;
> @@ -1414,7 +1414,7 @@ int vhost_get_vq_desc(struct vhost_virtqueue *vq,
>  
>  	/* Grab the next descriptor number they're advertising, and increment
>  	 * the index we've seen. */
> -	if (unlikely(__get_user(ring_head,
> +	if (unlikely(__get_user_mm(vq->dev->mm, ring_head,
>  				&vq->avail->ring[last_avail_idx & (vq->num - 1)]))) {
>  		vq_err(vq, "Failed to read head: idx %d address %p\n",
>  		       last_avail_idx,
> @@ -1450,7 +1450,7 @@ int vhost_get_vq_desc(struct vhost_virtqueue *vq,
>  			       i, vq->num, head);
>  			return -EINVAL;
>  		}
> -		ret = __copy_from_user(&desc, vq->desc + i, sizeof desc);
> +		ret = __copy_from_user_mm(vq->dev->mm, &desc, vq->desc + i, sizeof desc);
>  		if (unlikely(ret)) {
>  			vq_err(vq, "Failed to get descriptor: idx %d addr %p\n",
>  			       i, vq->desc + i);
> @@ -1622,7 +1622,7 @@ static bool vhost_notify(struct vhost_dev *dev, struct vhost_virtqueue *vq)
>  
>  	if (!vhost_has_feature(vq, VIRTIO_RING_F_EVENT_IDX)) {
>  		__virtio16 flags;
> -		if (__get_user(flags, &vq->avail->flags)) {
> +		if (__get_user_mm(dev->mm, flags, &vq->avail->flags)) {
>  			vq_err(vq, "Failed to get flags");
>  			return true;
>  		}
> @@ -1636,7 +1636,7 @@ static bool vhost_notify(struct vhost_dev *dev, struct vhost_virtqueue *vq)
>  	if (unlikely(!v))
>  		return true;
>  
> -	if (__get_user(event, vhost_used_event(vq))) {
> +	if (__get_user_mm(dev->mm, event, vhost_used_event(vq))) {
>  		vq_err(vq, "Failed to get used event idx");
>  		return true;
>  	}
> @@ -1678,7 +1678,7 @@ bool vhost_vq_avail_empty(struct vhost_dev *dev, struct vhost_virtqueue *vq)
>  	__virtio16 avail_idx;
>  	int r;
>  
> -	r = __get_user(avail_idx, &vq->avail->idx);
> +	r = __get_user_mm(dev->mm, avail_idx, &vq->avail->idx);
>  	if (r)
>  		return false;
>  
> @@ -1713,7 +1713,7 @@ bool vhost_enable_notify(struct vhost_dev *dev, struct vhost_virtqueue *vq)
>  	/* They could have slipped one in as we were doing that: make
>  	 * sure it's written, then check again. */
>  	smp_mb();
> -	r = __get_user(avail_idx, &vq->avail->idx);
> +	r = __get_user_mm(dev->mm, avail_idx, &vq->avail->idx);
>  	if (r) {
>  		vq_err(vq, "Failed to check avail idx at %p: %d\n",
>  		       &vq->avail->idx, r);
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 127c7f9a7719..1ba4642b1efb 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -512,6 +512,7 @@ static inline int get_dumpable(struct mm_struct *mm)
>  #define MMF_HAS_UPROBES		19	/* has uprobes */
>  #define MMF_RECALC_UPROBES	20	/* MMF_HAS_UPROBES can be wrong */
>  #define MMF_OOM_SKIP		21	/* mm is of no interest for the OOM killer */
> +#define MMF_UNSTABLE		22	/* mm is unstable for copy_from_user */
>  
>  #define MMF_INIT_MASK		(MMF_DUMPABLE_MASK | MMF_DUMP_FILTER_MASK)
>  
> diff --git a/include/linux/uaccess.h b/include/linux/uaccess.h
> index 349557825428..a327d5362581 100644
> --- a/include/linux/uaccess.h
> +++ b/include/linux/uaccess.h
> @@ -76,6 +76,28 @@ static inline unsigned long __copy_from_user_nocache(void *to,
>  #endif		/* ARCH_HAS_NOCACHE_UACCESS */
>  
>  /*
> + * A safe variant of __get_user for for use_mm() users to have a

for for -> for?

> + * gurantee that the address space wasn't reaped in the background
> + */
> +#define __get_user_mm(mm, x, ptr)				\
> +({								\
> +	int ___gu_err = __get_user(x, ptr);			\

I suspect you need smp_rmb() here to make sure it test does not
bypass the memory read.

You will accordingly need smp_wmb() when you set the flag,
maybe it's there already - I have not checked.

> +	if (!___gu_err && test_bit(MMF_UNSTABLE, &mm->flags))	\
> +		___gu_err = -EFAULT;				\
> +	___gu_err;						\
> +})
> +
> +/* similar to __get_user_mm */
> +static inline __must_check long __copy_from_user_mm(struct mm_struct *mm,
> +		void *to, const void __user * from, unsigned long n)
> +{
> +	long ret = __copy_from_user(to, from, n);
> +	if ((ret >= 0) && test_bit(MMF_UNSTABLE, &mm->flags))
> +		return -EFAULT;
> +	return ret;
> +}
> +
> +/*
>   * probe_kernel_read(): safely attempt to read from a location
>   * @dst: pointer to the buffer that shall take the data
>   * @src: address to read from
> diff --git a/include/linux/uio.h b/include/linux/uio.h
> index 1b5d1cd796e2..4be6b24003d8 100644
> --- a/include/linux/uio.h
> +++ b/include/linux/uio.h
> @@ -9,6 +9,7 @@
>  #ifndef __LINUX_UIO_H
>  #define __LINUX_UIO_H
>  
> +#include <linux/sched.h>
>  #include <linux/kernel.h>
>  #include <uapi/linux/uio.h>
>  
> @@ -84,6 +85,15 @@ size_t copy_page_from_iter(struct page *page, size_t offset, size_t bytes,
>  			 struct iov_iter *i);
>  size_t copy_to_iter(const void *addr, size_t bytes, struct iov_iter *i);
>  size_t copy_from_iter(void *addr, size_t bytes, struct iov_iter *i);
> +
> +static inline size_t copy_from_iter_mm(struct mm_struct *mm, void *addr,
> +		size_t bytes, struct iov_iter *i)
> +{
> +	size_t ret = copy_from_iter(addr, bytes, i);
> +	if (!IS_ERR_VALUE(ret) && test_bit(MMF_UNSTABLE, &mm->flags))
> +		return -EFAULT;
> +	return ret;
> +}
>  size_t copy_from_iter_nocache(void *addr, size_t bytes, struct iov_iter *i);
>  size_t iov_iter_zero(size_t bytes, struct iov_iter *);
>  unsigned long iov_iter_alignment(const struct iov_iter *i);
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index ca1cc24ba720..6ccf63fbfc72 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -488,6 +488,14 @@ static bool __oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
>  		goto unlock_oom;
>  	}
>  
> +	/*
> +	 * Tell all users of get_user_mm/copy_from_user_mm that the content
> +	 * is no longer stable. No barriers really needed because unmapping
> +	 * should imply barriers already

ok

> and the reader would hit a page fault
> +	 * if it stumbled over a reaped memory.

This last point I don't get. flag read could bypass data read
if that happens data read could happen after unmap
yes it might get a PF but you handle that, correct?

> +	 */
> +	set_bit(MMF_UNSTABLE, &mm->flags);
> +

I would really prefer a callback that vhost would register
and stop all accesses. Tell me if you need help on above idea.
But with the above nits addressed,
I think this would be acceptable as well.

>  	tlb_gather_mmu(&tlb, mm, 0, -1);
>  	for (vma = mm->mmap ; vma; vma = vma->vm_next) {
>  		if (is_vm_hugetlb_page(vma))
> -- 
> 2.8.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 09/10] vhost, mm: make sure that oom_reaper doesn't reap memory read by vhost
  2016-07-28 20:41   ` Michael S. Tsirkin
@ 2016-07-29  6:04     ` Michal Hocko
  2016-07-29 13:14       ` Michael S. Tsirkin
  0 siblings, 1 reply; 47+ messages in thread
From: Michal Hocko @ 2016-07-29  6:04 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: linux-mm, Andrew Morton, Tetsuo Handa, Oleg Nesterov,
	David Rientjes, Vladimir Davydov

On Thu 28-07-16 23:41:53, Michael S. Tsirkin wrote:
> On Thu, Jul 28, 2016 at 09:42:33PM +0200, Michal Hocko wrote:
[...]
> > diff --git a/include/linux/uaccess.h b/include/linux/uaccess.h
> > index 349557825428..a327d5362581 100644
> > --- a/include/linux/uaccess.h
> > +++ b/include/linux/uaccess.h
> > @@ -76,6 +76,28 @@ static inline unsigned long __copy_from_user_nocache(void *to,
> >  #endif		/* ARCH_HAS_NOCACHE_UACCESS */
> >  
> >  /*
> > + * A safe variant of __get_user for for use_mm() users to have a
> 
> for for -> for?

fixed

> 
> > + * gurantee that the address space wasn't reaped in the background
> > + */
> > +#define __get_user_mm(mm, x, ptr)				\
> > +({								\
> > +	int ___gu_err = __get_user(x, ptr);			\
> 
> I suspect you need smp_rmb() here to make sure it test does not
> bypass the memory read.
> 
> You will accordingly need smp_wmb() when you set the flag,
> maybe it's there already - I have not checked.

As the comment for setting the flag explains the memory barriers
shouldn't be really needed AFAIU. More on that below.

[...]
> > diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> > index ca1cc24ba720..6ccf63fbfc72 100644
> > --- a/mm/oom_kill.c
> > +++ b/mm/oom_kill.c
> > @@ -488,6 +488,14 @@ static bool __oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
> >  		goto unlock_oom;
> >  	}
> >  
> > +	/*
> > +	 * Tell all users of get_user_mm/copy_from_user_mm that the content
> > +	 * is no longer stable. No barriers really needed because unmapping
> > +	 * should imply barriers already
> 
> ok
> 
> > and the reader would hit a page fault
> > +	 * if it stumbled over a reaped memory.
> 
> This last point I don't get. flag read could bypass data read
> if that happens data read could happen after unmap
> yes it might get a PF but you handle that, correct?

The point I've tried to make is that if the reader really page faults
then get_user will imply the full barrier already. If get_user didn't
page fault then the state of the flag is not really important because
the reaper shouldn't have touched it. Does it make more sense now or
I've missed your question?

> 
> > +	 */
> > +	set_bit(MMF_UNSTABLE, &mm->flags);
> > +
> 
> I would really prefer a callback that vhost would register
> and stop all accesses. Tell me if you need help on above idea.


Well, in order to make callback workable the oom reaper would have to
synchronize with the said callback until it declares all currently
ongoing accesses done. That means oom reaper would have to block/wait
and that is something I would really like to prevent from because it
just adds another possibility of the lockup (say the get_user cannot
make forward progress because it is stuck in the page fault allocating
memory). Or do you see any other way how to implement such a callback
mechanism without blocking on the oom_reaper side?

> But with the above nits addressed,
> I think this would be acceptable as well.

Thank you for your review and feedback!
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 09/10] vhost, mm: make sure that oom_reaper doesn't reap memory read by vhost
  2016-07-29  6:04     ` Michal Hocko
@ 2016-07-29 13:14       ` Michael S. Tsirkin
  2016-07-29 13:35         ` Michal Hocko
  2016-08-12  9:43         ` Michal Hocko
  0 siblings, 2 replies; 47+ messages in thread
From: Michael S. Tsirkin @ 2016-07-29 13:14 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, Andrew Morton, Tetsuo Handa, Oleg Nesterov,
	David Rientjes, Vladimir Davydov

On Fri, Jul 29, 2016 at 08:04:22AM +0200, Michal Hocko wrote:
> On Thu 28-07-16 23:41:53, Michael S. Tsirkin wrote:
> > On Thu, Jul 28, 2016 at 09:42:33PM +0200, Michal Hocko wrote:
> [...]
> > > diff --git a/include/linux/uaccess.h b/include/linux/uaccess.h
> > > index 349557825428..a327d5362581 100644
> > > --- a/include/linux/uaccess.h
> > > +++ b/include/linux/uaccess.h
> > > @@ -76,6 +76,28 @@ static inline unsigned long __copy_from_user_nocache(void *to,
> > >  #endif		/* ARCH_HAS_NOCACHE_UACCESS */
> > >  
> > >  /*
> > > + * A safe variant of __get_user for for use_mm() users to have a
> > 
> > for for -> for?
> 
> fixed
> 
> > 
> > > + * gurantee that the address space wasn't reaped in the background
> > > + */
> > > +#define __get_user_mm(mm, x, ptr)				\
> > > +({								\
> > > +	int ___gu_err = __get_user(x, ptr);			\
> > 
> > I suspect you need smp_rmb() here to make sure it test does not
> > bypass the memory read.
> > 
> > You will accordingly need smp_wmb() when you set the flag,
> > maybe it's there already - I have not checked.
> 
> As the comment for setting the flag explains the memory barriers
> shouldn't be really needed AFAIU. More on that below.
> 
> [...]
> > > diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> > > index ca1cc24ba720..6ccf63fbfc72 100644
> > > --- a/mm/oom_kill.c
> > > +++ b/mm/oom_kill.c
> > > @@ -488,6 +488,14 @@ static bool __oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
> > >  		goto unlock_oom;
> > >  	}
> > >  
> > > +	/*
> > > +	 * Tell all users of get_user_mm/copy_from_user_mm that the content
> > > +	 * is no longer stable. No barriers really needed because unmapping
> > > +	 * should imply barriers already
> > 
> > ok
> > 
> > > and the reader would hit a page fault
> > > +	 * if it stumbled over a reaped memory.
> > 
> > This last point I don't get. flag read could bypass data read
> > if that happens data read could happen after unmap
> > yes it might get a PF but you handle that, correct?
> 
> The point I've tried to make is that if the reader really page faults
> then get_user will imply the full barrier already. If get_user didn't
> page fault then the state of the flag is not really important because
> the reaper shouldn't have touched it. Does it make more sense now or
> I've missed your question?

Can task flag read happen before the get_user pagefault?
If it does, task flag could not be set even though
page fault triggered.

> > 
> > > +	 */
> > > +	set_bit(MMF_UNSTABLE, &mm->flags);
> > > +
> > 
> > I would really prefer a callback that vhost would register
> > and stop all accesses. Tell me if you need help on above idea.
> 
> 
> Well, in order to make callback workable the oom reaper would have to
> synchronize with the said callback until it declares all currently
> ongoing accesses done. That means oom reaper would have to block/wait
> and that is something I would really like to prevent from because it
> just adds another possibility of the lockup (say the get_user cannot
> make forward progress because it is stuck in the page fault allocating
> memory). Or do you see any other way how to implement such a callback
> mechanism without blocking on the oom_reaper side?

I'll think it over and respond.

> 
> > But with the above nits addressed,
> > I think this would be acceptable as well.
> 
> Thank you for your review and feedback!
> -- 
> Michal Hocko
> SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 09/10] vhost, mm: make sure that oom_reaper doesn't reap memory read by vhost
  2016-07-29 13:14       ` Michael S. Tsirkin
@ 2016-07-29 13:35         ` Michal Hocko
  2016-07-29 17:57           ` Michael S. Tsirkin
  2016-08-12  9:43         ` Michal Hocko
  1 sibling, 1 reply; 47+ messages in thread
From: Michal Hocko @ 2016-07-29 13:35 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: linux-mm, Andrew Morton, Tetsuo Handa, Oleg Nesterov,
	David Rientjes, Vladimir Davydov

On Fri 29-07-16 16:14:10, Michael S. Tsirkin wrote:
> On Fri, Jul 29, 2016 at 08:04:22AM +0200, Michal Hocko wrote:
> > On Thu 28-07-16 23:41:53, Michael S. Tsirkin wrote:
> > > On Thu, Jul 28, 2016 at 09:42:33PM +0200, Michal Hocko wrote:
[...]
> > > > and the reader would hit a page fault
> > > > +	 * if it stumbled over a reaped memory.
> > > 
> > > This last point I don't get. flag read could bypass data read
> > > if that happens data read could happen after unmap
> > > yes it might get a PF but you handle that, correct?
> > 
> > The point I've tried to make is that if the reader really page faults
> > then get_user will imply the full barrier already. If get_user didn't
> > page fault then the state of the flag is not really important because
> > the reaper shouldn't have touched it. Does it make more sense now or
> > I've missed your question?
> 
> Can task flag read happen before the get_user pagefault?

Do you mean?

get_user_mm()
  temp = false <- test_bit(MMF_UNSTABLE, &mm->flags)
  ret = __get_user(x, ptr)
  #PF
  if (!ret && temp) # misses the flag

The code is basically doing

  if (!__get_user() && test_bit(MMF_UNSTABLE, &mm->flags))

so test_bit part of the conditional cannot be evaluated before
__get_user() part is done. Compiler cannot reorder two depending
subconditions AFAIK.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 09/10] vhost, mm: make sure that oom_reaper doesn't reap memory read by vhost
  2016-07-28 19:42 ` [PATCH 09/10] vhost, mm: make sure that oom_reaper doesn't reap memory read by vhost Michal Hocko
  2016-07-28 20:41   ` Michael S. Tsirkin
@ 2016-07-29 17:07   ` Oleg Nesterov
  2016-07-31  9:11     ` Michal Hocko
  1 sibling, 1 reply; 47+ messages in thread
From: Oleg Nesterov @ 2016-07-29 17:07 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, Andrew Morton, Tetsuo Handa, David Rientjes,
	Vladimir Davydov, Michal Hocko, Michael S. Tsirkin

Well. I promised to not argue, but I can't resist...

On 07/28, Michal Hocko wrote:
>
> --- a/include/linux/uaccess.h
> +++ b/include/linux/uaccess.h
> @@ -76,6 +76,28 @@ static inline unsigned long __copy_from_user_nocache(void *to,
>  #endif		/* ARCH_HAS_NOCACHE_UACCESS */
>
>  /*
> + * A safe variant of __get_user for for use_mm() users to have a
> + * gurantee that the address space wasn't reaped in the background
> + */
> +#define __get_user_mm(mm, x, ptr)				\
> +({								\
> +	int ___gu_err = __get_user(x, ptr);			\
> +	if (!___gu_err && test_bit(MMF_UNSTABLE, &mm->flags))	\
> +		___gu_err = -EFAULT;				\
> +	___gu_err;						\
> +})
> +
> +/* similar to __get_user_mm */
> +static inline __must_check long __copy_from_user_mm(struct mm_struct *mm,
> +		void *to, const void __user * from, unsigned long n)
> +{
> +	long ret = __copy_from_user(to, from, n);
> +	if ((ret >= 0) && test_bit(MMF_UNSTABLE, &mm->flags))
> +		return -EFAULT;
> +	return ret;
> +}

Still fail to understand why do we actually need this, but nevermind.

Can't we instead change handle_pte_fault() or do_anonymous_page() to
fail if MMF_UNSTABLE? We can realy pte_offset_map_lock(), MMF_UNSTABLE
must be visible under this lock.

We do not even need to actually disallow to re-populate the unmapped
pte afaics, so we can even change handle_mm_fault() to check
MMF_UNSTABLE after at the ens and return VM_FAULT_SIGBUS if it is set.

Oleg.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 09/10] vhost, mm: make sure that oom_reaper doesn't reap memory read by vhost
  2016-07-29 13:35         ` Michal Hocko
@ 2016-07-29 17:57           ` Michael S. Tsirkin
  2016-07-31  9:44             ` Michal Hocko
  0 siblings, 1 reply; 47+ messages in thread
From: Michael S. Tsirkin @ 2016-07-29 17:57 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, Andrew Morton, Tetsuo Handa, Oleg Nesterov,
	David Rientjes, Vladimir Davydov

On Fri, Jul 29, 2016 at 03:35:29PM +0200, Michal Hocko wrote:
> On Fri 29-07-16 16:14:10, Michael S. Tsirkin wrote:
> > On Fri, Jul 29, 2016 at 08:04:22AM +0200, Michal Hocko wrote:
> > > On Thu 28-07-16 23:41:53, Michael S. Tsirkin wrote:
> > > > On Thu, Jul 28, 2016 at 09:42:33PM +0200, Michal Hocko wrote:
> [...]
> > > > > and the reader would hit a page fault
> > > > > +	 * if it stumbled over a reaped memory.
> > > > 
> > > > This last point I don't get. flag read could bypass data read
> > > > if that happens data read could happen after unmap
> > > > yes it might get a PF but you handle that, correct?
> > > 
> > > The point I've tried to make is that if the reader really page faults
> > > then get_user will imply the full barrier already. If get_user didn't
> > > page fault then the state of the flag is not really important because
> > > the reaper shouldn't have touched it. Does it make more sense now or
> > > I've missed your question?
> > 
> > Can task flag read happen before the get_user pagefault?
> 
> Do you mean?
> 
> get_user_mm()
>   temp = false <- test_bit(MMF_UNSTABLE, &mm->flags)
>   ret = __get_user(x, ptr)
>   #PF
>   if (!ret && temp) # misses the flag
> 
> The code is basically doing
> 
>   if (!__get_user() && test_bit(MMF_UNSTABLE, &mm->flags))
> 
> so test_bit part of the conditional cannot be evaluated before
> __get_user() part is done. Compiler cannot reorder two depending
> subconditions AFAIK.

But maybe the CPU can.

> -- 
> Michal Hocko
> SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 08/10] exit, oom: postpone exit_oom_victim to later
  2016-07-28 19:42 ` [PATCH 08/10] exit, oom: postpone exit_oom_victim to later Michal Hocko
@ 2016-07-30  8:20   ` Tetsuo Handa
  2016-07-31  9:35     ` Michal Hocko
  0 siblings, 1 reply; 47+ messages in thread
From: Tetsuo Handa @ 2016-07-30  8:20 UTC (permalink / raw)
  To: mhocko, linux-mm; +Cc: akpm, oleg, rientjes, vdavydov, mhocko

Michal Hocko wrote:
> From: Michal Hocko <mhocko@suse.com>
> 
> exit_oom_victim was called after mmput because it is expected that
> address space of the victim would get released by that time and there is
> no reason to hold off the oom killer from selecting another task should
> that be insufficient to handle the oom situation. In order to catch
> post exit_mm() allocations we used to check for PF_EXITING but this
> got removed by 6a618957ad17 ("mm: oom_kill: don't ignore oom score on
> exiting tasks") because this check was lockup prone.
> 
> It seems that we have all needed pieces ready now and can finally
> fix this properly (at least for CONFIG_MMU cases where we have the
> oom_reaper).  Since "oom: keep mm of the killed task available" we have
> a reliable way to ignore oom victims which are no longer interesting
> because they either were reaped and do not sit on a lot of memory or
> they are not reapable for some reason and it is safer to ignore them
> and move on to another victim. That means that we can safely postpone
> exit_oom_victim to closer to the final schedule.

I don't like this patch. The advantage of this patch will be that we can
avoid selecting next OOM victim when only OOM victims need to allocate
memory after they left exit_mm(). But the disadvantage of this patch will
be that we increase the possibility of depleting 100% of memory reserves
by allowing them to allocate using ALLOC_NO_WATERMARKS after they left
exit_mm(). It is possible that a user creates a process with 10000 threads
and let that process be OOM-killed. Then, this patch allows 10000 threads
to start consuming memory reserves after they left exit_mm(). OOM victims
are not the only threads who need to allocate memory for termination. Non
OOM victims might need to allocate memory at exit_task_work() in order to
allow OOM victims to make forward progress. I think that allocations from
do_exit() are important for terminating cleanly (from the point of view of
filesystem integrity and kernel object management) and such allocations
should not be given up simply because ALLOC_NO_WATERMARKS allocations
failed.

> 
> There is possible advantages of this because we are reducing chances
> of further interference of the oom victim with the rest of the system
> after oom_killer_disable(). Strictly speaking this is possible right
> now because there are indeed allocations possible past exit_mm() and
> who knows whether some of them can trigger IO. I haven't seen this in
> practice though.

I don't know which I/O oom_killer_disable() must act as a hard barrier.
But safer way is to get rid of TIF_MEMDIE's triple meanings. The first
one which prevents the OOM killer from selecting next OOM victim was
removed by replacing TIF_MEMDIE test in oom_scan_process_thread() with
tsk_is_oom_victim(). The second one which allows the OOM victims to
deplete 100% of memory reserves wants some changes in order not to
block memory allocations by non OOM victims (e.g. GFP_ATOMIC allocations
by interrupt handlers, GFP_NOIO / GFP_NOFS allocations by subsystems
which are needed for making forward progress of threads in do_exit())
by consuming too much of memory reserves. The third one which blocks
oom_killer_disable() can be removed by replacing TIF_MEMDIE test in
exit_oom_victim() with PFA_OOM_WAITING test like below patch. (If
oom_killer_disable() were specific to CONFIG_MMU=y kernels, I think
that not thawing OOM victims will be simpler because the OOM reaper
can reclaim memory without thawing OOM victims.)

---
 include/linux/oom.h   | 2 +-
 include/linux/sched.h | 4 ++++
 kernel/exit.c         | 4 +++-
 kernel/freezer.c      | 2 +-
 mm/oom_kill.c         | 7 +++----
 5 files changed, 12 insertions(+), 7 deletions(-)

diff --git a/include/linux/oom.h b/include/linux/oom.h
index 22e18c4..69d56c5 100644
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -102,7 +102,7 @@ extern enum oom_scan_t oom_scan_process_thread(struct oom_control *oc,
 
 extern bool out_of_memory(struct oom_control *oc);
 
-extern void exit_oom_victim(void);
+extern void unmark_oom_victim(void);
 
 extern int register_oom_notifier(struct notifier_block *nb);
 extern int unregister_oom_notifier(struct notifier_block *nb);
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 32212e9..7f624d1 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2290,6 +2290,7 @@ static inline void memalloc_noio_restore(unsigned int flags)
 #define PFA_SPREAD_PAGE  1      /* Spread page cache over cpuset */
 #define PFA_SPREAD_SLAB  2      /* Spread some slab caches over cpuset */
 #define PFA_LMK_WAITING  3      /* Lowmemorykiller is waiting */
+#define PFA_OOM_WAITING  4      /* Freezer is waiting for OOM killer */
 
 
 #define TASK_PFA_TEST(name, func)					\
@@ -2316,6 +2317,9 @@ TASK_PFA_CLEAR(SPREAD_SLAB, spread_slab)
 TASK_PFA_TEST(LMK_WAITING, lmk_waiting)
 TASK_PFA_SET(LMK_WAITING, lmk_waiting)
 
+TASK_PFA_TEST(OOM_WAITING, oom_waiting)
+TASK_PFA_SET(OOM_WAITING, oom_waiting)
+
 /*
  * task->jobctl flags
  */
diff --git a/kernel/exit.c b/kernel/exit.c
index e9bca29..b19dbfd 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -511,7 +511,7 @@ static void exit_mm(struct task_struct *tsk)
 	mm_update_next_owner(mm);
 	mmput(mm);
 	if (test_thread_flag(TIF_MEMDIE))
-		exit_oom_victim();
+		clear_thread_flag(TIF_MEMDIE);
 }
 
 static struct task_struct *find_alive_thread(struct task_struct *p)
@@ -902,6 +902,8 @@ void do_exit(long code)
 	smp_mb();
 	raw_spin_unlock_wait(&tsk->pi_lock);
 
+	if (task_oom_waiting(tsk))
+		unmark_oom_victim();
 	/* causes final put_task_struct in finish_task_switch(). */
 	tsk->state = TASK_DEAD;
 	tsk->flags |= PF_NOFREEZE;	/* tell freezer to ignore us */
diff --git a/kernel/freezer.c b/kernel/freezer.c
index 6f56a9e..306270d 100644
--- a/kernel/freezer.c
+++ b/kernel/freezer.c
@@ -42,7 +42,7 @@ bool freezing_slow_path(struct task_struct *p)
 	if (p->flags & (PF_NOFREEZE | PF_SUSPEND_TASK))
 		return false;
 
-	if (test_tsk_thread_flag(p, TIF_MEMDIE))
+	if (task_oom_waiting(p))
 		return false;
 
 	if (pm_nosig_freezing || cgroup_freezing(p))
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index ca1cc24..c7ae974 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -644,17 +644,16 @@ void mark_oom_victim(struct task_struct *tsk)
 	 * any memory and livelock. freezing_slow_path will tell the freezer
 	 * that TIF_MEMDIE tasks should be ignored.
 	 */
+	task_set_oom_waiting(tsk);
 	__thaw_task(tsk);
 	atomic_inc(&oom_victims);
 }
 
 /**
- * exit_oom_victim - note the exit of an OOM victim
+ * unmark_oom_victim - note the exit of an OOM victim
  */
-void exit_oom_victim(void)
+void unmark_oom_victim(void)
 {
-	clear_thread_flag(TIF_MEMDIE);
-
 	if (!atomic_dec_return(&oom_victims))
 		wake_up_all(&oom_victims_wait);
 }
-- 
1.8.3.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 09/10] vhost, mm: make sure that oom_reaper doesn't reap memory read by vhost
  2016-07-29 17:07   ` Oleg Nesterov
@ 2016-07-31  9:11     ` Michal Hocko
  0 siblings, 0 replies; 47+ messages in thread
From: Michal Hocko @ 2016-07-31  9:11 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: linux-mm, Andrew Morton, Tetsuo Handa, David Rientjes,
	Vladimir Davydov, Michael S. Tsirkin

On Fri 29-07-16 19:07:28, Oleg Nesterov wrote:
> Well. I promised to not argue, but I can't resist...
> 
> On 07/28, Michal Hocko wrote:
> >
> > --- a/include/linux/uaccess.h
> > +++ b/include/linux/uaccess.h
> > @@ -76,6 +76,28 @@ static inline unsigned long __copy_from_user_nocache(void *to,
> >  #endif		/* ARCH_HAS_NOCACHE_UACCESS */
> >
> >  /*
> > + * A safe variant of __get_user for for use_mm() users to have a
> > + * gurantee that the address space wasn't reaped in the background
> > + */
> > +#define __get_user_mm(mm, x, ptr)				\
> > +({								\
> > +	int ___gu_err = __get_user(x, ptr);			\
> > +	if (!___gu_err && test_bit(MMF_UNSTABLE, &mm->flags))	\
> > +		___gu_err = -EFAULT;				\
> > +	___gu_err;						\
> > +})
> > +
> > +/* similar to __get_user_mm */
> > +static inline __must_check long __copy_from_user_mm(struct mm_struct *mm,
> > +		void *to, const void __user * from, unsigned long n)
> > +{
> > +	long ret = __copy_from_user(to, from, n);
> > +	if ((ret >= 0) && test_bit(MMF_UNSTABLE, &mm->flags))
> > +		return -EFAULT;
> > +	return ret;
> > +}
> 
> Still fail to understand why do we actually need this, but nevermind.

Well, I only rely on what Michael told me about the possible breakage
because I am not familiar with the internals of the vhost driver enough
to tell any better.

> Can't we instead change handle_pte_fault() or do_anonymous_page() to
> fail if MMF_UNSTABLE? We can realy pte_offset_map_lock(), MMF_UNSTABLE
> must be visible under this lock.

I have considered this option but felt like this would impose the
overhead (small but still non-zero) to everybody while actually only one
user really needs this. If we had more users the page fault path might
be worthwhile but it is only use_mm users which we have 3 and only one
really needs it.

> We do not even need to actually disallow to re-populate the unmapped
> pte afaics, so we can even change handle_mm_fault() to check
> MMF_UNSTABLE after at the ens and return VM_FAULT_SIGBUS if it is set.
> 
> Oleg.
> 

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 08/10] exit, oom: postpone exit_oom_victim to later
  2016-07-30  8:20   ` Tetsuo Handa
@ 2016-07-31  9:35     ` Michal Hocko
  2016-07-31 10:19       ` Michal Hocko
  2016-08-01 10:46       ` Tetsuo Handa
  0 siblings, 2 replies; 47+ messages in thread
From: Michal Hocko @ 2016-07-31  9:35 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: linux-mm, akpm, oleg, rientjes, vdavydov

On Sat 30-07-16 17:20:30, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > From: Michal Hocko <mhocko@suse.com>
> > 
> > exit_oom_victim was called after mmput because it is expected that
> > address space of the victim would get released by that time and there is
> > no reason to hold off the oom killer from selecting another task should
> > that be insufficient to handle the oom situation. In order to catch
> > post exit_mm() allocations we used to check for PF_EXITING but this
> > got removed by 6a618957ad17 ("mm: oom_kill: don't ignore oom score on
> > exiting tasks") because this check was lockup prone.
> > 
> > It seems that we have all needed pieces ready now and can finally
> > fix this properly (at least for CONFIG_MMU cases where we have the
> > oom_reaper).  Since "oom: keep mm of the killed task available" we have
> > a reliable way to ignore oom victims which are no longer interesting
> > because they either were reaped and do not sit on a lot of memory or
> > they are not reapable for some reason and it is safer to ignore them
> > and move on to another victim. That means that we can safely postpone
> > exit_oom_victim to closer to the final schedule.
> 
> I don't like this patch. The advantage of this patch will be that we can
> avoid selecting next OOM victim when only OOM victims need to allocate
> memory after they left exit_mm().

Not really as we do not rely on TIF_MEMDIE nor signal->oom_victims to
block new oom victim selection anymore.

> But the disadvantage of this patch will
> be that we increase the possibility of depleting 100% of memory reserves
> by allowing them to allocate using ALLOC_NO_WATERMARKS after they left
> exit_mm().

I think this is a separate problem. As the current code stands we can
already deplete memory reserves. The large number of threads might be
sitting in an allocation loop before they bail out to handle the
SIGKILL. Exit path shouldn't add too much on top of that. If we want to
be reliable in not consuming all the reserves we would have to employ
some form of throttling and that is out of scope of this patch.

> It is possible that a user creates a process with 10000 threads
> and let that process be OOM-killed. Then, this patch allows 10000 threads
> to start consuming memory reserves after they left exit_mm(). OOM victims
> are not the only threads who need to allocate memory for termination. Non
> OOM victims might need to allocate memory at exit_task_work() in order to
> allow OOM victims to make forward progress.

this might be possible but unlike the regular exiting tasks we do
reclaim oom victim's memory in the background. So while they can consume
memory reserves we should also give some (and arguably much more) memory
back. The reserves are there to expedite the exit.

> I think that allocations from
> do_exit() are important for terminating cleanly (from the point of view of
> filesystem integrity and kernel object management) and such allocations
> should not be given up simply because ALLOC_NO_WATERMARKS allocations
> failed.

We are talking about a fatal condition when OOM killer forcefully kills
a task. Chances are that the userspace leaves so much state behind that
a manual cleanup would be necessary anyway. Depleting the memory
reserves is not nice but I really believe that this particular patch
doesn't make the situation really much worse than before.
 
> > There is possible advantages of this because we are reducing chances
> > of further interference of the oom victim with the rest of the system
> > after oom_killer_disable(). Strictly speaking this is possible right
> > now because there are indeed allocations possible past exit_mm() and
> > who knows whether some of them can trigger IO. I haven't seen this in
> > practice though.
> 
> I don't know which I/O oom_killer_disable() must act as a hard barrier.

Any allocation that could trigger the IO can corrupt the hibernation
image or access the half suspended device. The whole point of
oom_killer_disable is to prevent anything like that to happen.

> But safer way is to get rid of TIF_MEMDIE's triple meanings. The first
> one which prevents the OOM killer from selecting next OOM victim was
> removed by replacing TIF_MEMDIE test in oom_scan_process_thread() with
> tsk_is_oom_victim(). The second one which allows the OOM victims to
> deplete 100% of memory reserves wants some changes in order not to
> block memory allocations by non OOM victims (e.g. GFP_ATOMIC allocations
> by interrupt handlers, GFP_NOIO / GFP_NOFS allocations by subsystems
> which are needed for making forward progress of threads in do_exit())
> by consuming too much of memory reserves. The third one which blocks
> oom_killer_disable() can be removed by replacing TIF_MEMDIE test in
> exit_oom_victim() with PFA_OOM_WAITING test like below patch.

I plan to remove TIF_MEMDIE dependency for this as well but I would like
to finish this pile first. We actually do not need any flag for that. We
just need to detect last exiting thread and tsk_is_oom_victim. I have
some preliminary code for that.

> (If
> oom_killer_disable() were specific to CONFIG_MMU=y kernels, I think
> that not thawing OOM victims will be simpler because the OOM reaper
> can reclaim memory without thawing OOM victims.)

Well I do not think keeping an oom victim inside the fridge is a good
idea. The task might be not sitting on any reclaimable memory but it
still might consume resources which are bound to its life time (open
files and their buffers etc.).
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 09/10] vhost, mm: make sure that oom_reaper doesn't reap memory read by vhost
  2016-07-29 17:57           ` Michael S. Tsirkin
@ 2016-07-31  9:44             ` Michal Hocko
  2016-08-12  9:42               ` Michal Hocko
  0 siblings, 1 reply; 47+ messages in thread
From: Michal Hocko @ 2016-07-31  9:44 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: linux-mm, Andrew Morton, Tetsuo Handa, Oleg Nesterov,
	David Rientjes, Vladimir Davydov, Paul E. McKenney

On Fri 29-07-16 20:57:44, Michael S. Tsirkin wrote:
> On Fri, Jul 29, 2016 at 03:35:29PM +0200, Michal Hocko wrote:
> > On Fri 29-07-16 16:14:10, Michael S. Tsirkin wrote:
> > > On Fri, Jul 29, 2016 at 08:04:22AM +0200, Michal Hocko wrote:
> > > > On Thu 28-07-16 23:41:53, Michael S. Tsirkin wrote:
> > > > > On Thu, Jul 28, 2016 at 09:42:33PM +0200, Michal Hocko wrote:
> > [...]
> > > > > > and the reader would hit a page fault
> > > > > > +	 * if it stumbled over a reaped memory.
> > > > > 
> > > > > This last point I don't get. flag read could bypass data read
> > > > > if that happens data read could happen after unmap
> > > > > yes it might get a PF but you handle that, correct?
> > > > 
> > > > The point I've tried to make is that if the reader really page faults
> > > > then get_user will imply the full barrier already. If get_user didn't
> > > > page fault then the state of the flag is not really important because
> > > > the reaper shouldn't have touched it. Does it make more sense now or
> > > > I've missed your question?
> > > 
> > > Can task flag read happen before the get_user pagefault?
> > 
> > Do you mean?
> > 
> > get_user_mm()
> >   temp = false <- test_bit(MMF_UNSTABLE, &mm->flags)
> >   ret = __get_user(x, ptr)
> >   #PF
> >   if (!ret && temp) # misses the flag
> > 
> > The code is basically doing
> > 
> >   if (!__get_user() && test_bit(MMF_UNSTABLE, &mm->flags))
> > 
> > so test_bit part of the conditional cannot be evaluated before
> > __get_user() part is done. Compiler cannot reorder two depending
> > subconditions AFAIK.
> 
> But maybe the CPU can.

Are you sure? How does that differ from
	if (ptr && ptr->something)
construct?

Let's CC Paul. Just to describe the situation. We have the following
situation:

#define __get_user_mm(mm, x, ptr)				\
({								\
	int ___gu_err = __get_user(x, ptr);			\
	if (!___gu_err && test_bit(MMF_UNSTABLE, &mm->flags))	\
		___gu_err = -EFAULT;				\
	___gu_err;						\
})

and the oom reaper doing:

	set_bit(MMF_UNSTABLE, &mm->flags);

	for (vma = mm->mmap ; vma; vma = vma->vm_next) {
		unmap_page_range

I assume that write memory barrier between set_bit and unmap_page_range
is not really needed because unmapping should already imply the memory
barrier. A read memory barrier between __get_user and test_bit shouldn't
be really needed because we can tolerate a stale value if __get_user
didn't #PF because we haven't unmapped that address obviously. If we
unmapped it then __get_user would #PF and that should imply a full
memory barrier as well. Now the question is whether a CPU can speculate
and read the flag before we issue the #PF.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 08/10] exit, oom: postpone exit_oom_victim to later
  2016-07-31  9:35     ` Michal Hocko
@ 2016-07-31 10:19       ` Michal Hocko
  2016-08-01 10:46       ` Tetsuo Handa
  1 sibling, 0 replies; 47+ messages in thread
From: Michal Hocko @ 2016-07-31 10:19 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: linux-mm, akpm, oleg, rientjes, vdavydov

On Sun 31-07-16 11:35:30, Michal Hocko wrote:
> On Sat 30-07-16 17:20:30, Tetsuo Handa wrote:
[...]
> > But safer way is to get rid of TIF_MEMDIE's triple meanings. The first
> > one which prevents the OOM killer from selecting next OOM victim was
> > removed by replacing TIF_MEMDIE test in oom_scan_process_thread() with
> > tsk_is_oom_victim(). The second one which allows the OOM victims to
> > deplete 100% of memory reserves wants some changes in order not to
> > block memory allocations by non OOM victims (e.g. GFP_ATOMIC allocations
> > by interrupt handlers, GFP_NOIO / GFP_NOFS allocations by subsystems
> > which are needed for making forward progress of threads in do_exit())
> > by consuming too much of memory reserves. The third one which blocks
> > oom_killer_disable() can be removed by replacing TIF_MEMDIE test in
> > exit_oom_victim() with PFA_OOM_WAITING test like below patch.
> 
> I plan to remove TIF_MEMDIE dependency for this as well but I would like
> to finish this pile first. We actually do not need any flag for that. We
> just need to detect last exiting thread and tsk_is_oom_victim. I have
> some preliminary code for that.

That being said. If you _really_ consider this patch to be controversial
I can drop it and handle it with other patches which should handle also
TIF_MEMDIE removal. The rest of the series doesn't really depend on it
in any way. I just though this would be easy enough to carry it with
this pile already. I do not insist on it.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 08/10] exit, oom: postpone exit_oom_victim to later
  2016-07-31  9:35     ` Michal Hocko
  2016-07-31 10:19       ` Michal Hocko
@ 2016-08-01 10:46       ` Tetsuo Handa
  2016-08-01 11:33         ` Michal Hocko
  1 sibling, 1 reply; 47+ messages in thread
From: Tetsuo Handa @ 2016-08-01 10:46 UTC (permalink / raw)
  To: mhocko; +Cc: linux-mm, akpm, oleg, rientjes, vdavydov

Michal Hocko wrote:
> On Sat 30-07-16 17:20:30, Tetsuo Handa wrote:
> > Michal Hocko wrote:
> > > From: Michal Hocko <mhocko@suse.com>
> > > 
> > > exit_oom_victim was called after mmput because it is expected that
> > > address space of the victim would get released by that time and there is
> > > no reason to hold off the oom killer from selecting another task should
> > > that be insufficient to handle the oom situation. In order to catch
> > > post exit_mm() allocations we used to check for PF_EXITING but this
> > > got removed by 6a618957ad17 ("mm: oom_kill: don't ignore oom score on
> > > exiting tasks") because this check was lockup prone.
> > > 
> > > It seems that we have all needed pieces ready now and can finally
> > > fix this properly (at least for CONFIG_MMU cases where we have the
> > > oom_reaper).  Since "oom: keep mm of the killed task available" we have
> > > a reliable way to ignore oom victims which are no longer interesting
> > > because they either were reaped and do not sit on a lot of memory or
> > > they are not reapable for some reason and it is safer to ignore them
> > > and move on to another victim. That means that we can safely postpone
> > > exit_oom_victim to closer to the final schedule.
> > 
> > I don't like this patch. The advantage of this patch will be that we can
> > avoid selecting next OOM victim when only OOM victims need to allocate
> > memory after they left exit_mm().
> 
> Not really as we do not rely on TIF_MEMDIE nor signal->oom_victims to
> block new oom victim selection anymore.

I meant "whether out_of_memory() is called or not (when OOM victims need
to allocate memory after they left exit_mm())".

I did not mean "whether out_of_memory() selects next OOM victim or not"
because "whether MMF_OOM_SKIP on signal->oom_mm is set or not" depends
on "whether out_of_memory() is called before MMF_OOM_SKIP is set on
signal->oom_mm" which depends on timing.

> 
> > But the disadvantage of this patch will
> > be that we increase the possibility of depleting 100% of memory reserves
> > by allowing them to allocate using ALLOC_NO_WATERMARKS after they left
> > exit_mm().
> 
> I think this is a separate problem. As the current code stands we can
> already deplete memory reserves. The large number of threads might be
> sitting in an allocation loop before they bail out to handle the
> SIGKILL. Exit path shouldn't add too much on top of that. If we want to
> be reliable in not consuming all the reserves we would have to employ
> some form of throttling and that is out of scope of this patch.

I'm suggesting such throttling by allowing fatal_signal_pending() or
PF_EXITING threads access to only some portion of memory reserves.

> 
> > It is possible that a user creates a process with 10000 threads
> > and let that process be OOM-killed. Then, this patch allows 10000 threads
> > to start consuming memory reserves after they left exit_mm(). OOM victims
> > are not the only threads who need to allocate memory for termination. Non
> > OOM victims might need to allocate memory at exit_task_work() in order to
> > allow OOM victims to make forward progress.
> 
> this might be possible but unlike the regular exiting tasks we do
> reclaim oom victim's memory in the background. So while they can consume
> memory reserves we should also give some (and arguably much more) memory
> back. The reserves are there to expedite the exit.

Background reclaim does not occur on CONFIG_MMU=n kernels. But this patch
also affects CONFIG_MMU=n kernels. If a process with two threads was
OOM-killed and one thread consumed too much memory after it left exit_mm()
before the other thread sets MMF_OOM_SKIP on their mm by returning from
exit_aio() etc. in __mmput() from mmput() from exit_mm(), this patch
introduces a new possibility to OOM livelock. I think it is wild to assume
that "CONFIG_MMU=n kernels can OOM livelock even without this patch. Thus,
let's apply this patch even though this patch might break the balance of
OOM handling in CONFIG_MMU=n kernels."

Also, where is the guarantee that memory reclaimed by the OOM reaper is
used for terminating exiting threads? Since we do not prefer
fatal_signal_pending() or PF_EXITING threads over !fatal_signal_pending()
nor !PF_EXITING threads, it is possible that all memory reclaimed by the
OOM reaper is depleted by !fatal_signal_pending() nor !PF_EXITING threads.
Yes, the OOM reaper will allow the OOM killer to select next OOM victim.
But the intention of this patch is to avoid calling out_of_memory() by
allowing OOM victims access to memory reserves, isn't it?
We after all need to call out_of_memory() regardless of whether we prefer
TIF_MEMDIE (or signal->oom_mm != NULL) threads over !TIF_MEMDIE (or
signal->oom_mm == NULL) threads.

So, it is not clear to me that this patch is an improvement.

> 
> > I think that allocations from
> > do_exit() are important for terminating cleanly (from the point of view of
> > filesystem integrity and kernel object management) and such allocations
> > should not be given up simply because ALLOC_NO_WATERMARKS allocations
> > failed.
> 
> We are talking about a fatal condition when OOM killer forcefully kills
> a task. Chances are that the userspace leaves so much state behind that
> a manual cleanup would be necessary anyway. Depleting the memory
> reserves is not nice but I really believe that this particular patch
> doesn't make the situation really much worse than before.

I'm not talking about inconsistency in userspace programs. I'm talking
about inconsistency of objects managed by kernel (e.g. failing to drop
references) caused by allocation failures.

>  
> > > There is possible advantages of this because we are reducing chances
> > > of further interference of the oom victim with the rest of the system
> > > after oom_killer_disable(). Strictly speaking this is possible right
> > > now because there are indeed allocations possible past exit_mm() and
> > > who knows whether some of them can trigger IO. I haven't seen this in
> > > practice though.
> > 
> > I don't know which I/O oom_killer_disable() must act as a hard barrier.
> 
> Any allocation that could trigger the IO can corrupt the hibernation
> image or access the half suspended device. The whole point of
> oom_killer_disable is to prevent anything like that to happen.

That's a puzzling answer. What I/O? If fs writeback done by a GFP_FS
allocation issued by userspace processes between after returning from
exit_mm() and reaching final schedule() in do_exit() is problematic, why
fs writeback issued by a GFP_FS allocation done by kernel threads after
returning from oom_killer_disable() is not problematic? I think any I/O
which userspace processes can do is also doable by kernel threads.
Where is the guarantee that kernel threads which do I/O which
oom_killer_disable() acts as a hard barrier do not corrupt the hibernation
image or access the half suspended device?

> 
> > But safer way is to get rid of TIF_MEMDIE's triple meanings. The first
> > one which prevents the OOM killer from selecting next OOM victim was
> > removed by replacing TIF_MEMDIE test in oom_scan_process_thread() with
> > tsk_is_oom_victim(). The second one which allows the OOM victims to
> > deplete 100% of memory reserves wants some changes in order not to
> > block memory allocations by non OOM victims (e.g. GFP_ATOMIC allocations
> > by interrupt handlers, GFP_NOIO / GFP_NOFS allocations by subsystems
> > which are needed for making forward progress of threads in do_exit())
> > by consuming too much of memory reserves. The third one which blocks
> > oom_killer_disable() can be removed by replacing TIF_MEMDIE test in
> > exit_oom_victim() with PFA_OOM_WAITING test like below patch.
> 
> I plan to remove TIF_MEMDIE dependency for this as well but I would like
> to finish this pile first. We actually do not need any flag for that. We
> just need to detect last exiting thread and tsk_is_oom_victim. I have
> some preliminary code for that.

Please show me the preliminary code. How do you expedite termination of
exiting threads? If you simply remove test_thread_flag(TIF_MEMDIE) in
gfp_to_alloc_flags(), there is a risk of failing to escape from current
allocation request loop (if you also remove

	/* Avoid allocations with no watermarks from looping endlessly */
	if (test_thread_flag(TIF_MEMDIE) && !(gfp_mask & __GFP_NOFAIL))
		goto nopage;

) especially for CONFIG_MMU=n kernels, or hitting problems due to
allocation failure (if you don't remove

	/* Avoid allocations with no watermarks from looping endlessly */
	if (test_thread_flag(TIF_MEMDIE) && !(gfp_mask & __GFP_NOFAIL))
		goto nopage;

) due to not allowing access to memory reserves.

On the other hand, if you simply replace test_thread_flag(TIF_MEMDIE) in
gfp_to_alloc_flags() with signal->oom_mm != NULL, it might increase
possibility of depleting memory reserves.

> 
> > (If
> > oom_killer_disable() were specific to CONFIG_MMU=y kernels, I think
> > that not thawing OOM victims will be simpler because the OOM reaper
> > can reclaim memory without thawing OOM victims.)
> 
> Well I do not think keeping an oom victim inside the fridge is a good
> idea. The task might be not sitting on any reclaimable memory but it
> still might consume resources which are bound to its life time (open
> files and their buffers etc.).

Then, I do not think keeping !TIF_MEMDIE OOM victims (sharing TIF_MEMDIE
OOM victim's memory) inside the fridge is a good idea. There might be
resources (e.g. open files) which will not be released unless all threads
sharing TIF_MEMDIE OOM victim's memory terminate.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 08/10] exit, oom: postpone exit_oom_victim to later
  2016-08-01 10:46       ` Tetsuo Handa
@ 2016-08-01 11:33         ` Michal Hocko
  2016-08-02 10:32           ` Tetsuo Handa
  0 siblings, 1 reply; 47+ messages in thread
From: Michal Hocko @ 2016-08-01 11:33 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: linux-mm, akpm, oleg, rientjes, vdavydov

On Mon 01-08-16 19:46:48, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > On Sat 30-07-16 17:20:30, Tetsuo Handa wrote:
[...]
> > > But the disadvantage of this patch will
> > > be that we increase the possibility of depleting 100% of memory reserves
> > > by allowing them to allocate using ALLOC_NO_WATERMARKS after they left
> > > exit_mm().
> > 
> > I think this is a separate problem. As the current code stands we can
> > already deplete memory reserves. The large number of threads might be
> > sitting in an allocation loop before they bail out to handle the
> > SIGKILL. Exit path shouldn't add too much on top of that. If we want to
> > be reliable in not consuming all the reserves we would have to employ
> > some form of throttling and that is out of scope of this patch.
> 
> I'm suggesting such throttling by allowing fatal_signal_pending() or
> PF_EXITING threads access to only some portion of memory reserves.

Which is completely out of scope of this patch and this patchseries as
well.

> > > It is possible that a user creates a process with 10000 threads
> > > and let that process be OOM-killed. Then, this patch allows 10000 threads
> > > to start consuming memory reserves after they left exit_mm(). OOM victims
> > > are not the only threads who need to allocate memory for termination. Non
> > > OOM victims might need to allocate memory at exit_task_work() in order to
> > > allow OOM victims to make forward progress.
> > 
> > this might be possible but unlike the regular exiting tasks we do
> > reclaim oom victim's memory in the background. So while they can consume
> > memory reserves we should also give some (and arguably much more) memory
> > back. The reserves are there to expedite the exit.
> 
> Background reclaim does not occur on CONFIG_MMU=n kernels. But this patch
> also affects CONFIG_MMU=n kernels. If a process with two threads was
> OOM-killed and one thread consumed too much memory after it left exit_mm()
> before the other thread sets MMF_OOM_SKIP on their mm by returning from
> exit_aio() etc. in __mmput() from mmput() from exit_mm(), this patch
> introduces a new possibility to OOM livelock. I think it is wild to assume
> that "CONFIG_MMU=n kernels can OOM livelock even without this patch. Thus,
> let's apply this patch even though this patch might break the balance of
> OOM handling in CONFIG_MMU=n kernels."

As I've said if you have strong doubts about the patch I can drop it for
now. I do agree that nommu really matters here, though.

> Also, where is the guarantee that memory reclaimed by the OOM reaper is
> used for terminating exiting threads?

As the oom reaper replenishes the memory the oom victim should be able
to consume that memory in favor of other processes exactly because it
has access to memory reserves which others are not.

> Since we do not prefer
> fatal_signal_pending() or PF_EXITING threads over !fatal_signal_pending()
> nor !PF_EXITING threads, it is possible that all memory reclaimed by the
> OOM reaper is depleted by !fatal_signal_pending() nor !PF_EXITING threads.
> Yes, the OOM reaper will allow the OOM killer to select next OOM victim.
> But the intention of this patch is to avoid calling out_of_memory() by
> allowing OOM victims access to memory reserves, isn't it?

Yes and also to postpone the oom_killer_disbale to later.

> We after all need to call out_of_memory() regardless of whether we prefer
> TIF_MEMDIE (or signal->oom_mm != NULL) threads over !TIF_MEMDIE (or
> signal->oom_mm == NULL) threads.
> 
> So, it is not clear to me that this patch is an improvement.

It might fit in better with other TIF_MEMDIE related changes so I do not
care if it is dropped now.

> > > I think that allocations from
> > > do_exit() are important for terminating cleanly (from the point of view of
> > > filesystem integrity and kernel object management) and such allocations
> > > should not be given up simply because ALLOC_NO_WATERMARKS allocations
> > > failed.
> > 
> > We are talking about a fatal condition when OOM killer forcefully kills
> > a task. Chances are that the userspace leaves so much state behind that
> > a manual cleanup would be necessary anyway. Depleting the memory
> > reserves is not nice but I really believe that this particular patch
> > doesn't make the situation really much worse than before.
> 
> I'm not talking about inconsistency in userspace programs. I'm talking
> about inconsistency of objects managed by kernel (e.g. failing to drop
> references) caused by allocation failures.

That would be a bug on its own, no?
 
> > > > There is possible advantages of this because we are reducing chances
> > > > of further interference of the oom victim with the rest of the system
> > > > after oom_killer_disable(). Strictly speaking this is possible right
> > > > now because there are indeed allocations possible past exit_mm() and
> > > > who knows whether some of them can trigger IO. I haven't seen this in
> > > > practice though.
> > > 
> > > I don't know which I/O oom_killer_disable() must act as a hard barrier.
> > 
> > Any allocation that could trigger the IO can corrupt the hibernation
> > image or access the half suspended device. The whole point of
> > oom_killer_disable is to prevent anything like that to happen.
> 
> That's a puzzling answer. What I/O? If fs writeback done by a GFP_FS
> allocation issued by userspace processes between after returning from
> exit_mm() and reaching final schedule() in do_exit() is problematic, why
> fs writeback issued by a GFP_FS allocation done by kernel threads after
> returning from oom_killer_disable() is not problematic? I think any I/O
> which userspace processes can do is also doable by kernel threads.
> Where is the guarantee that kernel threads which do I/O which
> oom_killer_disable() acts as a hard barrier do not corrupt the hibernation
> image or access the half suspended device?

This has been discussed several times already. I do not have a link
handy but it should be there in the archives. I am sorry and do not want
to be rude here but I guess this is only tangential to this particular
patch series so I would prefer not discussing PM consequences. PM
suspend is describe to expect no _user_ activity after oom_killer_disable.
If you are questioning that assumption then please do that in a separate
thread.

> > > But safer way is to get rid of TIF_MEMDIE's triple meanings. The first
> > > one which prevents the OOM killer from selecting next OOM victim was
> > > removed by replacing TIF_MEMDIE test in oom_scan_process_thread() with
> > > tsk_is_oom_victim(). The second one which allows the OOM victims to
> > > deplete 100% of memory reserves wants some changes in order not to
> > > block memory allocations by non OOM victims (e.g. GFP_ATOMIC allocations
> > > by interrupt handlers, GFP_NOIO / GFP_NOFS allocations by subsystems
> > > which are needed for making forward progress of threads in do_exit())
> > > by consuming too much of memory reserves. The third one which blocks
> > > oom_killer_disable() can be removed by replacing TIF_MEMDIE test in
> > > exit_oom_victim() with PFA_OOM_WAITING test like below patch.
> > 
> > I plan to remove TIF_MEMDIE dependency for this as well but I would like
> > to finish this pile first. We actually do not need any flag for that. We
> > just need to detect last exiting thread and tsk_is_oom_victim. I have
> > some preliminary code for that.
> 
> Please show me the preliminary code. How do you expedite termination of
> exiting threads? If you simply remove test_thread_flag(TIF_MEMDIE) in
> gfp_to_alloc_flags(), there is a risk of failing to escape from current
> allocation request loop (if you also remove
> 
> 	/* Avoid allocations with no watermarks from looping endlessly */
> 	if (test_thread_flag(TIF_MEMDIE) && !(gfp_mask & __GFP_NOFAIL))
> 		goto nopage;
> 
> ) especially for CONFIG_MMU=n kernels, or hitting problems due to
> allocation failure (if you don't remove
> 
> 	/* Avoid allocations with no watermarks from looping endlessly */
> 	if (test_thread_flag(TIF_MEMDIE) && !(gfp_mask & __GFP_NOFAIL))
> 		goto nopage;
> 
> ) due to not allowing access to memory reserves.
> 
> On the other hand, if you simply replace test_thread_flag(TIF_MEMDIE) in
> gfp_to_alloc_flags() with signal->oom_mm != NULL, it might increase
> possibility of depleting memory reserves.

I have planned to access to memory reserves in two stages.
tsk_is_oom_victim would get access to a portion of the memory reserves
(below try_harder) and full access after the oom reaper started reaping
the memory. The code is not ready yet, I need to think about it much
more and this is out of scope of this series.

> > > (If
> > > oom_killer_disable() were specific to CONFIG_MMU=y kernels, I think
> > > that not thawing OOM victims will be simpler because the OOM reaper
> > > can reclaim memory without thawing OOM victims.)
> > 
> > Well I do not think keeping an oom victim inside the fridge is a good
> > idea. The task might be not sitting on any reclaimable memory but it
> > still might consume resources which are bound to its life time (open
> > files and their buffers etc.).
> 
> Then, I do not think keeping !TIF_MEMDIE OOM victims (sharing TIF_MEMDIE
> OOM victim's memory) inside the fridge is a good idea. There might be
> resources (e.g. open files) which will not be released unless all threads
> sharing TIF_MEMDIE OOM victim's memory terminate.

Yes this is a part of the future changes which will not care about
TIF_MEMDIE at all and act always on the whole process groups. And as
such on all processes sharing the mm as well (after this series).
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 08/10] exit, oom: postpone exit_oom_victim to later
  2016-08-01 11:33         ` Michal Hocko
@ 2016-08-02 10:32           ` Tetsuo Handa
  2016-08-02 11:31             ` Michal Hocko
  0 siblings, 1 reply; 47+ messages in thread
From: Tetsuo Handa @ 2016-08-02 10:32 UTC (permalink / raw)
  To: mhocko; +Cc: linux-mm, akpm, oleg, rientjes, vdavydov

Michal Hocko wrote:
> > > > It is possible that a user creates a process with 10000 threads
> > > > and let that process be OOM-killed. Then, this patch allows 10000 threads
> > > > to start consuming memory reserves after they left exit_mm(). OOM victims
> > > > are not the only threads who need to allocate memory for termination. Non
> > > > OOM victims might need to allocate memory at exit_task_work() in order to
> > > > allow OOM victims to make forward progress.
> > > 
> > > this might be possible but unlike the regular exiting tasks we do
> > > reclaim oom victim's memory in the background. So while they can consume
> > > memory reserves we should also give some (and arguably much more) memory
> > > back. The reserves are there to expedite the exit.
> > 
> > Background reclaim does not occur on CONFIG_MMU=n kernels. But this patch
> > also affects CONFIG_MMU=n kernels. If a process with two threads was
> > OOM-killed and one thread consumed too much memory after it left exit_mm()
> > before the other thread sets MMF_OOM_SKIP on their mm by returning from
> > exit_aio() etc. in __mmput() from mmput() from exit_mm(), this patch
> > introduces a new possibility to OOM livelock. I think it is wild to assume
> > that "CONFIG_MMU=n kernels can OOM livelock even without this patch. Thus,
> > let's apply this patch even though this patch might break the balance of
> > OOM handling in CONFIG_MMU=n kernels."
> 
> As I've said if you have strong doubts about the patch I can drop it for
> now. I do agree that nommu really matters here, though.

OK. Then, for now let's postpone only the oom_killer_disbale() to later
rather than postpone the exit_oom_victim() to later.

> 
> > Also, where is the guarantee that memory reclaimed by the OOM reaper is
> > used for terminating exiting threads?
> 
> As the oom reaper replenishes the memory the oom victim should be able
> to consume that memory in favor of other processes exactly because it
> has access to memory reserves which others are not.

The difficult thing is that TIF_MEMDIE (or signal->oom_mm != NULL) threads
are not always the only threads which need to allocate memory in order to
make forward progress. Some kernel threads might need to allocate memory
in order to allow TIF_MEMDIE (or signal->oom_mm != NULL) threads to make
forward progress, but they don't have acess to memory reserves.

Anyway, this is out of scope if we for now postpone only
the oom_killer_disbale() to later.

> > > > I think that allocations from
> > > > do_exit() are important for terminating cleanly (from the point of view of
> > > > filesystem integrity and kernel object management) and such allocations
> > > > should not be given up simply because ALLOC_NO_WATERMARKS allocations
> > > > failed.
> > > 
> > > We are talking about a fatal condition when OOM killer forcefully kills
> > > a task. Chances are that the userspace leaves so much state behind that
> > > a manual cleanup would be necessary anyway. Depleting the memory
> > > reserves is not nice but I really believe that this particular patch
> > > doesn't make the situation really much worse than before.
> > 
> > I'm not talking about inconsistency in userspace programs. I'm talking
> > about inconsistency of objects managed by kernel (e.g. failing to drop
> > references) caused by allocation failures.
> 
> That would be a bug on its own, no?

Right, but memory allocations after exit_mm() from do_exit() (e.g.
exit_task_work()) might assume (or depend on) the "too small to fail"
memory-allocation rule where small GFP_FS allocations won't fail unless
TIF_MEMDIE is set, but this patch can unexpectedly break that rule if
they assume (or depend on) that rule.

>  
> > > > > There is possible advantages of this because we are reducing chances
> > > > > of further interference of the oom victim with the rest of the system
> > > > > after oom_killer_disable(). Strictly speaking this is possible right
> > > > > now because there are indeed allocations possible past exit_mm() and
> > > > > who knows whether some of them can trigger IO. I haven't seen this in
> > > > > practice though.
> > > > 
> > > > I don't know which I/O oom_killer_disable() must act as a hard barrier.
> > > 
> > > Any allocation that could trigger the IO can corrupt the hibernation
> > > image or access the half suspended device. The whole point of
> > > oom_killer_disable is to prevent anything like that to happen.
> > 
> > That's a puzzling answer. What I/O? If fs writeback done by a GFP_FS
> > allocation issued by userspace processes between after returning from
> > exit_mm() and reaching final schedule() in do_exit() is problematic, why
> > fs writeback issued by a GFP_FS allocation done by kernel threads after
> > returning from oom_killer_disable() is not problematic? I think any I/O
> > which userspace processes can do is also doable by kernel threads.
> > Where is the guarantee that kernel threads which do I/O which
> > oom_killer_disable() acts as a hard barrier do not corrupt the hibernation
> > image or access the half suspended device?
> 
> This has been discussed several times already. I do not have a link
> handy but it should be there in the archives. I am sorry and do not want
> to be rude here but I guess this is only tangential to this particular
> patch series so I would prefer not discussing PM consequences. PM
> suspend is describe to expect no _user_ activity after oom_killer_disable.
> If you are questioning that assumption then please do that in a separate
> thread.
> 

OK. Well, it seems to me that pm_restrict_gfp_mask() is called from
enter_state() in kernel/power/suspend.c which is really late stage of
suspend operation. This suggests that fs writeback done by a GFP_FS
allocation issued by kernel threads after returning from oom_killer_disable()
is not problematic. Though I don't know whether it is safe that we do not
have a synchronization mechanism to wait for non freezable kernel threads
which are between post

  gfp_mask &= gfp_allowed_mask;

line in __alloc_pages_nodemask() and pre

  if (pm_suspended_storage())
      goto out;

line in __alloc_pages_may_oom() after pm_restrict_gfp_mask() is called.
Presumably we are assuming that we are no longer under memory pressure
so that no I/O will be needed for allocations by the moment
pm_restrict_gfp_mask() is called.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 08/10] exit, oom: postpone exit_oom_victim to later
  2016-08-02 10:32           ` Tetsuo Handa
@ 2016-08-02 11:31             ` Michal Hocko
  0 siblings, 0 replies; 47+ messages in thread
From: Michal Hocko @ 2016-08-02 11:31 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: linux-mm, akpm, oleg, rientjes, vdavydov

On Tue 02-08-16 19:32:45, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > > > > It is possible that a user creates a process with 10000 threads
> > > > > and let that process be OOM-killed. Then, this patch allows 10000 threads
> > > > > to start consuming memory reserves after they left exit_mm(). OOM victims
> > > > > are not the only threads who need to allocate memory for termination. Non
> > > > > OOM victims might need to allocate memory at exit_task_work() in order to
> > > > > allow OOM victims to make forward progress.
> > > > 
> > > > this might be possible but unlike the regular exiting tasks we do
> > > > reclaim oom victim's memory in the background. So while they can consume
> > > > memory reserves we should also give some (and arguably much more) memory
> > > > back. The reserves are there to expedite the exit.
> > > 
> > > Background reclaim does not occur on CONFIG_MMU=n kernels. But this patch
> > > also affects CONFIG_MMU=n kernels. If a process with two threads was
> > > OOM-killed and one thread consumed too much memory after it left exit_mm()
> > > before the other thread sets MMF_OOM_SKIP on their mm by returning from
> > > exit_aio() etc. in __mmput() from mmput() from exit_mm(), this patch
> > > introduces a new possibility to OOM livelock. I think it is wild to assume
> > > that "CONFIG_MMU=n kernels can OOM livelock even without this patch. Thus,
> > > let's apply this patch even though this patch might break the balance of
> > > OOM handling in CONFIG_MMU=n kernels."
> > 
> > As I've said if you have strong doubts about the patch I can drop it for
> > now. I do agree that nommu really matters here, though.
> 
> OK. Then, for now let's postpone only the oom_killer_disbale() to later
> rather than postpone the exit_oom_victim() to later.

that would require other changes (basically make oom_killer_disbale
independent on TIF_MEMDIE) which I think doesn't belong to this pile. So
I would rather sacrifice this patch instead and it will not be part of
the v2.
 
[...]
> > > > > I think that allocations from
> > > > > do_exit() are important for terminating cleanly (from the point of view of
> > > > > filesystem integrity and kernel object management) and such allocations
> > > > > should not be given up simply because ALLOC_NO_WATERMARKS allocations
> > > > > failed.
> > > > 
> > > > We are talking about a fatal condition when OOM killer forcefully kills
> > > > a task. Chances are that the userspace leaves so much state behind that
> > > > a manual cleanup would be necessary anyway. Depleting the memory
> > > > reserves is not nice but I really believe that this particular patch
> > > > doesn't make the situation really much worse than before.
> > > 
> > > I'm not talking about inconsistency in userspace programs. I'm talking
> > > about inconsistency of objects managed by kernel (e.g. failing to drop
> > > references) caused by allocation failures.
> > 
> > That would be a bug on its own, no?
> 
> Right, but memory allocations after exit_mm() from do_exit() (e.g.
> exit_task_work()) might assume (or depend on) the "too small to fail"
> memory-allocation rule where small GFP_FS allocations won't fail unless
> TIF_MEMDIE is set, but this patch can unexpectedly break that rule if
> they assume (or depend on) that rule.

Silent dependency on nofail semantic withtou GFP_NOFAIL is still a bug.
Full stop. I really fail to see why you are still arguing about that.

[...]
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 09/10] vhost, mm: make sure that oom_reaper doesn't reap memory read by vhost
  2016-07-31  9:44             ` Michal Hocko
@ 2016-08-12  9:42               ` Michal Hocko
  2016-08-12 13:21                 ` Oleg Nesterov
  0 siblings, 1 reply; 47+ messages in thread
From: Michal Hocko @ 2016-08-12  9:42 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: linux-mm, Andrew Morton, Tetsuo Handa, Oleg Nesterov,
	David Rientjes, Vladimir Davydov, Michael S. Tsirkin

Sorry to bother you Paul but I would be really greatful if you could
comment on this, please!

On Sun 31-07-16 11:44:38, Michal Hocko wrote:
> On Fri 29-07-16 20:57:44, Michael S. Tsirkin wrote:
> > On Fri, Jul 29, 2016 at 03:35:29PM +0200, Michal Hocko wrote:
> > > On Fri 29-07-16 16:14:10, Michael S. Tsirkin wrote:
> > > > On Fri, Jul 29, 2016 at 08:04:22AM +0200, Michal Hocko wrote:
> > > > > On Thu 28-07-16 23:41:53, Michael S. Tsirkin wrote:
> > > > > > On Thu, Jul 28, 2016 at 09:42:33PM +0200, Michal Hocko wrote:
> > > [...]
> > > > > > > and the reader would hit a page fault
> > > > > > > +	 * if it stumbled over a reaped memory.
> > > > > > 
> > > > > > This last point I don't get. flag read could bypass data read
> > > > > > if that happens data read could happen after unmap
> > > > > > yes it might get a PF but you handle that, correct?
> > > > > 
> > > > > The point I've tried to make is that if the reader really page faults
> > > > > then get_user will imply the full barrier already. If get_user didn't
> > > > > page fault then the state of the flag is not really important because
> > > > > the reaper shouldn't have touched it. Does it make more sense now or
> > > > > I've missed your question?
> > > > 
> > > > Can task flag read happen before the get_user pagefault?
> > > 
> > > Do you mean?
> > > 
> > > get_user_mm()
> > >   temp = false <- test_bit(MMF_UNSTABLE, &mm->flags)
> > >   ret = __get_user(x, ptr)
> > >   #PF
> > >   if (!ret && temp) # misses the flag
> > > 
> > > The code is basically doing
> > > 
> > >   if (!__get_user() && test_bit(MMF_UNSTABLE, &mm->flags))
> > > 
> > > so test_bit part of the conditional cannot be evaluated before
> > > __get_user() part is done. Compiler cannot reorder two depending
> > > subconditions AFAIK.
> > 
> > But maybe the CPU can.
> 
> Are you sure? How does that differ from
> 	if (ptr && ptr->something)
> construct?
> 
> Let's CC Paul. Just to describe the situation. We have the following
> situation:
> 
> #define __get_user_mm(mm, x, ptr)				\
> ({								\
> 	int ___gu_err = __get_user(x, ptr);			\
> 	if (!___gu_err && test_bit(MMF_UNSTABLE, &mm->flags))	\
> 		___gu_err = -EFAULT;				\
> 	___gu_err;						\
> })
> 
> and the oom reaper doing:
> 
> 	set_bit(MMF_UNSTABLE, &mm->flags);
> 
> 	for (vma = mm->mmap ; vma; vma = vma->vm_next) {
> 		unmap_page_range
> 
> I assume that write memory barrier between set_bit and unmap_page_range
> is not really needed because unmapping should already imply the memory
> barrier. A read memory barrier between __get_user and test_bit shouldn't
> be really needed because we can tolerate a stale value if __get_user
> didn't #PF because we haven't unmapped that address obviously. If we
> unmapped it then __get_user would #PF and that should imply a full
> memory barrier as well. Now the question is whether a CPU can speculate
> and read the flag before we issue the #PF.

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 09/10] vhost, mm: make sure that oom_reaper doesn't reap memory read by vhost
  2016-07-29 13:14       ` Michael S. Tsirkin
  2016-07-29 13:35         ` Michal Hocko
@ 2016-08-12  9:43         ` Michal Hocko
  1 sibling, 0 replies; 47+ messages in thread
From: Michal Hocko @ 2016-08-12  9:43 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: linux-mm, Andrew Morton, Tetsuo Handa, Oleg Nesterov,
	David Rientjes, Vladimir Davydov

On Fri 29-07-16 16:14:10, Michael S. Tsirkin wrote:
> On Fri, Jul 29, 2016 at 08:04:22AM +0200, Michal Hocko wrote:
> > On Thu 28-07-16 23:41:53, Michael S. Tsirkin wrote:
[...]
> > > 
> > > > +	 */
> > > > +	set_bit(MMF_UNSTABLE, &mm->flags);
> > > > +
> > > 
> > > I would really prefer a callback that vhost would register
> > > and stop all accesses. Tell me if you need help on above idea.
> > 
> > 
> > Well, in order to make callback workable the oom reaper would have to
> > synchronize with the said callback until it declares all currently
> > ongoing accesses done. That means oom reaper would have to block/wait
> > and that is something I would really like to prevent from because it
> > just adds another possibility of the lockup (say the get_user cannot
> > make forward progress because it is stuck in the page fault allocating
> > memory). Or do you see any other way how to implement such a callback
> > mechanism without blocking on the oom_reaper side?
> 
> I'll think it over and respond.

ping?
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 09/10] vhost, mm: make sure that oom_reaper doesn't reap memory read by vhost
  2016-08-12  9:42               ` Michal Hocko
@ 2016-08-12 13:21                 ` Oleg Nesterov
  2016-08-12 14:41                   ` Michal Hocko
                                     ` (3 more replies)
  0 siblings, 4 replies; 47+ messages in thread
From: Oleg Nesterov @ 2016-08-12 13:21 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Paul E. McKenney, linux-mm, Andrew Morton, Tetsuo Handa,
	David Rientjes, Vladimir Davydov, Michael S. Tsirkin

On 08/12, Michal Hocko wrote:
>
> > Let's CC Paul. Just to describe the situation. We have the following
> > situation:
> >
> > #define __get_user_mm(mm, x, ptr)				\
> > ({								\
> > 	int ___gu_err = __get_user(x, ptr);			\
> > 	if (!___gu_err && test_bit(MMF_UNSTABLE, &mm->flags))	\
> > 		___gu_err = -EFAULT;				\
> > 	___gu_err;						\
> > })
> >
> > and the oom reaper doing:
> >
> > 	set_bit(MMF_UNSTABLE, &mm->flags);
> >
> > 	for (vma = mm->mmap ; vma; vma = vma->vm_next) {
> > 		unmap_page_range
> >
> > I assume that write memory barrier between set_bit and unmap_page_range
> > is not really needed because unmapping should already imply the memory
> > barrier.

Well, I leave this to Paul, but...

I think it is not needed because we can rely on pte locking. We do
not care if anything is re-ordered UNLESS __get_user() above actually
triggers a fault and re-populates the page which was already unmapped
by __oom_reap_task(), and in the latter case __get_user_mm() can't
miss MMF_UNSTABLE simply because __get_user() and unmap_page_range()
need to lock/unlock the same ptlock_ptr().

So we only need the compiler barrier to ensure that __get_user_mm()
won't read MMF_UNSTABLE before __get_user(). But since __get_user()
is function, it is not needed too.

There is a more interesting case when another 3rd thread can trigger
a fault and populate this page before __get_user_mm() calls _get_user().
But even in this case I think we are fine.


Whats really interesting is that I still fail to understand do we really
need this hack, iiuc you are not sure too, and Michael didn't bother to
explain why a bogus zero from anon memory is worse than other problems
caused by SIGKKILL from oom-kill.c.

Oleg.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 09/10] vhost, mm: make sure that oom_reaper doesn't reap memory read by vhost
  2016-08-12 13:21                 ` Oleg Nesterov
@ 2016-08-12 14:41                   ` Michal Hocko
  2016-08-12 16:05                     ` Oleg Nesterov
  2016-08-12 15:57                   ` Paul E. McKenney
                                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 47+ messages in thread
From: Michal Hocko @ 2016-08-12 14:41 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Paul E. McKenney, linux-mm, Andrew Morton, Tetsuo Handa,
	David Rientjes, Vladimir Davydov, Michael S. Tsirkin

On Fri 12-08-16 15:21:41, Oleg Nesterov wrote:
> On 08/12, Michal Hocko wrote:
[...]
> There is a more interesting case when another 3rd thread can trigger
> a fault and populate this page before __get_user_mm() calls _get_user().
> But even in this case I think we are fine.

All the threads should be killed/exiting so they shouldn't access that
memory. My assumption is that the exit path doesn't touch that memory.
If any of threads was in the middle of the page fault or g-u-p while
writing to that address then it should be OK because it would be just
a matter of SIGKILL timing.  I might be wrong here and in that case
__get_user_mm wouldn't be sufficient of course.

> Whats really interesting is that I still fail to understand do we really
> need this hack, iiuc you are not sure too, and Michael didn't bother to
> explain why a bogus zero from anon memory is worse than other problems
> caused by SIGKKILL from oom-kill.c.

Yes, I admit that I am not familiar with the vhost memory usage model so
I can only speculate. But the mere fact that the mm is bound to a device
fd which can be passed over to a different process makes me worried.
This means that the mm is basically isolated from the original process
until the last fd is closed which is under control of the process which
holds it. The mm can still be access during that time from the vhost
worker. And I guess this is exactly where the problem lies.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 09/10] vhost, mm: make sure that oom_reaper doesn't reap memory read by vhost
  2016-08-12 13:21                 ` Oleg Nesterov
  2016-08-12 14:41                   ` Michal Hocko
@ 2016-08-12 15:57                   ` Paul E. McKenney
  2016-08-12 16:09                     ` Oleg Nesterov
  2016-08-12 16:23                     ` Michal Hocko
  2016-08-13  0:15                   ` Michael S. Tsirkin
  2016-08-22 13:03                   ` Michal Hocko
  3 siblings, 2 replies; 47+ messages in thread
From: Paul E. McKenney @ 2016-08-12 15:57 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Michal Hocko, linux-mm, Andrew Morton, Tetsuo Handa,
	David Rientjes, Vladimir Davydov, Michael S. Tsirkin

On Fri, Aug 12, 2016 at 03:21:41PM +0200, Oleg Nesterov wrote:
> On 08/12, Michal Hocko wrote:
> >
> > > Let's CC Paul. Just to describe the situation. We have the following
> > > situation:
> > >
> > > #define __get_user_mm(mm, x, ptr)				\
> > > ({								\
> > > 	int ___gu_err = __get_user(x, ptr);			\
> > > 	if (!___gu_err && test_bit(MMF_UNSTABLE, &mm->flags))	\
> > > 		___gu_err = -EFAULT;				\
> > > 	___gu_err;						\
> > > })
> > >
> > > and the oom reaper doing:
> > >
> > > 	set_bit(MMF_UNSTABLE, &mm->flags);
> > >
> > > 	for (vma = mm->mmap ; vma; vma = vma->vm_next) {
> > > 		unmap_page_range
> > >
> > > I assume that write memory barrier between set_bit and unmap_page_range
> > > is not really needed because unmapping should already imply the memory
> > > barrier.
> 
> Well, I leave this to Paul, but...
> 
> I think it is not needed because we can rely on pte locking. We do
> not care if anything is re-ordered UNLESS __get_user() above actually
> triggers a fault and re-populates the page which was already unmapped
> by __oom_reap_task(), and in the latter case __get_user_mm() can't
> miss MMF_UNSTABLE simply because __get_user() and unmap_page_range()
> need to lock/unlock the same ptlock_ptr().
> 
> So we only need the compiler barrier to ensure that __get_user_mm()
> won't read MMF_UNSTABLE before __get_user(). But since __get_user()
> is function, it is not needed too.
> 
> There is a more interesting case when another 3rd thread can trigger
> a fault and populate this page before __get_user_mm() calls _get_user().
> But even in this case I think we are fine.

Hmmm...  What source tree are you guys looking at?  I am seeing some
of the above being macros rather than functions and others not being
present at all...

							Thanx, Paul

> Whats really interesting is that I still fail to understand do we really
> need this hack, iiuc you are not sure too, and Michael didn't bother to
> explain why a bogus zero from anon memory is worse than other problems
> caused by SIGKKILL from oom-kill.c.
> 
> Oleg.
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 09/10] vhost, mm: make sure that oom_reaper doesn't reap memory read by vhost
  2016-08-12 14:41                   ` Michal Hocko
@ 2016-08-12 16:05                     ` Oleg Nesterov
  0 siblings, 0 replies; 47+ messages in thread
From: Oleg Nesterov @ 2016-08-12 16:05 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Paul E. McKenney, linux-mm, Andrew Morton, Tetsuo Handa,
	David Rientjes, Vladimir Davydov, Michael S. Tsirkin

On 08/12, Michal Hocko wrote:
>
> On Fri 12-08-16 15:21:41, Oleg Nesterov wrote:
>
> > Whats really interesting is that I still fail to understand do we really
> > need this hack, iiuc you are not sure too, and Michael didn't bother to
> > explain why a bogus zero from anon memory is worse than other problems
> > caused by SIGKKILL from oom-kill.c.
>
> Yes, I admit that I am not familiar with the vhost memory usage model so
> I can only speculate. But the mere fact that the mm is bound to a device
> fd

Yes, and I already tried to complain. This doesn't look right in any case.

> which can be passed over to a different process makes me worried.

> This means that the mm is basically isolated from the original process
> until the last fd is closed which is under control of the process which
> holds it. The mm can still be access during that time from the vhost
> worker. And I guess this is exactly where the problem lies.

Agreed.

Oleg.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 09/10] vhost, mm: make sure that oom_reaper doesn't reap memory read by vhost
  2016-08-12 15:57                   ` Paul E. McKenney
@ 2016-08-12 16:09                     ` Oleg Nesterov
  2016-08-12 16:26                       ` Paul E. McKenney
  2016-08-12 16:23                     ` Michal Hocko
  1 sibling, 1 reply; 47+ messages in thread
From: Oleg Nesterov @ 2016-08-12 16:09 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Michal Hocko, linux-mm, Andrew Morton, Tetsuo Handa,
	David Rientjes, Vladimir Davydov, Michael S. Tsirkin

On 08/12, Paul E. McKenney wrote:
>
> Hmmm...  What source tree are you guys looking at?  I am seeing some
> of the above being macros rather than functions and others not being
> present at all...

Sorry for confusion. These code snippets are form Michal's patches. I hope
he will write another email to unconfuse you ;)

Oleg.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 09/10] vhost, mm: make sure that oom_reaper doesn't reap memory read by vhost
  2016-08-12 15:57                   ` Paul E. McKenney
  2016-08-12 16:09                     ` Oleg Nesterov
@ 2016-08-12 16:23                     ` Michal Hocko
  1 sibling, 0 replies; 47+ messages in thread
From: Michal Hocko @ 2016-08-12 16:23 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Oleg Nesterov, linux-mm, Andrew Morton, Tetsuo Handa,
	David Rientjes, Vladimir Davydov, Michael S. Tsirkin

On Fri 12-08-16 08:57:34, Paul E. McKenney wrote:
> On Fri, Aug 12, 2016 at 03:21:41PM +0200, Oleg Nesterov wrote:
> > On 08/12, Michal Hocko wrote:
> > >
> > > > Let's CC Paul. Just to describe the situation. We have the following
> > > > situation:
> > > >
> > > > #define __get_user_mm(mm, x, ptr)				\
> > > > ({								\
> > > > 	int ___gu_err = __get_user(x, ptr);			\
> > > > 	if (!___gu_err && test_bit(MMF_UNSTABLE, &mm->flags))	\
> > > > 		___gu_err = -EFAULT;				\
> > > > 	___gu_err;						\
> > > > })
> > > >
> > > > and the oom reaper doing:
> > > >
> > > > 	set_bit(MMF_UNSTABLE, &mm->flags);
> > > >
> > > > 	for (vma = mm->mmap ; vma; vma = vma->vm_next) {
> > > > 		unmap_page_range
> > > >
> > > > I assume that write memory barrier between set_bit and unmap_page_range
> > > > is not really needed because unmapping should already imply the memory
> > > > barrier.
> > 
> > Well, I leave this to Paul, but...
> > 
> > I think it is not needed because we can rely on pte locking. We do
> > not care if anything is re-ordered UNLESS __get_user() above actually
> > triggers a fault and re-populates the page which was already unmapped
> > by __oom_reap_task(), and in the latter case __get_user_mm() can't
> > miss MMF_UNSTABLE simply because __get_user() and unmap_page_range()
> > need to lock/unlock the same ptlock_ptr().
> > 
> > So we only need the compiler barrier to ensure that __get_user_mm()
> > won't read MMF_UNSTABLE before __get_user(). But since __get_user()
> > is function, it is not needed too.
> > 
> > There is a more interesting case when another 3rd thread can trigger
> > a fault and populate this page before __get_user_mm() calls _get_user().
> > But even in this case I think we are fine.
> 
> Hmmm...  What source tree are you guys looking at?  I am seeing some
> of the above being macros rather than functions and others not being
> present at all...

The code is not upstream yet. You can find the current version of the
patchset here:
git://git.kernel.org/pub/scm/linux/kernel/git/mhocko/mm.git#attempts/oom-robustify

namely we are talking about 3c24392768ab ("vhost, mm: make sure that
oom_reaper doesn't reap memory read by vhost")

Thanks!
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 09/10] vhost, mm: make sure that oom_reaper doesn't reap memory read by vhost
  2016-08-12 16:09                     ` Oleg Nesterov
@ 2016-08-12 16:26                       ` Paul E. McKenney
  0 siblings, 0 replies; 47+ messages in thread
From: Paul E. McKenney @ 2016-08-12 16:26 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Michal Hocko, linux-mm, Andrew Morton, Tetsuo Handa,
	David Rientjes, Vladimir Davydov, Michael S. Tsirkin

On Fri, Aug 12, 2016 at 06:09:30PM +0200, Oleg Nesterov wrote:
> On 08/12, Paul E. McKenney wrote:
> >
> > Hmmm...  What source tree are you guys looking at?  I am seeing some
> > of the above being macros rather than functions and others not being
> > present at all...
> 
> Sorry for confusion. These code snippets are form Michal's patches. I hope
> he will write another email to unconfuse you ;)

Actually, I suspect that you have it as well in hand as I would have.  ;-)

							Thanx, Paul

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 09/10] vhost, mm: make sure that oom_reaper doesn't reap memory read by vhost
  2016-08-12 13:21                 ` Oleg Nesterov
  2016-08-12 14:41                   ` Michal Hocko
  2016-08-12 15:57                   ` Paul E. McKenney
@ 2016-08-13  0:15                   ` Michael S. Tsirkin
  2016-08-14  8:41                     ` Michal Hocko
  2016-08-22 13:03                   ` Michal Hocko
  3 siblings, 1 reply; 47+ messages in thread
From: Michael S. Tsirkin @ 2016-08-13  0:15 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Michal Hocko, Paul E. McKenney, linux-mm, Andrew Morton,
	Tetsuo Handa, David Rientjes, Vladimir Davydov

On Fri, Aug 12, 2016 at 03:21:41PM +0200, Oleg Nesterov wrote:
> Whats really interesting is that I still fail to understand do we really
> need this hack, iiuc you are not sure too, and Michael didn't bother to
> explain why a bogus zero from anon memory is worse than other problems
> caused by SIGKKILL from oom-kill.c.

vhost thread will die, but vcpu thread is going on.  If it's memory is
corrupted because vhost read 0 and uses that as an array index, it can
do things like corrupt the disk, so it can't be restarted.

But I really wish we didn't need this special-casing.  Can't PTEs be
made invalid on oom instead of pointing them at the zero page? And then
won't memory accesses trigger pagefaults instead of returning 0? That
would make regular copy_from_user machinery do the right thing,
making vhost stop running as appropriate.

-- 
MST

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 09/10] vhost, mm: make sure that oom_reaper doesn't reap memory read by vhost
  2016-08-13  0:15                   ` Michael S. Tsirkin
@ 2016-08-14  8:41                     ` Michal Hocko
  2016-08-14 16:57                       ` Michael S. Tsirkin
  0 siblings, 1 reply; 47+ messages in thread
From: Michal Hocko @ 2016-08-14  8:41 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Oleg Nesterov, Paul E. McKenney, linux-mm, Andrew Morton,
	Tetsuo Handa, David Rientjes, Vladimir Davydov

On Sat 13-08-16 03:15:00, Michael S. Tsirkin wrote:
> On Fri, Aug 12, 2016 at 03:21:41PM +0200, Oleg Nesterov wrote:
> > Whats really interesting is that I still fail to understand do we really
> > need this hack, iiuc you are not sure too, and Michael didn't bother to
> > explain why a bogus zero from anon memory is worse than other problems
> > caused by SIGKKILL from oom-kill.c.
> 
> vhost thread will die, but vcpu thread is going on.  If it's memory is
> corrupted because vhost read 0 and uses that as an array index, it can
> do things like corrupt the disk, so it can't be restarted.
> 
> But I really wish we didn't need this special-casing.  Can't PTEs be
> made invalid on oom instead of pointing them at the zero page?

Well ptes are just made !present and the subsequent #PF will allocate
a fresh new page which will be a zero page as the original content is
gone already. But I am not really sure what you mean by an invalid
pte. You are in a kernel thread context, aka unkillable context. How
would you handle SIGBUS or whatever other signal as a result of the
invalid access?

> And then
> won't memory accesses trigger pagefaults instead of returning 0?

See above. Zero page is just result of the lost memory content. We
cannot both reclaim and keep the original content.

> That
> would make regular copy_from_user machinery do the right thing,
> making vhost stop running as appropriate.

I must be missing something here but how would you make the kernel
thread context find out the invalid access. You would have to perform
signal handling routine after every single memory access and I fail how
this is any different from a special copy_from_user_mm.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 09/10] vhost, mm: make sure that oom_reaper doesn't reap memory read by vhost
  2016-08-14  8:41                     ` Michal Hocko
@ 2016-08-14 16:57                       ` Michael S. Tsirkin
  2016-08-14 23:06                         ` Michael S. Tsirkin
  0 siblings, 1 reply; 47+ messages in thread
From: Michael S. Tsirkin @ 2016-08-14 16:57 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Oleg Nesterov, Paul E. McKenney, linux-mm, Andrew Morton,
	Tetsuo Handa, David Rientjes, Vladimir Davydov

On Sun, Aug 14, 2016 at 10:41:52AM +0200, Michal Hocko wrote:
> On Sat 13-08-16 03:15:00, Michael S. Tsirkin wrote:
> > On Fri, Aug 12, 2016 at 03:21:41PM +0200, Oleg Nesterov wrote:
> > > Whats really interesting is that I still fail to understand do we really
> > > need this hack, iiuc you are not sure too, and Michael didn't bother to
> > > explain why a bogus zero from anon memory is worse than other problems
> > > caused by SIGKKILL from oom-kill.c.
> > 
> > vhost thread will die, but vcpu thread is going on.  If it's memory is
> > corrupted because vhost read 0 and uses that as an array index, it can
> > do things like corrupt the disk, so it can't be restarted.
> > 
> > But I really wish we didn't need this special-casing.  Can't PTEs be
> > made invalid on oom instead of pointing them at the zero page?
> 
> Well ptes are just made !present and the subsequent #PF will allocate
> a fresh new page which will be a zero page as the original content is
> gone already.

Can't we set a flag to make fixups desist from faulting
in memory?


> But I am not really sure what you mean by an invalid
> pte. You are in a kernel thread context, aka unkillable context. How
> would you handle SIGBUS or whatever other signal as a result of the
> invalid access?

No need for signal - each copy from user access is already
checked for errors.

> > And then
> > won't memory accesses trigger pagefaults instead of returning 0?
> 
> See above. Zero page is just result of the lost memory content. We
> cannot both reclaim and keep the original content.

Isn't this what decides it's a valid address so
we need to bring in a page (in __do_page_fault)?


        vma = find_vma(mm, address);
        if (unlikely(!vma)) {
                bad_area(regs, error_code, address);
                return;
        }       
        if (likely(vma->vm_start <= address))
                goto good_area;
        if (unlikely(!(vma->vm_flags & VM_GROWSDOWN))) {
                bad_area(regs, error_code, address);
                return;
        }       


So why can't we check a flag here, and call bad_area?
then vhost will get an error from access to userspace
memory and can handle it correctly.


> > That
> > would make regular copy_from_user machinery do the right thing,
> > making vhost stop running as appropriate.
> 
> I must be missing something here but how would you make the kernel
> thread context find out the invalid access. You would have to perform
> signal handling routine after every single memory access and I fail how
> this is any different from a special copy_from_user_mm.

No because IIUC no checks are needed as long as there
is no fault. On fault, fixups are run, at the moment
they bring in a page, I am saying they should
behave as if an invalid address was accessed instead.


> -- 
> Michal Hocko
> SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 09/10] vhost, mm: make sure that oom_reaper doesn't reap memory read by vhost
  2016-08-14 16:57                       ` Michael S. Tsirkin
@ 2016-08-14 23:06                         ` Michael S. Tsirkin
  2016-08-15  9:49                           ` Michal Hocko
  0 siblings, 1 reply; 47+ messages in thread
From: Michael S. Tsirkin @ 2016-08-14 23:06 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Oleg Nesterov, Paul E. McKenney, linux-mm, Andrew Morton,
	Tetsuo Handa, David Rientjes, Vladimir Davydov

On Sun, Aug 14, 2016 at 07:57:20PM +0300, Michael S. Tsirkin wrote:
> On Sun, Aug 14, 2016 at 10:41:52AM +0200, Michal Hocko wrote:
> > On Sat 13-08-16 03:15:00, Michael S. Tsirkin wrote:
> > > On Fri, Aug 12, 2016 at 03:21:41PM +0200, Oleg Nesterov wrote:
> > > > Whats really interesting is that I still fail to understand do we really
> > > > need this hack, iiuc you are not sure too, and Michael didn't bother to
> > > > explain why a bogus zero from anon memory is worse than other problems
> > > > caused by SIGKKILL from oom-kill.c.
> > > 
> > > vhost thread will die, but vcpu thread is going on.  If it's memory is
> > > corrupted because vhost read 0 and uses that as an array index, it can
> > > do things like corrupt the disk, so it can't be restarted.
> > > 
> > > But I really wish we didn't need this special-casing.  Can't PTEs be
> > > made invalid on oom instead of pointing them at the zero page?
> > 
> > Well ptes are just made !present and the subsequent #PF will allocate
> > a fresh new page which will be a zero page as the original content is
> > gone already.
> 
> Can't we set a flag to make fixups desist from faulting
> in memory?
> 
> 
> > But I am not really sure what you mean by an invalid
> > pte. You are in a kernel thread context, aka unkillable context. How
> > would you handle SIGBUS or whatever other signal as a result of the
> > invalid access?
> 
> No need for signal - each copy from user access is already
> checked for errors.
> 
> > > And then
> > > won't memory accesses trigger pagefaults instead of returning 0?
> > 
> > See above. Zero page is just result of the lost memory content. We
> > cannot both reclaim and keep the original content.
> 
> Isn't this what decides it's a valid address so
> we need to bring in a page (in __do_page_fault)?
> 
> 
>         vma = find_vma(mm, address);
>         if (unlikely(!vma)) {
>                 bad_area(regs, error_code, address);
>                 return;
>         }       
>         if (likely(vma->vm_start <= address))
>                 goto good_area;
>         if (unlikely(!(vma->vm_flags & VM_GROWSDOWN))) {
>                 bad_area(regs, error_code, address);
>                 return;
>         }       
> 
> 
> So why can't we check a flag here, and call bad_area?
> then vhost will get an error from access to userspace
> memory and can handle it correctly.
> 
> 
> > > That
> > > would make regular copy_from_user machinery do the right thing,
> > > making vhost stop running as appropriate.
> > 
> > I must be missing something here but how would you make the kernel
> > thread context find out the invalid access. You would have to perform
> > signal handling routine after every single memory access and I fail how
> > this is any different from a special copy_from_user_mm.
> 
> No because IIUC no checks are needed as long as there
> is no fault. On fault, fixups are run, at the moment
> they bring in a page, I am saying they should
> behave as if an invalid address was accessed instead.
> 
> 
> > -- 
> > Michal Hocko
> > SUSE Labs


So fundamentally, won't the following make copy to/from user
return EFAULT?  If yes, vhost is already prepared to handle that.


diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index dc80230..e5dbee5 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -1309,6 +1309,11 @@ retry:
 		might_sleep();
 	}
 
+	if (unlikely(test_bit(MMF_UNSTABLE, &mm->flags))) {
+		bad_area(regs, error_code, address);
+		return;
+	}
+
 	vma = find_vma(mm, address);
 	if (unlikely(!vma)) {
 		bad_area(regs, error_code, address);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 09/10] vhost, mm: make sure that oom_reaper doesn't reap memory read by vhost
  2016-08-14 23:06                         ` Michael S. Tsirkin
@ 2016-08-15  9:49                           ` Michal Hocko
  2016-08-17 16:58                             ` Michal Hocko
  0 siblings, 1 reply; 47+ messages in thread
From: Michal Hocko @ 2016-08-15  9:49 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Oleg Nesterov, Paul E. McKenney, linux-mm, Andrew Morton,
	Tetsuo Handa, David Rientjes, Vladimir Davydov

On Mon 15-08-16 02:06:31, Michael S. Tsirkin wrote:
[...]
> So fundamentally, won't the following make copy to/from user
> return EFAULT?  If yes, vhost is already prepared to handle that.
> 
> 
> diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
> index dc80230..e5dbee5 100644
> --- a/arch/x86/mm/fault.c
> +++ b/arch/x86/mm/fault.c
> @@ -1309,6 +1309,11 @@ retry:
>  		might_sleep();
>  	}
>  
> +	if (unlikely(test_bit(MMF_UNSTABLE, &mm->flags))) {
> +		bad_area(regs, error_code, address);
> +		return;
> +	}
> +
>  	vma = find_vma(mm, address);
>  	if (unlikely(!vma)) {
>  		bad_area(regs, error_code, address);

This would be racy but even if we did the check _after_ the #PF is
handled then I am not very happy to touch the #PF path which is quite
hot for something as rare as OOM and which only has one user which needs
a special handling. That is the primary reason why I prefer the specific
API.

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 09/10] vhost, mm: make sure that oom_reaper doesn't reap memory read by vhost
  2016-08-15  9:49                           ` Michal Hocko
@ 2016-08-17 16:58                             ` Michal Hocko
  0 siblings, 0 replies; 47+ messages in thread
From: Michal Hocko @ 2016-08-17 16:58 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Oleg Nesterov, Paul E. McKenney, linux-mm, Andrew Morton,
	Tetsuo Handa, David Rientjes, Vladimir Davydov

On Mon 15-08-16 11:49:12, Michal Hocko wrote:
> On Mon 15-08-16 02:06:31, Michael S. Tsirkin wrote:
> [...]
> > So fundamentally, won't the following make copy to/from user
> > return EFAULT?  If yes, vhost is already prepared to handle that.
> > 
> > 
> > diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
> > index dc80230..e5dbee5 100644
> > --- a/arch/x86/mm/fault.c
> > +++ b/arch/x86/mm/fault.c
> > @@ -1309,6 +1309,11 @@ retry:
> >  		might_sleep();
> >  	}
> >  
> > +	if (unlikely(test_bit(MMF_UNSTABLE, &mm->flags))) {
> > +		bad_area(regs, error_code, address);
> > +		return;
> > +	}
> > +
> >  	vma = find_vma(mm, address);
> >  	if (unlikely(!vma)) {
> >  		bad_area(regs, error_code, address);
> 
> This would be racy but even if we did the check _after_ the #PF is
> handled then I am not very happy to touch the #PF path which is quite
> hot for something as rare as OOM and which only has one user which needs
> a special handling. That is the primary reason why I prefer the specific
> API.

I would really appreciate if we could reach some conclusion here. I
would like to target the upcoming merge window. I do not insist on the
approach I have taken but I feel it is the least disruptive wrt. the
usecase. If there is a strong opposition and a general agreement that
hooking into the page fault handler is a better way to go I can
implement that. But please consider the amount of work and the fact that
it is only vhost which really matters here.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 09/10] vhost, mm: make sure that oom_reaper doesn't reap memory read by vhost
  2016-08-12 13:21                 ` Oleg Nesterov
                                     ` (2 preceding siblings ...)
  2016-08-13  0:15                   ` Michael S. Tsirkin
@ 2016-08-22 13:03                   ` Michal Hocko
  2016-08-22 21:01                     ` Michael S. Tsirkin
  3 siblings, 1 reply; 47+ messages in thread
From: Michal Hocko @ 2016-08-22 13:03 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Paul E. McKenney, linux-mm, Andrew Morton, Tetsuo Handa,
	David Rientjes, Vladimir Davydov, Michael S. Tsirkin

On Fri 12-08-16 15:21:41, Oleg Nesterov wrote:
[...]
> Whats really interesting is that I still fail to understand do we really
> need this hack, iiuc you are not sure too, and Michael didn't bother to
> explain why a bogus zero from anon memory is worse than other problems
> caused by SIGKKILL from oom-kill.c.

OK, so I've extended the changelog to clarify this some more, hopefully.
"
vhost driver relies on copy_from_user/get_user from a kernel thread.
This makes it impossible to reap the memory of an oom victim which
shares the mm with the vhost kernel thread because it could see a zero
page unexpectedly and theoretically make an incorrect decision visible
outside of the killed task context. To quote Michael S. Tsirkin:
: Getting an error from __get_user and friends is handled gracefully.
: Getting zero instead of a real value will cause userspace
: memory corruption.

The vhost kernel thread is bound to an open fd of the vhost device which
is not tight to the mm owner life cycle in theory. The fd can be 
inherited or passed over to another process which means that we really
have to be careful about unexpected memory corruption because unlike for
normal oom victims the result will be visible outside of the oom victim
context.

Make sure that each place which can read from userspace is annotated
properly and it uses copy_from_user_mm, __get_user_mm resp.
copy_from_iter_mm. Each will get the target mm as an argument and it
performs a pessimistic check to rule out that the oom_reaper could
possibly unmap the particular page. __oom_reap_task then just needs to
mark the mm as unstable before it unmaps any page.

An alternative approach would require to hook into the page fault path
and trigger EFAULT path from there but I do not like to add any code
to all users while there is a single use_mm() consumer which suffers 
from this problem. 

This is a preparatory patch without any functional changes because
the oom reaper doesn't touch mm shared with kthreads yet.
"

Does it help? Are there any other concerns or I can repost the series
and ask Andrew to pick it for mmotm tree?
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 09/10] vhost, mm: make sure that oom_reaper doesn't reap memory read by vhost
  2016-08-22 13:03                   ` Michal Hocko
@ 2016-08-22 21:01                     ` Michael S. Tsirkin
  2016-08-23  7:55                       ` Michal Hocko
  0 siblings, 1 reply; 47+ messages in thread
From: Michael S. Tsirkin @ 2016-08-22 21:01 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Oleg Nesterov, Paul E. McKenney, linux-mm, Andrew Morton,
	Tetsuo Handa, David Rientjes, Vladimir Davydov

On Mon, Aug 22, 2016 at 03:03:11PM +0200, Michal Hocko wrote:
> On Fri 12-08-16 15:21:41, Oleg Nesterov wrote:
> [...]
> > Whats really interesting is that I still fail to understand do we really
> > need this hack, iiuc you are not sure too, and Michael didn't bother to
> > explain why a bogus zero from anon memory is worse than other problems
> > caused by SIGKKILL from oom-kill.c.
> 
> OK, so I've extended the changelog to clarify this some more, hopefully.
> "
> vhost driver relies on copy_from_user/get_user from a kernel thread.
> This makes it impossible to reap the memory of an oom victim which
> shares the mm with the vhost kernel thread because it could see a zero
> page unexpectedly and theoretically make an incorrect decision visible
> outside of the killed task context. To quote Michael S. Tsirkin:
> : Getting an error from __get_user and friends is handled gracefully.
> : Getting zero instead of a real value will cause userspace
> : memory corruption.
> 
> The vhost kernel thread is bound to an open fd of the vhost device which
> is not tight to the mm owner life cycle in theory. The fd can be 
> inherited or passed over to another process which means that we really
> have to be careful about unexpected memory corruption because unlike for
> normal oom victims the result will be visible outside of the oom victim
> context.
> 
> Make sure that each place which can read from userspace is annotated
> properly and it uses copy_from_user_mm, __get_user_mm resp.
> copy_from_iter_mm. Each will get the target mm as an argument and it
> performs a pessimistic check to rule out that the oom_reaper could
> possibly unmap the particular page. __oom_reap_task then just needs to
> mark the mm as unstable before it unmaps any page.
> 
> An alternative approach would require to hook into the page fault path
> and trigger EFAULT path from there but I do not like to add any code
> to all users while there is a single use_mm() consumer which suffers 
> from this problem. 

However you are adding code on data path while page fault
handling is slow path. It's a single user from kernel
perspective but for someone who's running virt workloads
this could be 100% of the uses.

We did switch to __copy_from ... callbacks in the past
and this did help performance, and the extra branches
there have more or less the same cost.

And the resulting API is fragile to say the least


> This is a preparatory patch without any functional changes because
> the oom reaper doesn't touch mm shared with kthreads yet.
> "
> 
> Does it help? Are there any other concerns or I can repost the series
> and ask Andrew to pick it for mmotm tree?

Actually, vhost net calls out to tun which does regular copy_from_iter.
Returning 0 there will cause corrupted packets in the network: not a
huge deal, but ugly.  And I don't think we want to annotate run and
macvtap as well.

Really please just fix it in the page fault code like Oleg suggested.
It's a couple of lines of code and all current and
future users are automatically fixed.


> -- 
> Michal Hocko
> SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 09/10] vhost, mm: make sure that oom_reaper doesn't reap memory read by vhost
  2016-08-22 21:01                     ` Michael S. Tsirkin
@ 2016-08-23  7:55                       ` Michal Hocko
  2016-08-23  9:06                         ` Michal Hocko
  0 siblings, 1 reply; 47+ messages in thread
From: Michal Hocko @ 2016-08-23  7:55 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Oleg Nesterov, Paul E. McKenney, linux-mm, Andrew Morton,
	Tetsuo Handa, David Rientjes, Vladimir Davydov

On Tue 23-08-16 00:01:23, Michael S. Tsirkin wrote:
[...]
> Actually, vhost net calls out to tun which does regular copy_from_iter.
> Returning 0 there will cause corrupted packets in the network: not a
> huge deal, but ugly.  And I don't think we want to annotate run and
> macvtap as well.

Hmm, OK, I wasn't aware of that path and being consistent here matters.
If the vhost driver can interact with other subsystems then there is
really no other option than hooking into the page fault path. Ohh well.

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 09/10] vhost, mm: make sure that oom_reaper doesn't reap memory read by vhost
  2016-08-23  7:55                       ` Michal Hocko
@ 2016-08-23  9:06                         ` Michal Hocko
  2016-08-23 12:54                           ` Michael S. Tsirkin
  2016-08-24 16:42                           ` Michal Hocko
  0 siblings, 2 replies; 47+ messages in thread
From: Michal Hocko @ 2016-08-23  9:06 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Oleg Nesterov, Paul E. McKenney, linux-mm, Andrew Morton,
	Tetsuo Handa, David Rientjes, Vladimir Davydov

On Tue 23-08-16 09:55:55, Michal Hocko wrote:
> On Tue 23-08-16 00:01:23, Michael S. Tsirkin wrote:
> [...]
> > Actually, vhost net calls out to tun which does regular copy_from_iter.
> > Returning 0 there will cause corrupted packets in the network: not a
> > huge deal, but ugly.  And I don't think we want to annotate run and
> > macvtap as well.
> 
> Hmm, OK, I wasn't aware of that path and being consistent here matters.
> If the vhost driver can interact with other subsystems then there is
> really no other option than hooking into the page fault path. Ohh well.

Here is a completely untested patch just for sanity check.
---

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 09/10] vhost, mm: make sure that oom_reaper doesn't reap memory read by vhost
  2016-08-23  9:06                         ` Michal Hocko
@ 2016-08-23 12:54                           ` Michael S. Tsirkin
  2016-08-24 16:42                           ` Michal Hocko
  1 sibling, 0 replies; 47+ messages in thread
From: Michael S. Tsirkin @ 2016-08-23 12:54 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Oleg Nesterov, Paul E. McKenney, linux-mm, Andrew Morton,
	Tetsuo Handa, David Rientjes, Vladimir Davydov

On Tue, Aug 23, 2016 at 11:06:55AM +0200, Michal Hocko wrote:
> On Tue 23-08-16 09:55:55, Michal Hocko wrote:
> > On Tue 23-08-16 00:01:23, Michael S. Tsirkin wrote:
> > [...]
> > > Actually, vhost net calls out to tun which does regular copy_from_iter.
> > > Returning 0 there will cause corrupted packets in the network: not a
> > > huge deal, but ugly.  And I don't think we want to annotate run and
> > > macvtap as well.
> > 
> > Hmm, OK, I wasn't aware of that path and being consistent here matters.
> > If the vhost driver can interact with other subsystems then there is
> > really no other option than hooking into the page fault path. Ohh well.
> 
> Here is a completely untested patch just for sanity check.
> ---
> >From f32711ea518f8151d6efb1c71f359211117dd5a2 Mon Sep 17 00:00:00 2001
> From: Michal Hocko <mhocko@suse.com>
> Date: Tue, 14 Jun 2016 09:33:06 +0200
> Subject: [PATCH] vhost, mm: make sure that oom_reaper doesn't reap memory read
>  by vhost
> 
> vhost driver relies on copy_from_user/get_user from a kernel thread.
> This makes it impossible to reap the memory of an oom victim which
> shares the mm with the vhost kernel thread because it could see a zero
> page unexpectedly and theoretically make an incorrect decision visible
> outside of the killed task context. To quote Michael S. Tsirkin:
> : Getting an error from __get_user and friends is handled gracefully.
> : Getting zero instead of a real value will cause userspace
> : memory corruption.
> 
> The vhost kernel thread is bound to an open fd of the vhost device which
> is not tight to the mm owner life cycle in theory. The fd can be
> inherited or passed over to another process which means that we really
> have to be careful about unexpected memory corruption because unlike for
> normal oom victims the result will be visible outside of the oom victim
> context.
> 
> Make sure that no kthread context (users of use_mm) can ever see
> corrupted data because of the oom reaper and hook into the page fault
> path by checking MMF_UNSTABLE mm flag. __oom_reap_task_mm will set the
> flag before it starts unmapping the address space while the flag is
> checked after the page fault has been handled. If the flag is set
> then SIGBUS is triggered so any g-u-p user will get a error code.
> 
> This patch shouldn't have any visible effect at this moment because the
> OOM killer doesn't invoke oom reaper for tasks with mm shared with
> kthreads yet.
> 
> Cc: "Michael S. Tsirkin" <mst@redhat.com>
> Signed-off-by: Michal Hocko <mhocko@suse.com>
> ---
>  include/linux/sched.h |  1 +
>  mm/memory.c           | 13 +++++++++++++
>  mm/oom_kill.c         |  8 ++++++++
>  3 files changed, 22 insertions(+)
> 
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index eda579f3283a..63acaf9cc51c 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -522,6 +522,7 @@ static inline int get_dumpable(struct mm_struct *mm)
>  #define MMF_HAS_UPROBES		19	/* has uprobes */
>  #define MMF_RECALC_UPROBES	20	/* MMF_HAS_UPROBES can be wrong */
>  #define MMF_OOM_SKIP		21	/* mm is of no interest for the OOM killer */
> +#define MMF_UNSTABLE		22	/* mm is unstable for copy_from_user */
>  
>  #define MMF_INIT_MASK		(MMF_DUMPABLE_MASK | MMF_DUMP_FILTER_MASK)
>  
> diff --git a/mm/memory.c b/mm/memory.c
> index 83be99d9d8a1..5c1df34fef64 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -3656,6 +3656,19 @@ int handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
>                          mem_cgroup_oom_synchronize(false);
>  	}
>  
> +	/*
> +	 * This mm has been already reaped by the oom reaper and so the
> +	 * refault cannot be trusted in general. Anonymous refaults would
> +	 * lose data and give a zero page instead e.g. This is especially
> +	 * problem for use_mm() because regular tasks will just die and
> +	 * the corrupted data will not be visible anywhere while kthread
> +	 * will outlive the oom victim and potentially propagate the date
> +	 * further.
> +	 */
> +	if (unlikely((current->flags & PF_KTHREAD) && !(ret & VM_FAULT_ERROR)
> +				&& test_bit(MMF_UNSTABLE, &mm->flags)))
> +		ret = VM_FAULT_SIGBUS;
> +
>  	return ret;
>  }
>  EXPORT_SYMBOL_GPL(handle_mm_fault);
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 3b990544db6d..5a3ba96c8338 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -495,6 +495,14 @@ static bool __oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
>  		goto unlock_oom;
>  	}
>  
> +	/*
> +	 * Tell all users of get_user/copy_from_user etc... that the content
> +	 * is no longer stable. No barriers really needed because unmapping
> +	 * should imply barriers already and the reader would hit a page fault
> +	 * if it stumbled over a reaped memory.
> +	 */
> +	set_bit(MMF_UNSTABLE, &mm->flags);
> +
>  	tlb_gather_mmu(&tlb, mm, 0, -1);
>  	for (vma = mm->mmap ; vma; vma = vma->vm_next) {
>  		if (is_vm_hugetlb_page(vma))
> -- 
> 2.8.1

That's much better IMHO, and it's also much clearer why there's
no need for barriers here.

Acked-by: Michael S. Tsirkin <mst@redhat.com>



> -- 
> Michal Hocko
> SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 09/10] vhost, mm: make sure that oom_reaper doesn't reap memory read by vhost
  2016-08-23  9:06                         ` Michal Hocko
  2016-08-23 12:54                           ` Michael S. Tsirkin
@ 2016-08-24 16:42                           ` Michal Hocko
  1 sibling, 0 replies; 47+ messages in thread
From: Michal Hocko @ 2016-08-24 16:42 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Oleg Nesterov, Paul E. McKenney, linux-mm, Andrew Morton,
	Tetsuo Handa, David Rientjes, Vladimir Davydov

[-- Attachment #1: Type: text/plain, Size: 2046 bytes --]

On Tue 23-08-16 11:06:55, Michal Hocko wrote:
> On Tue 23-08-16 09:55:55, Michal Hocko wrote:
> > On Tue 23-08-16 00:01:23, Michael S. Tsirkin wrote:
> > [...]
> > > Actually, vhost net calls out to tun which does regular copy_from_iter.
> > > Returning 0 there will cause corrupted packets in the network: not a
> > > huge deal, but ugly.  And I don't think we want to annotate run and
> > > macvtap as well.
> > 
> > Hmm, OK, I wasn't aware of that path and being consistent here matters.
> > If the vhost driver can interact with other subsystems then there is
> > really no other option than hooking into the page fault path. Ohh well.
> 
> Here is a completely untested patch just for sanity check.

OK, so I've tested the patch and it seems to work as expected. I didn't
know how to configure vhost so I've just hacked up a quick kernel thread
which picks on a task (I am always selecting a first task which is the
OOM_SCORE_ADJ_MAX because that would be the selected victim - see the
code attached) and then read through its address space in a loop. The
oom victim then just mmaps and poppulates a private anon mapping which
causes the oom killer. It properly notices that the memor could have
been reaped.
[  628.559374] Out of memory: Kill process 3193 (oom_victim) score 1868 or sacrifice child
[  628.561713] Killed process 3193 (oom_victim) total-vm:1052540kB, anon-rss:782964kB, file-rss:12kB, shmem-rss:0kB
[  628.568120] Found a dead vma: ret:-14vma ffff88003c697000 start 00007f9227b33000 end 00007f9267b33000
next ffff88003d80d8a0 prev ffff88003d80d228 mm ffff88003d97b200
prot 8000000000000025 anon_vma ffff88003dbbc000 vm_ops           (null)
pgoff 7f9227b33 file           (null) private_data           (null)
flags: 0x100073(read|write|mayread|maywrite|mayexec|account)
[  628.595684] oom_reaper: reaped process 3193 (oom_victim), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
[  673.366282] Waiting for kthread to stop
[  673.367487] Done

Are there any objections or suggestions to the apporach?
-- 
Michal Hocko
SUSE Labs

[-- Attachment #2: oom_reaper_test.c --]
[-- Type: text/x-csrc, Size: 2103 bytes --]

#include <linux/module.h>
#include <linux/init.h>
#include <linux/printk.h>
#include <linux/sched.h>
#include <linux/mmu_context.h>
#include <linux/mm_types.h>
#include <linux/mm.h>
#include <linux/kthread.h>
#include <linux/oom.h>

struct task_struct *th = NULL;

static int read_vma(struct vm_area_struct *vma)
{
	unsigned long addr;
	for (addr = vma->vm_start; addr < vma->vm_end; addr += PAGE_SIZE) {
		char __user *ptr = (char __user *)addr;
		int ret;
		char c;

		if ((ret = get_user(c, ptr)) < 0 && test_bit(MMF_UNSTABLE, &vma->vm_mm->flags)) {
			pr_info("Found a dead vma: ret:%d", ret);
			dump_vma(vma);
			return 1;
		}
	}
	return 0;
}

static int scan_memory(void *data)
{
	struct task_struct *victim = data;
	struct mm_struct *mm;
	int reported = 0;

	pr_info("Starting with pid:%d %s\n", victim->pid, victim->comm);
	mm = get_task_mm(victim);
	if (!mm) {
		pr_info("Failed to get mm\n");
		return 1;
	}
	use_mm(mm);
	mmput(mm);

	while (!kthread_should_stop()) {
		struct vm_area_struct *vma = mm->mmap;
		
		if (!reported && test_bit(MMF_UNSTABLE, &mm->flags)) {
			pr_info("mm is unstable\n");
			reported = 1;
		}
		for (; vma; vma = vma->vm_next) {
			if (!(vma->vm_flags & VM_MAYREAD))
				continue;
			if (read_vma(vma))
				goto out;
		}
		schedule_timeout_idle(HZ);
	}
out:
	unuse_mm(mm);
	while (!kthread_should_stop()) {
		set_current_state(TASK_UNINTERRUPTIBLE);
		if (kthread_should_stop())
			break;
		schedule();
	}
	return 0;
}
static int __init mymodule_init(void)
{
	struct task_struct *p, *victim = NULL;

	rcu_read_lock();
	for_each_process(p) {
		if (p->signal->oom_score_adj == OOM_SCORE_ADJ_MAX) {
			get_task_struct(p);
			victim = p;
			break;
		}
	}
	rcu_read_unlock();
	if (!victim) {
		pr_info("No potential victim found\n");
		return 1;
	}

	th = kthread_run(scan_memory, victim, "scan_memory");
	return 0;
}

static void __exit mymodule_exit(void)
{
	if (!th)
		return;

	pr_info("Waiting for kthread to stop\n");
	kthread_stop(th);
	put_task_struct(th);
	pr_info("Done\n");
}

module_init(mymodule_init);
module_exit(mymodule_exit);

MODULE_LICENSE("GPL");

^ permalink raw reply	[flat|nested] 47+ messages in thread

end of thread, other threads:[~2016-08-24 16:43 UTC | newest]

Thread overview: 47+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-07-28 19:42 [RFC PATCH 0/10] fortify oom killer even more Michal Hocko
2016-07-28 19:42 ` [PATCH 01/10] mm,oom_reaper: Reduce find_lock_task_mm() usage Michal Hocko
2016-07-28 19:42 ` [PATCH 02/10] mm,oom_reaper: Do not attempt to reap a task twice Michal Hocko
2016-07-28 19:42 ` [PATCH 03/10] oom: keep mm of the killed task available Michal Hocko
2016-07-28 19:42 ` [PATCH 04/10] mm, oom: get rid of signal_struct::oom_victims Michal Hocko
2016-07-28 19:42 ` [PATCH 05/10] kernel, oom: fix potential pgd_lock deadlock from __mmdrop Michal Hocko
2016-07-28 19:42 ` [PATCH 06/10] oom, suspend: fix oom_killer_disable vs. pm suspend properly Michal Hocko
2016-07-28 19:42 ` [PATCH 07/10] mm, oom: enforce exit_oom_victim on current task Michal Hocko
2016-07-28 19:42 ` [PATCH 08/10] exit, oom: postpone exit_oom_victim to later Michal Hocko
2016-07-30  8:20   ` Tetsuo Handa
2016-07-31  9:35     ` Michal Hocko
2016-07-31 10:19       ` Michal Hocko
2016-08-01 10:46       ` Tetsuo Handa
2016-08-01 11:33         ` Michal Hocko
2016-08-02 10:32           ` Tetsuo Handa
2016-08-02 11:31             ` Michal Hocko
2016-07-28 19:42 ` [PATCH 09/10] vhost, mm: make sure that oom_reaper doesn't reap memory read by vhost Michal Hocko
2016-07-28 20:41   ` Michael S. Tsirkin
2016-07-29  6:04     ` Michal Hocko
2016-07-29 13:14       ` Michael S. Tsirkin
2016-07-29 13:35         ` Michal Hocko
2016-07-29 17:57           ` Michael S. Tsirkin
2016-07-31  9:44             ` Michal Hocko
2016-08-12  9:42               ` Michal Hocko
2016-08-12 13:21                 ` Oleg Nesterov
2016-08-12 14:41                   ` Michal Hocko
2016-08-12 16:05                     ` Oleg Nesterov
2016-08-12 15:57                   ` Paul E. McKenney
2016-08-12 16:09                     ` Oleg Nesterov
2016-08-12 16:26                       ` Paul E. McKenney
2016-08-12 16:23                     ` Michal Hocko
2016-08-13  0:15                   ` Michael S. Tsirkin
2016-08-14  8:41                     ` Michal Hocko
2016-08-14 16:57                       ` Michael S. Tsirkin
2016-08-14 23:06                         ` Michael S. Tsirkin
2016-08-15  9:49                           ` Michal Hocko
2016-08-17 16:58                             ` Michal Hocko
2016-08-22 13:03                   ` Michal Hocko
2016-08-22 21:01                     ` Michael S. Tsirkin
2016-08-23  7:55                       ` Michal Hocko
2016-08-23  9:06                         ` Michal Hocko
2016-08-23 12:54                           ` Michael S. Tsirkin
2016-08-24 16:42                           ` Michal Hocko
2016-08-12  9:43         ` Michal Hocko
2016-07-29 17:07   ` Oleg Nesterov
2016-07-31  9:11     ` Michal Hocko
2016-07-28 19:42 ` [PATCH 10/10] oom, oom_reaper: allow to reap mm shared by the kthreads Michal Hocko

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.