From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-lf0-f71.google.com (mail-lf0-f71.google.com [209.85.215.71]) by kanga.kvack.org (Postfix) with ESMTP id A02AE6B0005 for ; Fri, 1 Jul 2016 05:26:44 -0400 (EDT) Received: by mail-lf0-f71.google.com with SMTP id a4so78316739lfa.1 for ; Fri, 01 Jul 2016 02:26:44 -0700 (PDT) Received: from mail-wm0-f50.google.com (mail-wm0-f50.google.com. [74.125.82.50]) by mx.google.com with ESMTPS id ll9si2586236wjb.43.2016.07.01.02.26.43 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Fri, 01 Jul 2016 02:26:43 -0700 (PDT) Received: by mail-wm0-f50.google.com with SMTP id v199so18671500wmv.0 for ; Fri, 01 Jul 2016 02:26:43 -0700 (PDT) From: Michal Hocko Subject: [RFC PATCH 0/6] fortify oom killer even more Date: Fri, 1 Jul 2016 11:26:24 +0200 Message-Id: <1467365190-24640-1-git-send-email-mhocko@kernel.org> Sender: owner-linux-mm@kvack.org List-ID: To: linux-mm@kvack.org Cc: Andrew Morton , Tetsuo Handa , Oleg Nesterov , David Rientjes , Vladimir Davydov , "Michael S. Tsirkin" , Michal Hocko Hi, I am sending this pile as an RFC and I hope it will make a good foundation to hopefully plug the remaining holes which could lead to oom lockups for CONFIG_MMU. There are two main parts patches 1-4 and the 5-6. The first pile focuses on moving decisions about oom victims more to mm_struct. Especially the part when there is an oom victim noticed and we decide whether to select new victim. Patch 1 remembers the mm at the time oom victim is selected and it is stable for the rest of the process group life time. This allows some simplifications. The later part is about kthread vs. oom_reaper interaction. It seems that the only use_mm() user which needs fixing is vhost and that is fixable. Then we can remove the kthread restriction and so basically the every oom victim will be reapable now (well except the weird cases where the mm is shared with init but I consider that uninteresting). I haven't tested this properly yet. I will be mostly offline next week but definitely plan to test it later on. Right now I would appreciated feedback/review. If this looks OK then I would like to target it for 4.9. The series is based on top of the current mmotm (2016-06-24-15-53) + http://lkml.kernel.org/r/1467201562-6709-1-git-send-email-mhocko@kernel.org Thanks! Michal Hocko (6): oom: keep mm of the killed task available oom, suspend: fix oom_killer_disable vs. pm suspend properly exit, oom: postpone exit_oom_victim to later oom, oom_reaper: consider mmget_not_zero as a failure vhost, mm: make sure that oom_reaper doesn't reap memory read by vhost oom, oom_reaper: allow to reap mm shared by the kthreads drivers/vhost/scsi.c | 2 +- drivers/vhost/vhost.c | 18 +++---- include/linux/oom.h | 2 +- include/linux/sched.h | 3 ++ include/linux/uaccess.h | 22 +++++++++ include/linux/uio.h | 10 ++++ kernel/exit.c | 5 +- kernel/fork.c | 2 + kernel/power/process.c | 17 ++----- mm/mmu_context.c | 6 +++ mm/oom_kill.c | 127 ++++++++++++++++++++++++------------------------ 11 files changed, 124 insertions(+), 90 deletions(-) -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wm0-f71.google.com (mail-wm0-f71.google.com [74.125.82.71]) by kanga.kvack.org (Postfix) with ESMTP id D790B6B0253 for ; Fri, 1 Jul 2016 05:26:53 -0400 (EDT) Received: by mail-wm0-f71.google.com with SMTP id a66so13405060wme.1 for ; Fri, 01 Jul 2016 02:26:53 -0700 (PDT) Received: from mail-wm0-f67.google.com (mail-wm0-f67.google.com. [74.125.82.67]) by mx.google.com with ESMTPS id t5si3188581wmt.124.2016.07.01.02.26.52 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Fri, 01 Jul 2016 02:26:52 -0700 (PDT) Received: by mail-wm0-f67.google.com with SMTP id 187so3863681wmz.1 for ; Fri, 01 Jul 2016 02:26:52 -0700 (PDT) From: Michal Hocko Subject: [RFC PATCH 1/6] oom: keep mm of the killed task available Date: Fri, 1 Jul 2016 11:26:25 +0200 Message-Id: <1467365190-24640-2-git-send-email-mhocko@kernel.org> In-Reply-To: <1467365190-24640-1-git-send-email-mhocko@kernel.org> References: <1467365190-24640-1-git-send-email-mhocko@kernel.org> Sender: owner-linux-mm@kvack.org List-ID: To: linux-mm@kvack.org Cc: Andrew Morton , Tetsuo Handa , Oleg Nesterov , David Rientjes , Vladimir Davydov , Michal Hocko From: Michal Hocko oom_reap_task has to call exit_oom_victim in order to make sure that the oom vicim will never block the oom killer for ever. This is, however, opening new problems (e.g oom_killer_disable exclusion - see 74070542099c ("oom, suspend: fix oom_reaper vs. oom_killer_disable race")). exit_oom_victim should be only called from the victim's context ideally. One way to achieve this would be to rely on per mm_struct flags. We already have MMF_OOM_REAPED to hide a task from the oom killer since "mm, oom: hide mm which is shared with kthread or global init". The problem is that the exit path: do_exit exit_mm tsk->mm = NULL; mmput __mmput exit_oom_victim doesn't guarantee that exit_oom_victim will get called in a bounded amount of time. At least exit_aio depends on IO which might get blocked due to lack of memory and who knows what else is lurking there. This patch takes a different approach. We remember tsk->mm into the signal_struct and bind it to the signal struct life time for all oom victims. __oom_reap_task as well as oom_scan_process_thread do not have to rely on find_lock_task_mm anymore and they will have a reliable reference to the mm struct. As a result all the oom specific communication inside the OOM killer can be done via tsk->signal->oom_mm. exit_oom_victim from __oom_reap_task can be dropped. MMF_OOM_NOT_REAPABLE is trivial to implement as well because we just need to OOM_SCAN_SELECT it when we see the flag. Increasing the signal_struct for something as unlikely as the oom killer is far from ideal but this approach will make the code much more reasonable and long term we even might want to move task->mm into the signal_struct anyway. In the next step we might want to make the oom killer exclusion and access to memory reserves completely independent which would be also nice. Signed-off-by: Michal Hocko --- include/linux/sched.h | 2 ++ kernel/fork.c | 2 ++ mm/oom_kill.c | 67 +++++++++++++++++++++------------------------------ 3 files changed, 31 insertions(+), 40 deletions(-) diff --git a/include/linux/sched.h b/include/linux/sched.h index 6d81a1eb974a..befdcc1cde3c 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -793,6 +793,8 @@ struct signal_struct { short oom_score_adj; /* OOM kill score adjustment */ short oom_score_adj_min; /* OOM kill score adjustment min value. * Only settable by CAP_SYS_RESOURCE. */ + struct mm_struct *oom_mm; /* recorded mm when the thread group got + * killed by the oom killer */ struct mutex cred_guard_mutex; /* guard against foreign influences on * credential calculations diff --git a/kernel/fork.c b/kernel/fork.c index 452fc864f2f6..2bd3cc73d103 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -245,6 +245,8 @@ static inline void free_signal_struct(struct signal_struct *sig) { taskstats_tgid_free(sig); sched_autogroup_exit(sig); + if (sig->oom_mm) + mmdrop(sig->oom_mm); kmem_cache_free(signal_cachep, sig); } diff --git a/mm/oom_kill.c b/mm/oom_kill.c index 7d0a275df822..4ea4a649822d 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -286,16 +286,17 @@ enum oom_scan_t oom_scan_process_thread(struct oom_control *oc, * Don't allow any other task to have access to the reserves unless * the task has MMF_OOM_REAPED because chances that it would release * any memory is quite low. + * MMF_OOM_NOT_REAPABLE means that the oom_reaper backed off last time + * so let it try again. */ if (!is_sysrq_oom(oc) && atomic_read(&task->signal->oom_victims)) { - struct task_struct *p = find_lock_task_mm(task); + struct mm_struct *mm = task->signal->oom_mm; enum oom_scan_t ret = OOM_SCAN_ABORT; - if (p) { - if (test_bit(MMF_OOM_REAPED, &p->mm->flags)) - ret = OOM_SCAN_CONTINUE; - task_unlock(p); - } + if (test_bit(MMF_OOM_REAPED, &mm->flags)) + ret = OOM_SCAN_CONTINUE; + else if (test_bit(MMF_OOM_NOT_REAPABLE, &mm->flags)) + ret = OOM_SCAN_SELECT; return ret; } @@ -457,7 +458,6 @@ static bool __oom_reap_task(struct task_struct *tsk) struct mmu_gather tlb; struct vm_area_struct *vma; struct mm_struct *mm = NULL; - struct task_struct *p; struct zap_details details = {.check_swap_entries = true, .ignore_dirty = true}; bool ret = true; @@ -478,22 +478,10 @@ static bool __oom_reap_task(struct task_struct *tsk) */ mutex_lock(&oom_lock); - /* - * Make sure we find the associated mm_struct even when the particular - * thread has already terminated and cleared its mm. - * We might have race with exit path so consider our work done if there - * is no mm. - */ - p = find_lock_task_mm(tsk); - if (!p) - goto unlock_oom; - mm = p->mm; - atomic_inc(&mm->mm_count); - task_unlock(p); - + mm = tsk->signal->oom_mm; if (!down_read_trylock(&mm->mmap_sem)) { ret = false; - goto mm_drop; + goto unlock_oom; } /* @@ -503,7 +491,7 @@ static bool __oom_reap_task(struct task_struct *tsk) */ if (!mmget_not_zero(mm)) { up_read(&mm->mmap_sem); - goto mm_drop; + goto unlock_oom; } tlb_gather_mmu(&tlb, mm, 0, -1); @@ -551,8 +539,6 @@ static bool __oom_reap_task(struct task_struct *tsk) * put the oom_reaper out of the way. */ mmput_async(mm); -mm_drop: - mmdrop(mm); unlock_oom: mutex_unlock(&oom_lock); return ret; @@ -568,7 +554,7 @@ static void oom_reap_task(struct task_struct *tsk) schedule_timeout_idle(HZ/10); if (attempts > MAX_OOM_REAP_RETRIES) { - struct task_struct *p; + struct mm_struct *mm; pr_info("oom_reaper: unable to reap pid:%d (%s)\n", task_pid_nr(tsk), tsk->comm); @@ -579,27 +565,17 @@ static void oom_reap_task(struct task_struct *tsk) * so hide the mm from the oom killer so that it can move on * to another task with a different mm struct. */ - p = find_lock_task_mm(tsk); - if (p) { - if (test_and_set_bit(MMF_OOM_NOT_REAPABLE, &p->mm->flags)) { - pr_info("oom_reaper: giving up pid:%d (%s)\n", - task_pid_nr(tsk), tsk->comm); - set_bit(MMF_OOM_REAPED, &p->mm->flags); - } - task_unlock(p); + mm = tsk->signal->oom_mm; + if (test_and_set_bit(MMF_OOM_NOT_REAPABLE, &mm->flags)) { + pr_info("oom_reaper: giving up pid:%d (%s)\n", + task_pid_nr(tsk), tsk->comm); + set_bit(MMF_OOM_REAPED, &mm->flags); } debug_show_all_locks(); } - /* - * Clear TIF_MEMDIE because the task shouldn't be sitting on a - * reasonably reclaimable memory anymore or it is not a good candidate - * for the oom victim right now because it cannot release its memory - * itself nor by the oom reaper. - */ tsk->oom_reaper_list = NULL; - exit_oom_victim(tsk); /* Drop a reference taken by wake_oom_reaper */ put_task_struct(tsk); @@ -664,14 +640,25 @@ subsys_initcall(oom_init) * * Has to be called with oom_lock held and never after * oom has been disabled already. + * + * tsk->mm has to be non NULL and caller has to guarantee it is stable (either + * under task_lock or operate on the current). */ void mark_oom_victim(struct task_struct *tsk) { + struct mm_struct *mm = tsk->mm; + WARN_ON(oom_killer_disabled); /* OOM killer might race with memcg OOM */ if (test_and_set_tsk_thread_flag(tsk, TIF_MEMDIE)) return; + atomic_inc(&tsk->signal->oom_victims); + + /* oom_mm is bound to the signal struct life time. */ + if (!cmpxchg(&tsk->signal->oom_mm, NULL, mm)) + atomic_inc(&tsk->signal->oom_mm->mm_count); + /* * Make sure that the task is woken up from uninterruptible sleep * if it is frozen because OOM killer wouldn't be able to free -- 2.8.1 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wm0-f69.google.com (mail-wm0-f69.google.com [74.125.82.69]) by kanga.kvack.org (Postfix) with ESMTP id A73216B025E for ; Fri, 1 Jul 2016 05:26:54 -0400 (EDT) Received: by mail-wm0-f69.google.com with SMTP id f126so13368337wma.3 for ; Fri, 01 Jul 2016 02:26:54 -0700 (PDT) Received: from mail-wm0-f66.google.com (mail-wm0-f66.google.com. [74.125.82.66]) by mx.google.com with ESMTPS id yu6si2733516wjb.118.2016.07.01.02.26.53 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Fri, 01 Jul 2016 02:26:53 -0700 (PDT) Received: by mail-wm0-f66.google.com with SMTP id 187so3863749wmz.1 for ; Fri, 01 Jul 2016 02:26:53 -0700 (PDT) From: Michal Hocko Subject: [RFC PATCH 2/6] oom, suspend: fix oom_killer_disable vs. pm suspend properly Date: Fri, 1 Jul 2016 11:26:26 +0200 Message-Id: <1467365190-24640-3-git-send-email-mhocko@kernel.org> In-Reply-To: <1467365190-24640-1-git-send-email-mhocko@kernel.org> References: <1467365190-24640-1-git-send-email-mhocko@kernel.org> Sender: owner-linux-mm@kvack.org List-ID: To: linux-mm@kvack.org Cc: Andrew Morton , Tetsuo Handa , Oleg Nesterov , David Rientjes , Vladimir Davydov , Michal Hocko From: Michal Hocko 74070542099c ("oom, suspend: fix oom_reaper vs. oom_killer_disable race") has workaround an existing race between oom_killer_disable and oom_reaper by adding another round of try_to_freeze_tasks after the oom killer was disabled. This was an easiest thing to do for a late 4.7 fix. Let's fix it properly now. After "oom: keep mm of the killed task available" we no longer call exit_oom_victim from the oom reaper so the race described in the above commit doesn't exist anymore. Unfortunately this alone is not sufficient for the oom_killer_disable usecase because now we do not have any reliable way to reach exit_oom_victim (the victim might get stuck on a way to exit for an unbounded amount of time). OOM killer can cope with that by checking mm flags and move on to another victim but we cannot do the same for oom_killer_disable as we would lose the guarantee of no further interference of the victim with the rest of the system. What we can do instead is to cap the maximum time the oom_killer_disable waits for victims. The only current user of this function (pm suspend) already has a concept of timeout for back off so we can reuse the same value there. Let's drop set_freezable for the oom_reaper kthread because it is no longer needed as the reaper doesn't wake or thaw any processes. Signed-off-by: Michal Hocko --- include/linux/oom.h | 2 +- kernel/power/process.c | 17 +++-------------- mm/oom_kill.c | 33 ++++++++++++++++++++------------- 3 files changed, 24 insertions(+), 28 deletions(-) diff --git a/include/linux/oom.h b/include/linux/oom.h index 5bc0457ee3a8..eb44374a3f32 100644 --- a/include/linux/oom.h +++ b/include/linux/oom.h @@ -102,7 +102,7 @@ extern int register_oom_notifier(struct notifier_block *nb); extern int unregister_oom_notifier(struct notifier_block *nb); extern bool oom_killer_disabled; -extern bool oom_killer_disable(void); +extern bool oom_killer_disable(signed long timeout); extern void oom_killer_enable(void); extern struct task_struct *find_lock_task_mm(struct task_struct *p); diff --git a/kernel/power/process.c b/kernel/power/process.c index 0c2ee9761d57..2456f10c7326 100644 --- a/kernel/power/process.c +++ b/kernel/power/process.c @@ -141,23 +141,12 @@ int freeze_processes(void) /* * Now that the whole userspace is frozen we need to disbale * the OOM killer to disallow any further interference with - * killable tasks. + * killable tasks. There is no guarantee oom victims will + * ever reach a point they go away we have to wait with a timeout. */ - if (!error && !oom_killer_disable()) + if (!error && !oom_killer_disable(msecs_to_jiffies(freeze_timeout_msecs))) error = -EBUSY; - /* - * There is a hard to fix race between oom_reaper kernel thread - * and oom_killer_disable. oom_reaper calls exit_oom_victim - * before the victim reaches exit_mm so try to freeze all the tasks - * again and catch such a left over task. - */ - if (!error) { - pr_info("Double checking all user space processes after OOM killer disable... "); - error = try_to_freeze_tasks(true); - pr_cont("\n"); - } - if (error) thaw_processes(); return error; diff --git a/mm/oom_kill.c b/mm/oom_kill.c index 4ea4a649822d..4ac089cba353 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -583,8 +583,6 @@ static void oom_reap_task(struct task_struct *tsk) static int oom_reaper(void *unused) { - set_freezable(); - while (true) { struct task_struct *tsk = NULL; @@ -683,10 +681,20 @@ void exit_oom_victim(struct task_struct *tsk) } /** + * oom_killer_enable - enable OOM killer + */ +void oom_killer_enable(void) +{ + oom_killer_disabled = false; +} + +/** * oom_killer_disable - disable OOM killer + * @timeout: maximum timeout to wait for oom victims in jiffies * * Forces all page allocations to fail rather than trigger OOM killer. - * Will block and wait until all OOM victims are killed. + * Will block and wait until all OOM victims are killed or the given + * timeout expires. * * The function cannot be called when there are runnable user tasks because * the userspace would see unexpected allocation failures as a result. Any @@ -695,8 +703,10 @@ void exit_oom_victim(struct task_struct *tsk) * Returns true if successful and false if the OOM killer cannot be * disabled. */ -bool oom_killer_disable(void) +bool oom_killer_disable(signed long timeout) { + signed long ret; + /* * Make sure to not race with an ongoing OOM killer. Check that the * current is not killed (possibly due to sharing the victim's memory). @@ -706,19 +716,16 @@ bool oom_killer_disable(void) oom_killer_disabled = true; mutex_unlock(&oom_lock); - wait_event(oom_victims_wait, !atomic_read(&oom_victims)); + ret = wait_event_interruptible_timeout(oom_victims_wait, + !atomic_read(&oom_victims), timeout); + if (ret <= 0) { + oom_killer_enable(); + return false; + } return true; } -/** - * oom_killer_enable - enable OOM killer - */ -void oom_killer_enable(void) -{ - oom_killer_disabled = false; -} - static inline bool __task_will_free_mem(struct task_struct *task) { struct signal_struct *sig = task->signal; -- 2.8.1 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wm0-f72.google.com (mail-wm0-f72.google.com [74.125.82.72]) by kanga.kvack.org (Postfix) with ESMTP id C33F46B025F for ; Fri, 1 Jul 2016 05:26:56 -0400 (EDT) Received: by mail-wm0-f72.google.com with SMTP id f126so13369061wma.3 for ; Fri, 01 Jul 2016 02:26:56 -0700 (PDT) Received: from mail-wm0-f66.google.com (mail-wm0-f66.google.com. [74.125.82.66]) by mx.google.com with ESMTPS id lp5si2763579wjb.121.2016.07.01.02.26.53 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Fri, 01 Jul 2016 02:26:54 -0700 (PDT) Received: by mail-wm0-f66.google.com with SMTP id r201so3870743wme.0 for ; Fri, 01 Jul 2016 02:26:53 -0700 (PDT) From: Michal Hocko Subject: [RFC PATCH 3/6] exit, oom: postpone exit_oom_victim to later Date: Fri, 1 Jul 2016 11:26:27 +0200 Message-Id: <1467365190-24640-4-git-send-email-mhocko@kernel.org> In-Reply-To: <1467365190-24640-1-git-send-email-mhocko@kernel.org> References: <1467365190-24640-1-git-send-email-mhocko@kernel.org> Sender: owner-linux-mm@kvack.org List-ID: To: linux-mm@kvack.org Cc: Andrew Morton , Tetsuo Handa , Oleg Nesterov , David Rientjes , Vladimir Davydov , Michal Hocko From: Michal Hocko exit_oom_victim was called after mmput because it is expected that address space of the victim would get released by that time and there is no reason to hold off the oom killer from selecting another task should that be insufficient to handle the oom situation. In order to catch post exit_mm() allocations we used to check for PF_EXITING but this got removed by 6a618957ad17 ("mm: oom_kill: don't ignore oom score on exiting tasks") because this check was lockup prone. It seems that we have all needed pieces ready now and can finally fix this properly (at least for CONFIG_MMU cases where we have the oom_reaper). Since "oom: keep mm of the killed task available" we have a reliable way to ignore oom victims which are no longer interesting because they either were reaped and do not sit on a lot of memory or they are not reapable for some reason and it is safer to ignore them and move on to another victim. That means that we can safely postpone exit_oom_victim to closer to the final schedule. There is possible advantages of this because we are reducing chances of further interference of the oom victim with the rest of the system after oom_killer_disable(). Strictly speaking this is possible right now because there are indeed allocations possible past exit_mm() and who knows whether some of them can trigger IO. I haven't seen this in practice though. Signed-off-by: Michal Hocko --- kernel/exit.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/kernel/exit.c b/kernel/exit.c index 9e6e1356e6bb..a7260c05f18c 100644 --- a/kernel/exit.c +++ b/kernel/exit.c @@ -434,8 +434,6 @@ static void exit_mm(struct task_struct *tsk) task_unlock(tsk); mm_update_next_owner(mm); mmput(mm); - if (test_thread_flag(TIF_MEMDIE)) - exit_oom_victim(tsk); } static struct task_struct *find_alive_thread(struct task_struct *p) @@ -822,6 +820,9 @@ void do_exit(long code) smp_mb(); raw_spin_unlock_wait(&tsk->pi_lock); + if (test_thread_flag(TIF_MEMDIE)) + exit_oom_victim(tsk); + /* causes final put_task_struct in finish_task_switch(). */ tsk->state = TASK_DEAD; tsk->flags |= PF_NOFREEZE; /* tell freezer to ignore us */ -- 2.8.1 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-lf0-f70.google.com (mail-lf0-f70.google.com [209.85.215.70]) by kanga.kvack.org (Postfix) with ESMTP id 1FF696B0260 for ; Fri, 1 Jul 2016 05:26:59 -0400 (EDT) Received: by mail-lf0-f70.google.com with SMTP id a4so78320424lfa.1 for ; Fri, 01 Jul 2016 02:26:59 -0700 (PDT) Received: from mail-wm0-f65.google.com (mail-wm0-f65.google.com. [74.125.82.65]) by mx.google.com with ESMTPS id pa7si2805749wjb.109.2016.07.01.02.26.54 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Fri, 01 Jul 2016 02:26:54 -0700 (PDT) Received: by mail-wm0-f65.google.com with SMTP id 187so3863900wmz.1 for ; Fri, 01 Jul 2016 02:26:54 -0700 (PDT) From: Michal Hocko Subject: [RFC PATCH 4/6] oom, oom_reaper: consider mmget_not_zero as a failure Date: Fri, 1 Jul 2016 11:26:28 +0200 Message-Id: <1467365190-24640-5-git-send-email-mhocko@kernel.org> In-Reply-To: <1467365190-24640-1-git-send-email-mhocko@kernel.org> References: <1467365190-24640-1-git-send-email-mhocko@kernel.org> Sender: owner-linux-mm@kvack.org List-ID: To: linux-mm@kvack.org Cc: Andrew Morton , Tetsuo Handa , Oleg Nesterov , David Rientjes , Vladimir Davydov , Michal Hocko From: Michal Hocko mmget_not_zero failing means that we are racing with mmput->__mmput and we are currently interpreting this as a success because we believe that __mmput will release the address space. This is not guaranteed though because at least exit_aio might wait on IO and it is not entirely clear whether it will terminate in a bounded amount of time. It is hard to tell what else is lurking there. This patch makes this path more conservative and we report a failure which will lead to setting MMF_OOM_NOT_REAPABLE and MMF_OOM_REAPED if this state is permanent. Signed-off-by: Michal Hocko --- mm/oom_kill.c | 7 +++---- 1 file changed, 3 insertions(+), 4 deletions(-) diff --git a/mm/oom_kill.c b/mm/oom_kill.c index 4ac089cba353..b2210b6c38ba 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -460,7 +460,7 @@ static bool __oom_reap_task(struct task_struct *tsk) struct mm_struct *mm = NULL; struct zap_details details = {.check_swap_entries = true, .ignore_dirty = true}; - bool ret = true; + bool ret = false; /* * We have to make sure to not race with the victim exit path @@ -479,10 +479,8 @@ static bool __oom_reap_task(struct task_struct *tsk) mutex_lock(&oom_lock); mm = tsk->signal->oom_mm; - if (!down_read_trylock(&mm->mmap_sem)) { - ret = false; + if (!down_read_trylock(&mm->mmap_sem)) goto unlock_oom; - } /* * increase mm_users only after we know we will reap something so @@ -494,6 +492,7 @@ static bool __oom_reap_task(struct task_struct *tsk) goto unlock_oom; } + ret = true; tlb_gather_mmu(&tlb, mm, 0, -1); for (vma = mm->mmap ; vma; vma = vma->vm_next) { if (is_vm_hugetlb_page(vma)) -- 2.8.1 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-lf0-f69.google.com (mail-lf0-f69.google.com [209.85.215.69]) by kanga.kvack.org (Postfix) with ESMTP id 9B51A6B0261 for ; Fri, 1 Jul 2016 05:27:01 -0400 (EDT) Received: by mail-lf0-f69.google.com with SMTP id a2so77748640lfe.0 for ; Fri, 01 Jul 2016 02:27:01 -0700 (PDT) Received: from mail-wm0-f65.google.com (mail-wm0-f65.google.com. [74.125.82.65]) by mx.google.com with ESMTPS id ag4si2773651wjc.200.2016.07.01.02.26.55 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Fri, 01 Jul 2016 02:26:55 -0700 (PDT) Received: by mail-wm0-f65.google.com with SMTP id c82so3855172wme.3 for ; Fri, 01 Jul 2016 02:26:55 -0700 (PDT) From: Michal Hocko Subject: [RFC PATCH 5/6] vhost, mm: make sure that oom_reaper doesn't reap memory read by vhost Date: Fri, 1 Jul 2016 11:26:29 +0200 Message-Id: <1467365190-24640-6-git-send-email-mhocko@kernel.org> In-Reply-To: <1467365190-24640-1-git-send-email-mhocko@kernel.org> References: <1467365190-24640-1-git-send-email-mhocko@kernel.org> Sender: owner-linux-mm@kvack.org List-ID: To: linux-mm@kvack.org Cc: Andrew Morton , Tetsuo Handa , Oleg Nesterov , David Rientjes , Vladimir Davydov , Michal Hocko , "Michael S. Tsirkin" From: Michal Hocko vhost driver relies on copy_from_user/get_user from a kernel thread. This makes it impossible to reap the memory of an oom victim which shares mm with the vhost kernel thread because it could see a zero page unexpectedly and theoretically make an incorrect decision visible outside of the killed task context. To quote Michael S. Tsirkin: : Getting an error from __get_user and friends is handled gracefully. : Getting zero instead of a real value will cause userspace : memory corruption. Make sure that each place which can read from userspace is annotated properly and it uses copy_from_user_mm, __get_user_mm resp. copy_from_iter_mm. Each will get the target mm as an argument and it performs a pessimistic check to rule out that the oom_reaper could possibly unmap the particular page. __oom_reap_task then just needs to mark the mm as unstable before it unmaps any page. This is a preparatory patch without any functional changes because the oom reaper doesn't touch mm shared with kthreads yet. Cc: "Michael S. Tsirkin" Signed-off-by: Michal Hocko --- drivers/vhost/scsi.c | 2 +- drivers/vhost/vhost.c | 18 +++++++++--------- include/linux/sched.h | 1 + include/linux/uaccess.h | 22 ++++++++++++++++++++++ include/linux/uio.h | 10 ++++++++++ mm/oom_kill.c | 8 ++++++++ 6 files changed, 51 insertions(+), 10 deletions(-) diff --git a/drivers/vhost/scsi.c b/drivers/vhost/scsi.c index 0e6fd556c982..2c8dc0b9a21f 100644 --- a/drivers/vhost/scsi.c +++ b/drivers/vhost/scsi.c @@ -932,7 +932,7 @@ vhost_scsi_handle_vq(struct vhost_scsi *vs, struct vhost_virtqueue *vq) */ iov_iter_init(&out_iter, WRITE, vq->iov, out, out_size); - ret = copy_from_iter(req, req_size, &out_iter); + ret = copy_from_iter_mm(vq->dev->mm, req, req_size, &out_iter); if (unlikely(ret != req_size)) { vq_err(vq, "Faulted on copy_from_iter\n"); vhost_scsi_send_bad_target(vs, vq, head, out); diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c index 669fef1e2bb6..71a754a0fe7e 100644 --- a/drivers/vhost/vhost.c +++ b/drivers/vhost/vhost.c @@ -1212,7 +1212,7 @@ int vhost_vq_init_access(struct vhost_virtqueue *vq) r = -EFAULT; goto err; } - r = __get_user(last_used_idx, &vq->used->idx); + r = __get_user_mm(vq->dev->mm, last_used_idx, &vq->used->idx); if (r) goto err; vq->last_used_idx = vhost16_to_cpu(vq, last_used_idx); @@ -1328,7 +1328,7 @@ static int get_indirect(struct vhost_virtqueue *vq, i, count); return -EINVAL; } - if (unlikely(copy_from_iter(&desc, sizeof(desc), &from) != + if (unlikely(copy_from_iter_mm(vq->dev->mm, &desc, sizeof(desc), &from) != sizeof(desc))) { vq_err(vq, "Failed indirect descriptor: idx %d, %zx\n", i, (size_t)vhost64_to_cpu(vq, indirect->addr) + i * sizeof desc); @@ -1392,7 +1392,7 @@ int vhost_get_vq_desc(struct vhost_virtqueue *vq, /* Check it isn't doing very strange things with descriptor numbers. */ last_avail_idx = vq->last_avail_idx; - if (unlikely(__get_user(avail_idx, &vq->avail->idx))) { + if (unlikely(__get_user_mm(vq->dev->mm, avail_idx, &vq->avail->idx))) { vq_err(vq, "Failed to access avail idx at %p\n", &vq->avail->idx); return -EFAULT; @@ -1414,7 +1414,7 @@ int vhost_get_vq_desc(struct vhost_virtqueue *vq, /* Grab the next descriptor number they're advertising, and increment * the index we've seen. */ - if (unlikely(__get_user(ring_head, + if (unlikely(__get_user_mm(vq->dev->mm, ring_head, &vq->avail->ring[last_avail_idx & (vq->num - 1)]))) { vq_err(vq, "Failed to read head: idx %d address %p\n", last_avail_idx, @@ -1450,7 +1450,7 @@ int vhost_get_vq_desc(struct vhost_virtqueue *vq, i, vq->num, head); return -EINVAL; } - ret = __copy_from_user(&desc, vq->desc + i, sizeof desc); + ret = __copy_from_user_mm(vq->dev->mm, &desc, vq->desc + i, sizeof desc); if (unlikely(ret)) { vq_err(vq, "Failed to get descriptor: idx %d addr %p\n", i, vq->desc + i); @@ -1622,7 +1622,7 @@ static bool vhost_notify(struct vhost_dev *dev, struct vhost_virtqueue *vq) if (!vhost_has_feature(vq, VIRTIO_RING_F_EVENT_IDX)) { __virtio16 flags; - if (__get_user(flags, &vq->avail->flags)) { + if (__get_user_mm(dev->mm, flags, &vq->avail->flags)) { vq_err(vq, "Failed to get flags"); return true; } @@ -1636,7 +1636,7 @@ static bool vhost_notify(struct vhost_dev *dev, struct vhost_virtqueue *vq) if (unlikely(!v)) return true; - if (__get_user(event, vhost_used_event(vq))) { + if (__get_user_mm(dev->mm, event, vhost_used_event(vq))) { vq_err(vq, "Failed to get used event idx"); return true; } @@ -1678,7 +1678,7 @@ bool vhost_vq_avail_empty(struct vhost_dev *dev, struct vhost_virtqueue *vq) __virtio16 avail_idx; int r; - r = __get_user(avail_idx, &vq->avail->idx); + r = __get_user_mm(dev->mm, avail_idx, &vq->avail->idx); if (r) return false; @@ -1713,7 +1713,7 @@ bool vhost_enable_notify(struct vhost_dev *dev, struct vhost_virtqueue *vq) /* They could have slipped one in as we were doing that: make * sure it's written, then check again. */ smp_mb(); - r = __get_user(avail_idx, &vq->avail->idx); + r = __get_user_mm(dev->mm, avail_idx, &vq->avail->idx); if (r) { vq_err(vq, "Failed to check avail idx at %p: %d\n", &vq->avail->idx, r); diff --git a/include/linux/sched.h b/include/linux/sched.h index befdcc1cde3c..ff5102adb0c4 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -513,6 +513,7 @@ static inline int get_dumpable(struct mm_struct *mm) #define MMF_RECALC_UPROBES 20 /* MMF_HAS_UPROBES can be wrong */ #define MMF_OOM_REAPED 21 /* mm has been already reaped */ #define MMF_OOM_NOT_REAPABLE 22 /* mm couldn't be reaped */ +#define MMF_UNSTABLE 23 /* mm is unstable for copy_from_user */ #define MMF_INIT_MASK (MMF_DUMPABLE_MASK | MMF_DUMP_FILTER_MASK) diff --git a/include/linux/uaccess.h b/include/linux/uaccess.h index 349557825428..a327d5362581 100644 --- a/include/linux/uaccess.h +++ b/include/linux/uaccess.h @@ -76,6 +76,28 @@ static inline unsigned long __copy_from_user_nocache(void *to, #endif /* ARCH_HAS_NOCACHE_UACCESS */ /* + * A safe variant of __get_user for for use_mm() users to have a + * gurantee that the address space wasn't reaped in the background + */ +#define __get_user_mm(mm, x, ptr) \ +({ \ + int ___gu_err = __get_user(x, ptr); \ + if (!___gu_err && test_bit(MMF_UNSTABLE, &mm->flags)) \ + ___gu_err = -EFAULT; \ + ___gu_err; \ +}) + +/* similar to __get_user_mm */ +static inline __must_check long __copy_from_user_mm(struct mm_struct *mm, + void *to, const void __user * from, unsigned long n) +{ + long ret = __copy_from_user(to, from, n); + if ((ret >= 0) && test_bit(MMF_UNSTABLE, &mm->flags)) + return -EFAULT; + return ret; +} + +/* * probe_kernel_read(): safely attempt to read from a location * @dst: pointer to the buffer that shall take the data * @src: address to read from diff --git a/include/linux/uio.h b/include/linux/uio.h index 1b5d1cd796e2..4be6b24003d8 100644 --- a/include/linux/uio.h +++ b/include/linux/uio.h @@ -9,6 +9,7 @@ #ifndef __LINUX_UIO_H #define __LINUX_UIO_H +#include #include #include @@ -84,6 +85,15 @@ size_t copy_page_from_iter(struct page *page, size_t offset, size_t bytes, struct iov_iter *i); size_t copy_to_iter(const void *addr, size_t bytes, struct iov_iter *i); size_t copy_from_iter(void *addr, size_t bytes, struct iov_iter *i); + +static inline size_t copy_from_iter_mm(struct mm_struct *mm, void *addr, + size_t bytes, struct iov_iter *i) +{ + size_t ret = copy_from_iter(addr, bytes, i); + if (!IS_ERR_VALUE(ret) && test_bit(MMF_UNSTABLE, &mm->flags)) + return -EFAULT; + return ret; +} size_t copy_from_iter_nocache(void *addr, size_t bytes, struct iov_iter *i); size_t iov_iter_zero(size_t bytes, struct iov_iter *); unsigned long iov_iter_alignment(const struct iov_iter *i); diff --git a/mm/oom_kill.c b/mm/oom_kill.c index b2210b6c38ba..38a0cd32c01b 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -492,6 +492,14 @@ static bool __oom_reap_task(struct task_struct *tsk) goto unlock_oom; } + /* + * Tell all users of get_user_mm/copy_from_user_mm that the content + * is no longer stable. No barriers really needed because unmapping + * should imply barriers already and the reader would hit a page fault + * if it stumbled over a reaped memory. + */ + set_bit(MMF_UNSTABLE, &mm->flags); + ret = true; tlb_gather_mmu(&tlb, mm, 0, -1); for (vma = mm->mmap ; vma; vma = vma->vm_next) { -- 2.8.1 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-lf0-f70.google.com (mail-lf0-f70.google.com [209.85.215.70]) by kanga.kvack.org (Postfix) with ESMTP id C6C4C6B0262 for ; Fri, 1 Jul 2016 05:27:03 -0400 (EDT) Received: by mail-lf0-f70.google.com with SMTP id a4so78322555lfa.1 for ; Fri, 01 Jul 2016 02:27:03 -0700 (PDT) Received: from mail-wm0-f66.google.com (mail-wm0-f66.google.com. [74.125.82.66]) by mx.google.com with ESMTPS id ub2si2645915wjc.93.2016.07.01.02.26.56 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Fri, 01 Jul 2016 02:26:56 -0700 (PDT) Received: by mail-wm0-f66.google.com with SMTP id c82so3855240wme.3 for ; Fri, 01 Jul 2016 02:26:56 -0700 (PDT) From: Michal Hocko Subject: [RFC PATCH 6/6] oom, oom_reaper: allow to reap mm shared by the kthreads Date: Fri, 1 Jul 2016 11:26:30 +0200 Message-Id: <1467365190-24640-7-git-send-email-mhocko@kernel.org> In-Reply-To: <1467365190-24640-1-git-send-email-mhocko@kernel.org> References: <1467365190-24640-1-git-send-email-mhocko@kernel.org> Sender: owner-linux-mm@kvack.org List-ID: To: linux-mm@kvack.org Cc: Andrew Morton , Tetsuo Handa , Oleg Nesterov , David Rientjes , Vladimir Davydov , Michal Hocko From: Michal Hocko oom reaper was skipped for an mm which is shared with the kernel thread (aka use_mm()). The primary concern was that such a kthread might want to read from the userspace memory and see zero page as a result of the oom reaper action. This seems to be overly conservative because none of the current use_mm() users need to do copy_from_user or get_user. aio code used to rely on copy_from_user but this is long gone along with use_mm() usage in fs/aio.c. We currently have only 3 users in the kernel: - ffs_user_copy_worker, ep_user_copy_worker only do copy_to_iter() - vhost_worker needs to copy from userspace but it relies on the safe __get_user_mm, copy_from_user_mm resp. copy_from_iter_mm Add a note to use_mm about the copy_from_user risk and allow the oom killer to invoke the oom_reaper for mms shared with kthreads. This will practically cause all the sane use cases to be reapable. Signed-off-by: Michal Hocko --- mm/mmu_context.c | 6 ++++++ mm/oom_kill.c | 14 +++++++------- 2 files changed, 13 insertions(+), 7 deletions(-) diff --git a/mm/mmu_context.c b/mm/mmu_context.c index f802c2d216a7..61a7a90250be 100644 --- a/mm/mmu_context.c +++ b/mm/mmu_context.c @@ -16,6 +16,12 @@ * mm context. * (Note: this routine is intended to be called only * from a kernel thread context) + * + * Do not use copy_from_user/__get_user from this context + * and use the safe copy_from_user_mm/__get_user_mm because + * the address space might got reclaimed behind the back by + * the oom_reaper so an unexpected zero page might be + * encountered. */ void use_mm(struct mm_struct *mm) { diff --git a/mm/oom_kill.c b/mm/oom_kill.c index 38a0cd32c01b..d5ea082a7360 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -914,13 +914,7 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p, continue; if (same_thread_group(p, victim)) continue; - if (unlikely(p->flags & PF_KTHREAD) || is_global_init(p)) { - /* - * We cannot use oom_reaper for the mm shared by this - * process because it wouldn't get killed and so the - * memory might be still used. Hide the mm from the oom - * killer to guarantee OOM forward progress. - */ + if (is_global_init(p)) { can_oom_reap = false; set_bit(MMF_OOM_REAPED, &mm->flags); pr_info("oom killer %d (%s) has mm pinned by %d (%s)\n", @@ -928,6 +922,12 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p, task_pid_nr(p), p->comm); continue; } + /* + * No use_mm() user needs to read from the userspace so we are + * ok to reap it. + */ + if (unlikely(p->flags & PF_KTHREAD)) + continue; do_send_sig_info(SIGKILL, SEND_SIG_FORCED, p, true); } rcu_read_unlock(); -- 2.8.1 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pf0-f199.google.com (mail-pf0-f199.google.com [209.85.192.199]) by kanga.kvack.org (Postfix) with ESMTP id 5CD716B0005 for ; Sat, 2 Jul 2016 22:45:45 -0400 (EDT) Received: by mail-pf0-f199.google.com with SMTP id a69so320797496pfa.1 for ; Sat, 02 Jul 2016 19:45:45 -0700 (PDT) Received: from www262.sakura.ne.jp (www262.sakura.ne.jp. [2001:e42:101:1:202:181:97:72]) by mx.google.com with ESMTPS id x78si1391828pfa.126.2016.07.02.19.45.44 for (version=TLS1 cipher=AES128-SHA bits=128/128); Sat, 02 Jul 2016 19:45:44 -0700 (PDT) Subject: Re: [RFC PATCH 1/6] oom: keep mm of the killed task available From: Tetsuo Handa References: <1467365190-24640-1-git-send-email-mhocko@kernel.org> <1467365190-24640-2-git-send-email-mhocko@kernel.org> In-Reply-To: <1467365190-24640-2-git-send-email-mhocko@kernel.org> Message-Id: <201607031145.HIF90125.LMHQVFJOtOSOFF@I-love.SAKURA.ne.jp> Date: Sun, 3 Jul 2016 11:45:34 +0900 Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Sender: owner-linux-mm@kvack.org List-ID: To: mhocko@kernel.org, linux-mm@kvack.org Cc: akpm@linux-foundation.org, oleg@redhat.com, rientjes@google.com, vdavydov@parallels.com, mhocko@suse.com Michal Hocko wrote: > diff --git a/mm/oom_kill.c b/mm/oom_kill.c > index 7d0a275df822..4ea4a649822d 100644 > --- a/mm/oom_kill.c > +++ b/mm/oom_kill.c > @@ -286,16 +286,17 @@ enum oom_scan_t oom_scan_process_thread(struct oom_control *oc, > * Don't allow any other task to have access to the reserves unless > * the task has MMF_OOM_REAPED because chances that it would release > * any memory is quite low. > + * MMF_OOM_NOT_REAPABLE means that the oom_reaper backed off last time > + * so let it try again. > */ > if (!is_sysrq_oom(oc) && atomic_read(&task->signal->oom_victims)) { > - struct task_struct *p = find_lock_task_mm(task); > + struct mm_struct *mm = task->signal->oom_mm; > enum oom_scan_t ret = OOM_SCAN_ABORT; > > - if (p) { > - if (test_bit(MMF_OOM_REAPED, &p->mm->flags)) > - ret = OOM_SCAN_CONTINUE; > - task_unlock(p); > - } > + if (test_bit(MMF_OOM_REAPED, &mm->flags)) > + ret = OOM_SCAN_CONTINUE; > + else if (test_bit(MMF_OOM_NOT_REAPABLE, &mm->flags)) > + ret = OOM_SCAN_SELECT; I don't think this is useful. MMF_OOM_NOT_REAPABLE is set when mm->mmap_sem could not be held for read by the OOM reaper thread. That occurs when someone is blocked at unkillable wait with that mm->mmap_sem held for write. Unless the reason that someone is blocked is lack of CPU time, the reason is likely that that someone is blocked due to waiting for somebody else's memory allocation. Then, it won't succeed that retrying OOM reaping MMF_OOM_NOT_REAPABLE mm as soon as oom_scan_process_thread() finds it. At least, retrying OOM reaping MMF_OOM_NOT_REAPABLE mm should be attempted after that someone is no longer blocked due to waiting for somebody else's memory allocation (e.g. retry only when oom_scan_process_thread() is sure that the OOM reaper thread can hold mm->mmap_sem for read). But I don't think with need to dance with task->signal->oom_mm. See my series which removes task->signal->oom_victims and OOM_SCAN_ABORT case. > > return ret; > } -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-yw0-f198.google.com (mail-yw0-f198.google.com [209.85.161.198]) by kanga.kvack.org (Postfix) with ESMTP id EDEB46B0005 for ; Sun, 3 Jul 2016 09:47:26 -0400 (EDT) Received: by mail-yw0-f198.google.com with SMTP id i12so61733506ywa.0 for ; Sun, 03 Jul 2016 06:47:26 -0700 (PDT) Received: from mx1.redhat.com (mx1.redhat.com. [209.132.183.28]) by mx.google.com with ESMTPS id t40si1791431qtc.2.2016.07.03.06.47.26 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Sun, 03 Jul 2016 06:47:26 -0700 (PDT) Date: Sun, 3 Jul 2016 15:47:19 +0200 From: Oleg Nesterov Subject: Re: [RFC PATCH 5/6] vhost, mm: make sure that oom_reaper doesn't reap memory read by vhost Message-ID: <20160703134719.GA28492@redhat.com> References: <1467365190-24640-1-git-send-email-mhocko@kernel.org> <1467365190-24640-6-git-send-email-mhocko@kernel.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1467365190-24640-6-git-send-email-mhocko@kernel.org> Sender: owner-linux-mm@kvack.org List-ID: To: Michal Hocko Cc: linux-mm@kvack.org, Andrew Morton , Tetsuo Handa , David Rientjes , Vladimir Davydov , Michal Hocko , "Michael S. Tsirkin" On 07/01, Michal Hocko wrote: > > From: Michal Hocko > > vhost driver relies on copy_from_user/get_user from a kernel thread. > This makes it impossible to reap the memory of an oom victim which > shares mm with the vhost kernel thread because it could see a zero > page unexpectedly and theoretically make an incorrect decision visible > outside of the killed task context. And I still can't understand how, but let me repeat that I don't understand this code at all. > To quote Michael S. Tsirkin: > : Getting an error from __get_user and friends is handled gracefully. > : Getting zero instead of a real value will cause userspace > : memory corruption. Which userspace memory corruption? We are going to kill the dev->mm owner, the task which did ioctl(VHOST_SET_OWNER) and (at first glance) the task who communicates with the callbacks fired by vhost_worker(). Michael, could you please spell why should we care? > --- a/mm/oom_kill.c > +++ b/mm/oom_kill.c > @@ -492,6 +492,14 @@ static bool __oom_reap_task(struct task_struct *tsk) > goto unlock_oom; > } > > + /* > + * Tell all users of get_user_mm/copy_from_user_mm that the content > + * is no longer stable. No barriers really needed because unmapping > + * should imply barriers already and the reader would hit a page fault > + * if it stumbled over a reaped memory. > + */ > + set_bit(MMF_UNSTABLE, &mm->flags); And this is racy anyway. Oleg. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-qt0-f199.google.com (mail-qt0-f199.google.com [209.85.216.199]) by kanga.kvack.org (Postfix) with ESMTP id 7F6596B0005 for ; Sun, 3 Jul 2016 10:09:09 -0400 (EDT) Received: by mail-qt0-f199.google.com with SMTP id v18so363121953qtv.0 for ; Sun, 03 Jul 2016 07:09:09 -0700 (PDT) Received: from mx1.redhat.com (mx1.redhat.com. [209.132.183.28]) by mx.google.com with ESMTPS id s186si1817026qkh.49.2016.07.03.07.09.08 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Sun, 03 Jul 2016 07:09:08 -0700 (PDT) Date: Sun, 3 Jul 2016 17:09:04 +0300 From: "Michael S. Tsirkin" Subject: Re: [RFC PATCH 5/6] vhost, mm: make sure that oom_reaper doesn't reap memory read by vhost Message-ID: <20160703140904.GA26908@redhat.com> References: <1467365190-24640-1-git-send-email-mhocko@kernel.org> <1467365190-24640-6-git-send-email-mhocko@kernel.org> <20160703134719.GA28492@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20160703134719.GA28492@redhat.com> Sender: owner-linux-mm@kvack.org List-ID: To: Oleg Nesterov Cc: Michal Hocko , linux-mm@kvack.org, Andrew Morton , Tetsuo Handa , David Rientjes , Vladimir Davydov , Michal Hocko On Sun, Jul 03, 2016 at 03:47:19PM +0200, Oleg Nesterov wrote: > On 07/01, Michal Hocko wrote: > > > > From: Michal Hocko > > > > vhost driver relies on copy_from_user/get_user from a kernel thread. > > This makes it impossible to reap the memory of an oom victim which > > shares mm with the vhost kernel thread because it could see a zero > > page unexpectedly and theoretically make an incorrect decision visible > > outside of the killed task context. > > And I still can't understand how, but let me repeat that I don't understand > this code at all. > > > To quote Michael S. Tsirkin: > > : Getting an error from __get_user and friends is handled gracefully. > > : Getting zero instead of a real value will cause userspace > > : memory corruption. > > Which userspace memory corruption? We are going to kill the dev->mm owner, > the task which did ioctl(VHOST_SET_OWNER) and (at first glance) the task > who communicates with the callbacks fired by vhost_worker(). > > Michael, could you please spell why should we care? I am concerned that - oom victim is sharing memory with another task - getting incorrect value from ring read makes vhost change that shared memory Also, I don't see where do we kill the task that communicates with the callbacks. Having said all that, how about we just add some kind of per-mm notifier list, and let vhost know that owner is going away so it should stop looking at memory? Seems cleaner than looking at flags at each memory access, since vhost has its own locking. > > --- a/mm/oom_kill.c > > +++ b/mm/oom_kill.c > > @@ -492,6 +492,14 @@ static bool __oom_reap_task(struct task_struct *tsk) > > goto unlock_oom; > > } > > > > + /* > > + * Tell all users of get_user_mm/copy_from_user_mm that the content > > + * is no longer stable. No barriers really needed because unmapping > > + * should imply barriers already and the reader would hit a page fault > > + * if it stumbled over a reaped memory. > > + */ > > + set_bit(MMF_UNSTABLE, &mm->flags); > > And this is racy anyway. > > Oleg. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-vk0-f71.google.com (mail-vk0-f71.google.com [209.85.213.71]) by kanga.kvack.org (Postfix) with ESMTP id 27D8B6B0005 for ; Sun, 3 Jul 2016 11:18:37 -0400 (EDT) Received: by mail-vk0-f71.google.com with SMTP id m127so28328068vkb.3 for ; Sun, 03 Jul 2016 08:18:37 -0700 (PDT) Received: from mx1.redhat.com (mx1.redhat.com. [209.132.183.28]) by mx.google.com with ESMTPS id 21si1688577qkk.272.2016.07.03.08.18.36 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Sun, 03 Jul 2016 08:18:36 -0700 (PDT) Date: Sun, 3 Jul 2016 17:18:29 +0200 From: Oleg Nesterov Subject: Re: [RFC PATCH 5/6] vhost, mm: make sure that oom_reaper doesn't reap memory read by vhost Message-ID: <20160703151829.GA28667@redhat.com> References: <1467365190-24640-1-git-send-email-mhocko@kernel.org> <1467365190-24640-6-git-send-email-mhocko@kernel.org> <20160703134719.GA28492@redhat.com> <20160703140904.GA26908@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20160703140904.GA26908@redhat.com> Sender: owner-linux-mm@kvack.org List-ID: To: "Michael S. Tsirkin" Cc: Michal Hocko , linux-mm@kvack.org, Andrew Morton , Tetsuo Handa , David Rientjes , Vladimir Davydov , Michal Hocko On 07/03, Michael S. Tsirkin wrote: > > On Sun, Jul 03, 2016 at 03:47:19PM +0200, Oleg Nesterov wrote: > > On 07/01, Michal Hocko wrote: > > > > > > From: Michal Hocko > > > > > > vhost driver relies on copy_from_user/get_user from a kernel thread. > > > This makes it impossible to reap the memory of an oom victim which > > > shares mm with the vhost kernel thread because it could see a zero > > > page unexpectedly and theoretically make an incorrect decision visible > > > outside of the killed task context. > > > > And I still can't understand how, but let me repeat that I don't understand > > this code at all. > > > > > To quote Michael S. Tsirkin: > > > : Getting an error from __get_user and friends is handled gracefully. > > > : Getting zero instead of a real value will cause userspace > > > : memory corruption. > > > > Which userspace memory corruption? We are going to kill the dev->mm owner, > > the task which did ioctl(VHOST_SET_OWNER) and (at first glance) the task > > who communicates with the callbacks fired by vhost_worker(). > > > > Michael, could you please spell why should we care? > > I am concerned that > - oom victim is sharing memory with another task > - getting incorrect value from ring read makes vhost > change that shared memory Well, we are going to kill all tasks which share this memory. I mean, ->mm. If "sharing memory with another task" means, say, a file, then this memory won't be unmapped (if shared). So let me ask again... Suppose, say, QEMU does VHOST_SET_OWNER and then we unmap its (anonymous/non-shared) memory. Who else's memory can be corrupted? Sorry, I simply do not know what vhost does, quite possibly a stupid question. > Having said all that, how about we just add some kind of per-mm > notifier list, and let vhost know that owner is going away so > it should stop looking at memory? > > Seems cleaner than looking at flags at each memory access, > since vhost has its own locking. Agreed... although of course I do not understand how this should work. But looks better in any case.. Or perhaps we can change oom_kill_process() to send SIGKILL to kthreads as well, this should not have any effect unless kthread does allow_signal(SIGKILL), then we can change vhost_worker() to catch SIGKILL and react somehow. Not sure this is really possible. Oleg. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-vk0-f72.google.com (mail-vk0-f72.google.com [209.85.213.72]) by kanga.kvack.org (Postfix) with ESMTP id 7E1D46B0005 for ; Sun, 3 Jul 2016 11:30:17 -0400 (EDT) Received: by mail-vk0-f72.google.com with SMTP id i63so11292541vkb.1 for ; Sun, 03 Jul 2016 08:30:17 -0700 (PDT) Received: from mx1.redhat.com (mx1.redhat.com. [209.132.183.28]) by mx.google.com with ESMTPS id g188si1390253qkd.16.2016.07.03.08.30.16 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Sun, 03 Jul 2016 08:30:16 -0700 (PDT) Date: Sun, 3 Jul 2016 18:30:11 +0300 From: "Michael S. Tsirkin" Subject: Re: [RFC PATCH 5/6] vhost, mm: make sure that oom_reaper doesn't reap memory read by vhost Message-ID: <20160703182254-mutt-send-email-mst@redhat.com> References: <1467365190-24640-1-git-send-email-mhocko@kernel.org> <1467365190-24640-6-git-send-email-mhocko@kernel.org> <20160703134719.GA28492@redhat.com> <20160703140904.GA26908@redhat.com> <20160703151829.GA28667@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20160703151829.GA28667@redhat.com> Sender: owner-linux-mm@kvack.org List-ID: To: Oleg Nesterov Cc: Michal Hocko , linux-mm@kvack.org, Andrew Morton , Tetsuo Handa , David Rientjes , Vladimir Davydov , Michal Hocko On Sun, Jul 03, 2016 at 05:18:29PM +0200, Oleg Nesterov wrote: > On 07/03, Michael S. Tsirkin wrote: > > > > On Sun, Jul 03, 2016 at 03:47:19PM +0200, Oleg Nesterov wrote: > > > On 07/01, Michal Hocko wrote: > > > > > > > > From: Michal Hocko > > > > > > > > vhost driver relies on copy_from_user/get_user from a kernel thread. > > > > This makes it impossible to reap the memory of an oom victim which > > > > shares mm with the vhost kernel thread because it could see a zero > > > > page unexpectedly and theoretically make an incorrect decision visible > > > > outside of the killed task context. > > > > > > And I still can't understand how, but let me repeat that I don't understand > > > this code at all. > > > > > > > To quote Michael S. Tsirkin: > > > > : Getting an error from __get_user and friends is handled gracefully. > > > > : Getting zero instead of a real value will cause userspace > > > > : memory corruption. > > > > > > Which userspace memory corruption? We are going to kill the dev->mm owner, > > > the task which did ioctl(VHOST_SET_OWNER) and (at first glance) the task > > > who communicates with the callbacks fired by vhost_worker(). > > > > > > Michael, could you please spell why should we care? > > > > I am concerned that > > - oom victim is sharing memory with another task > > - getting incorrect value from ring read makes vhost > > change that shared memory > > Well, we are going to kill all tasks which share this memory. I mean, ->mm. > If "sharing memory with another task" means, say, a file, then this memory > won't be unmapped (if shared). > > So let me ask again... Suppose, say, QEMU does VHOST_SET_OWNER and then we > unmap its (anonymous/non-shared) memory. Who else's memory can be corrupted? As you say, I mean anyone who shares memory with QEMU through a file. IIUC current users that do this are all stateless so even if they crash this is not a big deal, but it seems wrong to assume this will be like this forever. > Sorry, I simply do not know what vhost does, quite possibly a stupid question. > > > Having said all that, how about we just add some kind of per-mm > > notifier list, and let vhost know that owner is going away so > > it should stop looking at memory? > > > > Seems cleaner than looking at flags at each memory access, > > since vhost has its own locking. > > Agreed... although of course I do not understand how this should work. Add a linked list of callbacks in in struct mm_struct. vhost would add itself there. In callback, set private_data for all vqs to NULL under vq mutex. > But > looks better in any case.. > > Or perhaps we can change oom_kill_process() to send SIGKILL to kthreads as > well, this should not have any effect unless kthread does allow_signal(SIGKILL), > then we can change vhost_worker() to catch SIGKILL and react somehow. Not sure > this is really possible. > > Oleg. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-vk0-f70.google.com (mail-vk0-f70.google.com [209.85.213.70]) by kanga.kvack.org (Postfix) with ESMTP id 7E74F6B0005 for ; Sun, 3 Jul 2016 12:47:39 -0400 (EDT) Received: by mail-vk0-f70.google.com with SMTP id i63so14492532vkb.1 for ; Sun, 03 Jul 2016 09:47:39 -0700 (PDT) Received: from mx1.redhat.com (mx1.redhat.com. [209.132.183.28]) by mx.google.com with ESMTPS id q25si2110573qtq.57.2016.07.03.09.47.38 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Sun, 03 Jul 2016 09:47:38 -0700 (PDT) Date: Sun, 3 Jul 2016 18:47:23 +0200 From: Oleg Nesterov Subject: Re: [RFC PATCH 5/6] vhost, mm: make sure that oom_reaper doesn't reap memory read by vhost Message-ID: <20160703164723.GA30151@redhat.com> References: <1467365190-24640-1-git-send-email-mhocko@kernel.org> <1467365190-24640-6-git-send-email-mhocko@kernel.org> <20160703134719.GA28492@redhat.com> <20160703140904.GA26908@redhat.com> <20160703151829.GA28667@redhat.com> <20160703182254-mutt-send-email-mst@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20160703182254-mutt-send-email-mst@redhat.com> Sender: owner-linux-mm@kvack.org List-ID: To: "Michael S. Tsirkin" Cc: Michal Hocko , linux-mm@kvack.org, Andrew Morton , Tetsuo Handa , David Rientjes , Vladimir Davydov , Michal Hocko On 07/03, Michael S. Tsirkin wrote: > > On Sun, Jul 03, 2016 at 05:18:29PM +0200, Oleg Nesterov wrote: > > > > Well, we are going to kill all tasks which share this memory. I mean, ->mm. > > If "sharing memory with another task" means, say, a file, then this memory > > won't be unmapped (if shared). > > > > So let me ask again... Suppose, say, QEMU does VHOST_SET_OWNER and then we > > unmap its (anonymous/non-shared) memory. Who else's memory can be corrupted? > > As you say, I mean anyone who shares memory with QEMU through a file. And in this case vhost_worker() reads the anonymous memory of QEMU process, not the memory which can be shared with another task, correct? And if QEMU simply crashes, this can't affect anyone who shares memory with QEMU through a file, yes? Oleg. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-vk0-f70.google.com (mail-vk0-f70.google.com [209.85.213.70]) by kanga.kvack.org (Postfix) with ESMTP id 1A40C6B0005 for ; Sun, 3 Jul 2016 17:18:02 -0400 (EDT) Received: by mail-vk0-f70.google.com with SMTP id v6so415369171vkb.2 for ; Sun, 03 Jul 2016 14:18:02 -0700 (PDT) Received: from mx1.redhat.com (mx1.redhat.com. [209.132.183.28]) by mx.google.com with ESMTPS id q72si66272qka.110.2016.07.03.14.18.01 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Sun, 03 Jul 2016 14:18:01 -0700 (PDT) Date: Mon, 4 Jul 2016 00:17:55 +0300 From: "Michael S. Tsirkin" Subject: Re: [RFC PATCH 5/6] vhost, mm: make sure that oom_reaper doesn't reap memory read by vhost Message-ID: <20160703215250-mutt-send-email-mst@redhat.com> References: <1467365190-24640-1-git-send-email-mhocko@kernel.org> <1467365190-24640-6-git-send-email-mhocko@kernel.org> <20160703134719.GA28492@redhat.com> <20160703140904.GA26908@redhat.com> <20160703151829.GA28667@redhat.com> <20160703182254-mutt-send-email-mst@redhat.com> <20160703164723.GA30151@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20160703164723.GA30151@redhat.com> Sender: owner-linux-mm@kvack.org List-ID: To: Oleg Nesterov Cc: Michal Hocko , linux-mm@kvack.org, Andrew Morton , Tetsuo Handa , David Rientjes , Vladimir Davydov , Michal Hocko On Sun, Jul 03, 2016 at 06:47:23PM +0200, Oleg Nesterov wrote: > On 07/03, Michael S. Tsirkin wrote: > > > > On Sun, Jul 03, 2016 at 05:18:29PM +0200, Oleg Nesterov wrote: > > > > > > Well, we are going to kill all tasks which share this memory. I mean, ->mm. > > > If "sharing memory with another task" means, say, a file, then this memory > > > won't be unmapped (if shared). > > > > > > So let me ask again... Suppose, say, QEMU does VHOST_SET_OWNER and then we > > > unmap its (anonymous/non-shared) memory. Who else's memory can be corrupted? > > > > As you say, I mean anyone who shares memory with QEMU through a file. > > And in this case vhost_worker() reads the anonymous memory of QEMU process, > not the memory which can be shared with another task, correct? > > And if QEMU simply crashes, this can't affect anyone who shares memory with > QEMU through a file, yes? > > Oleg. Well no - the VM memory is not always anonymous memory. It can be an mmaped file. -- MST -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wm0-f71.google.com (mail-wm0-f71.google.com [74.125.82.71]) by kanga.kvack.org (Postfix) with ESMTP id BA5656B0253 for ; Thu, 7 Jul 2016 04:24:34 -0400 (EDT) Received: by mail-wm0-f71.google.com with SMTP id i4so11319258wmg.2 for ; Thu, 07 Jul 2016 01:24:34 -0700 (PDT) Received: from mail-wm0-f67.google.com (mail-wm0-f67.google.com. [74.125.82.67]) by mx.google.com with ESMTPS id o64si3951478wmb.29.2016.07.07.01.24.33 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Thu, 07 Jul 2016 01:24:33 -0700 (PDT) Received: by mail-wm0-f67.google.com with SMTP id 187so3698840wmz.1 for ; Thu, 07 Jul 2016 01:24:33 -0700 (PDT) Date: Thu, 7 Jul 2016 10:24:31 +0200 From: Michal Hocko Subject: Re: [RFC PATCH 1/6] oom: keep mm of the killed task available Message-ID: <20160707082431.GB5379@dhcp22.suse.cz> References: <1467365190-24640-1-git-send-email-mhocko@kernel.org> <1467365190-24640-2-git-send-email-mhocko@kernel.org> <201607031145.HIF90125.LMHQVFJOtOSOFF@I-love.SAKURA.ne.jp> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <201607031145.HIF90125.LMHQVFJOtOSOFF@I-love.SAKURA.ne.jp> Sender: owner-linux-mm@kvack.org List-ID: To: Tetsuo Handa Cc: linux-mm@kvack.org, akpm@linux-foundation.org, oleg@redhat.com, rientjes@google.com, vdavydov@parallels.com On Sun 03-07-16 11:45:34, Tetsuo Handa wrote: > Michal Hocko wrote: > > diff --git a/mm/oom_kill.c b/mm/oom_kill.c > > index 7d0a275df822..4ea4a649822d 100644 > > --- a/mm/oom_kill.c > > +++ b/mm/oom_kill.c > > @@ -286,16 +286,17 @@ enum oom_scan_t oom_scan_process_thread(struct oom_control *oc, > > * Don't allow any other task to have access to the reserves unless > > * the task has MMF_OOM_REAPED because chances that it would release > > * any memory is quite low. > > + * MMF_OOM_NOT_REAPABLE means that the oom_reaper backed off last time > > + * so let it try again. > > */ > > if (!is_sysrq_oom(oc) && atomic_read(&task->signal->oom_victims)) { > > - struct task_struct *p = find_lock_task_mm(task); > > + struct mm_struct *mm = task->signal->oom_mm; > > enum oom_scan_t ret = OOM_SCAN_ABORT; > > > > - if (p) { > > - if (test_bit(MMF_OOM_REAPED, &p->mm->flags)) > > - ret = OOM_SCAN_CONTINUE; > > - task_unlock(p); > > - } > > + if (test_bit(MMF_OOM_REAPED, &mm->flags)) > > + ret = OOM_SCAN_CONTINUE; > > + else if (test_bit(MMF_OOM_NOT_REAPABLE, &mm->flags)) > > + ret = OOM_SCAN_SELECT; > > I don't think this is useful. Well, to be honest me neither but changing the retry logic is not in scope of this patch. It just preserved the existing logic. I guess we can get rid of it but that deserves a separate patch. The retry was implemented to cover unlikely stalls when the lock is held but as this hasn't ever been observed in the real life I would agree to remove it to simplify the code (even though it is literally few lines of code). I was probably overcautious when adding the flag. -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wm0-f72.google.com (mail-wm0-f72.google.com [74.125.82.72]) by kanga.kvack.org (Postfix) with ESMTP id DFB946B0253 for ; Thu, 7 Jul 2016 04:28:14 -0400 (EDT) Received: by mail-wm0-f72.google.com with SMTP id a66so12893957wme.1 for ; Thu, 07 Jul 2016 01:28:14 -0700 (PDT) Received: from mail-wm0-f50.google.com (mail-wm0-f50.google.com. [74.125.82.50]) by mx.google.com with ESMTPS id m2si4660518wme.45.2016.07.07.01.28.13 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Thu, 07 Jul 2016 01:28:13 -0700 (PDT) Received: by mail-wm0-f50.google.com with SMTP id f126so200922796wma.1 for ; Thu, 07 Jul 2016 01:28:13 -0700 (PDT) Date: Thu, 7 Jul 2016 10:28:12 +0200 From: Michal Hocko Subject: Re: [RFC PATCH 5/6] vhost, mm: make sure that oom_reaper doesn't reap memory read by vhost Message-ID: <20160707082811.GC5379@dhcp22.suse.cz> References: <1467365190-24640-1-git-send-email-mhocko@kernel.org> <1467365190-24640-6-git-send-email-mhocko@kernel.org> <20160703134719.GA28492@redhat.com> <20160703140904.GA26908@redhat.com> <20160703151829.GA28667@redhat.com> <20160703182254-mutt-send-email-mst@redhat.com> <20160703164723.GA30151@redhat.com> <20160703215250-mutt-send-email-mst@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20160703215250-mutt-send-email-mst@redhat.com> Sender: owner-linux-mm@kvack.org List-ID: To: "Michael S. Tsirkin" Cc: Oleg Nesterov , linux-mm@kvack.org, Andrew Morton , Tetsuo Handa , David Rientjes , Vladimir Davydov On Mon 04-07-16 00:17:55, Michael S. Tsirkin wrote: > On Sun, Jul 03, 2016 at 06:47:23PM +0200, Oleg Nesterov wrote: > > On 07/03, Michael S. Tsirkin wrote: > > > > > > On Sun, Jul 03, 2016 at 05:18:29PM +0200, Oleg Nesterov wrote: > > > > > > > > Well, we are going to kill all tasks which share this memory. I mean, ->mm. > > > > If "sharing memory with another task" means, say, a file, then this memory > > > > won't be unmapped (if shared). > > > > > > > > So let me ask again... Suppose, say, QEMU does VHOST_SET_OWNER and then we > > > > unmap its (anonymous/non-shared) memory. Who else's memory can be corrupted? > > > > > > As you say, I mean anyone who shares memory with QEMU through a file. > > > > And in this case vhost_worker() reads the anonymous memory of QEMU process, > > not the memory which can be shared with another task, correct? > > > > And if QEMU simply crashes, this can't affect anyone who shares memory with > > QEMU through a file, yes? > > > > Oleg. > > Well no - the VM memory is not always anonymous memory. It can be an > mmaped file. Just to make sure we are all at the same page. I guess the scenario is as follows. The owner of the mm has ring and other statefull information in the private memory but consumers living with their own mm consume some data from a shared memory segments (e.g. files). The worker would misinterpret statefull information (zeros rather than the original content) and would copy invalid/corrupted data to the consumer. Am I correct? -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-ob0-f200.google.com (mail-ob0-f200.google.com [209.85.214.200]) by kanga.kvack.org (Postfix) with ESMTP id 9504B6B0253 for ; Thu, 7 Jul 2016 04:39:35 -0400 (EDT) Received: by mail-ob0-f200.google.com with SMTP id fq2so20695059obb.2 for ; Thu, 07 Jul 2016 01:39:35 -0700 (PDT) Received: from mail-wm0-f68.google.com (mail-wm0-f68.google.com. [74.125.82.68]) by mx.google.com with ESMTPS id v131si1842296wmf.1.2016.07.07.01.39.34 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Thu, 07 Jul 2016 01:39:34 -0700 (PDT) Received: by mail-wm0-f68.google.com with SMTP id i4so3325470wmg.3 for ; Thu, 07 Jul 2016 01:39:34 -0700 (PDT) Date: Thu, 7 Jul 2016 10:39:33 +0200 From: Michal Hocko Subject: Re: [RFC PATCH 5/6] vhost, mm: make sure that oom_reaper doesn't reap memory read by vhost Message-ID: <20160707083932.GD5379@dhcp22.suse.cz> References: <1467365190-24640-1-git-send-email-mhocko@kernel.org> <1467365190-24640-6-git-send-email-mhocko@kernel.org> <20160703134719.GA28492@redhat.com> <20160703140904.GA26908@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20160703140904.GA26908@redhat.com> Sender: owner-linux-mm@kvack.org List-ID: To: "Michael S. Tsirkin" Cc: Oleg Nesterov , linux-mm@kvack.org, Andrew Morton , Tetsuo Handa , David Rientjes , Vladimir Davydov On Sun 03-07-16 17:09:04, Michael S. Tsirkin wrote: [...] > Having said all that, how about we just add some kind of per-mm > notifier list, and let vhost know that owner is going away so > it should stop looking at memory? But this would have to be a synchronous operation from the oom killer, no? I would really like to reduce the number of external dependencies from the oom killer paths as much as possible. This is the whole point of these patches. If we have a notification mechanism, what would guarantee that the oom killer would make a forward progress if the notified end cannot continue (wait for a lock etc...)? I do realize that a test per each memory access is not welcome that much. An alternative would be to hook the check into the page fault handler because then the overhead would be reduced only to the slowpath (from the copy_from_user POV). But then also non use_mm users would have to pay the price which is even less attractive. Another alternative would be disabling pagefaults when copying from the userspace. This would require that the memory is prefault when used which might be a problem for the current implementation. -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wm0-f71.google.com (mail-wm0-f71.google.com [74.125.82.71]) by kanga.kvack.org (Postfix) with ESMTP id C00EF6B0253 for ; Thu, 7 Jul 2016 04:42:02 -0400 (EDT) Received: by mail-wm0-f71.google.com with SMTP id a66so13142971wme.1 for ; Thu, 07 Jul 2016 01:42:02 -0700 (PDT) Received: from mail-wm0-f66.google.com (mail-wm0-f66.google.com. [74.125.82.66]) by mx.google.com with ESMTPS id p6si1075805wjx.285.2016.07.07.01.42.01 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Thu, 07 Jul 2016 01:42:01 -0700 (PDT) Received: by mail-wm0-f66.google.com with SMTP id i4so3340422wmg.3 for ; Thu, 07 Jul 2016 01:42:01 -0700 (PDT) Date: Thu, 7 Jul 2016 10:42:00 +0200 From: Michal Hocko Subject: Re: [RFC PATCH 5/6] vhost, mm: make sure that oom_reaper doesn't reap memory read by vhost Message-ID: <20160707084159.GE5379@dhcp22.suse.cz> References: <1467365190-24640-1-git-send-email-mhocko@kernel.org> <1467365190-24640-6-git-send-email-mhocko@kernel.org> <20160703134719.GA28492@redhat.com> <20160703140904.GA26908@redhat.com> <20160703151829.GA28667@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20160703151829.GA28667@redhat.com> Sender: owner-linux-mm@kvack.org List-ID: To: Oleg Nesterov Cc: "Michael S. Tsirkin" , linux-mm@kvack.org, Andrew Morton , Tetsuo Handa , David Rientjes , Vladimir Davydov On Sun 03-07-16 17:18:29, Oleg Nesterov wrote: [...] > Or perhaps we can change oom_kill_process() to send SIGKILL to kthreads as > well, this should not have any effect unless kthread does allow_signal(SIGKILL), > then we can change vhost_worker() to catch SIGKILL and react somehow. Not sure > this is really possible. But then we would have to check for the signal after every memory access no? This sounds much more error prone than the test being wrapped inside the copy_from... API to me. -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-it0-f71.google.com (mail-it0-f71.google.com [209.85.214.71]) by kanga.kvack.org (Postfix) with ESMTP id 24C096B0253 for ; Thu, 7 Jul 2016 07:48:57 -0400 (EDT) Received: by mail-it0-f71.google.com with SMTP id g8so47415036itb.2 for ; Thu, 07 Jul 2016 04:48:57 -0700 (PDT) Received: from www262.sakura.ne.jp (www262.sakura.ne.jp. [2001:e42:101:1:202:181:97:72]) by mx.google.com with ESMTPS id v127si2589045itg.48.2016.07.07.04.48.55 for (version=TLS1 cipher=AES128-SHA bits=128/128); Thu, 07 Jul 2016 04:48:56 -0700 (PDT) Subject: Re: [RFC PATCH 1/6] oom: keep mm of the killed task available From: Tetsuo Handa References: <1467365190-24640-1-git-send-email-mhocko@kernel.org> <1467365190-24640-2-git-send-email-mhocko@kernel.org> <201607031145.HIF90125.LMHQVFJOtOSOFF@I-love.SAKURA.ne.jp> <20160707082431.GB5379@dhcp22.suse.cz> In-Reply-To: <20160707082431.GB5379@dhcp22.suse.cz> Message-Id: <201607072048.JBE13074.FSOJVHLOFFMOtQ@I-love.SAKURA.ne.jp> Date: Thu, 7 Jul 2016 20:48:46 +0900 Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Sender: owner-linux-mm@kvack.org List-ID: To: mhocko@kernel.org Cc: linux-mm@kvack.org, akpm@linux-foundation.org, oleg@redhat.com, rientjes@google.com, vdavydov@parallels.com Michal Hocko wrote: > On Sun 03-07-16 11:45:34, Tetsuo Handa wrote: > > Michal Hocko wrote: > > > diff --git a/mm/oom_kill.c b/mm/oom_kill.c > > > index 7d0a275df822..4ea4a649822d 100644 > > > --- a/mm/oom_kill.c > > > +++ b/mm/oom_kill.c > > > @@ -286,16 +286,17 @@ enum oom_scan_t oom_scan_process_thread(struct oom_control *oc, > > > * Don't allow any other task to have access to the reserves unless > > > * the task has MMF_OOM_REAPED because chances that it would release > > > * any memory is quite low. > > > + * MMF_OOM_NOT_REAPABLE means that the oom_reaper backed off last time > > > + * so let it try again. > > > */ > > > if (!is_sysrq_oom(oc) && atomic_read(&task->signal->oom_victims)) { > > > - struct task_struct *p = find_lock_task_mm(task); > > > + struct mm_struct *mm = task->signal->oom_mm; > > > enum oom_scan_t ret = OOM_SCAN_ABORT; > > > > > > - if (p) { > > > - if (test_bit(MMF_OOM_REAPED, &p->mm->flags)) > > > - ret = OOM_SCAN_CONTINUE; > > > - task_unlock(p); > > > - } > > > + if (test_bit(MMF_OOM_REAPED, &mm->flags)) > > > + ret = OOM_SCAN_CONTINUE; > > > + else if (test_bit(MMF_OOM_NOT_REAPABLE, &mm->flags)) > > > + ret = OOM_SCAN_SELECT; > > > > I don't think this is useful. > > Well, to be honest me neither but changing the retry logic is not in > scope of this patch. It just preserved the existing logic. I guess we > can get rid of it but that deserves a separate patch. The retry was > implemented to cover unlikely stalls when the lock is held but as this > hasn't ever been observed in the real life I would agree to remove it to > simplify the code (even though it is literally few lines of code). I was > probably overcautious when adding the flag. > You mean reverting http://lkml.kernel.org/r/1466426628-15074-10-git-send-email-mhocko@kernel.org ? If we hit a situation where MMF_OOM_NOT_REAPABLE is set, it means that that mm was used by multiple threads and one of them is blocked. On the other hand, since currently task_struct->oom_reaper_list is used, we can hit (say, T1 and T2 and T3 are sharing the same mm) (1) The T1's mm is queued to oom_reaper_list for the first time by T1. (2) The OOM reaper finds that mm for the first time. (3) The OOM reaper fails to hold mm->mmap_sem for read because T3 is blocked with that mm->mmap_sem held for write. (4) The T2's mm (which is same with T1's mm) is queued to oom_reaper_list for the second time by T2. (5) The OOM reaper still fails to hold mm->mmap_sem for read because T3 is blocked with that mm->mmap_sem held for write. (6) The OOM reaper sets MMF_OOM_NOT_REAPABLE. (7) That mm is dequeued from oom_reaper_list for the first time by the OOM reaper. (8) The OOM reaper finds that mm for the second time. (9) The OOM reaper still fails to hold mm->mmap_sem for read because T3 is blocked with that mm->mmap_sem held for write. (10) The OOM reaper sets MMF_OOM_REAPED. (11) That mm is dequeued from oom_reaper_list for the second time by the OOM reaper. sequences. To me, MMF_OOM_NOT_REAPABLE alone is unlikely helpful. If oom_mm_list list which chains mm_struct is used, at least we won't concurrently queue same mm which is currently under OOM reaper's operation. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wm0-f69.google.com (mail-wm0-f69.google.com [74.125.82.69]) by kanga.kvack.org (Postfix) with ESMTP id 9EA3E6B0253 for ; Thu, 7 Jul 2016 09:33:02 -0400 (EDT) Received: by mail-wm0-f69.google.com with SMTP id r190so18627519wmr.0 for ; Thu, 07 Jul 2016 06:33:02 -0700 (PDT) Received: from mail-lf0-f66.google.com (mail-lf0-f66.google.com. [209.85.215.66]) by mx.google.com with ESMTPS id s2si2588672lfs.401.2016.07.07.06.33.01 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Thu, 07 Jul 2016 06:33:01 -0700 (PDT) Received: by mail-lf0-f66.google.com with SMTP id w130so1458115lfd.2 for ; Thu, 07 Jul 2016 06:33:01 -0700 (PDT) Date: Thu, 7 Jul 2016 15:32:59 +0200 From: Michal Hocko Subject: Re: [RFC PATCH 1/6] oom: keep mm of the killed task available Message-ID: <20160707133259.GL5379@dhcp22.suse.cz> References: <1467365190-24640-1-git-send-email-mhocko@kernel.org> <1467365190-24640-2-git-send-email-mhocko@kernel.org> <201607031145.HIF90125.LMHQVFJOtOSOFF@I-love.SAKURA.ne.jp> <20160707082431.GB5379@dhcp22.suse.cz> <201607072048.JBE13074.FSOJVHLOFFMOtQ@I-love.SAKURA.ne.jp> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <201607072048.JBE13074.FSOJVHLOFFMOtQ@I-love.SAKURA.ne.jp> Sender: owner-linux-mm@kvack.org List-ID: To: Tetsuo Handa Cc: linux-mm@kvack.org, akpm@linux-foundation.org, oleg@redhat.com, rientjes@google.com, vdavydov@parallels.com On Thu 07-07-16 20:48:46, Tetsuo Handa wrote: > Michal Hocko wrote: > > On Sun 03-07-16 11:45:34, Tetsuo Handa wrote: > > > Michal Hocko wrote: > > > > diff --git a/mm/oom_kill.c b/mm/oom_kill.c > > > > index 7d0a275df822..4ea4a649822d 100644 > > > > --- a/mm/oom_kill.c > > > > +++ b/mm/oom_kill.c > > > > @@ -286,16 +286,17 @@ enum oom_scan_t oom_scan_process_thread(struct oom_control *oc, > > > > * Don't allow any other task to have access to the reserves unless > > > > * the task has MMF_OOM_REAPED because chances that it would release > > > > * any memory is quite low. > > > > + * MMF_OOM_NOT_REAPABLE means that the oom_reaper backed off last time > > > > + * so let it try again. > > > > */ > > > > if (!is_sysrq_oom(oc) && atomic_read(&task->signal->oom_victims)) { > > > > - struct task_struct *p = find_lock_task_mm(task); > > > > + struct mm_struct *mm = task->signal->oom_mm; > > > > enum oom_scan_t ret = OOM_SCAN_ABORT; > > > > > > > > - if (p) { > > > > - if (test_bit(MMF_OOM_REAPED, &p->mm->flags)) > > > > - ret = OOM_SCAN_CONTINUE; > > > > - task_unlock(p); > > > > - } > > > > + if (test_bit(MMF_OOM_REAPED, &mm->flags)) > > > > + ret = OOM_SCAN_CONTINUE; > > > > + else if (test_bit(MMF_OOM_NOT_REAPABLE, &mm->flags)) > > > > + ret = OOM_SCAN_SELECT; > > > > > > I don't think this is useful. > > > > Well, to be honest me neither but changing the retry logic is not in > > scope of this patch. It just preserved the existing logic. I guess we > > can get rid of it but that deserves a separate patch. The retry was > > implemented to cover unlikely stalls when the lock is held but as this > > hasn't ever been observed in the real life I would agree to remove it to > > simplify the code (even though it is literally few lines of code). I was > > probably overcautious when adding the flag. > > > > You mean reverting http://lkml.kernel.org/r/1466426628-15074-10-git-send-email-mhocko@kernel.org ? Yes, asuming that MMF_OOM_REAPED is set in that case of course. [Skipping the rest as this is not related to this patch.] -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-qk0-f198.google.com (mail-qk0-f198.google.com [209.85.220.198]) by kanga.kvack.org (Postfix) with ESMTP id 0ABEC6B025E for ; Thu, 7 Jul 2016 11:39:00 -0400 (EDT) Received: by mail-qk0-f198.google.com with SMTP id c185so43411794qkd.1 for ; Thu, 07 Jul 2016 08:39:00 -0700 (PDT) Received: from mx1.redhat.com (mx1.redhat.com. [209.132.183.28]) by mx.google.com with ESMTPS id m185si2905313qkd.303.2016.07.07.08.38.59 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Thu, 07 Jul 2016 08:38:59 -0700 (PDT) Date: Thu, 7 Jul 2016 18:38:52 +0300 From: "Michael S. Tsirkin" Subject: Re: [RFC PATCH 5/6] vhost, mm: make sure that oom_reaper doesn't reap memory read by vhost Message-ID: <20160707183848-mutt-send-email-mst@redhat.com> References: <1467365190-24640-1-git-send-email-mhocko@kernel.org> <1467365190-24640-6-git-send-email-mhocko@kernel.org> <20160703134719.GA28492@redhat.com> <20160703140904.GA26908@redhat.com> <20160703151829.GA28667@redhat.com> <20160703182254-mutt-send-email-mst@redhat.com> <20160703164723.GA30151@redhat.com> <20160703215250-mutt-send-email-mst@redhat.com> <20160707082811.GC5379@dhcp22.suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20160707082811.GC5379@dhcp22.suse.cz> Sender: owner-linux-mm@kvack.org List-ID: To: Michal Hocko Cc: Oleg Nesterov , linux-mm@kvack.org, Andrew Morton , Tetsuo Handa , David Rientjes , Vladimir Davydov On Thu, Jul 07, 2016 at 10:28:12AM +0200, Michal Hocko wrote: > On Mon 04-07-16 00:17:55, Michael S. Tsirkin wrote: > > On Sun, Jul 03, 2016 at 06:47:23PM +0200, Oleg Nesterov wrote: > > > On 07/03, Michael S. Tsirkin wrote: > > > > > > > > On Sun, Jul 03, 2016 at 05:18:29PM +0200, Oleg Nesterov wrote: > > > > > > > > > > Well, we are going to kill all tasks which share this memory. I mean, ->mm. > > > > > If "sharing memory with another task" means, say, a file, then this memory > > > > > won't be unmapped (if shared). > > > > > > > > > > So let me ask again... Suppose, say, QEMU does VHOST_SET_OWNER and then we > > > > > unmap its (anonymous/non-shared) memory. Who else's memory can be corrupted? > > > > > > > > As you say, I mean anyone who shares memory with QEMU through a file. > > > > > > And in this case vhost_worker() reads the anonymous memory of QEMU process, > > > not the memory which can be shared with another task, correct? > > > > > > And if QEMU simply crashes, this can't affect anyone who shares memory with > > > QEMU through a file, yes? > > > > > > Oleg. > > > > Well no - the VM memory is not always anonymous memory. It can be an > > mmaped file. > > Just to make sure we are all at the same page. I guess the scenario is > as follows. The owner of the mm has ring and other statefull information > in the private memory but consumers living with their own mm consume > some data from a shared memory segments (e.g. files). The worker would > misinterpret statefull information (zeros rather than the original > content) and would copy invalid/corrupted data to the consumer. Am I > correct? > > -- > Michal Hocko > SUSE Labs Exactly. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-qk0-f199.google.com (mail-qk0-f199.google.com [209.85.220.199]) by kanga.kvack.org (Postfix) with ESMTP id 8DEC16B0253 for ; Thu, 7 Jul 2016 12:46:08 -0400 (EDT) Received: by mail-qk0-f199.google.com with SMTP id c185so46806956qkd.1 for ; Thu, 07 Jul 2016 09:46:08 -0700 (PDT) Received: from mx1.redhat.com (mx1.redhat.com. [209.132.183.28]) by mx.google.com with ESMTPS id t187si2222439qkh.130.2016.07.07.09.46.07 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Thu, 07 Jul 2016 09:46:07 -0700 (PDT) Date: Thu, 7 Jul 2016 18:46:02 +0200 From: Oleg Nesterov Subject: Re: [RFC PATCH 5/6] vhost, mm: make sure that oom_reaper doesn't reap memory read by vhost Message-ID: <20160707164602.GC3063@redhat.com> References: <1467365190-24640-1-git-send-email-mhocko@kernel.org> <1467365190-24640-6-git-send-email-mhocko@kernel.org> <20160703134719.GA28492@redhat.com> <20160703140904.GA26908@redhat.com> <20160703151829.GA28667@redhat.com> <20160707084159.GE5379@dhcp22.suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20160707084159.GE5379@dhcp22.suse.cz> Sender: owner-linux-mm@kvack.org List-ID: To: Michal Hocko Cc: "Michael S. Tsirkin" , linux-mm@kvack.org, Andrew Morton , Tetsuo Handa , David Rientjes , Vladimir Davydov On 07/07, Michal Hocko wrote: > > On Sun 03-07-16 17:18:29, Oleg Nesterov wrote: > [...] > > Or perhaps we can change oom_kill_process() to send SIGKILL to kthreads as > > well, this should not have any effect unless kthread does allow_signal(SIGKILL), > > then we can change vhost_worker() to catch SIGKILL and react somehow. Not sure > > this is really possible. > > But then we would have to check for the signal after every memory > access no? This sounds much more error prone than the test being wrapped > inside the copy_from... API to me. At least I agree this doesn't look nice too. Oleg. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-vk0-f70.google.com (mail-vk0-f70.google.com [209.85.213.70]) by kanga.kvack.org (Postfix) with ESMTP id 4844E6B0260 for ; Fri, 8 Jul 2016 08:29:53 -0400 (EDT) Received: by mail-vk0-f70.google.com with SMTP id v6so90752668vkb.2 for ; Fri, 08 Jul 2016 05:29:53 -0700 (PDT) Received: from mx1.redhat.com (mx1.redhat.com. [209.132.183.28]) by mx.google.com with ESMTPS id q18si898193qte.36.2016.07.08.05.29.52 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Fri, 08 Jul 2016 05:29:52 -0700 (PDT) Date: Fri, 8 Jul 2016 14:29:48 +0200 From: Oleg Nesterov Subject: Re: [RFC PATCH 5/6] vhost, mm: make sure that oom_reaper doesn't reap memory read by vhost Message-ID: <20160708122948.GA4733@redhat.com> References: <1467365190-24640-1-git-send-email-mhocko@kernel.org> <1467365190-24640-6-git-send-email-mhocko@kernel.org> <20160703134719.GA28492@redhat.com> <20160703140904.GA26908@redhat.com> <20160703151829.GA28667@redhat.com> <20160703182254-mutt-send-email-mst@redhat.com> <20160703164723.GA30151@redhat.com> <20160703215250-mutt-send-email-mst@redhat.com> <20160707082811.GC5379@dhcp22.suse.cz> <20160707183848-mutt-send-email-mst@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20160707183848-mutt-send-email-mst@redhat.com> Sender: owner-linux-mm@kvack.org List-ID: To: "Michael S. Tsirkin" Cc: Michal Hocko , linux-mm@kvack.org, Andrew Morton , Tetsuo Handa , David Rientjes , Vladimir Davydov On 07/07, Michael S. Tsirkin wrote: > > On Thu, Jul 07, 2016 at 10:28:12AM +0200, Michal Hocko wrote: > > > > Just to make sure we are all at the same page. I guess the scenario is > > as follows. The owner of the mm has ring and other statefull information > > in the private memory but consumers living with their own mm consume > > some data from a shared memory segments (e.g. files). The worker would > > misinterpret statefull information (zeros rather than the original > > content) and would copy invalid/corrupted data to the consumer. Am I > > correct? > > Exactly. Michael, let me ask again. But what if we simply kill the owner of this mm? Yes, if we dont't unmap its memory then vhost_worker() can't read the wrong zero from anonymous vma. But the killed process obviously won't be able to update this statefull info after that, it will be frozen. Are you saying this can't affect other apps which share the memory with the (killed) mm owner? IOW. If we kill a process, this can affect other applications anyway. Just for example, suppose that this process takes a non-robust futex in the shared memory segment. After that other users of this futex will hang forever. So do you think that this particular "vhost" problem is really worse and we must specialy avoid it? If yes, note that this means that any process which can do VHOST_SET_OWNER becomes "oom-unkillable" to some degree, and this doesn't look right. It can spawn another CLONE_FILES process and this will block fops->release() which (iiuc) should stop the kernel thread which pins the memory hog's memory. Oleg. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-lf0-f69.google.com (mail-lf0-f69.google.com [209.85.215.69]) by kanga.kvack.org (Postfix) with ESMTP id 8A1A86B025E for ; Mon, 11 Jul 2016 10:14:30 -0400 (EDT) Received: by mail-lf0-f69.google.com with SMTP id g18so71569822lfg.2 for ; Mon, 11 Jul 2016 07:14:30 -0700 (PDT) Received: from mail-wm0-f51.google.com (mail-wm0-f51.google.com. [74.125.82.51]) by mx.google.com with ESMTPS id p10si38632wjp.50.2016.07.11.07.14.29 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Mon, 11 Jul 2016 07:14:29 -0700 (PDT) Received: by mail-wm0-f51.google.com with SMTP id o80so52747210wme.1 for ; Mon, 11 Jul 2016 07:14:29 -0700 (PDT) Date: Mon, 11 Jul 2016 16:14:27 +0200 From: Michal Hocko Subject: Re: [RFC PATCH 5/6] vhost, mm: make sure that oom_reaper doesn't reap memory read by vhost Message-ID: <20160711141427.GM1811@dhcp22.suse.cz> References: <1467365190-24640-6-git-send-email-mhocko@kernel.org> <20160703134719.GA28492@redhat.com> <20160703140904.GA26908@redhat.com> <20160703151829.GA28667@redhat.com> <20160703182254-mutt-send-email-mst@redhat.com> <20160703164723.GA30151@redhat.com> <20160703215250-mutt-send-email-mst@redhat.com> <20160707082811.GC5379@dhcp22.suse.cz> <20160707183848-mutt-send-email-mst@redhat.com> <20160708122948.GA4733@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20160708122948.GA4733@redhat.com> Sender: owner-linux-mm@kvack.org List-ID: To: Oleg Nesterov Cc: "Michael S. Tsirkin" , linux-mm@kvack.org, Andrew Morton , Tetsuo Handa , David Rientjes , Vladimir Davydov On Fri 08-07-16 14:29:48, Oleg Nesterov wrote: > On 07/07, Michael S. Tsirkin wrote: > > > > On Thu, Jul 07, 2016 at 10:28:12AM +0200, Michal Hocko wrote: > > > > > > Just to make sure we are all at the same page. I guess the scenario is > > > as follows. The owner of the mm has ring and other statefull information > > > in the private memory but consumers living with their own mm consume > > > some data from a shared memory segments (e.g. files). The worker would > > > misinterpret statefull information (zeros rather than the original > > > content) and would copy invalid/corrupted data to the consumer. Am I > > > correct? > > > > Exactly. > > Michael, let me ask again. > > But what if we simply kill the owner of this mm? I might be wrong here but the mm owner doesn't really matter AFAIU. It is the holder of the file descriptor for the "device" who control all the actions, no? The fact that it hijacked the mm along the way is hiden from users. If you kill the owner but pass the fd somewhere else then the mm will live as long as the fd. [...] > If yes, note that this means that any process which can do VHOST_SET_OWNER becomes > "oom-unkillable" to some degree, and this doesn't look right. It can spawn another > CLONE_FILES process and this will block fops->release() which (iiuc) should stop > the kernel thread which pins the memory hog's memory. I believe this is indeed possible. It can even pass the fd to a different process and keep it alive, hidden from the oom killer causing other processes to be killed. -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-qt0-f199.google.com (mail-qt0-f199.google.com [209.85.216.199]) by kanga.kvack.org (Postfix) with ESMTP id B56B36B025F for ; Tue, 12 Jul 2016 10:33:26 -0400 (EDT) Received: by mail-qt0-f199.google.com with SMTP id b35so33070585qta.0 for ; Tue, 12 Jul 2016 07:33:26 -0700 (PDT) Received: from mx1.redhat.com (mx1.redhat.com. [209.132.183.28]) by mx.google.com with ESMTPS id q43si1601773qta.58.2016.07.12.07.33.25 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Tue, 12 Jul 2016 07:33:25 -0700 (PDT) Date: Tue, 12 Jul 2016 16:33:47 +0200 From: Oleg Nesterov Subject: Re: [RFC PATCH 5/6] vhost, mm: make sure that oom_reaper doesn't reap memory read by vhost Message-ID: <20160712143346.GC28837@redhat.com> References: <20160703134719.GA28492@redhat.com> <20160703140904.GA26908@redhat.com> <20160703151829.GA28667@redhat.com> <20160703182254-mutt-send-email-mst@redhat.com> <20160703164723.GA30151@redhat.com> <20160703215250-mutt-send-email-mst@redhat.com> <20160707082811.GC5379@dhcp22.suse.cz> <20160707183848-mutt-send-email-mst@redhat.com> <20160708122948.GA4733@redhat.com> <20160711141427.GM1811@dhcp22.suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20160711141427.GM1811@dhcp22.suse.cz> Sender: owner-linux-mm@kvack.org List-ID: To: Michal Hocko Cc: "Michael S. Tsirkin" , linux-mm@kvack.org, Andrew Morton , Tetsuo Handa , David Rientjes , Vladimir Davydov On 07/11, Michal Hocko wrote: > > On Fri 08-07-16 14:29:48, Oleg Nesterov wrote: > > On 07/07, Michael S. Tsirkin wrote: > > > > > > On Thu, Jul 07, 2016 at 10:28:12AM +0200, Michal Hocko wrote: > > > > > > > > Just to make sure we are all at the same page. I guess the scenario is > > > > as follows. The owner of the mm has ring and other statefull information > > > > in the private memory but consumers living with their own mm consume > > > > some data from a shared memory segments (e.g. files). The worker would > > > > misinterpret statefull information (zeros rather than the original > > > > content) and would copy invalid/corrupted data to the consumer. Am I > > > > correct? > > > > > > Exactly. > > > > Michael, let me ask again. > > > > But what if we simply kill the owner of this mm? > > I might be wrong here but the mm owner doesn't really matter AFAIU. It > is the holder of the file descriptor for the "device" who control all > the actions, no? The fact that it hijacked the mm along the way is hiden > from users. If you kill the owner but pass the fd somewhere else then > the mm will live as long as the fd. Of course. I meant that qemu/guest won't update that statefull info in its anonymous memory after we kill it. And I have no idea if it is fine or not. As I said, I do not even know what drivers/vhost actually does, and probably that is why I do not understand why this particular problem (bogus zeroes in anonymous memory) is worse than other problems we can't avoid anyway when we kill the victim and this affects other applications. > > If yes, note that this means that any process which can do VHOST_SET_OWNER becomes > > "oom-unkillable" to some degree, and this doesn't look right. It can spawn another > > CLONE_FILES process and this will block fops->release() which (iiuc) should stop > > the kernel thread which pins the memory hog's memory. > > I believe this is indeed possible. It can even pass the fd to a > different process and keep it alive, hidden from the oom killer causing > other processes to be killed. Yes, so I think we should unmap the memory even if it is used by kthread. Oleg. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-lf0-f69.google.com (mail-lf0-f69.google.com [209.85.215.69]) by kanga.kvack.org (Postfix) with ESMTP id 6E3506B0005 for ; Fri, 22 Jul 2016 07:09:09 -0400 (EDT) Received: by mail-lf0-f69.google.com with SMTP id l89so70466151lfi.3 for ; Fri, 22 Jul 2016 04:09:09 -0700 (PDT) Received: from mail-wm0-f66.google.com (mail-wm0-f66.google.com. [74.125.82.66]) by mx.google.com with ESMTPS id s2si359559wjc.195.2016.07.22.04.09.08 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Fri, 22 Jul 2016 04:09:08 -0700 (PDT) Received: by mail-wm0-f66.google.com with SMTP id i5so5730750wmg.2 for ; Fri, 22 Jul 2016 04:09:08 -0700 (PDT) Date: Fri, 22 Jul 2016 13:09:06 +0200 From: Michal Hocko Subject: Re: [RFC PATCH 5/6] vhost, mm: make sure that oom_reaper doesn't reap memory read by vhost Message-ID: <20160722110906.GI794@dhcp22.suse.cz> References: <1467365190-24640-1-git-send-email-mhocko@kernel.org> <1467365190-24640-6-git-send-email-mhocko@kernel.org> <20160703134719.GA28492@redhat.com> <20160703140904.GA26908@redhat.com> <20160707083932.GD5379@dhcp22.suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20160707083932.GD5379@dhcp22.suse.cz> Sender: owner-linux-mm@kvack.org List-ID: To: "Michael S. Tsirkin" Cc: Oleg Nesterov , linux-mm@kvack.org, Andrew Morton , Tetsuo Handa , David Rientjes , Vladimir Davydov On Thu 07-07-16 10:39:32, Michal Hocko wrote: > On Sun 03-07-16 17:09:04, Michael S. Tsirkin wrote: > [...] > > Having said all that, how about we just add some kind of per-mm > > notifier list, and let vhost know that owner is going away so > > it should stop looking at memory? > > But this would have to be a synchronous operation from the oom killer, > no? I would really like to reduce the number of external dependencies > from the oom killer paths as much as possible. This is the whole point > of these patches. If we have a notification mechanism, what would > guarantee that the oom killer would make a forward progress if the > notified end cannot continue (wait for a lock etc...)? > > I do realize that a test per each memory access is not welcome that > much. An alternative would be to hook the check into the page fault > handler because then the overhead would be reduced only to the slowpath > (from the copy_from_user POV). But then also non use_mm users would have > to pay the price which is even less attractive. > > Another alternative would be disabling pagefaults when copying from the > userspace. This would require that the memory is prefault when used > which might be a problem for the current implementation. ping Michael... I would like to pursue this again and have something for 4.9 ideally. -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org