* [patch] mm, oom: fix concurrent munlock and oom reaper unmap @ 2018-04-17 22:46 David Rientjes 2018-04-18 0:57 ` Tetsuo Handa 0 siblings, 1 reply; 47+ messages in thread From: David Rientjes @ 2018-04-17 22:46 UTC (permalink / raw) To: Andrew Morton Cc: Michal Hocko, Andrea Arcangeli, Tetsuo Handa, Roman Gushchin, linux-kernel, linux-mm Since exit_mmap() is done without the protection of mm->mmap_sem, it is possible for the oom reaper to concurrently operate on an mm until MMF_OOM_SKIP is set. This allows munlock_vma_pages_all() to concurrently run while the oom reaper is operating on a vma. Since munlock_vma_pages_range() depends on clearing VM_LOCKED from vm_flags before actually doing the munlock to determine if any other vmas are locking the same memory, the check for VM_LOCKED in the oom reaper is racy. This is especially noticeable on architectures such as powerpc where clearing a huge pmd requires kick_all_cpus_sync(). If the pmd is zapped by the oom reaper during follow_page_mask() after the check for pmd_none() is bypassed, this ends up deferencing a NULL ptl. Fix this by reusing MMF_UNSTABLE to specify that an mm should not be reaped. This prevents the concurrent munlock_vma_pages_range() and unmap_page_range(). The oom reaper will simply not operate on an mm that has the bit set and leave the unmapping to exit_mmap(). Fixes: 212925802454 ("mm: oom: let oom_reap_task and exit_mmap run concurrently") Cc: stable@vger.kernel.org [4.14+] Signed-off-by: David Rientjes <rientjes@google.com> --- mm/mmap.c | 38 ++++++++++++++++++++------------------ mm/oom_kill.c | 19 ++++++++----------- 2 files changed, 28 insertions(+), 29 deletions(-) diff --git a/mm/mmap.c b/mm/mmap.c --- a/mm/mmap.c +++ b/mm/mmap.c @@ -3015,6 +3015,25 @@ void exit_mmap(struct mm_struct *mm) /* mm's last user has gone, and its about to be pulled down */ mmu_notifier_release(mm); + if (unlikely(mm_is_oom_victim(mm))) { + /* + * Wait for oom_reap_task() to stop working on this mm. Because + * MMF_UNSTABLE is already set before calling down_read(), + * oom_reap_task() will not run on this mm after up_write(). + * oom_reap_task() also depends on a stable VM_LOCKED flag to + * indicate it should not unmap during munlock_vma_pages_all(). + * + * mm_is_oom_victim() cannot be set from under us because + * victim->mm is already set to NULL under task_lock before + * calling mmput() and victim->signal->oom_mm is set by the oom + * killer only if victim->mm is non-NULL while holding + * task_lock(). + */ + set_bit(MMF_UNSTABLE, &mm->flags); + down_write(&mm->mmap_sem); + up_write(&mm->mmap_sem); + } + if (mm->locked_vm) { vma = mm->mmap; while (vma) { @@ -3036,26 +3055,9 @@ void exit_mmap(struct mm_struct *mm) /* update_hiwater_rss(mm) here? but nobody should be looking */ /* Use -1 here to ensure all VMAs in the mm are unmapped */ unmap_vmas(&tlb, vma, 0, -1); - - if (unlikely(mm_is_oom_victim(mm))) { - /* - * Wait for oom_reap_task() to stop working on this - * mm. Because MMF_OOM_SKIP is already set before - * calling down_read(), oom_reap_task() will not run - * on this "mm" post up_write(). - * - * mm_is_oom_victim() cannot be set from under us - * either because victim->mm is already set to NULL - * under task_lock before calling mmput and oom_mm is - * set not NULL by the OOM killer only if victim->mm - * is found not NULL while holding the task_lock. - */ - set_bit(MMF_OOM_SKIP, &mm->flags); - down_write(&mm->mmap_sem); - up_write(&mm->mmap_sem); - } free_pgtables(&tlb, vma, FIRST_USER_ADDRESS, USER_PGTABLES_CEILING); tlb_finish_mmu(&tlb, 0, -1); + set_bit(MMF_OOM_SKIP, &mm->flags); /* * Walk the list again, actually closing and freeing it, diff --git a/mm/oom_kill.c b/mm/oom_kill.c --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -521,12 +521,17 @@ static bool __oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm) } /* - * MMF_OOM_SKIP is set by exit_mmap when the OOM reaper can't - * work on the mm anymore. The check for MMF_OOM_SKIP must run + * Tell all users of get_user/copy_from_user etc... that the content + * is no longer stable. No barriers really needed because unmapping + * should imply barriers already and the reader would hit a page fault + * if it stumbled over reaped memory. + * + * MMF_UNSTABLE is also set by exit_mmap when the OOM reaper shouldn't + * work on the mm anymore. The check for MMF_OOM_UNSTABLE must run * under mmap_sem for reading because it serializes against the * down_write();up_write() cycle in exit_mmap(). */ - if (test_bit(MMF_OOM_SKIP, &mm->flags)) { + if (test_and_set_bit(MMF_UNSTABLE, &mm->flags)) { up_read(&mm->mmap_sem); trace_skip_task_reaping(tsk->pid); goto unlock_oom; @@ -534,14 +539,6 @@ static bool __oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm) trace_start_task_reaping(tsk->pid); - /* - * Tell all users of get_user/copy_from_user etc... that the content - * is no longer stable. No barriers really needed because unmapping - * should imply barriers already and the reader would hit a page fault - * if it stumbled over a reaped memory. - */ - set_bit(MMF_UNSTABLE, &mm->flags); - for (vma = mm->mmap ; vma; vma = vma->vm_next) { if (!can_madv_dontneed_vma(vma)) continue; ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [patch] mm, oom: fix concurrent munlock and oom reaper unmap 2018-04-17 22:46 [patch] mm, oom: fix concurrent munlock and oom reaper unmap David Rientjes @ 2018-04-18 0:57 ` Tetsuo Handa 2018-04-18 2:39 ` David Rientjes 0 siblings, 1 reply; 47+ messages in thread From: Tetsuo Handa @ 2018-04-18 0:57 UTC (permalink / raw) To: David Rientjes Cc: Andrew Morton, Michal Hocko, Andrea Arcangeli, Tetsuo Handa, Roman Gushchin, linux-kernel, linux-mm David Rientjes wrote: > Since exit_mmap() is done without the protection of mm->mmap_sem, it is > possible for the oom reaper to concurrently operate on an mm until > MMF_OOM_SKIP is set. > > This allows munlock_vma_pages_all() to concurrently run while the oom > reaper is operating on a vma. Since munlock_vma_pages_range() depends on > clearing VM_LOCKED from vm_flags before actually doing the munlock to > determine if any other vmas are locking the same memory, the check for > VM_LOCKED in the oom reaper is racy. > > This is especially noticeable on architectures such as powerpc where > clearing a huge pmd requires kick_all_cpus_sync(). If the pmd is zapped > by the oom reaper during follow_page_mask() after the check for pmd_none() > is bypassed, this ends up deferencing a NULL ptl. I don't know whether the explanation above is correct. Did you actually see a crash caused by this race? > > Fix this by reusing MMF_UNSTABLE to specify that an mm should not be > reaped. This prevents the concurrent munlock_vma_pages_range() and > unmap_page_range(). The oom reaper will simply not operate on an mm that > has the bit set and leave the unmapping to exit_mmap(). But this patch is setting MMF_OOM_SKIP without reaping any memory as soon as MMF_UNSTABLE is set, which is the situation described in 212925802454: At the same time if the OOM reaper doesn't wait at all for the memory of the current OOM candidate to be freed by exit_mmap->unmap_vmas, it would generate a spurious OOM kill. If exit_mmap() does not wait for any pages and __oom_reap_task_mm() can not handle mlock()ed pages, isn't it better to revert 212925802454 like I mentioned at https://patchwork.kernel.org/patch/10095661/ and let the OOM reaper reclaim as much as possible before setting MMF_OOM_SKIP? ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [patch] mm, oom: fix concurrent munlock and oom reaper unmap 2018-04-18 0:57 ` Tetsuo Handa @ 2018-04-18 2:39 ` David Rientjes 2018-04-18 2:52 ` [patch v2] " David Rientjes 0 siblings, 1 reply; 47+ messages in thread From: David Rientjes @ 2018-04-18 2:39 UTC (permalink / raw) To: Tetsuo Handa Cc: Andrew Morton, Michal Hocko, Andrea Arcangeli, Roman Gushchin, linux-kernel, linux-mm On Wed, 18 Apr 2018, Tetsuo Handa wrote: > > Since exit_mmap() is done without the protection of mm->mmap_sem, it is > > possible for the oom reaper to concurrently operate on an mm until > > MMF_OOM_SKIP is set. > > > > This allows munlock_vma_pages_all() to concurrently run while the oom > > reaper is operating on a vma. Since munlock_vma_pages_range() depends on > > clearing VM_LOCKED from vm_flags before actually doing the munlock to > > determine if any other vmas are locking the same memory, the check for > > VM_LOCKED in the oom reaper is racy. > > > > This is especially noticeable on architectures such as powerpc where > > clearing a huge pmd requires kick_all_cpus_sync(). If the pmd is zapped > > by the oom reaper during follow_page_mask() after the check for pmd_none() > > is bypassed, this ends up deferencing a NULL ptl. > > I don't know whether the explanation above is correct. > Did you actually see a crash caused by this race? > Yes, it's trivially reproducible on power by simply mlocking a ton of memory and triggering oom kill. > > Fix this by reusing MMF_UNSTABLE to specify that an mm should not be > > reaped. This prevents the concurrent munlock_vma_pages_range() and > > unmap_page_range(). The oom reaper will simply not operate on an mm that > > has the bit set and leave the unmapping to exit_mmap(). > > But this patch is setting MMF_OOM_SKIP without reaping any memory as soon as > MMF_UNSTABLE is set, which is the situation described in 212925802454: > Oh, you're referring to __oom_reap_task_mm() returning true because of MMF_UNSTABLE and then setting MMF_OOM_SKIP itself? Yes, that is dumb. We could change __oom_reap_task_mm() to only set MMF_OOM_SKIP if MMF_UNSTABLE hasn't been set. I'll send a v2, which I needed to do anyway to do s/kick_all_cpus_sync/serialize_against_pte_lookup/ in the changelog (power only does it for the needed cpus). ^ permalink raw reply [flat|nested] 47+ messages in thread
* [patch v2] mm, oom: fix concurrent munlock and oom reaper unmap 2018-04-18 2:39 ` David Rientjes @ 2018-04-18 2:52 ` David Rientjes 2018-04-18 3:55 ` Tetsuo Handa 2018-04-18 7:50 ` Michal Hocko 0 siblings, 2 replies; 47+ messages in thread From: David Rientjes @ 2018-04-18 2:52 UTC (permalink / raw) To: Andrew Morton, Tetsuo Handa Cc: Michal Hocko, Andrea Arcangeli, Roman Gushchin, linux-kernel, linux-mm Since exit_mmap() is done without the protection of mm->mmap_sem, it is possible for the oom reaper to concurrently operate on an mm until MMF_OOM_SKIP is set. This allows munlock_vma_pages_all() to concurrently run while the oom reaper is operating on a vma. Since munlock_vma_pages_range() depends on clearing VM_LOCKED from vm_flags before actually doing the munlock to determine if any other vmas are locking the same memory, the check for VM_LOCKED in the oom reaper is racy. This is especially noticeable on architectures such as powerpc where clearing a huge pmd requires serialize_against_pte_lookup(). If the pmd is zapped by the oom reaper during follow_page_mask() after the check for pmd_none() is bypassed, this ends up deferencing a NULL ptl. Fix this by reusing MMF_UNSTABLE to specify that an mm should not be reaped. This prevents the concurrent munlock_vma_pages_range() and unmap_page_range(). The oom reaper will simply not operate on an mm that has the bit set and leave the unmapping to exit_mmap(). Fixes: 212925802454 ("mm: oom: let oom_reap_task and exit_mmap run concurrently") Cc: stable@vger.kernel.org [4.14+] Signed-off-by: David Rientjes <rientjes@google.com> --- v2: - oom reaper only sets MMF_OOM_SKIP if MMF_UNSTABLE was never set (either by itself or by exit_mmap(), per Tetsuo - s/kick_all_cpus_sync/serialize_against_pte_lookup/ in changelog as more isolated way of forcing cpus as non-idle on power mm/mmap.c | 38 ++++++++++++++++++++------------------ mm/oom_kill.c | 28 +++++++++++++--------------- 2 files changed, 33 insertions(+), 33 deletions(-) diff --git a/mm/mmap.c b/mm/mmap.c --- a/mm/mmap.c +++ b/mm/mmap.c @@ -3015,6 +3015,25 @@ void exit_mmap(struct mm_struct *mm) /* mm's last user has gone, and its about to be pulled down */ mmu_notifier_release(mm); + if (unlikely(mm_is_oom_victim(mm))) { + /* + * Wait for oom_reap_task() to stop working on this mm. Because + * MMF_UNSTABLE is already set before calling down_read(), + * oom_reap_task() will not run on this mm after up_write(). + * oom_reap_task() also depends on a stable VM_LOCKED flag to + * indicate it should not unmap during munlock_vma_pages_all(). + * + * mm_is_oom_victim() cannot be set from under us because + * victim->mm is already set to NULL under task_lock before + * calling mmput() and victim->signal->oom_mm is set by the oom + * killer only if victim->mm is non-NULL while holding + * task_lock(). + */ + set_bit(MMF_UNSTABLE, &mm->flags); + down_write(&mm->mmap_sem); + up_write(&mm->mmap_sem); + } + if (mm->locked_vm) { vma = mm->mmap; while (vma) { @@ -3036,26 +3055,9 @@ void exit_mmap(struct mm_struct *mm) /* update_hiwater_rss(mm) here? but nobody should be looking */ /* Use -1 here to ensure all VMAs in the mm are unmapped */ unmap_vmas(&tlb, vma, 0, -1); - - if (unlikely(mm_is_oom_victim(mm))) { - /* - * Wait for oom_reap_task() to stop working on this - * mm. Because MMF_OOM_SKIP is already set before - * calling down_read(), oom_reap_task() will not run - * on this "mm" post up_write(). - * - * mm_is_oom_victim() cannot be set from under us - * either because victim->mm is already set to NULL - * under task_lock before calling mmput and oom_mm is - * set not NULL by the OOM killer only if victim->mm - * is found not NULL while holding the task_lock. - */ - set_bit(MMF_OOM_SKIP, &mm->flags); - down_write(&mm->mmap_sem); - up_write(&mm->mmap_sem); - } free_pgtables(&tlb, vma, FIRST_USER_ADDRESS, USER_PGTABLES_CEILING); tlb_finish_mmu(&tlb, 0, -1); + set_bit(MMF_OOM_SKIP, &mm->flags); /* * Walk the list again, actually closing and freeing it, diff --git a/mm/oom_kill.c b/mm/oom_kill.c --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -521,12 +521,17 @@ static bool __oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm) } /* - * MMF_OOM_SKIP is set by exit_mmap when the OOM reaper can't - * work on the mm anymore. The check for MMF_OOM_SKIP must run + * Tell all users of get_user/copy_from_user etc... that the content + * is no longer stable. No barriers really needed because unmapping + * should imply barriers already and the reader would hit a page fault + * if it stumbled over reaped memory. + * + * MMF_UNSTABLE is also set by exit_mmap when the OOM reaper shouldn't + * work on the mm anymore. The check for MMF_OOM_UNSTABLE must run * under mmap_sem for reading because it serializes against the * down_write();up_write() cycle in exit_mmap(). */ - if (test_bit(MMF_OOM_SKIP, &mm->flags)) { + if (test_and_set_bit(MMF_UNSTABLE, &mm->flags)) { up_read(&mm->mmap_sem); trace_skip_task_reaping(tsk->pid); goto unlock_oom; @@ -534,14 +539,6 @@ static bool __oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm) trace_start_task_reaping(tsk->pid); - /* - * Tell all users of get_user/copy_from_user etc... that the content - * is no longer stable. No barriers really needed because unmapping - * should imply barriers already and the reader would hit a page fault - * if it stumbled over a reaped memory. - */ - set_bit(MMF_UNSTABLE, &mm->flags); - for (vma = mm->mmap ; vma; vma = vma->vm_next) { if (!can_madv_dontneed_vma(vma)) continue; @@ -567,6 +564,7 @@ static bool __oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm) tlb_finish_mmu(&tlb, start, end); } } + set_bit(MMF_OOM_SKIP, &mm->flags); pr_info("oom_reaper: reaped process %d (%s), now anon-rss:%lukB, file-rss:%lukB, shmem-rss:%lukB\n", task_pid_nr(tsk), tsk->comm, K(get_mm_counter(mm, MM_ANONPAGES)), @@ -594,7 +592,6 @@ static void oom_reap_task(struct task_struct *tsk) test_bit(MMF_OOM_SKIP, &mm->flags)) goto done; - pr_info("oom_reaper: unable to reap pid:%d (%s)\n", task_pid_nr(tsk), tsk->comm); debug_show_all_locks(); @@ -603,10 +600,11 @@ static void oom_reap_task(struct task_struct *tsk) tsk->oom_reaper_list = NULL; /* - * Hide this mm from OOM killer because it has been either reaped or - * somebody can't call up_write(mmap_sem). + * If the oom reaper could not get started on this mm and it has not yet + * reached exit_mmap(), set MMF_OOM_SKIP to disregard. */ - set_bit(MMF_OOM_SKIP, &mm->flags); + if (!test_bit(MMF_UNSTABLE, &mm->flags)) + set_bit(MMF_OOM_SKIP, &mm->flags); /* Drop a reference taken by wake_oom_reaper */ put_task_struct(tsk); ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [patch v2] mm, oom: fix concurrent munlock and oom reaper unmap 2018-04-18 2:52 ` [patch v2] " David Rientjes @ 2018-04-18 3:55 ` Tetsuo Handa 2018-04-18 4:11 ` David Rientjes 2018-04-18 7:50 ` Michal Hocko 1 sibling, 1 reply; 47+ messages in thread From: Tetsuo Handa @ 2018-04-18 3:55 UTC (permalink / raw) To: David Rientjes Cc: Andrew Morton, Michal Hocko, Andrea Arcangeli, Roman Gushchin, linux-kernel, linux-mm David Rientjes wrote: > Fix this by reusing MMF_UNSTABLE to specify that an mm should not be > reaped. This prevents the concurrent munlock_vma_pages_range() and > unmap_page_range(). The oom reaper will simply not operate on an mm that > has the bit set and leave the unmapping to exit_mmap(). This change assumes that munlock_vma_pages_all()/unmap_vmas()/free_pgtables() are never blocked for memory allocation. Is that guaranteed? For example, i_mmap_lock_write() from unmap_single_vma() from unmap_vmas() is never blocked for memory allocation? Commit 97b1255cb27c551d ("mm,oom_reaper: check for MMF_OOM_SKIP before complaining") was waiting for i_mmap_lock_write() from unlink_file_vma() from free_pgtables(). Is it really guaranteed that somebody else who is holding that lock is never waiting for memory allocation? ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [patch v2] mm, oom: fix concurrent munlock and oom reaper unmap 2018-04-18 3:55 ` Tetsuo Handa @ 2018-04-18 4:11 ` David Rientjes 2018-04-18 4:47 ` Tetsuo Handa 0 siblings, 1 reply; 47+ messages in thread From: David Rientjes @ 2018-04-18 4:11 UTC (permalink / raw) To: Tetsuo Handa Cc: Andrew Morton, Michal Hocko, Andrea Arcangeli, Roman Gushchin, linux-kernel, linux-mm On Wed, 18 Apr 2018, Tetsuo Handa wrote: > > Fix this by reusing MMF_UNSTABLE to specify that an mm should not be > > reaped. This prevents the concurrent munlock_vma_pages_range() and > > unmap_page_range(). The oom reaper will simply not operate on an mm that > > has the bit set and leave the unmapping to exit_mmap(). > > This change assumes that munlock_vma_pages_all()/unmap_vmas()/free_pgtables() > are never blocked for memory allocation. Is that guaranteed? For example, > i_mmap_lock_write() from unmap_single_vma() from unmap_vmas() is never blocked > for memory allocation? Commit 97b1255cb27c551d ("mm,oom_reaper: check for > MMF_OOM_SKIP before complaining") was waiting for i_mmap_lock_write() from > unlink_file_vma() from free_pgtables(). Is it really guaranteed that somebody > else who is holding that lock is never waiting for memory allocation? > Commit 97b1255cb27c is referencing MMF_OOM_SKIP already being set by exit_mmap(). The only thing this patch changes is where that is done: before or after free_pgtables(). We can certainly move it to before free_pgtables() at the risk of subsequent (and eventually unnecessary) oom kills. It's not exactly the point of this patch. I have thousands of real-world examples where additional processes were oom killed while the original victim was in free_pgtables(). That's why we've moved the MMF_OOM_SKIP to after free_pgtables(). I'm not sure how likely your scenario is in the real world, but if it poses a problem then I believe it should be fixed by eventually deferring previous victims as a change to oom_evaluate_task(), not exit_mmap(). If you'd like me to fix that, please send along your test case that triggers it and I will send a patch. ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [patch v2] mm, oom: fix concurrent munlock and oom reaper unmap 2018-04-18 4:11 ` David Rientjes @ 2018-04-18 4:47 ` Tetsuo Handa 2018-04-18 5:20 ` David Rientjes 0 siblings, 1 reply; 47+ messages in thread From: Tetsuo Handa @ 2018-04-18 4:47 UTC (permalink / raw) To: David Rientjes Cc: Andrew Morton, Michal Hocko, Andrea Arcangeli, Roman Gushchin, linux-kernel, linux-mm David Rientjes wrote: > On Wed, 18 Apr 2018, Tetsuo Handa wrote: > > > Fix this by reusing MMF_UNSTABLE to specify that an mm should not be > > > reaped. This prevents the concurrent munlock_vma_pages_range() and > > > unmap_page_range(). The oom reaper will simply not operate on an mm that > > > has the bit set and leave the unmapping to exit_mmap(). > > > > This change assumes that munlock_vma_pages_all()/unmap_vmas()/free_pgtables() > > are never blocked for memory allocation. Is that guaranteed? For example, > > i_mmap_lock_write() from unmap_single_vma() from unmap_vmas() is never blocked > > for memory allocation? Commit 97b1255cb27c551d ("mm,oom_reaper: check for > > MMF_OOM_SKIP before complaining") was waiting for i_mmap_lock_write() from > > unlink_file_vma() from free_pgtables(). Is it really guaranteed that somebody > > else who is holding that lock is never waiting for memory allocation? > > > > Commit 97b1255cb27c is referencing MMF_OOM_SKIP already being set by > exit_mmap(). The only thing this patch changes is where that is done: > before or after free_pgtables(). We can certainly move it to before > free_pgtables() at the risk of subsequent (and eventually unnecessary) oom > kills. It's not exactly the point of this patch. > > I have thousands of real-world examples where additional processes were > oom killed while the original victim was in free_pgtables(). That's why > we've moved the MMF_OOM_SKIP to after free_pgtables(). "we have moved"? No, not yet. Your patch is about to move it. My question is: is it guaranteed that munlock_vma_pages_all()/unmap_vmas()/free_pgtables() by exit_mmap() are never blocked for memory allocation. Note that exit_mmap() tries to unmap all pages while the OOM reaper tries to unmap only safe pages. If there is possibility that munlock_vma_pages_all()/unmap_vmas()/free_pgtables() by exit_mmap() are blocked for memory allocation, your patch will introduce an OOM livelock. > I'm not sure how > likely your scenario is in the real world, but if it poses a problem then > I believe it should be fixed by eventually deferring previous victims as a > change to oom_evaluate_task(), not exit_mmap(). If you'd like me to fix > that, please send along your test case that triggers it and I will send a > patch. > ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [patch v2] mm, oom: fix concurrent munlock and oom reaper unmap 2018-04-18 4:47 ` Tetsuo Handa @ 2018-04-18 5:20 ` David Rientjes 0 siblings, 0 replies; 47+ messages in thread From: David Rientjes @ 2018-04-18 5:20 UTC (permalink / raw) To: Tetsuo Handa Cc: Andrew Morton, Michal Hocko, Andrea Arcangeli, Roman Gushchin, linux-kernel, linux-mm On Wed, 18 Apr 2018, Tetsuo Handa wrote: > > Commit 97b1255cb27c is referencing MMF_OOM_SKIP already being set by > > exit_mmap(). The only thing this patch changes is where that is done: > > before or after free_pgtables(). We can certainly move it to before > > free_pgtables() at the risk of subsequent (and eventually unnecessary) oom > > kills. It's not exactly the point of this patch. > > > > I have thousands of real-world examples where additional processes were > > oom killed while the original victim was in free_pgtables(). That's why > > we've moved the MMF_OOM_SKIP to after free_pgtables(). > > "we have moved"? No, not yet. Your patch is about to move it. > I'm referring to our own kernel, we have thousands of real-world examples where additional processes have been oom killed where the original victim is in free_pgtables(). It actually happens about 10-15% of the time in automated testing where you create a 128MB memcg, fork a canary, and then fork a >128MB memory hog. 10-15% of the time both processes get oom killed: the memory hog first (higher rss), the canary second. The pgtable stat is unchanged between oom kills. > My question is: is it guaranteed that munlock_vma_pages_all()/unmap_vmas()/free_pgtables() > by exit_mmap() are never blocked for memory allocation. Note that exit_mmap() tries to unmap > all pages while the OOM reaper tries to unmap only safe pages. If there is possibility that > munlock_vma_pages_all()/unmap_vmas()/free_pgtables() by exit_mmap() are blocked for memory > allocation, your patch will introduce an OOM livelock. > If munlock_vma_pages_all(), unmap_vmas(), or free_pgtables() require memory to make forward progress, then we have bigger problems :) I just ran a query of real-world oom kill logs that I have. In 33,773,705 oom kills, I have no evidence of a thread failing to exit after reaching exit_mmap(). You may recall from my support of your patch to emit the stack trace when the oom reaper fails, in https://marc.info/?l=linux-mm&m=152157881518627, that I have logs of 28,222,058 occurrences of the oom reaper where it successfully frees memory and the victim exits. If you'd like to pursue the possibility that exit_mmap() blocks before freeing memory that we have somehow been lucky to miss in 33 million occurrences, I'd appreciate the test case. ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [patch v2] mm, oom: fix concurrent munlock and oom reaper unmap 2018-04-18 2:52 ` [patch v2] " David Rientjes 2018-04-18 3:55 ` Tetsuo Handa @ 2018-04-18 7:50 ` Michal Hocko 2018-04-18 11:49 ` Tetsuo Handa 2018-04-18 19:14 ` David Rientjes 1 sibling, 2 replies; 47+ messages in thread From: Michal Hocko @ 2018-04-18 7:50 UTC (permalink / raw) To: David Rientjes Cc: Andrew Morton, Tetsuo Handa, Andrea Arcangeli, Roman Gushchin, linux-kernel, linux-mm On Tue 17-04-18 19:52:41, David Rientjes wrote: > Since exit_mmap() is done without the protection of mm->mmap_sem, it is > possible for the oom reaper to concurrently operate on an mm until > MMF_OOM_SKIP is set. > > This allows munlock_vma_pages_all() to concurrently run while the oom > reaper is operating on a vma. Since munlock_vma_pages_range() depends on > clearing VM_LOCKED from vm_flags before actually doing the munlock to > determine if any other vmas are locking the same memory, the check for > VM_LOCKED in the oom reaper is racy. > > This is especially noticeable on architectures such as powerpc where > clearing a huge pmd requires serialize_against_pte_lookup(). If the pmd > is zapped by the oom reaper during follow_page_mask() after the check for > pmd_none() is bypassed, this ends up deferencing a NULL ptl. > > Fix this by reusing MMF_UNSTABLE to specify that an mm should not be > reaped. This prevents the concurrent munlock_vma_pages_range() and > unmap_page_range(). The oom reaper will simply not operate on an mm that > has the bit set and leave the unmapping to exit_mmap(). This will further complicate the protocol and actually theoretically restores the oom lockup issues because the oom reaper doesn't set MMF_OOM_SKIP when racing with exit_mmap so we fully rely that nothing blocks there... So the resulting code is more fragile and tricky. Can we try a simpler way and get back to what I was suggesting before [1] and simply not play tricks with down_write(&mm->mmap_sem); up_write(&mm->mmap_sem); and use the write lock in exit_mmap for oom_victims? Andrea wanted to make this more clever but this is the second fallout which could have been prevented. The patch would be smaller and the locking protocol easier [1] http://lkml.kernel.org/r/20170727065023.GB20970@dhcp22.suse.cz > Fixes: 212925802454 ("mm: oom: let oom_reap_task and exit_mmap run concurrently") > Cc: stable@vger.kernel.org [4.14+] > Signed-off-by: David Rientjes <rientjes@google.com> > --- > v2: > - oom reaper only sets MMF_OOM_SKIP if MMF_UNSTABLE was never set (either > by itself or by exit_mmap(), per Tetsuo > - s/kick_all_cpus_sync/serialize_against_pte_lookup/ in changelog as more > isolated way of forcing cpus as non-idle on power > > mm/mmap.c | 38 ++++++++++++++++++++------------------ > mm/oom_kill.c | 28 +++++++++++++--------------- > 2 files changed, 33 insertions(+), 33 deletions(-) > > diff --git a/mm/mmap.c b/mm/mmap.c > --- a/mm/mmap.c > +++ b/mm/mmap.c > @@ -3015,6 +3015,25 @@ void exit_mmap(struct mm_struct *mm) > /* mm's last user has gone, and its about to be pulled down */ > mmu_notifier_release(mm); > > + if (unlikely(mm_is_oom_victim(mm))) { > + /* > + * Wait for oom_reap_task() to stop working on this mm. Because > + * MMF_UNSTABLE is already set before calling down_read(), > + * oom_reap_task() will not run on this mm after up_write(). > + * oom_reap_task() also depends on a stable VM_LOCKED flag to > + * indicate it should not unmap during munlock_vma_pages_all(). > + * > + * mm_is_oom_victim() cannot be set from under us because > + * victim->mm is already set to NULL under task_lock before > + * calling mmput() and victim->signal->oom_mm is set by the oom > + * killer only if victim->mm is non-NULL while holding > + * task_lock(). > + */ > + set_bit(MMF_UNSTABLE, &mm->flags); > + down_write(&mm->mmap_sem); > + up_write(&mm->mmap_sem); > + } > + > if (mm->locked_vm) { > vma = mm->mmap; > while (vma) { > @@ -3036,26 +3055,9 @@ void exit_mmap(struct mm_struct *mm) > /* update_hiwater_rss(mm) here? but nobody should be looking */ > /* Use -1 here to ensure all VMAs in the mm are unmapped */ > unmap_vmas(&tlb, vma, 0, -1); > - > - if (unlikely(mm_is_oom_victim(mm))) { > - /* > - * Wait for oom_reap_task() to stop working on this > - * mm. Because MMF_OOM_SKIP is already set before > - * calling down_read(), oom_reap_task() will not run > - * on this "mm" post up_write(). > - * > - * mm_is_oom_victim() cannot be set from under us > - * either because victim->mm is already set to NULL > - * under task_lock before calling mmput and oom_mm is > - * set not NULL by the OOM killer only if victim->mm > - * is found not NULL while holding the task_lock. > - */ > - set_bit(MMF_OOM_SKIP, &mm->flags); > - down_write(&mm->mmap_sem); > - up_write(&mm->mmap_sem); > - } > free_pgtables(&tlb, vma, FIRST_USER_ADDRESS, USER_PGTABLES_CEILING); > tlb_finish_mmu(&tlb, 0, -1); > + set_bit(MMF_OOM_SKIP, &mm->flags); > > /* > * Walk the list again, actually closing and freeing it, > diff --git a/mm/oom_kill.c b/mm/oom_kill.c > --- a/mm/oom_kill.c > +++ b/mm/oom_kill.c > @@ -521,12 +521,17 @@ static bool __oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm) > } > > /* > - * MMF_OOM_SKIP is set by exit_mmap when the OOM reaper can't > - * work on the mm anymore. The check for MMF_OOM_SKIP must run > + * Tell all users of get_user/copy_from_user etc... that the content > + * is no longer stable. No barriers really needed because unmapping > + * should imply barriers already and the reader would hit a page fault > + * if it stumbled over reaped memory. > + * > + * MMF_UNSTABLE is also set by exit_mmap when the OOM reaper shouldn't > + * work on the mm anymore. The check for MMF_OOM_UNSTABLE must run > * under mmap_sem for reading because it serializes against the > * down_write();up_write() cycle in exit_mmap(). > */ > - if (test_bit(MMF_OOM_SKIP, &mm->flags)) { > + if (test_and_set_bit(MMF_UNSTABLE, &mm->flags)) { > up_read(&mm->mmap_sem); > trace_skip_task_reaping(tsk->pid); > goto unlock_oom; > @@ -534,14 +539,6 @@ static bool __oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm) > > trace_start_task_reaping(tsk->pid); > > - /* > - * Tell all users of get_user/copy_from_user etc... that the content > - * is no longer stable. No barriers really needed because unmapping > - * should imply barriers already and the reader would hit a page fault > - * if it stumbled over a reaped memory. > - */ > - set_bit(MMF_UNSTABLE, &mm->flags); > - > for (vma = mm->mmap ; vma; vma = vma->vm_next) { > if (!can_madv_dontneed_vma(vma)) > continue; > @@ -567,6 +564,7 @@ static bool __oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm) > tlb_finish_mmu(&tlb, start, end); > } > } > + set_bit(MMF_OOM_SKIP, &mm->flags); > pr_info("oom_reaper: reaped process %d (%s), now anon-rss:%lukB, file-rss:%lukB, shmem-rss:%lukB\n", > task_pid_nr(tsk), tsk->comm, > K(get_mm_counter(mm, MM_ANONPAGES)), > @@ -594,7 +592,6 @@ static void oom_reap_task(struct task_struct *tsk) > test_bit(MMF_OOM_SKIP, &mm->flags)) > goto done; > > - > pr_info("oom_reaper: unable to reap pid:%d (%s)\n", > task_pid_nr(tsk), tsk->comm); > debug_show_all_locks(); > @@ -603,10 +600,11 @@ static void oom_reap_task(struct task_struct *tsk) > tsk->oom_reaper_list = NULL; > > /* > - * Hide this mm from OOM killer because it has been either reaped or > - * somebody can't call up_write(mmap_sem). > + * If the oom reaper could not get started on this mm and it has not yet > + * reached exit_mmap(), set MMF_OOM_SKIP to disregard. > */ > - set_bit(MMF_OOM_SKIP, &mm->flags); > + if (!test_bit(MMF_UNSTABLE, &mm->flags)) > + set_bit(MMF_OOM_SKIP, &mm->flags); > > /* Drop a reference taken by wake_oom_reaper */ > put_task_struct(tsk); -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [patch v2] mm, oom: fix concurrent munlock and oom reaper unmap 2018-04-18 7:50 ` Michal Hocko @ 2018-04-18 11:49 ` Tetsuo Handa 2018-04-18 11:58 ` Michal Hocko 2018-04-18 19:14 ` David Rientjes 1 sibling, 1 reply; 47+ messages in thread From: Tetsuo Handa @ 2018-04-18 11:49 UTC (permalink / raw) To: mhocko, rientjes; +Cc: akpm, aarcange, guro, linux-kernel, linux-mm Michal Hocko wrote: > On Tue 17-04-18 19:52:41, David Rientjes wrote: > > Since exit_mmap() is done without the protection of mm->mmap_sem, it is > > possible for the oom reaper to concurrently operate on an mm until > > MMF_OOM_SKIP is set. > > > > This allows munlock_vma_pages_all() to concurrently run while the oom > > reaper is operating on a vma. Since munlock_vma_pages_range() depends on > > clearing VM_LOCKED from vm_flags before actually doing the munlock to > > determine if any other vmas are locking the same memory, the check for > > VM_LOCKED in the oom reaper is racy. > > > > This is especially noticeable on architectures such as powerpc where > > clearing a huge pmd requires serialize_against_pte_lookup(). If the pmd > > is zapped by the oom reaper during follow_page_mask() after the check for > > pmd_none() is bypassed, this ends up deferencing a NULL ptl. > > > > Fix this by reusing MMF_UNSTABLE to specify that an mm should not be > > reaped. This prevents the concurrent munlock_vma_pages_range() and > > unmap_page_range(). The oom reaper will simply not operate on an mm that > > has the bit set and leave the unmapping to exit_mmap(). > > This will further complicate the protocol and actually theoretically > restores the oom lockup issues because the oom reaper doesn't set > MMF_OOM_SKIP when racing with exit_mmap so we fully rely that nothing > blocks there... So the resulting code is more fragile and tricky. > > Can we try a simpler way and get back to what I was suggesting before > [1] and simply not play tricks with > down_write(&mm->mmap_sem); > up_write(&mm->mmap_sem); > > and use the write lock in exit_mmap for oom_victims? You mean something like this? Then, I'm tempted to call __oom_reap_task_mm() before holding mmap_sem for write. It would be OK to call __oom_reap_task_mm() at the beginning of __mmput()... diff --git a/mm/mmap.c b/mm/mmap.c index 188f195..ba7083b 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -3011,17 +3011,22 @@ void exit_mmap(struct mm_struct *mm) struct mmu_gather tlb; struct vm_area_struct *vma; unsigned long nr_accounted = 0; + const bool is_oom_mm = mm_is_oom_victim(mm); /* mm's last user has gone, and its about to be pulled down */ mmu_notifier_release(mm); if (mm->locked_vm) { + if (is_oom_mm) + down_write(&mm->mmap_sem); vma = mm->mmap; while (vma) { if (vma->vm_flags & VM_LOCKED) munlock_vma_pages_all(vma); vma = vma->vm_next; } + if (is_oom_mm) + up_write(&mm->mmap_sem); } arch_exit_mmap(mm); @@ -3037,7 +3042,7 @@ void exit_mmap(struct mm_struct *mm) /* Use -1 here to ensure all VMAs in the mm are unmapped */ unmap_vmas(&tlb, vma, 0, -1); - if (unlikely(mm_is_oom_victim(mm))) { + if (unlikely(is_oom_mm)) { /* * Wait for oom_reap_task() to stop working on this * mm. Because MMF_OOM_SKIP is already set before ^ permalink raw reply related [flat|nested] 47+ messages in thread
* Re: [patch v2] mm, oom: fix concurrent munlock and oom reaper unmap 2018-04-18 11:49 ` Tetsuo Handa @ 2018-04-18 11:58 ` Michal Hocko 2018-04-18 13:25 ` Tetsuo Handa 0 siblings, 1 reply; 47+ messages in thread From: Michal Hocko @ 2018-04-18 11:58 UTC (permalink / raw) To: Tetsuo Handa; +Cc: rientjes, akpm, aarcange, guro, linux-kernel, linux-mm On Wed 18-04-18 20:49:11, Tetsuo Handa wrote: > Michal Hocko wrote: > > On Tue 17-04-18 19:52:41, David Rientjes wrote: > > > Since exit_mmap() is done without the protection of mm->mmap_sem, it is > > > possible for the oom reaper to concurrently operate on an mm until > > > MMF_OOM_SKIP is set. > > > > > > This allows munlock_vma_pages_all() to concurrently run while the oom > > > reaper is operating on a vma. Since munlock_vma_pages_range() depends on > > > clearing VM_LOCKED from vm_flags before actually doing the munlock to > > > determine if any other vmas are locking the same memory, the check for > > > VM_LOCKED in the oom reaper is racy. > > > > > > This is especially noticeable on architectures such as powerpc where > > > clearing a huge pmd requires serialize_against_pte_lookup(). If the pmd > > > is zapped by the oom reaper during follow_page_mask() after the check for > > > pmd_none() is bypassed, this ends up deferencing a NULL ptl. > > > > > > Fix this by reusing MMF_UNSTABLE to specify that an mm should not be > > > reaped. This prevents the concurrent munlock_vma_pages_range() and > > > unmap_page_range(). The oom reaper will simply not operate on an mm that > > > has the bit set and leave the unmapping to exit_mmap(). > > > > This will further complicate the protocol and actually theoretically > > restores the oom lockup issues because the oom reaper doesn't set > > MMF_OOM_SKIP when racing with exit_mmap so we fully rely that nothing > > blocks there... So the resulting code is more fragile and tricky. > > > > Can we try a simpler way and get back to what I was suggesting before > > [1] and simply not play tricks with > > down_write(&mm->mmap_sem); > > up_write(&mm->mmap_sem); > > > > and use the write lock in exit_mmap for oom_victims? > > You mean something like this? or simply hold the write lock until we unmap and free page tables. It would make the locking rules much more straightforward. What you are proposing is more focused on this particular fix and it would work as well but the subtle locking would still stay in place. I am not sure we want the trickiness. > Then, I'm tempted to call __oom_reap_task_mm() before holding mmap_sem for write. > It would be OK to call __oom_reap_task_mm() at the beginning of __mmput()... I am not sure I understand. > diff --git a/mm/mmap.c b/mm/mmap.c > index 188f195..ba7083b 100644 > --- a/mm/mmap.c > +++ b/mm/mmap.c > @@ -3011,17 +3011,22 @@ void exit_mmap(struct mm_struct *mm) > struct mmu_gather tlb; > struct vm_area_struct *vma; > unsigned long nr_accounted = 0; > + const bool is_oom_mm = mm_is_oom_victim(mm); > > /* mm's last user has gone, and its about to be pulled down */ > mmu_notifier_release(mm); > > if (mm->locked_vm) { > + if (is_oom_mm) > + down_write(&mm->mmap_sem); > vma = mm->mmap; > while (vma) { > if (vma->vm_flags & VM_LOCKED) > munlock_vma_pages_all(vma); > vma = vma->vm_next; > } > + if (is_oom_mm) > + up_write(&mm->mmap_sem); > } > > arch_exit_mmap(mm); > @@ -3037,7 +3042,7 @@ void exit_mmap(struct mm_struct *mm) > /* Use -1 here to ensure all VMAs in the mm are unmapped */ > unmap_vmas(&tlb, vma, 0, -1); > > - if (unlikely(mm_is_oom_victim(mm))) { > + if (unlikely(is_oom_mm)) { > /* > * Wait for oom_reap_task() to stop working on this > * mm. Because MMF_OOM_SKIP is already set before -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [patch v2] mm, oom: fix concurrent munlock and oom reaper unmap 2018-04-18 11:58 ` Michal Hocko @ 2018-04-18 13:25 ` Tetsuo Handa 2018-04-18 13:44 ` Michal Hocko 0 siblings, 1 reply; 47+ messages in thread From: Tetsuo Handa @ 2018-04-18 13:25 UTC (permalink / raw) To: mhocko; +Cc: rientjes, akpm, aarcange, guro, linux-kernel, linux-mm Michal Hocko wrote: > > > Can we try a simpler way and get back to what I was suggesting before > > > [1] and simply not play tricks with > > > down_write(&mm->mmap_sem); > > > up_write(&mm->mmap_sem); > > > > > > and use the write lock in exit_mmap for oom_victims? > > > > You mean something like this? > > or simply hold the write lock until we unmap and free page tables. That increases possibility of __oom_reap_task_mm() giving up reclaim and setting MMF_OOM_SKIP when exit_mmap() is making forward progress, doesn't it? I think that it is better that __oom_reap_task_mm() does not give up when exit_mmap() can make progress. In that aspect, the section protected by mmap_sem held for write should be as short as possible. > It would make the locking rules much more straightforward. > What you are proposing is more focused on this particular fix and it > would work as well but the subtle locking would still stay in place. Yes, this change is focused on -stable patch. > I am not sure we want the trickiness. I don't like the trickiness too. I think we can even consider direct OOM reaping suggested at https://patchwork.kernel.org/patch/10095661/ . > > > Then, I'm tempted to call __oom_reap_task_mm() before holding mmap_sem for write. > > It would be OK to call __oom_reap_task_mm() at the beginning of __mmput()... > > I am not sure I understand. To reduce possibility of __oom_reap_task_mm() giving up reclaim and setting MMF_OOM_SKIP. ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [patch v2] mm, oom: fix concurrent munlock and oom reaper unmap 2018-04-18 13:25 ` Tetsuo Handa @ 2018-04-18 13:44 ` Michal Hocko 2018-04-18 14:28 ` Tetsuo Handa 0 siblings, 1 reply; 47+ messages in thread From: Michal Hocko @ 2018-04-18 13:44 UTC (permalink / raw) To: Tetsuo Handa; +Cc: rientjes, akpm, aarcange, guro, linux-kernel, linux-mm On Wed 18-04-18 22:25:54, Tetsuo Handa wrote: > Michal Hocko wrote: > > > > Can we try a simpler way and get back to what I was suggesting before > > > > [1] and simply not play tricks with > > > > down_write(&mm->mmap_sem); > > > > up_write(&mm->mmap_sem); > > > > > > > > and use the write lock in exit_mmap for oom_victims? > > > > > > You mean something like this? > > > > or simply hold the write lock until we unmap and free page tables. > > That increases possibility of __oom_reap_task_mm() giving up reclaim and > setting MMF_OOM_SKIP when exit_mmap() is making forward progress, doesn't it? Yes it does. But it is not that likely and easily noticeable from the logs so we can make the locking protocol more complex if this really hits two often. > I think that it is better that __oom_reap_task_mm() does not give up when > exit_mmap() can make progress. In that aspect, the section protected by > mmap_sem held for write should be as short as possible. Sure, but then weight the complexity on the other side and try to think whether simpler code which works most of the time is better than a buggy complex one. The current protocol has 2 followup fixes which speaks for itself. [...] > > > Then, I'm tempted to call __oom_reap_task_mm() before holding mmap_sem for write. > > > It would be OK to call __oom_reap_task_mm() at the beginning of __mmput()... > > > > I am not sure I understand. > > To reduce possibility of __oom_reap_task_mm() giving up reclaim and > setting MMF_OOM_SKIP. Still do not understand. Do you want to call __oom_reap_task_mm from __mmput? If yes why would you do so when exit_mmap does a stronger version of it? -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [patch v2] mm, oom: fix concurrent munlock and oom reaper unmap 2018-04-18 13:44 ` Michal Hocko @ 2018-04-18 14:28 ` Tetsuo Handa 0 siblings, 0 replies; 47+ messages in thread From: Tetsuo Handa @ 2018-04-18 14:28 UTC (permalink / raw) To: mhocko; +Cc: rientjes, akpm, aarcange, guro, linux-kernel, linux-mm Michal Hocko wrote: > > > > Then, I'm tempted to call __oom_reap_task_mm() before holding mmap_sem for write. > > > > It would be OK to call __oom_reap_task_mm() at the beginning of __mmput()... > > > > > > I am not sure I understand. > > > > To reduce possibility of __oom_reap_task_mm() giving up reclaim and > > setting MMF_OOM_SKIP. > > Still do not understand. Do you want to call __oom_reap_task_mm from > __mmput? Yes. > If yes why would you do so when exit_mmap does a stronger > version of it? Because memory which can be reclaimed by the OOM reaper is guaranteed to be reclaimed before setting MMF_OOM_SKIP when the OOM reaper and exit_mmap() contended, because the OOM reaper (weak reclaim) sets MMF_OOM_SKIP after one second for safety in case of exit_mmap() (strong reclaim) failing to make forward progress. ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [patch v2] mm, oom: fix concurrent munlock and oom reaper unmap 2018-04-18 7:50 ` Michal Hocko 2018-04-18 11:49 ` Tetsuo Handa @ 2018-04-18 19:14 ` David Rientjes 2018-04-19 6:35 ` Michal Hocko 1 sibling, 1 reply; 47+ messages in thread From: David Rientjes @ 2018-04-18 19:14 UTC (permalink / raw) To: Michal Hocko Cc: Andrew Morton, Tetsuo Handa, Andrea Arcangeli, Roman Gushchin, linux-kernel, linux-mm On Wed, 18 Apr 2018, Michal Hocko wrote: > > Since exit_mmap() is done without the protection of mm->mmap_sem, it is > > possible for the oom reaper to concurrently operate on an mm until > > MMF_OOM_SKIP is set. > > > > This allows munlock_vma_pages_all() to concurrently run while the oom > > reaper is operating on a vma. Since munlock_vma_pages_range() depends on > > clearing VM_LOCKED from vm_flags before actually doing the munlock to > > determine if any other vmas are locking the same memory, the check for > > VM_LOCKED in the oom reaper is racy. > > > > This is especially noticeable on architectures such as powerpc where > > clearing a huge pmd requires serialize_against_pte_lookup(). If the pmd > > is zapped by the oom reaper during follow_page_mask() after the check for > > pmd_none() is bypassed, this ends up deferencing a NULL ptl. > > > > Fix this by reusing MMF_UNSTABLE to specify that an mm should not be > > reaped. This prevents the concurrent munlock_vma_pages_range() and > > unmap_page_range(). The oom reaper will simply not operate on an mm that > > has the bit set and leave the unmapping to exit_mmap(). > > This will further complicate the protocol and actually theoretically > restores the oom lockup issues because the oom reaper doesn't set > MMF_OOM_SKIP when racing with exit_mmap so we fully rely that nothing > blocks there... So the resulting code is more fragile and tricky. > exit_mmap() does not block before set_bit(MMF_OOM_SKIP) once it is entered. > Can we try a simpler way and get back to what I was suggesting before > [1] and simply not play tricks with > down_write(&mm->mmap_sem); > up_write(&mm->mmap_sem); > > and use the write lock in exit_mmap for oom_victims? > > Andrea wanted to make this more clever but this is the second fallout > which could have been prevented. The patch would be smaller and the > locking protocol easier > > [1] http://lkml.kernel.org/r/20170727065023.GB20970@dhcp22.suse.cz > exit_mmap() doesn't need to protect munlock, unmap, or freeing pgtables with mm->mmap_sem; the issue is that you need to start holding it in this case before munlock and then until at least the end of free_pgtables(). Anything in between also needlessly holds it so could introduce weird lockdep issues that only trigger for oom victims, i.e. they could be very rare on some configs. I don't necessarily like holding a mutex over functions where it's actually not needed, not only as a general principle but also because the oom reaper can now infer that reaping isn't possible just because it can't do down_read() and isn't aware the thread is actually in exit_mmap() needlessly holding it. I like how the oom reaper currently retries on failing to grab mm->mmap_sem and then backs out because it's assumed it can't make forward progress. Adding additional complication for situations where mm->mmap_sem is contended (and munlock to free_pgtables() can take a long time for certain processes) to check if it's actually already in exit_mmap() would seem more complicated than this. The patch is simply using MMF_UNSTABLE rather than MMF_OOM_SKIP to serialize exit_mmap() with the oom reaper and doing it before anything interesting in exit_mmap() because without it the munlock can trivially race with unmap_page_range() and cause a NULL pointer or #GP on a pmd or pte. The way Andrea implemented it is fine, we simply have revealed a race between munlock_vma_pages_all() and unmap_page_range() that needs it to do set_bit(); down_write(); up_write(); earlier. ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [patch v2] mm, oom: fix concurrent munlock and oom reaper unmap 2018-04-18 19:14 ` David Rientjes @ 2018-04-19 6:35 ` Michal Hocko 2018-04-19 10:45 ` Tetsuo Handa 2018-04-19 19:34 ` David Rientjes 0 siblings, 2 replies; 47+ messages in thread From: Michal Hocko @ 2018-04-19 6:35 UTC (permalink / raw) To: David Rientjes Cc: Andrew Morton, Tetsuo Handa, Andrea Arcangeli, Roman Gushchin, linux-kernel, linux-mm On Wed 18-04-18 12:14:29, David Rientjes wrote: > On Wed, 18 Apr 2018, Michal Hocko wrote: > > > > Since exit_mmap() is done without the protection of mm->mmap_sem, it is > > > possible for the oom reaper to concurrently operate on an mm until > > > MMF_OOM_SKIP is set. > > > > > > This allows munlock_vma_pages_all() to concurrently run while the oom > > > reaper is operating on a vma. Since munlock_vma_pages_range() depends on > > > clearing VM_LOCKED from vm_flags before actually doing the munlock to > > > determine if any other vmas are locking the same memory, the check for > > > VM_LOCKED in the oom reaper is racy. > > > > > > This is especially noticeable on architectures such as powerpc where > > > clearing a huge pmd requires serialize_against_pte_lookup(). If the pmd > > > is zapped by the oom reaper during follow_page_mask() after the check for > > > pmd_none() is bypassed, this ends up deferencing a NULL ptl. > > > > > > Fix this by reusing MMF_UNSTABLE to specify that an mm should not be > > > reaped. This prevents the concurrent munlock_vma_pages_range() and > > > unmap_page_range(). The oom reaper will simply not operate on an mm that > > > has the bit set and leave the unmapping to exit_mmap(). > > > > This will further complicate the protocol and actually theoretically > > restores the oom lockup issues because the oom reaper doesn't set > > MMF_OOM_SKIP when racing with exit_mmap so we fully rely that nothing > > blocks there... So the resulting code is more fragile and tricky. > > > > exit_mmap() does not block before set_bit(MMF_OOM_SKIP) once it is > entered. Not true. munlock_vma_pages_all might take page_lock which can have unpredictable dependences. This is the reason why we are ruling out mlocked VMAs in the first place when reaping the address space. > > Can we try a simpler way and get back to what I was suggesting before > > [1] and simply not play tricks with > > down_write(&mm->mmap_sem); > > up_write(&mm->mmap_sem); > > > > and use the write lock in exit_mmap for oom_victims? > > > > Andrea wanted to make this more clever but this is the second fallout > > which could have been prevented. The patch would be smaller and the > > locking protocol easier > > > > [1] http://lkml.kernel.org/r/20170727065023.GB20970@dhcp22.suse.cz > > > > exit_mmap() doesn't need to protect munlock, unmap, or freeing pgtables > with mm->mmap_sem; the issue is that you need to start holding it in this > case before munlock and then until at least the end of free_pgtables(). > Anything in between also needlessly holds it so could introduce weird > lockdep issues that only trigger for oom victims, i.e. they could be very > rare on some configs. I don't necessarily like holding a mutex over > functions where it's actually not needed, not only as a general principle > but also because the oom reaper can now infer that reaping isn't possible > just because it can't do down_read() and isn't aware the thread is > actually in exit_mmap() needlessly holding it. While you are correct, strictly speaking, because unmap_vmas can race with the oom reaper. With the lock held during the whole operation we can indeed trigger back off in the oom_repaer. It will keep retrying but the tear down can take quite some time. This is a fair argument. On the other hand your lock protocol introduces the MMF_OOM_SKIP problem I've mentioned above and that really worries me. The primary objective of the reaper is to guarantee a forward progress without relying on any externalities. We might kill another OOM victim but that is safer than lock up. [...] > The patch is simply using MMF_UNSTABLE rather than MMF_OOM_SKIP to > serialize exit_mmap() with the oom reaper and doing it before anything > interesting in exit_mmap() because without it the munlock can trivially > race with unmap_page_range() and cause a NULL pointer or #GP on a pmd or > pte. The way Andrea implemented it is fine, we simply have revealed a > race between munlock_vma_pages_all() and unmap_page_range() that needs it > to do set_bit(); down_write(); up_write(); earlier. The current protocol has proven to be error prone so I really believe we should back off and turn it into something much simpler and build on top of that if needed. So do you see any _technical_ reasons why not do [1] and have a simpler protocol easily backportable to stable trees? -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [patch v2] mm, oom: fix concurrent munlock and oom reaper unmap 2018-04-19 6:35 ` Michal Hocko @ 2018-04-19 10:45 ` Tetsuo Handa 2018-04-19 11:04 ` Michal Hocko 2018-04-19 19:14 ` David Rientjes 2018-04-19 19:34 ` David Rientjes 1 sibling, 2 replies; 47+ messages in thread From: Tetsuo Handa @ 2018-04-19 10:45 UTC (permalink / raw) To: mhocko, rientjes; +Cc: akpm, aarcange, guro, linux-kernel, linux-mm Michal Hocko wrote: > > exit_mmap() does not block before set_bit(MMF_OOM_SKIP) once it is > > entered. > > Not true. munlock_vma_pages_all might take page_lock which can have > unpredictable dependences. This is the reason why we are ruling out > mlocked VMAs in the first place when reaping the address space. Wow! Then, > While you are correct, strictly speaking, because unmap_vmas can race > with the oom reaper. With the lock held during the whole operation we > can indeed trigger back off in the oom_repaer. It will keep retrying but > the tear down can take quite some time. This is a fair argument. On the > other hand your lock protocol introduces the MMF_OOM_SKIP problem I've > mentioned above and that really worries me. The primary objective of the > reaper is to guarantee a forward progress without relying on any > externalities. We might kill another OOM victim but that is safer than > lock up. current code has a possibility that the OOM reaper is disturbed by unpredictable dependencies, like I worried that I think that there is a possibility that the OOM reaper tries to reclaim mlocked pages as soon as exit_mmap() cleared VM_LOCKED flag by calling munlock_vma_pages_all(). when current approach was proposed. We currently have the MMF_OOM_SKIP problem. We need to teach the OOM reaper stop reaping as soon as entering exit_mmap(). Maybe let the OOM reaper poll for progress (e.g. none of get_mm_counter(mm, *) decreased for last 1 second) ? ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [patch v2] mm, oom: fix concurrent munlock and oom reaper unmap 2018-04-19 10:45 ` Tetsuo Handa @ 2018-04-19 11:04 ` Michal Hocko 2018-04-19 11:51 ` Tetsuo Handa 2018-04-19 19:14 ` David Rientjes 1 sibling, 1 reply; 47+ messages in thread From: Michal Hocko @ 2018-04-19 11:04 UTC (permalink / raw) To: Tetsuo Handa; +Cc: rientjes, akpm, aarcange, guro, linux-kernel, linux-mm On Thu 19-04-18 19:45:46, Tetsuo Handa wrote: > Michal Hocko wrote: > > > exit_mmap() does not block before set_bit(MMF_OOM_SKIP) once it is > > > entered. > > > > Not true. munlock_vma_pages_all might take page_lock which can have > > unpredictable dependences. This is the reason why we are ruling out > > mlocked VMAs in the first place when reaping the address space. > > Wow! Then, > > > While you are correct, strictly speaking, because unmap_vmas can race > > with the oom reaper. With the lock held during the whole operation we > > can indeed trigger back off in the oom_repaer. It will keep retrying but > > the tear down can take quite some time. This is a fair argument. On the > > other hand your lock protocol introduces the MMF_OOM_SKIP problem I've > > mentioned above and that really worries me. The primary objective of the > > reaper is to guarantee a forward progress without relying on any > > externalities. We might kill another OOM victim but that is safer than > > lock up. > > current code has a possibility that the OOM reaper is disturbed by > unpredictable dependencies, like I worried that > > I think that there is a possibility that the OOM reaper tries to reclaim > mlocked pages as soon as exit_mmap() cleared VM_LOCKED flag by calling > munlock_vma_pages_all(). > > when current approach was proposed. We currently have the MMF_OOM_SKIP problem. > We need to teach the OOM reaper stop reaping as soon as entering exit_mmap(). > Maybe let the OOM reaper poll for progress (e.g. none of get_mm_counter(mm, *) > decreased for last 1 second) ? Can we start simple and build a more elaborate heuristics on top _please_? In other words holding the mmap_sem for write for oom victims in exit_mmap should handle the problem. We can then enhance this to probe for progress or any other clever tricks if we find out that the race happens too often and we kill more than necessary. Let's not repeat the error of trying to be too clever from the beginning as we did previously. This are is just too subtle and obviously error prone. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [patch v2] mm, oom: fix concurrent munlock and oom reaper unmap 2018-04-19 11:04 ` Michal Hocko @ 2018-04-19 11:51 ` Tetsuo Handa 2018-04-19 12:48 ` Michal Hocko 0 siblings, 1 reply; 47+ messages in thread From: Tetsuo Handa @ 2018-04-19 11:51 UTC (permalink / raw) To: mhocko; +Cc: rientjes, akpm, aarcange, guro, linux-kernel, linux-mm Michal Hocko wrote: > > We need to teach the OOM reaper stop reaping as soon as entering exit_mmap(). > > Maybe let the OOM reaper poll for progress (e.g. none of get_mm_counter(mm, *) > > decreased for last 1 second) ? > > Can we start simple and build a more elaborate heuristics on top _please_? > In other words holding the mmap_sem for write for oom victims in > exit_mmap should handle the problem. We can then enhance this to probe > for progress or any other clever tricks if we find out that the race > happens too often and we kill more than necessary. > > Let's not repeat the error of trying to be too clever from the beginning > as we did previously. This are is just too subtle and obviously error > prone. > Something like this? --- mm/mmap.c | 41 +++++++++++++++++++++++------------------ mm/oom_kill.c | 29 +++++++++++------------------ 2 files changed, 34 insertions(+), 36 deletions(-) diff --git a/mm/mmap.c b/mm/mmap.c index 188f195..3edb7da 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -3015,6 +3015,28 @@ void exit_mmap(struct mm_struct *mm) /* mm's last user has gone, and its about to be pulled down */ mmu_notifier_release(mm); + if (unlikely(mm_is_oom_victim(mm))) { + /* + * Tell oom_reap_task() not to start reaping this mm. + * + * oom_reap_task() depends on a stable VM_LOCKED flag to + * indicate it should not unmap during munlock_vma_pages_all(). + * + * Since MMF_UNSTABLE is set before calling down_write(), + * oom_reap_task() which calls down_read() before testing + * MMF_UNSTABLE will not run on this mm after up_write(). + * + * mm_is_oom_victim() cannot be set from under us because + * victim->mm is already set to NULL under task_lock before + * calling mmput() and victim->signal->oom_mm is set by the oom + * killer only if victim->mm is non-NULL while holding + * task_lock(). + */ + set_bit(MMF_UNSTABLE, &mm->flags); + down_write(&mm->mmap_sem); + up_write(&mm->mmap_sem); + } + if (mm->locked_vm) { vma = mm->mmap; while (vma) { @@ -3036,26 +3058,9 @@ void exit_mmap(struct mm_struct *mm) /* update_hiwater_rss(mm) here? but nobody should be looking */ /* Use -1 here to ensure all VMAs in the mm are unmapped */ unmap_vmas(&tlb, vma, 0, -1); - - if (unlikely(mm_is_oom_victim(mm))) { - /* - * Wait for oom_reap_task() to stop working on this - * mm. Because MMF_OOM_SKIP is already set before - * calling down_read(), oom_reap_task() will not run - * on this "mm" post up_write(). - * - * mm_is_oom_victim() cannot be set from under us - * either because victim->mm is already set to NULL - * under task_lock before calling mmput and oom_mm is - * set not NULL by the OOM killer only if victim->mm - * is found not NULL while holding the task_lock. - */ - set_bit(MMF_OOM_SKIP, &mm->flags); - down_write(&mm->mmap_sem); - up_write(&mm->mmap_sem); - } free_pgtables(&tlb, vma, FIRST_USER_ADDRESS, USER_PGTABLES_CEILING); tlb_finish_mmu(&tlb, 0, -1); + set_bit(MMF_OOM_SKIP, &mm->flags); /* * Walk the list again, actually closing and freeing it, diff --git a/mm/oom_kill.c b/mm/oom_kill.c index ff992fa..1fef1b6 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -510,25 +510,16 @@ static bool __oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm) /* * If the mm has invalidate_{start,end}() notifiers that could block, + * or if the mm is in exit_mmap() which has unpredictable dependencies, * sleep to give the oom victim some more time. * TODO: we really want to get rid of this ugly hack and make sure that * notifiers cannot block for unbounded amount of time */ - if (mm_has_blockable_invalidate_notifiers(mm)) { - up_read(&mm->mmap_sem); - schedule_timeout_idle(HZ); - goto unlock_oom; - } - - /* - * MMF_OOM_SKIP is set by exit_mmap when the OOM reaper can't - * work on the mm anymore. The check for MMF_OOM_SKIP must run - * under mmap_sem for reading because it serializes against the - * down_write();up_write() cycle in exit_mmap(). - */ - if (test_bit(MMF_OOM_SKIP, &mm->flags)) { + if (mm_has_blockable_invalidate_notifiers(mm) || + test_bit(MMF_UNSTABLE, &mm->flags)) { up_read(&mm->mmap_sem); trace_skip_task_reaping(tsk->pid); + schedule_timeout_idle(HZ); goto unlock_oom; } @@ -590,11 +581,9 @@ static void oom_reap_task(struct task_struct *tsk) while (attempts++ < MAX_OOM_REAP_RETRIES && !__oom_reap_task_mm(tsk, mm)) schedule_timeout_idle(HZ/10); - if (attempts <= MAX_OOM_REAP_RETRIES || - test_bit(MMF_OOM_SKIP, &mm->flags)) + if (test_bit(MMF_UNSTABLE, &mm->flags)) goto done; - pr_info("oom_reaper: unable to reap pid:%d (%s)\n", task_pid_nr(tsk), tsk->comm); debug_show_all_locks(); @@ -603,8 +592,12 @@ static void oom_reap_task(struct task_struct *tsk) tsk->oom_reaper_list = NULL; /* - * Hide this mm from OOM killer because it has been either reaped or - * somebody can't call up_write(mmap_sem). + * Hide this mm from the OOM killer because: + * the OOM reaper completed reaping + * or + * exit_mmap() told the OOM reaper not to start reaping + * or + * neither exit_mmap() nor the OOM reaper started reaping */ set_bit(MMF_OOM_SKIP, &mm->flags); -- 1.8.3.1 ^ permalink raw reply related [flat|nested] 47+ messages in thread
* Re: [patch v2] mm, oom: fix concurrent munlock and oom reaper unmap 2018-04-19 11:51 ` Tetsuo Handa @ 2018-04-19 12:48 ` Michal Hocko 0 siblings, 0 replies; 47+ messages in thread From: Michal Hocko @ 2018-04-19 12:48 UTC (permalink / raw) To: Tetsuo Handa; +Cc: rientjes, akpm, aarcange, guro, linux-kernel, linux-mm On Thu 19-04-18 20:51:45, Tetsuo Handa wrote: > Michal Hocko wrote: > > > We need to teach the OOM reaper stop reaping as soon as entering exit_mmap(). > > > Maybe let the OOM reaper poll for progress (e.g. none of get_mm_counter(mm, *) > > > decreased for last 1 second) ? > > > > Can we start simple and build a more elaborate heuristics on top _please_? > > In other words holding the mmap_sem for write for oom victims in > > exit_mmap should handle the problem. We can then enhance this to probe > > for progress or any other clever tricks if we find out that the race > > happens too often and we kill more than necessary. > > > > Let's not repeat the error of trying to be too clever from the beginning > > as we did previously. This are is just too subtle and obviously error > > prone. > > > Something like this? Not really. This is still building on the tricky locking protocol we have and proven to be error prone. Can we simply take the mmap_sem for write for oom victims before munlocking and release it after munmapping? I am OK with building on the current protocol if taking the mmap_sem for the whole section has some serious down sides but I haven't heard any yet, to be honest. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [patch v2] mm, oom: fix concurrent munlock and oom reaper unmap 2018-04-19 10:45 ` Tetsuo Handa 2018-04-19 11:04 ` Michal Hocko @ 2018-04-19 19:14 ` David Rientjes 1 sibling, 0 replies; 47+ messages in thread From: David Rientjes @ 2018-04-19 19:14 UTC (permalink / raw) To: Tetsuo Handa; +Cc: mhocko, akpm, aarcange, guro, linux-kernel, linux-mm On Thu, 19 Apr 2018, Tetsuo Handa wrote: > current code has a possibility that the OOM reaper is disturbed by > unpredictable dependencies, like I worried that > > I think that there is a possibility that the OOM reaper tries to reclaim > mlocked pages as soon as exit_mmap() cleared VM_LOCKED flag by calling > munlock_vma_pages_all(). > > when current approach was proposed. That's exactly the issue that this patch is fixing, yes. If you brought that possibility up then I'm sorry that it was ignored. ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [patch v2] mm, oom: fix concurrent munlock and oom reaper unmap 2018-04-19 6:35 ` Michal Hocko 2018-04-19 10:45 ` Tetsuo Handa @ 2018-04-19 19:34 ` David Rientjes 2018-04-19 22:13 ` Tetsuo Handa 2018-04-20 8:23 ` Michal Hocko 1 sibling, 2 replies; 47+ messages in thread From: David Rientjes @ 2018-04-19 19:34 UTC (permalink / raw) To: Michal Hocko Cc: Andrew Morton, Tetsuo Handa, Andrea Arcangeli, Roman Gushchin, linux-kernel, linux-mm On Thu, 19 Apr 2018, Michal Hocko wrote: > > exit_mmap() does not block before set_bit(MMF_OOM_SKIP) once it is > > entered. > > Not true. munlock_vma_pages_all might take page_lock which can have > unpredictable dependences. This is the reason why we are ruling out > mlocked VMAs in the first place when reaping the address space. > I don't find any occurrences in millions of oom kills in real-world scenarios where this matters. The solution is certainly not to hold down_write(&mm->mmap_sem) during munlock_vma_pages_all() instead. If exit_mmap() is not making forward progress then that's a separate issue; that would need to be fixed in one of two ways: (1) in oom_reap_task() to try over a longer duration before setting MMF_OOM_SKIP itself, but that would have to be a long duration to allow a large unmap and page table free, or (2) in oom_evaluate_task() so that we defer for MMF_OOM_SKIP but only if MMF_UNSTABLE has been set for a long period of time so we target another process when the oom killer has given up. Either of those two fixes are simple to implement, I'd just like to see a bug report with stack traces to indicate that a victim getting stalled in exit_mmap() is a problem to justify the patch. I'm trying to fix the page table corruption that is trivial to trigger on powerpc. We simply cannot allow the oom reaper's unmap_page_range() to race with munlock_vma_pages_range(), ever. Holding down_write on mm->mmap_sem otherwise needlessly over a large amount of code is riskier (hasn't been done or tested here), more error prone (any code change over this large area of code or in functions it calls are unnecessarily burdened by unnecessary locking), makes exit_mmap() less extensible for the same reason, and causes the oom reaper to give up and go set MMF_OOM_SKIP itself because it depends on taking down_read while the thread is still exiting. > On the > other hand your lock protocol introduces the MMF_OOM_SKIP problem I've > mentioned above and that really worries me. The primary objective of the > reaper is to guarantee a forward progress without relying on any > externalities. We might kill another OOM victim but that is safer than > lock up. > I understand the concern, but it's the difference between the victim getting stuck in exit_mmap() and actually taking a long time to free its memory in exit_mmap(). I don't have evidence of the former. If there are bug reports for occurrences of the oom reaper being unable to reap, it would be helpful to see. The only reports about the "unable to reap" message was that the message itself was racy, not that a thread got stuck. This is more reason to not take down_write unnecessarily in the exit_mmap() path, because it influences an oom reaper heurstic. > The current protocol has proven to be error prone so I really believe we > should back off and turn it into something much simpler and build on top > of that if needed. > > So do you see any _technical_ reasons why not do [1] and have a simpler > protocol easily backportable to stable trees? It's not simpler per the above, and I agree with Andrea's assessment when this was originally implemented. The current method is not error prone, it works, it just wasn't protecting enough of exit_mmap(). That's not a critcism of the method itself, it's a bugfix that expands its critical section. ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [patch v2] mm, oom: fix concurrent munlock and oom reaper unmap 2018-04-19 19:34 ` David Rientjes @ 2018-04-19 22:13 ` Tetsuo Handa 2018-04-20 8:23 ` Michal Hocko 1 sibling, 0 replies; 47+ messages in thread From: Tetsuo Handa @ 2018-04-19 22:13 UTC (permalink / raw) To: rientjes, mhocko; +Cc: akpm, aarcange, guro, linux-kernel, linux-mm David Rientjes wrote: > On Thu, 19 Apr 2018, Michal Hocko wrote: > > > > exit_mmap() does not block before set_bit(MMF_OOM_SKIP) once it is > > > entered. > > > > Not true. munlock_vma_pages_all might take page_lock which can have > > unpredictable dependences. This is the reason why we are ruling out > > mlocked VMAs in the first place when reaping the address space. > > > > I don't find any occurrences in millions of oom kills in real-world > scenarios where this matters. Is your OOM events system-wide rather than memcg? It is trivial to hide bugs in the details if your OOM events is memcg OOM. > The solution is certainly not to hold > down_write(&mm->mmap_sem) during munlock_vma_pages_all() instead. If > exit_mmap() is not making forward progress then that's a separate issue; Just a simple memory + CPU pressure is sufficient for making exit_mmap() unable to make forward progress. Try triggering system-wide OOM event by running below reproducer. We are ever ignoring this issue. ----- #include <unistd.h> int main(int argc, char *argv[]) { while (1) if (fork() == 0) execlp(argv[0], argv[0], NULL); return 0; } ----- > that would need to be fixed in one of two ways: (1) in oom_reap_task() to > try over a longer duration before setting MMF_OOM_SKIP itself, but that > would have to be a long duration to allow a large unmap and page table > free, or (2) in oom_evaluate_task() so that we defer for MMF_OOM_SKIP but > only if MMF_UNSTABLE has been set for a long period of time so we target > another process when the oom killer has given up. > > Either of those two fixes are simple to implement, I'd just like to see a > bug report with stack traces to indicate that a victim getting stalled in > exit_mmap() is a problem to justify the patch. It is too hard for normal users to report problems under memory pressure without a mean to help understand what is happening. See a bug report at https://lists.opensuse.org/opensuse-kernel/2018-04/msg00018.html for example. > > I'm trying to fix the page table corruption that is trivial to trigger on > powerpc. We simply cannot allow the oom reaper's unmap_page_range() to > race with munlock_vma_pages_range(), ever. Holding down_write on > mm->mmap_sem otherwise needlessly over a large amount of code is riskier > (hasn't been done or tested here), more error prone (any code change over > this large area of code or in functions it calls are unnecessarily > burdened by unnecessary locking), makes exit_mmap() less extensible for > the same reason, and causes the oom reaper to give up and go set > MMF_OOM_SKIP itself because it depends on taking down_read while the > thread is still exiting. I suggest reverting 212925802454 ("mm: oom: let oom_reap_task and exit_mmap run concurrently"). We can check for progress for a while before setting MMF_OOM_SKIP after the OOM reaper completed or gave up reaping. ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [patch v2] mm, oom: fix concurrent munlock and oom reaper unmap 2018-04-19 19:34 ` David Rientjes 2018-04-19 22:13 ` Tetsuo Handa @ 2018-04-20 8:23 ` Michal Hocko 2018-04-20 12:40 ` Michal Hocko 2018-04-22 3:45 ` [patch v2] mm, oom: fix concurrent munlock and oom reaper unmap David Rientjes 1 sibling, 2 replies; 47+ messages in thread From: Michal Hocko @ 2018-04-20 8:23 UTC (permalink / raw) To: David Rientjes Cc: Andrew Morton, Tetsuo Handa, Andrea Arcangeli, Roman Gushchin, linux-kernel, linux-mm On Thu 19-04-18 12:34:53, David Rientjes wrote: > On Thu, 19 Apr 2018, Michal Hocko wrote: > > > > exit_mmap() does not block before set_bit(MMF_OOM_SKIP) once it is > > > entered. > > > > Not true. munlock_vma_pages_all might take page_lock which can have > > unpredictable dependences. This is the reason why we are ruling out > > mlocked VMAs in the first place when reaping the address space. > > > > I don't find any occurrences in millions of oom kills in real-world > scenarios where this matters. Which doesn't really mean much. We want a guarantee here. > The solution is certainly not to hold > down_write(&mm->mmap_sem) during munlock_vma_pages_all() instead. Why not? This is what we do for normal paths. exit path just tries to be clever because it knows that it doesn't have to lock because there is no concurent user. At least it wasn't until the oom reaper came. So I really fail to see why we shouldn't do the most obvious thing and use the locking. > If > exit_mmap() is not making forward progress then that's a separate issue; Please read what I wrote. There is a page lock and there is no way to guarantee it will make a forward progress. Or do you consider that not true? > that would need to be fixed in one of two ways: (1) in oom_reap_task() to > try over a longer duration before setting MMF_OOM_SKIP itself, but that > would have to be a long duration to allow a large unmap and page table > free, or (2) in oom_evaluate_task() so that we defer for MMF_OOM_SKIP but > only if MMF_UNSTABLE has been set for a long period of time so we target > another process when the oom killer has given up. > > Either of those two fixes are simple to implement, I'd just like to see a > bug report with stack traces to indicate that a victim getting stalled in > exit_mmap() is a problem to justify the patch. And both are not really needed if we do the proper and straightforward locking. > I'm trying to fix the page table corruption that is trivial to trigger on > powerpc. We simply cannot allow the oom reaper's unmap_page_range() to > race with munlock_vma_pages_range(), ever. There is no discussion about that. Sure, you are right. We are just arguing how to achieve that. > Holding down_write on > mm->mmap_sem otherwise needlessly over a large amount of code is riskier > (hasn't been done or tested here), more error prone (any code change over > this large area of code or in functions it calls are unnecessarily > burdened by unnecessary locking), makes exit_mmap() less extensible for > the same reason, I do not see any of the calls in that path could suffer from holding mmap_sem. Do you? > and causes the oom reaper to give up and go set > MMF_OOM_SKIP itself because it depends on taking down_read while the > thread is still exiting. Which is the standard backoff mechanism. > > On the > > other hand your lock protocol introduces the MMF_OOM_SKIP problem I've > > mentioned above and that really worries me. The primary objective of the > > reaper is to guarantee a forward progress without relying on any > > externalities. We might kill another OOM victim but that is safer than > > lock up. > > > > I understand the concern, but it's the difference between the victim > getting stuck in exit_mmap() and actually taking a long time to free its > memory in exit_mmap(). I don't have evidence of the former. I do not really want to repeat myself. The primary purpose of the oom reaper is to provide a _guarantee_ of the forward progress. I do not care whether there is any evidences. All I know that lock_page has plethora of different dependencies and we cannot clearly state this is safe so we _must not_ depend on it when setting MMF_OOM_SKIP. The way how the oom path was fragile and lockup prone based on optimistic assumptions shouldn't be repeated. That being said, I haven't heard any actual technical argument about why locking the munmap path is a wrong thing to do while the MMF_OOM_SKIP dependency on the page_lock really concerns me so Nacked-by: Michal Hocko <mhocko@suse.com> If you want to keep the current locking protocol then you really have to make sure that the oom reaper will set MMF_OOM_SKIP when racing with exit_mmap. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [patch v2] mm, oom: fix concurrent munlock and oom reaper unmap 2018-04-20 8:23 ` Michal Hocko @ 2018-04-20 12:40 ` Michal Hocko 2018-04-22 3:45 ` [patch v2] mm, oom: fix concurrent munlock and oom reaper unmap David Rientjes 1 sibling, 0 replies; 47+ messages in thread From: Michal Hocko @ 2018-04-20 12:40 UTC (permalink / raw) To: David Rientjes Cc: Andrew Morton, Tetsuo Handa, Andrea Arcangeli, Roman Gushchin, linux-kernel, linux-mm On Fri 20-04-18 10:23:49, Michal Hocko wrote: > On Thu 19-04-18 12:34:53, David Rientjes wrote: [...] > > I understand the concern, but it's the difference between the victim > > getting stuck in exit_mmap() and actually taking a long time to free its > > memory in exit_mmap(). I don't have evidence of the former. > > I do not really want to repeat myself. The primary purpose of the oom > reaper is to provide a _guarantee_ of the forward progress. I do not > care whether there is any evidences. All I know that lock_page has > plethora of different dependencies and we cannot clearly state this is > safe so we _must not_ depend on it when setting MMF_OOM_SKIP. > > The way how the oom path was fragile and lockup prone based on > optimistic assumptions shouldn't be repeated. > > That being said, I haven't heard any actual technical argument about why > locking the munmap path is a wrong thing to do while the MMF_OOM_SKIP > dependency on the page_lock really concerns me so > > Nacked-by: Michal Hocko <mhocko@suse.com> > > If you want to keep the current locking protocol then you really have to > make sure that the oom reaper will set MMF_OOM_SKIP when racing with > exit_mmap. So here is my suggestion for the fix. >From 37ab85d619f508ceaf829e57648a3d986c6d8bfc Mon Sep 17 00:00:00 2001 From: Michal Hocko <mhocko@suse.com> Date: Fri, 20 Apr 2018 13:53:08 +0200 Subject: [PATCH] oom: mm, oom: fix concurrent munlock and oom reaper unmap David has noticed that our current assumption that the oom reaper can race with exit_mmap is not correct. munlock_vma_pages_all() depends on clearing VM_LOCKED from vm_flags before actually doing the munlock to determine if any other vmas are locking the same memory, the check for VM_LOCKED in the oom reaper is racy. This is especially noticeable on architectures such as powerpc where clearing a huge pmd requires serialize_against_pte_lookup(). If the pmd is zapped by the oom reaper during follow_page_mask() after the check for pmd_none() is bypassed, this ends up deferencing a NULL ptl. Fix this by taking the exclusive mmap_sem in exit_mmap while tearing down the address space. Once that is done MMF_OOM_SKIP is set so that __oom_reap_task_mm can back off if it manages to take the read lock finally. Fixes: 212925802454 ("mm: oom: let oom_reap_task and exit_mmap run concurrently") Cc: stable Signed-off-by: Michal Hocko <mhocko@suse.com> --- mm/mmap.c | 36 ++++++++++++++++++------------------ mm/oom_kill.c | 2 +- 2 files changed, 19 insertions(+), 19 deletions(-) diff --git a/mm/mmap.c b/mm/mmap.c index faf85699f1a1..216efa6d9f61 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -3004,10 +3004,21 @@ void exit_mmap(struct mm_struct *mm) struct mmu_gather tlb; struct vm_area_struct *vma; unsigned long nr_accounted = 0; + bool locked = false; /* mm's last user has gone, and its about to be pulled down */ mmu_notifier_release(mm); + /* + * The mm is not accessible for anybody except for the oom reaper + * which cannot race with munlocking so make sure we exclude the + * two. + */ + if (unlikely(mm_is_oom_victim(mm))) { + down_write(&mm->mmap_sem); + locked = true; + } + if (mm->locked_vm) { vma = mm->mmap; while (vma) { @@ -3021,7 +3032,7 @@ void exit_mmap(struct mm_struct *mm) vma = mm->mmap; if (!vma) /* Can happen if dup_mmap() received an OOM */ - return; + goto out_unlock; lru_add_drain(); flush_cache_mm(mm); @@ -3030,23 +3041,6 @@ void exit_mmap(struct mm_struct *mm) /* Use -1 here to ensure all VMAs in the mm are unmapped */ unmap_vmas(&tlb, vma, 0, -1); - if (unlikely(mm_is_oom_victim(mm))) { - /* - * Wait for oom_reap_task() to stop working on this - * mm. Because MMF_OOM_SKIP is already set before - * calling down_read(), oom_reap_task() will not run - * on this "mm" post up_write(). - * - * mm_is_oom_victim() cannot be set from under us - * either because victim->mm is already set to NULL - * under task_lock before calling mmput and oom_mm is - * set not NULL by the OOM killer only if victim->mm - * is found not NULL while holding the task_lock. - */ - set_bit(MMF_OOM_SKIP, &mm->flags); - down_write(&mm->mmap_sem); - up_write(&mm->mmap_sem); - } free_pgtables(&tlb, vma, FIRST_USER_ADDRESS, USER_PGTABLES_CEILING); tlb_finish_mmu(&tlb, 0, -1); @@ -3060,6 +3054,12 @@ void exit_mmap(struct mm_struct *mm) vma = remove_vma(vma); } vm_unacct_memory(nr_accounted); + +out_unlock: + if (unlikely(locked)) { + set_bit(MMF_OOM_SKIP, &mm->flags); + up_write(&mm->mmap_sem); + } } /* Insert vm structure into process list sorted by address diff --git a/mm/oom_kill.c b/mm/oom_kill.c index dfd370526909..94d26ef2c3c7 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -524,7 +524,7 @@ static bool __oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm) * MMF_OOM_SKIP is set by exit_mmap when the OOM reaper can't * work on the mm anymore. The check for MMF_OOM_SKIP must run * under mmap_sem for reading because it serializes against the - * down_write();up_write() cycle in exit_mmap(). + * exit_mmap(). */ if (test_bit(MMF_OOM_SKIP, &mm->flags)) { up_read(&mm->mmap_sem); -- 2.16.3 -- Michal Hocko SUSE Labs ^ permalink raw reply related [flat|nested] 47+ messages in thread
* Re: [patch v2] mm, oom: fix concurrent munlock and oom reaper unmap @ 2018-04-20 12:40 ` Michal Hocko 0 siblings, 0 replies; 47+ messages in thread From: Michal Hocko @ 2018-04-20 12:40 UTC (permalink / raw) To: David Rientjes Cc: Andrew Morton, Tetsuo Handa, Andrea Arcangeli, Roman Gushchin, linux-kernel, linux-mm On Fri 20-04-18 10:23:49, Michal Hocko wrote: > On Thu 19-04-18 12:34:53, David Rientjes wrote: [...] > > I understand the concern, but it's the difference between the victim > > getting stuck in exit_mmap() and actually taking a long time to free its > > memory in exit_mmap(). I don't have evidence of the former. > > I do not really want to repeat myself. The primary purpose of the oom > reaper is to provide a _guarantee_ of the forward progress. I do not > care whether there is any evidences. All I know that lock_page has > plethora of different dependencies and we cannot clearly state this is > safe so we _must not_ depend on it when setting MMF_OOM_SKIP. > > The way how the oom path was fragile and lockup prone based on > optimistic assumptions shouldn't be repeated. > > That being said, I haven't heard any actual technical argument about why > locking the munmap path is a wrong thing to do while the MMF_OOM_SKIP > dependency on the page_lock really concerns me so > > Nacked-by: Michal Hocko <mhocko@suse.com> > > If you want to keep the current locking protocol then you really have to > make sure that the oom reaper will set MMF_OOM_SKIP when racing with > exit_mmap. So here is my suggestion for the fix. ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [patch v2] mm, oom: fix concurrent munlock and oom reaper unmap 2018-04-20 12:40 ` Michal Hocko (?) @ 2018-04-22 3:22 ` David Rientjes 2018-04-22 3:48 ` [patch v2] mm, oom: fix concurrent munlock and oom reaperunmap Tetsuo Handa -1 siblings, 1 reply; 47+ messages in thread From: David Rientjes @ 2018-04-22 3:22 UTC (permalink / raw) To: Michal Hocko Cc: Andrew Morton, Tetsuo Handa, Andrea Arcangeli, Roman Gushchin, linux-kernel, linux-mm On Fri, 20 Apr 2018, Michal Hocko wrote: > diff --git a/mm/mmap.c b/mm/mmap.c > index faf85699f1a1..216efa6d9f61 100644 > --- a/mm/mmap.c > +++ b/mm/mmap.c > @@ -3004,10 +3004,21 @@ void exit_mmap(struct mm_struct *mm) > struct mmu_gather tlb; > struct vm_area_struct *vma; > unsigned long nr_accounted = 0; > + bool locked = false; > > /* mm's last user has gone, and its about to be pulled down */ > mmu_notifier_release(mm); > > + /* > + * The mm is not accessible for anybody except for the oom reaper > + * which cannot race with munlocking so make sure we exclude the > + * two. > + */ > + if (unlikely(mm_is_oom_victim(mm))) { > + down_write(&mm->mmap_sem); > + locked = true; > + } > + > if (mm->locked_vm) { > vma = mm->mmap; > while (vma) { > @@ -3021,7 +3032,7 @@ void exit_mmap(struct mm_struct *mm) > > vma = mm->mmap; > if (!vma) /* Can happen if dup_mmap() received an OOM */ > - return; > + goto out_unlock; > > lru_add_drain(); > flush_cache_mm(mm); > @@ -3030,23 +3041,6 @@ void exit_mmap(struct mm_struct *mm) > /* Use -1 here to ensure all VMAs in the mm are unmapped */ > unmap_vmas(&tlb, vma, 0, -1); > > - if (unlikely(mm_is_oom_victim(mm))) { > - /* > - * Wait for oom_reap_task() to stop working on this > - * mm. Because MMF_OOM_SKIP is already set before > - * calling down_read(), oom_reap_task() will not run > - * on this "mm" post up_write(). > - * > - * mm_is_oom_victim() cannot be set from under us > - * either because victim->mm is already set to NULL > - * under task_lock before calling mmput and oom_mm is > - * set not NULL by the OOM killer only if victim->mm > - * is found not NULL while holding the task_lock. > - */ > - set_bit(MMF_OOM_SKIP, &mm->flags); > - down_write(&mm->mmap_sem); > - up_write(&mm->mmap_sem); > - } > free_pgtables(&tlb, vma, FIRST_USER_ADDRESS, USER_PGTABLES_CEILING); > tlb_finish_mmu(&tlb, 0, -1); > > @@ -3060,6 +3054,12 @@ void exit_mmap(struct mm_struct *mm) > vma = remove_vma(vma); > } > vm_unacct_memory(nr_accounted); > + > +out_unlock: > + if (unlikely(locked)) { > + set_bit(MMF_OOM_SKIP, &mm->flags); > + up_write(&mm->mmap_sem); > + } > } > > /* Insert vm structure into process list sorted by address How have you tested this? I'm wondering why you do not see oom killing of many processes if the victim is a very large process that takes a long time to free memory in exit_mmap() as I do because the oom reaper gives up trying to acquire mm->mmap_sem and just sets MMF_OOM_SKIP itself. ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [patch v2] mm, oom: fix concurrent munlock and oom reaperunmap 2018-04-22 3:22 ` David Rientjes @ 2018-04-22 3:48 ` Tetsuo Handa 2018-04-22 13:08 ` Michal Hocko 2018-04-24 2:31 ` David Rientjes 0 siblings, 2 replies; 47+ messages in thread From: Tetsuo Handa @ 2018-04-22 3:48 UTC (permalink / raw) To: rientjes, mhocko; +Cc: akpm, aarcange, guro, linux-kernel, linux-mm David Rientjes wrote: > How have you tested this? > > I'm wondering why you do not see oom killing of many processes if the > victim is a very large process that takes a long time to free memory in > exit_mmap() as I do because the oom reaper gives up trying to acquire > mm->mmap_sem and just sets MMF_OOM_SKIP itself. > We can call __oom_reap_task_mm() from exit_mmap() (or __mmput()) before exit_mmap() holds mmap_sem for write. Then, at least memory which could have been reclaimed if exit_mmap() did not hold mmap_sem for write will be guaranteed to be reclaimed before MMF_OOM_SKIP is set. ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [patch v2] mm, oom: fix concurrent munlock and oom reaperunmap 2018-04-22 3:48 ` [patch v2] mm, oom: fix concurrent munlock and oom reaperunmap Tetsuo Handa @ 2018-04-22 13:08 ` Michal Hocko 2018-04-24 2:31 ` David Rientjes 1 sibling, 0 replies; 47+ messages in thread From: Michal Hocko @ 2018-04-22 13:08 UTC (permalink / raw) To: Tetsuo Handa; +Cc: rientjes, akpm, aarcange, guro, linux-kernel, linux-mm On Sun 22-04-18 12:48:13, Tetsuo Handa wrote: > David Rientjes wrote: > > How have you tested this? > > > > I'm wondering why you do not see oom killing of many processes if the > > victim is a very large process that takes a long time to free memory in > > exit_mmap() as I do because the oom reaper gives up trying to acquire > > mm->mmap_sem and just sets MMF_OOM_SKIP itself. > > > > We can call __oom_reap_task_mm() from exit_mmap() (or __mmput()) before > exit_mmap() holds mmap_sem for write. Then, at least memory which could > have been reclaimed if exit_mmap() did not hold mmap_sem for write will > be guaranteed to be reclaimed before MMF_OOM_SKIP is set. That might be an alternative way but I am really wondering whether this is the simplest way forward. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [patch v2] mm, oom: fix concurrent munlock and oom reaperunmap 2018-04-22 3:48 ` [patch v2] mm, oom: fix concurrent munlock and oom reaperunmap Tetsuo Handa 2018-04-22 13:08 ` Michal Hocko @ 2018-04-24 2:31 ` David Rientjes 2018-04-24 5:11 ` Tetsuo Handa 2018-04-24 13:04 ` [patch v2] mm, oom: fix concurrent munlock and oom reaperunmap Michal Hocko 1 sibling, 2 replies; 47+ messages in thread From: David Rientjes @ 2018-04-24 2:31 UTC (permalink / raw) To: Tetsuo Handa Cc: mhocko, Andrew Morton, Andrea Arcangeli, guro, linux-kernel, linux-mm On Sun, 22 Apr 2018, Tetsuo Handa wrote: > > I'm wondering why you do not see oom killing of many processes if the > > victim is a very large process that takes a long time to free memory in > > exit_mmap() as I do because the oom reaper gives up trying to acquire > > mm->mmap_sem and just sets MMF_OOM_SKIP itself. > > > > We can call __oom_reap_task_mm() from exit_mmap() (or __mmput()) before > exit_mmap() holds mmap_sem for write. Then, at least memory which could > have been reclaimed if exit_mmap() did not hold mmap_sem for write will > be guaranteed to be reclaimed before MMF_OOM_SKIP is set. > I think that's an exceptionally good idea and will mitigate the concerns of others. It can be done without holding mm->mmap_sem in exit_mmap() and uses the same criteria that the oom reaper uses to set MMF_OOM_SKIP itself, so we don't get dozens of unnecessary oom kills. What do you think about this? It passes preliminary testing on powerpc and I'm enqueued it for much more intensive testing. (I'm wishing there was a better way to acknowledge your contribution to fixing this issue, especially since you brought up the exact problem this is addressing in previous emails.) mm, oom: fix concurrent munlock and oom reaper unmap Since exit_mmap() is done without the protection of mm->mmap_sem, it is possible for the oom reaper to concurrently operate on an mm until MMF_OOM_SKIP is set. This allows munlock_vma_pages_all() to concurrently run while the oom reaper is operating on a vma. Since munlock_vma_pages_range() depends on clearing VM_LOCKED from vm_flags before actually doing the munlock to determine if any other vmas are locking the same memory, the check for VM_LOCKED in the oom reaper is racy. This is especially noticeable on architectures such as powerpc where clearing a huge pmd requires serialize_against_pte_lookup(). If the pmd is zapped by the oom reaper during follow_page_mask() after the check for pmd_none() is bypassed, this ends up deferencing a NULL ptl. Fix this by manually freeing all possible memory from the mm before doing the munlock and then setting MMF_OOM_SKIP. The oom reaper can not run on the mm anymore so the munlock is safe to do in exit_mmap(). It also matches the logic that the oom reaper currently uses for determining when to set MMF_OOM_SKIP itself, so there's no new risk of excessive oom killing. Fixes: 212925802454 ("mm: oom: let oom_reap_task and exit_mmap run concurrently") Cc: stable@vger.kernel.org [4.14+] Suggested-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: David Rientjes <rientjes@google.com> --- include/linux/oom.h | 2 ++ mm/mmap.c | 39 ++++++++++++---------- mm/oom_kill.c | 81 ++++++++++++++++++++++++--------------------- 3 files changed, 66 insertions(+), 56 deletions(-) diff --git a/include/linux/oom.h b/include/linux/oom.h --- a/include/linux/oom.h +++ b/include/linux/oom.h @@ -95,6 +95,8 @@ static inline int check_stable_address_space(struct mm_struct *mm) return 0; } +void __oom_reap_task_mm(struct mm_struct *mm); + extern unsigned long oom_badness(struct task_struct *p, struct mem_cgroup *memcg, const nodemask_t *nodemask, unsigned long totalpages); diff --git a/mm/mmap.c b/mm/mmap.c --- a/mm/mmap.c +++ b/mm/mmap.c @@ -3015,6 +3015,27 @@ void exit_mmap(struct mm_struct *mm) /* mm's last user has gone, and its about to be pulled down */ mmu_notifier_release(mm); + if (unlikely(mm_is_oom_victim(mm))) { + /* + * Manually reap the mm to free as much memory as possible. + * Then, as the oom reaper, set MMF_OOM_SKIP to disregard this + * mm from further consideration. Taking mm->mmap_sem for write + * after setting MMF_OOM_SKIP will guarantee that the oom reaper + * will not run on this mm again after mmap_sem is dropped. + * + * This needs to be done before calling munlock_vma_pages_all(), + * which clears VM_LOCKED, otherwise the oom reaper cannot + * reliably test it. + */ + mutex_lock(&oom_lock); + __oom_reap_task_mm(mm); + mutex_unlock(&oom_lock); + + set_bit(MMF_OOM_SKIP, &mm->flags); + down_write(&mm->mmap_sem); + up_write(&mm->mmap_sem); + } + if (mm->locked_vm) { vma = mm->mmap; while (vma) { @@ -3036,24 +3057,6 @@ void exit_mmap(struct mm_struct *mm) /* update_hiwater_rss(mm) here? but nobody should be looking */ /* Use -1 here to ensure all VMAs in the mm are unmapped */ unmap_vmas(&tlb, vma, 0, -1); - - if (unlikely(mm_is_oom_victim(mm))) { - /* - * Wait for oom_reap_task() to stop working on this - * mm. Because MMF_OOM_SKIP is already set before - * calling down_read(), oom_reap_task() will not run - * on this "mm" post up_write(). - * - * mm_is_oom_victim() cannot be set from under us - * either because victim->mm is already set to NULL - * under task_lock before calling mmput and oom_mm is - * set not NULL by the OOM killer only if victim->mm - * is found not NULL while holding the task_lock. - */ - set_bit(MMF_OOM_SKIP, &mm->flags); - down_write(&mm->mmap_sem); - up_write(&mm->mmap_sem); - } free_pgtables(&tlb, vma, FIRST_USER_ADDRESS, USER_PGTABLES_CEILING); tlb_finish_mmu(&tlb, 0, -1); diff --git a/mm/oom_kill.c b/mm/oom_kill.c --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -469,7 +469,6 @@ bool process_shares_mm(struct task_struct *p, struct mm_struct *mm) return false; } - #ifdef CONFIG_MMU /* * OOM Reaper kernel thread which tries to reap the memory used by the OOM @@ -480,16 +479,54 @@ static DECLARE_WAIT_QUEUE_HEAD(oom_reaper_wait); static struct task_struct *oom_reaper_list; static DEFINE_SPINLOCK(oom_reaper_lock); -static bool __oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm) +void __oom_reap_task_mm(struct mm_struct *mm) { - struct mmu_gather tlb; struct vm_area_struct *vma; + + /* + * Tell all users of get_user/copy_from_user etc... that the content + * is no longer stable. No barriers really needed because unmapping + * should imply barriers already and the reader would hit a page fault + * if it stumbled over a reaped memory. + */ + set_bit(MMF_UNSTABLE, &mm->flags); + + for (vma = mm->mmap ; vma; vma = vma->vm_next) { + if (!can_madv_dontneed_vma(vma)) + continue; + + /* + * Only anonymous pages have a good chance to be dropped + * without additional steps which we cannot afford as we + * are OOM already. + * + * We do not even care about fs backed pages because all + * which are reclaimable have already been reclaimed and + * we do not want to block exit_mmap by keeping mm ref + * count elevated without a good reason. + */ + if (vma_is_anonymous(vma) || !(vma->vm_flags & VM_SHARED)) { + const unsigned long start = vma->vm_start; + const unsigned long end = vma->vm_end; + struct mmu_gather tlb; + + tlb_gather_mmu(&tlb, mm, start, end); + mmu_notifier_invalidate_range_start(mm, start, end); + unmap_page_range(&tlb, vma, start, end, NULL); + mmu_notifier_invalidate_range_end(mm, start, end); + tlb_finish_mmu(&tlb, start, end); + } + } +} + +static bool oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm) +{ bool ret = true; /* * We have to make sure to not race with the victim exit path * and cause premature new oom victim selection: - * __oom_reap_task_mm exit_mm + * oom_reap_task_mm exit_mm * mmget_not_zero * mmput * atomic_dec_and_test @@ -534,39 +571,8 @@ static bool __oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm) trace_start_task_reaping(tsk->pid); - /* - * Tell all users of get_user/copy_from_user etc... that the content - * is no longer stable. No barriers really needed because unmapping - * should imply barriers already and the reader would hit a page fault - * if it stumbled over a reaped memory. - */ - set_bit(MMF_UNSTABLE, &mm->flags); - - for (vma = mm->mmap ; vma; vma = vma->vm_next) { - if (!can_madv_dontneed_vma(vma)) - continue; + __oom_reap_task_mm(mm); - /* - * Only anonymous pages have a good chance to be dropped - * without additional steps which we cannot afford as we - * are OOM already. - * - * We do not even care about fs backed pages because all - * which are reclaimable have already been reclaimed and - * we do not want to block exit_mmap by keeping mm ref - * count elevated without a good reason. - */ - if (vma_is_anonymous(vma) || !(vma->vm_flags & VM_SHARED)) { - const unsigned long start = vma->vm_start; - const unsigned long end = vma->vm_end; - - tlb_gather_mmu(&tlb, mm, start, end); - mmu_notifier_invalidate_range_start(mm, start, end); - unmap_page_range(&tlb, vma, start, end, NULL); - mmu_notifier_invalidate_range_end(mm, start, end); - tlb_finish_mmu(&tlb, start, end); - } - } pr_info("oom_reaper: reaped process %d (%s), now anon-rss:%lukB, file-rss:%lukB, shmem-rss:%lukB\n", task_pid_nr(tsk), tsk->comm, K(get_mm_counter(mm, MM_ANONPAGES)), @@ -587,14 +593,13 @@ static void oom_reap_task(struct task_struct *tsk) struct mm_struct *mm = tsk->signal->oom_mm; /* Retry the down_read_trylock(mmap_sem) a few times */ - while (attempts++ < MAX_OOM_REAP_RETRIES && !__oom_reap_task_mm(tsk, mm)) + while (attempts++ < MAX_OOM_REAP_RETRIES && !oom_reap_task_mm(tsk, mm)) schedule_timeout_idle(HZ/10); if (attempts <= MAX_OOM_REAP_RETRIES || test_bit(MMF_OOM_SKIP, &mm->flags)) goto done; - pr_info("oom_reaper: unable to reap pid:%d (%s)\n", task_pid_nr(tsk), tsk->comm); debug_show_all_locks(); ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [patch v2] mm, oom: fix concurrent munlock and oom reaperunmap 2018-04-24 2:31 ` David Rientjes @ 2018-04-24 5:11 ` Tetsuo Handa 2018-04-24 5:35 ` David Rientjes 2018-04-24 13:04 ` [patch v2] mm, oom: fix concurrent munlock and oom reaperunmap Michal Hocko 1 sibling, 1 reply; 47+ messages in thread From: Tetsuo Handa @ 2018-04-24 5:11 UTC (permalink / raw) To: David Rientjes Cc: mhocko, Andrew Morton, Andrea Arcangeli, guro, linux-kernel, linux-mm > On Sun, 22 Apr 2018, Tetsuo Handa wrote: > > > > I'm wondering why you do not see oom killing of many processes if the > > > victim is a very large process that takes a long time to free memory in > > > exit_mmap() as I do because the oom reaper gives up trying to acquire > > > mm->mmap_sem and just sets MMF_OOM_SKIP itself. > > > > > > > We can call __oom_reap_task_mm() from exit_mmap() (or __mmput()) before > > exit_mmap() holds mmap_sem for write. Then, at least memory which could > > have been reclaimed if exit_mmap() did not hold mmap_sem for write will > > be guaranteed to be reclaimed before MMF_OOM_SKIP is set. > > > > I think that's an exceptionally good idea and will mitigate the concerns > of others. > > It can be done without holding mm->mmap_sem in exit_mmap() and uses the > same criteria that the oom reaper uses to set MMF_OOM_SKIP itself, so we > don't get dozens of unnecessary oom kills. > > What do you think about this? It passes preliminary testing on powerpc > and I'm enqueued it for much more intensive testing. (I'm wishing there > was a better way to acknowledge your contribution to fixing this issue, > especially since you brought up the exact problem this is addressing in > previous emails.) > I don't think this patch is safe, for exit_mmap() is calling mmu_notifier_invalidate_range_{start,end}() which might block with oom_lock held when oom_reap_task_mm() is waiting for oom_lock held by exit_mmap(). exit_mmap() must not block while holding oom_lock in order to guarantee that oom_reap_task_mm() can give up. Some suggestion on top of your patch: mm/mmap.c | 13 +++++-------- mm/oom_kill.c | 51 ++++++++++++++++++++++++++------------------------- 2 files changed, 31 insertions(+), 33 deletions(-) diff --git a/mm/mmap.c b/mm/mmap.c index 981eed4..7b31357 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -3019,21 +3019,18 @@ void exit_mmap(struct mm_struct *mm) /* * Manually reap the mm to free as much memory as possible. * Then, as the oom reaper, set MMF_OOM_SKIP to disregard this - * mm from further consideration. Taking mm->mmap_sem for write - * after setting MMF_OOM_SKIP will guarantee that the oom reaper - * will not run on this mm again after mmap_sem is dropped. + * mm from further consideration. Setting MMF_OOM_SKIP under + * oom_lock held will guarantee that the OOM reaper will not + * run on this mm again. * * This needs to be done before calling munlock_vma_pages_all(), * which clears VM_LOCKED, otherwise the oom reaper cannot * reliably test it. */ - mutex_lock(&oom_lock); __oom_reap_task_mm(mm); - mutex_unlock(&oom_lock); - + mutex_lock(&oom_lock); set_bit(MMF_OOM_SKIP, &mm->flags); - down_write(&mm->mmap_sem); - up_write(&mm->mmap_sem); + mutex_unlock(&oom_lock); } if (mm->locked_vm) { diff --git a/mm/oom_kill.c b/mm/oom_kill.c index 8ba6cb8..9a29df8 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -523,21 +523,15 @@ static bool oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm) { bool ret = true; + mutex_lock(&oom_lock); + /* - * We have to make sure to not race with the victim exit path - * and cause premature new oom victim selection: - * oom_reap_task_mm exit_mm - * mmget_not_zero - * mmput - * atomic_dec_and_test - * exit_oom_victim - * [...] - * out_of_memory - * select_bad_process - * # no TIF_MEMDIE task selects new victim - * unmap_page_range # frees some memory + * MMF_OOM_SKIP is set by exit_mmap() when the OOM reaper can't + * work on the mm anymore. The check for MMF_OOM_SKIP must run + * under oom_lock held. */ - mutex_lock(&oom_lock); + if (test_bit(MMF_OOM_SKIP, &mm->flags)) + goto unlock_oom; if (!down_read_trylock(&mm->mmap_sem)) { ret = false; @@ -557,18 +551,6 @@ static bool oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm) goto unlock_oom; } - /* - * MMF_OOM_SKIP is set by exit_mmap when the OOM reaper can't - * work on the mm anymore. The check for MMF_OOM_SKIP must run - * under mmap_sem for reading because it serializes against the - * down_write();up_write() cycle in exit_mmap(). - */ - if (test_bit(MMF_OOM_SKIP, &mm->flags)) { - up_read(&mm->mmap_sem); - trace_skip_task_reaping(tsk->pid); - goto unlock_oom; - } - trace_start_task_reaping(tsk->pid); __oom_reap_task_mm(mm); @@ -610,8 +592,27 @@ static void oom_reap_task(struct task_struct *tsk) /* * Hide this mm from OOM killer because it has been either reaped or * somebody can't call up_write(mmap_sem). + * + * We have to make sure to not cause premature new oom victim selection: + * + * __alloc_pages_may_oom() oom_reap_task_mm()/exit_mmap() + * mutex_trylock(&oom_lock) + * get_page_from_freelist(ALLOC_WMARK_HIGH) # fails + * unmap_page_range() # frees some memory + * set_bit(MMF_OOM_SKIP) + * out_of_memory() + * select_bad_process() + * test_bit(MMF_OOM_SKIP) # selects new oom victim + * mutex_unlock(&oom_lock) + * + * Setting MMF_OOM_SKIP under oom_lock held will guarantee that the + * last second alocation attempt is done by __alloc_pages_may_oom() + * before out_of_memory() selects next OOM victim by finding + * MMF_OOM_SKIP. */ + mutex_lock(&oom_lock); set_bit(MMF_OOM_SKIP, &mm->flags); + mutex_unlock(&oom_lock); /* Drop a reference taken by wake_oom_reaper */ put_task_struct(tsk); -- 1.8.3.1 ^ permalink raw reply related [flat|nested] 47+ messages in thread
* Re: [patch v2] mm, oom: fix concurrent munlock and oom reaperunmap 2018-04-24 5:11 ` Tetsuo Handa @ 2018-04-24 5:35 ` David Rientjes 2018-04-24 21:57 ` [patch v2] mm, oom: fix concurrent munlock and oom reaper unmap Tetsuo Handa 0 siblings, 1 reply; 47+ messages in thread From: David Rientjes @ 2018-04-24 5:35 UTC (permalink / raw) To: Tetsuo Handa Cc: mhocko, Andrew Morton, Andrea Arcangeli, guro, linux-kernel, linux-mm On Tue, 24 Apr 2018, Tetsuo Handa wrote: > > > We can call __oom_reap_task_mm() from exit_mmap() (or __mmput()) before > > > exit_mmap() holds mmap_sem for write. Then, at least memory which could > > > have been reclaimed if exit_mmap() did not hold mmap_sem for write will > > > be guaranteed to be reclaimed before MMF_OOM_SKIP is set. > > > > > > > I think that's an exceptionally good idea and will mitigate the concerns > > of others. > > > > It can be done without holding mm->mmap_sem in exit_mmap() and uses the > > same criteria that the oom reaper uses to set MMF_OOM_SKIP itself, so we > > don't get dozens of unnecessary oom kills. > > > > What do you think about this? It passes preliminary testing on powerpc > > and I'm enqueued it for much more intensive testing. (I'm wishing there > > was a better way to acknowledge your contribution to fixing this issue, > > especially since you brought up the exact problem this is addressing in > > previous emails.) > > > > I don't think this patch is safe, for exit_mmap() is calling > mmu_notifier_invalidate_range_{start,end}() which might block with oom_lock > held when oom_reap_task_mm() is waiting for oom_lock held by exit_mmap(). One of the reasons that I extracted __oom_reap_task_mm() out of the new oom_reap_task_mm() is to avoid the checks that would be unnecessary when called from exit_mmap(). In this case, we can ignore the mm_has_blockable_invalidate_notifiers() check because exit_mmap() has already done mmu_notifier_release(). So I don't think there's a concern about __oom_reap_task_mm() blocking while holding oom_lock. Unless you are referring to something else? > exit_mmap() must not block while holding oom_lock in order to guarantee that > oom_reap_task_mm() can give up. > > Some suggestion on top of your patch: > > mm/mmap.c | 13 +++++-------- > mm/oom_kill.c | 51 ++++++++++++++++++++++++++------------------------- > 2 files changed, 31 insertions(+), 33 deletions(-) > > diff --git a/mm/mmap.c b/mm/mmap.c > index 981eed4..7b31357 100644 > --- a/mm/mmap.c > +++ b/mm/mmap.c > @@ -3019,21 +3019,18 @@ void exit_mmap(struct mm_struct *mm) > /* > * Manually reap the mm to free as much memory as possible. > * Then, as the oom reaper, set MMF_OOM_SKIP to disregard this > - * mm from further consideration. Taking mm->mmap_sem for write > - * after setting MMF_OOM_SKIP will guarantee that the oom reaper > - * will not run on this mm again after mmap_sem is dropped. > + * mm from further consideration. Setting MMF_OOM_SKIP under > + * oom_lock held will guarantee that the OOM reaper will not > + * run on this mm again. > * > * This needs to be done before calling munlock_vma_pages_all(), > * which clears VM_LOCKED, otherwise the oom reaper cannot > * reliably test it. > */ > - mutex_lock(&oom_lock); > __oom_reap_task_mm(mm); > - mutex_unlock(&oom_lock); > - > + mutex_lock(&oom_lock); > set_bit(MMF_OOM_SKIP, &mm->flags); > - down_write(&mm->mmap_sem); > - up_write(&mm->mmap_sem); > + mutex_unlock(&oom_lock); > } > > if (mm->locked_vm) { > diff --git a/mm/oom_kill.c b/mm/oom_kill.c > index 8ba6cb8..9a29df8 100644 > --- a/mm/oom_kill.c > +++ b/mm/oom_kill.c > @@ -523,21 +523,15 @@ static bool oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm) > { > bool ret = true; > > + mutex_lock(&oom_lock); > + > /* > - * We have to make sure to not race with the victim exit path > - * and cause premature new oom victim selection: > - * oom_reap_task_mm exit_mm > - * mmget_not_zero > - * mmput > - * atomic_dec_and_test > - * exit_oom_victim > - * [...] > - * out_of_memory > - * select_bad_process > - * # no TIF_MEMDIE task selects new victim > - * unmap_page_range # frees some memory > + * MMF_OOM_SKIP is set by exit_mmap() when the OOM reaper can't > + * work on the mm anymore. The check for MMF_OOM_SKIP must run > + * under oom_lock held. > */ > - mutex_lock(&oom_lock); > + if (test_bit(MMF_OOM_SKIP, &mm->flags)) > + goto unlock_oom; > > if (!down_read_trylock(&mm->mmap_sem)) { > ret = false; > @@ -557,18 +551,6 @@ static bool oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm) > goto unlock_oom; > } > > - /* > - * MMF_OOM_SKIP is set by exit_mmap when the OOM reaper can't > - * work on the mm anymore. The check for MMF_OOM_SKIP must run > - * under mmap_sem for reading because it serializes against the > - * down_write();up_write() cycle in exit_mmap(). > - */ > - if (test_bit(MMF_OOM_SKIP, &mm->flags)) { > - up_read(&mm->mmap_sem); > - trace_skip_task_reaping(tsk->pid); > - goto unlock_oom; > - } > - > trace_start_task_reaping(tsk->pid); > > __oom_reap_task_mm(mm); > @@ -610,8 +592,27 @@ static void oom_reap_task(struct task_struct *tsk) > /* > * Hide this mm from OOM killer because it has been either reaped or > * somebody can't call up_write(mmap_sem). > + * > + * We have to make sure to not cause premature new oom victim selection: > + * > + * __alloc_pages_may_oom() oom_reap_task_mm()/exit_mmap() > + * mutex_trylock(&oom_lock) > + * get_page_from_freelist(ALLOC_WMARK_HIGH) # fails > + * unmap_page_range() # frees some memory > + * set_bit(MMF_OOM_SKIP) > + * out_of_memory() > + * select_bad_process() > + * test_bit(MMF_OOM_SKIP) # selects new oom victim > + * mutex_unlock(&oom_lock) > + * > + * Setting MMF_OOM_SKIP under oom_lock held will guarantee that the > + * last second alocation attempt is done by __alloc_pages_may_oom() > + * before out_of_memory() selects next OOM victim by finding > + * MMF_OOM_SKIP. > */ > + mutex_lock(&oom_lock); > set_bit(MMF_OOM_SKIP, &mm->flags); > + mutex_unlock(&oom_lock); > > /* Drop a reference taken by wake_oom_reaper */ > put_task_struct(tsk); ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [patch v2] mm, oom: fix concurrent munlock and oom reaper unmap 2018-04-24 5:35 ` David Rientjes @ 2018-04-24 21:57 ` Tetsuo Handa 2018-04-24 22:25 ` David Rientjes 0 siblings, 1 reply; 47+ messages in thread From: Tetsuo Handa @ 2018-04-24 21:57 UTC (permalink / raw) To: rientjes; +Cc: mhocko, akpm, aarcange, guro, linux-kernel, linux-mm David Rientjes wrote: > On Tue, 24 Apr 2018, Tetsuo Handa wrote: > > > > > We can call __oom_reap_task_mm() from exit_mmap() (or __mmput()) before > > > > exit_mmap() holds mmap_sem for write. Then, at least memory which could > > > > have been reclaimed if exit_mmap() did not hold mmap_sem for write will > > > > be guaranteed to be reclaimed before MMF_OOM_SKIP is set. > > > > > > > > > > I think that's an exceptionally good idea and will mitigate the concerns > > > of others. > > > > > > It can be done without holding mm->mmap_sem in exit_mmap() and uses the > > > same criteria that the oom reaper uses to set MMF_OOM_SKIP itself, so we > > > don't get dozens of unnecessary oom kills. > > > > > > What do you think about this? It passes preliminary testing on powerpc > > > and I'm enqueued it for much more intensive testing. (I'm wishing there > > > was a better way to acknowledge your contribution to fixing this issue, > > > especially since you brought up the exact problem this is addressing in > > > previous emails.) > > > > > > > I don't think this patch is safe, for exit_mmap() is calling > > mmu_notifier_invalidate_range_{start,end}() which might block with oom_lock > > held when oom_reap_task_mm() is waiting for oom_lock held by exit_mmap(). > > One of the reasons that I extracted __oom_reap_task_mm() out of the new > oom_reap_task_mm() is to avoid the checks that would be unnecessary when > called from exit_mmap(). In this case, we can ignore the > mm_has_blockable_invalidate_notifiers() check because exit_mmap() has > already done mmu_notifier_release(). So I don't think there's a concern > about __oom_reap_task_mm() blocking while holding oom_lock. Unless you > are referring to something else? Oh, mmu_notifier_release() made mm_has_blockable_invalidate_notifiers() == false. OK. But I want comments why it is safe; I will probably miss that dependency when we move that code next time. ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [patch v2] mm, oom: fix concurrent munlock and oom reaper unmap 2018-04-24 21:57 ` [patch v2] mm, oom: fix concurrent munlock and oom reaper unmap Tetsuo Handa @ 2018-04-24 22:25 ` David Rientjes 2018-04-24 22:34 ` [patch v3 for-4.17] " David Rientjes 0 siblings, 1 reply; 47+ messages in thread From: David Rientjes @ 2018-04-24 22:25 UTC (permalink / raw) To: Tetsuo Handa; +Cc: mhocko, akpm, aarcange, guro, linux-kernel, linux-mm On Wed, 25 Apr 2018, Tetsuo Handa wrote: > > One of the reasons that I extracted __oom_reap_task_mm() out of the new > > oom_reap_task_mm() is to avoid the checks that would be unnecessary when > > called from exit_mmap(). In this case, we can ignore the > > mm_has_blockable_invalidate_notifiers() check because exit_mmap() has > > already done mmu_notifier_release(). So I don't think there's a concern > > about __oom_reap_task_mm() blocking while holding oom_lock. Unless you > > are referring to something else? > > Oh, mmu_notifier_release() made mm_has_blockable_invalidate_notifiers() == false. OK. > > But I want comments why it is safe; I will probably miss that dependency > when we move that code next time. > Ok, makes sense. I'll send a v3 to update the comment. ^ permalink raw reply [flat|nested] 47+ messages in thread
* [patch v3 for-4.17] mm, oom: fix concurrent munlock and oom reaper unmap 2018-04-24 22:25 ` David Rientjes @ 2018-04-24 22:34 ` David Rientjes 2018-04-24 23:19 ` Michal Hocko 0 siblings, 1 reply; 47+ messages in thread From: David Rientjes @ 2018-04-24 22:34 UTC (permalink / raw) To: Andrew Morton, Tetsuo Handa, Linus Torvalds Cc: mhocko, Andrea Arcangeli, guro, linux-kernel, linux-mm Since exit_mmap() is done without the protection of mm->mmap_sem, it is possible for the oom reaper to concurrently operate on an mm until MMF_OOM_SKIP is set. This allows munlock_vma_pages_all() to concurrently run while the oom reaper is operating on a vma. Since munlock_vma_pages_range() depends on clearing VM_LOCKED from vm_flags before actually doing the munlock to determine if any other vmas are locking the same memory, the check for VM_LOCKED in the oom reaper is racy. This is especially noticeable on architectures such as powerpc where clearing a huge pmd requires serialize_against_pte_lookup(). If the pmd is zapped by the oom reaper during follow_page_mask() after the check for pmd_none() is bypassed, this ends up deferencing a NULL ptl or a kernel oops. Fix this by manually freeing all possible memory from the mm before doing the munlock and then setting MMF_OOM_SKIP. The oom reaper can not run on the mm anymore so the munlock is safe to do in exit_mmap(). It also matches the logic that the oom reaper currently uses for determining when to set MMF_OOM_SKIP itself, so there's no new risk of excessive oom killing. This issue fixes CVE-2018-1000200. Fixes: 212925802454 ("mm: oom: let oom_reap_task and exit_mmap run concurrently") Cc: stable@vger.kernel.org [4.14+] Suggested-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: David Rientjes <rientjes@google.com> --- include/linux/oom.h | 2 ++ mm/mmap.c | 44 ++++++++++++++---------- mm/oom_kill.c | 81 ++++++++++++++++++++++++--------------------- 3 files changed, 71 insertions(+), 56 deletions(-) diff --git a/include/linux/oom.h b/include/linux/oom.h --- a/include/linux/oom.h +++ b/include/linux/oom.h @@ -95,6 +95,8 @@ static inline int check_stable_address_space(struct mm_struct *mm) return 0; } +void __oom_reap_task_mm(struct mm_struct *mm); + extern unsigned long oom_badness(struct task_struct *p, struct mem_cgroup *memcg, const nodemask_t *nodemask, unsigned long totalpages); diff --git a/mm/mmap.c b/mm/mmap.c --- a/mm/mmap.c +++ b/mm/mmap.c @@ -3015,6 +3015,32 @@ void exit_mmap(struct mm_struct *mm) /* mm's last user has gone, and its about to be pulled down */ mmu_notifier_release(mm); + if (unlikely(mm_is_oom_victim(mm))) { + /* + * Manually reap the mm to free as much memory as possible. + * Then, as the oom reaper does, set MMF_OOM_SKIP to disregard + * this mm from further consideration. Taking mm->mmap_sem for + * write after setting MMF_OOM_SKIP will guarantee that the oom + * reaper will not run on this mm again after mmap_sem is + * dropped. + * + * Nothing can be holding mm->mmap_sem here and the above call + * to mmu_notifier_release(mm) ensures mmu notifier callbacks in + * __oom_reap_task_mm() will not block. + * + * This needs to be done before calling munlock_vma_pages_all(), + * which clears VM_LOCKED, otherwise the oom reaper cannot + * reliably test it. + */ + mutex_lock(&oom_lock); + __oom_reap_task_mm(mm); + mutex_unlock(&oom_lock); + + set_bit(MMF_OOM_SKIP, &mm->flags); + down_write(&mm->mmap_sem); + up_write(&mm->mmap_sem); + } + if (mm->locked_vm) { vma = mm->mmap; while (vma) { @@ -3036,24 +3062,6 @@ void exit_mmap(struct mm_struct *mm) /* update_hiwater_rss(mm) here? but nobody should be looking */ /* Use -1 here to ensure all VMAs in the mm are unmapped */ unmap_vmas(&tlb, vma, 0, -1); - - if (unlikely(mm_is_oom_victim(mm))) { - /* - * Wait for oom_reap_task() to stop working on this - * mm. Because MMF_OOM_SKIP is already set before - * calling down_read(), oom_reap_task() will not run - * on this "mm" post up_write(). - * - * mm_is_oom_victim() cannot be set from under us - * either because victim->mm is already set to NULL - * under task_lock before calling mmput and oom_mm is - * set not NULL by the OOM killer only if victim->mm - * is found not NULL while holding the task_lock. - */ - set_bit(MMF_OOM_SKIP, &mm->flags); - down_write(&mm->mmap_sem); - up_write(&mm->mmap_sem); - } free_pgtables(&tlb, vma, FIRST_USER_ADDRESS, USER_PGTABLES_CEILING); tlb_finish_mmu(&tlb, 0, -1); diff --git a/mm/oom_kill.c b/mm/oom_kill.c --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -469,7 +469,6 @@ bool process_shares_mm(struct task_struct *p, struct mm_struct *mm) return false; } - #ifdef CONFIG_MMU /* * OOM Reaper kernel thread which tries to reap the memory used by the OOM @@ -480,16 +479,54 @@ static DECLARE_WAIT_QUEUE_HEAD(oom_reaper_wait); static struct task_struct *oom_reaper_list; static DEFINE_SPINLOCK(oom_reaper_lock); -static bool __oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm) +void __oom_reap_task_mm(struct mm_struct *mm) { - struct mmu_gather tlb; struct vm_area_struct *vma; + + /* + * Tell all users of get_user/copy_from_user etc... that the content + * is no longer stable. No barriers really needed because unmapping + * should imply barriers already and the reader would hit a page fault + * if it stumbled over a reaped memory. + */ + set_bit(MMF_UNSTABLE, &mm->flags); + + for (vma = mm->mmap ; vma; vma = vma->vm_next) { + if (!can_madv_dontneed_vma(vma)) + continue; + + /* + * Only anonymous pages have a good chance to be dropped + * without additional steps which we cannot afford as we + * are OOM already. + * + * We do not even care about fs backed pages because all + * which are reclaimable have already been reclaimed and + * we do not want to block exit_mmap by keeping mm ref + * count elevated without a good reason. + */ + if (vma_is_anonymous(vma) || !(vma->vm_flags & VM_SHARED)) { + const unsigned long start = vma->vm_start; + const unsigned long end = vma->vm_end; + struct mmu_gather tlb; + + tlb_gather_mmu(&tlb, mm, start, end); + mmu_notifier_invalidate_range_start(mm, start, end); + unmap_page_range(&tlb, vma, start, end, NULL); + mmu_notifier_invalidate_range_end(mm, start, end); + tlb_finish_mmu(&tlb, start, end); + } + } +} + +static bool oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm) +{ bool ret = true; /* * We have to make sure to not race with the victim exit path * and cause premature new oom victim selection: - * __oom_reap_task_mm exit_mm + * oom_reap_task_mm exit_mm * mmget_not_zero * mmput * atomic_dec_and_test @@ -534,39 +571,8 @@ static bool __oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm) trace_start_task_reaping(tsk->pid); - /* - * Tell all users of get_user/copy_from_user etc... that the content - * is no longer stable. No barriers really needed because unmapping - * should imply barriers already and the reader would hit a page fault - * if it stumbled over a reaped memory. - */ - set_bit(MMF_UNSTABLE, &mm->flags); - - for (vma = mm->mmap ; vma; vma = vma->vm_next) { - if (!can_madv_dontneed_vma(vma)) - continue; + __oom_reap_task_mm(mm); - /* - * Only anonymous pages have a good chance to be dropped - * without additional steps which we cannot afford as we - * are OOM already. - * - * We do not even care about fs backed pages because all - * which are reclaimable have already been reclaimed and - * we do not want to block exit_mmap by keeping mm ref - * count elevated without a good reason. - */ - if (vma_is_anonymous(vma) || !(vma->vm_flags & VM_SHARED)) { - const unsigned long start = vma->vm_start; - const unsigned long end = vma->vm_end; - - tlb_gather_mmu(&tlb, mm, start, end); - mmu_notifier_invalidate_range_start(mm, start, end); - unmap_page_range(&tlb, vma, start, end, NULL); - mmu_notifier_invalidate_range_end(mm, start, end); - tlb_finish_mmu(&tlb, start, end); - } - } pr_info("oom_reaper: reaped process %d (%s), now anon-rss:%lukB, file-rss:%lukB, shmem-rss:%lukB\n", task_pid_nr(tsk), tsk->comm, K(get_mm_counter(mm, MM_ANONPAGES)), @@ -587,14 +593,13 @@ static void oom_reap_task(struct task_struct *tsk) struct mm_struct *mm = tsk->signal->oom_mm; /* Retry the down_read_trylock(mmap_sem) a few times */ - while (attempts++ < MAX_OOM_REAP_RETRIES && !__oom_reap_task_mm(tsk, mm)) + while (attempts++ < MAX_OOM_REAP_RETRIES && !oom_reap_task_mm(tsk, mm)) schedule_timeout_idle(HZ/10); if (attempts <= MAX_OOM_REAP_RETRIES || test_bit(MMF_OOM_SKIP, &mm->flags)) goto done; - pr_info("oom_reaper: unable to reap pid:%d (%s)\n", task_pid_nr(tsk), tsk->comm); debug_show_all_locks(); ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [patch v3 for-4.17] mm, oom: fix concurrent munlock and oom reaper unmap 2018-04-24 22:34 ` [patch v3 for-4.17] " David Rientjes @ 2018-04-24 23:19 ` Michal Hocko 0 siblings, 0 replies; 47+ messages in thread From: Michal Hocko @ 2018-04-24 23:19 UTC (permalink / raw) To: David Rientjes Cc: Andrew Morton, Tetsuo Handa, Linus Torvalds, Andrea Arcangeli, guro, linux-kernel, linux-mm On Tue 24-04-18 15:34:03, David Rientjes wrote: > Since exit_mmap() is done without the protection of mm->mmap_sem, it is > possible for the oom reaper to concurrently operate on an mm until > MMF_OOM_SKIP is set. > > This allows munlock_vma_pages_all() to concurrently run while the oom > reaper is operating on a vma. Since munlock_vma_pages_range() depends on > clearing VM_LOCKED from vm_flags before actually doing the munlock to > determine if any other vmas are locking the same memory, the check for > VM_LOCKED in the oom reaper is racy. > > This is especially noticeable on architectures such as powerpc where > clearing a huge pmd requires serialize_against_pte_lookup(). If the pmd > is zapped by the oom reaper during follow_page_mask() after the check for > pmd_none() is bypassed, this ends up deferencing a NULL ptl or a kernel > oops. > > Fix this by manually freeing all possible memory from the mm before doing > the munlock and then setting MMF_OOM_SKIP. The oom reaper can not run on > the mm anymore so the munlock is safe to do in exit_mmap(). It also > matches the logic that the oom reaper currently uses for determining when > to set MMF_OOM_SKIP itself, so there's no new risk of excessive oom > killing. > > This issue fixes CVE-2018-1000200. > > Fixes: 212925802454 ("mm: oom: let oom_reap_task and exit_mmap run concurrently") > Cc: stable@vger.kernel.org [4.14+] > Suggested-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> > Signed-off-by: David Rientjes <rientjes@google.com> OK, now that I finally got that mmu_notifier_release will make all further mmu notifier calls NOOP then this is indeed safe. Considering that you take oom_lock, you can set MMF_OOM_SKIP inside that lock and won't have to bother with the exclusive mmap_sem AFAICS. So the patch can be simplified. But other than that this looks like a right way to go. I would have preferred to hide the oom locking and MMF_OOM_SKIP handling in exit_mmap but this is mostly cosmetic. Acked-by: Michal Hocko <mhocko@suse.com> > --- > include/linux/oom.h | 2 ++ > mm/mmap.c | 44 ++++++++++++++---------- > mm/oom_kill.c | 81 ++++++++++++++++++++++++--------------------- > 3 files changed, 71 insertions(+), 56 deletions(-) > > diff --git a/include/linux/oom.h b/include/linux/oom.h > --- a/include/linux/oom.h > +++ b/include/linux/oom.h > @@ -95,6 +95,8 @@ static inline int check_stable_address_space(struct mm_struct *mm) > return 0; > } > > +void __oom_reap_task_mm(struct mm_struct *mm); > + > extern unsigned long oom_badness(struct task_struct *p, > struct mem_cgroup *memcg, const nodemask_t *nodemask, > unsigned long totalpages); > diff --git a/mm/mmap.c b/mm/mmap.c > --- a/mm/mmap.c > +++ b/mm/mmap.c > @@ -3015,6 +3015,32 @@ void exit_mmap(struct mm_struct *mm) > /* mm's last user has gone, and its about to be pulled down */ > mmu_notifier_release(mm); > > + if (unlikely(mm_is_oom_victim(mm))) { > + /* > + * Manually reap the mm to free as much memory as possible. > + * Then, as the oom reaper does, set MMF_OOM_SKIP to disregard > + * this mm from further consideration. Taking mm->mmap_sem for > + * write after setting MMF_OOM_SKIP will guarantee that the oom > + * reaper will not run on this mm again after mmap_sem is > + * dropped. > + * > + * Nothing can be holding mm->mmap_sem here and the above call > + * to mmu_notifier_release(mm) ensures mmu notifier callbacks in > + * __oom_reap_task_mm() will not block. > + * > + * This needs to be done before calling munlock_vma_pages_all(), > + * which clears VM_LOCKED, otherwise the oom reaper cannot > + * reliably test it. > + */ > + mutex_lock(&oom_lock); > + __oom_reap_task_mm(mm); > + mutex_unlock(&oom_lock); > + > + set_bit(MMF_OOM_SKIP, &mm->flags); > + down_write(&mm->mmap_sem); > + up_write(&mm->mmap_sem); > + } > + > if (mm->locked_vm) { > vma = mm->mmap; > while (vma) { > @@ -3036,24 +3062,6 @@ void exit_mmap(struct mm_struct *mm) > /* update_hiwater_rss(mm) here? but nobody should be looking */ > /* Use -1 here to ensure all VMAs in the mm are unmapped */ > unmap_vmas(&tlb, vma, 0, -1); > - > - if (unlikely(mm_is_oom_victim(mm))) { > - /* > - * Wait for oom_reap_task() to stop working on this > - * mm. Because MMF_OOM_SKIP is already set before > - * calling down_read(), oom_reap_task() will not run > - * on this "mm" post up_write(). > - * > - * mm_is_oom_victim() cannot be set from under us > - * either because victim->mm is already set to NULL > - * under task_lock before calling mmput and oom_mm is > - * set not NULL by the OOM killer only if victim->mm > - * is found not NULL while holding the task_lock. > - */ > - set_bit(MMF_OOM_SKIP, &mm->flags); > - down_write(&mm->mmap_sem); > - up_write(&mm->mmap_sem); > - } > free_pgtables(&tlb, vma, FIRST_USER_ADDRESS, USER_PGTABLES_CEILING); > tlb_finish_mmu(&tlb, 0, -1); > > diff --git a/mm/oom_kill.c b/mm/oom_kill.c > --- a/mm/oom_kill.c > +++ b/mm/oom_kill.c > @@ -469,7 +469,6 @@ bool process_shares_mm(struct task_struct *p, struct mm_struct *mm) > return false; > } > > - > #ifdef CONFIG_MMU > /* > * OOM Reaper kernel thread which tries to reap the memory used by the OOM > @@ -480,16 +479,54 @@ static DECLARE_WAIT_QUEUE_HEAD(oom_reaper_wait); > static struct task_struct *oom_reaper_list; > static DEFINE_SPINLOCK(oom_reaper_lock); > > -static bool __oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm) > +void __oom_reap_task_mm(struct mm_struct *mm) > { > - struct mmu_gather tlb; > struct vm_area_struct *vma; > + > + /* > + * Tell all users of get_user/copy_from_user etc... that the content > + * is no longer stable. No barriers really needed because unmapping > + * should imply barriers already and the reader would hit a page fault > + * if it stumbled over a reaped memory. > + */ > + set_bit(MMF_UNSTABLE, &mm->flags); > + > + for (vma = mm->mmap ; vma; vma = vma->vm_next) { > + if (!can_madv_dontneed_vma(vma)) > + continue; > + > + /* > + * Only anonymous pages have a good chance to be dropped > + * without additional steps which we cannot afford as we > + * are OOM already. > + * > + * We do not even care about fs backed pages because all > + * which are reclaimable have already been reclaimed and > + * we do not want to block exit_mmap by keeping mm ref > + * count elevated without a good reason. > + */ > + if (vma_is_anonymous(vma) || !(vma->vm_flags & VM_SHARED)) { > + const unsigned long start = vma->vm_start; > + const unsigned long end = vma->vm_end; > + struct mmu_gather tlb; > + > + tlb_gather_mmu(&tlb, mm, start, end); > + mmu_notifier_invalidate_range_start(mm, start, end); > + unmap_page_range(&tlb, vma, start, end, NULL); > + mmu_notifier_invalidate_range_end(mm, start, end); > + tlb_finish_mmu(&tlb, start, end); > + } > + } > +} > + > +static bool oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm) > +{ > bool ret = true; > > /* > * We have to make sure to not race with the victim exit path > * and cause premature new oom victim selection: > - * __oom_reap_task_mm exit_mm > + * oom_reap_task_mm exit_mm > * mmget_not_zero > * mmput > * atomic_dec_and_test > @@ -534,39 +571,8 @@ static bool __oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm) > > trace_start_task_reaping(tsk->pid); > > - /* > - * Tell all users of get_user/copy_from_user etc... that the content > - * is no longer stable. No barriers really needed because unmapping > - * should imply barriers already and the reader would hit a page fault > - * if it stumbled over a reaped memory. > - */ > - set_bit(MMF_UNSTABLE, &mm->flags); > - > - for (vma = mm->mmap ; vma; vma = vma->vm_next) { > - if (!can_madv_dontneed_vma(vma)) > - continue; > + __oom_reap_task_mm(mm); > > - /* > - * Only anonymous pages have a good chance to be dropped > - * without additional steps which we cannot afford as we > - * are OOM already. > - * > - * We do not even care about fs backed pages because all > - * which are reclaimable have already been reclaimed and > - * we do not want to block exit_mmap by keeping mm ref > - * count elevated without a good reason. > - */ > - if (vma_is_anonymous(vma) || !(vma->vm_flags & VM_SHARED)) { > - const unsigned long start = vma->vm_start; > - const unsigned long end = vma->vm_end; > - > - tlb_gather_mmu(&tlb, mm, start, end); > - mmu_notifier_invalidate_range_start(mm, start, end); > - unmap_page_range(&tlb, vma, start, end, NULL); > - mmu_notifier_invalidate_range_end(mm, start, end); > - tlb_finish_mmu(&tlb, start, end); > - } > - } > pr_info("oom_reaper: reaped process %d (%s), now anon-rss:%lukB, file-rss:%lukB, shmem-rss:%lukB\n", > task_pid_nr(tsk), tsk->comm, > K(get_mm_counter(mm, MM_ANONPAGES)), > @@ -587,14 +593,13 @@ static void oom_reap_task(struct task_struct *tsk) > struct mm_struct *mm = tsk->signal->oom_mm; > > /* Retry the down_read_trylock(mmap_sem) a few times */ > - while (attempts++ < MAX_OOM_REAP_RETRIES && !__oom_reap_task_mm(tsk, mm)) > + while (attempts++ < MAX_OOM_REAP_RETRIES && !oom_reap_task_mm(tsk, mm)) > schedule_timeout_idle(HZ/10); > > if (attempts <= MAX_OOM_REAP_RETRIES || > test_bit(MMF_OOM_SKIP, &mm->flags)) > goto done; > > - > pr_info("oom_reaper: unable to reap pid:%d (%s)\n", > task_pid_nr(tsk), tsk->comm); > debug_show_all_locks(); -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [patch v2] mm, oom: fix concurrent munlock and oom reaperunmap 2018-04-24 2:31 ` David Rientjes 2018-04-24 5:11 ` Tetsuo Handa @ 2018-04-24 13:04 ` Michal Hocko 2018-04-24 20:01 ` David Rientjes 1 sibling, 1 reply; 47+ messages in thread From: Michal Hocko @ 2018-04-24 13:04 UTC (permalink / raw) To: David Rientjes Cc: Tetsuo Handa, Andrew Morton, Andrea Arcangeli, guro, linux-kernel, linux-mm On Mon 23-04-18 19:31:05, David Rientjes wrote: [...] > diff --git a/mm/mmap.c b/mm/mmap.c > --- a/mm/mmap.c > +++ b/mm/mmap.c > @@ -3015,6 +3015,27 @@ void exit_mmap(struct mm_struct *mm) > /* mm's last user has gone, and its about to be pulled down */ > mmu_notifier_release(mm); > > + if (unlikely(mm_is_oom_victim(mm))) { > + /* > + * Manually reap the mm to free as much memory as possible. > + * Then, as the oom reaper, set MMF_OOM_SKIP to disregard this > + * mm from further consideration. Taking mm->mmap_sem for write > + * after setting MMF_OOM_SKIP will guarantee that the oom reaper > + * will not run on this mm again after mmap_sem is dropped. > + * > + * This needs to be done before calling munlock_vma_pages_all(), > + * which clears VM_LOCKED, otherwise the oom reaper cannot > + * reliably test it. > + */ > + mutex_lock(&oom_lock); > + __oom_reap_task_mm(mm); > + mutex_unlock(&oom_lock); > + > + set_bit(MMF_OOM_SKIP, &mm->flags); > + down_write(&mm->mmap_sem); > + up_write(&mm->mmap_sem); > + } > + Is there any reason why we cannot simply call __oom_reap_task_mm as we have it now? mmap_sem for read shouldn't fail here because this is the last reference of the mm and we are past the ksm and khugepaged synchronizations. So unless my jed laged brain fools me the patch should be as simple as the following (I haven't tested it at all). diff --git a/mm/mmap.c b/mm/mmap.c index faf85699f1a1..a8f170f53872 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -3008,6 +3008,13 @@ void exit_mmap(struct mm_struct *mm) /* mm's last user has gone, and its about to be pulled down */ mmu_notifier_release(mm); + /* + * The mm is not accessible for anybody except for the oom reaper + * which cannot race with munlocking so reap the task direct. + */ + if (unlikely(mm_is_oom_victim(mm))) + __oom_reap_task_mm(current, mm); + if (mm->locked_vm) { vma = mm->mmap; while (vma) { @@ -3030,23 +3037,6 @@ void exit_mmap(struct mm_struct *mm) /* Use -1 here to ensure all VMAs in the mm are unmapped */ unmap_vmas(&tlb, vma, 0, -1); - if (unlikely(mm_is_oom_victim(mm))) { - /* - * Wait for oom_reap_task() to stop working on this - * mm. Because MMF_OOM_SKIP is already set before - * calling down_read(), oom_reap_task() will not run - * on this "mm" post up_write(). - * - * mm_is_oom_victim() cannot be set from under us - * either because victim->mm is already set to NULL - * under task_lock before calling mmput and oom_mm is - * set not NULL by the OOM killer only if victim->mm - * is found not NULL while holding the task_lock. - */ - set_bit(MMF_OOM_SKIP, &mm->flags); - down_write(&mm->mmap_sem); - up_write(&mm->mmap_sem); - } free_pgtables(&tlb, vma, FIRST_USER_ADDRESS, USER_PGTABLES_CEILING); tlb_finish_mmu(&tlb, 0, -1); @@ -3060,6 +3050,7 @@ void exit_mmap(struct mm_struct *mm) vma = remove_vma(vma); } vm_unacct_memory(nr_accounted); + } /* Insert vm structure into process list sorted by address diff --git a/mm/oom_kill.c b/mm/oom_kill.c index dfd370526909..e39ceb127e8e 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -524,7 +524,7 @@ static bool __oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm) * MMF_OOM_SKIP is set by exit_mmap when the OOM reaper can't * work on the mm anymore. The check for MMF_OOM_SKIP must run * under mmap_sem for reading because it serializes against the - * down_write();up_write() cycle in exit_mmap(). + * exit_mmap(). */ if (test_bit(MMF_OOM_SKIP, &mm->flags)) { up_read(&mm->mmap_sem); @@ -567,12 +567,14 @@ static bool __oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm) tlb_finish_mmu(&tlb, start, end); } } - pr_info("oom_reaper: reaped process %d (%s), now anon-rss:%lukB, file-rss:%lukB, shmem-rss:%lukB\n", + pr_info("%s: reaped process %d (%s), now anon-rss:%lukB, file-rss:%lukB, shmem-rss:%lukB\n", + current->comm, task_pid_nr(tsk), tsk->comm, K(get_mm_counter(mm, MM_ANONPAGES)), K(get_mm_counter(mm, MM_FILEPAGES)), K(get_mm_counter(mm, MM_SHMEMPAGES))); up_read(&mm->mmap_sem); + set_bit(MMF_OOM_SKIP, &mm->flags); trace_finish_task_reaping(tsk->pid); unlock_oom: @@ -590,10 +592,11 @@ static void oom_reap_task(struct task_struct *tsk) while (attempts++ < MAX_OOM_REAP_RETRIES && !__oom_reap_task_mm(tsk, mm)) schedule_timeout_idle(HZ/10); - if (attempts <= MAX_OOM_REAP_RETRIES || - test_bit(MMF_OOM_SKIP, &mm->flags)) + if (attempts <= MAX_OOM_REAP_RETRIES) goto done; + if (test_bit(MMF_OOM_SKIP, &mm->flags)) + goto put_task; pr_info("oom_reaper: unable to reap pid:%d (%s)\n", task_pid_nr(tsk), tsk->comm); @@ -609,6 +612,7 @@ static void oom_reap_task(struct task_struct *tsk) set_bit(MMF_OOM_SKIP, &mm->flags); /* Drop a reference taken by wake_oom_reaper */ +put_task: put_task_struct(tsk); } -- Michal Hocko SUSE Labs ^ permalink raw reply related [flat|nested] 47+ messages in thread
* Re: [patch v2] mm, oom: fix concurrent munlock and oom reaperunmap 2018-04-24 13:04 ` [patch v2] mm, oom: fix concurrent munlock and oom reaperunmap Michal Hocko @ 2018-04-24 20:01 ` David Rientjes 2018-04-24 20:13 ` Michal Hocko 0 siblings, 1 reply; 47+ messages in thread From: David Rientjes @ 2018-04-24 20:01 UTC (permalink / raw) To: Michal Hocko, Andrew Morton Cc: Tetsuo Handa, Andrea Arcangeli, guro, linux-kernel, linux-mm On Tue, 24 Apr 2018, Michal Hocko wrote: > Is there any reason why we cannot simply call __oom_reap_task_mm as we > have it now? mmap_sem for read shouldn't fail here because this is the > last reference of the mm and we are past the ksm and khugepaged > synchronizations. So unless my jed laged brain fools me the patch should > be as simple as the following (I haven't tested it at all). > I wanted to remove all per task checks because they are now irrelevant: this would be the first dependency that exit_mmap() has on any task_struct, which isn't intuitive -- we simply want to exit the mmap. There's no requirement that current owns the mm other than this. I wanted to avoid the implicit dependency on MMF_OOM_SKIP and make it explicit in the exit path to be matched with the oom reaper. I didn't want anything additional printed to the kernel log about oom reaping unless the oom_reaper actually needed to intervene, which is useful knowledge outside of basic exiting. My patch has passed intensive testing on both x86 and powerpc, so I'll ask that it's pushed for 4.17-rc3. Many thanks to Tetsuo for the suggestion on calling __oom_reap_task_mm() from exit_mmap(). ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [patch v2] mm, oom: fix concurrent munlock and oom reaperunmap 2018-04-24 20:01 ` David Rientjes @ 2018-04-24 20:13 ` Michal Hocko 2018-04-24 20:22 ` David Rientjes 0 siblings, 1 reply; 47+ messages in thread From: Michal Hocko @ 2018-04-24 20:13 UTC (permalink / raw) To: David Rientjes Cc: Andrew Morton, Tetsuo Handa, Andrea Arcangeli, guro, linux-kernel, linux-mm On Tue 24-04-18 13:01:03, David Rientjes wrote: > On Tue, 24 Apr 2018, Michal Hocko wrote: > > > Is there any reason why we cannot simply call __oom_reap_task_mm as we > > have it now? mmap_sem for read shouldn't fail here because this is the > > last reference of the mm and we are past the ksm and khugepaged > > synchronizations. So unless my jed laged brain fools me the patch should > > be as simple as the following (I haven't tested it at all). > > > > I wanted to remove all per task checks because they are now irrelevant: > this would be the first dependency that exit_mmap() has on any > task_struct, which isn't intuitive -- we simply want to exit the mmap. > There's no requirement that current owns the mm other than this. There is no such requirement in the __oom_reap_task_mm. The given task is used for reporting purposes. > I wanted > to avoid the implicit dependency on MMF_OOM_SKIP and make it explicit in > the exit path to be matched with the oom reaper. Well, I find it actually better that the code is not explicit about MMF_OOM_SKIP. The whole thing happens in the oom proper which should be really preferable. The whole synchronization is then completely transparent to the oom (including the oom lock etc). > I didn't want anything > additional printed to the kernel log about oom reaping unless the > oom_reaper actually needed to intervene, which is useful knowledge outside > of basic exiting. Can we shave all those parts as follow ups and make the fix as simple as possible? > My patch has passed intensive testing on both x86 and powerpc, so I'll ask > that it's pushed for 4.17-rc3. Many thanks to Tetsuo for the suggestion > on calling __oom_reap_task_mm() from exit_mmap(). Yeah, but your patch does have a problem with blockable mmu notifiers IIUC. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [patch v2] mm, oom: fix concurrent munlock and oom reaperunmap 2018-04-24 20:13 ` Michal Hocko @ 2018-04-24 20:22 ` David Rientjes 2018-04-24 20:31 ` Michal Hocko 0 siblings, 1 reply; 47+ messages in thread From: David Rientjes @ 2018-04-24 20:22 UTC (permalink / raw) To: Michal Hocko Cc: Andrew Morton, Tetsuo Handa, Andrea Arcangeli, guro, linux-kernel, linux-mm On Tue, 24 Apr 2018, Michal Hocko wrote: > > I wanted to remove all per task checks because they are now irrelevant: > > this would be the first dependency that exit_mmap() has on any > > task_struct, which isn't intuitive -- we simply want to exit the mmap. > > There's no requirement that current owns the mm other than this. > > There is no such requirement in the __oom_reap_task_mm. The given task > is used for reporting purposes. > And tracing, which is pointless. And it unnecessarily spams the kernel log for basic exiting. > > I wanted > > to avoid the implicit dependency on MMF_OOM_SKIP and make it explicit in > > the exit path to be matched with the oom reaper. > > Well, I find it actually better that the code is not explicit about > MMF_OOM_SKIP. The whole thing happens in the oom proper which should be > really preferable. The whole synchronization is then completely > transparent to the oom (including the oom lock etc). > It's already done in exit_mmap(). I'm not changing > > I didn't want anything > > additional printed to the kernel log about oom reaping unless the > > oom_reaper actually needed to intervene, which is useful knowledge outside > > of basic exiting. > > Can we shave all those parts as follow ups and make the fix as simple as > possible? > It is as simple as possible. It is not doing any unnecessary locking or checks that the exit path does not need to do for the sake of a smaller patch. The number of changed lines in the patch is not what I'm interested in, I am interested in something that is stable, something that works, doesn't add additional (and unnecessary locking), and doesn't change around what function sets what bit when called from what path. > > My patch has passed intensive testing on both x86 and powerpc, so I'll ask > > that it's pushed for 4.17-rc3. Many thanks to Tetsuo for the suggestion > > on calling __oom_reap_task_mm() from exit_mmap(). > > Yeah, but your patch does have a problem with blockable mmu notifiers > IIUC. What on earth are you talking about? exit_mmap() does mmu_notifier_release(). There are no blockable mmu notifiers. ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [patch v2] mm, oom: fix concurrent munlock and oom reaperunmap 2018-04-24 20:22 ` David Rientjes @ 2018-04-24 20:31 ` Michal Hocko 2018-04-24 21:07 ` David Rientjes 0 siblings, 1 reply; 47+ messages in thread From: Michal Hocko @ 2018-04-24 20:31 UTC (permalink / raw) To: David Rientjes Cc: Andrew Morton, Tetsuo Handa, Andrea Arcangeli, guro, linux-kernel, linux-mm On Tue 24-04-18 13:22:45, David Rientjes wrote: [...] > > > My patch has passed intensive testing on both x86 and powerpc, so I'll ask > > > that it's pushed for 4.17-rc3. Many thanks to Tetsuo for the suggestion > > > on calling __oom_reap_task_mm() from exit_mmap(). > > > > Yeah, but your patch does have a problem with blockable mmu notifiers > > IIUC. > > What on earth are you talking about? exit_mmap() does > mmu_notifier_release(). There are no blockable mmu notifiers. MMF_OOM_SKIP - remember? The thing that guarantees a forward progress. So we cannot really depend on setting MMF_OOM_SKIP if a mmu_notifier_release blocks for an excessive/unbounded amount of time. Look I am not really interested in disussing this to death but it would be really _nice_ if you could calm down a bit, stop fighting for the solution you have proposed and ignore the feedback you are getting. There are two things to care about here. Stop the race that can blow up and do not regress MMF_OOM_SKIP guarantee. Can we please do that. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [patch v2] mm, oom: fix concurrent munlock and oom reaperunmap 2018-04-24 20:31 ` Michal Hocko @ 2018-04-24 21:07 ` David Rientjes 2018-04-24 23:08 ` Michal Hocko 0 siblings, 1 reply; 47+ messages in thread From: David Rientjes @ 2018-04-24 21:07 UTC (permalink / raw) To: Michal Hocko Cc: Andrew Morton, Tetsuo Handa, Andrea Arcangeli, guro, linux-kernel, linux-mm On Tue, 24 Apr 2018, Michal Hocko wrote: > > > > My patch has passed intensive testing on both x86 and powerpc, so I'll ask > > > > that it's pushed for 4.17-rc3. Many thanks to Tetsuo for the suggestion > > > > on calling __oom_reap_task_mm() from exit_mmap(). > > > > > > Yeah, but your patch does have a problem with blockable mmu notifiers > > > IIUC. > > > > What on earth are you talking about? exit_mmap() does > > mmu_notifier_release(). There are no blockable mmu notifiers. > > MMF_OOM_SKIP - remember? The thing that guarantees a forward progress. > So we cannot really depend on setting MMF_OOM_SKIP if a > mmu_notifier_release blocks for an excessive/unbounded amount of time. > If the thread is blocked in exit_mmap() because of mmu_notifier_release() then the oom reaper will eventually grab mm->mmap_sem (nothing holding it in exit_mmap()), return true, and oom_reap_task() will set MMF_OOM_SKIP. This is unchanged with the patch and is a completely separate issue. > Look I am not really interested in disussing this to death but it would > be really _nice_ if you could calm down a bit, stop fighting for the solution > you have proposed and ignore the feedback you are getting. > I assume we should spend more time considering the two untested patches you have sent, one of which killed 17 processes while a 8GB memory hog was exiting because the oom reaper couldn't grab mm->mmap_sem and set MMF_OOM_SKIP. > There are two things to care about here. Stop the race that can blow up > and do not regress MMF_OOM_SKIP guarantee. Can we please do that. My patch does both. Thanks. ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [patch v2] mm, oom: fix concurrent munlock and oom reaperunmap 2018-04-24 21:07 ` David Rientjes @ 2018-04-24 23:08 ` Michal Hocko 2018-04-24 23:14 ` Michal Hocko 0 siblings, 1 reply; 47+ messages in thread From: Michal Hocko @ 2018-04-24 23:08 UTC (permalink / raw) To: David Rientjes Cc: Andrew Morton, Tetsuo Handa, Andrea Arcangeli, guro, linux-kernel, linux-mm On Tue 24-04-18 14:07:52, David Rientjes wrote: > On Tue, 24 Apr 2018, Michal Hocko wrote: > > > > > > My patch has passed intensive testing on both x86 and powerpc, so I'll ask > > > > > that it's pushed for 4.17-rc3. Many thanks to Tetsuo for the suggestion > > > > > on calling __oom_reap_task_mm() from exit_mmap(). > > > > > > > > Yeah, but your patch does have a problem with blockable mmu notifiers > > > > IIUC. > > > > > > What on earth are you talking about? exit_mmap() does > > > mmu_notifier_release(). There are no blockable mmu notifiers. > > > > MMF_OOM_SKIP - remember? The thing that guarantees a forward progress. > > So we cannot really depend on setting MMF_OOM_SKIP if a > > mmu_notifier_release blocks for an excessive/unbounded amount of time. > > > > If the thread is blocked in exit_mmap() because of mmu_notifier_release() > then the oom reaper will eventually grab mm->mmap_sem (nothing holding it > in exit_mmap()), return true, and oom_reap_task() will set MMF_OOM_SKIP. > This is unchanged with the patch and is a completely separate issue. I must be missing something or we are talking past each other. So let me be explicit. What does prevent the following oom_reaper exit_mmap mutex_lock(oom_lock) mutex_lock(oom_lock) __oom_reap_task_mm mmu_notifier_invalidate_range_start # blockable mmu_notifier # which takes ages to # finish or depends on # an allocation (in)directly -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [patch v2] mm, oom: fix concurrent munlock and oom reaperunmap 2018-04-24 23:08 ` Michal Hocko @ 2018-04-24 23:14 ` Michal Hocko 0 siblings, 0 replies; 47+ messages in thread From: Michal Hocko @ 2018-04-24 23:14 UTC (permalink / raw) To: David Rientjes Cc: Andrew Morton, Tetsuo Handa, Andrea Arcangeli, guro, linux-kernel, linux-mm On Tue 24-04-18 17:08:15, Michal Hocko wrote: > On Tue 24-04-18 14:07:52, David Rientjes wrote: > > On Tue, 24 Apr 2018, Michal Hocko wrote: > > > > > > > > My patch has passed intensive testing on both x86 and powerpc, so I'll ask > > > > > > that it's pushed for 4.17-rc3. Many thanks to Tetsuo for the suggestion > > > > > > on calling __oom_reap_task_mm() from exit_mmap(). > > > > > > > > > > Yeah, but your patch does have a problem with blockable mmu notifiers > > > > > IIUC. > > > > > > > > What on earth are you talking about? exit_mmap() does > > > > mmu_notifier_release(). There are no blockable mmu notifiers. > > > > > > MMF_OOM_SKIP - remember? The thing that guarantees a forward progress. > > > So we cannot really depend on setting MMF_OOM_SKIP if a > > > mmu_notifier_release blocks for an excessive/unbounded amount of time. > > > > > > > If the thread is blocked in exit_mmap() because of mmu_notifier_release() > > then the oom reaper will eventually grab mm->mmap_sem (nothing holding it > > in exit_mmap()), return true, and oom_reap_task() will set MMF_OOM_SKIP. > > This is unchanged with the patch and is a completely separate issue. > > I must be missing something or we are talking past each other. Ohh. Ok, so _I_ was missing that mm_has_notifiers is false after mmu_notifier_release. So we cannot block at that time. Then we are good. Sorry about the confusion! -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [patch v2] mm, oom: fix concurrent munlock and oom reaper unmap 2018-04-20 8:23 ` Michal Hocko 2018-04-20 12:40 ` Michal Hocko @ 2018-04-22 3:45 ` David Rientjes 2018-04-22 13:18 ` Michal Hocko 1 sibling, 1 reply; 47+ messages in thread From: David Rientjes @ 2018-04-22 3:45 UTC (permalink / raw) To: Michal Hocko Cc: Andrew Morton, Tetsuo Handa, Andrea Arcangeli, Roman Gushchin, linux-kernel, linux-mm On Fri, 20 Apr 2018, Michal Hocko wrote: > > The solution is certainly not to hold > > down_write(&mm->mmap_sem) during munlock_vma_pages_all() instead. > > Why not? This is what we do for normal paths. exit path just tries to be > clever because it knows that it doesn't have to lock because there is no > concurent user. At least it wasn't until the oom reaper came. So I > really fail to see why we shouldn't do the most obvious thing and use > the locking. > Because the oom reaper uses the ability to acquire mm->mmap_sem as a heuristic to determine when to give up and allow another process to be chosen. If the heuristics of the oom reaper are to be improved, that's great, but this patch fixes the oops on powerpc as 4.17 material and as a stable backport. It's also well tested. > > If > > exit_mmap() is not making forward progress then that's a separate issue; > > Please read what I wrote. There is a page lock and there is no way to > guarantee it will make a forward progress. Or do you consider that not > true? > I don't have any evidence of it, and since this is called before exit_mmap() sets MMF_OOM_SKIP then the oom reaper would need to set the bit itself and I would be able to find the artifact it leaves behind in the kernel log. I cannot find a single instance of a thread stuck in munlock by way of exit_mmap() that causes the oom reaper to have to set the bit itself, and I should be able to if this were a problem. > > Holding down_write on > > mm->mmap_sem otherwise needlessly over a large amount of code is riskier > > (hasn't been done or tested here), more error prone (any code change over > > this large area of code or in functions it calls are unnecessarily > > burdened by unnecessary locking), makes exit_mmap() less extensible for > > the same reason, > > I do not see any of the calls in that path could suffer from holding > mmap_sem. Do you? > > > and causes the oom reaper to give up and go set > > MMF_OOM_SKIP itself because it depends on taking down_read while the > > thread is still exiting. > > Which is the standard backoff mechanism. > The reason today's locking methodology is preferred is because of the backoff mechanism. Your patch kills many processes unnecessarily if the oom killer selects a large process to kill, which it specifically tries to do, because unmap_vmas() and free_pgtables() takes a very long time, sometimes tens of seconds. ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [patch v2] mm, oom: fix concurrent munlock and oom reaper unmap 2018-04-22 3:45 ` [patch v2] mm, oom: fix concurrent munlock and oom reaper unmap David Rientjes @ 2018-04-22 13:18 ` Michal Hocko 2018-04-23 16:09 ` Michal Hocko 0 siblings, 1 reply; 47+ messages in thread From: Michal Hocko @ 2018-04-22 13:18 UTC (permalink / raw) To: David Rientjes Cc: Andrew Morton, Tetsuo Handa, Andrea Arcangeli, Roman Gushchin, linux-kernel, linux-mm On Sat 21-04-18 20:45:11, David Rientjes wrote: > On Fri, 20 Apr 2018, Michal Hocko wrote: > > > > The solution is certainly not to hold > > > down_write(&mm->mmap_sem) during munlock_vma_pages_all() instead. > > > > Why not? This is what we do for normal paths. exit path just tries to be > > clever because it knows that it doesn't have to lock because there is no > > concurent user. At least it wasn't until the oom reaper came. So I > > really fail to see why we shouldn't do the most obvious thing and use > > the locking. > > > > Because the oom reaper uses the ability to acquire mm->mmap_sem as a > heuristic to determine when to give up and allow another process to be > chosen. > > If the heuristics of the oom reaper are to be improved, that's great, but > this patch fixes the oops on powerpc as 4.17 material and as a stable > backport. It's also well tested. > > > > If > > > exit_mmap() is not making forward progress then that's a separate issue; > > > > Please read what I wrote. There is a page lock and there is no way to > > guarantee it will make a forward progress. Or do you consider that not > > true? > > > > I don't have any evidence of it, and since this is called before > exit_mmap() sets MMF_OOM_SKIP then the oom reaper would need to set the > bit itself and I would be able to find the artifact it leaves behind in > the kernel log. I cannot find a single instance of a thread stuck in > munlock by way of exit_mmap() that causes the oom reaper to have to set > the bit itself, and I should be able to if this were a problem. Look. The fact that you do not _see any evidence_ is completely irrelevant. The OOM reaper is about _guarantee_. And the guarantee is gone with the page_lock because that is used in contexts which do allocate memory and it can depend on other locks. So _no way_ we can make MMF_OOM_SKIP to depend on it. I will not repeat it anymore. I will not allow to ruin the whole oom reaper endeavor by adding "this should not happen" stuff that the oom killer was full of. > > > Holding down_write on > > > mm->mmap_sem otherwise needlessly over a large amount of code is riskier > > > (hasn't been done or tested here), more error prone (any code change over > > > this large area of code or in functions it calls are unnecessarily > > > burdened by unnecessary locking), makes exit_mmap() less extensible for > > > the same reason, > > > > I do not see any of the calls in that path could suffer from holding > > mmap_sem. Do you? > > > > > and causes the oom reaper to give up and go set > > > MMF_OOM_SKIP itself because it depends on taking down_read while the > > > thread is still exiting. > > > > Which is the standard backoff mechanism. > > > > The reason today's locking methodology is preferred is because of the > backoff mechanism. Your patch kills many processes unnecessarily if the > oom killer selects a large process to kill, which it specifically tries to > do, because unmap_vmas() and free_pgtables() takes a very long time, > sometimes tens of seconds. and I absolutely agree that the feedback mechanism should be improved. The patch I propose _might_ to lead to killing another task. I do not pretend otherwise. But it will keep the lockup free guarantee which is oom repeer giving us. Btw. the past oom implementation would simply kill more in that case as well because exiting tasks with task->mm == NULL would be ignored completely. So this is not a big regression even if that happens occasionally. Maybe invoking the reaper as suggested by Tetsuo will help here. Maybe we will come up with something more smart. But I would like to have a stop gap solution for stable that is easy enough. And your patch is not doing that because it adds a very subtle dependency on the page lock. So please stop repeating your arguments all over and either come with an argument which proves me wrong and the lock_page dependency is not real or come with an alternative solution which doesn't make MMF_OOM_SKIP depend on the page lock. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [patch v2] mm, oom: fix concurrent munlock and oom reaper unmap 2018-04-22 13:18 ` Michal Hocko @ 2018-04-23 16:09 ` Michal Hocko 0 siblings, 0 replies; 47+ messages in thread From: Michal Hocko @ 2018-04-23 16:09 UTC (permalink / raw) To: David Rientjes, Tetsuo Handa Cc: Andrew Morton, Andrea Arcangeli, Roman Gushchin, linux-kernel, linux-mm On Sun 22-04-18 07:18:57, Michal Hocko wrote: > On Sat 21-04-18 20:45:11, David Rientjes wrote: [...] > Maybe invoking the reaper as suggested by Tetsuo will help here. Maybe > we will come up with something more smart. But I would like to have a > stop gap solution for stable that is easy enough. And your patch is not > doing that because it adds a very subtle dependency on the page lock. > So please stop repeating your arguments all over and either come with > an argument which proves me wrong and the lock_page dependency is not > real or come with an alternative solution which doesn't make > MMF_OOM_SKIP depend on the page lock. I though I would give this a try but I am at a conference and quite busy. Tetsuo are you willing to give it a try so that we have something to compare and decide, please? -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 47+ messages in thread
end of thread, other threads:[~2018-04-24 23:19 UTC | newest] Thread overview: 47+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2018-04-17 22:46 [patch] mm, oom: fix concurrent munlock and oom reaper unmap David Rientjes 2018-04-18 0:57 ` Tetsuo Handa 2018-04-18 2:39 ` David Rientjes 2018-04-18 2:52 ` [patch v2] " David Rientjes 2018-04-18 3:55 ` Tetsuo Handa 2018-04-18 4:11 ` David Rientjes 2018-04-18 4:47 ` Tetsuo Handa 2018-04-18 5:20 ` David Rientjes 2018-04-18 7:50 ` Michal Hocko 2018-04-18 11:49 ` Tetsuo Handa 2018-04-18 11:58 ` Michal Hocko 2018-04-18 13:25 ` Tetsuo Handa 2018-04-18 13:44 ` Michal Hocko 2018-04-18 14:28 ` Tetsuo Handa 2018-04-18 19:14 ` David Rientjes 2018-04-19 6:35 ` Michal Hocko 2018-04-19 10:45 ` Tetsuo Handa 2018-04-19 11:04 ` Michal Hocko 2018-04-19 11:51 ` Tetsuo Handa 2018-04-19 12:48 ` Michal Hocko 2018-04-19 19:14 ` David Rientjes 2018-04-19 19:34 ` David Rientjes 2018-04-19 22:13 ` Tetsuo Handa 2018-04-20 8:23 ` Michal Hocko 2018-04-20 12:40 ` Michal Hocko 2018-04-20 12:40 ` Michal Hocko 2018-04-22 3:22 ` David Rientjes 2018-04-22 3:48 ` [patch v2] mm, oom: fix concurrent munlock and oom reaperunmap Tetsuo Handa 2018-04-22 13:08 ` Michal Hocko 2018-04-24 2:31 ` David Rientjes 2018-04-24 5:11 ` Tetsuo Handa 2018-04-24 5:35 ` David Rientjes 2018-04-24 21:57 ` [patch v2] mm, oom: fix concurrent munlock and oom reaper unmap Tetsuo Handa 2018-04-24 22:25 ` David Rientjes 2018-04-24 22:34 ` [patch v3 for-4.17] " David Rientjes 2018-04-24 23:19 ` Michal Hocko 2018-04-24 13:04 ` [patch v2] mm, oom: fix concurrent munlock and oom reaperunmap Michal Hocko 2018-04-24 20:01 ` David Rientjes 2018-04-24 20:13 ` Michal Hocko 2018-04-24 20:22 ` David Rientjes 2018-04-24 20:31 ` Michal Hocko 2018-04-24 21:07 ` David Rientjes 2018-04-24 23:08 ` Michal Hocko 2018-04-24 23:14 ` Michal Hocko 2018-04-22 3:45 ` [patch v2] mm, oom: fix concurrent munlock and oom reaper unmap David Rientjes 2018-04-22 13:18 ` Michal Hocko 2018-04-23 16:09 ` Michal Hocko
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.