* Multiple oom_reaper BUGs: unmap_page_range racing with exit_mmap @ 2017-12-06 2:43 David Rientjes 2017-12-06 2:58 ` David Rientjes ` (2 more replies) 0 siblings, 3 replies; 19+ messages in thread From: David Rientjes @ 2017-12-06 2:43 UTC (permalink / raw) To: Andrew Morton Cc: Michal Hocko, Tetsuo Handa, Andrea Arcangeli, linux-kernel, linux-mm Hi, I'd like to understand the synchronization between the oom_reaper's unmap_page_range() and exit_mmap(). The latter does not hold mm->mmap_sem: it's supposed to be the last thread operating on the mm before it is destroyed. If unmap_page_range() races with unmap_vmas(), we trivially call page_remove_rmap() twice on the same page: BUG: Bad page map in process oom_reaper pte:6353826300000000 pmd:00000000 addr:00007f50cab1d000 vm_flags:08100073 anon_vma:ffff9eea335603f0 mapping: (null) index:7f50cab1d file: (null) fault: (null) mmap: (null) readpage: (null) CPU: 2 PID: 1001 Comm: oom_reaper Call Trace: [<ffffffffa4bd967d>] dump_stack+0x4d/0x70 [<ffffffffa4a03558>] unmap_page_range+0x1068/0x1130 [<ffffffffa4a2e07f>] __oom_reap_task_mm+0xd5/0x16b [<ffffffffa4a2e226>] oom_reaper+0xff/0x14c [<ffffffffa48d6ad1>] kthread+0xc1/0xe0 And there are more examples of badness from an unmap_page_range() racing with unmap_vmas(). In this case, MMF_OOM_SKIP is doing two things: (1) avoiding additional oom kills until unmap_vmas() returns and (2) avoid the oom_reaper working on the mm after unmap_vmas(). In (2), there's nothing preventing the oom reaper from calling unmap_page_range() in parallel with the final thread doing unmap_vmas() -- we no longer do mmget() to prevent exit_mmap() from being called. I don't think that we can allow the oom reaper's unmap_page_range() to race with unmap_vmas(). If we can, what allows this if we don't either increment mm->mm_users in the oom reaper or hold mm->mmap_sem for write in exit_mmap()? One way to solve the issue is to have two mm flags: one to indicate the mm is entering unmap_vmas(): set the flag, do down_write(&mm->mmap_sem); up_write(&mm->mmap_sem), then unmap_vmas(). The oom reaper needs this flag clear, not MMF_OOM_SKIP, while holding down_read(&mm->mmap_sem) to be allowed to call unmap_page_range(). The oom killer will still defer selecting this victim for MMF_OOM_SKIP after unmap_vmas() returns. The result of that change would be that we do not oom reap from any mm entering unmap_vmas(): we let unmap_vmas() do the work itself and avoid racing with it. ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Multiple oom_reaper BUGs: unmap_page_range racing with exit_mmap 2017-12-06 2:43 Multiple oom_reaper BUGs: unmap_page_range racing with exit_mmap David Rientjes @ 2017-12-06 2:58 ` David Rientjes [not found] ` <201712060328.vB63SrDK069830@www262.sakura.ne.jp> 2017-12-06 8:31 ` Michal Hocko 2017-12-07 11:35 ` Michal Hocko 2 siblings, 1 reply; 19+ messages in thread From: David Rientjes @ 2017-12-06 2:58 UTC (permalink / raw) To: Andrew Morton Cc: Michal Hocko, Tetsuo Handa, Andrea Arcangeli, linux-kernel, linux-mm On Tue, 5 Dec 2017, David Rientjes wrote: > One way to solve the issue is to have two mm flags: one to indicate the mm > is entering unmap_vmas(): set the flag, do down_write(&mm->mmap_sem); > up_write(&mm->mmap_sem), then unmap_vmas(). The oom reaper needs this > flag clear, not MMF_OOM_SKIP, while holding down_read(&mm->mmap_sem) to be > allowed to call unmap_page_range(). The oom killer will still defer > selecting this victim for MMF_OOM_SKIP after unmap_vmas() returns. > > The result of that change would be that we do not oom reap from any mm > entering unmap_vmas(): we let unmap_vmas() do the work itself and avoid > racing with it. > I think we need something like the following? diff --git a/include/linux/sched/coredump.h b/include/linux/sched/coredump.h --- a/include/linux/sched/coredump.h +++ b/include/linux/sched/coredump.h @@ -70,6 +70,7 @@ static inline int get_dumpable(struct mm_struct *mm) #define MMF_UNSTABLE 22 /* mm is unstable for copy_from_user */ #define MMF_HUGE_ZERO_PAGE 23 /* mm has ever used the global huge zero page */ #define MMF_DISABLE_THP 24 /* disable THP for all VMAs */ +#define MMF_REAPING 25 /* mm is undergoing reaping */ #define MMF_DISABLE_THP_MASK (1 << MMF_DISABLE_THP) #define MMF_INIT_MASK (MMF_DUMPABLE_MASK | MMF_DUMP_FILTER_MASK |\ diff --git a/mm/mmap.c b/mm/mmap.c --- a/mm/mmap.c +++ b/mm/mmap.c @@ -3014,16 +3014,11 @@ void exit_mmap(struct mm_struct *mm) lru_add_drain(); flush_cache_mm(mm); - tlb_gather_mmu(&tlb, mm, 0, -1); - /* update_hiwater_rss(mm) here? but nobody should be looking */ - /* Use -1 here to ensure all VMAs in the mm are unmapped */ - unmap_vmas(&tlb, vma, 0, -1); - - set_bit(MMF_OOM_SKIP, &mm->flags); + set_bit(MMF_REAPING, &mm->flags); if (unlikely(tsk_is_oom_victim(current))) { /* * Wait for oom_reap_task() to stop working on this - * mm. Because MMF_OOM_SKIP is already set before + * mm. Because MMF_REAPING is already set before * calling down_read(), oom_reap_task() will not run * on this "mm" post up_write(). * @@ -3036,6 +3031,11 @@ void exit_mmap(struct mm_struct *mm) down_write(&mm->mmap_sem); up_write(&mm->mmap_sem); } + tlb_gather_mmu(&tlb, mm, 0, -1); + /* update_hiwater_rss(mm) here? but nobody should be looking */ + /* Use -1 here to ensure all VMAs in the mm are unmapped */ + unmap_vmas(&tlb, vma, 0, -1); + set_bit(MMF_OOM_SKIP, &mm->flags); free_pgtables(&tlb, vma, FIRST_USER_ADDRESS, USER_PGTABLES_CEILING); tlb_finish_mmu(&tlb, 0, -1); diff --git a/mm/oom_kill.c b/mm/oom_kill.c --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -529,12 +529,12 @@ static bool __oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm) } /* - * MMF_OOM_SKIP is set by exit_mmap when the OOM reaper can't - * work on the mm anymore. The check for MMF_OOM_SKIP must run + * MMF_REAPING is set by exit_mmap when the OOM reaper can't + * work on the mm anymore. The check for MMF_REAPING must run * under mmap_sem for reading because it serializes against the * down_write();up_write() cycle in exit_mmap(). */ - if (test_bit(MMF_OOM_SKIP, &mm->flags)) { + if (test_bit(MMF_REAPING, &mm->flags)) { up_read(&mm->mmap_sem); trace_skip_task_reaping(tsk->pid); goto unlock_oom; ^ permalink raw reply [flat|nested] 19+ messages in thread
[parent not found: <201712060328.vB63SrDK069830@www262.sakura.ne.jp>]
* Re: Multiple oom_reaper BUGs: unmap_page_range racing with exit_mmap [not found] ` <201712060328.vB63SrDK069830@www262.sakura.ne.jp> @ 2017-12-06 7:48 ` David Rientjes 2017-12-06 9:00 ` Michal Hocko 2017-12-06 11:37 ` Tetsuo Handa 2017-12-06 8:50 ` Michal Hocko 1 sibling, 2 replies; 19+ messages in thread From: David Rientjes @ 2017-12-06 7:48 UTC (permalink / raw) To: Tetsuo Handa Cc: Andrew Morton, Michal Hocko, Andrea Arcangeli, linux-kernel, linux-mm On Wed, 6 Dec 2017, Tetsuo Handa wrote: > > > One way to solve the issue is to have two mm flags: one to indicate the mm > > > is entering unmap_vmas(): set the flag, do down_write(&mm->mmap_sem); > > > up_write(&mm->mmap_sem), then unmap_vmas(). The oom reaper needs this > > > flag clear, not MMF_OOM_SKIP, while holding down_read(&mm->mmap_sem) to be > > > allowed to call unmap_page_range(). The oom killer will still defer > > > selecting this victim for MMF_OOM_SKIP after unmap_vmas() returns. > > > > > > The result of that change would be that we do not oom reap from any mm > > > entering unmap_vmas(): we let unmap_vmas() do the work itself and avoid > > > racing with it. > > > > > > > I think we need something like the following? > > This patch does not work. __oom_reap_task_mm() can find MMF_REAPING and > return true and sets MMF_OOM_SKIP before exit_mmap() calls down_write(). > Ah, you're talking about oom_reap_task() setting MMF_OOM_SKIP prematurely and allowing for additional oom kills? I see your point, but I was mainly focused on preventing the panic as the first order of business. We could certainly fix oom_reap_task() to not set MMF_OOM_SKIP itself and rather leave that to exit_mmap() if it finds MMF_REAPING if your concern matches my understanding. > Also, I don't know what exit_mmap() is doing but I think that there is a > possibility that the OOM reaper tries to reclaim mlocked pages as soon as > exit_mmap() cleared VM_LOCKED flag by calling munlock_vma_pages_all(). > > if (mm->locked_vm) { > vma = mm->mmap; > while (vma) { > if (vma->vm_flags & VM_LOCKED) > munlock_vma_pages_all(vma); > vma = vma->vm_next; > } > } > Yes, that looks possible as well, although the problem I have reported can happen with or without mlock. Did you find this by code inspection or have you experienced runtime problems with it? I think this argues to do MMF_REAPING-style behavior at the beginning of exit_mmap() and avoid reaping all together once we have reached that point. There are no more users of the mm and we are in the process of tearing it down, I'm not sure that the oom reaper should be in the business with trying to interfere with that. Or are there actual bug reports where an oom victim gets wedged while in exit_mmap() prior to releasing its memory? I know this conflicts with your patches in -mm to remove the oom mutex, but I think we should make sure we can prevent crashes before cleaning it up. (I also noticed that the mm_has_notifiers() check is missing a trace_skip_task_reaping(tsk->pid)) --- diff --git a/include/linux/sched/coredump.h b/include/linux/sched/coredump.h --- a/include/linux/sched/coredump.h +++ b/include/linux/sched/coredump.h @@ -70,6 +70,7 @@ static inline int get_dumpable(struct mm_struct *mm) #define MMF_UNSTABLE 22 /* mm is unstable for copy_from_user */ #define MMF_HUGE_ZERO_PAGE 23 /* mm has ever used the global huge zero page */ #define MMF_DISABLE_THP 24 /* disable THP for all VMAs */ +#define MMF_REAPING 25 /* mm is undergoing reaping */ #define MMF_DISABLE_THP_MASK (1 << MMF_DISABLE_THP) #define MMF_INIT_MASK (MMF_DUMPABLE_MASK | MMF_DUMP_FILTER_MASK |\ diff --git a/mm/mmap.c b/mm/mmap.c --- a/mm/mmap.c +++ b/mm/mmap.c @@ -2996,6 +2996,23 @@ void exit_mmap(struct mm_struct *mm) /* mm's last user has gone, and its about to be pulled down */ mmu_notifier_release(mm); + set_bit(MMF_REAPING, &mm->flags); + if (unlikely(tsk_is_oom_victim(current))) { + /* + * Wait for oom_reap_task() to stop working on this + * mm. Because MMF_REAPING is already set before + * calling down_read(), oom_reap_task() will not run + * on this "mm" post up_write(). + * + * tsk_is_oom_victim() cannot be set from under us + * either because current->mm is already set to NULL + * under task_lock before calling mmput and oom_mm is + * set not NULL by the OOM killer only if current->mm + * is found not NULL while holding the task_lock. + */ + down_write(&mm->mmap_sem); + up_write(&mm->mmap_sem); + } if (mm->locked_vm) { vma = mm->mmap; @@ -3018,26 +3035,9 @@ void exit_mmap(struct mm_struct *mm) /* update_hiwater_rss(mm) here? but nobody should be looking */ /* Use -1 here to ensure all VMAs in the mm are unmapped */ unmap_vmas(&tlb, vma, 0, -1); - - set_bit(MMF_OOM_SKIP, &mm->flags); - if (unlikely(tsk_is_oom_victim(current))) { - /* - * Wait for oom_reap_task() to stop working on this - * mm. Because MMF_OOM_SKIP is already set before - * calling down_read(), oom_reap_task() will not run - * on this "mm" post up_write(). - * - * tsk_is_oom_victim() cannot be set from under us - * either because current->mm is already set to NULL - * under task_lock before calling mmput and oom_mm is - * set not NULL by the OOM killer only if current->mm - * is found not NULL while holding the task_lock. - */ - down_write(&mm->mmap_sem); - up_write(&mm->mmap_sem); - } free_pgtables(&tlb, vma, FIRST_USER_ADDRESS, USER_PGTABLES_CEILING); tlb_finish_mmu(&tlb, 0, -1); + set_bit(MMF_OOM_SKIP, &mm->flags); /* * Walk the list again, actually closing and freeing it, diff --git a/mm/oom_kill.c b/mm/oom_kill.c --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -485,30 +485,29 @@ static DECLARE_WAIT_QUEUE_HEAD(oom_reaper_wait); static struct task_struct *oom_reaper_list; static DEFINE_SPINLOCK(oom_reaper_lock); -static bool __oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm) +/* + * Returns 0 on success or + * -EAGAIN: mm->mmap_sem is contended + * -EINVAL: mm is ineligible + */ +static int __oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm) { struct mmu_gather tlb; struct vm_area_struct *vma; - bool ret = true; + int ret = 0; /* * We have to make sure to not race with the victim exit path - * and cause premature new oom victim selection: - * __oom_reap_task_mm exit_mm - * mmget_not_zero - * mmput - * atomic_dec_and_test - * exit_oom_victim - * [...] - * out_of_memory - * select_bad_process - * # no TIF_MEMDIE task selects new victim - * unmap_page_range # frees some memory + * and cause premature new oom victim selection: if MMF_REAPING is + * not set under the protection of mmap_sem, allow reaping because + * exit_mmap() has not been entered, which is serialized with a + * down_write();up_write() cycle. Otherwise, leave reaping to + * exit_mmap(), which will set MMF_OOM_SKIP itself. */ mutex_lock(&oom_lock); if (!down_read_trylock(&mm->mmap_sem)) { - ret = false; + ret = -EAGAIN; trace_skip_task_reaping(tsk->pid); goto unlock_oom; } @@ -529,13 +528,14 @@ static bool __oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm) } /* - * MMF_OOM_SKIP is set by exit_mmap when the OOM reaper can't - * work on the mm anymore. The check for MMF_OOM_SKIP must run + * MMF_REAPING is set by exit_mmap when the OOM reaper can't + * work on the mm anymore. The check for MMF_REAPING must run * under mmap_sem for reading because it serializes against the * down_write();up_write() cycle in exit_mmap(). */ - if (test_bit(MMF_OOM_SKIP, &mm->flags)) { + if (test_bit(MMF_REAPING, &mm->flags)) { up_read(&mm->mmap_sem); + ret = -EINVAL; trace_skip_task_reaping(tsk->pid); goto unlock_oom; } @@ -589,15 +589,18 @@ static void oom_reap_task(struct task_struct *tsk) { int attempts = 0; struct mm_struct *mm = tsk->signal->oom_mm; + int ret; /* Retry the down_read_trylock(mmap_sem) a few times */ - while (attempts++ < MAX_OOM_REAP_RETRIES && !__oom_reap_task_mm(tsk, mm)) + while (attempts++ < MAX_OOM_REAP_RETRIES) { + ret = __oom_reap_task_mm(tsk, mm); + if (ret != -EAGAIN) + break; schedule_timeout_idle(HZ/10); - + } if (attempts <= MAX_OOM_REAP_RETRIES) goto done; - pr_info("oom_reaper: unable to reap pid:%d (%s)\n", task_pid_nr(tsk), tsk->comm); debug_show_all_locks(); @@ -609,7 +612,8 @@ static void oom_reap_task(struct task_struct *tsk) * Hide this mm from OOM killer because it has been either reaped or * somebody can't call up_write(mmap_sem). */ - set_bit(MMF_OOM_SKIP, &mm->flags); + if (ret != -EINVAL) + set_bit(MMF_OOM_SKIP, &mm->flags); /* Drop a reference taken by wake_oom_reaper */ put_task_struct(tsk); ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Multiple oom_reaper BUGs: unmap_page_range racing with exit_mmap 2017-12-06 7:48 ` David Rientjes @ 2017-12-06 9:00 ` Michal Hocko [not found] ` <201712070720.vB77KlBQ009754@www262.sakura.ne.jp> 2017-12-06 11:37 ` Tetsuo Handa 1 sibling, 1 reply; 19+ messages in thread From: Michal Hocko @ 2017-12-06 9:00 UTC (permalink / raw) To: David Rientjes Cc: Tetsuo Handa, Andrew Morton, Andrea Arcangeli, linux-kernel, linux-mm On Tue 05-12-17 23:48:21, David Rientjes wrote: [...] > I think this argues to do MMF_REAPING-style behavior at the beginning of > exit_mmap() and avoid reaping all together once we have reached that > point. There are no more users of the mm and we are in the process of > tearing it down, I'm not sure that the oom reaper should be in the > business with trying to interfere with that. Or are there actual bug > reports where an oom victim gets wedged while in exit_mmap() prior to > releasing its memory? Something like that seem to work indeed. But we should better understand what is going on here before adding new oom reaper specific kludges. So let's focus on getting more information from your crashes first. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 19+ messages in thread
[parent not found: <201712070720.vB77KlBQ009754@www262.sakura.ne.jp>]
* Re: Multiple oom_reaper BUGs: unmap_page_range racing with exit_mmap [not found] ` <201712070720.vB77KlBQ009754@www262.sakura.ne.jp> @ 2017-12-07 8:28 ` Michal Hocko 2017-12-07 8:44 ` Michal Hocko ` (2 more replies) 0 siblings, 3 replies; 19+ messages in thread From: Michal Hocko @ 2017-12-07 8:28 UTC (permalink / raw) To: Tetsuo Handa Cc: David Rientjes, Andrew Morton, Andrea Arcangeli, linux-kernel, linux-mm On Thu 07-12-17 16:20:47, Tetsuo Handa wrote: [...] > int main(int argc, char *argv[]) > { > int i; > char *stack; > if (fork() || fork() || setsid() == EOF || pipe(pipe_fd)) > _exit(0); > stack = mmap(NULL, STACKSIZE * NUMTHREADS, PROT_WRITE | PROT_READ, > MAP_ANONYMOUS | MAP_PRIVATE, EOF, 0); > for (i = 0; i < NUMTHREADS; i++) > if (clone(memory_eater, stack + (i + 1) * STACKSIZE, > /*CLONE_THREAD | CLONE_SIGHAND | */CLONE_VM | CLONE_FS | > CLONE_FILES, NULL) == -1) > break; Hmm, so you are creating a separate process (from the signal point of view) and I suspect it is one of those that holds the last reference to the mm_struct which is released here and it has tsk_oom_victim = F [...] > [ 113.273394] Freed by task 1377: > [ 113.276211] kasan_slab_free+0x71/0xc0 > [ 113.279093] kmem_cache_free+0xaf/0x1e0 > [ 113.281974] remove_vma+0x9d/0xb0 > [ 113.284734] exit_mmap+0x179/0x250 > [ 113.287651] mmput+0x7d/0x1b0 > [ 113.290456] do_exit+0x408/0x1290 > [ 113.293268] do_group_exit+0x84/0x140 > [ 113.296109] get_signal+0x291/0x9b0 > [ 113.298915] do_signal+0x8e/0xa70 > [ 113.301637] exit_to_usermode_loop+0x71/0xb0 > [ 113.304632] do_syscall_64+0x343/0x390 > [ 113.307349] return_from_SYSCALL_64+0x0/0x75 [...] > What we overlooked is the fact that "it is not always the process which > got ->signal->oom_mm set, it is any thread which called mmput() which > invoked __mmput() path". Therefore, below patch fixes oops in my case. > If some unrelated kernel thread was holding mm_users ref, it is possible > that we miss down_write()/up_write() synchronization. Very well spotted! It could be any task in fact (e.g. somebody reading from /proc/<pid> file which requires mm_struct). oom_reaper oom_victim task mmget_not_zero exit_mmap mmput __oom_reap_task_mm mmput __mmput exit_mmap remove_vma unmap_page_range So we need a more robust test for the oom victim. Your suggestion is basically what I came up with originally [1] and which was deemed ineffective because we took the mmap_sem even for regular paths and Kirill was afraid this adds some unnecessary cycles to the exit path which is quite hot. So I guess we have to do something else instead. We have to store the oom flag to the mm struct as well. Something like the patch below. [1] http://lkml.kernel.org/r/20170724072332.31903-1-mhocko@kernel.org --- diff --git a/include/linux/oom.h b/include/linux/oom.h index 27cd36b762b5..b7668b5d3e14 100644 --- a/include/linux/oom.h +++ b/include/linux/oom.h @@ -77,6 +77,11 @@ static inline bool tsk_is_oom_victim(struct task_struct * tsk) return tsk->signal->oom_mm; } +static inline bool mm_is_oom_victim(struct mm_struct *mm) +{ + return test_bit(MMF_OOM_VICTIM, &mm->flags); +} + /* * Checks whether a page fault on the given mm is still reliable. * This is no longer true if the oom reaper started to reap the diff --git a/include/linux/sched/coredump.h b/include/linux/sched/coredump.h index 9c8847395b5e..da673ca66e7a 100644 --- a/include/linux/sched/coredump.h +++ b/include/linux/sched/coredump.h @@ -68,8 +68,9 @@ static inline int get_dumpable(struct mm_struct *mm) #define MMF_RECALC_UPROBES 20 /* MMF_HAS_UPROBES can be wrong */ #define MMF_OOM_SKIP 21 /* mm is of no interest for the OOM killer */ #define MMF_UNSTABLE 22 /* mm is unstable for copy_from_user */ -#define MMF_HUGE_ZERO_PAGE 23 /* mm has ever used the global huge zero page */ -#define MMF_DISABLE_THP 24 /* disable THP for all VMAs */ +#define MMF_OOM_VICTIM 23 /* mm is the oom victim */ +#define MMF_HUGE_ZERO_PAGE 24 /* mm has ever used the global huge zero page */ +#define MMF_DISABLE_THP 25 /* disable THP for all VMAs */ #define MMF_DISABLE_THP_MASK (1 << MMF_DISABLE_THP) #define MMF_INIT_MASK (MMF_DUMPABLE_MASK | MMF_DUMP_FILTER_MASK |\ diff --git a/mm/mmap.c b/mm/mmap.c index 476e810cf100..d00a06248ef1 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -3005,7 +3005,7 @@ void exit_mmap(struct mm_struct *mm) unmap_vmas(&tlb, vma, 0, -1); set_bit(MMF_OOM_SKIP, &mm->flags); - if (unlikely(tsk_is_oom_victim(current))) { + if (unlikely(mm_is_oom_victim(mm))) { /* * Wait for oom_reap_task() to stop working on this * mm. Because MMF_OOM_SKIP is already set before diff --git a/mm/oom_kill.c b/mm/oom_kill.c index 3b0d0fed8480..e4d290b6804b 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -666,8 +666,10 @@ static void mark_oom_victim(struct task_struct *tsk) return; /* oom_mm is bound to the signal struct life time. */ - if (!cmpxchg(&tsk->signal->oom_mm, NULL, mm)) + if (!cmpxchg(&tsk->signal->oom_mm, NULL, mm)) { mmgrab(tsk->signal->oom_mm); + set_bit(MMF_OOM_VICTIM, &mm->flags); + } /* * Make sure that the task is woken up from uninterruptible sleep -- Michal Hocko SUSE Labs ^ permalink raw reply related [flat|nested] 19+ messages in thread
* Re: Multiple oom_reaper BUGs: unmap_page_range racing with exit_mmap 2017-12-07 8:28 ` Michal Hocko @ 2017-12-07 8:44 ` Michal Hocko 2017-12-07 11:00 ` Tetsuo Handa 2017-12-07 21:22 ` David Rientjes 2 siblings, 0 replies; 19+ messages in thread From: Michal Hocko @ 2017-12-07 8:44 UTC (permalink / raw) To: Tetsuo Handa Cc: David Rientjes, Andrew Morton, Andrea Arcangeli, linux-kernel, linux-mm On Thu 07-12-17 09:28:01, Michal Hocko wrote: [...] > oom_reaper oom_victim task > mmget_not_zero > exit_mmap bleh, this should have been do_exit > mmput > __oom_reap_task_mm mmput > __mmput > exit_mmap > remove_vma > unmap_page_range -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Multiple oom_reaper BUGs: unmap_page_range racing with exit_mmap 2017-12-07 8:28 ` Michal Hocko 2017-12-07 8:44 ` Michal Hocko @ 2017-12-07 11:00 ` Tetsuo Handa 2017-12-07 21:22 ` David Rientjes 2 siblings, 0 replies; 19+ messages in thread From: Tetsuo Handa @ 2017-12-07 11:00 UTC (permalink / raw) To: mhocko; +Cc: rientjes, akpm, aarcange, linux-kernel, linux-mm Michal Hocko wrote: > Hmm, so you are creating a separate process (from the signal point of > view) and I suspect it is one of those that holds the last reference to > the mm_struct which is released here and it has tsk_oom_victim = F Right. > So we need a more robust test for the oom victim. Your suggestion is > basically what I came up with originally [1] and which was deemed > ineffective because we took the mmap_sem even for regular paths and > Kirill was afraid this adds some unnecessary cycles to the exit path > which is quite hot. > > So I guess we have to do something else instead. We have to store the > oom flag to the mm struct as well. Something like the patch below. Yes, adding a new flag for this purpose will work. Also, setting MMF_UNSTABLE flag between after sending SIGKILL and before victim->mm becomes NULL and testing MMF_UNSTABLE at exit_mm() should work. But I prefer simple revert + mmget()/mmput_async() approach at http://lkml.kernel.org/r/201712062037.DAF90168.SVFQOJFMOOtHLF@I-love.SAKURA.ne.jp , for my approach not only saves lines but also fixes unexpected change for nommu at http://lkml.kernel.org/r/201711091949.BDB73475.OSHFOMQtLFOFVJ@I-love.SAKURA.ne.jp . Also, if we replace asynchronous OOM reaping by the OOM reaper kernel thread with synchronous OOM reaping by the OOM killer, we can close MMF_OOM_SKIP race window because it is guaranteed that __oom_reap_task_mm() is called before __mmput() is called. ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Multiple oom_reaper BUGs: unmap_page_range racing with exit_mmap 2017-12-07 8:28 ` Michal Hocko 2017-12-07 8:44 ` Michal Hocko 2017-12-07 11:00 ` Tetsuo Handa @ 2017-12-07 21:22 ` David Rientjes 2017-12-08 7:50 ` Michal Hocko 2 siblings, 1 reply; 19+ messages in thread From: David Rientjes @ 2017-12-07 21:22 UTC (permalink / raw) To: Michal Hocko Cc: Tetsuo Handa, Andrew Morton, Andrea Arcangeli, linux-kernel, linux-mm On Thu, 7 Dec 2017, Michal Hocko wrote: > Very well spotted! It could be any task in fact (e.g. somebody reading > from /proc/<pid> file which requires mm_struct). > > oom_reaper oom_victim task > mmget_not_zero > exit_mmap > mmput > __oom_reap_task_mm mmput > __mmput > exit_mmap > remove_vma > unmap_page_range > > So we need a more robust test for the oom victim. Your suggestion is > basically what I came up with originally [1] and which was deemed > ineffective because we took the mmap_sem even for regular paths and > Kirill was afraid this adds some unnecessary cycles to the exit path > which is quite hot. > Yes, I can confirm that in all crashes that we have analyzed so far that MMF_OOM_SKIP is actually set at the time that oom_reaper causes BUGs of various stack traces all originating from unmap_page_range() which is certainly not supposed to happen. > So I guess we have to do something else instead. We have to store the > oom flag to the mm struct as well. Something like the patch below. > > [1] http://lkml.kernel.org/r/20170724072332.31903-1-mhocko@kernel.org > --- > diff --git a/include/linux/oom.h b/include/linux/oom.h > index 27cd36b762b5..b7668b5d3e14 100644 > --- a/include/linux/oom.h > +++ b/include/linux/oom.h > @@ -77,6 +77,11 @@ static inline bool tsk_is_oom_victim(struct task_struct * tsk) > return tsk->signal->oom_mm; > } > > +static inline bool mm_is_oom_victim(struct mm_struct *mm) > +{ > + return test_bit(MMF_OOM_VICTIM, &mm->flags); > +} > + > /* > * Checks whether a page fault on the given mm is still reliable. > * This is no longer true if the oom reaper started to reap the > diff --git a/include/linux/sched/coredump.h b/include/linux/sched/coredump.h > index 9c8847395b5e..da673ca66e7a 100644 > --- a/include/linux/sched/coredump.h > +++ b/include/linux/sched/coredump.h > @@ -68,8 +68,9 @@ static inline int get_dumpable(struct mm_struct *mm) > #define MMF_RECALC_UPROBES 20 /* MMF_HAS_UPROBES can be wrong */ > #define MMF_OOM_SKIP 21 /* mm is of no interest for the OOM killer */ > #define MMF_UNSTABLE 22 /* mm is unstable for copy_from_user */ > -#define MMF_HUGE_ZERO_PAGE 23 /* mm has ever used the global huge zero page */ > -#define MMF_DISABLE_THP 24 /* disable THP for all VMAs */ > +#define MMF_OOM_VICTIM 23 /* mm is the oom victim */ > +#define MMF_HUGE_ZERO_PAGE 24 /* mm has ever used the global huge zero page */ > +#define MMF_DISABLE_THP 25 /* disable THP for all VMAs */ > #define MMF_DISABLE_THP_MASK (1 << MMF_DISABLE_THP) > > #define MMF_INIT_MASK (MMF_DUMPABLE_MASK | MMF_DUMP_FILTER_MASK |\ Could we not adjust the bit values, but simply add new one for MMF_OOM_VICTIM? We have automated tools that look at specific bits in mm->flags and it would be nice to not have them be inconsistent between kernel versions. Not absolutely required, but nice to avoid. > diff --git a/mm/mmap.c b/mm/mmap.c > index 476e810cf100..d00a06248ef1 100644 > --- a/mm/mmap.c > +++ b/mm/mmap.c > @@ -3005,7 +3005,7 @@ void exit_mmap(struct mm_struct *mm) > unmap_vmas(&tlb, vma, 0, -1); > > set_bit(MMF_OOM_SKIP, &mm->flags); > - if (unlikely(tsk_is_oom_victim(current))) { > + if (unlikely(mm_is_oom_victim(mm))) { > /* > * Wait for oom_reap_task() to stop working on this > * mm. Because MMF_OOM_SKIP is already set before > diff --git a/mm/oom_kill.c b/mm/oom_kill.c > index 3b0d0fed8480..e4d290b6804b 100644 > --- a/mm/oom_kill.c > +++ b/mm/oom_kill.c > @@ -666,8 +666,10 @@ static void mark_oom_victim(struct task_struct *tsk) > return; > > /* oom_mm is bound to the signal struct life time. */ > - if (!cmpxchg(&tsk->signal->oom_mm, NULL, mm)) > + if (!cmpxchg(&tsk->signal->oom_mm, NULL, mm)) { > mmgrab(tsk->signal->oom_mm); > + set_bit(MMF_OOM_VICTIM, &mm->flags); > + } > > /* > * Make sure that the task is woken up from uninterruptible sleep Looks good, I see the other email with the same functional change plus a follow-up based on a suggestion by Tetsuo. I'll test it alongside a change to not adjust existing MMF_* bit numbers. ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Multiple oom_reaper BUGs: unmap_page_range racing with exit_mmap 2017-12-07 21:22 ` David Rientjes @ 2017-12-08 7:50 ` Michal Hocko 0 siblings, 0 replies; 19+ messages in thread From: Michal Hocko @ 2017-12-08 7:50 UTC (permalink / raw) To: David Rientjes Cc: Tetsuo Handa, Andrew Morton, Andrea Arcangeli, linux-kernel, linux-mm On Thu 07-12-17 13:22:30, David Rientjes wrote: [...] > > diff --git a/include/linux/sched/coredump.h b/include/linux/sched/coredump.h > > index 9c8847395b5e..da673ca66e7a 100644 > > --- a/include/linux/sched/coredump.h > > +++ b/include/linux/sched/coredump.h > > @@ -68,8 +68,9 @@ static inline int get_dumpable(struct mm_struct *mm) > > #define MMF_RECALC_UPROBES 20 /* MMF_HAS_UPROBES can be wrong */ > > #define MMF_OOM_SKIP 21 /* mm is of no interest for the OOM killer */ > > #define MMF_UNSTABLE 22 /* mm is unstable for copy_from_user */ > > -#define MMF_HUGE_ZERO_PAGE 23 /* mm has ever used the global huge zero page */ > > -#define MMF_DISABLE_THP 24 /* disable THP for all VMAs */ > > +#define MMF_OOM_VICTIM 23 /* mm is the oom victim */ > > +#define MMF_HUGE_ZERO_PAGE 24 /* mm has ever used the global huge zero page */ > > +#define MMF_DISABLE_THP 25 /* disable THP for all VMAs */ > > #define MMF_DISABLE_THP_MASK (1 << MMF_DISABLE_THP) > > > > #define MMF_INIT_MASK (MMF_DUMPABLE_MASK | MMF_DUMP_FILTER_MASK |\ > > Could we not adjust the bit values, but simply add new one for > MMF_OOM_VICTIM? We have automated tools that look at specific bits in > mm->flags and it would be nice to not have them be inconsistent between > kernel versions. Not absolutely required, but nice to avoid. I just wanted to have those semantically related bits closer together. But I do not insist on this. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Multiple oom_reaper BUGs: unmap_page_range racing with exit_mmap 2017-12-06 7:48 ` David Rientjes 2017-12-06 9:00 ` Michal Hocko @ 2017-12-06 11:37 ` Tetsuo Handa 1 sibling, 0 replies; 19+ messages in thread From: Tetsuo Handa @ 2017-12-06 11:37 UTC (permalink / raw) To: rientjes; +Cc: akpm, mhocko, aarcange, linux-kernel, linux-mm David Rientjes wrote: > On Wed, 6 Dec 2017, Tetsuo Handa wrote: > > Also, I don't know what exit_mmap() is doing but I think that there is a > > possibility that the OOM reaper tries to reclaim mlocked pages as soon as > > exit_mmap() cleared VM_LOCKED flag by calling munlock_vma_pages_all(). > > > > if (mm->locked_vm) { > > vma = mm->mmap; > > while (vma) { > > if (vma->vm_flags & VM_LOCKED) > > munlock_vma_pages_all(vma); > > vma = vma->vm_next; > > } > > } > > > > Yes, that looks possible as well, although the problem I have reported can > happen with or without mlock. Did you find this by code inspection or > have you experienced runtime problems with it? By code inspection. > > I think this argues to do MMF_REAPING-style behavior at the beginning of > exit_mmap() and avoid reaping all together once we have reached that > point. There are no more users of the mm and we are in the process of > tearing it down, I'm not sure that the oom reaper should be in the > business with trying to interfere with that. Or are there actual bug > reports where an oom victim gets wedged while in exit_mmap() prior to > releasing its memory? If our assumption is that the OOM reaper can reclaim majority of OOM victim's memory via victim's ->signal->oom_mm, what will be wrong with simply reverting 212925802454672e ("mm: oom: let oom_reap_task and exit_mmap run concurrently") and replace mmgrab()/mmdrop_async() with mmget()/mmput_async() so that the OOM reaper no longer need to worry about tricky __mmput() behavior (like shown below) ? ---------- kernel/fork.c | 17 ++++++++++++----- mm/mmap.c | 17 ----------------- mm/oom_kill.c | 21 +++++++-------------- 3 files changed, 19 insertions(+), 36 deletions(-) diff --git a/kernel/fork.c b/kernel/fork.c index 432eadf..018a857 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -394,12 +394,18 @@ static inline void free_signal_struct(struct signal_struct *sig) { taskstats_tgid_free(sig); sched_autogroup_exit(sig); - /* - * __mmdrop is not safe to call from softirq context on x86 due to - * pgd_dtor so postpone it to the async context - */ - if (sig->oom_mm) + if (sig->oom_mm) { +#ifdef CONFIG_MMU + mmput_async(sig->oom_mm); +#else + /* + * There might be archtectures where calling __mmdrop() from + * softirq context is not safe. Thus, postpone it to the async + * context. + */ mmdrop_async(sig->oom_mm); +#endif + } kmem_cache_free(signal_cachep, sig); } @@ -931,6 +937,7 @@ static inline void __mmput(struct mm_struct *mm) } if (mm->binfmt) module_put(mm->binfmt->module); + set_bit(MMF_OOM_SKIP, &mm->flags); mmdrop(mm); } diff --git a/mm/mmap.c b/mm/mmap.c index a4d5468..fafaf06 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -3019,23 +3019,6 @@ void exit_mmap(struct mm_struct *mm) /* Use -1 here to ensure all VMAs in the mm are unmapped */ unmap_vmas(&tlb, vma, 0, -1); - set_bit(MMF_OOM_SKIP, &mm->flags); - if (unlikely(tsk_is_oom_victim(current))) { - /* - * Wait for oom_reap_task() to stop working on this - * mm. Because MMF_OOM_SKIP is already set before - * calling down_read(), oom_reap_task() will not run - * on this "mm" post up_write(). - * - * tsk_is_oom_victim() cannot be set from under us - * either because current->mm is already set to NULL - * under task_lock before calling mmput and oom_mm is - * set not NULL by the OOM killer only if current->mm - * is found not NULL while holding the task_lock. - */ - down_write(&mm->mmap_sem); - up_write(&mm->mmap_sem); - } free_pgtables(&tlb, vma, FIRST_USER_ADDRESS, USER_PGTABLES_CEILING); tlb_finish_mmu(&tlb, 0, -1); diff --git a/mm/oom_kill.c b/mm/oom_kill.c index c957be3..eb2a005 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -528,18 +528,6 @@ static bool __oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm) goto unlock_oom; } - /* - * MMF_OOM_SKIP is set by exit_mmap when the OOM reaper can't - * work on the mm anymore. The check for MMF_OOM_SKIP must run - * under mmap_sem for reading because it serializes against the - * down_write();up_write() cycle in exit_mmap(). - */ - if (test_bit(MMF_OOM_SKIP, &mm->flags)) { - up_read(&mm->mmap_sem); - trace_skip_task_reaping(tsk->pid); - goto unlock_oom; - } - trace_start_task_reaping(tsk->pid); /* @@ -683,8 +671,13 @@ static void mark_oom_victim(struct task_struct *tsk) return; /* oom_mm is bound to the signal struct life time. */ - if (!cmpxchg(&tsk->signal->oom_mm, NULL, mm)) - mmgrab(tsk->signal->oom_mm); + if (!cmpxchg(&tsk->signal->oom_mm, NULL, mm)) { +#ifdef CONFIG_MMU + mmget(mm); +#else + mmgrab(mm); +#endif + } /* * Make sure that the task is woken up from uninterruptible sleep ---------- If holding the address space when there are so many victim's mm waiting for the OOM reaper to reclaim is considered dangerous, I think we can try direct OOM reaping (like untested patch shown below) in order to reclaim first-blocked first-reaped basis (and also serve as a mitigation for race caused by removing oom_lock from the OOM reaper). ---------- include/linux/mm_types.h | 3 ++ include/linux/sched.h | 3 -- mm/oom_kill.c | 132 ++++++++++++++--------------------------------- 3 files changed, 43 insertions(+), 95 deletions(-) diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index cfd0ac4..068119b 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -501,6 +501,9 @@ struct mm_struct { atomic_long_t hugetlb_usage; #endif struct work_struct async_put_work; +#ifdef CONFIG_MMU + unsigned long oom_reap_started; +#endif #if IS_ENABLED(CONFIG_HMM) /* HMM needs to track a few things per mm */ diff --git a/include/linux/sched.h b/include/linux/sched.h index a2709d2..d63b599 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1081,9 +1081,6 @@ struct task_struct { unsigned long task_state_change; #endif int pagefault_disabled; -#ifdef CONFIG_MMU - struct task_struct *oom_reaper_list; -#endif #ifdef CONFIG_VMAP_STACK struct vm_struct *stack_vm_area; #endif diff --git a/mm/oom_kill.c b/mm/oom_kill.c index 3b0d0fe..a9f8bae 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -309,9 +309,14 @@ static enum oom_constraint constrained_alloc(struct oom_control *oc) return CONSTRAINT_NONE; } +#ifdef CONFIG_MMU +static void oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm); +#endif + static int oom_evaluate_task(struct task_struct *task, void *arg) { struct oom_control *oc = arg; + struct mm_struct *mm; unsigned long points; if (oom_unkillable_task(task, NULL, oc->nodemask)) @@ -324,7 +329,8 @@ static int oom_evaluate_task(struct task_struct *task, void *arg) * any memory is quite low. */ if (!is_sysrq_oom(oc) && tsk_is_oom_victim(task)) { - if (test_bit(MMF_OOM_SKIP, &task->signal->oom_mm->flags)) + mm = task->signal->oom_mm; + if (test_bit(MMF_OOM_SKIP, &mm->flags)) goto next; goto abort; } @@ -357,6 +363,15 @@ static int oom_evaluate_task(struct task_struct *task, void *arg) if (oc->chosen) put_task_struct(oc->chosen); oc->chosen = (void *)-1UL; +#ifdef CONFIG_MMU + get_task_struct(task); + if (!is_memcg_oom(oc)) + rcu_read_unlock(); + oom_reap_task_mm(task, mm); + put_task_struct(task); + if (!is_memcg_oom(oc)) + rcu_read_lock(); +#endif return 1; } @@ -474,23 +489,14 @@ bool process_shares_mm(struct task_struct *p, struct mm_struct *mm) return false; } - #ifdef CONFIG_MMU -/* - * OOM Reaper kernel thread which tries to reap the memory used by the OOM - * victim (if that is possible) to help the OOM killer to move on. - */ -static struct task_struct *oom_reaper_th; -static DECLARE_WAIT_QUEUE_HEAD(oom_reaper_wait); -static struct task_struct *oom_reaper_list; -static DEFINE_SPINLOCK(oom_reaper_lock); - static bool __oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm) { struct mmu_gather tlb; struct vm_area_struct *vma; bool ret = true; + trace_wake_reaper(tsk->pid); if (!down_read_trylock(&mm->mmap_sem)) { ret = false; trace_skip_task_reaping(tsk->pid); @@ -507,8 +513,8 @@ static bool __oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm) * mmu_notifier_invalidate_range_{start,end} around unmap_page_range */ if (mm_has_notifiers(mm)) { + ret = false; up_read(&mm->mmap_sem); - schedule_timeout_idle(HZ); goto unlock_oom; } @@ -567,82 +573,22 @@ static bool __oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm) return ret; } -#define MAX_OOM_REAP_RETRIES 10 -static void oom_reap_task(struct task_struct *tsk) -{ - int attempts = 0; - struct mm_struct *mm = tsk->signal->oom_mm; - - /* Retry the down_read_trylock(mmap_sem) a few times */ - while (attempts++ < MAX_OOM_REAP_RETRIES && !__oom_reap_task_mm(tsk, mm)) - schedule_timeout_idle(HZ/10); - - if (attempts <= MAX_OOM_REAP_RETRIES) - goto done; - - - pr_info("oom_reaper: unable to reap pid:%d (%s)\n", - task_pid_nr(tsk), tsk->comm); - debug_show_all_locks(); - -done: - tsk->oom_reaper_list = NULL; - - /* - * Hide this mm from OOM killer because it has been either reaped or - * somebody can't call up_write(mmap_sem). - */ - set_bit(MMF_OOM_SKIP, &mm->flags); - - /* Drop a reference taken by wake_oom_reaper */ - put_task_struct(tsk); -} - -static int oom_reaper(void *unused) +static void oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm) { - while (true) { - struct task_struct *tsk = NULL; - - wait_event_freezable(oom_reaper_wait, oom_reaper_list != NULL); - spin_lock(&oom_reaper_lock); - if (oom_reaper_list != NULL) { - tsk = oom_reaper_list; - oom_reaper_list = tsk->oom_reaper_list; + if (__oom_reap_task_mm(tsk, mm)) + /* Hide this mm from OOM killer because we reaped it. */ + set_bit(MMF_OOM_SKIP, &mm->flags); + else if (!mm->oom_reap_started) + mm->oom_reap_started = jiffies; + else if (time_after(jiffies, mm->oom_reap_started + HZ)) { + if (!mm_has_notifiers(mm)) { + pr_info("oom_reaper: unable to reap pid:%d (%s)\n", + task_pid_nr(tsk), tsk->comm); + debug_show_all_locks(); } - spin_unlock(&oom_reaper_lock); - - if (tsk) - oom_reap_task(tsk); + /* Hide this mm from OOM killer because we can't reap. */ + set_bit(MMF_OOM_SKIP, &mm->flags); } - - return 0; -} - -static void wake_oom_reaper(struct task_struct *tsk) -{ - /* tsk is already queued? */ - if (tsk == oom_reaper_list || tsk->oom_reaper_list) - return; - - get_task_struct(tsk); - - spin_lock(&oom_reaper_lock); - tsk->oom_reaper_list = oom_reaper_list; - oom_reaper_list = tsk; - spin_unlock(&oom_reaper_lock); - trace_wake_reaper(tsk->pid); - wake_up(&oom_reaper_wait); -} - -static int __init oom_init(void) -{ - oom_reaper_th = kthread_run(oom_reaper, NULL, "oom_reaper"); - return 0; -} -subsys_initcall(oom_init) -#else -static inline void wake_oom_reaper(struct task_struct *tsk) -{ } #endif /* CONFIG_MMU */ @@ -825,7 +771,6 @@ static void oom_kill_process(struct oom_control *oc, const char *message) unsigned int victim_points = 0; static DEFINE_RATELIMIT_STATE(oom_rs, DEFAULT_RATELIMIT_INTERVAL, DEFAULT_RATELIMIT_BURST); - bool can_oom_reap = true; /* * If the task is already exiting, don't alarm the sysadmin or kill @@ -835,7 +780,6 @@ static void oom_kill_process(struct oom_control *oc, const char *message) task_lock(p); if (task_will_free_mem(p)) { mark_oom_victim(p); - wake_oom_reaper(p); task_unlock(p); put_task_struct(p); return; @@ -924,7 +868,6 @@ static void oom_kill_process(struct oom_control *oc, const char *message) if (same_thread_group(p, victim)) continue; if (is_global_init(p)) { - can_oom_reap = false; set_bit(MMF_OOM_SKIP, &mm->flags); pr_info("oom killer %d (%s) has mm pinned by %d (%s)\n", task_pid_nr(victim), victim->comm, @@ -941,9 +884,15 @@ static void oom_kill_process(struct oom_control *oc, const char *message) } rcu_read_unlock(); - if (can_oom_reap) - wake_oom_reaper(victim); - +#ifdef CONFIG_MMU + /* + * sysctl_oom_kill_allocating_task case could get stuck because + * select_bad_process() which calls oom_reap_task_mm() is not called. + */ + if (sysctl_oom_kill_allocating_task && + !test_bit(MMF_OOM_SKIP, &mm->flags)) + oom_reap_task_mm(victim, mm); +#endif mmdrop(mm); put_task_struct(victim); } @@ -1019,7 +968,6 @@ bool out_of_memory(struct oom_control *oc) */ if (task_will_free_mem(current)) { mark_oom_victim(current); - wake_oom_reaper(current); return true; } ---------- ^ permalink raw reply related [flat|nested] 19+ messages in thread
* Re: Multiple oom_reaper BUGs: unmap_page_range racing with exit_mmap [not found] ` <201712060328.vB63SrDK069830@www262.sakura.ne.jp> 2017-12-06 7:48 ` David Rientjes @ 2017-12-06 8:50 ` Michal Hocko 1 sibling, 0 replies; 19+ messages in thread From: Michal Hocko @ 2017-12-06 8:50 UTC (permalink / raw) To: Tetsuo Handa Cc: David Rientjes, Andrew Morton, Andrea Arcangeli, linux-kernel, linux-mm On Wed 06-12-17 12:28:53, Tetsuo Handa wrote: > David Rientjes wrote: > > On Tue, 5 Dec 2017, David Rientjes wrote: > > > > > One way to solve the issue is to have two mm flags: one to indicate the mm > > > is entering unmap_vmas(): set the flag, do down_write(&mm->mmap_sem); > > > up_write(&mm->mmap_sem), then unmap_vmas(). The oom reaper needs this > > > flag clear, not MMF_OOM_SKIP, while holding down_read(&mm->mmap_sem) to be > > > allowed to call unmap_page_range(). The oom killer will still defer > > > selecting this victim for MMF_OOM_SKIP after unmap_vmas() returns. > > > > > > The result of that change would be that we do not oom reap from any mm > > > entering unmap_vmas(): we let unmap_vmas() do the work itself and avoid > > > racing with it. > > > > > > > I think we need something like the following? > > This patch does not work. __oom_reap_task_mm() can find MMF_REAPING and > return true and sets MMF_OOM_SKIP before exit_mmap() calls down_write(). > > Also, I don't know what exit_mmap() is doing but I think that there is a > possibility that the OOM reaper tries to reclaim mlocked pages as soon as > exit_mmap() cleared VM_LOCKED flag by calling munlock_vma_pages_all(). > > if (mm->locked_vm) { > vma = mm->mmap; > while (vma) { > if (vma->vm_flags & VM_LOCKED) > munlock_vma_pages_all(vma); > vma = vma->vm_next; > } > } I do not really see, why this would matter. munlock_vma_pages_all is mostly about accounting and clearing the per-page state. It relies on follow_page which crawls page tables and unmap_page_range clears ptes under the lock which is taken when resolving a locked page as well. I still have to think about all the consequences when we are effectively reaping VM_LOCKED vmas - I suspect we can do some misaccounting but I yet do not see how this could lead to crashes. Maybe we can move VM_LOCKED clearing _after_ the munlock bussiness is done but this is really hard to tell before I re-read the mlock code more throughly. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Multiple oom_reaper BUGs: unmap_page_range racing with exit_mmap 2017-12-06 2:43 Multiple oom_reaper BUGs: unmap_page_range racing with exit_mmap David Rientjes 2017-12-06 2:58 ` David Rientjes @ 2017-12-06 8:31 ` Michal Hocko 2017-12-07 11:35 ` Michal Hocko 2 siblings, 0 replies; 19+ messages in thread From: Michal Hocko @ 2017-12-06 8:31 UTC (permalink / raw) To: David Rientjes Cc: Andrew Morton, Tetsuo Handa, Andrea Arcangeli, linux-kernel, linux-mm On Tue 05-12-17 18:43:48, David Rientjes wrote: > Hi, > > I'd like to understand the synchronization between the oom_reaper's > unmap_page_range() and exit_mmap(). The latter does not hold > mm->mmap_sem: it's supposed to be the last thread operating on the mm > before it is destroyed. > > If unmap_page_range() races with unmap_vmas(), we trivially call > page_remove_rmap() twice on the same page: Well, the oom reaper is basically MADV_DONTNEED and that allows parallel tear down (it takes only mmap_sem for read). The exit path doesn't take the mmap_sem during unmap_vmas but that shouldn't make any difference because both path would take it for read anyway. The essential synchronization between oom reaper and exit_mmap is exit_mmap: set_bit(MMF_OOM_SKIP, &mm->flags); if (unlikely(tsk_is_oom_victim(current))) { /* * Wait for oom_reap_task() to stop working on this * mm. Because MMF_OOM_SKIP is already set before * calling down_read(), oom_reap_task() will not run * on this "mm" post up_write(). * * tsk_is_oom_victim() cannot be set from under us * either because current->mm is already set to NULL * under task_lock before calling mmput and oom_mm is * set not NULL by the OOM killer only if current->mm * is found not NULL while holding the task_lock. */ down_write(&mm->mmap_sem); up_write(&mm->mmap_sem); } oom_reaper if (!down_read_trylock(&mm->mmap_sem)) { /* * MMF_OOM_SKIP is set by exit_mmap when the OOM reaper can't * work on the mm anymore. The check for MMF_OOM_SKIP must run * under mmap_sem for reading because it serializes against the * down_write();up_write() cycle in exit_mmap(). */ if (test_bit(MMF_OOM_SKIP, &mm->flags)) { which makes sure that the reaper doesn't race with free_pgtables. > BUG: Bad page map in process oom_reaper pte:6353826300000000 pmd:00000000 Hmm, this is really strange. This is a pte without a pmd or is the output just incomplete. > addr:00007f50cab1d000 vm_flags:08100073 anon_vma:ffff9eea335603f0 mapping: (null) index:7f50cab1d > file: (null) fault: (null) mmap: (null) readpage: (null) > CPU: 2 PID: 1001 Comm: oom_reaper > Call Trace: > [<ffffffffa4bd967d>] dump_stack+0x4d/0x70 > [<ffffffffa4a03558>] unmap_page_range+0x1068/0x1130 could you use addr2line to get the exact spot where this triggered please? > [<ffffffffa4a2e07f>] __oom_reap_task_mm+0xd5/0x16b > [<ffffffffa4a2e226>] oom_reaper+0xff/0x14c > [<ffffffffa48d6ad1>] kthread+0xc1/0xe0 > > And there are more examples of badness from an unmap_page_range() racing > with unmap_vmas(). In this case, MMF_OOM_SKIP is doing two things: (1) > avoiding additional oom kills until unmap_vmas() returns and (2) avoid the > oom_reaper working on the mm after unmap_vmas(). In (2), there's nothing > preventing the oom reaper from calling unmap_page_range() in parallel with > the final thread doing unmap_vmas() -- we no longer do mmget() to prevent > exit_mmap() from being called. Yes and that is an intentional behavior. There shouldn't be any reason to exclude the two because this should be equivalent to calling MADV_DONTNEED in parallel. I will get to the rest of your email later because the above is the essential assumption 212925802454 ("mm: oom: let oom_reap_task and exit_mmap run concurrently") builds on. If it is not correct then we have a bigger problem. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Multiple oom_reaper BUGs: unmap_page_range racing with exit_mmap 2017-12-06 2:43 Multiple oom_reaper BUGs: unmap_page_range racing with exit_mmap David Rientjes 2017-12-06 2:58 ` David Rientjes 2017-12-06 8:31 ` Michal Hocko @ 2017-12-07 11:35 ` Michal Hocko 2017-12-07 15:44 ` Tetsuo Handa 2 siblings, 1 reply; 19+ messages in thread From: Michal Hocko @ 2017-12-07 11:35 UTC (permalink / raw) To: David Rientjes Cc: Andrew Morton, Tetsuo Handa, Andrea Arcangeli, linux-kernel, linux-mm David, could you test with this patch please? --- >From c4a2a58d3856bc1e528f7c1622a59cef58033cb3 Mon Sep 17 00:00:00 2001 From: Michal Hocko <mhocko@suse.com> Date: Thu, 7 Dec 2017 09:39:03 +0100 Subject: [PATCH] mm, oom_reaper: fix memory corruption David Rientjes has reported the following memory corruption while the oom reaper tries to unmap the victims address space BUG: Bad page map in process oom_reaper pte:6353826300000000 pmd:00000000 addr:00007f50cab1d000 vm_flags:08100073 anon_vma:ffff9eea335603f0 mapping: (null) index:7f50cab1d file: (null) fault: (null) mmap: (null) readpage: (null) CPU: 2 PID: 1001 Comm: oom_reaper Call Trace: [<ffffffffa4bd967d>] dump_stack+0x4d/0x70 [<ffffffffa4a03558>] unmap_page_range+0x1068/0x1130 [<ffffffffa4a2e07f>] __oom_reap_task_mm+0xd5/0x16b [<ffffffffa4a2e226>] oom_reaper+0xff/0x14c [<ffffffffa48d6ad1>] kthread+0xc1/0xe0 Tetsuo Handa has noticed that the synchronization inside exit_mmap is insufficient. We only synchronize with the oom reaper if tsk_is_oom_victim which is not true if the final __mmput is called from a different context than the oom victim exit path. This can trivially happen from context of any task which has grabbed mm reference (e.g. to read /proc/<pid>/ file which requires mm etc.). The race would look like this oom_reaper oom_victim task mmget_not_zero do_exit mmput __oom_reap_task_mm mmput __mmput exit_mmap remove_vma unmap_page_range Fix this issue by providing a new mm_is_oom_victim() helper which operates on the mm struct rather than a task. Any context which operates on a remote mm struct should use this helper in place of tsk_is_oom_victim. The flag is set in mark_oom_victim and never cleared so it is stable in the exit_mmap path. Reported-by: David Rientjes <rientjes@google.com> Debugged-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Fixes: 212925802454 ("mm: oom: let oom_reap_task and exit_mmap run concurrently") Cc: stable # 4.14 Signed-off-by: Michal Hocko <mhocko@suse.com> --- include/linux/oom.h | 9 +++++++++ include/linux/sched/coredump.h | 5 +++-- mm/mmap.c | 2 +- mm/oom_kill.c | 4 +++- 4 files changed, 16 insertions(+), 4 deletions(-) diff --git a/include/linux/oom.h b/include/linux/oom.h index 27cd36b762b5..c4139e79771d 100644 --- a/include/linux/oom.h +++ b/include/linux/oom.h @@ -77,6 +77,15 @@ static inline bool tsk_is_oom_victim(struct task_struct * tsk) return tsk->signal->oom_mm; } +/* + * Use this helper if tsk->mm != mm and the victim mm needs a special + * handling. This is guaranteed to stay true after once set. + */ +static inline bool mm_is_oom_victim(struct mm_struct *mm) +{ + return test_bit(MMF_OOM_VICTIM, &mm->flags); +} + /* * Checks whether a page fault on the given mm is still reliable. * This is no longer true if the oom reaper started to reap the diff --git a/include/linux/sched/coredump.h b/include/linux/sched/coredump.h index 9c8847395b5e..da673ca66e7a 100644 --- a/include/linux/sched/coredump.h +++ b/include/linux/sched/coredump.h @@ -68,8 +68,9 @@ static inline int get_dumpable(struct mm_struct *mm) #define MMF_RECALC_UPROBES 20 /* MMF_HAS_UPROBES can be wrong */ #define MMF_OOM_SKIP 21 /* mm is of no interest for the OOM killer */ #define MMF_UNSTABLE 22 /* mm is unstable for copy_from_user */ -#define MMF_HUGE_ZERO_PAGE 23 /* mm has ever used the global huge zero page */ -#define MMF_DISABLE_THP 24 /* disable THP for all VMAs */ +#define MMF_OOM_VICTIM 23 /* mm is the oom victim */ +#define MMF_HUGE_ZERO_PAGE 24 /* mm has ever used the global huge zero page */ +#define MMF_DISABLE_THP 25 /* disable THP for all VMAs */ #define MMF_DISABLE_THP_MASK (1 << MMF_DISABLE_THP) #define MMF_INIT_MASK (MMF_DUMPABLE_MASK | MMF_DUMP_FILTER_MASK |\ diff --git a/mm/mmap.c b/mm/mmap.c index 476e810cf100..d00a06248ef1 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -3005,7 +3005,7 @@ void exit_mmap(struct mm_struct *mm) unmap_vmas(&tlb, vma, 0, -1); set_bit(MMF_OOM_SKIP, &mm->flags); - if (unlikely(tsk_is_oom_victim(current))) { + if (unlikely(mm_is_oom_victim(mm))) { /* * Wait for oom_reap_task() to stop working on this * mm. Because MMF_OOM_SKIP is already set before diff --git a/mm/oom_kill.c b/mm/oom_kill.c index 3b0d0fed8480..e4d290b6804b 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -666,8 +666,10 @@ static void mark_oom_victim(struct task_struct *tsk) return; /* oom_mm is bound to the signal struct life time. */ - if (!cmpxchg(&tsk->signal->oom_mm, NULL, mm)) + if (!cmpxchg(&tsk->signal->oom_mm, NULL, mm)) { mmgrab(tsk->signal->oom_mm); + set_bit(MMF_OOM_VICTIM, &mm->flags); + } /* * Make sure that the task is woken up from uninterruptible sleep -- 2.15.0 -- Michal Hocko SUSE Labs ^ permalink raw reply related [flat|nested] 19+ messages in thread
* Re: Multiple oom_reaper BUGs: unmap_page_range racing with exit_mmap 2017-12-07 11:35 ` Michal Hocko @ 2017-12-07 15:44 ` Tetsuo Handa 2017-12-07 16:30 ` Michal Hocko 0 siblings, 1 reply; 19+ messages in thread From: Tetsuo Handa @ 2017-12-07 15:44 UTC (permalink / raw) To: mhocko, rientjes; +Cc: akpm, aarcange, linux-kernel, linux-mm Michal Hocko wrote: > David, could you test with this patch please? Even if this patch solved David's case, you need to update * tsk_is_oom_victim() cannot be set from under us * either because current->mm is already set to NULL * under task_lock before calling mmput and oom_mm is * set not NULL by the OOM killer only if current->mm * is found not NULL while holding the task_lock. part as well, for it is the explanation of why tsk_is_oom_victim() test was expected to work. Also, do we need to do set_bit(MMF_OOM_SKIP, &mm->flags); if mm_is_oom_victim(mm) == false? exit_mmap() is called means that nobody can reach this mm except ->signal->oom_mm, and mm_is_oom_victim(mm) == false means that this mm cannot be reached by ->signal->oom_mm . Then, I think we do not need to set MMF_OOM_SKIP on this mm at exit_mmap() if mm_is_oom_victim(mm) == false. ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Multiple oom_reaper BUGs: unmap_page_range racing with exit_mmap 2017-12-07 15:44 ` Tetsuo Handa @ 2017-12-07 16:30 ` Michal Hocko 2017-12-07 21:55 ` David Rientjes 2017-12-08 10:11 ` Tetsuo Handa 0 siblings, 2 replies; 19+ messages in thread From: Michal Hocko @ 2017-12-07 16:30 UTC (permalink / raw) To: Tetsuo Handa; +Cc: rientjes, akpm, aarcange, linux-kernel, linux-mm On Fri 08-12-17 00:44:11, Tetsuo Handa wrote: > Michal Hocko wrote: > > David, could you test with this patch please? > > Even if this patch solved David's case, you need to update > > * tsk_is_oom_victim() cannot be set from under us > * either because current->mm is already set to NULL > * under task_lock before calling mmput and oom_mm is > * set not NULL by the OOM killer only if current->mm > * is found not NULL while holding the task_lock. > > part as well, for it is the explanation of why > tsk_is_oom_victim() test was expected to work. Yes, the same applies for mm_is_oom_victim. I will fixup s@tsk_@mm_@ here. > Also, do we need to do > > set_bit(MMF_OOM_SKIP, &mm->flags); > > if mm_is_oom_victim(mm) == false? I do not think we really need to set MMF_OOM_SKIP if we are not going to synchronize. > exit_mmap() is called means that nobody can reach this mm > except ->signal->oom_mm, and mm_is_oom_victim(mm) == false > means that this mm cannot be reached by ->signal->oom_mm . > > Then, I think we do not need to set MMF_OOM_SKIP on this mm > at exit_mmap() if mm_is_oom_victim(mm) == false. yes. I will fold the following in if this turned out to really address David's issue. But I suspect this will be the case considering the NULL pmd in the report which would suggest racing with free_pgtable... Thanks for the review! --- diff --git a/mm/mmap.c b/mm/mmap.c index d00a06248ef1..e63b7a576670 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -3004,7 +3004,6 @@ void exit_mmap(struct mm_struct *mm) /* Use -1 here to ensure all VMAs in the mm are unmapped */ unmap_vmas(&tlb, vma, 0, -1); - set_bit(MMF_OOM_SKIP, &mm->flags); if (unlikely(mm_is_oom_victim(mm))) { /* * Wait for oom_reap_task() to stop working on this @@ -3012,12 +3011,13 @@ void exit_mmap(struct mm_struct *mm) * calling down_read(), oom_reap_task() will not run * on this "mm" post up_write(). * - * tsk_is_oom_victim() cannot be set from under us + * mm_is_oom_victim() cannot be set from under us * either because current->mm is already set to NULL * under task_lock before calling mmput and oom_mm is * set not NULL by the OOM killer only if current->mm * is found not NULL while holding the task_lock. */ + set_bit(MMF_OOM_SKIP, &mm->flags); down_write(&mm->mmap_sem); up_write(&mm->mmap_sem); } -- Michal Hocko SUSE Labs ^ permalink raw reply related [flat|nested] 19+ messages in thread
* Re: Multiple oom_reaper BUGs: unmap_page_range racing with exit_mmap 2017-12-07 16:30 ` Michal Hocko @ 2017-12-07 21:55 ` David Rientjes 2017-12-08 9:26 ` David Rientjes 2017-12-08 10:11 ` Tetsuo Handa 1 sibling, 1 reply; 19+ messages in thread From: David Rientjes @ 2017-12-07 21:55 UTC (permalink / raw) To: Michal Hocko; +Cc: Tetsuo Handa, akpm, aarcange, linux-kernel, linux-mm On Thu, 7 Dec 2017, Michal Hocko wrote: > yes. I will fold the following in if this turned out to really address > David's issue. But I suspect this will be the case considering the NULL > pmd in the report which would suggest racing with free_pgtable... > I'm backporting and testing the following patch against Linus's tree. To clarify an earlier point, we don't actually have any change from upstream code that allows for free_pgtables() before the set_bit(MMF_OOM_SKIP);down_write();up_write() cycle. diff --git a/include/linux/oom.h b/include/linux/oom.h --- a/include/linux/oom.h +++ b/include/linux/oom.h @@ -66,6 +66,15 @@ static inline bool tsk_is_oom_victim(struct task_struct * tsk) return tsk->signal->oom_mm; } +/* + * Use this helper if tsk->mm != mm and the victim mm needs a special + * handling. This is guaranteed to stay true after once set. + */ +static inline bool mm_is_oom_victim(struct mm_struct *mm) +{ + return test_bit(MMF_OOM_VICTIM, &mm->flags); +} + /* * Checks whether a page fault on the given mm is still reliable. * This is no longer true if the oom reaper started to reap the diff --git a/include/linux/sched/coredump.h b/include/linux/sched/coredump.h --- a/include/linux/sched/coredump.h +++ b/include/linux/sched/coredump.h @@ -71,6 +71,7 @@ static inline int get_dumpable(struct mm_struct *mm) #define MMF_HUGE_ZERO_PAGE 23 /* mm has ever used the global huge zero page */ #define MMF_DISABLE_THP 24 /* disable THP for all VMAs */ #define MMF_DISABLE_THP_MASK (1 << MMF_DISABLE_THP) +#define MMF_OOM_VICTIM 25 /* mm is the oom victim */ #define MMF_INIT_MASK (MMF_DUMPABLE_MASK | MMF_DUMP_FILTER_MASK |\ MMF_DISABLE_THP_MASK) diff --git a/mm/mmap.c b/mm/mmap.c --- a/mm/mmap.c +++ b/mm/mmap.c @@ -3019,20 +3019,20 @@ void exit_mmap(struct mm_struct *mm) /* Use -1 here to ensure all VMAs in the mm are unmapped */ unmap_vmas(&tlb, vma, 0, -1); - set_bit(MMF_OOM_SKIP, &mm->flags); - if (unlikely(tsk_is_oom_victim(current))) { + if (unlikely(mm_is_oom_victim(mm))) { /* * Wait for oom_reap_task() to stop working on this * mm. Because MMF_OOM_SKIP is already set before * calling down_read(), oom_reap_task() will not run * on this "mm" post up_write(). * - * tsk_is_oom_victim() cannot be set from under us + * mm_is_oom_victim() cannot be set from under us * either because current->mm is already set to NULL * under task_lock before calling mmput and oom_mm is * set not NULL by the OOM killer only if current->mm * is found not NULL while holding the task_lock. */ + set_bit(MMF_OOM_SKIP, &mm->flags); down_write(&mm->mmap_sem); up_write(&mm->mmap_sem); } diff --git a/mm/oom_kill.c b/mm/oom_kill.c --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -683,8 +683,10 @@ static void mark_oom_victim(struct task_struct *tsk) return; /* oom_mm is bound to the signal struct life time. */ - if (!cmpxchg(&tsk->signal->oom_mm, NULL, mm)) + if (!cmpxchg(&tsk->signal->oom_mm, NULL, mm)) { mmgrab(tsk->signal->oom_mm); + set_bit(MMF_OOM_VICTIM, &mm->flags); + } /* * Make sure that the task is woken up from uninterruptible sleep ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Multiple oom_reaper BUGs: unmap_page_range racing with exit_mmap 2017-12-07 21:55 ` David Rientjes @ 2017-12-08 9:26 ` David Rientjes 2017-12-08 11:27 ` Michal Hocko 0 siblings, 1 reply; 19+ messages in thread From: David Rientjes @ 2017-12-08 9:26 UTC (permalink / raw) To: Michal Hocko; +Cc: Tetsuo Handa, akpm, aarcange, linux-kernel, linux-mm On Thu, 7 Dec 2017, David Rientjes wrote: > I'm backporting and testing the following patch against Linus's tree. To > clarify an earlier point, we don't actually have any change from upstream > code that allows for free_pgtables() before the > set_bit(MMF_OOM_SKIP);down_write();up_write() cycle. > > diff --git a/include/linux/oom.h b/include/linux/oom.h > --- a/include/linux/oom.h > +++ b/include/linux/oom.h > @@ -66,6 +66,15 @@ static inline bool tsk_is_oom_victim(struct task_struct * tsk) > return tsk->signal->oom_mm; > } > > +/* > + * Use this helper if tsk->mm != mm and the victim mm needs a special > + * handling. This is guaranteed to stay true after once set. > + */ > +static inline bool mm_is_oom_victim(struct mm_struct *mm) > +{ > + return test_bit(MMF_OOM_VICTIM, &mm->flags); > +} > + > /* > * Checks whether a page fault on the given mm is still reliable. > * This is no longer true if the oom reaper started to reap the > diff --git a/include/linux/sched/coredump.h b/include/linux/sched/coredump.h > --- a/include/linux/sched/coredump.h > +++ b/include/linux/sched/coredump.h > @@ -71,6 +71,7 @@ static inline int get_dumpable(struct mm_struct *mm) > #define MMF_HUGE_ZERO_PAGE 23 /* mm has ever used the global huge zero page */ > #define MMF_DISABLE_THP 24 /* disable THP for all VMAs */ > #define MMF_DISABLE_THP_MASK (1 << MMF_DISABLE_THP) > +#define MMF_OOM_VICTIM 25 /* mm is the oom victim */ > > #define MMF_INIT_MASK (MMF_DUMPABLE_MASK | MMF_DUMP_FILTER_MASK |\ > MMF_DISABLE_THP_MASK) > diff --git a/mm/mmap.c b/mm/mmap.c > --- a/mm/mmap.c > +++ b/mm/mmap.c > @@ -3019,20 +3019,20 @@ void exit_mmap(struct mm_struct *mm) > /* Use -1 here to ensure all VMAs in the mm are unmapped */ > unmap_vmas(&tlb, vma, 0, -1); > > - set_bit(MMF_OOM_SKIP, &mm->flags); > - if (unlikely(tsk_is_oom_victim(current))) { > + if (unlikely(mm_is_oom_victim(mm))) { > /* > * Wait for oom_reap_task() to stop working on this > * mm. Because MMF_OOM_SKIP is already set before > * calling down_read(), oom_reap_task() will not run > * on this "mm" post up_write(). > * > - * tsk_is_oom_victim() cannot be set from under us > + * mm_is_oom_victim() cannot be set from under us > * either because current->mm is already set to NULL > * under task_lock before calling mmput and oom_mm is > * set not NULL by the OOM killer only if current->mm > * is found not NULL while holding the task_lock. > */ > + set_bit(MMF_OOM_SKIP, &mm->flags); > down_write(&mm->mmap_sem); > up_write(&mm->mmap_sem); > } > diff --git a/mm/oom_kill.c b/mm/oom_kill.c > --- a/mm/oom_kill.c > +++ b/mm/oom_kill.c > @@ -683,8 +683,10 @@ static void mark_oom_victim(struct task_struct *tsk) > return; > > /* oom_mm is bound to the signal struct life time. */ > - if (!cmpxchg(&tsk->signal->oom_mm, NULL, mm)) > + if (!cmpxchg(&tsk->signal->oom_mm, NULL, mm)) { > mmgrab(tsk->signal->oom_mm); > + set_bit(MMF_OOM_VICTIM, &mm->flags); > + } > > /* > * Make sure that the task is woken up from uninterruptible sleep > This passes all functional testing that I have and I can create a synthetic testcase that can trigger at least MMF_OOM_VICTIM getting set while oom_reaper is still working on an mm that this prevents, so feel free to add an Acked-by: David Rientjes <rientjes@google.com> with a variant of your previous changelogs. Thanks! I think it would appropriate to cc stable for 4.14 and add a Fixes: 212925802454 ("mm: oom: let oom_reap_task and exit_mmap run concurrently") if nobody disagrees, which I think you may have already done on a previous iteration. We can still discuss if there are any VM_LOCKED subtleties in the this thread, but I have no evidence that it is responsible for any issues. ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Multiple oom_reaper BUGs: unmap_page_range racing with exit_mmap 2017-12-08 9:26 ` David Rientjes @ 2017-12-08 11:27 ` Michal Hocko 0 siblings, 0 replies; 19+ messages in thread From: Michal Hocko @ 2017-12-08 11:27 UTC (permalink / raw) To: David Rientjes; +Cc: Tetsuo Handa, akpm, aarcange, linux-kernel, linux-mm On Fri 08-12-17 01:26:46, David Rientjes wrote: > On Thu, 7 Dec 2017, David Rientjes wrote: > > > I'm backporting and testing the following patch against Linus's tree. To > > clarify an earlier point, we don't actually have any change from upstream > > code that allows for free_pgtables() before the > > set_bit(MMF_OOM_SKIP);down_write();up_write() cycle. > > > > diff --git a/include/linux/oom.h b/include/linux/oom.h > > --- a/include/linux/oom.h > > +++ b/include/linux/oom.h > > @@ -66,6 +66,15 @@ static inline bool tsk_is_oom_victim(struct task_struct * tsk) > > return tsk->signal->oom_mm; > > } > > > > +/* > > + * Use this helper if tsk->mm != mm and the victim mm needs a special > > + * handling. This is guaranteed to stay true after once set. > > + */ > > +static inline bool mm_is_oom_victim(struct mm_struct *mm) > > +{ > > + return test_bit(MMF_OOM_VICTIM, &mm->flags); > > +} > > + > > /* > > * Checks whether a page fault on the given mm is still reliable. > > * This is no longer true if the oom reaper started to reap the > > diff --git a/include/linux/sched/coredump.h b/include/linux/sched/coredump.h > > --- a/include/linux/sched/coredump.h > > +++ b/include/linux/sched/coredump.h > > @@ -71,6 +71,7 @@ static inline int get_dumpable(struct mm_struct *mm) > > #define MMF_HUGE_ZERO_PAGE 23 /* mm has ever used the global huge zero page */ > > #define MMF_DISABLE_THP 24 /* disable THP for all VMAs */ > > #define MMF_DISABLE_THP_MASK (1 << MMF_DISABLE_THP) > > +#define MMF_OOM_VICTIM 25 /* mm is the oom victim */ > > > > #define MMF_INIT_MASK (MMF_DUMPABLE_MASK | MMF_DUMP_FILTER_MASK |\ > > MMF_DISABLE_THP_MASK) > > diff --git a/mm/mmap.c b/mm/mmap.c > > --- a/mm/mmap.c > > +++ b/mm/mmap.c > > @@ -3019,20 +3019,20 @@ void exit_mmap(struct mm_struct *mm) > > /* Use -1 here to ensure all VMAs in the mm are unmapped */ > > unmap_vmas(&tlb, vma, 0, -1); > > > > - set_bit(MMF_OOM_SKIP, &mm->flags); > > - if (unlikely(tsk_is_oom_victim(current))) { > > + if (unlikely(mm_is_oom_victim(mm))) { > > /* > > * Wait for oom_reap_task() to stop working on this > > * mm. Because MMF_OOM_SKIP is already set before > > * calling down_read(), oom_reap_task() will not run > > * on this "mm" post up_write(). > > * > > - * tsk_is_oom_victim() cannot be set from under us > > + * mm_is_oom_victim() cannot be set from under us > > * either because current->mm is already set to NULL > > * under task_lock before calling mmput and oom_mm is > > * set not NULL by the OOM killer only if current->mm > > * is found not NULL while holding the task_lock. > > */ > > + set_bit(MMF_OOM_SKIP, &mm->flags); > > down_write(&mm->mmap_sem); > > up_write(&mm->mmap_sem); > > } > > diff --git a/mm/oom_kill.c b/mm/oom_kill.c > > --- a/mm/oom_kill.c > > +++ b/mm/oom_kill.c > > @@ -683,8 +683,10 @@ static void mark_oom_victim(struct task_struct *tsk) > > return; > > > > /* oom_mm is bound to the signal struct life time. */ > > - if (!cmpxchg(&tsk->signal->oom_mm, NULL, mm)) > > + if (!cmpxchg(&tsk->signal->oom_mm, NULL, mm)) { > > mmgrab(tsk->signal->oom_mm); > > + set_bit(MMF_OOM_VICTIM, &mm->flags); > > + } > > > > /* > > * Make sure that the task is woken up from uninterruptible sleep > > > > This passes all functional testing that I have and I can create a > synthetic testcase that can trigger at least MMF_OOM_VICTIM getting set > while oom_reaper is still working on an mm that this prevents, so feel > free to add an > > Acked-by: David Rientjes <rientjes@google.com> > > with a variant of your previous changelogs. Thanks! > > I think it would appropriate to cc stable for 4.14 and add a > > Fixes: 212925802454 ("mm: oom: let oom_reap_task and exit_mmap run > concurrently") > > if nobody disagrees, which I think you may have already done on a previous > iteration. Thanks for your testing! I will repost the patch later today. > We can still discuss if there are any VM_LOCKED subtleties in the this > thread, but I have no evidence that it is responsible for any issues. Yes this is worth a separate discussion. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Multiple oom_reaper BUGs: unmap_page_range racing with exit_mmap 2017-12-07 16:30 ` Michal Hocko 2017-12-07 21:55 ` David Rientjes @ 2017-12-08 10:11 ` Tetsuo Handa 1 sibling, 0 replies; 19+ messages in thread From: Tetsuo Handa @ 2017-12-08 10:11 UTC (permalink / raw) To: mhocko; +Cc: rientjes, akpm, aarcange, linux-kernel, linux-mm Michal Hocko wrote: > On Fri 08-12-17 00:44:11, Tetsuo Handa wrote: > > Michal Hocko wrote: > > > David, could you test with this patch please? > > > > Even if this patch solved David's case, you need to update > > > > * tsk_is_oom_victim() cannot be set from under us > > * either because current->mm is already set to NULL > > * under task_lock before calling mmput and oom_mm is > > * set not NULL by the OOM killer only if current->mm > > * is found not NULL while holding the task_lock. > > > > part as well, for it is the explanation of why > > tsk_is_oom_victim() test was expected to work. > > Yes, the same applies for mm_is_oom_victim. I will fixup s@tsk_@mm_@ > here. > If you try to "s@tsk_@mm_@", I suggest doing ---------- mm/mmap.c | 6 ------ 1 file changed, 6 deletions(-) diff --git a/mm/mmap.c b/mm/mmap.c index 68fdd14..b2cb4e5 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -3025,12 +3025,6 @@ void exit_mmap(struct mm_struct *mm) * mm. Because MMF_OOM_SKIP is already set before * calling down_read(), oom_reap_task() will not run * on this "mm" post up_write(). - * - * mm_is_oom_victim() cannot be set from under us - * either because current->mm is already set to NULL - * under task_lock before calling mmput and oom_mm is - * set not NULL by the OOM killer only if current->mm - * is found not NULL while holding the task_lock. */ set_bit(MMF_OOM_SKIP, &mm->flags); down_write(&mm->mmap_sem); ---------- because current->mm and current->signal->oom_mm are irrelevant for the reason to set MMF_OOM_SKIP here. What matters here is that this mm's ->mmap must not be accessed by the OOM reaper afterwards. (And I think we can do mm->mmap = NULL instead of setting MMF_OOM_SKIP. Then, we can defer setting MMF_OOM_SKIP until "the OOM reaper gave up waiting for __mmput()" or "__mmput() became ready to call mmdrop()" whichever occurred first.) I think we can try changes shown below for (1) Allowing __oom_reap_task_mm() to run without oom_lock held. (Because start OOM reaping without waiting for oom_lock can mitigate unexpected stalls due to schedule_timeout_killable(1) with oom_lock held. Though, root cause of the stall is that we can't guarantee that enough CPU resource is given to a thread holding oom_lock.) Also, we could offload __oom_reap_task_mm() to some WQ_MEM_RECLAIM workqueue in order to allow reclaiming memory in parallel when many victims are waiting for OOM reaping. (2) Guarantee that last second allocation at __alloc_pages_may_oom() is attempted after confirming that there is no victim's mm without MMF_OOM_SKIP set. (Because moving last second allocation to after select_bad_process() was rejected. Though, doing last second allocation attempt after select_bad_process() can allow us to eliminate oom_lock taken in the changes shown below...) (3) Guarantee that MMF_OOM_SKIP is set at __mmput(). (Because this was an unexpected change for CONFIG_MMU=n.) ---------- kernel/fork.c | 6 ++++++ mm/mmap.c | 9 ++------- mm/oom_kill.c | 49 ++++++++++--------------------------------------- 3 files changed, 18 insertions(+), 46 deletions(-) diff --git a/kernel/fork.c b/kernel/fork.c index 432eadf..dd1d69e 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -931,6 +931,12 @@ static inline void __mmput(struct mm_struct *mm) } if (mm->binfmt) module_put(mm->binfmt->module); + if (unlikely(mm_is_oom_victim(mm))) { + /* Tell select_bad_process() to start selecting next mm. */ + mutex_lock(&oom_lock); + set_bit(MMF_OOM_SKIP, &mm->flags); + mutex_unlock(&oom_lock); + } mmdrop(mm); } diff --git a/mm/mmap.c b/mm/mmap.c index b2cb4e5..641b5c1 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -3020,14 +3020,9 @@ void exit_mmap(struct mm_struct *mm) unmap_vmas(&tlb, vma, 0, -1); if (unlikely(mm_is_oom_victim(mm))) { - /* - * Wait for oom_reap_task() to stop working on this - * mm. Because MMF_OOM_SKIP is already set before - * calling down_read(), oom_reap_task() will not run - * on this "mm" post up_write(). - */ - set_bit(MMF_OOM_SKIP, &mm->flags); + /* Tell oom_reap_task() to stop working on this mm. */ down_write(&mm->mmap_sem); + mm->mmap = NULL; up_write(&mm->mmap_sem); } free_pgtables(&tlb, vma, FIRST_USER_ADDRESS, USER_PGTABLES_CEILING); diff --git a/mm/oom_kill.c b/mm/oom_kill.c index 29f8555..ec5303d 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -489,28 +489,10 @@ static bool __oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm) { struct mmu_gather tlb; struct vm_area_struct *vma; - bool ret = true; - - /* - * We have to make sure to not race with the victim exit path - * and cause premature new oom victim selection: - * __oom_reap_task_mm exit_mm - * mmget_not_zero - * mmput - * atomic_dec_and_test - * exit_oom_victim - * [...] - * out_of_memory - * select_bad_process - * # no TIF_MEMDIE task selects new victim - * unmap_page_range # frees some memory - */ - mutex_lock(&oom_lock); if (!down_read_trylock(&mm->mmap_sem)) { - ret = false; trace_skip_task_reaping(tsk->pid); - goto unlock_oom; + return false; } /* @@ -525,19 +507,7 @@ static bool __oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm) if (mm_has_notifiers(mm)) { up_read(&mm->mmap_sem); schedule_timeout_idle(HZ); - goto unlock_oom; - } - - /* - * MMF_OOM_SKIP is set by exit_mmap when the OOM reaper can't - * work on the mm anymore. The check for MMF_OOM_SKIP must run - * under mmap_sem for reading because it serializes against the - * down_write();up_write() cycle in exit_mmap(). - */ - if (test_bit(MMF_OOM_SKIP, &mm->flags)) { - up_read(&mm->mmap_sem); - trace_skip_task_reaping(tsk->pid); - goto unlock_oom; + return true; } trace_start_task_reaping(tsk->pid); @@ -550,6 +520,10 @@ static bool __oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm) */ set_bit(MMF_UNSTABLE, &mm->flags); + /* + * exit_mmap() sets mm->mmap to NULL with mm->mmap_sem held for write + * if I need to stop working on this mm. + */ for (vma = mm->mmap ; vma; vma = vma->vm_next) { if (!can_madv_dontneed_vma(vma)) continue; @@ -579,9 +553,7 @@ static bool __oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm) up_read(&mm->mmap_sem); trace_finish_task_reaping(tsk->pid); -unlock_oom: - mutex_unlock(&oom_lock); - return ret; + return true; } #define MAX_OOM_REAP_RETRIES 10 @@ -605,11 +577,10 @@ static void oom_reap_task(struct task_struct *tsk) done: tsk->oom_reaper_list = NULL; - /* - * Hide this mm from OOM killer because it has been either reaped or - * somebody can't call up_write(mmap_sem). - */ + /* Tell select_bad_process() to start selecting next mm. */ + mutex_lock(&oom_lock); set_bit(MMF_OOM_SKIP, &mm->flags); + mutex_unlock(&oom_lock); /* Drop a reference taken by wake_oom_reaper */ put_task_struct(tsk); ---------- ^ permalink raw reply related [flat|nested] 19+ messages in thread
end of thread, other threads:[~2017-12-08 11:27 UTC | newest] Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2017-12-06 2:43 Multiple oom_reaper BUGs: unmap_page_range racing with exit_mmap David Rientjes 2017-12-06 2:58 ` David Rientjes [not found] ` <201712060328.vB63SrDK069830@www262.sakura.ne.jp> 2017-12-06 7:48 ` David Rientjes 2017-12-06 9:00 ` Michal Hocko [not found] ` <201712070720.vB77KlBQ009754@www262.sakura.ne.jp> 2017-12-07 8:28 ` Michal Hocko 2017-12-07 8:44 ` Michal Hocko 2017-12-07 11:00 ` Tetsuo Handa 2017-12-07 21:22 ` David Rientjes 2017-12-08 7:50 ` Michal Hocko 2017-12-06 11:37 ` Tetsuo Handa 2017-12-06 8:50 ` Michal Hocko 2017-12-06 8:31 ` Michal Hocko 2017-12-07 11:35 ` Michal Hocko 2017-12-07 15:44 ` Tetsuo Handa 2017-12-07 16:30 ` Michal Hocko 2017-12-07 21:55 ` David Rientjes 2017-12-08 9:26 ` David Rientjes 2017-12-08 11:27 ` Michal Hocko 2017-12-08 10:11 ` Tetsuo Handa
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).