All of lore.kernel.org
 help / color / mirror / Atom feed
* [patch] mm, oom: fix concurrent munlock and oom reaper unmap
@ 2018-04-17 22:46 David Rientjes
  2018-04-18  0:57 ` Tetsuo Handa
  0 siblings, 1 reply; 47+ messages in thread
From: David Rientjes @ 2018-04-17 22:46 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Michal Hocko, Andrea Arcangeli, Tetsuo Handa, Roman Gushchin,
	linux-kernel, linux-mm

Since exit_mmap() is done without the protection of mm->mmap_sem, it is
possible for the oom reaper to concurrently operate on an mm until
MMF_OOM_SKIP is set.

This allows munlock_vma_pages_all() to concurrently run while the oom
reaper is operating on a vma.  Since munlock_vma_pages_range() depends on
clearing VM_LOCKED from vm_flags before actually doing the munlock to
determine if any other vmas are locking the same memory, the check for
VM_LOCKED in the oom reaper is racy.

This is especially noticeable on architectures such as powerpc where
clearing a huge pmd requires kick_all_cpus_sync().  If the pmd is zapped
by the oom reaper during follow_page_mask() after the check for pmd_none()
is bypassed, this ends up deferencing a NULL ptl.

Fix this by reusing MMF_UNSTABLE to specify that an mm should not be
reaped.  This prevents the concurrent munlock_vma_pages_range() and
unmap_page_range().  The oom reaper will simply not operate on an mm that
has the bit set and leave the unmapping to exit_mmap().

Fixes: 212925802454 ("mm: oom: let oom_reap_task and exit_mmap run concurrently")
Cc: stable@vger.kernel.org [4.14+]
Signed-off-by: David Rientjes <rientjes@google.com>
---
 mm/mmap.c     | 38 ++++++++++++++++++++------------------
 mm/oom_kill.c | 19 ++++++++-----------
 2 files changed, 28 insertions(+), 29 deletions(-)

diff --git a/mm/mmap.c b/mm/mmap.c
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -3015,6 +3015,25 @@ void exit_mmap(struct mm_struct *mm)
 	/* mm's last user has gone, and its about to be pulled down */
 	mmu_notifier_release(mm);
 
+	if (unlikely(mm_is_oom_victim(mm))) {
+		/*
+		 * Wait for oom_reap_task() to stop working on this mm.  Because
+		 * MMF_UNSTABLE is already set before calling down_read(),
+		 * oom_reap_task() will not run on this mm after up_write().
+		 * oom_reap_task() also depends on a stable VM_LOCKED flag to
+		 * indicate it should not unmap during munlock_vma_pages_all().
+		 *
+		 * mm_is_oom_victim() cannot be set from under us because
+		 * victim->mm is already set to NULL under task_lock before
+		 * calling mmput() and victim->signal->oom_mm is set by the oom
+		 * killer only if victim->mm is non-NULL while holding
+		 * task_lock().
+		 */
+		set_bit(MMF_UNSTABLE, &mm->flags);
+		down_write(&mm->mmap_sem);
+		up_write(&mm->mmap_sem);
+	}
+
 	if (mm->locked_vm) {
 		vma = mm->mmap;
 		while (vma) {
@@ -3036,26 +3055,9 @@ void exit_mmap(struct mm_struct *mm)
 	/* update_hiwater_rss(mm) here? but nobody should be looking */
 	/* Use -1 here to ensure all VMAs in the mm are unmapped */
 	unmap_vmas(&tlb, vma, 0, -1);
-
-	if (unlikely(mm_is_oom_victim(mm))) {
-		/*
-		 * Wait for oom_reap_task() to stop working on this
-		 * mm. Because MMF_OOM_SKIP is already set before
-		 * calling down_read(), oom_reap_task() will not run
-		 * on this "mm" post up_write().
-		 *
-		 * mm_is_oom_victim() cannot be set from under us
-		 * either because victim->mm is already set to NULL
-		 * under task_lock before calling mmput and oom_mm is
-		 * set not NULL by the OOM killer only if victim->mm
-		 * is found not NULL while holding the task_lock.
-		 */
-		set_bit(MMF_OOM_SKIP, &mm->flags);
-		down_write(&mm->mmap_sem);
-		up_write(&mm->mmap_sem);
-	}
 	free_pgtables(&tlb, vma, FIRST_USER_ADDRESS, USER_PGTABLES_CEILING);
 	tlb_finish_mmu(&tlb, 0, -1);
+	set_bit(MMF_OOM_SKIP, &mm->flags);
 
 	/*
 	 * Walk the list again, actually closing and freeing it,
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -521,12 +521,17 @@ static bool __oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
 	}
 
 	/*
-	 * MMF_OOM_SKIP is set by exit_mmap when the OOM reaper can't
-	 * work on the mm anymore. The check for MMF_OOM_SKIP must run
+	 * Tell all users of get_user/copy_from_user etc... that the content
+	 * is no longer stable. No barriers really needed because unmapping
+	 * should imply barriers already and the reader would hit a page fault
+	 * if it stumbled over reaped memory.
+	 *
+	 * MMF_UNSTABLE is also set by exit_mmap when the OOM reaper shouldn't
+	 * work on the mm anymore. The check for MMF_OOM_UNSTABLE must run
 	 * under mmap_sem for reading because it serializes against the
 	 * down_write();up_write() cycle in exit_mmap().
 	 */
-	if (test_bit(MMF_OOM_SKIP, &mm->flags)) {
+	if (test_and_set_bit(MMF_UNSTABLE, &mm->flags)) {
 		up_read(&mm->mmap_sem);
 		trace_skip_task_reaping(tsk->pid);
 		goto unlock_oom;
@@ -534,14 +539,6 @@ static bool __oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
 
 	trace_start_task_reaping(tsk->pid);
 
-	/*
-	 * Tell all users of get_user/copy_from_user etc... that the content
-	 * is no longer stable. No barriers really needed because unmapping
-	 * should imply barriers already and the reader would hit a page fault
-	 * if it stumbled over a reaped memory.
-	 */
-	set_bit(MMF_UNSTABLE, &mm->flags);
-
 	for (vma = mm->mmap ; vma; vma = vma->vm_next) {
 		if (!can_madv_dontneed_vma(vma))
 			continue;

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [patch] mm, oom: fix concurrent munlock and oom reaper unmap
  2018-04-17 22:46 [patch] mm, oom: fix concurrent munlock and oom reaper unmap David Rientjes
@ 2018-04-18  0:57 ` Tetsuo Handa
  2018-04-18  2:39   ` David Rientjes
  0 siblings, 1 reply; 47+ messages in thread
From: Tetsuo Handa @ 2018-04-18  0:57 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Michal Hocko, Andrea Arcangeli, Tetsuo Handa,
	Roman Gushchin, linux-kernel, linux-mm

David Rientjes wrote:
> Since exit_mmap() is done without the protection of mm->mmap_sem, it is
> possible for the oom reaper to concurrently operate on an mm until
> MMF_OOM_SKIP is set.
> 
> This allows munlock_vma_pages_all() to concurrently run while the oom
> reaper is operating on a vma.  Since munlock_vma_pages_range() depends on
> clearing VM_LOCKED from vm_flags before actually doing the munlock to
> determine if any other vmas are locking the same memory, the check for
> VM_LOCKED in the oom reaper is racy.
> 
> This is especially noticeable on architectures such as powerpc where
> clearing a huge pmd requires kick_all_cpus_sync().  If the pmd is zapped
> by the oom reaper during follow_page_mask() after the check for pmd_none()
> is bypassed, this ends up deferencing a NULL ptl.

I don't know whether the explanation above is correct.
Did you actually see a crash caused by this race?

> 
> Fix this by reusing MMF_UNSTABLE to specify that an mm should not be
> reaped.  This prevents the concurrent munlock_vma_pages_range() and
> unmap_page_range().  The oom reaper will simply not operate on an mm that
> has the bit set and leave the unmapping to exit_mmap().

But this patch is setting MMF_OOM_SKIP without reaping any memory as soon as
MMF_UNSTABLE is set, which is the situation described in 212925802454:

    At the same time if the OOM reaper doesn't wait at all for the memory of
    the current OOM candidate to be freed by exit_mmap->unmap_vmas, it would
    generate a spurious OOM kill.

If exit_mmap() does not wait for any pages and __oom_reap_task_mm() can not
handle mlock()ed pages, isn't it better to revert 212925802454 like I mentioned
at https://patchwork.kernel.org/patch/10095661/ and let the OOM reaper reclaim
as much as possible before setting MMF_OOM_SKIP?

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [patch] mm, oom: fix concurrent munlock and oom reaper unmap
  2018-04-18  0:57 ` Tetsuo Handa
@ 2018-04-18  2:39   ` David Rientjes
  2018-04-18  2:52     ` [patch v2] " David Rientjes
  0 siblings, 1 reply; 47+ messages in thread
From: David Rientjes @ 2018-04-18  2:39 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: Andrew Morton, Michal Hocko, Andrea Arcangeli, Roman Gushchin,
	linux-kernel, linux-mm

On Wed, 18 Apr 2018, Tetsuo Handa wrote:

> > Since exit_mmap() is done without the protection of mm->mmap_sem, it is
> > possible for the oom reaper to concurrently operate on an mm until
> > MMF_OOM_SKIP is set.
> > 
> > This allows munlock_vma_pages_all() to concurrently run while the oom
> > reaper is operating on a vma.  Since munlock_vma_pages_range() depends on
> > clearing VM_LOCKED from vm_flags before actually doing the munlock to
> > determine if any other vmas are locking the same memory, the check for
> > VM_LOCKED in the oom reaper is racy.
> > 
> > This is especially noticeable on architectures such as powerpc where
> > clearing a huge pmd requires kick_all_cpus_sync().  If the pmd is zapped
> > by the oom reaper during follow_page_mask() after the check for pmd_none()
> > is bypassed, this ends up deferencing a NULL ptl.
> 
> I don't know whether the explanation above is correct.
> Did you actually see a crash caused by this race?
> 

Yes, it's trivially reproducible on power by simply mlocking a ton of 
memory and triggering oom kill.

> > Fix this by reusing MMF_UNSTABLE to specify that an mm should not be
> > reaped.  This prevents the concurrent munlock_vma_pages_range() and
> > unmap_page_range().  The oom reaper will simply not operate on an mm that
> > has the bit set and leave the unmapping to exit_mmap().
> 
> But this patch is setting MMF_OOM_SKIP without reaping any memory as soon as
> MMF_UNSTABLE is set, which is the situation described in 212925802454:
> 

Oh, you're referring to __oom_reap_task_mm() returning true because of 
MMF_UNSTABLE and then setting MMF_OOM_SKIP itself?  Yes, that is dumb.  We 
could change __oom_reap_task_mm() to only set MMF_OOM_SKIP if MMF_UNSTABLE 
hasn't been set.  I'll send a v2, which I needed to do anyway to do 
s/kick_all_cpus_sync/serialize_against_pte_lookup/ in the changelog (power 
only does it for the needed cpus).

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [patch v2] mm, oom: fix concurrent munlock and oom reaper unmap
  2018-04-18  2:39   ` David Rientjes
@ 2018-04-18  2:52     ` David Rientjes
  2018-04-18  3:55       ` Tetsuo Handa
  2018-04-18  7:50       ` Michal Hocko
  0 siblings, 2 replies; 47+ messages in thread
From: David Rientjes @ 2018-04-18  2:52 UTC (permalink / raw)
  To: Andrew Morton, Tetsuo Handa
  Cc: Michal Hocko, Andrea Arcangeli, Roman Gushchin, linux-kernel, linux-mm

Since exit_mmap() is done without the protection of mm->mmap_sem, it is
possible for the oom reaper to concurrently operate on an mm until
MMF_OOM_SKIP is set.

This allows munlock_vma_pages_all() to concurrently run while the oom
reaper is operating on a vma.  Since munlock_vma_pages_range() depends on
clearing VM_LOCKED from vm_flags before actually doing the munlock to
determine if any other vmas are locking the same memory, the check for
VM_LOCKED in the oom reaper is racy.

This is especially noticeable on architectures such as powerpc where
clearing a huge pmd requires serialize_against_pte_lookup().  If the pmd
is zapped by the oom reaper during follow_page_mask() after the check for
pmd_none() is bypassed, this ends up deferencing a NULL ptl.

Fix this by reusing MMF_UNSTABLE to specify that an mm should not be
reaped.  This prevents the concurrent munlock_vma_pages_range() and
unmap_page_range().  The oom reaper will simply not operate on an mm that
has the bit set and leave the unmapping to exit_mmap().

Fixes: 212925802454 ("mm: oom: let oom_reap_task and exit_mmap run concurrently")
Cc: stable@vger.kernel.org [4.14+]
Signed-off-by: David Rientjes <rientjes@google.com>
---
 v2:
  - oom reaper only sets MMF_OOM_SKIP if MMF_UNSTABLE was never set (either
    by itself or by exit_mmap(), per Tetsuo
  - s/kick_all_cpus_sync/serialize_against_pte_lookup/ in changelog as more
    isolated way of forcing cpus as non-idle on power

 mm/mmap.c     | 38 ++++++++++++++++++++------------------
 mm/oom_kill.c | 28 +++++++++++++---------------
 2 files changed, 33 insertions(+), 33 deletions(-)

diff --git a/mm/mmap.c b/mm/mmap.c
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -3015,6 +3015,25 @@ void exit_mmap(struct mm_struct *mm)
 	/* mm's last user has gone, and its about to be pulled down */
 	mmu_notifier_release(mm);
 
+	if (unlikely(mm_is_oom_victim(mm))) {
+		/*
+		 * Wait for oom_reap_task() to stop working on this mm.  Because
+		 * MMF_UNSTABLE is already set before calling down_read(),
+		 * oom_reap_task() will not run on this mm after up_write().
+		 * oom_reap_task() also depends on a stable VM_LOCKED flag to
+		 * indicate it should not unmap during munlock_vma_pages_all().
+		 *
+		 * mm_is_oom_victim() cannot be set from under us because
+		 * victim->mm is already set to NULL under task_lock before
+		 * calling mmput() and victim->signal->oom_mm is set by the oom
+		 * killer only if victim->mm is non-NULL while holding
+		 * task_lock().
+		 */
+		set_bit(MMF_UNSTABLE, &mm->flags);
+		down_write(&mm->mmap_sem);
+		up_write(&mm->mmap_sem);
+	}
+
 	if (mm->locked_vm) {
 		vma = mm->mmap;
 		while (vma) {
@@ -3036,26 +3055,9 @@ void exit_mmap(struct mm_struct *mm)
 	/* update_hiwater_rss(mm) here? but nobody should be looking */
 	/* Use -1 here to ensure all VMAs in the mm are unmapped */
 	unmap_vmas(&tlb, vma, 0, -1);
-
-	if (unlikely(mm_is_oom_victim(mm))) {
-		/*
-		 * Wait for oom_reap_task() to stop working on this
-		 * mm. Because MMF_OOM_SKIP is already set before
-		 * calling down_read(), oom_reap_task() will not run
-		 * on this "mm" post up_write().
-		 *
-		 * mm_is_oom_victim() cannot be set from under us
-		 * either because victim->mm is already set to NULL
-		 * under task_lock before calling mmput and oom_mm is
-		 * set not NULL by the OOM killer only if victim->mm
-		 * is found not NULL while holding the task_lock.
-		 */
-		set_bit(MMF_OOM_SKIP, &mm->flags);
-		down_write(&mm->mmap_sem);
-		up_write(&mm->mmap_sem);
-	}
 	free_pgtables(&tlb, vma, FIRST_USER_ADDRESS, USER_PGTABLES_CEILING);
 	tlb_finish_mmu(&tlb, 0, -1);
+	set_bit(MMF_OOM_SKIP, &mm->flags);
 
 	/*
 	 * Walk the list again, actually closing and freeing it,
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -521,12 +521,17 @@ static bool __oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
 	}
 
 	/*
-	 * MMF_OOM_SKIP is set by exit_mmap when the OOM reaper can't
-	 * work on the mm anymore. The check for MMF_OOM_SKIP must run
+	 * Tell all users of get_user/copy_from_user etc... that the content
+	 * is no longer stable. No barriers really needed because unmapping
+	 * should imply barriers already and the reader would hit a page fault
+	 * if it stumbled over reaped memory.
+	 *
+	 * MMF_UNSTABLE is also set by exit_mmap when the OOM reaper shouldn't
+	 * work on the mm anymore. The check for MMF_OOM_UNSTABLE must run
 	 * under mmap_sem for reading because it serializes against the
 	 * down_write();up_write() cycle in exit_mmap().
 	 */
-	if (test_bit(MMF_OOM_SKIP, &mm->flags)) {
+	if (test_and_set_bit(MMF_UNSTABLE, &mm->flags)) {
 		up_read(&mm->mmap_sem);
 		trace_skip_task_reaping(tsk->pid);
 		goto unlock_oom;
@@ -534,14 +539,6 @@ static bool __oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
 
 	trace_start_task_reaping(tsk->pid);
 
-	/*
-	 * Tell all users of get_user/copy_from_user etc... that the content
-	 * is no longer stable. No barriers really needed because unmapping
-	 * should imply barriers already and the reader would hit a page fault
-	 * if it stumbled over a reaped memory.
-	 */
-	set_bit(MMF_UNSTABLE, &mm->flags);
-
 	for (vma = mm->mmap ; vma; vma = vma->vm_next) {
 		if (!can_madv_dontneed_vma(vma))
 			continue;
@@ -567,6 +564,7 @@ static bool __oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
 			tlb_finish_mmu(&tlb, start, end);
 		}
 	}
+	set_bit(MMF_OOM_SKIP, &mm->flags);
 	pr_info("oom_reaper: reaped process %d (%s), now anon-rss:%lukB, file-rss:%lukB, shmem-rss:%lukB\n",
 			task_pid_nr(tsk), tsk->comm,
 			K(get_mm_counter(mm, MM_ANONPAGES)),
@@ -594,7 +592,6 @@ static void oom_reap_task(struct task_struct *tsk)
 	    test_bit(MMF_OOM_SKIP, &mm->flags))
 		goto done;
 
-
 	pr_info("oom_reaper: unable to reap pid:%d (%s)\n",
 		task_pid_nr(tsk), tsk->comm);
 	debug_show_all_locks();
@@ -603,10 +600,11 @@ static void oom_reap_task(struct task_struct *tsk)
 	tsk->oom_reaper_list = NULL;
 
 	/*
-	 * Hide this mm from OOM killer because it has been either reaped or
-	 * somebody can't call up_write(mmap_sem).
+	 * If the oom reaper could not get started on this mm and it has not yet
+	 * reached exit_mmap(), set MMF_OOM_SKIP to disregard.
 	 */
-	set_bit(MMF_OOM_SKIP, &mm->flags);
+	if (!test_bit(MMF_UNSTABLE, &mm->flags))
+		set_bit(MMF_OOM_SKIP, &mm->flags);
 
 	/* Drop a reference taken by wake_oom_reaper */
 	put_task_struct(tsk);

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [patch v2] mm, oom: fix concurrent munlock and oom reaper unmap
  2018-04-18  2:52     ` [patch v2] " David Rientjes
@ 2018-04-18  3:55       ` Tetsuo Handa
  2018-04-18  4:11         ` David Rientjes
  2018-04-18  7:50       ` Michal Hocko
  1 sibling, 1 reply; 47+ messages in thread
From: Tetsuo Handa @ 2018-04-18  3:55 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Michal Hocko, Andrea Arcangeli, Roman Gushchin,
	linux-kernel, linux-mm

David Rientjes wrote:
> Fix this by reusing MMF_UNSTABLE to specify that an mm should not be
> reaped.  This prevents the concurrent munlock_vma_pages_range() and
> unmap_page_range().  The oom reaper will simply not operate on an mm that
> has the bit set and leave the unmapping to exit_mmap().

This change assumes that munlock_vma_pages_all()/unmap_vmas()/free_pgtables()
are never blocked for memory allocation. Is that guaranteed? For example,
i_mmap_lock_write() from unmap_single_vma() from unmap_vmas() is never blocked
for memory allocation? Commit 97b1255cb27c551d ("mm,oom_reaper: check for
MMF_OOM_SKIP before complaining") was waiting for i_mmap_lock_write() from
unlink_file_vma() from free_pgtables(). Is it really guaranteed that somebody
else who is holding that lock is never waiting for memory allocation?

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [patch v2] mm, oom: fix concurrent munlock and oom reaper unmap
  2018-04-18  3:55       ` Tetsuo Handa
@ 2018-04-18  4:11         ` David Rientjes
  2018-04-18  4:47           ` Tetsuo Handa
  0 siblings, 1 reply; 47+ messages in thread
From: David Rientjes @ 2018-04-18  4:11 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: Andrew Morton, Michal Hocko, Andrea Arcangeli, Roman Gushchin,
	linux-kernel, linux-mm

On Wed, 18 Apr 2018, Tetsuo Handa wrote:

> > Fix this by reusing MMF_UNSTABLE to specify that an mm should not be
> > reaped.  This prevents the concurrent munlock_vma_pages_range() and
> > unmap_page_range().  The oom reaper will simply not operate on an mm that
> > has the bit set and leave the unmapping to exit_mmap().
> 
> This change assumes that munlock_vma_pages_all()/unmap_vmas()/free_pgtables()
> are never blocked for memory allocation. Is that guaranteed? For example,
> i_mmap_lock_write() from unmap_single_vma() from unmap_vmas() is never blocked
> for memory allocation? Commit 97b1255cb27c551d ("mm,oom_reaper: check for
> MMF_OOM_SKIP before complaining") was waiting for i_mmap_lock_write() from
> unlink_file_vma() from free_pgtables(). Is it really guaranteed that somebody
> else who is holding that lock is never waiting for memory allocation?
> 

Commit 97b1255cb27c is referencing MMF_OOM_SKIP already being set by 
exit_mmap().  The only thing this patch changes is where that is done: 
before or after free_pgtables().  We can certainly move it to before 
free_pgtables() at the risk of subsequent (and eventually unnecessary) oom 
kills.  It's not exactly the point of this patch.

I have thousands of real-world examples where additional processes were 
oom killed while the original victim was in free_pgtables().  That's why 
we've moved the MMF_OOM_SKIP to after free_pgtables().  I'm not sure how 
likely your scenario is in the real world, but if it poses a problem then 
I believe it should be fixed by eventually deferring previous victims as a 
change to oom_evaluate_task(), not exit_mmap().  If you'd like me to fix 
that, please send along your test case that triggers it and I will send a 
patch.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [patch v2] mm, oom: fix concurrent munlock and oom reaper unmap
  2018-04-18  4:11         ` David Rientjes
@ 2018-04-18  4:47           ` Tetsuo Handa
  2018-04-18  5:20             ` David Rientjes
  0 siblings, 1 reply; 47+ messages in thread
From: Tetsuo Handa @ 2018-04-18  4:47 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Michal Hocko, Andrea Arcangeli, Roman Gushchin,
	linux-kernel, linux-mm

David Rientjes wrote:
> On Wed, 18 Apr 2018, Tetsuo Handa wrote:
> > > Fix this by reusing MMF_UNSTABLE to specify that an mm should not be
> > > reaped.  This prevents the concurrent munlock_vma_pages_range() and
> > > unmap_page_range().  The oom reaper will simply not operate on an mm that
> > > has the bit set and leave the unmapping to exit_mmap().
> > 
> > This change assumes that munlock_vma_pages_all()/unmap_vmas()/free_pgtables()
> > are never blocked for memory allocation. Is that guaranteed? For example,
> > i_mmap_lock_write() from unmap_single_vma() from unmap_vmas() is never blocked
> > for memory allocation? Commit 97b1255cb27c551d ("mm,oom_reaper: check for
> > MMF_OOM_SKIP before complaining") was waiting for i_mmap_lock_write() from
> > unlink_file_vma() from free_pgtables(). Is it really guaranteed that somebody
> > else who is holding that lock is never waiting for memory allocation?
> > 
> 
> Commit 97b1255cb27c is referencing MMF_OOM_SKIP already being set by 
> exit_mmap().  The only thing this patch changes is where that is done: 
> before or after free_pgtables().  We can certainly move it to before 
> free_pgtables() at the risk of subsequent (and eventually unnecessary) oom 
> kills.  It's not exactly the point of this patch.
> 
> I have thousands of real-world examples where additional processes were 
> oom killed while the original victim was in free_pgtables().  That's why 
> we've moved the MMF_OOM_SKIP to after free_pgtables().

"we have moved"? No, not yet. Your patch is about to move it.

My question is: is it guaranteed that munlock_vma_pages_all()/unmap_vmas()/free_pgtables()
by exit_mmap() are never blocked for memory allocation. Note that exit_mmap() tries to unmap
all pages while the OOM reaper tries to unmap only safe pages. If there is possibility that
munlock_vma_pages_all()/unmap_vmas()/free_pgtables() by exit_mmap() are blocked for memory
allocation, your patch will introduce an OOM livelock.

>                                                         I'm not sure how 
> likely your scenario is in the real world, but if it poses a problem then 
> I believe it should be fixed by eventually deferring previous victims as a 
> change to oom_evaluate_task(), not exit_mmap().  If you'd like me to fix 
> that, please send along your test case that triggers it and I will send a 
> patch.
> 

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [patch v2] mm, oom: fix concurrent munlock and oom reaper unmap
  2018-04-18  4:47           ` Tetsuo Handa
@ 2018-04-18  5:20             ` David Rientjes
  0 siblings, 0 replies; 47+ messages in thread
From: David Rientjes @ 2018-04-18  5:20 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: Andrew Morton, Michal Hocko, Andrea Arcangeli, Roman Gushchin,
	linux-kernel, linux-mm

On Wed, 18 Apr 2018, Tetsuo Handa wrote:

> > Commit 97b1255cb27c is referencing MMF_OOM_SKIP already being set by 
> > exit_mmap().  The only thing this patch changes is where that is done: 
> > before or after free_pgtables().  We can certainly move it to before 
> > free_pgtables() at the risk of subsequent (and eventually unnecessary) oom 
> > kills.  It's not exactly the point of this patch.
> > 
> > I have thousands of real-world examples where additional processes were 
> > oom killed while the original victim was in free_pgtables().  That's why 
> > we've moved the MMF_OOM_SKIP to after free_pgtables().
> 
> "we have moved"? No, not yet. Your patch is about to move it.
> 

I'm referring to our own kernel, we have thousands of real-world examples 
where additional processes have been oom killed where the original victim 
is in free_pgtables().  It actually happens about 10-15% of the time in 
automated testing where you create a 128MB memcg, fork a canary, and then 
fork a >128MB memory hog.  10-15% of the time both processes get oom 
killed: the memory hog first (higher rss), the canary second.  The pgtable 
stat is unchanged between oom kills.

> My question is: is it guaranteed that munlock_vma_pages_all()/unmap_vmas()/free_pgtables()
> by exit_mmap() are never blocked for memory allocation. Note that exit_mmap() tries to unmap
> all pages while the OOM reaper tries to unmap only safe pages. If there is possibility that
> munlock_vma_pages_all()/unmap_vmas()/free_pgtables() by exit_mmap() are blocked for memory
> allocation, your patch will introduce an OOM livelock.
> 

If munlock_vma_pages_all(), unmap_vmas(), or free_pgtables() require 
memory to make forward progress, then we have bigger problems :)

I just ran a query of real-world oom kill logs that I have.  In 33,773,705 
oom kills, I have no evidence of a thread failing to exit after reaching 
exit_mmap().

You may recall from my support of your patch to emit the stack trace when 
the oom reaper fails, in https://marc.info/?l=linux-mm&m=152157881518627, 
that I have logs of 28,222,058 occurrences of the oom reaper where it 
successfully frees memory and the victim exits.

If you'd like to pursue the possibility that exit_mmap() blocks before 
freeing memory that we have somehow been lucky to miss in 33 million 
occurrences, I'd appreciate the test case.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [patch v2] mm, oom: fix concurrent munlock and oom reaper unmap
  2018-04-18  2:52     ` [patch v2] " David Rientjes
  2018-04-18  3:55       ` Tetsuo Handa
@ 2018-04-18  7:50       ` Michal Hocko
  2018-04-18 11:49         ` Tetsuo Handa
  2018-04-18 19:14         ` David Rientjes
  1 sibling, 2 replies; 47+ messages in thread
From: Michal Hocko @ 2018-04-18  7:50 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Tetsuo Handa, Andrea Arcangeli, Roman Gushchin,
	linux-kernel, linux-mm

On Tue 17-04-18 19:52:41, David Rientjes wrote:
> Since exit_mmap() is done without the protection of mm->mmap_sem, it is
> possible for the oom reaper to concurrently operate on an mm until
> MMF_OOM_SKIP is set.
> 
> This allows munlock_vma_pages_all() to concurrently run while the oom
> reaper is operating on a vma.  Since munlock_vma_pages_range() depends on
> clearing VM_LOCKED from vm_flags before actually doing the munlock to
> determine if any other vmas are locking the same memory, the check for
> VM_LOCKED in the oom reaper is racy.
> 
> This is especially noticeable on architectures such as powerpc where
> clearing a huge pmd requires serialize_against_pte_lookup().  If the pmd
> is zapped by the oom reaper during follow_page_mask() after the check for
> pmd_none() is bypassed, this ends up deferencing a NULL ptl.
> 
> Fix this by reusing MMF_UNSTABLE to specify that an mm should not be
> reaped.  This prevents the concurrent munlock_vma_pages_range() and
> unmap_page_range().  The oom reaper will simply not operate on an mm that
> has the bit set and leave the unmapping to exit_mmap().

This will further complicate the protocol and actually theoretically
restores the oom lockup issues because the oom reaper doesn't set
MMF_OOM_SKIP when racing with exit_mmap so we fully rely that nothing
blocks there... So the resulting code is more fragile and tricky.

Can we try a simpler way and get back to what I was suggesting before
[1] and simply not play tricks with
		down_write(&mm->mmap_sem);
		up_write(&mm->mmap_sem);

and use the write lock in exit_mmap for oom_victims?

Andrea wanted to make this more clever but this is the second fallout
which could have been prevented. The patch would be smaller and the
locking protocol easier

[1] http://lkml.kernel.org/r/20170727065023.GB20970@dhcp22.suse.cz

> Fixes: 212925802454 ("mm: oom: let oom_reap_task and exit_mmap run concurrently")
> Cc: stable@vger.kernel.org [4.14+]
> Signed-off-by: David Rientjes <rientjes@google.com>
> ---
>  v2:
>   - oom reaper only sets MMF_OOM_SKIP if MMF_UNSTABLE was never set (either
>     by itself or by exit_mmap(), per Tetsuo
>   - s/kick_all_cpus_sync/serialize_against_pte_lookup/ in changelog as more
>     isolated way of forcing cpus as non-idle on power
> 
>  mm/mmap.c     | 38 ++++++++++++++++++++------------------
>  mm/oom_kill.c | 28 +++++++++++++---------------
>  2 files changed, 33 insertions(+), 33 deletions(-)
> 
> diff --git a/mm/mmap.c b/mm/mmap.c
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -3015,6 +3015,25 @@ void exit_mmap(struct mm_struct *mm)
>  	/* mm's last user has gone, and its about to be pulled down */
>  	mmu_notifier_release(mm);
>  
> +	if (unlikely(mm_is_oom_victim(mm))) {
> +		/*
> +		 * Wait for oom_reap_task() to stop working on this mm.  Because
> +		 * MMF_UNSTABLE is already set before calling down_read(),
> +		 * oom_reap_task() will not run on this mm after up_write().
> +		 * oom_reap_task() also depends on a stable VM_LOCKED flag to
> +		 * indicate it should not unmap during munlock_vma_pages_all().
> +		 *
> +		 * mm_is_oom_victim() cannot be set from under us because
> +		 * victim->mm is already set to NULL under task_lock before
> +		 * calling mmput() and victim->signal->oom_mm is set by the oom
> +		 * killer only if victim->mm is non-NULL while holding
> +		 * task_lock().
> +		 */
> +		set_bit(MMF_UNSTABLE, &mm->flags);
> +		down_write(&mm->mmap_sem);
> +		up_write(&mm->mmap_sem);
> +	}
> +
>  	if (mm->locked_vm) {
>  		vma = mm->mmap;
>  		while (vma) {
> @@ -3036,26 +3055,9 @@ void exit_mmap(struct mm_struct *mm)
>  	/* update_hiwater_rss(mm) here? but nobody should be looking */
>  	/* Use -1 here to ensure all VMAs in the mm are unmapped */
>  	unmap_vmas(&tlb, vma, 0, -1);
> -
> -	if (unlikely(mm_is_oom_victim(mm))) {
> -		/*
> -		 * Wait for oom_reap_task() to stop working on this
> -		 * mm. Because MMF_OOM_SKIP is already set before
> -		 * calling down_read(), oom_reap_task() will not run
> -		 * on this "mm" post up_write().
> -		 *
> -		 * mm_is_oom_victim() cannot be set from under us
> -		 * either because victim->mm is already set to NULL
> -		 * under task_lock before calling mmput and oom_mm is
> -		 * set not NULL by the OOM killer only if victim->mm
> -		 * is found not NULL while holding the task_lock.
> -		 */
> -		set_bit(MMF_OOM_SKIP, &mm->flags);
> -		down_write(&mm->mmap_sem);
> -		up_write(&mm->mmap_sem);
> -	}
>  	free_pgtables(&tlb, vma, FIRST_USER_ADDRESS, USER_PGTABLES_CEILING);
>  	tlb_finish_mmu(&tlb, 0, -1);
> +	set_bit(MMF_OOM_SKIP, &mm->flags);
>  
>  	/*
>  	 * Walk the list again, actually closing and freeing it,
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -521,12 +521,17 @@ static bool __oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
>  	}
>  
>  	/*
> -	 * MMF_OOM_SKIP is set by exit_mmap when the OOM reaper can't
> -	 * work on the mm anymore. The check for MMF_OOM_SKIP must run
> +	 * Tell all users of get_user/copy_from_user etc... that the content
> +	 * is no longer stable. No barriers really needed because unmapping
> +	 * should imply barriers already and the reader would hit a page fault
> +	 * if it stumbled over reaped memory.
> +	 *
> +	 * MMF_UNSTABLE is also set by exit_mmap when the OOM reaper shouldn't
> +	 * work on the mm anymore. The check for MMF_OOM_UNSTABLE must run
>  	 * under mmap_sem for reading because it serializes against the
>  	 * down_write();up_write() cycle in exit_mmap().
>  	 */
> -	if (test_bit(MMF_OOM_SKIP, &mm->flags)) {
> +	if (test_and_set_bit(MMF_UNSTABLE, &mm->flags)) {
>  		up_read(&mm->mmap_sem);
>  		trace_skip_task_reaping(tsk->pid);
>  		goto unlock_oom;
> @@ -534,14 +539,6 @@ static bool __oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
>  
>  	trace_start_task_reaping(tsk->pid);
>  
> -	/*
> -	 * Tell all users of get_user/copy_from_user etc... that the content
> -	 * is no longer stable. No barriers really needed because unmapping
> -	 * should imply barriers already and the reader would hit a page fault
> -	 * if it stumbled over a reaped memory.
> -	 */
> -	set_bit(MMF_UNSTABLE, &mm->flags);
> -
>  	for (vma = mm->mmap ; vma; vma = vma->vm_next) {
>  		if (!can_madv_dontneed_vma(vma))
>  			continue;
> @@ -567,6 +564,7 @@ static bool __oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
>  			tlb_finish_mmu(&tlb, start, end);
>  		}
>  	}
> +	set_bit(MMF_OOM_SKIP, &mm->flags);
>  	pr_info("oom_reaper: reaped process %d (%s), now anon-rss:%lukB, file-rss:%lukB, shmem-rss:%lukB\n",
>  			task_pid_nr(tsk), tsk->comm,
>  			K(get_mm_counter(mm, MM_ANONPAGES)),
> @@ -594,7 +592,6 @@ static void oom_reap_task(struct task_struct *tsk)
>  	    test_bit(MMF_OOM_SKIP, &mm->flags))
>  		goto done;
>  
> -
>  	pr_info("oom_reaper: unable to reap pid:%d (%s)\n",
>  		task_pid_nr(tsk), tsk->comm);
>  	debug_show_all_locks();
> @@ -603,10 +600,11 @@ static void oom_reap_task(struct task_struct *tsk)
>  	tsk->oom_reaper_list = NULL;
>  
>  	/*
> -	 * Hide this mm from OOM killer because it has been either reaped or
> -	 * somebody can't call up_write(mmap_sem).
> +	 * If the oom reaper could not get started on this mm and it has not yet
> +	 * reached exit_mmap(), set MMF_OOM_SKIP to disregard.
>  	 */
> -	set_bit(MMF_OOM_SKIP, &mm->flags);
> +	if (!test_bit(MMF_UNSTABLE, &mm->flags))
> +		set_bit(MMF_OOM_SKIP, &mm->flags);
>  
>  	/* Drop a reference taken by wake_oom_reaper */
>  	put_task_struct(tsk);

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [patch v2] mm, oom: fix concurrent munlock and oom reaper unmap
  2018-04-18  7:50       ` Michal Hocko
@ 2018-04-18 11:49         ` Tetsuo Handa
  2018-04-18 11:58           ` Michal Hocko
  2018-04-18 19:14         ` David Rientjes
  1 sibling, 1 reply; 47+ messages in thread
From: Tetsuo Handa @ 2018-04-18 11:49 UTC (permalink / raw)
  To: mhocko, rientjes; +Cc: akpm, aarcange, guro, linux-kernel, linux-mm

Michal Hocko wrote:
> On Tue 17-04-18 19:52:41, David Rientjes wrote:
> > Since exit_mmap() is done without the protection of mm->mmap_sem, it is
> > possible for the oom reaper to concurrently operate on an mm until
> > MMF_OOM_SKIP is set.
> > 
> > This allows munlock_vma_pages_all() to concurrently run while the oom
> > reaper is operating on a vma.  Since munlock_vma_pages_range() depends on
> > clearing VM_LOCKED from vm_flags before actually doing the munlock to
> > determine if any other vmas are locking the same memory, the check for
> > VM_LOCKED in the oom reaper is racy.
> > 
> > This is especially noticeable on architectures such as powerpc where
> > clearing a huge pmd requires serialize_against_pte_lookup().  If the pmd
> > is zapped by the oom reaper during follow_page_mask() after the check for
> > pmd_none() is bypassed, this ends up deferencing a NULL ptl.
> > 
> > Fix this by reusing MMF_UNSTABLE to specify that an mm should not be
> > reaped.  This prevents the concurrent munlock_vma_pages_range() and
> > unmap_page_range().  The oom reaper will simply not operate on an mm that
> > has the bit set and leave the unmapping to exit_mmap().
> 
> This will further complicate the protocol and actually theoretically
> restores the oom lockup issues because the oom reaper doesn't set
> MMF_OOM_SKIP when racing with exit_mmap so we fully rely that nothing
> blocks there... So the resulting code is more fragile and tricky.
> 
> Can we try a simpler way and get back to what I was suggesting before
> [1] and simply not play tricks with
> 		down_write(&mm->mmap_sem);
> 		up_write(&mm->mmap_sem);
> 
> and use the write lock in exit_mmap for oom_victims?

You mean something like this?
Then, I'm tempted to call __oom_reap_task_mm() before holding mmap_sem for write.
It would be OK to call __oom_reap_task_mm() at the beginning of __mmput()...

diff --git a/mm/mmap.c b/mm/mmap.c
index 188f195..ba7083b 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -3011,17 +3011,22 @@ void exit_mmap(struct mm_struct *mm)
 	struct mmu_gather tlb;
 	struct vm_area_struct *vma;
 	unsigned long nr_accounted = 0;
+	const bool is_oom_mm = mm_is_oom_victim(mm);
 
 	/* mm's last user has gone, and its about to be pulled down */
 	mmu_notifier_release(mm);
 
 	if (mm->locked_vm) {
+		if (is_oom_mm)
+			down_write(&mm->mmap_sem);
 		vma = mm->mmap;
 		while (vma) {
 			if (vma->vm_flags & VM_LOCKED)
 				munlock_vma_pages_all(vma);
 			vma = vma->vm_next;
 		}
+		if (is_oom_mm)
+			up_write(&mm->mmap_sem);
 	}
 
 	arch_exit_mmap(mm);
@@ -3037,7 +3042,7 @@ void exit_mmap(struct mm_struct *mm)
 	/* Use -1 here to ensure all VMAs in the mm are unmapped */
 	unmap_vmas(&tlb, vma, 0, -1);
 
-	if (unlikely(mm_is_oom_victim(mm))) {
+	if (unlikely(is_oom_mm)) {
 		/*
 		 * Wait for oom_reap_task() to stop working on this
 		 * mm. Because MMF_OOM_SKIP is already set before

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* Re: [patch v2] mm, oom: fix concurrent munlock and oom reaper unmap
  2018-04-18 11:49         ` Tetsuo Handa
@ 2018-04-18 11:58           ` Michal Hocko
  2018-04-18 13:25             ` Tetsuo Handa
  0 siblings, 1 reply; 47+ messages in thread
From: Michal Hocko @ 2018-04-18 11:58 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: rientjes, akpm, aarcange, guro, linux-kernel, linux-mm

On Wed 18-04-18 20:49:11, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > On Tue 17-04-18 19:52:41, David Rientjes wrote:
> > > Since exit_mmap() is done without the protection of mm->mmap_sem, it is
> > > possible for the oom reaper to concurrently operate on an mm until
> > > MMF_OOM_SKIP is set.
> > > 
> > > This allows munlock_vma_pages_all() to concurrently run while the oom
> > > reaper is operating on a vma.  Since munlock_vma_pages_range() depends on
> > > clearing VM_LOCKED from vm_flags before actually doing the munlock to
> > > determine if any other vmas are locking the same memory, the check for
> > > VM_LOCKED in the oom reaper is racy.
> > > 
> > > This is especially noticeable on architectures such as powerpc where
> > > clearing a huge pmd requires serialize_against_pte_lookup().  If the pmd
> > > is zapped by the oom reaper during follow_page_mask() after the check for
> > > pmd_none() is bypassed, this ends up deferencing a NULL ptl.
> > > 
> > > Fix this by reusing MMF_UNSTABLE to specify that an mm should not be
> > > reaped.  This prevents the concurrent munlock_vma_pages_range() and
> > > unmap_page_range().  The oom reaper will simply not operate on an mm that
> > > has the bit set and leave the unmapping to exit_mmap().
> > 
> > This will further complicate the protocol and actually theoretically
> > restores the oom lockup issues because the oom reaper doesn't set
> > MMF_OOM_SKIP when racing with exit_mmap so we fully rely that nothing
> > blocks there... So the resulting code is more fragile and tricky.
> > 
> > Can we try a simpler way and get back to what I was suggesting before
> > [1] and simply not play tricks with
> > 		down_write(&mm->mmap_sem);
> > 		up_write(&mm->mmap_sem);
> > 
> > and use the write lock in exit_mmap for oom_victims?
> 
> You mean something like this?

or simply hold the write lock until we unmap and free page tables.
It would make the locking rules much more straightforward.
What you are proposing is more focused on this particular fix and it
would work as well but the subtle locking would still stay in place.
I am not sure we want the trickiness.

> Then, I'm tempted to call __oom_reap_task_mm() before holding mmap_sem for write.
> It would be OK to call __oom_reap_task_mm() at the beginning of __mmput()...

I am not sure I understand.

> diff --git a/mm/mmap.c b/mm/mmap.c
> index 188f195..ba7083b 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -3011,17 +3011,22 @@ void exit_mmap(struct mm_struct *mm)
>  	struct mmu_gather tlb;
>  	struct vm_area_struct *vma;
>  	unsigned long nr_accounted = 0;
> +	const bool is_oom_mm = mm_is_oom_victim(mm);
>  
>  	/* mm's last user has gone, and its about to be pulled down */
>  	mmu_notifier_release(mm);
>  
>  	if (mm->locked_vm) {
> +		if (is_oom_mm)
> +			down_write(&mm->mmap_sem);
>  		vma = mm->mmap;
>  		while (vma) {
>  			if (vma->vm_flags & VM_LOCKED)
>  				munlock_vma_pages_all(vma);
>  			vma = vma->vm_next;
>  		}
> +		if (is_oom_mm)
> +			up_write(&mm->mmap_sem);
>  	}
>  
>  	arch_exit_mmap(mm);
> @@ -3037,7 +3042,7 @@ void exit_mmap(struct mm_struct *mm)
>  	/* Use -1 here to ensure all VMAs in the mm are unmapped */
>  	unmap_vmas(&tlb, vma, 0, -1);
>  
> -	if (unlikely(mm_is_oom_victim(mm))) {
> +	if (unlikely(is_oom_mm)) {
>  		/*
>  		 * Wait for oom_reap_task() to stop working on this
>  		 * mm. Because MMF_OOM_SKIP is already set before

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [patch v2] mm, oom: fix concurrent munlock and oom reaper unmap
  2018-04-18 11:58           ` Michal Hocko
@ 2018-04-18 13:25             ` Tetsuo Handa
  2018-04-18 13:44               ` Michal Hocko
  0 siblings, 1 reply; 47+ messages in thread
From: Tetsuo Handa @ 2018-04-18 13:25 UTC (permalink / raw)
  To: mhocko; +Cc: rientjes, akpm, aarcange, guro, linux-kernel, linux-mm

Michal Hocko wrote:
> > > Can we try a simpler way and get back to what I was suggesting before
> > > [1] and simply not play tricks with
> > > 		down_write(&mm->mmap_sem);
> > > 		up_write(&mm->mmap_sem);
> > > 
> > > and use the write lock in exit_mmap for oom_victims?
> > 
> > You mean something like this?
> 
> or simply hold the write lock until we unmap and free page tables.

That increases possibility of __oom_reap_task_mm() giving up reclaim and
setting MMF_OOM_SKIP when exit_mmap() is making forward progress, doesn't it?
I think that it is better that __oom_reap_task_mm() does not give up when
exit_mmap() can make progress. In that aspect, the section protected by
mmap_sem held for write should be as short as possible.

> It would make the locking rules much more straightforward.
> What you are proposing is more focused on this particular fix and it
> would work as well but the subtle locking would still stay in place.

Yes, this change is focused on -stable patch.

> I am not sure we want the trickiness.

I don't like the trickiness too. I think we can even consider direct OOM
reaping suggested at https://patchwork.kernel.org/patch/10095661/ .

> 
> > Then, I'm tempted to call __oom_reap_task_mm() before holding mmap_sem for write.
> > It would be OK to call __oom_reap_task_mm() at the beginning of __mmput()...
> 
> I am not sure I understand.

To reduce possibility of __oom_reap_task_mm() giving up reclaim and
setting MMF_OOM_SKIP.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [patch v2] mm, oom: fix concurrent munlock and oom reaper unmap
  2018-04-18 13:25             ` Tetsuo Handa
@ 2018-04-18 13:44               ` Michal Hocko
  2018-04-18 14:28                 ` Tetsuo Handa
  0 siblings, 1 reply; 47+ messages in thread
From: Michal Hocko @ 2018-04-18 13:44 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: rientjes, akpm, aarcange, guro, linux-kernel, linux-mm

On Wed 18-04-18 22:25:54, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > > > Can we try a simpler way and get back to what I was suggesting before
> > > > [1] and simply not play tricks with
> > > > 		down_write(&mm->mmap_sem);
> > > > 		up_write(&mm->mmap_sem);
> > > > 
> > > > and use the write lock in exit_mmap for oom_victims?
> > > 
> > > You mean something like this?
> > 
> > or simply hold the write lock until we unmap and free page tables.
> 
> That increases possibility of __oom_reap_task_mm() giving up reclaim and
> setting MMF_OOM_SKIP when exit_mmap() is making forward progress, doesn't it?

Yes it does. But it is not that likely and easily noticeable from the
logs so we can make the locking protocol more complex if this really
hits two often.

> I think that it is better that __oom_reap_task_mm() does not give up when
> exit_mmap() can make progress. In that aspect, the section protected by
> mmap_sem held for write should be as short as possible.

Sure, but then weight the complexity on the other side and try to think
whether simpler code which works most of the time is better than a buggy
complex one. The current protocol has 2 followup fixes which speaks for
itself.
 
[...]
> > > Then, I'm tempted to call __oom_reap_task_mm() before holding mmap_sem for write.
> > > It would be OK to call __oom_reap_task_mm() at the beginning of __mmput()...
> > 
> > I am not sure I understand.
> 
> To reduce possibility of __oom_reap_task_mm() giving up reclaim and
> setting MMF_OOM_SKIP.

Still do not understand. Do you want to call __oom_reap_task_mm from
__mmput? If yes why would you do so when exit_mmap does a stronger
version of it?

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [patch v2] mm, oom: fix concurrent munlock and oom reaper unmap
  2018-04-18 13:44               ` Michal Hocko
@ 2018-04-18 14:28                 ` Tetsuo Handa
  0 siblings, 0 replies; 47+ messages in thread
From: Tetsuo Handa @ 2018-04-18 14:28 UTC (permalink / raw)
  To: mhocko; +Cc: rientjes, akpm, aarcange, guro, linux-kernel, linux-mm

Michal Hocko wrote:
> > > > Then, I'm tempted to call __oom_reap_task_mm() before holding mmap_sem for write.
> > > > It would be OK to call __oom_reap_task_mm() at the beginning of __mmput()...
> > > 
> > > I am not sure I understand.
> > 
> > To reduce possibility of __oom_reap_task_mm() giving up reclaim and
> > setting MMF_OOM_SKIP.
> 
> Still do not understand. Do you want to call __oom_reap_task_mm from
> __mmput?

Yes.

>          If yes why would you do so when exit_mmap does a stronger
> version of it?

Because memory which can be reclaimed by the OOM reaper is guaranteed
to be reclaimed before setting MMF_OOM_SKIP when the OOM reaper and
exit_mmap() contended, because the OOM reaper (weak reclaim) sets
MMF_OOM_SKIP after one second for safety in case of exit_mmap()
(strong reclaim) failing to make forward progress.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [patch v2] mm, oom: fix concurrent munlock and oom reaper unmap
  2018-04-18  7:50       ` Michal Hocko
  2018-04-18 11:49         ` Tetsuo Handa
@ 2018-04-18 19:14         ` David Rientjes
  2018-04-19  6:35           ` Michal Hocko
  1 sibling, 1 reply; 47+ messages in thread
From: David Rientjes @ 2018-04-18 19:14 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Tetsuo Handa, Andrea Arcangeli, Roman Gushchin,
	linux-kernel, linux-mm

On Wed, 18 Apr 2018, Michal Hocko wrote:

> > Since exit_mmap() is done without the protection of mm->mmap_sem, it is
> > possible for the oom reaper to concurrently operate on an mm until
> > MMF_OOM_SKIP is set.
> > 
> > This allows munlock_vma_pages_all() to concurrently run while the oom
> > reaper is operating on a vma.  Since munlock_vma_pages_range() depends on
> > clearing VM_LOCKED from vm_flags before actually doing the munlock to
> > determine if any other vmas are locking the same memory, the check for
> > VM_LOCKED in the oom reaper is racy.
> > 
> > This is especially noticeable on architectures such as powerpc where
> > clearing a huge pmd requires serialize_against_pte_lookup().  If the pmd
> > is zapped by the oom reaper during follow_page_mask() after the check for
> > pmd_none() is bypassed, this ends up deferencing a NULL ptl.
> > 
> > Fix this by reusing MMF_UNSTABLE to specify that an mm should not be
> > reaped.  This prevents the concurrent munlock_vma_pages_range() and
> > unmap_page_range().  The oom reaper will simply not operate on an mm that
> > has the bit set and leave the unmapping to exit_mmap().
> 
> This will further complicate the protocol and actually theoretically
> restores the oom lockup issues because the oom reaper doesn't set
> MMF_OOM_SKIP when racing with exit_mmap so we fully rely that nothing
> blocks there... So the resulting code is more fragile and tricky.
> 

exit_mmap() does not block before set_bit(MMF_OOM_SKIP) once it is 
entered.

> Can we try a simpler way and get back to what I was suggesting before
> [1] and simply not play tricks with
> 		down_write(&mm->mmap_sem);
> 		up_write(&mm->mmap_sem);
> 
> and use the write lock in exit_mmap for oom_victims?
> 
> Andrea wanted to make this more clever but this is the second fallout
> which could have been prevented. The patch would be smaller and the
> locking protocol easier
> 
> [1] http://lkml.kernel.org/r/20170727065023.GB20970@dhcp22.suse.cz
> 

exit_mmap() doesn't need to protect munlock, unmap, or freeing pgtables 
with mm->mmap_sem; the issue is that you need to start holding it in this 
case before munlock and then until at least the end of free_pgtables().  
Anything in between also needlessly holds it so could introduce weird 
lockdep issues that only trigger for oom victims, i.e. they could be very 
rare on some configs.  I don't necessarily like holding a mutex over 
functions where it's actually not needed, not only as a general principle 
but also because the oom reaper can now infer that reaping isn't possible 
just because it can't do down_read() and isn't aware the thread is 
actually in exit_mmap() needlessly holding it.

I like how the oom reaper currently retries on failing to grab 
mm->mmap_sem and then backs out because it's assumed it can't make forward 
progress.  Adding additional complication for situations where 
mm->mmap_sem is contended (and munlock to free_pgtables() can take a long 
time for certain processes) to check if it's actually already in 
exit_mmap() would seem more complicated than this.

The patch is simply using MMF_UNSTABLE rather than MMF_OOM_SKIP to 
serialize exit_mmap() with the oom reaper and doing it before anything 
interesting in exit_mmap() because without it the munlock can trivially 
race with unmap_page_range() and cause a NULL pointer or #GP on a pmd or 
pte.  The way Andrea implemented it is fine, we simply have revealed a 
race between munlock_vma_pages_all() and unmap_page_range() that needs it 
to do set_bit(); down_write(); up_write(); earlier.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [patch v2] mm, oom: fix concurrent munlock and oom reaper unmap
  2018-04-18 19:14         ` David Rientjes
@ 2018-04-19  6:35           ` Michal Hocko
  2018-04-19 10:45             ` Tetsuo Handa
  2018-04-19 19:34             ` David Rientjes
  0 siblings, 2 replies; 47+ messages in thread
From: Michal Hocko @ 2018-04-19  6:35 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Tetsuo Handa, Andrea Arcangeli, Roman Gushchin,
	linux-kernel, linux-mm

On Wed 18-04-18 12:14:29, David Rientjes wrote:
> On Wed, 18 Apr 2018, Michal Hocko wrote:
> 
> > > Since exit_mmap() is done without the protection of mm->mmap_sem, it is
> > > possible for the oom reaper to concurrently operate on an mm until
> > > MMF_OOM_SKIP is set.
> > > 
> > > This allows munlock_vma_pages_all() to concurrently run while the oom
> > > reaper is operating on a vma.  Since munlock_vma_pages_range() depends on
> > > clearing VM_LOCKED from vm_flags before actually doing the munlock to
> > > determine if any other vmas are locking the same memory, the check for
> > > VM_LOCKED in the oom reaper is racy.
> > > 
> > > This is especially noticeable on architectures such as powerpc where
> > > clearing a huge pmd requires serialize_against_pte_lookup().  If the pmd
> > > is zapped by the oom reaper during follow_page_mask() after the check for
> > > pmd_none() is bypassed, this ends up deferencing a NULL ptl.
> > > 
> > > Fix this by reusing MMF_UNSTABLE to specify that an mm should not be
> > > reaped.  This prevents the concurrent munlock_vma_pages_range() and
> > > unmap_page_range().  The oom reaper will simply not operate on an mm that
> > > has the bit set and leave the unmapping to exit_mmap().
> > 
> > This will further complicate the protocol and actually theoretically
> > restores the oom lockup issues because the oom reaper doesn't set
> > MMF_OOM_SKIP when racing with exit_mmap so we fully rely that nothing
> > blocks there... So the resulting code is more fragile and tricky.
> > 
> 
> exit_mmap() does not block before set_bit(MMF_OOM_SKIP) once it is 
> entered.

Not true. munlock_vma_pages_all might take page_lock which can have
unpredictable dependences. This is the reason why we are ruling out
mlocked VMAs in the first place when reaping the address space.

> > Can we try a simpler way and get back to what I was suggesting before
> > [1] and simply not play tricks with
> > 		down_write(&mm->mmap_sem);
> > 		up_write(&mm->mmap_sem);
> > 
> > and use the write lock in exit_mmap for oom_victims?
> > 
> > Andrea wanted to make this more clever but this is the second fallout
> > which could have been prevented. The patch would be smaller and the
> > locking protocol easier
> > 
> > [1] http://lkml.kernel.org/r/20170727065023.GB20970@dhcp22.suse.cz
> > 
> 
> exit_mmap() doesn't need to protect munlock, unmap, or freeing pgtables 
> with mm->mmap_sem; the issue is that you need to start holding it in this 
> case before munlock and then until at least the end of free_pgtables().  
> Anything in between also needlessly holds it so could introduce weird 
> lockdep issues that only trigger for oom victims, i.e. they could be very 
> rare on some configs.  I don't necessarily like holding a mutex over 
> functions where it's actually not needed, not only as a general principle 
> but also because the oom reaper can now infer that reaping isn't possible 
> just because it can't do down_read() and isn't aware the thread is 
> actually in exit_mmap() needlessly holding it.

While you are correct, strictly speaking, because unmap_vmas can race
with the oom reaper. With the lock held during the whole operation we
can indeed trigger back off in the oom_repaer. It will keep retrying but
the tear down can take quite some time. This is a fair argument. On the
other hand your lock protocol introduces the MMF_OOM_SKIP problem I've
mentioned above and that really worries me. The primary objective of the
reaper is to guarantee a forward progress without relying on any
externalities. We might kill another OOM victim but that is safer than
lock up.

[...]

> The patch is simply using MMF_UNSTABLE rather than MMF_OOM_SKIP to 
> serialize exit_mmap() with the oom reaper and doing it before anything 
> interesting in exit_mmap() because without it the munlock can trivially 
> race with unmap_page_range() and cause a NULL pointer or #GP on a pmd or 
> pte.  The way Andrea implemented it is fine, we simply have revealed a 
> race between munlock_vma_pages_all() and unmap_page_range() that needs it 
> to do set_bit(); down_write(); up_write(); earlier.

The current protocol has proven to be error prone so I really believe we
should back off and turn it into something much simpler and build on top
of that if needed.

So do you see any _technical_ reasons why not do [1] and have a simpler
protocol easily backportable to stable trees?
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [patch v2] mm, oom: fix concurrent munlock and oom reaper unmap
  2018-04-19  6:35           ` Michal Hocko
@ 2018-04-19 10:45             ` Tetsuo Handa
  2018-04-19 11:04               ` Michal Hocko
  2018-04-19 19:14               ` David Rientjes
  2018-04-19 19:34             ` David Rientjes
  1 sibling, 2 replies; 47+ messages in thread
From: Tetsuo Handa @ 2018-04-19 10:45 UTC (permalink / raw)
  To: mhocko, rientjes; +Cc: akpm, aarcange, guro, linux-kernel, linux-mm

Michal Hocko wrote:
> > exit_mmap() does not block before set_bit(MMF_OOM_SKIP) once it is 
> > entered.
> 
> Not true. munlock_vma_pages_all might take page_lock which can have
> unpredictable dependences. This is the reason why we are ruling out
> mlocked VMAs in the first place when reaping the address space.

Wow! Then,

> While you are correct, strictly speaking, because unmap_vmas can race
> with the oom reaper. With the lock held during the whole operation we
> can indeed trigger back off in the oom_repaer. It will keep retrying but
> the tear down can take quite some time. This is a fair argument. On the
> other hand your lock protocol introduces the MMF_OOM_SKIP problem I've
> mentioned above and that really worries me. The primary objective of the
> reaper is to guarantee a forward progress without relying on any
> externalities. We might kill another OOM victim but that is safer than
> lock up.

current code has a possibility that the OOM reaper is disturbed by
unpredictable dependencies, like I worried that

  I think that there is a possibility that the OOM reaper tries to reclaim
  mlocked pages as soon as exit_mmap() cleared VM_LOCKED flag by calling
  munlock_vma_pages_all().

when current approach was proposed. We currently have the MMF_OOM_SKIP problem.
We need to teach the OOM reaper stop reaping as soon as entering exit_mmap().
Maybe let the OOM reaper poll for progress (e.g. none of get_mm_counter(mm, *)
decreased for last 1 second) ?

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [patch v2] mm, oom: fix concurrent munlock and oom reaper unmap
  2018-04-19 10:45             ` Tetsuo Handa
@ 2018-04-19 11:04               ` Michal Hocko
  2018-04-19 11:51                 ` Tetsuo Handa
  2018-04-19 19:14               ` David Rientjes
  1 sibling, 1 reply; 47+ messages in thread
From: Michal Hocko @ 2018-04-19 11:04 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: rientjes, akpm, aarcange, guro, linux-kernel, linux-mm

On Thu 19-04-18 19:45:46, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > > exit_mmap() does not block before set_bit(MMF_OOM_SKIP) once it is 
> > > entered.
> > 
> > Not true. munlock_vma_pages_all might take page_lock which can have
> > unpredictable dependences. This is the reason why we are ruling out
> > mlocked VMAs in the first place when reaping the address space.
> 
> Wow! Then,
> 
> > While you are correct, strictly speaking, because unmap_vmas can race
> > with the oom reaper. With the lock held during the whole operation we
> > can indeed trigger back off in the oom_repaer. It will keep retrying but
> > the tear down can take quite some time. This is a fair argument. On the
> > other hand your lock protocol introduces the MMF_OOM_SKIP problem I've
> > mentioned above and that really worries me. The primary objective of the
> > reaper is to guarantee a forward progress without relying on any
> > externalities. We might kill another OOM victim but that is safer than
> > lock up.
> 
> current code has a possibility that the OOM reaper is disturbed by
> unpredictable dependencies, like I worried that
> 
>   I think that there is a possibility that the OOM reaper tries to reclaim
>   mlocked pages as soon as exit_mmap() cleared VM_LOCKED flag by calling
>   munlock_vma_pages_all().
> 
> when current approach was proposed. We currently have the MMF_OOM_SKIP problem.
> We need to teach the OOM reaper stop reaping as soon as entering exit_mmap().
> Maybe let the OOM reaper poll for progress (e.g. none of get_mm_counter(mm, *)
> decreased for last 1 second) ?

Can we start simple and build a more elaborate heuristics on top _please_?
In other words holding the mmap_sem for write for oom victims in
exit_mmap should handle the problem. We can then enhance this to probe
for progress or any other clever tricks if we find out that the race
happens too often and we kill more than necessary.

Let's not repeat the error of trying to be too clever from the beginning
as we did previously. This are is just too subtle and obviously error
prone.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [patch v2] mm, oom: fix concurrent munlock and oom reaper unmap
  2018-04-19 11:04               ` Michal Hocko
@ 2018-04-19 11:51                 ` Tetsuo Handa
  2018-04-19 12:48                   ` Michal Hocko
  0 siblings, 1 reply; 47+ messages in thread
From: Tetsuo Handa @ 2018-04-19 11:51 UTC (permalink / raw)
  To: mhocko; +Cc: rientjes, akpm, aarcange, guro, linux-kernel, linux-mm

Michal Hocko wrote:
> > We need to teach the OOM reaper stop reaping as soon as entering exit_mmap().
> > Maybe let the OOM reaper poll for progress (e.g. none of get_mm_counter(mm, *)
> > decreased for last 1 second) ?
> 
> Can we start simple and build a more elaborate heuristics on top _please_?
> In other words holding the mmap_sem for write for oom victims in
> exit_mmap should handle the problem. We can then enhance this to probe
> for progress or any other clever tricks if we find out that the race
> happens too often and we kill more than necessary.
> 
> Let's not repeat the error of trying to be too clever from the beginning
> as we did previously. This are is just too subtle and obviously error
> prone.
> 
Something like this?

---
 mm/mmap.c     | 41 +++++++++++++++++++++++------------------
 mm/oom_kill.c | 29 +++++++++++------------------
 2 files changed, 34 insertions(+), 36 deletions(-)

diff --git a/mm/mmap.c b/mm/mmap.c
index 188f195..3edb7da 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -3015,6 +3015,28 @@ void exit_mmap(struct mm_struct *mm)
 	/* mm's last user has gone, and its about to be pulled down */
 	mmu_notifier_release(mm);
 
+	if (unlikely(mm_is_oom_victim(mm))) {
+		/*
+		 * Tell oom_reap_task() not to start reaping this mm.
+		 *
+		 * oom_reap_task() depends on a stable VM_LOCKED flag to
+		 * indicate it should not unmap during munlock_vma_pages_all().
+		 *
+		 * Since MMF_UNSTABLE is set before calling down_write(),
+		 * oom_reap_task() which calls down_read() before testing
+		 * MMF_UNSTABLE will not run on this mm after up_write().
+		 *
+		 * mm_is_oom_victim() cannot be set from under us because
+		 * victim->mm is already set to NULL under task_lock before
+		 * calling mmput() and victim->signal->oom_mm is set by the oom
+		 * killer only if victim->mm is non-NULL while holding
+		 * task_lock().
+		 */
+		set_bit(MMF_UNSTABLE, &mm->flags);
+		down_write(&mm->mmap_sem);
+		up_write(&mm->mmap_sem);
+	}
+
 	if (mm->locked_vm) {
 		vma = mm->mmap;
 		while (vma) {
@@ -3036,26 +3058,9 @@ void exit_mmap(struct mm_struct *mm)
 	/* update_hiwater_rss(mm) here? but nobody should be looking */
 	/* Use -1 here to ensure all VMAs in the mm are unmapped */
 	unmap_vmas(&tlb, vma, 0, -1);
-
-	if (unlikely(mm_is_oom_victim(mm))) {
-		/*
-		 * Wait for oom_reap_task() to stop working on this
-		 * mm. Because MMF_OOM_SKIP is already set before
-		 * calling down_read(), oom_reap_task() will not run
-		 * on this "mm" post up_write().
-		 *
-		 * mm_is_oom_victim() cannot be set from under us
-		 * either because victim->mm is already set to NULL
-		 * under task_lock before calling mmput and oom_mm is
-		 * set not NULL by the OOM killer only if victim->mm
-		 * is found not NULL while holding the task_lock.
-		 */
-		set_bit(MMF_OOM_SKIP, &mm->flags);
-		down_write(&mm->mmap_sem);
-		up_write(&mm->mmap_sem);
-	}
 	free_pgtables(&tlb, vma, FIRST_USER_ADDRESS, USER_PGTABLES_CEILING);
 	tlb_finish_mmu(&tlb, 0, -1);
+	set_bit(MMF_OOM_SKIP, &mm->flags);
 
 	/*
 	 * Walk the list again, actually closing and freeing it,
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index ff992fa..1fef1b6 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -510,25 +510,16 @@ static bool __oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
 
 	/*
 	 * If the mm has invalidate_{start,end}() notifiers that could block,
+	 * or if the mm is in exit_mmap() which has unpredictable dependencies,
 	 * sleep to give the oom victim some more time.
 	 * TODO: we really want to get rid of this ugly hack and make sure that
 	 * notifiers cannot block for unbounded amount of time
 	 */
-	if (mm_has_blockable_invalidate_notifiers(mm)) {
-		up_read(&mm->mmap_sem);
-		schedule_timeout_idle(HZ);
-		goto unlock_oom;
-	}
-
-	/*
-	 * MMF_OOM_SKIP is set by exit_mmap when the OOM reaper can't
-	 * work on the mm anymore. The check for MMF_OOM_SKIP must run
-	 * under mmap_sem for reading because it serializes against the
-	 * down_write();up_write() cycle in exit_mmap().
-	 */
-	if (test_bit(MMF_OOM_SKIP, &mm->flags)) {
+	if (mm_has_blockable_invalidate_notifiers(mm) ||
+	    test_bit(MMF_UNSTABLE, &mm->flags)) {
 		up_read(&mm->mmap_sem);
 		trace_skip_task_reaping(tsk->pid);
+		schedule_timeout_idle(HZ);
 		goto unlock_oom;
 	}
 
@@ -590,11 +581,9 @@ static void oom_reap_task(struct task_struct *tsk)
 	while (attempts++ < MAX_OOM_REAP_RETRIES && !__oom_reap_task_mm(tsk, mm))
 		schedule_timeout_idle(HZ/10);
 
-	if (attempts <= MAX_OOM_REAP_RETRIES ||
-	    test_bit(MMF_OOM_SKIP, &mm->flags))
+	if (test_bit(MMF_UNSTABLE, &mm->flags))
 		goto done;
 
-
 	pr_info("oom_reaper: unable to reap pid:%d (%s)\n",
 		task_pid_nr(tsk), tsk->comm);
 	debug_show_all_locks();
@@ -603,8 +592,12 @@ static void oom_reap_task(struct task_struct *tsk)
 	tsk->oom_reaper_list = NULL;
 
 	/*
-	 * Hide this mm from OOM killer because it has been either reaped or
-	 * somebody can't call up_write(mmap_sem).
+	 * Hide this mm from the OOM killer because:
+	 *   the OOM reaper completed reaping
+	 * or
+	 *   exit_mmap() told the OOM reaper not to start reaping
+	 * or
+	 *   neither exit_mmap() nor the OOM reaper started reaping
 	 */
 	set_bit(MMF_OOM_SKIP, &mm->flags);
 
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* Re: [patch v2] mm, oom: fix concurrent munlock and oom reaper unmap
  2018-04-19 11:51                 ` Tetsuo Handa
@ 2018-04-19 12:48                   ` Michal Hocko
  0 siblings, 0 replies; 47+ messages in thread
From: Michal Hocko @ 2018-04-19 12:48 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: rientjes, akpm, aarcange, guro, linux-kernel, linux-mm

On Thu 19-04-18 20:51:45, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > > We need to teach the OOM reaper stop reaping as soon as entering exit_mmap().
> > > Maybe let the OOM reaper poll for progress (e.g. none of get_mm_counter(mm, *)
> > > decreased for last 1 second) ?
> > 
> > Can we start simple and build a more elaborate heuristics on top _please_?
> > In other words holding the mmap_sem for write for oom victims in
> > exit_mmap should handle the problem. We can then enhance this to probe
> > for progress or any other clever tricks if we find out that the race
> > happens too often and we kill more than necessary.
> > 
> > Let's not repeat the error of trying to be too clever from the beginning
> > as we did previously. This are is just too subtle and obviously error
> > prone.
> > 
> Something like this?

Not really. This is still building on the tricky locking protocol we
have and proven to be error prone.  Can we simply take the mmap_sem for
write for oom victims before munlocking and release it after munmapping?
 
I am OK with building on the current protocol if taking the mmap_sem for
the whole section has some serious down sides but I haven't heard any
yet, to be honest.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [patch v2] mm, oom: fix concurrent munlock and oom reaper unmap
  2018-04-19 10:45             ` Tetsuo Handa
  2018-04-19 11:04               ` Michal Hocko
@ 2018-04-19 19:14               ` David Rientjes
  1 sibling, 0 replies; 47+ messages in thread
From: David Rientjes @ 2018-04-19 19:14 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: mhocko, akpm, aarcange, guro, linux-kernel, linux-mm

On Thu, 19 Apr 2018, Tetsuo Handa wrote:

> current code has a possibility that the OOM reaper is disturbed by
> unpredictable dependencies, like I worried that
> 
>   I think that there is a possibility that the OOM reaper tries to reclaim
>   mlocked pages as soon as exit_mmap() cleared VM_LOCKED flag by calling
>   munlock_vma_pages_all().
> 
> when current approach was proposed.

That's exactly the issue that this patch is fixing, yes.  If you brought 
that possibility up then I'm sorry that it was ignored.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [patch v2] mm, oom: fix concurrent munlock and oom reaper unmap
  2018-04-19  6:35           ` Michal Hocko
  2018-04-19 10:45             ` Tetsuo Handa
@ 2018-04-19 19:34             ` David Rientjes
  2018-04-19 22:13               ` Tetsuo Handa
  2018-04-20  8:23               ` Michal Hocko
  1 sibling, 2 replies; 47+ messages in thread
From: David Rientjes @ 2018-04-19 19:34 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Tetsuo Handa, Andrea Arcangeli, Roman Gushchin,
	linux-kernel, linux-mm

On Thu, 19 Apr 2018, Michal Hocko wrote:

> > exit_mmap() does not block before set_bit(MMF_OOM_SKIP) once it is 
> > entered.
> 
> Not true. munlock_vma_pages_all might take page_lock which can have
> unpredictable dependences. This is the reason why we are ruling out
> mlocked VMAs in the first place when reaping the address space.
> 

I don't find any occurrences in millions of oom kills in real-world 
scenarios where this matters.  The solution is certainly not to hold 
down_write(&mm->mmap_sem) during munlock_vma_pages_all() instead.  If 
exit_mmap() is not making forward progress then that's a separate issue; 
that would need to be fixed in one of two ways: (1) in oom_reap_task() to 
try over a longer duration before setting MMF_OOM_SKIP itself, but that 
would have to be a long duration to allow a large unmap and page table 
free, or (2) in oom_evaluate_task() so that we defer for MMF_OOM_SKIP but 
only if MMF_UNSTABLE has been set for a long period of time so we target 
another process when the oom killer has given up.

Either of those two fixes are simple to implement, I'd just like to see a 
bug report with stack traces to indicate that a victim getting stalled in 
exit_mmap() is a problem to justify the patch.

I'm trying to fix the page table corruption that is trivial to trigger on 
powerpc.  We simply cannot allow the oom reaper's unmap_page_range() to 
race with munlock_vma_pages_range(), ever.  Holding down_write on 
mm->mmap_sem otherwise needlessly over a large amount of code is riskier 
(hasn't been done or tested here), more error prone (any code change over 
this large area of code or in functions it calls are unnecessarily 
burdened by unnecessary locking), makes exit_mmap() less extensible for 
the same reason, and causes the oom reaper to give up and go set 
MMF_OOM_SKIP itself because it depends on taking down_read while the 
thread is still exiting.

> On the
> other hand your lock protocol introduces the MMF_OOM_SKIP problem I've
> mentioned above and that really worries me. The primary objective of the
> reaper is to guarantee a forward progress without relying on any
> externalities. We might kill another OOM victim but that is safer than
> lock up.
> 

I understand the concern, but it's the difference between the victim 
getting stuck in exit_mmap() and actually taking a long time to free its 
memory in exit_mmap().  I don't have evidence of the former.  If there are 
bug reports for occurrences of the oom reaper being unable to reap, it 
would be helpful to see.  The only reports about the "unable to reap" 
message was that the message itself was racy, not that a thread got stuck.  
This is more reason to not take down_write unnecessarily in the 
exit_mmap() path, because it influences an oom reaper heurstic.

> The current protocol has proven to be error prone so I really believe we
> should back off and turn it into something much simpler and build on top
> of that if needed.
> 
> So do you see any _technical_ reasons why not do [1] and have a simpler
> protocol easily backportable to stable trees?

It's not simpler per the above, and I agree with Andrea's assessment when 
this was originally implemented.  The current method is not error prone, 
it works, it just wasn't protecting enough of exit_mmap().  That's not a 
critcism of the method itself, it's a bugfix that expands its critical 
section.  

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [patch v2] mm, oom: fix concurrent munlock and oom reaper unmap
  2018-04-19 19:34             ` David Rientjes
@ 2018-04-19 22:13               ` Tetsuo Handa
  2018-04-20  8:23               ` Michal Hocko
  1 sibling, 0 replies; 47+ messages in thread
From: Tetsuo Handa @ 2018-04-19 22:13 UTC (permalink / raw)
  To: rientjes, mhocko; +Cc: akpm, aarcange, guro, linux-kernel, linux-mm

David Rientjes wrote:
> On Thu, 19 Apr 2018, Michal Hocko wrote:
> 
> > > exit_mmap() does not block before set_bit(MMF_OOM_SKIP) once it is 
> > > entered.
> > 
> > Not true. munlock_vma_pages_all might take page_lock which can have
> > unpredictable dependences. This is the reason why we are ruling out
> > mlocked VMAs in the first place when reaping the address space.
> > 
> 
> I don't find any occurrences in millions of oom kills in real-world 
> scenarios where this matters.

Is your OOM events system-wide rather than memcg?
It is trivial to hide bugs in the details if your OOM events is memcg OOM.

>                                The solution is certainly not to hold 
> down_write(&mm->mmap_sem) during munlock_vma_pages_all() instead.  If 
> exit_mmap() is not making forward progress then that's a separate issue; 

Just a simple memory + CPU pressure is sufficient for making exit_mmap()
unable to make forward progress. Try triggering system-wide OOM event by
running below reproducer. We are ever ignoring this issue.

-----
#include <unistd.h>

int main(int argc, char *argv[])
{
        while (1)
                if (fork() == 0)
                        execlp(argv[0], argv[0], NULL);
        return 0;
}
-----

> that would need to be fixed in one of two ways: (1) in oom_reap_task() to 
> try over a longer duration before setting MMF_OOM_SKIP itself, but that 
> would have to be a long duration to allow a large unmap and page table 
> free, or (2) in oom_evaluate_task() so that we defer for MMF_OOM_SKIP but 
> only if MMF_UNSTABLE has been set for a long period of time so we target 
> another process when the oom killer has given up.
> 
> Either of those two fixes are simple to implement, I'd just like to see a 
> bug report with stack traces to indicate that a victim getting stalled in 
> exit_mmap() is a problem to justify the patch.

It is too hard for normal users to report problems under memory pressure
without a mean to help understand what is happening. See a bug report at
https://lists.opensuse.org/opensuse-kernel/2018-04/msg00018.html for example.

> 
> I'm trying to fix the page table corruption that is trivial to trigger on 
> powerpc.  We simply cannot allow the oom reaper's unmap_page_range() to 
> race with munlock_vma_pages_range(), ever.  Holding down_write on 
> mm->mmap_sem otherwise needlessly over a large amount of code is riskier 
> (hasn't been done or tested here), more error prone (any code change over 
> this large area of code or in functions it calls are unnecessarily 
> burdened by unnecessary locking), makes exit_mmap() less extensible for 
> the same reason, and causes the oom reaper to give up and go set 
> MMF_OOM_SKIP itself because it depends on taking down_read while the 
> thread is still exiting.

I suggest reverting 212925802454 ("mm: oom: let oom_reap_task and exit_mmap
run concurrently"). We can check for progress for a while before setting
MMF_OOM_SKIP after the OOM reaper completed or gave up reaping.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [patch v2] mm, oom: fix concurrent munlock and oom reaper unmap
  2018-04-19 19:34             ` David Rientjes
  2018-04-19 22:13               ` Tetsuo Handa
@ 2018-04-20  8:23               ` Michal Hocko
  2018-04-20 12:40                   ` Michal Hocko
  2018-04-22  3:45                 ` [patch v2] mm, oom: fix concurrent munlock and oom reaper unmap David Rientjes
  1 sibling, 2 replies; 47+ messages in thread
From: Michal Hocko @ 2018-04-20  8:23 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Tetsuo Handa, Andrea Arcangeli, Roman Gushchin,
	linux-kernel, linux-mm

On Thu 19-04-18 12:34:53, David Rientjes wrote:
> On Thu, 19 Apr 2018, Michal Hocko wrote:
> 
> > > exit_mmap() does not block before set_bit(MMF_OOM_SKIP) once it is 
> > > entered.
> > 
> > Not true. munlock_vma_pages_all might take page_lock which can have
> > unpredictable dependences. This is the reason why we are ruling out
> > mlocked VMAs in the first place when reaping the address space.
> > 
> 
> I don't find any occurrences in millions of oom kills in real-world 
> scenarios where this matters.

Which doesn't really mean much. We want a guarantee here.

> The solution is certainly not to hold 
> down_write(&mm->mmap_sem) during munlock_vma_pages_all() instead.

Why not? This is what we do for normal paths. exit path just tries to be
clever because it knows that it doesn't have to lock because there is no
concurent user. At least it wasn't until the oom reaper came. So I
really fail to see why we shouldn't do the most obvious thing and use
the locking.

> If 
> exit_mmap() is not making forward progress then that's a separate issue; 

Please read what I wrote. There is a page lock and there is no way to
guarantee it will make a forward progress. Or do you consider that not
true?

> that would need to be fixed in one of two ways: (1) in oom_reap_task() to 
> try over a longer duration before setting MMF_OOM_SKIP itself, but that 
> would have to be a long duration to allow a large unmap and page table 
> free, or (2) in oom_evaluate_task() so that we defer for MMF_OOM_SKIP but 
> only if MMF_UNSTABLE has been set for a long period of time so we target 
> another process when the oom killer has given up.
> 
> Either of those two fixes are simple to implement, I'd just like to see a 
> bug report with stack traces to indicate that a victim getting stalled in 
> exit_mmap() is a problem to justify the patch.

And both are not really needed if we do the proper and straightforward
locking.

> I'm trying to fix the page table corruption that is trivial to trigger on 
> powerpc.  We simply cannot allow the oom reaper's unmap_page_range() to 
> race with munlock_vma_pages_range(), ever.

There is no discussion about that. Sure, you are right. We are just
arguing how to achieve that.

> Holding down_write on 
> mm->mmap_sem otherwise needlessly over a large amount of code is riskier 
> (hasn't been done or tested here), more error prone (any code change over 
> this large area of code or in functions it calls are unnecessarily 
> burdened by unnecessary locking), makes exit_mmap() less extensible for 
> the same reason,

I do not see any of the calls in that path could suffer from holding
mmap_sem. Do you?

> and causes the oom reaper to give up and go set 
> MMF_OOM_SKIP itself because it depends on taking down_read while the 
> thread is still exiting.

Which is the standard backoff mechanism.

> > On the
> > other hand your lock protocol introduces the MMF_OOM_SKIP problem I've
> > mentioned above and that really worries me. The primary objective of the
> > reaper is to guarantee a forward progress without relying on any
> > externalities. We might kill another OOM victim but that is safer than
> > lock up.
> > 
> 
> I understand the concern, but it's the difference between the victim 
> getting stuck in exit_mmap() and actually taking a long time to free its 
> memory in exit_mmap().  I don't have evidence of the former.

I do not really want to repeat myself. The primary purpose of the oom
reaper is to provide a _guarantee_ of the forward progress. I do not
care whether there is any evidences. All I know that lock_page has
plethora of different dependencies and we cannot clearly state this is
safe so we _must not_ depend on it when setting MMF_OOM_SKIP.

The way how the oom path was fragile and lockup prone based on
optimistic assumptions shouldn't be repeated.

That being said, I haven't heard any actual technical argument about why
locking the munmap path is a wrong thing to do while the MMF_OOM_SKIP
dependency on the page_lock really concerns me so

Nacked-by: Michal Hocko <mhocko@suse.com>

If you want to keep the current locking protocol then you really have to
make sure that the oom reaper will set MMF_OOM_SKIP when racing with
exit_mmap.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [patch v2] mm, oom: fix concurrent munlock and oom reaper unmap
  2018-04-20  8:23               ` Michal Hocko
@ 2018-04-20 12:40                   ` Michal Hocko
  2018-04-22  3:45                 ` [patch v2] mm, oom: fix concurrent munlock and oom reaper unmap David Rientjes
  1 sibling, 0 replies; 47+ messages in thread
From: Michal Hocko @ 2018-04-20 12:40 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Tetsuo Handa, Andrea Arcangeli, Roman Gushchin,
	linux-kernel, linux-mm

On Fri 20-04-18 10:23:49, Michal Hocko wrote:
> On Thu 19-04-18 12:34:53, David Rientjes wrote:
[...]
> > I understand the concern, but it's the difference between the victim 
> > getting stuck in exit_mmap() and actually taking a long time to free its 
> > memory in exit_mmap().  I don't have evidence of the former.
> 
> I do not really want to repeat myself. The primary purpose of the oom
> reaper is to provide a _guarantee_ of the forward progress. I do not
> care whether there is any evidences. All I know that lock_page has
> plethora of different dependencies and we cannot clearly state this is
> safe so we _must not_ depend on it when setting MMF_OOM_SKIP.
> 
> The way how the oom path was fragile and lockup prone based on
> optimistic assumptions shouldn't be repeated.
> 
> That being said, I haven't heard any actual technical argument about why
> locking the munmap path is a wrong thing to do while the MMF_OOM_SKIP
> dependency on the page_lock really concerns me so
> 
> Nacked-by: Michal Hocko <mhocko@suse.com>
> 
> If you want to keep the current locking protocol then you really have to
> make sure that the oom reaper will set MMF_OOM_SKIP when racing with
> exit_mmap.

So here is my suggestion for the fix.

>From 37ab85d619f508ceaf829e57648a3d986c6d8bfc Mon Sep 17 00:00:00 2001
From: Michal Hocko <mhocko@suse.com>
Date: Fri, 20 Apr 2018 13:53:08 +0200
Subject: [PATCH] oom: mm, oom: fix concurrent munlock and oom reaper unmap

David has noticed that our current assumption that the oom reaper can
race with exit_mmap is not correct. munlock_vma_pages_all() depends on
clearing VM_LOCKED from vm_flags before actually doing the munlock to
determine if any other vmas are locking the same memory, the check for
VM_LOCKED in the oom reaper is racy.

This is especially noticeable on architectures such as powerpc where
clearing a huge pmd requires serialize_against_pte_lookup().  If the pmd
is zapped by the oom reaper during follow_page_mask() after the check for
pmd_none() is bypassed, this ends up deferencing a NULL ptl.

Fix this by taking the exclusive mmap_sem in exit_mmap while tearing
down the address space. Once that is done MMF_OOM_SKIP is set so that
__oom_reap_task_mm can back off if it manages to take the read lock
finally.

Fixes: 212925802454 ("mm: oom: let oom_reap_task and exit_mmap run concurrently")
Cc: stable
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 mm/mmap.c     | 36 ++++++++++++++++++------------------
 mm/oom_kill.c |  2 +-
 2 files changed, 19 insertions(+), 19 deletions(-)

diff --git a/mm/mmap.c b/mm/mmap.c
index faf85699f1a1..216efa6d9f61 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -3004,10 +3004,21 @@ void exit_mmap(struct mm_struct *mm)
 	struct mmu_gather tlb;
 	struct vm_area_struct *vma;
 	unsigned long nr_accounted = 0;
+	bool locked = false;
 
 	/* mm's last user has gone, and its about to be pulled down */
 	mmu_notifier_release(mm);
 
+	/*
+	 * The mm is not accessible for anybody except for the oom reaper
+	 * which cannot race with munlocking so make sure we exclude the
+	 * two.
+	 */
+	if (unlikely(mm_is_oom_victim(mm))) {
+		down_write(&mm->mmap_sem);
+		locked = true;
+	}
+
 	if (mm->locked_vm) {
 		vma = mm->mmap;
 		while (vma) {
@@ -3021,7 +3032,7 @@ void exit_mmap(struct mm_struct *mm)
 
 	vma = mm->mmap;
 	if (!vma)	/* Can happen if dup_mmap() received an OOM */
-		return;
+		goto out_unlock;
 
 	lru_add_drain();
 	flush_cache_mm(mm);
@@ -3030,23 +3041,6 @@ void exit_mmap(struct mm_struct *mm)
 	/* Use -1 here to ensure all VMAs in the mm are unmapped */
 	unmap_vmas(&tlb, vma, 0, -1);
 
-	if (unlikely(mm_is_oom_victim(mm))) {
-		/*
-		 * Wait for oom_reap_task() to stop working on this
-		 * mm. Because MMF_OOM_SKIP is already set before
-		 * calling down_read(), oom_reap_task() will not run
-		 * on this "mm" post up_write().
-		 *
-		 * mm_is_oom_victim() cannot be set from under us
-		 * either because victim->mm is already set to NULL
-		 * under task_lock before calling mmput and oom_mm is
-		 * set not NULL by the OOM killer only if victim->mm
-		 * is found not NULL while holding the task_lock.
-		 */
-		set_bit(MMF_OOM_SKIP, &mm->flags);
-		down_write(&mm->mmap_sem);
-		up_write(&mm->mmap_sem);
-	}
 	free_pgtables(&tlb, vma, FIRST_USER_ADDRESS, USER_PGTABLES_CEILING);
 	tlb_finish_mmu(&tlb, 0, -1);
 
@@ -3060,6 +3054,12 @@ void exit_mmap(struct mm_struct *mm)
 		vma = remove_vma(vma);
 	}
 	vm_unacct_memory(nr_accounted);
+
+out_unlock:
+	if (unlikely(locked)) {
+		set_bit(MMF_OOM_SKIP, &mm->flags);
+		up_write(&mm->mmap_sem);
+	}
 }
 
 /* Insert vm structure into process list sorted by address
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index dfd370526909..94d26ef2c3c7 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -524,7 +524,7 @@ static bool __oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
 	 * MMF_OOM_SKIP is set by exit_mmap when the OOM reaper can't
 	 * work on the mm anymore. The check for MMF_OOM_SKIP must run
 	 * under mmap_sem for reading because it serializes against the
-	 * down_write();up_write() cycle in exit_mmap().
+	 * exit_mmap().
 	 */
 	if (test_bit(MMF_OOM_SKIP, &mm->flags)) {
 		up_read(&mm->mmap_sem);
-- 
2.16.3

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* Re: [patch v2] mm, oom: fix concurrent munlock and oom reaper unmap
@ 2018-04-20 12:40                   ` Michal Hocko
  0 siblings, 0 replies; 47+ messages in thread
From: Michal Hocko @ 2018-04-20 12:40 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Tetsuo Handa, Andrea Arcangeli, Roman Gushchin,
	linux-kernel, linux-mm

On Fri 20-04-18 10:23:49, Michal Hocko wrote:
> On Thu 19-04-18 12:34:53, David Rientjes wrote:
[...]
> > I understand the concern, but it's the difference between the victim 
> > getting stuck in exit_mmap() and actually taking a long time to free its 
> > memory in exit_mmap().  I don't have evidence of the former.
> 
> I do not really want to repeat myself. The primary purpose of the oom
> reaper is to provide a _guarantee_ of the forward progress. I do not
> care whether there is any evidences. All I know that lock_page has
> plethora of different dependencies and we cannot clearly state this is
> safe so we _must not_ depend on it when setting MMF_OOM_SKIP.
> 
> The way how the oom path was fragile and lockup prone based on
> optimistic assumptions shouldn't be repeated.
> 
> That being said, I haven't heard any actual technical argument about why
> locking the munmap path is a wrong thing to do while the MMF_OOM_SKIP
> dependency on the page_lock really concerns me so
> 
> Nacked-by: Michal Hocko <mhocko@suse.com>
> 
> If you want to keep the current locking protocol then you really have to
> make sure that the oom reaper will set MMF_OOM_SKIP when racing with
> exit_mmap.

So here is my suggestion for the fix.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [patch v2] mm, oom: fix concurrent munlock and oom reaper unmap
  2018-04-20 12:40                   ` Michal Hocko
  (?)
@ 2018-04-22  3:22                   ` David Rientjes
  2018-04-22  3:48                     ` [patch v2] mm, oom: fix concurrent munlock and oom reaperunmap Tetsuo Handa
  -1 siblings, 1 reply; 47+ messages in thread
From: David Rientjes @ 2018-04-22  3:22 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Tetsuo Handa, Andrea Arcangeli, Roman Gushchin,
	linux-kernel, linux-mm

On Fri, 20 Apr 2018, Michal Hocko wrote:

> diff --git a/mm/mmap.c b/mm/mmap.c
> index faf85699f1a1..216efa6d9f61 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -3004,10 +3004,21 @@ void exit_mmap(struct mm_struct *mm)
>  	struct mmu_gather tlb;
>  	struct vm_area_struct *vma;
>  	unsigned long nr_accounted = 0;
> +	bool locked = false;
>  
>  	/* mm's last user has gone, and its about to be pulled down */
>  	mmu_notifier_release(mm);
>  
> +	/*
> +	 * The mm is not accessible for anybody except for the oom reaper
> +	 * which cannot race with munlocking so make sure we exclude the
> +	 * two.
> +	 */
> +	if (unlikely(mm_is_oom_victim(mm))) {
> +		down_write(&mm->mmap_sem);
> +		locked = true;
> +	}
> +
>  	if (mm->locked_vm) {
>  		vma = mm->mmap;
>  		while (vma) {
> @@ -3021,7 +3032,7 @@ void exit_mmap(struct mm_struct *mm)
>  
>  	vma = mm->mmap;
>  	if (!vma)	/* Can happen if dup_mmap() received an OOM */
> -		return;
> +		goto out_unlock;
>  
>  	lru_add_drain();
>  	flush_cache_mm(mm);
> @@ -3030,23 +3041,6 @@ void exit_mmap(struct mm_struct *mm)
>  	/* Use -1 here to ensure all VMAs in the mm are unmapped */
>  	unmap_vmas(&tlb, vma, 0, -1);
>  
> -	if (unlikely(mm_is_oom_victim(mm))) {
> -		/*
> -		 * Wait for oom_reap_task() to stop working on this
> -		 * mm. Because MMF_OOM_SKIP is already set before
> -		 * calling down_read(), oom_reap_task() will not run
> -		 * on this "mm" post up_write().
> -		 *
> -		 * mm_is_oom_victim() cannot be set from under us
> -		 * either because victim->mm is already set to NULL
> -		 * under task_lock before calling mmput and oom_mm is
> -		 * set not NULL by the OOM killer only if victim->mm
> -		 * is found not NULL while holding the task_lock.
> -		 */
> -		set_bit(MMF_OOM_SKIP, &mm->flags);
> -		down_write(&mm->mmap_sem);
> -		up_write(&mm->mmap_sem);
> -	}
>  	free_pgtables(&tlb, vma, FIRST_USER_ADDRESS, USER_PGTABLES_CEILING);
>  	tlb_finish_mmu(&tlb, 0, -1);
>  
> @@ -3060,6 +3054,12 @@ void exit_mmap(struct mm_struct *mm)
>  		vma = remove_vma(vma);
>  	}
>  	vm_unacct_memory(nr_accounted);
> +
> +out_unlock:
> +	if (unlikely(locked)) {
> +		set_bit(MMF_OOM_SKIP, &mm->flags);
> +		up_write(&mm->mmap_sem);
> +	}
>  }
>  
>  /* Insert vm structure into process list sorted by address

How have you tested this?

I'm wondering why you do not see oom killing of many processes if the 
victim is a very large process that takes a long time to free memory in 
exit_mmap() as I do because the oom reaper gives up trying to acquire 
mm->mmap_sem and just sets MMF_OOM_SKIP itself.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [patch v2] mm, oom: fix concurrent munlock and oom reaper unmap
  2018-04-20  8:23               ` Michal Hocko
  2018-04-20 12:40                   ` Michal Hocko
@ 2018-04-22  3:45                 ` David Rientjes
  2018-04-22 13:18                   ` Michal Hocko
  1 sibling, 1 reply; 47+ messages in thread
From: David Rientjes @ 2018-04-22  3:45 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Tetsuo Handa, Andrea Arcangeli, Roman Gushchin,
	linux-kernel, linux-mm

On Fri, 20 Apr 2018, Michal Hocko wrote:

> > The solution is certainly not to hold 
> > down_write(&mm->mmap_sem) during munlock_vma_pages_all() instead.
> 
> Why not? This is what we do for normal paths. exit path just tries to be
> clever because it knows that it doesn't have to lock because there is no
> concurent user. At least it wasn't until the oom reaper came. So I
> really fail to see why we shouldn't do the most obvious thing and use
> the locking.
> 

Because the oom reaper uses the ability to acquire mm->mmap_sem as a 
heuristic to determine when to give up and allow another process to be 
chosen.

If the heuristics of the oom reaper are to be improved, that's great, but 
this patch fixes the oops on powerpc as 4.17 material and as a stable 
backport.  It's also well tested.

> > If 
> > exit_mmap() is not making forward progress then that's a separate issue; 
> 
> Please read what I wrote. There is a page lock and there is no way to
> guarantee it will make a forward progress. Or do you consider that not
> true?
> 

I don't have any evidence of it, and since this is called before 
exit_mmap() sets MMF_OOM_SKIP then the oom reaper would need to set the 
bit itself and I would be able to find the artifact it leaves behind in 
the kernel log.  I cannot find a single instance of a thread stuck in 
munlock by way of exit_mmap() that causes the oom reaper to have to set 
the bit itself, and I should be able to if this were a problem.

> > Holding down_write on 
> > mm->mmap_sem otherwise needlessly over a large amount of code is riskier 
> > (hasn't been done or tested here), more error prone (any code change over 
> > this large area of code or in functions it calls are unnecessarily 
> > burdened by unnecessary locking), makes exit_mmap() less extensible for 
> > the same reason,
> 
> I do not see any of the calls in that path could suffer from holding
> mmap_sem. Do you?
> 
> > and causes the oom reaper to give up and go set 
> > MMF_OOM_SKIP itself because it depends on taking down_read while the 
> > thread is still exiting.
> 
> Which is the standard backoff mechanism.
> 

The reason today's locking methodology is preferred is because of the 
backoff mechanism.  Your patch kills many processes unnecessarily if the 
oom killer selects a large process to kill, which it specifically tries to 
do, because unmap_vmas() and free_pgtables() takes a very long time, 
sometimes tens of seconds.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [patch v2] mm, oom: fix concurrent munlock and oom reaperunmap
  2018-04-22  3:22                   ` David Rientjes
@ 2018-04-22  3:48                     ` Tetsuo Handa
  2018-04-22 13:08                       ` Michal Hocko
  2018-04-24  2:31                       ` David Rientjes
  0 siblings, 2 replies; 47+ messages in thread
From: Tetsuo Handa @ 2018-04-22  3:48 UTC (permalink / raw)
  To: rientjes, mhocko; +Cc: akpm, aarcange, guro, linux-kernel, linux-mm

David Rientjes wrote:
> How have you tested this?
> 
> I'm wondering why you do not see oom killing of many processes if the 
> victim is a very large process that takes a long time to free memory in 
> exit_mmap() as I do because the oom reaper gives up trying to acquire 
> mm->mmap_sem and just sets MMF_OOM_SKIP itself.
> 

We can call __oom_reap_task_mm() from exit_mmap() (or __mmput()) before
exit_mmap() holds mmap_sem for write. Then, at least memory which could
have been reclaimed if exit_mmap() did not hold mmap_sem for write will
be guaranteed to be reclaimed before MMF_OOM_SKIP is set.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [patch v2] mm, oom: fix concurrent munlock and oom reaperunmap
  2018-04-22  3:48                     ` [patch v2] mm, oom: fix concurrent munlock and oom reaperunmap Tetsuo Handa
@ 2018-04-22 13:08                       ` Michal Hocko
  2018-04-24  2:31                       ` David Rientjes
  1 sibling, 0 replies; 47+ messages in thread
From: Michal Hocko @ 2018-04-22 13:08 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: rientjes, akpm, aarcange, guro, linux-kernel, linux-mm

On Sun 22-04-18 12:48:13, Tetsuo Handa wrote:
> David Rientjes wrote:
> > How have you tested this?
> > 
> > I'm wondering why you do not see oom killing of many processes if the 
> > victim is a very large process that takes a long time to free memory in 
> > exit_mmap() as I do because the oom reaper gives up trying to acquire 
> > mm->mmap_sem and just sets MMF_OOM_SKIP itself.
> > 
> 
> We can call __oom_reap_task_mm() from exit_mmap() (or __mmput()) before
> exit_mmap() holds mmap_sem for write. Then, at least memory which could
> have been reclaimed if exit_mmap() did not hold mmap_sem for write will
> be guaranteed to be reclaimed before MMF_OOM_SKIP is set.

That might be an alternative way but I am really wondering whether this
is the simplest way forward.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [patch v2] mm, oom: fix concurrent munlock and oom reaper unmap
  2018-04-22  3:45                 ` [patch v2] mm, oom: fix concurrent munlock and oom reaper unmap David Rientjes
@ 2018-04-22 13:18                   ` Michal Hocko
  2018-04-23 16:09                     ` Michal Hocko
  0 siblings, 1 reply; 47+ messages in thread
From: Michal Hocko @ 2018-04-22 13:18 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Tetsuo Handa, Andrea Arcangeli, Roman Gushchin,
	linux-kernel, linux-mm

On Sat 21-04-18 20:45:11, David Rientjes wrote:
> On Fri, 20 Apr 2018, Michal Hocko wrote:
> 
> > > The solution is certainly not to hold 
> > > down_write(&mm->mmap_sem) during munlock_vma_pages_all() instead.
> > 
> > Why not? This is what we do for normal paths. exit path just tries to be
> > clever because it knows that it doesn't have to lock because there is no
> > concurent user. At least it wasn't until the oom reaper came. So I
> > really fail to see why we shouldn't do the most obvious thing and use
> > the locking.
> > 
> 
> Because the oom reaper uses the ability to acquire mm->mmap_sem as a 
> heuristic to determine when to give up and allow another process to be 
> chosen.
> 
> If the heuristics of the oom reaper are to be improved, that's great, but 
> this patch fixes the oops on powerpc as 4.17 material and as a stable 
> backport.  It's also well tested.
> 
> > > If 
> > > exit_mmap() is not making forward progress then that's a separate issue; 
> > 
> > Please read what I wrote. There is a page lock and there is no way to
> > guarantee it will make a forward progress. Or do you consider that not
> > true?
> > 
> 
> I don't have any evidence of it, and since this is called before 
> exit_mmap() sets MMF_OOM_SKIP then the oom reaper would need to set the 
> bit itself and I would be able to find the artifact it leaves behind in 
> the kernel log.  I cannot find a single instance of a thread stuck in 
> munlock by way of exit_mmap() that causes the oom reaper to have to set 
> the bit itself, and I should be able to if this were a problem.

Look. The fact that you do not _see any evidence_ is completely
irrelevant. The OOM reaper is about _guarantee_. And the guarantee is
gone with the page_lock because that is used in contexts which do
allocate memory and it can depend on other locks. So _no way_ we can
make MMF_OOM_SKIP to depend on it. I will not repeat it anymore. I will
not allow to ruin the whole oom reaper endeavor by adding "this should
not happen" stuff that the oom killer was full of.

> > > Holding down_write on 
> > > mm->mmap_sem otherwise needlessly over a large amount of code is riskier 
> > > (hasn't been done or tested here), more error prone (any code change over 
> > > this large area of code or in functions it calls are unnecessarily 
> > > burdened by unnecessary locking), makes exit_mmap() less extensible for 
> > > the same reason,
> > 
> > I do not see any of the calls in that path could suffer from holding
> > mmap_sem. Do you?
> > 
> > > and causes the oom reaper to give up and go set 
> > > MMF_OOM_SKIP itself because it depends on taking down_read while the 
> > > thread is still exiting.
> > 
> > Which is the standard backoff mechanism.
> > 
> 
> The reason today's locking methodology is preferred is because of the 
> backoff mechanism.  Your patch kills many processes unnecessarily if the 
> oom killer selects a large process to kill, which it specifically tries to 
> do, because unmap_vmas() and free_pgtables() takes a very long time, 
> sometimes tens of seconds.

and I absolutely agree that the feedback mechanism should be improved.
The patch I propose _might_ to lead to killing another task. I do not
pretend otherwise. But it will keep the lockup free guarantee which is
oom repeer giving us. Btw. the past oom implementation would simply kill
more in that case as well because exiting tasks with task->mm == NULL
would be ignored completely. So this is not a big regression even if
that happens occasionally.

Maybe invoking the reaper as suggested by Tetsuo will help here. Maybe
we will come up with something more smart. But I would like to have a
stop gap solution for stable that is easy enough. And your patch is not
doing that because it adds a very subtle dependency on the page lock.
So please stop repeating your arguments all over and either come with
an argument which proves me wrong and the lock_page dependency is not
real or come with an alternative solution which doesn't make
MMF_OOM_SKIP depend on the page lock.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [patch v2] mm, oom: fix concurrent munlock and oom reaper unmap
  2018-04-22 13:18                   ` Michal Hocko
@ 2018-04-23 16:09                     ` Michal Hocko
  0 siblings, 0 replies; 47+ messages in thread
From: Michal Hocko @ 2018-04-23 16:09 UTC (permalink / raw)
  To: David Rientjes, Tetsuo Handa
  Cc: Andrew Morton, Andrea Arcangeli, Roman Gushchin, linux-kernel, linux-mm

On Sun 22-04-18 07:18:57, Michal Hocko wrote:
> On Sat 21-04-18 20:45:11, David Rientjes wrote:
[...]
> Maybe invoking the reaper as suggested by Tetsuo will help here. Maybe
> we will come up with something more smart. But I would like to have a
> stop gap solution for stable that is easy enough. And your patch is not
> doing that because it adds a very subtle dependency on the page lock.
> So please stop repeating your arguments all over and either come with
> an argument which proves me wrong and the lock_page dependency is not
> real or come with an alternative solution which doesn't make
> MMF_OOM_SKIP depend on the page lock.

I though I would give this a try but I am at a conference and quite
busy. Tetsuo are you willing to give it a try so that we have something
to compare and decide, please?
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [patch v2] mm, oom: fix concurrent munlock and oom reaperunmap
  2018-04-22  3:48                     ` [patch v2] mm, oom: fix concurrent munlock and oom reaperunmap Tetsuo Handa
  2018-04-22 13:08                       ` Michal Hocko
@ 2018-04-24  2:31                       ` David Rientjes
  2018-04-24  5:11                         ` Tetsuo Handa
  2018-04-24 13:04                         ` [patch v2] mm, oom: fix concurrent munlock and oom reaperunmap Michal Hocko
  1 sibling, 2 replies; 47+ messages in thread
From: David Rientjes @ 2018-04-24  2:31 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: mhocko, Andrew Morton, Andrea Arcangeli, guro, linux-kernel, linux-mm

On Sun, 22 Apr 2018, Tetsuo Handa wrote:

> > I'm wondering why you do not see oom killing of many processes if the 
> > victim is a very large process that takes a long time to free memory in 
> > exit_mmap() as I do because the oom reaper gives up trying to acquire 
> > mm->mmap_sem and just sets MMF_OOM_SKIP itself.
> > 
> 
> We can call __oom_reap_task_mm() from exit_mmap() (or __mmput()) before
> exit_mmap() holds mmap_sem for write. Then, at least memory which could
> have been reclaimed if exit_mmap() did not hold mmap_sem for write will
> be guaranteed to be reclaimed before MMF_OOM_SKIP is set.
> 

I think that's an exceptionally good idea and will mitigate the concerns 
of others.

It can be done without holding mm->mmap_sem in exit_mmap() and uses the 
same criteria that the oom reaper uses to set MMF_OOM_SKIP itself, so we 
don't get dozens of unnecessary oom kills.

What do you think about this?  It passes preliminary testing on powerpc 
and I'm enqueued it for much more intensive testing.  (I'm wishing there 
was a better way to acknowledge your contribution to fixing this issue, 
especially since you brought up the exact problem this is addressing in 
previous emails.)


mm, oom: fix concurrent munlock and oom reaper unmap

Since exit_mmap() is done without the protection of mm->mmap_sem, it is
possible for the oom reaper to concurrently operate on an mm until
MMF_OOM_SKIP is set.

This allows munlock_vma_pages_all() to concurrently run while the oom
reaper is operating on a vma.  Since munlock_vma_pages_range() depends on
clearing VM_LOCKED from vm_flags before actually doing the munlock to
determine if any other vmas are locking the same memory, the check for
VM_LOCKED in the oom reaper is racy.

This is especially noticeable on architectures such as powerpc where
clearing a huge pmd requires serialize_against_pte_lookup().  If the pmd
is zapped by the oom reaper during follow_page_mask() after the check for
pmd_none() is bypassed, this ends up deferencing a NULL ptl.

Fix this by manually freeing all possible memory from the mm before doing
the munlock and then setting MMF_OOM_SKIP.  The oom reaper can not run on
the mm anymore so the munlock is safe to do in exit_mmap().  It also
matches the logic that the oom reaper currently uses for determining when
to set MMF_OOM_SKIP itself, so there's no new risk of excessive oom
killing.

Fixes: 212925802454 ("mm: oom: let oom_reap_task and exit_mmap run concurrently")
Cc: stable@vger.kernel.org [4.14+]
Suggested-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: David Rientjes <rientjes@google.com>
---
 include/linux/oom.h |  2 ++
 mm/mmap.c           | 39 ++++++++++++----------
 mm/oom_kill.c       | 81 ++++++++++++++++++++++++---------------------
 3 files changed, 66 insertions(+), 56 deletions(-)

diff --git a/include/linux/oom.h b/include/linux/oom.h
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -95,6 +95,8 @@ static inline int check_stable_address_space(struct mm_struct *mm)
 	return 0;
 }
 
+void __oom_reap_task_mm(struct mm_struct *mm);
+
 extern unsigned long oom_badness(struct task_struct *p,
 		struct mem_cgroup *memcg, const nodemask_t *nodemask,
 		unsigned long totalpages);
diff --git a/mm/mmap.c b/mm/mmap.c
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -3015,6 +3015,27 @@ void exit_mmap(struct mm_struct *mm)
 	/* mm's last user has gone, and its about to be pulled down */
 	mmu_notifier_release(mm);
 
+	if (unlikely(mm_is_oom_victim(mm))) {
+		/*
+		 * Manually reap the mm to free as much memory as possible.
+		 * Then, as the oom reaper, set MMF_OOM_SKIP to disregard this
+		 * mm from further consideration.  Taking mm->mmap_sem for write
+		 * after setting MMF_OOM_SKIP will guarantee that the oom reaper
+		 * will not run on this mm again after mmap_sem is dropped.
+		 *
+		 * This needs to be done before calling munlock_vma_pages_all(),
+		 * which clears VM_LOCKED, otherwise the oom reaper cannot
+		 * reliably test it.
+		 */
+		mutex_lock(&oom_lock);
+		__oom_reap_task_mm(mm);
+		mutex_unlock(&oom_lock);
+
+		set_bit(MMF_OOM_SKIP, &mm->flags);
+		down_write(&mm->mmap_sem);
+		up_write(&mm->mmap_sem);
+	}
+
 	if (mm->locked_vm) {
 		vma = mm->mmap;
 		while (vma) {
@@ -3036,24 +3057,6 @@ void exit_mmap(struct mm_struct *mm)
 	/* update_hiwater_rss(mm) here? but nobody should be looking */
 	/* Use -1 here to ensure all VMAs in the mm are unmapped */
 	unmap_vmas(&tlb, vma, 0, -1);
-
-	if (unlikely(mm_is_oom_victim(mm))) {
-		/*
-		 * Wait for oom_reap_task() to stop working on this
-		 * mm. Because MMF_OOM_SKIP is already set before
-		 * calling down_read(), oom_reap_task() will not run
-		 * on this "mm" post up_write().
-		 *
-		 * mm_is_oom_victim() cannot be set from under us
-		 * either because victim->mm is already set to NULL
-		 * under task_lock before calling mmput and oom_mm is
-		 * set not NULL by the OOM killer only if victim->mm
-		 * is found not NULL while holding the task_lock.
-		 */
-		set_bit(MMF_OOM_SKIP, &mm->flags);
-		down_write(&mm->mmap_sem);
-		up_write(&mm->mmap_sem);
-	}
 	free_pgtables(&tlb, vma, FIRST_USER_ADDRESS, USER_PGTABLES_CEILING);
 	tlb_finish_mmu(&tlb, 0, -1);
 
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -469,7 +469,6 @@ bool process_shares_mm(struct task_struct *p, struct mm_struct *mm)
 	return false;
 }
 
-
 #ifdef CONFIG_MMU
 /*
  * OOM Reaper kernel thread which tries to reap the memory used by the OOM
@@ -480,16 +479,54 @@ static DECLARE_WAIT_QUEUE_HEAD(oom_reaper_wait);
 static struct task_struct *oom_reaper_list;
 static DEFINE_SPINLOCK(oom_reaper_lock);
 
-static bool __oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
+void __oom_reap_task_mm(struct mm_struct *mm)
 {
-	struct mmu_gather tlb;
 	struct vm_area_struct *vma;
+
+	/*
+	 * Tell all users of get_user/copy_from_user etc... that the content
+	 * is no longer stable. No barriers really needed because unmapping
+	 * should imply barriers already and the reader would hit a page fault
+	 * if it stumbled over a reaped memory.
+	 */
+	set_bit(MMF_UNSTABLE, &mm->flags);
+
+	for (vma = mm->mmap ; vma; vma = vma->vm_next) {
+		if (!can_madv_dontneed_vma(vma))
+			continue;
+
+		/*
+		 * Only anonymous pages have a good chance to be dropped
+		 * without additional steps which we cannot afford as we
+		 * are OOM already.
+		 *
+		 * We do not even care about fs backed pages because all
+		 * which are reclaimable have already been reclaimed and
+		 * we do not want to block exit_mmap by keeping mm ref
+		 * count elevated without a good reason.
+		 */
+		if (vma_is_anonymous(vma) || !(vma->vm_flags & VM_SHARED)) {
+			const unsigned long start = vma->vm_start;
+			const unsigned long end = vma->vm_end;
+			struct mmu_gather tlb;
+
+			tlb_gather_mmu(&tlb, mm, start, end);
+			mmu_notifier_invalidate_range_start(mm, start, end);
+			unmap_page_range(&tlb, vma, start, end, NULL);
+			mmu_notifier_invalidate_range_end(mm, start, end);
+			tlb_finish_mmu(&tlb, start, end);
+		}
+	}
+}
+
+static bool oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
+{
 	bool ret = true;
 
 	/*
 	 * We have to make sure to not race with the victim exit path
 	 * and cause premature new oom victim selection:
-	 * __oom_reap_task_mm		exit_mm
+	 * oom_reap_task_mm		exit_mm
 	 *   mmget_not_zero
 	 *				  mmput
 	 *				    atomic_dec_and_test
@@ -534,39 +571,8 @@ static bool __oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
 
 	trace_start_task_reaping(tsk->pid);
 
-	/*
-	 * Tell all users of get_user/copy_from_user etc... that the content
-	 * is no longer stable. No barriers really needed because unmapping
-	 * should imply barriers already and the reader would hit a page fault
-	 * if it stumbled over a reaped memory.
-	 */
-	set_bit(MMF_UNSTABLE, &mm->flags);
-
-	for (vma = mm->mmap ; vma; vma = vma->vm_next) {
-		if (!can_madv_dontneed_vma(vma))
-			continue;
+	__oom_reap_task_mm(mm);
 
-		/*
-		 * Only anonymous pages have a good chance to be dropped
-		 * without additional steps which we cannot afford as we
-		 * are OOM already.
-		 *
-		 * We do not even care about fs backed pages because all
-		 * which are reclaimable have already been reclaimed and
-		 * we do not want to block exit_mmap by keeping mm ref
-		 * count elevated without a good reason.
-		 */
-		if (vma_is_anonymous(vma) || !(vma->vm_flags & VM_SHARED)) {
-			const unsigned long start = vma->vm_start;
-			const unsigned long end = vma->vm_end;
-
-			tlb_gather_mmu(&tlb, mm, start, end);
-			mmu_notifier_invalidate_range_start(mm, start, end);
-			unmap_page_range(&tlb, vma, start, end, NULL);
-			mmu_notifier_invalidate_range_end(mm, start, end);
-			tlb_finish_mmu(&tlb, start, end);
-		}
-	}
 	pr_info("oom_reaper: reaped process %d (%s), now anon-rss:%lukB, file-rss:%lukB, shmem-rss:%lukB\n",
 			task_pid_nr(tsk), tsk->comm,
 			K(get_mm_counter(mm, MM_ANONPAGES)),
@@ -587,14 +593,13 @@ static void oom_reap_task(struct task_struct *tsk)
 	struct mm_struct *mm = tsk->signal->oom_mm;
 
 	/* Retry the down_read_trylock(mmap_sem) a few times */
-	while (attempts++ < MAX_OOM_REAP_RETRIES && !__oom_reap_task_mm(tsk, mm))
+	while (attempts++ < MAX_OOM_REAP_RETRIES && !oom_reap_task_mm(tsk, mm))
 		schedule_timeout_idle(HZ/10);
 
 	if (attempts <= MAX_OOM_REAP_RETRIES ||
 	    test_bit(MMF_OOM_SKIP, &mm->flags))
 		goto done;
 
-
 	pr_info("oom_reaper: unable to reap pid:%d (%s)\n",
 		task_pid_nr(tsk), tsk->comm);
 	debug_show_all_locks();

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [patch v2] mm, oom: fix concurrent munlock and oom reaperunmap
  2018-04-24  2:31                       ` David Rientjes
@ 2018-04-24  5:11                         ` Tetsuo Handa
  2018-04-24  5:35                           ` David Rientjes
  2018-04-24 13:04                         ` [patch v2] mm, oom: fix concurrent munlock and oom reaperunmap Michal Hocko
  1 sibling, 1 reply; 47+ messages in thread
From: Tetsuo Handa @ 2018-04-24  5:11 UTC (permalink / raw)
  To: David Rientjes
  Cc: mhocko, Andrew Morton, Andrea Arcangeli, guro, linux-kernel, linux-mm

> On Sun, 22 Apr 2018, Tetsuo Handa wrote:
> 
> > > I'm wondering why you do not see oom killing of many processes if the 
> > > victim is a very large process that takes a long time to free memory in 
> > > exit_mmap() as I do because the oom reaper gives up trying to acquire 
> > > mm->mmap_sem and just sets MMF_OOM_SKIP itself.
> > > 
> > 
> > We can call __oom_reap_task_mm() from exit_mmap() (or __mmput()) before
> > exit_mmap() holds mmap_sem for write. Then, at least memory which could
> > have been reclaimed if exit_mmap() did not hold mmap_sem for write will
> > be guaranteed to be reclaimed before MMF_OOM_SKIP is set.
> > 
> 
> I think that's an exceptionally good idea and will mitigate the concerns 
> of others.
> 
> It can be done without holding mm->mmap_sem in exit_mmap() and uses the 
> same criteria that the oom reaper uses to set MMF_OOM_SKIP itself, so we 
> don't get dozens of unnecessary oom kills.
> 
> What do you think about this?  It passes preliminary testing on powerpc 
> and I'm enqueued it for much more intensive testing.  (I'm wishing there 
> was a better way to acknowledge your contribution to fixing this issue, 
> especially since you brought up the exact problem this is addressing in 
> previous emails.)
> 

I don't think this patch is safe, for exit_mmap() is calling
mmu_notifier_invalidate_range_{start,end}() which might block with oom_lock
held when oom_reap_task_mm() is waiting for oom_lock held by exit_mmap().
exit_mmap() must not block while holding oom_lock in order to guarantee that
oom_reap_task_mm() can give up.

Some suggestion on top of your patch:

 mm/mmap.c     | 13 +++++--------
 mm/oom_kill.c | 51 ++++++++++++++++++++++++++-------------------------
 2 files changed, 31 insertions(+), 33 deletions(-)

diff --git a/mm/mmap.c b/mm/mmap.c
index 981eed4..7b31357 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -3019,21 +3019,18 @@ void exit_mmap(struct mm_struct *mm)
 		/*
 		 * Manually reap the mm to free as much memory as possible.
 		 * Then, as the oom reaper, set MMF_OOM_SKIP to disregard this
-		 * mm from further consideration.  Taking mm->mmap_sem for write
-		 * after setting MMF_OOM_SKIP will guarantee that the oom reaper
-		 * will not run on this mm again after mmap_sem is dropped.
+		 * mm from further consideration. Setting MMF_OOM_SKIP under
+		 * oom_lock held will guarantee that the OOM reaper will not
+		 * run on this mm again.
 		 *
 		 * This needs to be done before calling munlock_vma_pages_all(),
 		 * which clears VM_LOCKED, otherwise the oom reaper cannot
 		 * reliably test it.
 		 */
-		mutex_lock(&oom_lock);
 		__oom_reap_task_mm(mm);
-		mutex_unlock(&oom_lock);
-
+		mutex_lock(&oom_lock);
 		set_bit(MMF_OOM_SKIP, &mm->flags);
-		down_write(&mm->mmap_sem);
-		up_write(&mm->mmap_sem);
+		mutex_unlock(&oom_lock);
 	}
 
 	if (mm->locked_vm) {
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 8ba6cb8..9a29df8 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -523,21 +523,15 @@ static bool oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
 {
 	bool ret = true;
 
+	mutex_lock(&oom_lock);
+
 	/*
-	 * We have to make sure to not race with the victim exit path
-	 * and cause premature new oom victim selection:
-	 * oom_reap_task_mm		exit_mm
-	 *   mmget_not_zero
-	 *				  mmput
-	 *				    atomic_dec_and_test
-	 *				  exit_oom_victim
-	 *				[...]
-	 *				out_of_memory
-	 *				  select_bad_process
-	 *				    # no TIF_MEMDIE task selects new victim
-	 *  unmap_page_range # frees some memory
+	 * MMF_OOM_SKIP is set by exit_mmap() when the OOM reaper can't
+	 * work on the mm anymore. The check for MMF_OOM_SKIP must run
+	 * under oom_lock held.
 	 */
-	mutex_lock(&oom_lock);
+	if (test_bit(MMF_OOM_SKIP, &mm->flags))
+		goto unlock_oom;
 
 	if (!down_read_trylock(&mm->mmap_sem)) {
 		ret = false;
@@ -557,18 +551,6 @@ static bool oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
 		goto unlock_oom;
 	}
 
-	/*
-	 * MMF_OOM_SKIP is set by exit_mmap when the OOM reaper can't
-	 * work on the mm anymore. The check for MMF_OOM_SKIP must run
-	 * under mmap_sem for reading because it serializes against the
-	 * down_write();up_write() cycle in exit_mmap().
-	 */
-	if (test_bit(MMF_OOM_SKIP, &mm->flags)) {
-		up_read(&mm->mmap_sem);
-		trace_skip_task_reaping(tsk->pid);
-		goto unlock_oom;
-	}
-
 	trace_start_task_reaping(tsk->pid);
 
 	__oom_reap_task_mm(mm);
@@ -610,8 +592,27 @@ static void oom_reap_task(struct task_struct *tsk)
 	/*
 	 * Hide this mm from OOM killer because it has been either reaped or
 	 * somebody can't call up_write(mmap_sem).
+	 *
+	 * We have to make sure to not cause premature new oom victim selection:
+	 *
+	 * __alloc_pages_may_oom()     oom_reap_task_mm()/exit_mmap()
+	 *   mutex_trylock(&oom_lock)
+	 *   get_page_from_freelist(ALLOC_WMARK_HIGH) # fails
+	 *                               unmap_page_range() # frees some memory
+	 *                               set_bit(MMF_OOM_SKIP)
+	 *   out_of_memory()
+	 *     select_bad_process()
+	 *       test_bit(MMF_OOM_SKIP) # selects new oom victim
+	 *   mutex_unlock(&oom_lock)
+	 *
+	 * Setting MMF_OOM_SKIP under oom_lock held will guarantee that the
+	 * last second alocation attempt is done by __alloc_pages_may_oom()
+	 * before out_of_memory() selects next OOM victim by finding
+	 * MMF_OOM_SKIP.
 	 */
+	mutex_lock(&oom_lock);
 	set_bit(MMF_OOM_SKIP, &mm->flags);
+	mutex_unlock(&oom_lock);
 
 	/* Drop a reference taken by wake_oom_reaper */
 	put_task_struct(tsk);
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* Re: [patch v2] mm, oom: fix concurrent munlock and oom reaperunmap
  2018-04-24  5:11                         ` Tetsuo Handa
@ 2018-04-24  5:35                           ` David Rientjes
  2018-04-24 21:57                             ` [patch v2] mm, oom: fix concurrent munlock and oom reaper unmap Tetsuo Handa
  0 siblings, 1 reply; 47+ messages in thread
From: David Rientjes @ 2018-04-24  5:35 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: mhocko, Andrew Morton, Andrea Arcangeli, guro, linux-kernel, linux-mm

On Tue, 24 Apr 2018, Tetsuo Handa wrote:

> > > We can call __oom_reap_task_mm() from exit_mmap() (or __mmput()) before
> > > exit_mmap() holds mmap_sem for write. Then, at least memory which could
> > > have been reclaimed if exit_mmap() did not hold mmap_sem for write will
> > > be guaranteed to be reclaimed before MMF_OOM_SKIP is set.
> > > 
> > 
> > I think that's an exceptionally good idea and will mitigate the concerns 
> > of others.
> > 
> > It can be done without holding mm->mmap_sem in exit_mmap() and uses the 
> > same criteria that the oom reaper uses to set MMF_OOM_SKIP itself, so we 
> > don't get dozens of unnecessary oom kills.
> > 
> > What do you think about this?  It passes preliminary testing on powerpc 
> > and I'm enqueued it for much more intensive testing.  (I'm wishing there 
> > was a better way to acknowledge your contribution to fixing this issue, 
> > especially since you brought up the exact problem this is addressing in 
> > previous emails.)
> > 
> 
> I don't think this patch is safe, for exit_mmap() is calling
> mmu_notifier_invalidate_range_{start,end}() which might block with oom_lock
> held when oom_reap_task_mm() is waiting for oom_lock held by exit_mmap().

One of the reasons that I extracted __oom_reap_task_mm() out of the new 
oom_reap_task_mm() is to avoid the checks that would be unnecessary when 
called from exit_mmap().  In this case, we can ignore the 
mm_has_blockable_invalidate_notifiers() check because exit_mmap() has 
already done mmu_notifier_release().  So I don't think there's a concern 
about __oom_reap_task_mm() blocking while holding oom_lock.  Unless you 
are referring to something else?

> exit_mmap() must not block while holding oom_lock in order to guarantee that
> oom_reap_task_mm() can give up.
> 
> Some suggestion on top of your patch:
> 
>  mm/mmap.c     | 13 +++++--------
>  mm/oom_kill.c | 51 ++++++++++++++++++++++++++-------------------------
>  2 files changed, 31 insertions(+), 33 deletions(-)
> 
> diff --git a/mm/mmap.c b/mm/mmap.c
> index 981eed4..7b31357 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -3019,21 +3019,18 @@ void exit_mmap(struct mm_struct *mm)
>  		/*
>  		 * Manually reap the mm to free as much memory as possible.
>  		 * Then, as the oom reaper, set MMF_OOM_SKIP to disregard this
> -		 * mm from further consideration.  Taking mm->mmap_sem for write
> -		 * after setting MMF_OOM_SKIP will guarantee that the oom reaper
> -		 * will not run on this mm again after mmap_sem is dropped.
> +		 * mm from further consideration. Setting MMF_OOM_SKIP under
> +		 * oom_lock held will guarantee that the OOM reaper will not
> +		 * run on this mm again.
>  		 *
>  		 * This needs to be done before calling munlock_vma_pages_all(),
>  		 * which clears VM_LOCKED, otherwise the oom reaper cannot
>  		 * reliably test it.
>  		 */
> -		mutex_lock(&oom_lock);
>  		__oom_reap_task_mm(mm);
> -		mutex_unlock(&oom_lock);
> -
> +		mutex_lock(&oom_lock);
>  		set_bit(MMF_OOM_SKIP, &mm->flags);
> -		down_write(&mm->mmap_sem);
> -		up_write(&mm->mmap_sem);
> +		mutex_unlock(&oom_lock);
>  	}
>  
>  	if (mm->locked_vm) {
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 8ba6cb8..9a29df8 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -523,21 +523,15 @@ static bool oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
>  {
>  	bool ret = true;
>  
> +	mutex_lock(&oom_lock);
> +
>  	/*
> -	 * We have to make sure to not race with the victim exit path
> -	 * and cause premature new oom victim selection:
> -	 * oom_reap_task_mm		exit_mm
> -	 *   mmget_not_zero
> -	 *				  mmput
> -	 *				    atomic_dec_and_test
> -	 *				  exit_oom_victim
> -	 *				[...]
> -	 *				out_of_memory
> -	 *				  select_bad_process
> -	 *				    # no TIF_MEMDIE task selects new victim
> -	 *  unmap_page_range # frees some memory
> +	 * MMF_OOM_SKIP is set by exit_mmap() when the OOM reaper can't
> +	 * work on the mm anymore. The check for MMF_OOM_SKIP must run
> +	 * under oom_lock held.
>  	 */
> -	mutex_lock(&oom_lock);
> +	if (test_bit(MMF_OOM_SKIP, &mm->flags))
> +		goto unlock_oom;
>  
>  	if (!down_read_trylock(&mm->mmap_sem)) {
>  		ret = false;
> @@ -557,18 +551,6 @@ static bool oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
>  		goto unlock_oom;
>  	}
>  
> -	/*
> -	 * MMF_OOM_SKIP is set by exit_mmap when the OOM reaper can't
> -	 * work on the mm anymore. The check for MMF_OOM_SKIP must run
> -	 * under mmap_sem for reading because it serializes against the
> -	 * down_write();up_write() cycle in exit_mmap().
> -	 */
> -	if (test_bit(MMF_OOM_SKIP, &mm->flags)) {
> -		up_read(&mm->mmap_sem);
> -		trace_skip_task_reaping(tsk->pid);
> -		goto unlock_oom;
> -	}
> -
>  	trace_start_task_reaping(tsk->pid);
>  
>  	__oom_reap_task_mm(mm);
> @@ -610,8 +592,27 @@ static void oom_reap_task(struct task_struct *tsk)
>  	/*
>  	 * Hide this mm from OOM killer because it has been either reaped or
>  	 * somebody can't call up_write(mmap_sem).
> +	 *
> +	 * We have to make sure to not cause premature new oom victim selection:
> +	 *
> +	 * __alloc_pages_may_oom()     oom_reap_task_mm()/exit_mmap()
> +	 *   mutex_trylock(&oom_lock)
> +	 *   get_page_from_freelist(ALLOC_WMARK_HIGH) # fails
> +	 *                               unmap_page_range() # frees some memory
> +	 *                               set_bit(MMF_OOM_SKIP)
> +	 *   out_of_memory()
> +	 *     select_bad_process()
> +	 *       test_bit(MMF_OOM_SKIP) # selects new oom victim
> +	 *   mutex_unlock(&oom_lock)
> +	 *
> +	 * Setting MMF_OOM_SKIP under oom_lock held will guarantee that the
> +	 * last second alocation attempt is done by __alloc_pages_may_oom()
> +	 * before out_of_memory() selects next OOM victim by finding
> +	 * MMF_OOM_SKIP.
>  	 */
> +	mutex_lock(&oom_lock);
>  	set_bit(MMF_OOM_SKIP, &mm->flags);
> +	mutex_unlock(&oom_lock);
>  
>  	/* Drop a reference taken by wake_oom_reaper */
>  	put_task_struct(tsk);

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [patch v2] mm, oom: fix concurrent munlock and oom reaperunmap
  2018-04-24  2:31                       ` David Rientjes
  2018-04-24  5:11                         ` Tetsuo Handa
@ 2018-04-24 13:04                         ` Michal Hocko
  2018-04-24 20:01                           ` David Rientjes
  1 sibling, 1 reply; 47+ messages in thread
From: Michal Hocko @ 2018-04-24 13:04 UTC (permalink / raw)
  To: David Rientjes
  Cc: Tetsuo Handa, Andrew Morton, Andrea Arcangeli, guro,
	linux-kernel, linux-mm

On Mon 23-04-18 19:31:05, David Rientjes wrote:
[...]
> diff --git a/mm/mmap.c b/mm/mmap.c
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -3015,6 +3015,27 @@ void exit_mmap(struct mm_struct *mm)
>  	/* mm's last user has gone, and its about to be pulled down */
>  	mmu_notifier_release(mm);
>  
> +	if (unlikely(mm_is_oom_victim(mm))) {
> +		/*
> +		 * Manually reap the mm to free as much memory as possible.
> +		 * Then, as the oom reaper, set MMF_OOM_SKIP to disregard this
> +		 * mm from further consideration.  Taking mm->mmap_sem for write
> +		 * after setting MMF_OOM_SKIP will guarantee that the oom reaper
> +		 * will not run on this mm again after mmap_sem is dropped.
> +		 *
> +		 * This needs to be done before calling munlock_vma_pages_all(),
> +		 * which clears VM_LOCKED, otherwise the oom reaper cannot
> +		 * reliably test it.
> +		 */
> +		mutex_lock(&oom_lock);
> +		__oom_reap_task_mm(mm);
> +		mutex_unlock(&oom_lock);
> +
> +		set_bit(MMF_OOM_SKIP, &mm->flags);
> +		down_write(&mm->mmap_sem);
> +		up_write(&mm->mmap_sem);
> +	}
> +

Is there any reason why we cannot simply call __oom_reap_task_mm as we
have it now? mmap_sem for read shouldn't fail here because this is the
last reference of the mm and we are past the ksm and khugepaged
synchronizations. So unless my jed laged brain fools me the patch should
be as simple as the following (I haven't tested it at all).

diff --git a/mm/mmap.c b/mm/mmap.c
index faf85699f1a1..a8f170f53872 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -3008,6 +3008,13 @@ void exit_mmap(struct mm_struct *mm)
 	/* mm's last user has gone, and its about to be pulled down */
 	mmu_notifier_release(mm);
 
+	/*
+	 * The mm is not accessible for anybody except for the oom reaper
+	 * which cannot race with munlocking so reap the task direct.
+	 */
+	if (unlikely(mm_is_oom_victim(mm)))
+		__oom_reap_task_mm(current, mm);
+
 	if (mm->locked_vm) {
 		vma = mm->mmap;
 		while (vma) {
@@ -3030,23 +3037,6 @@ void exit_mmap(struct mm_struct *mm)
 	/* Use -1 here to ensure all VMAs in the mm are unmapped */
 	unmap_vmas(&tlb, vma, 0, -1);
 
-	if (unlikely(mm_is_oom_victim(mm))) {
-		/*
-		 * Wait for oom_reap_task() to stop working on this
-		 * mm. Because MMF_OOM_SKIP is already set before
-		 * calling down_read(), oom_reap_task() will not run
-		 * on this "mm" post up_write().
-		 *
-		 * mm_is_oom_victim() cannot be set from under us
-		 * either because victim->mm is already set to NULL
-		 * under task_lock before calling mmput and oom_mm is
-		 * set not NULL by the OOM killer only if victim->mm
-		 * is found not NULL while holding the task_lock.
-		 */
-		set_bit(MMF_OOM_SKIP, &mm->flags);
-		down_write(&mm->mmap_sem);
-		up_write(&mm->mmap_sem);
-	}
 	free_pgtables(&tlb, vma, FIRST_USER_ADDRESS, USER_PGTABLES_CEILING);
 	tlb_finish_mmu(&tlb, 0, -1);
 
@@ -3060,6 +3050,7 @@ void exit_mmap(struct mm_struct *mm)
 		vma = remove_vma(vma);
 	}
 	vm_unacct_memory(nr_accounted);
+
 }
 
 /* Insert vm structure into process list sorted by address
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index dfd370526909..e39ceb127e8e 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -524,7 +524,7 @@ static bool __oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
 	 * MMF_OOM_SKIP is set by exit_mmap when the OOM reaper can't
 	 * work on the mm anymore. The check for MMF_OOM_SKIP must run
 	 * under mmap_sem for reading because it serializes against the
-	 * down_write();up_write() cycle in exit_mmap().
+	 * exit_mmap().
 	 */
 	if (test_bit(MMF_OOM_SKIP, &mm->flags)) {
 		up_read(&mm->mmap_sem);
@@ -567,12 +567,14 @@ static bool __oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
 			tlb_finish_mmu(&tlb, start, end);
 		}
 	}
-	pr_info("oom_reaper: reaped process %d (%s), now anon-rss:%lukB, file-rss:%lukB, shmem-rss:%lukB\n",
+	pr_info("%s: reaped process %d (%s), now anon-rss:%lukB, file-rss:%lukB, shmem-rss:%lukB\n",
+			current->comm,
 			task_pid_nr(tsk), tsk->comm,
 			K(get_mm_counter(mm, MM_ANONPAGES)),
 			K(get_mm_counter(mm, MM_FILEPAGES)),
 			K(get_mm_counter(mm, MM_SHMEMPAGES)));
 	up_read(&mm->mmap_sem);
+	set_bit(MMF_OOM_SKIP, &mm->flags);
 
 	trace_finish_task_reaping(tsk->pid);
 unlock_oom:
@@ -590,10 +592,11 @@ static void oom_reap_task(struct task_struct *tsk)
 	while (attempts++ < MAX_OOM_REAP_RETRIES && !__oom_reap_task_mm(tsk, mm))
 		schedule_timeout_idle(HZ/10);
 
-	if (attempts <= MAX_OOM_REAP_RETRIES ||
-	    test_bit(MMF_OOM_SKIP, &mm->flags))
+	if (attempts <= MAX_OOM_REAP_RETRIES)
 		goto done;
 
+	if (test_bit(MMF_OOM_SKIP, &mm->flags))
+		goto put_task;
 
 	pr_info("oom_reaper: unable to reap pid:%d (%s)\n",
 		task_pid_nr(tsk), tsk->comm);
@@ -609,6 +612,7 @@ static void oom_reap_task(struct task_struct *tsk)
 	set_bit(MMF_OOM_SKIP, &mm->flags);
 
 	/* Drop a reference taken by wake_oom_reaper */
+put_task:
 	put_task_struct(tsk);
 }
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* Re: [patch v2] mm, oom: fix concurrent munlock and oom reaperunmap
  2018-04-24 13:04                         ` [patch v2] mm, oom: fix concurrent munlock and oom reaperunmap Michal Hocko
@ 2018-04-24 20:01                           ` David Rientjes
  2018-04-24 20:13                             ` Michal Hocko
  0 siblings, 1 reply; 47+ messages in thread
From: David Rientjes @ 2018-04-24 20:01 UTC (permalink / raw)
  To: Michal Hocko, Andrew Morton
  Cc: Tetsuo Handa, Andrea Arcangeli, guro, linux-kernel, linux-mm

On Tue, 24 Apr 2018, Michal Hocko wrote:

> Is there any reason why we cannot simply call __oom_reap_task_mm as we
> have it now? mmap_sem for read shouldn't fail here because this is the
> last reference of the mm and we are past the ksm and khugepaged
> synchronizations. So unless my jed laged brain fools me the patch should
> be as simple as the following (I haven't tested it at all).
> 

I wanted to remove all per task checks because they are now irrelevant: 
this would be the first dependency that exit_mmap() has on any 
task_struct, which isn't intuitive -- we simply want to exit the mmap.  
There's no requirement that current owns the mm other than this.  I wanted 
to avoid the implicit dependency on MMF_OOM_SKIP and make it explicit in 
the exit path to be matched with the oom reaper.  I didn't want anything 
additional printed to the kernel log about oom reaping unless the 
oom_reaper actually needed to intervene, which is useful knowledge outside 
of basic exiting.

My patch has passed intensive testing on both x86 and powerpc, so I'll ask 
that it's pushed for 4.17-rc3.  Many thanks to Tetsuo for the suggestion 
on calling __oom_reap_task_mm() from exit_mmap().

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [patch v2] mm, oom: fix concurrent munlock and oom reaperunmap
  2018-04-24 20:01                           ` David Rientjes
@ 2018-04-24 20:13                             ` Michal Hocko
  2018-04-24 20:22                               ` David Rientjes
  0 siblings, 1 reply; 47+ messages in thread
From: Michal Hocko @ 2018-04-24 20:13 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Tetsuo Handa, Andrea Arcangeli, guro,
	linux-kernel, linux-mm

On Tue 24-04-18 13:01:03, David Rientjes wrote:
> On Tue, 24 Apr 2018, Michal Hocko wrote:
> 
> > Is there any reason why we cannot simply call __oom_reap_task_mm as we
> > have it now? mmap_sem for read shouldn't fail here because this is the
> > last reference of the mm and we are past the ksm and khugepaged
> > synchronizations. So unless my jed laged brain fools me the patch should
> > be as simple as the following (I haven't tested it at all).
> > 
> 
> I wanted to remove all per task checks because they are now irrelevant: 
> this would be the first dependency that exit_mmap() has on any 
> task_struct, which isn't intuitive -- we simply want to exit the mmap.  
> There's no requirement that current owns the mm other than this. 

There is no such requirement in the __oom_reap_task_mm. The given task
is used for reporting purposes.

> I wanted 
> to avoid the implicit dependency on MMF_OOM_SKIP and make it explicit in 
> the exit path to be matched with the oom reaper.

Well, I find it actually better that the code is not explicit about
MMF_OOM_SKIP. The whole thing happens in the oom proper which should be
really preferable. The whole synchronization is then completely
transparent to the oom (including the oom lock etc).

> I didn't want anything 
> additional printed to the kernel log about oom reaping unless the 
> oom_reaper actually needed to intervene, which is useful knowledge outside 
> of basic exiting.

Can we shave all those parts as follow ups and make the fix as simple as
possible?
 
> My patch has passed intensive testing on both x86 and powerpc, so I'll ask 
> that it's pushed for 4.17-rc3.  Many thanks to Tetsuo for the suggestion 
> on calling __oom_reap_task_mm() from exit_mmap().

Yeah, but your patch does have a problem with blockable mmu notifiers
IIUC.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [patch v2] mm, oom: fix concurrent munlock and oom reaperunmap
  2018-04-24 20:13                             ` Michal Hocko
@ 2018-04-24 20:22                               ` David Rientjes
  2018-04-24 20:31                                 ` Michal Hocko
  0 siblings, 1 reply; 47+ messages in thread
From: David Rientjes @ 2018-04-24 20:22 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Tetsuo Handa, Andrea Arcangeli, guro,
	linux-kernel, linux-mm

On Tue, 24 Apr 2018, Michal Hocko wrote:

> > I wanted to remove all per task checks because they are now irrelevant: 
> > this would be the first dependency that exit_mmap() has on any 
> > task_struct, which isn't intuitive -- we simply want to exit the mmap.  
> > There's no requirement that current owns the mm other than this. 
> 
> There is no such requirement in the __oom_reap_task_mm. The given task
> is used for reporting purposes.
> 

And tracing, which is pointless.  And it unnecessarily spams the kernel 
log for basic exiting.

> > I wanted 
> > to avoid the implicit dependency on MMF_OOM_SKIP and make it explicit in 
> > the exit path to be matched with the oom reaper.
> 
> Well, I find it actually better that the code is not explicit about
> MMF_OOM_SKIP. The whole thing happens in the oom proper which should be
> really preferable. The whole synchronization is then completely
> transparent to the oom (including the oom lock etc).
> 

It's already done in exit_mmap().  I'm not changing 

> > I didn't want anything 
> > additional printed to the kernel log about oom reaping unless the 
> > oom_reaper actually needed to intervene, which is useful knowledge outside 
> > of basic exiting.
> 
> Can we shave all those parts as follow ups and make the fix as simple as
> possible?
>  

It is as simple as possible.  It is not doing any unnecessary locking or 
checks that the exit path does not need to do for the sake of a smaller 
patch.  The number of changed lines in the patch is not what I'm 
interested in, I am interested in something that is stable, something that 
works, doesn't add additional (and unnecessary locking), and doesn't 
change around what function sets what bit when called from what path.

> > My patch has passed intensive testing on both x86 and powerpc, so I'll ask 
> > that it's pushed for 4.17-rc3.  Many thanks to Tetsuo for the suggestion 
> > on calling __oom_reap_task_mm() from exit_mmap().
> 
> Yeah, but your patch does have a problem with blockable mmu notifiers
> IIUC.

What on earth are you talking about?  exit_mmap() does 
mmu_notifier_release().  There are no blockable mmu notifiers.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [patch v2] mm, oom: fix concurrent munlock and oom reaperunmap
  2018-04-24 20:22                               ` David Rientjes
@ 2018-04-24 20:31                                 ` Michal Hocko
  2018-04-24 21:07                                   ` David Rientjes
  0 siblings, 1 reply; 47+ messages in thread
From: Michal Hocko @ 2018-04-24 20:31 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Tetsuo Handa, Andrea Arcangeli, guro,
	linux-kernel, linux-mm

On Tue 24-04-18 13:22:45, David Rientjes wrote:
[...]
> > > My patch has passed intensive testing on both x86 and powerpc, so I'll ask 
> > > that it's pushed for 4.17-rc3.  Many thanks to Tetsuo for the suggestion 
> > > on calling __oom_reap_task_mm() from exit_mmap().
> > 
> > Yeah, but your patch does have a problem with blockable mmu notifiers
> > IIUC.
> 
> What on earth are you talking about?  exit_mmap() does 
> mmu_notifier_release().  There are no blockable mmu notifiers.

MMF_OOM_SKIP - remember? The thing that guarantees a forward progress.
So we cannot really depend on setting MMF_OOM_SKIP if a
mmu_notifier_release blocks for an excessive/unbounded amount of time.

Look I am not really interested in disussing this to death but it would
be really _nice_ if you could calm down a bit, stop fighting for the solution
you have proposed and ignore the feedback you are getting.

There are two things to care about here. Stop the race that can blow up
and do not regress MMF_OOM_SKIP guarantee. Can we please do that.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [patch v2] mm, oom: fix concurrent munlock and oom reaperunmap
  2018-04-24 20:31                                 ` Michal Hocko
@ 2018-04-24 21:07                                   ` David Rientjes
  2018-04-24 23:08                                     ` Michal Hocko
  0 siblings, 1 reply; 47+ messages in thread
From: David Rientjes @ 2018-04-24 21:07 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Tetsuo Handa, Andrea Arcangeli, guro,
	linux-kernel, linux-mm

On Tue, 24 Apr 2018, Michal Hocko wrote:

> > > > My patch has passed intensive testing on both x86 and powerpc, so I'll ask 
> > > > that it's pushed for 4.17-rc3.  Many thanks to Tetsuo for the suggestion 
> > > > on calling __oom_reap_task_mm() from exit_mmap().
> > > 
> > > Yeah, but your patch does have a problem with blockable mmu notifiers
> > > IIUC.
> > 
> > What on earth are you talking about?  exit_mmap() does 
> > mmu_notifier_release().  There are no blockable mmu notifiers.
> 
> MMF_OOM_SKIP - remember? The thing that guarantees a forward progress.
> So we cannot really depend on setting MMF_OOM_SKIP if a
> mmu_notifier_release blocks for an excessive/unbounded amount of time.
> 

If the thread is blocked in exit_mmap() because of mmu_notifier_release() 
then the oom reaper will eventually grab mm->mmap_sem (nothing holding it 
in exit_mmap()), return true, and oom_reap_task() will set MMF_OOM_SKIP.  
This is unchanged with the patch and is a completely separate issue.

> Look I am not really interested in disussing this to death but it would
> be really _nice_ if you could calm down a bit, stop fighting for the solution
> you have proposed and ignore the feedback you are getting.
> 

I assume we should spend more time considering the two untested patches 
you have sent, one of which killed 17 processes while a 8GB memory hog was 
exiting because the oom reaper couldn't grab mm->mmap_sem and set 
MMF_OOM_SKIP.

> There are two things to care about here. Stop the race that can blow up
> and do not regress MMF_OOM_SKIP guarantee. Can we please do that.

My patch does both.

Thanks.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [patch v2] mm, oom: fix concurrent munlock and oom reaper unmap
  2018-04-24  5:35                           ` David Rientjes
@ 2018-04-24 21:57                             ` Tetsuo Handa
  2018-04-24 22:25                               ` David Rientjes
  0 siblings, 1 reply; 47+ messages in thread
From: Tetsuo Handa @ 2018-04-24 21:57 UTC (permalink / raw)
  To: rientjes; +Cc: mhocko, akpm, aarcange, guro, linux-kernel, linux-mm

David Rientjes wrote:
> On Tue, 24 Apr 2018, Tetsuo Handa wrote:
> 
> > > > We can call __oom_reap_task_mm() from exit_mmap() (or __mmput()) before
> > > > exit_mmap() holds mmap_sem for write. Then, at least memory which could
> > > > have been reclaimed if exit_mmap() did not hold mmap_sem for write will
> > > > be guaranteed to be reclaimed before MMF_OOM_SKIP is set.
> > > > 
> > > 
> > > I think that's an exceptionally good idea and will mitigate the concerns 
> > > of others.
> > > 
> > > It can be done without holding mm->mmap_sem in exit_mmap() and uses the 
> > > same criteria that the oom reaper uses to set MMF_OOM_SKIP itself, so we 
> > > don't get dozens of unnecessary oom kills.
> > > 
> > > What do you think about this?  It passes preliminary testing on powerpc 
> > > and I'm enqueued it for much more intensive testing.  (I'm wishing there 
> > > was a better way to acknowledge your contribution to fixing this issue, 
> > > especially since you brought up the exact problem this is addressing in 
> > > previous emails.)
> > > 
> > 
> > I don't think this patch is safe, for exit_mmap() is calling
> > mmu_notifier_invalidate_range_{start,end}() which might block with oom_lock
> > held when oom_reap_task_mm() is waiting for oom_lock held by exit_mmap().
> 
> One of the reasons that I extracted __oom_reap_task_mm() out of the new 
> oom_reap_task_mm() is to avoid the checks that would be unnecessary when 
> called from exit_mmap().  In this case, we can ignore the 
> mm_has_blockable_invalidate_notifiers() check because exit_mmap() has 
> already done mmu_notifier_release().  So I don't think there's a concern 
> about __oom_reap_task_mm() blocking while holding oom_lock.  Unless you 
> are referring to something else?

Oh, mmu_notifier_release() made mm_has_blockable_invalidate_notifiers() == false. OK.

But I want comments why it is safe; I will probably miss that dependency
when we move that code next time.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [patch v2] mm, oom: fix concurrent munlock and oom reaper unmap
  2018-04-24 21:57                             ` [patch v2] mm, oom: fix concurrent munlock and oom reaper unmap Tetsuo Handa
@ 2018-04-24 22:25                               ` David Rientjes
  2018-04-24 22:34                                 ` [patch v3 for-4.17] " David Rientjes
  0 siblings, 1 reply; 47+ messages in thread
From: David Rientjes @ 2018-04-24 22:25 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: mhocko, akpm, aarcange, guro, linux-kernel, linux-mm

On Wed, 25 Apr 2018, Tetsuo Handa wrote:

> > One of the reasons that I extracted __oom_reap_task_mm() out of the new 
> > oom_reap_task_mm() is to avoid the checks that would be unnecessary when 
> > called from exit_mmap().  In this case, we can ignore the 
> > mm_has_blockable_invalidate_notifiers() check because exit_mmap() has 
> > already done mmu_notifier_release().  So I don't think there's a concern 
> > about __oom_reap_task_mm() blocking while holding oom_lock.  Unless you 
> > are referring to something else?
> 
> Oh, mmu_notifier_release() made mm_has_blockable_invalidate_notifiers() == false. OK.
> 
> But I want comments why it is safe; I will probably miss that dependency
> when we move that code next time.
> 

Ok, makes sense.  I'll send a v3 to update the comment.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [patch v3 for-4.17] mm, oom: fix concurrent munlock and oom reaper unmap
  2018-04-24 22:25                               ` David Rientjes
@ 2018-04-24 22:34                                 ` David Rientjes
  2018-04-24 23:19                                   ` Michal Hocko
  0 siblings, 1 reply; 47+ messages in thread
From: David Rientjes @ 2018-04-24 22:34 UTC (permalink / raw)
  To: Andrew Morton, Tetsuo Handa, Linus Torvalds
  Cc: mhocko, Andrea Arcangeli, guro, linux-kernel, linux-mm

Since exit_mmap() is done without the protection of mm->mmap_sem, it is
possible for the oom reaper to concurrently operate on an mm until
MMF_OOM_SKIP is set.

This allows munlock_vma_pages_all() to concurrently run while the oom
reaper is operating on a vma.  Since munlock_vma_pages_range() depends on
clearing VM_LOCKED from vm_flags before actually doing the munlock to
determine if any other vmas are locking the same memory, the check for
VM_LOCKED in the oom reaper is racy.

This is especially noticeable on architectures such as powerpc where
clearing a huge pmd requires serialize_against_pte_lookup().  If the pmd
is zapped by the oom reaper during follow_page_mask() after the check for
pmd_none() is bypassed, this ends up deferencing a NULL ptl or a kernel
oops.

Fix this by manually freeing all possible memory from the mm before doing
the munlock and then setting MMF_OOM_SKIP.  The oom reaper can not run on
the mm anymore so the munlock is safe to do in exit_mmap().  It also
matches the logic that the oom reaper currently uses for determining when
to set MMF_OOM_SKIP itself, so there's no new risk of excessive oom
killing.

This issue fixes CVE-2018-1000200.

Fixes: 212925802454 ("mm: oom: let oom_reap_task and exit_mmap run concurrently")
Cc: stable@vger.kernel.org [4.14+]
Suggested-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: David Rientjes <rientjes@google.com>
---
 include/linux/oom.h |  2 ++
 mm/mmap.c           | 44 ++++++++++++++----------
 mm/oom_kill.c       | 81 ++++++++++++++++++++++++---------------------
 3 files changed, 71 insertions(+), 56 deletions(-)

diff --git a/include/linux/oom.h b/include/linux/oom.h
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -95,6 +95,8 @@ static inline int check_stable_address_space(struct mm_struct *mm)
 	return 0;
 }
 
+void __oom_reap_task_mm(struct mm_struct *mm);
+
 extern unsigned long oom_badness(struct task_struct *p,
 		struct mem_cgroup *memcg, const nodemask_t *nodemask,
 		unsigned long totalpages);
diff --git a/mm/mmap.c b/mm/mmap.c
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -3015,6 +3015,32 @@ void exit_mmap(struct mm_struct *mm)
 	/* mm's last user has gone, and its about to be pulled down */
 	mmu_notifier_release(mm);
 
+	if (unlikely(mm_is_oom_victim(mm))) {
+		/*
+		 * Manually reap the mm to free as much memory as possible.
+		 * Then, as the oom reaper does, set MMF_OOM_SKIP to disregard
+		 * this mm from further consideration.  Taking mm->mmap_sem for
+		 * write after setting MMF_OOM_SKIP will guarantee that the oom
+		 * reaper will not run on this mm again after mmap_sem is
+		 * dropped.
+		 *
+		 * Nothing can be holding mm->mmap_sem here and the above call
+		 * to mmu_notifier_release(mm) ensures mmu notifier callbacks in
+		 * __oom_reap_task_mm() will not block.
+		 *
+		 * This needs to be done before calling munlock_vma_pages_all(),
+		 * which clears VM_LOCKED, otherwise the oom reaper cannot
+		 * reliably test it.
+		 */
+		mutex_lock(&oom_lock);
+		__oom_reap_task_mm(mm);
+		mutex_unlock(&oom_lock);
+
+		set_bit(MMF_OOM_SKIP, &mm->flags);
+		down_write(&mm->mmap_sem);
+		up_write(&mm->mmap_sem);
+	}
+
 	if (mm->locked_vm) {
 		vma = mm->mmap;
 		while (vma) {
@@ -3036,24 +3062,6 @@ void exit_mmap(struct mm_struct *mm)
 	/* update_hiwater_rss(mm) here? but nobody should be looking */
 	/* Use -1 here to ensure all VMAs in the mm are unmapped */
 	unmap_vmas(&tlb, vma, 0, -1);
-
-	if (unlikely(mm_is_oom_victim(mm))) {
-		/*
-		 * Wait for oom_reap_task() to stop working on this
-		 * mm. Because MMF_OOM_SKIP is already set before
-		 * calling down_read(), oom_reap_task() will not run
-		 * on this "mm" post up_write().
-		 *
-		 * mm_is_oom_victim() cannot be set from under us
-		 * either because victim->mm is already set to NULL
-		 * under task_lock before calling mmput and oom_mm is
-		 * set not NULL by the OOM killer only if victim->mm
-		 * is found not NULL while holding the task_lock.
-		 */
-		set_bit(MMF_OOM_SKIP, &mm->flags);
-		down_write(&mm->mmap_sem);
-		up_write(&mm->mmap_sem);
-	}
 	free_pgtables(&tlb, vma, FIRST_USER_ADDRESS, USER_PGTABLES_CEILING);
 	tlb_finish_mmu(&tlb, 0, -1);
 
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -469,7 +469,6 @@ bool process_shares_mm(struct task_struct *p, struct mm_struct *mm)
 	return false;
 }
 
-
 #ifdef CONFIG_MMU
 /*
  * OOM Reaper kernel thread which tries to reap the memory used by the OOM
@@ -480,16 +479,54 @@ static DECLARE_WAIT_QUEUE_HEAD(oom_reaper_wait);
 static struct task_struct *oom_reaper_list;
 static DEFINE_SPINLOCK(oom_reaper_lock);
 
-static bool __oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
+void __oom_reap_task_mm(struct mm_struct *mm)
 {
-	struct mmu_gather tlb;
 	struct vm_area_struct *vma;
+
+	/*
+	 * Tell all users of get_user/copy_from_user etc... that the content
+	 * is no longer stable. No barriers really needed because unmapping
+	 * should imply barriers already and the reader would hit a page fault
+	 * if it stumbled over a reaped memory.
+	 */
+	set_bit(MMF_UNSTABLE, &mm->flags);
+
+	for (vma = mm->mmap ; vma; vma = vma->vm_next) {
+		if (!can_madv_dontneed_vma(vma))
+			continue;
+
+		/*
+		 * Only anonymous pages have a good chance to be dropped
+		 * without additional steps which we cannot afford as we
+		 * are OOM already.
+		 *
+		 * We do not even care about fs backed pages because all
+		 * which are reclaimable have already been reclaimed and
+		 * we do not want to block exit_mmap by keeping mm ref
+		 * count elevated without a good reason.
+		 */
+		if (vma_is_anonymous(vma) || !(vma->vm_flags & VM_SHARED)) {
+			const unsigned long start = vma->vm_start;
+			const unsigned long end = vma->vm_end;
+			struct mmu_gather tlb;
+
+			tlb_gather_mmu(&tlb, mm, start, end);
+			mmu_notifier_invalidate_range_start(mm, start, end);
+			unmap_page_range(&tlb, vma, start, end, NULL);
+			mmu_notifier_invalidate_range_end(mm, start, end);
+			tlb_finish_mmu(&tlb, start, end);
+		}
+	}
+}
+
+static bool oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
+{
 	bool ret = true;
 
 	/*
 	 * We have to make sure to not race with the victim exit path
 	 * and cause premature new oom victim selection:
-	 * __oom_reap_task_mm		exit_mm
+	 * oom_reap_task_mm		exit_mm
 	 *   mmget_not_zero
 	 *				  mmput
 	 *				    atomic_dec_and_test
@@ -534,39 +571,8 @@ static bool __oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
 
 	trace_start_task_reaping(tsk->pid);
 
-	/*
-	 * Tell all users of get_user/copy_from_user etc... that the content
-	 * is no longer stable. No barriers really needed because unmapping
-	 * should imply barriers already and the reader would hit a page fault
-	 * if it stumbled over a reaped memory.
-	 */
-	set_bit(MMF_UNSTABLE, &mm->flags);
-
-	for (vma = mm->mmap ; vma; vma = vma->vm_next) {
-		if (!can_madv_dontneed_vma(vma))
-			continue;
+	__oom_reap_task_mm(mm);
 
-		/*
-		 * Only anonymous pages have a good chance to be dropped
-		 * without additional steps which we cannot afford as we
-		 * are OOM already.
-		 *
-		 * We do not even care about fs backed pages because all
-		 * which are reclaimable have already been reclaimed and
-		 * we do not want to block exit_mmap by keeping mm ref
-		 * count elevated without a good reason.
-		 */
-		if (vma_is_anonymous(vma) || !(vma->vm_flags & VM_SHARED)) {
-			const unsigned long start = vma->vm_start;
-			const unsigned long end = vma->vm_end;
-
-			tlb_gather_mmu(&tlb, mm, start, end);
-			mmu_notifier_invalidate_range_start(mm, start, end);
-			unmap_page_range(&tlb, vma, start, end, NULL);
-			mmu_notifier_invalidate_range_end(mm, start, end);
-			tlb_finish_mmu(&tlb, start, end);
-		}
-	}
 	pr_info("oom_reaper: reaped process %d (%s), now anon-rss:%lukB, file-rss:%lukB, shmem-rss:%lukB\n",
 			task_pid_nr(tsk), tsk->comm,
 			K(get_mm_counter(mm, MM_ANONPAGES)),
@@ -587,14 +593,13 @@ static void oom_reap_task(struct task_struct *tsk)
 	struct mm_struct *mm = tsk->signal->oom_mm;
 
 	/* Retry the down_read_trylock(mmap_sem) a few times */
-	while (attempts++ < MAX_OOM_REAP_RETRIES && !__oom_reap_task_mm(tsk, mm))
+	while (attempts++ < MAX_OOM_REAP_RETRIES && !oom_reap_task_mm(tsk, mm))
 		schedule_timeout_idle(HZ/10);
 
 	if (attempts <= MAX_OOM_REAP_RETRIES ||
 	    test_bit(MMF_OOM_SKIP, &mm->flags))
 		goto done;
 
-
 	pr_info("oom_reaper: unable to reap pid:%d (%s)\n",
 		task_pid_nr(tsk), tsk->comm);
 	debug_show_all_locks();

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [patch v2] mm, oom: fix concurrent munlock and oom reaperunmap
  2018-04-24 21:07                                   ` David Rientjes
@ 2018-04-24 23:08                                     ` Michal Hocko
  2018-04-24 23:14                                       ` Michal Hocko
  0 siblings, 1 reply; 47+ messages in thread
From: Michal Hocko @ 2018-04-24 23:08 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Tetsuo Handa, Andrea Arcangeli, guro,
	linux-kernel, linux-mm

On Tue 24-04-18 14:07:52, David Rientjes wrote:
> On Tue, 24 Apr 2018, Michal Hocko wrote:
> 
> > > > > My patch has passed intensive testing on both x86 and powerpc, so I'll ask 
> > > > > that it's pushed for 4.17-rc3.  Many thanks to Tetsuo for the suggestion 
> > > > > on calling __oom_reap_task_mm() from exit_mmap().
> > > > 
> > > > Yeah, but your patch does have a problem with blockable mmu notifiers
> > > > IIUC.
> > > 
> > > What on earth are you talking about?  exit_mmap() does 
> > > mmu_notifier_release().  There are no blockable mmu notifiers.
> > 
> > MMF_OOM_SKIP - remember? The thing that guarantees a forward progress.
> > So we cannot really depend on setting MMF_OOM_SKIP if a
> > mmu_notifier_release blocks for an excessive/unbounded amount of time.
> > 
> 
> If the thread is blocked in exit_mmap() because of mmu_notifier_release() 
> then the oom reaper will eventually grab mm->mmap_sem (nothing holding it 
> in exit_mmap()), return true, and oom_reap_task() will set MMF_OOM_SKIP.  
> This is unchanged with the patch and is a completely separate issue.

I must be missing something or we are talking past each other. So let me
be explicit. What does prevent the following

oom_reaper				exit_mmap
					  mutex_lock(oom_lock)
  mutex_lock(oom_lock)			    __oom_reap_task_mm
  					      mmu_notifier_invalidate_range_start
					        # blockable mmu_notifier
						# which takes ages to
						# finish or depends on
						# an allocation (in)directly
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [patch v2] mm, oom: fix concurrent munlock and oom reaperunmap
  2018-04-24 23:08                                     ` Michal Hocko
@ 2018-04-24 23:14                                       ` Michal Hocko
  0 siblings, 0 replies; 47+ messages in thread
From: Michal Hocko @ 2018-04-24 23:14 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Tetsuo Handa, Andrea Arcangeli, guro,
	linux-kernel, linux-mm

On Tue 24-04-18 17:08:15, Michal Hocko wrote:
> On Tue 24-04-18 14:07:52, David Rientjes wrote:
> > On Tue, 24 Apr 2018, Michal Hocko wrote:
> > 
> > > > > > My patch has passed intensive testing on both x86 and powerpc, so I'll ask 
> > > > > > that it's pushed for 4.17-rc3.  Many thanks to Tetsuo for the suggestion 
> > > > > > on calling __oom_reap_task_mm() from exit_mmap().
> > > > > 
> > > > > Yeah, but your patch does have a problem with blockable mmu notifiers
> > > > > IIUC.
> > > > 
> > > > What on earth are you talking about?  exit_mmap() does 
> > > > mmu_notifier_release().  There are no blockable mmu notifiers.
> > > 
> > > MMF_OOM_SKIP - remember? The thing that guarantees a forward progress.
> > > So we cannot really depend on setting MMF_OOM_SKIP if a
> > > mmu_notifier_release blocks for an excessive/unbounded amount of time.
> > > 
> > 
> > If the thread is blocked in exit_mmap() because of mmu_notifier_release() 
> > then the oom reaper will eventually grab mm->mmap_sem (nothing holding it 
> > in exit_mmap()), return true, and oom_reap_task() will set MMF_OOM_SKIP.  
> > This is unchanged with the patch and is a completely separate issue.
> 
> I must be missing something or we are talking past each other.

Ohh. Ok, so _I_ was missing that mm_has_notifiers is false after
mmu_notifier_release. So we cannot block at that time. Then we are good.
Sorry about the confusion!
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [patch v3 for-4.17] mm, oom: fix concurrent munlock and oom reaper unmap
  2018-04-24 22:34                                 ` [patch v3 for-4.17] " David Rientjes
@ 2018-04-24 23:19                                   ` Michal Hocko
  0 siblings, 0 replies; 47+ messages in thread
From: Michal Hocko @ 2018-04-24 23:19 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Tetsuo Handa, Linus Torvalds, Andrea Arcangeli,
	guro, linux-kernel, linux-mm

On Tue 24-04-18 15:34:03, David Rientjes wrote:
> Since exit_mmap() is done without the protection of mm->mmap_sem, it is
> possible for the oom reaper to concurrently operate on an mm until
> MMF_OOM_SKIP is set.
> 
> This allows munlock_vma_pages_all() to concurrently run while the oom
> reaper is operating on a vma.  Since munlock_vma_pages_range() depends on
> clearing VM_LOCKED from vm_flags before actually doing the munlock to
> determine if any other vmas are locking the same memory, the check for
> VM_LOCKED in the oom reaper is racy.
> 
> This is especially noticeable on architectures such as powerpc where
> clearing a huge pmd requires serialize_against_pte_lookup().  If the pmd
> is zapped by the oom reaper during follow_page_mask() after the check for
> pmd_none() is bypassed, this ends up deferencing a NULL ptl or a kernel
> oops.
> 
> Fix this by manually freeing all possible memory from the mm before doing
> the munlock and then setting MMF_OOM_SKIP.  The oom reaper can not run on
> the mm anymore so the munlock is safe to do in exit_mmap().  It also
> matches the logic that the oom reaper currently uses for determining when
> to set MMF_OOM_SKIP itself, so there's no new risk of excessive oom
> killing.
> 
> This issue fixes CVE-2018-1000200.
> 
> Fixes: 212925802454 ("mm: oom: let oom_reap_task and exit_mmap run concurrently")
> Cc: stable@vger.kernel.org [4.14+]
> Suggested-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> Signed-off-by: David Rientjes <rientjes@google.com>

OK, now that I finally got that mmu_notifier_release will make all
further mmu notifier calls NOOP then this is indeed safe. Considering
that you take oom_lock, you can set MMF_OOM_SKIP inside that lock and
won't have to bother with the exclusive mmap_sem AFAICS. So the patch
can be simplified.

But other than that this looks like a right way to go. I would have
preferred to hide the oom locking and MMF_OOM_SKIP handling in exit_mmap
but this is mostly cosmetic.

Acked-by: Michal Hocko <mhocko@suse.com>

> ---
>  include/linux/oom.h |  2 ++
>  mm/mmap.c           | 44 ++++++++++++++----------
>  mm/oom_kill.c       | 81 ++++++++++++++++++++++++---------------------
>  3 files changed, 71 insertions(+), 56 deletions(-)
> 
> diff --git a/include/linux/oom.h b/include/linux/oom.h
> --- a/include/linux/oom.h
> +++ b/include/linux/oom.h
> @@ -95,6 +95,8 @@ static inline int check_stable_address_space(struct mm_struct *mm)
>  	return 0;
>  }
>  
> +void __oom_reap_task_mm(struct mm_struct *mm);
> +
>  extern unsigned long oom_badness(struct task_struct *p,
>  		struct mem_cgroup *memcg, const nodemask_t *nodemask,
>  		unsigned long totalpages);
> diff --git a/mm/mmap.c b/mm/mmap.c
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -3015,6 +3015,32 @@ void exit_mmap(struct mm_struct *mm)
>  	/* mm's last user has gone, and its about to be pulled down */
>  	mmu_notifier_release(mm);
>  
> +	if (unlikely(mm_is_oom_victim(mm))) {
> +		/*
> +		 * Manually reap the mm to free as much memory as possible.
> +		 * Then, as the oom reaper does, set MMF_OOM_SKIP to disregard
> +		 * this mm from further consideration.  Taking mm->mmap_sem for
> +		 * write after setting MMF_OOM_SKIP will guarantee that the oom
> +		 * reaper will not run on this mm again after mmap_sem is
> +		 * dropped.
> +		 *
> +		 * Nothing can be holding mm->mmap_sem here and the above call
> +		 * to mmu_notifier_release(mm) ensures mmu notifier callbacks in
> +		 * __oom_reap_task_mm() will not block.
> +		 *
> +		 * This needs to be done before calling munlock_vma_pages_all(),
> +		 * which clears VM_LOCKED, otherwise the oom reaper cannot
> +		 * reliably test it.
> +		 */
> +		mutex_lock(&oom_lock);
> +		__oom_reap_task_mm(mm);
> +		mutex_unlock(&oom_lock);
> +
> +		set_bit(MMF_OOM_SKIP, &mm->flags);
> +		down_write(&mm->mmap_sem);
> +		up_write(&mm->mmap_sem);
> +	}
> +
>  	if (mm->locked_vm) {
>  		vma = mm->mmap;
>  		while (vma) {
> @@ -3036,24 +3062,6 @@ void exit_mmap(struct mm_struct *mm)
>  	/* update_hiwater_rss(mm) here? but nobody should be looking */
>  	/* Use -1 here to ensure all VMAs in the mm are unmapped */
>  	unmap_vmas(&tlb, vma, 0, -1);
> -
> -	if (unlikely(mm_is_oom_victim(mm))) {
> -		/*
> -		 * Wait for oom_reap_task() to stop working on this
> -		 * mm. Because MMF_OOM_SKIP is already set before
> -		 * calling down_read(), oom_reap_task() will not run
> -		 * on this "mm" post up_write().
> -		 *
> -		 * mm_is_oom_victim() cannot be set from under us
> -		 * either because victim->mm is already set to NULL
> -		 * under task_lock before calling mmput and oom_mm is
> -		 * set not NULL by the OOM killer only if victim->mm
> -		 * is found not NULL while holding the task_lock.
> -		 */
> -		set_bit(MMF_OOM_SKIP, &mm->flags);
> -		down_write(&mm->mmap_sem);
> -		up_write(&mm->mmap_sem);
> -	}
>  	free_pgtables(&tlb, vma, FIRST_USER_ADDRESS, USER_PGTABLES_CEILING);
>  	tlb_finish_mmu(&tlb, 0, -1);
>  
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -469,7 +469,6 @@ bool process_shares_mm(struct task_struct *p, struct mm_struct *mm)
>  	return false;
>  }
>  
> -
>  #ifdef CONFIG_MMU
>  /*
>   * OOM Reaper kernel thread which tries to reap the memory used by the OOM
> @@ -480,16 +479,54 @@ static DECLARE_WAIT_QUEUE_HEAD(oom_reaper_wait);
>  static struct task_struct *oom_reaper_list;
>  static DEFINE_SPINLOCK(oom_reaper_lock);
>  
> -static bool __oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
> +void __oom_reap_task_mm(struct mm_struct *mm)
>  {
> -	struct mmu_gather tlb;
>  	struct vm_area_struct *vma;
> +
> +	/*
> +	 * Tell all users of get_user/copy_from_user etc... that the content
> +	 * is no longer stable. No barriers really needed because unmapping
> +	 * should imply barriers already and the reader would hit a page fault
> +	 * if it stumbled over a reaped memory.
> +	 */
> +	set_bit(MMF_UNSTABLE, &mm->flags);
> +
> +	for (vma = mm->mmap ; vma; vma = vma->vm_next) {
> +		if (!can_madv_dontneed_vma(vma))
> +			continue;
> +
> +		/*
> +		 * Only anonymous pages have a good chance to be dropped
> +		 * without additional steps which we cannot afford as we
> +		 * are OOM already.
> +		 *
> +		 * We do not even care about fs backed pages because all
> +		 * which are reclaimable have already been reclaimed and
> +		 * we do not want to block exit_mmap by keeping mm ref
> +		 * count elevated without a good reason.
> +		 */
> +		if (vma_is_anonymous(vma) || !(vma->vm_flags & VM_SHARED)) {
> +			const unsigned long start = vma->vm_start;
> +			const unsigned long end = vma->vm_end;
> +			struct mmu_gather tlb;
> +
> +			tlb_gather_mmu(&tlb, mm, start, end);
> +			mmu_notifier_invalidate_range_start(mm, start, end);
> +			unmap_page_range(&tlb, vma, start, end, NULL);
> +			mmu_notifier_invalidate_range_end(mm, start, end);
> +			tlb_finish_mmu(&tlb, start, end);
> +		}
> +	}
> +}
> +
> +static bool oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
> +{
>  	bool ret = true;
>  
>  	/*
>  	 * We have to make sure to not race with the victim exit path
>  	 * and cause premature new oom victim selection:
> -	 * __oom_reap_task_mm		exit_mm
> +	 * oom_reap_task_mm		exit_mm
>  	 *   mmget_not_zero
>  	 *				  mmput
>  	 *				    atomic_dec_and_test
> @@ -534,39 +571,8 @@ static bool __oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
>  
>  	trace_start_task_reaping(tsk->pid);
>  
> -	/*
> -	 * Tell all users of get_user/copy_from_user etc... that the content
> -	 * is no longer stable. No barriers really needed because unmapping
> -	 * should imply barriers already and the reader would hit a page fault
> -	 * if it stumbled over a reaped memory.
> -	 */
> -	set_bit(MMF_UNSTABLE, &mm->flags);
> -
> -	for (vma = mm->mmap ; vma; vma = vma->vm_next) {
> -		if (!can_madv_dontneed_vma(vma))
> -			continue;
> +	__oom_reap_task_mm(mm);
>  
> -		/*
> -		 * Only anonymous pages have a good chance to be dropped
> -		 * without additional steps which we cannot afford as we
> -		 * are OOM already.
> -		 *
> -		 * We do not even care about fs backed pages because all
> -		 * which are reclaimable have already been reclaimed and
> -		 * we do not want to block exit_mmap by keeping mm ref
> -		 * count elevated without a good reason.
> -		 */
> -		if (vma_is_anonymous(vma) || !(vma->vm_flags & VM_SHARED)) {
> -			const unsigned long start = vma->vm_start;
> -			const unsigned long end = vma->vm_end;
> -
> -			tlb_gather_mmu(&tlb, mm, start, end);
> -			mmu_notifier_invalidate_range_start(mm, start, end);
> -			unmap_page_range(&tlb, vma, start, end, NULL);
> -			mmu_notifier_invalidate_range_end(mm, start, end);
> -			tlb_finish_mmu(&tlb, start, end);
> -		}
> -	}
>  	pr_info("oom_reaper: reaped process %d (%s), now anon-rss:%lukB, file-rss:%lukB, shmem-rss:%lukB\n",
>  			task_pid_nr(tsk), tsk->comm,
>  			K(get_mm_counter(mm, MM_ANONPAGES)),
> @@ -587,14 +593,13 @@ static void oom_reap_task(struct task_struct *tsk)
>  	struct mm_struct *mm = tsk->signal->oom_mm;
>  
>  	/* Retry the down_read_trylock(mmap_sem) a few times */
> -	while (attempts++ < MAX_OOM_REAP_RETRIES && !__oom_reap_task_mm(tsk, mm))
> +	while (attempts++ < MAX_OOM_REAP_RETRIES && !oom_reap_task_mm(tsk, mm))
>  		schedule_timeout_idle(HZ/10);
>  
>  	if (attempts <= MAX_OOM_REAP_RETRIES ||
>  	    test_bit(MMF_OOM_SKIP, &mm->flags))
>  		goto done;
>  
> -
>  	pr_info("oom_reaper: unable to reap pid:%d (%s)\n",
>  		task_pid_nr(tsk), tsk->comm);
>  	debug_show_all_locks();

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 47+ messages in thread

end of thread, other threads:[~2018-04-24 23:19 UTC | newest]

Thread overview: 47+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-04-17 22:46 [patch] mm, oom: fix concurrent munlock and oom reaper unmap David Rientjes
2018-04-18  0:57 ` Tetsuo Handa
2018-04-18  2:39   ` David Rientjes
2018-04-18  2:52     ` [patch v2] " David Rientjes
2018-04-18  3:55       ` Tetsuo Handa
2018-04-18  4:11         ` David Rientjes
2018-04-18  4:47           ` Tetsuo Handa
2018-04-18  5:20             ` David Rientjes
2018-04-18  7:50       ` Michal Hocko
2018-04-18 11:49         ` Tetsuo Handa
2018-04-18 11:58           ` Michal Hocko
2018-04-18 13:25             ` Tetsuo Handa
2018-04-18 13:44               ` Michal Hocko
2018-04-18 14:28                 ` Tetsuo Handa
2018-04-18 19:14         ` David Rientjes
2018-04-19  6:35           ` Michal Hocko
2018-04-19 10:45             ` Tetsuo Handa
2018-04-19 11:04               ` Michal Hocko
2018-04-19 11:51                 ` Tetsuo Handa
2018-04-19 12:48                   ` Michal Hocko
2018-04-19 19:14               ` David Rientjes
2018-04-19 19:34             ` David Rientjes
2018-04-19 22:13               ` Tetsuo Handa
2018-04-20  8:23               ` Michal Hocko
2018-04-20 12:40                 ` Michal Hocko
2018-04-20 12:40                   ` Michal Hocko
2018-04-22  3:22                   ` David Rientjes
2018-04-22  3:48                     ` [patch v2] mm, oom: fix concurrent munlock and oom reaperunmap Tetsuo Handa
2018-04-22 13:08                       ` Michal Hocko
2018-04-24  2:31                       ` David Rientjes
2018-04-24  5:11                         ` Tetsuo Handa
2018-04-24  5:35                           ` David Rientjes
2018-04-24 21:57                             ` [patch v2] mm, oom: fix concurrent munlock and oom reaper unmap Tetsuo Handa
2018-04-24 22:25                               ` David Rientjes
2018-04-24 22:34                                 ` [patch v3 for-4.17] " David Rientjes
2018-04-24 23:19                                   ` Michal Hocko
2018-04-24 13:04                         ` [patch v2] mm, oom: fix concurrent munlock and oom reaperunmap Michal Hocko
2018-04-24 20:01                           ` David Rientjes
2018-04-24 20:13                             ` Michal Hocko
2018-04-24 20:22                               ` David Rientjes
2018-04-24 20:31                                 ` Michal Hocko
2018-04-24 21:07                                   ` David Rientjes
2018-04-24 23:08                                     ` Michal Hocko
2018-04-24 23:14                                       ` Michal Hocko
2018-04-22  3:45                 ` [patch v2] mm, oom: fix concurrent munlock and oom reaper unmap David Rientjes
2018-04-22 13:18                   ` Michal Hocko
2018-04-23 16:09                     ` Michal Hocko

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.