linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* Re: + mm-oom-let-oom_reap_task-and-exit_mmap-to-run-concurrently.patch added to -mm tree
       [not found] <59936823.CQNWQErWJ8EAIG3q%akpm@linux-foundation.org>
@ 2017-08-16 13:23 ` Michal Hocko
  2017-08-17 17:12   ` Andrea Arcangeli
  0 siblings, 1 reply; 5+ messages in thread
From: Michal Hocko @ 2017-08-16 13:23 UTC (permalink / raw)
  To: akpm
  Cc: aarcange, hughd, kirill, oleg, penguin-kernel, rientjes,
	mm-commits, linux-mm

On Tue 15-08-17 14:31:15, Andrew Morton wrote:
[...]
> From: Andrea Arcangeli <aarcange@redhat.com>
> Subject: mm: oom: let oom_reap_task and exit_mmap run concurrently
> 
> This is purely required because exit_aio() may block and exit_mmap() may
> never start, if the oom_reap_task cannot start running on a mm with
> mm_users == 0.
> 
> At the same time if the OOM reaper doesn't wait at all for the memory of
> the current OOM candidate to be freed by exit_mmap->unmap_vmas, it would
> generate a spurious OOM kill.
> 
> If it wasn't because of the exit_aio or similar blocking functions in the
> last mmput, it would be enough to change the oom_reap_task() in the case
> it finds mm_users == 0, to wait for a timeout or to wait for __mmput to
> set MMF_OOM_SKIP itself, but it's not just exit_mmap the problem here so
> the concurrency of exit_mmap and oom_reap_task is apparently warranted.
> 
> It's a non standard runtime, exit_mmap() runs without mmap_sem, and
> oom_reap_task runs with the mmap_sem for reading as usual (kind of
> MADV_DONTNEED).
> 
> The race between the two is solved with a combination of
> tsk_is_oom_victim() (serialized by task_lock) and MMF_OOM_SKIP (serialized
> by a dummy down_write/up_write cycle on the same lines of the ksm_exit
> method).
> 
> If the oom_reap_task() may be running concurrently during exit_mmap,
> exit_mmap will wait it to finish in down_write (before taking down mm
> structures that would make the oom_reap_task fail with use after free).
> 
> If exit_mmap comes first, oom_reap_task() will skip the mm if MMF_OOM_SKIP
> is already set and in turn all memory is already freed and furthermore the
> mm data structures may already have been taken down by free_pgtables.

I find the changelog rather hard to understand but that is not critical.

> Link: http://lkml.kernel.org/r/20170726162912.GA29716@redhat.com
> Fixes: 26db62f179d1 ("oom: keep mm of the killed task available")
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> Reported-by: David Rientjes <rientjes@google.com>
> Tested-by: David Rientjes <rientjes@google.com>
> Cc: Michal Hocko <mhocko@suse.com>
> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> Cc: Oleg Nesterov <oleg@redhat.com>
> Cc: Hugh Dickins <hughd@google.com>
> Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Reviewed-by: Michal Hocko <mhocko@suse.com>

> ---
> 
>  kernel/fork.c |    1 -
>  mm/mmap.c     |   17 +++++++++++++++++
>  mm/oom_kill.c |   15 +++++----------
>  3 files changed, 22 insertions(+), 11 deletions(-)
> 
> diff -puN kernel/fork.c~mm-oom-let-oom_reap_task-and-exit_mmap-to-run-concurrently kernel/fork.c
> --- a/kernel/fork.c~mm-oom-let-oom_reap_task-and-exit_mmap-to-run-concurrently
> +++ a/kernel/fork.c
> @@ -910,7 +910,6 @@ static inline void __mmput(struct mm_str
>  	}
>  	if (mm->binfmt)
>  		module_put(mm->binfmt->module);
> -	set_bit(MMF_OOM_SKIP, &mm->flags);
>  	mmdrop(mm);
>  }
>  
> diff -puN mm/mmap.c~mm-oom-let-oom_reap_task-and-exit_mmap-to-run-concurrently mm/mmap.c
> --- a/mm/mmap.c~mm-oom-let-oom_reap_task-and-exit_mmap-to-run-concurrently
> +++ a/mm/mmap.c
> @@ -3001,6 +3001,23 @@ void exit_mmap(struct mm_struct *mm)
>  	/* Use -1 here to ensure all VMAs in the mm are unmapped */
>  	unmap_vmas(&tlb, vma, 0, -1);
>  
> +	set_bit(MMF_OOM_SKIP, &mm->flags);
> +	if (tsk_is_oom_victim(current)) {
> +		/*
> +		 * Wait for oom_reap_task() to stop working on this
> +		 * mm. Because MMF_OOM_SKIP is already set before
> +		 * calling down_read(), oom_reap_task() will not run
> +		 * on this "mm" post up_write().
> +		 *
> +		 * tsk_is_oom_victim() cannot be set from under us
> +		 * either because current->mm is already set to NULL
> +		 * under task_lock before calling mmput and oom_mm is
> +		 * set not NULL by the OOM killer only if current->mm
> +		 * is found not NULL while holding the task_lock.
> +		 */
> +		down_write(&mm->mmap_sem);
> +		up_write(&mm->mmap_sem);
> +	}
>  	free_pgtables(&tlb, vma, FIRST_USER_ADDRESS, USER_PGTABLES_CEILING);
>  	tlb_finish_mmu(&tlb, 0, -1);
>  
> diff -puN mm/oom_kill.c~mm-oom-let-oom_reap_task-and-exit_mmap-to-run-concurrently mm/oom_kill.c
> --- a/mm/oom_kill.c~mm-oom-let-oom_reap_task-and-exit_mmap-to-run-concurrently
> +++ a/mm/oom_kill.c
> @@ -495,11 +495,12 @@ static bool __oom_reap_task_mm(struct ta
>  	}
>  
>  	/*
> -	 * increase mm_users only after we know we will reap something so
> -	 * that the mmput_async is called only when we have reaped something
> -	 * and delayed __mmput doesn't matter that much
> +	 * MMF_OOM_SKIP is set by exit_mmap when the OOM reaper can't
> +	 * work on the mm anymore. The check for MMF_OOM_SKIP must run
> +	 * under mmap_sem for reading because it serializes against the
> +	 * down_write();up_write() cycle in exit_mmap().
>  	 */
> -	if (!mmget_not_zero(mm)) {
> +	if (test_bit(MMF_OOM_SKIP, &mm->flags)) {
>  		up_read(&mm->mmap_sem);
>  		trace_skip_task_reaping(tsk->pid);
>  		goto unlock_oom;
> @@ -542,12 +543,6 @@ static bool __oom_reap_task_mm(struct ta
>  			K(get_mm_counter(mm, MM_SHMEMPAGES)));
>  	up_read(&mm->mmap_sem);
>  
> -	/*
> -	 * Drop our reference but make sure the mmput slow path is called from a
> -	 * different context because we shouldn't risk we get stuck there and
> -	 * put the oom_reaper out of the way.
> -	 */
> -	mmput_async(mm);
>  	trace_finish_task_reaping(tsk->pid);
>  unlock_oom:
>  	mutex_unlock(&oom_lock);
> _
> 
> Patches currently in -mm which might be from aarcange@redhat.com are
> 
> userfaultfd-selftest-exercise-uffdio_copy-zeropage-eexist.patch
> userfaultfd-selftest-explicit-failure-if-the-sigbus-test-failed.patch
> userfaultfd-call-userfaultfd_unmap_prep-only-if-__split_vma-succeeds.patch
> userfaultfd-provide-pid-in-userfault-msg-add-feat-union.patch
> mm-oom-let-oom_reap_task-and-exit_mmap-to-run-concurrently.patch
> mm-oom-let-oom_reap_task-and-exit_mmap-to-run-concurrently-fix.patch
> 

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: + mm-oom-let-oom_reap_task-and-exit_mmap-to-run-concurrently.patch added to -mm tree
  2017-08-16 13:23 ` + mm-oom-let-oom_reap_task-and-exit_mmap-to-run-concurrently.patch added to -mm tree Michal Hocko
@ 2017-08-17 17:12   ` Andrea Arcangeli
  2017-08-18  7:04     ` Michal Hocko
  0 siblings, 1 reply; 5+ messages in thread
From: Andrea Arcangeli @ 2017-08-17 17:12 UTC (permalink / raw)
  To: Michal Hocko
  Cc: akpm, hughd, kirill, oleg, penguin-kernel, rientjes, mm-commits,
	linux-mm

On Wed, Aug 16, 2017 at 03:23:29PM +0200, Michal Hocko wrote:
> Reviewed-by: Michal Hocko <mhocko@suse.com>

Thanks for the review!

There's this further possible microoptimization that can be folded on top.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: + mm-oom-let-oom_reap_task-and-exit_mmap-to-run-concurrently.patch added to -mm tree
  2017-08-17 17:12   ` Andrea Arcangeli
@ 2017-08-18  7:04     ` Michal Hocko
  2017-08-18 18:41       ` Andrea Arcangeli
  0 siblings, 1 reply; 5+ messages in thread
From: Michal Hocko @ 2017-08-18  7:04 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: akpm, hughd, kirill, oleg, penguin-kernel, rientjes, mm-commits,
	linux-mm

On Thu 17-08-17 19:12:40, Andrea Arcangeli wrote:
> On Wed, Aug 16, 2017 at 03:23:29PM +0200, Michal Hocko wrote:
> > Reviewed-by: Michal Hocko <mhocko@suse.com>
> 
> Thanks for the review!
> 
> There's this further possible microoptimization that can be folded on top.
> 
> >From 76bf017f923581d15fe01249af92b0d757752a9f Mon Sep 17 00:00:00 2001
> From: Andrea Arcangeli <aarcange@redhat.com>
> Date: Thu, 17 Aug 2017 18:39:46 +0200
> Subject: [PATCH 1/1] mm: oom: let oom_reap_task and exit_mmap run
>  concurrently, add unlikely
> 
> Microoptimization to fold before merging.
> 
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> ---
>  mm/mmap.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/mm/mmap.c b/mm/mmap.c
> index 013170e5c8a4..ab0026a8acc4 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -3003,7 +3003,7 @@ void exit_mmap(struct mm_struct *mm)
>  	unmap_vmas(&tlb, vma, 0, -1);
>  
>  	set_bit(MMF_OOM_SKIP, &mm->flags);
> -	if (tsk_is_oom_victim(current)) {
> +	if (unlikely(tsk_is_oom_victim(current))) {

I dunno. This doesn't make any difference in the generated code for
me (with gcc 6.4). If anything we might wan't to putt unlikely inside
tsk_is_oom_victim. Or even go further and use a jump label to get any
conditional paths out of way.

>  		/*
>  		 * Wait for oom_reap_task() to stop working on this
>  		 * mm. Because MMF_OOM_SKIP is already set before
> 

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: + mm-oom-let-oom_reap_task-and-exit_mmap-to-run-concurrently.patch added to -mm tree
  2017-08-18  7:04     ` Michal Hocko
@ 2017-08-18 18:41       ` Andrea Arcangeli
  2017-08-21  8:30         ` Michal Hocko
  0 siblings, 1 reply; 5+ messages in thread
From: Andrea Arcangeli @ 2017-08-18 18:41 UTC (permalink / raw)
  To: Michal Hocko
  Cc: akpm, hughd, kirill, oleg, penguin-kernel, rientjes, mm-commits,
	linux-mm

On Fri, Aug 18, 2017 at 09:04:44AM +0200, Michal Hocko wrote:
> I dunno. This doesn't make any difference in the generated code for
> me (with gcc 6.4). If anything we might wan't to putt unlikely inside

That's fine, this is just in case the code surrounding the check
changes in the future. It's not like we should remove unlikely/likely
if the emitted bytecode doesn't change.

> tsk_is_oom_victim. Or even go further and use a jump label to get any

I don't think it's necessarily the best to put it inside
tsk_is_oom_victim, even if currently it would be the same.

All it matters for likely unlikely is not to risk to ever get it
wrong. If unsure it's better to leave it alone.

We can't be sure all future callers of tsk_is_oom_victim will always
be unlikely to get a true retval. All we can be sure is that this
specific caller will get a false retval 100% of the time, in all
workloads where performance can matter.

> conditional paths out of way.

Using a jump label won't allocate memory so I tend to believe it would
be safe to run them here. However before worrying at the exit path, I
think the first target of optimization would be the MMF_UNSTABLE
checks, those are in the page fault fast paths and they end up run
infinitely more frequently than this single branch in exit.

I'm guilty of adding a branch to the page fault myself to check for
userfaultfd_missing(vma) (and such one is not even unlikely, as it
depends on the workload if it is), but without userfaultfd there are
things that just aren't possible with the standard
mmap/mprotect/SIGSEGV legacy API. So there's no way around it, unless
we run stop machine on s390 or 2*NR_CPUs IPIs on x86 every time
somebody calls UFFDIO_REGISTER/UNREGISTER which may introduce
unexpected latencies on large SMP systems (furthermore I made sure
CONFIG_USERFAULTFD=n would completely eliminate the branch at build
time, see the "return false" inline).

So what would you think about the simplest approach to the
MMF_UNSTABLE issue, that is to add a build time CONFIG_OOM_REAPER=y
option for the OOM reaper so those branches are optimized away at
build time (and the above one too, and perhaps the MMF_OOM_SKIP
set_bit too) if it's ok to disable the OOM reaper as well and increase
the risk an OOM hang? (it's years I didn't hit an OOM hang in my
desktop even before OOM reaper was introduced). It could be default
enabled of course.

I'd be curious to be able to still test what happens to the VM when
the OOM reaper is off, so if nothing else it would be a debug option,
because it'd also help to reproduce more easily those
filesystem-kernel-thread induced hangs that would still happen if the
OOM reaper cannot run because some other process is trying to take the
mmap_sem for writing. A down_read_trylock_unfair would go a long way
to reduce the likelyhood to run into that. The kernel CI exercising
multiple configs would then also autonomously CC us on a report if
those branches are a measurable issue so it'll be easier to tell if
the migration entry conversion or static key is worth it for
MMF_UNSTABLE.

Thanks,
Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: + mm-oom-let-oom_reap_task-and-exit_mmap-to-run-concurrently.patch added to -mm tree
  2017-08-18 18:41       ` Andrea Arcangeli
@ 2017-08-21  8:30         ` Michal Hocko
  0 siblings, 0 replies; 5+ messages in thread
From: Michal Hocko @ 2017-08-21  8:30 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: akpm, hughd, kirill, oleg, penguin-kernel, rientjes, mm-commits,
	linux-mm

On Fri 18-08-17 20:41:45, Andrea Arcangeli wrote:
> On Fri, Aug 18, 2017 at 09:04:44AM +0200, Michal Hocko wrote:
> > I dunno. This doesn't make any difference in the generated code for
> > me (with gcc 6.4). If anything we might wan't to putt unlikely inside
> 
> That's fine, this is just in case the code surrounding the check
> changes in the future. It's not like we should remove unlikely/likely
> if the emitted bytecode doesn't change.
> 
> > tsk_is_oom_victim. Or even go further and use a jump label to get any
> 
> I don't think it's necessarily the best to put it inside
> tsk_is_oom_victim, even if currently it would be the same.
> 
> All it matters for likely unlikely is not to risk to ever get it
> wrong. If unsure it's better to leave it alone.
> 
> We can't be sure all future callers of tsk_is_oom_victim will always
> be unlikely to get a true retval. All we can be sure is that this
> specific caller will get a false retval 100% of the time, in all
> workloads where performance can matter.

Cosindering that it is highly unlikely to meet an OOM victim I would
consider unlikely as always applicable. Even if this is something in the
oom proper then it is a) a cold path so a misprediction doesn't matter
and b) even then it is highly unlikely to meet a victim because oom
victims should almost always be a minority.

> > conditional paths out of way.
> 
> Using a jump label won't allocate memory so I tend to believe it would
> be safe to run them here. However before worrying at the exit path, I
> think the first target of optimization would be the MMF_UNSTABLE
> checks, those are in the page fault fast paths and they end up run
> infinitely more frequently than this single branch in exit.

Yes that is true.

[...]
> So what would you think about the simplest approach to the
> MMF_UNSTABLE issue, that is to add a build time CONFIG_OOM_REAPER=y
> option for the OOM reaper so those branches are optimized away at
> build time (and the above one too, and perhaps the MMF_OOM_SKIP
> set_bit too) if it's ok to disable the OOM reaper as well and increase
> the risk an OOM hang? (it's years I didn't hit an OOM hang in my
> desktop even before OOM reaper was introduced). It could be default
> enabled of course.

I really do hate how many config options we have already and adding more
on top doesn't look like an improvement to me. Jump labels sound like
a much better way forward. Or do you see any potential disadvantage?

> I'd be curious to be able to still test what happens to the VM when
> the OOM reaper is off, so if nothing else it would be a debug option,
> because it'd also help to reproduce more easily those

The same could be achieved with a kernel command line option which would
be a smaller patch, easier to maintain in future and also wouldn't
further increase the config space fragmentation.

> filesystem-kernel-thread induced hangs that would still happen if the
> OOM reaper cannot run because some other process is trying to take the
> mmap_sem for writing. A down_read_trylock_unfair would go a long way
> to reduce the likelyhood to run into that. The kernel CI exercising
> multiple configs would then also autonomously CC us on a report if
> those branches are a measurable issue so it'll be easier to tell if
> the migration entry conversion or static key is worth it for
> MMF_UNSTABLE.

While this sounds like an interesting exercise I am not convinced it
justifies the new config option.

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2017-08-21  8:30 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <59936823.CQNWQErWJ8EAIG3q%akpm@linux-foundation.org>
2017-08-16 13:23 ` + mm-oom-let-oom_reap_task-and-exit_mmap-to-run-concurrently.patch added to -mm tree Michal Hocko
2017-08-17 17:12   ` Andrea Arcangeli
2017-08-18  7:04     ` Michal Hocko
2017-08-18 18:41       ` Andrea Arcangeli
2017-08-21  8:30         ` Michal Hocko

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).