All of lore.kernel.org
 help / color / mirror / Atom feed
From: Michal Hocko <mhocko@kernel.org>
To: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	David Rientjes <rientjes@google.com>,
	Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>,
	Oleg Nesterov <oleg@redhat.com>, Hugh Dickins <hughd@google.com>,
	linux-mm@kvack.org, LKML <linux-kernel@vger.kernel.org>
Subject: Re: [PATCH] mm, oom: allow oom reaper to race with exit_mmap
Date: Wed, 26 Jul 2017 07:45:33 +0200	[thread overview]
Message-ID: <20170726054533.GA960@dhcp22.suse.cz> (raw)
In-Reply-To: <20170725182619.GQ29716@redhat.com>

On Tue 25-07-17 20:26:19, Andrea Arcangeli wrote:
> On Tue, Jul 25, 2017 at 05:45:14PM +0200, Michal Hocko wrote:
> > That problem is real though as reported by David.
> 
> I'm not against fixing it, I just think it's not a major concern, and
> the solution doesn't seem optimal as measured by Kirill.
> 
> I'm just skeptical it's the best to solve that tiny race, 99.9% of the
> time such down_write is unnecessary.
> 
> > it is not only about exit_mmap. __mmput calls into exit_aio and that can
> > wait for completion and there is no way to guarantee this will finish in
> > finite time.
> 
> exit_aio blocking is actually the only good point for wanting this
> concurrency where exit_mmap->unmap_vmas and
> oom_reap_task->unmap_page_range have to run concurrently on the same
> mm.

Yes, exit_aio is the only blocking call I know of currently. But I would
like this to be as robust as possible and so I do not want to rely on
the current implementation. This can change in future and I can
guarantee that nobody will think about the oom path when adding
something to the final __mmput path.

> exit_mmap would have no issue, if there was enough time in the
> lifetime CPU to allocate the memory, sure the memory will also be
> freed in finite amount of time by exit_mmap.

I am not sure I understand. Say that any call prior to unmap_vmas blocks
on a lock which is held by another call path which cannot proceed with
the allocation...
 
> In fact you mentioned multiple OOM in the NUMA case, exit_mmap may not
> solve that, so depending on the runtime it may have been better not to
> wait all memory of the process to be freed before moving to the next
> task, but only a couple of seconds before the OOM reaper moves to a
> new candidate. Again this is only a tradeoff between solving the OOM
> faster vs risk of false positives OOM.

I really do not want to rely on any timing. This just too fragile. Once
we have killed a task then we shouldn't pick another victim until it
passed exit_mmap or the oom_reaper did its job. Otherwise we just risk
false positives while we have already disrupted the workload.
 
> If it wasn't because of exit_aio (which may have to wait I/O
> completion), changing the OOM reaper to return "false" if
> mmget_not_zero returns zero and MMF_OOM_SKIP is not set yet, would
> have been enough (and depending on the runtime it may have solved OOM
> faster in NUMA) and there would be absolutely no need to run OOM
> reaper and exit_mmap concurrently on the same mm. However there's such
> exit_aio..
> 
> Raw I/O mempools never require memory allocations, although aio if it
> involves a filesystem to complete may run into filesystem or buffering
> locks which are known to loop forever or depend on other tasks stuck
> in kernel allocations, so I didn't go down that chain too long.

Exactly. We simply cannot assume anything here because veryfying this
basically impossible.
 
[...]
> diff --git a/mm/mmap.c b/mm/mmap.c
> index f19efcf75418..615133762b99 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -2993,6 +2993,11 @@ void exit_mmap(struct mm_struct *mm)
>  	/* Use -1 here to ensure all VMAs in the mm are unmapped */
>  	unmap_vmas(&tlb, vma, 0, -1);
>  
> +	if (test_and_set_bit(MMF_OOM_SKIP, &mm->flags)) {
> +		/* wait the OOM reaper to stop working on this mm */
> +		down_write(&mm->mmap_sem);
> +		up_write(&mm->mmap_sem);
> +	}
>  	free_pgtables(&tlb, vma, FIRST_USER_ADDRESS, USER_PGTABLES_CEILING);
>  	tlb_finish_mmu(&tlb, 0, -1);
>  
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 9e8b4f030c1c..2a7000995784 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -471,6 +471,7 @@ static bool __oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
>  	struct mmu_gather tlb;
>  	struct vm_area_struct *vma;
>  	bool ret = true;
> +	bool mmgot = true;
>  
>  	/*
>  	 * We have to make sure to not race with the victim exit path
> @@ -500,9 +501,16 @@ static bool __oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
>  	 * and delayed __mmput doesn't matter that much
>  	 */
>  	if (!mmget_not_zero(mm)) {
> -		up_read(&mm->mmap_sem);
>  		trace_skip_task_reaping(tsk->pid);
> -		goto unlock_oom;
> +		/*
> +		 * MMF_OOM_SKIP is set by exit_mmap when the OOM
> +		 * reaper can't work on the mm anymore.
> +		 */
> +		if (test_and_set_bit(MMF_OOM_SKIP, &mm->flags)) {
> +			up_read(&mm->mmap_sem);
> +			goto unlock_oom;
> +		}
> +		mmgot = false;
>  	}

This will work more or less the same to what we have currently.

[victim]		[oom reaper]				[oom killer]
do_exit			__oom_reap_task_mm
  mmput
    __mmput
			  mmget_not_zero
			    test_and_set_bit(MMF_OOM_SKIP)
			    					oom_evaluate_task
								   # select next victim 
			  # reap the mm
      unmap_vmas

so we can select a next victim while the current one is still not
completely torn down.

> > > 4) how is it safe to overwrite a VM_FAULT_RETRY that returns without
> > >    mmap_sem and then the arch code will release the mmap_sem despite
> > >    it was already released by handle_mm_fault? Anonymous memory faults
> > >    aren't common to return VM_FAULT_RETRY but an userfault
> > >    can. Shouldn't there be a block that prevents overwriting if
> > >    VM_FAULT_RETRY is set below? (not only VM_FAULT_ERROR)
> > > 
> > > 	if (unlikely((current->flags & PF_KTHREAD) && !(ret & VM_FAULT_ERROR)
> > > 				&& test_bit(MMF_UNSTABLE, &vma->vm_mm->flags)))
> > > 		ret = VM_FAULT_SIGBUS;
> > 
> > I am not sure I understand what you mean and how this is related to the
> > patch?
> 
> It's not related to the patch but it involves the OOM reaper as it
> only happens when MMF_UNSTABLE is set which is set only by the OOM
> reaper. I was simply reading the OOM reaper code and following up what
> MMF_UNSTABLE does and it ringed a bell.

I hope 3f70dc38cec2 ("mm: make sure that kthreads will not refault oom
reaped memory") will clarify this code. If not please start a new thread
so that we do not conflate different things together.

-- 
Michal Hocko
SUSE Labs

WARNING: multiple messages have this Message-ID (diff)
From: Michal Hocko <mhocko@kernel.org>
To: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	David Rientjes <rientjes@google.com>,
	Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>,
	Oleg Nesterov <oleg@redhat.com>, Hugh Dickins <hughd@google.com>,
	linux-mm@kvack.org, LKML <linux-kernel@vger.kernel.org>
Subject: Re: [PATCH] mm, oom: allow oom reaper to race with exit_mmap
Date: Wed, 26 Jul 2017 07:45:33 +0200	[thread overview]
Message-ID: <20170726054533.GA960@dhcp22.suse.cz> (raw)
In-Reply-To: <20170725182619.GQ29716@redhat.com>

On Tue 25-07-17 20:26:19, Andrea Arcangeli wrote:
> On Tue, Jul 25, 2017 at 05:45:14PM +0200, Michal Hocko wrote:
> > That problem is real though as reported by David.
> 
> I'm not against fixing it, I just think it's not a major concern, and
> the solution doesn't seem optimal as measured by Kirill.
> 
> I'm just skeptical it's the best to solve that tiny race, 99.9% of the
> time such down_write is unnecessary.
> 
> > it is not only about exit_mmap. __mmput calls into exit_aio and that can
> > wait for completion and there is no way to guarantee this will finish in
> > finite time.
> 
> exit_aio blocking is actually the only good point for wanting this
> concurrency where exit_mmap->unmap_vmas and
> oom_reap_task->unmap_page_range have to run concurrently on the same
> mm.

Yes, exit_aio is the only blocking call I know of currently. But I would
like this to be as robust as possible and so I do not want to rely on
the current implementation. This can change in future and I can
guarantee that nobody will think about the oom path when adding
something to the final __mmput path.

> exit_mmap would have no issue, if there was enough time in the
> lifetime CPU to allocate the memory, sure the memory will also be
> freed in finite amount of time by exit_mmap.

I am not sure I understand. Say that any call prior to unmap_vmas blocks
on a lock which is held by another call path which cannot proceed with
the allocation...
 
> In fact you mentioned multiple OOM in the NUMA case, exit_mmap may not
> solve that, so depending on the runtime it may have been better not to
> wait all memory of the process to be freed before moving to the next
> task, but only a couple of seconds before the OOM reaper moves to a
> new candidate. Again this is only a tradeoff between solving the OOM
> faster vs risk of false positives OOM.

I really do not want to rely on any timing. This just too fragile. Once
we have killed a task then we shouldn't pick another victim until it
passed exit_mmap or the oom_reaper did its job. Otherwise we just risk
false positives while we have already disrupted the workload.
 
> If it wasn't because of exit_aio (which may have to wait I/O
> completion), changing the OOM reaper to return "false" if
> mmget_not_zero returns zero and MMF_OOM_SKIP is not set yet, would
> have been enough (and depending on the runtime it may have solved OOM
> faster in NUMA) and there would be absolutely no need to run OOM
> reaper and exit_mmap concurrently on the same mm. However there's such
> exit_aio..
> 
> Raw I/O mempools never require memory allocations, although aio if it
> involves a filesystem to complete may run into filesystem or buffering
> locks which are known to loop forever or depend on other tasks stuck
> in kernel allocations, so I didn't go down that chain too long.

Exactly. We simply cannot assume anything here because veryfying this
basically impossible.
 
[...]
> diff --git a/mm/mmap.c b/mm/mmap.c
> index f19efcf75418..615133762b99 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -2993,6 +2993,11 @@ void exit_mmap(struct mm_struct *mm)
>  	/* Use -1 here to ensure all VMAs in the mm are unmapped */
>  	unmap_vmas(&tlb, vma, 0, -1);
>  
> +	if (test_and_set_bit(MMF_OOM_SKIP, &mm->flags)) {
> +		/* wait the OOM reaper to stop working on this mm */
> +		down_write(&mm->mmap_sem);
> +		up_write(&mm->mmap_sem);
> +	}
>  	free_pgtables(&tlb, vma, FIRST_USER_ADDRESS, USER_PGTABLES_CEILING);
>  	tlb_finish_mmu(&tlb, 0, -1);
>  
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 9e8b4f030c1c..2a7000995784 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -471,6 +471,7 @@ static bool __oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
>  	struct mmu_gather tlb;
>  	struct vm_area_struct *vma;
>  	bool ret = true;
> +	bool mmgot = true;
>  
>  	/*
>  	 * We have to make sure to not race with the victim exit path
> @@ -500,9 +501,16 @@ static bool __oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
>  	 * and delayed __mmput doesn't matter that much
>  	 */
>  	if (!mmget_not_zero(mm)) {
> -		up_read(&mm->mmap_sem);
>  		trace_skip_task_reaping(tsk->pid);
> -		goto unlock_oom;
> +		/*
> +		 * MMF_OOM_SKIP is set by exit_mmap when the OOM
> +		 * reaper can't work on the mm anymore.
> +		 */
> +		if (test_and_set_bit(MMF_OOM_SKIP, &mm->flags)) {
> +			up_read(&mm->mmap_sem);
> +			goto unlock_oom;
> +		}
> +		mmgot = false;
>  	}

This will work more or less the same to what we have currently.

[victim]		[oom reaper]				[oom killer]
do_exit			__oom_reap_task_mm
  mmput
    __mmput
			  mmget_not_zero
			    test_and_set_bit(MMF_OOM_SKIP)
			    					oom_evaluate_task
								   # select next victim 
			  # reap the mm
      unmap_vmas

so we can select a next victim while the current one is still not
completely torn down.

> > > 4) how is it safe to overwrite a VM_FAULT_RETRY that returns without
> > >    mmap_sem and then the arch code will release the mmap_sem despite
> > >    it was already released by handle_mm_fault? Anonymous memory faults
> > >    aren't common to return VM_FAULT_RETRY but an userfault
> > >    can. Shouldn't there be a block that prevents overwriting if
> > >    VM_FAULT_RETRY is set below? (not only VM_FAULT_ERROR)
> > > 
> > > 	if (unlikely((current->flags & PF_KTHREAD) && !(ret & VM_FAULT_ERROR)
> > > 				&& test_bit(MMF_UNSTABLE, &vma->vm_mm->flags)))
> > > 		ret = VM_FAULT_SIGBUS;
> > 
> > I am not sure I understand what you mean and how this is related to the
> > patch?
> 
> It's not related to the patch but it involves the OOM reaper as it
> only happens when MMF_UNSTABLE is set which is set only by the OOM
> reaper. I was simply reading the OOM reaper code and following up what
> MMF_UNSTABLE does and it ringed a bell.

I hope 3f70dc38cec2 ("mm: make sure that kthreads will not refault oom
reaped memory") will clarify this code. If not please start a new thread
so that we do not conflate different things together.

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

  reply	other threads:[~2017-07-26  5:45 UTC|newest]

Thread overview: 71+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-07-24  7:23 [PATCH] mm, oom: allow oom reaper to race with exit_mmap Michal Hocko
2017-07-24  7:23 ` Michal Hocko
2017-07-24 14:00 ` Kirill A. Shutemov
2017-07-24 14:00   ` Kirill A. Shutemov
2017-07-24 14:15   ` Michal Hocko
2017-07-24 14:15     ` Michal Hocko
2017-07-24 14:51     ` Kirill A. Shutemov
2017-07-24 14:51       ` Kirill A. Shutemov
2017-07-24 16:11       ` Michal Hocko
2017-07-24 16:11         ` Michal Hocko
2017-07-25 14:17         ` Kirill A. Shutemov
2017-07-25 14:17           ` Kirill A. Shutemov
2017-07-25 14:26           ` Michal Hocko
2017-07-25 14:26             ` Michal Hocko
2017-07-25 15:07             ` Kirill A. Shutemov
2017-07-25 15:07               ` Kirill A. Shutemov
2017-07-25 15:15               ` Michal Hocko
2017-07-25 15:15                 ` Michal Hocko
2017-07-25 14:26         ` Michal Hocko
2017-07-25 15:17           ` Kirill A. Shutemov
2017-07-25 15:17             ` Kirill A. Shutemov
2017-07-25 15:23             ` Michal Hocko
2017-07-25 15:23               ` Michal Hocko
2017-07-25 15:31               ` Kirill A. Shutemov
2017-07-25 15:31                 ` Kirill A. Shutemov
2017-07-25 16:04                 ` Michal Hocko
2017-07-25 16:04                   ` Michal Hocko
2017-07-25 19:19                   ` Andrea Arcangeli
2017-07-25 19:19                     ` Andrea Arcangeli
2017-07-26  5:45                     ` Michal Hocko
2017-07-26  5:45                       ` Michal Hocko
2017-07-26 16:29                       ` Andrea Arcangeli
2017-07-26 16:29                         ` Andrea Arcangeli
2017-07-26 16:43                         ` Andrea Arcangeli
2017-07-26 16:43                           ` Andrea Arcangeli
2017-07-27  6:50                         ` Michal Hocko
2017-07-27  6:50                           ` Michal Hocko
2017-07-27 14:55                           ` Andrea Arcangeli
2017-07-27 14:55                             ` Andrea Arcangeli
2017-07-28  6:23                             ` Michal Hocko
2017-07-28  6:23                               ` Michal Hocko
2017-07-28  1:58                         ` [PATCH 1/1] mm: oom: let oom_reap_task and exit_mmap to run kbuild test robot
2017-08-15  0:20                         ` [PATCH] mm, oom: allow oom reaper to race with exit_mmap David Rientjes
2017-08-15  0:20                           ` David Rientjes
2017-07-24 15:27 ` Michal Hocko
2017-07-24 15:27   ` Michal Hocko
2017-07-24 16:42 ` kbuild test robot
2017-07-24 18:12   ` Michal Hocko
2017-07-24 18:12     ` Michal Hocko
2017-07-25 15:26 ` Andrea Arcangeli
2017-07-25 15:26   ` Andrea Arcangeli
2017-07-25 15:45   ` Michal Hocko
2017-07-25 15:45     ` Michal Hocko
2017-07-25 18:26     ` Andrea Arcangeli
2017-07-25 18:26       ` Andrea Arcangeli
2017-07-26  5:45       ` Michal Hocko [this message]
2017-07-26  5:45         ` Michal Hocko
2017-07-26 16:39         ` Andrea Arcangeli
2017-07-26 16:39           ` Andrea Arcangeli
2017-07-27  6:32           ` Michal Hocko
2017-07-27  6:32             ` Michal Hocko
2017-07-27  8:29 Manish Jaggi
2017-07-27  9:24 ` Michal Hocko
2017-08-10  8:16 Michal Hocko
2017-08-10  8:16 ` Michal Hocko
2017-08-10 18:05 ` Andrea Arcangeli
2017-08-10 18:05   ` Andrea Arcangeli
2017-08-10 18:51   ` Michal Hocko
2017-08-10 18:51     ` Michal Hocko
2017-08-10 20:36     ` Michal Hocko
2017-08-10 20:36       ` Michal Hocko

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20170726054533.GA960@dhcp22.suse.cz \
    --to=mhocko@kernel.org \
    --cc=aarcange@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=hughd@google.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=oleg@redhat.com \
    --cc=penguin-kernel@I-love.SAKURA.ne.jp \
    --cc=rientjes@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.