linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] mm, oom: allow oom reaper to race with exit_mmap
@ 2017-07-24  7:23 Michal Hocko
  2017-07-24 14:00 ` Kirill A. Shutemov
                   ` (3 more replies)
  0 siblings, 4 replies; 36+ messages in thread
From: Michal Hocko @ 2017-07-24  7:23 UTC (permalink / raw)
  To: Andrew Morton
  Cc: David Rientjes, Tetsuo Handa, Oleg Nesterov, Hugh Dickins,
	Andrea Arcangeli, linux-mm, LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

David has noticed that the oom killer might kill additional tasks while
the exiting oom victim hasn't terminated yet because the oom_reaper marks
the curent victim MMF_OOM_SKIP too early when mm->mm_users dropped down
to 0. The race is as follows

oom_reap_task				do_exit
					  exit_mm
  __oom_reap_task_mm
					    mmput
					      __mmput
    mmget_not_zero # fails
    						exit_mmap # frees memory
  set_bit(MMF_OOM_SKIP)

The victim is still visible to the OOM killer until it is unhashed.

Currently we try to reduce a risk of this race by taking oom_lock
and wait for out_of_memory sleep while holding the lock to give the
victim some time to exit. This is quite suboptimal approach because
there is no guarantee the victim (especially a large one) will manage
to unmap its address space and free enough memory to the particular oom
domain which needs a memory (e.g. a specific NUMA node).

Fix this problem by allowing __oom_reap_task_mm and __mmput path to
race. __oom_reap_task_mm is basically MADV_DONTNEED and that is allowed
to run in parallel with other unmappers (hence the mmap_sem for read).

The only tricky part is to exclude page tables tear down and all
operations which modify the address space in the __mmput path. exit_mmap
doesn't expect any other users so it doesn't use any locking. Nothing
really forbids us to use mmap_sem for write, though. In fact we are
already relying on this lock earlier in the __mmput path to synchronize
with ksm and khugepaged.

Take the exclusive mmap_sem when calling free_pgtables and destroying
vmas to sync with __oom_reap_task_mm which take the lock for read. All
other operations can safely race with the parallel unmap.

Changes
- bail on null mm->mmap early as per David Rientjes

Reported-by: David Rientjes <rientjes@google.com>
Fixes: 26db62f179d1 ("oom: keep mm of the killed task available")
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
Hi,
I've sent this as an RFC [1] previously and it seems that the original
issue has been resolved [2], although an explicit tested-by would be
appreciated of course. Hugh has pointed out [3] that using mmap_sem
in exit_mmap will allow to drop the tricky
	down_write(mmap_sem);
	up_write(mmap_sem);
in both paths. I hope I will get to that in a forseeable future.

I am not yet sure this is important enough to merge to stable trees,
I would rather wait for a report to show up.

[1] http://lkml.kernel.org/r/20170626130346.26314-1-mhocko@kernel.org
[2] http://lkml.kernel.org/r/alpine.DEB.2.10.1707111336250.60183@chino.kir.corp.google.com
[3] http://lkml.kernel.org/r/alpine.LSU.2.11.1707191716030.2055@eggly.anvils

 mm/mmap.c     |  7 +++++++
 mm/oom_kill.c | 45 +++++++--------------------------------------
 2 files changed, 14 insertions(+), 38 deletions(-)

diff --git a/mm/mmap.c b/mm/mmap.c
index 24e9261bdcc0..0eeb658caa30 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2993,6 +2993,11 @@ void exit_mmap(struct mm_struct *mm)
 	/* Use -1 here to ensure all VMAs in the mm are unmapped */
 	unmap_vmas(&tlb, vma, 0, -1);
 
+	/*
+	 * oom reaper might race with exit_mmap so make sure we won't free
+	 * page tables or unmap VMAs under its feet
+	 */
+	down_write(&mm->mmap_sem);
 	free_pgtables(&tlb, vma, FIRST_USER_ADDRESS, USER_PGTABLES_CEILING);
 	tlb_finish_mmu(&tlb, 0, -1);
 
@@ -3005,7 +3010,9 @@ void exit_mmap(struct mm_struct *mm)
 			nr_accounted += vma_pages(vma);
 		vma = remove_vma(vma);
 	}
+	mm->mmap = NULL;
 	vm_unacct_memory(nr_accounted);
+	up_write(&mm->mmap_sem);
 }
 
 /* Insert vm structure into process list sorted by address
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 9e8b4f030c1c..a6dabe3691c1 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -470,40 +470,15 @@ static bool __oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
 {
 	struct mmu_gather tlb;
 	struct vm_area_struct *vma;
-	bool ret = true;
-
-	/*
-	 * We have to make sure to not race with the victim exit path
-	 * and cause premature new oom victim selection:
-	 * __oom_reap_task_mm		exit_mm
-	 *   mmget_not_zero
-	 *				  mmput
-	 *				    atomic_dec_and_test
-	 *				  exit_oom_victim
-	 *				[...]
-	 *				out_of_memory
-	 *				  select_bad_process
-	 *				    # no TIF_MEMDIE task selects new victim
-	 *  unmap_page_range # frees some memory
-	 */
-	mutex_lock(&oom_lock);
 
 	if (!down_read_trylock(&mm->mmap_sem)) {
-		ret = false;
 		trace_skip_task_reaping(tsk->pid);
-		goto unlock_oom;
+		return false;
 	}
 
-	/*
-	 * increase mm_users only after we know we will reap something so
-	 * that the mmput_async is called only when we have reaped something
-	 * and delayed __mmput doesn't matter that much
-	 */
-	if (!mmget_not_zero(mm)) {
-		up_read(&mm->mmap_sem);
-		trace_skip_task_reaping(tsk->pid);
-		goto unlock_oom;
-	}
+	/* There is nothing to reap so bail out without signs in the log */
+	if (!mm->mmap)
+		goto unlock;
 
 	trace_start_task_reaping(tsk->pid);
 
@@ -540,17 +515,11 @@ static bool __oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
 			K(get_mm_counter(mm, MM_ANONPAGES)),
 			K(get_mm_counter(mm, MM_FILEPAGES)),
 			K(get_mm_counter(mm, MM_SHMEMPAGES)));
-	up_read(&mm->mmap_sem);
 
-	/*
-	 * Drop our reference but make sure the mmput slow path is called from a
-	 * different context because we shouldn't risk we get stuck there and
-	 * put the oom_reaper out of the way.
-	 */
-	mmput_async(mm);
 	trace_finish_task_reaping(tsk->pid);
-unlock_oom:
-	mutex_unlock(&oom_lock);
+unlock:
+	up_read(&mm->mmap_sem);
+
 	return ret;
 }
 
-- 
2.13.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: [PATCH] mm, oom: allow oom reaper to race with exit_mmap
  2017-07-24  7:23 [PATCH] mm, oom: allow oom reaper to race with exit_mmap Michal Hocko
@ 2017-07-24 14:00 ` Kirill A. Shutemov
  2017-07-24 14:15   ` Michal Hocko
  2017-07-24 15:27 ` Michal Hocko
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 36+ messages in thread
From: Kirill A. Shutemov @ 2017-07-24 14:00 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, David Rientjes, Tetsuo Handa, Oleg Nesterov,
	Hugh Dickins, Andrea Arcangeli, linux-mm, LKML, Michal Hocko

On Mon, Jul 24, 2017 at 09:23:32AM +0200, Michal Hocko wrote:
> From: Michal Hocko <mhocko@suse.com>
> 
> David has noticed that the oom killer might kill additional tasks while
> the exiting oom victim hasn't terminated yet because the oom_reaper marks
> the curent victim MMF_OOM_SKIP too early when mm->mm_users dropped down
> to 0. The race is as follows
> 
> oom_reap_task				do_exit
> 					  exit_mm
>   __oom_reap_task_mm
> 					    mmput
> 					      __mmput
>     mmget_not_zero # fails
>     						exit_mmap # frees memory
>   set_bit(MMF_OOM_SKIP)
> 
> The victim is still visible to the OOM killer until it is unhashed.
> 
> Currently we try to reduce a risk of this race by taking oom_lock
> and wait for out_of_memory sleep while holding the lock to give the
> victim some time to exit. This is quite suboptimal approach because
> there is no guarantee the victim (especially a large one) will manage
> to unmap its address space and free enough memory to the particular oom
> domain which needs a memory (e.g. a specific NUMA node).
> 
> Fix this problem by allowing __oom_reap_task_mm and __mmput path to
> race. __oom_reap_task_mm is basically MADV_DONTNEED and that is allowed
> to run in parallel with other unmappers (hence the mmap_sem for read).
> 
> The only tricky part is to exclude page tables tear down and all
> operations which modify the address space in the __mmput path. exit_mmap
> doesn't expect any other users so it doesn't use any locking. Nothing
> really forbids us to use mmap_sem for write, though. In fact we are
> already relying on this lock earlier in the __mmput path to synchronize
> with ksm and khugepaged.

That's true, but we take mmap_sem there for small portion of cases.

It's quite different from taking the lock unconditionally. I'm worry about
scalability implication of such move. On bigger machines it can be big
hit.

Should we do performance/scalability evaluation of the patch before
getting it applied?

-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH] mm, oom: allow oom reaper to race with exit_mmap
  2017-07-24 14:00 ` Kirill A. Shutemov
@ 2017-07-24 14:15   ` Michal Hocko
  2017-07-24 14:51     ` Kirill A. Shutemov
  0 siblings, 1 reply; 36+ messages in thread
From: Michal Hocko @ 2017-07-24 14:15 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrew Morton, David Rientjes, Tetsuo Handa, Oleg Nesterov,
	Hugh Dickins, Andrea Arcangeli, linux-mm, LKML

On Mon 24-07-17 17:00:08, Kirill A. Shutemov wrote:
> On Mon, Jul 24, 2017 at 09:23:32AM +0200, Michal Hocko wrote:
> > From: Michal Hocko <mhocko@suse.com>
> > 
> > David has noticed that the oom killer might kill additional tasks while
> > the exiting oom victim hasn't terminated yet because the oom_reaper marks
> > the curent victim MMF_OOM_SKIP too early when mm->mm_users dropped down
> > to 0. The race is as follows
> > 
> > oom_reap_task				do_exit
> > 					  exit_mm
> >   __oom_reap_task_mm
> > 					    mmput
> > 					      __mmput
> >     mmget_not_zero # fails
> >     						exit_mmap # frees memory
> >   set_bit(MMF_OOM_SKIP)
> > 
> > The victim is still visible to the OOM killer until it is unhashed.
> > 
> > Currently we try to reduce a risk of this race by taking oom_lock
> > and wait for out_of_memory sleep while holding the lock to give the
> > victim some time to exit. This is quite suboptimal approach because
> > there is no guarantee the victim (especially a large one) will manage
> > to unmap its address space and free enough memory to the particular oom
> > domain which needs a memory (e.g. a specific NUMA node).
> > 
> > Fix this problem by allowing __oom_reap_task_mm and __mmput path to
> > race. __oom_reap_task_mm is basically MADV_DONTNEED and that is allowed
> > to run in parallel with other unmappers (hence the mmap_sem for read).
> > 
> > The only tricky part is to exclude page tables tear down and all
> > operations which modify the address space in the __mmput path. exit_mmap
> > doesn't expect any other users so it doesn't use any locking. Nothing
> > really forbids us to use mmap_sem for write, though. In fact we are
> > already relying on this lock earlier in the __mmput path to synchronize
> > with ksm and khugepaged.
> 
> That's true, but we take mmap_sem there for small portion of cases.
> 
> It's quite different from taking the lock unconditionally. I'm worry about
> scalability implication of such move. On bigger machines it can be big
> hit.

What kind of scalability implication you have in mind? There is
basically a zero contention on the mmap_sem that late in the exit path
so this should be pretty much a fast path of the down_write. I agree it
is not 0 cost but the cost of the address space freeing should basically
make it a noise.

> Should we do performance/scalability evaluation of the patch before
> getting it applied?

What kind of test(s) would you be interested in?
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH] mm, oom: allow oom reaper to race with exit_mmap
  2017-07-24 14:15   ` Michal Hocko
@ 2017-07-24 14:51     ` Kirill A. Shutemov
  2017-07-24 16:11       ` Michal Hocko
  0 siblings, 1 reply; 36+ messages in thread
From: Kirill A. Shutemov @ 2017-07-24 14:51 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, David Rientjes, Tetsuo Handa, Oleg Nesterov,
	Hugh Dickins, Andrea Arcangeli, linux-mm, LKML

On Mon, Jul 24, 2017 at 04:15:26PM +0200, Michal Hocko wrote:
> On Mon 24-07-17 17:00:08, Kirill A. Shutemov wrote:
> > On Mon, Jul 24, 2017 at 09:23:32AM +0200, Michal Hocko wrote:
> > > From: Michal Hocko <mhocko@suse.com>
> > > 
> > > David has noticed that the oom killer might kill additional tasks while
> > > the exiting oom victim hasn't terminated yet because the oom_reaper marks
> > > the curent victim MMF_OOM_SKIP too early when mm->mm_users dropped down
> > > to 0. The race is as follows
> > > 
> > > oom_reap_task				do_exit
> > > 					  exit_mm
> > >   __oom_reap_task_mm
> > > 					    mmput
> > > 					      __mmput
> > >     mmget_not_zero # fails
> > >     						exit_mmap # frees memory
> > >   set_bit(MMF_OOM_SKIP)
> > > 
> > > The victim is still visible to the OOM killer until it is unhashed.
> > > 
> > > Currently we try to reduce a risk of this race by taking oom_lock
> > > and wait for out_of_memory sleep while holding the lock to give the
> > > victim some time to exit. This is quite suboptimal approach because
> > > there is no guarantee the victim (especially a large one) will manage
> > > to unmap its address space and free enough memory to the particular oom
> > > domain which needs a memory (e.g. a specific NUMA node).
> > > 
> > > Fix this problem by allowing __oom_reap_task_mm and __mmput path to
> > > race. __oom_reap_task_mm is basically MADV_DONTNEED and that is allowed
> > > to run in parallel with other unmappers (hence the mmap_sem for read).
> > > 
> > > The only tricky part is to exclude page tables tear down and all
> > > operations which modify the address space in the __mmput path. exit_mmap
> > > doesn't expect any other users so it doesn't use any locking. Nothing
> > > really forbids us to use mmap_sem for write, though. In fact we are
> > > already relying on this lock earlier in the __mmput path to synchronize
> > > with ksm and khugepaged.
> > 
> > That's true, but we take mmap_sem there for small portion of cases.
> > 
> > It's quite different from taking the lock unconditionally. I'm worry about
> > scalability implication of such move. On bigger machines it can be big
> > hit.
> 
> What kind of scalability implication you have in mind? There is
> basically a zero contention on the mmap_sem that late in the exit path
> so this should be pretty much a fast path of the down_write. I agree it
> is not 0 cost but the cost of the address space freeing should basically
> make it a noise.

Even in fast path case, it adds two atomic operation per-process. If the
cache line is not exclusive to the core by the time of exit(2) it can be
noticible.

... but I guess it's not very hot scenario.

I guess I'm just too cautious here. :)

> > Should we do performance/scalability evaluation of the patch before
> > getting it applied?
> 
> What kind of test(s) would you be interested in?

Can we at lest check that number of /bin/true we can spawn per second
wouldn't be harmed by the patch? ;)

-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH] mm, oom: allow oom reaper to race with exit_mmap
  2017-07-24  7:23 [PATCH] mm, oom: allow oom reaper to race with exit_mmap Michal Hocko
  2017-07-24 14:00 ` Kirill A. Shutemov
@ 2017-07-24 15:27 ` Michal Hocko
  2017-07-24 16:42 ` kbuild test robot
  2017-07-25 15:26 ` Andrea Arcangeli
  3 siblings, 0 replies; 36+ messages in thread
From: Michal Hocko @ 2017-07-24 15:27 UTC (permalink / raw)
  To: Andrew Morton
  Cc: David Rientjes, Tetsuo Handa, Oleg Nesterov, Hugh Dickins,
	Andrea Arcangeli, linux-mm, LKML

Dohh, the full conflict resolution didn't make it into the commit. The
full patch is below
---

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH] mm, oom: allow oom reaper to race with exit_mmap
  2017-07-24 14:51     ` Kirill A. Shutemov
@ 2017-07-24 16:11       ` Michal Hocko
  2017-07-25 14:17         ` Kirill A. Shutemov
  2017-07-25 14:26         ` Michal Hocko
  0 siblings, 2 replies; 36+ messages in thread
From: Michal Hocko @ 2017-07-24 16:11 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrew Morton, David Rientjes, Tetsuo Handa, Oleg Nesterov,
	Hugh Dickins, Andrea Arcangeli, linux-mm, LKML

On Mon 24-07-17 17:51:42, Kirill A. Shutemov wrote:
> On Mon, Jul 24, 2017 at 04:15:26PM +0200, Michal Hocko wrote:
[...]
> > What kind of scalability implication you have in mind? There is
> > basically a zero contention on the mmap_sem that late in the exit path
> > so this should be pretty much a fast path of the down_write. I agree it
> > is not 0 cost but the cost of the address space freeing should basically
> > make it a noise.
> 
> Even in fast path case, it adds two atomic operation per-process. If the
> cache line is not exclusive to the core by the time of exit(2) it can be
> noticible.
> 
> ... but I guess it's not very hot scenario.
> 
> I guess I'm just too cautious here. :)

I definitely did not want to handwave your concern. I just think we can
rule out the slow path and didn't think about the fast path overhead.

> > > Should we do performance/scalability evaluation of the patch before
> > > getting it applied?
> > 
> > What kind of test(s) would you be interested in?
> 
> Can we at lest check that number of /bin/true we can spawn per second
> wouldn't be harmed by the patch? ;)

OK, so measuring a single /bin/true doesn't tell anything so I've done
root@test1:~# cat a.sh 
#!/bin/sh

NR=$1
for i in $(seq $NR)
do
        /bin/true
done

in my virtual machine (on a otherwise idle host) with 4 cpus and 2GB of
RAM

Unpatched kernel
root@test1:~# /usr/bin/time -v ./a.sh 100000 
        Command being timed: "./a.sh 100000"
        User time (seconds): 53.57
        System time (seconds): 26.12
        Percent of CPU this job got: 100%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 1:19.46
root@test1:~# /usr/bin/time -v ./a.sh 100000 
        Command being timed: "./a.sh 100000"
        User time (seconds): 53.90
        System time (seconds): 26.23
        Percent of CPU this job got: 100%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 1:19.77
root@test1:~# /usr/bin/time -v ./a.sh 100000 
        Command being timed: "./a.sh 100000"
        User time (seconds): 54.02
        System time (seconds): 26.18
        Percent of CPU this job got: 100%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 1:19.92

patched kernel
root@test1:~# /usr/bin/time -v ./a.sh 100000 
        Command being timed: "./a.sh 100000"
        User time (seconds): 53.81
        System time (seconds): 26.55
        Percent of CPU this job got: 100%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 1:19.99
root@test1:~# /usr/bin/time -v ./a.sh 100000 
        Command being timed: "./a.sh 100000"
        User time (seconds): 53.78
        System time (seconds): 26.15
        Percent of CPU this job got: 100%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 1:19.67
root@test1:~# /usr/bin/time -v ./a.sh 100000 
        Command being timed: "./a.sh 100000"
        User time (seconds): 54.08
        System time (seconds): 26.87
        Percent of CPU this job got: 100%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 1:20.52

the results very quite a lot (have a look at the user time which
shouldn't have no reason to vary at all - maybe the virtual machine
aspect?). I would say that we are still reasonably close to a noise
here. Considering that /bin/true would close to the worst case I think
this looks reasonably. What do you think?

If you absolutely insist, I can make the lock conditional only for oom
victims. That would still mean current->signal->oom_mm pointers fetches
and a 2 branches.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH] mm, oom: allow oom reaper to race with exit_mmap
  2017-07-24  7:23 [PATCH] mm, oom: allow oom reaper to race with exit_mmap Michal Hocko
  2017-07-24 14:00 ` Kirill A. Shutemov
  2017-07-24 15:27 ` Michal Hocko
@ 2017-07-24 16:42 ` kbuild test robot
  2017-07-24 18:12   ` Michal Hocko
  2017-07-25 15:26 ` Andrea Arcangeli
  3 siblings, 1 reply; 36+ messages in thread
From: kbuild test robot @ 2017-07-24 16:42 UTC (permalink / raw)
  To: Michal Hocko
  Cc: kbuild-all, Andrew Morton, David Rientjes, Tetsuo Handa,
	Oleg Nesterov, Hugh Dickins, Andrea Arcangeli, linux-mm, LKML,
	Michal Hocko

[-- Attachment #1: Type: text/plain, Size: 5793 bytes --]

Hi Michal,

[auto build test ERROR on mmotm/master]
[also build test ERROR on v4.13-rc2 next-20170724]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/Michal-Hocko/mm-oom-allow-oom-reaper-to-race-with-exit_mmap/20170724-233159
base:   git://git.cmpxchg.org/linux-mmotm.git master
config: x86_64-randconfig-x016-201730 (attached as .config)
compiler: gcc-6 (Debian 6.2.0-3) 6.2.0 20160901
reproduce:
        # save the attached .config to linux build tree
        make ARCH=x86_64 

All error/warnings (new ones prefixed by >>):

   mm/oom_kill.c: In function '__oom_reap_task_mm':
>> mm/oom_kill.c:523:9: error: 'ret' undeclared (first use in this function)
     return ret;
            ^~~
   mm/oom_kill.c:523:9: note: each undeclared identifier is reported only once for each function it appears in
>> mm/oom_kill.c:524:1: warning: control reaches end of non-void function [-Wreturn-type]
    }
    ^

vim +/ret +523 mm/oom_kill.c

03049269d Michal Hocko       2016-03-25  468  
7ebffa455 Tetsuo Handa       2016-10-07  469  static bool __oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
aac453635 Michal Hocko       2016-03-25  470  {
aac453635 Michal Hocko       2016-03-25  471  	struct mmu_gather tlb;
aac453635 Michal Hocko       2016-03-25  472  	struct vm_area_struct *vma;
e2fe14564 Michal Hocko       2016-05-27  473  
aac453635 Michal Hocko       2016-03-25  474  	if (!down_read_trylock(&mm->mmap_sem)) {
fa60da35c Andrew Morton      2017-07-14  475  		trace_skip_task_reaping(tsk->pid);
ed7a155c6 Michal Hocko       2017-07-24  476  		return false;
e5e3f4c4f Michal Hocko       2016-07-26  477  	}
e5e3f4c4f Michal Hocko       2016-07-26  478  
ed7a155c6 Michal Hocko       2017-07-24  479  	/* There is nothing to reap so bail out without signs in the log */
ed7a155c6 Michal Hocko       2017-07-24  480  	if (!mm->mmap)
ed7a155c6 Michal Hocko       2017-07-24  481  		goto unlock;
aac453635 Michal Hocko       2016-03-25  482  
fa60da35c Andrew Morton      2017-07-14  483  	trace_start_task_reaping(tsk->pid);
fa60da35c Andrew Morton      2017-07-14  484  
3f70dc38c Michal Hocko       2016-10-07  485  	/*
3f70dc38c Michal Hocko       2016-10-07  486  	 * Tell all users of get_user/copy_from_user etc... that the content
3f70dc38c Michal Hocko       2016-10-07  487  	 * is no longer stable. No barriers really needed because unmapping
3f70dc38c Michal Hocko       2016-10-07  488  	 * should imply barriers already and the reader would hit a page fault
3f70dc38c Michal Hocko       2016-10-07  489  	 * if it stumbled over a reaped memory.
3f70dc38c Michal Hocko       2016-10-07  490  	 */
3f70dc38c Michal Hocko       2016-10-07  491  	set_bit(MMF_UNSTABLE, &mm->flags);
3f70dc38c Michal Hocko       2016-10-07  492  
aac453635 Michal Hocko       2016-03-25  493  	tlb_gather_mmu(&tlb, mm, 0, -1);
aac453635 Michal Hocko       2016-03-25  494  	for (vma = mm->mmap ; vma; vma = vma->vm_next) {
235190738 Kirill A. Shutemov 2017-02-22  495  		if (!can_madv_dontneed_vma(vma))
aac453635 Michal Hocko       2016-03-25  496  			continue;
aac453635 Michal Hocko       2016-03-25  497  
aac453635 Michal Hocko       2016-03-25  498  		/*
aac453635 Michal Hocko       2016-03-25  499  		 * Only anonymous pages have a good chance to be dropped
aac453635 Michal Hocko       2016-03-25  500  		 * without additional steps which we cannot afford as we
aac453635 Michal Hocko       2016-03-25  501  		 * are OOM already.
aac453635 Michal Hocko       2016-03-25  502  		 *
aac453635 Michal Hocko       2016-03-25  503  		 * We do not even care about fs backed pages because all
aac453635 Michal Hocko       2016-03-25  504  		 * which are reclaimable have already been reclaimed and
aac453635 Michal Hocko       2016-03-25  505  		 * we do not want to block exit_mmap by keeping mm ref
aac453635 Michal Hocko       2016-03-25  506  		 * count elevated without a good reason.
aac453635 Michal Hocko       2016-03-25  507  		 */
aac453635 Michal Hocko       2016-03-25  508  		if (vma_is_anonymous(vma) || !(vma->vm_flags & VM_SHARED))
aac453635 Michal Hocko       2016-03-25  509  			unmap_page_range(&tlb, vma, vma->vm_start, vma->vm_end,
3e8715fdc Kirill A. Shutemov 2017-02-22  510  					 NULL);
aac453635 Michal Hocko       2016-03-25  511  	}
aac453635 Michal Hocko       2016-03-25  512  	tlb_finish_mmu(&tlb, 0, -1);
bc448e897 Michal Hocko       2016-03-25  513  	pr_info("oom_reaper: reaped process %d (%s), now anon-rss:%lukB, file-rss:%lukB, shmem-rss:%lukB\n",
bc448e897 Michal Hocko       2016-03-25  514  			task_pid_nr(tsk), tsk->comm,
bc448e897 Michal Hocko       2016-03-25  515  			K(get_mm_counter(mm, MM_ANONPAGES)),
bc448e897 Michal Hocko       2016-03-25  516  			K(get_mm_counter(mm, MM_FILEPAGES)),
bc448e897 Michal Hocko       2016-03-25  517  			K(get_mm_counter(mm, MM_SHMEMPAGES)));
36324a990 Michal Hocko       2016-03-25  518  
fa60da35c Andrew Morton      2017-07-14  519  	trace_finish_task_reaping(tsk->pid);
ed7a155c6 Michal Hocko       2017-07-24  520  unlock:
ed7a155c6 Michal Hocko       2017-07-24  521  	up_read(&mm->mmap_sem);
ed7a155c6 Michal Hocko       2017-07-24  522  
aac453635 Michal Hocko       2016-03-25 @523  	return ret;
aac453635 Michal Hocko       2016-03-25 @524  }
aac453635 Michal Hocko       2016-03-25  525  

:::::: The code at line 523 was first introduced by commit
:::::: aac453635549699c13a84ea1456d5b0e574ef855 mm, oom: introduce oom reaper

:::::: TO: Michal Hocko <mhocko@suse.com>
:::::: CC: Linus Torvalds <torvalds@linux-foundation.org>

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 31959 bytes --]

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH] mm, oom: allow oom reaper to race with exit_mmap
  2017-07-24 16:42 ` kbuild test robot
@ 2017-07-24 18:12   ` Michal Hocko
  0 siblings, 0 replies; 36+ messages in thread
From: Michal Hocko @ 2017-07-24 18:12 UTC (permalink / raw)
  To: kbuild test robot
  Cc: kbuild-all, Andrew Morton, David Rientjes, Tetsuo Handa,
	Oleg Nesterov, Hugh Dickins, Andrea Arcangeli, linux-mm, LKML

On Tue 25-07-17 00:42:05, kbuild test robot wrote:
> Hi Michal,
> 
> [auto build test ERROR on mmotm/master]
> [also build test ERROR on v4.13-rc2 next-20170724]
> [if your patch is applied to the wrong git tree, please drop us a note to help improve the system]
> 
> url:    https://github.com/0day-ci/linux/commits/Michal-Hocko/mm-oom-allow-oom-reaper-to-race-with-exit_mmap/20170724-233159
> base:   git://git.cmpxchg.org/linux-mmotm.git master
> config: x86_64-randconfig-x016-201730 (attached as .config)
> compiler: gcc-6 (Debian 6.2.0-3) 6.2.0 20160901
> reproduce:
>         # save the attached .config to linux build tree
>         make ARCH=x86_64 
> 
> All error/warnings (new ones prefixed by >>):
> 
>    mm/oom_kill.c: In function '__oom_reap_task_mm':
> >> mm/oom_kill.c:523:9: error: 'ret' undeclared (first use in this function)
>      return ret;
>             ^~~

Fixed by http://lkml.kernel.org/r/20170724152703.GP25221@dhcp22.suse.cz
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH] mm, oom: allow oom reaper to race with exit_mmap
  2017-07-24 16:11       ` Michal Hocko
@ 2017-07-25 14:17         ` Kirill A. Shutemov
  2017-07-25 14:26           ` Michal Hocko
  2017-07-25 14:26         ` Michal Hocko
  1 sibling, 1 reply; 36+ messages in thread
From: Kirill A. Shutemov @ 2017-07-25 14:17 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, David Rientjes, Tetsuo Handa, Oleg Nesterov,
	Hugh Dickins, Andrea Arcangeli, linux-mm, LKML

On Mon, Jul 24, 2017 at 06:11:47PM +0200, Michal Hocko wrote:
> On Mon 24-07-17 17:51:42, Kirill A. Shutemov wrote:
> > On Mon, Jul 24, 2017 at 04:15:26PM +0200, Michal Hocko wrote:
> [...]
> > > What kind of scalability implication you have in mind? There is
> > > basically a zero contention on the mmap_sem that late in the exit path
> > > so this should be pretty much a fast path of the down_write. I agree it
> > > is not 0 cost but the cost of the address space freeing should basically
> > > make it a noise.
> > 
> > Even in fast path case, it adds two atomic operation per-process. If the
> > cache line is not exclusive to the core by the time of exit(2) it can be
> > noticible.
> > 
> > ... but I guess it's not very hot scenario.
> > 
> > I guess I'm just too cautious here. :)
> 
> I definitely did not want to handwave your concern. I just think we can
> rule out the slow path and didn't think about the fast path overhead.
> 
> > > > Should we do performance/scalability evaluation of the patch before
> > > > getting it applied?
> > > 
> > > What kind of test(s) would you be interested in?
> > 
> > Can we at lest check that number of /bin/true we can spawn per second
> > wouldn't be harmed by the patch? ;)
> 
> OK, so measuring a single /bin/true doesn't tell anything so I've done
> root@test1:~# cat a.sh 
> #!/bin/sh
> 
> NR=$1
> for i in $(seq $NR)
> do
>         /bin/true
> done
> 
> in my virtual machine (on a otherwise idle host) with 4 cpus and 2GB of
> RAM
> 
> Unpatched kernel
> root@test1:~# /usr/bin/time -v ./a.sh 100000 
>         Command being timed: "./a.sh 100000"
>         User time (seconds): 53.57
>         System time (seconds): 26.12
>         Percent of CPU this job got: 100%
>         Elapsed (wall clock) time (h:mm:ss or m:ss): 1:19.46
> root@test1:~# /usr/bin/time -v ./a.sh 100000 
>         Command being timed: "./a.sh 100000"
>         User time (seconds): 53.90
>         System time (seconds): 26.23
>         Percent of CPU this job got: 100%
>         Elapsed (wall clock) time (h:mm:ss or m:ss): 1:19.77
> root@test1:~# /usr/bin/time -v ./a.sh 100000 
>         Command being timed: "./a.sh 100000"
>         User time (seconds): 54.02
>         System time (seconds): 26.18
>         Percent of CPU this job got: 100%
>         Elapsed (wall clock) time (h:mm:ss or m:ss): 1:19.92
> 
> patched kernel
> root@test1:~# /usr/bin/time -v ./a.sh 100000 
>         Command being timed: "./a.sh 100000"
>         User time (seconds): 53.81
>         System time (seconds): 26.55
>         Percent of CPU this job got: 100%
>         Elapsed (wall clock) time (h:mm:ss or m:ss): 1:19.99
> root@test1:~# /usr/bin/time -v ./a.sh 100000 
>         Command being timed: "./a.sh 100000"
>         User time (seconds): 53.78
>         System time (seconds): 26.15
>         Percent of CPU this job got: 100%
>         Elapsed (wall clock) time (h:mm:ss or m:ss): 1:19.67
> root@test1:~# /usr/bin/time -v ./a.sh 100000 
>         Command being timed: "./a.sh 100000"
>         User time (seconds): 54.08
>         System time (seconds): 26.87
>         Percent of CPU this job got: 100%
>         Elapsed (wall clock) time (h:mm:ss or m:ss): 1:20.52
> 
> the results very quite a lot (have a look at the user time which
> shouldn't have no reason to vary at all - maybe the virtual machine
> aspect?). I would say that we are still reasonably close to a noise
> here. Considering that /bin/true would close to the worst case I think
> this looks reasonably. What do you think?
> 
> If you absolutely insist, I can make the lock conditional only for oom
> victims. That would still mean current->signal->oom_mm pointers fetches
> and a 2 branches.


Below are numbers for the same test case, but from bigger machine (48
threads, 64GiB of RAM).

v4.13-rc2:

 Performance counter stats for './a.sh 100000' (5 runs):

     159857.233790      task-clock:u (msec)       #    1.000 CPUs utilized            ( +-  3.21% )
                 0      context-switches:u        #    0.000 K/sec
                 0      cpu-migrations:u          #    0.000 K/sec
         8,761,843      page-faults:u             #    0.055 M/sec                    ( +-  0.64% )
    38,725,763,026      cycles:u                  #    0.242 GHz                      ( +-  0.18% )
   272,691,643,016      stalled-cycles-frontend:u #  704.16% frontend cycles idle     ( +-  3.16% )
    22,221,416,575      instructions:u            #    0.57  insn per cycle
                                                  #   12.27  stalled cycles per insn  ( +-  0.00% )
     5,306,829,649      branches:u                #   33.197 M/sec                    ( +-  0.00% )
       240,783,599      branch-misses:u           #    4.54% of all branches          ( +-  0.15% )

     159.808721098 seconds time elapsed                                          ( +-  3.15% )

v4.13-rc2 + the patch:

 Performance counter stats for './a.sh 100000' (5 runs):

     167628.094556      task-clock:u (msec)       #    1.007 CPUs utilized            ( +-  1.63% )
                 0      context-switches:u        #    0.000 K/sec
                 0      cpu-migrations:u          #    0.000 K/sec
         8,838,314      page-faults:u             #    0.053 M/sec                    ( +-  0.26% )
    38,862,240,137      cycles:u                  #    0.232 GHz                      ( +-  0.10% )
   282,105,057,553      stalled-cycles-frontend:u #  725.91% frontend cycles idle     ( +-  1.64% )
    22,219,273,623      instructions:u            #    0.57  insn per cycle
                                                  #   12.70  stalled cycles per insn  ( +-  0.00% )
     5,306,165,194      branches:u                #   31.654 M/sec                    ( +-  0.00% )
       240,473,075      branch-misses:u           #    4.53% of all branches          ( +-  0.07% )

     166.497005412 seconds time elapsed                                          ( +-  1.61% )

IMO, there is something to think about. ~4% slowdown is not insignificant.
I expect effect to be bigger for larger machines.

-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH] mm, oom: allow oom reaper to race with exit_mmap
  2017-07-25 14:17         ` Kirill A. Shutemov
@ 2017-07-25 14:26           ` Michal Hocko
  2017-07-25 15:07             ` Kirill A. Shutemov
  0 siblings, 1 reply; 36+ messages in thread
From: Michal Hocko @ 2017-07-25 14:26 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrew Morton, David Rientjes, Tetsuo Handa, Oleg Nesterov,
	Hugh Dickins, Andrea Arcangeli, linux-mm, LKML

On Tue 25-07-17 17:17:23, Kirill A. Shutemov wrote:
[...]
> Below are numbers for the same test case, but from bigger machine (48
> threads, 64GiB of RAM).
> 
> v4.13-rc2:
> 
>  Performance counter stats for './a.sh 100000' (5 runs):
> 
>      159857.233790      task-clock:u (msec)       #    1.000 CPUs utilized            ( +-  3.21% )
>                  0      context-switches:u        #    0.000 K/sec
>                  0      cpu-migrations:u          #    0.000 K/sec
>          8,761,843      page-faults:u             #    0.055 M/sec                    ( +-  0.64% )
>     38,725,763,026      cycles:u                  #    0.242 GHz                      ( +-  0.18% )
>    272,691,643,016      stalled-cycles-frontend:u #  704.16% frontend cycles idle     ( +-  3.16% )
>     22,221,416,575      instructions:u            #    0.57  insn per cycle
>                                                   #   12.27  stalled cycles per insn  ( +-  0.00% )
>      5,306,829,649      branches:u                #   33.197 M/sec                    ( +-  0.00% )
>        240,783,599      branch-misses:u           #    4.54% of all branches          ( +-  0.15% )
> 
>      159.808721098 seconds time elapsed                                          ( +-  3.15% )
> 
> v4.13-rc2 + the patch:
> 
>  Performance counter stats for './a.sh 100000' (5 runs):
> 
>      167628.094556      task-clock:u (msec)       #    1.007 CPUs utilized            ( +-  1.63% )
>                  0      context-switches:u        #    0.000 K/sec
>                  0      cpu-migrations:u          #    0.000 K/sec
>          8,838,314      page-faults:u             #    0.053 M/sec                    ( +-  0.26% )
>     38,862,240,137      cycles:u                  #    0.232 GHz                      ( +-  0.10% )
>    282,105,057,553      stalled-cycles-frontend:u #  725.91% frontend cycles idle     ( +-  1.64% )
>     22,219,273,623      instructions:u            #    0.57  insn per cycle
>                                                   #   12.70  stalled cycles per insn  ( +-  0.00% )
>      5,306,165,194      branches:u                #   31.654 M/sec                    ( +-  0.00% )
>        240,473,075      branch-misses:u           #    4.53% of all branches          ( +-  0.07% )
> 
>      166.497005412 seconds time elapsed                                          ( +-  1.61% )
> 
> IMO, there is something to think about. ~4% slowdown is not insignificant.
> I expect effect to be bigger for larger machines.

Thanks for retesting Kirill. Are those numbers stable over runs? E.g.
the run without the patch has ~3% variance while the one with the patch
has it smaller. This sounds suspicious to me. There shouldn't be any
lock contention (except for the oom killer) so the lock shouldn't make
any difference wrt. variability.

Also I was about to post a more targeted test. Could you try it with it
as well, please?

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH] mm, oom: allow oom reaper to race with exit_mmap
  2017-07-24 16:11       ` Michal Hocko
  2017-07-25 14:17         ` Kirill A. Shutemov
@ 2017-07-25 14:26         ` Michal Hocko
  2017-07-25 15:17           ` Kirill A. Shutemov
  1 sibling, 1 reply; 36+ messages in thread
From: Michal Hocko @ 2017-07-25 14:26 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrew Morton, David Rientjes, Tetsuo Handa, Oleg Nesterov,
	Hugh Dickins, Andrea Arcangeli, linux-mm, LKML

[-- Attachment #1: Type: text/plain, Size: 2000 bytes --]

On Mon 24-07-17 18:11:46, Michal Hocko wrote:
> On Mon 24-07-17 17:51:42, Kirill A. Shutemov wrote:
> > On Mon, Jul 24, 2017 at 04:15:26PM +0200, Michal Hocko wrote:
> [...]
> > > What kind of scalability implication you have in mind? There is
> > > basically a zero contention on the mmap_sem that late in the exit path
> > > so this should be pretty much a fast path of the down_write. I agree it
> > > is not 0 cost but the cost of the address space freeing should basically
> > > make it a noise.
> > 
> > Even in fast path case, it adds two atomic operation per-process. If the
> > cache line is not exclusive to the core by the time of exit(2) it can be
> > noticible.
> > 
> > ... but I guess it's not very hot scenario.
> > 
> > I guess I'm just too cautious here. :)
> 
> I definitely did not want to handwave your concern. I just think we can
> rule out the slow path and didn't think about the fast path overhead.
> 
> > > > Should we do performance/scalability evaluation of the patch before
> > > > getting it applied?
> > > 
> > > What kind of test(s) would you be interested in?
> > 
> > Can we at lest check that number of /bin/true we can spawn per second
> > wouldn't be harmed by the patch? ;)
> 
> OK, so measuring a single /bin/true doesn't tell anything so I've done
> root@test1:~# cat a.sh 
> #!/bin/sh
> 
> NR=$1
> for i in $(seq $NR)
> do
>         /bin/true
> done

I wanted to reduce a potential shell side effects so I've come with a
simple program which forks and saves the timestamp before child exit and
right after waitpid (see attached) and then measured it 100k times. Sure
this still measures waitpid overhead and the signal delivery but this
should be more or less constant on an idle system, right? See attached.

before the patch
min: 306300.00 max: 6731916.00 avg: 437962.07 std: 92898.30 nr: 100000

after
min: 303196.00 max: 5728080.00 avg: 436081.87 std: 96165.98 nr: 100000

The results are well withing noise as I would expect.
-- 
Michal Hocko
SUSE Labs

[-- Attachment #2: exit_time.c --]
[-- Type: text/x-csrc, Size: 985 bytes --]

#include <sys/mman.h>
#include <sys/wait.h>
#include <sys/time.h>
#include <assert.h>
#include <stdlib.h>
#include <stdio.h>
#include <inttypes.h>
#include <unistd.h>

#define NR_FORKS 100000

static inline uint64_t get_cycles(void)
{
	uint64_t t;
	volatile int dont_remove __attribute__((unused));
	unsigned tmp;

	__asm volatile ("cpuid" : "=a"(tmp), "=b"(tmp), "=c"(tmp), "=d"(tmp): "a" (0));

	dont_remove = tmp; 
	__asm volatile ("rdtsc" : "=A"(t));
	return t;
}

int main(int argc, char **argv)
{
	void *addr = mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_ANON|MAP_SHARED, -1, 0);
	int i = NR_FORKS, j = 1;

	assert(addr != MAP_FAILED);

	while (i-- > 0) {
		pid_t child = fork();
		uint64_t before, after;

		assert(child != -1);
		if (!child) {
			*(uint64_t *)addr = get_cycles();
			return 0;
		}
		assert(child == waitpid(child, NULL, 0));
		before = *(uint64_t *)addr;
		after = get_cycles();

		printf("%u\n", (unsigned)(after - before));
		fflush(stdout);
	}

	return 0;
}

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH] mm, oom: allow oom reaper to race with exit_mmap
  2017-07-25 14:26           ` Michal Hocko
@ 2017-07-25 15:07             ` Kirill A. Shutemov
  2017-07-25 15:15               ` Michal Hocko
  0 siblings, 1 reply; 36+ messages in thread
From: Kirill A. Shutemov @ 2017-07-25 15:07 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, David Rientjes, Tetsuo Handa, Oleg Nesterov,
	Hugh Dickins, Andrea Arcangeli, linux-mm, LKML

On Tue, Jul 25, 2017 at 04:26:17PM +0200, Michal Hocko wrote:
> On Tue 25-07-17 17:17:23, Kirill A. Shutemov wrote:
> [...]
> > Below are numbers for the same test case, but from bigger machine (48
> > threads, 64GiB of RAM).
> > 
> > v4.13-rc2:
> > 
> >  Performance counter stats for './a.sh 100000' (5 runs):
> > 
> >      159857.233790      task-clock:u (msec)       #    1.000 CPUs utilized            ( +-  3.21% )
> >                  0      context-switches:u        #    0.000 K/sec
> >                  0      cpu-migrations:u          #    0.000 K/sec
> >          8,761,843      page-faults:u             #    0.055 M/sec                    ( +-  0.64% )
> >     38,725,763,026      cycles:u                  #    0.242 GHz                      ( +-  0.18% )
> >    272,691,643,016      stalled-cycles-frontend:u #  704.16% frontend cycles idle     ( +-  3.16% )
> >     22,221,416,575      instructions:u            #    0.57  insn per cycle
> >                                                   #   12.27  stalled cycles per insn  ( +-  0.00% )
> >      5,306,829,649      branches:u                #   33.197 M/sec                    ( +-  0.00% )
> >        240,783,599      branch-misses:u           #    4.54% of all branches          ( +-  0.15% )
> > 
> >      159.808721098 seconds time elapsed                                          ( +-  3.15% )
> > 
> > v4.13-rc2 + the patch:
> > 
> >  Performance counter stats for './a.sh 100000' (5 runs):
> > 
> >      167628.094556      task-clock:u (msec)       #    1.007 CPUs utilized            ( +-  1.63% )
> >                  0      context-switches:u        #    0.000 K/sec
> >                  0      cpu-migrations:u          #    0.000 K/sec
> >          8,838,314      page-faults:u             #    0.053 M/sec                    ( +-  0.26% )
> >     38,862,240,137      cycles:u                  #    0.232 GHz                      ( +-  0.10% )
> >    282,105,057,553      stalled-cycles-frontend:u #  725.91% frontend cycles idle     ( +-  1.64% )
> >     22,219,273,623      instructions:u            #    0.57  insn per cycle
> >                                                   #   12.70  stalled cycles per insn  ( +-  0.00% )
> >      5,306,165,194      branches:u                #   31.654 M/sec                    ( +-  0.00% )
> >        240,473,075      branch-misses:u           #    4.53% of all branches          ( +-  0.07% )
> > 
> >      166.497005412 seconds time elapsed                                          ( +-  1.61% )
> > 
> > IMO, there is something to think about. ~4% slowdown is not insignificant.
> > I expect effect to be bigger for larger machines.
> 
> Thanks for retesting Kirill. Are those numbers stable over runs? E.g.
> the run without the patch has ~3% variance while the one with the patch
> has it smaller. This sounds suspicious to me. There shouldn't be any
> lock contention (except for the oom killer) so the lock shouldn't make
> any difference wrt. variability.

There's run-to-tun variability. I'll post new numbers for your new test.

-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH] mm, oom: allow oom reaper to race with exit_mmap
  2017-07-25 15:07             ` Kirill A. Shutemov
@ 2017-07-25 15:15               ` Michal Hocko
  0 siblings, 0 replies; 36+ messages in thread
From: Michal Hocko @ 2017-07-25 15:15 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrew Morton, David Rientjes, Tetsuo Handa, Oleg Nesterov,
	Hugh Dickins, Andrea Arcangeli, linux-mm, LKML

On Tue 25-07-17 18:07:19, Kirill A. Shutemov wrote:
> On Tue, Jul 25, 2017 at 04:26:17PM +0200, Michal Hocko wrote:
[...]
> > Thanks for retesting Kirill. Are those numbers stable over runs? E.g.
> > the run without the patch has ~3% variance while the one with the patch
> > has it smaller. This sounds suspicious to me. There shouldn't be any
> > lock contention (except for the oom killer) so the lock shouldn't make
> > any difference wrt. variability.
> 
> There's run-to-tun variability. I'll post new numbers for your new test.

That's what I've seen and the variance was quite large. I suspected
shell but if you look at the more dedicated test, the std over avg is
still quite large.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH] mm, oom: allow oom reaper to race with exit_mmap
  2017-07-25 14:26         ` Michal Hocko
@ 2017-07-25 15:17           ` Kirill A. Shutemov
  2017-07-25 15:23             ` Michal Hocko
  0 siblings, 1 reply; 36+ messages in thread
From: Kirill A. Shutemov @ 2017-07-25 15:17 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, David Rientjes, Tetsuo Handa, Oleg Nesterov,
	Hugh Dickins, Andrea Arcangeli, linux-mm, LKML

On Tue, Jul 25, 2017 at 04:26:26PM +0200, Michal Hocko wrote:
> On Mon 24-07-17 18:11:46, Michal Hocko wrote:
> > On Mon 24-07-17 17:51:42, Kirill A. Shutemov wrote:
> > > On Mon, Jul 24, 2017 at 04:15:26PM +0200, Michal Hocko wrote:
> > [...]
> > > > What kind of scalability implication you have in mind? There is
> > > > basically a zero contention on the mmap_sem that late in the exit path
> > > > so this should be pretty much a fast path of the down_write. I agree it
> > > > is not 0 cost but the cost of the address space freeing should basically
> > > > make it a noise.
> > > 
> > > Even in fast path case, it adds two atomic operation per-process. If the
> > > cache line is not exclusive to the core by the time of exit(2) it can be
> > > noticible.
> > > 
> > > ... but I guess it's not very hot scenario.
> > > 
> > > I guess I'm just too cautious here. :)
> > 
> > I definitely did not want to handwave your concern. I just think we can
> > rule out the slow path and didn't think about the fast path overhead.
> > 
> > > > > Should we do performance/scalability evaluation of the patch before
> > > > > getting it applied?
> > > > 
> > > > What kind of test(s) would you be interested in?
> > > 
> > > Can we at lest check that number of /bin/true we can spawn per second
> > > wouldn't be harmed by the patch? ;)
> > 
> > OK, so measuring a single /bin/true doesn't tell anything so I've done
> > root@test1:~# cat a.sh 
> > #!/bin/sh
> > 
> > NR=$1
> > for i in $(seq $NR)
> > do
> >         /bin/true
> > done
> 
> I wanted to reduce a potential shell side effects so I've come with a
> simple program which forks and saves the timestamp before child exit and
> right after waitpid (see attached) and then measured it 100k times. Sure
> this still measures waitpid overhead and the signal delivery but this
> should be more or less constant on an idle system, right? See attached.
> 
> before the patch
> min: 306300.00 max: 6731916.00 avg: 437962.07 std: 92898.30 nr: 100000
> 
> after
> min: 303196.00 max: 5728080.00 avg: 436081.87 std: 96165.98 nr: 100000
> 
> The results are well withing noise as I would expect.

I've silightly modified your test case: replaced cpuid + rdtsc with
rdtscp. cpuid overhead is measurable in such tight loop.

3 runs before the patch:
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
 177200  205000  212900  217800  223700 2377000
 172400  201700  209700  214300  220600 1343000
 175700  203800  212300  217100  223000 1061000

3 runs after the patch:
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
 175900  204800  213000  216400  223600 1989000
 180300  210900  219600  223600  230200 3184000
 182100  212500  222000  226200  232700 1473000

The difference is still measuarble. Around 3%.

-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH] mm, oom: allow oom reaper to race with exit_mmap
  2017-07-25 15:17           ` Kirill A. Shutemov
@ 2017-07-25 15:23             ` Michal Hocko
  2017-07-25 15:31               ` Kirill A. Shutemov
  0 siblings, 1 reply; 36+ messages in thread
From: Michal Hocko @ 2017-07-25 15:23 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrew Morton, David Rientjes, Tetsuo Handa, Oleg Nesterov,
	Hugh Dickins, Andrea Arcangeli, linux-mm, LKML

On Tue 25-07-17 18:17:54, Kirill A. Shutemov wrote:
> > before the patch
> > min: 306300.00 max: 6731916.00 avg: 437962.07 std: 92898.30 nr: 100000
> > 
> > after
> > min: 303196.00 max: 5728080.00 avg: 436081.87 std: 96165.98 nr: 100000
> > 
> > The results are well withing noise as I would expect.
> 
> I've silightly modified your test case: replaced cpuid + rdtsc with
> rdtscp. cpuid overhead is measurable in such tight loop.
> 
> 3 runs before the patch:
>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
>  177200  205000  212900  217800  223700 2377000
>  172400  201700  209700  214300  220600 1343000
>  175700  203800  212300  217100  223000 1061000
> 
> 3 runs after the patch:
>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
>  175900  204800  213000  216400  223600 1989000
>  180300  210900  219600  223600  230200 3184000
>  182100  212500  222000  226200  232700 1473000
> 
> The difference is still measuarble. Around 3%.

what is stdev?
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH] mm, oom: allow oom reaper to race with exit_mmap
  2017-07-24  7:23 [PATCH] mm, oom: allow oom reaper to race with exit_mmap Michal Hocko
                   ` (2 preceding siblings ...)
  2017-07-24 16:42 ` kbuild test robot
@ 2017-07-25 15:26 ` Andrea Arcangeli
  2017-07-25 15:45   ` Michal Hocko
  3 siblings, 1 reply; 36+ messages in thread
From: Andrea Arcangeli @ 2017-07-25 15:26 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, David Rientjes, Tetsuo Handa, Oleg Nesterov,
	Hugh Dickins, linux-mm, LKML, Michal Hocko

On Mon, Jul 24, 2017 at 09:23:32AM +0200, Michal Hocko wrote:
> From: Michal Hocko <mhocko@suse.com>
> 
> David has noticed that the oom killer might kill additional tasks while
> the exiting oom victim hasn't terminated yet because the oom_reaper marks
> the curent victim MMF_OOM_SKIP too early when mm->mm_users dropped down
> to 0. The race is as follows
> 
> oom_reap_task				do_exit
> 					  exit_mm
>   __oom_reap_task_mm
> 					    mmput
> 					      __mmput
>     mmget_not_zero # fails
>     						exit_mmap # frees memory
>   set_bit(MMF_OOM_SKIP)
> 
> The victim is still visible to the OOM killer until it is unhashed.

I think this is a very minor problem, in the worst case you get a
false positive oom kill, and it requires a race condition for it to
happen. I wouldn't add mmap_sem in exit_mmap just for this considering
the mmget_not_zero is already enough to leave exit_mmap alone.

Could you first clarify these points then I'll understand better what
the above is about:

1) if exit_mmap runs for a long time with terabytes of RAM with
   mmap_sem held for writing like your patch does, wouldn't then
   oom_reap_task_mm fail the same way after a few tries on
   down_read_trylock? Despite your patch got applied? Isn't that
   simply moving the failure that leads to set_bit(MMF_OOM_SKIP) from
   mmget_not_zero to down_read_trylock?

2) why isn't __oom_reap_task_mm returning different retvals in case
   mmget_not_zero fails? What is the point to schedule_timeout
   and retry MAX_OOM_REAP_RETRIES times if mmget_not_zero caused it to
   return null as it can't do anything about such task anymore? Why
   are we scheduling those RETRIES times if mm_users is 0?

3) if exit_mmap is freeing lots of memory already, why should there be
   another OOM immediately? I thought oom reaper only was needed when
   the task on the right column couldn't reach the final mmput to set
   mm_users to 0. Why exactly is a problem that MMF_OOM_SKIP gets set
   on the mm, if exit_mmap is already guaranteed to be running? Why
   isn't the oom reaper happy to just stop in such case and wait it to
   complete? exit_mmap doesn't even take the mmap_sem and it's running
   in R state, how would it block in a way that requires the OOM
   reaper to free memory from another process to complete?

4) how is it safe to overwrite a VM_FAULT_RETRY that returns without
   mmap_sem and then the arch code will release the mmap_sem despite
   it was already released by handle_mm_fault? Anonymous memory faults
   aren't common to return VM_FAULT_RETRY but an userfault
   can. Shouldn't there be a block that prevents overwriting if
   VM_FAULT_RETRY is set below? (not only VM_FAULT_ERROR)

	if (unlikely((current->flags & PF_KTHREAD) && !(ret & VM_FAULT_ERROR)
				&& test_bit(MMF_UNSTABLE, &vma->vm_mm->flags)))
		ret = VM_FAULT_SIGBUS;

Thanks,
Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH] mm, oom: allow oom reaper to race with exit_mmap
  2017-07-25 15:23             ` Michal Hocko
@ 2017-07-25 15:31               ` Kirill A. Shutemov
  2017-07-25 16:04                 ` Michal Hocko
  0 siblings, 1 reply; 36+ messages in thread
From: Kirill A. Shutemov @ 2017-07-25 15:31 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, David Rientjes, Tetsuo Handa, Oleg Nesterov,
	Hugh Dickins, Andrea Arcangeli, linux-mm, LKML

On Tue, Jul 25, 2017 at 05:23:00PM +0200, Michal Hocko wrote:
> what is stdev?

Updated tables:

3 runs before the patch:
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.  Stdev
 177200  205000  212900  217800  223700 2377000  32868
 172400  201700  209700  214300  220600 1343000  31191
 175700  203800  212300  217100  223000 1061000  31195

3 runs after the patch:
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.  Stdev
 175900  204800  213000  216400  223600 1989000  27210
 180300  210900  219600  223600  230200 3184000  32609
 182100  212500  222000  226200  232700 1473000  32138

-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH] mm, oom: allow oom reaper to race with exit_mmap
  2017-07-25 15:26 ` Andrea Arcangeli
@ 2017-07-25 15:45   ` Michal Hocko
  2017-07-25 18:26     ` Andrea Arcangeli
  0 siblings, 1 reply; 36+ messages in thread
From: Michal Hocko @ 2017-07-25 15:45 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Andrew Morton, David Rientjes, Tetsuo Handa, Oleg Nesterov,
	Hugh Dickins, linux-mm, LKML

On Tue 25-07-17 17:26:39, Andrea Arcangeli wrote:
> On Mon, Jul 24, 2017 at 09:23:32AM +0200, Michal Hocko wrote:
> > From: Michal Hocko <mhocko@suse.com>
> > 
> > David has noticed that the oom killer might kill additional tasks while
> > the exiting oom victim hasn't terminated yet because the oom_reaper marks
> > the curent victim MMF_OOM_SKIP too early when mm->mm_users dropped down
> > to 0. The race is as follows
> > 
> > oom_reap_task				do_exit
> > 					  exit_mm
> >   __oom_reap_task_mm
> > 					    mmput
> > 					      __mmput
> >     mmget_not_zero # fails
> >     						exit_mmap # frees memory
> >   set_bit(MMF_OOM_SKIP)
> > 
> > The victim is still visible to the OOM killer until it is unhashed.
> 
> I think this is a very minor problem, in the worst case you get a
> false positive oom kill, and it requires a race condition for it to
> happen. I wouldn't add mmap_sem in exit_mmap just for this considering
> the mmget_not_zero is already enough to leave exit_mmap alone.

That problem is real though as reported by David.

> Could you first clarify these points then I'll understand better what
> the above is about:
> 
> 1) if exit_mmap runs for a long time with terabytes of RAM with
>    mmap_sem held for writing like your patch does, wouldn't then
>    oom_reap_task_mm fail the same way after a few tries on
>    down_read_trylock? Despite your patch got applied? Isn't that
>    simply moving the failure that leads to set_bit(MMF_OOM_SKIP) from
>    mmget_not_zero to down_read_trylock?

No, it's not because the exclusive lock in exit_mmap is taken _after_ we
unmapped the address space. unmap_vmas will happily race with the oom
reaper.
 
> 2) why isn't __oom_reap_task_mm returning different retvals in case
>    mmget_not_zero fails? What is the point to schedule_timeout
>    and retry MAX_OOM_REAP_RETRIES times if mmget_not_zero caused it to
>    return null as it can't do anything about such task anymore? Why
>    are we scheduling those RETRIES times if mm_users is 0?

We are not. __oom_reap_task_mm will return true if the mm_users is 0 and
bail out.

> 3) if exit_mmap is freeing lots of memory already, why should there be
>    another OOM immediately?

Because the memory can be freed from a different oom domain (e.g. a
different NUMA node).

>    I thought oom reaper only was needed when
>    the task on the right column couldn't reach the final mmput to set
>    mm_users to 0. Why exactly is a problem that MMF_OOM_SKIP gets set
>    on the mm, if exit_mmap is already guaranteed to be running?

MMF_OOM_SKIP will hide this task from the OOM killer and so we will
select another victim if we are still under oom. We _want_ to postpone
setting MMF_OOM_SKIP until we know that the oom victim no longer
interesting and we can go on to select another one.

>    Why
>    isn't the oom reaper happy to just stop in such case and wait it to
>    complete?

Because there is no _guarantee_ that the final __mmput will release the
memory in finite time. And we cannot guarantee that longterm.

>    exit_mmap doesn't even take the mmap_sem and it's running
>    in R state, how would it block in a way that requires the OOM
>    reaper to free memory from another process to complete?

it is not only about exit_mmap. __mmput calls into exit_aio and that can
wait for completion and there is no way to guarantee this will finish in
finite time.

> 4) how is it safe to overwrite a VM_FAULT_RETRY that returns without
>    mmap_sem and then the arch code will release the mmap_sem despite
>    it was already released by handle_mm_fault? Anonymous memory faults
>    aren't common to return VM_FAULT_RETRY but an userfault
>    can. Shouldn't there be a block that prevents overwriting if
>    VM_FAULT_RETRY is set below? (not only VM_FAULT_ERROR)
> 
> 	if (unlikely((current->flags & PF_KTHREAD) && !(ret & VM_FAULT_ERROR)
> 				&& test_bit(MMF_UNSTABLE, &vma->vm_mm->flags)))
> 		ret = VM_FAULT_SIGBUS;

I am not sure I understand what you mean and how this is related to the
patch?

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH] mm, oom: allow oom reaper to race with exit_mmap
  2017-07-25 15:31               ` Kirill A. Shutemov
@ 2017-07-25 16:04                 ` Michal Hocko
  2017-07-25 19:19                   ` Andrea Arcangeli
  0 siblings, 1 reply; 36+ messages in thread
From: Michal Hocko @ 2017-07-25 16:04 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrew Morton, David Rientjes, Tetsuo Handa, Oleg Nesterov,
	Hugh Dickins, Andrea Arcangeli, linux-mm, LKML

On Tue 25-07-17 18:31:10, Kirill A. Shutemov wrote:
> On Tue, Jul 25, 2017 at 05:23:00PM +0200, Michal Hocko wrote:
> > what is stdev?
> 
> Updated tables:
> 
> 3 runs before the patch:
>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.  Stdev
>  177200  205000  212900  217800  223700 2377000  32868
>  172400  201700  209700  214300  220600 1343000  31191
>  175700  203800  212300  217100  223000 1061000  31195
> 
> 3 runs after the patch:
>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.  Stdev
>  175900  204800  213000  216400  223600 1989000  27210
>  180300  210900  219600  223600  230200 3184000  32609
>  182100  212500  222000  226200  232700 1473000  32138

High std/avg ~15% matches my measurements (mine were even higher ~20%)
and that would suggest that 3% average difference is still somehing
within a "noise".

Anyway, I do not really need to take the lock unless the task is the
oom victim. Could you try whether those numbers improve if the lock is
conditional?

Thanks!
---
diff --git a/mm/mmap.c b/mm/mmap.c
index 0eeb658caa30..ca8a274485f8 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -44,6 +44,7 @@
 #include <linux/userfaultfd_k.h>
 #include <linux/moduleparam.h>
 #include <linux/pkeys.h>
+#include <linux/oom.h>
 
 #include <linux/uaccess.h>
 #include <asm/cacheflush.h>
@@ -2997,7 +2998,8 @@ void exit_mmap(struct mm_struct *mm)
 	 * oom reaper might race with exit_mmap so make sure we won't free
 	 * page tables or unmap VMAs under its feet
 	 */
-	down_write(&mm->mmap_sem);
+	if (tsk_is_oom_victim(current))
+		down_write(&mm->mmap_sem);
 	free_pgtables(&tlb, vma, FIRST_USER_ADDRESS, USER_PGTABLES_CEILING);
 	tlb_finish_mmu(&tlb, 0, -1);
 
@@ -3012,7 +3014,8 @@ void exit_mmap(struct mm_struct *mm)
 	}
 	mm->mmap = NULL;
 	vm_unacct_memory(nr_accounted);
-	up_write(&mm->mmap_sem);
+	if (tsk_is_oom_victim(current))
+		up_write(&mm->mmap_sem);
 }
 
 /* Insert vm structure into process list sorted by address
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: [PATCH] mm, oom: allow oom reaper to race with exit_mmap
  2017-07-25 15:45   ` Michal Hocko
@ 2017-07-25 18:26     ` Andrea Arcangeli
  2017-07-26  5:45       ` Michal Hocko
  0 siblings, 1 reply; 36+ messages in thread
From: Andrea Arcangeli @ 2017-07-25 18:26 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, David Rientjes, Tetsuo Handa, Oleg Nesterov,
	Hugh Dickins, linux-mm, LKML

On Tue, Jul 25, 2017 at 05:45:14PM +0200, Michal Hocko wrote:
> That problem is real though as reported by David.

I'm not against fixing it, I just think it's not a major concern, and
the solution doesn't seem optimal as measured by Kirill.

I'm just skeptical it's the best to solve that tiny race, 99.9% of the
time such down_write is unnecessary.

> it is not only about exit_mmap. __mmput calls into exit_aio and that can
> wait for completion and there is no way to guarantee this will finish in
> finite time.

exit_aio blocking is actually the only good point for wanting this
concurrency where exit_mmap->unmap_vmas and
oom_reap_task->unmap_page_range have to run concurrently on the same
mm.

exit_mmap would have no issue, if there was enough time in the
lifetime CPU to allocate the memory, sure the memory will also be
freed in finite amount of time by exit_mmap.

In fact you mentioned multiple OOM in the NUMA case, exit_mmap may not
solve that, so depending on the runtime it may have been better not to
wait all memory of the process to be freed before moving to the next
task, but only a couple of seconds before the OOM reaper moves to a
new candidate. Again this is only a tradeoff between solving the OOM
faster vs risk of false positives OOM.

If it wasn't because of exit_aio (which may have to wait I/O
completion), changing the OOM reaper to return "false" if
mmget_not_zero returns zero and MMF_OOM_SKIP is not set yet, would
have been enough (and depending on the runtime it may have solved OOM
faster in NUMA) and there would be absolutely no need to run OOM
reaper and exit_mmap concurrently on the same mm. However there's such
exit_aio..

Raw I/O mempools never require memory allocations, although aio if it
involves a filesystem to complete may run into filesystem or buffering
locks which are known to loop forever or depend on other tasks stuck
in kernel allocations, so I didn't go down that chain too long.

So the simplest is to use a similar trick to what ksm_exit uses, this
is untested just to show the idea it may require further adjustment as
the bit isn't used only for the test_and_set_bit locking, but I didn't
see much issues with other set_bit/test_bit.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH] mm, oom: allow oom reaper to race with exit_mmap
  2017-07-25 16:04                 ` Michal Hocko
@ 2017-07-25 19:19                   ` Andrea Arcangeli
  2017-07-26  5:45                     ` Michal Hocko
  0 siblings, 1 reply; 36+ messages in thread
From: Andrea Arcangeli @ 2017-07-25 19:19 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Kirill A. Shutemov, Andrew Morton, David Rientjes, Tetsuo Handa,
	Oleg Nesterov, Hugh Dickins, linux-mm, LKML

On Tue, Jul 25, 2017 at 06:04:00PM +0200, Michal Hocko wrote:
> -	down_write(&mm->mmap_sem);
> +	if (tsk_is_oom_victim(current))
> +		down_write(&mm->mmap_sem);
>  	free_pgtables(&tlb, vma, FIRST_USER_ADDRESS, USER_PGTABLES_CEILING);
>  	tlb_finish_mmu(&tlb, 0, -1);
>  
> @@ -3012,7 +3014,8 @@ void exit_mmap(struct mm_struct *mm)
>  	}
>  	mm->mmap = NULL;
>  	vm_unacct_memory(nr_accounted);
> -	up_write(&mm->mmap_sem);
> +	if (tsk_is_oom_victim(current))
> +		up_write(&mm->mmap_sem);

How is this possibly safe? mark_oom_victim can run while exit_mmap is
running. Even if you cache the first read in the local stack, failure
to notice you marked it, could lead to use after free. Or at least
there's no comment on which lock should prevent the use after free
with the above.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH] mm, oom: allow oom reaper to race with exit_mmap
  2017-07-25 18:26     ` Andrea Arcangeli
@ 2017-07-26  5:45       ` Michal Hocko
  2017-07-26 16:39         ` Andrea Arcangeli
  0 siblings, 1 reply; 36+ messages in thread
From: Michal Hocko @ 2017-07-26  5:45 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Andrew Morton, David Rientjes, Tetsuo Handa, Oleg Nesterov,
	Hugh Dickins, linux-mm, LKML

On Tue 25-07-17 20:26:19, Andrea Arcangeli wrote:
> On Tue, Jul 25, 2017 at 05:45:14PM +0200, Michal Hocko wrote:
> > That problem is real though as reported by David.
> 
> I'm not against fixing it, I just think it's not a major concern, and
> the solution doesn't seem optimal as measured by Kirill.
> 
> I'm just skeptical it's the best to solve that tiny race, 99.9% of the
> time such down_write is unnecessary.
> 
> > it is not only about exit_mmap. __mmput calls into exit_aio and that can
> > wait for completion and there is no way to guarantee this will finish in
> > finite time.
> 
> exit_aio blocking is actually the only good point for wanting this
> concurrency where exit_mmap->unmap_vmas and
> oom_reap_task->unmap_page_range have to run concurrently on the same
> mm.

Yes, exit_aio is the only blocking call I know of currently. But I would
like this to be as robust as possible and so I do not want to rely on
the current implementation. This can change in future and I can
guarantee that nobody will think about the oom path when adding
something to the final __mmput path.

> exit_mmap would have no issue, if there was enough time in the
> lifetime CPU to allocate the memory, sure the memory will also be
> freed in finite amount of time by exit_mmap.

I am not sure I understand. Say that any call prior to unmap_vmas blocks
on a lock which is held by another call path which cannot proceed with
the allocation...
 
> In fact you mentioned multiple OOM in the NUMA case, exit_mmap may not
> solve that, so depending on the runtime it may have been better not to
> wait all memory of the process to be freed before moving to the next
> task, but only a couple of seconds before the OOM reaper moves to a
> new candidate. Again this is only a tradeoff between solving the OOM
> faster vs risk of false positives OOM.

I really do not want to rely on any timing. This just too fragile. Once
we have killed a task then we shouldn't pick another victim until it
passed exit_mmap or the oom_reaper did its job. Otherwise we just risk
false positives while we have already disrupted the workload.
 
> If it wasn't because of exit_aio (which may have to wait I/O
> completion), changing the OOM reaper to return "false" if
> mmget_not_zero returns zero and MMF_OOM_SKIP is not set yet, would
> have been enough (and depending on the runtime it may have solved OOM
> faster in NUMA) and there would be absolutely no need to run OOM
> reaper and exit_mmap concurrently on the same mm. However there's such
> exit_aio..
> 
> Raw I/O mempools never require memory allocations, although aio if it
> involves a filesystem to complete may run into filesystem or buffering
> locks which are known to loop forever or depend on other tasks stuck
> in kernel allocations, so I didn't go down that chain too long.

Exactly. We simply cannot assume anything here because veryfying this
basically impossible.
 
[...]
> diff --git a/mm/mmap.c b/mm/mmap.c
> index f19efcf75418..615133762b99 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -2993,6 +2993,11 @@ void exit_mmap(struct mm_struct *mm)
>  	/* Use -1 here to ensure all VMAs in the mm are unmapped */
>  	unmap_vmas(&tlb, vma, 0, -1);
>  
> +	if (test_and_set_bit(MMF_OOM_SKIP, &mm->flags)) {
> +		/* wait the OOM reaper to stop working on this mm */
> +		down_write(&mm->mmap_sem);
> +		up_write(&mm->mmap_sem);
> +	}
>  	free_pgtables(&tlb, vma, FIRST_USER_ADDRESS, USER_PGTABLES_CEILING);
>  	tlb_finish_mmu(&tlb, 0, -1);
>  
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 9e8b4f030c1c..2a7000995784 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -471,6 +471,7 @@ static bool __oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
>  	struct mmu_gather tlb;
>  	struct vm_area_struct *vma;
>  	bool ret = true;
> +	bool mmgot = true;
>  
>  	/*
>  	 * We have to make sure to not race with the victim exit path
> @@ -500,9 +501,16 @@ static bool __oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
>  	 * and delayed __mmput doesn't matter that much
>  	 */
>  	if (!mmget_not_zero(mm)) {
> -		up_read(&mm->mmap_sem);
>  		trace_skip_task_reaping(tsk->pid);
> -		goto unlock_oom;
> +		/*
> +		 * MMF_OOM_SKIP is set by exit_mmap when the OOM
> +		 * reaper can't work on the mm anymore.
> +		 */
> +		if (test_and_set_bit(MMF_OOM_SKIP, &mm->flags)) {
> +			up_read(&mm->mmap_sem);
> +			goto unlock_oom;
> +		}
> +		mmgot = false;
>  	}

This will work more or less the same to what we have currently.

[victim]		[oom reaper]				[oom killer]
do_exit			__oom_reap_task_mm
  mmput
    __mmput
			  mmget_not_zero
			    test_and_set_bit(MMF_OOM_SKIP)
			    					oom_evaluate_task
								   # select next victim 
			  # reap the mm
      unmap_vmas

so we can select a next victim while the current one is still not
completely torn down.

> > > 4) how is it safe to overwrite a VM_FAULT_RETRY that returns without
> > >    mmap_sem and then the arch code will release the mmap_sem despite
> > >    it was already released by handle_mm_fault? Anonymous memory faults
> > >    aren't common to return VM_FAULT_RETRY but an userfault
> > >    can. Shouldn't there be a block that prevents overwriting if
> > >    VM_FAULT_RETRY is set below? (not only VM_FAULT_ERROR)
> > > 
> > > 	if (unlikely((current->flags & PF_KTHREAD) && !(ret & VM_FAULT_ERROR)
> > > 				&& test_bit(MMF_UNSTABLE, &vma->vm_mm->flags)))
> > > 		ret = VM_FAULT_SIGBUS;
> > 
> > I am not sure I understand what you mean and how this is related to the
> > patch?
> 
> It's not related to the patch but it involves the OOM reaper as it
> only happens when MMF_UNSTABLE is set which is set only by the OOM
> reaper. I was simply reading the OOM reaper code and following up what
> MMF_UNSTABLE does and it ringed a bell.

I hope 3f70dc38cec2 ("mm: make sure that kthreads will not refault oom
reaped memory") will clarify this code. If not please start a new thread
so that we do not conflate different things together.

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH] mm, oom: allow oom reaper to race with exit_mmap
  2017-07-25 19:19                   ` Andrea Arcangeli
@ 2017-07-26  5:45                     ` Michal Hocko
  2017-07-26 16:29                       ` Andrea Arcangeli
  0 siblings, 1 reply; 36+ messages in thread
From: Michal Hocko @ 2017-07-26  5:45 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Kirill A. Shutemov, Andrew Morton, David Rientjes, Tetsuo Handa,
	Oleg Nesterov, Hugh Dickins, linux-mm, LKML

On Tue 25-07-17 21:19:52, Andrea Arcangeli wrote:
> On Tue, Jul 25, 2017 at 06:04:00PM +0200, Michal Hocko wrote:
> > -	down_write(&mm->mmap_sem);
> > +	if (tsk_is_oom_victim(current))
> > +		down_write(&mm->mmap_sem);
> >  	free_pgtables(&tlb, vma, FIRST_USER_ADDRESS, USER_PGTABLES_CEILING);
> >  	tlb_finish_mmu(&tlb, 0, -1);
> >  
> > @@ -3012,7 +3014,8 @@ void exit_mmap(struct mm_struct *mm)
> >  	}
> >  	mm->mmap = NULL;
> >  	vm_unacct_memory(nr_accounted);
> > -	up_write(&mm->mmap_sem);
> > +	if (tsk_is_oom_victim(current))
> > +		up_write(&mm->mmap_sem);
> 
> How is this possibly safe? mark_oom_victim can run while exit_mmap is
> running.

I believe it cannot. We always call mark_oom_victim (on !current) with
task_lock held and check task->mm != NULL and we call do_exit->mmput after
mm is set to NULL under the same lock.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH] mm, oom: allow oom reaper to race with exit_mmap
  2017-07-26  5:45                     ` Michal Hocko
@ 2017-07-26 16:29                       ` Andrea Arcangeli
  2017-07-26 16:43                         ` Andrea Arcangeli
                                           ` (3 more replies)
  0 siblings, 4 replies; 36+ messages in thread
From: Andrea Arcangeli @ 2017-07-26 16:29 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Kirill A. Shutemov, Andrew Morton, David Rientjes, Tetsuo Handa,
	Oleg Nesterov, Hugh Dickins, linux-mm, LKML

On Wed, Jul 26, 2017 at 07:45:57AM +0200, Michal Hocko wrote:
> On Tue 25-07-17 21:19:52, Andrea Arcangeli wrote:
> > On Tue, Jul 25, 2017 at 06:04:00PM +0200, Michal Hocko wrote:
> > > -	down_write(&mm->mmap_sem);
> > > +	if (tsk_is_oom_victim(current))
> > > +		down_write(&mm->mmap_sem);
> > >  	free_pgtables(&tlb, vma, FIRST_USER_ADDRESS, USER_PGTABLES_CEILING);
> > >  	tlb_finish_mmu(&tlb, 0, -1);
> > >  
> > > @@ -3012,7 +3014,8 @@ void exit_mmap(struct mm_struct *mm)
> > >  	}
> > >  	mm->mmap = NULL;
> > >  	vm_unacct_memory(nr_accounted);
> > > -	up_write(&mm->mmap_sem);
> > > +	if (tsk_is_oom_victim(current))
> > > +		up_write(&mm->mmap_sem);
> > 
> > How is this possibly safe? mark_oom_victim can run while exit_mmap is
> > running.
> 
> I believe it cannot. We always call mark_oom_victim (on !current) with
> task_lock held and check task->mm != NULL and we call do_exit->mmput after
> mm is set to NULL under the same lock.

Holding the mmap_sem for writing and setting mm->mmap to NULL to
filter which tasks already released the mmap_sem for writing post
free_pgtables still look unnecessary to solve this.

Using MMF_OOM_SKIP as flag had side effects of oom_badness() skipping
it, but we can use the same tsk_is_oom_victim instead and relay on the
locking in mark_oom_victim you pointed out above instead of the
test_and_set_bit of my patch, because current->mm is already NULL at
that point.

A race at the light of the above now is, because current->mm is NULL by the
time mmput is called, how can you start the oom_reap_task on a process
with current->mm NULL that called the last mmput and is blocked
in exit_aio? It looks like no false positive can get fixed until this
is solved first because 

Isn't this enough? If this is enough it avoids other modification to
the exit_mmap runtime that looks unnecessary: mm->mmap = NULL replaced
by MMF_OOM_SKIP that has to be set anyway by __mmput later and one
unnecessary branch to call the up_write.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH] mm, oom: allow oom reaper to race with exit_mmap
  2017-07-26  5:45       ` Michal Hocko
@ 2017-07-26 16:39         ` Andrea Arcangeli
  2017-07-27  6:32           ` Michal Hocko
  0 siblings, 1 reply; 36+ messages in thread
From: Andrea Arcangeli @ 2017-07-26 16:39 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, David Rientjes, Tetsuo Handa, Oleg Nesterov,
	Hugh Dickins, linux-mm, LKML

On Wed, Jul 26, 2017 at 07:45:33AM +0200, Michal Hocko wrote:
> Yes, exit_aio is the only blocking call I know of currently. But I would
> like this to be as robust as possible and so I do not want to rely on
> the current implementation. This can change in future and I can
> guarantee that nobody will think about the oom path when adding
> something to the final __mmput path.

I think ksm_exit may block too waiting for allocations, the generic
idea is those calls before exit_mmap can cause a problem yes.

> > exit_mmap would have no issue, if there was enough time in the
> > lifetime CPU to allocate the memory, sure the memory will also be
> > freed in finite amount of time by exit_mmap.
> 
> I am not sure I understand. Say that any call prior to unmap_vmas blocks
> on a lock which is held by another call path which cannot proceed with
> the allocation...

What I meant was, if three was no prior call to exit_mmap->unmap_vmas.

> I really do not want to rely on any timing. This just too fragile. Once
> we have killed a task then we shouldn't pick another victim until it
> passed exit_mmap or the oom_reaper did its job. Otherwise we just risk
> false positives while we have already disrupted the workload.

On smaller systems lack or parallelism in OOM killing surely isn't a
problem.

> This will work more or less the same to what we have currently.
> 
> [victim]		[oom reaper]				[oom killer]
> do_exit			__oom_reap_task_mm
>   mmput
>     __mmput
> 			  mmget_not_zero
> 			    test_and_set_bit(MMF_OOM_SKIP)
> 			    					oom_evaluate_task
> 								   # select next victim 
> 			  # reap the mm
>       unmap_vmas
>
> so we can select a next victim while the current one is still not
> completely torn down.

How does oom_evaluate_task possibly run at the same time of
test_and_set_bit in __oom_reap_task_mm considering both are running
under the oom_lock? It's hard to see how what you describe above could
materialize as second and third column cannot run in parallel because
of the oom_lock.

I don't think there was any issue, but then you pointed out the
locking on signal->oom_mm that is protected by the task_lock vs
current->mm NULL check, so I can replace in my patch the
test_and_set_bit with set_bit on one side and the oom_mm task_lock
protected locking on the other side. This way I can put back a set_bit
in the __mmput fast path (instead of test_and_set_bit) and it's even
more efficient. With such a change, I'll also stop depending on the
oom_lock to prevent second and third column to run in parallel.

I still didn't remove the oom_lock outright that seems orthogonal
change unrelated to this issue but now you could remove it as far as
the above is concerned.

> I hope 3f70dc38cec2 ("mm: make sure that kthreads will not refault oom
> reaped memory") will clarify this code. If not please start a new thread
> so that we do not conflate different things together.

I'll look into that, thanks.
Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH] mm, oom: allow oom reaper to race with exit_mmap
  2017-07-26 16:29                       ` Andrea Arcangeli
@ 2017-07-26 16:43                         ` Andrea Arcangeli
  2017-07-27  6:50                         ` Michal Hocko
                                           ` (2 subsequent siblings)
  3 siblings, 0 replies; 36+ messages in thread
From: Andrea Arcangeli @ 2017-07-26 16:43 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Kirill A. Shutemov, Andrew Morton, David Rientjes, Tetsuo Handa,
	Oleg Nesterov, Hugh Dickins, linux-mm, LKML

On Wed, Jul 26, 2017 at 06:29:12PM +0200, Andrea Arcangeli wrote:
> From 3d9001490ee1a71f39c7bfaf19e96821f9d3ff16 Mon Sep 17 00:00:00 2001
> From: Andrea Arcangeli <aarcange@redhat.com>
> Date: Tue, 25 Jul 2017 20:02:27 +0200
> Subject: [PATCH 1/1] mm: oom: let oom_reap_task and exit_mmap to run
>  concurrently

This needs an incremental one liner...

diff --git a/mm/mmap.c b/mm/mmap.c
index bdab595ce25c..fd16996ee0a8 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -44,6 +44,7 @@
 #include <linux/userfaultfd_k.h>
 #include <linux/moduleparam.h>
 #include <linux/pkeys.h>
+#include <linux/oom.h>
 
 #include <linux/uaccess.h>
 #include <asm/cacheflush.h>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: [PATCH] mm, oom: allow oom reaper to race with exit_mmap
  2017-07-26 16:39         ` Andrea Arcangeli
@ 2017-07-27  6:32           ` Michal Hocko
  0 siblings, 0 replies; 36+ messages in thread
From: Michal Hocko @ 2017-07-27  6:32 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Andrew Morton, David Rientjes, Tetsuo Handa, Oleg Nesterov,
	Hugh Dickins, linux-mm, LKML

On Wed 26-07-17 18:39:28, Andrea Arcangeli wrote:
> On Wed, Jul 26, 2017 at 07:45:33AM +0200, Michal Hocko wrote:
> > Yes, exit_aio is the only blocking call I know of currently. But I would
> > like this to be as robust as possible and so I do not want to rely on
> > the current implementation. This can change in future and I can
> > guarantee that nobody will think about the oom path when adding
> > something to the final __mmput path.
> 
> I think ksm_exit may block too waiting for allocations, the generic
> idea is those calls before exit_mmap can cause a problem yes.

I thought that ksm used __GFP_NORETRY but haven't checked too deeply.
Anyway I guess we agree that enabling oom_reaper to race with the final
__mmput is desirable?

[...]
> > This will work more or less the same to what we have currently.
> > 
> > [victim]		[oom reaper]				[oom killer]
> > do_exit			__oom_reap_task_mm
> >   mmput
> >     __mmput
> > 			  mmget_not_zero
> > 			    test_and_set_bit(MMF_OOM_SKIP)
> > 			    					oom_evaluate_task
> > 								   # select next victim 
> > 			  # reap the mm
> >       unmap_vmas
> >
> > so we can select a next victim while the current one is still not
> > completely torn down.
> 
> How does oom_evaluate_task possibly run at the same time of
> test_and_set_bit in __oom_reap_task_mm considering both are running
> under the oom_lock?

You are absolutely right. This race is impossible. It was just me
assuming we are going to get rid of the oom_lock because I have that
idea in the back of my head and I would really like to get rid of
it. Global locks are nasty and I would prefer dropping it if we can.

[...]
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH] mm, oom: allow oom reaper to race with exit_mmap
  2017-07-26 16:29                       ` Andrea Arcangeli
  2017-07-26 16:43                         ` Andrea Arcangeli
@ 2017-07-27  6:50                         ` Michal Hocko
  2017-07-27 14:55                           ` Andrea Arcangeli
  2017-07-28  1:58                         ` [PATCH 1/1] mm: oom: let oom_reap_task and exit_mmap to run kbuild test robot
  2017-08-15  0:20                         ` [PATCH] mm, oom: allow oom reaper to race with exit_mmap David Rientjes
  3 siblings, 1 reply; 36+ messages in thread
From: Michal Hocko @ 2017-07-27  6:50 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Kirill A. Shutemov, Andrew Morton, David Rientjes, Tetsuo Handa,
	Oleg Nesterov, Hugh Dickins, linux-mm, LKML

On Wed 26-07-17 18:29:12, Andrea Arcangeli wrote:
> On Wed, Jul 26, 2017 at 07:45:57AM +0200, Michal Hocko wrote:
> > On Tue 25-07-17 21:19:52, Andrea Arcangeli wrote:
> > > On Tue, Jul 25, 2017 at 06:04:00PM +0200, Michal Hocko wrote:
> > > > -	down_write(&mm->mmap_sem);
> > > > +	if (tsk_is_oom_victim(current))
> > > > +		down_write(&mm->mmap_sem);
> > > >  	free_pgtables(&tlb, vma, FIRST_USER_ADDRESS, USER_PGTABLES_CEILING);
> > > >  	tlb_finish_mmu(&tlb, 0, -1);
> > > >  
> > > > @@ -3012,7 +3014,8 @@ void exit_mmap(struct mm_struct *mm)
> > > >  	}
> > > >  	mm->mmap = NULL;
> > > >  	vm_unacct_memory(nr_accounted);
> > > > -	up_write(&mm->mmap_sem);
> > > > +	if (tsk_is_oom_victim(current))
> > > > +		up_write(&mm->mmap_sem);
> > > 
> > > How is this possibly safe? mark_oom_victim can run while exit_mmap is
> > > running.
> > 
> > I believe it cannot. We always call mark_oom_victim (on !current) with
> > task_lock held and check task->mm != NULL and we call do_exit->mmput after
> > mm is set to NULL under the same lock.
> 
> Holding the mmap_sem for writing and setting mm->mmap to NULL to
> filter which tasks already released the mmap_sem for writing post
> free_pgtables still look unnecessary to solve this.
> 
> Using MMF_OOM_SKIP as flag had side effects of oom_badness() skipping
> it, but we can use the same tsk_is_oom_victim instead and relay on the
> locking in mark_oom_victim you pointed out above instead of the
> test_and_set_bit of my patch, because current->mm is already NULL at
> that point.
> 
> A race at the light of the above now is, because current->mm is NULL by the
> time mmput is called, how can you start the oom_reap_task on a process
> with current->mm NULL that called the last mmput and is blocked
> in exit_aio?

Because we have that mm available. See tsk->signal->oom_mm in
oom_reap_task

> It looks like no false positive can get fixed until this
> is solved first because 
> 
> Isn't this enough? If this is enough it avoids other modification to
> the exit_mmap runtime that looks unnecessary: mm->mmap = NULL replaced
> by MMF_OOM_SKIP that has to be set anyway by __mmput later and one
> unnecessary branch to call the up_write.
> 
[...]
> diff --git a/mm/mmap.c b/mm/mmap.c
> index f19efcf75418..bdab595ce25c 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -2993,6 +2993,23 @@ void exit_mmap(struct mm_struct *mm)
>  	/* Use -1 here to ensure all VMAs in the mm are unmapped */
>  	unmap_vmas(&tlb, vma, 0, -1);
>  
> +	set_bit(MMF_OOM_SKIP, &mm->flags);
> +	if (tsk_is_oom_victim(current)) {
> +		/*
> +		 * Wait for oom_reap_task() to stop working on this
> +		 * mm. Because MMF_OOM_SKIP is already set before
> +		 * calling down_read(), oom_reap_task() will not run
> +		 * on this "mm" post up_write().
> +		 *
> +		 * tsk_is_oom_victim() cannot be set from under us
> +		 * either because current->mm is already set to NULL
> +		 * under task_lock before calling mmput and oom_mm is
> +		 * set not NULL by the OOM killer only if current->mm
> +		 * is found not NULL while holding the task_lock.
> +		 */
> +		down_write(&mm->mmap_sem);
> +		up_write(&mm->mmap_sem);
> +	}
>  	free_pgtables(&tlb, vma, FIRST_USER_ADDRESS, USER_PGTABLES_CEILING);
>  	tlb_finish_mmu(&tlb, 0, -1);

Yes this will work and it won't depend on the oom_lock. But isn't it
just more ugly than simply doing

	if (tsk_is_oom_victim) {
		down_write(&mm->mmap_sem);
		locked = true;
	}
	free_pgtables(...)
	[...]
	if (locked)
		down_up(&mm->mmap_sem);

in general I do not like empty locked sections much, to be honest. Now
with the conditional locking my patch looks as follows. It should be
pretty much equivalent to your patch. Would that be acceptable for you
or do you think there is a strong reason to go with yours?
---

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH] mm, oom: allow oom reaper to race with exit_mmap
  2017-07-27  6:50                         ` Michal Hocko
@ 2017-07-27 14:55                           ` Andrea Arcangeli
  2017-07-28  6:23                             ` Michal Hocko
  0 siblings, 1 reply; 36+ messages in thread
From: Andrea Arcangeli @ 2017-07-27 14:55 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Kirill A. Shutemov, Andrew Morton, David Rientjes, Tetsuo Handa,
	Oleg Nesterov, Hugh Dickins, linux-mm, LKML

On Thu, Jul 27, 2017 at 08:50:24AM +0200, Michal Hocko wrote:
> Yes this will work and it won't depend on the oom_lock. But isn't it
> just more ugly than simply doing
> 
> 	if (tsk_is_oom_victim) {
> 		down_write(&mm->mmap_sem);
> 		locked = true;
> 	}
> 	free_pgtables(...)
> 	[...]
> 	if (locked)
> 		down_up(&mm->mmap_sem);

To me not doing if (tsk_is_oom...) { down_write; up_write } is by
default a confusing implementation, because it's not strict and not
strict code is not self documenting and you've to think twice of why
you're doing something the way you're doing it.

The doubt on what was the point to hold the mmap_sem during
free_pgtables is precisely why I started digging into this issue
because it didn't look possible you could truly benefit from holding
the mmap_sem during free_pgtables.

I also don't like having a new invariant that your solution relies on,
that is mm->mmap = NULL, when we can make just set the MMF_OOM_SKIP a
bit earlier that it gets set anyway and use that to control the other
side of the race.

I like strict code that uses as fewer invariants as possible and that
never holds a lock for any instruction more than it is required (again
purely for self documenting reasons, the CPU won't notice much one
instruction more or less).

Even with your patch the two branches are unnecessary, that may not be
measurable, but it's still wasted CPU. It's all about setting mm->mmap
before the up_write. In fact my patch should at least put an incremental
unlikely around my single branch added to exit_mmap.

I see the {down_write;up_write} Hugh's ksm_exit-like as a strict
solution to this issue and I wrote it specifically while trying to
research a way to be more strict because from the start it didn't look
the holding of the mmap_sem during free_pgtables was necessary.

I'm also fine to drop the oom_lock but I think it can be done
incrementally as it's a separate issue, my second patch should allow
for it with no adverse side effects.

All I care about is the exit_mmap path because it runs too many times
not to pay deep attention to every bit of it ;).

Thanks,
Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 1/1] mm: oom: let oom_reap_task and exit_mmap to run
  2017-07-26 16:29                       ` Andrea Arcangeli
  2017-07-26 16:43                         ` Andrea Arcangeli
  2017-07-27  6:50                         ` Michal Hocko
@ 2017-07-28  1:58                         ` kbuild test robot
  2017-08-15  0:20                         ` [PATCH] mm, oom: allow oom reaper to race with exit_mmap David Rientjes
  3 siblings, 0 replies; 36+ messages in thread
From: kbuild test robot @ 2017-07-28  1:58 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: kbuild-all, Michal Hocko, Kirill A. Shutemov, Andrew Morton,
	David Rientjes, Tetsuo Handa, Oleg Nesterov, Hugh Dickins,
	linux-mm, LKML

[-- Attachment #1: Type: text/plain, Size: 8507 bytes --]

Hi Andrea,

[auto build test WARNING on linus/master]
[also build test WARNING on v4.13-rc2 next-20170727]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/Andrea-Arcangeli/mm-oom-let-oom_reap_task-and-exit_mmap-to-run/20170728-082915
config: x86_64-randconfig-x013-201730 (attached as .config)
compiler: gcc-6 (Debian 6.2.0-3) 6.2.0 20160901
reproduce:
        # save the attached .config to linux build tree
        make ARCH=x86_64 

All warnings (new ones prefixed by >>):

   In file included from include/linux/linkage.h:4:0,
                    from include/linux/kernel.h:6,
                    from mm/mmap.c:11:
   mm/mmap.c: In function 'exit_mmap':
   mm/mmap.c:2997:6: error: implicit declaration of function 'tsk_is_oom_victim' [-Werror=implicit-function-declaration]
     if (tsk_is_oom_victim(current)) {
         ^
   include/linux/compiler.h:156:30: note: in definition of macro '__trace_if'
     if (__builtin_constant_p(!!(cond)) ? !!(cond) :   \
                                 ^~~~
>> mm/mmap.c:2997:2: note: in expansion of macro 'if'
     if (tsk_is_oom_victim(current)) {
     ^~
   mm/mmap.c: At top level:
   include/linux/compiler.h:162:4: warning: '______f' is static but declared in inline function 'strcpy' which is not static
       ______f = {     \
       ^
   include/linux/compiler.h:154:23: note: in expansion of macro '__trace_if'
    #define if(cond, ...) __trace_if( (cond , ## __VA_ARGS__) )
                          ^~~~~~~~~~
   include/linux/string.h:390:2: note: in expansion of macro 'if'
     if (p_size == (size_t)-1 && q_size == (size_t)-1)
     ^~
   include/linux/compiler.h:162:4: warning: '______f' is static but declared in inline function 'kmemdup' which is not static
       ______f = {     \
       ^
   include/linux/compiler.h:154:23: note: in expansion of macro '__trace_if'
    #define if(cond, ...) __trace_if( (cond , ## __VA_ARGS__) )
                          ^~~~~~~~~~
   include/linux/string.h:380:2: note: in expansion of macro 'if'
     if (p_size < size)
     ^~
   include/linux/compiler.h:162:4: warning: '______f' is static but declared in inline function 'kmemdup' which is not static
       ______f = {     \
       ^
   include/linux/compiler.h:154:23: note: in expansion of macro '__trace_if'
    #define if(cond, ...) __trace_if( (cond , ## __VA_ARGS__) )
                          ^~~~~~~~~~
   include/linux/string.h:378:2: note: in expansion of macro 'if'
     if (__builtin_constant_p(size) && p_size < size)
     ^~
   include/linux/compiler.h:162:4: warning: '______f' is static but declared in inline function 'memchr_inv' which is not static
       ______f = {     \
       ^
   include/linux/compiler.h:154:23: note: in expansion of macro '__trace_if'
    #define if(cond, ...) __trace_if( (cond , ## __VA_ARGS__) )
                          ^~~~~~~~~~
   include/linux/string.h:369:2: note: in expansion of macro 'if'
     if (p_size < size)
     ^~
   include/linux/compiler.h:162:4: warning: '______f' is static but declared in inline function 'memchr_inv' which is not static
       ______f = {     \
       ^
   include/linux/compiler.h:154:23: note: in expansion of macro '__trace_if'
    #define if(cond, ...) __trace_if( (cond , ## __VA_ARGS__) )
                          ^~~~~~~~~~
   include/linux/string.h:367:2: note: in expansion of macro 'if'
     if (__builtin_constant_p(size) && p_size < size)
     ^~
   include/linux/compiler.h:162:4: warning: '______f' is static but declared in inline function 'memchr' which is not static
       ______f = {     \
       ^
   include/linux/compiler.h:154:23: note: in expansion of macro '__trace_if'
    #define if(cond, ...) __trace_if( (cond , ## __VA_ARGS__) )
                          ^~~~~~~~~~
   include/linux/string.h:358:2: note: in expansion of macro 'if'
     if (p_size < size)
     ^~
   include/linux/compiler.h:162:4: warning: '______f' is static but declared in inline function 'memchr' which is not static
       ______f = {     \
       ^
   include/linux/compiler.h:154:23: note: in expansion of macro '__trace_if'
    #define if(cond, ...) __trace_if( (cond , ## __VA_ARGS__) )
                          ^~~~~~~~~~
   include/linux/string.h:356:2: note: in expansion of macro 'if'
     if (__builtin_constant_p(size) && p_size < size)
     ^~
   include/linux/compiler.h:162:4: warning: '______f' is static but declared in inline function 'memcmp' which is not static
       ______f = {     \
       ^
   include/linux/compiler.h:154:23: note: in expansion of macro '__trace_if'
    #define if(cond, ...) __trace_if( (cond , ## __VA_ARGS__) )
                          ^~~~~~~~~~
   include/linux/string.h:348:2: note: in expansion of macro 'if'
     if (p_size < size || q_size < size)
     ^~
   include/linux/compiler.h:162:4: warning: '______f' is static but declared in inline function 'memcmp' which is not static
       ______f = {     \
       ^
   include/linux/compiler.h:154:23: note: in expansion of macro '__trace_if'
    #define if(cond, ...) __trace_if( (cond , ## __VA_ARGS__) )
                          ^~~~~~~~~~
   include/linux/string.h:345:3: note: in expansion of macro 'if'
      if (q_size < size)
      ^~
   include/linux/compiler.h:162:4: warning: '______f' is static but declared in inline function 'memcmp' which is not static
       ______f = {     \
       ^
   include/linux/compiler.h:154:23: note: in expansion of macro '__trace_if'
    #define if(cond, ...) __trace_if( (cond , ## __VA_ARGS__) )
                          ^~~~~~~~~~
   include/linux/string.h:343:3: note: in expansion of macro 'if'
      if (p_size < size)
      ^~
   include/linux/compiler.h:162:4: warning: '______f' is static but declared in inline function 'memcmp' which is not static
       ______f = {     \
       ^
   include/linux/compiler.h:154:23: note: in expansion of macro '__trace_if'
    #define if(cond, ...) __trace_if( (cond , ## __VA_ARGS__) )
                          ^~~~~~~~~~
   include/linux/string.h:342:2: note: in expansion of macro 'if'

vim +/if +2997 mm/mmap.c

  2963	
  2964	/* Release all mmaps. */
  2965	void exit_mmap(struct mm_struct *mm)
  2966	{
  2967		struct mmu_gather tlb;
  2968		struct vm_area_struct *vma;
  2969		unsigned long nr_accounted = 0;
  2970	
  2971		/* mm's last user has gone, and its about to be pulled down */
  2972		mmu_notifier_release(mm);
  2973	
  2974		if (mm->locked_vm) {
  2975			vma = mm->mmap;
  2976			while (vma) {
  2977				if (vma->vm_flags & VM_LOCKED)
  2978					munlock_vma_pages_all(vma);
  2979				vma = vma->vm_next;
  2980			}
  2981		}
  2982	
  2983		arch_exit_mmap(mm);
  2984	
  2985		vma = mm->mmap;
  2986		if (!vma)	/* Can happen if dup_mmap() received an OOM */
  2987			return;
  2988	
  2989		lru_add_drain();
  2990		flush_cache_mm(mm);
  2991		tlb_gather_mmu(&tlb, mm, 0, -1);
  2992		/* update_hiwater_rss(mm) here? but nobody should be looking */
  2993		/* Use -1 here to ensure all VMAs in the mm are unmapped */
  2994		unmap_vmas(&tlb, vma, 0, -1);
  2995	
  2996		set_bit(MMF_OOM_SKIP, &mm->flags);
> 2997		if (tsk_is_oom_victim(current)) {
  2998			/*
  2999			 * Wait for oom_reap_task() to stop working on this
  3000			 * mm. Because MMF_OOM_SKIP is already set before
  3001			 * calling down_read(), oom_reap_task() will not run
  3002			 * on this "mm" post up_write().
  3003			 *
  3004			 * tsk_is_oom_victim() cannot be set from under us
  3005			 * either because current->mm is already set to NULL
  3006			 * under task_lock before calling mmput and oom_mm is
  3007			 * set not NULL by the OOM killer only if current->mm
  3008			 * is found not NULL while holding the task_lock.
  3009			 */
  3010			down_write(&mm->mmap_sem);
  3011			up_write(&mm->mmap_sem);
  3012		}
  3013		free_pgtables(&tlb, vma, FIRST_USER_ADDRESS, USER_PGTABLES_CEILING);
  3014		tlb_finish_mmu(&tlb, 0, -1);
  3015	
  3016		/*
  3017		 * Walk the list again, actually closing and freeing it,
  3018		 * with preemption enabled, without holding any MM locks.
  3019		 */
  3020		while (vma) {
  3021			if (vma->vm_flags & VM_ACCOUNT)
  3022				nr_accounted += vma_pages(vma);
  3023			vma = remove_vma(vma);
  3024		}
  3025		vm_unacct_memory(nr_accounted);
  3026	}
  3027	

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 35408 bytes --]

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH] mm, oom: allow oom reaper to race with exit_mmap
  2017-07-27 14:55                           ` Andrea Arcangeli
@ 2017-07-28  6:23                             ` Michal Hocko
  0 siblings, 0 replies; 36+ messages in thread
From: Michal Hocko @ 2017-07-28  6:23 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Kirill A. Shutemov, Andrew Morton, David Rientjes, Tetsuo Handa,
	Oleg Nesterov, Hugh Dickins, linux-mm, LKML

On Thu 27-07-17 16:55:59, Andrea Arcangeli wrote:
> On Thu, Jul 27, 2017 at 08:50:24AM +0200, Michal Hocko wrote:
> > Yes this will work and it won't depend on the oom_lock. But isn't it
> > just more ugly than simply doing
> > 
> > 	if (tsk_is_oom_victim) {
> > 		down_write(&mm->mmap_sem);
> > 		locked = true;
> > 	}
> > 	free_pgtables(...)
> > 	[...]
> > 	if (locked)
> > 		down_up(&mm->mmap_sem);
> 
> To me not doing if (tsk_is_oom...) { down_write; up_write } is by
> default a confusing implementation, because it's not strict and not
> strict code is not self documenting and you've to think twice of why
> you're doing something the way you're doing it.

I disagree. But this is a matter of taste, I guess. I prefer when
locking is explicit and clear about the scope. An empty locked region
used for synchronization is just obscure and you have to think much more
what it means when you can see what the lock actually protects. It is
also less error prone I believe.

> The doubt on what was the point to hold the mmap_sem during
> free_pgtables is precisely why I started digging into this issue
> because it didn't look possible you could truly benefit from holding
> the mmap_sem during free_pgtables.

The point is that you cannot remove page tables while somebody is
walking them. This is enforced in all other paths except for exit which
was kind of special because nobody could race there.

> I also don't like having a new invariant that your solution relies on,
> that is mm->mmap = NULL, when we can make just set the MMF_OOM_SKIP a
> bit earlier that it gets set anyway and use that to control the other
> side of the race.

Well, again a matter of taste. I prefer making it clear that mm->mmap is
no longer valid because it points to a freed pointer. Relying on
MMF_OOM_SKIP for the handshake is just abusing the flag for a different
purpose than it was intended.
 
> I like strict code that uses as fewer invariants as possible and that
> never holds a lock for any instruction more than it is required (again
> purely for self documenting reasons, the CPU won't notice much one
> instruction more or less).
> 
> Even with your patch the two branches are unnecessary, that may not be
> measurable, but it's still wasted CPU. It's all about setting mm->mmap
> before the up_write. In fact my patch should at least put an incremental
> unlikely around my single branch added to exit_mmap.
>
> I see the {down_write;up_write} Hugh's ksm_exit-like as a strict
> solution to this issue and I wrote it specifically while trying to
> research a way to be more strict because from the start it didn't look
> the holding of the mmap_sem during free_pgtables was necessary.

While we have discussed that with Hugh he has shown an interest to
actually get rid of this (peculiar) construct which would be possible if
the down_write was implicit in exit_mmap [1].

> I'm also fine to drop the oom_lock but I think it can be done
> incrementally as it's a separate issue, my second patch should allow
> for it with no adverse side effects.
>
> All I care about is the exit_mmap path because it runs too many times
> not to pay deep attention to every bit of it ;).

Well, all I care about is to fix this over-eager oom killing due to oom
reaper setting MMF_OOM_SKIP too early. That is a regression. I think we
can agree that both proposed solutions are functionally equivalent. I do
not want to push mine too hard so if you _believe_ that yours is really
better feel free to post it for inclusion.

[1] http://lkml.kernel.org/r/alpine.LSU.2.11.1707191716030.2055@eggly.anvils
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH] mm, oom: allow oom reaper to race with exit_mmap
  2017-07-26 16:29                       ` Andrea Arcangeli
                                           ` (2 preceding siblings ...)
  2017-07-28  1:58                         ` [PATCH 1/1] mm: oom: let oom_reap_task and exit_mmap to run kbuild test robot
@ 2017-08-15  0:20                         ` David Rientjes
  3 siblings, 0 replies; 36+ messages in thread
From: David Rientjes @ 2017-08-15  0:20 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Michal Hocko, Kirill A. Shutemov, Andrew Morton, Tetsuo Handa,
	Oleg Nesterov, Hugh Dickins, linux-mm, LKML

On Wed, 26 Jul 2017, Andrea Arcangeli wrote:

> From 3d9001490ee1a71f39c7bfaf19e96821f9d3ff16 Mon Sep 17 00:00:00 2001
> From: Andrea Arcangeli <aarcange@redhat.com>
> Date: Tue, 25 Jul 2017 20:02:27 +0200
> Subject: [PATCH 1/1] mm: oom: let oom_reap_task and exit_mmap to run
>  concurrently
> 
> This is purely required because exit_aio() may block and exit_mmap() may
> never start, if the oom_reap_task cannot start running on a mm with
> mm_users == 0.
> 
> At the same time if the OOM reaper doesn't wait at all for the memory
> of the current OOM candidate to be freed by exit_mmap->unmap_vmas, it
> would generate a spurious OOM kill.
> 
> If it wasn't because of the exit_aio or similar blocking functions in
> the last mmput, it would be enough to change the oom_reap_task() in
> the case it finds mm_users == 0, to wait for a timeout or to wait for
> __mmput to set MMF_OOM_SKIP itself, but it's not just exit_mmap the
> problem here so the concurrency of exit_mmap and oom_reap_task is
> apparently warranted.
> 
> It's a non standard runtime, exit_mmap() runs without mmap_sem, and
> oom_reap_task runs with the mmap_sem for reading as usual (kind of
> MADV_DONTNEED).
> 
> The race between the two is solved with a combination of
> tsk_is_oom_victim() (serialized by task_lock) and MMF_OOM_SKIP
> (serialized by a dummy down_write/up_write cycle on the same lines of
> the ksm_exit method).
> 
> If the oom_reap_task() may be running concurrently during exit_mmap,
> exit_mmap will wait it to finish in down_write (before taking down mm
> structures that would make the oom_reap_task fail with use after
> free).
> 
> If exit_mmap comes first, oom_reap_task() will skip the mm if
> MMF_OOM_SKIP is already set and in turn all memory is already freed
> and furthermore the mm data structures may already have been taken
> down by free_pgtables.
> 
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>

With your follow-up one liner to include linux/oom.h folded in:

Tested-by: David Rientjes <rientjes@google.com>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH] mm, oom: allow oom reaper to race with exit_mmap
  2017-08-10 18:51   ` Michal Hocko
@ 2017-08-10 20:36     ` Michal Hocko
  0 siblings, 0 replies; 36+ messages in thread
From: Michal Hocko @ 2017-08-10 20:36 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Andrew Morton, Tetsuo Handa, David Rientjes, Oleg Nesterov,
	Andrea Argangeli, Hugh Dickins, Kirill A. Shutemov, linux-mm,
	LKML

On Thu 10-08-17 20:51:38, Michal Hocko wrote:
[...]
> OK, let's agree to disagree. As I've said I like when the critical
> section is explicit and we _know_ what it protects. In this case it is
> clear that we have to protect from the page tables tear down and the
> vma destructions. But as I've said I am not going to argue about this
> more. It is more important to finally fix this.

Now that I've reread, it may sound different than I thought. I meant to
say that I will not argue about which solution is better and both
patches are good to go. I will let others to decide but I would be glad
if we go with something finally.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH] mm, oom: allow oom reaper to race with exit_mmap
  2017-08-10 18:05 ` Andrea Arcangeli
@ 2017-08-10 18:51   ` Michal Hocko
  2017-08-10 20:36     ` Michal Hocko
  0 siblings, 1 reply; 36+ messages in thread
From: Michal Hocko @ 2017-08-10 18:51 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Andrew Morton, Tetsuo Handa, David Rientjes, Oleg Nesterov,
	Andrea Argangeli, Hugh Dickins, Kirill A. Shutemov, linux-mm,
	LKML

On Thu 10-08-17 20:05:54, Andrea Arcangeli wrote:
> On Thu, Aug 10, 2017 at 10:16:32AM +0200, Michal Hocko wrote:
> > Andrea has proposed and alternative solution [4] which should be
> > equivalent functionally similar to {ksm,khugepaged}_exit. I have to
> > confess I really don't like that approach but I can live with it if
> > that is a preferred way (to be honest I would like to drop the empty
> 
> Well you added two branches, when only one is necessary. It's more or
> less like preferring a rwsem when a mutex is enough, because you're
> more used to use rwsems.
> 
> > down_write();up_write() from the other two callers as well). In fact I
> > have asked Andrea to post his patch [5] but that hasn't happened. I do
> > not think we should wait much longer and finally merge some fix. 
> 
> It's posted in [4] already below I didn't think it was necessary to
> resend it.

it was deep in the email thread and I've asked you explicitly to repost
which I've done for the same reason.

> The only other improvement I can think of is an unlikely
> around tsk_is_oom_victim() in exit_mmap, but your patch below would
> need it too, and two of them.

with
diff --git a/mm/mmap.c b/mm/mmap.c
index 822e8860b9d2..9d4a5a488f72 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -3002,7 +3002,7 @@ void exit_mmap(struct mm_struct *mm)
 	 * with tsk->mm != NULL checked on !current tasks which synchronizes
 	 * with exit_mm and so we cannot race here.
 	 */
-	if (tsk_is_oom_victim(current)) {
+	if (unlikely(tsk_is_oom_victim(current))) {
 		down_write(&mm->mmap_sem);
 		locked = true;
 	}
@@ -3020,7 +3020,7 @@ void exit_mmap(struct mm_struct *mm)
 	}
 	mm->mmap = NULL;
 	vm_unacct_memory(nr_accounted);
-	if (locked)
+	if (unlikely(locked))
 		up_write(&mm->mmap_sem);
 }
 
The generated code is identical. But I do not have any objection of
course.

> > [1] http://lkml.kernel.org/r/20170724072332.31903-1-mhocko@kernel.org
> > [2] http://lkml.kernel.org/r/20170725142626.GJ26723@dhcp22.suse.cz
> > [3] http://lkml.kernel.org/r/20170725160359.GO26723@dhcp22.suse.cz
> > [4] http://lkml.kernel.org/r/20170726162912.GA29716@redhat.com
> > [5] http://lkml.kernel.org/r/20170728062345.GA2274@dhcp22.suse.cz
> > 
> > +	if (tsk_is_oom_victim(current)) {
> > +		down_write(&mm->mmap_sem);
> > +		locked = true;
> > +	}
> >  	free_pgtables(&tlb, vma, FIRST_USER_ADDRESS, USER_PGTABLES_CEILING);
> >  	tlb_finish_mmu(&tlb, 0, -1);
> >  
> > @@ -3005,7 +3018,10 @@ void exit_mmap(struct mm_struct *mm)
> >  			nr_accounted += vma_pages(vma);
> >  		vma = remove_vma(vma);
> >  	}
> > +	mm->mmap = NULL;
> >  	vm_unacct_memory(nr_accounted);
> > +	if (locked)
> > +		up_write(&mm->mmap_sem);
> 
> I wouldn't normally repost to add an unlikely when I'm not sure if it
> gets merged, but if it gets merged I would immediately tell to Andrew
> about the microoptimization being missing there so he can fold it
> later.
> 
> Before reposting about the unlikely I thought we should agree which
> version to merge: [4] or the above double branch (for no good as far
> as I tangibly can tell).
> 
> I think down_write;up_write is the correct thing to do here because
> holding the lock for any additional instruction has zero benefits, and
> if it has zero benefits it only adds up confusion and makes the code
> partly senseless, and that ultimately hurts the reader when it tries
> to understand why you're holding the lock for so long when it's not
> needed.

OK, let's agree to disagree. As I've said I like when the critical
section is explicit and we _know_ what it protects. In this case it is
clear that we have to protect from the page tables tear down and the
vma destructions. But as I've said I am not going to argue about this
more. It is more important to finally fix this.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: [PATCH] mm, oom: allow oom reaper to race with exit_mmap
  2017-08-10  8:16 Michal Hocko
@ 2017-08-10 18:05 ` Andrea Arcangeli
  2017-08-10 18:51   ` Michal Hocko
  0 siblings, 1 reply; 36+ messages in thread
From: Andrea Arcangeli @ 2017-08-10 18:05 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Tetsuo Handa, David Rientjes, Oleg Nesterov,
	Andrea Argangeli, Hugh Dickins, Kirill A. Shutemov, linux-mm,
	LKML, Michal Hocko

On Thu, Aug 10, 2017 at 10:16:32AM +0200, Michal Hocko wrote:
> Andrea has proposed and alternative solution [4] which should be
> equivalent functionally similar to {ksm,khugepaged}_exit. I have to
> confess I really don't like that approach but I can live with it if
> that is a preferred way (to be honest I would like to drop the empty

Well you added two branches, when only one is necessary. It's more or
less like preferring a rwsem when a mutex is enough, because you're
more used to use rwsems.

> down_write();up_write() from the other two callers as well). In fact I
> have asked Andrea to post his patch [5] but that hasn't happened. I do
> not think we should wait much longer and finally merge some fix. 

It's posted in [4] already below I didn't think it was necessary to
resend it. The only other improvement I can think of is an unlikely
around tsk_is_oom_victim() in exit_mmap, but your patch below would
need it too, and two of them.

> [1] http://lkml.kernel.org/r/20170724072332.31903-1-mhocko@kernel.org
> [2] http://lkml.kernel.org/r/20170725142626.GJ26723@dhcp22.suse.cz
> [3] http://lkml.kernel.org/r/20170725160359.GO26723@dhcp22.suse.cz
> [4] http://lkml.kernel.org/r/20170726162912.GA29716@redhat.com
> [5] http://lkml.kernel.org/r/20170728062345.GA2274@dhcp22.suse.cz
> 
> +	if (tsk_is_oom_victim(current)) {
> +		down_write(&mm->mmap_sem);
> +		locked = true;
> +	}
>  	free_pgtables(&tlb, vma, FIRST_USER_ADDRESS, USER_PGTABLES_CEILING);
>  	tlb_finish_mmu(&tlb, 0, -1);
>  
> @@ -3005,7 +3018,10 @@ void exit_mmap(struct mm_struct *mm)
>  			nr_accounted += vma_pages(vma);
>  		vma = remove_vma(vma);
>  	}
> +	mm->mmap = NULL;
>  	vm_unacct_memory(nr_accounted);
> +	if (locked)
> +		up_write(&mm->mmap_sem);

I wouldn't normally repost to add an unlikely when I'm not sure if it
gets merged, but if it gets merged I would immediately tell to Andrew
about the microoptimization being missing there so he can fold it
later.

Before reposting about the unlikely I thought we should agree which
version to merge: [4] or the above double branch (for no good as far
as I tangibly can tell).

I think down_write;up_write is the correct thing to do here because
holding the lock for any additional instruction has zero benefits, and
if it has zero benefits it only adds up confusion and makes the code
partly senseless, and that ultimately hurts the reader when it tries
to understand why you're holding the lock for so long when it's not
needed.

I just read other code yesterday for another bug about the rss going
off by one in some older kernel, that calls add_mm_rss_vec(mm, rss);
where rss is on the stack and mm->rss_stat is mm global, under the PT
lock, and again I had to ask myself why is it done there, and if the
PT lock could possibly help. My evaluation was no, it's just done in
the wrong place, but then I'm not 100% sure because there's a chance I
misread something very subtle.

	add_mm_rss_vec(mm, rss);
	arch_leave_lazy_mmu_mode();

	/* Do the actual TLB flush before dropping ptl */
	if (force_flush)
		tlb_flush_mmu_tlbonly(tlb);
	pte_unmap_unlock(start_pte, ptl);

The tlb flushing doesn't seem to check the stats either, so why is
add_mm_rss_vec isn't called after pte_unmap_unlock?

And yes it looks offtopic (and there's no bug in the rss accounting, I
was just reading around it just in case) but it's not, it's precisely
the kind of issue I have with your patch because it'll introduce
another case like above that I can't explain why it's done under a
lock when it doesn't need it, and it's hard to guess it was just your
dislike for down_read;up_write that made you choose to hold the lock
for no good reason arbitrarily a bit longer.

Thanks,
Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH] mm, oom: allow oom reaper to race with exit_mmap
@ 2017-08-10  8:16 Michal Hocko
  2017-08-10 18:05 ` Andrea Arcangeli
  0 siblings, 1 reply; 36+ messages in thread
From: Michal Hocko @ 2017-08-10  8:16 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Tetsuo Handa, David Rientjes, Oleg Nesterov, Andrea Argangeli,
	Hugh Dickins, Kirill A. Shutemov, linux-mm, LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

David has noticed that the oom killer might kill additional tasks while
the exiting oom victim hasn't terminated yet because the oom_reaper marks
the curent victim MMF_OOM_SKIP too early when mm->mm_users dropped down
to 0. The race is as follows

oom_reap_task				do_exit
					  exit_mm
  __oom_reap_task_mm
					    mmput
					      __mmput
    mmget_not_zero # fails
    						exit_mmap # frees memory
  set_bit(MMF_OOM_SKIP)

The victim is still visible to the OOM killer until it is unhashed.

Currently we try to reduce a risk of this race by taking oom_lock
and wait for out_of_memory sleep while holding the lock to give the
victim some time to exit. This is quite suboptimal approach because
there is no guarantee the victim (especially a large one) will manage
to unmap its address space and free enough memory to the particular oom
domain which needs a memory (e.g. a specific NUMA node).

Fix this problem by allowing __oom_reap_task_mm and __mmput path to
race. __oom_reap_task_mm is basically MADV_DONTNEED and that is allowed
to run in parallel with other unmappers (hence the mmap_sem for read).

The only tricky part is to exclude page tables tear down and all
operations which modify the address space in the __mmput path. exit_mmap
doesn't expect any other users so it doesn't use any locking. Nothing
really forbids us to use mmap_sem for write, though. In fact we are
already relying on this lock earlier in the __mmput path to synchronize
with ksm and khugepaged.

Take the exclusive mmap_sem when calling free_pgtables and destroying
vmas to sync with __oom_reap_task_mm which take the lock for read. All
other operations can safely race with the parallel unmap.

Changes v1
- bail on null mm->mmap early as per David Rientjes
- take exclusive mmap_sem in exit_mmap only for oom victims to reduce
  the lock overhead

Reported-by: David Rientjes <rientjes@google.com>
Fixes: 26db62f179d1 ("oom: keep mm of the killed task available")
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
Hi,
the previous version of the patch has been posted here [1]. The original
patch has taken mmap_sem in exit_mmap unconditionally but Kirill was
worried this could have a performance impact (we should exercise the
fast path most of the time because nobody should be holding lock at that
stage). An artificial testcase [2] has shown ~3% difference but numbers
are quite noisy [3] so it is effect is not all that clear. Anyway I have
made the lock conditional for oom victims.

Andrea has proposed and alternative solution [4] which should be
equivalent functionally similar to {ksm,khugepaged}_exit. I have to
confess I really don't like that approach but I can live with it if
that is a preferred way (to be honest I would like to drop the empty
down_write();up_write() from the other two callers as well). In fact I
have asked Andrea to post his patch [5] but that hasn't happened. I do
not think we should wait much longer and finally merge some fix. 

[1] http://lkml.kernel.org/r/20170724072332.31903-1-mhocko@kernel.org
[2] http://lkml.kernel.org/r/20170725142626.GJ26723@dhcp22.suse.cz
[3] http://lkml.kernel.org/r/20170725160359.GO26723@dhcp22.suse.cz
[4] http://lkml.kernel.org/r/20170726162912.GA29716@redhat.com
[5] http://lkml.kernel.org/r/20170728062345.GA2274@dhcp22.suse.cz

 mm/mmap.c     | 16 ++++++++++++++++
 mm/oom_kill.c | 47 ++++++++---------------------------------------
 2 files changed, 24 insertions(+), 39 deletions(-)

diff --git a/mm/mmap.c b/mm/mmap.c
index 24e9261bdcc0..822e8860b9d2 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -44,6 +44,7 @@
 #include <linux/userfaultfd_k.h>
 #include <linux/moduleparam.h>
 #include <linux/pkeys.h>
+#include <linux/oom.h>
 
 #include <linux/uaccess.h>
 #include <asm/cacheflush.h>
@@ -2967,6 +2968,7 @@ void exit_mmap(struct mm_struct *mm)
 	struct mmu_gather tlb;
 	struct vm_area_struct *vma;
 	unsigned long nr_accounted = 0;
+	bool locked = false;
 
 	/* mm's last user has gone, and its about to be pulled down */
 	mmu_notifier_release(mm);
@@ -2993,6 +2995,17 @@ void exit_mmap(struct mm_struct *mm)
 	/* Use -1 here to ensure all VMAs in the mm are unmapped */
 	unmap_vmas(&tlb, vma, 0, -1);
 
+	/*
+	 * oom reaper might race with exit_mmap so make sure we won't free
+	 * page tables or unmap VMAs under its feet
+	 * Please note that mark_oom_victim is always called under task_lock
+	 * with tsk->mm != NULL checked on !current tasks which synchronizes
+	 * with exit_mm and so we cannot race here.
+	 */
+	if (tsk_is_oom_victim(current)) {
+		down_write(&mm->mmap_sem);
+		locked = true;
+	}
 	free_pgtables(&tlb, vma, FIRST_USER_ADDRESS, USER_PGTABLES_CEILING);
 	tlb_finish_mmu(&tlb, 0, -1);
 
@@ -3005,7 +3018,10 @@ void exit_mmap(struct mm_struct *mm)
 			nr_accounted += vma_pages(vma);
 		vma = remove_vma(vma);
 	}
+	mm->mmap = NULL;
 	vm_unacct_memory(nr_accounted);
+	if (locked)
+		up_write(&mm->mmap_sem);
 }
 
 /* Insert vm structure into process list sorted by address
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 9e8b4f030c1c..b1c96e1910f2 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -470,40 +470,15 @@ static bool __oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
 {
 	struct mmu_gather tlb;
 	struct vm_area_struct *vma;
-	bool ret = true;
-
-	/*
-	 * We have to make sure to not race with the victim exit path
-	 * and cause premature new oom victim selection:
-	 * __oom_reap_task_mm		exit_mm
-	 *   mmget_not_zero
-	 *				  mmput
-	 *				    atomic_dec_and_test
-	 *				  exit_oom_victim
-	 *				[...]
-	 *				out_of_memory
-	 *				  select_bad_process
-	 *				    # no TIF_MEMDIE task selects new victim
-	 *  unmap_page_range # frees some memory
-	 */
-	mutex_lock(&oom_lock);
 
 	if (!down_read_trylock(&mm->mmap_sem)) {
-		ret = false;
 		trace_skip_task_reaping(tsk->pid);
-		goto unlock_oom;
+		return false;
 	}
 
-	/*
-	 * increase mm_users only after we know we will reap something so
-	 * that the mmput_async is called only when we have reaped something
-	 * and delayed __mmput doesn't matter that much
-	 */
-	if (!mmget_not_zero(mm)) {
-		up_read(&mm->mmap_sem);
-		trace_skip_task_reaping(tsk->pid);
-		goto unlock_oom;
-	}
+	/* There is nothing to reap so bail out without signs in the log */
+	if (!mm->mmap)
+		goto unlock;
 
 	trace_start_task_reaping(tsk->pid);
 
@@ -540,18 +515,12 @@ static bool __oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
 			K(get_mm_counter(mm, MM_ANONPAGES)),
 			K(get_mm_counter(mm, MM_FILEPAGES)),
 			K(get_mm_counter(mm, MM_SHMEMPAGES)));
-	up_read(&mm->mmap_sem);
 
-	/*
-	 * Drop our reference but make sure the mmput slow path is called from a
-	 * different context because we shouldn't risk we get stuck there and
-	 * put the oom_reaper out of the way.
-	 */
-	mmput_async(mm);
 	trace_finish_task_reaping(tsk->pid);
-unlock_oom:
-	mutex_unlock(&oom_lock);
-	return ret;
+unlock:
+	up_read(&mm->mmap_sem);
+
+	return true;
 }
 
 #define MAX_OOM_REAP_RETRIES 10
-- 
2.13.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 36+ messages in thread

end of thread, other threads:[~2017-08-15  0:21 UTC | newest]

Thread overview: 36+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-07-24  7:23 [PATCH] mm, oom: allow oom reaper to race with exit_mmap Michal Hocko
2017-07-24 14:00 ` Kirill A. Shutemov
2017-07-24 14:15   ` Michal Hocko
2017-07-24 14:51     ` Kirill A. Shutemov
2017-07-24 16:11       ` Michal Hocko
2017-07-25 14:17         ` Kirill A. Shutemov
2017-07-25 14:26           ` Michal Hocko
2017-07-25 15:07             ` Kirill A. Shutemov
2017-07-25 15:15               ` Michal Hocko
2017-07-25 14:26         ` Michal Hocko
2017-07-25 15:17           ` Kirill A. Shutemov
2017-07-25 15:23             ` Michal Hocko
2017-07-25 15:31               ` Kirill A. Shutemov
2017-07-25 16:04                 ` Michal Hocko
2017-07-25 19:19                   ` Andrea Arcangeli
2017-07-26  5:45                     ` Michal Hocko
2017-07-26 16:29                       ` Andrea Arcangeli
2017-07-26 16:43                         ` Andrea Arcangeli
2017-07-27  6:50                         ` Michal Hocko
2017-07-27 14:55                           ` Andrea Arcangeli
2017-07-28  6:23                             ` Michal Hocko
2017-07-28  1:58                         ` [PATCH 1/1] mm: oom: let oom_reap_task and exit_mmap to run kbuild test robot
2017-08-15  0:20                         ` [PATCH] mm, oom: allow oom reaper to race with exit_mmap David Rientjes
2017-07-24 15:27 ` Michal Hocko
2017-07-24 16:42 ` kbuild test robot
2017-07-24 18:12   ` Michal Hocko
2017-07-25 15:26 ` Andrea Arcangeli
2017-07-25 15:45   ` Michal Hocko
2017-07-25 18:26     ` Andrea Arcangeli
2017-07-26  5:45       ` Michal Hocko
2017-07-26 16:39         ` Andrea Arcangeli
2017-07-27  6:32           ` Michal Hocko
2017-08-10  8:16 Michal Hocko
2017-08-10 18:05 ` Andrea Arcangeli
2017-08-10 18:51   ` Michal Hocko
2017-08-10 20:36     ` Michal Hocko

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).