All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2 1/2] mm: drop oom code from exit_mmap
@ 2022-05-16  7:56 Suren Baghdasaryan
  2022-05-16  7:56 ` [PATCH v2 2/2] mm: delete unused MMF_OOM_VICTIM flag Suren Baghdasaryan
                   ` (3 more replies)
  0 siblings, 4 replies; 16+ messages in thread
From: Suren Baghdasaryan @ 2022-05-16  7:56 UTC (permalink / raw)
  To: akpm
  Cc: mhocko, rientjes, willy, hannes, guro, minchan, kirill, aarcange,
	brauner, hch, oleg, david, jannh, shakeelb, peterx, jhubbard,
	shuah, linux-kernel, linux-mm, kernel-team, surenb

The primary reason to invoke the oom reaper from the exit_mmap path used
to be a prevention of an excessive oom killing if the oom victim exit
races with the oom reaper (see [1] for more details). The invocation has
moved around since then because of the interaction with the munlock
logic but the underlying reason has remained the same (see [2]).

Munlock code is no longer a problem since [3] and there shouldn't be
any blocking operation before the memory is unmapped by exit_mmap so
the oom reaper invocation can be dropped. The unmapping part can be done
with the non-exclusive mmap_sem and the exclusive one is only required
when page tables are freed.

Remove the oom_reaper from exit_mmap which will make the code easier to
read. This is really unlikely to make any observable difference although
some microbenchmarks could benefit from one less branch that needs to be
evaluated even though it almost never is true.

[1] 212925802454 ("mm: oom: let oom_reap_task and exit_mmap run concurrently")
[2] 27ae357fa82b ("mm, oom: fix concurrent munlock and oom reaper unmap, v3")
[3] a213e5cf71cb ("mm/munlock: delete munlock_vma_pages_all(), allow oomreap")

Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Acked-by: Michal Hocko <mhocko@suse.com>
---
 include/linux/oom.h |  2 --
 mm/mmap.c           | 31 ++++++++++++-------------------
 mm/oom_kill.c       |  2 +-
 3 files changed, 13 insertions(+), 22 deletions(-)

diff --git a/include/linux/oom.h b/include/linux/oom.h
index 2db9a1432511..6cdf0772dbae 100644
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -106,8 +106,6 @@ static inline vm_fault_t check_stable_address_space(struct mm_struct *mm)
 	return 0;
 }
 
-bool __oom_reap_task_mm(struct mm_struct *mm);
-
 long oom_badness(struct task_struct *p,
 		unsigned long totalpages);
 
diff --git a/mm/mmap.c b/mm/mmap.c
index 313b57d55a63..ded42150e706 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -3105,30 +3105,13 @@ void exit_mmap(struct mm_struct *mm)
 	/* mm's last user has gone, and its about to be pulled down */
 	mmu_notifier_release(mm);
 
-	if (unlikely(mm_is_oom_victim(mm))) {
-		/*
-		 * Manually reap the mm to free as much memory as possible.
-		 * Then, as the oom reaper does, set MMF_OOM_SKIP to disregard
-		 * this mm from further consideration.  Taking mm->mmap_lock for
-		 * write after setting MMF_OOM_SKIP will guarantee that the oom
-		 * reaper will not run on this mm again after mmap_lock is
-		 * dropped.
-		 *
-		 * Nothing can be holding mm->mmap_lock here and the above call
-		 * to mmu_notifier_release(mm) ensures mmu notifier callbacks in
-		 * __oom_reap_task_mm() will not block.
-		 */
-		(void)__oom_reap_task_mm(mm);
-		set_bit(MMF_OOM_SKIP, &mm->flags);
-	}
-
-	mmap_write_lock(mm);
+	mmap_read_lock(mm);
 	arch_exit_mmap(mm);
 
 	vma = mm->mmap;
 	if (!vma) {
 		/* Can happen if dup_mmap() received an OOM */
-		mmap_write_unlock(mm);
+		mmap_read_unlock(mm);
 		return;
 	}
 
@@ -3138,6 +3121,16 @@ void exit_mmap(struct mm_struct *mm)
 	/* update_hiwater_rss(mm) here? but nobody should be looking */
 	/* Use -1 here to ensure all VMAs in the mm are unmapped */
 	unmap_vmas(&tlb, vma, 0, -1);
+	mmap_read_unlock(mm);
+
+	/*
+	 * Set MMF_OOM_SKIP to hide this task from the oom killer/reaper
+	 * because the memory has been already freed. Do not bother checking
+	 * mm_is_oom_victim because setting a bit unconditionally is cheaper.
+	 */
+	set_bit(MMF_OOM_SKIP, &mm->flags);
+
+	mmap_write_lock(mm);
 	free_pgtables(&tlb, vma, FIRST_USER_ADDRESS, USER_PGTABLES_CEILING);
 	tlb_finish_mmu(&tlb);
 
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 49d7df39b02d..36355b162727 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -509,7 +509,7 @@ static DECLARE_WAIT_QUEUE_HEAD(oom_reaper_wait);
 static struct task_struct *oom_reaper_list;
 static DEFINE_SPINLOCK(oom_reaper_lock);
 
-bool __oom_reap_task_mm(struct mm_struct *mm)
+static bool __oom_reap_task_mm(struct mm_struct *mm)
 {
 	struct vm_area_struct *vma;
 	bool ret = true;
-- 
2.36.0.550.gb090851708-goog


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v2 2/2] mm: delete unused MMF_OOM_VICTIM flag
  2022-05-16  7:56 [PATCH v2 1/2] mm: drop oom code from exit_mmap Suren Baghdasaryan
@ 2022-05-16  7:56 ` Suren Baghdasaryan
  2022-05-17  3:13   ` Shuah Khan
  2022-05-17  3:12 ` [PATCH v2 1/2] mm: drop oom code from exit_mmap Shuah Khan
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 16+ messages in thread
From: Suren Baghdasaryan @ 2022-05-16  7:56 UTC (permalink / raw)
  To: akpm
  Cc: mhocko, rientjes, willy, hannes, guro, minchan, kirill, aarcange,
	brauner, hch, oleg, david, jannh, shakeelb, peterx, jhubbard,
	shuah, linux-kernel, linux-mm, kernel-team, surenb

With the last usage of MMF_OOM_VICTIM in exit_mmap gone, this flag is
now unused and can be removed.

Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Acked-by: Michal Hocko <mhocko@suse.com>
---
 include/linux/oom.h            | 9 ---------
 include/linux/sched/coredump.h | 7 +++----
 mm/oom_kill.c                  | 4 +---
 3 files changed, 4 insertions(+), 16 deletions(-)

diff --git a/include/linux/oom.h b/include/linux/oom.h
index 6cdf0772dbae..25990e9d9e15 100644
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -77,15 +77,6 @@ static inline bool tsk_is_oom_victim(struct task_struct * tsk)
 	return tsk->signal->oom_mm;
 }
 
-/*
- * Use this helper if tsk->mm != mm and the victim mm needs a special
- * handling. This is guaranteed to stay true after once set.
- */
-static inline bool mm_is_oom_victim(struct mm_struct *mm)
-{
-	return test_bit(MMF_OOM_VICTIM, &mm->flags);
-}
-
 /*
  * Checks whether a page fault on the given mm is still reliable.
  * This is no longer true if the oom reaper started to reap the
diff --git a/include/linux/sched/coredump.h b/include/linux/sched/coredump.h
index 4d9e3a656875..5318c791ad39 100644
--- a/include/linux/sched/coredump.h
+++ b/include/linux/sched/coredump.h
@@ -70,9 +70,8 @@ static inline int get_dumpable(struct mm_struct *mm)
 #define MMF_UNSTABLE		22	/* mm is unstable for copy_from_user */
 #define MMF_HUGE_ZERO_PAGE	23      /* mm has ever used the global huge zero page */
 #define MMF_DISABLE_THP		24	/* disable THP for all VMAs */
-#define MMF_OOM_VICTIM		25	/* mm is the oom victim */
-#define MMF_OOM_REAP_QUEUED	26	/* mm was queued for oom_reaper */
-#define MMF_MULTIPROCESS	27	/* mm is shared between processes */
+#define MMF_OOM_REAP_QUEUED	25	/* mm was queued for oom_reaper */
+#define MMF_MULTIPROCESS	26	/* mm is shared between processes */
 /*
  * MMF_HAS_PINNED: Whether this mm has pinned any pages.  This can be either
  * replaced in the future by mm.pinned_vm when it becomes stable, or grow into
@@ -80,7 +79,7 @@ static inline int get_dumpable(struct mm_struct *mm)
  * pinned pages were unpinned later on, we'll still keep this bit set for the
  * lifecycle of this mm, just for simplicity.
  */
-#define MMF_HAS_PINNED		28	/* FOLL_PIN has run, never cleared */
+#define MMF_HAS_PINNED		27	/* FOLL_PIN has run, never cleared */
 #define MMF_DISABLE_THP_MASK	(1 << MMF_DISABLE_THP)
 
 #define MMF_INIT_MASK		(MMF_DUMPABLE_MASK | MMF_DUMP_FILTER_MASK |\
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 36355b162727..11291b99599f 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -732,10 +732,8 @@ static void mark_oom_victim(struct task_struct *tsk)
 		return;
 
 	/* oom_mm is bound to the signal struct life time. */
-	if (!cmpxchg(&tsk->signal->oom_mm, NULL, mm)) {
+	if (!cmpxchg(&tsk->signal->oom_mm, NULL, mm))
 		mmgrab(tsk->signal->oom_mm);
-		set_bit(MMF_OOM_VICTIM, &mm->flags);
-	}
 
 	/*
 	 * Make sure that the task is woken up from uninterruptible sleep
-- 
2.36.0.550.gb090851708-goog


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 1/2] mm: drop oom code from exit_mmap
  2022-05-16  7:56 [PATCH v2 1/2] mm: drop oom code from exit_mmap Suren Baghdasaryan
  2022-05-16  7:56 ` [PATCH v2 2/2] mm: delete unused MMF_OOM_VICTIM flag Suren Baghdasaryan
@ 2022-05-17  3:12 ` Shuah Khan
  2022-05-19 19:29 ` Andrew Morton
  2022-05-19 20:22 ` Liam Howlett
  3 siblings, 0 replies; 16+ messages in thread
From: Shuah Khan @ 2022-05-17  3:12 UTC (permalink / raw)
  To: Suren Baghdasaryan, akpm
  Cc: mhocko, rientjes, willy, hannes, guro, minchan, kirill, aarcange,
	brauner, hch, oleg, david, jannh, shakeelb, peterx, jhubbard,
	shuah, linux-kernel, linux-mm, kernel-team, Shuah Khan

On 5/16/22 1:56 AM, Suren Baghdasaryan wrote:
> The primary reason to invoke the oom reaper from the exit_mmap path used
> to be a prevention of an excessive oom killing if the oom victim exit
> races with the oom reaper (see [1] for more details). The invocation has
> moved around since then because of the interaction with the munlock
> logic but the underlying reason has remained the same (see [2]).
> 
> Munlock code is no longer a problem since [3] and there shouldn't be
> any blocking operation before the memory is unmapped by exit_mmap so
> the oom reaper invocation can be dropped. The unmapping part can be done
> with the non-exclusive mmap_sem and the exclusive one is only required
> when page tables are freed.
> 
> Remove the oom_reaper from exit_mmap which will make the code easier to
> read. This is really unlikely to make any observable difference although
> some microbenchmarks could benefit from one less branch that needs to be
> evaluated even though it almost never is true.
> 
> [1] 212925802454 ("mm: oom: let oom_reap_task and exit_mmap run concurrently")
> [2] 27ae357fa82b ("mm, oom: fix concurrent munlock and oom reaper unmap, v3")
> [3] a213e5cf71cb ("mm/munlock: delete munlock_vma_pages_all(), allow oomreap")
> 
> Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> Acked-by: Michal Hocko <mhocko@suse.com>
> ---
Looks good to me.

Reviewed-by: Shuah Khan <skhan@linuxfoundation.org>

thanks,
-- Shuah

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 2/2] mm: delete unused MMF_OOM_VICTIM flag
  2022-05-16  7:56 ` [PATCH v2 2/2] mm: delete unused MMF_OOM_VICTIM flag Suren Baghdasaryan
@ 2022-05-17  3:13   ` Shuah Khan
  0 siblings, 0 replies; 16+ messages in thread
From: Shuah Khan @ 2022-05-17  3:13 UTC (permalink / raw)
  To: Suren Baghdasaryan, akpm
  Cc: mhocko, rientjes, willy, hannes, guro, minchan, kirill, aarcange,
	brauner, hch, oleg, david, jannh, shakeelb, peterx, jhubbard,
	shuah, linux-kernel, linux-mm, kernel-team, Shuah Khan

On 5/16/22 1:56 AM, Suren Baghdasaryan wrote:
> With the last usage of MMF_OOM_VICTIM in exit_mmap gone, this flag is
> now unused and can be removed.
> 
> Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> Acked-by: Michal Hocko <mhocko@suse.com>
> ---
>   include/linux/oom.h            | 9 ---------
>   include/linux/sched/coredump.h | 7 +++----
>   mm/oom_kill.c                  | 4 +---
>   3 files changed, 4 insertions(+), 16 deletions(-)
> 
Looks good to me.

Reviewed-by: Shuah Khan <skhan@linuxfoundation.org>

thanks,
-- Shuah

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 1/2] mm: drop oom code from exit_mmap
  2022-05-16  7:56 [PATCH v2 1/2] mm: drop oom code from exit_mmap Suren Baghdasaryan
  2022-05-16  7:56 ` [PATCH v2 2/2] mm: delete unused MMF_OOM_VICTIM flag Suren Baghdasaryan
  2022-05-17  3:12 ` [PATCH v2 1/2] mm: drop oom code from exit_mmap Shuah Khan
@ 2022-05-19 19:29 ` Andrew Morton
  2022-05-19 20:33   ` Liam Howlett
  2022-05-19 20:22 ` Liam Howlett
  3 siblings, 1 reply; 16+ messages in thread
From: Andrew Morton @ 2022-05-19 19:29 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: mhocko, rientjes, willy, hannes, guro, minchan, kirill, aarcange,
	brauner, hch, oleg, david, jannh, shakeelb, peterx, jhubbard,
	shuah, linux-kernel, linux-mm, kernel-team, Liam Howlett

On Mon, 16 May 2022 00:56:18 -0700 Suren Baghdasaryan <surenb@google.com> wrote:

> The primary reason to invoke the oom reaper from the exit_mmap path used
> to be a prevention of an excessive oom killing if the oom victim exit
> races with the oom reaper (see [1] for more details). The invocation has
> moved around since then because of the interaction with the munlock
> logic but the underlying reason has remained the same (see [2]).
> 
> Munlock code is no longer a problem since [3] and there shouldn't be
> any blocking operation before the memory is unmapped by exit_mmap so
> the oom reaper invocation can be dropped. The unmapping part can be done
> with the non-exclusive mmap_sem and the exclusive one is only required
> when page tables are freed.
> 
> Remove the oom_reaper from exit_mmap which will make the code easier to
> read. This is really unlikely to make any observable difference although
> some microbenchmarks could benefit from one less branch that needs to be
> evaluated even though it almost never is true.
> 

Liam, this mucks "mm: start tracking VMAs with maple tree" somewhat.

> --- a/include/linux/oom.h
> +++ b/include/linux/oom.h
> @@ -106,8 +106,6 @@ static inline vm_fault_t check_stable_address_space(struct mm_struct *mm)
>  	return 0;
>  }
>  
> -bool __oom_reap_task_mm(struct mm_struct *mm);
> -
>  long oom_badness(struct task_struct *p,
>  		unsigned long totalpages);
>  
> diff --git a/mm/mmap.c b/mm/mmap.c
> index 313b57d55a63..ded42150e706 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -3105,30 +3105,13 @@ void exit_mmap(struct mm_struct *mm)
>  	/* mm's last user has gone, and its about to be pulled down */
>  	mmu_notifier_release(mm);
>  
> -	if (unlikely(mm_is_oom_victim(mm))) {
> -		/*
> -		 * Manually reap the mm to free as much memory as possible.
> -		 * Then, as the oom reaper does, set MMF_OOM_SKIP to disregard
> -		 * this mm from further consideration.  Taking mm->mmap_lock for
> -		 * write after setting MMF_OOM_SKIP will guarantee that the oom
> -		 * reaper will not run on this mm again after mmap_lock is
> -		 * dropped.
> -		 *
> -		 * Nothing can be holding mm->mmap_lock here and the above call
> -		 * to mmu_notifier_release(mm) ensures mmu notifier callbacks in
> -		 * __oom_reap_task_mm() will not block.
> -		 */
> -		(void)__oom_reap_task_mm(mm);
> -		set_bit(MMF_OOM_SKIP, &mm->flags);
> -	}
> -
> -	mmap_write_lock(mm);
> +	mmap_read_lock(mm);
>  	arch_exit_mmap(mm);
>  
>  	vma = mm->mmap;
>  	if (!vma) {
>  		/* Can happen if dup_mmap() received an OOM */
> -		mmap_write_unlock(mm);
> +		mmap_read_unlock(mm);
>  		return;
>  	}
>  
> @@ -3138,6 +3121,16 @@ void exit_mmap(struct mm_struct *mm)
>  	/* update_hiwater_rss(mm) here? but nobody should be looking */
>  	/* Use -1 here to ensure all VMAs in the mm are unmapped */
>  	unmap_vmas(&tlb, vma, 0, -1);
> +	mmap_read_unlock(mm);
> +
> +	/*
> +	 * Set MMF_OOM_SKIP to hide this task from the oom killer/reaper
> +	 * because the memory has been already freed. Do not bother checking
> +	 * mm_is_oom_victim because setting a bit unconditionally is cheaper.
> +	 */
> +	set_bit(MMF_OOM_SKIP, &mm->flags);
> +
> +	mmap_write_lock(mm);
>  	free_pgtables(&tlb, vma, FIRST_USER_ADDRESS, USER_PGTABLES_CEILING);
>  	tlb_finish_mmu(&tlb);
>  

I ended up with the below rework of "mm: start tracking VMAs with maple
tree".  Please triple check?

void exit_mmap(struct mm_struct *mm)
{
	struct mmu_gather tlb;
	struct vm_area_struct *vma;
	unsigned long nr_accounted = 0;

	/* mm's last user has gone, and its about to be pulled down */
	mmu_notifier_release(mm);

	mmap_write_lock(mm);
	arch_exit_mmap(mm);
	vma = mm->mmap;
	if (!vma) {
		/* Can happen if dup_mmap() received an OOM */
		mmap_write_unlock(mm);
		return;
	}

	lru_add_drain();
	flush_cache_mm(mm);
	tlb_gather_mmu_fullmm(&tlb, mm);
	/* update_hiwater_rss(mm) here? but nobody should be looking */
	/* Use -1 here to ensure all VMAs in the mm are unmapped */
	unmap_vmas(&tlb, vma, 0, -1);

	/*
	 * Set MMF_OOM_SKIP to hide this task from the oom killer/reaper
	 * because the memory has been already freed. Do not bother checking
	 * mm_is_oom_victim because setting a bit unconditionally is cheaper.
	 */
	set_bit(MMF_OOM_SKIP, &mm->flags);

	free_pgtables(&tlb, vma, FIRST_USER_ADDRESS, USER_PGTABLES_CEILING);
	tlb_finish_mmu(&tlb);

	/* Walk the list again, actually closing and freeing it. */
	while (vma) {
		if (vma->vm_flags & VM_ACCOUNT)
			nr_accounted += vma_pages(vma);
		vma = remove_vma(vma);
		cond_resched();
	}

	trace_exit_mmap(mm);
	__mt_destroy(&mm->mm_mt);
	mmap_write_unlock(mm);
	vm_unacct_memory(nr_accounted);
}


And "mm: remove the vma linked list" needed further reworking.  I ended
up with

void exit_mmap(struct mm_struct *mm)
{
	struct mmu_gather tlb;
	struct vm_area_struct *vma;
	unsigned long nr_accounted = 0;
	MA_STATE(mas, &mm->mm_mt, 0, 0);
	int count = 0;

	/* mm's last user has gone, and its about to be pulled down */
	mmu_notifier_release(mm);

	mmap_write_lock(mm);
	arch_exit_mmap(mm);
	vma = mas_find(&mas, ULONG_MAX);
	if (!vma) {
		/* Can happen if dup_mmap() received an OOM */
		mmap_write_unlock(mm);
		return;
	}

	lru_add_drain();
	flush_cache_mm(mm);
	tlb_gather_mmu_fullmm(&tlb, mm);
	/* update_hiwater_rss(mm) here? but nobody should be looking */
	/* Use ULONG_MAX here to ensure all VMAs in the mm are unmapped */
	unmap_vmas(&tlb, &mm->mm_mt, vma, 0, ULONG_MAX);

	/*
	 * Set MMF_OOM_SKIP to hide this task from the oom killer/reaper
	 * because the memory has been already freed. Do not bother checking
	 * mm_is_oom_victim because setting a bit unconditionally is cheaper.
	 */
	set_bit(MMF_OOM_SKIP, &mm->flags);

	free_pgtables(&tlb, &mm->mm_mt, vma, FIRST_USER_ADDRESS, USER_PGTABLES_CEILING);
	tlb_finish_mmu(&tlb);

	/*
	 * Walk the list again, actually closing and freeing it, with preemption
	 * enabled, without holding any MM locks besides the unreachable
	 * mmap_write_lock.
	 */
	do {
		if (vma->vm_flags & VM_ACCOUNT)
			nr_accounted += vma_pages(vma);
		remove_vma(vma);
		count++;
		cond_resched();
	} while ((vma = mas_find(&mas, ULONG_MAX)) != NULL);

	BUG_ON(count != mm->map_count);

	trace_exit_mmap(mm);
	__mt_destroy(&mm->mm_mt);
	mmap_write_unlock(mm);
	vm_unacct_memory(nr_accounted);
}


The mapletree patches remain hidden from mm.git until, I expect, next week.

Thanks.



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 1/2] mm: drop oom code from exit_mmap
  2022-05-16  7:56 [PATCH v2 1/2] mm: drop oom code from exit_mmap Suren Baghdasaryan
                   ` (2 preceding siblings ...)
  2022-05-19 19:29 ` Andrew Morton
@ 2022-05-19 20:22 ` Liam Howlett
  2022-05-19 21:33   ` Suren Baghdasaryan
  3 siblings, 1 reply; 16+ messages in thread
From: Liam Howlett @ 2022-05-19 20:22 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: akpm, mhocko, rientjes, willy, hannes, guro, minchan, kirill,
	aarcange, brauner, hch, oleg, david, jannh, shakeelb, peterx,
	jhubbard, shuah, linux-kernel, linux-mm, kernel-team

* Suren Baghdasaryan <surenb@google.com> [220516 03:56]:
> The primary reason to invoke the oom reaper from the exit_mmap path used
> to be a prevention of an excessive oom killing if the oom victim exit
> races with the oom reaper (see [1] for more details). The invocation has
> moved around since then because of the interaction with the munlock
> logic but the underlying reason has remained the same (see [2]).
> 
> Munlock code is no longer a problem since [3] and there shouldn't be
> any blocking operation before the memory is unmapped by exit_mmap so
> the oom reaper invocation can be dropped. The unmapping part can be done
> with the non-exclusive mmap_sem and the exclusive one is only required
> when page tables are freed.
> 
> Remove the oom_reaper from exit_mmap which will make the code easier to
> read. This is really unlikely to make any observable difference although
> some microbenchmarks could benefit from one less branch that needs to be
> evaluated even though it almost never is true.
> 
> [1] 212925802454 ("mm: oom: let oom_reap_task and exit_mmap run concurrently")
> [2] 27ae357fa82b ("mm, oom: fix concurrent munlock and oom reaper unmap, v3")
> [3] a213e5cf71cb ("mm/munlock: delete munlock_vma_pages_all(), allow oomreap")
> 
> Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> Acked-by: Michal Hocko <mhocko@suse.com>
> ---
>  include/linux/oom.h |  2 --
>  mm/mmap.c           | 31 ++++++++++++-------------------
>  mm/oom_kill.c       |  2 +-
>  3 files changed, 13 insertions(+), 22 deletions(-)
> 
> diff --git a/include/linux/oom.h b/include/linux/oom.h
> index 2db9a1432511..6cdf0772dbae 100644
> --- a/include/linux/oom.h
> +++ b/include/linux/oom.h
> @@ -106,8 +106,6 @@ static inline vm_fault_t check_stable_address_space(struct mm_struct *mm)
>  	return 0;
>  }
>  
> -bool __oom_reap_task_mm(struct mm_struct *mm);
> -
>  long oom_badness(struct task_struct *p,
>  		unsigned long totalpages);
>  
> diff --git a/mm/mmap.c b/mm/mmap.c
> index 313b57d55a63..ded42150e706 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -3105,30 +3105,13 @@ void exit_mmap(struct mm_struct *mm)
>  	/* mm's last user has gone, and its about to be pulled down */
>  	mmu_notifier_release(mm);
>  
> -	if (unlikely(mm_is_oom_victim(mm))) {
> -		/*
> -		 * Manually reap the mm to free as much memory as possible.
> -		 * Then, as the oom reaper does, set MMF_OOM_SKIP to disregard
> -		 * this mm from further consideration.  Taking mm->mmap_lock for
> -		 * write after setting MMF_OOM_SKIP will guarantee that the oom
> -		 * reaper will not run on this mm again after mmap_lock is
> -		 * dropped.
> -		 *
> -		 * Nothing can be holding mm->mmap_lock here and the above call
> -		 * to mmu_notifier_release(mm) ensures mmu notifier callbacks in
> -		 * __oom_reap_task_mm() will not block.
> -		 */
> -		(void)__oom_reap_task_mm(mm);
> -		set_bit(MMF_OOM_SKIP, &mm->flags);
> -	}
> -
> -	mmap_write_lock(mm);
> +	mmap_read_lock(mm);
>  	arch_exit_mmap(mm);

arch_exit_mmap() was called under the write lock before, is it safe to
call it under the read lock?

>  
>  	vma = mm->mmap;
>  	if (!vma) {
>  		/* Can happen if dup_mmap() received an OOM */
> -		mmap_write_unlock(mm);
> +		mmap_read_unlock(mm);
>  		return;
>  	}
>  
> @@ -3138,6 +3121,16 @@ void exit_mmap(struct mm_struct *mm)
>  	/* update_hiwater_rss(mm) here? but nobody should be looking */
>  	/* Use -1 here to ensure all VMAs in the mm are unmapped */
>  	unmap_vmas(&tlb, vma, 0, -1);
> +	mmap_read_unlock(mm);
> +
> +	/*
> +	 * Set MMF_OOM_SKIP to hide this task from the oom killer/reaper
> +	 * because the memory has been already freed. Do not bother checking
> +	 * mm_is_oom_victim because setting a bit unconditionally is cheaper.
> +	 */
> +	set_bit(MMF_OOM_SKIP, &mm->flags);
> +
> +	mmap_write_lock(mm);

Is there a race here?  We had a VMA but after the read lock was dropped,
could the oom killer cause the VMA to be invalidated?  I don't think so
but the comment above about dup_mmap() receiving an OOM makes me
question it.  The code before kept the write lock from when the VMA was
found until the end of the mm edits - and it had the check for !vma
within the block itself.  We are also hiding it from the oom killer
outside the read lock so it is possible for oom to find it in that
window, right?

Could we just unconditionally set the skip bit before taking a write
lock for the duration of the exit?  I'm probably missing your reason for
doing it this way.

>  	free_pgtables(&tlb, vma, FIRST_USER_ADDRESS, USER_PGTABLES_CEILING);
>  	tlb_finish_mmu(&tlb);
>  
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 49d7df39b02d..36355b162727 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -509,7 +509,7 @@ static DECLARE_WAIT_QUEUE_HEAD(oom_reaper_wait);
>  static struct task_struct *oom_reaper_list;
>  static DEFINE_SPINLOCK(oom_reaper_lock);
>  
> -bool __oom_reap_task_mm(struct mm_struct *mm)
> +static bool __oom_reap_task_mm(struct mm_struct *mm)
>  {
>  	struct vm_area_struct *vma;
>  	bool ret = true;
> -- 
> 2.36.0.550.gb090851708-goog
> 
> 

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 1/2] mm: drop oom code from exit_mmap
  2022-05-19 19:29 ` Andrew Morton
@ 2022-05-19 20:33   ` Liam Howlett
  2022-05-19 21:43     ` Suren Baghdasaryan
  0 siblings, 1 reply; 16+ messages in thread
From: Liam Howlett @ 2022-05-19 20:33 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Suren Baghdasaryan, mhocko, rientjes, willy, hannes, guro,
	minchan, kirill, aarcange, brauner, hch, oleg, david, jannh,
	shakeelb, peterx, jhubbard, shuah, linux-kernel, linux-mm,
	kernel-team

* Andrew Morton <akpm@linux-foundation.org> [220519 15:29]:
> On Mon, 16 May 2022 00:56:18 -0700 Suren Baghdasaryan <surenb@google.com> wrote:
> 
> > The primary reason to invoke the oom reaper from the exit_mmap path used
> > to be a prevention of an excessive oom killing if the oom victim exit
> > races with the oom reaper (see [1] for more details). The invocation has
> > moved around since then because of the interaction with the munlock
> > logic but the underlying reason has remained the same (see [2]).
> > 
> > Munlock code is no longer a problem since [3] and there shouldn't be
> > any blocking operation before the memory is unmapped by exit_mmap so
> > the oom reaper invocation can be dropped. The unmapping part can be done
> > with the non-exclusive mmap_sem and the exclusive one is only required
> > when page tables are freed.
> > 
> > Remove the oom_reaper from exit_mmap which will make the code easier to
> > read. This is really unlikely to make any observable difference although
> > some microbenchmarks could benefit from one less branch that needs to be
> > evaluated even though it almost never is true.
> > 
> 
> Liam, this mucks "mm: start tracking VMAs with maple tree" somewhat.
> 
> > --- a/include/linux/oom.h
> > +++ b/include/linux/oom.h
> > @@ -106,8 +106,6 @@ static inline vm_fault_t check_stable_address_space(struct mm_struct *mm)
> >  	return 0;
> >  }
> >  
> > -bool __oom_reap_task_mm(struct mm_struct *mm);
> > -
> >  long oom_badness(struct task_struct *p,
> >  		unsigned long totalpages);
> >  
> > diff --git a/mm/mmap.c b/mm/mmap.c
> > index 313b57d55a63..ded42150e706 100644
> > --- a/mm/mmap.c
> > +++ b/mm/mmap.c
> > @@ -3105,30 +3105,13 @@ void exit_mmap(struct mm_struct *mm)
> >  	/* mm's last user has gone, and its about to be pulled down */
> >  	mmu_notifier_release(mm);
> >  
> > -	if (unlikely(mm_is_oom_victim(mm))) {
> > -		/*
> > -		 * Manually reap the mm to free as much memory as possible.
> > -		 * Then, as the oom reaper does, set MMF_OOM_SKIP to disregard
> > -		 * this mm from further consideration.  Taking mm->mmap_lock for
> > -		 * write after setting MMF_OOM_SKIP will guarantee that the oom
> > -		 * reaper will not run on this mm again after mmap_lock is
> > -		 * dropped.
> > -		 *
> > -		 * Nothing can be holding mm->mmap_lock here and the above call
> > -		 * to mmu_notifier_release(mm) ensures mmu notifier callbacks in
> > -		 * __oom_reap_task_mm() will not block.
> > -		 */
> > -		(void)__oom_reap_task_mm(mm);
> > -		set_bit(MMF_OOM_SKIP, &mm->flags);
> > -	}
> > -
> > -	mmap_write_lock(mm);
> > +	mmap_read_lock(mm);
> >  	arch_exit_mmap(mm);
> >  
> >  	vma = mm->mmap;
> >  	if (!vma) {
> >  		/* Can happen if dup_mmap() received an OOM */
> > -		mmap_write_unlock(mm);
> > +		mmap_read_unlock(mm);
> >  		return;
> >  	}
> >  
> > @@ -3138,6 +3121,16 @@ void exit_mmap(struct mm_struct *mm)
> >  	/* update_hiwater_rss(mm) here? but nobody should be looking */
> >  	/* Use -1 here to ensure all VMAs in the mm are unmapped */
> >  	unmap_vmas(&tlb, vma, 0, -1);
> > +	mmap_read_unlock(mm);
> > +
> > +	/*
> > +	 * Set MMF_OOM_SKIP to hide this task from the oom killer/reaper
> > +	 * because the memory has been already freed. Do not bother checking
> > +	 * mm_is_oom_victim because setting a bit unconditionally is cheaper.
> > +	 */
> > +	set_bit(MMF_OOM_SKIP, &mm->flags);
> > +
> > +	mmap_write_lock(mm);
> >  	free_pgtables(&tlb, vma, FIRST_USER_ADDRESS, USER_PGTABLES_CEILING);
> >  	tlb_finish_mmu(&tlb);
> >  
> 
> I ended up with the below rework of "mm: start tracking VMAs with maple
> tree".  Please triple check?

One small fix in the first one.  Suren found a race with the oom or
process_mrelease that needed the linked list to be removed here.  Please
correct me if I am mistaken, Suren?

> 
> void exit_mmap(struct mm_struct *mm)
> {
> 	struct mmu_gather tlb;
> 	struct vm_area_struct *vma;
> 	unsigned long nr_accounted = 0;
> 
> 	/* mm's last user has gone, and its about to be pulled down */
> 	mmu_notifier_release(mm);
> 
> 	mmap_write_lock(mm);
> 	arch_exit_mmap(mm);
> 	vma = mm->mmap;
> 	if (!vma) {
> 		/* Can happen if dup_mmap() received an OOM */
> 		mmap_write_unlock(mm);
> 		return;
> 	}
> 
> 	lru_add_drain();
> 	flush_cache_mm(mm);
> 	tlb_gather_mmu_fullmm(&tlb, mm);
> 	/* update_hiwater_rss(mm) here? but nobody should be looking */
> 	/* Use -1 here to ensure all VMAs in the mm are unmapped */
> 	unmap_vmas(&tlb, vma, 0, -1);
> 
> 	/*
> 	 * Set MMF_OOM_SKIP to hide this task from the oom killer/reaper
> 	 * because the memory has been already freed. Do not bother checking
> 	 * mm_is_oom_victim because setting a bit unconditionally is cheaper.
> 	 */
> 	set_bit(MMF_OOM_SKIP, &mm->flags);
> 
> 	free_pgtables(&tlb, vma, FIRST_USER_ADDRESS, USER_PGTABLES_CEILING);
> 	tlb_finish_mmu(&tlb);
> 
> 	/* Walk the list again, actually closing and freeing it. */
> 	while (vma) {
> 		if (vma->vm_flags & VM_ACCOUNT)
> 			nr_accounted += vma_pages(vma);
> 		vma = remove_vma(vma);
> 		cond_resched();
> 	}
> 
> 	trace_exit_mmap(mm);
> 	__mt_destroy(&mm->mm_mt);

+	mm->mmap = NULL;

> 	mmap_write_unlock(mm);
> 	vm_unacct_memory(nr_accounted);
> }
> 
> 
> And "mm: remove the vma linked list" needed further reworking.  I ended
> up with
> 
> void exit_mmap(struct mm_struct *mm)
> {
> 	struct mmu_gather tlb;
> 	struct vm_area_struct *vma;
> 	unsigned long nr_accounted = 0;
> 	MA_STATE(mas, &mm->mm_mt, 0, 0);
> 	int count = 0;
> 
> 	/* mm's last user has gone, and its about to be pulled down */
> 	mmu_notifier_release(mm);
> 
> 	mmap_write_lock(mm);
> 	arch_exit_mmap(mm);
> 	vma = mas_find(&mas, ULONG_MAX);
> 	if (!vma) {
> 		/* Can happen if dup_mmap() received an OOM */
> 		mmap_write_unlock(mm);
> 		return;
> 	}
> 
> 	lru_add_drain();
> 	flush_cache_mm(mm);
> 	tlb_gather_mmu_fullmm(&tlb, mm);
> 	/* update_hiwater_rss(mm) here? but nobody should be looking */
> 	/* Use ULONG_MAX here to ensure all VMAs in the mm are unmapped */
> 	unmap_vmas(&tlb, &mm->mm_mt, vma, 0, ULONG_MAX);
> 
> 	/*
> 	 * Set MMF_OOM_SKIP to hide this task from the oom killer/reaper
> 	 * because the memory has been already freed. Do not bother checking
> 	 * mm_is_oom_victim because setting a bit unconditionally is cheaper.
> 	 */
> 	set_bit(MMF_OOM_SKIP, &mm->flags);
> 
> 	free_pgtables(&tlb, &mm->mm_mt, vma, FIRST_USER_ADDRESS, USER_PGTABLES_CEILING);
> 	tlb_finish_mmu(&tlb);
> 
> 	/*
> 	 * Walk the list again, actually closing and freeing it, with preemption
> 	 * enabled, without holding any MM locks besides the unreachable
> 	 * mmap_write_lock.
> 	 */
> 	do {
> 		if (vma->vm_flags & VM_ACCOUNT)
> 			nr_accounted += vma_pages(vma);
> 		remove_vma(vma);
> 		count++;
> 		cond_resched();
> 	} while ((vma = mas_find(&mas, ULONG_MAX)) != NULL);
> 
> 	BUG_ON(count != mm->map_count);
> 
> 	trace_exit_mmap(mm);
> 	__mt_destroy(&mm->mm_mt);
> 	mmap_write_unlock(mm);
> 	vm_unacct_memory(nr_accounted);
> }

It is worth noting that this drops the mmap_read_lock/unlock before the
write locking.  I'm not sure why Suren had it in his patches and I've
responded just now asking about it.  It may be an important aspect of
what he was planning.

Thanks,
Liam


> 
> 
> The mapletree patches remain hidden from mm.git until, I expect, next week.
> 
> Thanks.
> 
> 

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 1/2] mm: drop oom code from exit_mmap
  2022-05-19 20:22 ` Liam Howlett
@ 2022-05-19 21:33   ` Suren Baghdasaryan
  2022-05-19 22:14     ` Suren Baghdasaryan
                       ` (2 more replies)
  0 siblings, 3 replies; 16+ messages in thread
From: Suren Baghdasaryan @ 2022-05-19 21:33 UTC (permalink / raw)
  To: Liam Howlett
  Cc: akpm, mhocko, rientjes, willy, hannes, guro, minchan, kirill,
	aarcange, brauner, hch, oleg, david, jannh, shakeelb, peterx,
	jhubbard, shuah, linux-kernel, linux-mm, kernel-team

On Thu, May 19, 2022 at 1:22 PM Liam Howlett <liam.howlett@oracle.com> wrote:
>
> * Suren Baghdasaryan <surenb@google.com> [220516 03:56]:
> > The primary reason to invoke the oom reaper from the exit_mmap path used
> > to be a prevention of an excessive oom killing if the oom victim exit
> > races with the oom reaper (see [1] for more details). The invocation has
> > moved around since then because of the interaction with the munlock
> > logic but the underlying reason has remained the same (see [2]).
> >
> > Munlock code is no longer a problem since [3] and there shouldn't be
> > any blocking operation before the memory is unmapped by exit_mmap so
> > the oom reaper invocation can be dropped. The unmapping part can be done
> > with the non-exclusive mmap_sem and the exclusive one is only required
> > when page tables are freed.
> >
> > Remove the oom_reaper from exit_mmap which will make the code easier to
> > read. This is really unlikely to make any observable difference although
> > some microbenchmarks could benefit from one less branch that needs to be
> > evaluated even though it almost never is true.
> >
> > [1] 212925802454 ("mm: oom: let oom_reap_task and exit_mmap run concurrently")
> > [2] 27ae357fa82b ("mm, oom: fix concurrent munlock and oom reaper unmap, v3")
> > [3] a213e5cf71cb ("mm/munlock: delete munlock_vma_pages_all(), allow oomreap")
> >
> > Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> > Acked-by: Michal Hocko <mhocko@suse.com>
> > ---
> >  include/linux/oom.h |  2 --
> >  mm/mmap.c           | 31 ++++++++++++-------------------
> >  mm/oom_kill.c       |  2 +-
> >  3 files changed, 13 insertions(+), 22 deletions(-)
> >
> > diff --git a/include/linux/oom.h b/include/linux/oom.h
> > index 2db9a1432511..6cdf0772dbae 100644
> > --- a/include/linux/oom.h
> > +++ b/include/linux/oom.h
> > @@ -106,8 +106,6 @@ static inline vm_fault_t check_stable_address_space(struct mm_struct *mm)
> >       return 0;
> >  }
> >
> > -bool __oom_reap_task_mm(struct mm_struct *mm);
> > -
> >  long oom_badness(struct task_struct *p,
> >               unsigned long totalpages);
> >
> > diff --git a/mm/mmap.c b/mm/mmap.c
> > index 313b57d55a63..ded42150e706 100644
> > --- a/mm/mmap.c
> > +++ b/mm/mmap.c
> > @@ -3105,30 +3105,13 @@ void exit_mmap(struct mm_struct *mm)
> >       /* mm's last user has gone, and its about to be pulled down */
> >       mmu_notifier_release(mm);
> >
> > -     if (unlikely(mm_is_oom_victim(mm))) {
> > -             /*
> > -              * Manually reap the mm to free as much memory as possible.
> > -              * Then, as the oom reaper does, set MMF_OOM_SKIP to disregard
> > -              * this mm from further consideration.  Taking mm->mmap_lock for
> > -              * write after setting MMF_OOM_SKIP will guarantee that the oom
> > -              * reaper will not run on this mm again after mmap_lock is
> > -              * dropped.
> > -              *
> > -              * Nothing can be holding mm->mmap_lock here and the above call
> > -              * to mmu_notifier_release(mm) ensures mmu notifier callbacks in
> > -              * __oom_reap_task_mm() will not block.
> > -              */
> > -             (void)__oom_reap_task_mm(mm);
> > -             set_bit(MMF_OOM_SKIP, &mm->flags);
> > -     }
> > -
> > -     mmap_write_lock(mm);
> > +     mmap_read_lock(mm);
> >       arch_exit_mmap(mm);
>
> arch_exit_mmap() was called under the write lock before, is it safe to
> call it under the read lock?

Ah, good catch. I missed at least one call chain which I believe would
require arch_exit_mmap() to be called under write lock:

arch_exit_mmap
    ldt_arch_exit_mmap
        free_ldt_pgtables
            free_pgd_range

I'll need to check whether arch_exit_mmap() has to be called before
unmap_vmas(). If not, we could move it further down when we hold the
write lock.
Andrew, please remove this patchset from your tree for now until I fix this.

>
> >
> >       vma = mm->mmap;
> >       if (!vma) {
> >               /* Can happen if dup_mmap() received an OOM */
> > -             mmap_write_unlock(mm);
> > +             mmap_read_unlock(mm);
> >               return;
> >       }
> >
> > @@ -3138,6 +3121,16 @@ void exit_mmap(struct mm_struct *mm)
> >       /* update_hiwater_rss(mm) here? but nobody should be looking */
> >       /* Use -1 here to ensure all VMAs in the mm are unmapped */
> >       unmap_vmas(&tlb, vma, 0, -1);
> > +     mmap_read_unlock(mm);
> > +
> > +     /*
> > +      * Set MMF_OOM_SKIP to hide this task from the oom killer/reaper
> > +      * because the memory has been already freed. Do not bother checking
> > +      * mm_is_oom_victim because setting a bit unconditionally is cheaper.
> > +      */
> > +     set_bit(MMF_OOM_SKIP, &mm->flags);
> > +
> > +     mmap_write_lock(mm);
>
> Is there a race here?  We had a VMA but after the read lock was dropped,
> could the oom killer cause the VMA to be invalidated?  I don't think so
> but the comment above about dup_mmap() receiving an OOM makes me
> question it.  The code before kept the write lock from when the VMA was
> found until the end of the mm edits - and it had the check for !vma
> within the block itself.  We are also hiding it from the oom killer
> outside the read lock so it is possible for oom to find it in that
> window, right?

When I was trying to understand that comment and looked into
dup_mmap() code, my conclusion was that this check was there to
protect us from the case when dup_mmap() gets interrupted and leaves
mm->mmap=NULL. So, in a sense it was not really a race with OOM killer
but an interrupted dup_mmap() case. So, once we checked it above we
don't need to recheck again under write lock. When I asked Michal
about this he was in agreement but it's possible we overlooked some
corner case. If so, please let me know and I can add this check here.

>
> Could we just unconditionally set the skip bit before taking a write
> lock for the duration of the exit?  I'm probably missing your reason for
> doing it this way.

That's what I'm doing - unconditionally setting MMF_OOM_SKIP before
taking the write lock. Did I miss something?

>
> >       free_pgtables(&tlb, vma, FIRST_USER_ADDRESS, USER_PGTABLES_CEILING);
> >       tlb_finish_mmu(&tlb);
> >
> > diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> > index 49d7df39b02d..36355b162727 100644
> > --- a/mm/oom_kill.c
> > +++ b/mm/oom_kill.c
> > @@ -509,7 +509,7 @@ static DECLARE_WAIT_QUEUE_HEAD(oom_reaper_wait);
> >  static struct task_struct *oom_reaper_list;
> >  static DEFINE_SPINLOCK(oom_reaper_lock);
> >
> > -bool __oom_reap_task_mm(struct mm_struct *mm)
> > +static bool __oom_reap_task_mm(struct mm_struct *mm)
> >  {
> >       struct vm_area_struct *vma;
> >       bool ret = true;
> > --
> > 2.36.0.550.gb090851708-goog
> >
> >
>
> --
> To unsubscribe from this group and stop receiving emails from it, send an email to kernel-team+unsubscribe@android.com.
>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 1/2] mm: drop oom code from exit_mmap
  2022-05-19 20:33   ` Liam Howlett
@ 2022-05-19 21:43     ` Suren Baghdasaryan
  2022-05-19 22:34       ` Liam Howlett
  0 siblings, 1 reply; 16+ messages in thread
From: Suren Baghdasaryan @ 2022-05-19 21:43 UTC (permalink / raw)
  To: Liam Howlett
  Cc: Andrew Morton, mhocko, rientjes, willy, hannes, guro, minchan,
	kirill, aarcange, brauner, hch, oleg, david, jannh, shakeelb,
	peterx, jhubbard, shuah, linux-kernel, linux-mm, kernel-team

On Thu, May 19, 2022 at 1:33 PM Liam Howlett <liam.howlett@oracle.com> wrote:
>
> * Andrew Morton <akpm@linux-foundation.org> [220519 15:29]:
> > On Mon, 16 May 2022 00:56:18 -0700 Suren Baghdasaryan <surenb@google.com> wrote:
> >
> > > The primary reason to invoke the oom reaper from the exit_mmap path used
> > > to be a prevention of an excessive oom killing if the oom victim exit
> > > races with the oom reaper (see [1] for more details). The invocation has
> > > moved around since then because of the interaction with the munlock
> > > logic but the underlying reason has remained the same (see [2]).
> > >
> > > Munlock code is no longer a problem since [3] and there shouldn't be
> > > any blocking operation before the memory is unmapped by exit_mmap so
> > > the oom reaper invocation can be dropped. The unmapping part can be done
> > > with the non-exclusive mmap_sem and the exclusive one is only required
> > > when page tables are freed.
> > >
> > > Remove the oom_reaper from exit_mmap which will make the code easier to
> > > read. This is really unlikely to make any observable difference although
> > > some microbenchmarks could benefit from one less branch that needs to be
> > > evaluated even though it almost never is true.
> > >
> >
> > Liam, this mucks "mm: start tracking VMAs with maple tree" somewhat.
> >
> > > --- a/include/linux/oom.h
> > > +++ b/include/linux/oom.h
> > > @@ -106,8 +106,6 @@ static inline vm_fault_t check_stable_address_space(struct mm_struct *mm)
> > >     return 0;
> > >  }
> > >
> > > -bool __oom_reap_task_mm(struct mm_struct *mm);
> > > -
> > >  long oom_badness(struct task_struct *p,
> > >             unsigned long totalpages);
> > >
> > > diff --git a/mm/mmap.c b/mm/mmap.c
> > > index 313b57d55a63..ded42150e706 100644
> > > --- a/mm/mmap.c
> > > +++ b/mm/mmap.c
> > > @@ -3105,30 +3105,13 @@ void exit_mmap(struct mm_struct *mm)
> > >     /* mm's last user has gone, and its about to be pulled down */
> > >     mmu_notifier_release(mm);
> > >
> > > -   if (unlikely(mm_is_oom_victim(mm))) {
> > > -           /*
> > > -            * Manually reap the mm to free as much memory as possible.
> > > -            * Then, as the oom reaper does, set MMF_OOM_SKIP to disregard
> > > -            * this mm from further consideration.  Taking mm->mmap_lock for
> > > -            * write after setting MMF_OOM_SKIP will guarantee that the oom
> > > -            * reaper will not run on this mm again after mmap_lock is
> > > -            * dropped.
> > > -            *
> > > -            * Nothing can be holding mm->mmap_lock here and the above call
> > > -            * to mmu_notifier_release(mm) ensures mmu notifier callbacks in
> > > -            * __oom_reap_task_mm() will not block.
> > > -            */
> > > -           (void)__oom_reap_task_mm(mm);
> > > -           set_bit(MMF_OOM_SKIP, &mm->flags);
> > > -   }
> > > -
> > > -   mmap_write_lock(mm);
> > > +   mmap_read_lock(mm);
> > >     arch_exit_mmap(mm);
> > >
> > >     vma = mm->mmap;
> > >     if (!vma) {
> > >             /* Can happen if dup_mmap() received an OOM */
> > > -           mmap_write_unlock(mm);
> > > +           mmap_read_unlock(mm);
> > >             return;
> > >     }
> > >
> > > @@ -3138,6 +3121,16 @@ void exit_mmap(struct mm_struct *mm)
> > >     /* update_hiwater_rss(mm) here? but nobody should be looking */
> > >     /* Use -1 here to ensure all VMAs in the mm are unmapped */
> > >     unmap_vmas(&tlb, vma, 0, -1);
> > > +   mmap_read_unlock(mm);
> > > +
> > > +   /*
> > > +    * Set MMF_OOM_SKIP to hide this task from the oom killer/reaper
> > > +    * because the memory has been already freed. Do not bother checking
> > > +    * mm_is_oom_victim because setting a bit unconditionally is cheaper.
> > > +    */
> > > +   set_bit(MMF_OOM_SKIP, &mm->flags);
> > > +
> > > +   mmap_write_lock(mm);
> > >     free_pgtables(&tlb, vma, FIRST_USER_ADDRESS, USER_PGTABLES_CEILING);
> > >     tlb_finish_mmu(&tlb);
> > >
> >
> > I ended up with the below rework of "mm: start tracking VMAs with maple
> > tree".  Please triple check?
>
> One small fix in the first one.  Suren found a race with the oom or
> process_mrelease that needed the linked list to be removed here.  Please
> correct me if I am mistaken, Suren?
>
> >
> > void exit_mmap(struct mm_struct *mm)
> > {
> >       struct mmu_gather tlb;
> >       struct vm_area_struct *vma;
> >       unsigned long nr_accounted = 0;
> >
> >       /* mm's last user has gone, and its about to be pulled down */
> >       mmu_notifier_release(mm);
> >
> >       mmap_write_lock(mm);
> >       arch_exit_mmap(mm);
> >       vma = mm->mmap;
> >       if (!vma) {
> >               /* Can happen if dup_mmap() received an OOM */
> >               mmap_write_unlock(mm);
> >               return;
> >       }
> >
> >       lru_add_drain();
> >       flush_cache_mm(mm);
> >       tlb_gather_mmu_fullmm(&tlb, mm);
> >       /* update_hiwater_rss(mm) here? but nobody should be looking */
> >       /* Use -1 here to ensure all VMAs in the mm are unmapped */
> >       unmap_vmas(&tlb, vma, 0, -1);
> >
> >       /*
> >        * Set MMF_OOM_SKIP to hide this task from the oom killer/reaper
> >        * because the memory has been already freed. Do not bother checking
> >        * mm_is_oom_victim because setting a bit unconditionally is cheaper.
> >        */
> >       set_bit(MMF_OOM_SKIP, &mm->flags);
> >
> >       free_pgtables(&tlb, vma, FIRST_USER_ADDRESS, USER_PGTABLES_CEILING);
> >       tlb_finish_mmu(&tlb);
> >
> >       /* Walk the list again, actually closing and freeing it. */
> >       while (vma) {
> >               if (vma->vm_flags & VM_ACCOUNT)
> >                       nr_accounted += vma_pages(vma);
> >               vma = remove_vma(vma);
> >               cond_resched();
> >       }
> >
> >       trace_exit_mmap(mm);
> >       __mt_destroy(&mm->mm_mt);
>
> +       mm->mmap = NULL;

That's correct. We need to reset mm->mmap so that the loop in
__oom_reap_task_mm() stops immediately. However with maple trees I
believe that loop is different and with an empty tree there would be
no dereferencing. Liam?

>
> >       mmap_write_unlock(mm);
> >       vm_unacct_memory(nr_accounted);
> > }
> >
> >
> > And "mm: remove the vma linked list" needed further reworking.  I ended
> > up with
> >
> > void exit_mmap(struct mm_struct *mm)
> > {
> >       struct mmu_gather tlb;
> >       struct vm_area_struct *vma;
> >       unsigned long nr_accounted = 0;
> >       MA_STATE(mas, &mm->mm_mt, 0, 0);
> >       int count = 0;
> >
> >       /* mm's last user has gone, and its about to be pulled down */
> >       mmu_notifier_release(mm);
> >
> >       mmap_write_lock(mm);
> >       arch_exit_mmap(mm);
> >       vma = mas_find(&mas, ULONG_MAX);
> >       if (!vma) {
> >               /* Can happen if dup_mmap() received an OOM */
> >               mmap_write_unlock(mm);
> >               return;
> >       }
> >
> >       lru_add_drain();
> >       flush_cache_mm(mm);
> >       tlb_gather_mmu_fullmm(&tlb, mm);
> >       /* update_hiwater_rss(mm) here? but nobody should be looking */
> >       /* Use ULONG_MAX here to ensure all VMAs in the mm are unmapped */
> >       unmap_vmas(&tlb, &mm->mm_mt, vma, 0, ULONG_MAX);
> >
> >       /*
> >        * Set MMF_OOM_SKIP to hide this task from the oom killer/reaper
> >        * because the memory has been already freed. Do not bother checking
> >        * mm_is_oom_victim because setting a bit unconditionally is cheaper.
> >        */
> >       set_bit(MMF_OOM_SKIP, &mm->flags);
> >
> >       free_pgtables(&tlb, &mm->mm_mt, vma, FIRST_USER_ADDRESS, USER_PGTABLES_CEILING);
> >       tlb_finish_mmu(&tlb);
> >
> >       /*
> >        * Walk the list again, actually closing and freeing it, with preemption
> >        * enabled, without holding any MM locks besides the unreachable
> >        * mmap_write_lock.
> >        */
> >       do {
> >               if (vma->vm_flags & VM_ACCOUNT)
> >                       nr_accounted += vma_pages(vma);
> >               remove_vma(vma);
> >               count++;
> >               cond_resched();
> >       } while ((vma = mas_find(&mas, ULONG_MAX)) != NULL);
> >
> >       BUG_ON(count != mm->map_count);
> >
> >       trace_exit_mmap(mm);
> >       __mt_destroy(&mm->mm_mt);
> >       mmap_write_unlock(mm);
> >       vm_unacct_memory(nr_accounted);
> > }
>
> It is worth noting that this drops the mmap_read_lock/unlock before the
> write locking.  I'm not sure why Suren had it in his patches and I've
> responded just now asking about it.  It may be an important aspect of
> what he was planning.

unmap_vmas() does not require a mmap_write_lock and doing in under
read lock protection allows OOM-killer and process_mrelease to run in
parallel with exit_mmap. That was the reason I start with
mmap_read_lock and then switch to mmap_write_lock. If that creates
issues we should switch to mmap_write_lock for the whole duration of
this call.
Thanks,
Suren.

>
> Thanks,
> Liam
>
>
> >
> >
> > The mapletree patches remain hidden from mm.git until, I expect, next week.
> >
> > Thanks.
> >
> >
>
> --
> To unsubscribe from this group and stop receiving emails from it, send an email to kernel-team+unsubscribe@android.com.
>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 1/2] mm: drop oom code from exit_mmap
  2022-05-19 21:33   ` Suren Baghdasaryan
@ 2022-05-19 22:14     ` Suren Baghdasaryan
  2022-05-19 22:56     ` Liam Howlett
  2022-05-20  7:21     ` Michal Hocko
  2 siblings, 0 replies; 16+ messages in thread
From: Suren Baghdasaryan @ 2022-05-19 22:14 UTC (permalink / raw)
  To: Liam Howlett
  Cc: akpm, mhocko, rientjes, willy, hannes, guro, minchan, kirill,
	aarcange, brauner, hch, oleg, david, jannh, shakeelb, peterx,
	jhubbard, shuah, linux-kernel, linux-mm, kernel-team

On Thu, May 19, 2022 at 2:33 PM Suren Baghdasaryan <surenb@google.com> wrote:
>
> On Thu, May 19, 2022 at 1:22 PM Liam Howlett <liam.howlett@oracle.com> wrote:
> >
> > * Suren Baghdasaryan <surenb@google.com> [220516 03:56]:
> > > The primary reason to invoke the oom reaper from the exit_mmap path used
> > > to be a prevention of an excessive oom killing if the oom victim exit
> > > races with the oom reaper (see [1] for more details). The invocation has
> > > moved around since then because of the interaction with the munlock
> > > logic but the underlying reason has remained the same (see [2]).
> > >
> > > Munlock code is no longer a problem since [3] and there shouldn't be
> > > any blocking operation before the memory is unmapped by exit_mmap so
> > > the oom reaper invocation can be dropped. The unmapping part can be done
> > > with the non-exclusive mmap_sem and the exclusive one is only required
> > > when page tables are freed.
> > >
> > > Remove the oom_reaper from exit_mmap which will make the code easier to
> > > read. This is really unlikely to make any observable difference although
> > > some microbenchmarks could benefit from one less branch that needs to be
> > > evaluated even though it almost never is true.
> > >
> > > [1] 212925802454 ("mm: oom: let oom_reap_task and exit_mmap run concurrently")
> > > [2] 27ae357fa82b ("mm, oom: fix concurrent munlock and oom reaper unmap, v3")
> > > [3] a213e5cf71cb ("mm/munlock: delete munlock_vma_pages_all(), allow oomreap")
> > >
> > > Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> > > Acked-by: Michal Hocko <mhocko@suse.com>
> > > ---
> > >  include/linux/oom.h |  2 --
> > >  mm/mmap.c           | 31 ++++++++++++-------------------
> > >  mm/oom_kill.c       |  2 +-
> > >  3 files changed, 13 insertions(+), 22 deletions(-)
> > >
> > > diff --git a/include/linux/oom.h b/include/linux/oom.h
> > > index 2db9a1432511..6cdf0772dbae 100644
> > > --- a/include/linux/oom.h
> > > +++ b/include/linux/oom.h
> > > @@ -106,8 +106,6 @@ static inline vm_fault_t check_stable_address_space(struct mm_struct *mm)
> > >       return 0;
> > >  }
> > >
> > > -bool __oom_reap_task_mm(struct mm_struct *mm);
> > > -
> > >  long oom_badness(struct task_struct *p,
> > >               unsigned long totalpages);
> > >
> > > diff --git a/mm/mmap.c b/mm/mmap.c
> > > index 313b57d55a63..ded42150e706 100644
> > > --- a/mm/mmap.c
> > > +++ b/mm/mmap.c
> > > @@ -3105,30 +3105,13 @@ void exit_mmap(struct mm_struct *mm)
> > >       /* mm's last user has gone, and its about to be pulled down */
> > >       mmu_notifier_release(mm);
> > >
> > > -     if (unlikely(mm_is_oom_victim(mm))) {
> > > -             /*
> > > -              * Manually reap the mm to free as much memory as possible.
> > > -              * Then, as the oom reaper does, set MMF_OOM_SKIP to disregard
> > > -              * this mm from further consideration.  Taking mm->mmap_lock for
> > > -              * write after setting MMF_OOM_SKIP will guarantee that the oom
> > > -              * reaper will not run on this mm again after mmap_lock is
> > > -              * dropped.
> > > -              *
> > > -              * Nothing can be holding mm->mmap_lock here and the above call
> > > -              * to mmu_notifier_release(mm) ensures mmu notifier callbacks in
> > > -              * __oom_reap_task_mm() will not block.
> > > -              */
> > > -             (void)__oom_reap_task_mm(mm);
> > > -             set_bit(MMF_OOM_SKIP, &mm->flags);
> > > -     }
> > > -
> > > -     mmap_write_lock(mm);
> > > +     mmap_read_lock(mm);
> > >       arch_exit_mmap(mm);
> >
> > arch_exit_mmap() was called under the write lock before, is it safe to
> > call it under the read lock?
>
> Ah, good catch. I missed at least one call chain which I believe would
> require arch_exit_mmap() to be called under write lock:
>
> arch_exit_mmap
>     ldt_arch_exit_mmap
>         free_ldt_pgtables
>             free_pgd_range
>
> I'll need to check whether arch_exit_mmap() has to be called before
> unmap_vmas(). If not, we could move it further down when we hold the
> write lock.
> Andrew, please remove this patchset from your tree for now until I fix this.

I think it should be fine to move arch_exit_mmap() to be called right
after mmap_write_lock. This changes the order of calls from:

arch_exit_mmap()
unmap_vmas()

to

unmap_vmas()
arch_exit_mmap()

however I don't see any implementation of arch_exit_mmap() which uses
mm->mmap. So, it seems safe. I'll wait a day or so for possible
objections and will post a new version.

>
> >
> > >
> > >       vma = mm->mmap;
> > >       if (!vma) {
> > >               /* Can happen if dup_mmap() received an OOM */
> > > -             mmap_write_unlock(mm);
> > > +             mmap_read_unlock(mm);
> > >               return;
> > >       }
> > >
> > > @@ -3138,6 +3121,16 @@ void exit_mmap(struct mm_struct *mm)
> > >       /* update_hiwater_rss(mm) here? but nobody should be looking */
> > >       /* Use -1 here to ensure all VMAs in the mm are unmapped */
> > >       unmap_vmas(&tlb, vma, 0, -1);
> > > +     mmap_read_unlock(mm);
> > > +
> > > +     /*
> > > +      * Set MMF_OOM_SKIP to hide this task from the oom killer/reaper
> > > +      * because the memory has been already freed. Do not bother checking
> > > +      * mm_is_oom_victim because setting a bit unconditionally is cheaper.
> > > +      */
> > > +     set_bit(MMF_OOM_SKIP, &mm->flags);
> > > +
> > > +     mmap_write_lock(mm);
> >
> > Is there a race here?  We had a VMA but after the read lock was dropped,
> > could the oom killer cause the VMA to be invalidated?  I don't think so
> > but the comment above about dup_mmap() receiving an OOM makes me
> > question it.  The code before kept the write lock from when the VMA was
> > found until the end of the mm edits - and it had the check for !vma
> > within the block itself.  We are also hiding it from the oom killer
> > outside the read lock so it is possible for oom to find it in that
> > window, right?
>
> When I was trying to understand that comment and looked into
> dup_mmap() code, my conclusion was that this check was there to
> protect us from the case when dup_mmap() gets interrupted and leaves
> mm->mmap=NULL. So, in a sense it was not really a race with OOM killer
> but an interrupted dup_mmap() case. So, once we checked it above we
> don't need to recheck again under write lock. When I asked Michal
> about this he was in agreement but it's possible we overlooked some
> corner case. If so, please let me know and I can add this check here.
>
> >
> > Could we just unconditionally set the skip bit before taking a write
> > lock for the duration of the exit?  I'm probably missing your reason for
> > doing it this way.
>
> That's what I'm doing - unconditionally setting MMF_OOM_SKIP before
> taking the write lock. Did I miss something?
>
> >
> > >       free_pgtables(&tlb, vma, FIRST_USER_ADDRESS, USER_PGTABLES_CEILING);
> > >       tlb_finish_mmu(&tlb);
> > >
> > > diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> > > index 49d7df39b02d..36355b162727 100644
> > > --- a/mm/oom_kill.c
> > > +++ b/mm/oom_kill.c
> > > @@ -509,7 +509,7 @@ static DECLARE_WAIT_QUEUE_HEAD(oom_reaper_wait);
> > >  static struct task_struct *oom_reaper_list;
> > >  static DEFINE_SPINLOCK(oom_reaper_lock);
> > >
> > > -bool __oom_reap_task_mm(struct mm_struct *mm)
> > > +static bool __oom_reap_task_mm(struct mm_struct *mm)
> > >  {
> > >       struct vm_area_struct *vma;
> > >       bool ret = true;
> > > --
> > > 2.36.0.550.gb090851708-goog
> > >
> > >
> >
> > --
> > To unsubscribe from this group and stop receiving emails from it, send an email to kernel-team+unsubscribe@android.com.
> >

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 1/2] mm: drop oom code from exit_mmap
  2022-05-19 21:43     ` Suren Baghdasaryan
@ 2022-05-19 22:34       ` Liam Howlett
  0 siblings, 0 replies; 16+ messages in thread
From: Liam Howlett @ 2022-05-19 22:34 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: Andrew Morton, mhocko, rientjes, willy, hannes, guro, minchan,
	kirill, aarcange, brauner, hch, oleg, david, jannh, shakeelb,
	peterx, jhubbard, shuah, linux-kernel, linux-mm, kernel-team

* Suren Baghdasaryan <surenb@google.com> [220519 17:44]:
> On Thu, May 19, 2022 at 1:33 PM Liam Howlett <liam.howlett@oracle.com> wrote:
> >
> > * Andrew Morton <akpm@linux-foundation.org> [220519 15:29]:
> > > On Mon, 16 May 2022 00:56:18 -0700 Suren Baghdasaryan <surenb@google.com> wrote:
> > >
> > > > The primary reason to invoke the oom reaper from the exit_mmap path used
> > > > to be a prevention of an excessive oom killing if the oom victim exit
> > > > races with the oom reaper (see [1] for more details). The invocation has
> > > > moved around since then because of the interaction with the munlock
> > > > logic but the underlying reason has remained the same (see [2]).
> > > >
> > > > Munlock code is no longer a problem since [3] and there shouldn't be
> > > > any blocking operation before the memory is unmapped by exit_mmap so
> > > > the oom reaper invocation can be dropped. The unmapping part can be done
> > > > with the non-exclusive mmap_sem and the exclusive one is only required
> > > > when page tables are freed.
> > > >
> > > > Remove the oom_reaper from exit_mmap which will make the code easier to
> > > > read. This is really unlikely to make any observable difference although
> > > > some microbenchmarks could benefit from one less branch that needs to be
> > > > evaluated even though it almost never is true.
> > > >
> > >
> > > Liam, this mucks "mm: start tracking VMAs with maple tree" somewhat.
> > >
> > > > --- a/include/linux/oom.h
> > > > +++ b/include/linux/oom.h
> > > > @@ -106,8 +106,6 @@ static inline vm_fault_t check_stable_address_space(struct mm_struct *mm)
> > > >     return 0;
> > > >  }
> > > >
> > > > -bool __oom_reap_task_mm(struct mm_struct *mm);
> > > > -
> > > >  long oom_badness(struct task_struct *p,
> > > >             unsigned long totalpages);
> > > >
> > > > diff --git a/mm/mmap.c b/mm/mmap.c
> > > > index 313b57d55a63..ded42150e706 100644
> > > > --- a/mm/mmap.c
> > > > +++ b/mm/mmap.c
> > > > @@ -3105,30 +3105,13 @@ void exit_mmap(struct mm_struct *mm)
> > > >     /* mm's last user has gone, and its about to be pulled down */
> > > >     mmu_notifier_release(mm);
> > > >
> > > > -   if (unlikely(mm_is_oom_victim(mm))) {
> > > > -           /*
> > > > -            * Manually reap the mm to free as much memory as possible.
> > > > -            * Then, as the oom reaper does, set MMF_OOM_SKIP to disregard
> > > > -            * this mm from further consideration.  Taking mm->mmap_lock for
> > > > -            * write after setting MMF_OOM_SKIP will guarantee that the oom
> > > > -            * reaper will not run on this mm again after mmap_lock is
> > > > -            * dropped.
> > > > -            *
> > > > -            * Nothing can be holding mm->mmap_lock here and the above call
> > > > -            * to mmu_notifier_release(mm) ensures mmu notifier callbacks in
> > > > -            * __oom_reap_task_mm() will not block.
> > > > -            */
> > > > -           (void)__oom_reap_task_mm(mm);
> > > > -           set_bit(MMF_OOM_SKIP, &mm->flags);
> > > > -   }
> > > > -
> > > > -   mmap_write_lock(mm);
> > > > +   mmap_read_lock(mm);
> > > >     arch_exit_mmap(mm);
> > > >
> > > >     vma = mm->mmap;
> > > >     if (!vma) {
> > > >             /* Can happen if dup_mmap() received an OOM */
> > > > -           mmap_write_unlock(mm);
> > > > +           mmap_read_unlock(mm);
> > > >             return;
> > > >     }
> > > >
> > > > @@ -3138,6 +3121,16 @@ void exit_mmap(struct mm_struct *mm)
> > > >     /* update_hiwater_rss(mm) here? but nobody should be looking */
> > > >     /* Use -1 here to ensure all VMAs in the mm are unmapped */
> > > >     unmap_vmas(&tlb, vma, 0, -1);
> > > > +   mmap_read_unlock(mm);
> > > > +
> > > > +   /*
> > > > +    * Set MMF_OOM_SKIP to hide this task from the oom killer/reaper
> > > > +    * because the memory has been already freed. Do not bother checking
> > > > +    * mm_is_oom_victim because setting a bit unconditionally is cheaper.
> > > > +    */
> > > > +   set_bit(MMF_OOM_SKIP, &mm->flags);
> > > > +
> > > > +   mmap_write_lock(mm);
> > > >     free_pgtables(&tlb, vma, FIRST_USER_ADDRESS, USER_PGTABLES_CEILING);
> > > >     tlb_finish_mmu(&tlb);
> > > >
> > >
> > > I ended up with the below rework of "mm: start tracking VMAs with maple
> > > tree".  Please triple check?
> >
> > One small fix in the first one.  Suren found a race with the oom or
> > process_mrelease that needed the linked list to be removed here.  Please
> > correct me if I am mistaken, Suren?
> >
> > >
> > > void exit_mmap(struct mm_struct *mm)
> > > {
> > >       struct mmu_gather tlb;
> > >       struct vm_area_struct *vma;
> > >       unsigned long nr_accounted = 0;
> > >
> > >       /* mm's last user has gone, and its about to be pulled down */
> > >       mmu_notifier_release(mm);
> > >
> > >       mmap_write_lock(mm);
> > >       arch_exit_mmap(mm);
> > >       vma = mm->mmap;
> > >       if (!vma) {
> > >               /* Can happen if dup_mmap() received an OOM */
> > >               mmap_write_unlock(mm);
> > >               return;
> > >       }
> > >
> > >       lru_add_drain();
> > >       flush_cache_mm(mm);
> > >       tlb_gather_mmu_fullmm(&tlb, mm);
> > >       /* update_hiwater_rss(mm) here? but nobody should be looking */
> > >       /* Use -1 here to ensure all VMAs in the mm are unmapped */
> > >       unmap_vmas(&tlb, vma, 0, -1);
> > >
> > >       /*
> > >        * Set MMF_OOM_SKIP to hide this task from the oom killer/reaper
> > >        * because the memory has been already freed. Do not bother checking
> > >        * mm_is_oom_victim because setting a bit unconditionally is cheaper.
> > >        */
> > >       set_bit(MMF_OOM_SKIP, &mm->flags);
> > >
> > >       free_pgtables(&tlb, vma, FIRST_USER_ADDRESS, USER_PGTABLES_CEILING);
> > >       tlb_finish_mmu(&tlb);
> > >
> > >       /* Walk the list again, actually closing and freeing it. */
> > >       while (vma) {
> > >               if (vma->vm_flags & VM_ACCOUNT)
> > >                       nr_accounted += vma_pages(vma);
> > >               vma = remove_vma(vma);
> > >               cond_resched();
> > >       }
> > >
> > >       trace_exit_mmap(mm);
> > >       __mt_destroy(&mm->mm_mt);
> >
> > +       mm->mmap = NULL;
> 
> That's correct. We need to reset mm->mmap so that the loop in
> __oom_reap_task_mm() stops immediately. However with maple trees I
> believe that loop is different and with an empty tree there would be
> no dereferencing. Liam?

That's correct after the "mm: remove the vma linked list".  At this
point we still need it.  The maple tree is tracking the VMAs but it's
not being used - just checked to be the same.

> 
> >
> > >       mmap_write_unlock(mm);
> > >       vm_unacct_memory(nr_accounted);
> > > }
> > >
> > >
> > > And "mm: remove the vma linked list" needed further reworking.  I ended
> > > up with
> > >
> > > void exit_mmap(struct mm_struct *mm)
> > > {
> > >       struct mmu_gather tlb;
> > >       struct vm_area_struct *vma;
> > >       unsigned long nr_accounted = 0;
> > >       MA_STATE(mas, &mm->mm_mt, 0, 0);
> > >       int count = 0;
> > >
> > >       /* mm's last user has gone, and its about to be pulled down */
> > >       mmu_notifier_release(mm);
> > >
> > >       mmap_write_lock(mm);
> > >       arch_exit_mmap(mm);
> > >       vma = mas_find(&mas, ULONG_MAX);
> > >       if (!vma) {
> > >               /* Can happen if dup_mmap() received an OOM */
> > >               mmap_write_unlock(mm);
> > >               return;
> > >       }
> > >
> > >       lru_add_drain();
> > >       flush_cache_mm(mm);
> > >       tlb_gather_mmu_fullmm(&tlb, mm);
> > >       /* update_hiwater_rss(mm) here? but nobody should be looking */
> > >       /* Use ULONG_MAX here to ensure all VMAs in the mm are unmapped */
> > >       unmap_vmas(&tlb, &mm->mm_mt, vma, 0, ULONG_MAX);
> > >
> > >       /*
> > >        * Set MMF_OOM_SKIP to hide this task from the oom killer/reaper
> > >        * because the memory has been already freed. Do not bother checking
> > >        * mm_is_oom_victim because setting a bit unconditionally is cheaper.
> > >        */
> > >       set_bit(MMF_OOM_SKIP, &mm->flags);
> > >
> > >       free_pgtables(&tlb, &mm->mm_mt, vma, FIRST_USER_ADDRESS, USER_PGTABLES_CEILING);
> > >       tlb_finish_mmu(&tlb);
> > >
> > >       /*
> > >        * Walk the list again, actually closing and freeing it, with preemption
> > >        * enabled, without holding any MM locks besides the unreachable
> > >        * mmap_write_lock.
> > >        */
> > >       do {
> > >               if (vma->vm_flags & VM_ACCOUNT)
> > >                       nr_accounted += vma_pages(vma);
> > >               remove_vma(vma);
> > >               count++;
> > >               cond_resched();
> > >       } while ((vma = mas_find(&mas, ULONG_MAX)) != NULL);
> > >
> > >       BUG_ON(count != mm->map_count);
> > >
> > >       trace_exit_mmap(mm);
> > >       __mt_destroy(&mm->mm_mt);
> > >       mmap_write_unlock(mm);
> > >       vm_unacct_memory(nr_accounted);
> > > }
> >
> > It is worth noting that this drops the mmap_read_lock/unlock before the
> > write locking.  I'm not sure why Suren had it in his patches and I've
> > responded just now asking about it.  It may be an important aspect of
> > what he was planning.
> 
> unmap_vmas() does not require a mmap_write_lock and doing in under
> read lock protection allows OOM-killer and process_mrelease to run in
> parallel with exit_mmap. That was the reason I start with
> mmap_read_lock and then switch to mmap_write_lock. If that creates
> issues we should switch to mmap_write_lock for the whole duration of
> this call.
> Thanks,
> Suren.
> 
> >
> > Thanks,
> > Liam
> >
> >
> > >
> > >
> > > The mapletree patches remain hidden from mm.git until, I expect, next week.
> > >
> > > Thanks.
> > >
> > >
> >
> > --
> > To unsubscribe from this group and stop receiving emails from it, send an email to kernel-team+unsubscribe@android.com.
> >

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 1/2] mm: drop oom code from exit_mmap
  2022-05-19 21:33   ` Suren Baghdasaryan
  2022-05-19 22:14     ` Suren Baghdasaryan
@ 2022-05-19 22:56     ` Liam Howlett
  2022-05-19 23:08       ` Suren Baghdasaryan
  2022-05-20  7:21     ` Michal Hocko
  2 siblings, 1 reply; 16+ messages in thread
From: Liam Howlett @ 2022-05-19 22:56 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: akpm, mhocko, rientjes, willy, hannes, guro, minchan, kirill,
	aarcange, brauner, hch, oleg, david, jannh, shakeelb, peterx,
	jhubbard, shuah, linux-kernel, linux-mm, kernel-team

* Suren Baghdasaryan <surenb@google.com> [220519 17:33]:
> On Thu, May 19, 2022 at 1:22 PM Liam Howlett <liam.howlett@oracle.com> wrote:
> >
> > * Suren Baghdasaryan <surenb@google.com> [220516 03:56]:
> > > The primary reason to invoke the oom reaper from the exit_mmap path used
> > > to be a prevention of an excessive oom killing if the oom victim exit
> > > races with the oom reaper (see [1] for more details). The invocation has
> > > moved around since then because of the interaction with the munlock
> > > logic but the underlying reason has remained the same (see [2]).
> > >
> > > Munlock code is no longer a problem since [3] and there shouldn't be
> > > any blocking operation before the memory is unmapped by exit_mmap so
> > > the oom reaper invocation can be dropped. The unmapping part can be done
> > > with the non-exclusive mmap_sem and the exclusive one is only required
> > > when page tables are freed.
> > >
> > > Remove the oom_reaper from exit_mmap which will make the code easier to
> > > read. This is really unlikely to make any observable difference although
> > > some microbenchmarks could benefit from one less branch that needs to be
> > > evaluated even though it almost never is true.
> > >
> > > [1] 212925802454 ("mm: oom: let oom_reap_task and exit_mmap run concurrently")
> > > [2] 27ae357fa82b ("mm, oom: fix concurrent munlock and oom reaper unmap, v3")
> > > [3] a213e5cf71cb ("mm/munlock: delete munlock_vma_pages_all(), allow oomreap")
> > >
> > > Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> > > Acked-by: Michal Hocko <mhocko@suse.com>
> > > ---
> > >  include/linux/oom.h |  2 --
> > >  mm/mmap.c           | 31 ++++++++++++-------------------
> > >  mm/oom_kill.c       |  2 +-
> > >  3 files changed, 13 insertions(+), 22 deletions(-)
> > >
> > > diff --git a/include/linux/oom.h b/include/linux/oom.h
> > > index 2db9a1432511..6cdf0772dbae 100644
> > > --- a/include/linux/oom.h
> > > +++ b/include/linux/oom.h
> > > @@ -106,8 +106,6 @@ static inline vm_fault_t check_stable_address_space(struct mm_struct *mm)
> > >       return 0;
> > >  }
> > >
> > > -bool __oom_reap_task_mm(struct mm_struct *mm);
> > > -
> > >  long oom_badness(struct task_struct *p,
> > >               unsigned long totalpages);
> > >
> > > diff --git a/mm/mmap.c b/mm/mmap.c
> > > index 313b57d55a63..ded42150e706 100644
> > > --- a/mm/mmap.c
> > > +++ b/mm/mmap.c
> > > @@ -3105,30 +3105,13 @@ void exit_mmap(struct mm_struct *mm)
> > >       /* mm's last user has gone, and its about to be pulled down */
> > >       mmu_notifier_release(mm);
> > >
> > > -     if (unlikely(mm_is_oom_victim(mm))) {
> > > -             /*
> > > -              * Manually reap the mm to free as much memory as possible.
> > > -              * Then, as the oom reaper does, set MMF_OOM_SKIP to disregard
> > > -              * this mm from further consideration.  Taking mm->mmap_lock for
> > > -              * write after setting MMF_OOM_SKIP will guarantee that the oom
> > > -              * reaper will not run on this mm again after mmap_lock is
> > > -              * dropped.
> > > -              *
> > > -              * Nothing can be holding mm->mmap_lock here and the above call
> > > -              * to mmu_notifier_release(mm) ensures mmu notifier callbacks in
> > > -              * __oom_reap_task_mm() will not block.
> > > -              */
> > > -             (void)__oom_reap_task_mm(mm);
> > > -             set_bit(MMF_OOM_SKIP, &mm->flags);
> > > -     }
> > > -
> > > -     mmap_write_lock(mm);
> > > +     mmap_read_lock(mm);
> > >       arch_exit_mmap(mm);
> >
> > arch_exit_mmap() was called under the write lock before, is it safe to
> > call it under the read lock?
> 
> Ah, good catch. I missed at least one call chain which I believe would
> require arch_exit_mmap() to be called under write lock:
> 
> arch_exit_mmap
>     ldt_arch_exit_mmap
>         free_ldt_pgtables
>             free_pgd_range
> 
> I'll need to check whether arch_exit_mmap() has to be called before
> unmap_vmas(). If not, we could move it further down when we hold the
> write lock.
> Andrew, please remove this patchset from your tree for now until I fix this.
> 
> >
> > >
> > >       vma = mm->mmap;
> > >       if (!vma) {
> > >               /* Can happen if dup_mmap() received an OOM */
> > > -             mmap_write_unlock(mm);
> > > +             mmap_read_unlock(mm);
> > >               return;
> > >       }
> > >
> > > @@ -3138,6 +3121,16 @@ void exit_mmap(struct mm_struct *mm)
> > >       /* update_hiwater_rss(mm) here? but nobody should be looking */
> > >       /* Use -1 here to ensure all VMAs in the mm are unmapped */
> > >       unmap_vmas(&tlb, vma, 0, -1);
> > > +     mmap_read_unlock(mm);
> > > +
> > > +     /*
> > > +      * Set MMF_OOM_SKIP to hide this task from the oom killer/reaper
> > > +      * because the memory has been already freed. Do not bother checking
> > > +      * mm_is_oom_victim because setting a bit unconditionally is cheaper.
> > > +      */
> > > +     set_bit(MMF_OOM_SKIP, &mm->flags);
> > > +
> > > +     mmap_write_lock(mm);
> >
> > Is there a race here?  We had a VMA but after the read lock was dropped,
> > could the oom killer cause the VMA to be invalidated?  I don't think so
> > but the comment above about dup_mmap() receiving an OOM makes me
> > question it.  The code before kept the write lock from when the VMA was
> > found until the end of the mm edits - and it had the check for !vma
> > within the block itself.  We are also hiding it from the oom killer
> > outside the read lock so it is possible for oom to find it in that
> > window, right?
> 
> When I was trying to understand that comment and looked into
> dup_mmap() code, my conclusion was that this check was there to
> protect us from the case when dup_mmap() gets interrupted and leaves
> mm->mmap=NULL. So, in a sense it was not really a race with OOM killer
> but an interrupted dup_mmap() case. So, once we checked it above we
> don't need to recheck again under write lock. When I asked Michal
> about this he was in agreement but it's possible we overlooked some
> corner case. If so, please let me know and I can add this check here.

I didn't see how it was a problem either, neither of the other entry
points modify the vma linked list/tree.

> 
> >
> > Could we just unconditionally set the skip bit before taking a write
> > lock for the duration of the exit?  I'm probably missing your reason for
> > doing it this way.
> 
> That's what I'm doing - unconditionally setting MMF_OOM_SKIP before
> taking the write lock. Did I miss something?

Sorry, I meant to type "before the read lock".  I think you answered
this in the other thread though.  I think you want the oom killer and
process_mrelease to be able to run in parallel to the exiting of the
task?  If so, is it worth all tasks taking the read lock and then
dropping it to allow this rare case?

> 
> >
> > >       free_pgtables(&tlb, vma, FIRST_USER_ADDRESS, USER_PGTABLES_CEILING);
> > >       tlb_finish_mmu(&tlb);
> > >
> > > diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> > > index 49d7df39b02d..36355b162727 100644
> > > --- a/mm/oom_kill.c
> > > +++ b/mm/oom_kill.c
> > > @@ -509,7 +509,7 @@ static DECLARE_WAIT_QUEUE_HEAD(oom_reaper_wait);
> > >  static struct task_struct *oom_reaper_list;
> > >  static DEFINE_SPINLOCK(oom_reaper_lock);
> > >
> > > -bool __oom_reap_task_mm(struct mm_struct *mm)
> > > +static bool __oom_reap_task_mm(struct mm_struct *mm)
> > >  {
> > >       struct vm_area_struct *vma;
> > >       bool ret = true;
> > > --
> > > 2.36.0.550.gb090851708-goog
> > >
> > >
> >
> > --
> > To unsubscribe from this group and stop receiving emails from it, send an email to kernel-team+unsubscribe@android.com.
> >

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 1/2] mm: drop oom code from exit_mmap
  2022-05-19 22:56     ` Liam Howlett
@ 2022-05-19 23:08       ` Suren Baghdasaryan
  0 siblings, 0 replies; 16+ messages in thread
From: Suren Baghdasaryan @ 2022-05-19 23:08 UTC (permalink / raw)
  To: Liam Howlett
  Cc: akpm, mhocko, rientjes, willy, hannes, guro, minchan, kirill,
	aarcange, brauner, hch, oleg, david, jannh, shakeelb, peterx,
	jhubbard, shuah, linux-kernel, linux-mm, kernel-team

On Thu, May 19, 2022 at 3:56 PM Liam Howlett <liam.howlett@oracle.com> wrote:
>
> * Suren Baghdasaryan <surenb@google.com> [220519 17:33]:
> > On Thu, May 19, 2022 at 1:22 PM Liam Howlett <liam.howlett@oracle.com> wrote:
> > >
> > > * Suren Baghdasaryan <surenb@google.com> [220516 03:56]:
> > > > The primary reason to invoke the oom reaper from the exit_mmap path used
> > > > to be a prevention of an excessive oom killing if the oom victim exit
> > > > races with the oom reaper (see [1] for more details). The invocation has
> > > > moved around since then because of the interaction with the munlock
> > > > logic but the underlying reason has remained the same (see [2]).
> > > >
> > > > Munlock code is no longer a problem since [3] and there shouldn't be
> > > > any blocking operation before the memory is unmapped by exit_mmap so
> > > > the oom reaper invocation can be dropped. The unmapping part can be done
> > > > with the non-exclusive mmap_sem and the exclusive one is only required
> > > > when page tables are freed.
> > > >
> > > > Remove the oom_reaper from exit_mmap which will make the code easier to
> > > > read. This is really unlikely to make any observable difference although
> > > > some microbenchmarks could benefit from one less branch that needs to be
> > > > evaluated even though it almost never is true.
> > > >
> > > > [1] 212925802454 ("mm: oom: let oom_reap_task and exit_mmap run concurrently")
> > > > [2] 27ae357fa82b ("mm, oom: fix concurrent munlock and oom reaper unmap, v3")
> > > > [3] a213e5cf71cb ("mm/munlock: delete munlock_vma_pages_all(), allow oomreap")
> > > >
> > > > Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> > > > Acked-by: Michal Hocko <mhocko@suse.com>
> > > > ---
> > > >  include/linux/oom.h |  2 --
> > > >  mm/mmap.c           | 31 ++++++++++++-------------------
> > > >  mm/oom_kill.c       |  2 +-
> > > >  3 files changed, 13 insertions(+), 22 deletions(-)
> > > >
> > > > diff --git a/include/linux/oom.h b/include/linux/oom.h
> > > > index 2db9a1432511..6cdf0772dbae 100644
> > > > --- a/include/linux/oom.h
> > > > +++ b/include/linux/oom.h
> > > > @@ -106,8 +106,6 @@ static inline vm_fault_t check_stable_address_space(struct mm_struct *mm)
> > > >       return 0;
> > > >  }
> > > >
> > > > -bool __oom_reap_task_mm(struct mm_struct *mm);
> > > > -
> > > >  long oom_badness(struct task_struct *p,
> > > >               unsigned long totalpages);
> > > >
> > > > diff --git a/mm/mmap.c b/mm/mmap.c
> > > > index 313b57d55a63..ded42150e706 100644
> > > > --- a/mm/mmap.c
> > > > +++ b/mm/mmap.c
> > > > @@ -3105,30 +3105,13 @@ void exit_mmap(struct mm_struct *mm)
> > > >       /* mm's last user has gone, and its about to be pulled down */
> > > >       mmu_notifier_release(mm);
> > > >
> > > > -     if (unlikely(mm_is_oom_victim(mm))) {
> > > > -             /*
> > > > -              * Manually reap the mm to free as much memory as possible.
> > > > -              * Then, as the oom reaper does, set MMF_OOM_SKIP to disregard
> > > > -              * this mm from further consideration.  Taking mm->mmap_lock for
> > > > -              * write after setting MMF_OOM_SKIP will guarantee that the oom
> > > > -              * reaper will not run on this mm again after mmap_lock is
> > > > -              * dropped.
> > > > -              *
> > > > -              * Nothing can be holding mm->mmap_lock here and the above call
> > > > -              * to mmu_notifier_release(mm) ensures mmu notifier callbacks in
> > > > -              * __oom_reap_task_mm() will not block.
> > > > -              */
> > > > -             (void)__oom_reap_task_mm(mm);
> > > > -             set_bit(MMF_OOM_SKIP, &mm->flags);
> > > > -     }
> > > > -
> > > > -     mmap_write_lock(mm);
> > > > +     mmap_read_lock(mm);
> > > >       arch_exit_mmap(mm);
> > >
> > > arch_exit_mmap() was called under the write lock before, is it safe to
> > > call it under the read lock?
> >
> > Ah, good catch. I missed at least one call chain which I believe would
> > require arch_exit_mmap() to be called under write lock:
> >
> > arch_exit_mmap
> >     ldt_arch_exit_mmap
> >         free_ldt_pgtables
> >             free_pgd_range
> >
> > I'll need to check whether arch_exit_mmap() has to be called before
> > unmap_vmas(). If not, we could move it further down when we hold the
> > write lock.
> > Andrew, please remove this patchset from your tree for now until I fix this.
> >
> > >
> > > >
> > > >       vma = mm->mmap;
> > > >       if (!vma) {
> > > >               /* Can happen if dup_mmap() received an OOM */
> > > > -             mmap_write_unlock(mm);
> > > > +             mmap_read_unlock(mm);
> > > >               return;
> > > >       }
> > > >
> > > > @@ -3138,6 +3121,16 @@ void exit_mmap(struct mm_struct *mm)
> > > >       /* update_hiwater_rss(mm) here? but nobody should be looking */
> > > >       /* Use -1 here to ensure all VMAs in the mm are unmapped */
> > > >       unmap_vmas(&tlb, vma, 0, -1);
> > > > +     mmap_read_unlock(mm);
> > > > +
> > > > +     /*
> > > > +      * Set MMF_OOM_SKIP to hide this task from the oom killer/reaper
> > > > +      * because the memory has been already freed. Do not bother checking
> > > > +      * mm_is_oom_victim because setting a bit unconditionally is cheaper.
> > > > +      */
> > > > +     set_bit(MMF_OOM_SKIP, &mm->flags);
> > > > +
> > > > +     mmap_write_lock(mm);
> > >
> > > Is there a race here?  We had a VMA but after the read lock was dropped,
> > > could the oom killer cause the VMA to be invalidated?  I don't think so
> > > but the comment above about dup_mmap() receiving an OOM makes me
> > > question it.  The code before kept the write lock from when the VMA was
> > > found until the end of the mm edits - and it had the check for !vma
> > > within the block itself.  We are also hiding it from the oom killer
> > > outside the read lock so it is possible for oom to find it in that
> > > window, right?
> >
> > When I was trying to understand that comment and looked into
> > dup_mmap() code, my conclusion was that this check was there to
> > protect us from the case when dup_mmap() gets interrupted and leaves
> > mm->mmap=NULL. So, in a sense it was not really a race with OOM killer
> > but an interrupted dup_mmap() case. So, once we checked it above we
> > don't need to recheck again under write lock. When I asked Michal
> > about this he was in agreement but it's possible we overlooked some
> > corner case. If so, please let me know and I can add this check here.
>
> I didn't see how it was a problem either, neither of the other entry
> points modify the vma linked list/tree.
>
> >
> > >
> > > Could we just unconditionally set the skip bit before taking a write
> > > lock for the duration of the exit?  I'm probably missing your reason for
> > > doing it this way.
> >
> > That's what I'm doing - unconditionally setting MMF_OOM_SKIP before
> > taking the write lock. Did I miss something?
>
> Sorry, I meant to type "before the read lock".  I think you answered
> this in the other thread though.  I think you want the oom killer and
> process_mrelease to be able to run in parallel to the exiting of the
> task?  If so, is it worth all tasks taking the read lock and then
> dropping it to allow this rare case?

In the usual case the lock should be uncontended, so should not be an
issue I think.

>
> >
> > >
> > > >       free_pgtables(&tlb, vma, FIRST_USER_ADDRESS, USER_PGTABLES_CEILING);
> > > >       tlb_finish_mmu(&tlb);
> > > >
> > > > diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> > > > index 49d7df39b02d..36355b162727 100644
> > > > --- a/mm/oom_kill.c
> > > > +++ b/mm/oom_kill.c
> > > > @@ -509,7 +509,7 @@ static DECLARE_WAIT_QUEUE_HEAD(oom_reaper_wait);
> > > >  static struct task_struct *oom_reaper_list;
> > > >  static DEFINE_SPINLOCK(oom_reaper_lock);
> > > >
> > > > -bool __oom_reap_task_mm(struct mm_struct *mm)
> > > > +static bool __oom_reap_task_mm(struct mm_struct *mm)
> > > >  {
> > > >       struct vm_area_struct *vma;
> > > >       bool ret = true;
> > > > --
> > > > 2.36.0.550.gb090851708-goog
> > > >
> > > >
> > >
> > > --
> > > To unsubscribe from this group and stop receiving emails from it, send an email to kernel-team+unsubscribe@android.com.
> > >
>
> --
> To unsubscribe from this group and stop receiving emails from it, send an email to kernel-team+unsubscribe@android.com.
>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 1/2] mm: drop oom code from exit_mmap
  2022-05-19 21:33   ` Suren Baghdasaryan
  2022-05-19 22:14     ` Suren Baghdasaryan
  2022-05-19 22:56     ` Liam Howlett
@ 2022-05-20  7:21     ` Michal Hocko
  2022-05-20 15:55       ` Suren Baghdasaryan
  2 siblings, 1 reply; 16+ messages in thread
From: Michal Hocko @ 2022-05-20  7:21 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: Liam Howlett, akpm, rientjes, willy, hannes, guro, minchan,
	kirill, aarcange, brauner, hch, oleg, david, jannh, shakeelb,
	peterx, jhubbard, shuah, linux-kernel, linux-mm, kernel-team

On Thu 19-05-22 14:33:03, Suren Baghdasaryan wrote:
> On Thu, May 19, 2022 at 1:22 PM Liam Howlett <liam.howlett@oracle.com> wrote:
[...]
> > arch_exit_mmap() was called under the write lock before, is it safe to
> > call it under the read lock?
> 
> Ah, good catch. I missed at least one call chain which I believe would
> require arch_exit_mmap() to be called under write lock:
> 
> arch_exit_mmap
>     ldt_arch_exit_mmap
>         free_ldt_pgtables
>             free_pgd_range

Why would be this a problem? This is LDT mapped into page tables but as
far as I know oom_reaper cannot really ever see that range because it is
not really reachable from any VMA.

> I'll need to check whether arch_exit_mmap() has to be called before
> unmap_vmas(). If not, we could move it further down when we hold the
> write lock.
> Andrew, please remove this patchset from your tree for now until I fix this.
> 
> >
> > >
> > >       vma = mm->mmap;
> > >       if (!vma) {
> > >               /* Can happen if dup_mmap() received an OOM */
> > > -             mmap_write_unlock(mm);
> > > +             mmap_read_unlock(mm);
> > >               return;
> > >       }
> > >
> > > @@ -3138,6 +3121,16 @@ void exit_mmap(struct mm_struct *mm)
> > >       /* update_hiwater_rss(mm) here? but nobody should be looking */
> > >       /* Use -1 here to ensure all VMAs in the mm are unmapped */
> > >       unmap_vmas(&tlb, vma, 0, -1);
> > > +     mmap_read_unlock(mm);
> > > +
> > > +     /*
> > > +      * Set MMF_OOM_SKIP to hide this task from the oom killer/reaper
> > > +      * because the memory has been already freed. Do not bother checking
> > > +      * mm_is_oom_victim because setting a bit unconditionally is cheaper.
> > > +      */
> > > +     set_bit(MMF_OOM_SKIP, &mm->flags);
> > > +
> > > +     mmap_write_lock(mm);
> >
> > Is there a race here?  We had a VMA but after the read lock was dropped,
> > could the oom killer cause the VMA to be invalidated?

Nope, the oom killer itself doesn't do much beyond sending SIGKILL and
scheduling the victim for the oom_reaper. dup_mmap is holding exclusive
mmap_lock throughout the whole process.

> > I don't think so
> > but the comment above about dup_mmap() receiving an OOM makes me
> > question it.  The code before kept the write lock from when the VMA was
> > found until the end of the mm edits - and it had the check for !vma
> > within the block itself.  We are also hiding it from the oom killer
> > outside the read lock so it is possible for oom to find it in that
> > window, right?

The oom killer's victim selection doesn't really depend on the
mmap_lock. If there is a race and MMF_OOM_SKIP is not set yet then it
will consider the task and very likely bail out anyway because the
address space has already been unampped so oom_badness() would consider
this task boring.

oom_reaper on the other hand would just try to unmap in parallel but
that is fine regardless of MMF_OOM_SKIP. Seeing the flag would allow to
bail out early rather than just trying to unmap something that is no
longer there. The only problem for the oom_reaper is to see page tables
of the address space disappearing from udner its feet. That is excluded
by the the exlusive lock and as Suren mentions mm->mmap == NULL check
if the exit_mmap wins the race.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 1/2] mm: drop oom code from exit_mmap
  2022-05-20  7:21     ` Michal Hocko
@ 2022-05-20 15:55       ` Suren Baghdasaryan
  2022-05-20 16:17         ` Suren Baghdasaryan
  0 siblings, 1 reply; 16+ messages in thread
From: Suren Baghdasaryan @ 2022-05-20 15:55 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Liam Howlett, akpm, rientjes, willy, hannes, guro, minchan,
	kirill, aarcange, brauner, hch, oleg, david, jannh, shakeelb,
	peterx, jhubbard, shuah, linux-kernel, linux-mm, kernel-team

On Fri, May 20, 2022 at 12:21 AM Michal Hocko <mhocko@suse.com> wrote:
>
> On Thu 19-05-22 14:33:03, Suren Baghdasaryan wrote:
> > On Thu, May 19, 2022 at 1:22 PM Liam Howlett <liam.howlett@oracle.com> wrote:
> [...]
> > > arch_exit_mmap() was called under the write lock before, is it safe to
> > > call it under the read lock?
> >
> > Ah, good catch. I missed at least one call chain which I believe would
> > require arch_exit_mmap() to be called under write lock:
> >
> > arch_exit_mmap
> >     ldt_arch_exit_mmap
> >         free_ldt_pgtables
> >             free_pgd_range
>
> Why would be this a problem? This is LDT mapped into page tables but as
> far as I know oom_reaper cannot really ever see that range because it is
> not really reachable from any VMA.

Ah, thanks! I didn't realize these page tables are not reachable from
VMAs. The only other call that I'm not sure is ok without mmap write
lock is xen_hvm_exit_mmap:

arch_exit_mmap
    paravirt_arch_exit_mmap
        xen_hvm_exit_mmap

I'll look closer today but if someone can confirm it's safe then my
current patch should be fine as is.
Thanks,
Suren.

>
> > I'll need to check whether arch_exit_mmap() has to be called before
> > unmap_vmas(). If not, we could move it further down when we hold the
> > write lock.
> > Andrew, please remove this patchset from your tree for now until I fix this.
> >
> > >
> > > >
> > > >       vma = mm->mmap;
> > > >       if (!vma) {
> > > >               /* Can happen if dup_mmap() received an OOM */
> > > > -             mmap_write_unlock(mm);
> > > > +             mmap_read_unlock(mm);
> > > >               return;
> > > >       }
> > > >
> > > > @@ -3138,6 +3121,16 @@ void exit_mmap(struct mm_struct *mm)
> > > >       /* update_hiwater_rss(mm) here? but nobody should be looking */
> > > >       /* Use -1 here to ensure all VMAs in the mm are unmapped */
> > > >       unmap_vmas(&tlb, vma, 0, -1);
> > > > +     mmap_read_unlock(mm);
> > > > +
> > > > +     /*
> > > > +      * Set MMF_OOM_SKIP to hide this task from the oom killer/reaper
> > > > +      * because the memory has been already freed. Do not bother checking
> > > > +      * mm_is_oom_victim because setting a bit unconditionally is cheaper.
> > > > +      */
> > > > +     set_bit(MMF_OOM_SKIP, &mm->flags);
> > > > +
> > > > +     mmap_write_lock(mm);
> > >
> > > Is there a race here?  We had a VMA but after the read lock was dropped,
> > > could the oom killer cause the VMA to be invalidated?
>
> Nope, the oom killer itself doesn't do much beyond sending SIGKILL and
> scheduling the victim for the oom_reaper. dup_mmap is holding exclusive
> mmap_lock throughout the whole process.
>
> > > I don't think so
> > > but the comment above about dup_mmap() receiving an OOM makes me
> > > question it.  The code before kept the write lock from when the VMA was
> > > found until the end of the mm edits - and it had the check for !vma
> > > within the block itself.  We are also hiding it from the oom killer
> > > outside the read lock so it is possible for oom to find it in that
> > > window, right?
>
> The oom killer's victim selection doesn't really depend on the
> mmap_lock. If there is a race and MMF_OOM_SKIP is not set yet then it
> will consider the task and very likely bail out anyway because the
> address space has already been unampped so oom_badness() would consider
> this task boring.
>
> oom_reaper on the other hand would just try to unmap in parallel but
> that is fine regardless of MMF_OOM_SKIP. Seeing the flag would allow to
> bail out early rather than just trying to unmap something that is no
> longer there. The only problem for the oom_reaper is to see page tables
> of the address space disappearing from udner its feet. That is excluded
> by the the exlusive lock and as Suren mentions mm->mmap == NULL check
> if the exit_mmap wins the race.
> --
> Michal Hocko
> SUSE Labs

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 1/2] mm: drop oom code from exit_mmap
  2022-05-20 15:55       ` Suren Baghdasaryan
@ 2022-05-20 16:17         ` Suren Baghdasaryan
  0 siblings, 0 replies; 16+ messages in thread
From: Suren Baghdasaryan @ 2022-05-20 16:17 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Liam Howlett, akpm, rientjes, willy, hannes, guro, minchan,
	kirill, aarcange, brauner, hch, oleg, david, jannh, shakeelb,
	peterx, jhubbard, shuah, linux-kernel, linux-mm, kernel-team

On Fri, May 20, 2022 at 8:55 AM Suren Baghdasaryan <surenb@google.com> wrote:
>
> On Fri, May 20, 2022 at 12:21 AM Michal Hocko <mhocko@suse.com> wrote:
> >
> > On Thu 19-05-22 14:33:03, Suren Baghdasaryan wrote:
> > > On Thu, May 19, 2022 at 1:22 PM Liam Howlett <liam.howlett@oracle.com> wrote:
> > [...]
> > > > arch_exit_mmap() was called under the write lock before, is it safe to
> > > > call it under the read lock?
> > >
> > > Ah, good catch. I missed at least one call chain which I believe would
> > > require arch_exit_mmap() to be called under write lock:
> > >
> > > arch_exit_mmap
> > >     ldt_arch_exit_mmap
> > >         free_ldt_pgtables
> > >             free_pgd_range
> >
> > Why would be this a problem? This is LDT mapped into page tables but as
> > far as I know oom_reaper cannot really ever see that range because it is
> > not really reachable from any VMA.
>
> Ah, thanks! I didn't realize these page tables are not reachable from
> VMAs. The only other call that I'm not sure is ok without mmap write
> lock is xen_hvm_exit_mmap:
>
> arch_exit_mmap
>     paravirt_arch_exit_mmap
>         xen_hvm_exit_mmap
>
> I'll look closer today but if someone can confirm it's safe then my
> current patch should be fine as is.

My conclusion is that it's safe to call arch_exit_mmap without
exclusive mmap lock since the only possible competition is from
OOM-killer/process_mrelease which operate on mm->mmap and none of the
arch_exit_mmap implementations use mm->mmap.

Andrew, sorry for going back and forth. I think the patch is fine as
is and can be integrated. Thanks!


> Thanks,
> Suren.
>
> >
> > > I'll need to check whether arch_exit_mmap() has to be called before
> > > unmap_vmas(). If not, we could move it further down when we hold the
> > > write lock.
> > > Andrew, please remove this patchset from your tree for now until I fix this.
> > >
> > > >
> > > > >
> > > > >       vma = mm->mmap;
> > > > >       if (!vma) {
> > > > >               /* Can happen if dup_mmap() received an OOM */
> > > > > -             mmap_write_unlock(mm);
> > > > > +             mmap_read_unlock(mm);
> > > > >               return;
> > > > >       }
> > > > >
> > > > > @@ -3138,6 +3121,16 @@ void exit_mmap(struct mm_struct *mm)
> > > > >       /* update_hiwater_rss(mm) here? but nobody should be looking */
> > > > >       /* Use -1 here to ensure all VMAs in the mm are unmapped */
> > > > >       unmap_vmas(&tlb, vma, 0, -1);
> > > > > +     mmap_read_unlock(mm);
> > > > > +
> > > > > +     /*
> > > > > +      * Set MMF_OOM_SKIP to hide this task from the oom killer/reaper
> > > > > +      * because the memory has been already freed. Do not bother checking
> > > > > +      * mm_is_oom_victim because setting a bit unconditionally is cheaper.
> > > > > +      */
> > > > > +     set_bit(MMF_OOM_SKIP, &mm->flags);
> > > > > +
> > > > > +     mmap_write_lock(mm);
> > > >
> > > > Is there a race here?  We had a VMA but after the read lock was dropped,
> > > > could the oom killer cause the VMA to be invalidated?
> >
> > Nope, the oom killer itself doesn't do much beyond sending SIGKILL and
> > scheduling the victim for the oom_reaper. dup_mmap is holding exclusive
> > mmap_lock throughout the whole process.
> >
> > > > I don't think so
> > > > but the comment above about dup_mmap() receiving an OOM makes me
> > > > question it.  The code before kept the write lock from when the VMA was
> > > > found until the end of the mm edits - and it had the check for !vma
> > > > within the block itself.  We are also hiding it from the oom killer
> > > > outside the read lock so it is possible for oom to find it in that
> > > > window, right?
> >
> > The oom killer's victim selection doesn't really depend on the
> > mmap_lock. If there is a race and MMF_OOM_SKIP is not set yet then it
> > will consider the task and very likely bail out anyway because the
> > address space has already been unampped so oom_badness() would consider
> > this task boring.
> >
> > oom_reaper on the other hand would just try to unmap in parallel but
> > that is fine regardless of MMF_OOM_SKIP. Seeing the flag would allow to
> > bail out early rather than just trying to unmap something that is no
> > longer there. The only problem for the oom_reaper is to see page tables
> > of the address space disappearing from udner its feet. That is excluded
> > by the the exlusive lock and as Suren mentions mm->mmap == NULL check
> > if the exit_mmap wins the race.
> > --
> > Michal Hocko
> > SUSE Labs

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2022-05-20 16:17 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-05-16  7:56 [PATCH v2 1/2] mm: drop oom code from exit_mmap Suren Baghdasaryan
2022-05-16  7:56 ` [PATCH v2 2/2] mm: delete unused MMF_OOM_VICTIM flag Suren Baghdasaryan
2022-05-17  3:13   ` Shuah Khan
2022-05-17  3:12 ` [PATCH v2 1/2] mm: drop oom code from exit_mmap Shuah Khan
2022-05-19 19:29 ` Andrew Morton
2022-05-19 20:33   ` Liam Howlett
2022-05-19 21:43     ` Suren Baghdasaryan
2022-05-19 22:34       ` Liam Howlett
2022-05-19 20:22 ` Liam Howlett
2022-05-19 21:33   ` Suren Baghdasaryan
2022-05-19 22:14     ` Suren Baghdasaryan
2022-05-19 22:56     ` Liam Howlett
2022-05-19 23:08       ` Suren Baghdasaryan
2022-05-20  7:21     ` Michal Hocko
2022-05-20 15:55       ` Suren Baghdasaryan
2022-05-20 16:17         ` Suren Baghdasaryan

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.