All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH RESEND v2 1/2] mm: drop oom code from exit_mmap
@ 2022-05-31 22:30 Suren Baghdasaryan
  2022-05-31 22:31 ` [PATCH RESEND v2 2/2] mm: delete unused MMF_OOM_VICTIM flag Suren Baghdasaryan
  2022-06-01 21:36 ` [PATCH RESEND v2 1/2] mm: drop oom code from exit_mmap Andrew Morton
  0 siblings, 2 replies; 20+ messages in thread
From: Suren Baghdasaryan @ 2022-05-31 22:30 UTC (permalink / raw)
  To: akpm
  Cc: mhocko, rientjes, willy, hannes, guro, minchan, kirill, aarcange,
	brauner, hch, oleg, david, jannh, shakeelb, peterx, jhubbard,
	shuah, linux-kernel, linux-mm, linux-kselftest, kernel-team,
	surenb

The primary reason to invoke the oom reaper from the exit_mmap path used
to be a prevention of an excessive oom killing if the oom victim exit
races with the oom reaper (see [1] for more details). The invocation has
moved around since then because of the interaction with the munlock
logic but the underlying reason has remained the same (see [2]).

Munlock code is no longer a problem since [3] and there shouldn't be
any blocking operation before the memory is unmapped by exit_mmap so
the oom reaper invocation can be dropped. The unmapping part can be done
with the non-exclusive mmap_sem and the exclusive one is only required
when page tables are freed.

Remove the oom_reaper from exit_mmap which will make the code easier to
read. This is really unlikely to make any observable difference although
some microbenchmarks could benefit from one less branch that needs to be
evaluated even though it almost never is true.

[1] 212925802454 ("mm: oom: let oom_reap_task and exit_mmap run concurrently")
[2] 27ae357fa82b ("mm, oom: fix concurrent munlock and oom reaper unmap, v3")
[3] a213e5cf71cb ("mm/munlock: delete munlock_vma_pages_all(), allow oomreap")

Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Acked-by: Michal Hocko <mhocko@suse.com>
---
Notes:
- Rebased over git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
mm-unstable branch per Andrew's request but applies cleany to Linus' ToT
- Conflicts with maple-tree patchset. Resolving these was discussed in
https://lore.kernel.org/all/20220519223438.qx35hbpfnnfnpouw@revolver/

 include/linux/oom.h |  2 --
 mm/mmap.c           | 31 ++++++++++++-------------------
 mm/oom_kill.c       |  2 +-
 3 files changed, 13 insertions(+), 22 deletions(-)

diff --git a/include/linux/oom.h b/include/linux/oom.h
index 02d1e7bbd8cd..6cdde62b078b 100644
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -106,8 +106,6 @@ static inline vm_fault_t check_stable_address_space(struct mm_struct *mm)
 	return 0;
 }
 
-bool __oom_reap_task_mm(struct mm_struct *mm);
-
 long oom_badness(struct task_struct *p,
 		unsigned long totalpages);
 
diff --git a/mm/mmap.c b/mm/mmap.c
index 2b9305ed0dda..b7918e6bb0db 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -3110,30 +3110,13 @@ void exit_mmap(struct mm_struct *mm)
 	/* mm's last user has gone, and its about to be pulled down */
 	mmu_notifier_release(mm);
 
-	if (unlikely(mm_is_oom_victim(mm))) {
-		/*
-		 * Manually reap the mm to free as much memory as possible.
-		 * Then, as the oom reaper does, set MMF_OOM_SKIP to disregard
-		 * this mm from further consideration.  Taking mm->mmap_lock for
-		 * write after setting MMF_OOM_SKIP will guarantee that the oom
-		 * reaper will not run on this mm again after mmap_lock is
-		 * dropped.
-		 *
-		 * Nothing can be holding mm->mmap_lock here and the above call
-		 * to mmu_notifier_release(mm) ensures mmu notifier callbacks in
-		 * __oom_reap_task_mm() will not block.
-		 */
-		(void)__oom_reap_task_mm(mm);
-		set_bit(MMF_OOM_SKIP, &mm->flags);
-	}
-
-	mmap_write_lock(mm);
+	mmap_read_lock(mm);
 	arch_exit_mmap(mm);
 
 	vma = mm->mmap;
 	if (!vma) {
 		/* Can happen if dup_mmap() received an OOM */
-		mmap_write_unlock(mm);
+		mmap_read_unlock(mm);
 		return;
 	}
 
@@ -3143,6 +3126,16 @@ void exit_mmap(struct mm_struct *mm)
 	/* update_hiwater_rss(mm) here? but nobody should be looking */
 	/* Use -1 here to ensure all VMAs in the mm are unmapped */
 	unmap_vmas(&tlb, vma, 0, -1);
+	mmap_read_unlock(mm);
+
+	/*
+	 * Set MMF_OOM_SKIP to hide this task from the oom killer/reaper
+	 * because the memory has been already freed. Do not bother checking
+	 * mm_is_oom_victim because setting a bit unconditionally is cheaper.
+	 */
+	set_bit(MMF_OOM_SKIP, &mm->flags);
+
+	mmap_write_lock(mm);
 	free_pgtables(&tlb, vma, FIRST_USER_ADDRESS, USER_PGTABLES_CEILING);
 	tlb_finish_mmu(&tlb);
 
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 8a70bca67c94..98dca2b42357 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -538,7 +538,7 @@ static DECLARE_WAIT_QUEUE_HEAD(oom_reaper_wait);
 static struct task_struct *oom_reaper_list;
 static DEFINE_SPINLOCK(oom_reaper_lock);
 
-bool __oom_reap_task_mm(struct mm_struct *mm)
+static bool __oom_reap_task_mm(struct mm_struct *mm)
 {
 	struct vm_area_struct *vma;
 	bool ret = true;
-- 
2.36.1.255.ge46751e96f-goog


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH RESEND v2 2/2] mm: delete unused MMF_OOM_VICTIM flag
  2022-05-31 22:30 [PATCH RESEND v2 1/2] mm: drop oom code from exit_mmap Suren Baghdasaryan
@ 2022-05-31 22:31 ` Suren Baghdasaryan
  2022-08-22 22:21   ` Andrew Morton
  2022-06-01 21:36 ` [PATCH RESEND v2 1/2] mm: drop oom code from exit_mmap Andrew Morton
  1 sibling, 1 reply; 20+ messages in thread
From: Suren Baghdasaryan @ 2022-05-31 22:31 UTC (permalink / raw)
  To: akpm
  Cc: mhocko, rientjes, willy, hannes, guro, minchan, kirill, aarcange,
	brauner, hch, oleg, david, jannh, shakeelb, peterx, jhubbard,
	shuah, linux-kernel, linux-mm, linux-kselftest, kernel-team,
	surenb

With the last usage of MMF_OOM_VICTIM in exit_mmap gone, this flag is
now unused and can be removed.

Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Acked-by: Michal Hocko <mhocko@suse.com>
---
 include/linux/oom.h            | 9 ---------
 include/linux/sched/coredump.h | 7 +++----
 mm/oom_kill.c                  | 4 +---
 3 files changed, 4 insertions(+), 16 deletions(-)

diff --git a/include/linux/oom.h b/include/linux/oom.h
index 6cdde62b078b..7d0c9c48a0c5 100644
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -77,15 +77,6 @@ static inline bool tsk_is_oom_victim(struct task_struct * tsk)
 	return tsk->signal->oom_mm;
 }
 
-/*
- * Use this helper if tsk->mm != mm and the victim mm needs a special
- * handling. This is guaranteed to stay true after once set.
- */
-static inline bool mm_is_oom_victim(struct mm_struct *mm)
-{
-	return test_bit(MMF_OOM_VICTIM, &mm->flags);
-}
-
 /*
  * Checks whether a page fault on the given mm is still reliable.
  * This is no longer true if the oom reaper started to reap the
diff --git a/include/linux/sched/coredump.h b/include/linux/sched/coredump.h
index 4d0a5be28b70..8270ad7ae14c 100644
--- a/include/linux/sched/coredump.h
+++ b/include/linux/sched/coredump.h
@@ -71,9 +71,8 @@ static inline int get_dumpable(struct mm_struct *mm)
 #define MMF_UNSTABLE		22	/* mm is unstable for copy_from_user */
 #define MMF_HUGE_ZERO_PAGE	23      /* mm has ever used the global huge zero page */
 #define MMF_DISABLE_THP		24	/* disable THP for all VMAs */
-#define MMF_OOM_VICTIM		25	/* mm is the oom victim */
-#define MMF_OOM_REAP_QUEUED	26	/* mm was queued for oom_reaper */
-#define MMF_MULTIPROCESS	27	/* mm is shared between processes */
+#define MMF_OOM_REAP_QUEUED	25	/* mm was queued for oom_reaper */
+#define MMF_MULTIPROCESS	26	/* mm is shared between processes */
 /*
  * MMF_HAS_PINNED: Whether this mm has pinned any pages.  This can be either
  * replaced in the future by mm.pinned_vm when it becomes stable, or grow into
@@ -81,7 +80,7 @@ static inline int get_dumpable(struct mm_struct *mm)
  * pinned pages were unpinned later on, we'll still keep this bit set for the
  * lifecycle of this mm, just for simplicity.
  */
-#define MMF_HAS_PINNED		28	/* FOLL_PIN has run, never cleared */
+#define MMF_HAS_PINNED		27	/* FOLL_PIN has run, never cleared */
 #define MMF_DISABLE_THP_MASK	(1 << MMF_DISABLE_THP)
 
 #define MMF_INIT_MASK		(MMF_DUMPABLE_MASK | MMF_DUMP_FILTER_MASK |\
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 98dca2b42357..c6c76c313b39 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -764,10 +764,8 @@ static void mark_oom_victim(struct task_struct *tsk)
 		return;
 
 	/* oom_mm is bound to the signal struct life time. */
-	if (!cmpxchg(&tsk->signal->oom_mm, NULL, mm)) {
+	if (!cmpxchg(&tsk->signal->oom_mm, NULL, mm))
 		mmgrab(tsk->signal->oom_mm);
-		set_bit(MMF_OOM_VICTIM, &mm->flags);
-	}
 
 	/*
 	 * Make sure that the task is woken up from uninterruptible sleep
-- 
2.36.1.255.ge46751e96f-goog


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH RESEND v2 1/2] mm: drop oom code from exit_mmap
  2022-05-31 22:30 [PATCH RESEND v2 1/2] mm: drop oom code from exit_mmap Suren Baghdasaryan
  2022-05-31 22:31 ` [PATCH RESEND v2 2/2] mm: delete unused MMF_OOM_VICTIM flag Suren Baghdasaryan
@ 2022-06-01 21:36 ` Andrew Morton
  2022-06-01 21:47   ` Suren Baghdasaryan
  1 sibling, 1 reply; 20+ messages in thread
From: Andrew Morton @ 2022-06-01 21:36 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: mhocko, rientjes, willy, hannes, guro, minchan, kirill, aarcange,
	brauner, hch, oleg, david, jannh, shakeelb, peterx, jhubbard,
	shuah, linux-kernel, linux-mm, linux-kselftest, kernel-team,
	Liam Howlett

On Tue, 31 May 2022 15:30:59 -0700 Suren Baghdasaryan <surenb@google.com> wrote:

> The primary reason to invoke the oom reaper from the exit_mmap path used
> to be a prevention of an excessive oom killing if the oom victim exit
> races with the oom reaper (see [1] for more details). The invocation has
> moved around since then because of the interaction with the munlock
> logic but the underlying reason has remained the same (see [2]).
> 
> Munlock code is no longer a problem since [3] and there shouldn't be
> any blocking operation before the memory is unmapped by exit_mmap so
> the oom reaper invocation can be dropped. The unmapping part can be done
> with the non-exclusive mmap_sem and the exclusive one is only required
> when page tables are freed.
> 
> Remove the oom_reaper from exit_mmap which will make the code easier to
> read. This is really unlikely to make any observable difference although
> some microbenchmarks could benefit from one less branch that needs to be
> evaluated even though it almost never is true.
> 
> [1] 212925802454 ("mm: oom: let oom_reap_task and exit_mmap run concurrently")
> [2] 27ae357fa82b ("mm, oom: fix concurrent munlock and oom reaper unmap, v3")
> [3] a213e5cf71cb ("mm/munlock: delete munlock_vma_pages_all(), allow oomreap")
> 

I've just reinstated the mapletree patchset so there are some
conflicting changes.

> --- a/include/linux/oom.h
> +++ b/include/linux/oom.h
> @@ -106,8 +106,6 @@ static inline vm_fault_t check_stable_address_space(struct mm_struct *mm)
>  	return 0;
>  }
>  
> -bool __oom_reap_task_mm(struct mm_struct *mm);
> -
>  long oom_badness(struct task_struct *p,
>  		unsigned long totalpages);
>  
> diff --git a/mm/mmap.c b/mm/mmap.c
> index 2b9305ed0dda..b7918e6bb0db 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -3110,30 +3110,13 @@ void exit_mmap(struct mm_struct *mm)
>  	/* mm's last user has gone, and its about to be pulled down */
>  	mmu_notifier_release(mm);
>  
> -	if (unlikely(mm_is_oom_victim(mm))) {
> -		/*
> -		 * Manually reap the mm to free as much memory as possible.
> -		 * Then, as the oom reaper does, set MMF_OOM_SKIP to disregard
> -		 * this mm from further consideration.  Taking mm->mmap_lock for
> -		 * write after setting MMF_OOM_SKIP will guarantee that the oom
> -		 * reaper will not run on this mm again after mmap_lock is
> -		 * dropped.
> -		 *
> -		 * Nothing can be holding mm->mmap_lock here and the above call
> -		 * to mmu_notifier_release(mm) ensures mmu notifier callbacks in
> -		 * __oom_reap_task_mm() will not block.
> -		 */
> -		(void)__oom_reap_task_mm(mm);
> -		set_bit(MMF_OOM_SKIP, &mm->flags);
> -	}
> -
> -	mmap_write_lock(mm);
> +	mmap_read_lock(mm);

Unclear why this patch fiddles with the mm_struct locking in this
fashion - changelogging that would have been helpful.

But iirc mapletree wants to retain a write_lock here, so I ended up with

void exit_mmap(struct mm_struct *mm)
{
	struct mmu_gather tlb;
	struct vm_area_struct *vma;
	unsigned long nr_accounted = 0;
	MA_STATE(mas, &mm->mm_mt, 0, 0);
	int count = 0;

	/* mm's last user has gone, and its about to be pulled down */
	mmu_notifier_release(mm);

	mmap_write_lock(mm);
	arch_exit_mmap(mm);

	vma = mas_find(&mas, ULONG_MAX);
	if (!vma) {
		/* Can happen if dup_mmap() received an OOM */
		mmap_write_unlock(mm);
		return;
	}

	lru_add_drain();
	flush_cache_mm(mm);
	tlb_gather_mmu_fullmm(&tlb, mm);
	/* update_hiwater_rss(mm) here? but nobody should be looking */
	/* Use ULONG_MAX here to ensure all VMAs in the mm are unmapped */
	unmap_vmas(&tlb, &mm->mm_mt, vma, 0, ULONG_MAX);

	/*
	 * Set MMF_OOM_SKIP to hide this task from the oom killer/reaper
	 * because the memory has been already freed. Do not bother checking
	 * mm_is_oom_victim because setting a bit unconditionally is cheaper.
	 */
	set_bit(MMF_OOM_SKIP, &mm->flags);
	free_pgtables(&tlb, &mm->mm_mt, vma, FIRST_USER_ADDRESS,
		      USER_PGTABLES_CEILING);
	tlb_finish_mmu(&tlb);

	/*
	 * Walk the list again, actually closing and freeing it, with preemption
	 * enabled, without holding any MM locks besides the unreachable
	 * mmap_write_lock.
	 */
	do {
		if (vma->vm_flags & VM_ACCOUNT)
			nr_accounted += vma_pages(vma);
		remove_vma(vma);
		count++;
		cond_resched();
	} while ((vma = mas_find(&mas, ULONG_MAX)) != NULL);

	BUG_ON(count != mm->map_count);

	trace_exit_mmap(mm);
	__mt_destroy(&mm->mm_mt);
	mm->mmap = NULL;
	mmap_write_unlock(mm);
	vm_unacct_memory(nr_accounted);
}


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH RESEND v2 1/2] mm: drop oom code from exit_mmap
  2022-06-01 21:36 ` [PATCH RESEND v2 1/2] mm: drop oom code from exit_mmap Andrew Morton
@ 2022-06-01 21:47   ` Suren Baghdasaryan
  2022-06-01 21:50     ` Suren Baghdasaryan
                       ` (2 more replies)
  0 siblings, 3 replies; 20+ messages in thread
From: Suren Baghdasaryan @ 2022-06-01 21:47 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Michal Hocko, David Rientjes, Matthew Wilcox, Johannes Weiner,
	Roman Gushchin, Minchan Kim, Kirill A. Shutemov,
	Andrea Arcangeli, Christian Brauner, Christoph Hellwig,
	Oleg Nesterov, David Hildenbrand, Jann Horn, Shakeel Butt,
	Peter Xu, John Hubbard, shuah, LKML, linux-mm, linux-kselftest,
	kernel-team, Liam Howlett

On Wed, Jun 1, 2022 at 2:36 PM Andrew Morton <akpm@linux-foundation.org> wrote:
>
> On Tue, 31 May 2022 15:30:59 -0700 Suren Baghdasaryan <surenb@google.com> wrote:
>
> > The primary reason to invoke the oom reaper from the exit_mmap path used
> > to be a prevention of an excessive oom killing if the oom victim exit
> > races with the oom reaper (see [1] for more details). The invocation has
> > moved around since then because of the interaction with the munlock
> > logic but the underlying reason has remained the same (see [2]).
> >
> > Munlock code is no longer a problem since [3] and there shouldn't be
> > any blocking operation before the memory is unmapped by exit_mmap so
> > the oom reaper invocation can be dropped. The unmapping part can be done
> > with the non-exclusive mmap_sem and the exclusive one is only required
> > when page tables are freed.
> >
> > Remove the oom_reaper from exit_mmap which will make the code easier to
> > read. This is really unlikely to make any observable difference although
> > some microbenchmarks could benefit from one less branch that needs to be
> > evaluated even though it almost never is true.
> >
> > [1] 212925802454 ("mm: oom: let oom_reap_task and exit_mmap run concurrently")
> > [2] 27ae357fa82b ("mm, oom: fix concurrent munlock and oom reaper unmap, v3")
> > [3] a213e5cf71cb ("mm/munlock: delete munlock_vma_pages_all(), allow oomreap")
> >
>
> I've just reinstated the mapletree patchset so there are some
> conflicting changes.
>
> > --- a/include/linux/oom.h
> > +++ b/include/linux/oom.h
> > @@ -106,8 +106,6 @@ static inline vm_fault_t check_stable_address_space(struct mm_struct *mm)
> >       return 0;
> >  }
> >
> > -bool __oom_reap_task_mm(struct mm_struct *mm);
> > -
> >  long oom_badness(struct task_struct *p,
> >               unsigned long totalpages);
> >
> > diff --git a/mm/mmap.c b/mm/mmap.c
> > index 2b9305ed0dda..b7918e6bb0db 100644
> > --- a/mm/mmap.c
> > +++ b/mm/mmap.c
> > @@ -3110,30 +3110,13 @@ void exit_mmap(struct mm_struct *mm)
> >       /* mm's last user has gone, and its about to be pulled down */
> >       mmu_notifier_release(mm);
> >
> > -     if (unlikely(mm_is_oom_victim(mm))) {
> > -             /*
> > -              * Manually reap the mm to free as much memory as possible.
> > -              * Then, as the oom reaper does, set MMF_OOM_SKIP to disregard
> > -              * this mm from further consideration.  Taking mm->mmap_lock for
> > -              * write after setting MMF_OOM_SKIP will guarantee that the oom
> > -              * reaper will not run on this mm again after mmap_lock is
> > -              * dropped.
> > -              *
> > -              * Nothing can be holding mm->mmap_lock here and the above call
> > -              * to mmu_notifier_release(mm) ensures mmu notifier callbacks in
> > -              * __oom_reap_task_mm() will not block.
> > -              */
> > -             (void)__oom_reap_task_mm(mm);
> > -             set_bit(MMF_OOM_SKIP, &mm->flags);
> > -     }
> > -
> > -     mmap_write_lock(mm);
> > +     mmap_read_lock(mm);
>
> Unclear why this patch fiddles with the mm_struct locking in this
> fashion - changelogging that would have been helpful.

Yeah, I should have clarified this in the description. Everything up
to unmap_vmas() can be done under mmap_read_lock and that way
oom-reaper and process_mrelease can do the unmapping in parallel with
exit_mmap. That's the reason we take mmap_read_lock, unmap the vmas,
mark the mm with MMF_OOM_SKIP and take the mmap_write_lock to execute
free_pgtables. I think maple trees do not change that except there is
no mm->mmap anymore, so the line at the end of exit_mmap where we
reset mm->mmap to NULL can be removed (I show that line below).

>
> But iirc mapletree wants to retain a write_lock here, so I ended up with
>
> void exit_mmap(struct mm_struct *mm)
> {
>         struct mmu_gather tlb;
>         struct vm_area_struct *vma;
>         unsigned long nr_accounted = 0;
>         MA_STATE(mas, &mm->mm_mt, 0, 0);
>         int count = 0;
>
>         /* mm's last user has gone, and its about to be pulled down */
>         mmu_notifier_release(mm);
>
>         mmap_write_lock(mm);
>         arch_exit_mmap(mm);
>
>         vma = mas_find(&mas, ULONG_MAX);
>         if (!vma) {
>                 /* Can happen if dup_mmap() received an OOM */
>                 mmap_write_unlock(mm);
>                 return;
>         }
>
>         lru_add_drain();
>         flush_cache_mm(mm);
>         tlb_gather_mmu_fullmm(&tlb, mm);
>         /* update_hiwater_rss(mm) here? but nobody should be looking */
>         /* Use ULONG_MAX here to ensure all VMAs in the mm are unmapped */
>         unmap_vmas(&tlb, &mm->mm_mt, vma, 0, ULONG_MAX);
>
>         /*
>          * Set MMF_OOM_SKIP to hide this task from the oom killer/reaper
>          * because the memory has been already freed. Do not bother checking
>          * mm_is_oom_victim because setting a bit unconditionally is cheaper.
>          */
>         set_bit(MMF_OOM_SKIP, &mm->flags);
>         free_pgtables(&tlb, &mm->mm_mt, vma, FIRST_USER_ADDRESS,
>                       USER_PGTABLES_CEILING);
>         tlb_finish_mmu(&tlb);
>
>         /*
>          * Walk the list again, actually closing and freeing it, with preemption
>          * enabled, without holding any MM locks besides the unreachable
>          * mmap_write_lock.
>          */
>         do {
>                 if (vma->vm_flags & VM_ACCOUNT)
>                         nr_accounted += vma_pages(vma);
>                 remove_vma(vma);
>                 count++;
>                 cond_resched();
>         } while ((vma = mas_find(&mas, ULONG_MAX)) != NULL);
>
>         BUG_ON(count != mm->map_count);
>
>         trace_exit_mmap(mm);
>         __mt_destroy(&mm->mm_mt);
>         mm->mmap = NULL;

^^^ this line above needs to be removed when the patch is applied over
the maple tree patchset.


>         mmap_write_unlock(mm);
>         vm_unacct_memory(nr_accounted);
> }
>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH RESEND v2 1/2] mm: drop oom code from exit_mmap
  2022-06-01 21:47   ` Suren Baghdasaryan
@ 2022-06-01 21:50     ` Suren Baghdasaryan
  2022-06-02  6:53     ` Michal Hocko
  2022-06-02 13:39     ` Matthew Wilcox
  2 siblings, 0 replies; 20+ messages in thread
From: Suren Baghdasaryan @ 2022-06-01 21:50 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Michal Hocko, David Rientjes, Matthew Wilcox, Johannes Weiner,
	Roman Gushchin, Minchan Kim, Kirill A. Shutemov,
	Andrea Arcangeli, Christian Brauner, Christoph Hellwig,
	Oleg Nesterov, David Hildenbrand, Jann Horn, Shakeel Butt,
	Peter Xu, John Hubbard, shuah, LKML, linux-mm, linux-kselftest,
	kernel-team, Liam Howlett

On Wed, Jun 1, 2022 at 2:47 PM Suren Baghdasaryan <surenb@google.com> wrote:
>
> On Wed, Jun 1, 2022 at 2:36 PM Andrew Morton <akpm@linux-foundation.org> wrote:
> >
> > On Tue, 31 May 2022 15:30:59 -0700 Suren Baghdasaryan <surenb@google.com> wrote:
> >
> > > The primary reason to invoke the oom reaper from the exit_mmap path used
> > > to be a prevention of an excessive oom killing if the oom victim exit
> > > races with the oom reaper (see [1] for more details). The invocation has
> > > moved around since then because of the interaction with the munlock
> > > logic but the underlying reason has remained the same (see [2]).
> > >
> > > Munlock code is no longer a problem since [3] and there shouldn't be
> > > any blocking operation before the memory is unmapped by exit_mmap so
> > > the oom reaper invocation can be dropped. The unmapping part can be done
> > > with the non-exclusive mmap_sem and the exclusive one is only required
> > > when page tables are freed.
> > >
> > > Remove the oom_reaper from exit_mmap which will make the code easier to
> > > read. This is really unlikely to make any observable difference although
> > > some microbenchmarks could benefit from one less branch that needs to be
> > > evaluated even though it almost never is true.
> > >
> > > [1] 212925802454 ("mm: oom: let oom_reap_task and exit_mmap run concurrently")
> > > [2] 27ae357fa82b ("mm, oom: fix concurrent munlock and oom reaper unmap, v3")
> > > [3] a213e5cf71cb ("mm/munlock: delete munlock_vma_pages_all(), allow oomreap")
> > >
> >
> > I've just reinstated the mapletree patchset so there are some
> > conflicting changes.
> >
> > > --- a/include/linux/oom.h
> > > +++ b/include/linux/oom.h
> > > @@ -106,8 +106,6 @@ static inline vm_fault_t check_stable_address_space(struct mm_struct *mm)
> > >       return 0;
> > >  }
> > >
> > > -bool __oom_reap_task_mm(struct mm_struct *mm);
> > > -
> > >  long oom_badness(struct task_struct *p,
> > >               unsigned long totalpages);
> > >
> > > diff --git a/mm/mmap.c b/mm/mmap.c
> > > index 2b9305ed0dda..b7918e6bb0db 100644
> > > --- a/mm/mmap.c
> > > +++ b/mm/mmap.c
> > > @@ -3110,30 +3110,13 @@ void exit_mmap(struct mm_struct *mm)
> > >       /* mm's last user has gone, and its about to be pulled down */
> > >       mmu_notifier_release(mm);
> > >
> > > -     if (unlikely(mm_is_oom_victim(mm))) {
> > > -             /*
> > > -              * Manually reap the mm to free as much memory as possible.
> > > -              * Then, as the oom reaper does, set MMF_OOM_SKIP to disregard
> > > -              * this mm from further consideration.  Taking mm->mmap_lock for
> > > -              * write after setting MMF_OOM_SKIP will guarantee that the oom
> > > -              * reaper will not run on this mm again after mmap_lock is
> > > -              * dropped.
> > > -              *
> > > -              * Nothing can be holding mm->mmap_lock here and the above call
> > > -              * to mmu_notifier_release(mm) ensures mmu notifier callbacks in
> > > -              * __oom_reap_task_mm() will not block.
> > > -              */
> > > -             (void)__oom_reap_task_mm(mm);
> > > -             set_bit(MMF_OOM_SKIP, &mm->flags);
> > > -     }
> > > -
> > > -     mmap_write_lock(mm);
> > > +     mmap_read_lock(mm);
> >
> > Unclear why this patch fiddles with the mm_struct locking in this
> > fashion - changelogging that would have been helpful.
>
> Yeah, I should have clarified this in the description. Everything up
> to unmap_vmas() can be done under mmap_read_lock and that way
> oom-reaper and process_mrelease can do the unmapping in parallel with
> exit_mmap. That's the reason we take mmap_read_lock, unmap the vmas,
> mark the mm with MMF_OOM_SKIP and take the mmap_write_lock to execute
> free_pgtables. I think maple trees do not change that except there is
> no mm->mmap anymore, so the line at the end of exit_mmap where we
> reset mm->mmap to NULL can be removed (I show that line below).

In the current changelog I have this explanation:

"The unmapping part can be done with the non-exclusive mmap_sem and
the exclusive one is only required when page tables are freed."

should I resend a v3 with a more detailed explanation for these
mmap_lock manipulations?

>
> >
> > But iirc mapletree wants to retain a write_lock here, so I ended up with
> >
> > void exit_mmap(struct mm_struct *mm)
> > {
> >         struct mmu_gather tlb;
> >         struct vm_area_struct *vma;
> >         unsigned long nr_accounted = 0;
> >         MA_STATE(mas, &mm->mm_mt, 0, 0);
> >         int count = 0;
> >
> >         /* mm's last user has gone, and its about to be pulled down */
> >         mmu_notifier_release(mm);
> >
> >         mmap_write_lock(mm);
> >         arch_exit_mmap(mm);
> >
> >         vma = mas_find(&mas, ULONG_MAX);
> >         if (!vma) {
> >                 /* Can happen if dup_mmap() received an OOM */
> >                 mmap_write_unlock(mm);
> >                 return;
> >         }
> >
> >         lru_add_drain();
> >         flush_cache_mm(mm);
> >         tlb_gather_mmu_fullmm(&tlb, mm);
> >         /* update_hiwater_rss(mm) here? but nobody should be looking */
> >         /* Use ULONG_MAX here to ensure all VMAs in the mm are unmapped */
> >         unmap_vmas(&tlb, &mm->mm_mt, vma, 0, ULONG_MAX);
> >
> >         /*
> >          * Set MMF_OOM_SKIP to hide this task from the oom killer/reaper
> >          * because the memory has been already freed. Do not bother checking
> >          * mm_is_oom_victim because setting a bit unconditionally is cheaper.
> >          */
> >         set_bit(MMF_OOM_SKIP, &mm->flags);
> >         free_pgtables(&tlb, &mm->mm_mt, vma, FIRST_USER_ADDRESS,
> >                       USER_PGTABLES_CEILING);
> >         tlb_finish_mmu(&tlb);
> >
> >         /*
> >          * Walk the list again, actually closing and freeing it, with preemption
> >          * enabled, without holding any MM locks besides the unreachable
> >          * mmap_write_lock.
> >          */
> >         do {
> >                 if (vma->vm_flags & VM_ACCOUNT)
> >                         nr_accounted += vma_pages(vma);
> >                 remove_vma(vma);
> >                 count++;
> >                 cond_resched();
> >         } while ((vma = mas_find(&mas, ULONG_MAX)) != NULL);
> >
> >         BUG_ON(count != mm->map_count);
> >
> >         trace_exit_mmap(mm);
> >         __mt_destroy(&mm->mm_mt);
> >         mm->mmap = NULL;
>
> ^^^ this line above needs to be removed when the patch is applied over
> the maple tree patchset.
>
>
> >         mmap_write_unlock(mm);
> >         vm_unacct_memory(nr_accounted);
> > }
> >

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH RESEND v2 1/2] mm: drop oom code from exit_mmap
  2022-06-01 21:47   ` Suren Baghdasaryan
  2022-06-01 21:50     ` Suren Baghdasaryan
@ 2022-06-02  6:53     ` Michal Hocko
  2022-06-02 13:31       ` Liam Howlett
  2022-06-02 13:39     ` Matthew Wilcox
  2 siblings, 1 reply; 20+ messages in thread
From: Michal Hocko @ 2022-06-02  6:53 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: Andrew Morton, David Rientjes, Matthew Wilcox, Johannes Weiner,
	Roman Gushchin, Minchan Kim, Kirill A. Shutemov,
	Andrea Arcangeli, Christian Brauner, Christoph Hellwig,
	Oleg Nesterov, David Hildenbrand, Jann Horn, Shakeel Butt,
	Peter Xu, John Hubbard, shuah, LKML, linux-mm, linux-kselftest,
	kernel-team, Liam Howlett

On Wed 01-06-22 14:47:41, Suren Baghdasaryan wrote:
> On Wed, Jun 1, 2022 at 2:36 PM Andrew Morton <akpm@linux-foundation.org> wrote:
[...]
> > But iirc mapletree wants to retain a write_lock here, so I ended up with
> >
> > void exit_mmap(struct mm_struct *mm)
> > {
> >         struct mmu_gather tlb;
> >         struct vm_area_struct *vma;
> >         unsigned long nr_accounted = 0;
> >         MA_STATE(mas, &mm->mm_mt, 0, 0);
> >         int count = 0;
> >
> >         /* mm's last user has gone, and its about to be pulled down */
> >         mmu_notifier_release(mm);
> >
> >         mmap_write_lock(mm);
> >         arch_exit_mmap(mm);
> >
> >         vma = mas_find(&mas, ULONG_MAX);
> >         if (!vma) {
> >                 /* Can happen if dup_mmap() received an OOM */
> >                 mmap_write_unlock(mm);
> >                 return;
> >         }
> >
> >         lru_add_drain();
> >         flush_cache_mm(mm);
> >         tlb_gather_mmu_fullmm(&tlb, mm);
> >         /* update_hiwater_rss(mm) here? but nobody should be looking */
> >         /* Use ULONG_MAX here to ensure all VMAs in the mm are unmapped */
> >         unmap_vmas(&tlb, &mm->mm_mt, vma, 0, ULONG_MAX);
> >
> >         /*
> >          * Set MMF_OOM_SKIP to hide this task from the oom killer/reaper
> >          * because the memory has been already freed. Do not bother checking
> >          * mm_is_oom_victim because setting a bit unconditionally is cheaper.
> >          */
> >         set_bit(MMF_OOM_SKIP, &mm->flags);
> >         free_pgtables(&tlb, &mm->mm_mt, vma, FIRST_USER_ADDRESS,
> >                       USER_PGTABLES_CEILING);
> >         tlb_finish_mmu(&tlb);
> >
> >         /*
> >          * Walk the list again, actually closing and freeing it, with preemption
> >          * enabled, without holding any MM locks besides the unreachable
> >          * mmap_write_lock.
> >          */
> >         do {
> >                 if (vma->vm_flags & VM_ACCOUNT)
> >                         nr_accounted += vma_pages(vma);
> >                 remove_vma(vma);
> >                 count++;
> >                 cond_resched();
> >         } while ((vma = mas_find(&mas, ULONG_MAX)) != NULL);
> >
> >         BUG_ON(count != mm->map_count);
> >
> >         trace_exit_mmap(mm);
> >         __mt_destroy(&mm->mm_mt);
> >         mm->mmap = NULL;
> 
> ^^^ this line above needs to be removed when the patch is applied over
> the maple tree patchset.

I am not fully up to date on the maple tree changes. Could you explain
why resetting mm->mmap is not needed anymore please?

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH RESEND v2 1/2] mm: drop oom code from exit_mmap
  2022-06-02  6:53     ` Michal Hocko
@ 2022-06-02 13:31       ` Liam Howlett
  2022-06-02 14:08         ` Michal Hocko
  0 siblings, 1 reply; 20+ messages in thread
From: Liam Howlett @ 2022-06-02 13:31 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Suren Baghdasaryan, Andrew Morton, David Rientjes,
	Matthew Wilcox, Johannes Weiner, Roman Gushchin, Minchan Kim,
	Kirill A. Shutemov, Andrea Arcangeli, Christian Brauner,
	Christoph Hellwig, Oleg Nesterov, David Hildenbrand, Jann Horn,
	Shakeel Butt, Peter Xu, John Hubbard, shuah, LKML, linux-mm,
	linux-kselftest, kernel-team

* Michal Hocko <mhocko@suse.com> [220602 02:53]:
> On Wed 01-06-22 14:47:41, Suren Baghdasaryan wrote:
> > On Wed, Jun 1, 2022 at 2:36 PM Andrew Morton <akpm@linux-foundation.org> wrote:
> [...]
> > > But iirc mapletree wants to retain a write_lock here, so I ended up with
> > >
> > > void exit_mmap(struct mm_struct *mm)
> > > {
> > >         struct mmu_gather tlb;
> > >         struct vm_area_struct *vma;
> > >         unsigned long nr_accounted = 0;
> > >         MA_STATE(mas, &mm->mm_mt, 0, 0);
> > >         int count = 0;
> > >
> > >         /* mm's last user has gone, and its about to be pulled down */
> > >         mmu_notifier_release(mm);
> > >
> > >         mmap_write_lock(mm);
> > >         arch_exit_mmap(mm);
> > >
> > >         vma = mas_find(&mas, ULONG_MAX);
> > >         if (!vma) {
> > >                 /* Can happen if dup_mmap() received an OOM */
> > >                 mmap_write_unlock(mm);
> > >                 return;
> > >         }
> > >
> > >         lru_add_drain();
> > >         flush_cache_mm(mm);
> > >         tlb_gather_mmu_fullmm(&tlb, mm);
> > >         /* update_hiwater_rss(mm) here? but nobody should be looking */
> > >         /* Use ULONG_MAX here to ensure all VMAs in the mm are unmapped */
> > >         unmap_vmas(&tlb, &mm->mm_mt, vma, 0, ULONG_MAX);
> > >
> > >         /*
> > >          * Set MMF_OOM_SKIP to hide this task from the oom killer/reaper
> > >          * because the memory has been already freed. Do not bother checking
> > >          * mm_is_oom_victim because setting a bit unconditionally is cheaper.
> > >          */
> > >         set_bit(MMF_OOM_SKIP, &mm->flags);
> > >         free_pgtables(&tlb, &mm->mm_mt, vma, FIRST_USER_ADDRESS,
> > >                       USER_PGTABLES_CEILING);
> > >         tlb_finish_mmu(&tlb);
> > >
> > >         /*
> > >          * Walk the list again, actually closing and freeing it, with preemption
> > >          * enabled, without holding any MM locks besides the unreachable
> > >          * mmap_write_lock.
> > >          */
> > >         do {
> > >                 if (vma->vm_flags & VM_ACCOUNT)
> > >                         nr_accounted += vma_pages(vma);
> > >                 remove_vma(vma);
> > >                 count++;
> > >                 cond_resched();
> > >         } while ((vma = mas_find(&mas, ULONG_MAX)) != NULL);
> > >
> > >         BUG_ON(count != mm->map_count);
> > >
> > >         trace_exit_mmap(mm);
> > >         __mt_destroy(&mm->mm_mt);
> > >         mm->mmap = NULL;
> > 
> > ^^^ this line above needs to be removed when the patch is applied over
> > the maple tree patchset.
> 
> I am not fully up to date on the maple tree changes. Could you explain
> why resetting mm->mmap is not needed anymore please?

The maple tree patch set removes the linked list, including mm->mmap.
The call to __mt_destroy() means none of the old VMAs can be found in
the race condition that mm->mmap = NULL was solving.


Thanks,
Liam

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH RESEND v2 1/2] mm: drop oom code from exit_mmap
  2022-06-01 21:47   ` Suren Baghdasaryan
  2022-06-01 21:50     ` Suren Baghdasaryan
  2022-06-02  6:53     ` Michal Hocko
@ 2022-06-02 13:39     ` Matthew Wilcox
  2022-06-02 15:02       ` Suren Baghdasaryan
  2 siblings, 1 reply; 20+ messages in thread
From: Matthew Wilcox @ 2022-06-02 13:39 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: Andrew Morton, Michal Hocko, David Rientjes, Johannes Weiner,
	Roman Gushchin, Minchan Kim, Kirill A. Shutemov,
	Andrea Arcangeli, Christian Brauner, Christoph Hellwig,
	Oleg Nesterov, David Hildenbrand, Jann Horn, Shakeel Butt,
	Peter Xu, John Hubbard, shuah, LKML, linux-mm, linux-kselftest,
	kernel-team, Liam Howlett

On Wed, Jun 01, 2022 at 02:47:41PM -0700, Suren Baghdasaryan wrote:
> > Unclear why this patch fiddles with the mm_struct locking in this
> > fashion - changelogging that would have been helpful.
> 
> Yeah, I should have clarified this in the description. Everything up
> to unmap_vmas() can be done under mmap_read_lock and that way
> oom-reaper and process_mrelease can do the unmapping in parallel with
> exit_mmap. That's the reason we take mmap_read_lock, unmap the vmas,
> mark the mm with MMF_OOM_SKIP and take the mmap_write_lock to execute
> free_pgtables. I think maple trees do not change that except there is
> no mm->mmap anymore, so the line at the end of exit_mmap where we
> reset mm->mmap to NULL can be removed (I show that line below).

I don't understand why we _want_ unmapping to proceed in parallel?  Is it
so urgent to unmap these page tables that we need two processes doing
it at the same time?  And doesn't that just change the contention from
visible (contention on a lock) to invisible (contention on cachelines)?


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH RESEND v2 1/2] mm: drop oom code from exit_mmap
  2022-06-02 13:31       ` Liam Howlett
@ 2022-06-02 14:08         ` Michal Hocko
  0 siblings, 0 replies; 20+ messages in thread
From: Michal Hocko @ 2022-06-02 14:08 UTC (permalink / raw)
  To: Liam Howlett
  Cc: Suren Baghdasaryan, Andrew Morton, David Rientjes,
	Matthew Wilcox, Johannes Weiner, Roman Gushchin, Minchan Kim,
	Kirill A. Shutemov, Andrea Arcangeli, Christian Brauner,
	Christoph Hellwig, Oleg Nesterov, David Hildenbrand, Jann Horn,
	Shakeel Butt, Peter Xu, John Hubbard, shuah, LKML, linux-mm,
	linux-kselftest, kernel-team

On Thu 02-06-22 13:31:27, Liam Howlett wrote:
> * Michal Hocko <mhocko@suse.com> [220602 02:53]:
> > On Wed 01-06-22 14:47:41, Suren Baghdasaryan wrote:
> > > On Wed, Jun 1, 2022 at 2:36 PM Andrew Morton <akpm@linux-foundation.org> wrote:
> > [...]
> > > > But iirc mapletree wants to retain a write_lock here, so I ended up with
> > > >
> > > > void exit_mmap(struct mm_struct *mm)
> > > > {
> > > >         struct mmu_gather tlb;
> > > >         struct vm_area_struct *vma;
> > > >         unsigned long nr_accounted = 0;
> > > >         MA_STATE(mas, &mm->mm_mt, 0, 0);
> > > >         int count = 0;
> > > >
> > > >         /* mm's last user has gone, and its about to be pulled down */
> > > >         mmu_notifier_release(mm);
> > > >
> > > >         mmap_write_lock(mm);
> > > >         arch_exit_mmap(mm);
> > > >
> > > >         vma = mas_find(&mas, ULONG_MAX);
> > > >         if (!vma) {
> > > >                 /* Can happen if dup_mmap() received an OOM */
> > > >                 mmap_write_unlock(mm);
> > > >                 return;
> > > >         }
> > > >
> > > >         lru_add_drain();
> > > >         flush_cache_mm(mm);
> > > >         tlb_gather_mmu_fullmm(&tlb, mm);
> > > >         /* update_hiwater_rss(mm) here? but nobody should be looking */
> > > >         /* Use ULONG_MAX here to ensure all VMAs in the mm are unmapped */
> > > >         unmap_vmas(&tlb, &mm->mm_mt, vma, 0, ULONG_MAX);
> > > >
> > > >         /*
> > > >          * Set MMF_OOM_SKIP to hide this task from the oom killer/reaper
> > > >          * because the memory has been already freed. Do not bother checking
> > > >          * mm_is_oom_victim because setting a bit unconditionally is cheaper.
> > > >          */
> > > >         set_bit(MMF_OOM_SKIP, &mm->flags);
> > > >         free_pgtables(&tlb, &mm->mm_mt, vma, FIRST_USER_ADDRESS,
> > > >                       USER_PGTABLES_CEILING);
> > > >         tlb_finish_mmu(&tlb);
> > > >
> > > >         /*
> > > >          * Walk the list again, actually closing and freeing it, with preemption
> > > >          * enabled, without holding any MM locks besides the unreachable
> > > >          * mmap_write_lock.
> > > >          */
> > > >         do {
> > > >                 if (vma->vm_flags & VM_ACCOUNT)
> > > >                         nr_accounted += vma_pages(vma);
> > > >                 remove_vma(vma);
> > > >                 count++;
> > > >                 cond_resched();
> > > >         } while ((vma = mas_find(&mas, ULONG_MAX)) != NULL);
> > > >
> > > >         BUG_ON(count != mm->map_count);
> > > >
> > > >         trace_exit_mmap(mm);
> > > >         __mt_destroy(&mm->mm_mt);
> > > >         mm->mmap = NULL;
> > > 
> > > ^^^ this line above needs to be removed when the patch is applied over
> > > the maple tree patchset.
> > 
> > I am not fully up to date on the maple tree changes. Could you explain
> > why resetting mm->mmap is not needed anymore please?
> 
> The maple tree patch set removes the linked list, including mm->mmap.
> The call to __mt_destroy() means none of the old VMAs can be found in
> the race condition that mm->mmap = NULL was solving.

Thanks for the clarification, Liam.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH RESEND v2 1/2] mm: drop oom code from exit_mmap
  2022-06-02 13:39     ` Matthew Wilcox
@ 2022-06-02 15:02       ` Suren Baghdasaryan
  0 siblings, 0 replies; 20+ messages in thread
From: Suren Baghdasaryan @ 2022-06-02 15:02 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Andrew Morton, Michal Hocko, David Rientjes, Johannes Weiner,
	Roman Gushchin, Minchan Kim, Kirill A. Shutemov,
	Andrea Arcangeli, Christian Brauner, Christoph Hellwig,
	Oleg Nesterov, David Hildenbrand, Jann Horn, Shakeel Butt,
	Peter Xu, John Hubbard, shuah, LKML, linux-mm, linux-kselftest,
	kernel-team, Liam Howlett

On Thu, Jun 2, 2022 at 6:39 AM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Wed, Jun 01, 2022 at 02:47:41PM -0700, Suren Baghdasaryan wrote:
> > > Unclear why this patch fiddles with the mm_struct locking in this
> > > fashion - changelogging that would have been helpful.
> >
> > Yeah, I should have clarified this in the description. Everything up
> > to unmap_vmas() can be done under mmap_read_lock and that way
> > oom-reaper and process_mrelease can do the unmapping in parallel with
> > exit_mmap. That's the reason we take mmap_read_lock, unmap the vmas,
> > mark the mm with MMF_OOM_SKIP and take the mmap_write_lock to execute
> > free_pgtables. I think maple trees do not change that except there is
> > no mm->mmap anymore, so the line at the end of exit_mmap where we
> > reset mm->mmap to NULL can be removed (I show that line below).
>
> I don't understand why we _want_ unmapping to proceed in parallel?  Is it
> so urgent to unmap these page tables that we need two processes doing
> it at the same time?  And doesn't that just change the contention from
> visible (contention on a lock) to invisible (contention on cachelines)?

It's important for process_madvise() syscall not to be blocked by a
potentially lower priority task doing exit_mmap. I've seen such
priority inversion happening when the dying process is running on a
little core taking its time while a high-priority task is waiting in
the syscall while there is no reason for them to block each other.

>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH RESEND v2 2/2] mm: delete unused MMF_OOM_VICTIM flag
  2022-05-31 22:31 ` [PATCH RESEND v2 2/2] mm: delete unused MMF_OOM_VICTIM flag Suren Baghdasaryan
@ 2022-08-22 22:21   ` Andrew Morton
  2022-08-22 22:33     ` Yu Zhao
  0 siblings, 1 reply; 20+ messages in thread
From: Andrew Morton @ 2022-08-22 22:21 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: mhocko, rientjes, willy, hannes, guro, minchan, kirill, aarcange,
	brauner, hch, oleg, david, jannh, shakeelb, peterx, jhubbard,
	shuah, linux-kernel, linux-mm, linux-kselftest, kernel-team,
	Yu Zhao

On Tue, 31 May 2022 15:31:00 -0700 Suren Baghdasaryan <surenb@google.com> wrote:

> With the last usage of MMF_OOM_VICTIM in exit_mmap gone, this flag is
> now unused and can be removed.
> 
> ...
>
> --- a/include/linux/oom.h
> +++ b/include/linux/oom.h
> @@ -77,15 +77,6 @@ static inline bool tsk_is_oom_victim(struct task_struct * tsk)
>  	return tsk->signal->oom_mm;
>  }
>  
> -/*
> - * Use this helper if tsk->mm != mm and the victim mm needs a special
> - * handling. This is guaranteed to stay true after once set.
> - */
> -static inline bool mm_is_oom_victim(struct mm_struct *mm)
> -{
> -	return test_bit(MMF_OOM_VICTIM, &mm->flags);
> -}
> -

The patch "mm: multi-gen LRU: support page table walks" from the MGLRU
series
(https://lkml.kernel.org/r/20220815071332.627393-9-yuzhao@google.com)
adds two calls to mm_is_oom_victim(), so my build broke.

I assume the fix is simply

--- a/mm/vmscan.c~mm-delete-unused-mmf_oom_victim-flag-fix
+++ a/mm/vmscan.c
@@ -3429,9 +3429,6 @@ static bool should_skip_mm(struct mm_str
 	if (size < MIN_LRU_BATCH)
 		return true;
 
-	if (mm_is_oom_victim(mm))
-		return true;
-
 	return !mmget_not_zero(mm);
 }
 
@@ -4127,9 +4124,6 @@ restart:
 
 		walk_pmd_range(&val, addr, next, args);
 
-		if (mm_is_oom_victim(args->mm))
-			return 1;
-
 		/* a racy check to curtail the waiting time */
 		if (wq_has_sleeper(&walk->lruvec->mm_state.wait))
 			return 1;
_

Please confirm?

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH RESEND v2 2/2] mm: delete unused MMF_OOM_VICTIM flag
  2022-08-22 22:21   ` Andrew Morton
@ 2022-08-22 22:33     ` Yu Zhao
  2022-08-22 22:48       ` Andrew Morton
  0 siblings, 1 reply; 20+ messages in thread
From: Yu Zhao @ 2022-08-22 22:33 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Suren Baghdasaryan, Michal Hocko, David Rientjes, Matthew Wilcox,
	Johannes Weiner, Roman Gushchin, Minchan Kim,
	Kirill A . Shutemov, Andrea Arcangeli, brauner, hch, oleg,
	David Hildenbrand, Jann Horn, Shakeel Butt, Peter Xu,
	John Hubbard, shuah, linux-kernel, Linux-MM, linux-kselftest,
	kernel-team

On Mon, Aug 22, 2022 at 4:21 PM Andrew Morton <akpm@linux-foundation.org> wrote:
>
> On Tue, 31 May 2022 15:31:00 -0700 Suren Baghdasaryan <surenb@google.com> wrote:
>
> > With the last usage of MMF_OOM_VICTIM in exit_mmap gone, this flag is
> > now unused and can be removed.
> >
> > ...
> >
> > --- a/include/linux/oom.h
> > +++ b/include/linux/oom.h
> > @@ -77,15 +77,6 @@ static inline bool tsk_is_oom_victim(struct task_struct * tsk)
> >       return tsk->signal->oom_mm;
> >  }
> >
> > -/*
> > - * Use this helper if tsk->mm != mm and the victim mm needs a special
> > - * handling. This is guaranteed to stay true after once set.
> > - */
> > -static inline bool mm_is_oom_victim(struct mm_struct *mm)
> > -{
> > -     return test_bit(MMF_OOM_VICTIM, &mm->flags);
> > -}
> > -
>
> The patch "mm: multi-gen LRU: support page table walks" from the MGLRU
> series
> (https://lkml.kernel.org/r/20220815071332.627393-9-yuzhao@google.com)
> adds two calls to mm_is_oom_victim(), so my build broke.
>
> I assume the fix is simply
>
> --- a/mm/vmscan.c~mm-delete-unused-mmf_oom_victim-flag-fix
> +++ a/mm/vmscan.c
> @@ -3429,9 +3429,6 @@ static bool should_skip_mm(struct mm_str
>         if (size < MIN_LRU_BATCH)
>                 return true;
>
> -       if (mm_is_oom_victim(mm))
> -               return true;
> -
>         return !mmget_not_zero(mm);
>  }
>
> @@ -4127,9 +4124,6 @@ restart:
>
>                 walk_pmd_range(&val, addr, next, args);
>
> -               if (mm_is_oom_victim(args->mm))
> -                       return 1;
> -
>                 /* a racy check to curtail the waiting time */
>                 if (wq_has_sleeper(&walk->lruvec->mm_state.wait))
>                         return 1;
> _
>
> Please confirm?

LGTM.  The deleted checks are not about correctness.

I've queued

--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -3402,7 +3402,7 @@ static bool should_skip_mm(struct mm_struct *mm,
struct lru_gen_mm_walk *walk)
        if (size < MIN_LRU_BATCH)
                return true;

-       if (mm_is_oom_victim(mm))
+       if (test_bit(MMF_OOM_REAP_QUEUED, &mm->flags))
                return true;

        return !mmget_not_zero(mm);
@@ -4109,7 +4109,7 @@ static int walk_pud_range(p4d_t *p4d, unsigned
long start, unsigned long end,

                walk_pmd_range(&val, addr, next, args);

-               if (mm_is_oom_victim(args->mm))
+               if (test_bit(MMF_OOM_REAP_QUEUED, &args->mm->flags))
                        return 1;

                /* a racy check to curtail the waiting time */

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH RESEND v2 2/2] mm: delete unused MMF_OOM_VICTIM flag
  2022-08-22 22:33     ` Yu Zhao
@ 2022-08-22 22:48       ` Andrew Morton
  2022-08-22 22:59         ` Yu Zhao
  0 siblings, 1 reply; 20+ messages in thread
From: Andrew Morton @ 2022-08-22 22:48 UTC (permalink / raw)
  To: Yu Zhao
  Cc: Suren Baghdasaryan, Michal Hocko, David Rientjes, Matthew Wilcox,
	Johannes Weiner, Roman Gushchin, Minchan Kim,
	Kirill A . Shutemov, Andrea Arcangeli, brauner, hch, oleg,
	David Hildenbrand, Jann Horn, Shakeel Butt, Peter Xu,
	John Hubbard, shuah, linux-kernel, Linux-MM, linux-kselftest,
	kernel-team

On Mon, 22 Aug 2022 16:33:51 -0600 Yu Zhao <yuzhao@google.com> wrote:

> > --- a/mm/vmscan.c~mm-delete-unused-mmf_oom_victim-flag-fix
> > +++ a/mm/vmscan.c
> > @@ -3429,9 +3429,6 @@ static bool should_skip_mm(struct mm_str
> >         if (size < MIN_LRU_BATCH)
> >                 return true;
> >
> > -       if (mm_is_oom_victim(mm))
> > -               return true;
> > -
> >         return !mmget_not_zero(mm);
> >  }
> >
> > @@ -4127,9 +4124,6 @@ restart:
> >
> >                 walk_pmd_range(&val, addr, next, args);
> >
> > -               if (mm_is_oom_victim(args->mm))
> > -                       return 1;
> > -
> >                 /* a racy check to curtail the waiting time */
> >                 if (wq_has_sleeper(&walk->lruvec->mm_state.wait))
> >                         return 1;
> > _
> >
> > Please confirm?
> 
> LGTM.  The deleted checks are not about correctness.

OK, for now.

> I've queued
> 
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -3402,7 +3402,7 @@ static bool should_skip_mm(struct mm_struct *mm,
> struct lru_gen_mm_walk *walk)
>         if (size < MIN_LRU_BATCH)
>                 return true;
> 
> -       if (mm_is_oom_victim(mm))
> +       if (test_bit(MMF_OOM_REAP_QUEUED, &mm->flags))
>                 return true;
> 
>         return !mmget_not_zero(mm);
> @@ -4109,7 +4109,7 @@ static int walk_pud_range(p4d_t *p4d, unsigned
> long start, unsigned long end,
> 
>                 walk_pmd_range(&val, addr, next, args);
> 
> -               if (mm_is_oom_victim(args->mm))
> +               if (test_bit(MMF_OOM_REAP_QUEUED, &args->mm->flags))
>                         return 1;
> 
>                 /* a racy check to curtail the waiting time */

Oh.  Why?  What does this change do?

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH RESEND v2 2/2] mm: delete unused MMF_OOM_VICTIM flag
  2022-08-22 22:48       ` Andrew Morton
@ 2022-08-22 22:59         ` Yu Zhao
  2022-08-22 23:16           ` Andrew Morton
  0 siblings, 1 reply; 20+ messages in thread
From: Yu Zhao @ 2022-08-22 22:59 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Suren Baghdasaryan, Michal Hocko, David Rientjes, Matthew Wilcox,
	Johannes Weiner, Roman Gushchin, Minchan Kim,
	Kirill A . Shutemov, Andrea Arcangeli, brauner, hch, oleg,
	David Hildenbrand, Jann Horn, Shakeel Butt, Peter Xu,
	John Hubbard, shuah, linux-kernel, Linux-MM, linux-kselftest,
	kernel-team

On Mon, Aug 22, 2022 at 4:48 PM Andrew Morton <akpm@linux-foundation.org> wrote:
>
> On Mon, 22 Aug 2022 16:33:51 -0600 Yu Zhao <yuzhao@google.com> wrote:
>
> > > --- a/mm/vmscan.c~mm-delete-unused-mmf_oom_victim-flag-fix
> > > +++ a/mm/vmscan.c
> > > @@ -3429,9 +3429,6 @@ static bool should_skip_mm(struct mm_str
> > >         if (size < MIN_LRU_BATCH)
> > >                 return true;
> > >
> > > -       if (mm_is_oom_victim(mm))
> > > -               return true;
> > > -
> > >         return !mmget_not_zero(mm);
> > >  }
> > >
> > > @@ -4127,9 +4124,6 @@ restart:
> > >
> > >                 walk_pmd_range(&val, addr, next, args);
> > >
> > > -               if (mm_is_oom_victim(args->mm))
> > > -                       return 1;
> > > -
> > >                 /* a racy check to curtail the waiting time */
> > >                 if (wq_has_sleeper(&walk->lruvec->mm_state.wait))
> > >                         return 1;
> > > _
> > >
> > > Please confirm?
> >
> > LGTM.  The deleted checks are not about correctness.
>
> OK, for now.
>
> > I've queued
> >
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -3402,7 +3402,7 @@ static bool should_skip_mm(struct mm_struct *mm,
> > struct lru_gen_mm_walk *walk)
> >         if (size < MIN_LRU_BATCH)
> >                 return true;
> >
> > -       if (mm_is_oom_victim(mm))
> > +       if (test_bit(MMF_OOM_REAP_QUEUED, &mm->flags))
> >                 return true;
> >
> >         return !mmget_not_zero(mm);
> > @@ -4109,7 +4109,7 @@ static int walk_pud_range(p4d_t *p4d, unsigned
> > long start, unsigned long end,
> >
> >                 walk_pmd_range(&val, addr, next, args);
> >
> > -               if (mm_is_oom_victim(args->mm))
> > +               if (test_bit(MMF_OOM_REAP_QUEUED, &args->mm->flags))
> >                         return 1;
> >
> >                 /* a racy check to curtail the waiting time */
>
> Oh.  Why?  What does this change do?

The MMF_OOM_REAP_QUEUED flag is similar to the deleted MMF_OOM_VICTIM
flag, but it's set at a later stage during an OOM kill.

When either is set, the OOM reaper is probably already freeing the
memory of this mm_struct, or at least it's going to. So there is no
need to dwell on it in the reclaim path, hence not about correctness.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH RESEND v2 2/2] mm: delete unused MMF_OOM_VICTIM flag
  2022-08-22 22:59         ` Yu Zhao
@ 2022-08-22 23:16           ` Andrew Morton
  2022-08-22 23:20             ` Yu Zhao
  0 siblings, 1 reply; 20+ messages in thread
From: Andrew Morton @ 2022-08-22 23:16 UTC (permalink / raw)
  To: Yu Zhao
  Cc: Suren Baghdasaryan, Michal Hocko, David Rientjes, Matthew Wilcox,
	Johannes Weiner, Roman Gushchin, Minchan Kim,
	Kirill A . Shutemov, Andrea Arcangeli, brauner, hch, oleg,
	David Hildenbrand, Jann Horn, Shakeel Butt, Peter Xu,
	John Hubbard, shuah, linux-kernel, Linux-MM, linux-kselftest,
	kernel-team

On Mon, 22 Aug 2022 16:59:29 -0600 Yu Zhao <yuzhao@google.com> wrote:

> > > @@ -4109,7 +4109,7 @@ static int walk_pud_range(p4d_t *p4d, unsigned
> > > long start, unsigned long end,
> > >
> > >                 walk_pmd_range(&val, addr, next, args);
> > >
> > > -               if (mm_is_oom_victim(args->mm))
> > > +               if (test_bit(MMF_OOM_REAP_QUEUED, &args->mm->flags))
> > >                         return 1;
> > >
> > >                 /* a racy check to curtail the waiting time */
> >
> > Oh.  Why?  What does this change do?
> 
> The MMF_OOM_REAP_QUEUED flag is similar to the deleted MMF_OOM_VICTIM
> flag, but it's set at a later stage during an OOM kill.
> 
> When either is set, the OOM reaper is probably already freeing the
> memory of this mm_struct, or at least it's going to. So there is no
> need to dwell on it in the reclaim path, hence not about correctness.

Thanks.  That sounds worthy of some code comments?

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH RESEND v2 2/2] mm: delete unused MMF_OOM_VICTIM flag
  2022-08-22 23:16           ` Andrew Morton
@ 2022-08-22 23:20             ` Yu Zhao
  2022-08-23  8:36               ` Michal Hocko
  0 siblings, 1 reply; 20+ messages in thread
From: Yu Zhao @ 2022-08-22 23:20 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Suren Baghdasaryan, Michal Hocko, David Rientjes, Matthew Wilcox,
	Johannes Weiner, Roman Gushchin, Minchan Kim,
	Kirill A . Shutemov, Andrea Arcangeli, brauner,
	Christoph Hellwig, oleg, David Hildenbrand, Jann Horn,
	Shakeel Butt, Peter Xu, John Hubbard, shuah, linux-kernel,
	Linux-MM, linux-kselftest, kernel-team

On Mon, Aug 22, 2022 at 5:16 PM Andrew Morton <akpm@linux-foundation.org> wrote:
>
> On Mon, 22 Aug 2022 16:59:29 -0600 Yu Zhao <yuzhao@google.com> wrote:
>
> > > > @@ -4109,7 +4109,7 @@ static int walk_pud_range(p4d_t *p4d, unsigned
> > > > long start, unsigned long end,
> > > >
> > > >                 walk_pmd_range(&val, addr, next, args);
> > > >
> > > > -               if (mm_is_oom_victim(args->mm))
> > > > +               if (test_bit(MMF_OOM_REAP_QUEUED, &args->mm->flags))
> > > >                         return 1;
> > > >
> > > >                 /* a racy check to curtail the waiting time */
> > >
> > > Oh.  Why?  What does this change do?
> >
> > The MMF_OOM_REAP_QUEUED flag is similar to the deleted MMF_OOM_VICTIM
> > flag, but it's set at a later stage during an OOM kill.
> >
> > When either is set, the OOM reaper is probably already freeing the
> > memory of this mm_struct, or at least it's going to. So there is no
> > need to dwell on it in the reclaim path, hence not about correctness.
>
> Thanks.  That sounds worthy of some code comments?

Will do. Thanks.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH RESEND v2 2/2] mm: delete unused MMF_OOM_VICTIM flag
  2022-08-22 23:20             ` Yu Zhao
@ 2022-08-23  8:36               ` Michal Hocko
  2022-08-28 19:50                 ` Yu Zhao
  0 siblings, 1 reply; 20+ messages in thread
From: Michal Hocko @ 2022-08-23  8:36 UTC (permalink / raw)
  To: Yu Zhao
  Cc: Andrew Morton, Suren Baghdasaryan, David Rientjes,
	Matthew Wilcox, Johannes Weiner, Roman Gushchin, Minchan Kim,
	Kirill A . Shutemov, Andrea Arcangeli, brauner,
	Christoph Hellwig, oleg, David Hildenbrand, Jann Horn,
	Shakeel Butt, Peter Xu, John Hubbard, shuah, linux-kernel,
	Linux-MM, linux-kselftest, kernel-team

On Mon 22-08-22 17:20:17, Yu Zhao wrote:
> On Mon, Aug 22, 2022 at 5:16 PM Andrew Morton <akpm@linux-foundation.org> wrote:
> >
> > On Mon, 22 Aug 2022 16:59:29 -0600 Yu Zhao <yuzhao@google.com> wrote:
> >
> > > > > @@ -4109,7 +4109,7 @@ static int walk_pud_range(p4d_t *p4d, unsigned
> > > > > long start, unsigned long end,
> > > > >
> > > > >                 walk_pmd_range(&val, addr, next, args);
> > > > >
> > > > > -               if (mm_is_oom_victim(args->mm))
> > > > > +               if (test_bit(MMF_OOM_REAP_QUEUED, &args->mm->flags))
> > > > >                         return 1;
> > > > >
> > > > >                 /* a racy check to curtail the waiting time */
> > > >
> > > > Oh.  Why?  What does this change do?
> > >
> > > The MMF_OOM_REAP_QUEUED flag is similar to the deleted MMF_OOM_VICTIM
> > > flag, but it's set at a later stage during an OOM kill.
> > >
> > > When either is set, the OOM reaper is probably already freeing the
> > > memory of this mm_struct, or at least it's going to. So there is no
> > > need to dwell on it in the reclaim path, hence not about correctness.
> >
> > Thanks.  That sounds worthy of some code comments?
> 
> Will do. Thanks.

I would rather not see this abuse. You cannot really make any
assumptions about oom_reaper and how quickly it is going to free the
memory. If this is really worth it (and I have to say I doubt it) then
it should be a separate patch with numbers justifying it.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH RESEND v2 2/2] mm: delete unused MMF_OOM_VICTIM flag
  2022-08-23  8:36               ` Michal Hocko
@ 2022-08-28 19:50                 ` Yu Zhao
  2022-08-29 10:40                   ` Michal Hocko
  0 siblings, 1 reply; 20+ messages in thread
From: Yu Zhao @ 2022-08-28 19:50 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Suren Baghdasaryan, David Rientjes,
	Matthew Wilcox, Johannes Weiner, Roman Gushchin, Minchan Kim,
	Kirill A . Shutemov, Andrea Arcangeli, brauner,
	Christoph Hellwig, oleg, David Hildenbrand, Jann Horn,
	Shakeel Butt, Peter Xu, John Hubbard, shuah, linux-kernel,
	Linux-MM, linux-kselftest, kernel-team

On Tue, Aug 23, 2022 at 2:36 AM Michal Hocko <mhocko@suse.com> wrote:
>
> On Mon 22-08-22 17:20:17, Yu Zhao wrote:
> > On Mon, Aug 22, 2022 at 5:16 PM Andrew Morton <akpm@linux-foundation.org> wrote:
> > >
> > > On Mon, 22 Aug 2022 16:59:29 -0600 Yu Zhao <yuzhao@google.com> wrote:
> > >
> > > > > > @@ -4109,7 +4109,7 @@ static int walk_pud_range(p4d_t *p4d, unsigned
> > > > > > long start, unsigned long end,
> > > > > >
> > > > > >                 walk_pmd_range(&val, addr, next, args);
> > > > > >
> > > > > > -               if (mm_is_oom_victim(args->mm))
> > > > > > +               if (test_bit(MMF_OOM_REAP_QUEUED, &args->mm->flags))
> > > > > >                         return 1;
> > > > > >
> > > > > >                 /* a racy check to curtail the waiting time */
> > > > >
> > > > > Oh.  Why?  What does this change do?
> > > >
> > > > The MMF_OOM_REAP_QUEUED flag is similar to the deleted MMF_OOM_VICTIM
> > > > flag, but it's set at a later stage during an OOM kill.
> > > >
> > > > When either is set, the OOM reaper is probably already freeing the
> > > > memory of this mm_struct, or at least it's going to. So there is no
> > > > need to dwell on it in the reclaim path, hence not about correctness.
> > >
> > > Thanks.  That sounds worthy of some code comments?
> >
> > Will do. Thanks.
>
> I would rather not see this abuse.

I understand where you're coming from, however, I don't share this
POV. I see it as cooperation -- the page reclaim and the oom/reaper
can't (or at least shouldn't) operate in isolation.

> You cannot really make any
> assumptions about oom_reaper and how quickly it is going to free the
> memory.

Agreed. But here we are talking about heuristics, not dependencies on
certain behaviors. Assume we are playing a guessing game: there are
multiple mm_structs available for reclaim, would the oom-killed ones
be more profitable on average? I'd say no, because I assume it's more
likely than unlikely that the oom reaper is doing/to do its work. Note
that the assumption is about likelihood, hence arguably valid.

> If this is really worth it (and I have to say I doubt it) then
> it should be a separate patch with numbers justifying it.

I definitely can artificially create a test case that runs oom a few
times per second, to prove this two-liner is beneficial to that
scenario. Then there is the question how much it would benefit the
real-world scenarios.

I'd recommend keeping this two-liner if we still had
mm_is_oom_victim(), because it's simple, clear and intuitive. With
MMF_OOM_REAP_QUEUED, I don't have a strong opinion. Since you do, I'll
just delete it.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH RESEND v2 2/2] mm: delete unused MMF_OOM_VICTIM flag
  2022-08-28 19:50                 ` Yu Zhao
@ 2022-08-29 10:40                   ` Michal Hocko
  2022-08-29 10:45                     ` Michal Hocko
  0 siblings, 1 reply; 20+ messages in thread
From: Michal Hocko @ 2022-08-29 10:40 UTC (permalink / raw)
  To: Yu Zhao
  Cc: Andrew Morton, Suren Baghdasaryan, David Rientjes,
	Matthew Wilcox, Johannes Weiner, Roman Gushchin, Minchan Kim,
	Kirill A . Shutemov, Andrea Arcangeli, brauner,
	Christoph Hellwig, oleg, David Hildenbrand, Jann Horn,
	Shakeel Butt, Peter Xu, John Hubbard, shuah, linux-kernel,
	Linux-MM, linux-kselftest, kernel-team

On Sun 28-08-22 13:50:09, Yu Zhao wrote:
> On Tue, Aug 23, 2022 at 2:36 AM Michal Hocko <mhocko@suse.com> wrote:
[...]
> > You cannot really make any
> > assumptions about oom_reaper and how quickly it is going to free the
> > memory.
> 
> Agreed. But here we are talking about heuristics, not dependencies on
> certain behaviors. Assume we are playing a guessing game: there are
> multiple mm_structs available for reclaim, would the oom-killed ones
> be more profitable on average? I'd say no, because I assume it's more
> likely than unlikely that the oom reaper is doing/to do its work. Note
> that the assumption is about likelihood, hence arguably valid.

Well, my main counter argument would be that we do not really want to
carve last resort mechanism (which the oom reaper is) into any heuristic
because any future changes into that mechanism will be much harder to
justify and change. There is a cost of the maintenance that should be
considered. While you might be right that this change would be
beneficial, there is no actual proof of that. Historically we've had
several examples of such a behavior which was really hard to change
later on because the effect would be really hard to evaluate.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH RESEND v2 2/2] mm: delete unused MMF_OOM_VICTIM flag
  2022-08-29 10:40                   ` Michal Hocko
@ 2022-08-29 10:45                     ` Michal Hocko
  0 siblings, 0 replies; 20+ messages in thread
From: Michal Hocko @ 2022-08-29 10:45 UTC (permalink / raw)
  To: Yu Zhao
  Cc: Andrew Morton, Suren Baghdasaryan, David Rientjes,
	Matthew Wilcox, Johannes Weiner, Roman Gushchin, Minchan Kim,
	Kirill A . Shutemov, Andrea Arcangeli, brauner,
	Christoph Hellwig, oleg, David Hildenbrand, Jann Horn,
	Shakeel Butt, Peter Xu, John Hubbard, shuah, linux-kernel,
	Linux-MM, linux-kselftest, kernel-team

On Mon 29-08-22 12:40:05, Michal Hocko wrote:
> On Sun 28-08-22 13:50:09, Yu Zhao wrote:
> > On Tue, Aug 23, 2022 at 2:36 AM Michal Hocko <mhocko@suse.com> wrote:
> [...]
> > > You cannot really make any
> > > assumptions about oom_reaper and how quickly it is going to free the
> > > memory.
> > 
> > Agreed. But here we are talking about heuristics, not dependencies on
> > certain behaviors. Assume we are playing a guessing game: there are
> > multiple mm_structs available for reclaim, would the oom-killed ones
> > be more profitable on average? I'd say no, because I assume it's more
> > likely than unlikely that the oom reaper is doing/to do its work. Note
> > that the assumption is about likelihood, hence arguably valid.
> 
> Well, my main counter argument would be that we do not really want to
> carve last resort mechanism (which the oom reaper is) into any heuristic
> because any future changes into that mechanism will be much harder to
> justify and change. There is a cost of the maintenance that should be
> considered. While you might be right that this change would be
> beneficial, there is no actual proof of that. Historically we've had
> several examples of such a behavior which was really hard to change
> later on because the effect would be really hard to evaluate.

Forgot to mention the recent change as a clear example of the change
which would be have a higher burden to evaluate. e4a38402c36e
("oom_kill.c: futex: delay the OOM reaper to allow time for proper futex
cleanup") has changed the wake up logic to be triggered after a timeout.
This means that the task will be sitting there on the queue without any
actual reclaim done on it. The timeout itself can be changed in the
future and I would really hate to argue that changeing it from $FOO to
$FOO + epsilon breaks a very subtle dependency somewhere deep in the
reclaim path. From the oom reaper POV any timeout is reasonable becaude
this is the _last_ resort to resolve OOM stall/deadlock when the victim
cannot exit on its own for whatever reason. This is a considerably
different objective from "we want to optimize which taks to scan to
reclaim efficiently".

See my point?
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2022-08-29 10:45 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-05-31 22:30 [PATCH RESEND v2 1/2] mm: drop oom code from exit_mmap Suren Baghdasaryan
2022-05-31 22:31 ` [PATCH RESEND v2 2/2] mm: delete unused MMF_OOM_VICTIM flag Suren Baghdasaryan
2022-08-22 22:21   ` Andrew Morton
2022-08-22 22:33     ` Yu Zhao
2022-08-22 22:48       ` Andrew Morton
2022-08-22 22:59         ` Yu Zhao
2022-08-22 23:16           ` Andrew Morton
2022-08-22 23:20             ` Yu Zhao
2022-08-23  8:36               ` Michal Hocko
2022-08-28 19:50                 ` Yu Zhao
2022-08-29 10:40                   ` Michal Hocko
2022-08-29 10:45                     ` Michal Hocko
2022-06-01 21:36 ` [PATCH RESEND v2 1/2] mm: drop oom code from exit_mmap Andrew Morton
2022-06-01 21:47   ` Suren Baghdasaryan
2022-06-01 21:50     ` Suren Baghdasaryan
2022-06-02  6:53     ` Michal Hocko
2022-06-02 13:31       ` Liam Howlett
2022-06-02 14:08         ` Michal Hocko
2022-06-02 13:39     ` Matthew Wilcox
2022-06-02 15:02       ` Suren Baghdasaryan

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.