All of lore.kernel.org
 help / color / mirror / Atom feed
* Re: [PATCH 2/2] mm, oom_reaper: do not mmput synchronously from the oom reaper context
@ 2016-05-20  1:30 ` Minchan Kim
  0 siblings, 0 replies; 16+ messages in thread
From: Minchan Kim @ 2016-05-20  1:30 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Tetsuo Handa, David Rientjes, linux-mm, LKML,
	Michal Hocko

Forking new thread because my comment is not related to this patch's
purpose but found a thing during reading this patch.

On Tue, Apr 26, 2016 at 04:04:30PM +0200, Michal Hocko wrote:
> From: Michal Hocko <mhocko@suse.com>
> 
> Tetsuo has properly noted that mmput slow path might get blocked waiting
> for another party (e.g. exit_aio waits for an IO). If that happens the
> oom_reaper would be put out of the way and will not be able to process
> next oom victim. We should strive for making this context as reliable
> and independent on other subsystems as much as possible.
> 
> Introduce mmput_async which will perform the slow path from an async
> (WQ) context. This will delay the operation but that shouldn't be a
> problem because the oom_reaper has reclaimed the victim's address space
> for most cases as much as possible and the remaining context shouldn't
> bind too much memory anymore. The only exception is when mmap_sem
> trylock has failed which shouldn't happen too often.
> 
> The issue is only theoretical but not impossible.

The mmput_async is used for only OOM reaper which is enabled on CONFIG_MMU.
So until someone who want to use mmput_async in !CONFIG_MMU come out,
we could save sizeof(struct work_struct) per mm in !CONFIG_MMU.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 2/2] mm, oom_reaper: do not mmput synchronously from the oom reaper context
@ 2016-05-20  1:30 ` Minchan Kim
  0 siblings, 0 replies; 16+ messages in thread
From: Minchan Kim @ 2016-05-20  1:30 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Tetsuo Handa, David Rientjes, linux-mm, LKML,
	Michal Hocko

Forking new thread because my comment is not related to this patch's
purpose but found a thing during reading this patch.

On Tue, Apr 26, 2016 at 04:04:30PM +0200, Michal Hocko wrote:
> From: Michal Hocko <mhocko@suse.com>
> 
> Tetsuo has properly noted that mmput slow path might get blocked waiting
> for another party (e.g. exit_aio waits for an IO). If that happens the
> oom_reaper would be put out of the way and will not be able to process
> next oom victim. We should strive for making this context as reliable
> and independent on other subsystems as much as possible.
> 
> Introduce mmput_async which will perform the slow path from an async
> (WQ) context. This will delay the operation but that shouldn't be a
> problem because the oom_reaper has reclaimed the victim's address space
> for most cases as much as possible and the remaining context shouldn't
> bind too much memory anymore. The only exception is when mmap_sem
> trylock has failed which shouldn't happen too often.
> 
> The issue is only theoretical but not impossible.

The mmput_async is used for only OOM reaper which is enabled on CONFIG_MMU.
So until someone who want to use mmput_async in !CONFIG_MMU come out,
we could save sizeof(struct work_struct) per mm in !CONFIG_MMU.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 2/2] mm, oom_reaper: do not mmput synchronously from the oom reaper context
  2016-05-20  1:30 ` Minchan Kim
@ 2016-05-20  6:16   ` Michal Hocko
  -1 siblings, 0 replies; 16+ messages in thread
From: Michal Hocko @ 2016-05-20  6:16 UTC (permalink / raw)
  To: Minchan Kim; +Cc: Andrew Morton, Tetsuo Handa, David Rientjes, linux-mm, LKML

On Fri 20-05-16 10:30:53, Minchan Kim wrote:
> Forking new thread because my comment is not related to this patch's
> purpose but found a thing during reading this patch.
> 
> On Tue, Apr 26, 2016 at 04:04:30PM +0200, Michal Hocko wrote:
> > From: Michal Hocko <mhocko@suse.com>
> > 
> > Tetsuo has properly noted that mmput slow path might get blocked waiting
> > for another party (e.g. exit_aio waits for an IO). If that happens the
> > oom_reaper would be put out of the way and will not be able to process
> > next oom victim. We should strive for making this context as reliable
> > and independent on other subsystems as much as possible.
> > 
> > Introduce mmput_async which will perform the slow path from an async
> > (WQ) context. This will delay the operation but that shouldn't be a
> > problem because the oom_reaper has reclaimed the victim's address space
> > for most cases as much as possible and the remaining context shouldn't
> > bind too much memory anymore. The only exception is when mmap_sem
> > trylock has failed which shouldn't happen too often.
> > 
> > The issue is only theoretical but not impossible.
> 
> The mmput_async is used for only OOM reaper which is enabled on CONFIG_MMU.
> So until someone who want to use mmput_async in !CONFIG_MMU come out,
> we could save sizeof(struct work_struct) per mm in !CONFIG_MMU.

You are right. What about the following?
---
>From 8f8a34bf00882bfc0b557ed79e0e9e956ac9d217 Mon Sep 17 00:00:00 2001
From: Michal Hocko <mhocko@suse.com>
Date: Fri, 20 May 2016 08:14:39 +0200
Subject: [PATCH] mmotm:
 mm-oom_reaper-do-not-mmput-synchronously-from-the-oom-reaper-context-fix

mmput_async is currently used only from the oom_reaper which is defined
only for CONFIG_MMU. We can save work_struct in mm_struct for
!CONFIG_MMU.

Reported-by: Minchan Kim <minchan@kernel.org>
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 include/linux/mm_types.h | 2 ++
 include/linux/sched.h    | 2 ++
 kernel/fork.c            | 2 ++
 3 files changed, 6 insertions(+)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index ab142ace96f3..a16dcb2efca4 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -514,7 +514,9 @@ struct mm_struct {
 #ifdef CONFIG_HUGETLB_PAGE
 	atomic_long_t hugetlb_usage;
 #endif
+#ifdef CONFIG_MMU
 	struct work_struct async_put_work;
+#endif
 };
 
 static inline void mm_init_cpumask(struct mm_struct *mm)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index df8778e72211..11b31ded65cf 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2604,10 +2604,12 @@ static inline void mmdrop(struct mm_struct * mm)
 
 /* mmput gets rid of the mappings and all user-space */
 extern void mmput(struct mm_struct *);
+#ifdef CONFIG_MMU
 /* same as above but performs the slow path from the async kontext. Can
  * be called from the atomic context as well
  */
 extern void mmput_async(struct mm_struct *);
+#endif
 
 /* Grab a reference to a task's mm, if it is not already going away */
 extern struct mm_struct *get_task_mm(struct task_struct *task);
diff --git a/kernel/fork.c b/kernel/fork.c
index e1dc6b02ac8b..1e3dc3af6845 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -732,6 +732,7 @@ void mmput(struct mm_struct *mm)
 }
 EXPORT_SYMBOL_GPL(mmput);
 
+#ifdef CONFIG_MMU
 static void mmput_async_fn(struct work_struct *work)
 {
 	struct mm_struct *mm = container_of(work, struct mm_struct, async_put_work);
@@ -745,6 +746,7 @@ void mmput_async(struct mm_struct *mm)
 		schedule_work(&mm->async_put_work);
 	}
 }
+#endif
 
 /**
  * set_mm_exe_file - change a reference to the mm's executable file
-- 
2.8.1


-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 2/2] mm, oom_reaper: do not mmput synchronously from the oom reaper context
@ 2016-05-20  6:16   ` Michal Hocko
  0 siblings, 0 replies; 16+ messages in thread
From: Michal Hocko @ 2016-05-20  6:16 UTC (permalink / raw)
  To: Minchan Kim; +Cc: Andrew Morton, Tetsuo Handa, David Rientjes, linux-mm, LKML

On Fri 20-05-16 10:30:53, Minchan Kim wrote:
> Forking new thread because my comment is not related to this patch's
> purpose but found a thing during reading this patch.
> 
> On Tue, Apr 26, 2016 at 04:04:30PM +0200, Michal Hocko wrote:
> > From: Michal Hocko <mhocko@suse.com>
> > 
> > Tetsuo has properly noted that mmput slow path might get blocked waiting
> > for another party (e.g. exit_aio waits for an IO). If that happens the
> > oom_reaper would be put out of the way and will not be able to process
> > next oom victim. We should strive for making this context as reliable
> > and independent on other subsystems as much as possible.
> > 
> > Introduce mmput_async which will perform the slow path from an async
> > (WQ) context. This will delay the operation but that shouldn't be a
> > problem because the oom_reaper has reclaimed the victim's address space
> > for most cases as much as possible and the remaining context shouldn't
> > bind too much memory anymore. The only exception is when mmap_sem
> > trylock has failed which shouldn't happen too often.
> > 
> > The issue is only theoretical but not impossible.
> 
> The mmput_async is used for only OOM reaper which is enabled on CONFIG_MMU.
> So until someone who want to use mmput_async in !CONFIG_MMU come out,
> we could save sizeof(struct work_struct) per mm in !CONFIG_MMU.

You are right. What about the following?
---

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 2/2] mm, oom_reaper: do not mmput synchronously from the oom reaper context
  2016-05-20  6:16   ` Michal Hocko
@ 2016-05-20  7:12     ` Minchan Kim
  -1 siblings, 0 replies; 16+ messages in thread
From: Minchan Kim @ 2016-05-20  7:12 UTC (permalink / raw)
  To: Michal Hocko; +Cc: Andrew Morton, Tetsuo Handa, David Rientjes, linux-mm, LKML

On Fri, May 20, 2016 at 08:16:59AM +0200, Michal Hocko wrote:
> On Fri 20-05-16 10:30:53, Minchan Kim wrote:
> > Forking new thread because my comment is not related to this patch's
> > purpose but found a thing during reading this patch.
> > 
> > On Tue, Apr 26, 2016 at 04:04:30PM +0200, Michal Hocko wrote:
> > > From: Michal Hocko <mhocko@suse.com>
> > > 
> > > Tetsuo has properly noted that mmput slow path might get blocked waiting
> > > for another party (e.g. exit_aio waits for an IO). If that happens the
> > > oom_reaper would be put out of the way and will not be able to process
> > > next oom victim. We should strive for making this context as reliable
> > > and independent on other subsystems as much as possible.
> > > 
> > > Introduce mmput_async which will perform the slow path from an async
> > > (WQ) context. This will delay the operation but that shouldn't be a
> > > problem because the oom_reaper has reclaimed the victim's address space
> > > for most cases as much as possible and the remaining context shouldn't
> > > bind too much memory anymore. The only exception is when mmap_sem
> > > trylock has failed which shouldn't happen too often.
> > > 
> > > The issue is only theoretical but not impossible.
> > 
> > The mmput_async is used for only OOM reaper which is enabled on CONFIG_MMU.
> > So until someone who want to use mmput_async in !CONFIG_MMU come out,
> > we could save sizeof(struct work_struct) per mm in !CONFIG_MMU.
> 
> You are right. What about the following?
> ---
> From 8f8a34bf00882bfc0b557ed79e0e9e956ac9d217 Mon Sep 17 00:00:00 2001
> From: Michal Hocko <mhocko@suse.com>
> Date: Fri, 20 May 2016 08:14:39 +0200
> Subject: [PATCH] mmotm:
>  mm-oom_reaper-do-not-mmput-synchronously-from-the-oom-reaper-context-fix
> 
> mmput_async is currently used only from the oom_reaper which is defined
> only for CONFIG_MMU. We can save work_struct in mm_struct for
> !CONFIG_MMU.
> 
> Reported-by: Minchan Kim <minchan@kernel.org>
> Signed-off-by: Michal Hocko <mhocko@suse.com>
Acked-by: Minchan Kim <minchan@kernel.org>

Found a typo below although it was not caused by this patch.
My brand new glasses are really good for me.

> ---
>  include/linux/mm_types.h | 2 ++
>  include/linux/sched.h    | 2 ++
>  kernel/fork.c            | 2 ++
>  3 files changed, 6 insertions(+)
> 
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index ab142ace96f3..a16dcb2efca4 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -514,7 +514,9 @@ struct mm_struct {
>  #ifdef CONFIG_HUGETLB_PAGE
>  	atomic_long_t hugetlb_usage;
>  #endif
> +#ifdef CONFIG_MMU
>  	struct work_struct async_put_work;
> +#endif
>  };
>  
>  static inline void mm_init_cpumask(struct mm_struct *mm)
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index df8778e72211..11b31ded65cf 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -2604,10 +2604,12 @@ static inline void mmdrop(struct mm_struct * mm)
>  
>  /* mmput gets rid of the mappings and all user-space */
>  extern void mmput(struct mm_struct *);
> +#ifdef CONFIG_MMU
>  /* same as above but performs the slow path from the async kontext. Can

                                                              c

>   * be called from the atomic context as well
>   */
>  extern void mmput_async(struct mm_struct *);
> +#endif
>  
>  /* Grab a reference to a task's mm, if it is not already going away */
>  extern struct mm_struct *get_task_mm(struct task_struct *task);
> diff --git a/kernel/fork.c b/kernel/fork.c
> index e1dc6b02ac8b..1e3dc3af6845 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -732,6 +732,7 @@ void mmput(struct mm_struct *mm)
>  }
>  EXPORT_SYMBOL_GPL(mmput);
>  
> +#ifdef CONFIG_MMU
>  static void mmput_async_fn(struct work_struct *work)
>  {
>  	struct mm_struct *mm = container_of(work, struct mm_struct, async_put_work);
> @@ -745,6 +746,7 @@ void mmput_async(struct mm_struct *mm)
>  		schedule_work(&mm->async_put_work);
>  	}
>  }
> +#endif
>  
>  /**
>   * set_mm_exe_file - change a reference to the mm's executable file
> -- 
> 2.8.1
> 
> 
> -- 
> Michal Hocko
> SUSE Labs

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 2/2] mm, oom_reaper: do not mmput synchronously from the oom reaper context
@ 2016-05-20  7:12     ` Minchan Kim
  0 siblings, 0 replies; 16+ messages in thread
From: Minchan Kim @ 2016-05-20  7:12 UTC (permalink / raw)
  To: Michal Hocko; +Cc: Andrew Morton, Tetsuo Handa, David Rientjes, linux-mm, LKML

On Fri, May 20, 2016 at 08:16:59AM +0200, Michal Hocko wrote:
> On Fri 20-05-16 10:30:53, Minchan Kim wrote:
> > Forking new thread because my comment is not related to this patch's
> > purpose but found a thing during reading this patch.
> > 
> > On Tue, Apr 26, 2016 at 04:04:30PM +0200, Michal Hocko wrote:
> > > From: Michal Hocko <mhocko@suse.com>
> > > 
> > > Tetsuo has properly noted that mmput slow path might get blocked waiting
> > > for another party (e.g. exit_aio waits for an IO). If that happens the
> > > oom_reaper would be put out of the way and will not be able to process
> > > next oom victim. We should strive for making this context as reliable
> > > and independent on other subsystems as much as possible.
> > > 
> > > Introduce mmput_async which will perform the slow path from an async
> > > (WQ) context. This will delay the operation but that shouldn't be a
> > > problem because the oom_reaper has reclaimed the victim's address space
> > > for most cases as much as possible and the remaining context shouldn't
> > > bind too much memory anymore. The only exception is when mmap_sem
> > > trylock has failed which shouldn't happen too often.
> > > 
> > > The issue is only theoretical but not impossible.
> > 
> > The mmput_async is used for only OOM reaper which is enabled on CONFIG_MMU.
> > So until someone who want to use mmput_async in !CONFIG_MMU come out,
> > we could save sizeof(struct work_struct) per mm in !CONFIG_MMU.
> 
> You are right. What about the following?
> ---
> From 8f8a34bf00882bfc0b557ed79e0e9e956ac9d217 Mon Sep 17 00:00:00 2001
> From: Michal Hocko <mhocko@suse.com>
> Date: Fri, 20 May 2016 08:14:39 +0200
> Subject: [PATCH] mmotm:
>  mm-oom_reaper-do-not-mmput-synchronously-from-the-oom-reaper-context-fix
> 
> mmput_async is currently used only from the oom_reaper which is defined
> only for CONFIG_MMU. We can save work_struct in mm_struct for
> !CONFIG_MMU.
> 
> Reported-by: Minchan Kim <minchan@kernel.org>
> Signed-off-by: Michal Hocko <mhocko@suse.com>
Acked-by: Minchan Kim <minchan@kernel.org>

Found a typo below although it was not caused by this patch.
My brand new glasses are really good for me.

> ---
>  include/linux/mm_types.h | 2 ++
>  include/linux/sched.h    | 2 ++
>  kernel/fork.c            | 2 ++
>  3 files changed, 6 insertions(+)
> 
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index ab142ace96f3..a16dcb2efca4 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -514,7 +514,9 @@ struct mm_struct {
>  #ifdef CONFIG_HUGETLB_PAGE
>  	atomic_long_t hugetlb_usage;
>  #endif
> +#ifdef CONFIG_MMU
>  	struct work_struct async_put_work;
> +#endif
>  };
>  
>  static inline void mm_init_cpumask(struct mm_struct *mm)
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index df8778e72211..11b31ded65cf 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -2604,10 +2604,12 @@ static inline void mmdrop(struct mm_struct * mm)
>  
>  /* mmput gets rid of the mappings and all user-space */
>  extern void mmput(struct mm_struct *);
> +#ifdef CONFIG_MMU
>  /* same as above but performs the slow path from the async kontext. Can

                                                              c

>   * be called from the atomic context as well
>   */
>  extern void mmput_async(struct mm_struct *);
> +#endif
>  
>  /* Grab a reference to a task's mm, if it is not already going away */
>  extern struct mm_struct *get_task_mm(struct task_struct *task);
> diff --git a/kernel/fork.c b/kernel/fork.c
> index e1dc6b02ac8b..1e3dc3af6845 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -732,6 +732,7 @@ void mmput(struct mm_struct *mm)
>  }
>  EXPORT_SYMBOL_GPL(mmput);
>  
> +#ifdef CONFIG_MMU
>  static void mmput_async_fn(struct work_struct *work)
>  {
>  	struct mm_struct *mm = container_of(work, struct mm_struct, async_put_work);
> @@ -745,6 +746,7 @@ void mmput_async(struct mm_struct *mm)
>  		schedule_work(&mm->async_put_work);
>  	}
>  }
> +#endif
>  
>  /**
>   * set_mm_exe_file - change a reference to the mm's executable file
> -- 
> 2.8.1
> 
> 
> -- 
> Michal Hocko
> SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 2/2] mm, oom_reaper: do not mmput synchronously from the oom reaper context
  2016-05-25 13:50         ` Michal Hocko
@ 2016-05-25 14:30           ` Tetsuo Handa
  0 siblings, 0 replies; 16+ messages in thread
From: Tetsuo Handa @ 2016-05-25 14:30 UTC (permalink / raw)
  To: mhocko; +Cc: akpm, rientjes, linux-mm

Michal Hocko wrote:
> On Wed 25-05-16 19:52:18, Tetsuo Handa wrote:
> > Michal Hocko wrote:
> > > > Just a random thought, but after this patch is applied, do we still need to use
> > > > a dedicated kernel thread for OOM-reap operation? If I recall correctly, the
> > > > reason we decided to use a dedicated kernel thread was that calling
> > > > down_read(&mm->mmap_sem) / mmput() from the OOM killer context is unsafe due to
> > > > dependency. By replacing mmput() with mmput_async(), since __oom_reap_task() will
> > > > no longer do operations that might block, can't we try OOM-reap operation from
> > > > current thread which called mark_oom_victim() or oom_scan_process_thread() ?
> > > 
> > > I was already thinking about that. It is true that the main blocker
> > > was the mmput, as you say, but the dedicated kernel thread seems to be
> > > more robust locking and stack wise. So I would prefer staying with the
> > > current approach until we see that it is somehow limitting. One pid and
> > > kernel stack doesn't seem to be a terrible price to me. But as I've said
> > > I am not bound to the kernel thread approach...
> > > 
> > 
> > It seems to me that async OOM reaping widens race window for needlessly
> > selecting next OOM victim, for the OOM reaper holding a reference of a
> > TIF_MEMDIE thread's mm expedites clearing TIF_MEMDIE from that thread
> > by making atomic_dec_and_test() in mmput() from exit_mm() false.
>  
> AFAIU you mean
> __oom_reap_task			exit_mm
>   atomic_inc_not_zero
> 				  tsk->mm = NULL
> 				  mmput
>   				    atomic_dec_and_test # > 0
> 				  exit_oom_victim # New victim will be
> 				  		  # selected
> 				<OOM killer invoked>
> 				  # no TIF_MEMDIE task so we can select a new one
>   unmap_page_range # to release the memory
> 

Yes.

> Previously we were kind of protected by PF_EXITING check in
> oom_scan_process_thread which is not there anymore. The race is possible
> even without the oom reaper because many other call sites might pin
> the address space and be preempted for an unbounded amount of time. We

It is true that there has been a race window even without the OOM reaper
(and I tried to mitigate it using oomkiller_holdoff_timer).
But until the OOM reaper kernel thread was introduced, the sequence

 				  mmput
   				    atomic_dec_and_test # > 0
 				  exit_oom_victim # New victim will be
 				  		  # selected

was able to select another thread sharing that mm (with noisy dump_header()
messages which I think should be suppressed after that thread group received
SIGKILL from oom_kill_process()). Since the OOM reaper is a kernel thread,
this sequence will simply select a different thread group not sharing that mm.
In this regard, I think that async OOM reaping increased possibility of
needlessly selecting next OOM victim.

> could widen the race window by reintroducing the check or moving
> exit_oom_victim later in do_exit after exit_notify which then removes
> the task from the task_list (in __unhash_process) so the OOM killer
> wouldn't see it anyway. Sounds ugly to me though.
> 
> > Maybe we should wait for first OOM reap attempt from the OOM killer context
> > before releasing oom_lock mutex (sync OOM reaping) ?
> 
> I do not think we want to wait inside the oom_lock as it is a global
> lock shared by all OOM killer contexts. Another option would be to use
> the oom_lock inside __oom_reap_task. It is not super cool either because
> now we have a dependency on the lock but looks like reasonably easy
> solution.

It would be nice if we can wait until memory reclaimed from the OOM victim's
mm is queued to freelist for allocation. But I don't have idea other than
oomkiller_holdoff_timer.

I think this problem should be discussed another day in a new thread.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 2/2] mm, oom_reaper: do not mmput synchronously from the oom reaper context
  2016-05-25 10:52       ` Tetsuo Handa
@ 2016-05-25 13:50         ` Michal Hocko
  2016-05-25 14:30           ` Tetsuo Handa
  0 siblings, 1 reply; 16+ messages in thread
From: Michal Hocko @ 2016-05-25 13:50 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: akpm, rientjes, linux-mm

On Wed 25-05-16 19:52:18, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > > Just a random thought, but after this patch is applied, do we still need to use
> > > a dedicated kernel thread for OOM-reap operation? If I recall correctly, the
> > > reason we decided to use a dedicated kernel thread was that calling
> > > down_read(&mm->mmap_sem) / mmput() from the OOM killer context is unsafe due to
> > > dependency. By replacing mmput() with mmput_async(), since __oom_reap_task() will
> > > no longer do operations that might block, can't we try OOM-reap operation from
> > > current thread which called mark_oom_victim() or oom_scan_process_thread() ?
> > 
> > I was already thinking about that. It is true that the main blocker
> > was the mmput, as you say, but the dedicated kernel thread seems to be
> > more robust locking and stack wise. So I would prefer staying with the
> > current approach until we see that it is somehow limitting. One pid and
> > kernel stack doesn't seem to be a terrible price to me. But as I've said
> > I am not bound to the kernel thread approach...
> > 
> 
> It seems to me that async OOM reaping widens race window for needlessly
> selecting next OOM victim, for the OOM reaper holding a reference of a
> TIF_MEMDIE thread's mm expedites clearing TIF_MEMDIE from that thread
> by making atomic_dec_and_test() in mmput() from exit_mm() false.
 
AFAIU you mean
__oom_reap_task			exit_mm
  atomic_inc_not_zero
				  tsk->mm = NULL
				  mmput
  				    atomic_dec_and_test # > 0
				  exit_oom_victim # New victim will be
				  		  # selected
				<OOM killer invoked>
				  # no TIF_MEMDIE task so we can select a new one
  unmap_page_range # to release the memory

Previously we were kind of protected by PF_EXITING check in
oom_scan_process_thread which is not there anymore. The race is possible
even without the oom reaper because many other call sites might pin
the address space and be preempted for an unbounded amount of time. We
could widen the race window by reintroducing the check or moving
exit_oom_victim later in do_exit after exit_notify which then removes
the task from the task_list (in __unhash_process) so the OOM killer
wouldn't see it anyway. Sounds ugly to me though.

> Maybe we should wait for first OOM reap attempt from the OOM killer context
> before releasing oom_lock mutex (sync OOM reaping) ?

I do not think we want to wait inside the oom_lock as it is a global
lock shared by all OOM killer contexts. Another option would be to use
the oom_lock inside __oom_reap_task. It is not super cool either because
now we have a dependency on the lock but looks like reasonably easy
solution.
---
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 5bb2f7698ad7..d0f42cc88f6a 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -450,6 +450,22 @@ static bool __oom_reap_task(struct task_struct *tsk)
 	bool ret = true;
 
 	/*
+	 * We have to make sure to not race with the victim exit path
+	 * and cause premature new oom victim selection:
+	 * __oom_reap_task		exit_mm
+	 *   atomic_inc_not_zero
+	 *   				  mmput
+	 *   				    atomic_dec_and_test
+	 *				  exit_oom_victim
+	 *				[...]
+	 *				out_of_memory
+	 *				  select_bad_process
+	 *				    # no TIF_MEMDIE task select new victim
+	 *  unmap_page_range # frees some memory
+	 */
+	mutex_lock(&oom_lock);
+
+	/*
 	 * Make sure we find the associated mm_struct even when the particular
 	 * thread has already terminated and cleared its mm.
 	 * We might have race with exit path so consider our work done if there
@@ -457,19 +473,19 @@ static bool __oom_reap_task(struct task_struct *tsk)
 	 */
 	p = find_lock_task_mm(tsk);
 	if (!p)
-		return true;
+		goto unlock_oom;
 
 	mm = p->mm;
 	if (!atomic_inc_not_zero(&mm->mm_users)) {
 		task_unlock(p);
-		return true;
+		goto unlock_oom;
 	}
 
 	task_unlock(p);
 
 	if (!down_read_trylock(&mm->mmap_sem)) {
 		ret = false;
-		goto out;
+		goto unlock_oom;
 	}
 
 	tlb_gather_mmu(&tlb, mm, 0, -1);
@@ -511,7 +527,8 @@ static bool __oom_reap_task(struct task_struct *tsk)
 	 * to release its memory.
 	 */
 	set_bit(MMF_OOM_REAPED, &mm->flags);
-out:
+unlock_oom:
+	mutex_unlock(&oom_lock);
 	/*
 	 * Drop our reference but make sure the mmput slow path is called from a
 	 * different context because we shouldn't risk we get stuck there and

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 2/2] mm, oom_reaper: do not mmput synchronously from the oom reaper context
  2016-05-19 17:20     ` Michal Hocko
@ 2016-05-25 10:52       ` Tetsuo Handa
  2016-05-25 13:50         ` Michal Hocko
  0 siblings, 1 reply; 16+ messages in thread
From: Tetsuo Handa @ 2016-05-25 10:52 UTC (permalink / raw)
  To: mhocko; +Cc: akpm, rientjes, linux-mm

Michal Hocko wrote:
> > Just a random thought, but after this patch is applied, do we still need to use
> > a dedicated kernel thread for OOM-reap operation? If I recall correctly, the
> > reason we decided to use a dedicated kernel thread was that calling
> > down_read(&mm->mmap_sem) / mmput() from the OOM killer context is unsafe due to
> > dependency. By replacing mmput() with mmput_async(), since __oom_reap_task() will
> > no longer do operations that might block, can't we try OOM-reap operation from
> > current thread which called mark_oom_victim() or oom_scan_process_thread() ?
> 
> I was already thinking about that. It is true that the main blocker
> was the mmput, as you say, but the dedicated kernel thread seems to be
> more robust locking and stack wise. So I would prefer staying with the
> current approach until we see that it is somehow limitting. One pid and
> kernel stack doesn't seem to be a terrible price to me. But as I've said
> I am not bound to the kernel thread approach...
> 

It seems to me that async OOM reaping widens race window for needlessly
selecting next OOM victim, for the OOM reaper holding a reference of a
TIF_MEMDIE thread's mm expedites clearing TIF_MEMDIE from that thread
by making atomic_dec_and_test() in mmput() from exit_mm() false.

Maybe we should wait for first OOM reap attempt from the OOM killer context
before releasing oom_lock mutex (sync OOM reaping) ?

Complete log is at http://I-love.SAKURA.ne.jp/tmp/serial-20160525.txt.xz .
----------------------------------------
[   73.485228] [ pid ]   uid  tgid total_vm      rss nr_ptes nr_pmds swapents oom_score_adj name
[   73.487497] [  442]     0   442   123055   105971     215       4        0             0 oleg's-test
[   73.490055] Out of memory: Kill process 442 (oleg's-test) score 855 or sacrifice child
[   73.492178] Killed process 442 (oleg's-test) total-vm:492220kB, anon-rss:423880kB, file-rss:4kB, shmem-rss:0kB
[   73.516065] oom_reaper: reaped process 442 (oleg's-test), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
[   74.308526] oleg's-test invoked oom-killer: gfp_mask=0x24280ca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO), order=0, oom_score_adj=0
[   74.316047] oleg's-test cpuset=/ mems_allowed=0
[   74.320387] CPU: 3 PID: 443 Comm: oleg's-test Not tainted 4.6.0-rc7+ #51
[   74.325435] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/31/2013
[   74.329768]  0000000000000286 00000000e7952d6d ffff88001943fad0 ffffffff812c7cbd
[   74.333314]  0000000000000000 ffff88001943fcf0 ffff88001943fb70 ffffffff811b9e94
[   74.336859]  0000000000000206 ffffffff8182b970 ffff88001943fb10 ffffffff810bc519
[   74.340524] Call Trace:
[   74.342373]  [<ffffffff812c7cbd>] dump_stack+0x85/0xc8
[   74.345138]  [<ffffffff811b9e94>] dump_header+0x5b/0x394
[   74.347878]  [<ffffffff810bc519>] ? trace_hardirqs_on_caller+0xf9/0x1c0
[   74.350988]  [<ffffffff810bc5ed>] ? trace_hardirqs_on+0xd/0x10
[   74.353897]  [<ffffffff8114434f>] oom_kill_process+0x35f/0x540
[   74.356908]  [<ffffffff81144786>] out_of_memory+0x206/0x5a0
[   74.359694]  [<ffffffff81144844>] ? out_of_memory+0x2c4/0x5a0
[   74.362553]  [<ffffffff8114a55a>] __alloc_pages_nodemask+0xb3a/0xb50
[   74.365611]  [<ffffffff810bcd36>] ? __lock_acquire+0x3d6/0x1fb0
[   74.368513]  [<ffffffff81194fe6>] alloc_pages_vma+0xb6/0x290
[   74.371293]  [<ffffffff81172673>] handle_mm_fault+0x1873/0x1e60
[   74.374181]  [<ffffffff81170e4c>] ? handle_mm_fault+0x4c/0x1e60
[   74.376989]  [<ffffffff8105c01a>] ? __do_page_fault+0x1da/0x4d0
[   74.379758]  [<ffffffff8105bf67>] ? __do_page_fault+0x127/0x4d0
[   74.382502]  [<ffffffff8105bff5>] __do_page_fault+0x1b5/0x4d0
[   74.385086]  [<ffffffff8105c340>] do_page_fault+0x30/0x80
[   74.386771]  [<ffffffff8160fbe8>] page_fault+0x28/0x30
[   74.388469] Mem-Info:
[   74.389457] active_anon:106132 inactive_anon:1024 isolated_anon:0
[   74.389457]  active_file:1 inactive_file:20 isolated_file:0
[   74.389457]  unevictable:0 dirty:0 writeback:0 unstable:0
[   74.389457]  slab_reclaimable:368 slab_unreclaimable:3722
[   74.389457]  mapped:3 shmem:1026 pagetables:224 bounce:0
[   74.389457]  free:1085 free_pcp:0 free_cma:0
[   74.398435] Node 0 DMA free:1816kB min:92kB low:112kB high:132kB active_anon:12208kB inactive_anon:124kB active_file:8kB inactive_file:4kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15988kB managed:15904kB mlocked:0kB dirty:0kB writeback:0kB mapped:4kB shmem:128kB slab_reclaimable:44kB slab_unreclaimable:624kB kernel_stack:112kB pagetables:20kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:4 all_unreclaimable? yes
[   74.408909] lowmem_reserve[]: 0 432 432 432
[   74.410422] Node 0 DMA32 free:2524kB min:2612kB low:3264kB high:3916kB active_anon:412320kB inactive_anon:3972kB active_file:0kB inactive_file:76kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:507776kB managed:465568kB mlocked:0kB dirty:0kB writeback:0kB mapped:8kB shmem:3976kB slab_reclaimable:1428kB slab_unreclaimable:14264kB kernel_stack:2704kB pagetables:876kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
[   74.421596] lowmem_reserve[]: 0 0 0 0
[   74.423273] Node 0 DMA: 11*4kB (UM) 8*8kB (U) 5*16kB (U) 3*32kB (UM) 4*64kB (U) 4*128kB (UM) 3*256kB (UM) 0*512kB 0*1024kB 0*2048kB 0*4096kB = 1820kB
[   74.427306] Node 0 DMA32: 95*4kB (UME) 117*8kB (UME) 44*16kB (UME) 16*32kB (UM) 0*64kB 1*128kB (M) 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 2660kB
[   74.431255] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
[   74.433655] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[   74.436036] 1056 total pagecache pages
[   74.437448] 0 pages in swap cache
[   74.438935] Swap cache stats: add 0, delete 0, find 0/0
[   74.440573] Free swap  = 0kB
[   74.441830] Total swap = 0kB
[   74.443003] 130941 pages RAM
[   74.444220] 0 pages HighMem/MovableOnly
[   74.445603] 10573 pages reserved
[   74.446873] 0 pages cma reserved
[   74.448068] 0 pages hwpoisoned
[   74.449255] [ pid ]   uid  tgid total_vm      rss nr_ptes nr_pmds swapents oom_score_adj name
[   74.451523] [  443]     0   443   123312   105971     214       3        0             0 oleg's-test
[   74.453958] Out of memory: Kill process 443 (oleg's-test) score 855 or sacrifice child
[   74.456234] Killed process 443 (oleg's-test) total-vm:493248kB, anon-rss:423880kB, file-rss:4kB, shmem-rss:0kB
[   74.459219] sh invoked oom-killer: gfp_mask=0x24201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD), order=0, oom_score_adj=0
[   74.462813] sh cpuset=/ mems_allowed=0
[   74.465221] CPU: 2 PID: 1 Comm: sh Not tainted 4.6.0-rc7+ #51
[   74.467037] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/31/2013
[   74.470207]  0000000000000286 00000000a17a86b0 ffff88001e673a18 ffffffff812c7cbd
[   74.473422]  0000000000000000 ffff88001e673bd0 ffff88001e673ab8 ffffffff811b9e94
[   74.475704]  ffff88001e66cbe0 ffff88001e673ab8 0000000000000246 0000000000000000
[   74.477990] Call Trace:
[   74.479170]  [<ffffffff812c7cbd>] dump_stack+0x85/0xc8
[   74.480872]  [<ffffffff811b9e94>] dump_header+0x5b/0x394
[   74.481837] oom_reaper: reaped process 443 (oleg's-test), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
[   74.485142]  [<ffffffff81144ac6>] out_of_memory+0x546/0x5a0
[   74.486959]  [<ffffffff81144844>] ? out_of_memory+0x2c4/0x5a0
[   74.489013]  [<ffffffff8114a55a>] __alloc_pages_nodemask+0xb3a/0xb50
[   74.491229]  [<ffffffff8113f500>] ? find_get_entry+0x40/0x1d0
[   74.493033]  [<ffffffff81193316>] alloc_pages_current+0x96/0x1b0
[   74.494909]  [<ffffffff8113edcd>] __page_cache_alloc+0x12d/0x160
[   74.496705]  [<ffffffff81142915>] filemap_fault+0x455/0x670
[   74.498470]  [<ffffffff811427f0>] ? filemap_fault+0x330/0x670
[   74.500279]  [<ffffffffa0231c99>] xfs_filemap_fault+0x39/0x60 [xfs]
[   74.502149]  [<ffffffff8116bc0b>] __do_fault+0x6b/0x120
[   74.503811]  [<ffffffff81171fd2>] handle_mm_fault+0x11d2/0x1e60
[   74.505702]  [<ffffffff81170e4c>] ? handle_mm_fault+0x4c/0x1e60
[   74.507450]  [<ffffffff810bc519>] ? trace_hardirqs_on_caller+0xf9/0x1c0
[   74.509336]  [<ffffffff8105bf67>] ? __do_page_fault+0x127/0x4d0
[   74.511054]  [<ffffffff8105bff5>] __do_page_fault+0x1b5/0x4d0
[   74.512693]  [<ffffffff8105c340>] do_page_fault+0x30/0x80
[   74.514303]  [<ffffffff8160fbe8>] page_fault+0x28/0x30
[   74.515893] Mem-Info:
[   74.516865] active_anon:106 inactive_anon:1024 isolated_anon:0
[   74.516865]  active_file:1 inactive_file:20 isolated_file:0
[   74.516865]  unevictable:0 dirty:0 writeback:0 unstable:0
[   74.516865]  slab_reclaimable:368 slab_unreclaimable:3722
[   74.516865]  mapped:3 shmem:1026 pagetables:24 bounce:0
[   74.516865]  free:107155 free_pcp:191 free_cma:0
[   74.526066] Node 0 DMA free:14024kB min:92kB low:112kB high:132kB active_anon:24kB inactive_anon:124kB active_file:8kB inactive_file:4kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15988kB managed:15904kB mlocked:0kB dirty:0kB writeback:0kB mapped:4kB shmem:128kB slab_reclaimable:44kB slab_unreclaimable:624kB kernel_stack:112kB pagetables:4kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
[   74.536441] lowmem_reserve[]: 0 432 432 432
[   74.538048] Node 0 DMA32 free:414596kB min:2612kB low:3264kB high:3916kB active_anon:400kB inactive_anon:3972kB active_file:0kB inactive_file:76kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:507776kB managed:465568kB mlocked:0kB dirty:0kB writeback:0kB mapped:8kB shmem:3976kB slab_reclaimable:1428kB slab_unreclaimable:14264kB kernel_stack:2704kB pagetables:92kB unstable:0kB bounce:0kB free_pcp:764kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
[   74.549582] lowmem_reserve[]: 0 0 0 0
[   74.551165] Node 0 DMA: 23*4kB (UM) 20*8kB (UM) 13*16kB (UM) 6*32kB (UM) 5*64kB (UM) 6*128kB (UM) 2*256kB (U) 3*512kB (UM) 2*1024kB (M) 2*2048kB (M) 1*4096kB (M) = 14028kB
[   74.556385] Node 0 DMA32: 522*4kB (UME) 342*8kB (UME) 198*16kB (UME) 116*32kB (UM) 91*64kB (UM) 29*128kB (M) 23*256kB (M) 17*512kB (M) 16*1024kB (M) 9*2048kB (M) 84*4096kB (M) = 414712kB
[   74.561989] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
[   74.564360] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[   74.566681] 1056 total pagecache pages
[   74.568112] 0 pages in swap cache
[   74.569439] Swap cache stats: add 0, delete 0, find 0/0
[   74.571166] Free swap  = 0kB
[   74.572455] Total swap = 0kB
[   74.573683] 130941 pages RAM
[   74.574960] 0 pages HighMem/MovableOnly
[   74.576336] 10573 pages reserved
[   74.577610] 0 pages cma reserved
[   74.578924] 0 pages hwpoisoned
[   74.580135] [ pid ]   uid  tgid total_vm      rss nr_ptes nr_pmds swapents oom_score_adj name
[   74.582454] Kernel panic - not syncing: Out of memory and no killable processes...
[   74.582454]
[   74.585646] Kernel Offset: disabled
[   74.587538] ---[ end Kernel panic - not syncing: Out of memory and no killable processes...
----------------------------------------

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 2/2] mm, oom_reaper: do not mmput synchronously from the oom reaper context
  2016-05-19 14:29   ` Tetsuo Handa
@ 2016-05-19 17:20     ` Michal Hocko
  2016-05-25 10:52       ` Tetsuo Handa
  0 siblings, 1 reply; 16+ messages in thread
From: Michal Hocko @ 2016-05-19 17:20 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: akpm, rientjes, linux-mm

On Thu 19-05-16 23:29:38, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > Tetsuo has properly noted that mmput slow path might get blocked waiting
> > for another party (e.g. exit_aio waits for an IO). If that happens the
> > oom_reaper would be put out of the way and will not be able to process
> > next oom victim. We should strive for making this context as reliable
> > and independent on other subsystems as much as possible.
> > 
> > Introduce mmput_async which will perform the slow path from an async
> > (WQ) context. This will delay the operation but that shouldn't be a
> > problem because the oom_reaper has reclaimed the victim's address space
> > for most cases as much as possible and the remaining context shouldn't
> > bind too much memory anymore. The only exception is when mmap_sem
> > trylock has failed which shouldn't happen too often.
> > 
> > The issue is only theoretical but not impossible.
> 
> Just a random thought, but after this patch is applied, do we still need to use
> a dedicated kernel thread for OOM-reap operation? If I recall correctly, the
> reason we decided to use a dedicated kernel thread was that calling
> down_read(&mm->mmap_sem) / mmput() from the OOM killer context is unsafe due to
> dependency. By replacing mmput() with mmput_async(), since __oom_reap_task() will
> no longer do operations that might block, can't we try OOM-reap operation from
> current thread which called mark_oom_victim() or oom_scan_process_thread() ?

I was already thinking about that. It is true that the main blocker
was the mmput, as you say, but the dedicated kernel thread seems to be
more robust locking and stack wise. So I would prefer staying with the
current approach until we see that it is somehow limitting. One pid and
kernel stack doesn't seem to be a terrible price to me. But as I've said
I am not bound to the kernel thread approach...

> I want to start waking up the OOM reaper whenever TIF_MEMDIE is set or found.
> 
> Using a dedicated kernel thread is still better because memory allocation path
> already consumed a lot of kernel stack? But we don't need to give up OOM-reaping
> when kthread_run() failed.

Is kthread_run failure during early boot even an option? Isn't such a
system screwed up by definition?

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 2/2] mm, oom_reaper: do not mmput synchronously from the oom reaper context
  2016-04-26 14:04   ` Michal Hocko
  (?)
  (?)
@ 2016-05-19 14:29   ` Tetsuo Handa
  2016-05-19 17:20     ` Michal Hocko
  -1 siblings, 1 reply; 16+ messages in thread
From: Tetsuo Handa @ 2016-05-19 14:29 UTC (permalink / raw)
  To: mhocko; +Cc: akpm, rientjes, linux-mm, mhocko

Michal Hocko wrote:
> Tetsuo has properly noted that mmput slow path might get blocked waiting
> for another party (e.g. exit_aio waits for an IO). If that happens the
> oom_reaper would be put out of the way and will not be able to process
> next oom victim. We should strive for making this context as reliable
> and independent on other subsystems as much as possible.
> 
> Introduce mmput_async which will perform the slow path from an async
> (WQ) context. This will delay the operation but that shouldn't be a
> problem because the oom_reaper has reclaimed the victim's address space
> for most cases as much as possible and the remaining context shouldn't
> bind too much memory anymore. The only exception is when mmap_sem
> trylock has failed which shouldn't happen too often.
> 
> The issue is only theoretical but not impossible.

Just a random thought, but after this patch is applied, do we still need to use
a dedicated kernel thread for OOM-reap operation? If I recall correctly, the
reason we decided to use a dedicated kernel thread was that calling
down_read(&mm->mmap_sem) / mmput() from the OOM killer context is unsafe due to
dependency. By replacing mmput() with mmput_async(), since __oom_reap_task() will
no longer do operations that might block, can't we try OOM-reap operation from
current thread which called mark_oom_victim() or oom_scan_process_thread() ?
I want to start waking up the OOM reaper whenever TIF_MEMDIE is set or found.

Using a dedicated kernel thread is still better because memory allocation path
already consumed a lot of kernel stack? But we don't need to give up OOM-reaping
when kthread_run() failed.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 2/2] mm, oom_reaper: do not mmput synchronously from the oom reaper context
  2016-04-26 14:18   ` kbuild test robot
@ 2016-04-26 14:58       ` Michal Hocko
  0 siblings, 0 replies; 16+ messages in thread
From: Michal Hocko @ 2016-04-26 14:58 UTC (permalink / raw)
  To: kbuild test robot, Andrew Morton
  Cc: kbuild-all, Tetsuo Handa, David Rientjes, linux-mm, LKML

On Tue 26-04-16 22:18:22, kbuild test robot wrote:
> >> include/linux/mm_types.h:516:21: error: field 'async_put_work' has incomplete type
>      struct work_struct async_put_work;

My bad. We need to include <linux/workqueue.h> because we rely on the
include only indirectly which happened to work fine for most of my
configs - not so for allnoconfig, though. Please fold this into the
original patch or let me know and I will repost the full patch again.
---
>From 368e90e7640a1eb5f0e621c7ccb08bc7ef2d272b Mon Sep 17 00:00:00 2001
From: Michal Hocko <mhocko@suse.com>
Date: Tue, 26 Apr 2016 16:48:13 +0200
Subject: [PATCH] mm-fix: mm, oom_reaper: do not mmput synchronously from the
 oom reaper context

In file included from include/linux/sched.h:27:0,
	    from include/linux/oom.h:5,
	    from mm/oom_kill.c:20:
>> include/linux/mm_types.h:516:21: error: field 'async_put_work' has incomplete type
	struct work_struct async_put_work;

Reported-by: kbuild test robot <lkp@intel.com>
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 include/linux/mm_types.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index f4f477244679..ab142ace96f3 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -12,6 +12,7 @@
 #include <linux/cpumask.h>
 #include <linux/uprobes.h>
 #include <linux/page-flags-layout.h>
+#include <linux/workqueue.h>
 #include <asm/page.h>
 #include <asm/mmu.h>
 
-- 
2.8.0.rc3

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 2/2] mm, oom_reaper: do not mmput synchronously from the oom reaper context
@ 2016-04-26 14:58       ` Michal Hocko
  0 siblings, 0 replies; 16+ messages in thread
From: Michal Hocko @ 2016-04-26 14:58 UTC (permalink / raw)
  To: kbuild test robot, Andrew Morton
  Cc: kbuild-all, Tetsuo Handa, David Rientjes, linux-mm, LKML

On Tue 26-04-16 22:18:22, kbuild test robot wrote:
> >> include/linux/mm_types.h:516:21: error: field 'async_put_work' has incomplete type
>      struct work_struct async_put_work;

My bad. We need to include <linux/workqueue.h> because we rely on the
include only indirectly which happened to work fine for most of my
configs - not so for allnoconfig, though. Please fold this into the
original patch or let me know and I will repost the full patch again.
---

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 2/2] mm, oom_reaper: do not mmput synchronously from the oom reaper context
  2016-04-26 14:04   ` Michal Hocko
  (?)
@ 2016-04-26 14:18   ` kbuild test robot
  2016-04-26 14:58       ` Michal Hocko
  -1 siblings, 1 reply; 16+ messages in thread
From: kbuild test robot @ 2016-04-26 14:18 UTC (permalink / raw)
  To: Michal Hocko
  Cc: kbuild-all, Andrew Morton, Tetsuo Handa, David Rientjes,
	linux-mm, LKML, Michal Hocko

[-- Attachment #1: Type: text/plain, Size: 1376 bytes --]

Hi,

[auto build test ERROR on next-20160426]
[cannot apply to tip/sched/core v4.6-rc5 v4.6-rc4 v4.6-rc3 v4.6-rc5]
[if your patch is applied to the wrong git tree, please drop us a note to help improving the system]

url:    https://github.com/0day-ci/linux/commits/Michal-Hocko/mm-oom_reaper-hide-oom-reaped-tasks-from-OOM-killer-more-carefully/20160426-220841
config: x86_64-randconfig-x010-201617 (attached as .config)
compiler: 
reproduce:
        # save the attached .config to linux build tree
        make ARCH=x86_64 

All errors (new ones prefixed by >>):

   In file included from include/linux/sched.h:27:0,
                    from include/linux/oom.h:5,
                    from mm/oom_kill.c:20:
>> include/linux/mm_types.h:516:21: error: field 'async_put_work' has incomplete type
     struct work_struct async_put_work;
                        ^

vim +/async_put_work +516 include/linux/mm_types.h

   510		/* address of the bounds directory */
   511		void __user *bd_addr;
   512	#endif
   513	#ifdef CONFIG_HUGETLB_PAGE
   514		atomic_long_t hugetlb_usage;
   515	#endif
 > 516		struct work_struct async_put_work;
   517	};
   518	
   519	static inline void mm_init_cpumask(struct mm_struct *mm)

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/octet-stream, Size: 33900 bytes --]

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH 2/2] mm, oom_reaper: do not mmput synchronously from the oom reaper context
  2016-04-26 14:04 [PATCH 0/2] last pile of oom_reaper patches for now Michal Hocko
@ 2016-04-26 14:04   ` Michal Hocko
  0 siblings, 0 replies; 16+ messages in thread
From: Michal Hocko @ 2016-04-26 14:04 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Tetsuo Handa, David Rientjes, linux-mm, LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

Tetsuo has properly noted that mmput slow path might get blocked waiting
for another party (e.g. exit_aio waits for an IO). If that happens the
oom_reaper would be put out of the way and will not be able to process
next oom victim. We should strive for making this context as reliable
and independent on other subsystems as much as possible.

Introduce mmput_async which will perform the slow path from an async
(WQ) context. This will delay the operation but that shouldn't be a
problem because the oom_reaper has reclaimed the victim's address space
for most cases as much as possible and the remaining context shouldn't
bind too much memory anymore. The only exception is when mmap_sem
trylock has failed which shouldn't happen too often.

The issue is only theoretical but not impossible.

Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 include/linux/mm_types.h |  1 +
 include/linux/sched.h    |  5 +++++
 kernel/fork.c            | 50 +++++++++++++++++++++++++++++++++---------------
 mm/oom_kill.c            |  8 ++++++--
 4 files changed, 47 insertions(+), 17 deletions(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index bf4f3ac5110c..f4f477244679 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -513,6 +513,7 @@ struct mm_struct {
 #ifdef CONFIG_HUGETLB_PAGE
 	atomic_long_t hugetlb_usage;
 #endif
+	struct work_struct async_put_work;
 };
 
 static inline void mm_init_cpumask(struct mm_struct *mm)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 7bd0fa9db199..df8778e72211 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2604,6 +2604,11 @@ static inline void mmdrop(struct mm_struct * mm)
 
 /* mmput gets rid of the mappings and all user-space */
 extern void mmput(struct mm_struct *);
+/* same as above but performs the slow path from the async kontext. Can
+ * be called from the atomic context as well
+ */
+extern void mmput_async(struct mm_struct *);
+
 /* Grab a reference to a task's mm, if it is not already going away */
 extern struct mm_struct *get_task_mm(struct task_struct *task);
 /*
diff --git a/kernel/fork.c b/kernel/fork.c
index accb7221d547..10b0f771d795 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -696,6 +696,26 @@ void __mmdrop(struct mm_struct *mm)
 }
 EXPORT_SYMBOL_GPL(__mmdrop);
 
+static inline void __mmput(struct mm_struct *mm)
+{
+	VM_BUG_ON(atomic_read(&mm->mm_users));
+
+	uprobe_clear_state(mm);
+	exit_aio(mm);
+	ksm_exit(mm);
+	khugepaged_exit(mm); /* must run before exit_mmap */
+	exit_mmap(mm);
+	set_mm_exe_file(mm, NULL);
+	if (!list_empty(&mm->mmlist)) {
+		spin_lock(&mmlist_lock);
+		list_del(&mm->mmlist);
+		spin_unlock(&mmlist_lock);
+	}
+	if (mm->binfmt)
+		module_put(mm->binfmt->module);
+	mmdrop(mm);
+}
+
 /*
  * Decrement the use count and release all resources for an mm.
  */
@@ -703,24 +723,24 @@ void mmput(struct mm_struct *mm)
 {
 	might_sleep();
 
+	if (atomic_dec_and_test(&mm->mm_users))
+		__mmput(mm);
+}
+EXPORT_SYMBOL_GPL(mmput);
+
+static void mmput_async_fn(struct work_struct *work)
+{
+	struct mm_struct *mm = container_of(work, struct mm_struct, async_put_work);
+	__mmput(mm);
+}
+
+void mmput_async(struct mm_struct *mm)
+{
 	if (atomic_dec_and_test(&mm->mm_users)) {
-		uprobe_clear_state(mm);
-		exit_aio(mm);
-		ksm_exit(mm);
-		khugepaged_exit(mm); /* must run before exit_mmap */
-		exit_mmap(mm);
-		set_mm_exe_file(mm, NULL);
-		if (!list_empty(&mm->mmlist)) {
-			spin_lock(&mmlist_lock);
-			list_del(&mm->mmlist);
-			spin_unlock(&mmlist_lock);
-		}
-		if (mm->binfmt)
-			module_put(mm->binfmt->module);
-		mmdrop(mm);
+		INIT_WORK(&mm->async_put_work, mmput_async_fn);
+		schedule_work(&mm->async_put_work);
 	}
 }
-EXPORT_SYMBOL_GPL(mmput);
 
 /**
  * set_mm_exe_file - change a reference to the mm's executable file
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index c0376efa79ec..c0e37dd1422f 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -446,7 +446,6 @@ static DECLARE_WAIT_QUEUE_HEAD(oom_reaper_wait);
 static struct task_struct *oom_reaper_list;
 static DEFINE_SPINLOCK(oom_reaper_lock);
 
-
 static bool __oom_reap_task(struct task_struct *tsk)
 {
 	struct mmu_gather tlb;
@@ -520,7 +519,12 @@ static bool __oom_reap_task(struct task_struct *tsk)
 	 */
 	set_bit(MMF_OOM_REAPED, &mm->flags);
 out:
-	mmput(mm);
+	/*
+	 * Drop our reference but make sure the mmput slow path is called from a
+	 * different context because we shouldn't risk we get stuck there and
+	 * put the oom_reaper out of the way.
+	 */
+	mmput_async(mm);
 	return ret;
 }
 
-- 
2.8.0.rc3

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH 2/2] mm, oom_reaper: do not mmput synchronously from the oom reaper context
@ 2016-04-26 14:04   ` Michal Hocko
  0 siblings, 0 replies; 16+ messages in thread
From: Michal Hocko @ 2016-04-26 14:04 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Tetsuo Handa, David Rientjes, linux-mm, LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

Tetsuo has properly noted that mmput slow path might get blocked waiting
for another party (e.g. exit_aio waits for an IO). If that happens the
oom_reaper would be put out of the way and will not be able to process
next oom victim. We should strive for making this context as reliable
and independent on other subsystems as much as possible.

Introduce mmput_async which will perform the slow path from an async
(WQ) context. This will delay the operation but that shouldn't be a
problem because the oom_reaper has reclaimed the victim's address space
for most cases as much as possible and the remaining context shouldn't
bind too much memory anymore. The only exception is when mmap_sem
trylock has failed which shouldn't happen too often.

The issue is only theoretical but not impossible.

Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 include/linux/mm_types.h |  1 +
 include/linux/sched.h    |  5 +++++
 kernel/fork.c            | 50 +++++++++++++++++++++++++++++++++---------------
 mm/oom_kill.c            |  8 ++++++--
 4 files changed, 47 insertions(+), 17 deletions(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index bf4f3ac5110c..f4f477244679 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -513,6 +513,7 @@ struct mm_struct {
 #ifdef CONFIG_HUGETLB_PAGE
 	atomic_long_t hugetlb_usage;
 #endif
+	struct work_struct async_put_work;
 };
 
 static inline void mm_init_cpumask(struct mm_struct *mm)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 7bd0fa9db199..df8778e72211 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2604,6 +2604,11 @@ static inline void mmdrop(struct mm_struct * mm)
 
 /* mmput gets rid of the mappings and all user-space */
 extern void mmput(struct mm_struct *);
+/* same as above but performs the slow path from the async kontext. Can
+ * be called from the atomic context as well
+ */
+extern void mmput_async(struct mm_struct *);
+
 /* Grab a reference to a task's mm, if it is not already going away */
 extern struct mm_struct *get_task_mm(struct task_struct *task);
 /*
diff --git a/kernel/fork.c b/kernel/fork.c
index accb7221d547..10b0f771d795 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -696,6 +696,26 @@ void __mmdrop(struct mm_struct *mm)
 }
 EXPORT_SYMBOL_GPL(__mmdrop);
 
+static inline void __mmput(struct mm_struct *mm)
+{
+	VM_BUG_ON(atomic_read(&mm->mm_users));
+
+	uprobe_clear_state(mm);
+	exit_aio(mm);
+	ksm_exit(mm);
+	khugepaged_exit(mm); /* must run before exit_mmap */
+	exit_mmap(mm);
+	set_mm_exe_file(mm, NULL);
+	if (!list_empty(&mm->mmlist)) {
+		spin_lock(&mmlist_lock);
+		list_del(&mm->mmlist);
+		spin_unlock(&mmlist_lock);
+	}
+	if (mm->binfmt)
+		module_put(mm->binfmt->module);
+	mmdrop(mm);
+}
+
 /*
  * Decrement the use count and release all resources for an mm.
  */
@@ -703,24 +723,24 @@ void mmput(struct mm_struct *mm)
 {
 	might_sleep();
 
+	if (atomic_dec_and_test(&mm->mm_users))
+		__mmput(mm);
+}
+EXPORT_SYMBOL_GPL(mmput);
+
+static void mmput_async_fn(struct work_struct *work)
+{
+	struct mm_struct *mm = container_of(work, struct mm_struct, async_put_work);
+	__mmput(mm);
+}
+
+void mmput_async(struct mm_struct *mm)
+{
 	if (atomic_dec_and_test(&mm->mm_users)) {
-		uprobe_clear_state(mm);
-		exit_aio(mm);
-		ksm_exit(mm);
-		khugepaged_exit(mm); /* must run before exit_mmap */
-		exit_mmap(mm);
-		set_mm_exe_file(mm, NULL);
-		if (!list_empty(&mm->mmlist)) {
-			spin_lock(&mmlist_lock);
-			list_del(&mm->mmlist);
-			spin_unlock(&mmlist_lock);
-		}
-		if (mm->binfmt)
-			module_put(mm->binfmt->module);
-		mmdrop(mm);
+		INIT_WORK(&mm->async_put_work, mmput_async_fn);
+		schedule_work(&mm->async_put_work);
 	}
 }
-EXPORT_SYMBOL_GPL(mmput);
 
 /**
  * set_mm_exe_file - change a reference to the mm's executable file
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index c0376efa79ec..c0e37dd1422f 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -446,7 +446,6 @@ static DECLARE_WAIT_QUEUE_HEAD(oom_reaper_wait);
 static struct task_struct *oom_reaper_list;
 static DEFINE_SPINLOCK(oom_reaper_lock);
 
-
 static bool __oom_reap_task(struct task_struct *tsk)
 {
 	struct mmu_gather tlb;
@@ -520,7 +519,12 @@ static bool __oom_reap_task(struct task_struct *tsk)
 	 */
 	set_bit(MMF_OOM_REAPED, &mm->flags);
 out:
-	mmput(mm);
+	/*
+	 * Drop our reference but make sure the mmput slow path is called from a
+	 * different context because we shouldn't risk we get stuck there and
+	 * put the oom_reaper out of the way.
+	 */
+	mmput_async(mm);
 	return ret;
 }
 
-- 
2.8.0.rc3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2016-05-25 14:30 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-05-20  1:30 [PATCH 2/2] mm, oom_reaper: do not mmput synchronously from the oom reaper context Minchan Kim
2016-05-20  1:30 ` Minchan Kim
2016-05-20  6:16 ` Michal Hocko
2016-05-20  6:16   ` Michal Hocko
2016-05-20  7:12   ` Minchan Kim
2016-05-20  7:12     ` Minchan Kim
  -- strict thread matches above, loose matches on Subject: below --
2016-04-26 14:04 [PATCH 0/2] last pile of oom_reaper patches for now Michal Hocko
2016-04-26 14:04 ` [PATCH 2/2] mm, oom_reaper: do not mmput synchronously from the oom reaper context Michal Hocko
2016-04-26 14:04   ` Michal Hocko
2016-04-26 14:18   ` kbuild test robot
2016-04-26 14:58     ` Michal Hocko
2016-04-26 14:58       ` Michal Hocko
2016-05-19 14:29   ` Tetsuo Handa
2016-05-19 17:20     ` Michal Hocko
2016-05-25 10:52       ` Tetsuo Handa
2016-05-25 13:50         ` Michal Hocko
2016-05-25 14:30           ` Tetsuo Handa

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.