linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [patch] mm, memcg: add memory.oom_control notification for system oom
@ 2013-10-31  1:39 David Rientjes
  2013-10-31  5:49 ` Johannes Weiner
  0 siblings, 1 reply; 87+ messages in thread
From: David Rientjes @ 2013-10-31  1:39 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Johannes Weiner, Michal Hocko, KAMEZAWA Hiroyuki, linux-kernel,
	linux-mm, cgroups

A subset of applications that wait on memory.oom_control don't disable
the oom killer for that memcg and simply log or cleanup after the kernel
oom killer kills a process to free memory.

We need the ability to do this for system oom conditions as well, i.e.
when the system is depleted of all memory and must kill a process.  For
convenience, this can use memcg since oom notifiers are already present.

When a userspace process waits on the root memcg's memory.oom_control, it
will wake up anytime there is a system oom condition so that it can log
the event, including what process was killed and the stack, or cleanup
after the kernel oom killer has killed something.

This is a special case of oom notifiers since it doesn't subsequently
notify all memcgs under the root memcg (all memcgs on the system).  We
don't want to trigger those oom handlers which are set aside specifically
for true memcg oom notifications that disable their own oom killers to
enforce their own oom policy, for example.

Signed-off-by: David Rientjes <rientjes@google.com>
---
 Documentation/cgroups/memory.txt | 11 ++++++-----
 include/linux/memcontrol.h       |  5 +++++
 mm/memcontrol.c                  |  9 +++++++++
 mm/oom_kill.c                    |  4 ++++
 4 files changed, 24 insertions(+), 5 deletions(-)

diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt
--- a/Documentation/cgroups/memory.txt
+++ b/Documentation/cgroups/memory.txt
@@ -739,18 +739,19 @@ delivery and gets notification when OOM happens.
 
 To register a notifier, an application must:
  - create an eventfd using eventfd(2)
- - open memory.oom_control file
+ - open memory.oom_control file for reading
  - write string like "<event_fd> <fd of memory.oom_control>" to
    cgroup.event_control
 
-The application will be notified through eventfd when OOM happens.
-OOM notification doesn't work for the root cgroup.
+The application will be notified through eventfd when OOM happens, including
+on system oom when used with the root memcg.
 
 You can disable the OOM-killer by writing "1" to memory.oom_control file, as:
 
-	#echo 1 > memory.oom_control
+	# echo 1 > memory.oom_control
 
-This operation is only allowed to the top cgroup of a sub-hierarchy.
+This operation is only allowed to the top cgroup of a sub-hierarchy and does
+not include the root memcg.
 If OOM-killer is disabled, tasks under cgroup will hang/sleep
 in memory cgroup's OOM-waitqueue when they request accountable memory.
 
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -155,6 +155,7 @@ static inline bool task_in_memcg_oom(struct task_struct *p)
 }
 
 bool mem_cgroup_oom_synchronize(bool wait);
+void mem_cgroup_root_oom_notify(void);
 
 #ifdef CONFIG_MEMCG_SWAP
 extern int do_swap_account;
@@ -397,6 +398,10 @@ static inline bool mem_cgroup_oom_synchronize(bool wait)
 	return false;
 }
 
+static inline void mem_cgroup_root_oom_notify(void)
+{
+}
+
 static inline void mem_cgroup_inc_page_stat(struct page *page,
 					    enum mem_cgroup_stat_index idx)
 {
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5641,6 +5641,15 @@ static void mem_cgroup_oom_notify(struct mem_cgroup *memcg)
 		mem_cgroup_oom_notify_cb(iter);
 }
 
+/*
+ * Notify any process waiting on the root memcg's memory.oom_control, but do not
+ * notify any child memcgs to avoid triggering their per-memcg oom handlers.
+ */
+void mem_cgroup_root_oom_notify(void)
+{
+	mem_cgroup_oom_notify_cb(root_mem_cgroup);
+}
+
 static int mem_cgroup_usage_register_event(struct cgroup_subsys_state *css,
 	struct cftype *cft, struct eventfd_ctx *eventfd, const char *args)
 {
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -632,6 +632,10 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
 		return;
 	}
 
+	/* Avoid waking up processes for oom kills triggered by sysrq */
+	if (!force_kill)
+		mem_cgroup_root_oom_notify();
+
 	/*
 	 * Check if there were limitations on the allocation (only relevant for
 	 * NUMA) that may require different handling.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [patch] mm, memcg: add memory.oom_control notification for system oom
  2013-10-31  1:39 [patch] mm, memcg: add memory.oom_control notification for system oom David Rientjes
@ 2013-10-31  5:49 ` Johannes Weiner
  2013-11-13 22:19   ` David Rientjes
  0 siblings, 1 reply; 87+ messages in thread
From: Johannes Weiner @ 2013-10-31  5:49 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Michal Hocko, KAMEZAWA Hiroyuki, linux-kernel,
	linux-mm, cgroups

On Wed, Oct 30, 2013 at 06:39:16PM -0700, David Rientjes wrote:
> A subset of applications that wait on memory.oom_control don't disable
> the oom killer for that memcg and simply log or cleanup after the kernel
> oom killer kills a process to free memory.
> 
> We need the ability to do this for system oom conditions as well, i.e.
> when the system is depleted of all memory and must kill a process.  For
> convenience, this can use memcg since oom notifiers are already present.
> 
> When a userspace process waits on the root memcg's memory.oom_control, it
> will wake up anytime there is a system oom condition so that it can log
> the event, including what process was killed and the stack, or cleanup
> after the kernel oom killer has killed something.
> 
> This is a special case of oom notifiers since it doesn't subsequently
> notify all memcgs under the root memcg (all memcgs on the system).  We
> don't want to trigger those oom handlers which are set aside specifically
> for true memcg oom notifications that disable their own oom killers to
> enforce their own oom policy, for example.

There is nothing they can do anyway since the handler is hardcoded for
the root cgroup, so this seems fine.

> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -155,6 +155,7 @@ static inline bool task_in_memcg_oom(struct task_struct *p)
>  }
>  
>  bool mem_cgroup_oom_synchronize(bool wait);
> +void mem_cgroup_root_oom_notify(void);
>  
>  #ifdef CONFIG_MEMCG_SWAP
>  extern int do_swap_account;
> @@ -397,6 +398,10 @@ static inline bool mem_cgroup_oom_synchronize(bool wait)
>  	return false;
>  }
>  
> +static inline void mem_cgroup_root_oom_notify(void)
> +{
> +}
> +
>  static inline void mem_cgroup_inc_page_stat(struct page *page,
>  					    enum mem_cgroup_stat_index idx)
>  {
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -5641,6 +5641,15 @@ static void mem_cgroup_oom_notify(struct mem_cgroup *memcg)
>  		mem_cgroup_oom_notify_cb(iter);
>  }
>  
> +/*
> + * Notify any process waiting on the root memcg's memory.oom_control, but do not
> + * notify any child memcgs to avoid triggering their per-memcg oom handlers.
> + */
> +void mem_cgroup_root_oom_notify(void)
> +{
> +	mem_cgroup_oom_notify_cb(root_mem_cgroup);
> +}
> +
>  static int mem_cgroup_usage_register_event(struct cgroup_subsys_state *css,
>  	struct cftype *cft, struct eventfd_ctx *eventfd, const char *args)
>  {
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -632,6 +632,10 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
>  		return;
>  	}
>  
> +	/* Avoid waking up processes for oom kills triggered by sysrq */
> +	if (!force_kill)
> +		mem_cgroup_root_oom_notify();

We have an API for global OOM notifications, please just use
register_oom_notifier() instead.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [patch] mm, memcg: add memory.oom_control notification for system oom
  2013-10-31  5:49 ` Johannes Weiner
@ 2013-11-13 22:19   ` David Rientjes
  2013-11-13 23:34     ` Johannes Weiner
  0 siblings, 1 reply; 87+ messages in thread
From: David Rientjes @ 2013-11-13 22:19 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Michal Hocko, KAMEZAWA Hiroyuki, linux-kernel,
	linux-mm, cgroups

On Thu, 31 Oct 2013, Johannes Weiner wrote:

> > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> > --- a/include/linux/memcontrol.h
> > +++ b/include/linux/memcontrol.h
> > @@ -155,6 +155,7 @@ static inline bool task_in_memcg_oom(struct task_struct *p)
> >  }
> >  
> >  bool mem_cgroup_oom_synchronize(bool wait);
> > +void mem_cgroup_root_oom_notify(void);
> >  
> >  #ifdef CONFIG_MEMCG_SWAP
> >  extern int do_swap_account;
> > @@ -397,6 +398,10 @@ static inline bool mem_cgroup_oom_synchronize(bool wait)
> >  	return false;
> >  }
> >  
> > +static inline void mem_cgroup_root_oom_notify(void)
> > +{
> > +}
> > +
> >  static inline void mem_cgroup_inc_page_stat(struct page *page,
> >  					    enum mem_cgroup_stat_index idx)
> >  {
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -5641,6 +5641,15 @@ static void mem_cgroup_oom_notify(struct mem_cgroup *memcg)
> >  		mem_cgroup_oom_notify_cb(iter);
> >  }
> >  
> > +/*
> > + * Notify any process waiting on the root memcg's memory.oom_control, but do not
> > + * notify any child memcgs to avoid triggering their per-memcg oom handlers.
> > + */
> > +void mem_cgroup_root_oom_notify(void)
> > +{
> > +	mem_cgroup_oom_notify_cb(root_mem_cgroup);
> > +}
> > +
> >  static int mem_cgroup_usage_register_event(struct cgroup_subsys_state *css,
> >  	struct cftype *cft, struct eventfd_ctx *eventfd, const char *args)
> >  {
> > diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> > --- a/mm/oom_kill.c
> > +++ b/mm/oom_kill.c
> > @@ -632,6 +632,10 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
> >  		return;
> >  	}
> >  
> > +	/* Avoid waking up processes for oom kills triggered by sysrq */
> > +	if (!force_kill)
> > +		mem_cgroup_root_oom_notify();
> 
> We have an API for global OOM notifications, please just use
> register_oom_notifier() instead.
> 

We can't use register_oom_notifier() because we don't want to notify the 
root memcg for a system oom handler if existing oom notifiers free memory 
(powerpc or s390).  We also don't want to notify the root memcg when 
current is exiting or has a pending SIGKILL, we just want to silently give 
it access to memory reserves and exit.  The mem_cgroup_root_oom_notify() 
here is placed correctly.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [patch] mm, memcg: add memory.oom_control notification for system oom
  2013-11-13 22:19   ` David Rientjes
@ 2013-11-13 23:34     ` Johannes Weiner
  2013-11-14  0:56       ` David Rientjes
  0 siblings, 1 reply; 87+ messages in thread
From: Johannes Weiner @ 2013-11-13 23:34 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Michal Hocko, KAMEZAWA Hiroyuki, linux-kernel,
	linux-mm, cgroups

On Wed, Nov 13, 2013 at 02:19:00PM -0800, David Rientjes wrote:
> On Thu, 31 Oct 2013, Johannes Weiner wrote:
> 
> > > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> > > --- a/include/linux/memcontrol.h
> > > +++ b/include/linux/memcontrol.h
> > > @@ -155,6 +155,7 @@ static inline bool task_in_memcg_oom(struct task_struct *p)
> > >  }
> > >  
> > >  bool mem_cgroup_oom_synchronize(bool wait);
> > > +void mem_cgroup_root_oom_notify(void);
> > >  
> > >  #ifdef CONFIG_MEMCG_SWAP
> > >  extern int do_swap_account;
> > > @@ -397,6 +398,10 @@ static inline bool mem_cgroup_oom_synchronize(bool wait)
> > >  	return false;
> > >  }
> > >  
> > > +static inline void mem_cgroup_root_oom_notify(void)
> > > +{
> > > +}
> > > +
> > >  static inline void mem_cgroup_inc_page_stat(struct page *page,
> > >  					    enum mem_cgroup_stat_index idx)
> > >  {
> > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > > --- a/mm/memcontrol.c
> > > +++ b/mm/memcontrol.c
> > > @@ -5641,6 +5641,15 @@ static void mem_cgroup_oom_notify(struct mem_cgroup *memcg)
> > >  		mem_cgroup_oom_notify_cb(iter);
> > >  }
> > >  
> > > +/*
> > > + * Notify any process waiting on the root memcg's memory.oom_control, but do not
> > > + * notify any child memcgs to avoid triggering their per-memcg oom handlers.
> > > + */
> > > +void mem_cgroup_root_oom_notify(void)
> > > +{
> > > +	mem_cgroup_oom_notify_cb(root_mem_cgroup);
> > > +}
> > > +
> > >  static int mem_cgroup_usage_register_event(struct cgroup_subsys_state *css,
> > >  	struct cftype *cft, struct eventfd_ctx *eventfd, const char *args)
> > >  {
> > > diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> > > --- a/mm/oom_kill.c
> > > +++ b/mm/oom_kill.c
> > > @@ -632,6 +632,10 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
> > >  		return;
> > >  	}
> > >  
> > > +	/* Avoid waking up processes for oom kills triggered by sysrq */
> > > +	if (!force_kill)
> > > +		mem_cgroup_root_oom_notify();
> > 
> > We have an API for global OOM notifications, please just use
> > register_oom_notifier() instead.
> > 
> 
> We can't use register_oom_notifier() because we don't want to notify the 
> root memcg for a system oom handler if existing oom notifiers free memory 
> (powerpc or s390).  We also don't want to notify the root memcg when 
> current is exiting or has a pending SIGKILL, we just want to silently give 
> it access to memory reserves and exit.  The mem_cgroup_root_oom_notify() 
> here is placed correctly.

This is all handwaving.  Somebody called out_of_memory() after they
failed reclaim, the machine is OOM.  The fact that current is exiting
without requiring a kill is coincidental and irrelevant.  You want an
OOM notification, use the OOM notifiers, that's what they're for.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [patch] mm, memcg: add memory.oom_control notification for system oom
  2013-11-13 23:34     ` Johannes Weiner
@ 2013-11-14  0:56       ` David Rientjes
  2013-11-14  3:25         ` Johannes Weiner
  0 siblings, 1 reply; 87+ messages in thread
From: David Rientjes @ 2013-11-14  0:56 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Michal Hocko, KAMEZAWA Hiroyuki, linux-kernel,
	linux-mm, cgroups

On Wed, 13 Nov 2013, Johannes Weiner wrote:

> > > > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> > > > --- a/include/linux/memcontrol.h
> > > > +++ b/include/linux/memcontrol.h
> > > > @@ -155,6 +155,7 @@ static inline bool task_in_memcg_oom(struct task_struct *p)
> > > >  }
> > > >  
> > > >  bool mem_cgroup_oom_synchronize(bool wait);
> > > > +void mem_cgroup_root_oom_notify(void);
> > > >  
> > > >  #ifdef CONFIG_MEMCG_SWAP
> > > >  extern int do_swap_account;
> > > > @@ -397,6 +398,10 @@ static inline bool mem_cgroup_oom_synchronize(bool wait)
> > > >  	return false;
> > > >  }
> > > >  
> > > > +static inline void mem_cgroup_root_oom_notify(void)
> > > > +{
> > > > +}
> > > > +
> > > >  static inline void mem_cgroup_inc_page_stat(struct page *page,
> > > >  					    enum mem_cgroup_stat_index idx)
> > > >  {
> > > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > > > --- a/mm/memcontrol.c
> > > > +++ b/mm/memcontrol.c
> > > > @@ -5641,6 +5641,15 @@ static void mem_cgroup_oom_notify(struct mem_cgroup *memcg)
> > > >  		mem_cgroup_oom_notify_cb(iter);
> > > >  }
> > > >  
> > > > +/*
> > > > + * Notify any process waiting on the root memcg's memory.oom_control, but do not
> > > > + * notify any child memcgs to avoid triggering their per-memcg oom handlers.
> > > > + */
> > > > +void mem_cgroup_root_oom_notify(void)
> > > > +{
> > > > +	mem_cgroup_oom_notify_cb(root_mem_cgroup);
> > > > +}
> > > > +
> > > >  static int mem_cgroup_usage_register_event(struct cgroup_subsys_state *css,
> > > >  	struct cftype *cft, struct eventfd_ctx *eventfd, const char *args)
> > > >  {
> > > > diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> > > > --- a/mm/oom_kill.c
> > > > +++ b/mm/oom_kill.c
> > > > @@ -632,6 +632,10 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
> > > >  		return;
> > > >  	}
> > > >  
> > > > +	/* Avoid waking up processes for oom kills triggered by sysrq */
> > > > +	if (!force_kill)
> > > > +		mem_cgroup_root_oom_notify();
> > > 
> > > We have an API for global OOM notifications, please just use
> > > register_oom_notifier() instead.
> > > 
> > 
> > We can't use register_oom_notifier() because we don't want to notify the 
> > root memcg for a system oom handler if existing oom notifiers free memory 
> > (powerpc or s390).  We also don't want to notify the root memcg when 
> > current is exiting or has a pending SIGKILL, we just want to silently give 
> > it access to memory reserves and exit.  The mem_cgroup_root_oom_notify() 
> > here is placed correctly.
> 
> This is all handwaving.

I'm defining the semantics of the system oom notification for the root 
memcg.  Userspace oom handlers are not going to want to wakeup when a 
kernel oom notifier is capable of freeing memory to prevent the oom killer 
from doing anything at all or if current simply needs access to memory 
reserves to make forward progress.  Userspace oom handlers want a wakeup 
when a process must be killed to free memory, and thus this is correctly 
placed.

> Somebody called out_of_memory() after they
> failed reclaim, the machine is OOM.

While momentarily oom, the oom notifiers in powerpc and s390 have the 
ability to free memory without requiring a kill.

> The fact that current is exiting
> without requiring a kill is coincidental and irrelevant.  You want an
> OOM notification, use the OOM notifiers, that's what they're for.
> 

I think you're misunderstanding the kernel oom notifiers, they exist 
solely to free memory so that the oom killer actually doesn't have to kill 
anything.  The fact that they use kernel notifiers is irrelevant and 
userspace oom notification is separate.  Userspace is only going to want a 
notification when the oom killer has to kill something, the EXACT same 
semantics as the non-root-memcg memory.oom_control.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [patch] mm, memcg: add memory.oom_control notification for system oom
  2013-11-14  0:56       ` David Rientjes
@ 2013-11-14  3:25         ` Johannes Weiner
  2013-11-14 22:57           ` David Rientjes
  0 siblings, 1 reply; 87+ messages in thread
From: Johannes Weiner @ 2013-11-14  3:25 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Michal Hocko, KAMEZAWA Hiroyuki, linux-kernel,
	linux-mm, cgroups

On Wed, Nov 13, 2013 at 04:56:09PM -0800, David Rientjes wrote:
> On Wed, 13 Nov 2013, Johannes Weiner wrote:
> 
> > > > > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> > > > > --- a/include/linux/memcontrol.h
> > > > > +++ b/include/linux/memcontrol.h
> > > > > @@ -155,6 +155,7 @@ static inline bool task_in_memcg_oom(struct task_struct *p)
> > > > >  }
> > > > >  
> > > > >  bool mem_cgroup_oom_synchronize(bool wait);
> > > > > +void mem_cgroup_root_oom_notify(void);
> > > > >  
> > > > >  #ifdef CONFIG_MEMCG_SWAP
> > > > >  extern int do_swap_account;
> > > > > @@ -397,6 +398,10 @@ static inline bool mem_cgroup_oom_synchronize(bool wait)
> > > > >  	return false;
> > > > >  }
> > > > >  
> > > > > +static inline void mem_cgroup_root_oom_notify(void)
> > > > > +{
> > > > > +}
> > > > > +
> > > > >  static inline void mem_cgroup_inc_page_stat(struct page *page,
> > > > >  					    enum mem_cgroup_stat_index idx)
> > > > >  {
> > > > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > > > > --- a/mm/memcontrol.c
> > > > > +++ b/mm/memcontrol.c
> > > > > @@ -5641,6 +5641,15 @@ static void mem_cgroup_oom_notify(struct mem_cgroup *memcg)
> > > > >  		mem_cgroup_oom_notify_cb(iter);
> > > > >  }
> > > > >  
> > > > > +/*
> > > > > + * Notify any process waiting on the root memcg's memory.oom_control, but do not
> > > > > + * notify any child memcgs to avoid triggering their per-memcg oom handlers.
> > > > > + */
> > > > > +void mem_cgroup_root_oom_notify(void)
> > > > > +{
> > > > > +	mem_cgroup_oom_notify_cb(root_mem_cgroup);
> > > > > +}
> > > > > +
> > > > >  static int mem_cgroup_usage_register_event(struct cgroup_subsys_state *css,
> > > > >  	struct cftype *cft, struct eventfd_ctx *eventfd, const char *args)
> > > > >  {
> > > > > diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> > > > > --- a/mm/oom_kill.c
> > > > > +++ b/mm/oom_kill.c
> > > > > @@ -632,6 +632,10 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
> > > > >  		return;
> > > > >  	}
> > > > >  
> > > > > +	/* Avoid waking up processes for oom kills triggered by sysrq */
> > > > > +	if (!force_kill)
> > > > > +		mem_cgroup_root_oom_notify();
> > > > 
> > > > We have an API for global OOM notifications, please just use
> > > > register_oom_notifier() instead.
> > > > 
> > > 
> > > We can't use register_oom_notifier() because we don't want to notify the 
> > > root memcg for a system oom handler if existing oom notifiers free memory 
> > > (powerpc or s390).  We also don't want to notify the root memcg when 
> > > current is exiting or has a pending SIGKILL, we just want to silently give 
> > > it access to memory reserves and exit.  The mem_cgroup_root_oom_notify() 
> > > here is placed correctly.
> > 
> > This is all handwaving.
> 
> I'm defining the semantics of the system oom notification for the root 
> memcg.  Userspace oom handlers are not going to want to wakeup when a 
> kernel oom notifier is capable of freeing memory to prevent the oom killer 
> from doing anything at all or if current simply needs access to memory 
> reserves to make forward progress.  Userspace oom handlers want a wakeup 
> when a process must be killed to free memory, and thus this is correctly 
> placed.

Userspace may very much be interested in an OOM situation, REGARDLESS
of what action needs to be taken.  Userspace has always the ability to
filter out events and look at the stats after the notification, but it
can not know situations it's not told about.

> > Somebody called out_of_memory() after they
> > failed reclaim, the machine is OOM.
> 
> While momentarily oom, the oom notifiers in powerpc and s390 have the 
> ability to free memory without requiring a kill.

So either

1) they should be part of the regular reclaim process, or

2) their invocation is severe enough to not be part of reclaim, at
   which point we should probably tell userspace about the OOM

> > The fact that current is exiting
> > without requiring a kill is coincidental and irrelevant.  You want an
> > OOM notification, use the OOM notifiers, that's what they're for.
> > 
> 
> I think you're misunderstanding the kernel oom notifiers, they exist 
> solely to free memory so that the oom killer actually doesn't have to kill 
> anything.  The fact that they use kernel notifiers is irrelevant and 
> userspace oom notification is separate.  Userspace is only going to want a 
> notification when the oom killer has to kill something, the EXACT same 
> semantics as the non-root-memcg memory.oom_control.

That's actually not true, we invoke the OOM notifier before calling
mem_cgroup_out_of_memory(), which then may skip the kill in favor of
letting current exit.  It does this for when the kernel handler is
enabled, which would be the equivalent for what you are implementing.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [patch] mm, memcg: add memory.oom_control notification for system oom
  2013-11-14  3:25         ` Johannes Weiner
@ 2013-11-14 22:57           ` David Rientjes
  2013-11-14 23:26             ` [patch 1/2] mm, memcg: avoid oom notification when current needs access to memory reserves David Rientjes
  2013-11-18 15:54             ` [patch] mm, memcg: add memory.oom_control notification for system oom Johannes Weiner
  0 siblings, 2 replies; 87+ messages in thread
From: David Rientjes @ 2013-11-14 22:57 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Michal Hocko, KAMEZAWA Hiroyuki, linux-kernel,
	linux-mm, cgroups

On Wed, 13 Nov 2013, Johannes Weiner wrote:

> > > Somebody called out_of_memory() after they
> > > failed reclaim, the machine is OOM.
> > 
> > While momentarily oom, the oom notifiers in powerpc and s390 have the 
> > ability to free memory without requiring a kill.
> 
> So either
> 
> 1) they should be part of the regular reclaim process, or
> 
> 2) their invocation is severe enough to not be part of reclaim, at
>    which point we should probably tell userspace about the OOM
> 

(1) is already true, we can avoid oom by freeing memory for subsystems 
using register_oom_notifier(), so we're not actually oom.  It's a late 
callback into the kernel to free memory in a sense of reclaim.  It was 
added directly into out_of_memory() purely for simplicity; it could be 
moved to the page allocator if we move all of the oom_notify_list helpers 
there as well.

The same is true of silently setting TIF_MEMDIE for current so that it has 
access to memory reserves and may exit when it has a pending SIGKILL or is 
already exiting.

In both cases, we're not actually oom because either (a) the kernel can 
still free memory and avoid actually killing a process, or (b) current 
simply needs access to memory reserves so it may die.

We don't want to invoke the userspace oom handler when we first enter 
direct reclaim, for example, for the same reason.

> > I think you're misunderstanding the kernel oom notifiers, they exist 
> > solely to free memory so that the oom killer actually doesn't have to kill 
> > anything.  The fact that they use kernel notifiers is irrelevant and 
> > userspace oom notification is separate.  Userspace is only going to want a 
> > notification when the oom killer has to kill something, the EXACT same 
> > semantics as the non-root-memcg memory.oom_control.
> 
> That's actually not true, we invoke the OOM notifier before calling
> mem_cgroup_out_of_memory(), which then may skip the kill in favor of
> letting current exit.  It does this for when the kernel handler is
> enabled, which would be the equivalent for what you are implementing.
> 

Good point, I don't think we should be notifying userspace for memcg oom 
conditions when current simply needs access to memory reserves to exit: 
the memcg isn't actually oom since TIF_MEMDIE implies memcg bypass.  I 
think we should do that in mem_cgroup_handle_oom() rather than 
mem_cgroup_out_of_memory().

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 87+ messages in thread

* [patch 1/2] mm, memcg: avoid oom notification when current needs access to memory reserves
  2013-11-14 22:57           ` David Rientjes
@ 2013-11-14 23:26             ` David Rientjes
  2013-11-14 23:26               ` [patch 2/2] mm, memcg: add memory.oom_control notification for system oom David Rientjes
                                 ` (2 more replies)
  2013-11-18 15:54             ` [patch] mm, memcg: add memory.oom_control notification for system oom Johannes Weiner
  1 sibling, 3 replies; 87+ messages in thread
From: David Rientjes @ 2013-11-14 23:26 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Michal Hocko, Johannes Weiner, KAMEZAWA Hiroyuki, linux-kernel,
	linux-mm, cgroups

When current has a pending SIGKILL or is already in the exit path, it
only needs access to memory reserves to fully exit.  In that sense, the
memcg is not actually oom for current, it simply needs to bypass memory
charges to exit and free its memory, which is guarantee itself that
memory will be freed.

We only want to notify userspace for actionable oom conditions where
something needs to be done (and all oom handling can already be deferred
to userspace through this method by disabling the memcg oom killer with
memory.oom_control), not simply when a memcg has reached its limit, which
would actually have to happen before memcg reclaim actually frees memory
for charges.

Reported-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: David Rientjes <rientjes@google.com>
---
 mm/memcontrol.c | 20 ++++++++++----------
 1 file changed, 10 insertions(+), 10 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1783,16 +1783,6 @@ static void mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
 	unsigned int points = 0;
 	struct task_struct *chosen = NULL;
 
-	/*
-	 * If current has a pending SIGKILL or is exiting, then automatically
-	 * select it.  The goal is to allow it to allocate so that it may
-	 * quickly exit and free its memory.
-	 */
-	if (fatal_signal_pending(current) || current->flags & PF_EXITING) {
-		set_thread_flag(TIF_MEMDIE);
-		return;
-	}
-
 	check_panic_on_oom(CONSTRAINT_MEMCG, gfp_mask, order, NULL);
 	totalpages = mem_cgroup_get_limit(memcg) >> PAGE_SHIFT ? : 1;
 	for_each_mem_cgroup_tree(iter, memcg) {
@@ -2243,6 +2233,16 @@ bool mem_cgroup_oom_synchronize(bool handle)
 	if (!handle)
 		goto cleanup;
 
+	/*
+	 * If current has a pending SIGKILL or is exiting, then automatically
+	 * select it.  The goal is to allow it to allocate so that it may
+	 * quickly exit and free its memory.
+	 */
+	if (fatal_signal_pending(current) || current->flags & PF_EXITING) {
+		set_thread_flag(TIF_MEMDIE);
+		goto cleanup;
+	}
+
 	owait.memcg = memcg;
 	owait.wait.flags = 0;
 	owait.wait.func = memcg_oom_wake_function;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 87+ messages in thread

* [patch 2/2] mm, memcg: add memory.oom_control notification for system oom
  2013-11-14 23:26             ` [patch 1/2] mm, memcg: avoid oom notification when current needs access to memory reserves David Rientjes
@ 2013-11-14 23:26               ` David Rientjes
  2013-11-18 18:52                 ` Michal Hocko
  2013-11-18 12:52               ` [patch 1/2] mm, memcg: avoid oom notification when current needs access to memory reserves Michal Hocko
  2013-11-18 15:41               ` Johannes Weiner
  2 siblings, 1 reply; 87+ messages in thread
From: David Rientjes @ 2013-11-14 23:26 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Michal Hocko, Johannes Weiner, KAMEZAWA Hiroyuki, linux-kernel,
	linux-mm, cgroups

A subset of applications that wait on memory.oom_control don't disable
the oom killer for that memcg and simply log or cleanup after the kernel
oom killer kills a process to free memory.

We need the ability to do this for system oom conditions as well, i.e.
when the system is depleted of all memory and must kill a process.  For
convenience, this can use memcg since oom notifiers are already present.

When a userspace process waits on the root memcg's memory.oom_control, it
will wake up anytime there is a system oom condition so that it can log
the event, including what process was killed and the stack, or cleanup
after the kernel oom killer has killed something.

This is a special case of oom notifiers since it doesn't subsequently
notify all memcgs under the root memcg (all memcgs on the system).  We
don't want to trigger those oom handlers which are set aside specifically
for true memcg oom notifications that disable their own oom killers to
enforce their own oom policy, for example.

Signed-off-by: David Rientjes <rientjes@google.com>
---
 Documentation/cgroups/memory.txt | 11 ++++++-----
 include/linux/memcontrol.h       |  5 +++++
 mm/memcontrol.c                  |  9 +++++++++
 mm/oom_kill.c                    |  4 ++++
 4 files changed, 24 insertions(+), 5 deletions(-)

diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt
--- a/Documentation/cgroups/memory.txt
+++ b/Documentation/cgroups/memory.txt
@@ -743,18 +743,19 @@ delivery and gets notification when OOM happens.
 
 To register a notifier, an application must:
  - create an eventfd using eventfd(2)
- - open memory.oom_control file
+ - open memory.oom_control file for reading
  - write string like "<event_fd> <fd of memory.oom_control>" to
    cgroup.event_control
 
-The application will be notified through eventfd when OOM happens.
-OOM notification doesn't work for the root cgroup.
+The application will be notified through eventfd when OOM happens, including
+on system oom when used with the root memcg.
 
 You can disable the OOM-killer by writing "1" to memory.oom_control file, as:
 
-	#echo 1 > memory.oom_control
+	# echo 1 > memory.oom_control
 
-This operation is only allowed to the top cgroup of a sub-hierarchy.
+This operation is only allowed to the top cgroup of a sub-hierarchy and does
+not include the root memcg.
 If OOM-killer is disabled, tasks under cgroup will hang/sleep
 in memory cgroup's OOM-waitqueue when they request accountable memory.
 
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -155,6 +155,7 @@ static inline bool task_in_memcg_oom(struct task_struct *p)
 }
 
 bool mem_cgroup_oom_synchronize(bool wait);
+void mem_cgroup_root_oom_notify(void);
 
 #ifdef CONFIG_MEMCG_SWAP
 extern int do_swap_account;
@@ -397,6 +398,10 @@ static inline bool mem_cgroup_oom_synchronize(bool wait)
 	return false;
 }
 
+static inline void mem_cgroup_root_oom_notify(void)
+{
+}
+
 static inline void mem_cgroup_inc_page_stat(struct page *page,
 					    enum mem_cgroup_stat_index idx)
 {
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5648,6 +5648,15 @@ static void mem_cgroup_oom_notify(struct mem_cgroup *memcg)
 		mem_cgroup_oom_notify_cb(iter);
 }
 
+/*
+ * Notify any process waiting on the root memcg's memory.oom_control, but do not
+ * notify any child memcgs to avoid triggering their per-memcg oom handlers.
+ */
+void mem_cgroup_root_oom_notify(void)
+{
+	mem_cgroup_oom_notify_cb(root_mem_cgroup);
+}
+
 static int mem_cgroup_usage_register_event(struct cgroup_subsys_state *css,
 	struct cftype *cft, struct eventfd_ctx *eventfd, const char *args)
 {
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -632,6 +632,10 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
 		return;
 	}
 
+	/* Avoid waking up processes for oom kills triggered by sysrq */
+	if (!force_kill)
+		mem_cgroup_root_oom_notify();
+
 	/*
 	 * Check if there were limitations on the allocation (only relevant for
 	 * NUMA) that may require different handling.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [patch 1/2] mm, memcg: avoid oom notification when current needs access to memory reserves
  2013-11-14 23:26             ` [patch 1/2] mm, memcg: avoid oom notification when current needs access to memory reserves David Rientjes
  2013-11-14 23:26               ` [patch 2/2] mm, memcg: add memory.oom_control notification for system oom David Rientjes
@ 2013-11-18 12:52               ` Michal Hocko
  2013-11-18 12:55                 ` Michal Hocko
  2013-11-18 15:41               ` Johannes Weiner
  2 siblings, 1 reply; 87+ messages in thread
From: Michal Hocko @ 2013-11-18 12:52 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Johannes Weiner, KAMEZAWA Hiroyuki, linux-kernel,
	linux-mm, cgroups, Eric W. Biederman

[Adding Eric to CC]

On Thu 14-11-13 15:26:51, David Rientjes wrote:
> When current has a pending SIGKILL or is already in the exit path, it
> only needs access to memory reserves to fully exit.  In that sense, the
> memcg is not actually oom for current, it simply needs to bypass memory
> charges to exit and free its memory, which is guarantee itself that
> memory will be freed.
> 
> We only want to notify userspace for actionable oom conditions where
> something needs to be done (and all oom handling can already be deferred
> to userspace through this method by disabling the memcg oom killer with
> memory.oom_control), not simply when a memcg has reached its limit, which
> would actually have to happen before memcg reclaim actually frees memory
> for charges.

I believe this also fixes the issue reported by Eric
(https://lkml.org/lkml/2013/7/28/74). I had a patch for this
https://lkml.org/lkml/2013/7/31/94 but the code changed since then and
this should be equivalent.
 
> Reported-by: Johannes Weiner <hannes@cmpxchg.org>
> Signed-off-by: David Rientjes <rientjes@google.com>
> ---
>  mm/memcontrol.c | 20 ++++++++++----------
>  1 file changed, 10 insertions(+), 10 deletions(-)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -1783,16 +1783,6 @@ static void mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
>  	unsigned int points = 0;
>  	struct task_struct *chosen = NULL;
>  
> -	/*
> -	 * If current has a pending SIGKILL or is exiting, then automatically
> -	 * select it.  The goal is to allow it to allocate so that it may
> -	 * quickly exit and free its memory.
> -	 */
> -	if (fatal_signal_pending(current) || current->flags & PF_EXITING) {
> -		set_thread_flag(TIF_MEMDIE);
> -		return;
> -	}
> -
>  	check_panic_on_oom(CONSTRAINT_MEMCG, gfp_mask, order, NULL);
>  	totalpages = mem_cgroup_get_limit(memcg) >> PAGE_SHIFT ? : 1;
>  	for_each_mem_cgroup_tree(iter, memcg) {
> @@ -2243,6 +2233,16 @@ bool mem_cgroup_oom_synchronize(bool handle)
>  	if (!handle)
>  		goto cleanup;
>  
> +	/*
> +	 * If current has a pending SIGKILL or is exiting, then automatically
> +	 * select it.  The goal is to allow it to allocate so that it may
> +	 * quickly exit and free its memory.
> +	 */
> +	if (fatal_signal_pending(current) || current->flags & PF_EXITING) {
> +		set_thread_flag(TIF_MEMDIE);
> +		goto cleanup;
> +	}
> +
>  	owait.memcg = memcg;
>  	owait.wait.flags = 0;
>  	owait.wait.func = memcg_oom_wake_function;

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [patch 1/2] mm, memcg: avoid oom notification when current needs access to memory reserves
  2013-11-18 12:52               ` [patch 1/2] mm, memcg: avoid oom notification when current needs access to memory reserves Michal Hocko
@ 2013-11-18 12:55                 ` Michal Hocko
  2013-11-19  1:19                   ` David Rientjes
  0 siblings, 1 reply; 87+ messages in thread
From: Michal Hocko @ 2013-11-18 12:55 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Johannes Weiner, KAMEZAWA Hiroyuki, linux-kernel,
	linux-mm, cgroups, Eric W. Biederman

On Mon 18-11-13 13:52:40, Michal Hocko wrote:
> [Adding Eric to CC]
> 
> On Thu 14-11-13 15:26:51, David Rientjes wrote:
> > When current has a pending SIGKILL or is already in the exit path, it
> > only needs access to memory reserves to fully exit.  In that sense, the
> > memcg is not actually oom for current, it simply needs to bypass memory
> > charges to exit and free its memory, which is guarantee itself that
> > memory will be freed.
> > 
> > We only want to notify userspace for actionable oom conditions where
> > something needs to be done (and all oom handling can already be deferred
> > to userspace through this method by disabling the memcg oom killer with
> > memory.oom_control), not simply when a memcg has reached its limit, which
> > would actually have to happen before memcg reclaim actually frees memory
> > for charges.
> 
> I believe this also fixes the issue reported by Eric
> (https://lkml.org/lkml/2013/7/28/74). I had a patch for this
> https://lkml.org/lkml/2013/7/31/94 but the code changed since then and
> this should be equivalent.
>  
> > Reported-by: Johannes Weiner <hannes@cmpxchg.org>
> > Signed-off-by: David Rientjes <rientjes@google.com>

Anyway, the patch looks good to me but please mention the above bug in
the changelog.

Acked-by: Michal Hocko <mhocko@suse.cz>

> > ---
> >  mm/memcontrol.c | 20 ++++++++++----------
> >  1 file changed, 10 insertions(+), 10 deletions(-)
> > 
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -1783,16 +1783,6 @@ static void mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
> >  	unsigned int points = 0;
> >  	struct task_struct *chosen = NULL;
> >  
> > -	/*
> > -	 * If current has a pending SIGKILL or is exiting, then automatically
> > -	 * select it.  The goal is to allow it to allocate so that it may
> > -	 * quickly exit and free its memory.
> > -	 */
> > -	if (fatal_signal_pending(current) || current->flags & PF_EXITING) {
> > -		set_thread_flag(TIF_MEMDIE);
> > -		return;
> > -	}
> > -
> >  	check_panic_on_oom(CONSTRAINT_MEMCG, gfp_mask, order, NULL);
> >  	totalpages = mem_cgroup_get_limit(memcg) >> PAGE_SHIFT ? : 1;
> >  	for_each_mem_cgroup_tree(iter, memcg) {
> > @@ -2243,6 +2233,16 @@ bool mem_cgroup_oom_synchronize(bool handle)
> >  	if (!handle)
> >  		goto cleanup;
> >  
> > +	/*
> > +	 * If current has a pending SIGKILL or is exiting, then automatically
> > +	 * select it.  The goal is to allow it to allocate so that it may
> > +	 * quickly exit and free its memory.
> > +	 */
> > +	if (fatal_signal_pending(current) || current->flags & PF_EXITING) {
> > +		set_thread_flag(TIF_MEMDIE);
> > +		goto cleanup;
> > +	}
> > +
> >  	owait.memcg = memcg;
> >  	owait.wait.flags = 0;
> >  	owait.wait.func = memcg_oom_wake_function;
> 
> -- 
> Michal Hocko
> SUSE Labs
> --
> To unsubscribe from this list: send the line "unsubscribe cgroups" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [patch 1/2] mm, memcg: avoid oom notification when current needs access to memory reserves
  2013-11-14 23:26             ` [patch 1/2] mm, memcg: avoid oom notification when current needs access to memory reserves David Rientjes
  2013-11-14 23:26               ` [patch 2/2] mm, memcg: add memory.oom_control notification for system oom David Rientjes
  2013-11-18 12:52               ` [patch 1/2] mm, memcg: avoid oom notification when current needs access to memory reserves Michal Hocko
@ 2013-11-18 15:41               ` Johannes Weiner
  2013-11-18 16:51                 ` Michal Hocko
  2 siblings, 1 reply; 87+ messages in thread
From: Johannes Weiner @ 2013-11-18 15:41 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Michal Hocko, KAMEZAWA Hiroyuki, linux-kernel,
	linux-mm, cgroups

On Thu, Nov 14, 2013 at 03:26:51PM -0800, David Rientjes wrote:
> When current has a pending SIGKILL or is already in the exit path, it
> only needs access to memory reserves to fully exit.  In that sense, the
> memcg is not actually oom for current, it simply needs to bypass memory
> charges to exit and free its memory, which is guarantee itself that
> memory will be freed.
> 
> We only want to notify userspace for actionable oom conditions where
> something needs to be done (and all oom handling can already be deferred
> to userspace through this method by disabling the memcg oom killer with
> memory.oom_control), not simply when a memcg has reached its limit, which
> would actually have to happen before memcg reclaim actually frees memory
> for charges.

Even though the situation may not require a kill, the user still wants
to know that the memory hard limit was breached and the isolation
broken in order to prevent a kill.  We just came really close and the
fact that current is exiting is coincidental.  Not everybody is having
OOM situations on a frequent basis and they might want to know when
they are redlining the system and that the same workload might blow up
the next time it's run.

The emergency reserves are there to prevent the system from
deadlocking.  We only dip into them to avert a more imminent disaster
but we are no longer in good shape at this point.  But by not even
announcing this situation to userspace anymore you are making this the
new baseline and declaring that everything is fine when the system is
already clutching at straws.

I maintain that we should signal OOM when our healthy and
always-available options are exhausted.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [patch] mm, memcg: add memory.oom_control notification for system oom
  2013-11-14 22:57           ` David Rientjes
  2013-11-14 23:26             ` [patch 1/2] mm, memcg: avoid oom notification when current needs access to memory reserves David Rientjes
@ 2013-11-18 15:54             ` Johannes Weiner
  2013-11-18 23:15               ` One Thousand Gnomes
  1 sibling, 1 reply; 87+ messages in thread
From: Johannes Weiner @ 2013-11-18 15:54 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Michal Hocko, KAMEZAWA Hiroyuki, linux-kernel,
	linux-mm, cgroups

On Thu, Nov 14, 2013 at 02:57:51PM -0800, David Rientjes wrote:
> On Wed, 13 Nov 2013, Johannes Weiner wrote:
> 
> > > > Somebody called out_of_memory() after they
> > > > failed reclaim, the machine is OOM.
> > > 
> > > While momentarily oom, the oom notifiers in powerpc and s390 have the 
> > > ability to free memory without requiring a kill.
> > 
> > So either
> > 
> > 1) they should be part of the regular reclaim process, or
> > 
> > 2) their invocation is severe enough to not be part of reclaim, at
> >    which point we should probably tell userspace about the OOM
> > 
> 
> (1) is already true, we can avoid oom by freeing memory for subsystems 
> using register_oom_notifier(), so we're not actually oom.  It's a late 
> callback into the kernel to free memory in a sense of reclaim.  It was 
> added directly into out_of_memory() purely for simplicity; it could be 
> moved to the page allocator if we move all of the oom_notify_list helpers 
> there as well.

If they can easily free it without any repercussions, they should
really be part of regular reclaim.  Maybe convert them to shrinkers.

And then you can use OOM notifiers to be notified about OOM.

> The same is true of silently setting TIF_MEMDIE for current so that it has 
> access to memory reserves and may exit when it has a pending SIGKILL or is 
> already exiting.
> 
> In both cases, we're not actually oom because either (a) the kernel can 
> still free memory and avoid actually killing a process, or (b) current 
> simply needs access to memory reserves so it may die.
> 
> We don't want to invoke the userspace oom handler when we first enter 
> direct reclaim, for example, for the same reason.

Reclaim is an option the kernel always has, the current task exiting
is a coincidence.

And accessing the emergency reserves means we are definitely no longer
A-OK, this is not comparable to the first direct reclaim invocation.

We exhausted our options and we got really lucky.  It should not be
considered the baseline and a user listening for "OOM conditions"
should be informed about this.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [patch 1/2] mm, memcg: avoid oom notification when current needs access to memory reserves
  2013-11-18 15:41               ` Johannes Weiner
@ 2013-11-18 16:51                 ` Michal Hocko
  2013-11-19  1:22                   ` David Rientjes
  2013-11-22 16:51                   ` Johannes Weiner
  0 siblings, 2 replies; 87+ messages in thread
From: Michal Hocko @ 2013-11-18 16:51 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: David Rientjes, Andrew Morton, KAMEZAWA Hiroyuki, linux-kernel,
	linux-mm, cgroups

On Mon 18-11-13 10:41:15, Johannes Weiner wrote:
> On Thu, Nov 14, 2013 at 03:26:51PM -0800, David Rientjes wrote:
> > When current has a pending SIGKILL or is already in the exit path, it
> > only needs access to memory reserves to fully exit.  In that sense, the
> > memcg is not actually oom for current, it simply needs to bypass memory
> > charges to exit and free its memory, which is guarantee itself that
> > memory will be freed.
> > 
> > We only want to notify userspace for actionable oom conditions where
> > something needs to be done (and all oom handling can already be deferred
> > to userspace through this method by disabling the memcg oom killer with
> > memory.oom_control), not simply when a memcg has reached its limit, which
> > would actually have to happen before memcg reclaim actually frees memory
> > for charges.
> 
> Even though the situation may not require a kill, the user still wants
> to know that the memory hard limit was breached and the isolation
> broken in order to prevent a kill.  We just came really close and the

You can observe that you are getting into troubles from fail counter
already. The usability without more reclaim statistics is a bit
questionable but you get a rough impression that something is wrong at
least.

> fact that current is exiting is coincidental.  Not everybody is having
> OOM situations on a frequent basis and they might want to know when
> they are redlining the system and that the same workload might blow up
> the next time it's run.

I am just concerned that signaling temporal OOM conditions which do not
require any OOM killer action (user or kernel space) might be confusing.
Userspace would have harder times to tell whether any action is required
or not.

> The emergency reserves are there to prevent the system from
> deadlocking.  We only dip into them to avert a more imminent disaster
> but we are no longer in good shape at this point.  But by not even
> announcing this situation to userspace anymore you are making this the
> new baseline and declaring that everything is fine when the system is
> already clutching at straws.
> 
> I maintain that we should signal OOM when our healthy and
> always-available options are exhausted.

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [patch 2/2] mm, memcg: add memory.oom_control notification for system oom
  2013-11-14 23:26               ` [patch 2/2] mm, memcg: add memory.oom_control notification for system oom David Rientjes
@ 2013-11-18 18:52                 ` Michal Hocko
  2013-11-19  1:25                   ` David Rientjes
  0 siblings, 1 reply; 87+ messages in thread
From: Michal Hocko @ 2013-11-18 18:52 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Johannes Weiner, KAMEZAWA Hiroyuki, linux-kernel,
	linux-mm, cgroups

On Thu 14-11-13 15:26:55, David Rientjes wrote:
> A subset of applications that wait on memory.oom_control don't disable
> the oom killer for that memcg and simply log or cleanup after the kernel
> oom killer kills a process to free memory.
> 
> We need the ability to do this for system oom conditions as well, i.e.
> when the system is depleted of all memory and must kill a process.  For
> convenience, this can use memcg since oom notifiers are already present.

Using the memcg interface for "read-only" interface without any plan for
the "write" is only halfway solution. We want to handle global OOM in a
more user defined ways but we have to agree on the proper interface
first. I do not want to end up with something half baked with memcg and
a different interface to do the real thing just because memcg turns out
to be unsuitable.

And to be honest, the more I am thinking about memcg based interface the
stronger is my feeling that it is unsuitable for the user defined OOM
policies. But that should be discussed properly (I will send a RFD in
the follow up days).

[...]
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [patch] mm, memcg: add memory.oom_control notification for system oom
  2013-11-18 15:54             ` [patch] mm, memcg: add memory.oom_control notification for system oom Johannes Weiner
@ 2013-11-18 23:15               ` One Thousand Gnomes
  0 siblings, 0 replies; 87+ messages in thread
From: One Thousand Gnomes @ 2013-11-18 23:15 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: David Rientjes, Andrew Morton, Michal Hocko, KAMEZAWA Hiroyuki,
	linux-kernel, linux-mm, cgroups

> And accessing the emergency reserves means we are definitely no longer
> A-OK, this is not comparable to the first direct reclaim invocation.
> 
> We exhausted our options and we got really lucky.  It should not be
> considered the baseline and a user listening for "OOM conditions"
> should be informed about this.

Definitely concur - there are loading tuning cases where you want to
drive the box to the point it starts whining and then scale back a touch.

It's an API change in effect, and while I can believe there are good
arguments for both any API change ought to be a new API for listening
only to serious OOM cases.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [patch 1/2] mm, memcg: avoid oom notification when current needs access to memory reserves
  2013-11-18 12:55                 ` Michal Hocko
@ 2013-11-19  1:19                   ` David Rientjes
  0 siblings, 0 replies; 87+ messages in thread
From: David Rientjes @ 2013-11-19  1:19 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Johannes Weiner, KAMEZAWA Hiroyuki, linux-kernel,
	linux-mm, cgroups, Eric W. Biederman

On Mon, 18 Nov 2013, Michal Hocko wrote:

> > > When current has a pending SIGKILL or is already in the exit path, it
> > > only needs access to memory reserves to fully exit.  In that sense, the
> > > memcg is not actually oom for current, it simply needs to bypass memory
> > > charges to exit and free its memory, which is guarantee itself that
> > > memory will be freed.
> > > 
> > > We only want to notify userspace for actionable oom conditions where
> > > something needs to be done (and all oom handling can already be deferred
> > > to userspace through this method by disabling the memcg oom killer with
> > > memory.oom_control), not simply when a memcg has reached its limit, which
> > > would actually have to happen before memcg reclaim actually frees memory
> > > for charges.
> > 
> > I believe this also fixes the issue reported by Eric
> > (https://lkml.org/lkml/2013/7/28/74). I had a patch for this
> > https://lkml.org/lkml/2013/7/31/94 but the code changed since then and
> > this should be equivalent.
> >  
> > > Reported-by: Johannes Weiner <hannes@cmpxchg.org>
> > > Signed-off-by: David Rientjes <rientjes@google.com>
> 
> Anyway, the patch looks good to me but please mention the above bug in
> the changelog.
> 

The patch is in -mm, so perhaps we can change the changelog if/when Eric 
confirms it fixes his issue.

> Acked-by: Michal Hocko <mhocko@suse.cz>
> 

Thanks!

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [patch 1/2] mm, memcg: avoid oom notification when current needs access to memory reserves
  2013-11-18 16:51                 ` Michal Hocko
@ 2013-11-19  1:22                   ` David Rientjes
  2013-11-22 16:51                   ` Johannes Weiner
  1 sibling, 0 replies; 87+ messages in thread
From: David Rientjes @ 2013-11-19  1:22 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Johannes Weiner, Andrew Morton, KAMEZAWA Hiroyuki, linux-kernel,
	linux-mm, cgroups

On Mon, 18 Nov 2013, Michal Hocko wrote:

> > Even though the situation may not require a kill, the user still wants
> > to know that the memory hard limit was breached and the isolation
> > broken in order to prevent a kill.  We just came really close and the
> 
> You can observe that you are getting into troubles from fail counter
> already. The usability without more reclaim statistics is a bit
> questionable but you get a rough impression that something is wrong at
> least.
> 

Agreed, but it seems like the appropriate mechanism for this is through 
the memory.{,memsw.}usage_in_bytes notifiers which already exist.

> > fact that current is exiting is coincidental.  Not everybody is having
> > OOM situations on a frequent basis and they might want to know when
> > they are redlining the system and that the same workload might blow up
> > the next time it's run.
> 
> I am just concerned that signaling temporal OOM conditions which do not
> require any OOM killer action (user or kernel space) might be confusing.
> Userspace would have harder times to tell whether any action is required
> or not.
> 

Completely agreed, in fact there is no reliable and non-racy way in 
userspace to determine "is this a real oom condition that I must act upon 
or can the kernel handle it?"

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [patch 2/2] mm, memcg: add memory.oom_control notification for system oom
  2013-11-18 18:52                 ` Michal Hocko
@ 2013-11-19  1:25                   ` David Rientjes
  2013-11-19 12:41                     ` Michal Hocko
  0 siblings, 1 reply; 87+ messages in thread
From: David Rientjes @ 2013-11-19  1:25 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Johannes Weiner, KAMEZAWA Hiroyuki, linux-kernel,
	linux-mm, cgroups

On Mon, 18 Nov 2013, Michal Hocko wrote:

> > A subset of applications that wait on memory.oom_control don't disable
> > the oom killer for that memcg and simply log or cleanup after the kernel
> > oom killer kills a process to free memory.
> > 
> > We need the ability to do this for system oom conditions as well, i.e.
> > when the system is depleted of all memory and must kill a process.  For
> > convenience, this can use memcg since oom notifiers are already present.
> 
> Using the memcg interface for "read-only" interface without any plan for
> the "write" is only halfway solution. We want to handle global OOM in a
> more user defined ways but we have to agree on the proper interface
> first. I do not want to end up with something half baked with memcg and
> a different interface to do the real thing just because memcg turns out
> to be unsuitable.
> 

This patch isn't really a halfway solution, you can still determine if the 
open(O_WRONLY) succeeds or not to determine if that feature has been 
implemented.  I'm concerned about disabling the oom killer entirely for 
system oom conditions, though, so I didn't implement it to be writable.  I 
don't think we should be doing anything special in terms of "write" 
behavior for the root memcg memory.oom_control, so I'd argue against doing 
anything other than disabling the oom killer.  That's scary.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [patch 2/2] mm, memcg: add memory.oom_control notification for system oom
  2013-11-19  1:25                   ` David Rientjes
@ 2013-11-19 12:41                     ` Michal Hocko
  0 siblings, 0 replies; 87+ messages in thread
From: Michal Hocko @ 2013-11-19 12:41 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Johannes Weiner, KAMEZAWA Hiroyuki, linux-kernel,
	linux-mm, cgroups

On Mon 18-11-13 17:25:13, David Rientjes wrote:
> On Mon, 18 Nov 2013, Michal Hocko wrote:
> 
> > > A subset of applications that wait on memory.oom_control don't disable
> > > the oom killer for that memcg and simply log or cleanup after the kernel
> > > oom killer kills a process to free memory.
> > > 
> > > We need the ability to do this for system oom conditions as well, i.e.
> > > when the system is depleted of all memory and must kill a process.  For
> > > convenience, this can use memcg since oom notifiers are already present.
> > 
> > Using the memcg interface for "read-only" interface without any plan for
> > the "write" is only halfway solution. We want to handle global OOM in a
> > more user defined ways but we have to agree on the proper interface
> > first. I do not want to end up with something half baked with memcg and
> > a different interface to do the real thing just because memcg turns out
> > to be unsuitable.
> > 
> 
> This patch isn't really a halfway solution, you can still determine if the 
> open(O_WRONLY) succeeds or not to determine if that feature has been 
> implemented. 

Let's say that we end up using loadable modules for the user policy
driven OOM killer. And that one would implement its own way of
notification or even no notification at all. How would an unrelated
check for open on a memcg file help?

> I'm concerned about disabling the oom killer entirely for 
> system oom conditions, though, so I didn't implement it to be writable.

I really do not like to use different interfaces to accomplish the two
parts of the process. OOM action and notification should be implemented
by the same "subsystem" (be it memcg, modules, foobar...).

> I don't think we should be doing anything special in terms of "write"
> behavior for the root memcg memory.oom_control, so I'd argue against
> doing anything other than disabling the oom killer.  That's scary.

But we need to have a way to describe user/admin policy for the global
OOM. Killing a task is just one of the policy and there are usecases (as
discussed at LSF2013) where e.g. killing the whole group of processes
makes much more sense. And there are many other possible policies. What
is the proper interface is a question and we should discuss that
properly. Memcg interface is one of the possible ways. We can also go
with kernel modules or a more generic filter like interface with
userspace defined rules.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [patch 1/2] mm, memcg: avoid oom notification when current needs access to memory reserves
  2013-11-18 16:51                 ` Michal Hocko
  2013-11-19  1:22                   ` David Rientjes
@ 2013-11-22 16:51                   ` Johannes Weiner
  2013-11-27  0:53                     ` David Rientjes
  1 sibling, 1 reply; 87+ messages in thread
From: Johannes Weiner @ 2013-11-22 16:51 UTC (permalink / raw)
  To: Michal Hocko
  Cc: David Rientjes, Andrew Morton, KAMEZAWA Hiroyuki, linux-kernel,
	linux-mm, cgroups

On Mon, Nov 18, 2013 at 05:51:10PM +0100, Michal Hocko wrote:
> On Mon 18-11-13 10:41:15, Johannes Weiner wrote:
> > On Thu, Nov 14, 2013 at 03:26:51PM -0800, David Rientjes wrote:
> > > When current has a pending SIGKILL or is already in the exit path, it
> > > only needs access to memory reserves to fully exit.  In that sense, the
> > > memcg is not actually oom for current, it simply needs to bypass memory
> > > charges to exit and free its memory, which is guarantee itself that
> > > memory will be freed.
> > > 
> > > We only want to notify userspace for actionable oom conditions where
> > > something needs to be done (and all oom handling can already be deferred
> > > to userspace through this method by disabling the memcg oom killer with
> > > memory.oom_control), not simply when a memcg has reached its limit, which
> > > would actually have to happen before memcg reclaim actually frees memory
> > > for charges.
> > 
> > Even though the situation may not require a kill, the user still wants
> > to know that the memory hard limit was breached and the isolation
> > broken in order to prevent a kill.  We just came really close and the
> 
> You can observe that you are getting into troubles from fail counter
> already. The usability without more reclaim statistics is a bit
> questionable but you get a rough impression that something is wrong at
> least.
> 
> > fact that current is exiting is coincidental.  Not everybody is having
> > OOM situations on a frequent basis and they might want to know when
> > they are redlining the system and that the same workload might blow up
> > the next time it's run.
> 
> I am just concerned that signaling temporal OOM conditions which do not
> require any OOM killer action (user or kernel space) might be confusing.
> Userspace would have harder times to tell whether any action is required
> or not.

But userspace in all likeliness DOES need to take action.

Reclaim is a really long process.  If 5 times doing 12 priority cycles
and scanning thousands of pages is not enough to reclaim a single
page, what does that say about the health of the memcg?

But more importantly, OOM handling is just inherently racy.  A task
might receive the kill signal a split second *after* userspace was
notified.  Or a task may exit voluntarily a split second after a
victim was chosen and killed.

We have to draw a line somewhere, right now this is "reclaim failed".
This patch doesn't fix a problem, it just blurs that line and makes
OOM notifications less predictable.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [patch 1/2] mm, memcg: avoid oom notification when current needs access to memory reserves
  2013-11-22 16:51                   ` Johannes Weiner
@ 2013-11-27  0:53                     ` David Rientjes
  2013-11-27 16:34                       ` Johannes Weiner
  0 siblings, 1 reply; 87+ messages in thread
From: David Rientjes @ 2013-11-27  0:53 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Michal Hocko, Andrew Morton, KAMEZAWA Hiroyuki, linux-kernel,
	linux-mm, cgroups

On Fri, 22 Nov 2013, Johannes Weiner wrote:

> But userspace in all likeliness DOES need to take action.
> 
> Reclaim is a really long process.  If 5 times doing 12 priority cycles
> and scanning thousands of pages is not enough to reclaim a single
> page, what does that say about the health of the memcg?
> 
> But more importantly, OOM handling is just inherently racy.  A task
> might receive the kill signal a split second *after* userspace was
> notified.  Or a task may exit voluntarily a split second after a
> victim was chosen and killed.
> 

That's not true even today without the userspace oom handling proposal 
currently being discussed if you have a memcg oom handler attached to a 
parent memcg with access to more memory than an oom child memcg.  The oom 
handler can disable the child memcg's oom killer with memory.oom_control 
and implement its own policy to deal with any notification of oom.

This patch is required to ensure that in such a scenario that the oom 
handler sitting in the parent memcg only wakes up when it's required to 
intervene.  Making an inference about the "health of the memcg" can 
certainly be done with memory thresholds and vmpressure, if you need that.

I agree with Michal.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [patch 1/2] mm, memcg: avoid oom notification when current needs access to memory reserves
  2013-11-27  0:53                     ` David Rientjes
@ 2013-11-27 16:34                       ` Johannes Weiner
  2013-11-27 21:51                         ` David Rientjes
  2013-12-02 20:02                         ` Michal Hocko
  0 siblings, 2 replies; 87+ messages in thread
From: Johannes Weiner @ 2013-11-27 16:34 UTC (permalink / raw)
  To: David Rientjes
  Cc: Michal Hocko, Andrew Morton, KAMEZAWA Hiroyuki, linux-kernel,
	linux-mm, cgroups

On Tue, Nov 26, 2013 at 04:53:47PM -0800, David Rientjes wrote:
> On Fri, 22 Nov 2013, Johannes Weiner wrote:
> 
> > But userspace in all likeliness DOES need to take action.
> > 
> > Reclaim is a really long process.  If 5 times doing 12 priority cycles
> > and scanning thousands of pages is not enough to reclaim a single
> > page, what does that say about the health of the memcg?
> > 
> > But more importantly, OOM handling is just inherently racy.  A task
> > might receive the kill signal a split second *after* userspace was
> > notified.  Or a task may exit voluntarily a split second after a
> > victim was chosen and killed.
> > 
> 
> That's not true even today without the userspace oom handling proposal 
> currently being discussed if you have a memcg oom handler attached to a 
> parent memcg with access to more memory than an oom child memcg.  The oom 
> handler can disable the child memcg's oom killer with memory.oom_control 
> and implement its own policy to deal with any notification of oom.

I was never implying the kernel handler.  All the races exist with
userspace handling as well.

> This patch is required to ensure that in such a scenario that the oom 
> handler sitting in the parent memcg only wakes up when it's required to 
> intervene.

A task could receive an unrelated kill between the OOM notification
and going to sleep to wait for userspace OOM handling.  Or another
task could exit voluntarily between the notification and waitqueue
entry, which would again be short-cut by the oom_recover of the exit
uncharges.

oom:                           other tasks:
check signal/exiting
                               could exit or get killed here
mem_cgroup_oom_trylock()
                               could exit or get killed here
mem_cgroup_oom_notify()
                               could exit or get killed here
if (userspace_handler)
  sleep()                      could exit or get killed here
else
  oom_kill()
                               could exit or get killed here

It does not matter where your signal/exiting check is, OOM
notification can never be race free because OOM is just an arbitrary
line we draw.  We have no idea what all the tasks are up to and how
close they are to releasing memory.  Even if we freeze the whole group
to handle tasks, it does not change the fact that the userspace OOM
handler might kill one task and after the unfreeze another task
immediately exits voluntarily or got a kill signal a split second
after it was frozen.

You can't fix this.  We just have to draw the line somewhere and
accept that in rare situations the OOM kill was unnecessary.  So
again, I don't see this patch is doing anything but blur the current
line and make notification less predictable.  And, as someone else in
this thread already said, it's a uservisible change in behavior and
would break known tuning usecases.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [patch 1/2] mm, memcg: avoid oom notification when current needs access to memory reserves
  2013-11-27 16:34                       ` Johannes Weiner
@ 2013-11-27 21:51                         ` David Rientjes
  2013-11-27 23:19                           ` Johannes Weiner
  2013-12-02 20:02                         ` Michal Hocko
  1 sibling, 1 reply; 87+ messages in thread
From: David Rientjes @ 2013-11-27 21:51 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Michal Hocko, Andrew Morton, KAMEZAWA Hiroyuki, linux-kernel,
	linux-mm, cgroups

On Wed, 27 Nov 2013, Johannes Weiner wrote:

> > > But more importantly, OOM handling is just inherently racy.  A task
> > > might receive the kill signal a split second *after* userspace was
> > > notified.  Or a task may exit voluntarily a split second after a
> > > victim was chosen and killed.
> > > 
> > 
> > That's not true even today without the userspace oom handling proposal 
> > currently being discussed if you have a memcg oom handler attached to a 
> > parent memcg with access to more memory than an oom child memcg.  The oom 
> > handler can disable the child memcg's oom killer with memory.oom_control 
> > and implement its own policy to deal with any notification of oom.
> 
> I was never implying the kernel handler.  All the races exist with
> userspace handling as well.
> 

A process may indeed exit immediately after a different process was oom 
killed.  A process may also free memory immediately after a process was 
oom killed.

> > This patch is required to ensure that in such a scenario that the oom 
> > handler sitting in the parent memcg only wakes up when it's required to 
> > intervene.
> 
> A task could receive an unrelated kill between the OOM notification
> and going to sleep to wait for userspace OOM handling.  Or another
> task could exit voluntarily between the notification and waitqueue
> entry, which would again be short-cut by the oom_recover of the exit
> uncharges.
> 
> oom:                           other tasks:
> check signal/exiting
>                                could exit or get killed here
> mem_cgroup_oom_trylock()
>                                could exit or get killed here
> mem_cgroup_oom_notify()
>                                could exit or get killed here
> if (userspace_handler)
>   sleep()                      could exit or get killed here
> else
>   oom_kill()
>                                could exit or get killed here
> 
> It does not matter where your signal/exiting check is, OOM
> notification can never be race free because OOM is just an arbitrary
> line we draw.  We have no idea what all the tasks are up to and how
> close they are to releasing memory.  Even if we freeze the whole group
> to handle tasks, it does not change the fact that the userspace OOM
> handler might kill one task and after the unfreeze another task
> immediately exits voluntarily or got a kill signal a split second
> after it was frozen.
> 
> You can't fix this.  We just have to draw the line somewhere and
> accept that in rare situations the OOM kill was unnecessary.  So
> again, I don't see this patch is doing anything but blur the current
> line and make notification less predictable.  And, as someone else in
> this thread already said, it's a uservisible change in behavior and
> would break known tuning usecases.
> 

The patch is drawing the line at "the kernel can no longer do anything to 
free memory", and that's the line where userspace should be notified or a 
process killed by the kernel.  Giving current access to memory reserves in 
the oom killer is an optimization so that all reclaim is exhausted prior 
to declaring that they are necessary, the kernel still has the ability to 
allow that process to exit and free memory.  This is the same as the oom 
notifiers within the kernel that free memory from s390 and powerpc archs: 
the kernel still has the ability to free memory.  If you wish to be 
notified that you've simply reached the memcg limit, for whatever reason, 
you can monitor memory.failcnt or register a memory threshold.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [patch 1/2] mm, memcg: avoid oom notification when current needs access to memory reserves
  2013-11-27 21:51                         ` David Rientjes
@ 2013-11-27 23:19                           ` Johannes Weiner
  2013-11-28  0:22                             ` David Rientjes
  0 siblings, 1 reply; 87+ messages in thread
From: Johannes Weiner @ 2013-11-27 23:19 UTC (permalink / raw)
  To: David Rientjes
  Cc: Michal Hocko, Andrew Morton, KAMEZAWA Hiroyuki, linux-kernel,
	linux-mm, cgroups

On Wed, Nov 27, 2013 at 01:51:20PM -0800, David Rientjes wrote:
> On Wed, 27 Nov 2013, Johannes Weiner wrote:
> 
> > > > But more importantly, OOM handling is just inherently racy.  A task
> > > > might receive the kill signal a split second *after* userspace was
> > > > notified.  Or a task may exit voluntarily a split second after a
> > > > victim was chosen and killed.
> > > > 
> > > 
> > > That's not true even today without the userspace oom handling proposal 
> > > currently being discussed if you have a memcg oom handler attached to a 
> > > parent memcg with access to more memory than an oom child memcg.  The oom 
> > > handler can disable the child memcg's oom killer with memory.oom_control 
> > > and implement its own policy to deal with any notification of oom.
> > 
> > I was never implying the kernel handler.  All the races exist with
> > userspace handling as well.
> > 
> 
> A process may indeed exit immediately after a different process was oom 
> killed.  A process may also free memory immediately after a process was 
> oom killed.
> 
> > > This patch is required to ensure that in such a scenario that the oom 
> > > handler sitting in the parent memcg only wakes up when it's required to 
> > > intervene.
> > 
> > A task could receive an unrelated kill between the OOM notification
> > and going to sleep to wait for userspace OOM handling.  Or another
> > task could exit voluntarily between the notification and waitqueue
> > entry, which would again be short-cut by the oom_recover of the exit
> > uncharges.
> > 
> > oom:                           other tasks:
> > check signal/exiting
> >                                could exit or get killed here
> > mem_cgroup_oom_trylock()
> >                                could exit or get killed here
> > mem_cgroup_oom_notify()
> >                                could exit or get killed here
> > if (userspace_handler)
> >   sleep()                      could exit or get killed here
> > else
> >   oom_kill()
> >                                could exit or get killed here
> > 
> > It does not matter where your signal/exiting check is, OOM
> > notification can never be race free because OOM is just an arbitrary
> > line we draw.  We have no idea what all the tasks are up to and how
> > close they are to releasing memory.  Even if we freeze the whole group
> > to handle tasks, it does not change the fact that the userspace OOM
> > handler might kill one task and after the unfreeze another task
> > immediately exits voluntarily or got a kill signal a split second
> > after it was frozen.
> > 
> > You can't fix this.  We just have to draw the line somewhere and
> > accept that in rare situations the OOM kill was unnecessary.  So
> > again, I don't see this patch is doing anything but blur the current
> > line and make notification less predictable.  And, as someone else in
> > this thread already said, it's a uservisible change in behavior and
> > would break known tuning usecases.
> > 
> 
> The patch is drawing the line at "the kernel can no longer do anything to 
> free memory", and that's the line where userspace should be notified or a 
> process killed by the kernel.
>
> Giving current access to memory reserves in the oom killer is an
> optimization so that all reclaim is exhausted prior to declaring
> that they are necessary, the kernel still has the ability to allow
> that process to exit and free memory.

"they" are necessary?

> This is the same as the oom notifiers within the kernel that free
> memory from s390 and powerpc archs: the kernel still has the ability
> to free memory.

They're not the same at all.  One is the kernel freeing memory, the
other is a random coincidence.

It's such an unlikely condition that you are not really helping the
notification to be less racy wrt concurrent memory freeing, which I
tried to explain still exists big time.  But it's enough to screw up
somebody's tuning effort by not reporting OOM, even though 60 reclaim
cycles have not produced a single page, just because the last
allocation happened to be in a dying task in that run.

> If you wish to be notified that you've simply reached the memcg
> limit, for whatever reason, you can monitor memory.failcnt or
> register a memory threshold.

Given a machine and a workload, I would like the OOM threshold to be
as predictable and reproducible as possible.  We can count on reclaim,
we can't count on the final straw coming from a dying task.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [patch 1/2] mm, memcg: avoid oom notification when current needs access to memory reserves
  2013-11-27 23:19                           ` Johannes Weiner
@ 2013-11-28  0:22                             ` David Rientjes
  2013-11-28  2:28                               ` Johannes Weiner
  0 siblings, 1 reply; 87+ messages in thread
From: David Rientjes @ 2013-11-28  0:22 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Michal Hocko, Andrew Morton, KAMEZAWA Hiroyuki, linux-kernel,
	linux-mm, cgroups

On Wed, 27 Nov 2013, Johannes Weiner wrote:

> > The patch is drawing the line at "the kernel can no longer do anything to 
> > free memory", and that's the line where userspace should be notified or a 
> > process killed by the kernel.
> >
> > Giving current access to memory reserves in the oom killer is an
> > optimization so that all reclaim is exhausted prior to declaring
> > that they are necessary, the kernel still has the ability to allow
> > that process to exit and free memory.
> 
> "they" are necessary?
> 

Memory reserves.

> > This is the same as the oom notifiers within the kernel that free
> > memory from s390 and powerpc archs: the kernel still has the ability
> > to free memory.
> 
> They're not the same at all.  One is the kernel freeing memory, the
> other is a random coincidence.
> 

Current is on the way to memory freeing because it has a pending SIGKILL 
or is already exiting, it simply needs access to memory reserves to do so.  
This was originally introduced to prevent the oom killer from having to 
scan the set of eligible processes and silently giving it access to memory 
reserves; we didn't want to emit all of the messages to the kernel log 
because scripts (and admins) were looking at the kernel log and seeing 
that the oom killer killed something when it really came from a different 
source or was already exiting.

We have a differing opinion on what to consider the point of oom (the 
"notification line that has to be drawn").  My position is to notify 
userspace when the kernel has exhausted its capability to free memory 
without killing something.  In the case of current exiting or having a 
pending SIGKILL, memory is going to be freed, the oom killer simply needs 
to preempt the tasklist scan.  The situation is going to be remedied.  I 
defined the notification with this patch to only happen when the kernel 
can't free any memory without a kill so that userspace may do so itself.  
Michal concurred with that position.

So I'll repeat: if you are interested in situations when the limit is 
reached, use memory thresholds, if you are interested in situations where 
reclaim is struggling to free memory, use VMPRESSURE_CRITICAL.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [patch 1/2] mm, memcg: avoid oom notification when current needs access to memory reserves
  2013-11-28  0:22                             ` David Rientjes
@ 2013-11-28  2:28                               ` Johannes Weiner
  2013-11-28  2:52                                 ` David Rientjes
  0 siblings, 1 reply; 87+ messages in thread
From: Johannes Weiner @ 2013-11-28  2:28 UTC (permalink / raw)
  To: David Rientjes
  Cc: Michal Hocko, Andrew Morton, KAMEZAWA Hiroyuki, linux-kernel,
	linux-mm, cgroups

On Wed, Nov 27, 2013 at 04:22:18PM -0800, David Rientjes wrote:
> On Wed, 27 Nov 2013, Johannes Weiner wrote:
> 
> > > The patch is drawing the line at "the kernel can no longer do anything to 
> > > free memory", and that's the line where userspace should be notified or a 
> > > process killed by the kernel.
> > >
> > > Giving current access to memory reserves in the oom killer is an
> > > optimization so that all reclaim is exhausted prior to declaring
> > > that they are necessary, the kernel still has the ability to allow
> > > that process to exit and free memory.
> > 
> > "they" are necessary?
> > 
> 
> Memory reserves.
> 
> > > This is the same as the oom notifiers within the kernel that free
> > > memory from s390 and powerpc archs: the kernel still has the ability
> > > to free memory.
> > 
> > They're not the same at all.  One is the kernel freeing memory, the
> > other is a random coincidence.
> > 
> 
> Current is on the way to memory freeing because it has a pending SIGKILL 
> or is already exiting, it simply needs access to memory reserves to do so.  
> This was originally introduced to prevent the oom killer from having to 
> scan the set of eligible processes and silently giving it access to memory 
> reserves; we didn't want to emit all of the messages to the kernel log 
> because scripts (and admins) were looking at the kernel log and seeing 
> that the oom killer killed something when it really came from a different 
> source or was already exiting.
> 
> We have a differing opinion on what to consider the point of oom (the 
> "notification line that has to be drawn").  My position is to notify 
> userspace when the kernel has exhausted its capability to free memory 
> without killing something.  In the case of current exiting or having a 
> pending SIGKILL, memory is going to be freed, the oom killer simply needs 
> to preempt the tasklist scan.  The situation is going to be remedied.  I 
> defined the notification with this patch to only happen when the kernel 
> can't free any memory without a kill so that userspace may do so itself.  
> Michal concurred with that position.

The long-standing, user-visible definition of the current line agrees
with me.  You can't just redefine this, period.

I tried to explain to you how insane the motivation for this patch is,
but it does not look like you are reading what I write.  But you don't
get to change user-visible behavior just like that anyway, much less
so without a sane reason, so this was a complete waste of time :-(

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [patch 1/2] mm, memcg: avoid oom notification when current needs access to memory reserves
  2013-11-28  2:28                               ` Johannes Weiner
@ 2013-11-28  2:52                                 ` David Rientjes
  2013-11-28  3:16                                   ` Johannes Weiner
  0 siblings, 1 reply; 87+ messages in thread
From: David Rientjes @ 2013-11-28  2:52 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Michal Hocko, Andrew Morton, KAMEZAWA Hiroyuki, linux-kernel,
	linux-mm, cgroups

On Wed, 27 Nov 2013, Johannes Weiner wrote:

> The long-standing, user-visible definition of the current line agrees
> with me.  You can't just redefine this, period.
> 
> I tried to explain to you how insane the motivation for this patch is,
> but it does not look like you are reading what I write.  But you don't
> get to change user-visible behavior just like that anyway, much less
> so without a sane reason, so this was a complete waste of time :-(
> 

If you would like to leave this to Andrew's decision, that's fine.  
Michal has already agreed with my patch and has acked it in -mm.

If userspace is going to handle oom conditions, which is possible today 
and will be extended in the future, then it should only wakeup as a last 
resort when there is no possibility of future memory freeing.  It would be 
stupid to have userspace wakeup to handle the oom condition and then 
require it determine if the kernel simply needed to give it access to 
memory reserves for the allocating task to exit and free memory so it 
doesn't actually need to do anything.

Section 10 of Documentation/cgroups/memory.txt defines the necessary 
actions for processes waiting on this notification to make forward 
progress, it doesn't expect a process is already going to exit and free 
memory on its own.  Waking up in such a condition would be absolutely 
ludicrous.

Furthermore, if you're looking for notification simply when the memcg oom 
limit has been reached, you can use memory thresholds.  If you're looking 
for notification simply when reclaim is suffering severe pressure, you can 
use VMPRESSURE_CRITICAL.

I've been patient in this thread, but at this point I think everything has 
been said and it's pointless to continue going in circles.  Thanks for 
your time.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [patch 1/2] mm, memcg: avoid oom notification when current needs access to memory reserves
  2013-11-28  2:52                                 ` David Rientjes
@ 2013-11-28  3:16                                   ` Johannes Weiner
  0 siblings, 0 replies; 87+ messages in thread
From: Johannes Weiner @ 2013-11-28  3:16 UTC (permalink / raw)
  To: David Rientjes
  Cc: Michal Hocko, Andrew Morton, KAMEZAWA Hiroyuki, linux-kernel,
	linux-mm, cgroups

On Wed, Nov 27, 2013 at 06:52:10PM -0800, David Rientjes wrote:
> On Wed, 27 Nov 2013, Johannes Weiner wrote:
> 
> > The long-standing, user-visible definition of the current line agrees
> > with me.  You can't just redefine this, period.
> > 
> > I tried to explain to you how insane the motivation for this patch is,
> > but it does not look like you are reading what I write.  But you don't
> > get to change user-visible behavior just like that anyway, much less
> > so without a sane reason, so this was a complete waste of time :-(
> > 
> 
> If you would like to leave this to Andrew's decision, that's fine.  
> Michal has already agreed with my patch and has acked it in -mm.
> 
> If userspace is going to handle oom conditions, which is possible today 
> and will be extended in the future, then it should only wakeup as a last 
> resort when there is no possibility of future memory freeing.

I'll ack a patch that accomplishes that.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [patch 1/2] mm, memcg: avoid oom notification when current needs access to memory reserves
  2013-11-27 16:34                       ` Johannes Weiner
  2013-11-27 21:51                         ` David Rientjes
@ 2013-12-02 20:02                         ` Michal Hocko
  2013-12-02 21:25                           ` Johannes Weiner
  1 sibling, 1 reply; 87+ messages in thread
From: Michal Hocko @ 2013-12-02 20:02 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: David Rientjes, Andrew Morton, KAMEZAWA Hiroyuki, linux-kernel,
	linux-mm, cgroups

On Wed 27-11-13 11:34:36, Johannes Weiner wrote:
> On Tue, Nov 26, 2013 at 04:53:47PM -0800, David Rientjes wrote:
> > On Fri, 22 Nov 2013, Johannes Weiner wrote:
> > 
> > > But userspace in all likeliness DOES need to take action.
> > > 
> > > Reclaim is a really long process.  If 5 times doing 12 priority cycles
> > > and scanning thousands of pages is not enough to reclaim a single
> > > page, what does that say about the health of the memcg?
> > > 
> > > But more importantly, OOM handling is just inherently racy.  A task
> > > might receive the kill signal a split second *after* userspace was
> > > notified.  Or a task may exit voluntarily a split second after a
> > > victim was chosen and killed.
> > > 
> > 
> > That's not true even today without the userspace oom handling proposal 
> > currently being discussed if you have a memcg oom handler attached to a 
> > parent memcg with access to more memory than an oom child memcg.  The oom 
> > handler can disable the child memcg's oom killer with memory.oom_control 
> > and implement its own policy to deal with any notification of oom.
> 
> I was never implying the kernel handler.  All the races exist with
> userspace handling as well.
> 
> > This patch is required to ensure that in such a scenario that the oom 
> > handler sitting in the parent memcg only wakes up when it's required to 
> > intervene.
> 
> A task could receive an unrelated kill between the OOM notification
> and going to sleep to wait for userspace OOM handling.  Or another
> task could exit voluntarily between the notification and waitqueue
> entry, which would again be short-cut by the oom_recover of the exit
> uncharges.
> 
> oom:                           other tasks:
> check signal/exiting
>                                could exit or get killed here
> mem_cgroup_oom_trylock()
>                                could exit or get killed here
> mem_cgroup_oom_notify()
>                                could exit or get killed here
> if (userspace_handler)
>   sleep()                      could exit or get killed here
> else
>   oom_kill()
>                                could exit or get killed here
> 
> It does not matter where your signal/exiting check is, OOM
> notification can never be race free because OOM is just an arbitrary
> line we draw.  We have no idea what all the tasks are up to and how
> close they are to releasing memory.  Even if we freeze the whole group
> to handle tasks, it does not change the fact that the userspace OOM
> handler might kill one task and after the unfreeze another task
> immediately exits voluntarily or got a kill signal a split second
> after it was frozen.
> 
> You can't fix this.  We just have to draw the line somewhere and
> accept that in rare situations the OOM kill was unnecessary.

But we are not talking just about races here. What if the OOM is a
result of an OOM action itself. E.g. a killed task faults a memory in
while exiting and it hasn't freed its memory yet. Should we notify in
such a case? What would an userspace OOM handler do (the in-kernel
implementation has an advantage because it can check the tasks flags)?

> So again, I don't see this patch is doing anything but blur the
> current line and make notification less predictable. And, as someone
> else in this thread already said, it's a uservisible change in
> behavior and would break known tuning usecases.

I would like to understand how would such a tuning usecase work and how
it would break with this change.

Consider the above example. You would get 2 notification for the very
same OOM condition.
On the other hand if the encountered exiting task was just a race then
we have two options basically. Either there are more tasks racing (and
not all of them are exiting) or there is only one (all are exiting).
We will not loose any notification in the first case because the flags
are checked before mem_cgroup_oom_trylock and so one of tasks would lock
and notify.
The second case is more interesting. Userspace won't get notification
but we also know that no action is required as the OOM will be resolved
by itself. And now we should consider whether notification would do more
good than harm. The tuning usecase would loose one event. Would such a
rare situation skew the statistics so much? On the other hand a real OOM
killer would do something which means something will be killed. I find
the later much worse.

So all in all. I do agree with you that this path will never be race
free and without pointless OOM actions. I also agree that drawing the
line is hard. But I am more inclined to prevent from notification when
we already know that _no action_ is required because IMHO the vast
majority of oom listeners are there to _do_ an action which is mostly
deadly.

Finally if this is too controversial then I would at least like to see
the same check introduced before we go to sleep for oom_kill_disable
case because that is a real bug.

Thanks!
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [patch 1/2] mm, memcg: avoid oom notification when current needs access to memory reserves
  2013-12-02 20:02                         ` Michal Hocko
@ 2013-12-02 21:25                           ` Johannes Weiner
  2013-12-03 12:04                             ` Michal Hocko
  0 siblings, 1 reply; 87+ messages in thread
From: Johannes Weiner @ 2013-12-02 21:25 UTC (permalink / raw)
  To: Michal Hocko
  Cc: David Rientjes, Andrew Morton, KAMEZAWA Hiroyuki, linux-kernel,
	linux-mm, cgroups

On Mon, Dec 02, 2013 at 09:02:21PM +0100, Michal Hocko wrote:
> On Wed 27-11-13 11:34:36, Johannes Weiner wrote:
> > On Tue, Nov 26, 2013 at 04:53:47PM -0800, David Rientjes wrote:
> > > On Fri, 22 Nov 2013, Johannes Weiner wrote:
> > > 
> > > > But userspace in all likeliness DOES need to take action.
> > > > 
> > > > Reclaim is a really long process.  If 5 times doing 12 priority cycles
> > > > and scanning thousands of pages is not enough to reclaim a single
> > > > page, what does that say about the health of the memcg?
> > > > 
> > > > But more importantly, OOM handling is just inherently racy.  A task
> > > > might receive the kill signal a split second *after* userspace was
> > > > notified.  Or a task may exit voluntarily a split second after a
> > > > victim was chosen and killed.
> > > > 
> > > 
> > > That's not true even today without the userspace oom handling proposal 
> > > currently being discussed if you have a memcg oom handler attached to a 
> > > parent memcg with access to more memory than an oom child memcg.  The oom 
> > > handler can disable the child memcg's oom killer with memory.oom_control 
> > > and implement its own policy to deal with any notification of oom.
> > 
> > I was never implying the kernel handler.  All the races exist with
> > userspace handling as well.
> > 
> > > This patch is required to ensure that in such a scenario that the oom 
> > > handler sitting in the parent memcg only wakes up when it's required to 
> > > intervene.
> > 
> > A task could receive an unrelated kill between the OOM notification
> > and going to sleep to wait for userspace OOM handling.  Or another
> > task could exit voluntarily between the notification and waitqueue
> > entry, which would again be short-cut by the oom_recover of the exit
> > uncharges.
> > 
> > oom:                           other tasks:
> > check signal/exiting
> >                                could exit or get killed here
> > mem_cgroup_oom_trylock()
> >                                could exit or get killed here
> > mem_cgroup_oom_notify()
> >                                could exit or get killed here
> > if (userspace_handler)
> >   sleep()                      could exit or get killed here
> > else
> >   oom_kill()
> >                                could exit or get killed here
> > 
> > It does not matter where your signal/exiting check is, OOM
> > notification can never be race free because OOM is just an arbitrary
> > line we draw.  We have no idea what all the tasks are up to and how
> > close they are to releasing memory.  Even if we freeze the whole group
> > to handle tasks, it does not change the fact that the userspace OOM
> > handler might kill one task and after the unfreeze another task
> > immediately exits voluntarily or got a kill signal a split second
> > after it was frozen.
> > 
> > You can't fix this.  We just have to draw the line somewhere and
> > accept that in rare situations the OOM kill was unnecessary.
> 
> But we are not talking just about races here. What if the OOM is a
> result of an OOM action itself. E.g. a killed task faults a memory in
> while exiting and it hasn't freed its memory yet. Should we notify in
> such a case? What would an userspace OOM handler do (the in-kernel
> implementation has an advantage because it can check the tasks flags)?

We don't notify in such a case.  Every charge from a TIF_MEMDIE or
exiting task is bypassing the limit immediately.  Not even reclaim.

> > So again, I don't see this patch is doing anything but blur the
> > current line and make notification less predictable. And, as someone
> > else in this thread already said, it's a uservisible change in
> > behavior and would break known tuning usecases.
> 
> I would like to understand how would such a tuning usecase work and how
> it would break with this change.

I would do test runs and with every run increase the size of the
workload until I get OOM notifications to know when the kernel has
been pushed beyond its limits and available memory + reclaim
capability can't keep up with the workload anymore.

Not informing me just because due to timing variance a random process
exits in the last moment would be flat out lying.  The machine is OOM.
Many reclaim cycles failing is a good predictor.  Last minute exit of
random task is not, it's happenstance and I don't want to rely on a
fluke like this to size my workload.

> Consider the above example. You would get 2 notification for the very
> same OOM condition.
> On the other hand if the encountered exiting task was just a race then
> we have two options basically. Either there are more tasks racing (and
> not all of them are exiting) or there is only one (all are exiting).
> We will not loose any notification in the first case because the flags
> are checked before mem_cgroup_oom_trylock and so one of tasks would lock
> and notify.
> The second case is more interesting. Userspace won't get notification
> but we also know that no action is required as the OOM will be resolved
> by itself. And now we should consider whether notification would do more
> good than harm. The tuning usecase would loose one event. Would such a
> rare situation skew the statistics so much? On the other hand a real OOM
> killer would do something which means something will be killed. I find
> the later much worse.

We already check in various places (sigh) for whether reclaim and
killing is still necessary.  What is the end game here?  An endless
loop right before the kill where we check if the kill is still
necessary?

You're not fixing this problem, so why make the notifications less
reliable?

> So all in all. I do agree with you that this path will never be race
> free and without pointless OOM actions. I also agree that drawing the
> line is hard. But I am more inclined to prevent from notification when
> we already know that _no action_ is required because IMHO the vast
> majority of oom listeners are there to _do_ an action which is mostly
> deadly.

If you want to push the machine so hard that active measures like
reclaim can't keep up and you rely on stupid timing like this to save
your sorry butt, then you'll just have to live with the
unpredictability of it.  You're going to eat kills that might have
been avoided last minute either way.  It's no excuse to plaster the MM
with TIF_MEMDIE checks and last-minute cgroup margin checks in the
weirdest locations.

Again, how likely is it anyway that the kill was truly skipped and not
just deferred?  Reclaim failing is a good indicator that you're in
trouble, a random task exiting in an ongoing workload does not say
much.  The machine could still be in trouble, so you just deferred the
inevitable, you didn't really avoid a kill.

At this point we are talking about OOM kill frequency and statistical
probability during apparently normal operations.  The OOM killer was
never written for that, it was supposed to be a last minute resort
that should not occur during normal operations and only if all SANE
measures to avoid it have failed.  99% of all users have no interest
in these micro-optimizations and we shouldn't clutter the code and
have unpredictable behavior without even a trace of data to show that
this is anything more than a placebo measure for one use case.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [patch 1/2] mm, memcg: avoid oom notification when current needs access to memory reserves
  2013-12-02 21:25                           ` Johannes Weiner
@ 2013-12-03 12:04                             ` Michal Hocko
  2013-12-03 20:17                               ` Johannes Weiner
  2013-12-03 23:50                               ` David Rientjes
  0 siblings, 2 replies; 87+ messages in thread
From: Michal Hocko @ 2013-12-03 12:04 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: David Rientjes, Andrew Morton, KAMEZAWA Hiroyuki, linux-kernel,
	linux-mm, cgroups

On Mon 02-12-13 16:25:00, Johannes Weiner wrote:
> On Mon, Dec 02, 2013 at 09:02:21PM +0100, Michal Hocko wrote:
[...]
> > But we are not talking just about races here. What if the OOM is a
> > result of an OOM action itself. E.g. a killed task faults a memory in
> > while exiting and it hasn't freed its memory yet. Should we notify in
> > such a case? What would an userspace OOM handler do (the in-kernel
> > implementation has an advantage because it can check the tasks flags)?
> 
> We don't notify in such a case.  Every charge from a TIF_MEMDIE or
> exiting task is bypassing the limit immediately.  Not even reclaim.

Not really. Assume a memcg is under OOM. A task is killed by
userspace so we get into signal delivery code which clears
fatal_signal_pending and the code goes on to exit but then it faults in.
__mem_cgroup_try_charge will not see signal pending and TIF_MEMDIE is
not set yet. OOM is still not resolved so we are back to square one.
 
> > > So again, I don't see this patch is doing anything but blur the
> > > current line and make notification less predictable. And, as someone
> > > else in this thread already said, it's a uservisible change in
> > > behavior and would break known tuning usecases.
> > 
> > I would like to understand how would such a tuning usecase work and how
> > it would break with this change.
> 
> I would do test runs and with every run increase the size of the
> workload until I get OOM notifications to know when the kernel has
> been pushed beyond its limits and available memory + reclaim
> capability can't keep up with the workload anymore.
> 
> Not informing me just because due to timing variance a random process
> exits in the last moment would be flat out lying.  The machine is OOM.
> Many reclaim cycles failing is a good predictor.  Last minute exit of
> random task is not, it's happenstance and I don't want to rely on a
> fluke like this to size my workload.

Such a metric would be inherently racy for the same reason. You simply
cannot rely on not seeing OOMs because an exiting task managed to leave
in time (after MEM_CGROUP_RECLAIM_RETRIES direct reclaim loops and
before mem_cgroup_oom). Difference between in time and little bit too
late is just too fragile to be useful IMO.

> > Consider the above example. You would get 2 notification for the very
> > same OOM condition.
> > On the other hand if the encountered exiting task was just a race then
> > we have two options basically. Either there are more tasks racing (and
> > not all of them are exiting) or there is only one (all are exiting).
> > We will not loose any notification in the first case because the flags
> > are checked before mem_cgroup_oom_trylock and so one of tasks would lock
> > and notify.
> > The second case is more interesting. Userspace won't get notification
> > but we also know that no action is required as the OOM will be resolved
> > by itself. And now we should consider whether notification would do more
> > good than harm. The tuning usecase would loose one event. Would such a
> > rare situation skew the statistics so much? On the other hand a real OOM
> > killer would do something which means something will be killed. I find
> > the later much worse.
> 
> We already check in various places (sigh) for whether reclaim and
> killing is still necessary.  What is the end game here?  An endless
> loop right before the kill where we check if the kill is still
> necessary?

The patch as is doesn't cover all the cases and ideally we should check
that for OOM_SCAN_ABORT and later in oom_kill_process because they can
back out as well if we want to have only-on-action notification. Such a
solution would be too messy though.

But as I've said. The primary reason I liked this change is because it
solves the above mentioned OOM during exit issue and it also prevents
from a pointless notification. I am perfectly fine with moving the
check+set TIF_MEMDIE down so solve only the issue #1 and do not mess
with notifications.

> You're not fixing this problem, so why make the notifications less
> reliable?

I am still not seeing why it is less reliable. The notification is
inherently racy so you cannot rely on any simple metrics based on their
count (at least not in general).

> > So all in all. I do agree with you that this path will never be race
> > free and without pointless OOM actions. I also agree that drawing the
> > line is hard. But I am more inclined to prevent from notification when
> > we already know that _no action_ is required because IMHO the vast
> > majority of oom listeners are there to _do_ an action which is mostly
> > deadly.
> 
> If you want to push the machine so hard that active measures like
> reclaim can't keep up and you rely on stupid timing like this to save
> your sorry butt, then you'll just have to live with the
> unpredictability of it.  You're going to eat kills that might have
> been avoided last minute either way.  It's no excuse to plaster the MM
> with TIF_MEMDIE checks and last-minute cgroup margin checks in the
> weirdest locations.

Yes I do not agree with putting TIF_MEMDIE checks all over the place and
we should reduce their number to minimum. It is fair to say that the
patch didn't add a new check. It just has moved it to cover both
in-kernel and user space oom paths. That was a bonus I liked. To be
honest I do not see the notification side effect as a big deal as those
are racy anyway and I would rather see fewer of them than more
(especially when it is clear that nothing is to be done).

> Again, how likely is it anyway that the kill was truly skipped and not
> just deferred?  Reclaim failing is a good indicator that you're in
> trouble, a random task exiting in an ongoing workload does not say
> much.  The machine could still be in trouble, so you just deferred the
> inevitable, you didn't really avoid a kill.
> 
> At this point we are talking about OOM kill frequency and statistical
> probability during apparently normal operations.  The OOM killer was
> never written for that, it was supposed to be a last minute resort
> that should not occur during normal operations and only if all SANE
> measures to avoid it have failed.  99% of all users have no interest
> in these micro-optimizations and we shouldn't clutter the code and
> have unpredictable behavior without even a trace of data to show that
> this is anything more than a placebo measure for one use case.

OK, as it seems that the notification part is too controversial, how
would you like the following? It reverts the notification part and still
solves the fault on exit path. I will prepare the full patch with the
changelog if this looks reasonable:
---
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 28c9221b74ea..f44fe7e65a98 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1783,6 +1783,16 @@ static void mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
 	unsigned int points = 0;
 	struct task_struct *chosen = NULL;
 
+	/*
+	 * If current has a pending SIGKILL or is exiting, then automatically
+	 * select it.  The goal is to allow it to allocate so that it may
+	 * quickly exit and free its memory.
+	 */
+	if (fatal_signal_pending(current) || current->flags & PF_EXITING) {
+		set_thread_flag(TIF_MEMDIE);
+		goto cleanup;
+	}
+
 	check_panic_on_oom(CONSTRAINT_MEMCG, gfp_mask, order, NULL);
 	totalpages = mem_cgroup_get_limit(memcg) >> PAGE_SHIFT ? : 1;
 	for_each_mem_cgroup_tree(iter, memcg) {
@@ -2233,16 +2243,6 @@ bool mem_cgroup_oom_synchronize(bool handle)
 	if (!handle)
 		goto cleanup;
 
-	/*
-	 * If current has a pending SIGKILL or is exiting, then automatically
-	 * select it.  The goal is to allow it to allocate so that it may
-	 * quickly exit and free its memory.
-	 */
-	if (fatal_signal_pending(current) || current->flags & PF_EXITING) {
-		set_thread_flag(TIF_MEMDIE);
-		goto cleanup;
-	}
-
 	owait.memcg = memcg;
 	owait.wait.flags = 0;
 	owait.wait.func = memcg_oom_wake_function;
@@ -2266,6 +2266,13 @@ bool mem_cgroup_oom_synchronize(bool handle)
 		schedule();
 		mem_cgroup_unmark_under_oom(memcg);
 		finish_wait(&memcg_oom_waitq, &owait.wait);
+
+		/* Userspace OOM handler cannot set TIF_MEMDIE to a target */
+		if (memcg->oom_kill_disable) {
+			if ((fatal_signal_pending(current) ||
+						current->flags & PF_EXITING))
+				set_thread_flag(TIF_MEMDIE);
+		}
 	}
 
 	if (locked) {

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 87+ messages in thread

* Re: [patch 1/2] mm, memcg: avoid oom notification when current needs access to memory reserves
  2013-12-03 12:04                             ` Michal Hocko
@ 2013-12-03 20:17                               ` Johannes Weiner
  2013-12-03 21:00                                 ` Michal Hocko
  2013-12-03 23:50                               ` David Rientjes
  1 sibling, 1 reply; 87+ messages in thread
From: Johannes Weiner @ 2013-12-03 20:17 UTC (permalink / raw)
  To: Michal Hocko
  Cc: David Rientjes, Andrew Morton, KAMEZAWA Hiroyuki, linux-kernel,
	linux-mm, cgroups

On Tue, Dec 03, 2013 at 01:04:54PM +0100, Michal Hocko wrote:
> On Mon 02-12-13 16:25:00, Johannes Weiner wrote:
> > On Mon, Dec 02, 2013 at 09:02:21PM +0100, Michal Hocko wrote:
> [...]
> > > But we are not talking just about races here. What if the OOM is a
> > > result of an OOM action itself. E.g. a killed task faults a memory in
> > > while exiting and it hasn't freed its memory yet. Should we notify in
> > > such a case? What would an userspace OOM handler do (the in-kernel
> > > implementation has an advantage because it can check the tasks flags)?
> > 
> > We don't notify in such a case.  Every charge from a TIF_MEMDIE or
> > exiting task is bypassing the limit immediately.  Not even reclaim.
> 
> Not really. Assume a memcg is under OOM. A task is killed by
> userspace so we get into signal delivery code which clears
> fatal_signal_pending and the code goes on to exit but then it faults in.
> __mem_cgroup_try_charge will not see signal pending and TIF_MEMDIE is
> not set yet. OOM is still not resolved so we are back to square one.

Ah, that's a completely separate problem, though.  One issue I have
with these checks is that I never know which one are shortcuts and
micro optimizations and which one are functionally necessary.

> > > > So again, I don't see this patch is doing anything but blur the
> > > > current line and make notification less predictable. And, as someone
> > > > else in this thread already said, it's a uservisible change in
> > > > behavior and would break known tuning usecases.
> > > 
> > > I would like to understand how would such a tuning usecase work and how
> > > it would break with this change.
> > 
> > I would do test runs and with every run increase the size of the
> > workload until I get OOM notifications to know when the kernel has
> > been pushed beyond its limits and available memory + reclaim
> > capability can't keep up with the workload anymore.
> > 
> > Not informing me just because due to timing variance a random process
> > exits in the last moment would be flat out lying.  The machine is OOM.
> > Many reclaim cycles failing is a good predictor.  Last minute exit of
> > random task is not, it's happenstance and I don't want to rely on a
> > fluke like this to size my workload.
> 
> Such a metric would be inherently racy for the same reason. You simply
> cannot rely on not seeing OOMs because an exiting task managed to leave
> in time (after MEM_CGROUP_RECLAIM_RETRIES direct reclaim loops and
> before mem_cgroup_oom). Difference between in time and little bit too
> late is just too fragile to be useful IMO.

Are we saying the same thing?  Or did I misunderstand you?

> > > Consider the above example. You would get 2 notification for the very
> > > same OOM condition.
> > > On the other hand if the encountered exiting task was just a race then
> > > we have two options basically. Either there are more tasks racing (and
> > > not all of them are exiting) or there is only one (all are exiting).
> > > We will not loose any notification in the first case because the flags
> > > are checked before mem_cgroup_oom_trylock and so one of tasks would lock
> > > and notify.
> > > The second case is more interesting. Userspace won't get notification
> > > but we also know that no action is required as the OOM will be resolved
> > > by itself. And now we should consider whether notification would do more
> > > good than harm. The tuning usecase would loose one event. Would such a
> > > rare situation skew the statistics so much? On the other hand a real OOM
> > > killer would do something which means something will be killed. I find
> > > the later much worse.
> > 
> > We already check in various places (sigh) for whether reclaim and
> > killing is still necessary.  What is the end game here?  An endless
> > loop right before the kill where we check if the kill is still
> > necessary?
> 
> The patch as is doesn't cover all the cases and ideally we should check
> that for OOM_SCAN_ABORT and later in oom_kill_process because they can
> back out as well if we want to have only-on-action notification. Such a
> solution would be too messy though.

There is never an only-on-action notification and there is no check to
cover all cases.  This is the fallacy I'm trying to point out.  All we
can do is pick a fairly predictable line and stick to it.

> But as I've said. The primary reason I liked this change is because it
> solves the above mentioned OOM during exit issue and it also prevents
> from a pointless notification. I am perfectly fine with moving the
> check+set TIF_MEMDIE down so solve only the issue #1 and do not mess
> with notifications.

The notification is not pointless, we are OOM at the time we make the
decision.

> > You're not fixing this problem, so why make the notifications less
> > reliable?
> 
> I am still not seeing why it is less reliable. The notification is
> inherently racy so you cannot rely on any simple metrics based on their
> count (at least not in general).

It should be based on reclaim failing, which is more predictable and
reproducible across multiple runs.  Anything else is just random
coincidence, I can't fathom why you think that it's at all meaningful.

You could do one more reclaim cycle and a charge attempt, or even just
add an msleep() between the last reclaim cycle and the last charge
attempt for exactly the same outcome.

> > > So all in all. I do agree with you that this path will never be race
> > > free and without pointless OOM actions. I also agree that drawing the
> > > line is hard. But I am more inclined to prevent from notification when
> > > we already know that _no action_ is required because IMHO the vast
> > > majority of oom listeners are there to _do_ an action which is mostly
> > > deadly.
> > 
> > If you want to push the machine so hard that active measures like
> > reclaim can't keep up and you rely on stupid timing like this to save
> > your sorry butt, then you'll just have to live with the
> > unpredictability of it.  You're going to eat kills that might have
> > been avoided last minute either way.  It's no excuse to plaster the MM
> > with TIF_MEMDIE checks and last-minute cgroup margin checks in the
> > weirdest locations.
> 
> Yes I do not agree with putting TIF_MEMDIE checks all over the place and
> we should reduce their number to minimum. It is fair to say that the
> patch didn't add a new check. It just has moved it to cover both
> in-kernel and user space oom paths. That was a bonus I liked. To be
> honest I do not see the notification side effect as a big deal as those
> are racy anyway and I would rather see fewer of them than more
> (especially when it is clear that nothing is to be done).

I don't know why we have this check there in the first place, it
should be part of the OOM victim selection process and not a hacky
shortcut in memcg.  It's a blatant layering violation for no obvious
reason.

> > Again, how likely is it anyway that the kill was truly skipped and not
> > just deferred?  Reclaim failing is a good indicator that you're in
> > trouble, a random task exiting in an ongoing workload does not say
> > much.  The machine could still be in trouble, so you just deferred the
> > inevitable, you didn't really avoid a kill.
> > 
> > At this point we are talking about OOM kill frequency and statistical
> > probability during apparently normal operations.  The OOM killer was
> > never written for that, it was supposed to be a last minute resort
> > that should not occur during normal operations and only if all SANE
> > measures to avoid it have failed.  99% of all users have no interest
> > in these micro-optimizations and we shouldn't clutter the code and
> > have unpredictable behavior without even a trace of data to show that
> > this is anything more than a placebo measure for one use case.
> 
> OK, as it seems that the notification part is too controversial, how
> would you like the following? It reverts the notification part and still
> solves the fault on exit path. I will prepare the full patch with the
> changelog if this looks reasonable:
> ---
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 28c9221b74ea..f44fe7e65a98 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -1783,6 +1783,16 @@ static void mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
>  	unsigned int points = 0;
>  	struct task_struct *chosen = NULL;
>  
> +	/*
> +	 * If current has a pending SIGKILL or is exiting, then automatically
> +	 * select it.  The goal is to allow it to allocate so that it may
> +	 * quickly exit and free its memory.
> +	 */
> +	if (fatal_signal_pending(current) || current->flags & PF_EXITING) {
> +		set_thread_flag(TIF_MEMDIE);
> +		goto cleanup;

		return;

Anyway, I wish this would not be here at all and part of the thread
scanner.  As David said, this is a very cold path, why the shortcut?

> @@ -2233,16 +2243,6 @@ bool mem_cgroup_oom_synchronize(bool handle)
>  	if (!handle)
>  		goto cleanup;
>  
> -	/*
> -	 * If current has a pending SIGKILL or is exiting, then automatically
> -	 * select it.  The goal is to allow it to allocate so that it may
> -	 * quickly exit and free its memory.
> -	 */
> -	if (fatal_signal_pending(current) || current->flags & PF_EXITING) {
> -		set_thread_flag(TIF_MEMDIE);
> -		goto cleanup;
> -	}
> -
>  	owait.memcg = memcg;
>  	owait.wait.flags = 0;
>  	owait.wait.func = memcg_oom_wake_function;
> @@ -2266,6 +2266,13 @@ bool mem_cgroup_oom_synchronize(bool handle)
>  		schedule();
>  		mem_cgroup_unmark_under_oom(memcg);
>  		finish_wait(&memcg_oom_waitq, &owait.wait);
> +
> +		/* Userspace OOM handler cannot set TIF_MEMDIE to a target */
> +		if (memcg->oom_kill_disable) {
> +			if ((fatal_signal_pending(current) ||
> +						current->flags & PF_EXITING))
> +				set_thread_flag(TIF_MEMDIE);
> +		}

This is an entirely different change that I think makes sense.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [patch 1/2] mm, memcg: avoid oom notification when current needs access to memory reserves
  2013-12-03 20:17                               ` Johannes Weiner
@ 2013-12-03 21:00                                 ` Michal Hocko
  2013-12-03 21:23                                   ` Johannes Weiner
  0 siblings, 1 reply; 87+ messages in thread
From: Michal Hocko @ 2013-12-03 21:00 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: David Rientjes, Andrew Morton, KAMEZAWA Hiroyuki, linux-kernel,
	linux-mm, cgroups

On Tue 03-12-13 15:17:13, Johannes Weiner wrote:
> On Tue, Dec 03, 2013 at 01:04:54PM +0100, Michal Hocko wrote:
> > On Mon 02-12-13 16:25:00, Johannes Weiner wrote:
> > > On Mon, Dec 02, 2013 at 09:02:21PM +0100, Michal Hocko wrote:
> > [...]
> > > > But we are not talking just about races here. What if the OOM is a
> > > > result of an OOM action itself. E.g. a killed task faults a memory in
> > > > while exiting and it hasn't freed its memory yet. Should we notify in
> > > > such a case? What would an userspace OOM handler do (the in-kernel
> > > > implementation has an advantage because it can check the tasks flags)?
> > > 
> > > We don't notify in such a case.  Every charge from a TIF_MEMDIE or
> > > exiting task is bypassing the limit immediately.  Not even reclaim.
> > 
> > Not really. Assume a memcg is under OOM. A task is killed by
> > userspace so we get into signal delivery code which clears
> > fatal_signal_pending and the code goes on to exit but then it faults in.
> > __mem_cgroup_try_charge will not see signal pending and TIF_MEMDIE is
> > not set yet. OOM is still not resolved so we are back to square one.
> 
> Ah, that's a completely separate problem, though.  One issue I have
> with these checks is that I never know which one are shortcuts and
> micro optimizations and which one are functionally necessary.

That is why I asked for the changelog update to mention the fix as well
(https://lkml.org/lkml/2013/11/18/161 resp.
https://lkml.org/lkml/2013/11/18/142)

> > > > > So again, I don't see this patch is doing anything but blur the
> > > > > current line and make notification less predictable. And, as someone
> > > > > else in this thread already said, it's a uservisible change in
> > > > > behavior and would break known tuning usecases.
> > > > 
> > > > I would like to understand how would such a tuning usecase work and how
> > > > it would break with this change.
> > > 
> > > I would do test runs and with every run increase the size of the
> > > workload until I get OOM notifications to know when the kernel has
> > > been pushed beyond its limits and available memory + reclaim
> > > capability can't keep up with the workload anymore.
> > > 
> > > Not informing me just because due to timing variance a random process
> > > exits in the last moment would be flat out lying.  The machine is OOM.
> > > Many reclaim cycles failing is a good predictor.  Last minute exit of
> > > random task is not, it's happenstance and I don't want to rely on a
> > > fluke like this to size my workload.
> > 
> > Such a metric would be inherently racy for the same reason. You simply
> > cannot rely on not seeing OOMs because an exiting task managed to leave
> > in time (after MEM_CGROUP_RECLAIM_RETRIES direct reclaim loops and
> > before mem_cgroup_oom). Difference between in time and little bit too
> > late is just too fragile to be useful IMO.
> 
> Are we saying the same thing?  Or did I misunderstand you?

yes we are and my point was that using such a metric doesn't make much
sense. So arguing with reliability and possible regressions is a bit
over-reaction. I was backing this patch because it has moved the check
to a more appropriate place where it actually solves also another issue.

I was probably not clear enough about that.

> > > > Consider the above example. You would get 2 notification for the very
> > > > same OOM condition.
> > > > On the other hand if the encountered exiting task was just a race then
> > > > we have two options basically. Either there are more tasks racing (and
> > > > not all of them are exiting) or there is only one (all are exiting).
> > > > We will not loose any notification in the first case because the flags
> > > > are checked before mem_cgroup_oom_trylock and so one of tasks would lock
> > > > and notify.
> > > > The second case is more interesting. Userspace won't get notification
> > > > but we also know that no action is required as the OOM will be resolved
> > > > by itself. And now we should consider whether notification would do more
> > > > good than harm. The tuning usecase would loose one event. Would such a
> > > > rare situation skew the statistics so much? On the other hand a real OOM
> > > > killer would do something which means something will be killed. I find
> > > > the later much worse.
> > > 
> > > We already check in various places (sigh) for whether reclaim and
> > > killing is still necessary.  What is the end game here?  An endless
> > > loop right before the kill where we check if the kill is still
> > > necessary?
> > 
> > The patch as is doesn't cover all the cases and ideally we should check
> > that for OOM_SCAN_ABORT and later in oom_kill_process because they can
> > back out as well if we want to have only-on-action notification. Such a
> > solution would be too messy though.
> 
> There is never an only-on-action notification and there is no check to
> cover all cases.  This is the fallacy I'm trying to point out.  All we
> can do is pick a fairly predictable line and stick to it.

Yes, nothing will be 100% but drawing the line at the place where the
kernel is going to _do_ something sounds like a quite a clear cut to
me and a reasonable semantic. That something might be either killing
something or putting somebody to sleep and wait for the userspace. That
would be an ideal semantic IMHO. The code doesn't allow us to do so now
unfortunately (without too much refactoring).

> > But as I've said. The primary reason I liked this change is because it
> > solves the above mentioned OOM during exit issue and it also prevents
> > from a pointless notification. I am perfectly fine with moving the
> > check+set TIF_MEMDIE down so solve only the issue #1 and do not mess
> > with notifications.
> 
> The notification is not pointless, we are OOM at the time we make the
> decision.

We have clearly different opinions here. I do not consider temporal
OOM conditions as significant enough. As the OOM itself is really hard
to define and to be _agreed_ on I think we should be as practical as
possible and provide the oom notification interface as comfortable
to users as possible. And then we should balance cons and pros of
notifying. I am still convinced that notifying less is better in general
because I see OOM killer use cases.

> > > You're not fixing this problem, so why make the notifications less
> > > reliable?
> > 
> > I am still not seeing why it is less reliable. The notification is
> > inherently racy so you cannot rely on any simple metrics based on their
> > count (at least not in general).
> 
> It should be based on reclaim failing, which is more predictable and
> reproducible across multiple runs.  Anything else is just random
> coincidence, I can't fathom why you think that it's at all meaningful.

If you are interested in reclaim failing then failcnt is a much better
metric for you. That's all that I am saying.
 
> You could do one more reclaim cycle and a charge attempt, or even just
> add an msleep() between the last reclaim cycle and the last charge
> attempt for exactly the same outcome.
> 
> > > > So all in all. I do agree with you that this path will never be race
> > > > free and without pointless OOM actions. I also agree that drawing the
> > > > line is hard. But I am more inclined to prevent from notification when
> > > > we already know that _no action_ is required because IMHO the vast
> > > > majority of oom listeners are there to _do_ an action which is mostly
> > > > deadly.
> > > 
> > > If you want to push the machine so hard that active measures like
> > > reclaim can't keep up and you rely on stupid timing like this to save
> > > your sorry butt, then you'll just have to live with the
> > > unpredictability of it.  You're going to eat kills that might have
> > > been avoided last minute either way.  It's no excuse to plaster the MM
> > > with TIF_MEMDIE checks and last-minute cgroup margin checks in the
> > > weirdest locations.
> > 
> > Yes I do not agree with putting TIF_MEMDIE checks all over the place and
> > we should reduce their number to minimum. It is fair to say that the
> > patch didn't add a new check. It just has moved it to cover both
> > in-kernel and user space oom paths. That was a bonus I liked. To be
> > honest I do not see the notification side effect as a big deal as those
> > are racy anyway and I would rather see fewer of them than more
> > (especially when it is clear that nothing is to be done).
> 
> I don't know why we have this check there in the first place, it
> should be part of the OOM victim selection process and not a hacky
> shortcut in memcg.  It's a blatant layering violation for no obvious
> reason.

My understanding is that it tries to expedite the process. Scanning
tasks might be quite expensive. Sure, you would find a task and abort
scanning as well but that might be after zillions of tasks scanned.

> > > Again, how likely is it anyway that the kill was truly skipped and not
> > > just deferred?  Reclaim failing is a good indicator that you're in
> > > trouble, a random task exiting in an ongoing workload does not say
> > > much.  The machine could still be in trouble, so you just deferred the
> > > inevitable, you didn't really avoid a kill.
> > > 
> > > At this point we are talking about OOM kill frequency and statistical
> > > probability during apparently normal operations.  The OOM killer was
> > > never written for that, it was supposed to be a last minute resort
> > > that should not occur during normal operations and only if all SANE
> > > measures to avoid it have failed.  99% of all users have no interest
> > > in these micro-optimizations and we shouldn't clutter the code and
> > > have unpredictable behavior without even a trace of data to show that
> > > this is anything more than a placebo measure for one use case.
> > 
> > OK, as it seems that the notification part is too controversial, how
> > would you like the following? It reverts the notification part and still
> > solves the fault on exit path. I will prepare the full patch with the
> > changelog if this looks reasonable:
> > ---
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index 28c9221b74ea..f44fe7e65a98 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -1783,6 +1783,16 @@ static void mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
> >  	unsigned int points = 0;
> >  	struct task_struct *chosen = NULL;
> >  
> > +	/*
> > +	 * If current has a pending SIGKILL or is exiting, then automatically
> > +	 * select it.  The goal is to allow it to allocate so that it may
> > +	 * quickly exit and free its memory.
> > +	 */
> > +	if (fatal_signal_pending(current) || current->flags & PF_EXITING) {
> > +		set_thread_flag(TIF_MEMDIE);
> > +		goto cleanup;
> 
> 		return;
> 
> Anyway, I wish this would not be here at all and part of the thread
> scanner.  As David said, this is a very cold path, why the shortcut?

Right. I would rather go and drop the original patch as we couldn't find
an agreement so the partial revert wouldn't be needed. This was just an
illustration.

> > @@ -2233,16 +2243,6 @@ bool mem_cgroup_oom_synchronize(bool handle)
> >  	if (!handle)
> >  		goto cleanup;
> >  
> > -	/*
> > -	 * If current has a pending SIGKILL or is exiting, then automatically
> > -	 * select it.  The goal is to allow it to allocate so that it may
> > -	 * quickly exit and free its memory.
> > -	 */
> > -	if (fatal_signal_pending(current) || current->flags & PF_EXITING) {
> > -		set_thread_flag(TIF_MEMDIE);
> > -		goto cleanup;
> > -	}
> > -
> >  	owait.memcg = memcg;
> >  	owait.wait.flags = 0;
> >  	owait.wait.func = memcg_oom_wake_function;
> > @@ -2266,6 +2266,13 @@ bool mem_cgroup_oom_synchronize(bool handle)
> >  		schedule();
> >  		mem_cgroup_unmark_under_oom(memcg);
> >  		finish_wait(&memcg_oom_waitq, &owait.wait);
> > +
> > +		/* Userspace OOM handler cannot set TIF_MEMDIE to a target */
> > +		if (memcg->oom_kill_disable) {
> > +			if ((fatal_signal_pending(current) ||
> > +						current->flags & PF_EXITING))
> > +				set_thread_flag(TIF_MEMDIE);
> > +		}
> 
> This is an entirely different change that I think makes sense.

OK, I will post the full patch after we settle with this one finally.

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [patch 1/2] mm, memcg: avoid oom notification when current needs access to memory reserves
  2013-12-03 21:00                                 ` Michal Hocko
@ 2013-12-03 21:23                                   ` Johannes Weiner
  0 siblings, 0 replies; 87+ messages in thread
From: Johannes Weiner @ 2013-12-03 21:23 UTC (permalink / raw)
  To: Michal Hocko
  Cc: David Rientjes, Andrew Morton, KAMEZAWA Hiroyuki, linux-kernel,
	linux-mm, cgroups

On Tue, Dec 03, 2013 at 10:00:09PM +0100, Michal Hocko wrote:
> On Tue 03-12-13 15:17:13, Johannes Weiner wrote:
> > On Tue, Dec 03, 2013 at 01:04:54PM +0100, Michal Hocko wrote:
> > > On Mon 02-12-13 16:25:00, Johannes Weiner wrote:
> > > > On Mon, Dec 02, 2013 at 09:02:21PM +0100, Michal Hocko wrote:
> > > [...]
> > > > > But we are not talking just about races here. What if the OOM is a
> > > > > result of an OOM action itself. E.g. a killed task faults a memory in
> > > > > while exiting and it hasn't freed its memory yet. Should we notify in
> > > > > such a case? What would an userspace OOM handler do (the in-kernel
> > > > > implementation has an advantage because it can check the tasks flags)?
> > > > 
> > > > We don't notify in such a case.  Every charge from a TIF_MEMDIE or
> > > > exiting task is bypassing the limit immediately.  Not even reclaim.
> > > 
> > > Not really. Assume a memcg is under OOM. A task is killed by
> > > userspace so we get into signal delivery code which clears
> > > fatal_signal_pending and the code goes on to exit but then it faults in.
> > > __mem_cgroup_try_charge will not see signal pending and TIF_MEMDIE is
> > > not set yet. OOM is still not resolved so we are back to square one.
> > 
> > Ah, that's a completely separate problem, though.  One issue I have
> > with these checks is that I never know which one are shortcuts and
> > micro optimizations and which one are functionally necessary.
> 
> That is why I asked for the changelog update to mention the fix as well
> (https://lkml.org/lkml/2013/11/18/161 resp.
> https://lkml.org/lkml/2013/11/18/142)
> 
> > > > > > So again, I don't see this patch is doing anything but blur the
> > > > > > current line and make notification less predictable. And, as someone
> > > > > > else in this thread already said, it's a uservisible change in
> > > > > > behavior and would break known tuning usecases.
> > > > > 
> > > > > I would like to understand how would such a tuning usecase work and how
> > > > > it would break with this change.
> > > > 
> > > > I would do test runs and with every run increase the size of the
> > > > workload until I get OOM notifications to know when the kernel has
> > > > been pushed beyond its limits and available memory + reclaim
> > > > capability can't keep up with the workload anymore.
> > > > 
> > > > Not informing me just because due to timing variance a random process
> > > > exits in the last moment would be flat out lying.  The machine is OOM.
> > > > Many reclaim cycles failing is a good predictor.  Last minute exit of
> > > > random task is not, it's happenstance and I don't want to rely on a
> > > > fluke like this to size my workload.
> > > 
> > > Such a metric would be inherently racy for the same reason. You simply
> > > cannot rely on not seeing OOMs because an exiting task managed to leave
> > > in time (after MEM_CGROUP_RECLAIM_RETRIES direct reclaim loops and
> > > before mem_cgroup_oom). Difference between in time and little bit too
> > > late is just too fragile to be useful IMO.
> > 
> > Are we saying the same thing?  Or did I misunderstand you?
> 
> yes we are and my point was that using such a metric doesn't make much
> sense. So arguing with reliability and possible regressions is a bit
> over-reaction. I was backing this patch because it has moved the check
> to a more appropriate place where it actually solves also another issue.
> 
> I was probably not clear enough about that.
> 
> > > > > Consider the above example. You would get 2 notification for the very
> > > > > same OOM condition.
> > > > > On the other hand if the encountered exiting task was just a race then
> > > > > we have two options basically. Either there are more tasks racing (and
> > > > > not all of them are exiting) or there is only one (all are exiting).
> > > > > We will not loose any notification in the first case because the flags
> > > > > are checked before mem_cgroup_oom_trylock and so one of tasks would lock
> > > > > and notify.
> > > > > The second case is more interesting. Userspace won't get notification
> > > > > but we also know that no action is required as the OOM will be resolved
> > > > > by itself. And now we should consider whether notification would do more
> > > > > good than harm. The tuning usecase would loose one event. Would such a
> > > > > rare situation skew the statistics so much? On the other hand a real OOM
> > > > > killer would do something which means something will be killed. I find
> > > > > the later much worse.
> > > > 
> > > > We already check in various places (sigh) for whether reclaim and
> > > > killing is still necessary.  What is the end game here?  An endless
> > > > loop right before the kill where we check if the kill is still
> > > > necessary?
> > > 
> > > The patch as is doesn't cover all the cases and ideally we should check
> > > that for OOM_SCAN_ABORT and later in oom_kill_process because they can
> > > back out as well if we want to have only-on-action notification. Such a
> > > solution would be too messy though.
> > 
> > There is never an only-on-action notification and there is no check to
> > cover all cases.  This is the fallacy I'm trying to point out.  All we
> > can do is pick a fairly predictable line and stick to it.
> 
> Yes, nothing will be 100% but drawing the line at the place where the
> kernel is going to _do_ something sounds like a quite a clear cut to
> me and a reasonable semantic. That something might be either killing
> something or putting somebody to sleep and wait for the userspace. That
> would be an ideal semantic IMHO. The code doesn't allow us to do so now
> unfortunately (without too much refactoring).
> 
> > > But as I've said. The primary reason I liked this change is because it
> > > solves the above mentioned OOM during exit issue and it also prevents
> > > from a pointless notification. I am perfectly fine with moving the
> > > check+set TIF_MEMDIE down so solve only the issue #1 and do not mess
> > > with notifications.
> > 
> > The notification is not pointless, we are OOM at the time we make the
> > decision.
> 
> We have clearly different opinions here. I do not consider temporal
> OOM conditions as significant enough.

As opposed to what, "permanent" OOM conditions? :-) I'm running out of
ways to phrase this...  The sampling window for OOM conditions is
completely arbitrary.  Right now it spans 5 reclaim cycles.  Adding a
last-second check for a random event right before the kill adds
nothing but noise.

> As the OOM itself is really hard to define and to be _agreed_ on I
> think we should be as practical as possible and provide the oom
> notification interface as comfortable to users as possible. And then
> we should balance cons and pros of notifying. I am still convinced
> that notifying less is better in general because I see OOM killer
> use cases.

If the sampling window is too small for you, then increase the number
of reclaim cycles as I proposed earlier.  It does exactly the same
thing but it's less invasive:

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 13b9d0f..6d308ed 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -67,7 +67,7 @@
 struct cgroup_subsys mem_cgroup_subsys __read_mostly;
 EXPORT_SYMBOL(mem_cgroup_subsys);
 
-#define MEM_CGROUP_RECLAIM_RETRIES	5
+#define MEM_CGROUP_RECLAIM_RETRIES	6
 static struct mem_cgroup *root_mem_cgroup __read_mostly;
 
 #ifdef CONFIG_MEMCG_SWAP

Or just wait for every single task in the memcg to be stuck in the
charge path, then you know for sure that the OOM condition is
permanent.  Unless an external task kills one of them, of course...

> > > @@ -2266,6 +2266,13 @@ bool mem_cgroup_oom_synchronize(bool handle)
> > >  		schedule();
> > >  		mem_cgroup_unmark_under_oom(memcg);
> > >  		finish_wait(&memcg_oom_waitq, &owait.wait);
> > > +
> > > +		/* Userspace OOM handler cannot set TIF_MEMDIE to a target */
> > > +		if (memcg->oom_kill_disable) {
> > > +			if ((fatal_signal_pending(current) ||
> > > +						current->flags & PF_EXITING))
> > > +				set_thread_flag(TIF_MEMDIE);
> > > +		}
> > 
> > This is an entirely different change that I think makes sense.
> 
> OK, I will post the full patch after we settle with this one finally.

Thanks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 87+ messages in thread

* Re: [patch 1/2] mm, memcg: avoid oom notification when current needs access to memory reserves
  2013-12-03 12:04                             ` Michal Hocko
  2013-12-03 20:17                               ` Johannes Weiner
@ 2013-12-03 23:50                               ` David Rientjes
  2013-12-04  3:34                                 ` Johannes Weiner
  2013-12-04 11:13                                 ` Michal Hocko
  1 sibling, 2 replies; 87+ messages in thread
From: David Rientjes @ 2013-12-03 23:50 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Johannes Weiner, Andrew Morton, KAMEZAWA Hiroyuki, linux-kernel,
	linux-mm, cgroups

On Tue, 3 Dec 2013, Michal Hocko wrote:

> OK, as it seems that the notification part is too controversial, how
> would you like the following? It reverts the notification part and still
> solves the fault on exit path. I will prepare the full patch with the
> changelog if this looks reasonable:

Um, no, that's not satisfactory because it obviously does the check after 
mem_cgroup_oom_notify().  There is absolutely no reason why userspace 
should be woken up when current simply needs access to memory reserves to 
exit.  You can already get such notification by memory thresholds at the 
memcg limit.

I'll repeat: Section 10 of Documentation/cgroups/memory.txt specifies what 
userspace should do when waking up; one of those options is not "check if 
the memcg is still actually oom in a short period of time once a charging 
task with a pending SIGKILL or in the exit path has been able to exit."  
Users of this interface typically also disable the memcg oom killer 
through the same file, it's ludicrous to put the responsibility on 
userspace to determine if the wakeup is actionable and requires it to 
intervene in one of the methods listed in section 10.

> ---
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 28c9221b74ea..f44fe7e65a98 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -1783,6 +1783,16 @@ static void mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
>  	unsigned int points = 0;
>  	struct task_struct *chosen = NULL;
>  
> +	/*
> +	 * If current has a pending SIGKILL or is exiting, then automatically
> +	 * select it.  The goal is to allow it to allocate so that it may
> +	 * quickly exit and free its memory.
> +	 */
> +	if (fatal_signal_pending(current) || current->flags & PF_EXITING) {
> +		set_thread_flag(TIF_MEMDIE);
> +		goto cleanup;
> +	}
> +
>  	check_panic_on_oom(CONSTRAINT_MEMCG, gfp_mask, order, NULL);
>  	totalpages = mem_cgroup_get_limit(memcg) >> PAGE_SHIFT ? : 1;
>  	for_each_mem_cgroup_tree(iter, memcg) {
> @@ -2233,16 +2243,6 @@ bool mem_cgroup_oom_synchronize(bool handle)
>  	if (!handle)
>  		goto cleanup;
>  
> -	/*
> -	 * If current has a pending SIGKILL or is exiting, then automatically
> -	 * select it.  The goal is to allow it to allocate so that it may
> -	 * quickly exit and free its memory.
> -	 */
> -	if (fatal_signal_pending(current) || current->flags & PF_EXITING) {
> -		set_thread_flag(TIF_MEMDIE);
> -		goto cleanup;
> -	}
> -
>  	owait.memcg = memcg;
>  	owait.wait.flags = 0;
>  	owait.wait.func = memcg_oom_wake_function;
> @@ -2266,6 +2266,13 @@ bool mem_cgroup_oom_synchronize(bool handle)
>  		schedule();
>  		mem_cgroup_unmark_under_oom(memcg);
>  		finish_wait(&memcg_oom_waitq, &owait.wait);
> +
> +		/* Userspace OOM handler cannot set TIF_MEMDIE to a target */
> +		if (memcg->oom_kill_disable) {
> +			if ((fatal_signal_pending(current) ||
> +						current->flags & PF_EXITING))
> +				set_thread_flag(TIF_MEMDIE);
> +		}
>  	}
>  
>  	if (locked) {

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [patch 1/2] mm, memcg: avoid oom notification when current needs access to memory reserves
  2013-12-03 23:50                               ` David Rientjes
@ 2013-12-04  3:34                                 ` Johannes Weiner
  2013-12-04 11:13                                 ` Michal Hocko
  1 sibling, 0 replies; 87+ messages in thread
From: Johannes Weiner @ 2013-12-04  3:34 UTC (permalink / raw)
  To: David Rientjes
  Cc: Michal Hocko, Andrew Morton, KAMEZAWA Hiroyuki, linux-kernel,
	linux-mm, cgroups

On Tue, Dec 03, 2013 at 03:50:41PM -0800, David Rientjes wrote:
> On Tue, 3 Dec 2013, Michal Hocko wrote:
> 
> > OK, as it seems that the notification part is too controversial, how
> > would you like the following? It reverts the notification part and still
> > solves the fault on exit path. I will prepare the full patch with the
> > changelog if this looks reasonable:
> 
> Um, no, that's not satisfactory because it obviously does the check after 
> mem_cgroup_oom_notify().  There is absolutely no reason why userspace 
> should be woken up when current simply needs access to memory reserves to 
> exit.  You can already get such notification by memory thresholds at the 
> memcg limit.
> 
> I'll repeat: Section 10 of Documentation/cgroups/memory.txt specifies what 
> userspace should do when waking up; one of those options is not "check if 
> the memcg is still actually oom in a short period of time once a charging 
> task with a pending SIGKILL or in the exit path has been able to exit."  
> Users of this interface typically also disable the memcg oom killer 
> through the same file, it's ludicrous to put the responsibility on 
> userspace to determine if the wakeup is actionable and requires it to 
> intervene in one of the methods listed in section 10.

Kind of a bummer that you haven't read anything I wrote...

But here is a patch that defers wakeups until we know for sure that
userspace action is required:

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index f1a0ae6..cc6adac 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2254,8 +2254,17 @@ bool mem_cgroup_oom_synchronize(bool handle)
 
 	locked = mem_cgroup_oom_trylock(memcg);
 
+#if 0
+	/*
+	 * XXX: An unrelated task in the group might exit at any time,
+	 * making the OOM kill unnecessary.  We don't want to wake up
+	 * the userspace handler unless we are certain it needs to
+	 * intervene, so disable notifications until we solve the
+	 * halting problem.
+	 */
 	if (locked)
 		mem_cgroup_oom_notify(memcg);
+#endif
 
 	if (locked && !memcg->oom_kill_disable) {
 		mem_cgroup_unmark_under_oom(memcg);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 87+ messages in thread

* Re: [patch 1/2] mm, memcg: avoid oom notification when current needs access to memory reserves
  2013-12-03 23:50                               ` David Rientjes
  2013-12-04  3:34                                 ` Johannes Weiner
@ 2013-12-04 11:13                                 ` Michal Hocko
  2013-12-05  0:23                                   ` David Rientjes
  1 sibling, 1 reply; 87+ messages in thread
From: Michal Hocko @ 2013-12-04 11:13 UTC (permalink / raw)
  To: David Rientjes
  Cc: Johannes Weiner, Andrew Morton, KAMEZAWA Hiroyuki, linux-kernel,
	linux-mm, cgroups

On Tue 03-12-13 15:50:41, David Rientjes wrote:
> On Tue, 3 Dec 2013, Michal Hocko wrote:
> 
> > OK, as it seems that the notification part is too controversial, how
> > would you like the following? It reverts the notification part and still
> > solves the fault on exit path. I will prepare the full patch with the
> > changelog if this looks reasonable:
> 
> Um, no, that's not satisfactory because it obviously does the check after 
> mem_cgroup_oom_notify().  There is absolutely no reason why userspace 
> should be woken up when current simply needs access to memory reserves to 
> exit. 

Let me repeat, that the only reason I liked the patch was that it solves
the fault during exit with oom disabled issue which I am really worried
about.
A nice side effect was that it moves the TIF_MEMDIE logic into a common
place. It seems that you are selling the side effect as a primary
feature.
Johannes is obviously against such a change for the reasons I won't
repeat here again. It is true that such a change wouldn't give us the
"notify only when an action is taken" semantic because oom path might
bail out few more times before killing anything.  Until we have that,
or agree what is the actual semantic that makes the most sense let's
backout with this and fix the actual bug which is real and drop the
tweak that just it only half way.

> You can already get such notification by memory thresholds at the 
> memcg limit.
> 
> I'll repeat: Section 10 of Documentation/cgroups/memory.txt specifies what 
> userspace should do when waking up; one of those options is not "check if 
> the memcg is still actually oom in a short period of time once a charging 
> task with a pending SIGKILL or in the exit path has been able to exit."
> Users of this interface typically also disable the memcg oom killer 
> through the same file, it's ludicrous to put the responsibility on 
> userspace to determine if the wakeup is actionable and requires it to 
> intervene in one of the methods listed in section 10.

David, you would need to show us that such a condition happens in real
loads often enough that such a tweak is worth it. Repeating that a race
exists doesn't help, because yeah it does and it will after your patch
as well. So show us that it happens considerably less often with this
check.
 
[...]
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [patch 1/2] mm, memcg: avoid oom notification when current needs access to memory reserves
  2013-12-04 11:13                                 ` Michal Hocko
@ 2013-12-05  0:23                                   ` David Rientjes
  2013-12-09 12:48                                     ` Michal Hocko
  0 siblings, 1 reply; 87+ messages in thread
From: David Rientjes @ 2013-12-05  0:23 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Johannes Weiner, Andrew Morton, KAMEZAWA Hiroyuki, linux-kernel,
	linux-mm, cgroups

On Wed, 4 Dec 2013, Michal Hocko wrote:

> > I'll repeat: Section 10 of Documentation/cgroups/memory.txt specifies what 
> > userspace should do when waking up; one of those options is not "check if 
> > the memcg is still actually oom in a short period of time once a charging 
> > task with a pending SIGKILL or in the exit path has been able to exit."
> > Users of this interface typically also disable the memcg oom killer 
> > through the same file, it's ludicrous to put the responsibility on 
> > userspace to determine if the wakeup is actionable and requires it to 
> > intervene in one of the methods listed in section 10.
> 
> David, you would need to show us that such a condition happens in real
> loads often enough that such a tweak is worth it. Repeating that a race
> exists doesn't help, because yeah it does and it will after your patch
> as well. So show us that it happens considerably less often with this
> check.
>  

Google depends on getting memory.oom_control notifications only when they 
are actionable, which is exactly how Documentation/cgroups/memory.txt 
describes how userspace should respond to such a notification.

"Actionable" here means that the kernel has exhausted its capabilities of 
allowing for future memory freeing, which is the entire premise of any oom 
killer.

Giving a dying process or a process that is going to subsequently die 
access to memory reserves is a capability the kernel users to ensure 
progress is made in oom conditions.  It is not an exhaustion of 
capabilities.

Yes, we all know that subsequent to the userspace notification that memory 
may be freed and the kill no longer becomes required.  There is nothing 
that can be done about that, and it has never been implied that a memcg is 
guaranteed to still be oom when the process wakes up.

I'm referring to a siutation that can manifest in a number of ways: 
coincidental process exit, coincidental process being killed, 
VMPRESSURE_CRITICAL notification that results in a process being killed, 
or memory threshold notification that results in a process being killed.  
Regardless, we're talking about a situation where something is already 
in the exit path or has been killed and is simply attempting to free its 
memory.

Such a process simply needs access to memory reserves to make progress and 
free its memory as part of the exit path.  The process waiting on 
memory.oom_control does _not_ need to do any of the actions mentioned in 
Documentation/cgroups/memory.txt: reduce usage, enlarge the limit, kill a 
process, or move a process with charge migration.

It would be ridiculous to require anybody implementing such a process to 
check if the oom condition still exists after a period of time before 
taking such an action.  It would be required to wait for any possible 
dying task or process with a pending SIGKILL to exit and there's no way to 
determine how long is long enough to wait or that it will get woken up 
again if it relies on a second signal for the same oom condition.  At the 
same time, the action taken by such a process would still be as racy as it 
would with the patch: we simply can't guarantee memory is not freed 
immediately after we issue the SIGKILL.

What we can control is that the kernel has exhausted its capabilities of 
allowing for future memory freeing at the time of notification.  That's 
the goal of the patch, at the same time making it consistent with the 
documentation.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [patch 1/2] mm, memcg: avoid oom notification when current needs access to memory reserves
  2013-12-05  0:23                                   ` David Rientjes
@ 2013-12-09 12:48                                     ` Michal Hocko
  2013-12-09 21:46                                       ` David Rientjes
  0 siblings, 1 reply; 87+ messages in thread
From: Michal Hocko @ 2013-12-09 12:48 UTC (permalink / raw)
  To: David Rientjes
  Cc: Johannes Weiner, Andrew Morton, KAMEZAWA Hiroyuki, linux-kernel,
	linux-mm, cgroups

On Wed 04-12-13 16:23:41, David Rientjes wrote:
> On Wed, 4 Dec 2013, Michal Hocko wrote:
> 
> > > I'll repeat: Section 10 of Documentation/cgroups/memory.txt specifies what 
> > > userspace should do when waking up; one of those options is not "check if 
> > > the memcg is still actually oom in a short period of time once a charging 
> > > task with a pending SIGKILL or in the exit path has been able to exit."
> > > Users of this interface typically also disable the memcg oom killer 
> > > through the same file, it's ludicrous to put the responsibility on 
> > > userspace to determine if the wakeup is actionable and requires it to 
> > > intervene in one of the methods listed in section 10.
> > 
> > David, you would need to show us that such a condition happens in real
> > loads often enough that such a tweak is worth it. Repeating that a race
> > exists doesn't help, because yeah it does and it will after your patch
> > as well. So show us that it happens considerably less often with this
> > check.
> >  
> 
> Google depends on getting memory.oom_control notifications only when they 
> are actionable, which is exactly how Documentation/cgroups/memory.txt 
> describes how userspace should respond to such a notification.
> 
> "Actionable" here means that the kernel has exhausted its capabilities of 
> allowing for future memory freeing, which is the entire premise of any oom 
> killer.
> 
> Giving a dying process or a process that is going to subsequently die 
> access to memory reserves is a capability the kernel users to ensure 
> progress is made in oom conditions.  It is not an exhaustion of 
> capabilities.
> 
> Yes, we all know that subsequent to the userspace notification that memory 
> may be freed and the kill no longer becomes required.  There is nothing 
> that can be done about that, and it has never been implied that a memcg is 
> guaranteed to still be oom when the process wakes up.
> 
> I'm referring to a siutation that can manifest in a number of ways: 
> coincidental process exit, coincidental process being killed, 
> VMPRESSURE_CRITICAL notification that results in a process being killed, 
> or memory threshold notification that results in a process being killed.  
> Regardless, we're talking about a situation where something is already 
> in the exit path or has been killed and is simply attempting to free its 
> memory.

You have already mentioned that. Several times in fact. And I do
understand what you are saying. You are just not backing your claims
with anything that would convince us that what you are trying to solve
is an issue in the real life. So show us it is real, please.

> Such a process simply needs access to memory reserves to make progress and 
> free its memory as part of the exit path.  The process waiting on 
> memory.oom_control does _not_ need to do any of the actions mentioned in 
> Documentation/cgroups/memory.txt: reduce usage, enlarge the limit, kill a 
> process, or move a process with charge migration.
> 
> It would be ridiculous to require anybody implementing such a process to 
> check if the oom condition still exists after a period of time before 
> taking such an action.

Why would you consider that ridiculous? If your memcg is oom already
then waiting few seconds to let racing tasks finish doesn't sound that
bad to me.

> It would be required to wait for any possible 
> dying task or process with a pending SIGKILL to exit and there's no way to 
> determine how long is long enough to wait or that it will get woken up 
> again if it relies on a second signal for the same oom condition.  At the 
> same time, the action taken by such a process would still be as racy as it 
> would with the patch: we simply can't guarantee memory is not freed 
> immediately after we issue the SIGKILL.
> 
> What we can control is that the kernel has exhausted its capabilities of 
> allowing for future memory freeing at the time of notification.  That's 
> the goal of the patch, at the same time making it consistent with the 
> documentation.

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [patch 1/2] mm, memcg: avoid oom notification when current needs access to memory reserves
  2013-12-09 12:48                                     ` Michal Hocko
@ 2013-12-09 21:46                                       ` David Rientjes
  2013-12-09 22:51                                         ` Johannes Weiner
                                                           ` (2 more replies)
  0 siblings, 3 replies; 87+ messages in thread
From: David Rientjes @ 2013-12-09 21:46 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Johannes Weiner, Andrew Morton, KAMEZAWA Hiroyuki, linux-kernel,
	linux-mm, cgroups

On Mon, 9 Dec 2013, Michal Hocko wrote:

> > Google depends on getting memory.oom_control notifications only when they 
> > are actionable, which is exactly how Documentation/cgroups/memory.txt 
> > describes how userspace should respond to such a notification.
> > 
> > "Actionable" here means that the kernel has exhausted its capabilities of 
> > allowing for future memory freeing, which is the entire premise of any oom 
> > killer.
> > 
> > Giving a dying process or a process that is going to subsequently die 
> > access to memory reserves is a capability the kernel users to ensure 
> > progress is made in oom conditions.  It is not an exhaustion of 
> > capabilities.
> > 
> > Yes, we all know that subsequent to the userspace notification that memory 
> > may be freed and the kill no longer becomes required.  There is nothing 
> > that can be done about that, and it has never been implied that a memcg is 
> > guaranteed to still be oom when the process wakes up.
> > 
> > I'm referring to a siutation that can manifest in a number of ways: 
> > coincidental process exit, coincidental process being killed, 
> > VMPRESSURE_CRITICAL notification that results in a process being killed, 
> > or memory threshold notification that results in a process being killed.  
> > Regardless, we're talking about a situation where something is already 
> > in the exit path or has been killed and is simply attempting to free its 
> > memory.
> 
> You have already mentioned that. Several times in fact. And I do
> understand what you are saying. You are just not backing your claims
> with anything that would convince us that what you are trying to solve
> is an issue in the real life. So show us it is real, please.
> 

What exactly would you like to see?  It's obvious that the kernel has not 
exhausted its capabilities of allowing for future memory freeing if the 
notification happens before the check for current->flags & PF_EXITING or 
fatal_signal_pending(current).  Does that conditional get triggered?  ALL 
THE TIME.  We know it happens because I had to introduce it into both the 
system oom killer and the memcg oom killer to fix mm->mmap_sem issues for 
threads that were killed as part of the oom killer SIGKILL but weren't the 
thread lucky enough to get TIF_MEMDIE set and they were in the allocation 
path.

Are you asking me to patch our kernel, get it rolled out, and plot a graph 
to show how often it gets triggered over time in our datacenters and that 
it causes us to get unnecessary oom kill notifications?

I'm trying to support you in any way I can by giving you the information 
you need, but in all honesty this seems pretty trivial and obvious to 
understand.  I'm really quite stunned at this thread.  What exactly are 
you arguing in the other direction for?  What does giving an oom 
notification before allowing exiting processes to free its memory so the 
memcg or system is no longer oom do?  Why can't you use memory thresholds 
or vmpressure for such a situation?

> > Such a process simply needs access to memory reserves to make progress and 
> > free its memory as part of the exit path.  The process waiting on 
> > memory.oom_control does _not_ need to do any of the actions mentioned in 
> > Documentation/cgroups/memory.txt: reduce usage, enlarge the limit, kill a 
> > process, or move a process with charge migration.
> > 
> > It would be ridiculous to require anybody implementing such a process to 
> > check if the oom condition still exists after a period of time before 
> > taking such an action.
> 
> Why would you consider that ridiculous? If your memcg is oom already
> then waiting few seconds to let racing tasks finish doesn't sound that
> bad to me.
> 

A few seconds?  Is that just handwaving or are you making a guarantee that 
all processes that need access to memory reserves will wake up, try its 
allocation, get the memcg's oom lock, get access to memory reserves, 
allocate, return to handle its pending SIGKILL, proceed down the exit() 
path, and free its memory by then?

Meanwhile, the userspace oom handler is doing its little sleep(3) that you 
suggest, it checks the status of the memcg, finds it's still oom, but 
doesn't realize because it didn't do a second blocking read() that its a 
second oom condition for a different process attached to the memcg and 
that process simply needs memory reserves to exit.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [patch 1/2] mm, memcg: avoid oom notification when current needs access to memory reserves
  2013-12-09 21:46                                       ` David Rientjes
@ 2013-12-09 22:51                                         ` Johannes Weiner
  2013-12-09 23:05                                         ` Johannes Weiner
  2013-12-10 10:38                                         ` Michal Hocko
  2 siblings, 0 replies; 87+ messages in thread
From: Johannes Weiner @ 2013-12-09 22:51 UTC (permalink / raw)
  To: David Rientjes
  Cc: Michal Hocko, Andrew Morton, KAMEZAWA Hiroyuki, linux-kernel,
	linux-mm, cgroups

On Mon, Dec 09, 2013 at 01:46:16PM -0800, David Rientjes wrote:
> On Mon, 9 Dec 2013, Michal Hocko wrote:
> 
> > > Google depends on getting memory.oom_control notifications only when they 
> > > are actionable, which is exactly how Documentation/cgroups/memory.txt 
> > > describes how userspace should respond to such a notification.
> > > 
> > > "Actionable" here means that the kernel has exhausted its capabilities of 
> > > allowing for future memory freeing, which is the entire premise of any oom 
> > > killer.
> > > 
> > > Giving a dying process or a process that is going to subsequently die 
> > > access to memory reserves is a capability the kernel users to ensure 
> > > progress is made in oom conditions.  It is not an exhaustion of 
> > > capabilities.
> > > 
> > > Yes, we all know that subsequent to the userspace notification that memory 
> > > may be freed and the kill no longer becomes required.  There is nothing 
> > > that can be done about that, and it has never been implied that a memcg is 
> > > guaranteed to still be oom when the process wakes up.
> > > 
> > > I'm referring to a siutation that can manifest in a number of ways: 
> > > coincidental process exit, coincidental process being killed, 
> > > VMPRESSURE_CRITICAL notification that results in a process being killed, 
> > > or memory threshold notification that results in a process being killed.  
> > > Regardless, we're talking about a situation where something is already 
> > > in the exit path or has been killed and is simply attempting to free its 
> > > memory.
> > 
> > You have already mentioned that. Several times in fact. And I do
> > understand what you are saying. You are just not backing your claims
> > with anything that would convince us that what you are trying to solve
> > is an issue in the real life. So show us it is real, please.
> > 
> 
> What exactly would you like to see?  It's obvious that the kernel has not 
> exhausted its capabilities of allowing for future memory freeing if the 
> notification happens before the check for current->flags & PF_EXITING or 
> fatal_signal_pending(current).  Does that conditional get triggered?  ALL 
> THE TIME.

We check for fatal signals during the repeated charge attempts and
reclaim.  Should we be checking for PF_EXITING too?

> We know it happens because I had to introduce it into both the 
> system oom killer and the memcg oom killer to fix mm->mmap_sem issues for 
> threads that were killed as part of the oom killer SIGKILL but weren't the 
> thread lucky enough to get TIF_MEMDIE set and they were in the allocation 
> path.



> 
> Are you asking me to patch our kernel, get it rolled out, and plot a graph 
> to show how often it gets triggered over time in our datacenters and that 
> it causes us to get unnecessary oom kill notifications?
> 
> I'm trying to support you in any way I can by giving you the information 
> you need, but in all honesty this seems pretty trivial and obvious to 
> understand.  I'm really quite stunned at this thread.  What exactly are 
> you arguing in the other direction for?  What does giving an oom 
> notification before allowing exiting processes to free its memory so the 
> memcg or system is no longer oom do?  Why can't you use memory thresholds 
> or vmpressure for such a situation?
> 
> > > Such a process simply needs access to memory reserves to make progress and 
> > > free its memory as part of the exit path.  The process waiting on 
> > > memory.oom_control does _not_ need to do any of the actions mentioned in 
> > > Documentation/cgroups/memory.txt: reduce usage, enlarge the limit, kill a 
> > > process, or move a process with charge migration.
> > > 
> > > It would be ridiculous to require anybody implementing such a process to 
> > > check if the oom condition still exists after a period of time before 
> > > taking such an action.
> > 
> > Why would you consider that ridiculous? If your memcg is oom already
> > then waiting few seconds to let racing tasks finish doesn't sound that
> > bad to me.
> > 
> 
> A few seconds?  Is that just handwaving or are you making a guarantee that 
> all processes that need access to memory reserves will wake up, try its 
> allocation, get the memcg's oom lock, get access to memory reserves, 
> allocate, return to handle its pending SIGKILL, proceed down the exit() 
> path, and free its memory by then?
> 
> Meanwhile, the userspace oom handler is doing its little sleep(3) that you 
> suggest, it checks the status of the memcg, finds it's still oom, but 
> doesn't realize because it didn't do a second blocking read() that its a 
> second oom condition for a different process attached to the memcg and 
> that process simply needs memory reserves to exit.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [patch 1/2] mm, memcg: avoid oom notification when current needs access to memory reserves
  2013-12-09 21:46                                       ` David Rientjes
  2013-12-09 22:51                                         ` Johannes Weiner
@ 2013-12-09 23:05                                         ` Johannes Weiner
  2014-01-10  0:34                                           ` David Rientjes
  2013-12-10 10:38                                         ` Michal Hocko
  2 siblings, 1 reply; 87+ messages in thread
From: Johannes Weiner @ 2013-12-09 23:05 UTC (permalink / raw)
  To: David Rientjes
  Cc: Michal Hocko, Andrew Morton, KAMEZAWA Hiroyuki, linux-kernel,
	linux-mm, cgroups

On Mon, Dec 09, 2013 at 01:46:16PM -0800, David Rientjes wrote:
> On Mon, 9 Dec 2013, Michal Hocko wrote:
> 
> > > Google depends on getting memory.oom_control notifications only when they 
> > > are actionable, which is exactly how Documentation/cgroups/memory.txt 
> > > describes how userspace should respond to such a notification.
> > > 
> > > "Actionable" here means that the kernel has exhausted its capabilities of 
> > > allowing for future memory freeing, which is the entire premise of any oom 
> > > killer.
> > > 
> > > Giving a dying process or a process that is going to subsequently die 
> > > access to memory reserves is a capability the kernel users to ensure 
> > > progress is made in oom conditions.  It is not an exhaustion of 
> > > capabilities.
> > > 
> > > Yes, we all know that subsequent to the userspace notification that memory 
> > > may be freed and the kill no longer becomes required.  There is nothing 
> > > that can be done about that, and it has never been implied that a memcg is 
> > > guaranteed to still be oom when the process wakes up.
> > > 
> > > I'm referring to a siutation that can manifest in a number of ways: 
> > > coincidental process exit, coincidental process being killed, 
> > > VMPRESSURE_CRITICAL notification that results in a process being killed, 
> > > or memory threshold notification that results in a process being killed.  
> > > Regardless, we're talking about a situation where something is already 
> > > in the exit path or has been killed and is simply attempting to free its 
> > > memory.
> > 
> > You have already mentioned that. Several times in fact. And I do
> > understand what you are saying. You are just not backing your claims
> > with anything that would convince us that what you are trying to solve
> > is an issue in the real life. So show us it is real, please.
> > 
> 
> What exactly would you like to see?  It's obvious that the kernel has not 
> exhausted its capabilities of allowing for future memory freeing if the 
> notification happens before the check for current->flags & PF_EXITING or 
> fatal_signal_pending(current).  Does that conditional get triggered?  ALL 
> THE TIME.

We check for fatal signals during the repeated charge attempts and
reclaim.  Should we be checking for PF_EXITING too?

> We know it happens because I had to introduce it into both the 
> system oom killer and the memcg oom killer to fix mm->mmap_sem issues for 
> threads that were killed as part of the oom killer SIGKILL but weren't the 
> thread lucky enough to get TIF_MEMDIE set and they were in the allocation 
> path.
>
> Are you asking me to patch our kernel, get it rolled out, and plot a graph 
> to show how often it gets triggered over time in our datacenters and that 
> it causes us to get unnecessary oom kill notifications?

I asked this because you were talking about all this nonsense of
last-second checks and the probability of unnecessary kills.

You kept insisting that this check has to be the last action before
the OOM kill, which was an entirely different motivation for this
change and I questioned the validity of this claim repeatedly during
this thread, to which you never answered.

You even re-inforced this motivation by suggesting the separate memcg
margin check right before the OOM kill, so don't blame us for
misunderstanding the exact placement of this check as your main
argument when you repeated it over and over.

All I object to is that the OOM killer is riddled with last-second
checks of whether the OOM situation is still existant.  We establish
that the context is OOM and once we are certain we are executing,
period.

Not catching PF_EXITING in the long window between the first reclaim
and going OOM is a separate issue and I can see that this should be
fixed but it should be checked before we start invoking OOM.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [patch 1/2] mm, memcg: avoid oom notification when current needs access to memory reserves
  2013-12-09 21:46                                       ` David Rientjes
  2013-12-09 22:51                                         ` Johannes Weiner
  2013-12-09 23:05                                         ` Johannes Weiner
@ 2013-12-10 10:38                                         ` Michal Hocko
  2013-12-11  1:03                                           ` David Rientjes
  2 siblings, 1 reply; 87+ messages in thread
From: Michal Hocko @ 2013-12-10 10:38 UTC (permalink / raw)
  To: David Rientjes
  Cc: Johannes Weiner, Andrew Morton, KAMEZAWA Hiroyuki, linux-kernel,
	linux-mm, cgroups

On Mon 09-12-13 13:46:16, David Rientjes wrote:
> On Mon, 9 Dec 2013, Michal Hocko wrote:
> 
> > > Google depends on getting memory.oom_control notifications only when they 
> > > are actionable, which is exactly how Documentation/cgroups/memory.txt 
> > > describes how userspace should respond to such a notification.
> > > 
> > > "Actionable" here means that the kernel has exhausted its capabilities of 
> > > allowing for future memory freeing, which is the entire premise of any oom 
> > > killer.
> > > 
> > > Giving a dying process or a process that is going to subsequently die 
> > > access to memory reserves is a capability the kernel users to ensure 
> > > progress is made in oom conditions.  It is not an exhaustion of 
> > > capabilities.
> > > 
> > > Yes, we all know that subsequent to the userspace notification that memory 
> > > may be freed and the kill no longer becomes required.  There is nothing 
> > > that can be done about that, and it has never been implied that a memcg is 
> > > guaranteed to still be oom when the process wakes up.
> > > 
> > > I'm referring to a siutation that can manifest in a number of ways: 
> > > coincidental process exit, coincidental process being killed, 
> > > VMPRESSURE_CRITICAL notification that results in a process being killed, 
> > > or memory threshold notification that results in a process being killed.  
> > > Regardless, we're talking about a situation where something is already 
> > > in the exit path or has been killed and is simply attempting to free its 
> > > memory.
> > 
> > You have already mentioned that. Several times in fact. And I do
> > understand what you are saying. You are just not backing your claims
> > with anything that would convince us that what you are trying to solve
> > is an issue in the real life. So show us it is real, please.
> > 
> 
> What exactly would you like to see?

How often do you see PF_EXITING tasks which haven't been killed causing
a pointless notification? Because fatal_signal_pending and TIF_MEMDIE
cases are already handled because we bypass charges in those cases (except
for user OOM killer killed tasks which don't get TIF_MEMDIE and that
should be fixed).

> It's obvious that the kernel has not 
> exhausted its capabilities of allowing for future memory freeing if the 
> notification happens before the check for current->flags & PF_EXITING or 
> fatal_signal_pending(current).  Does that conditional get triggered?  ALL 
> THE TIME.  We know it happens because I had to introduce it into both the 
> system oom killer and the memcg oom killer to fix mm->mmap_sem issues for 
> threads that were killed as part of the oom killer SIGKILL but weren't the 
> thread lucky enough to get TIF_MEMDIE set and they were in the allocation 
> path.

OOM killed task without TIF_MEMDIE is surely a problem. But the only
place I see this might happen right now is when a task is killed by
user space. And as I've said repeatedly this has to be fixed and it is
tangential to the notification problem your patch tries to handle but
not solve for other cases when we have notified but backout from killing
later.

> Are you asking me to patch our kernel, get it rolled out, and plot a graph 
> to show how often it gets triggered over time in our datacenters and that 
> it causes us to get unnecessary oom kill notifications?
> 
> I'm trying to support you in any way I can by giving you the information 
> you need, but in all honesty this seems pretty trivial and obvious to 
> understand.  I'm really quite stunned at this thread.  What exactly are 
> you arguing in the other direction for?  What does giving an oom 
> notification before allowing exiting processes to free its memory so the 
> memcg or system is no longer oom do?  Why can't you use memory thresholds 
> or vmpressure for such a situation?

David, I've tried to support you because I also think that notification
should be the last resort thing. But Johannes has a point that putting
this checks all over the place doesn't help longterm. So I would really
like to see a better justification than "it helps". 

> > > Such a process simply needs access to memory reserves to make progress and 
> > > free its memory as part of the exit path.  The process waiting on 
> > > memory.oom_control does _not_ need to do any of the actions mentioned in 
> > > Documentation/cgroups/memory.txt: reduce usage, enlarge the limit, kill a 
> > > process, or move a process with charge migration.
> > > 
> > > It would be ridiculous to require anybody implementing such a process to 
> > > check if the oom condition still exists after a period of time before 
> > > taking such an action.
> > 
> > Why would you consider that ridiculous? If your memcg is oom already
> > then waiting few seconds to let racing tasks finish doesn't sound that
> > bad to me.
> > 
> 
> A few seconds?  Is that just handwaving or are you making a guarantee that 
> all processes that need access to memory reserves will wake up, try its 
> allocation, get the memcg's oom lock, get access to memory reserves, 
> allocate, return to handle its pending SIGKILL, proceed down the exit() 
> path, and free its memory by then?
> 
> Meanwhile, the userspace oom handler is doing its little sleep(3) that you 
> suggest, it checks the status of the memcg, finds it's still oom, but 
> doesn't realize because it didn't do a second blocking read() that its a 
> second oom condition for a different process attached to the memcg and 
> that process simply needs memory reserves to exit.

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [patch 1/2] mm, memcg: avoid oom notification when current needs access to memory reserves
  2013-12-10 10:38                                         ` Michal Hocko
@ 2013-12-11  1:03                                           ` David Rientjes
  2013-12-11  9:55                                             ` Michal Hocko
  0 siblings, 1 reply; 87+ messages in thread
From: David Rientjes @ 2013-12-11  1:03 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Johannes Weiner, Andrew Morton, KAMEZAWA Hiroyuki, linux-kernel,
	linux-mm, cgroups

On Tue, 10 Dec 2013, Michal Hocko wrote:

> > What exactly would you like to see?
> 
> How often do you see PF_EXITING tasks which haven't been killed causing
> a pointless notification? Because fatal_signal_pending and TIF_MEMDIE
> cases are already handled because we bypass charges in those cases (except
> for user OOM killer killed tasks which don't get TIF_MEMDIE and that
> should be fixed).
> 

Triggering a pointless notification with PF_EXITING is rare, yet one 
pointless notification can be avoided with the patch.  Additionally, it 
also avoids a pointless notification for a racing SIGKILL.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [patch 1/2] mm, memcg: avoid oom notification when current needs access to memory reserves
  2013-12-11  1:03                                           ` David Rientjes
@ 2013-12-11  9:55                                             ` Michal Hocko
  2013-12-11 22:40                                               ` David Rientjes
  0 siblings, 1 reply; 87+ messages in thread
From: Michal Hocko @ 2013-12-11  9:55 UTC (permalink / raw)
  To: David Rientjes
  Cc: Johannes Weiner, Andrew Morton, KAMEZAWA Hiroyuki, linux-kernel,
	linux-mm, cgroups

On Tue 10-12-13 17:03:45, David Rientjes wrote:
> On Tue, 10 Dec 2013, Michal Hocko wrote:
> 
> > > What exactly would you like to see?
> > 
> > How often do you see PF_EXITING tasks which haven't been killed causing
> > a pointless notification? Because fatal_signal_pending and TIF_MEMDIE
> > cases are already handled because we bypass charges in those cases (except
> > for user OOM killer killed tasks which don't get TIF_MEMDIE and that
> > should be fixed).
> > 
> 
> Triggering a pointless notification with PF_EXITING is rare, yet one 
> pointless notification can be avoided with the patch. 

Sigh. Yes it will avoid one particular and rare race. There will still
be notifications without oom kills.

Anyway.
Does the reclaim make any sense for PF_EXITING tasks? Shouldn't we
simply bypass charges of these tasks automatically. Those tasks will
free some memory anyway so why to trigger reclaim and potentially OOM
in the first place? Do we need to go via TIF_MEMDIE loop in the first
place?

> Additionally, it also avoids a pointless notification for a racing
> SIGKILL.

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [patch 1/2] mm, memcg: avoid oom notification when current needs access to memory reserves
  2013-12-11  9:55                                             ` Michal Hocko
@ 2013-12-11 22:40                                               ` David Rientjes
  2013-12-12 10:31                                                 ` Michal Hocko
  0 siblings, 1 reply; 87+ messages in thread
From: David Rientjes @ 2013-12-11 22:40 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Johannes Weiner, Andrew Morton, KAMEZAWA Hiroyuki, linux-kernel,
	linux-mm, cgroups

On Wed, 11 Dec 2013, Michal Hocko wrote:

> > Triggering a pointless notification with PF_EXITING is rare, yet one 
> > pointless notification can be avoided with the patch. 
> 
> Sigh. Yes it will avoid one particular and rare race. There will still
> be notifications without oom kills.
> 

Would you prefer doing the mem_cgroup_oom_notify() in two places instead:

 - immediately before doing oom_kill_process() when it's guaranteed that
   the kernel would have killed something, and

 - when memory.oom_control == 1 in mem_cgroup_oom_synchronize()?

> Anyway.
> Does the reclaim make any sense for PF_EXITING tasks? Shouldn't we
> simply bypass charges of these tasks automatically. Those tasks will
> free some memory anyway so why to trigger reclaim and potentially OOM
> in the first place? Do we need to go via TIF_MEMDIE loop in the first
> place?
> 

I don't see any reason to make an optimization there since they will get 
TIF_MEMDIE set if reclaim has failed on one of their charges or if it 
results in a system oom through the page allocator's oom killer.  It would 
be nice to ensure reclaim has had a chance to free memory in the presence 
of any other potential parallel memory freeing.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [patch 1/2] mm, memcg: avoid oom notification when current needs access to memory reserves
  2013-12-11 22:40                                               ` David Rientjes
@ 2013-12-12 10:31                                                 ` Michal Hocko
  2013-12-12 10:50                                                   ` Michal Hocko
                                                                     ` (2 more replies)
  0 siblings, 3 replies; 87+ messages in thread
From: Michal Hocko @ 2013-12-12 10:31 UTC (permalink / raw)
  To: David Rientjes
  Cc: Johannes Weiner, Andrew Morton, KAMEZAWA Hiroyuki, linux-kernel,
	linux-mm, cgroups

On Wed 11-12-13 14:40:24, David Rientjes wrote:
> On Wed, 11 Dec 2013, Michal Hocko wrote:
> 
> > > Triggering a pointless notification with PF_EXITING is rare, yet one 
> > > pointless notification can be avoided with the patch. 
> > 
> > Sigh. Yes it will avoid one particular and rare race. There will still
> > be notifications without oom kills.
> > 
> 
> Would you prefer doing the mem_cgroup_oom_notify() in two places instead:
> 
>  - immediately before doing oom_kill_process() when it's guaranteed that
>    the kernel would have killed something, and
> 
>  - when memory.oom_control == 1 in mem_cgroup_oom_synchronize()?

Yes that would make sense to me. At least the two oom_control paths
would be consistent wrt. notifications. I thought it would be too messy
but it looks quite straightforward:
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index c72b03bf9679..5cb1deea6aac 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2256,15 +2256,16 @@ bool mem_cgroup_oom_synchronize(bool handle)
 
 	locked = mem_cgroup_oom_trylock(memcg);
 
-	if (locked)
-		mem_cgroup_oom_notify(memcg);
-
 	if (locked && !memcg->oom_kill_disable) {
 		mem_cgroup_unmark_under_oom(memcg);
 		finish_wait(&memcg_oom_waitq, &owait.wait);
+		/* calls mem_cgroup_oom_notify if there is a task to kill */
 		mem_cgroup_out_of_memory(memcg, current->memcg_oom.gfp_mask,
 					 current->memcg_oom.order);
 	} else {
+		if (locked && memcg->oom_kill_disable)
+			mem_cgroup_oom_notify(memcg);
+
 		schedule();
 		mem_cgroup_unmark_under_oom(memcg);
 		finish_wait(&memcg_oom_waitq, &owait.wait);
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 1e4a600a6163..2a7f15900922 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -470,6 +470,9 @@ void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
 		victim = p;
 	}
 
+	if (memcg)
+		mem_cgroup_oom_notify(memcg);
+
 	/* mm cannot safely be dereferenced after task_unlock(victim) */
 	mm = victim->mm;
 	pr_err("Killed process %d (%s) total-vm:%lukB, anon-rss:%lukB, file-rss:%lukB\n",


The semantic would be as simple as "notification is sent only when
an action is due". It will be still racy as nothing prevents a task
which is not under OOM to exit and release some memory but there is no
sensible way to address that. On the other hand such a semantic would be
sensible for oom_control listeners because they will know that an action
has to be or will be taken (the line was drawn).

Can we agree on this, Johannes? Or you see the line drawn when
mem_cgroup_oom_synchronize has been reached already no matter whether
the action is to be done or not?

Regardless the above. We would still have to cope with PF_EXITING
without TIF_MEMDIE entering OOM which is a separate issue. I think the
easier and cleaner solution would be to bail out early and do not even
charge for PF_EXITING tasks. It will solve the issue mentioned before
and also reduce the exit latency. Besides that I do not think we are
talking about many charges, do we?

> > Anyway.
> > Does the reclaim make any sense for PF_EXITING tasks? Shouldn't we
> > simply bypass charges of these tasks automatically. Those tasks will
> > free some memory anyway so why to trigger reclaim and potentially OOM
> > in the first place? Do we need to go via TIF_MEMDIE loop in the first
> > place?
> > 
> 
> I don't see any reason to make an optimization there since they will get 
> TIF_MEMDIE set if reclaim has failed on one of their charges or if it 
> results in a system oom through the page allocator's oom killer.

This all will happen after MEM_CGROUP_RECLAIM_RETRIES full reclaim
rounds. Is it really worth the addional overhead just to later say "OK
go ahead and skipp charges"?
And for the !oom memcg it might reclaim some pages which could have
stayed on LRUs just to free some memory little bit later and release the
memory pressure.
So I would rather go with
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index c72b03bf9679..fee25c5934d2 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2692,7 +2693,8 @@ static int __mem_cgroup_try_charge(struct mm_struct *mm,
 	 * MEMDIE process.
 	 */
 	if (unlikely(test_thread_flag(TIF_MEMDIE)
-		     || fatal_signal_pending(current)))
+		     || fatal_signal_pending(current))
+		     || current->flags & PF_EXITING)
 		goto bypass;
 
 	if (unlikely(task_in_memcg_oom(current)))

rather than the later checks down the oom_synchronize paths. The comment
already mentions dying process...

> It would be nice to ensure reclaim has had a chance to free memory in
> the presence of any other potential parallel memory freeing.

I am afraid I didn't get what you mean by this. We can only check we are
under OOM or try to reclaim to see if there is something...
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 87+ messages in thread

* Re: [patch 1/2] mm, memcg: avoid oom notification when current needs access to memory reserves
  2013-12-12 10:31                                                 ` Michal Hocko
@ 2013-12-12 10:50                                                   ` Michal Hocko
  2013-12-12 12:11                                                   ` Michal Hocko
  2013-12-13 23:55                                                   ` David Rientjes
  2 siblings, 0 replies; 87+ messages in thread
From: Michal Hocko @ 2013-12-12 10:50 UTC (permalink / raw)
  To: David Rientjes
  Cc: Johannes Weiner, Andrew Morton, KAMEZAWA Hiroyuki, linux-kernel,
	linux-mm, cgroups

On Thu 12-12-13 11:31:59, Michal Hocko wrote:
[...]
> > > Anyway.
> > > Does the reclaim make any sense for PF_EXITING tasks? Shouldn't we
> > > simply bypass charges of these tasks automatically. Those tasks will
> > > free some memory anyway so why to trigger reclaim and potentially OOM
> > > in the first place? Do we need to go via TIF_MEMDIE loop in the first
> > > place?
> > > 
> > 
> > I don't see any reason to make an optimization there since they will get 
> > TIF_MEMDIE set if reclaim has failed on one of their charges or if it 
> > results in a system oom through the page allocator's oom killer.
> 
> This all will happen after MEM_CGROUP_RECLAIM_RETRIES full reclaim
> rounds. Is it really worth the addional overhead just to later say "OK
> go ahead and skipp charges"?
> And for the !oom memcg it might reclaim some pages which could have
> stayed on LRUs just to free some memory little bit later and release the
> memory pressure.
> So I would rather go with
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index c72b03bf9679..fee25c5934d2 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -2692,7 +2693,8 @@ static int __mem_cgroup_try_charge(struct mm_struct *mm,
>  	 * MEMDIE process.
>  	 */
>  	if (unlikely(test_thread_flag(TIF_MEMDIE)
> -		     || fatal_signal_pending(current)))
> +		     || fatal_signal_pending(current))
> +		     || current->flags & PF_EXITING)
>  		goto bypass;
>  
>  	if (unlikely(task_in_memcg_oom(current)))
> 
> rather than the later checks down the oom_synchronize paths. The comment
> already mentions dying process...

With the full changelog. I will repost it in a separate thread if you
are OK with this.
---

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [patch 1/2] mm, memcg: avoid oom notification when current needs access to memory reserves
  2013-12-12 10:31                                                 ` Michal Hocko
  2013-12-12 10:50                                                   ` Michal Hocko
@ 2013-12-12 12:11                                                   ` Michal Hocko
  2013-12-12 12:37                                                     ` Michal Hocko
  2013-12-13 23:55                                                   ` David Rientjes
  2 siblings, 1 reply; 87+ messages in thread
From: Michal Hocko @ 2013-12-12 12:11 UTC (permalink / raw)
  To: David Rientjes
  Cc: Johannes Weiner, Andrew Morton, KAMEZAWA Hiroyuki, linux-kernel,
	linux-mm, cgroups

On Thu 12-12-13 11:31:59, Michal Hocko wrote:
[...]
> The semantic would be as simple as "notification is sent only when
> an action is due". It will be still racy as nothing prevents a task
> which is not under OOM to exit and release some memory but there is no
> sensible way to address that. On the other hand such a semantic would be
> sensible for oom_control listeners because they will know that an action
> has to be or will be taken (the line was drawn).
> 
> Can we agree on this, Johannes? Or you see the line drawn when
> mem_cgroup_oom_synchronize has been reached already no matter whether
> the action is to be done or not?

Something like the following:

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [patch 1/2] mm, memcg: avoid oom notification when current needs access to memory reserves
  2013-12-12 12:11                                                   ` Michal Hocko
@ 2013-12-12 12:37                                                     ` Michal Hocko
  0 siblings, 0 replies; 87+ messages in thread
From: Michal Hocko @ 2013-12-12 12:37 UTC (permalink / raw)
  To: David Rientjes
  Cc: Johannes Weiner, Andrew Morton, KAMEZAWA Hiroyuki, linux-kernel,
	linux-mm, cgroups

On Thu 12-12-13 13:11:40, Michal Hocko wrote:
> On Thu 12-12-13 11:31:59, Michal Hocko wrote:
> [...]
> > The semantic would be as simple as "notification is sent only when
> > an action is due". It will be still racy as nothing prevents a task
> > which is not under OOM to exit and release some memory but there is no
> > sensible way to address that. On the other hand such a semantic would be
> > sensible for oom_control listeners because they will know that an action
> > has to be or will be taken (the line was drawn).
> > 
> > Can we agree on this, Johannes? Or you see the line drawn when
> > mem_cgroup_oom_synchronize has been reached already no matter whether
> > the action is to be done or not?
> 
> Something like the following:

I forgot to mention that this patch assumes "memcg: Do not hang on OOM
when killed by userspace OOM"

> From 5d9c01e2814a7ade49db7945ad3890f4f138855e Mon Sep 17 00:00:00 2001
> From: Michal Hocko <mhocko@suse.cz>
> Date: Thu, 12 Dec 2013 11:50:17 +0100
> Subject: [PATCH] memcg: notify userspace about OOM when and action is due
> 
> Userspace is currently notified about OOM condition after fails
> to reclaim any memory after MEM_CGROUP_RECLAIM_RETRIES rounds.
> This usually means that the memcg is really in troubles and an
> OOM action (either done by userspace or kernel) has to be taken.
> The kernel OOM killer however bails out and doesn't kill anything
> if it sees an already dying/exiting task in a good hope a memory
> will be released and OOM situation will be resolved.
> 
> Therefore it makes sense to notify userspace only after really all
> measures have been taken and an userspace action is required or
> the kernel kills a task.
> 
> This patch also removes fatal_signal_pending and PF_EXITING check from
> mem_cgroup_oom_synchronize because __mem_cgroup_try_charge already
> checks for both and bypasses charge so we cannot end up in the oom path.

Hmm, I have just noticed that oom_scan_process_thread aborts scanning
only if it sees PF_EXITING or TIF_MEMDIE. Why the same is not done for 
fatal_signal_pending tasks as well? Following the same logic as for the
current we should do that no?

The different sets of checks is so confusing :/

> Signed-off-by: Michal Hocko <mhocko@suse.cz>
> ---
>  mm/memcontrol.c | 17 ++++-------------
>  mm/oom_kill.c   |  5 +++++
>  2 files changed, 9 insertions(+), 13 deletions(-)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 98900c070045..af7148c77bac 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -2235,16 +2235,6 @@ bool mem_cgroup_oom_synchronize(bool handle)
>  	if (!handle)
>  		goto cleanup;
>  
> -	/*
> -	 * If current has a pending SIGKILL or is exiting, then automatically
> -	 * select it.  The goal is to allow it to allocate so that it may
> -	 * quickly exit and free its memory.
> -	 */
> -	if (fatal_signal_pending(current) || current->flags & PF_EXITING) {
> -		set_thread_flag(TIF_MEMDIE);
> -		goto cleanup;
> -	}
> -
>  	owait.memcg = memcg;
>  	owait.wait.flags = 0;
>  	owait.wait.func = memcg_oom_wake_function;
> @@ -2256,15 +2246,16 @@ bool mem_cgroup_oom_synchronize(bool handle)
>  
>  	locked = mem_cgroup_oom_trylock(memcg);
>  
> -	if (locked)
> -		mem_cgroup_oom_notify(memcg);
> -
>  	if (locked && !memcg->oom_kill_disable) {
>  		mem_cgroup_unmark_under_oom(memcg);
>  		finish_wait(&memcg_oom_waitq, &owait.wait);
> +		/* calls mem_cgroup_oom_notify if there is a task to kill */
>  		mem_cgroup_out_of_memory(memcg, current->memcg_oom.gfp_mask,
>  					 current->memcg_oom.order);
>  	} else {
> +		if (locked && memcg->oom_kill_disable)
> +			mem_cgroup_oom_notify(memcg);
> +
>  		schedule();
>  		mem_cgroup_unmark_under_oom(memcg);
>  		finish_wait(&memcg_oom_waitq, &owait.wait);
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 1e4a600a6163..47c9de8da36d 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -394,6 +394,8 @@ static void dump_header(struct task_struct *p, gfp_t gfp_mask, int order,
>  		dump_tasks(memcg, nodemask);
>  }
>  
> +extern void mem_cgroup_oom_notify(struct mem_cgroup *memcg);
> +
>  #define K(x) ((x) << (PAGE_SHIFT-10))
>  /*
>   * Must be called while holding a reference to p, which will be released upon
> @@ -470,6 +472,9 @@ void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
>  		victim = p;
>  	}
>  
> +	if (memcg)
> +		mem_cgroup_oom_notify(memcg);
> +
>  	/* mm cannot safely be dereferenced after task_unlock(victim) */
>  	mm = victim->mm;
>  	pr_err("Killed process %d (%s) total-vm:%lukB, anon-rss:%lukB, file-rss:%lukB\n",
> -- 
> 1.8.4.4
> 
> -- 
> Michal Hocko
> SUSE Labs
> --
> To unsubscribe from this list: send the line "unsubscribe cgroups" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [patch 1/2] mm, memcg: avoid oom notification when current needs access to memory reserves
  2013-12-12 10:31                                                 ` Michal Hocko
  2013-12-12 10:50                                                   ` Michal Hocko
  2013-12-12 12:11                                                   ` Michal Hocko
@ 2013-12-13 23:55                                                   ` David Rientjes
  2013-12-17 16:23                                                     ` Michal Hocko
  2 siblings, 1 reply; 87+ messages in thread
From: David Rientjes @ 2013-12-13 23:55 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Johannes Weiner, Andrew Morton, KAMEZAWA Hiroyuki, linux-kernel,
	linux-mm, cgroups

On Thu, 12 Dec 2013, Michal Hocko wrote:

> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index c72b03bf9679..5cb1deea6aac 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -2256,15 +2256,16 @@ bool mem_cgroup_oom_synchronize(bool handle)
>  
>  	locked = mem_cgroup_oom_trylock(memcg);
>  
> -	if (locked)
> -		mem_cgroup_oom_notify(memcg);
> -
>  	if (locked && !memcg->oom_kill_disable) {
>  		mem_cgroup_unmark_under_oom(memcg);
>  		finish_wait(&memcg_oom_waitq, &owait.wait);
> +		/* calls mem_cgroup_oom_notify if there is a task to kill */
>  		mem_cgroup_out_of_memory(memcg, current->memcg_oom.gfp_mask,
>  					 current->memcg_oom.order);
>  	} else {
> +		if (locked && memcg->oom_kill_disable)
> +			mem_cgroup_oom_notify(memcg);
> +
>  		schedule();
>  		mem_cgroup_unmark_under_oom(memcg);
>  		finish_wait(&memcg_oom_waitq, &owait.wait);
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 1e4a600a6163..2a7f15900922 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -470,6 +470,9 @@ void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
>  		victim = p;
>  	}
>  
> +	if (memcg)
> +		mem_cgroup_oom_notify(memcg);
> +
>  	/* mm cannot safely be dereferenced after task_unlock(victim) */
>  	mm = victim->mm;
>  	pr_err("Killed process %d (%s) total-vm:%lukB, anon-rss:%lukB, file-rss:%lukB\n",

Yes, I like this.

> The semantic would be as simple as "notification is sent only when
> an action is due". It will be still racy as nothing prevents a task
> which is not under OOM to exit and release some memory but there is no
> sensible way to address that. On the other hand such a semantic would be
> sensible for oom_control listeners because they will know that an action
> has to be or will be taken (the line was drawn).
> 

I think this makes absolute sense and is in agreement with what is 
described in Documentation/cgroups/memory.txt.

> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index c72b03bf9679..fee25c5934d2 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -2692,7 +2693,8 @@ static int __mem_cgroup_try_charge(struct mm_struct *mm,
>  	 * MEMDIE process.
>  	 */
>  	if (unlikely(test_thread_flag(TIF_MEMDIE)
> -		     || fatal_signal_pending(current)))
> +		     || fatal_signal_pending(current))
> +		     || current->flags & PF_EXITING)
>  		goto bypass;
>  
>  	if (unlikely(task_in_memcg_oom(current)))
> 
> rather than the later checks down the oom_synchronize paths. The comment
> already mentions dying process...
> 

This is scary because it doesn't even try to reclaim memcg memory before 
allowing the allocation to succeed.  I think we could even argue that we 
should move the fatal_signal_pending(current) check to later and the only 
condition we should really be bypassing here is TIF_MEMDIE since it will 
only get set when reclaim has already failed.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [patch 1/2] mm, memcg: avoid oom notification when current needs access to memory reserves
  2013-12-13 23:55                                                   ` David Rientjes
@ 2013-12-17 16:23                                                     ` Michal Hocko
  2013-12-17 20:50                                                       ` David Rientjes
  0 siblings, 1 reply; 87+ messages in thread
From: Michal Hocko @ 2013-12-17 16:23 UTC (permalink / raw)
  To: David Rientjes
  Cc: Johannes Weiner, Andrew Morton, KAMEZAWA Hiroyuki, linux-kernel,
	linux-mm, cgroups

On Fri 13-12-13 15:55:44, David Rientjes wrote:
> On Thu, 12 Dec 2013, Michal Hocko wrote:
[...]
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index c72b03bf9679..fee25c5934d2 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -2692,7 +2693,8 @@ static int __mem_cgroup_try_charge(struct mm_struct *mm,
> >  	 * MEMDIE process.
> >  	 */
> >  	if (unlikely(test_thread_flag(TIF_MEMDIE)
> > -		     || fatal_signal_pending(current)))
> > +		     || fatal_signal_pending(current))
> > +		     || current->flags & PF_EXITING)
> >  		goto bypass;
> >  
> >  	if (unlikely(task_in_memcg_oom(current)))
> > 
> > rather than the later checks down the oom_synchronize paths. The comment
> > already mentions dying process...
> > 
> 
> This is scary because it doesn't even try to reclaim memcg memory before 
> allowing the allocation to succeed.

Why should it reclaim in the first place when it simply is on the way to
release memory. In other words why should it increase the memory
pressure when it is in fact releasing it?

I am really puzzled here. On one hand you are strongly arguing for not
notifying when we know we can prevent from OOM action and on the other
hand you are ok to get vmpressure/thresholds notification when an
exiting task triggers reclaim.

So I am really lost in what you are trying to achieve here. It sounds a
bit arbirtrary.

> I think we could even argue that we should move the
> fatal_signal_pending(current) check to later and the only condition we
> should really be bypassing here is TIF_MEMDIE since it will only get
> set when reclaim has already failed.

Any arguments?

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [patch 1/2] mm, memcg: avoid oom notification when current needs access to memory reserves
  2013-12-17 16:23                                                     ` Michal Hocko
@ 2013-12-17 20:50                                                       ` David Rientjes
  2013-12-18 20:04                                                         ` Michal Hocko
  0 siblings, 1 reply; 87+ messages in thread
From: David Rientjes @ 2013-12-17 20:50 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Johannes Weiner, Andrew Morton, KAMEZAWA Hiroyuki, linux-kernel,
	linux-mm, cgroups

On Tue, 17 Dec 2013, Michal Hocko wrote:

> > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > > index c72b03bf9679..fee25c5934d2 100644
> > > --- a/mm/memcontrol.c
> > > +++ b/mm/memcontrol.c
> > > @@ -2692,7 +2693,8 @@ static int __mem_cgroup_try_charge(struct mm_struct *mm,
> > >  	 * MEMDIE process.
> > >  	 */
> > >  	if (unlikely(test_thread_flag(TIF_MEMDIE)
> > > -		     || fatal_signal_pending(current)))
> > > +		     || fatal_signal_pending(current))
> > > +		     || current->flags & PF_EXITING)
> > >  		goto bypass;
> > >  
> > >  	if (unlikely(task_in_memcg_oom(current)))
> > > 
> > > rather than the later checks down the oom_synchronize paths. The comment
> > > already mentions dying process...
> > > 
> > 
> > This is scary because it doesn't even try to reclaim memcg memory before 
> > allowing the allocation to succeed.
> 
> Why should it reclaim in the first place when it simply is on the way to
> release memory. In other words why should it increase the memory
> pressure when it is in fact releasing it?
> 

(Answering about removing the fatal_signal_pending() check as well here.)

For memory isolation, we'd only want to bypass memcg charges when 
absolutely necessary and it seems like TIF_MEMDIE is the only case where 
that's required.  We don't give processes with pending SIGKILLs or those 
in the exit() path access to memory reserves in the page allocator without 
first determining that reclaim can't make any progress for the same reason 
and then we only do so by setting TIF_MEMDIE when calling the oom killer.  

> I am really puzzled here. On one hand you are strongly arguing for not
> notifying when we know we can prevent from OOM action and on the other
> hand you are ok to get vmpressure/thresholds notification when an
> exiting task triggers reclaim.
> 
> So I am really lost in what you are trying to achieve here. It sounds a
> bit arbirtrary.
> 

It's not arbitrary to define when memcg bypass is allowed and, in my 
opinion, it should only be done in situations where it is unavoidable and 
therefore breaking memory isolation is required.

(We wouldn't expect a 128MB memcg to be oom [and perhaps with a userspace 
oom handler attached] when it has 100 children each 1MB in size just 
because they all happen to be oom at the same time.  We set up the excess 
memory in the parent specifically for the memcg with the oom handler 
attached.)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [patch 1/2] mm, memcg: avoid oom notification when current needs access to memory reserves
  2013-12-17 20:50                                                       ` David Rientjes
@ 2013-12-18 20:04                                                         ` Michal Hocko
  2013-12-19  6:09                                                           ` David Rientjes
  0 siblings, 1 reply; 87+ messages in thread
From: Michal Hocko @ 2013-12-18 20:04 UTC (permalink / raw)
  To: David Rientjes
  Cc: Johannes Weiner, Andrew Morton, KAMEZAWA Hiroyuki, linux-kernel,
	linux-mm, cgroups

On Tue 17-12-13 12:50:09, David Rientjes wrote:
> On Tue, 17 Dec 2013, Michal Hocko wrote:
> 
> > > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > > > index c72b03bf9679..fee25c5934d2 100644
> > > > --- a/mm/memcontrol.c
> > > > +++ b/mm/memcontrol.c
> > > > @@ -2692,7 +2693,8 @@ static int __mem_cgroup_try_charge(struct mm_struct *mm,
> > > >  	 * MEMDIE process.
> > > >  	 */
> > > >  	if (unlikely(test_thread_flag(TIF_MEMDIE)
> > > > -		     || fatal_signal_pending(current)))
> > > > +		     || fatal_signal_pending(current))
> > > > +		     || current->flags & PF_EXITING)
> > > >  		goto bypass;
> > > >  
> > > >  	if (unlikely(task_in_memcg_oom(current)))
> > > > 
> > > > rather than the later checks down the oom_synchronize paths. The comment
> > > > already mentions dying process...
> > > > 
> > > 
> > > This is scary because it doesn't even try to reclaim memcg memory before 
> > > allowing the allocation to succeed.
> > 
> > Why should it reclaim in the first place when it simply is on the way to
> > release memory. In other words why should it increase the memory
> > pressure when it is in fact releasing it?
> > 
> 
> (Answering about removing the fatal_signal_pending() check as well here.)
> 
> For memory isolation, we'd only want to bypass memcg charges when 
> absolutely necessary and it seems like TIF_MEMDIE is the only case where 
> that's required.  We don't give processes with pending SIGKILLs or those 
> in the exit() path access to memory reserves in the page allocator without 
> first determining that reclaim can't make any progress for the same reason 
> and then we only do so by setting TIF_MEMDIE when calling the oom killer.  

While I do understand arguments about isolation I would also like to be
practical here. How many charges are we talking about? Dozen pages? Much
more?
Besides that all of those should be very short lived because the task
is going to die very soon and so the memory will be freed.

So from my POV I would like to see these heuristics as simple as
possible and placed at very few places. Doing a bypass before charge
- or even after a failed charge before doing reclaim sounds like an easy
enough heuristic without a big risk.
I have really hard time to see big benefits for forcing reclaim for a
very short lived charge because this might lead to different and much
worse side effects then a quantum noise.

Maybe I am missing something and we can charge a lot during exit but
then I think we should fix the exit path to not allocate that much.

> > I am really puzzled here. On one hand you are strongly arguing for not
> > notifying when we know we can prevent from OOM action and on the other
> > hand you are ok to get vmpressure/thresholds notification when an
> > exiting task triggers reclaim.
> > 
> > So I am really lost in what you are trying to achieve here. It sounds a
> > bit arbirtrary.
> > 
> 
> It's not arbitrary to define when memcg bypass is allowed and, in my 
> opinion, it should only be done in situations where it is unavoidable and 
> therefore breaking memory isolation is required.
> 
> (We wouldn't expect a 128MB memcg to be oom [and perhaps with a userspace 
> oom handler attached] when it has 100 children each 1MB in size just 
> because they all happen to be oom at the same time.  We set up the excess 

s/oom/exiting/ ?

> memory in the parent specifically for the memcg with the oom handler 
> attached.)

I am not sure I understand what you meant here.

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [patch 1/2] mm, memcg: avoid oom notification when current needs access to memory reserves
  2013-12-18 20:04                                                         ` Michal Hocko
@ 2013-12-19  6:09                                                           ` David Rientjes
  2013-12-19 14:41                                                             ` Michal Hocko
  0 siblings, 1 reply; 87+ messages in thread
From: David Rientjes @ 2013-12-19  6:09 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Johannes Weiner, Andrew Morton, KAMEZAWA Hiroyuki, linux-kernel,
	linux-mm, cgroups

On Wed, 18 Dec 2013, Michal Hocko wrote:

> > For memory isolation, we'd only want to bypass memcg charges when 
> > absolutely necessary and it seems like TIF_MEMDIE is the only case where 
> > that's required.  We don't give processes with pending SIGKILLs or those 
> > in the exit() path access to memory reserves in the page allocator without 
> > first determining that reclaim can't make any progress for the same reason 
> > and then we only do so by setting TIF_MEMDIE when calling the oom killer.  
> 
> While I do understand arguments about isolation I would also like to be
> practical here. How many charges are we talking about? Dozen pages? Much
> more?

The PF_EXITING bypass is indeed much less concerning than the 
fatal_signal_pending() bypass.

> Besides that all of those should be very short lived because the task
> is going to die very soon and so the memory will be freed.
> 

We don't know how much memory is being allocated while 
fatal_signal_pending() is true before the process can handle the SIGKILL, 
so this could potentially bypass a significant amount of memory.  If we 
are to have a configuration such as what Tejun recommended for oom 
handling:

			 _____root______
			/		\
		    user		 oom
		   /    \		/   \
		  A	 B	       a     b

where the limit of A + B can be greater than the limit of user for 
overcommit, and the limit of user is the amount of RAM minus whatever is 
reserved for the oom hierarchy, then significant bypass to the root memcg 
will cause memcgs in the oom hierarchy to actually not be able to allocate 
memory from the page allocator.

The PF_EXITING bypass is much less concerning because we shouldn't be 
doing significant memory allocation in the exit() path, but it's also true 
that neither the PF_EXITING nor the fatal_signal_pending() bypass is 
required.  In Tejun's suggested configuration above, we absolutely do want 
to reclaim from the user hierarchy before declaring oom and setting 
TIF_MEMDIE, otherwise the oom hierarchy cannot allocate.

> So from my POV I would like to see these heuristics as simple as
> possible and placed at very few places. Doing a bypass before charge
> - or even after a failed charge before doing reclaim sounds like an easy
> enough heuristic without a big risk.

It's a very significant risk of depleting memory that is available for oom 
handling in the suggested configuration.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [patch 1/2] mm, memcg: avoid oom notification when current needs access to memory reserves
  2013-12-19  6:09                                                           ` David Rientjes
@ 2013-12-19 14:41                                                             ` Michal Hocko
  2014-01-08  0:25                                                               ` Andrew Morton
  0 siblings, 1 reply; 87+ messages in thread
From: Michal Hocko @ 2013-12-19 14:41 UTC (permalink / raw)
  To: David Rientjes
  Cc: Johannes Weiner, Andrew Morton, KAMEZAWA Hiroyuki, linux-kernel,
	linux-mm, cgroups

On Wed 18-12-13 22:09:12, David Rientjes wrote:
> On Wed, 18 Dec 2013, Michal Hocko wrote:
> 
> > > For memory isolation, we'd only want to bypass memcg charges when 
> > > absolutely necessary and it seems like TIF_MEMDIE is the only case where 
> > > that's required.  We don't give processes with pending SIGKILLs or those 
> > > in the exit() path access to memory reserves in the page allocator without 
> > > first determining that reclaim can't make any progress for the same reason 
> > > and then we only do so by setting TIF_MEMDIE when calling the oom killer.  
> > 
> > While I do understand arguments about isolation I would also like to be
> > practical here. How many charges are we talking about? Dozen pages? Much
> > more?
> 
> The PF_EXITING bypass is indeed much less concerning than the 
> fatal_signal_pending() bypass.

OK, so can we at least agree on the patch posted here:
https://lkml.org/lkml/2013/12/12/129. This is a real bug and definitely
worth fixing.

> > Besides that all of those should be very short lived because the task
> > is going to die very soon and so the memory will be freed.
> > 
> 
> We don't know how much memory is being allocated while 
> fatal_signal_pending() is true before the process can handle the SIGKILL, 
> so this could potentially bypass a significant amount of memory. 

The question is. Does it in _practice_?

We have this behavior since 867578cbccb08 which is 2.6.34 and we haven't
seen a single report where a shotdown task would break over the limit too
much. This would suggest that such a case doesn't happen very often.  If
it happens or it is easily triggerable then I am all for reverting that
check but that would require a proper justification rather than
speculations.

> If we are to have a configuration such as what Tejun recommended for
> oom handling:
> 
> 			 _____root______
> 			/		\
> 		    user		 oom
> 		   /    \		/   \
> 		  A	 B	       a     b
> 
> where the limit of A + B can be greater than the limit of user for 
> overcommit, and the limit of user is the amount of RAM minus whatever is 
> reserved for the oom hierarchy, then significant bypass to the root memcg 
> will cause memcgs in the oom hierarchy to actually not be able to allocate 
> memory from the page allocator.

I can imagine that the killed task might be in the middle of an
allocation loop and rather far away from returning to userspace (e.g.
readahead comes to mind - although that one shouldn't cause the global
OOM).
I would argue that we shouldn't reclaim in such a case and rather fail
the charge. Reclaiming will not help us much. In an extreme case we
would end up in OOM and the killed task would get TIF_MEMDIE and so it
would be allowed to bypass charges and break the isolation anyway.
Can we fail charges for killed tasks in general? I am very skeptical
because this might be a regular allocation to make a progress on the way
out.

So this doesn't solve the isolation problem, it just postpones it to
later and makes the life of other tasks in the same memcg worse because
their memory gets reclaimed which can lead to different performance
issues. And all of that for temporal charges which will go away shortly.

> The PF_EXITING bypass is much less concerning because we shouldn't be 
> doing significant memory allocation in the exit() path, but it's also true 
> that neither the PF_EXITING nor the fatal_signal_pending() bypass is 
> required. 

Yes, it is not, strictly speaking, required. It is very practical to do,
though. We do not know much about the context which called us so we
cannot base our decisions properly and just doing reclaim to see what
happens sounds like a bad decision to me.

> In Tejun's suggested configuration above, we absolutely do want 
> to reclaim from the user hierarchy before declaring oom and setting 
> TIF_MEMDIE, otherwise the oom hierarchy cannot allocate.
> 
> > So from my POV I would like to see these heuristics as simple as
> > possible and placed at very few places. Doing a bypass before charge
> > - or even after a failed charge before doing reclaim sounds like an easy
> > enough heuristic without a big risk.
> 
> It's a very significant risk of depleting memory that is available for oom 
> handling in the suggested configuration.

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [patch 1/2] mm, memcg: avoid oom notification when current needs access to memory reserves
  2013-12-19 14:41                                                             ` Michal Hocko
@ 2014-01-08  0:25                                                               ` Andrew Morton
  2014-01-08 10:33                                                                 ` Michal Hocko
  2014-01-09 21:34                                                                 ` [patch 1/2] mm, memcg: avoid oom notification when current needs " David Rientjes
  0 siblings, 2 replies; 87+ messages in thread
From: Andrew Morton @ 2014-01-08  0:25 UTC (permalink / raw)
  To: Michal Hocko
  Cc: David Rientjes, Johannes Weiner, KAMEZAWA Hiroyuki, linux-kernel,
	linux-mm, cgroups, Eric W. Biederman

On Thu, 19 Dec 2013 15:41:34 +0100 Michal Hocko <mhocko@suse.cz> wrote:

> On Wed 18-12-13 22:09:12, David Rientjes wrote:
> > On Wed, 18 Dec 2013, Michal Hocko wrote:
> > 
> > > > For memory isolation, we'd only want to bypass memcg charges when 
> > > > absolutely necessary and it seems like TIF_MEMDIE is the only case where 
> > > > that's required.  We don't give processes with pending SIGKILLs or those 
> > > > in the exit() path access to memory reserves in the page allocator without 
> > > > first determining that reclaim can't make any progress for the same reason 
> > > > and then we only do so by setting TIF_MEMDIE when calling the oom killer.  
> > > 
> > > While I do understand arguments about isolation I would also like to be
> > > practical here. How many charges are we talking about? Dozen pages? Much
> > > more?
> > 
> > The PF_EXITING bypass is indeed much less concerning than the 
> > fatal_signal_pending() bypass.

I just spent a happy half hour reliving this thread and ended up
deciding I agreed with everyone!  I appears that many more emails are
needed so I think I'll drop
http://ozlabs.org/~akpm/mmots/broken-out/mm-memcg-avoid-oom-notification-when-current-needs-access-to-memory-reserves.patch
for now.

The claim that
mm-memcg-avoid-oom-notification-when-current-needs-access-to-memory-reserves.patch
will impact existing userspace seems a bit dubious to me.

> OK, so can we at least agree on the patch posted here:
> https://lkml.org/lkml/2013/12/12/129. This is a real bug and definitely
> worth fixing.

Yes, can we please get Eric's bug fixed?  I don't believe that Eric has
tested either https://lkml.org/lkml/2013/12/12/129 or
http://ozlabs.org/~akpm/mmots/broken-out/mm-memcg-avoid-oom-notification-when-current-needs-access-to-memory-reserves.patch.
Is he the only person who can reproduce this?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [patch 1/2] mm, memcg: avoid oom notification when current needs access to memory reserves
  2014-01-08  0:25                                                               ` Andrew Morton
@ 2014-01-08 10:33                                                                 ` Michal Hocko
  2014-01-09 14:30                                                                   ` [PATCH] memcg: Do not hang on OOM when killed by userspace OOM " Michal Hocko
  2014-01-09 21:34                                                                 ` [patch 1/2] mm, memcg: avoid oom notification when current needs " David Rientjes
  1 sibling, 1 reply; 87+ messages in thread
From: Michal Hocko @ 2014-01-08 10:33 UTC (permalink / raw)
  To: Andrew Morton
  Cc: David Rientjes, Johannes Weiner, KAMEZAWA Hiroyuki, linux-kernel,
	linux-mm, cgroups, Eric W. Biederman

On Tue 07-01-14 16:25:03, Andrew Morton wrote:
[...]
> > OK, so can we at least agree on the patch posted here:
> > https://lkml.org/lkml/2013/12/12/129. This is a real bug and definitely
> > worth fixing.
> 
> Yes, can we please get Eric's bug fixed?  I don't believe that Eric has
> tested either https://lkml.org/lkml/2013/12/12/129 or
> http://ozlabs.org/~akpm/mmots/broken-out/mm-memcg-avoid-oom-notification-when-current-needs-access-to-memory-reserves.patch.
> Is he the only person who can reproduce this?

I have gathered 3 patches from all the discussion and plan to post them
today or tomorrow as the time permits. https://lkml.org/lkml/2013/12/12/129
will be a part of it.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 87+ messages in thread

* [PATCH] memcg: Do not hang on OOM when killed by userspace OOM access to memory reserves
  2014-01-08 10:33                                                                 ` Michal Hocko
@ 2014-01-09 14:30                                                                   ` Michal Hocko
  2014-01-09 21:40                                                                     ` David Rientjes
  0 siblings, 1 reply; 87+ messages in thread
From: Michal Hocko @ 2014-01-09 14:30 UTC (permalink / raw)
  To: Andrew Morton
  Cc: David Rientjes, Johannes Weiner, KAMEZAWA Hiroyuki, linux-kernel,
	linux-mm, cgroups, Eric W. Biederman

On Wed 08-01-14 11:33:19, Michal Hocko wrote:
> On Tue 07-01-14 16:25:03, Andrew Morton wrote:
> [...]
> > > OK, so can we at least agree on the patch posted here:
> > > https://lkml.org/lkml/2013/12/12/129. This is a real bug and definitely
> > > worth fixing.
> > 
> > Yes, can we please get Eric's bug fixed?  I don't believe that Eric has
> > tested either https://lkml.org/lkml/2013/12/12/129 or
> > http://ozlabs.org/~akpm/mmots/broken-out/mm-memcg-avoid-oom-notification-when-current-needs-access-to-memory-reserves.patch.
> > Is he the only person who can reproduce this?
> 
> I have gathered 3 patches from all the discussion and plan to post them
> today or tomorrow as the time permits. https://lkml.org/lkml/2013/12/12/129
> will be a part of it.

OK, I've decided to post the oom notification parts later because they
will likely generate some discussion which might distract from the
actual fix so here it goes (can be applied on both mmotm and the current
Linus' tree):
---

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [patch 1/2] mm, memcg: avoid oom notification when current needs access to memory reserves
  2014-01-08  0:25                                                               ` Andrew Morton
  2014-01-08 10:33                                                                 ` Michal Hocko
@ 2014-01-09 21:34                                                                 ` David Rientjes
  2014-01-09 22:47                                                                   ` Andrew Morton
  1 sibling, 1 reply; 87+ messages in thread
From: David Rientjes @ 2014-01-09 21:34 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Michal Hocko, Johannes Weiner, KAMEZAWA Hiroyuki, linux-kernel,
	linux-mm, cgroups, Eric W. Biederman

On Tue, 7 Jan 2014, Andrew Morton wrote:

> I just spent a happy half hour reliving this thread and ended up
> deciding I agreed with everyone!  I appears that many more emails are
> needed so I think I'll drop
> http://ozlabs.org/~akpm/mmots/broken-out/mm-memcg-avoid-oom-notification-when-current-needs-access-to-memory-reserves.patch
> for now.
> 
> The claim that
> mm-memcg-avoid-oom-notification-when-current-needs-access-to-memory-reserves.patch
> will impact existing userspace seems a bit dubious to me.
> 

I'm not sure why this was dropped since it's vitally needed for any sane 
userspace oom handler to be effective.

Without the patch, a userspace oom handler waiting on memory.oom_control 
will be triggered when any process with a pending SIGKILL or in the exit() 
path simply needs access to memory reserves to make forward progress.  The 
kernel oom killer itself is preempted since nothing is actionable other 
than giving current access to memory reserves by setting the TIF_MEMDIE 
bit.  Userspace does not have the privilege to set this bit itself, so in 
such cases there is absolutely nothing actionable for the userspace oom 
handler.

The problem is that the userspace oom handler doesn't know that.

It would be ludicrous to require that a userspace oom handler must wait 
for some arbitrary amount of time to determine if it is actionable or not; 
what is a sane amount of time to wait?  Should we reliably expect that 
multiple oom notifications will be sent over a period of time if we are in 
a situation where current doesn't require memory reserves to make forward 
progress?  How long should the userspace oom handler store this state to 
determine how many times it has woken up?

Userspace oom handling implementations are fragile enough as it is, they 
should be made as trivial as possible to ensure they can do what is needed 
to make memory available, have the smallest memory footprint possible, and 
be as reliable as possible.  Requiring them to determine when a 
notification is actionable is troublesome.

Furthermore, Section 10 of Documentation/cgroups/memory.txt does not imply 
that any of this checking needs to be done and lists possible actions that 
a userspace oom handler can do upon being notified such as raising a limit 
or killing a process itself.  This is what userspace _expects_ to do when 
notified.

Giving current access to memory reserves so that it may make forward 
progress is something only the kernel can do and is a part of both the VM 
and memcg implementations to allow forward progress to be made.  It is not 
something userspace is involved in.

Additionally, you're not losing any functionality by merging the patch, if 
you really want to know simply when the limit has been reached and not 
something actionable as stated by the memcg documentation, you can do so 
with memory thresholds or VMPRESSURE_CRITICAL.

Google relies on this behavior so that userspace oom handlers can be 
implemented to respond to oom conditions and not cause unnecessary oom 
killing.  We'd like to know why you refuse to provide such an interface in 
a responsible and reliable way.

Thanks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH] memcg: Do not hang on OOM when killed by userspace OOM access to memory reserves
  2014-01-09 14:30                                                                   ` [PATCH] memcg: Do not hang on OOM when killed by userspace OOM " Michal Hocko
@ 2014-01-09 21:40                                                                     ` David Rientjes
  2014-01-10  8:23                                                                       ` Michal Hocko
  0 siblings, 1 reply; 87+ messages in thread
From: David Rientjes @ 2014-01-09 21:40 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Johannes Weiner, KAMEZAWA Hiroyuki, linux-kernel,
	linux-mm, cgroups, Eric W. Biederman

On Thu, 9 Jan 2014, Michal Hocko wrote:

> Eric has reported that he can see task(s) stuck in memcg OOM handler
> regularly. The only way out is to
> 	echo 0 > $GROUP/memory.oom_controll
> His usecase is:
> - Setup a hierarchy with memory and the freezer
>   (disable kernel oom and have a process watch for oom).
> - In that memory cgroup add a process with one thread per cpu.
> - In one thread slowly allocate once per second I think it is 16M of ram
>   and mlock and dirty it (just to force the pages into ram and stay there).
> - When oom is achieved loop:
>   * attempt to freeze all of the tasks.
>   * if frozen send every task SIGKILL, unfreeze, remove the directory in
>     cgroupfs.
> 
> Eric has then pinpointed the issue to be memcg specific.
> 
> All tasks are sitting on the memcg_oom_waitq when memcg oom is disabled.
> Those that have received fatal signal will bypass the charge and should
> continue on their way out. The tricky part is that the exit path might
> trigger a page fault (e.g. exit_robust_list), thus the memcg charge,
> while its memcg is still under OOM because nobody has released any
> charges yet.
> Unlike with the in-kernel OOM handler the exiting task doesn't get
> TIF_MEMDIE set so it doesn't shortcut futher charges of the killed task
> and falls to the memcg OOM again without any way out of it as there are
> no fatal signals pending anymore.
> 
> This patch fixes the issue by checking PF_EXITING early in
> __mem_cgroup_try_charge and bypass the charge same as if it had fatal
> signal pending or TIF_MEMDIE set.
> 
> Normally exiting tasks (aka not killed) will bypass the charge now but
> this should be OK as the task is leaving and will release memory and
> increasing the memory pressure just to release it in a moment seems
> dubious wasting of cycles. Besides that charges after exit_signals
> should be rare.
> 
> Reported-by: Eric W. Biederman <ebiederm@xmission.com>
> Signed-off-by: Michal Hocko <mhocko@suse.cz>

Is this tested?

> ---
>  mm/memcontrol.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index b8dfed1b9d87..b86fbb04b7c6 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -2685,7 +2685,8 @@ static int __mem_cgroup_try_charge(struct mm_struct *mm,
>  	 * MEMDIE process.
>  	 */
>  	if (unlikely(test_thread_flag(TIF_MEMDIE)
> -		     || fatal_signal_pending(current)))
> +		     || fatal_signal_pending(current))
> +		     || current->flags & PF_EXITING)
>  		goto bypass;
>  
>  	if (unlikely(task_in_memcg_oom(current)))

This would become problematic if significant amount of memory is charged 
in the exit() path.  I don't know of an egregious amount of memory being 
allocated and charged after PF_EXITING is set, but if it happens in the 
future then this could potentially cause system oom conditions even in 
memcg configurations that are designed such as the one Tejun suggested to 
be able to handle such conditions in userspace:

		     ___root___
		    /	       \
		user		oom
		/  \		/ \
		A  B		C D

where the limit of user is equal to the amount of system memory minus 
whatever amount of memory is needed by the system oom handler attached as 
a descendant of oom and still allows the limits of A + B to exceed the 
limit of user.

So how do we ensure that memory allocations in the exit() path don't cause 
system oom conditions whereas the above configuration no longer provides 
any strict guarantee?

Thanks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [patch 1/2] mm, memcg: avoid oom notification when current needs access to memory reserves
  2014-01-09 21:34                                                                 ` [patch 1/2] mm, memcg: avoid oom notification when current needs " David Rientjes
@ 2014-01-09 22:47                                                                   ` Andrew Morton
  2014-01-10  0:01                                                                     ` David Rientjes
  0 siblings, 1 reply; 87+ messages in thread
From: Andrew Morton @ 2014-01-09 22:47 UTC (permalink / raw)
  To: David Rientjes
  Cc: Michal Hocko, Johannes Weiner, KAMEZAWA Hiroyuki, linux-kernel,
	linux-mm, cgroups, Eric W. Biederman

On Thu, 9 Jan 2014 13:34:24 -0800 (PST) David Rientjes <rientjes@google.com> wrote:

> On Tue, 7 Jan 2014, Andrew Morton wrote:
> 
> > I just spent a happy half hour reliving this thread and ended up
> > deciding I agreed with everyone!  I appears that many more emails are
> > needed so I think I'll drop
> > http://ozlabs.org/~akpm/mmots/broken-out/mm-memcg-avoid-oom-notification-when-current-needs-access-to-memory-reserves.patch
> > for now.
> > 
> > The claim that
> > mm-memcg-avoid-oom-notification-when-current-needs-access-to-memory-reserves.patch
> > will impact existing userspace seems a bit dubious to me.
> > 
> 
> I'm not sure why this was dropped since it's vitally needed for any sane 
> userspace oom handler to be effective.

It was dropped because the other memcg developers disagreed with it.

I'd really prefer not to have to spend a great amount of time parsing
argumentative and repetitive emails to make a tie-break decision which
may well be wrong anyway.

Please work with the other guys to find an acceptable implementation. 
There must be *something* we can do?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [patch 1/2] mm, memcg: avoid oom notification when current needs access to memory reserves
  2014-01-09 22:47                                                                   ` Andrew Morton
@ 2014-01-10  0:01                                                                     ` David Rientjes
  2014-01-10  0:12                                                                       ` Andrew Morton
  2014-01-10  8:30                                                                       ` Michal Hocko
  0 siblings, 2 replies; 87+ messages in thread
From: David Rientjes @ 2014-01-10  0:01 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Michal Hocko, Johannes Weiner, KAMEZAWA Hiroyuki, linux-kernel,
	linux-mm, cgroups, Eric W. Biederman

On Thu, 9 Jan 2014, Andrew Morton wrote:

> > I'm not sure why this was dropped since it's vitally needed for any sane 
> > userspace oom handler to be effective.
> 
> It was dropped because the other memcg developers disagreed with it.
> 

It was acked-by Michal.

> I'd really prefer not to have to spend a great amount of time parsing
> argumentative and repetitive emails to make a tie-break decision which
> may well be wrong anyway.
> 
> Please work with the other guys to find an acceptable implementation. 
> There must be *something* we can do?
> 

We REQUIRE this behavior for a sane userspace oom handler implementation.  
You've snipped my email quite extensively, but I'd like to know 
specifically how you would implement a userspace oom handler described by 
Section 10 of Documentation/cgroups/memory.txt without this patch?

Are you suggesting that userspace is supposed to wait for successive 
wakeups over some arbitrarily defined period of time to determine whether 
memory freeing (i.e. a process in the exit() path or with a pending 
SIGKILL making forward progress to free its memory) can be done or whether 
it needs to do something to free memory?  If not, how else is userspace 
supposed to know that it should act?

How do you prevent unnecessary oom killing if the userspace oom handler 
wakes up and kills something concurrent with the process triggering the 
notification getting access to memory reserves, exiting, and freeing its 
memory?  Userspace just killed a process unnecessarily.  This is the exact 
reason why the kernel oom killer doesn't do a damn thing in these 
conditions, because it's NOT ACTIONABLE by the oom killer, a process 
simply needs to exit.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [patch 1/2] mm, memcg: avoid oom notification when current needs access to memory reserves
  2014-01-10  0:01                                                                     ` David Rientjes
@ 2014-01-10  0:12                                                                       ` Andrew Morton
  2014-01-10  0:23                                                                         ` David Rientjes
  2014-01-10  8:30                                                                       ` Michal Hocko
  1 sibling, 1 reply; 87+ messages in thread
From: Andrew Morton @ 2014-01-10  0:12 UTC (permalink / raw)
  To: David Rientjes
  Cc: Michal Hocko, Johannes Weiner, KAMEZAWA Hiroyuki, linux-kernel,
	linux-mm, cgroups, Eric W. Biederman

On Thu, 9 Jan 2014 16:01:15 -0800 (PST) David Rientjes <rientjes@google.com> wrote:

> On Thu, 9 Jan 2014, Andrew Morton wrote:
> 
> > > I'm not sure why this was dropped since it's vitally needed for any sane 
> > > userspace oom handler to be effective.
> > 
> > It was dropped because the other memcg developers disagreed with it.
> > 
> 
> It was acked-by Michal.

And Johannes?

> > I'd really prefer not to have to spend a great amount of time parsing
> > argumentative and repetitive emails to make a tie-break decision which
> > may well be wrong anyway.
> > 
> > Please work with the other guys to find an acceptable implementation. 
> > There must be *something* we can do?
> > 
> 
> We REQUIRE this behavior for a sane userspace oom handler implementation.  
> You've snipped my email quite extensively, but I'd like to know 
> specifically how you would implement a userspace oom handler described by 
> Section 10 of Documentation/cgroups/memory.txt without this patch?

>From long experience I know that if I suggest an alternative
implementation, advocates of the initial implementation will invest
great effort in demonstrating why my suggestion won't work while
investing zero effort in thinking up alternatives themselves.

> Are you suggesting that userspace is supposed to wait for successive 
> wakeups over some arbitrarily defined period of time to determine whether 
> memory freeing (i.e. a process in the exit() path or with a pending 
> SIGKILL making forward progress to free its memory) can be done or whether 
> it needs to do something to free memory?  If not, how else is userspace 
> supposed to know that it should act?
> 
> How do you prevent unnecessary oom killing if the userspace oom handler 
> wakes up and kills something concurrent with the process triggering the 
> notification getting access to memory reserves, exiting, and freeing its 
> memory?  Userspace just killed a process unnecessarily.  This is the exact 
> reason why the kernel oom killer doesn't do a damn thing in these 
> conditions, because it's NOT ACTIONABLE by the oom killer, a process 
> simply needs to exit.

So the interface is wrong.  We have two semantically different kernel
states which are being communicated to userspace in the same way, so
userspace cannot disambiguate.

Solution: invent a better communication scheme with a richer payload. 
Use that, deprecate the old interface if poss.

Another solution: add a mode knob to select between alternative kernel
behaviors (yuk).

Another solution: get David to think of a solution which addresses the
issues which others have raised.

Johannes' final email in this thread has yet to be replied to, btw.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [patch 1/2] mm, memcg: avoid oom notification when current needs access to memory reserves
  2014-01-10  0:12                                                                       ` Andrew Morton
@ 2014-01-10  0:23                                                                         ` David Rientjes
  2014-01-10  0:35                                                                           ` David Rientjes
  2014-01-10 22:14                                                                           ` Johannes Weiner
  0 siblings, 2 replies; 87+ messages in thread
From: David Rientjes @ 2014-01-10  0:23 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Michal Hocko, Johannes Weiner, KAMEZAWA Hiroyuki, linux-kernel,
	linux-mm, cgroups, Eric W. Biederman

On Thu, 9 Jan 2014, Andrew Morton wrote:

> > > It was dropped because the other memcg developers disagreed with it.
> > > 
> > 
> > It was acked-by Michal.
> 
> And Johannes?
> 

Johannes is arguing for the same semantics that VMPRESSURE_CRITICAL and/or 
memory thresholds provides, which disagrees from the list of solutions 
that Documentation/cgroups/memory.txt gives for userspace oom handler 
wakeups and is required for any sane implementation.

> > We REQUIRE this behavior for a sane userspace oom handler implementation.  
> > You've snipped my email quite extensively, but I'd like to know 
> > specifically how you would implement a userspace oom handler described by 
> > Section 10 of Documentation/cgroups/memory.txt without this patch?
> 
> From long experience I know that if I suggest an alternative
> implementation, advocates of the initial implementation will invest
> great effort in demonstrating why my suggestion won't work while
> investing zero effort in thinking up alternatives themselves.
> 

Easy thing to say when you don't suggest an alternative implementation, 
right?

I'm fully aware that I'm the only one in this thread who is charged with 
writing and maintaining userspace oom handlers, so I'm not asking for an 
actual implementation, but rather an answer to the very simple question: 
how does userspace know whether it needs to actually do anything or not 
without this patch?

> So the interface is wrong.  We have two semantically different kernel
> states which are being communicated to userspace in the same way, so
> userspace cannot disambiguate.
> 

We want to notify on one state, which is what is described in 
Documentation/cgroups/memory.txt and works with my patch, and not notify 
on another state which was broken by ME in f9434ad15524 ("memcg: give 
current access to memory reserves if it's trying to die").  Am I allowed 
to fix my own breakage?

Userspace expects to get notified for the reasons listed in the 
documentation, not when the kernel is going to allow memory to be freed 
itself.  You can get notification of oom through vmpressure or memory 
thresholds, memory.oom_control needs to be reserved for situations when 
"something" needs to be done by userspace and as defined by the 
documentation.

> Solution: invent a better communication scheme with a richer payload. 
> Use that, deprecate the old interface if poss.
> 

There are better communication schemes for oom conditions that are not 
actionable, they are memcg memory threshold notifications and vmpressure.

> Johannes' final email in this thread has yet to be replied to, btw.
> 

Will do.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [patch 1/2] mm, memcg: avoid oom notification when current needs access to memory reserves
  2013-12-09 23:05                                         ` Johannes Weiner
@ 2014-01-10  0:34                                           ` David Rientjes
  0 siblings, 0 replies; 87+ messages in thread
From: David Rientjes @ 2014-01-10  0:34 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Michal Hocko, Andrew Morton, KAMEZAWA Hiroyuki, linux-kernel,
	linux-mm, cgroups

Andrew requested I reply to this email, so it's old, but here it is.


On Mon, 9 Dec 2013, Johannes Weiner wrote:

> We check for fatal signals during the repeated charge attempts and
> reclaim.  Should we be checking for PF_EXITING too?
> 

Michal has proposed that patch and I question whether we should be doing 
that because if significant memory allocation can be done in the exit() 
path after PF_EXITING either now or in the future, then it does not allow 
memory to be set aside for system oom handlers given the suggested memcg 
configuration from Tejun that limits the amount of "user" memory to a 
top-level memcg limit that can be overcommitted below it and bypasses 
these charges to root that would disallow the userspace oom handlers from 
getting memory that they have been reserved.  In other words, if a 64GB 
machine has top-level memcgs "user" with limit of 62GB and "oom" with 
limit of 2GB for system oom handlers, that 2GB cannot be guaranteed with 
all of these bypasses (uncharged memory, such as unaccounted kernel 
memory, memory reserves).

> You even re-inforced this motivation by suggesting the separate memcg
> margin check right before the OOM kill, so don't blame us for
> misunderstanding the exact placement of this check as your main
> argument when you repeated it over and over.
> 

We've talked about a lot of stuff in these threads, yes.

> All I object to is that the OOM killer is riddled with last-second
> checks of whether the OOM situation is still existant.  We establish
> that the context is OOM and once we are certain we are executing,
> period.
> 

This patch moves a check from being "last second" to actually before the 
oom killer is called at all, you should be pleased.

> Not catching PF_EXITING in the long window between the first reclaim
> and going OOM is a separate issue and I can see that this should be
> fixed but it should be checked before we start invoking OOM.
> 

Doesn't seem like an issue with this patch.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [patch 1/2] mm, memcg: avoid oom notification when current needs access to memory reserves
  2014-01-10  0:23                                                                         ` David Rientjes
@ 2014-01-10  0:35                                                                           ` David Rientjes
  2014-01-10 22:14                                                                           ` Johannes Weiner
  1 sibling, 0 replies; 87+ messages in thread
From: David Rientjes @ 2014-01-10  0:35 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Michal Hocko, Johannes Weiner, KAMEZAWA Hiroyuki, linux-kernel,
	linux-mm, cgroups, Eric W. Biederman

On Thu, 9 Jan 2014, David Rientjes wrote:

> > Johannes' final email in this thread has yet to be replied to, btw.
> > 
> 
> Will do.
> 

I've responded to this email, but nothing in Johannes' email actually 
talks about this specific patch at all, so I'm not sure it's very useful.

Thanks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH] memcg: Do not hang on OOM when killed by userspace OOM access to memory reserves
  2014-01-09 21:40                                                                     ` David Rientjes
@ 2014-01-10  8:23                                                                       ` Michal Hocko
  2014-01-10 21:33                                                                         ` David Rientjes
  0 siblings, 1 reply; 87+ messages in thread
From: Michal Hocko @ 2014-01-10  8:23 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Johannes Weiner, KAMEZAWA Hiroyuki, linux-kernel,
	linux-mm, cgroups, Eric W. Biederman

On Thu 09-01-14 13:40:10, David Rientjes wrote:
> On Thu, 9 Jan 2014, Michal Hocko wrote:
> 
> > Eric has reported that he can see task(s) stuck in memcg OOM handler
> > regularly. The only way out is to
> > 	echo 0 > $GROUP/memory.oom_controll
> > His usecase is:
> > - Setup a hierarchy with memory and the freezer
> >   (disable kernel oom and have a process watch for oom).
> > - In that memory cgroup add a process with one thread per cpu.
> > - In one thread slowly allocate once per second I think it is 16M of ram
> >   and mlock and dirty it (just to force the pages into ram and stay there).
> > - When oom is achieved loop:
> >   * attempt to freeze all of the tasks.
> >   * if frozen send every task SIGKILL, unfreeze, remove the directory in
> >     cgroupfs.
> > 
> > Eric has then pinpointed the issue to be memcg specific.
> > 
> > All tasks are sitting on the memcg_oom_waitq when memcg oom is disabled.
> > Those that have received fatal signal will bypass the charge and should
> > continue on their way out. The tricky part is that the exit path might
> > trigger a page fault (e.g. exit_robust_list), thus the memcg charge,
> > while its memcg is still under OOM because nobody has released any
> > charges yet.
> > Unlike with the in-kernel OOM handler the exiting task doesn't get
> > TIF_MEMDIE set so it doesn't shortcut futher charges of the killed task
> > and falls to the memcg OOM again without any way out of it as there are
> > no fatal signals pending anymore.
> > 
> > This patch fixes the issue by checking PF_EXITING early in
> > __mem_cgroup_try_charge and bypass the charge same as if it had fatal
> > signal pending or TIF_MEMDIE set.
> > 
> > Normally exiting tasks (aka not killed) will bypass the charge now but
> > this should be OK as the task is leaving and will release memory and
> > increasing the memory pressure just to release it in a moment seems
> > dubious wasting of cycles. Besides that charges after exit_signals
> > should be rare.
> > 
> > Reported-by: Eric W. Biederman <ebiederm@xmission.com>
> > Signed-off-by: Michal Hocko <mhocko@suse.cz>
> 
> Is this tested?

By Eric? No AFAIK. I wasn't able to reproduce the issue myself.

> > ---
> >  mm/memcontrol.c | 3 ++-
> >  1 file changed, 2 insertions(+), 1 deletion(-)
> > 
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index b8dfed1b9d87..b86fbb04b7c6 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -2685,7 +2685,8 @@ static int __mem_cgroup_try_charge(struct mm_struct *mm,
> >  	 * MEMDIE process.
> >  	 */
> >  	if (unlikely(test_thread_flag(TIF_MEMDIE)
> > -		     || fatal_signal_pending(current)))
> > +		     || fatal_signal_pending(current))
> > +		     || current->flags & PF_EXITING)
> >  		goto bypass;
> >  
> >  	if (unlikely(task_in_memcg_oom(current)))
> 
> This would become problematic if significant amount of memory is charged 
> in the exit() path. 

But this would hurt also for fatal_signal_pending tasks, wouldn't it?
Besides that I do not see any source of allocation after exit_signals.

> I don't know of an egregious amount of memory being 
> allocated and charged after PF_EXITING is set, but if it happens in the 
> future then this could potentially cause system oom conditions even in 
> memcg configurations 

Even if that happens then the global OOM killer would give the exiting
task access to memory reserves and wouldn't kill anything else.

So I am not sure what problem do you see exactly.

Besides that allocating egregious amount of memory after exit_signals
sounds fundamentally broken to me.

> that are designed such as the one Tejun suggested to 
> be able to handle such conditions in userspace:
> 
> 		     ___root___
> 		    /	       \
> 		user		oom
> 		/  \		/ \
> 		A  B		C D
> 
> where the limit of user is equal to the amount of system memory minus 
> whatever amount of memory is needed by the system oom handler attached as 
> a descendant of oom and still allows the limits of A + B to exceed the 
> limit of user.
> 
> So how do we ensure that memory allocations in the exit() path don't cause 
> system oom conditions whereas the above configuration no longer provides 
> any strict guarantee?
> 
> Thanks.

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [patch 1/2] mm, memcg: avoid oom notification when current needs access to memory reserves
  2014-01-10  0:01                                                                     ` David Rientjes
  2014-01-10  0:12                                                                       ` Andrew Morton
@ 2014-01-10  8:30                                                                       ` Michal Hocko
  2014-01-10 21:38                                                                         ` David Rientjes
  1 sibling, 1 reply; 87+ messages in thread
From: Michal Hocko @ 2014-01-10  8:30 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Johannes Weiner, KAMEZAWA Hiroyuki, linux-kernel,
	linux-mm, cgroups, Eric W. Biederman

On Thu 09-01-14 16:01:15, David Rientjes wrote:
> On Thu, 9 Jan 2014, Andrew Morton wrote:
> 
> > > I'm not sure why this was dropped since it's vitally needed for any sane 
> > > userspace oom handler to be effective.
> > 
> > It was dropped because the other memcg developers disagreed with it.
> > 
> 
> It was acked-by Michal.

I have already explained why I have acked it. I will not repeat
it here again. I have also proposed an alternative solution
(https://lkml.org/lkml/2013/12/12/174) which IMO is more viable because
it handles both user/kernel memcg OOM consistently. This patch still has
to be discussed because of other Johannes concerns. I plan to repost it
in a near future.

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH] memcg: Do not hang on OOM when killed by userspace OOM access to memory reserves
  2014-01-10  8:23                                                                       ` Michal Hocko
@ 2014-01-10 21:33                                                                         ` David Rientjes
  2014-01-15 14:26                                                                           ` Michal Hocko
  0 siblings, 1 reply; 87+ messages in thread
From: David Rientjes @ 2014-01-10 21:33 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Johannes Weiner, KAMEZAWA Hiroyuki, linux-kernel,
	linux-mm, cgroups, Eric W. Biederman

On Fri, 10 Jan 2014, Michal Hocko wrote:

> > > ---
> > >  mm/memcontrol.c | 3 ++-
> > >  1 file changed, 2 insertions(+), 1 deletion(-)
> > > 
> > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > > index b8dfed1b9d87..b86fbb04b7c6 100644
> > > --- a/mm/memcontrol.c
> > > +++ b/mm/memcontrol.c
> > > @@ -2685,7 +2685,8 @@ static int __mem_cgroup_try_charge(struct mm_struct *mm,
> > >  	 * MEMDIE process.
> > >  	 */
> > >  	if (unlikely(test_thread_flag(TIF_MEMDIE)
> > > -		     || fatal_signal_pending(current)))
> > > +		     || fatal_signal_pending(current))
> > > +		     || current->flags & PF_EXITING)
> > >  		goto bypass;
> > >  
> > >  	if (unlikely(task_in_memcg_oom(current)))
> > 
> > This would become problematic if significant amount of memory is charged 
> > in the exit() path. 
> 
> But this would hurt also for fatal_signal_pending tasks, wouldn't it?

Yes, and as I've said twice now, that should be removed.  These bypasses 
should be given to one thread and one thread only, which would be the oom 
killed thread if it needs access to memory reserves to either allocate 
memory or charge memory.

If you are suggesting we use the "user" and "oom" top-level memcg 
hierarchy for allowing memory to be available for userspace system oom 
handlers, then this has become important when in the past it may have been 
a minor point.

> Besides that I do not see any source of allocation after exit_signals.
> 

That's fine for today but may not be in the future.  If memory allocation 
is done after PF_EXITING in the future, are people going to check memcg 
bypasses?  No.  And now we have additional memory bypass to root that will 
cause our userspace system oom hanlders to be oom themselves with the 
suggested configuration.

Using the "user" and "oom" top-level memcg hierarchy is a double edged 
sword, we must attempt to prevent all of these bypasses as much as 
possible.  The only relevant bypass here is for TIF_MEMDIE which would be 
set if necessary for the one thread that needs it.

> > I don't know of an egregious amount of memory being 
> > allocated and charged after PF_EXITING is set, but if it happens in the 
> > future then this could potentially cause system oom conditions even in 
> > memcg configurations 
> 
> Even if that happens then the global OOM killer would give the exiting
> task access to memory reserves and wouldn't kill anything else.
> 
> So I am not sure what problem do you see exactly.
> 

Userspace system oom handlers being able to handle memcg oom conditions in 
the top-level "user" memcg as proposed by Tejun.  If the global oom killer 
becomes a part of that discussion at all, then the userspace system oom 
handler never got a chance to handle the "user" oom.

> Besides that allocating egregious amount of memory after exit_signals
> sounds fundamentally broken to me.
> 

Egregious could be defined as allocating a few bytes multiplied by 
thousands of threads in PF_EXITING.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [patch 1/2] mm, memcg: avoid oom notification when current needs access to memory reserves
  2014-01-10  8:30                                                                       ` Michal Hocko
@ 2014-01-10 21:38                                                                         ` David Rientjes
  2014-01-10 22:34                                                                           ` Johannes Weiner
  0 siblings, 1 reply; 87+ messages in thread
From: David Rientjes @ 2014-01-10 21:38 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Johannes Weiner, KAMEZAWA Hiroyuki, linux-kernel,
	linux-mm, cgroups, Eric W. Biederman

On Fri, 10 Jan 2014, Michal Hocko wrote:

> I have already explained why I have acked it. I will not repeat
> it here again. I have also proposed an alternative solution
> (https://lkml.org/lkml/2013/12/12/174) which IMO is more viable because
> it handles both user/kernel memcg OOM consistently. This patch still has
> to be discussed because of other Johannes concerns. I plan to repost it
> in a near future.
> 

This three ring circus has to end.  Really.

Your patch, which is partially based on my suggestion to move the 
mem_cgroup_oom_notify() and call it from two places to support both 
memory.oom_control == 1 and != 1, is something that I liked as you know.  
It's based on my patch which is now removed from -mm.  So if you want to 
rebase that patch and propose it, that's great, but this is yet another 
occurrence of where important patches have been yanked out just before the 
merge window when the problem they are fixing is real and we depend on 
them.

Please post your rebased patch ASAP for the 3.14 merge window.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [patch 1/2] mm, memcg: avoid oom notification when current needs access to memory reserves
  2014-01-10  0:23                                                                         ` David Rientjes
  2014-01-10  0:35                                                                           ` David Rientjes
@ 2014-01-10 22:14                                                                           ` Johannes Weiner
  2014-01-12 22:10                                                                             ` David Rientjes
  1 sibling, 1 reply; 87+ messages in thread
From: Johannes Weiner @ 2014-01-10 22:14 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Michal Hocko, KAMEZAWA Hiroyuki, linux-kernel,
	linux-mm, cgroups, Eric W. Biederman

On Thu, Jan 09, 2014 at 04:23:50PM -0800, David Rientjes wrote:
> On Thu, 9 Jan 2014, Andrew Morton wrote:
> 
> > > > It was dropped because the other memcg developers disagreed with it.
> > > > 
> > > 
> > > It was acked-by Michal.

Michal acked it before we had most of the discussions and now he is
proposing an alternate version of yours, a patch that you are even
discussing with him concurrently in another thread.  To claim he is
still backing your patch because of that initial ack is disingenuous.

> > And Johannes?
> > 
> 
> Johannes is arguing for the same semantics that VMPRESSURE_CRITICAL and/or 
> memory thresholds provides, which disagrees from the list of solutions 
> that Documentation/cgroups/memory.txt gives for userspace oom handler 
> wakeups and is required for any sane implementation.

No, he's not and I'm sick of you repeating refuted garbage like this.

You have convinced neither me nor Michal that your problem is entirely
real and when confronted with doubt you just repeat the same points
over and over.

The one aspect of your change that we DO agree is valid is now fixed
by Michal in a separate attempt because you could not be bothered to
incorporate feedback into your patch.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [patch 1/2] mm, memcg: avoid oom notification when current needs access to memory reserves
  2014-01-10 21:38                                                                         ` David Rientjes
@ 2014-01-10 22:34                                                                           ` Johannes Weiner
  2014-01-12 22:14                                                                             ` David Rientjes
  0 siblings, 1 reply; 87+ messages in thread
From: Johannes Weiner @ 2014-01-10 22:34 UTC (permalink / raw)
  To: David Rientjes
  Cc: Michal Hocko, Andrew Morton, KAMEZAWA Hiroyuki, linux-kernel,
	linux-mm, cgroups, Eric W. Biederman

On Fri, Jan 10, 2014 at 01:38:50PM -0800, David Rientjes wrote:
> On Fri, 10 Jan 2014, Michal Hocko wrote:
> 
> > I have already explained why I have acked it. I will not repeat
> > it here again. I have also proposed an alternative solution
> > (https://lkml.org/lkml/2013/12/12/174) which IMO is more viable because
> > it handles both user/kernel memcg OOM consistently. This patch still has
> > to be discussed because of other Johannes concerns. I plan to repost it
> > in a near future.
> > 
> 
> This three ring circus has to end.  Really.
> 
> Your patch, which is partially based on my suggestion to move the 
> mem_cgroup_oom_notify() and call it from two places to support both 
> memory.oom_control == 1 and != 1, is something that I liked as you know.  
> It's based on my patch which is now removed from -mm.  So if you want to 
> rebase that patch and propose it, that's great, but this is yet another 
> occurrence of where important patches have been yanked out just before the 
> merge window when the problem they are fixing is real and we depend on 
> them.

We tried to discuss and understand the problem, yet all we got was
"it's OBVIOUS" and "Google has been using this patch ever since we
switched to memcg" and flat out repetitions of the same points about
reliable OOM notification that were already put into question.

You still have not convinced me that the problem exists as you
described it, apart from the aspects that Michal is now fixing
separately because you did not show any signs of cooperating.

None of this will change until you start working with us and actually
address feedback and inquiries instead of just repeating your talking
points over and over.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [patch 1/2] mm, memcg: avoid oom notification when current needs access to memory reserves
  2014-01-10 22:14                                                                           ` Johannes Weiner
@ 2014-01-12 22:10                                                                             ` David Rientjes
  2014-01-15 14:34                                                                               ` Michal Hocko
  0 siblings, 1 reply; 87+ messages in thread
From: David Rientjes @ 2014-01-12 22:10 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Michal Hocko, KAMEZAWA Hiroyuki, linux-kernel,
	linux-mm, cgroups, Eric W. Biederman

On Fri, 10 Jan 2014, Johannes Weiner wrote:

> > > > It was acked-by Michal.
> 
> Michal acked it before we had most of the discussions and now he is
> proposing an alternate version of yours, a patch that you are even
> discussing with him concurrently in another thread.  To claim he is
> still backing your patch because of that initial ack is disingenuous.
> 

His patch depends on mine, Johannes.

> > Johannes is arguing for the same semantics that VMPRESSURE_CRITICAL and/or 
> > memory thresholds provides, which disagrees from the list of solutions 
> > that Documentation/cgroups/memory.txt gives for userspace oom handler 
> > wakeups and is required for any sane implementation.
> 
> No, he's not and I'm sick of you repeating refuted garbage like this.
> 
> You have convinced neither me nor Michal that your problem is entirely
> real and when confronted with doubt you just repeat the same points
> over and over.
> 

The conditional to check if current needs access to memory reserves to 
make forward progress and avoid oom killing anything else is done after 
the memcg notification.  It's real per section 6.8.4 of the C99 standard 
which defines how a conditional works.  We do not want a userspace 
notification in such a case because userspace testing of whether the 
condition is actionable would be unreliable.  This is not dead code, it 
does get executed.

> The one aspect of your change that we DO agree is valid is now fixed
> by Michal in a separate attempt because you could not be bothered to
> incorporate feedback into your patch.
> 

I suggested his patch, Johannes, but his patch depends on mine.  I'm 
hoping he can rebase his patch and it's done and merged into -mm before 
the merge window for 3.14 as I've stated.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [patch 1/2] mm, memcg: avoid oom notification when current needs access to memory reserves
  2014-01-10 22:34                                                                           ` Johannes Weiner
@ 2014-01-12 22:14                                                                             ` David Rientjes
  0 siblings, 0 replies; 87+ messages in thread
From: David Rientjes @ 2014-01-12 22:14 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Michal Hocko, Andrew Morton, KAMEZAWA Hiroyuki, linux-kernel,
	linux-mm, cgroups, Eric W. Biederman

On Fri, 10 Jan 2014, Johannes Weiner wrote:

> > Your patch, which is partially based on my suggestion to move the 
> > mem_cgroup_oom_notify() and call it from two places to support both 
> > memory.oom_control == 1 and != 1, is something that I liked as you know.  
> > It's based on my patch which is now removed from -mm.  So if you want to 
> > rebase that patch and propose it, that's great, but this is yet another 
> > occurrence of where important patches have been yanked out just before the 
> > merge window when the problem they are fixing is real and we depend on 
> > them.
> 
> We tried to discuss and understand the problem, yet all we got was
> "it's OBVIOUS" and "Google has been using this patch ever since we
> switched to memcg" and flat out repetitions of the same points about
> reliable OOM notification that were already put into question.
> 
> You still have not convinced me that the problem exists as you
> described it, apart from the aspects that Michal is now fixing
> separately because you did not show any signs of cooperating.
> 

I cooperated by suggesting his patch which moves the 
mem_cgroup_oom_notify(), Johannes.  The problem is that it depends on my 
patch which was removed from -mm.  He can rebase that patch, but I'm 
hoping it is done before the merge window for inclusion in 3.14.

> None of this will change until you start working with us and actually
> address feedback and inquiries instead of just repeating your talking
> points over and over.
> 

I worked with Michal, who acked my patch, and then wrote another patch on 
top of it based partially on my suggestion, Johannes.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH] memcg: Do not hang on OOM when killed by userspace OOM access to memory reserves
  2014-01-10 21:33                                                                         ` David Rientjes
@ 2014-01-15 14:26                                                                           ` Michal Hocko
  2014-01-15 21:19                                                                             ` David Rientjes
  0 siblings, 1 reply; 87+ messages in thread
From: Michal Hocko @ 2014-01-15 14:26 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Johannes Weiner, KAMEZAWA Hiroyuki, linux-kernel,
	linux-mm, cgroups, Eric W. Biederman

On Fri 10-01-14 13:33:01, David Rientjes wrote:
> On Fri, 10 Jan 2014, Michal Hocko wrote:
> 
> > > > ---
> > > >  mm/memcontrol.c | 3 ++-
> > > >  1 file changed, 2 insertions(+), 1 deletion(-)
> > > > 
> > > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > > > index b8dfed1b9d87..b86fbb04b7c6 100644
> > > > --- a/mm/memcontrol.c
> > > > +++ b/mm/memcontrol.c
> > > > @@ -2685,7 +2685,8 @@ static int __mem_cgroup_try_charge(struct mm_struct *mm,
> > > >  	 * MEMDIE process.
> > > >  	 */
> > > >  	if (unlikely(test_thread_flag(TIF_MEMDIE)
> > > > -		     || fatal_signal_pending(current)))
> > > > +		     || fatal_signal_pending(current))
> > > > +		     || current->flags & PF_EXITING)
> > > >  		goto bypass;
> > > >  
> > > >  	if (unlikely(task_in_memcg_oom(current)))
> > > 
> > > This would become problematic if significant amount of memory is charged 
> > > in the exit() path. 
> > 
> > But this would hurt also for fatal_signal_pending tasks, wouldn't it?
> 
> Yes, and as I've said twice now, that should be removed. 

And you failed to provide any relevant data to back your suggestions. I
have told you that we have these heuristics for ages and we need a
strong justification to drop them. So if you really think that they are
not appropriate then back your statements with real data.

E.g. measure how much memory are we talking about.

> These bypasses should be given to one thread and one thread only,
> which would be the oom killed thread if it needs access to memory
> reserves to either allocate memory or charge memory.

There is no way to determine whether a task has been killed due to user
space OOM killer or by a regular kill.

> If you are suggesting we use the "user" and "oom" top-level memcg 
> hierarchy for allowing memory to be available for userspace system oom 
> handlers, then this has become important when in the past it may have been 
> a minor point.

I am not sure it would be _that_ important and if that really becomes to
be the case then we should deal with it. So far I haven't see any
evidence there is a lot of memory charged on the exit path.

> > Besides that I do not see any source of allocation after exit_signals.
> > 
> 
> That's fine for today but may not be in the future.  If memory allocation 
> is done after PF_EXITING in the future, are people going to check memcg 
> bypasses?  No.  And now we have additional memory bypass to root that will 
> cause our userspace system oom hanlders to be oom themselves with the 
> suggested configuration.
> 
> Using the "user" and "oom" top-level memcg hierarchy is a double edged 
> sword, we must attempt to prevent all of these bypasses as much as 
> possible.  The only relevant bypass here is for TIF_MEMDIE which would be 
> set if necessary for the one thread that needs it.

TIF_MEMDIE doesn't work for userspace OOM killers. So we cannot rely on
this flag currently.

> > > I don't know of an egregious amount of memory being 
> > > allocated and charged after PF_EXITING is set, but if it happens in the 
> > > future then this could potentially cause system oom conditions even in 
> > > memcg configurations 
> > 
> > Even if that happens then the global OOM killer would give the exiting
> > task access to memory reserves and wouldn't kill anything else.
> > 
> > So I am not sure what problem do you see exactly.
> > 
> 
> Userspace system oom handlers being able to handle memcg oom conditions in 
> the top-level "user" memcg as proposed by Tejun.  If the global oom killer 
> becomes a part of that discussion at all, then the userspace system oom 
> handler never got a chance to handle the "user" oom.
> 
> > Besides that allocating egregious amount of memory after exit_signals
> > sounds fundamentally broken to me.
> > 
> 
> Egregious could be defined as allocating a few bytes multiplied by 
> thousands of threads in PF_EXITING.

Does this happen in the real life.

Look, I have no objections to make the OOM handling better but it would
help a lot to build new heuristics based on some data in hands. I tried
to repeat that again and again but it seems to not help. I do not want
to end up with new sets of heuristics that break other stuff jut because
they made sense in the context of the specific usecase.

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [patch 1/2] mm, memcg: avoid oom notification when current needs access to memory reserves
  2014-01-12 22:10                                                                             ` David Rientjes
@ 2014-01-15 14:34                                                                               ` Michal Hocko
  2014-01-15 21:23                                                                                 ` David Rientjes
  0 siblings, 1 reply; 87+ messages in thread
From: Michal Hocko @ 2014-01-15 14:34 UTC (permalink / raw)
  To: David Rientjes
  Cc: Johannes Weiner, Andrew Morton, KAMEZAWA Hiroyuki, linux-kernel,
	linux-mm, cgroups, Eric W. Biederman

On Sun 12-01-14 14:10:49, David Rientjes wrote:
> On Fri, 10 Jan 2014, Johannes Weiner wrote:
> 
> > > > > It was acked-by Michal.
> > 
> > Michal acked it before we had most of the discussions and now he is
> > proposing an alternate version of yours, a patch that you are even
> > discussing with him concurrently in another thread.  To claim he is
> > still backing your patch because of that initial ack is disingenuous.
> > 
> 
> His patch depends on mine, Johannes.

Does it? Are we talking about the same patch here?
https://lkml.org/lkml/2013/12/12/174

Which depends on yours only to revert your part. I plan to repost it but
that still doesn't mean it will get merged because Johannes still has
some argumnets against. I would like to start the discussion again
because now we are so deep in circles that it is hard to come up with a
reasonable outcome. It is still hard to e.g. agree on an actual fix
for a real problem https://lkml.org/lkml/2013/12/12/129.

While notification might be an issue as well it is more of a corner case
than a regular one. So let's try to move on, agree on the "oom vs.
PF_EXITING) first and lay out discussion for the notification in a new
threa. Shall we?
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH] memcg: Do not hang on OOM when killed by userspace OOM access to memory reserves
  2014-01-15 14:26                                                                           ` Michal Hocko
@ 2014-01-15 21:19                                                                             ` David Rientjes
  2014-01-16 10:12                                                                               ` Michal Hocko
  0 siblings, 1 reply; 87+ messages in thread
From: David Rientjes @ 2014-01-15 21:19 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Johannes Weiner, KAMEZAWA Hiroyuki, linux-kernel,
	linux-mm, cgroups, Eric W. Biederman

On Wed, 15 Jan 2014, Michal Hocko wrote:

> > > > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > > > > index b8dfed1b9d87..b86fbb04b7c6 100644
> > > > > --- a/mm/memcontrol.c
> > > > > +++ b/mm/memcontrol.c
> > > > > @@ -2685,7 +2685,8 @@ static int __mem_cgroup_try_charge(struct mm_struct *mm,
> > > > >  	 * MEMDIE process.
> > > > >  	 */
> > > > >  	if (unlikely(test_thread_flag(TIF_MEMDIE)
> > > > > -		     || fatal_signal_pending(current)))
> > > > > +		     || fatal_signal_pending(current))
> > > > > +		     || current->flags & PF_EXITING)
> > > > >  		goto bypass;
> > > > >  
> > > > >  	if (unlikely(task_in_memcg_oom(current)))
> > > > 
> > > > This would become problematic if significant amount of memory is charged 
> > > > in the exit() path. 
> > > 
> > > But this would hurt also for fatal_signal_pending tasks, wouldn't it?
> > 
> > Yes, and as I've said twice now, that should be removed. 
> 
> And you failed to provide any relevant data to back your suggestions. I
> have told you that we have these heuristics for ages and we need a
> strong justification to drop them. So if you really think that they are
> not appropriate then back your statements with real data.
> 
> E.g. measure how much memory are we talking about.
> 

The heuristic may have existed for ages, but the proposed memcg 
configuration for preserving memory such that userspace oom handlers may 
run such as

			 _____root______
			/		\
		    user		 oom
		   /	\		/   \
		   A	B	 	a   b

where user/memory.limit_in_bytes == [amount of present RAM] + 
oom/memory.limit_in_bytes - [some fudge] causes all bypasses to be 
problematic, including Johannes' buggy bypass for charges in memcgs with 
pending memcgs that has since been fixed after I identified it.  This 
bypass is included.  Processes attached to "a" and "b" are userspace oom 
handlers for processes attached to "A" and "B", respectively.

The amount of memory you're talking about is proportional to the number of 
processes that have pending SIGKILLs (and now those with PF_EXITING set), 
the former of which is obviously more concerning since they could be 
charging memory at any point in the kernel that would succeed.  The latter 
is concerning only if future memory allocation post-PF_EXITING would be 
become significant and nobody is going to think about oom memcg bypasses 
in such a case.

To use the configuration suggested above, we need to prevent as many 
bypasses as possible to the root memcg.  Otherwise, the memory protected 
for the "oom" memcg from processes constrained by the limit of "user" is 
no longer protected.  This isn't only a problem with the bypasses here in 
the charging path, but also unaccounted kernel memory, for example.

For this to be usable, we need to ensure that the limit of the "oom" memcg 
is protected for the userspace oom handlers that are attached.  With a 
charge bypassed to the root memcg greater than or equal to the limit of 
the "oom" memcg OR cumulative charges bypassed to the root memcg greater 
than or equal to the limit of the "oom" memcg by processes with pending 
SIGKILLs, userspace oom handlers cannot respond.  That's particuarly 
dangerous without a memcg oom kill delay, as proposed before, since 
userspace must disable oom killing entirely for both "A" and "B" for 
userspace notification to be meaningful, since all processes are now 
livelocked.

> > These bypasses should be given to one thread and one thread only,
> > which would be the oom killed thread if it needs access to memory
> > reserves to either allocate memory or charge memory.
> 
> There is no way to determine whether a task has been killed due to user
> space OOM killer or by a regular kill.
> 

I'm referring to only granting TIF_MEMDIE to a single process in any memcg 
hierarchy at or below the memcg that has encountered its limit to avoid 
granting it to many processes and bypassing their charges to the root 
memcg; the same variation of the above code, but going through the memcg 
oom killer to get TIF_MEMDIE first.  We must be vigilant and only grant 
TIF_MEMDIE for the process that shall exit.

> > If you are suggesting we use the "user" and "oom" top-level memcg 
> > hierarchy for allowing memory to be available for userspace system oom 
> > handlers, then this has become important when in the past it may have been 
> > a minor point.
> 
> I am not sure it would be _that_ important and if that really becomes to
> be the case then we should deal with it. So far I haven't see any
> evidence there is a lot of memory charged on the exit path.
> 

I'm debating both fatal_signal_pending() and PF_EXITING here since they 
are now both bypasses, we need to remove fatal_signal_pending().  My 
simple question with your patch: how do you guarantee memory to processes 
attached to "a" and "b"?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [patch 1/2] mm, memcg: avoid oom notification when current needs access to memory reserves
  2014-01-15 14:34                                                                               ` Michal Hocko
@ 2014-01-15 21:23                                                                                 ` David Rientjes
  2014-01-16  9:32                                                                                   ` Michal Hocko
  0 siblings, 1 reply; 87+ messages in thread
From: David Rientjes @ 2014-01-15 21:23 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Johannes Weiner, Andrew Morton, KAMEZAWA Hiroyuki, linux-kernel,
	linux-mm, cgroups, Eric W. Biederman

On Wed, 15 Jan 2014, Michal Hocko wrote:

> > > > > > It was acked-by Michal.
> > > 
> > > Michal acked it before we had most of the discussions and now he is
> > > proposing an alternate version of yours, a patch that you are even
> > > discussing with him concurrently in another thread.  To claim he is
> > > still backing your patch because of that initial ack is disingenuous.
> > > 
> > 
> > His patch depends on mine, Johannes.
> 
> Does it? Are we talking about the same patch here?
> https://lkml.org/lkml/2013/12/12/174
> 

I'm happy with either patch, I suggested doing the mem_cgroup_oom_notify() 
at the last minute only when actually killing a process because of your 
concern that the oom killer would still defer.  That was addressing your 
concern as an extension of my patch which avoids unconditionally giving 
current access to memory reserves without scanning or deferring anything.  
I would be happy with either approach, and so I don't see why removing my 
patch from -mm which yours is based on would be needed.

> Which depends on yours only to revert your part. I plan to repost it but
> that still doesn't mean it will get merged because Johannes still has
> some argumnets against. I would like to start the discussion again
> because now we are so deep in circles that it is hard to come up with a
> reasonable outcome. It is still hard to e.g. agree on an actual fix
> for a real problem https://lkml.org/lkml/2013/12/12/129.
> 

This is concerning because it's merged in -mm without being tested by Eric 
and is marked for stable while violating the stable kernel rules criteria.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [patch 1/2] mm, memcg: avoid oom notification when current needs access to memory reserves
  2014-01-15 21:23                                                                                 ` David Rientjes
@ 2014-01-16  9:32                                                                                   ` Michal Hocko
  2014-01-21  5:58                                                                                     ` David Rientjes
  0 siblings, 1 reply; 87+ messages in thread
From: Michal Hocko @ 2014-01-16  9:32 UTC (permalink / raw)
  To: David Rientjes
  Cc: Johannes Weiner, Andrew Morton, KAMEZAWA Hiroyuki, linux-kernel,
	linux-mm, cgroups, Eric W. Biederman

On Wed 15-01-14 13:23:10, David Rientjes wrote:
> On Wed, 15 Jan 2014, Michal Hocko wrote:
[...]
> > Which depends on yours only to revert your part. I plan to repost it but
> > that still doesn't mean it will get merged because Johannes still has
> > some argumnets against. I would like to start the discussion again
> > because now we are so deep in circles that it is hard to come up with a
> > reasonable outcome. It is still hard to e.g. agree on an actual fix
> > for a real problem https://lkml.org/lkml/2013/12/12/129.
> > 
> 
> This is concerning because it's merged in -mm without being tested by Eric 
> and is marked for stable while violating the stable kernel rules criteria.

Are you questioning the patch fixes the described issue?

Please note that the exit_robust_list and PF_EXITING as a culprit has
been identified by Eric. Of course I would prefer if it was tested by
anybody who can reproduce it.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH] memcg: Do not hang on OOM when killed by userspace OOM access to memory reserves
  2014-01-15 21:19                                                                             ` David Rientjes
@ 2014-01-16 10:12                                                                               ` Michal Hocko
  2014-01-21  6:13                                                                                 ` David Rientjes
  0 siblings, 1 reply; 87+ messages in thread
From: Michal Hocko @ 2014-01-16 10:12 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Johannes Weiner, KAMEZAWA Hiroyuki, linux-kernel,
	linux-mm, cgroups, Eric W. Biederman

On Wed 15-01-14 13:19:21, David Rientjes wrote:
> On Wed, 15 Jan 2014, Michal Hocko wrote:
> 
> > > > > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > > > > > index b8dfed1b9d87..b86fbb04b7c6 100644
> > > > > > --- a/mm/memcontrol.c
> > > > > > +++ b/mm/memcontrol.c
> > > > > > @@ -2685,7 +2685,8 @@ static int __mem_cgroup_try_charge(struct mm_struct *mm,
> > > > > >  	 * MEMDIE process.
> > > > > >  	 */
> > > > > >  	if (unlikely(test_thread_flag(TIF_MEMDIE)
> > > > > > -		     || fatal_signal_pending(current)))
> > > > > > +		     || fatal_signal_pending(current))
> > > > > > +		     || current->flags & PF_EXITING)
> > > > > >  		goto bypass;
> > > > > >  
> > > > > >  	if (unlikely(task_in_memcg_oom(current)))
> > > > > 
> > > > > This would become problematic if significant amount of memory is charged 
> > > > > in the exit() path. 
> > > > 
> > > > But this would hurt also for fatal_signal_pending tasks, wouldn't it?
> > > 
> > > Yes, and as I've said twice now, that should be removed. 
> > 
> > And you failed to provide any relevant data to back your suggestions. I
> > have told you that we have these heuristics for ages and we need a
> > strong justification to drop them. So if you really think that they are
> > not appropriate then back your statements with real data.
> > 
> > E.g. measure how much memory are we talking about.
> 
> The heuristic may have existed for ages, but the proposed memcg 
> configuration for preserving memory such that userspace oom handlers may 
> run such as
> 
> 			 _____root______
> 			/		\
> 		    user		 oom
> 		   /	\		/   \
> 		   A	B	 	a   b
> 
> where user/memory.limit_in_bytes == [amount of present RAM] + 
> oom/memory.limit_in_bytes - [some fudge] causes all bypasses to be 
> problematic, including Johannes' buggy bypass for charges in memcgs with 
> pending memcgs that has since been fixed after I identified it.  This 
> bypass is included.  Processes attached to "a" and "b" are userspace oom 
> handlers for processes attached to "A" and "B", respectively.
> 
> The amount of memory you're talking about is proportional to the number of 
> processes that have pending SIGKILLs (and now those with PF_EXITING set), 
> the former of which is obviously more concerning since they could be 
> charging memory at any point in the kernel that would succeed. 

I understand your concerns. Yes, excessive charges might be dangerous. I
haven't dismissed that when you mentioned it earlier. I am just
repeatedly asking how much memory are we talking about, how real is the
issue and what are all the other conseqeunces. And for some reason you
are not providing that information (or maybe I am just not seeing that
in your responses) and that is why we are stuck in circle.

For example your some_fudge needs to consider memory which is not
accounted for. That is rather hard to predict because it depends
on drivers (or whoever calls page allocator directly) you have and
the current load. Also all tasks living in the root memcg are not
accounted. So you would need to have some pillow to be safe. Do
fatal_signal_pending tasks allocate way much more than your pillow?

I am also not sure whether fatal_signal_pending failing charge (without
going via OOM which would set TIF_MEMDIE) could prevent exiting task.

Finally there is still a risk for regressions when a killed task causes a
pointless reclaim just to free few pages for a task which will free
memory few seconds later.

> The latter 
> is concerning only if future memory allocation post-PF_EXITING would be 
> become significant and nobody is going to think about oom memcg bypasses 
> in such a case.
> 
> To use the configuration suggested above, we need to prevent as many 
> bypasses as possible to the root memcg.  Otherwise, the memory protected 
> for the "oom" memcg from processes constrained by the limit of "user" is 
> no longer protected.  This isn't only a problem with the bypasses here in 
> the charging path, but also unaccounted kernel memory, for example.

Yes, and apart from GFP_NOFAIL we are allowing to bypass only those that
should terminate in a short time. I think that having a setup with a
guarantee of never triggering the global OOM is too ambitious and I am
even skeptical it would be achievable.

> For this to be usable, we need to ensure that the limit of the "oom" memcg 
> is protected for the userspace oom handlers that are attached.  With a 
> charge bypassed to the root memcg greater than or equal to the limit of 
> the "oom" memcg OR cumulative charges bypassed to the root memcg greater 
> than or equal to the limit of the "oom" memcg by processes with pending 
> SIGKILLs, userspace oom handlers cannot respond.  That's particuarly 
> dangerous without a memcg oom kill delay, as proposed before, since 
> userspace must disable oom killing entirely for both "A" and "B" for 
> userspace notification to be meaningful, since all processes are now 
> livelocked.
> 
> > > These bypasses should be given to one thread and one thread only,
> > > which would be the oom killed thread if it needs access to memory
> > > reserves to either allocate memory or charge memory.
> > 
> > There is no way to determine whether a task has been killed due to user
> > space OOM killer or by a regular kill.
> > 
> 
> I'm referring to only granting TIF_MEMDIE to a single process in any memcg 
> hierarchy at or below the memcg that has encountered its limit to avoid 

The point was that TIF_MEMDIE handling for user OOM killed tasks is
tricky if implementable at all. If you check for fatal_signal_pending
after schedule returns in mem_cgroup_oom_synchronize then you might get
access to other fatal_signal_pending tasks as well.

> granting it to many processes and bypassing their charges to the root 
> memcg; the same variation of the above code, but going through the memcg 
> oom killer to get TIF_MEMDIE first.  We must be vigilant and only grant 
> TIF_MEMDIE for the process that shall exit.
>
> > > If you are suggesting we use the "user" and "oom" top-level memcg 
> > > hierarchy for allowing memory to be available for userspace system oom 
> > > handlers, then this has become important when in the past it may have been 
> > > a minor point.
> > 
> > I am not sure it would be _that_ important and if that really becomes to
> > be the case then we should deal with it. So far I haven't see any
> > evidence there is a lot of memory charged on the exit path.
> > 
> 
> I'm debating both fatal_signal_pending() and PF_EXITING here since they 
> are now both bypasses, we need to remove fatal_signal_pending().  My 
> simple question with your patch: how do you guarantee memory to processes 
> attached to "a" and "b"?

The only way you can get that _guarantee_ is to account all the memory
allocations. And that is not implemented and I would even question
whether it is worthwhile. So we still have to live with a possibility
of triggering the global OOM killer. That's why I believe we need to be
able to tell the kernel what is the user policy for oom killer (that is
a different discussion though).

So it all boils down to have sufficient pillow when your above hierarchy
is configured. You would need it even without fatal_signal_pending
resp. PF_EXITING bypasses. If we want to get rid of any bypass heuristic
then we have to show that the pillow would be unreasonably high to
handle regular loads or that the code path which triggers that is not
fixable.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [patch 1/2] mm, memcg: avoid oom notification when current needs access to memory reserves
  2014-01-16  9:32                                                                                   ` Michal Hocko
@ 2014-01-21  5:58                                                                                     ` David Rientjes
  2014-01-21  6:04                                                                                       ` Greg Kroah-Hartmann
  0 siblings, 1 reply; 87+ messages in thread
From: David Rientjes @ 2014-01-21  5:58 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Johannes Weiner, Andrew Morton, KAMEZAWA Hiroyuki, linux-kernel,
	linux-mm, cgroups, Eric W. Biederman, Greg Kroah-Hartmann

On Thu, 16 Jan 2014, Michal Hocko wrote:

> > This is concerning because it's merged in -mm without being tested by Eric 
> > and is marked for stable while violating the stable kernel rules criteria.
> 
> Are you questioning the patch fixes the described issue?
> 
> Please note that the exit_robust_list and PF_EXITING as a culprit has
> been identified by Eric. Of course I would prefer if it was tested by
> anybody who can reproduce it.

You're saying the patch hasn't been tested by anybody and that clearly 
violates the first rule in Documentation/stable_kernel_rules.txt:

 - It must be obviously correct and tested.

Adding Greg to the cc if this should be clarified further.  The patches 
getting proposed through -mm for stable boggles my mind sometimes.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [patch 1/2] mm, memcg: avoid oom notification when current needs access to memory reserves
  2014-01-21  5:58                                                                                     ` David Rientjes
@ 2014-01-21  6:04                                                                                       ` Greg Kroah-Hartmann
  2014-01-21  6:08                                                                                         ` David Rientjes
  0 siblings, 1 reply; 87+ messages in thread
From: Greg Kroah-Hartmann @ 2014-01-21  6:04 UTC (permalink / raw)
  To: David Rientjes
  Cc: Michal Hocko, Johannes Weiner, Andrew Morton, KAMEZAWA Hiroyuki,
	linux-kernel, linux-mm, cgroups, Eric W. Biederman

On Mon, Jan 20, 2014 at 09:58:28PM -0800, David Rientjes wrote:
> The patches getting proposed through -mm for stable boggles my mind
> sometimes.

Do you have any objections to patches that I have taken for -stable?  If
so, please let me know.

thanks,

greg k-h

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [patch 1/2] mm, memcg: avoid oom notification when current needs access to memory reserves
  2014-01-21  6:04                                                                                       ` Greg Kroah-Hartmann
@ 2014-01-21  6:08                                                                                         ` David Rientjes
  0 siblings, 0 replies; 87+ messages in thread
From: David Rientjes @ 2014-01-21  6:08 UTC (permalink / raw)
  To: Greg Kroah-Hartmann
  Cc: Michal Hocko, Johannes Weiner, Andrew Morton, KAMEZAWA Hiroyuki,
	linux-kernel, linux-mm, cgroups, Eric W. Biederman

On Mon, 20 Jan 2014, Greg Kroah-Hartmann wrote:

> > The patches getting proposed through -mm for stable boggles my mind
> > sometimes.
> 
> Do you have any objections to patches that I have taken for -stable?  If
> so, please let me know.
> 

You've haven't taken the ones that I outlined in 
http://marc.info/?l=linux-kernel&m=138580717728759, so I'm happy that 
those could be prevented.  I'm identifying another patch here that is 
pending in -mm that obviously violates the stable kernel rules and I don't 
believe it should be annotated in a way that you'll scoop it up later.

The patch in question hasn't been tested by anybody and I don't think you 
want such things to ever be merged into a stable kernel series.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH] memcg: Do not hang on OOM when killed by userspace OOM access to memory reserves
  2014-01-16 10:12                                                                               ` Michal Hocko
@ 2014-01-21  6:13                                                                                 ` David Rientjes
  2014-01-21 13:21                                                                                   ` Michal Hocko
  0 siblings, 1 reply; 87+ messages in thread
From: David Rientjes @ 2014-01-21  6:13 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Johannes Weiner, KAMEZAWA Hiroyuki, linux-kernel,
	linux-mm, cgroups, Eric W. Biederman

On Thu, 16 Jan 2014, Michal Hocko wrote:

> > The heuristic may have existed for ages, but the proposed memcg 
> > configuration for preserving memory such that userspace oom handlers may 
> > run such as
> > 
> > 			 _____root______
> > 			/		\
> > 		    user		 oom
> > 		   /	\		/   \
> > 		   A	B	 	a   b
> > 
> > where user/memory.limit_in_bytes == [amount of present RAM] + 
> > oom/memory.limit_in_bytes - [some fudge] causes all bypasses to be 
> > problematic, including Johannes' buggy bypass for charges in memcgs with 
> > pending memcgs that has since been fixed after I identified it.  This 
> > bypass is included.  Processes attached to "a" and "b" are userspace oom 
> > handlers for processes attached to "A" and "B", respectively.
> > 
> > The amount of memory you're talking about is proportional to the number of 
> > processes that have pending SIGKILLs (and now those with PF_EXITING set), 
> > the former of which is obviously more concerning since they could be 
> > charging memory at any point in the kernel that would succeed. 
> 
> I understand your concerns. Yes, excessive charges might be dangerous. I
> haven't dismissed that when you mentioned it earlier. I am just
> repeatedly asking how much memory are we talking about, how real is the
> issue and what are all the other conseqeunces. And for some reason you
> are not providing that information (or maybe I am just not seeing that
> in your responses) and that is why we are stuck in circle.
> 

Wtf are you talking about?  You're adding a bypass in this patch and then 
you're asking me to go and see how much memory it could potentially bypass 
and take away from oom handlers under the above memcg configuration?  This 
seems like something you should provide before throwing out patches that 
nobody has tested if you want to make the argument that the above memcg 
configuration is valid for handling userspace oom notifications.

And you certainly have dismissed what I've mentioned earlier when I said 
that anybody can add memory allocation to the exit path later on and 
nobody is going to think about how much memory this is going to bypass to 
the root memcg and potentially take away from userspace oom handlers.

There's two possible ways to forward this:

 - avoid bypass to the root memcg in every possible case such that the
   above memcg configuration actually makes a guarantee to userspace oom
   handlers attached to it, or

 - provide per-memcg memory reserves such that userspace oom handlers can
   allocate and charge memory without the above memcg configuration so 
   there is a guarantee.

What's not acceptable, now or ever, is suggesting a solution to a problem 
that is supposed to guarantee some resource and then allow under some 
circumstances that resource to be completely depleted such that the 
solution never works.

> Yes, and apart from GFP_NOFAIL we are allowing to bypass only those that
> should terminate in a short time. I think that having a setup with a
> guarantee of never triggering the global OOM is too ambitious and I am
> even skeptical it would be achievable.
> 

"Short time" is meaningless if the memory allocation causes memory to not 
be available to userspace oom handlers.  If allocations are allowed to be 
charged because you're in the exit() path or because you have SIGKILL, 
that can result in a system oom condition that would prevent userspace 
from being able to handle them.

> > I'm debating both fatal_signal_pending() and PF_EXITING here since they 
> > are now both bypasses, we need to remove fatal_signal_pending().  My 
> > simple question with your patch: how do you guarantee memory to processes 
> > attached to "a" and "b"?
> 
> The only way you can get that _guarantee_ is to account all the memory
> allocations. And that is not implemented and I would even question
> whether it is worthwhile. So we still have to live with a possibility
> of triggering the global OOM killer. That's why I believe we need to be
> able to tell the kernel what is the user policy for oom killer (that is
> a different discussion though).
> 

So you're saying that Tejun's suggested userspace oom handler 
configuration is pointless, correct?  We can certainly provide a guarantee 
if memory is reserved specifically for userspace oom handling like I 
proposed, the same way that memory reserves are guaranteed for oom killed 
processes.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH] memcg: Do not hang on OOM when killed by userspace OOM access to memory reserves
  2014-01-21  6:13                                                                                 ` David Rientjes
@ 2014-01-21 13:21                                                                                   ` Michal Hocko
  0 siblings, 0 replies; 87+ messages in thread
From: Michal Hocko @ 2014-01-21 13:21 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Johannes Weiner, KAMEZAWA Hiroyuki, linux-kernel,
	linux-mm, cgroups, Eric W. Biederman

On Mon 20-01-14 22:13:21, David Rientjes wrote:
> On Thu, 16 Jan 2014, Michal Hocko wrote:
> 
> > > The heuristic may have existed for ages, but the proposed memcg 
> > > configuration for preserving memory such that userspace oom handlers may 
> > > run such as
> > > 
> > > 			 _____root______
> > > 			/		\
> > > 		    user		 oom
> > > 		   /	\		/   \
> > > 		   A	B	 	a   b
> > > 
> > > where user/memory.limit_in_bytes == [amount of present RAM] + 
> > > oom/memory.limit_in_bytes - [some fudge] causes all bypasses to be 
> > > problematic, including Johannes' buggy bypass for charges in memcgs with 
> > > pending memcgs that has since been fixed after I identified it.  This 
> > > bypass is included.  Processes attached to "a" and "b" are userspace oom 
> > > handlers for processes attached to "A" and "B", respectively.
> > > 
> > > The amount of memory you're talking about is proportional to the number of 
> > > processes that have pending SIGKILLs (and now those with PF_EXITING set), 
> > > the former of which is obviously more concerning since they could be 
> > > charging memory at any point in the kernel that would succeed. 
> > 
> > I understand your concerns. Yes, excessive charges might be dangerous. I
> > haven't dismissed that when you mentioned it earlier. I am just
> > repeatedly asking how much memory are we talking about, how real is the
> > issue and what are all the other conseqeunces. And for some reason you
> > are not providing that information (or maybe I am just not seeing that
> > in your responses) and that is why we are stuck in circle.
> > 
> 
> Wtf are you talking about?  You're adding a bypass in this patch and then 
> you're asking me to go and see how much memory it could potentially bypass 
> and take away from oom handlers under the above memcg configuration?

No. You are mixing two things. One of them is adding PF_EXITING bypass
while the other is removing fatal_signal_pending bypass.

The first one is a subset of the later and it doesn't add an excessive
amount of charges because there are no direct allocations after
exit_signals. You haven't shown that this is not true and your only
concern was that this might change in future. Besides that my argument
was that even if such an allocation led to the global OOM the task would
be given TIF_MEMDIE and nothing would be killed.

The other part is fatal_signal_pending which we have there for ages
and you want to remove it. In order to do that I am asking you for
some data backing up that removal. You keep repeating your arguments
but they lack data or at least show code paths which would wildly
allocate&charge after task has been killed which wouldn't be fixable by
fatal_signal_pending check in the caller to show that the issue is real.
Besides that you are completely ignoring other concerns I have
mentioned, e.g. possible performance regressions when a pointless
reclaim slows existing tags.

Please try to understand that this is not Black&White thing.

> This seems like something you should provide before throwing out
> patches that nobody has tested if you want to make the argument that
> the above memcg configuration is valid for handling userspace oom
> notifications.
> 
> And you certainly have dismissed what I've mentioned earlier when I said 
> that anybody can add memory allocation to the exit path later on and 
> nobody is going to think about how much memory this is going to bypass to 
> the root memcg and potentially take away from userspace oom handlers.

If this happens then it has to be fixed and if not fixable then
reconsider this heuristic.

> There's two possible ways to forward this:
> 
>  - avoid bypass to the root memcg in every possible case such that the
>    above memcg configuration actually makes a guarantee to userspace oom
>    handlers attached to it, or
> 
>  - provide per-memcg memory reserves such that userspace oom handlers can
>    allocate and charge memory without the above memcg configuration so 
>    there is a guarantee.

David, you are aware that there are memory allocations that are out of
memcg/kmem scope, aren't you? This means that whether you add memcg
charge-reserves or access to memory reserves to memcg OOM killers then
you still can never rule out the global OOM killer.

> What's not acceptable, now or ever, is suggesting a solution to a problem 
> that is supposed to guarantee some resource and then allow under some 
> circumstances that resource to be completely depleted such that the 
> solution never works.

And yet you still haven't shown that such depletion is real. E.g. g-u-p
backs off when it sees fatal signal pending other callers that allocate
charged memory should do the same.

> > Yes, and apart from GFP_NOFAIL we are allowing to bypass only those that
> > should terminate in a short time. I think that having a setup with a
> > guarantee of never triggering the global OOM is too ambitious and I am
> > even skeptical it would be achievable.
> > 
> 
> "Short time" is meaningless if the memory allocation causes memory to not 
> be available to userspace oom handlers.  If allocations are allowed to be 
> charged because you're in the exit() path or because you have SIGKILL, 
> that can result in a system oom condition that would prevent userspace 
> from being able to handle them.

And you cannot prevent from that until _all_ memory allocation would be
charged which is not the case.

> > > I'm debating both fatal_signal_pending() and PF_EXITING here since they 
> > > are now both bypasses, we need to remove fatal_signal_pending().  My 
> > > simple question with your patch: how do you guarantee memory to processes 
> > > attached to "a" and "b"?
> > 
> > The only way you can get that _guarantee_ is to account all the memory
> > allocations. And that is not implemented and I would even question
> > whether it is worthwhile. So we still have to live with a possibility
> > of triggering the global OOM killer. That's why I believe we need to be
> > able to tell the kernel what is the user policy for oom killer (that is
> > a different discussion though).
> > 
> 
> So you're saying that Tejun's suggested userspace oom handler 
> configuration is pointless, correct?

No, I am not saying that. I am just saying that you cannot rule out
the global OOM killer. You have to tune your memory pillow based on
your workload what-ever approach you end up using until all the memory
(including every single in-kernel caller of the page allocator) is
accounted by memcg.

And I am still not convinced that fatal_signal_pending bypass is a major
factor here. I consider direct users of page allocator a much bigger
problem.

> We can certainly provide a guarantee if memory is reserved
> specifically for userspace oom handling like I proposed, the same way
> that memory reserves are guaranteed for oom killed processes.

No it's not! Because giving oom handlers access to memory reserves works
only until reserves are depleted as well. We can have many oom handlers
running in parallel and no guarantee on how much each of them can
allocate/charge. So you are back to square one.

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 87+ messages in thread

end of thread, other threads:[~2014-01-21 13:21 UTC | newest]

Thread overview: 87+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-10-31  1:39 [patch] mm, memcg: add memory.oom_control notification for system oom David Rientjes
2013-10-31  5:49 ` Johannes Weiner
2013-11-13 22:19   ` David Rientjes
2013-11-13 23:34     ` Johannes Weiner
2013-11-14  0:56       ` David Rientjes
2013-11-14  3:25         ` Johannes Weiner
2013-11-14 22:57           ` David Rientjes
2013-11-14 23:26             ` [patch 1/2] mm, memcg: avoid oom notification when current needs access to memory reserves David Rientjes
2013-11-14 23:26               ` [patch 2/2] mm, memcg: add memory.oom_control notification for system oom David Rientjes
2013-11-18 18:52                 ` Michal Hocko
2013-11-19  1:25                   ` David Rientjes
2013-11-19 12:41                     ` Michal Hocko
2013-11-18 12:52               ` [patch 1/2] mm, memcg: avoid oom notification when current needs access to memory reserves Michal Hocko
2013-11-18 12:55                 ` Michal Hocko
2013-11-19  1:19                   ` David Rientjes
2013-11-18 15:41               ` Johannes Weiner
2013-11-18 16:51                 ` Michal Hocko
2013-11-19  1:22                   ` David Rientjes
2013-11-22 16:51                   ` Johannes Weiner
2013-11-27  0:53                     ` David Rientjes
2013-11-27 16:34                       ` Johannes Weiner
2013-11-27 21:51                         ` David Rientjes
2013-11-27 23:19                           ` Johannes Weiner
2013-11-28  0:22                             ` David Rientjes
2013-11-28  2:28                               ` Johannes Weiner
2013-11-28  2:52                                 ` David Rientjes
2013-11-28  3:16                                   ` Johannes Weiner
2013-12-02 20:02                         ` Michal Hocko
2013-12-02 21:25                           ` Johannes Weiner
2013-12-03 12:04                             ` Michal Hocko
2013-12-03 20:17                               ` Johannes Weiner
2013-12-03 21:00                                 ` Michal Hocko
2013-12-03 21:23                                   ` Johannes Weiner
2013-12-03 23:50                               ` David Rientjes
2013-12-04  3:34                                 ` Johannes Weiner
2013-12-04 11:13                                 ` Michal Hocko
2013-12-05  0:23                                   ` David Rientjes
2013-12-09 12:48                                     ` Michal Hocko
2013-12-09 21:46                                       ` David Rientjes
2013-12-09 22:51                                         ` Johannes Weiner
2013-12-09 23:05                                         ` Johannes Weiner
2014-01-10  0:34                                           ` David Rientjes
2013-12-10 10:38                                         ` Michal Hocko
2013-12-11  1:03                                           ` David Rientjes
2013-12-11  9:55                                             ` Michal Hocko
2013-12-11 22:40                                               ` David Rientjes
2013-12-12 10:31                                                 ` Michal Hocko
2013-12-12 10:50                                                   ` Michal Hocko
2013-12-12 12:11                                                   ` Michal Hocko
2013-12-12 12:37                                                     ` Michal Hocko
2013-12-13 23:55                                                   ` David Rientjes
2013-12-17 16:23                                                     ` Michal Hocko
2013-12-17 20:50                                                       ` David Rientjes
2013-12-18 20:04                                                         ` Michal Hocko
2013-12-19  6:09                                                           ` David Rientjes
2013-12-19 14:41                                                             ` Michal Hocko
2014-01-08  0:25                                                               ` Andrew Morton
2014-01-08 10:33                                                                 ` Michal Hocko
2014-01-09 14:30                                                                   ` [PATCH] memcg: Do not hang on OOM when killed by userspace OOM " Michal Hocko
2014-01-09 21:40                                                                     ` David Rientjes
2014-01-10  8:23                                                                       ` Michal Hocko
2014-01-10 21:33                                                                         ` David Rientjes
2014-01-15 14:26                                                                           ` Michal Hocko
2014-01-15 21:19                                                                             ` David Rientjes
2014-01-16 10:12                                                                               ` Michal Hocko
2014-01-21  6:13                                                                                 ` David Rientjes
2014-01-21 13:21                                                                                   ` Michal Hocko
2014-01-09 21:34                                                                 ` [patch 1/2] mm, memcg: avoid oom notification when current needs " David Rientjes
2014-01-09 22:47                                                                   ` Andrew Morton
2014-01-10  0:01                                                                     ` David Rientjes
2014-01-10  0:12                                                                       ` Andrew Morton
2014-01-10  0:23                                                                         ` David Rientjes
2014-01-10  0:35                                                                           ` David Rientjes
2014-01-10 22:14                                                                           ` Johannes Weiner
2014-01-12 22:10                                                                             ` David Rientjes
2014-01-15 14:34                                                                               ` Michal Hocko
2014-01-15 21:23                                                                                 ` David Rientjes
2014-01-16  9:32                                                                                   ` Michal Hocko
2014-01-21  5:58                                                                                     ` David Rientjes
2014-01-21  6:04                                                                                       ` Greg Kroah-Hartmann
2014-01-21  6:08                                                                                         ` David Rientjes
2014-01-10  8:30                                                                       ` Michal Hocko
2014-01-10 21:38                                                                         ` David Rientjes
2014-01-10 22:34                                                                           ` Johannes Weiner
2014-01-12 22:14                                                                             ` David Rientjes
2013-11-18 15:54             ` [patch] mm, memcg: add memory.oom_control notification for system oom Johannes Weiner
2013-11-18 23:15               ` One Thousand Gnomes

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).