All of lore.kernel.org
 help / color / mirror / Atom feed
* [patch] memcg: add oom killer delay
@ 2011-02-08  0:24 David Rientjes
  2011-02-08  1:55 ` KAMEZAWA Hiroyuki
  2011-02-09 22:19 ` David Rientjes
  0 siblings, 2 replies; 58+ messages in thread
From: David Rientjes @ 2011-02-08  0:24 UTC (permalink / raw)
  To: Andrew Morton
  Cc: KAMEZAWA Hiroyuki, Balbir Singh, Daisuke Nishimura, linux-mm

Completely disabling the oom killer for a memcg is problematic if
userspace is unable to address the condition itself, usually because it
is unresponsive.  This scenario creates a memcg deadlock: tasks are
sitting in TASK_KILLABLE waiting for the limit to be increased, a task to
exit or move, or the oom killer reenabled and userspace is unable to do
so.

An additional possible use case is to defer oom killing within a memcg
for a set period of time, probably to prevent unnecessary kills due to
temporary memory spikes, before allowing the kernel to handle the
condition.

This patch adds an oom killer delay so that a memcg may be configured to
wait at least a pre-defined number of milliseconds before calling the oom
killer.  If the oom condition persists for this number of milliseconds,
the oom killer will be called the next time the memory controller
attempts to charge a page (and memory.oom_control is set to 0).  This
allows userspace to have a short period of time to respond to the
condition before deferring to the kernel to kill a task.

Admins may set the oom killer delay using the new interface:

	# echo 60000 > memory.oom_delay_millisecs

This will defer oom killing to the kernel only after 60 seconds has
elapsed by putting the task to sleep for 60 seconds.  When setting
memory.oom_delay_millisecs, all pending delays have their charges retried
and, if necessary, the new delay is then enforced.

The delay is cleared the first time the memcg is oom to avoid unnecessary
waiting when userspace is unresponsive for future oom conditions.  It may
be set again using the above interface to enforce a delay on the next
oom.

When a memory.oom_delay_millisecs is set for a cgroup, it is propagated
to all children memcg as well and is inherited when a new memcg is
created.

Signed-off-by: David Rientjes <rientjes@google.com>
---
 Documentation/cgroups/memory.txt |   28 ++++++++++++++++++++++
 mm/memcontrol.c                  |   48 ++++++++++++++++++++++++++++++++++---
 2 files changed, 72 insertions(+), 4 deletions(-)

diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt
--- a/Documentation/cgroups/memory.txt
+++ b/Documentation/cgroups/memory.txt
@@ -68,6 +68,7 @@ Brief summary of control files.
 				 (See sysctl's vm.swappiness)
  memory.move_charge_at_immigrate # set/show controls of moving charges
  memory.oom_control		 # set/show oom controls.
+ memory.oom_delay_millisecs	 # set/show millisecs to wait before oom kill
 
 1. History
 
@@ -640,6 +641,33 @@ At reading, current status of OOM is shown.
 	under_oom	 0 or 1 (if 1, the memory cgroup is under OOM, tasks may
 				 be stopped.)
 
+It is also possible to configure an oom killer timeout to prevent the
+possibility that the memcg will deadlock looking for memory if userspace
+has disabled the oom killer with oom_control but cannot act to fix the
+condition itself (usually because userspace has become unresponsive).
+
+To set an oom killer timeout for a memcg, write the number of milliseconds
+to wait before killing a task to memory.oom_delay_millisecs:
+
+	# echo 60000 > memory.oom_delay_millisecs	# 60 seconds before kill
+
+This timeout is reset the first time the memcg is oom to prevent needlessly
+waiting for the next oom when userspace is truly unresponsive.  It may be
+set again using the above interface to defer killing a task the next time
+the memcg is oom.
+
+Disabling the oom killer for a memcg with memory.oom_control takes
+precedence over memory.oom_delay_millisecs, so it must be set to 0
+(default) to allow the oom kill after the delay has expired.
+
+This value is inherited from the memcg's parent on creation.  Setting
+a delay for a memcg sets the same delay for all children, as well.
+
+There is no delay if memory.oom_delay_millisecs is set to 0 (default).
+This tunable's upper bound is MAX_SCHEDULE_TIMEOUT (about 24 days on
+32-bit and a lifetime on 64-bit).
+
+
 11. TODO
 
 1. Add support for accounting huge pages (as a separate controller)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -239,6 +239,8 @@ struct mem_cgroup {
 	unsigned int	swappiness;
 	/* OOM-Killer disable */
 	int		oom_kill_disable;
+	/* number of ticks to stall before calling oom killer */
+	int		oom_delay;
 
 	/* set when res.limit == memsw.limit */
 	bool		memsw_is_minimum;
@@ -1541,10 +1543,11 @@ static void memcg_oom_recover(struct mem_cgroup *mem)
 /*
  * try to call OOM killer. returns false if we should exit memory-reclaim loop.
  */
-bool mem_cgroup_handle_oom(struct mem_cgroup *mem, gfp_t mask)
+static bool mem_cgroup_handle_oom(struct mem_cgroup *mem, gfp_t mask)
 {
 	struct oom_wait_info owait;
 	bool locked, need_to_kill;
+	long timeout = MAX_SCHEDULE_TIMEOUT;
 
 	owait.mem = mem;
 	owait.wait.flags = 0;
@@ -1563,15 +1566,21 @@ bool mem_cgroup_handle_oom(struct mem_cgroup *mem, gfp_t mask)
 	prepare_to_wait(&memcg_oom_waitq, &owait.wait, TASK_KILLABLE);
 	if (!locked || mem->oom_kill_disable)
 		need_to_kill = false;
-	if (locked)
+	if (locked) {
+		if (mem->oom_delay) {
+			need_to_kill = false;
+			timeout = mem->oom_delay;
+			mem->oom_delay = 0;
+		}
 		mem_cgroup_oom_notify(mem);
+	}
 	mutex_unlock(&memcg_oom_mutex);
 
 	if (need_to_kill) {
 		finish_wait(&memcg_oom_waitq, &owait.wait);
 		mem_cgroup_out_of_memory(mem, mask);
 	} else {
-		schedule();
+		schedule_timeout(timeout);
 		finish_wait(&memcg_oom_waitq, &owait.wait);
 	}
 	mutex_lock(&memcg_oom_mutex);
@@ -1582,7 +1591,8 @@ bool mem_cgroup_handle_oom(struct mem_cgroup *mem, gfp_t mask)
 	if (test_thread_flag(TIF_MEMDIE) || fatal_signal_pending(current))
 		return false;
 	/* Give chance to dying process */
-	schedule_timeout(1);
+	if (timeout == MAX_SCHEDULE_TIMEOUT)
+		schedule_timeout(1);
 	return true;
 }
 
@@ -4168,6 +4178,30 @@ static int mem_cgroup_oom_control_write(struct cgroup *cgrp,
 	return 0;
 }
 
+static u64 mem_cgroup_oom_delay_millisecs_read(struct cgroup *cgrp,
+					struct cftype *cft)
+{
+	struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp);
+
+	return jiffies_to_msecs(memcg->oom_delay);
+}
+
+static int mem_cgroup_oom_delay_millisecs_write(struct cgroup *cgrp,
+					struct cftype *cft, u64 val)
+{
+	struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp);
+	struct mem_cgroup *iter;
+
+	if (val > MAX_SCHEDULE_TIMEOUT)
+		return -EINVAL;
+
+	for_each_mem_cgroup_tree(iter, memcg) {
+		iter->oom_delay = msecs_to_jiffies(val);
+		memcg_oom_recover(iter);
+	}
+	return 0;
+}
+
 static struct cftype mem_cgroup_files[] = {
 	{
 		.name = "usage_in_bytes",
@@ -4231,6 +4265,11 @@ static struct cftype mem_cgroup_files[] = {
 		.unregister_event = mem_cgroup_oom_unregister_event,
 		.private = MEMFILE_PRIVATE(_OOM_TYPE, OOM_CONTROL),
 	},
+	{
+		.name = "oom_delay_millisecs",
+		.read_u64 = mem_cgroup_oom_delay_millisecs_read,
+		.write_u64 = mem_cgroup_oom_delay_millisecs_write,
+	},
 };
 
 #ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
@@ -4469,6 +4508,7 @@ mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
 		parent = mem_cgroup_from_cont(cont->parent);
 		mem->use_hierarchy = parent->use_hierarchy;
 		mem->oom_kill_disable = parent->oom_kill_disable;
+		mem->oom_delay = parent->oom_delay;
 	}
 
 	if (parent && parent->use_hierarchy) {

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [patch] memcg: add oom killer delay
  2011-02-08  0:24 [patch] memcg: add oom killer delay David Rientjes
@ 2011-02-08  1:55 ` KAMEZAWA Hiroyuki
  2011-02-08  2:13   ` David Rientjes
  2011-02-09 22:19 ` David Rientjes
  1 sibling, 1 reply; 58+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-02-08  1:55 UTC (permalink / raw)
  To: David Rientjes; +Cc: Andrew Morton, Balbir Singh, Daisuke Nishimura, linux-mm

On Mon, 7 Feb 2011 16:24:08 -0800 (PST)
David Rientjes <rientjes@google.com> wrote:

> Completely disabling the oom killer for a memcg is problematic if
> userspace is unable to address the condition itself, usually because it
> is unresponsive.  This scenario creates a memcg deadlock: tasks are
> sitting in TASK_KILLABLE waiting for the limit to be increased, a task to
> exit or move, or the oom killer reenabled and userspace is unable to do
> so.
> 
> An additional possible use case is to defer oom killing within a memcg
> for a set period of time, probably to prevent unnecessary kills due to
> temporary memory spikes, before allowing the kernel to handle the
> condition.
> 
> This patch adds an oom killer delay so that a memcg may be configured to
> wait at least a pre-defined number of milliseconds before calling the oom
> killer.  If the oom condition persists for this number of milliseconds,
> the oom killer will be called the next time the memory controller
> attempts to charge a page (and memory.oom_control is set to 0).  This
> allows userspace to have a short period of time to respond to the
> condition before deferring to the kernel to kill a task.
> 
> Admins may set the oom killer delay using the new interface:
> 
> 	# echo 60000 > memory.oom_delay_millisecs
> 
> This will defer oom killing to the kernel only after 60 seconds has
> elapsed by putting the task to sleep for 60 seconds.  When setting
> memory.oom_delay_millisecs, all pending delays have their charges retried
> and, if necessary, the new delay is then enforced.
> 
> The delay is cleared the first time the memcg is oom to avoid unnecessary
> waiting when userspace is unresponsive for future oom conditions.  It may
> be set again using the above interface to enforce a delay on the next
> oom.
> 
> When a memory.oom_delay_millisecs is set for a cgroup, it is propagated
> to all children memcg as well and is inherited when a new memcg is
> created.
> 
> Signed-off-by: David Rientjes <rientjes@google.com>
> ---
>  Documentation/cgroups/memory.txt |   28 ++++++++++++++++++++++
>  mm/memcontrol.c                  |   48 ++++++++++++++++++++++++++++++++++---
>  2 files changed, 72 insertions(+), 4 deletions(-)
> 
> diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt
> --- a/Documentation/cgroups/memory.txt
> +++ b/Documentation/cgroups/memory.txt
> @@ -68,6 +68,7 @@ Brief summary of control files.
>  				 (See sysctl's vm.swappiness)
>   memory.move_charge_at_immigrate # set/show controls of moving charges
>   memory.oom_control		 # set/show oom controls.
> + memory.oom_delay_millisecs	 # set/show millisecs to wait before oom kill
>  
>  1. History
>  
> @@ -640,6 +641,33 @@ At reading, current status of OOM is shown.
>  	under_oom	 0 or 1 (if 1, the memory cgroup is under OOM, tasks may
>  				 be stopped.)
>  
> +It is also possible to configure an oom killer timeout to prevent the
> +possibility that the memcg will deadlock looking for memory if userspace
> +has disabled the oom killer with oom_control but cannot act to fix the
> +condition itself (usually because userspace has become unresponsive).
> +
> +To set an oom killer timeout for a memcg, write the number of milliseconds
> +to wait before killing a task to memory.oom_delay_millisecs:
> +
> +	# echo 60000 > memory.oom_delay_millisecs	# 60 seconds before kill
> +
> +This timeout is reset the first time the memcg is oom to prevent needlessly
> +waiting for the next oom when userspace is truly unresponsive.  It may be
> +set again using the above interface to defer killing a task the next time
> +the memcg is oom.
> +
> +Disabling the oom killer for a memcg with memory.oom_control takes
> +precedence over memory.oom_delay_millisecs, so it must be set to 0
> +(default) to allow the oom kill after the delay has expired.
> +
> +This value is inherited from the memcg's parent on creation.  Setting
> +a delay for a memcg sets the same delay for all children, as well.
> +
> +There is no delay if memory.oom_delay_millisecs is set to 0 (default).
> +This tunable's upper bound is MAX_SCHEDULE_TIMEOUT (about 24 days on
> +32-bit and a lifetime on 64-bit).
> +
> +
>  11. TODO
>  
>  1. Add support for accounting huge pages (as a separate controller)
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -239,6 +239,8 @@ struct mem_cgroup {
>  	unsigned int	swappiness;
>  	/* OOM-Killer disable */
>  	int		oom_kill_disable;
> +	/* number of ticks to stall before calling oom killer */
> +	int		oom_delay;
>  
>  	/* set when res.limit == memsw.limit */
>  	bool		memsw_is_minimum;
> @@ -1541,10 +1543,11 @@ static void memcg_oom_recover(struct mem_cgroup *mem)
>  /*
>   * try to call OOM killer. returns false if we should exit memory-reclaim loop.
>   */
> -bool mem_cgroup_handle_oom(struct mem_cgroup *mem, gfp_t mask)
> +static bool mem_cgroup_handle_oom(struct mem_cgroup *mem, gfp_t mask)
>  {
>  	struct oom_wait_info owait;
>  	bool locked, need_to_kill;
> +	long timeout = MAX_SCHEDULE_TIMEOUT;
>  
>  	owait.mem = mem;
>  	owait.wait.flags = 0;
> @@ -1563,15 +1566,21 @@ bool mem_cgroup_handle_oom(struct mem_cgroup *mem, gfp_t mask)
>  	prepare_to_wait(&memcg_oom_waitq, &owait.wait, TASK_KILLABLE);
>  	if (!locked || mem->oom_kill_disable)
>  		need_to_kill = false;
> -	if (locked)
> +	if (locked) {
> +		if (mem->oom_delay) {
> +			need_to_kill = false;
> +			timeout = mem->oom_delay;
> +			mem->oom_delay = 0;
> +		}
>  		mem_cgroup_oom_notify(mem);
> +	}
>  	mutex_unlock(&memcg_oom_mutex);
>  
>  	if (need_to_kill) {
>  		finish_wait(&memcg_oom_waitq, &owait.wait);
>  		mem_cgroup_out_of_memory(mem, mask);
>  	} else {
> -		schedule();
> +		schedule_timeout(timeout);
>  		finish_wait(&memcg_oom_waitq, &owait.wait);
>  	}
>  	mutex_lock(&memcg_oom_mutex);
> @@ -1582,7 +1591,8 @@ bool mem_cgroup_handle_oom(struct mem_cgroup *mem, gfp_t mask)
>  	if (test_thread_flag(TIF_MEMDIE) || fatal_signal_pending(current))
>  		return false;
>  	/* Give chance to dying process */
> -	schedule_timeout(1);
> +	if (timeout == MAX_SCHEDULE_TIMEOUT)
> +		schedule_timeout(1);
>  	return true;
>  }
>  
> @@ -4168,6 +4178,30 @@ static int mem_cgroup_oom_control_write(struct cgroup *cgrp,
>  	return 0;
>  }
>  
> +static u64 mem_cgroup_oom_delay_millisecs_read(struct cgroup *cgrp,
> +					struct cftype *cft)
> +{
> +	struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp);
> +
> +	return jiffies_to_msecs(memcg->oom_delay);
> +}
> +
> +static int mem_cgroup_oom_delay_millisecs_write(struct cgroup *cgrp,
> +					struct cftype *cft, u64 val)
> +{
> +	struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp);
> +	struct mem_cgroup *iter;
> +
> +	if (val > MAX_SCHEDULE_TIMEOUT)
> +		return -EINVAL;
> +
> +	for_each_mem_cgroup_tree(iter, memcg) {
> +		iter->oom_delay = msecs_to_jiffies(val);
> +		memcg_oom_recover(iter);
> +	}
> +	return 0;

Seems nicer and it seems you tries to update all children cgroups.

BTW, with above code, with following heirarchy,

    A
   /
  B  
 /
C

When a user set oom_delay in order as A->B->C, A,B,C can have 'different' numbers.
When a user set oom_delay in order as C->B->A, A,B,C will have the same numbers.

This intreface seems magical, or broken.

So, my recomendation is 'just allow to set value a cgroup which has no children/parent'.
Or 'just allo to se value a cgroup which is a root of a hierarchy'.
Could you add a check ? Inheritance at mkdir() is okay to me.

Thanks,
-Kame


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [patch] memcg: add oom killer delay
  2011-02-08  1:55 ` KAMEZAWA Hiroyuki
@ 2011-02-08  2:13   ` David Rientjes
  2011-02-08  2:13     ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 58+ messages in thread
From: David Rientjes @ 2011-02-08  2:13 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Andrew Morton, Balbir Singh, Daisuke Nishimura, linux-mm

On Tue, 8 Feb 2011, KAMEZAWA Hiroyuki wrote:

> > +static int mem_cgroup_oom_delay_millisecs_write(struct cgroup *cgrp,
> > +					struct cftype *cft, u64 val)
> > +{
> > +	struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp);
> > +	struct mem_cgroup *iter;
> > +
> > +	if (val > MAX_SCHEDULE_TIMEOUT)
> > +		return -EINVAL;
> > +
> > +	for_each_mem_cgroup_tree(iter, memcg) {
> > +		iter->oom_delay = msecs_to_jiffies(val);
> > +		memcg_oom_recover(iter);
> > +	}
> > +	return 0;
> 
> Seems nicer and it seems you tries to update all children cgroups.
> 
> BTW, with above code, with following heirarchy,
> 
>     A
>    /
>   B  
>  /
> C
> 
> When a user set oom_delay in order as A->B->C, A,B,C can have 'different' numbers.
> When a user set oom_delay in order as C->B->A, A,B,C will have the same numbers.
> 
> This intreface seems magical, or broken.
> 

It's not really magical, it just means that if you change the delay for a 
memcg that you do so for all of its children implicitly as well.

An alternative would be to ensure that a child memcg may never have a 
delay greater than the delay of its parent.  Would you prefer that 
instead?

> So, my recomendation is 'just allow to set value a cgroup which has no children/parent'.
> Or 'just allo to se value a cgroup which is a root of a hierarchy'.
> Could you add a check ? Inheritance at mkdir() is okay to me.
> 

I'm trying to get away from this only because it doesn't seem very logical 
that creating a child memcg within a parent means that the parent is now 
locked out of setting memory.oom_delay_millisecs.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [patch] memcg: add oom killer delay
  2011-02-08  2:13   ` David Rientjes
@ 2011-02-08  2:13     ` KAMEZAWA Hiroyuki
  2011-02-08  2:20       ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 58+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-02-08  2:13 UTC (permalink / raw)
  To: David Rientjes; +Cc: Andrew Morton, Balbir Singh, Daisuke Nishimura, linux-mm

On Mon, 7 Feb 2011 18:13:22 -0800 (PST)
David Rientjes <rientjes@google.com> wrote:

> On Tue, 8 Feb 2011, KAMEZAWA Hiroyuki wrote:
> 
> > > +static int mem_cgroup_oom_delay_millisecs_write(struct cgroup *cgrp,
> > > +					struct cftype *cft, u64 val)
> > > +{
> > > +	struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp);
> > > +	struct mem_cgroup *iter;
> > > +
> > > +	if (val > MAX_SCHEDULE_TIMEOUT)
> > > +		return -EINVAL;
> > > +
> > > +	for_each_mem_cgroup_tree(iter, memcg) {
> > > +		iter->oom_delay = msecs_to_jiffies(val);
> > > +		memcg_oom_recover(iter);
> > > +	}
> > > +	return 0;
> > 
> > Seems nicer and it seems you tries to update all children cgroups.
> > 
> > BTW, with above code, with following heirarchy,
> > 
> >     A
> >    /
> >   B  
> >  /
> > C
> > 
> > When a user set oom_delay in order as A->B->C, A,B,C can have 'different' numbers.
> > When a user set oom_delay in order as C->B->A, A,B,C will have the same numbers.
> > 
> > This intreface seems magical, or broken.
> > 
> 
> It's not really magical, it just means that if you change the delay for a 
> memcg that you do so for all of its children implicitly as well.
> 
But you didn't explain the bahavior in Documenation.

> An alternative would be to ensure that a child memcg may never have a 
> delay greater than the delay of its parent.  Would you prefer that 
> instead?
> 
I don't think such limitation makes sense.

My point is just that the behavior is very special in current memcg interfaces. 
They do
 - It's not allowed to set attribute to no-children, no-parent cgroup

So, please explain above behavior in Documenation.

Thanks,
-Kame


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [patch] memcg: add oom killer delay
  2011-02-08  2:13     ` KAMEZAWA Hiroyuki
@ 2011-02-08  2:20       ` KAMEZAWA Hiroyuki
  2011-02-08  2:37         ` David Rientjes
  0 siblings, 1 reply; 58+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-02-08  2:20 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: David Rientjes, Andrew Morton, Balbir Singh, Daisuke Nishimura, linux-mm

On Tue, 8 Feb 2011 11:13:51 +0900
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:

> On Mon, 7 Feb 2011 18:13:22 -0800 (PST)
> David Rientjes <rientjes@google.com> wrote:
> 
> > On Tue, 8 Feb 2011, KAMEZAWA Hiroyuki wrote:
> > 
> > > > +static int mem_cgroup_oom_delay_millisecs_write(struct cgroup *cgrp,
> > > > +					struct cftype *cft, u64 val)
> > > > +{
> > > > +	struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp);
> > > > +	struct mem_cgroup *iter;
> > > > +
> > > > +	if (val > MAX_SCHEDULE_TIMEOUT)
> > > > +		return -EINVAL;
> > > > +
> > > > +	for_each_mem_cgroup_tree(iter, memcg) {
> > > > +		iter->oom_delay = msecs_to_jiffies(val);
> > > > +		memcg_oom_recover(iter);
> > > > +	}
> > > > +	return 0;
> > > 
> > > Seems nicer and it seems you tries to update all children cgroups.
> > > 
> > > BTW, with above code, with following heirarchy,
> > > 
> > >     A
> > >    /
> > >   B  
> > >  /
> > > C
> > > 
> > > When a user set oom_delay in order as A->B->C, A,B,C can have 'different' numbers.
> > > When a user set oom_delay in order as C->B->A, A,B,C will have the same numbers.
> > > 
> > > This intreface seems magical, or broken.
> > > 
> > 
> > It's not really magical, it just means that if you change the delay for a 
> > memcg that you do so for all of its children implicitly as well.
> > 
> But you didn't explain the bahavior in Documenation.
> 
And write this fact:

     A
    /
   B
  /
 C

When 
  A.memory_oom_delay=1sec. 
  B.memory_oom_delay=500msec
  C.memory_oom_delay=200msec

If there are OOM in group C, C's oom_kill will be delayed for 200msec and
a task in group C will be killed. 

If there are OOM in group B, B's oom_kill will be delayed for 200msec and
a task in group B or C will be killed.

If there are OOM in group A, A's oom_kill will be delayed for 1sec and
a task in group A,B or C will be killed.

oom_killer in the hierarchy is serialized by lock and happens one-by-one
for avoiding a serial kill. So, above delay can be stacked. 

Thanks,
-Kame





--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [patch] memcg: add oom killer delay
  2011-02-08  2:20       ` KAMEZAWA Hiroyuki
@ 2011-02-08  2:37         ` David Rientjes
  2011-02-08 10:25           ` Balbir Singh
  0 siblings, 1 reply; 58+ messages in thread
From: David Rientjes @ 2011-02-08  2:37 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Andrew Morton, Balbir Singh, Daisuke Nishimura, linux-mm

On Tue, 8 Feb 2011, KAMEZAWA Hiroyuki wrote:

> And write this fact:
> 
>      A
>     /
>    B
>   /
>  C
> 
> When 
>   A.memory_oom_delay=1sec. 
>   B.memory_oom_delay=500msec
>   C.memory_oom_delay=200msec
> 
> If there are OOM in group C, C's oom_kill will be delayed for 200msec and
> a task in group C will be killed. 
> 
> If there are OOM in group B, B's oom_kill will be delayed for 200msec and
> a task in group B or C will be killed.
> 
> If there are OOM in group A, A's oom_kill will be delayed for 1sec and
> a task in group A,B or C will be killed.
> 
> oom_killer in the hierarchy is serialized by lock and happens one-by-one
> for avoiding a serial kill. So, above delay can be stacked. 
> 

Ok, I'll add this to the comment that says changing 
memory.oom_delay_millisecs does so for all children as well that was 
already added in this version of the patch.

I'll wait a couple days to see if Balbir or Daisuke have any additional 
comments.

Thanks for the review!

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [patch] memcg: add oom killer delay
  2011-02-08  2:37         ` David Rientjes
@ 2011-02-08 10:25           ` Balbir Singh
  0 siblings, 0 replies; 58+ messages in thread
From: Balbir Singh @ 2011-02-08 10:25 UTC (permalink / raw)
  To: David Rientjes
  Cc: KAMEZAWA Hiroyuki, Andrew Morton, Daisuke Nishimura, linux-mm

* David Rientjes <rientjes@google.com> [2011-02-07 18:37:30]:

> On Tue, 8 Feb 2011, KAMEZAWA Hiroyuki wrote:
> 
> > And write this fact:
> > 
> >      A
> >     /
> >    B
> >   /
> >  C
> > 
> > When 
> >   A.memory_oom_delay=1sec. 
> >   B.memory_oom_delay=500msec
> >   C.memory_oom_delay=200msec
> > 
> > If there are OOM in group C, C's oom_kill will be delayed for 200msec and
> > a task in group C will be killed. 
> > 
> > If there are OOM in group B, B's oom_kill will be delayed for 200msec and
> > a task in group B or C will be killed.
> > 
> > If there are OOM in group A, A's oom_kill will be delayed for 1sec and
> > a task in group A,B or C will be killed.
> > 
> > oom_killer in the hierarchy is serialized by lock and happens one-by-one
> > for avoiding a serial kill. So, above delay can be stacked. 
> > 
> 
> Ok, I'll add this to the comment that says changing 
> memory.oom_delay_millisecs does so for all children as well that was 
> already added in this version of the patch.
> 
> I'll wait a couple days to see if Balbir or Daisuke have any additional 
> comments.
>

The patches look good to me from last time, Kamezawa-San had these
comments even last time. I am OK with the changes proposed.

-- 
	Three Cheers,
	Balbir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 58+ messages in thread

* [patch] memcg: add oom killer delay
  2011-02-08  0:24 [patch] memcg: add oom killer delay David Rientjes
  2011-02-08  1:55 ` KAMEZAWA Hiroyuki
@ 2011-02-09 22:19 ` David Rientjes
  2011-02-10  0:04   ` KAMEZAWA Hiroyuki
  2011-02-23 23:08   ` Andrew Morton
  1 sibling, 2 replies; 58+ messages in thread
From: David Rientjes @ 2011-02-09 22:19 UTC (permalink / raw)
  To: Andrew Morton
  Cc: KAMEZAWA Hiroyuki, Balbir Singh, Daisuke Nishimura, linux-mm

Completely disabling the oom killer for a memcg is problematic if
userspace is unable to address the condition itself, usually because it
is unresponsive.  This scenario creates a memcg deadlock: tasks are
sitting in TASK_KILLABLE waiting for the limit to be increased, a task to
exit or move, or the oom killer reenabled and userspace is unable to do
so.

An additional possible use case is to defer oom killing within a memcg
for a set period of time, probably to prevent unnecessary kills due to
temporary memory spikes, before allowing the kernel to handle the
condition.

This patch adds an oom killer delay so that a memcg may be configured to
wait at least a pre-defined number of milliseconds before calling the oom
killer.  If the oom condition persists for this number of milliseconds,
the oom killer will be called the next time the memory controller
attempts to charge a page (and memory.oom_control is set to 0).  This
allows userspace to have a short period of time to respond to the
condition before deferring to the kernel to kill a task.

Admins may set the oom killer delay using the new interface:

	# echo 60000 > memory.oom_delay_millisecs

This will defer oom killing to the kernel only after 60 seconds has
elapsed by putting the task to sleep for 60 seconds.  When setting
memory.oom_delay_millisecs, all pending delays have their charges retried
and, if necessary, the new delay is then enforced.

The delay is cleared the first time the memcg is oom to avoid unnecessary
waiting when userspace is unresponsive for future oom conditions.  It may
be set again using the above interface to enforce a delay on the next
oom.

When a memory.oom_delay_millisecs is set for a cgroup, it is propagated
to all children memcg as well and is inherited when a new memcg is
created.

Signed-off-by: David Rientjes <rientjes@google.com>
---
 Documentation/cgroups/memory.txt |   32 +++++++++++++++++++++++++
 mm/memcontrol.c                  |   48 ++++++++++++++++++++++++++++++++++---
 2 files changed, 76 insertions(+), 4 deletions(-)

diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt
--- a/Documentation/cgroups/memory.txt
+++ b/Documentation/cgroups/memory.txt
@@ -68,6 +68,7 @@ Brief summary of control files.
 				 (See sysctl's vm.swappiness)
  memory.move_charge_at_immigrate # set/show controls of moving charges
  memory.oom_control		 # set/show oom controls.
+ memory.oom_delay_millisecs	 # set/show millisecs to wait before oom kill
 
 1. History
 
@@ -640,6 +641,37 @@ At reading, current status of OOM is shown.
 	under_oom	 0 or 1 (if 1, the memory cgroup is under OOM, tasks may
 				 be stopped.)
 
+It is also possible to configure an oom killer timeout to prevent the
+possibility that the memcg will deadlock looking for memory if userspace
+has disabled the oom killer with oom_control but cannot act to fix the
+condition itself (usually because userspace has become unresponsive).
+
+To set an oom killer timeout for a memcg, write the number of milliseconds
+to wait before killing a task to memory.oom_delay_millisecs:
+
+	# echo 60000 > memory.oom_delay_millisecs	# 60 seconds before kill
+
+When this memcg is oom, it is guaranteed that this delay will be incurred
+before the kernel kills a task.  The task chosen may either be from this
+memcg or its child memcgs, if any.
+
+This timeout is reset the first time the memcg is oom to prevent needlessly
+waiting for the next oom when userspace is truly unresponsive.  It may be
+set again using the above interface to defer killing a task the next time
+the memcg is oom.
+
+Disabling the oom killer for a memcg with memory.oom_control takes
+precedence over memory.oom_delay_millisecs, so it must be set to 0
+(default) to allow the oom kill after the delay has expired.
+
+This value is inherited from the memcg's parent on creation.  Setting a
+delay for a memcg sets the same delay for all children, as well.
+
+There is no delay if memory.oom_delay_millisecs is set to 0 (default).
+This tunable's upper bound is MAX_SCHEDULE_TIMEOUT (about 24 days on
+32-bit and a lifetime on 64-bit).
+
+
 11. TODO
 
 1. Add support for accounting huge pages (as a separate controller)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -239,6 +239,8 @@ struct mem_cgroup {
 	unsigned int	swappiness;
 	/* OOM-Killer disable */
 	int		oom_kill_disable;
+	/* number of ticks to stall before calling oom killer */
+	int		oom_delay;
 
 	/* set when res.limit == memsw.limit */
 	bool		memsw_is_minimum;
@@ -1541,10 +1543,11 @@ static void memcg_oom_recover(struct mem_cgroup *mem)
 /*
  * try to call OOM killer. returns false if we should exit memory-reclaim loop.
  */
-bool mem_cgroup_handle_oom(struct mem_cgroup *mem, gfp_t mask)
+static bool mem_cgroup_handle_oom(struct mem_cgroup *mem, gfp_t mask)
 {
 	struct oom_wait_info owait;
 	bool locked, need_to_kill;
+	long timeout = MAX_SCHEDULE_TIMEOUT;
 
 	owait.mem = mem;
 	owait.wait.flags = 0;
@@ -1563,15 +1566,21 @@ bool mem_cgroup_handle_oom(struct mem_cgroup *mem, gfp_t mask)
 	prepare_to_wait(&memcg_oom_waitq, &owait.wait, TASK_KILLABLE);
 	if (!locked || mem->oom_kill_disable)
 		need_to_kill = false;
-	if (locked)
+	if (locked) {
+		if (mem->oom_delay) {
+			need_to_kill = false;
+			timeout = mem->oom_delay;
+			mem->oom_delay = 0;
+		}
 		mem_cgroup_oom_notify(mem);
+	}
 	mutex_unlock(&memcg_oom_mutex);
 
 	if (need_to_kill) {
 		finish_wait(&memcg_oom_waitq, &owait.wait);
 		mem_cgroup_out_of_memory(mem, mask);
 	} else {
-		schedule();
+		schedule_timeout(timeout);
 		finish_wait(&memcg_oom_waitq, &owait.wait);
 	}
 	mutex_lock(&memcg_oom_mutex);
@@ -1582,7 +1591,8 @@ bool mem_cgroup_handle_oom(struct mem_cgroup *mem, gfp_t mask)
 	if (test_thread_flag(TIF_MEMDIE) || fatal_signal_pending(current))
 		return false;
 	/* Give chance to dying process */
-	schedule_timeout(1);
+	if (timeout == MAX_SCHEDULE_TIMEOUT)
+		schedule_timeout(1);
 	return true;
 }
 
@@ -4168,6 +4178,30 @@ static int mem_cgroup_oom_control_write(struct cgroup *cgrp,
 	return 0;
 }
 
+static u64 mem_cgroup_oom_delay_millisecs_read(struct cgroup *cgrp,
+					struct cftype *cft)
+{
+	struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp);
+
+	return jiffies_to_msecs(memcg->oom_delay);
+}
+
+static int mem_cgroup_oom_delay_millisecs_write(struct cgroup *cgrp,
+					struct cftype *cft, u64 val)
+{
+	struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp);
+	struct mem_cgroup *iter;
+
+	if (val > MAX_SCHEDULE_TIMEOUT)
+		return -EINVAL;
+
+	for_each_mem_cgroup_tree(iter, memcg) {
+		iter->oom_delay = msecs_to_jiffies(val);
+		memcg_oom_recover(iter);
+	}
+	return 0;
+}
+
 static struct cftype mem_cgroup_files[] = {
 	{
 		.name = "usage_in_bytes",
@@ -4231,6 +4265,11 @@ static struct cftype mem_cgroup_files[] = {
 		.unregister_event = mem_cgroup_oom_unregister_event,
 		.private = MEMFILE_PRIVATE(_OOM_TYPE, OOM_CONTROL),
 	},
+	{
+		.name = "oom_delay_millisecs",
+		.read_u64 = mem_cgroup_oom_delay_millisecs_read,
+		.write_u64 = mem_cgroup_oom_delay_millisecs_write,
+	},
 };
 
 #ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
@@ -4469,6 +4508,7 @@ mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
 		parent = mem_cgroup_from_cont(cont->parent);
 		mem->use_hierarchy = parent->use_hierarchy;
 		mem->oom_kill_disable = parent->oom_kill_disable;
+		mem->oom_delay = parent->oom_delay;
 	}
 
 	if (parent && parent->use_hierarchy) {

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [patch] memcg: add oom killer delay
  2011-02-09 22:19 ` David Rientjes
@ 2011-02-10  0:04   ` KAMEZAWA Hiroyuki
  2011-02-16  3:15     ` David Rientjes
  2011-02-23 23:08   ` Andrew Morton
  1 sibling, 1 reply; 58+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-02-10  0:04 UTC (permalink / raw)
  To: David Rientjes; +Cc: Andrew Morton, Balbir Singh, Daisuke Nishimura, linux-mm

On Wed, 9 Feb 2011 14:19:50 -0800 (PST)
David Rientjes <rientjes@google.com> wrote:

> Completely disabling the oom killer for a memcg is problematic if
> userspace is unable to address the condition itself, usually because it
> is unresponsive.  This scenario creates a memcg deadlock: tasks are
> sitting in TASK_KILLABLE waiting for the limit to be increased, a task to
> exit or move, or the oom killer reenabled and userspace is unable to do
> so.
> 
> An additional possible use case is to defer oom killing within a memcg
> for a set period of time, probably to prevent unnecessary kills due to
> temporary memory spikes, before allowing the kernel to handle the
> condition.
> 
> This patch adds an oom killer delay so that a memcg may be configured to
> wait at least a pre-defined number of milliseconds before calling the oom
> killer.  If the oom condition persists for this number of milliseconds,
> the oom killer will be called the next time the memory controller
> attempts to charge a page (and memory.oom_control is set to 0).  This
> allows userspace to have a short period of time to respond to the
> condition before deferring to the kernel to kill a task.
> 
> Admins may set the oom killer delay using the new interface:
> 
> 	# echo 60000 > memory.oom_delay_millisecs
> 
> This will defer oom killing to the kernel only after 60 seconds has
> elapsed by putting the task to sleep for 60 seconds.  When setting
> memory.oom_delay_millisecs, all pending delays have their charges retried
> and, if necessary, the new delay is then enforced.
> 
> The delay is cleared the first time the memcg is oom to avoid unnecessary
> waiting when userspace is unresponsive for future oom conditions.  It may
> be set again using the above interface to enforce a delay on the next
> oom.
> 
> When a memory.oom_delay_millisecs is set for a cgroup, it is propagated
> to all children memcg as well and is inherited when a new memcg is
> created.
> 
> Signed-off-by: David Rientjes <rientjes@google.com>

Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

Hm. But I'm not sure how this will be used.


> ---
>  Documentation/cgroups/memory.txt |   32 +++++++++++++++++++++++++
>  mm/memcontrol.c                  |   48 ++++++++++++++++++++++++++++++++++---
>  2 files changed, 76 insertions(+), 4 deletions(-)
> 
> diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt
> --- a/Documentation/cgroups/memory.txt
> +++ b/Documentation/cgroups/memory.txt
> @@ -68,6 +68,7 @@ Brief summary of control files.
>  				 (See sysctl's vm.swappiness)
>   memory.move_charge_at_immigrate # set/show controls of moving charges
>   memory.oom_control		 # set/show oom controls.
> + memory.oom_delay_millisecs	 # set/show millisecs to wait before oom kill
>  
>  1. History
>  
> @@ -640,6 +641,37 @@ At reading, current status of OOM is shown.
>  	under_oom	 0 or 1 (if 1, the memory cgroup is under OOM, tasks may
>  				 be stopped.)
>  
> +It is also possible to configure an oom killer timeout to prevent the
> +possibility that the memcg will deadlock looking for memory if userspace
> +has disabled the oom killer with oom_control but cannot act to fix the
> +condition itself (usually because userspace has become unresponsive).
> +
> +To set an oom killer timeout for a memcg, write the number of milliseconds
> +to wait before killing a task to memory.oom_delay_millisecs:
> +
> +	# echo 60000 > memory.oom_delay_millisecs	# 60 seconds before kill
> +
> +When this memcg is oom, it is guaranteed that this delay will be incurred
> +before the kernel kills a task.  The task chosen may either be from this
> +memcg or its child memcgs, if any.
> +
> +This timeout is reset the first time the memcg is oom to prevent needlessly
> +waiting for the next oom when userspace is truly unresponsive.  It may be
> +set again using the above interface to defer killing a task the next time
> +the memcg is oom.
> +
> +Disabling the oom killer for a memcg with memory.oom_control takes
> +precedence over memory.oom_delay_millisecs, so it must be set to 0
> +(default) to allow the oom kill after the delay has expired.
> +
> +This value is inherited from the memcg's parent on creation.  Setting a
> +delay for a memcg sets the same delay for all children, as well.
> +
> +There is no delay if memory.oom_delay_millisecs is set to 0 (default).
> +This tunable's upper bound is MAX_SCHEDULE_TIMEOUT (about 24 days on
> +32-bit and a lifetime on 64-bit).
> +
> +
>  11. TODO
>  
>  1. Add support for accounting huge pages (as a separate controller)
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -239,6 +239,8 @@ struct mem_cgroup {
>  	unsigned int	swappiness;
>  	/* OOM-Killer disable */
>  	int		oom_kill_disable;
> +	/* number of ticks to stall before calling oom killer */
> +	int		oom_delay;
>  
>  	/* set when res.limit == memsw.limit */
>  	bool		memsw_is_minimum;
> @@ -1541,10 +1543,11 @@ static void memcg_oom_recover(struct mem_cgroup *mem)
>  /*
>   * try to call OOM killer. returns false if we should exit memory-reclaim loop.
>   */
> -bool mem_cgroup_handle_oom(struct mem_cgroup *mem, gfp_t mask)
> +static bool mem_cgroup_handle_oom(struct mem_cgroup *mem, gfp_t mask)
>  {
>  	struct oom_wait_info owait;
>  	bool locked, need_to_kill;
> +	long timeout = MAX_SCHEDULE_TIMEOUT;
>  
>  	owait.mem = mem;
>  	owait.wait.flags = 0;
> @@ -1563,15 +1566,21 @@ bool mem_cgroup_handle_oom(struct mem_cgroup *mem, gfp_t mask)
>  	prepare_to_wait(&memcg_oom_waitq, &owait.wait, TASK_KILLABLE);
>  	if (!locked || mem->oom_kill_disable)
>  		need_to_kill = false;
> -	if (locked)
> +	if (locked) {
> +		if (mem->oom_delay) {
> +			need_to_kill = false;
> +			timeout = mem->oom_delay;
> +			mem->oom_delay = 0;
> +		}
>  		mem_cgroup_oom_notify(mem);
> +	}
>  	mutex_unlock(&memcg_oom_mutex);
>  
>  	if (need_to_kill) {
>  		finish_wait(&memcg_oom_waitq, &owait.wait);
>  		mem_cgroup_out_of_memory(mem, mask);
>  	} else {
> -		schedule();
> +		schedule_timeout(timeout);
>  		finish_wait(&memcg_oom_waitq, &owait.wait);
>  	}
>  	mutex_lock(&memcg_oom_mutex);
> @@ -1582,7 +1591,8 @@ bool mem_cgroup_handle_oom(struct mem_cgroup *mem, gfp_t mask)
>  	if (test_thread_flag(TIF_MEMDIE) || fatal_signal_pending(current))
>  		return false;
>  	/* Give chance to dying process */
> -	schedule_timeout(1);
> +	if (timeout == MAX_SCHEDULE_TIMEOUT)
> +		schedule_timeout(1);
>  	return true;
>  }
>  
> @@ -4168,6 +4178,30 @@ static int mem_cgroup_oom_control_write(struct cgroup *cgrp,
>  	return 0;
>  }
>  
> +static u64 mem_cgroup_oom_delay_millisecs_read(struct cgroup *cgrp,
> +					struct cftype *cft)
> +{
> +	struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp);
> +
> +	return jiffies_to_msecs(memcg->oom_delay);
> +}
> +
> +static int mem_cgroup_oom_delay_millisecs_write(struct cgroup *cgrp,
> +					struct cftype *cft, u64 val)
> +{
> +	struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp);
> +	struct mem_cgroup *iter;
> +
> +	if (val > MAX_SCHEDULE_TIMEOUT)
> +		return -EINVAL;
> +
> +	for_each_mem_cgroup_tree(iter, memcg) {
> +		iter->oom_delay = msecs_to_jiffies(val);
> +		memcg_oom_recover(iter);
> +	}
> +	return 0;
> +}
> +
>  static struct cftype mem_cgroup_files[] = {
>  	{
>  		.name = "usage_in_bytes",
> @@ -4231,6 +4265,11 @@ static struct cftype mem_cgroup_files[] = {
>  		.unregister_event = mem_cgroup_oom_unregister_event,
>  		.private = MEMFILE_PRIVATE(_OOM_TYPE, OOM_CONTROL),
>  	},
> +	{
> +		.name = "oom_delay_millisecs",
> +		.read_u64 = mem_cgroup_oom_delay_millisecs_read,
> +		.write_u64 = mem_cgroup_oom_delay_millisecs_write,
> +	},
>  };
>  
>  #ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
> @@ -4469,6 +4508,7 @@ mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
>  		parent = mem_cgroup_from_cont(cont->parent);
>  		mem->use_hierarchy = parent->use_hierarchy;
>  		mem->oom_kill_disable = parent->oom_kill_disable;
> +		mem->oom_delay = parent->oom_delay;
>  	}
>  
>  	if (parent && parent->use_hierarchy) {
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [patch] memcg: add oom killer delay
  2011-02-10  0:04   ` KAMEZAWA Hiroyuki
@ 2011-02-16  3:15     ` David Rientjes
  2011-02-20 22:19       ` David Rientjes
  0 siblings, 1 reply; 58+ messages in thread
From: David Rientjes @ 2011-02-16  3:15 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Andrew Morton, Balbir Singh, Daisuke Nishimura, linux-mm

On Thu, 10 Feb 2011, KAMEZAWA Hiroyuki wrote:

> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> 

Thanks!

> Hm. But I'm not sure how this will be used.
> 

Since this patch hasn't been added to -mm even with your acked-by, I'm 
assuming Andrew is waiting for an answer to this :)  I thought it was 
fairly well covered in the changelog, but I'll elaborate:

We can already give userspace a grace period to act before oom killing a 
task by utilizing memory.oom_control.  That's not what the oom killer 
delay addresses, however.  This addresses a very specific (and real) 
problem that occurs when userspace wants that grace period but is unable 
to respond, for whatever reason, to either increase the hard limit or 
allow the oom kill to proceed.  The possibility of that happening would 
cause that memcg to livelock because no forward progress could be made 
when oom, which is a negative result.  We don't have that possibility with 
the global oom killer since the kernel will always choose to act if memory 
freeing is not imminent: in other words, since we've opened the window for 
livelock because of an unreliable userspace via a kernel feature -- namely 
memory.oom_control -- then it's only responsible to provide an alternate 
means to configure the cgroup for the same grace period without risking 
livelock.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [patch] memcg: add oom killer delay
  2011-02-16  3:15     ` David Rientjes
@ 2011-02-20 22:19       ` David Rientjes
  0 siblings, 0 replies; 58+ messages in thread
From: David Rientjes @ 2011-02-20 22:19 UTC (permalink / raw)
  To: Andrew Morton
  Cc: KAMEZAWA Hiroyuki, Balbir Singh, Daisuke Nishimura, linux-mm

On Tue, 15 Feb 2011, David Rientjes wrote:

> > Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > 
> 
> Thanks!
> 
> > Hm. But I'm not sure how this will be used.
> > 
> 
> Since this patch hasn't been added to -mm even with your acked-by, I'm 
> assuming Andrew is waiting for an answer to this :)  I thought it was 
> fairly well covered in the changelog, but I'll elaborate:
> 
> We can already give userspace a grace period to act before oom killing a 
> task by utilizing memory.oom_control.  That's not what the oom killer 
> delay addresses, however.  This addresses a very specific (and real) 
> problem that occurs when userspace wants that grace period but is unable 
> to respond, for whatever reason, to either increase the hard limit or 
> allow the oom kill to proceed.  The possibility of that happening would 
> cause that memcg to livelock because no forward progress could be made 
> when oom, which is a negative result.  We don't have that possibility with 
> the global oom killer since the kernel will always choose to act if memory 
> freeing is not imminent: in other words, since we've opened the window for 
> livelock because of an unreliable userspace via a kernel feature -- namely 
> memory.oom_control -- then it's only responsible to provide an alternate 
> means to configure the cgroup for the same grace period without risking 
> livelock.
> 

Andrew, can this be merged in -mm?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [patch] memcg: add oom killer delay
  2011-02-09 22:19 ` David Rientjes
  2011-02-10  0:04   ` KAMEZAWA Hiroyuki
@ 2011-02-23 23:08   ` Andrew Morton
  2011-02-24  0:13     ` KAMEZAWA Hiroyuki
  2011-02-24  0:51     ` David Rientjes
  1 sibling, 2 replies; 58+ messages in thread
From: Andrew Morton @ 2011-02-23 23:08 UTC (permalink / raw)
  To: David Rientjes
  Cc: KAMEZAWA Hiroyuki, Balbir Singh, Daisuke Nishimura, linux-mm

On Wed, 9 Feb 2011 14:19:50 -0800 (PST)
David Rientjes <rientjes@google.com> wrote:

> Completely disabling the oom killer for a memcg is problematic if
> userspace is unable to address the condition itself, usually because it
> is unresponsive.  This scenario creates a memcg deadlock: tasks are
> sitting in TASK_KILLABLE waiting for the limit to be increased, a task to
> exit or move, or the oom killer reenabled and userspace is unable to do
> so.
> 
> An additional possible use case is to defer oom killing within a memcg
> for a set period of time, probably to prevent unnecessary kills due to
> temporary memory spikes, before allowing the kernel to handle the
> condition.
> 
> This patch adds an oom killer delay so that a memcg may be configured to
> wait at least a pre-defined number of milliseconds before calling the oom
> killer.  If the oom condition persists for this number of milliseconds,
> the oom killer will be called the next time the memory controller
> attempts to charge a page (and memory.oom_control is set to 0).  This
> allows userspace to have a short period of time to respond to the
> condition before deferring to the kernel to kill a task.
> 
> Admins may set the oom killer delay using the new interface:
> 
> 	# echo 60000 > memory.oom_delay_millisecs
> 
> This will defer oom killing to the kernel only after 60 seconds has
> elapsed by putting the task to sleep for 60 seconds.  When setting
> memory.oom_delay_millisecs, all pending delays have their charges retried
> and, if necessary, the new delay is then enforced.
> 
> The delay is cleared the first time the memcg is oom to avoid unnecessary
> waiting when userspace is unresponsive for future oom conditions.  It may
> be set again using the above interface to enforce a delay on the next
> oom.
> 
> When a memory.oom_delay_millisecs is set for a cgroup, it is propagated
> to all children memcg as well and is inherited when a new memcg is
> created.

Your patch still stinks!

If userspace can't handle a disabled oom-killer then userspace
shouldn't have disabled the oom-killer.

How do we fix this properly?

A little birdie tells me that the offending userspace oom handler is
running in a separate memcg and is not itself running out of memory. 
The problem is that the userspace oom handler is also taking peeks into
processes which are in the stressed memcg and is getting stuck on
mmap_sem in the procfs reads.  Correct?

It seems to me that such a userspace oom handler is correctly designed,
and that we should be looking into the reasons why it is unreliable,
and fixing them.  Please tell us about this?

(If fixing the kernel is intractable, wouldn't it be feasible for the
userspace oom handler to have its own watchdog which either starts
killing stuff itself, or which reenables the stressed memcg's
oom-killer?)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [patch] memcg: add oom killer delay
  2011-02-23 23:08   ` Andrew Morton
@ 2011-02-24  0:13     ` KAMEZAWA Hiroyuki
  2011-02-24  0:51     ` David Rientjes
  1 sibling, 0 replies; 58+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-02-24  0:13 UTC (permalink / raw)
  To: Andrew Morton; +Cc: David Rientjes, Balbir Singh, Daisuke Nishimura, linux-mm

On Wed, 23 Feb 2011 15:08:50 -0800
Andrew Morton <akpm@linux-foundation.org> wrote:

> On Wed, 9 Feb 2011 14:19:50 -0800 (PST)
> David Rientjes <rientjes@google.com> wrote:
> 
> > Completely disabling the oom killer for a memcg is problematic if
> > userspace is unable to address the condition itself, usually because it
> > is unresponsive.  This scenario creates a memcg deadlock: tasks are
> > sitting in TASK_KILLABLE waiting for the limit to be increased, a task to
> > exit or move, or the oom killer reenabled and userspace is unable to do
> > so.
> > 
> > An additional possible use case is to defer oom killing within a memcg
> > for a set period of time, probably to prevent unnecessary kills due to
> > temporary memory spikes, before allowing the kernel to handle the
> > condition.
> > 
> > This patch adds an oom killer delay so that a memcg may be configured to
> > wait at least a pre-defined number of milliseconds before calling the oom
> > killer.  If the oom condition persists for this number of milliseconds,
> > the oom killer will be called the next time the memory controller
> > attempts to charge a page (and memory.oom_control is set to 0).  This
> > allows userspace to have a short period of time to respond to the
> > condition before deferring to the kernel to kill a task.
> > 
> > Admins may set the oom killer delay using the new interface:
> > 
> > 	# echo 60000 > memory.oom_delay_millisecs
> > 
> > This will defer oom killing to the kernel only after 60 seconds has
> > elapsed by putting the task to sleep for 60 seconds.  When setting
> > memory.oom_delay_millisecs, all pending delays have their charges retried
> > and, if necessary, the new delay is then enforced.
> > 
> > The delay is cleared the first time the memcg is oom to avoid unnecessary
> > waiting when userspace is unresponsive for future oom conditions.  It may
> > be set again using the above interface to enforce a delay on the next
> > oom.
> > 
> > When a memory.oom_delay_millisecs is set for a cgroup, it is propagated
> > to all children memcg as well and is inherited when a new memcg is
> > created.
> 
> Your patch still stinks!
> 
> If userspace can't handle a disabled oom-killer then userspace
> shouldn't have disabled the oom-killer.
> 
> How do we fix this properly?
> 
> A little birdie tells me that the offending userspace oom handler is
> running in a separate memcg and is not itself running out of memory. 
> The problem is that the userspace oom handler is also taking peeks into
> processes which are in the stressed memcg and is getting stuck on
> mmap_sem in the procfs reads.  Correct?
> 

Hmm, I think memcg's oom-kill just happens under down_read(mmap_sem). 
And all tasks, which is under oom, will be in wait-queue.


Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [patch] memcg: add oom killer delay
  2011-02-23 23:08   ` Andrew Morton
  2011-02-24  0:13     ` KAMEZAWA Hiroyuki
@ 2011-02-24  0:51     ` David Rientjes
  2011-03-03 20:11       ` David Rientjes
  2011-03-03 21:52       ` Andrew Morton
  1 sibling, 2 replies; 58+ messages in thread
From: David Rientjes @ 2011-02-24  0:51 UTC (permalink / raw)
  To: Andrew Morton
  Cc: KAMEZAWA Hiroyuki, Balbir Singh, Daisuke Nishimura, linux-mm

On Wed, 23 Feb 2011, Andrew Morton wrote:

> Your patch still stinks!
> 
> If userspace can't handle a disabled oom-killer then userspace
> shouldn't have disabled the oom-killer.
> 

I agree, but userspace may not always be perfect especially on large 
scale; we, in kernel land, can easily choose to ignore that but it's only 
a problem because we're providing an interface where the memcg will 
livelock without userspace intervention.  The global oom killer doesn't 
have this problem and for years it has even radically panicked the machine 
instead of livelocking EVEN THOUGH other threads, those that are 
OOM_DISABLE, may be getting work done.

This is a memcg-specific issue because memory.oom_control has opened the 
possibility up to livelock that userspace may have no way of correcting on 
its own especially when it may be oom itself.  The natural conclusion is 
that you should never set memory.oom_control unless you can guarantee a 
perfect userspace implementation that will never be unresponsive.  At our 
scale, we can't make that guarantee so memory.oom_control is not helpful 
at all.

If that's the case, then what else do we have at our disposal other than 
memory.oom_delay_millisecs that allows us to increase a hard limit or kill 
a job of lower priority other than setting memory thresholds and hoping 
userspace will schedule and respond before the memcg is completely oom?

> How do we fix this properly?
> 
> A little birdie tells me that the offending userspace oom handler is
> running in a separate memcg and is not itself running out of memory. 

It depends on how you configure your memory controllers, but even if it is 
running in a separate memcg how can you make the conclusion it isn't oom 
in parallel?

> The problem is that the userspace oom handler is also taking peeks into
> processes which are in the stressed memcg and is getting stuck on
> mmap_sem in the procfs reads.  Correct?
> 

That's outside the scope of this feature and is a separate discussion; 
this patch specifically addresses an issue where a userspace job scheduler 
wants to take action when a memcg is oom before deferring to the kernel 
and happens to become unresponsive for whatever reason.

> It seems to me that such a userspace oom handler is correctly designed,
> and that we should be looking into the reasons why it is unreliable,
> and fixing them.  Please tell us about this?
> 

The problem isn't specific to any one cause or implementation, we know 
that userspace programs have bugs, they can stall forever in D-state, they 
can be oom themselves, they get stuck waiting on a lock, etc etc.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [patch] memcg: add oom killer delay
  2011-02-24  0:51     ` David Rientjes
@ 2011-03-03 20:11       ` David Rientjes
  2011-03-03 21:52       ` Andrew Morton
  1 sibling, 0 replies; 58+ messages in thread
From: David Rientjes @ 2011-03-03 20:11 UTC (permalink / raw)
  To: Andrew Morton
  Cc: KAMEZAWA Hiroyuki, Balbir Singh, Daisuke Nishimura, linux-mm

On Wed, 23 Feb 2011, David Rientjes wrote:

> On Wed, 23 Feb 2011, Andrew Morton wrote:
> 
> > Your patch still stinks!
> > 
> > If userspace can't handle a disabled oom-killer then userspace
> > shouldn't have disabled the oom-killer.
> > 
> 
> I agree, but userspace may not always be perfect especially on large 
> scale; we, in kernel land, can easily choose to ignore that but it's only 
> a problem because we're providing an interface where the memcg will 
> livelock without userspace intervention.  The global oom killer doesn't 
> have this problem and for years it has even radically panicked the machine 
> instead of livelocking EVEN THOUGH other threads, those that are 
> OOM_DISABLE, may be getting work done.
> 
> This is a memcg-specific issue because memory.oom_control has opened the 
> possibility up to livelock that userspace may have no way of correcting on 
> its own especially when it may be oom itself.  The natural conclusion is 
> that you should never set memory.oom_control unless you can guarantee a 
> perfect userspace implementation that will never be unresponsive.  At our 
> scale, we can't make that guarantee so memory.oom_control is not helpful 
> at all.
> 
> If that's the case, then what else do we have at our disposal other than 
> memory.oom_delay_millisecs that allows us to increase a hard limit or kill 
> a job of lower priority other than setting memory thresholds and hoping 
> userspace will schedule and respond before the memcg is completely oom?
> 
> > How do we fix this properly?
> > 
> > A little birdie tells me that the offending userspace oom handler is
> > running in a separate memcg and is not itself running out of memory. 
> 
> It depends on how you configure your memory controllers, but even if it is 
> running in a separate memcg how can you make the conclusion it isn't oom 
> in parallel?
> 
> > The problem is that the userspace oom handler is also taking peeks into
> > processes which are in the stressed memcg and is getting stuck on
> > mmap_sem in the procfs reads.  Correct?
> > 
> 
> That's outside the scope of this feature and is a separate discussion; 
> this patch specifically addresses an issue where a userspace job scheduler 
> wants to take action when a memcg is oom before deferring to the kernel 
> and happens to become unresponsive for whatever reason.
> 
> > It seems to me that such a userspace oom handler is correctly designed,
> > and that we should be looking into the reasons why it is unreliable,
> > and fixing them.  Please tell us about this?
> > 
> 
> The problem isn't specific to any one cause or implementation, we know 
> that userspace programs have bugs, they can stall forever in D-state, they 
> can be oom themselves, they get stuck waiting on a lock, etc etc.
> 

Was there a response to this, or can this patch be merged?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [patch] memcg: add oom killer delay
  2011-02-24  0:51     ` David Rientjes
  2011-03-03 20:11       ` David Rientjes
@ 2011-03-03 21:52       ` Andrew Morton
  2011-03-08  0:12         ` David Rientjes
  1 sibling, 1 reply; 58+ messages in thread
From: Andrew Morton @ 2011-03-03 21:52 UTC (permalink / raw)
  To: David Rientjes
  Cc: KAMEZAWA Hiroyuki, Balbir Singh, Daisuke Nishimura, linux-mm

On Wed, 23 Feb 2011 16:51:24 -0800 (PST)
David Rientjes <rientjes@google.com> wrote:

> > The problem is that the userspace oom handler is also taking peeks into
> > processes which are in the stressed memcg and is getting stuck on
> > mmap_sem in the procfs reads.  Correct?
> > 
> 
> That's outside the scope of this feature and is a separate discussion; 
> this patch specifically addresses an issue where a userspace job scheduler 
> wants to take action when a memcg is oom before deferring to the kernel 
> and happens to become unresponsive for whatever reason.

That's just handwaving used to justify a workaround for a kernel
deficiency.

If userspace has chosen to repalce the oom-killer then userspace should
be able to appropriately perform the role.  But for some
as-yet-undescribed reason, userspace is *not* able to perform that
role.

And I'm suspecting that the as-yet-undescribed reason is a kernel
deficiency.  Spit it out.

> > It seems to me that such a userspace oom handler is correctly designed,
> > and that we should be looking into the reasons why it is unreliable,
> > and fixing them.  Please tell us about this?
> > 
> 
> The problem isn't specific to any one cause or implementation, we know 
> that userspace programs have bugs, they can stall forever in D-state, they 
> can be oom themselves, they get stuck waiting on a lock, etc etc.

It's not the kernel's role to work around userspace bugs and it's
certainly not the kernel's role to work around kernel bugs.

Now please tell us: why is the userspace job manager getting stuck?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [patch] memcg: add oom killer delay
  2011-03-03 21:52       ` Andrew Morton
@ 2011-03-08  0:12         ` David Rientjes
  2011-03-08  0:29           ` Andrew Morton
  0 siblings, 1 reply; 58+ messages in thread
From: David Rientjes @ 2011-03-08  0:12 UTC (permalink / raw)
  To: Andrew Morton
  Cc: KAMEZAWA Hiroyuki, Balbir Singh, Daisuke Nishimura, linux-mm

On Thu, 3 Mar 2011, Andrew Morton wrote:

> If userspace has chosen to repalce the oom-killer then userspace should
> be able to appropriately perform the role.  But for some
> as-yet-undescribed reason, userspace is *not* able to perform that
> role.
> 
> And I'm suspecting that the as-yet-undescribed reason is a kernel
> deficiency.  Spit it out.
> 

The purpose of memory.oom_control is to disable to kernel oom killer from 
killing a task as soon as a memcg reaches its hard limit and reclaim has 
failed.  We want that behavior, but only temporarily for two reasons:

 - the condition may be temporary and we'd rather busy loop for the 
   duration of the spike in memory usage than kill something off because 
   it will be coming under the hard limit soon or userspace will be 
   increasing that limit (or killing a lower priority job in favor of 
   this one), and

 - it's possible that the userspace daemon is oom itself (being in a 
   separate cgroup doesn't prevent that) and is therefore subject to 
   being killed itself (or panicking the machine if its OOM_DISABLE and 
   nothing else is eligible) and cannot rectify the situation in other 
   memcgs that are also oom.

So this patch is not a bug fix, it's an enhancement to an already existing 
feature (memory.oom_control) that probably should have been coded to be a 
timeout in the first place and up to userspace whether that's infinite or 
not.

Not merging this patch forces us into the very limiting position where we 
either completely disable the oom killer or we don't and that's not 
helpful in either of the above two cases without relying on userspace to 
fix it (and it may be oom itself and locked out of freeing any memory via 
the oom killer for the very same reason).

So the question I'd ask is: why should the kernel only offer a complete 
and infinite disabling of the oom killer (something we'd never want to do 
in production) to allow userspace a grace period to respond to reaching 
the hard limit as opposed to allowing users the option to allow the 
killing iff userspace can't expand the cgroup or kill something itself.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [patch] memcg: add oom killer delay
  2011-03-08  0:12         ` David Rientjes
@ 2011-03-08  0:29           ` Andrew Morton
  2011-03-08  0:36             ` David Rientjes
  0 siblings, 1 reply; 58+ messages in thread
From: Andrew Morton @ 2011-03-08  0:29 UTC (permalink / raw)
  To: David Rientjes
  Cc: KAMEZAWA Hiroyuki, Balbir Singh, Daisuke Nishimura, linux-mm

On Mon, 7 Mar 2011 16:12:41 -0800 (PST)
David Rientjes <rientjes@google.com> wrote:

> So the question I'd ask is

What about my question?  Why is your usersapce oom-handler "unresponsive"?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [patch] memcg: add oom killer delay
  2011-03-08  0:29           ` Andrew Morton
@ 2011-03-08  0:36             ` David Rientjes
  2011-03-08  0:51               ` Andrew Morton
  0 siblings, 1 reply; 58+ messages in thread
From: David Rientjes @ 2011-03-08  0:36 UTC (permalink / raw)
  To: Andrew Morton
  Cc: KAMEZAWA Hiroyuki, Balbir Singh, Daisuke Nishimura, linux-mm

On Mon, 7 Mar 2011, Andrew Morton wrote:

> > So the question I'd ask is
> 
> What about my question?  Why is your usersapce oom-handler "unresponsive"?
> 

If we have a per-memcg userspace oom handler, then it's absolutely 
required that it either increase the hard limit of the oom memcg or kill a 
task to free memory; anything else risks livelocking that memcg.  At 
the same time, the oom handler's memcg isn't really important: it may be 
in a different memcg but it may be oom at the same time.  If we risk 
livelocking the memcg when it is oom and the oom killer cannot respond 
(the only reason for the oom killer to exist in the first place), then 
there's no guarantee that a userspace oom handler could respond under 
livelock.

For users who don't choose to implement their own userspace oom handler, 
memory.oom_delay_millisecs could also be used to delay killing a task in a 
memcg unless it has reached the hard limit for only a specific period of 
time and doesn't rely on memory thresholds to signal the task to start 
freeing some of its own memory.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [patch] memcg: add oom killer delay
  2011-03-08  0:36             ` David Rientjes
@ 2011-03-08  0:51               ` Andrew Morton
  2011-03-08  1:02                 ` David Rientjes
  0 siblings, 1 reply; 58+ messages in thread
From: Andrew Morton @ 2011-03-08  0:51 UTC (permalink / raw)
  To: David Rientjes
  Cc: KAMEZAWA Hiroyuki, Balbir Singh, Daisuke Nishimura, linux-mm

On Mon, 7 Mar 2011 16:36:47 -0800 (PST)
David Rientjes <rientjes@google.com> wrote:

> On Mon, 7 Mar 2011, Andrew Morton wrote:
> 
> > > So the question I'd ask is
> > 
> > What about my question?  Why is your usersapce oom-handler "unresponsive"?
> > 
> 
> If we have a per-memcg userspace oom handler, then it's absolutely 
> required that it either increase the hard limit of the oom memcg or kill a 
> task to free memory; anything else risks livelocking that memcg.  At 
> the same time, the oom handler's memcg isn't really important: it may be 
> in a different memcg but it may be oom at the same time.  If we risk 
> livelocking the memcg when it is oom and the oom killer cannot respond 
> (the only reason for the oom killer to exist in the first place), then 
> there's no guarantee that a userspace oom handler could respond under 
> livelock.

So you're saying that your userspace oom-handler is in a memcg which is
also oom?  That this is the only situation you've observed in which the
userspace oom-handler is "unresponsive"?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [patch] memcg: add oom killer delay
  2011-03-08  0:51               ` Andrew Morton
@ 2011-03-08  1:02                 ` David Rientjes
  2011-03-08  1:18                   ` Andrew Morton
  0 siblings, 1 reply; 58+ messages in thread
From: David Rientjes @ 2011-03-08  1:02 UTC (permalink / raw)
  To: Andrew Morton
  Cc: KAMEZAWA Hiroyuki, Balbir Singh, Daisuke Nishimura, linux-mm

On Mon, 7 Mar 2011, Andrew Morton wrote:

> > > > So the question I'd ask is
> > > 
> > > What about my question?  Why is your usersapce oom-handler "unresponsive"?
> > > 
> > 
> > If we have a per-memcg userspace oom handler, then it's absolutely 
> > required that it either increase the hard limit of the oom memcg or kill a 
> > task to free memory; anything else risks livelocking that memcg.  At 
> > the same time, the oom handler's memcg isn't really important: it may be 
> > in a different memcg but it may be oom at the same time.  If we risk 
> > livelocking the memcg when it is oom and the oom killer cannot respond 
> > (the only reason for the oom killer to exist in the first place), then 
> > there's no guarantee that a userspace oom handler could respond under 
> > livelock.
> 
> So you're saying that your userspace oom-handler is in a memcg which is
> also oom?

It could be, if users assign the handler to a different memcg; otherwise, 
it's guaranteed.  Keep in mind that for oom situations we give the killed 
task access to memory reserves below the min watermark with TIF_MEMDIE so 
that they can allocate memory to exit as quickly as possible (either to 
handle the SIGKILL or within the exit path).  That's because we can't 
guarantee anything within an oom system, cpuset, mempolicy, or memcg is 
ever responsive without it.  (And, the side effect of it and its threads 
exiting is the freeing of memory which allows everything else to once 
again be responsive.)

> That this is the only situation you've observed in which the
> userspace oom-handler is "unresponsive"?
> 

Personally, yes, but I could imagine other users could get caught if their 
userspace oom handler requires taking locks (such as mmap_sem) by reading 
within procfs that a thread within an oom memcg already holds.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [patch] memcg: add oom killer delay
  2011-03-08  1:02                 ` David Rientjes
@ 2011-03-08  1:18                   ` Andrew Morton
  2011-03-08  1:33                     ` David Rientjes
  2011-03-08  3:06                     ` [patch] memcg: add oom killer delay KAMEZAWA Hiroyuki
  0 siblings, 2 replies; 58+ messages in thread
From: Andrew Morton @ 2011-03-08  1:18 UTC (permalink / raw)
  To: David Rientjes
  Cc: KAMEZAWA Hiroyuki, Balbir Singh, Daisuke Nishimura, linux-mm

On Mon, 7 Mar 2011 17:02:36 -0800 (PST)
David Rientjes <rientjes@google.com> wrote:

> On Mon, 7 Mar 2011, Andrew Morton wrote:
> 
> > > > > So the question I'd ask is
> > > > 
> > > > What about my question?  Why is your usersapce oom-handler "unresponsive"?
> > > > 
> > > 
> > > If we have a per-memcg userspace oom handler, then it's absolutely 
> > > required that it either increase the hard limit of the oom memcg or kill a 
> > > task to free memory; anything else risks livelocking that memcg.  At 
> > > the same time, the oom handler's memcg isn't really important: it may be 
> > > in a different memcg but it may be oom at the same time.  If we risk 
> > > livelocking the memcg when it is oom and the oom killer cannot respond 
> > > (the only reason for the oom killer to exist in the first place), then 
> > > there's no guarantee that a userspace oom handler could respond under 
> > > livelock.
> > 
> > So you're saying that your userspace oom-handler is in a memcg which is
> > also oom?
> 
> It could be, if users assign the handler to a different memcg; otherwise, 
> it's guaranteed.

Putting the handler into the same container would be rather daft.

If userspace is going to elect to take over a kernel function then it
should be able to perform that function reliably.  We don't have hacks
in the kernel to stop runaway SCHED_FIFO tasks, either.  If the oom
handler has put itself into a memcg and then has permitted that memcg
to go oom then userspace is busted.

>  Keep in mind that for oom situations we give the killed 
> task access to memory reserves below the min watermark with TIF_MEMDIE so 
> that they can allocate memory to exit as quickly as possible (either to 
> handle the SIGKILL or within the exit path).  That's because we can't 
> guarantee anything within an oom system, cpuset, mempolicy, or memcg is 
> ever responsive without it.  (And, the side effect of it and its threads 
> exiting is the freeing of memory which allows everything else to once 
> again be responsive.)
> 
> > That this is the only situation you've observed in which the
> > userspace oom-handler is "unresponsive"?
> > 
> 
> Personally, yes, but I could imagine other users could get caught if their 
> userspace oom handler requires taking locks (such as mmap_sem) by reading 
> within procfs that a thread within an oom memcg already holds.

If activity in one memcg cause a lockup of processes in a separate
memcg then that's a containment violation and we should fix it.

One could argue that peering into a separate memcg's procfs files was
already a containment violation, but from a practical point of view we
definitely do want processes in a separate memcg to be able to
passively observe activity in another without stepping on lindmines.


My issue with this patch is that it extends the userspace API.  This
means we're committed to maintaining that interface *and its behaviour*
for evermore.  But the oom-killer and memcg are both areas of intense
development and the former has a habit of getting ripped out and
rewritten.  Committing ourselves to maintaining an extension to the
userspace interface is a big thing, especially as that extension is
somewhat tied to internal implementation details and is most definitely
tied to short-term inadequacies in userspace and in the kernel
implementation.

We should not commit the kernel to maintaining this new interface for
all time until all alternatives have been eliminated.  The patch looks
to me like a short-term hack to work around medium-term userspace and
kernel inadequacies, and that's a really bad basis upon which to merge
it.  Expedient hacks do sometimes makes sense, but it's real bad when
they appear in the API.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [patch] memcg: add oom killer delay
  2011-03-08  1:18                   ` Andrew Morton
@ 2011-03-08  1:33                     ` David Rientjes
  2011-03-08  2:51                       ` KAMEZAWA Hiroyuki
  2011-03-08  3:06                     ` [patch] memcg: add oom killer delay KAMEZAWA Hiroyuki
  1 sibling, 1 reply; 58+ messages in thread
From: David Rientjes @ 2011-03-08  1:33 UTC (permalink / raw)
  To: Andrew Morton
  Cc: KAMEZAWA Hiroyuki, Balbir Singh, Daisuke Nishimura, linux-mm

On Mon, 7 Mar 2011, Andrew Morton wrote:

> > It could be, if users assign the handler to a different memcg; otherwise, 
> > it's guaranteed.
> 
> Putting the handler into the same container would be rather daft.
> 
> If userspace is going to elect to take over a kernel function then it
> should be able to perform that function reliably.  We don't have hacks
> in the kernel to stop runaway SCHED_FIFO tasks, either.  If the oom
> handler has put itself into a memcg and then has permitted that memcg
> to go oom then userspace is busted.
> 

We have a container specifically for daemons like this and have struggled 
for years to accurately predict how much memory it needs and what to do 
when it is oom.  The problem, in this case, is that when it's oom it's too 
late: the memcg is livelocked and then no memory limits on the system have 
a chance of getting increased and nothing in oom memcgs are guaranteed to 
ever make forward progress again.

That's why I keep bringing up the point that this patch is not a bugfix: 
it's an extension of a feature (memory.oom_control) to allow userspace a 
period of time to respond to memcgs reaching their hard limit before 
killing something.  For our container with vital system daemons, this is 
absolutely mandatory if something consumes a large amount of memory and 
needs to be restarted; we want the logic in userspace to determine what to 
do without killing vital tasks or panicking.  We want to use the oom 
killer only as a last resort and that can effectively be done with this 
patch and not with memory.oom_control (and I think this is why Kame acked 
it).

> My issue with this patch is that it extends the userspace API.  This
> means we're committed to maintaining that interface *and its behaviour*
> for evermore.  But the oom-killer and memcg are both areas of intense
> development and the former has a habit of getting ripped out and
> rewritten.  Committing ourselves to maintaining an extension to the
> userspace interface is a big thing, especially as that extension is
> somewhat tied to internal implementation details and is most definitely
> tied to short-term inadequacies in userspace and in the kernel
> implementation.
> 

The same could have been said for memory.oom_control to disable the oom 
killer entirely which no seems to be solidified as the only way to 
influence oom killer behavior from the kernel and now we're locked into 
that limitation because we don't want dual interfaces.  I think this patch 
would have been received much better prior to memory.oom_control since it 
allows for the same behavior with an infinite timeout.  memory.oom_control 
is not an option for us since we can't guarantee that any userspace daemon 
at our scale will ever be responsive 100% of the time.

I don't think the idea of a userspace grace period when a memcg is oom is 
that abstract, though.  I think applications should have the opportunity 
to free some of their own memory first when oom instead of abruptly 
killing something and restarting it.

So, in the end, we may have to carry this patch internally forever but I 
think as memcg becomes more popular we'll have a higher demand for such a 
grace period.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [patch] memcg: add oom killer delay
  2011-03-08  1:33                     ` David Rientjes
@ 2011-03-08  2:51                       ` KAMEZAWA Hiroyuki
  2011-03-08  3:07                         ` David Rientjes
  0 siblings, 1 reply; 58+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-03-08  2:51 UTC (permalink / raw)
  To: David Rientjes; +Cc: Andrew Morton, Balbir Singh, Daisuke Nishimura, linux-mm

On Mon, 7 Mar 2011 17:33:44 -0800 (PST)
David Rientjes <rientjes@google.com> wrote:

> On Mon, 7 Mar 2011, Andrew Morton wrote:
> 
> > > It could be, if users assign the handler to a different memcg; otherwise, 
> > > it's guaranteed.
> > 
> > Putting the handler into the same container would be rather daft.
> > 
> > If userspace is going to elect to take over a kernel function then it
> > should be able to perform that function reliably.  We don't have hacks
> > in the kernel to stop runaway SCHED_FIFO tasks, either.  If the oom
> > handler has put itself into a memcg and then has permitted that memcg
> > to go oom then userspace is busted.
> > 
> 
> We have a container specifically for daemons like this and have struggled 
> for years to accurately predict how much memory it needs and what to do 
> when it is oom.  The problem, in this case, is that when it's oom it's too 
> late: the memcg is livelocked and then no memory limits on the system have 
> a chance of getting increased and nothing in oom memcgs are guaranteed to 
> ever make forward progress again.
> 
> That's why I keep bringing up the point that this patch is not a bugfix: 
> it's an extension of a feature (memory.oom_control) to allow userspace a 
> period of time to respond to memcgs reaching their hard limit before 
> killing something.  For our container with vital system daemons, this is 
> absolutely mandatory if something consumes a large amount of memory and 
> needs to be restarted; we want the logic in userspace to determine what to 
> do without killing vital tasks or panicking.  We want to use the oom 
> killer only as a last resort and that can effectively be done with this 
> patch and not with memory.oom_control (and I think this is why Kame acked 
> it).
> 

I acked just because the code itself seems to work. _And_ I can't convince
you "that function is never necessary". But please note, you don't convice
me "that's necessary".

BTW, why "the memcg is livelocked and then no memory limits on the system have 
a chance of getting increased"

Is there a memcg bug which prevents increasing limit ?

THanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [patch] memcg: add oom killer delay
  2011-03-08  1:18                   ` Andrew Morton
  2011-03-08  1:33                     ` David Rientjes
@ 2011-03-08  3:06                     ` KAMEZAWA Hiroyuki
  1 sibling, 0 replies; 58+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-03-08  3:06 UTC (permalink / raw)
  To: Andrew Morton; +Cc: David Rientjes, Balbir Singh, Daisuke Nishimura, linux-mm

On Mon, 7 Mar 2011 17:18:53 -0800
Andrew Morton <akpm@linux-foundation.org> wrote:

> On Mon, 7 Mar 2011 17:02:36 -0800 (PST)
> David Rientjes <rientjes@google.com> wrote:
> >  Keep in mind that for oom situations we give the killed 
> > task access to memory reserves below the min watermark with TIF_MEMDIE so 
> > that they can allocate memory to exit as quickly as possible (either to 
> > handle the SIGKILL or within the exit path).  That's because we can't 
> > guarantee anything within an oom system, cpuset, mempolicy, or memcg is 
> > ever responsive without it.  (And, the side effect of it and its threads 
> > exiting is the freeing of memory which allows everything else to once 
> > again be responsive.)
> > 
> > > That this is the only situation you've observed in which the
> > > userspace oom-handler is "unresponsive"?
> > > 
> > 
> > Personally, yes, but I could imagine other users could get caught if their 
> > userspace oom handler requires taking locks (such as mmap_sem) by reading 
> > within procfs that a thread within an oom memcg already holds.
> 
> If activity in one memcg cause a lockup of processes in a separate
> memcg then that's a containment violation and we should fix it.
> 

I hope dirty_ratio + async I/O controller will can be a help..
cpu controller is an only help for now (for limiting time for vmscan)

I'm not sure what we need other than above for now.



> One could argue that peering into a separate memcg's procfs files was
> already a containment violation, but from a practical point of view we
> definitely do want processes in a separate memcg to be able to
> passively observe activity in another without stepping on lindmines.
> 

It's namespace job, I think.

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [patch] memcg: add oom killer delay
  2011-03-08  2:51                       ` KAMEZAWA Hiroyuki
@ 2011-03-08  3:07                         ` David Rientjes
  2011-03-08  3:13                           ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 58+ messages in thread
From: David Rientjes @ 2011-03-08  3:07 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Andrew Morton, Balbir Singh, Daisuke Nishimura, linux-mm

On Tue, 8 Mar 2011, KAMEZAWA Hiroyuki wrote:

> BTW, why "the memcg is livelocked and then no memory limits on the system have 
> a chance of getting increased"
> 

I was referring specifically to the memcg which a job scheduler or 
userspace daemon responsible for doing so is attached.  If the thread 
responsible for managing memcgs and increasing limits or killing off lower 
priority jobs is in a memcg that is oom, there is a chance it will never 
be able to respond to the condition.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [patch] memcg: add oom killer delay
  2011-03-08  3:07                         ` David Rientjes
@ 2011-03-08  3:13                           ` KAMEZAWA Hiroyuki
  2011-03-08  3:56                             ` David Rientjes
  0 siblings, 1 reply; 58+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-03-08  3:13 UTC (permalink / raw)
  To: David Rientjes; +Cc: Andrew Morton, Balbir Singh, Daisuke Nishimura, linux-mm

On Mon, 7 Mar 2011 19:07:10 -0800 (PST)
David Rientjes <rientjes@google.com> wrote:

> On Tue, 8 Mar 2011, KAMEZAWA Hiroyuki wrote:
> 
> > BTW, why "the memcg is livelocked and then no memory limits on the system have 
> > a chance of getting increased"
> > 
> 
> I was referring specifically to the memcg which a job scheduler or 
> userspace daemon responsible for doing so is attached.  If the thread 
> responsible for managing memcgs and increasing limits or killing off lower 
> priority jobs is in a memcg that is oom, there is a chance it will never 
> be able to respond to the condition.
> 

I just think memcg for such daemons shouldn't have any limit or must not
set oom_disable. I think you know that. So, the question is why you can't
do it ?  Is there special reason which comes from cgroup's characteristics ?


Thanks,
-kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [patch] memcg: add oom killer delay
  2011-03-08  3:13                           ` KAMEZAWA Hiroyuki
@ 2011-03-08  3:56                             ` David Rientjes
  2011-03-08  4:17                               ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 58+ messages in thread
From: David Rientjes @ 2011-03-08  3:56 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Andrew Morton, Balbir Singh, Daisuke Nishimura, linux-mm

On Tue, 8 Mar 2011, KAMEZAWA Hiroyuki wrote:

> > I was referring specifically to the memcg which a job scheduler or 
> > userspace daemon responsible for doing so is attached.  If the thread 
> > responsible for managing memcgs and increasing limits or killing off lower 
> > priority jobs is in a memcg that is oom, there is a chance it will never 
> > be able to respond to the condition.
> > 
> 
> I just think memcg for such daemons shouldn't have any limit or must not
> set oom_disable. I think you know that. So, the question is why you can't
> do it ?  Is there special reason which comes from cgroup's characteristics ?
> 

Being in the root memcg doesn't mean the aggregate of your memcg's hard 
limits can't exceed the system's memory capacity.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [patch] memcg: add oom killer delay
  2011-03-08  3:56                             ` David Rientjes
@ 2011-03-08  4:17                               ` KAMEZAWA Hiroyuki
  2011-03-08  5:30                                 ` David Rientjes
  0 siblings, 1 reply; 58+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-03-08  4:17 UTC (permalink / raw)
  To: David Rientjes; +Cc: Andrew Morton, Balbir Singh, Daisuke Nishimura, linux-mm

On Mon, 7 Mar 2011 19:56:23 -0800 (PST)
David Rientjes <rientjes@google.com> wrote:

> On Tue, 8 Mar 2011, KAMEZAWA Hiroyuki wrote:
> 
> > > I was referring specifically to the memcg which a job scheduler or 
> > > userspace daemon responsible for doing so is attached.  If the thread 
> > > responsible for managing memcgs and increasing limits or killing off lower 
> > > priority jobs is in a memcg that is oom, there is a chance it will never 
> > > be able to respond to the condition.
> > > 
> > 
> > I just think memcg for such daemons shouldn't have any limit or must not
> > set oom_disable. I think you know that. So, the question is why you can't
> > do it ?  Is there special reason which comes from cgroup's characteristics ?
> > 
> 
> Being in the root memcg doesn't mean the aggregate of your memcg's hard 
> limits can't exceed the system's memory capacity.
> 
Hmm? That's an unexpected answer. Why system's capacity is problem here ?
(root memcg has no 'limit' always.)

Is it a problem that 'there is no 'guarantee' or 'private page pool'
for daemons ?

BTW, if system oom-killer works, memcg's oom_disable is ignored.
Why system oom-killer doesn't work ?

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [patch] memcg: add oom killer delay
  2011-03-08  4:17                               ` KAMEZAWA Hiroyuki
@ 2011-03-08  5:30                                 ` David Rientjes
  2011-03-08  5:49                                   ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 58+ messages in thread
From: David Rientjes @ 2011-03-08  5:30 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Andrew Morton, Balbir Singh, Daisuke Nishimura, linux-mm

On Tue, 8 Mar 2011, KAMEZAWA Hiroyuki wrote:

> Hmm? That's an unexpected answer. Why system's capacity is problem here ?
> (root memcg has no 'limit' always.)
> 
> Is it a problem that 'there is no 'guarantee' or 'private page pool'
> for daemons ?
> 

It's not an inherent problem of memcg, it's a configuration issue: if your 
userspace application cannot respond to address an oom condition in a 
memcg for whatever reason (such as it being in an oom memcg itself), then 
there's a chance that the memcg will livelock since the kernel cannot do 
anything to fix the issue itself.

That's aside from the general purpose of the new 
memory.oom_delay_millisecs: users may want a grace period for userspace to 
increase the hard limit or kill a task before deferring to the kernel.  
That seems exponentially more useful than simply disabling the oom killer 
entirely with memory.oom_control.  I think it's unfortunate 
memory.oom_control was merged frst and seems to have tainted this entire 
discussion.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [patch] memcg: add oom killer delay
  2011-03-08  5:30                                 ` David Rientjes
@ 2011-03-08  5:49                                   ` KAMEZAWA Hiroyuki
  2011-03-08 23:49                                     ` David Rientjes
  0 siblings, 1 reply; 58+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-03-08  5:49 UTC (permalink / raw)
  To: David Rientjes; +Cc: Andrew Morton, Balbir Singh, Daisuke Nishimura, linux-mm

On Mon, 7 Mar 2011 21:30:19 -0800 (PST)
David Rientjes <rientjes@google.com> wrote:

> On Tue, 8 Mar 2011, KAMEZAWA Hiroyuki wrote:
> 
> > Hmm? That's an unexpected answer. Why system's capacity is problem here ?
> > (root memcg has no 'limit' always.)
> > 
> > Is it a problem that 'there is no 'guarantee' or 'private page pool'
> > for daemons ?
> > 
> 
> It's not an inherent problem of memcg, it's a configuration issue: if your 
> userspace application cannot respond to address an oom condition in a 
> memcg for whatever reason (such as it being in an oom memcg itself), then 
> there's a chance that the memcg will livelock since the kernel cannot do 
> anything to fix the issue itself.
> 
> That's aside from the general purpose of the new 
> memory.oom_delay_millisecs: users may want a grace period for userspace to 
> increase the hard limit or kill a task before deferring to the kernel.  
> That seems exponentially more useful than simply disabling the oom killer 
> entirely with memory.oom_control.  I think it's unfortunate 
> memory.oom_control was merged frst and seems to have tainted this entire 
> discussion.
> 

That sounds like a mis-usage problem....what kind of workaround is offerred
if the user doesn't configure oom_delay_millisecs , a yet another mis-usage ?

THanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [patch] memcg: add oom killer delay
  2011-03-08  5:49                                   ` KAMEZAWA Hiroyuki
@ 2011-03-08 23:49                                     ` David Rientjes
  2011-03-09  6:04                                       ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 58+ messages in thread
From: David Rientjes @ 2011-03-08 23:49 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Andrew Morton, Balbir Singh, Daisuke Nishimura, linux-mm

On Tue, 8 Mar 2011, KAMEZAWA Hiroyuki wrote:

> > That's aside from the general purpose of the new 
> > memory.oom_delay_millisecs: users may want a grace period for userspace to 
> > increase the hard limit or kill a task before deferring to the kernel.  
> > That seems exponentially more useful than simply disabling the oom killer 
> > entirely with memory.oom_control.  I think it's unfortunate 
> > memory.oom_control was merged frst and seems to have tainted this entire 
> > discussion.
> > 
> 
> That sounds like a mis-usage problem....what kind of workaround is offerred
> if the user doesn't configure oom_delay_millisecs , a yet another mis-usage ?
> 

Not exactly sure what you mean, but you're saying disabling the oom killer 
with memory.oom_control is not the recommended way to allow userspace to 
fix the issue itself?  That seems like it's the entire usecase: we'd 
rarely want to let a memcg stall when it needs memory without trying to 
address the problem (elevating the limit, killing a lower priority job, 
sending a signal to free memory).  We have a memcg oom notifier to handle 
the situation but there's no guarantee that the kernel won't kill 
something first and that's a bad result if we chose to address it with one 
of the ways mentioned above.

To answer your question: if the admin doesn't configure a 
memory.oom_delay_millisecs, then the oom killer will obviously kill 
something off (if memory.oom_control is also not set) when reclaim fails 
to free memory just as before.

Aside from my specific usecase for this tunable, let me pose a question: 
do you believe that the memory controller would benefit from allowing 
users to have a grace period in which to take one of the actions listed 
above instead of killing something itself?  Yes, this would be possible by 
setting and then unsetting memory.oom_control, but that requires userspace 
to always be responsive (which, at our scale, we can unequivocally say 
isn't always possible) and doesn't effectively deal with spikes in memory 
that may only be temporary and doesn't require any intervention of the 
user at all.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [patch] memcg: add oom killer delay
  2011-03-08 23:49                                     ` David Rientjes
@ 2011-03-09  6:04                                       ` KAMEZAWA Hiroyuki
  2011-03-09  6:44                                         ` David Rientjes
  0 siblings, 1 reply; 58+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-03-09  6:04 UTC (permalink / raw)
  To: David Rientjes; +Cc: Andrew Morton, Balbir Singh, Daisuke Nishimura, linux-mm

On Tue, 8 Mar 2011 15:49:10 -0800 (PST)
David Rientjes <rientjes@google.com> wrote:

> On Tue, 8 Mar 2011, KAMEZAWA Hiroyuki wrote:
> 
> > > That's aside from the general purpose of the new 
> > > memory.oom_delay_millisecs: users may want a grace period for userspace to 
> > > increase the hard limit or kill a task before deferring to the kernel.  
> > > That seems exponentially more useful than simply disabling the oom killer 
> > > entirely with memory.oom_control.  I think it's unfortunate 
> > > memory.oom_control was merged frst and seems to have tainted this entire 
> > > discussion.
> > > 
> > 
> > That sounds like a mis-usage problem....what kind of workaround is offerred
> > if the user doesn't configure oom_delay_millisecs , a yet another mis-usage ?
> > 
> 
> Not exactly sure what you mean, but you're saying disabling the oom killer 
> with memory.oom_control is not the recommended way to allow userspace to 
> fix the issue itself?  That seems like it's the entire usecase: we'd 
> rarely want to let a memcg stall when it needs memory without trying to 
> address the problem (elevating the limit, killing a lower priority job, 
> sending a signal to free memory).  We have a memcg oom notifier to handle 
> the situation but there's no guarantee that the kernel won't kill 
> something first and that's a bad result if we chose to address it with one 
> of the ways mentioned above.
> 

Why memcg's oom and system's oom happens at the same time ?



> To answer your question: if the admin doesn't configure a 
> memory.oom_delay_millisecs, then the oom killer will obviously kill 
> something off (if memory.oom_control is also not set) when reclaim fails 
> to free memory just as before.
> 
> Aside from my specific usecase for this tunable, let me pose a question: 
> do you believe that the memory controller would benefit from allowing 
> users to have a grace period in which to take one of the actions listed 
> above instead of killing something itself?  Yes, this would be possible by 
> setting and then unsetting memory.oom_control, but that requires userspace 
> to always be responsive (which, at our scale, we can unequivocally say 
> isn't always possible) and doesn't effectively deal with spikes in memory 
> that may only be temporary and doesn't require any intervention of the 
> user at all.
> 

Please add 'notifier' in kernel space and handle the event by kernel module.
It is much better than 'timeout and allow oom-kill again'.

If you add a notifier_chain in memcg's oom path, I have no obstruction.
Implementing custom oom handler for it in kernel module sounds better
than timeout. If necessary, please export some functionailty of memcg.

IIUC, system's oom-killer has notifier chain of oom-kill. There is no reason
it's bad to have one for memcg.

Isn't it ok ? I think you can do what you want with it.

Thanks,
-Kame




--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [patch] memcg: add oom killer delay
  2011-03-09  6:04                                       ` KAMEZAWA Hiroyuki
@ 2011-03-09  6:44                                         ` David Rientjes
  2011-03-09  7:16                                           ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 58+ messages in thread
From: David Rientjes @ 2011-03-09  6:44 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Andrew Morton, Balbir Singh, Daisuke Nishimura, linux-mm

On Wed, 9 Mar 2011, KAMEZAWA Hiroyuki wrote:

> > Not exactly sure what you mean, but you're saying disabling the oom killer 
> > with memory.oom_control is not the recommended way to allow userspace to 
> > fix the issue itself?  That seems like it's the entire usecase: we'd 
> > rarely want to let a memcg stall when it needs memory without trying to 
> > address the problem (elevating the limit, killing a lower priority job, 
> > sending a signal to free memory).  We have a memcg oom notifier to handle 
> > the situation but there's no guarantee that the kernel won't kill 
> > something first and that's a bad result if we chose to address it with one 
> > of the ways mentioned above.
> > 
> 
> Why memcg's oom and system's oom happens at the same time ?
> 

Again, I'm not sure what you mean: there's no system oom in what I 
describe above.  I'm saying that userspace may not have sufficient time to 
react to an oom notification unless the oom killer is disabled via 
memory.oom_control and re-enabled iff userspace chooses to defer to the 
kernel.

> > Aside from my specific usecase for this tunable, let me pose a question: 
> > do you believe that the memory controller would benefit from allowing 
> > users to have a grace period in which to take one of the actions listed 
> > above instead of killing something itself?  Yes, this would be possible by 
> > setting and then unsetting memory.oom_control, but that requires userspace 
> > to always be responsive (which, at our scale, we can unequivocally say 
> > isn't always possible) and doesn't effectively deal with spikes in memory 
> > that may only be temporary and doesn't require any intervention of the 
> > user at all.
> > 
> 
> Please add 'notifier' in kernel space and handle the event by kernel module.
> It is much better than 'timeout and allow oom-kill again'.
> 

A kernel-space notifier would certainly be helpful, but at what point does 
the kernel choose to oom kill something?  If there's an oom notifier in 
place, do we always defer killing or for a set period of time?  If it's 
the latter then we'll still want the timeout, otherwise there's no way to 
guarantee we haven't killed something by the time userspace has a chance 
to react to the notification.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [patch] memcg: add oom killer delay
  2011-03-09  6:44                                         ` David Rientjes
@ 2011-03-09  7:16                                           ` KAMEZAWA Hiroyuki
  2011-03-09 21:12                                             ` David Rientjes
  0 siblings, 1 reply; 58+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-03-09  7:16 UTC (permalink / raw)
  To: David Rientjes; +Cc: Andrew Morton, Balbir Singh, Daisuke Nishimura, linux-mm

On Tue, 8 Mar 2011 22:44:11 -0800 (PST)
David Rientjes <rientjes@google.com> wrote:

> On Wed, 9 Mar 2011, KAMEZAWA Hiroyuki wrote:
 > > Aside from my specific usecase for this tunable, let me pose a question: 
> > > do you believe that the memory controller would benefit from allowing 
> > > users to have a grace period in which to take one of the actions listed 
> > > above instead of killing something itself?  Yes, this would be possible by 
> > > setting and then unsetting memory.oom_control, but that requires userspace 
> > > to always be responsive (which, at our scale, we can unequivocally say 
> > > isn't always possible) and doesn't effectively deal with spikes in memory 
> > > that may only be temporary and doesn't require any intervention of the 
> > > user at all.
> > > 
> > 
> > Please add 'notifier' in kernel space and handle the event by kernel module.
> > It is much better than 'timeout and allow oom-kill again'.
> > 
> 
> A kernel-space notifier would certainly be helpful, but at what point does 
> the kernel choose to oom kill something?  If there's an oom notifier in 
> place, do we always defer killing or for a set period of time? 

For google, as you like.

For me, I want an oom-killall module Or oom-SIGSTOP-all module.
oom-killall will be useful for killing fork-bombs and very quick recovery.

For me, the 1st motivation of oom-disable is to taking core-dump of
memory leaking process and look into it for checking memory leak.
(panic_on_oom -> kdump is used for supporting my customer.)

Maybe my example of notifier user doesn't sounds good to you, 
please find a good one.


> If it's  the latter then we'll still want the timeout, otherwise there's no
> way to  guarantee we haven't killed something by the time userspace has a chance 
> to react to the notification.
> 

You can get a list of tasks in the cgroup and send SIGNALs with filters you like.
List of thread-IDs can be got easily with cgroup_iter_xxx functions.

Anyway, if you add notifier, please give us a user of it. If possible,
it should be a function which can never be implemented in userland even
with sane programmers, admins, and users.

For example, if all process's oom_score_adj was set to -1000 and oom-killer
doesn't work, do you implement a timeout ? I think you'll say it's a
wrong configuration. memcg's oom_disable timeout is the same thing.

Thanks,
-Kame





--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [patch] memcg: add oom killer delay
  2011-03-09  7:16                                           ` KAMEZAWA Hiroyuki
@ 2011-03-09 21:12                                             ` David Rientjes
  2011-03-09 21:27                                               ` [patch] memcg: give current access to memory reserves if it's trying to die David Rientjes
  0 siblings, 1 reply; 58+ messages in thread
From: David Rientjes @ 2011-03-09 21:12 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Andrew Morton, Balbir Singh, Daisuke Nishimura, linux-mm

On Wed, 9 Mar 2011, KAMEZAWA Hiroyuki wrote:

> For me, I want an oom-killall module Or oom-SIGSTOP-all module.
> oom-killall will be useful for killing fork-bombs and very quick recovery.
> 

That doesn't need to be in kernelspace, though, the global oom killer will 
give access to memory reserves to any task that invokes it that has a 
pending SIGKILL, so we could extend that to the memcg oom killer as well 
and then use an oom notifier to trigger a killall in userspace.

> For me, the 1st motivation of oom-disable is to taking core-dump of 
> memory leaking process and look into it for checking memory leak.
> (panic_on_oom -> kdump is used for supporting my customer.)
> 

I remember your advocacy of panic_on_oom during the oom killer rewrite 
discussion, this makes it more clear.

> Anyway, if you add notifier, please give us a user of it. If possible,
> it should be a function which can never be implemented in userland even
> with sane programmers, admins, and users.
> 
> For example, if all process's oom_score_adj was set to -1000 and oom-killer
> doesn't work, do you implement a timeout ? I think you'll say it's a
> wrong configuration. memcg's oom_disable timeout is the same thing.
> 

You know me quite well, actually :)  We've agreed to drop this patch and 
carry it internally until we can deprecate it and implement a separate oom 
handling thread in the root memcg that will detect the situation for a 
given period of time (either via an oom notifier or by checking that the 
usage of the memcg of interest equals to the limit) and then do the 
killall.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 58+ messages in thread

* [patch] memcg: give current access to memory reserves if it's trying to die
  2011-03-09 21:12                                             ` David Rientjes
@ 2011-03-09 21:27                                               ` David Rientjes
  2011-03-09 23:30                                                 ` KAMEZAWA Hiroyuki
  2011-03-17 23:53                                                 ` Andrew Morton
  0 siblings, 2 replies; 58+ messages in thread
From: David Rientjes @ 2011-03-09 21:27 UTC (permalink / raw)
  To: Andrew Morton, KAMEZAWA Hiroyuki
  Cc: Balbir Singh, Daisuke Nishimura, linux-mm

When a memcg is oom and current has already received a SIGKILL, then give
it access to memory reserves with a higher scheduling priority so that it
may quickly exit and free its memory.

This is identical to the global oom killer and is done even before
checking for panic_on_oom: a pending SIGKILL here while panic_on_oom is
selected is guaranteed to have come from userspace; the thread only needs
access to memory reserves to exit and thus we don't unnecessarily panic
the machine until the kernel has no last resort to free memory.

Signed-off-by: David Rientjes <rientjes@google.com>
---
 mm/oom_kill.c |   11 +++++++++++
 1 files changed, 11 insertions(+), 0 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -537,6 +537,17 @@ void mem_cgroup_out_of_memory(struct mem_cgroup *mem, gfp_t gfp_mask)
 	unsigned int points = 0;
 	struct task_struct *p;
 
+	/*
+	 * If current has a pending SIGKILL, then automatically select it.  The
+	 * goal is to allow it to allocate so that it may quickly exit and free
+	 * its memory.
+	 */
+	if (fatal_signal_pending(current)) {
+		set_thread_flag(TIF_MEMDIE);
+		boost_dying_task_prio(current, NULL);
+		return;
+	}
+
 	check_panic_on_oom(CONSTRAINT_MEMCG, gfp_mask, 0, NULL);
 	limit = mem_cgroup_get_limit(mem) >> PAGE_SHIFT;
 	read_lock(&tasklist_lock);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [patch] memcg: give current access to memory reserves if it's trying to die
  2011-03-09 21:27                                               ` [patch] memcg: give current access to memory reserves if it's trying to die David Rientjes
@ 2011-03-09 23:30                                                 ` KAMEZAWA Hiroyuki
  2011-03-17 23:37                                                   ` David Rientjes
  2011-03-17 23:53                                                 ` Andrew Morton
  1 sibling, 1 reply; 58+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-03-09 23:30 UTC (permalink / raw)
  To: David Rientjes; +Cc: Andrew Morton, Balbir Singh, Daisuke Nishimura, linux-mm

On Wed, 9 Mar 2011 13:27:50 -0800 (PST)
David Rientjes <rientjes@google.com> wrote:

> When a memcg is oom and current has already received a SIGKILL, then give
> it access to memory reserves with a higher scheduling priority so that it
> may quickly exit and free its memory.
> 
> This is identical to the global oom killer and is done even before
> checking for panic_on_oom: a pending SIGKILL here while panic_on_oom is
> selected is guaranteed to have come from userspace; the thread only needs
> access to memory reserves to exit and thus we don't unnecessarily panic
> the machine until the kernel has no last resort to free memory.
> 
> Signed-off-by: David Rientjes <rientjes@google.com>

Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

Thank you.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [patch] memcg: give current access to memory reserves if it's trying to die
  2011-03-09 23:30                                                 ` KAMEZAWA Hiroyuki
@ 2011-03-17 23:37                                                   ` David Rientjes
  0 siblings, 0 replies; 58+ messages in thread
From: David Rientjes @ 2011-03-17 23:37 UTC (permalink / raw)
  To: Andrew Morton, KAMEZAWA Hiroyuki
  Cc: Balbir Singh, Daisuke Nishimura, linux-mm

On Thu, 10 Mar 2011, KAMEZAWA Hiroyuki wrote:

> On Wed, 9 Mar 2011 13:27:50 -0800 (PST)
> David Rientjes <rientjes@google.com> wrote:
> 
> > When a memcg is oom and current has already received a SIGKILL, then give
> > it access to memory reserves with a higher scheduling priority so that it
> > may quickly exit and free its memory.
> > 
> > This is identical to the global oom killer and is done even before
> > checking for panic_on_oom: a pending SIGKILL here while panic_on_oom is
> > selected is guaranteed to have come from userspace; the thread only needs
> > access to memory reserves to exit and thus we don't unnecessarily panic
> > the machine until the kernel has no last resort to free memory.
> > 
> > Signed-off-by: David Rientjes <rientjes@google.com>
> 
> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> 
> Thank you.
> 

I'm hoping this can make 2.6.39 so that userspace can kill a thread in a 
memcg when it is oom and the oom killer is disabled via memory.oom_control 
and it will still get access to memory reserves if needed while trying to 
exit.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [patch] memcg: give current access to memory reserves if it's trying to die
  2011-03-09 21:27                                               ` [patch] memcg: give current access to memory reserves if it's trying to die David Rientjes
  2011-03-09 23:30                                                 ` KAMEZAWA Hiroyuki
@ 2011-03-17 23:53                                                 ` Andrew Morton
  2011-03-18  4:35                                                   ` KAMEZAWA Hiroyuki
  2011-03-18 20:32                                                   ` David Rientjes
  1 sibling, 2 replies; 58+ messages in thread
From: Andrew Morton @ 2011-03-17 23:53 UTC (permalink / raw)
  To: David Rientjes
  Cc: KAMEZAWA Hiroyuki, Balbir Singh, Daisuke Nishimura, linux-mm

On Wed, 9 Mar 2011 13:27:50 -0800 (PST)
David Rientjes <rientjes@google.com> wrote:

> When a memcg is oom and current has already received a SIGKILL, then give
> it access to memory reserves with a higher scheduling priority so that it
> may quickly exit and free its memory.
> 
> This is identical to the global oom killer and is done even before
> checking for panic_on_oom: a pending SIGKILL here while panic_on_oom is
> selected is guaranteed to have come from userspace; the thread only needs
> access to memory reserves to exit and thus we don't unnecessarily panic
> the machine until the kernel has no last resort to free memory.
> 
> Signed-off-by: David Rientjes <rientjes@google.com>
> ---
>  mm/oom_kill.c |   11 +++++++++++
>  1 files changed, 11 insertions(+), 0 deletions(-)
> 
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -537,6 +537,17 @@ void mem_cgroup_out_of_memory(struct mem_cgroup *mem, gfp_t gfp_mask)
>  	unsigned int points = 0;
>  	struct task_struct *p;
>  
> +	/*
> +	 * If current has a pending SIGKILL, then automatically select it.  The
> +	 * goal is to allow it to allocate so that it may quickly exit and free
> +	 * its memory.
> +	 */
> +	if (fatal_signal_pending(current)) {
> +		set_thread_flag(TIF_MEMDIE);
> +		boost_dying_task_prio(current, NULL);
> +		return;
> +	}
> +
>  	check_panic_on_oom(CONSTRAINT_MEMCG, gfp_mask, 0, NULL);
>  	limit = mem_cgroup_get_limit(mem) >> PAGE_SHIFT;
>  	read_lock(&tasklist_lock);

The code duplication seems a bit gratuitous.



Was it deliberate that mem_cgroup_out_of_memory() ignores the oom
notifier callbacks?

(Why does that notifier list exist at all?  Wouldn't it be better to do
this via a vmscan shrinker?  Perhaps altered to be passed the scanning
priority?)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [patch] memcg: give current access to memory reserves if it's trying to die
  2011-03-17 23:53                                                 ` Andrew Morton
@ 2011-03-18  4:35                                                   ` KAMEZAWA Hiroyuki
  2011-03-18  5:17                                                     ` Andrew Morton
  2011-03-18 20:32                                                   ` David Rientjes
  1 sibling, 1 reply; 58+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-03-18  4:35 UTC (permalink / raw)
  To: Andrew Morton; +Cc: David Rientjes, Balbir Singh, Daisuke Nishimura, linux-mm

On Thu, 17 Mar 2011 16:53:19 -0700
Andrew Morton <akpm@linux-foundation.org> wrote:

> On Wed, 9 Mar 2011 13:27:50 -0800 (PST)
> David Rientjes <rientjes@google.com> wrote:
> 
> > When a memcg is oom and current has already received a SIGKILL, then give
> > it access to memory reserves with a higher scheduling priority so that it
> > may quickly exit and free its memory.
> > 
> > This is identical to the global oom killer and is done even before
> > checking for panic_on_oom: a pending SIGKILL here while panic_on_oom is
> > selected is guaranteed to have come from userspace; the thread only needs
> > access to memory reserves to exit and thus we don't unnecessarily panic
> > the machine until the kernel has no last resort to free memory.
> > 
> > Signed-off-by: David Rientjes <rientjes@google.com>
> > ---
> >  mm/oom_kill.c |   11 +++++++++++
> >  1 files changed, 11 insertions(+), 0 deletions(-)
> > 
> > diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> > --- a/mm/oom_kill.c
> > +++ b/mm/oom_kill.c
> > @@ -537,6 +537,17 @@ void mem_cgroup_out_of_memory(struct mem_cgroup *mem, gfp_t gfp_mask)
> >  	unsigned int points = 0;
> >  	struct task_struct *p;
> >  
> > +	/*
> > +	 * If current has a pending SIGKILL, then automatically select it.  The
> > +	 * goal is to allow it to allocate so that it may quickly exit and free
> > +	 * its memory.
> > +	 */
> > +	if (fatal_signal_pending(current)) {
> > +		set_thread_flag(TIF_MEMDIE);
> > +		boost_dying_task_prio(current, NULL);
> > +		return;
> > +	}
> > +
> >  	check_panic_on_oom(CONSTRAINT_MEMCG, gfp_mask, 0, NULL);
> >  	limit = mem_cgroup_get_limit(mem) >> PAGE_SHIFT;
> >  	read_lock(&tasklist_lock);
> 
> The code duplication seems a bit gratuitous.
> 
> 
> 
> Was it deliberate that mem_cgroup_out_of_memory() ignores the oom
> notifier callbacks?
> 

I'm not sure for what purpose notifier chain for oom exists.
At a loock, it's for s390/powerpc Collaborative Memory Manager.. ?

About memcg, notifier to userland already exists and I though I don't
need to call CMM callbacks (for now, there is no user with memcg, I guess.)

Thanks,
-kame

> (Why does that notifier list exist at all?  Wouldn't it be better to do
> this via a vmscan shrinker?  Perhaps altered to be passed the scanning
> priority?)
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [patch] memcg: give current access to memory reserves if it's trying to die
  2011-03-18  4:35                                                   ` KAMEZAWA Hiroyuki
@ 2011-03-18  5:17                                                     ` Andrew Morton
  2011-03-18  5:58                                                       ` KAMEZAWA Hiroyuki
  2011-03-18 20:36                                                       ` David Rientjes
  0 siblings, 2 replies; 58+ messages in thread
From: Andrew Morton @ 2011-03-18  5:17 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: David Rientjes, Balbir Singh, Daisuke Nishimura, linux-mm

On Fri, 18 Mar 2011 13:35:34 +0900 KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:

> On Thu, 17 Mar 2011 16:53:19 -0700
> Andrew Morton <akpm@linux-foundation.org> wrote:
> 
> > 
> > Was it deliberate that mem_cgroup_out_of_memory() ignores the oom
> > notifier callbacks?
> > 
> 
> I'm not sure for what purpose notifier chain for oom exists.
> At a loock, it's for s390/powerpc Collaborative Memory Manager.. ?

commit 8bc719d3cab8414938f9ea6e33b58d8810d18068
Author:     Martin Schwidefsky <schwidefsky@de.ibm.com>
AuthorDate: Mon Sep 25 23:31:20 2006 -0700
Commit:     Linus Torvalds <torvalds@g5.osdl.org>
CommitDate: Tue Sep 26 08:48:47 2006 -0700

    [PATCH] out of memory notifier
    
    Add a notifer chain to the out of memory killer.  If one of the registered
    callbacks could release some memory, do not kill the process but return and
    retry the allocation that forced the oom killer to run.
    
    The purpose of the notifier is to add a safety net in the presence of
    memory ballooners.  If the resource manager inflated the balloon to a size
    where memory allocations can not be satisfied anymore, it is better to
    deflate the balloon a bit instead of killing processes.
    
    The implementation for the s390 ballooner is included.

> About memcg, notifier to userland already exists and I though I don't
> need to call CMM callbacks (for now, there is no user with memcg, I guess.)

Seems to me that the callback should be performed.

Or, perhaps, migrate it over to use the shrinker stuff, along with
suitable handling of the scanning priority.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [patch] memcg: give current access to memory reserves if it's trying to die
  2011-03-18  5:17                                                     ` Andrew Morton
@ 2011-03-18  5:58                                                       ` KAMEZAWA Hiroyuki
  2011-03-18 20:36                                                       ` David Rientjes
  1 sibling, 0 replies; 58+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-03-18  5:58 UTC (permalink / raw)
  To: Andrew Morton; +Cc: David Rientjes, Balbir Singh, Daisuke Nishimura, linux-mm

On Thu, 17 Mar 2011 22:17:58 -0700
Andrew Morton <akpm@linux-foundation.org> wrote:

> On Fri, 18 Mar 2011 13:35:34 +0900 KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> 
> > On Thu, 17 Mar 2011 16:53:19 -0700
> > Andrew Morton <akpm@linux-foundation.org> wrote:
> > 
> > > 
> > > Was it deliberate that mem_cgroup_out_of_memory() ignores the oom
> > > notifier callbacks?
> > > 
> > 
> > I'm not sure for what purpose notifier chain for oom exists.
> > At a loock, it's for s390/powerpc Collaborative Memory Manager.. ?
> 
> commit 8bc719d3cab8414938f9ea6e33b58d8810d18068
> Author:     Martin Schwidefsky <schwidefsky@de.ibm.com>
> AuthorDate: Mon Sep 25 23:31:20 2006 -0700
> Commit:     Linus Torvalds <torvalds@g5.osdl.org>
> CommitDate: Tue Sep 26 08:48:47 2006 -0700
> 
>     [PATCH] out of memory notifier
>     
>     Add a notifer chain to the out of memory killer.  If one of the registered
>     callbacks could release some memory, do not kill the process but return and
>     retry the allocation that forced the oom killer to run.
>     
>     The purpose of the notifier is to add a safety net in the presence of
>     memory ballooners.  If the resource manager inflated the balloon to a size
>     where memory allocations can not be satisfied anymore, it is better to
>     deflate the balloon a bit instead of killing processes.
>     
>     The implementation for the s390 ballooner is included.
> 
> > About memcg, notifier to userland already exists and I though I don't
> > need to call CMM callbacks (for now, there is no user with memcg, I guess.)
> 
> Seems to me that the callback should be performed.
> 

Hmm, (for now) memory cgroup just handles user's memory, so any kernel
callback cannot do anything..other than dropping file cache if some
module pins it.

> Or, perhaps, migrate it over to use the shrinker stuff, along with
> suitable handling of the scanning priority.
> 

Ah, yes. callback at oom is too late, I think.

Thanks,
-Kame


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [patch] memcg: give current access to memory reserves if it's trying to die
  2011-03-17 23:53                                                 ` Andrew Morton
  2011-03-18  4:35                                                   ` KAMEZAWA Hiroyuki
@ 2011-03-18 20:32                                                   ` David Rientjes
  1 sibling, 0 replies; 58+ messages in thread
From: David Rientjes @ 2011-03-18 20:32 UTC (permalink / raw)
  To: Andrew Morton
  Cc: KAMEZAWA Hiroyuki, Balbir Singh, Daisuke Nishimura, linux-mm

On Thu, 17 Mar 2011, Andrew Morton wrote:

> > diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> > --- a/mm/oom_kill.c
> > +++ b/mm/oom_kill.c
> > @@ -537,6 +537,17 @@ void mem_cgroup_out_of_memory(struct mem_cgroup *mem, gfp_t gfp_mask)
> >  	unsigned int points = 0;
> >  	struct task_struct *p;
> >  
> > +	/*
> > +	 * If current has a pending SIGKILL, then automatically select it.  The
> > +	 * goal is to allow it to allocate so that it may quickly exit and free
> > +	 * its memory.
> > +	 */
> > +	if (fatal_signal_pending(current)) {
> > +		set_thread_flag(TIF_MEMDIE);
> > +		boost_dying_task_prio(current, NULL);
> > +		return;
> > +	}
> > +
> >  	check_panic_on_oom(CONSTRAINT_MEMCG, gfp_mask, 0, NULL);
> >  	limit = mem_cgroup_get_limit(mem) >> PAGE_SHIFT;
> >  	read_lock(&tasklist_lock);
> 
> The code duplication seems a bit gratuitous.
> 

I thought it was small enough to appear in two functions as opposed to 
doing something like

	static bool oom_current_has_sigkill(void)
	{
		if (fatal_signal_pending(current)) {
			set_thread_flag(TIF_MEMDIE);
			boost_dying_task_prio(current, NULL);
			return true;
		}
		return false;
	}

then doing

	if (oom_current_has_sigkill())
		return;

in mem_cgroup_out_of_memory() and out_of_memory().  If you'd prefer 
oom_current_has_sigkill(), let me know and I'll propose an alternate 
version.

> Was it deliberate that mem_cgroup_out_of_memory() ignores the oom
> notifier callbacks?
> 

Yes, the memory controller requires that something be killed (or, in the 
above case, simply allowing something to exit) to return under the hard 
limit; that's why we automatically kill current if nothing else eligible 
is found in select_bad_process().  Using oom notifier callbacks wouldn't 
guarantee there was freeing that would impact this memcg anyway.

> (Why does that notifier list exist at all?  Wouldn't it be better to do
> this via a vmscan shrinker?  Perhaps altered to be passed the scanning
> priority?)
> 

A vmscan shrinker seems more appropriate in the page allocator and not the 
oom killer before we call out_of_memory() and, as already mentioned, 
oom_notify_list doesn't do much at all (and is the wrong thing to do for 
cpusets or mempolicy ooms).  I've been reluctant to remove it because it 
doesn't have any impact for our systems but was obviously introduced for 
somebody's advantage in a rather unintrusive way. 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [patch] memcg: give current access to memory reserves if it's trying to die
  2011-03-18  5:17                                                     ` Andrew Morton
  2011-03-18  5:58                                                       ` KAMEZAWA Hiroyuki
@ 2011-03-18 20:36                                                       ` David Rientjes
  1 sibling, 0 replies; 58+ messages in thread
From: David Rientjes @ 2011-03-18 20:36 UTC (permalink / raw)
  To: Andrew Morton
  Cc: KAMEZAWA Hiroyuki, Balbir Singh, Daisuke Nishimura, linux-mm

On Thu, 17 Mar 2011, Andrew Morton wrote:

> commit 8bc719d3cab8414938f9ea6e33b58d8810d18068
> Author:     Martin Schwidefsky <schwidefsky@de.ibm.com>
> AuthorDate: Mon Sep 25 23:31:20 2006 -0700
> Commit:     Linus Torvalds <torvalds@g5.osdl.org>
> CommitDate: Tue Sep 26 08:48:47 2006 -0700
> 
>     [PATCH] out of memory notifier
>     
>     Add a notifer chain to the out of memory killer.  If one of the registered
>     callbacks could release some memory, do not kill the process but return and
>     retry the allocation that forced the oom killer to run.
>     
>     The purpose of the notifier is to add a safety net in the presence of
>     memory ballooners.  If the resource manager inflated the balloon to a size
>     where memory allocations can not be satisfied anymore, it is better to
>     deflate the balloon a bit instead of killing processes.
>     
>     The implementation for the s390 ballooner is included.
> 

I think it would be safe to do this only for CONSTRAINT_NONE in 
out_of_memory() since it's definitely not the right thing to do when a 
cpuset or mempolicy is oom; there's no guarantee that the freed memory is 
allocatable by the oom task.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [patch] memcg: add oom killer delay
  2010-12-22  9:21           ` David Rientjes
@ 2010-12-27  1:47             ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 58+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-12-27  1:47 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Balbir Singh, Daisuke Nishimura, Divyesh Shah, linux-mm

On Wed, 22 Dec 2010 01:21:01 -0800 (PST)
David Rientjes <rientjes@google.com> wrote:

> On Wed, 22 Dec 2010, KAMEZAWA Hiroyuki wrote:
> 
> > For example. oom_check_deadlockd can work as
> > 
> >   1. disable oom by memory.oom_disable=1
> >   2. check memory.oom_notify and wait it by poll()
> >   3. At oom, it wakes up.
> >   4. wait for 60 secs.
> >   5. If the cgroup is still in OOM, set oom_disalble=0
> > 
> > This daemon will not use much memory and can run in /roog memory cgroup.
> > 
> 
> Yes, this is almost the same as the "simple and perfect implementation" 
> that I eluded to in my response to Andrew (and I think KOSAKI-san 
> suggested something similiar), although it doesn't quite work because all 
> threads in the cgroup are sitting on the waitqueue and don't get woken up 
> to see oom_control == 0 unless memory is freed, a task is moved, or the 
> limit is resized so this daemon will need to trigger that as step #6.
> 
> That certainly works if it is indeed perfect and guaranteed to always be 
> running.  In the interest of a robust resource isolation model, I don't 
> think we can ever make that conclusion, though, so this discussion is 
> really only about how fault tolerant the kernel is because the end result 
> is if this daemon fails, the kernel livelocks.
> 
> I'd personally prefer not to allow a buggy or imperfect userspace to allow 
> the kernel to livelock; we control the kernel so I think it would be best 
> to ensure that it cannot livelock no matter what userspace happens to do 
> despite its best effort.  If you or Andrew come to the conclusion that 
> it's overkill and at the end of the day we have to trust userspace, I 
> really can't argue that philosophy though :)
> 

It's not livelock. A user can create a new thread in a cgroup not under OOM.
IMHO, oom-kill itself is totally of-no-use and panic_at_oom or the system
stop is always good. We can do cluster keep alive.

If I was you, I'll add a "pool of memory for emergency cgroup" and run 
watchdog tasks in it.

Thanks,
-Kame





--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [patch] memcg: add oom killer delay
  2010-12-25 10:47 ` Balbir Singh
@ 2010-12-26 20:35   ` David Rientjes
  0 siblings, 0 replies; 58+ messages in thread
From: David Rientjes @ 2010-12-26 20:35 UTC (permalink / raw)
  To: Balbir Singh
  Cc: Andrew Morton, Daisuke Nishimura, KAMEZAWA Hiroyuki,
	Divyesh Shah, linux-mm

On Sat, 25 Dec 2010, Balbir Singh wrote:

> > Completely disabling the oom killer for a memcg is problematic if
> > userspace is unable to address the condition itself, usually because
> > userspace is unresponsive.  This scenario creates a memcg livelock:
> > tasks are continuously trying to allocate memory and nothing is getting
> > killed, so memory freeing is impossible since reclaim has failed, and
> > all work stalls with no remedy in sight.
> > 
> > This patch adds an oom killer delay so that a memcg may be configured to
> > wait at least a pre-defined number of milliseconds before calling the
> > oom killer.  If the oom condition persists for this number of
> > milliseconds, the oom killer will be called the next time the memory
> > controller attempts to charge a page (and memory.oom_control is set to
> > 0).  This allows userspace to have a short period of time to respond to
> > the condition before timing out and deferring to the kernel to kill a
> > task.
> > 
> > Admins may set the oom killer timeout using the new interface:
> > 
> > 	# echo 60000 > memory.oom_delay
> > 
> > This will defer oom killing to the kernel only after 60 seconds has
> > elapsed.  When setting memory.oom_delay, all pending timeouts are
> > restarted.
> 
> I think Paul mentioned this problem and solution (I think already in
> use at google) some time back. Is x miliseconds a per oom kill decision
> timer or is it global?
> 

It's global for that memcg and works for all oom kills in that memcg until 
changed by userspace.  The example given in the changelog of 60 seconds 
would be for a use-case where we're only concerned about deadlock because 
nothing can allocate memory, the oom killer is disabled, and userspace 
fails to act for whatever reason.  It's necessary because of how memcg 
handles oom disabling compared to OOM_DISABLE: if al tasks are OOM_DISABLE 
for a system-wide oom then the machine panics whereas memcg will simply 
deadlock for memory.oom_control == 0.

 [ Your email is in response to the first version of the patch that 
   doesn't have the changes requested by Andrew, but this response will 
   be in terms of how it is implemented in v2. ]

> > Signed-off-by: David Rientjes <rientjes@google.com>
> > ---
> >  Documentation/cgroups/memory.txt |   15 ++++++++++
> >  mm/memcontrol.c                  |   56 +++++++++++++++++++++++++++++++++----
> >  2 files changed, 65 insertions(+), 6 deletions(-)
> > 
> > diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt
> > --- a/Documentation/cgroups/memory.txt
> > +++ b/Documentation/cgroups/memory.txt
> > @@ -68,6 +68,7 @@ Brief summary of control files.
> >  				 (See sysctl's vm.swappiness)
> >   memory.move_charge_at_immigrate # set/show controls of moving charges
> >   memory.oom_control		 # set/show oom controls.
> > + memory.oom_delay		 # set/show millisecs to wait before oom kill
> > 
> >  1. History
> > 
> > @@ -640,6 +641,20 @@ At reading, current status of OOM is shown.
> >  	under_oom	 0 or 1 (if 1, the memory cgroup is under OOM, tasks may
> >  				 be stopped.)
> > 
> > +It is also possible to configure an oom killer timeout to prevent the
> > +possibility that the memcg will livelock looking for memory if userspace
> > +has disabled the oom killer with oom_control but cannot act to fix the
> > +condition itself (usually because userspace has become unresponsive).
> > +
> > +To set an oom killer timeout for a memcg, write the number of milliseconds
> > +to wait before killing a task to memory.oom_delay:
> > +
> > +	# echo 60000 > memory.oom_delay	# wait 60 seconds, then oom kill
> > +
> > +This timeout is reset the next time the memcg successfully charges memory
> > +to a task.
> > +
> > +
> >  11. TODO
> > 
> >  1. Add support for accounting huge pages (as a separate controller)
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -233,12 +233,16 @@ struct mem_cgroup {
> >  	 * Should the accounting and control be hierarchical, per subtree?
> >  	 */
> >  	bool use_hierarchy;
> > +	/* oom_delay has expired and still out of memory? */
> > +	bool oom_kill_delay_expired;
> >  	atomic_t	oom_lock;
> >  	atomic_t	refcnt;
> > 
> >  	unsigned int	swappiness;
> >  	/* OOM-Killer disable */
> >  	int		oom_kill_disable;
> > +	/* min number of ticks to wait before calling oom killer */
> > +	int		oom_kill_delay;
> > 
> >  	/* set when res.limit == memsw.limit */
> >  	bool		memsw_is_minimum;
> > @@ -1524,6 +1528,7 @@ static void memcg_wakeup_oom(struct mem_cgroup *mem)
> > 
> >  static void memcg_oom_recover(struct mem_cgroup *mem)
> >  {
> > +	mem->oom_kill_delay_expired = false;
> >  	if (mem && atomic_read(&mem->oom_lock))
> >  		memcg_wakeup_oom(mem);
> >  }
> > @@ -1531,17 +1536,18 @@ static void memcg_oom_recover(struct mem_cgroup *mem)
> >  /*
> >   * try to call OOM killer. returns false if we should exit memory-reclaim loop.
> >   */
> > -bool mem_cgroup_handle_oom(struct mem_cgroup *mem, gfp_t mask)
> > +static bool mem_cgroup_handle_oom(struct mem_cgroup *mem, gfp_t mask)
> >  {
> >  	struct oom_wait_info owait;
> > -	bool locked, need_to_kill;
> > +	bool locked;
> > +	bool need_to_kill = true;
> > +	bool need_to_delay = false;
> > 
> >  	owait.mem = mem;
> >  	owait.wait.flags = 0;
> >  	owait.wait.func = memcg_oom_wake_function;
> >  	owait.wait.private = current;
> >  	INIT_LIST_HEAD(&owait.wait.task_list);
> > -	need_to_kill = true;
> >  	/* At first, try to OOM lock hierarchy under mem.*/
> >  	mutex_lock(&memcg_oom_mutex);
> >  	locked = mem_cgroup_oom_lock(mem);
> > @@ -1553,26 +1559,34 @@ bool mem_cgroup_handle_oom(struct mem_cgroup *mem, gfp_t mask)
> >  	prepare_to_wait(&memcg_oom_waitq, &owait.wait, TASK_KILLABLE);
> >  	if (!locked || mem->oom_kill_disable)
> >  		need_to_kill = false;
> > -	if (locked)
> > +	if (locked) {
> >  		mem_cgroup_oom_notify(mem);
> > +		if (mem->oom_kill_delay && !mem->oom_kill_delay_expired) {
> > +			need_to_kill = false;
> > +			need_to_delay = true;
> > +		}
> > +	}
> >  	mutex_unlock(&memcg_oom_mutex);
> > 
> >  	if (need_to_kill) {
> >  		finish_wait(&memcg_oom_waitq, &owait.wait);
> >  		mem_cgroup_out_of_memory(mem, mask);
> >  	} else {
> > -		schedule();
> > +		schedule_timeout(need_to_delay ? mem->oom_kill_delay :
> > +						 MAX_SCHEDULE_TIMEOUT);
> >  		finish_wait(&memcg_oom_waitq, &owait.wait);
> >  	}
> >  	mutex_lock(&memcg_oom_mutex);
> >  	mem_cgroup_oom_unlock(mem);
> >  	memcg_wakeup_oom(mem);
> > +	mem->oom_kill_delay_expired = need_to_delay;
> >  	mutex_unlock(&memcg_oom_mutex);
> > 
> >  	if (test_thread_flag(TIF_MEMDIE) || fatal_signal_pending(current))
> >  		return false;
> >  	/* Give chance to dying process */
> > -	schedule_timeout(1);
> > +	if (!need_to_delay)
> > +		schedule_timeout(1);
> >  	return true;
> >  }
> 
> I think we need additional statistics for tracking oom kills due to
> timer expiry.
> 

I can add an export for how many times memory.oom_delay_millisecs expired 
and allowed an oom kill if you'd like.  If you have a special use case for 
that, please let me know and I'll put it in the changelog.

> > @@ -2007,6 +2021,7 @@ again:
> >  		refill_stock(mem, csize - PAGE_SIZE);
> >  	css_put(&mem->css);
> >  done:
> > +	mem->oom_kill_delay_expired = false;
> >  	*memcg = mem;
> >  	return 0;
> >  nomem:
> > @@ -4053,6 +4068,29 @@ static int mem_cgroup_oom_control_write(struct cgroup *cgrp,
> >  	return 0;
> >  }
> > 
> > +static u64 mem_cgroup_oom_delay_read(struct cgroup *cgrp, struct cftype *cft)
> > +{
> > +	struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp);
> > +
> > +	return jiffies_to_msecs(memcg->oom_kill_delay);
> > +}
> > +
> > +static int mem_cgroup_oom_delay_write(struct cgroup *cgrp, struct cftype *cft,
> > +				      u64 val)
> > +{
> > +	struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp);
> > +
> > +	/* Sanity check -- don't wait longer than an hour */
> > +	if (val > (60 * 60 * 1000))
> > +		return -EINVAL;
> 
> Why do this and not document it? These sort of things get exremely
> confusing. I would prefer not to have it without a resonable use case
> or very good documentation, explaining why we need an upper bound.
> 

The upper-bound is necessary, although not at 60 minutes, because 
mem_cgroup_oom_delay_write() takes a u64 and memcg->oom_delay_millisecs is 
an int.  I don't think it makes sense for a memcg to deadlock and fail all 
memory allocations for over an hour, so it's a pretty arbitrary value.  
I'll document it in the change to Documentation/cgroups/memory.txt.

Thanks for the review!

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [patch] memcg: add oom killer delay
  2010-12-22  7:27 David Rientjes
  2010-12-22  7:59 ` Andrew Morton
@ 2010-12-25 10:47 ` Balbir Singh
  2010-12-26 20:35   ` David Rientjes
  1 sibling, 1 reply; 58+ messages in thread
From: Balbir Singh @ 2010-12-25 10:47 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Daisuke Nishimura, KAMEZAWA Hiroyuki,
	Divyesh Shah, linux-mm

* David Rientjes <rientjes@google.com> [2010-12-21 23:27:25]:

> Completely disabling the oom killer for a memcg is problematic if
> userspace is unable to address the condition itself, usually because
> userspace is unresponsive.  This scenario creates a memcg livelock:
> tasks are continuously trying to allocate memory and nothing is getting
> killed, so memory freeing is impossible since reclaim has failed, and
> all work stalls with no remedy in sight.
> 
> This patch adds an oom killer delay so that a memcg may be configured to
> wait at least a pre-defined number of milliseconds before calling the
> oom killer.  If the oom condition persists for this number of
> milliseconds, the oom killer will be called the next time the memory
> controller attempts to charge a page (and memory.oom_control is set to
> 0).  This allows userspace to have a short period of time to respond to
> the condition before timing out and deferring to the kernel to kill a
> task.
> 
> Admins may set the oom killer timeout using the new interface:
> 
> 	# echo 60000 > memory.oom_delay
> 
> This will defer oom killing to the kernel only after 60 seconds has
> elapsed.  When setting memory.oom_delay, all pending timeouts are
> restarted.

I think Paul mentioned this problem and solution (I think already in
use at google) some time back. Is x miliseconds a per oom kill decision
timer or is it global?

> 
> Signed-off-by: David Rientjes <rientjes@google.com>
> ---
>  Documentation/cgroups/memory.txt |   15 ++++++++++
>  mm/memcontrol.c                  |   56 +++++++++++++++++++++++++++++++++----
>  2 files changed, 65 insertions(+), 6 deletions(-)
> 
> diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt
> --- a/Documentation/cgroups/memory.txt
> +++ b/Documentation/cgroups/memory.txt
> @@ -68,6 +68,7 @@ Brief summary of control files.
>  				 (See sysctl's vm.swappiness)
>   memory.move_charge_at_immigrate # set/show controls of moving charges
>   memory.oom_control		 # set/show oom controls.
> + memory.oom_delay		 # set/show millisecs to wait before oom kill
> 
>  1. History
> 
> @@ -640,6 +641,20 @@ At reading, current status of OOM is shown.
>  	under_oom	 0 or 1 (if 1, the memory cgroup is under OOM, tasks may
>  				 be stopped.)
> 
> +It is also possible to configure an oom killer timeout to prevent the
> +possibility that the memcg will livelock looking for memory if userspace
> +has disabled the oom killer with oom_control but cannot act to fix the
> +condition itself (usually because userspace has become unresponsive).
> +
> +To set an oom killer timeout for a memcg, write the number of milliseconds
> +to wait before killing a task to memory.oom_delay:
> +
> +	# echo 60000 > memory.oom_delay	# wait 60 seconds, then oom kill
> +
> +This timeout is reset the next time the memcg successfully charges memory
> +to a task.
> +
> +
>  11. TODO
> 
>  1. Add support for accounting huge pages (as a separate controller)
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -233,12 +233,16 @@ struct mem_cgroup {
>  	 * Should the accounting and control be hierarchical, per subtree?
>  	 */
>  	bool use_hierarchy;
> +	/* oom_delay has expired and still out of memory? */
> +	bool oom_kill_delay_expired;
>  	atomic_t	oom_lock;
>  	atomic_t	refcnt;
> 
>  	unsigned int	swappiness;
>  	/* OOM-Killer disable */
>  	int		oom_kill_disable;
> +	/* min number of ticks to wait before calling oom killer */
> +	int		oom_kill_delay;
> 
>  	/* set when res.limit == memsw.limit */
>  	bool		memsw_is_minimum;
> @@ -1524,6 +1528,7 @@ static void memcg_wakeup_oom(struct mem_cgroup *mem)
> 
>  static void memcg_oom_recover(struct mem_cgroup *mem)
>  {
> +	mem->oom_kill_delay_expired = false;
>  	if (mem && atomic_read(&mem->oom_lock))
>  		memcg_wakeup_oom(mem);
>  }
> @@ -1531,17 +1536,18 @@ static void memcg_oom_recover(struct mem_cgroup *mem)
>  /*
>   * try to call OOM killer. returns false if we should exit memory-reclaim loop.
>   */
> -bool mem_cgroup_handle_oom(struct mem_cgroup *mem, gfp_t mask)
> +static bool mem_cgroup_handle_oom(struct mem_cgroup *mem, gfp_t mask)
>  {
>  	struct oom_wait_info owait;
> -	bool locked, need_to_kill;
> +	bool locked;
> +	bool need_to_kill = true;
> +	bool need_to_delay = false;
> 
>  	owait.mem = mem;
>  	owait.wait.flags = 0;
>  	owait.wait.func = memcg_oom_wake_function;
>  	owait.wait.private = current;
>  	INIT_LIST_HEAD(&owait.wait.task_list);
> -	need_to_kill = true;
>  	/* At first, try to OOM lock hierarchy under mem.*/
>  	mutex_lock(&memcg_oom_mutex);
>  	locked = mem_cgroup_oom_lock(mem);
> @@ -1553,26 +1559,34 @@ bool mem_cgroup_handle_oom(struct mem_cgroup *mem, gfp_t mask)
>  	prepare_to_wait(&memcg_oom_waitq, &owait.wait, TASK_KILLABLE);
>  	if (!locked || mem->oom_kill_disable)
>  		need_to_kill = false;
> -	if (locked)
> +	if (locked) {
>  		mem_cgroup_oom_notify(mem);
> +		if (mem->oom_kill_delay && !mem->oom_kill_delay_expired) {
> +			need_to_kill = false;
> +			need_to_delay = true;
> +		}
> +	}
>  	mutex_unlock(&memcg_oom_mutex);
> 
>  	if (need_to_kill) {
>  		finish_wait(&memcg_oom_waitq, &owait.wait);
>  		mem_cgroup_out_of_memory(mem, mask);
>  	} else {
> -		schedule();
> +		schedule_timeout(need_to_delay ? mem->oom_kill_delay :
> +						 MAX_SCHEDULE_TIMEOUT);
>  		finish_wait(&memcg_oom_waitq, &owait.wait);
>  	}
>  	mutex_lock(&memcg_oom_mutex);
>  	mem_cgroup_oom_unlock(mem);
>  	memcg_wakeup_oom(mem);
> +	mem->oom_kill_delay_expired = need_to_delay;
>  	mutex_unlock(&memcg_oom_mutex);
> 
>  	if (test_thread_flag(TIF_MEMDIE) || fatal_signal_pending(current))
>  		return false;
>  	/* Give chance to dying process */
> -	schedule_timeout(1);
> +	if (!need_to_delay)
> +		schedule_timeout(1);
>  	return true;
>  }

I think we need additional statistics for tracking oom kills due to
timer expiry.


> 
> @@ -2007,6 +2021,7 @@ again:
>  		refill_stock(mem, csize - PAGE_SIZE);
>  	css_put(&mem->css);
>  done:
> +	mem->oom_kill_delay_expired = false;
>  	*memcg = mem;
>  	return 0;
>  nomem:
> @@ -4053,6 +4068,29 @@ static int mem_cgroup_oom_control_write(struct cgroup *cgrp,
>  	return 0;
>  }
> 
> +static u64 mem_cgroup_oom_delay_read(struct cgroup *cgrp, struct cftype *cft)
> +{
> +	struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp);
> +
> +	return jiffies_to_msecs(memcg->oom_kill_delay);
> +}
> +
> +static int mem_cgroup_oom_delay_write(struct cgroup *cgrp, struct cftype *cft,
> +				      u64 val)
> +{
> +	struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp);
> +
> +	/* Sanity check -- don't wait longer than an hour */
> +	if (val > (60 * 60 * 1000))
> +		return -EINVAL;

Why do this and not document it? These sort of things get exremely
confusing. I would prefer not to have it without a resonable use case
or very good documentation, explaining why we need an upper bound.

> +
> +	cgroup_lock();
> +	memcg->oom_kill_delay = msecs_to_jiffies(val);
> +	memcg_oom_recover(memcg);
> +	cgroup_unlock();
> +	return 0;
> +}
> +
>  static struct cftype mem_cgroup_files[] = {
>  	{
>  		.name = "usage_in_bytes",
> @@ -4116,6 +4154,11 @@ static struct cftype mem_cgroup_files[] = {
>  		.unregister_event = mem_cgroup_oom_unregister_event,
>  		.private = MEMFILE_PRIVATE(_OOM_TYPE, OOM_CONTROL),
>  	},
> +	{
> +		.name = "oom_delay",
> +		.read_u64 = mem_cgroup_oom_delay_read,
> +		.write_u64 = mem_cgroup_oom_delay_write,
> +	},
>  };
> 
>  #ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
> @@ -4357,6 +4400,7 @@ mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
>  		parent = mem_cgroup_from_cont(cont->parent);
>  		mem->use_hierarchy = parent->use_hierarchy;
>  		mem->oom_kill_disable = parent->oom_kill_disable;
> +		mem->oom_kill_delay = parent->oom_kill_delay;
>  	}
> 
>  	if (parent && parent->use_hierarchy) {

-- 
	Three Cheers,
	Balbir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [patch] memcg: add oom killer delay
  2010-12-22  8:55         ` KAMEZAWA Hiroyuki
@ 2010-12-22  9:21           ` David Rientjes
  2010-12-27  1:47             ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 58+ messages in thread
From: David Rientjes @ 2010-12-22  9:21 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Andrew Morton, Balbir Singh, Daisuke Nishimura, Divyesh Shah, linux-mm

On Wed, 22 Dec 2010, KAMEZAWA Hiroyuki wrote:

> For example. oom_check_deadlockd can work as
> 
>   1. disable oom by memory.oom_disable=1
>   2. check memory.oom_notify and wait it by poll()
>   3. At oom, it wakes up.
>   4. wait for 60 secs.
>   5. If the cgroup is still in OOM, set oom_disalble=0
> 
> This daemon will not use much memory and can run in /roog memory cgroup.
> 

Yes, this is almost the same as the "simple and perfect implementation" 
that I eluded to in my response to Andrew (and I think KOSAKI-san 
suggested something similiar), although it doesn't quite work because all 
threads in the cgroup are sitting on the waitqueue and don't get woken up 
to see oom_control == 0 unless memory is freed, a task is moved, or the 
limit is resized so this daemon will need to trigger that as step #6.

That certainly works if it is indeed perfect and guaranteed to always be 
running.  In the interest of a robust resource isolation model, I don't 
think we can ever make that conclusion, though, so this discussion is 
really only about how fault tolerant the kernel is because the end result 
is if this daemon fails, the kernel livelocks.

I'd personally prefer not to allow a buggy or imperfect userspace to allow 
the kernel to livelock; we control the kernel so I think it would be best 
to ensure that it cannot livelock no matter what userspace happens to do 
despite its best effort.  If you or Andrew come to the conclusion that 
it's overkill and at the end of the day we have to trust userspace, I 
really can't argue that philosophy though :)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [patch] memcg: add oom killer delay
  2010-12-22  8:48       ` KAMEZAWA Hiroyuki
  2010-12-22  8:55         ` KAMEZAWA Hiroyuki
@ 2010-12-22  9:04         ` David Rientjes
  1 sibling, 0 replies; 58+ messages in thread
From: David Rientjes @ 2010-12-22  9:04 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Andrew Morton, Balbir Singh, Daisuke Nishimura, Divyesh Shah, linux-mm

On Wed, 22 Dec 2010, KAMEZAWA Hiroyuki wrote:

> > > seems to be hard to use. No one can estimate "milisecond" for avoidling
> > > OOM-kill. I think this is very bad. Nack to this feature itself.
> > > 
> > 
> > There's no estimation that is really needed, we simply need to be able to 
> > stall long enough that we'll eventually kill "something" if userspace 
> > fails to act.
> > 
> 
> Why we have to think of usermode failure by mis configuration or user mode bug ?
> It's a work of Middleware in usual.
> Please make libcgroup or libvirt more useful.
> 

It's a general concern for users who wish to defer the kernel oom killer 
unless userspace chooses not to act or cannot act and the only way to do 
that without memory.oom_delay is to set all memcgs to have 
memory.oom_control of 1.  memory.oom_control of 1 is equivalent to 
OOM_DISABLE for all attached tasks and if all tasks are assigned non-root 
memcg for resource isolation (and the sum of those memcgs' limits equals 
system RAM), we always get memcg oom kills instead of system wide oom 
kills.  The difference in this case is that with the memcg oom kills, the 
kernel livelocks whereas the system wide oom kills would panic the machine 
since all eligible tasks are OOM_DISABLE, the equivalent of all memcgs 
having memory.oom_control of 1.

Since the kernel has opened this possibility up by disabling oom killing 
without giving userspace any other chance of deferring the oom killer, we 
need a way to preserve the machine by having a fallback plan if userspace 
cannot act.  The other possibility would be to panic if all memcgs have 
memory.oom_control of 1 and the sum of their limits equals the machine's 
memory capacity.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [patch] memcg: add oom killer delay
  2010-12-22  8:48       ` KAMEZAWA Hiroyuki
@ 2010-12-22  8:55         ` KAMEZAWA Hiroyuki
  2010-12-22  9:21           ` David Rientjes
  2010-12-22  9:04         ` David Rientjes
  1 sibling, 1 reply; 58+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-12-22  8:55 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: David Rientjes, Andrew Morton, Balbir Singh, Daisuke Nishimura,
	Divyesh Shah, linux-mm

On Wed, 22 Dec 2010 17:48:29 +0900
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:

> On Wed, 22 Dec 2010 00:48:53 -0800 (PST)
> David Rientjes <rientjes@google.com> wrote:
> 
> > On Wed, 22 Dec 2010, KAMEZAWA Hiroyuki wrote:
> > 
> > > seems to be hard to use. No one can estimate "milisecond" for avoidling
> > > OOM-kill. I think this is very bad. Nack to this feature itself.
> > > 
> > 
> > There's no estimation that is really needed, we simply need to be able to 
> > stall long enough that we'll eventually kill "something" if userspace 
> > fails to act.
> > 
> 
> Why we have to think of usermode failure by mis configuration or user mode bug ?
> It's a work of Middleware in usual.

For example. oom_check_deadlockd can work as

  1. disable oom by memory.oom_disable=1
  2. check memory.oom_notify and wait it by poll()
  3. At oom, it wakes up.
  4. wait for 60 secs.
  5. If the cgroup is still in OOM, set oom_disalble=0

This daemon will not use much memory and can run in /roog memory cgroup.

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [patch] memcg: add oom killer delay
  2010-12-22  8:17   ` KAMEZAWA Hiroyuki
  2010-12-22  8:31     ` KOSAKI Motohiro
@ 2010-12-22  8:48     ` David Rientjes
  2010-12-22  8:48       ` KAMEZAWA Hiroyuki
  1 sibling, 1 reply; 58+ messages in thread
From: David Rientjes @ 2010-12-22  8:48 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Andrew Morton, Balbir Singh, Daisuke Nishimura, Divyesh Shah, linux-mm

On Wed, 22 Dec 2010, KAMEZAWA Hiroyuki wrote:

> seems to be hard to use. No one can estimate "milisecond" for avoidling
> OOM-kill. I think this is very bad. Nack to this feature itself.
> 

There's no estimation that is really needed, we simply need to be able to 
stall long enough that we'll eventually kill "something" if userspace 
fails to act.

> If you want something smart _in kernel_, please implement followings.
> 
>  - When hit oom, enlarge limit to some extent.
>  - All processes in cgroup should be stopped.
>  - A helper application will be called by usermode_helper().
>  - When a helper application exit(), automatically release all processes
>    to run again.
> 

Hmm, that's a _lot_ of policy to be implemented in the kernel itself and 
comes at the cost of either being faulty (if the limit cannot be 
increased) or harmful (when increasing the limit is detrimental to other 
memcg).

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [patch] memcg: add oom killer delay
  2010-12-22  8:48     ` David Rientjes
@ 2010-12-22  8:48       ` KAMEZAWA Hiroyuki
  2010-12-22  8:55         ` KAMEZAWA Hiroyuki
  2010-12-22  9:04         ` David Rientjes
  0 siblings, 2 replies; 58+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-12-22  8:48 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Balbir Singh, Daisuke Nishimura, Divyesh Shah, linux-mm

On Wed, 22 Dec 2010 00:48:53 -0800 (PST)
David Rientjes <rientjes@google.com> wrote:

> On Wed, 22 Dec 2010, KAMEZAWA Hiroyuki wrote:
> 
> > seems to be hard to use. No one can estimate "milisecond" for avoidling
> > OOM-kill. I think this is very bad. Nack to this feature itself.
> > 
> 
> There's no estimation that is really needed, we simply need to be able to 
> stall long enough that we'll eventually kill "something" if userspace 
> fails to act.
> 

Why we have to think of usermode failure by mis configuration or user mode bug ?
It's a work of Middleware in usual.
Please make libcgroup or libvirt more useful.

> > If you want something smart _in kernel_, please implement followings.
> > 
> >  - When hit oom, enlarge limit to some extent.
> >  - All processes in cgroup should be stopped.
> >  - A helper application will be called by usermode_helper().
> >  - When a helper application exit(), automatically release all processes
> >    to run again.
> > 
> 
> Hmm, that's a _lot_ of policy to be implemented in the kernel itself and 
> comes at the cost of either being faulty (if the limit cannot be 
> increased) or harmful (when increasing the limit is detrimental to other 
> memcg).
> 

Or runnking a helper function in "root" cgroup which has no limit at all.

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [patch] memcg: add oom killer delay
  2010-12-22  7:59 ` Andrew Morton
  2010-12-22  8:17   ` KAMEZAWA Hiroyuki
@ 2010-12-22  8:42   ` David Rientjes
  1 sibling, 0 replies; 58+ messages in thread
From: David Rientjes @ 2010-12-22  8:42 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Balbir Singh, Daisuke Nishimura, KAMEZAWA Hiroyuki, Divyesh Shah,
	linux-mm

On Tue, 21 Dec 2010, Andrew Morton wrote:

> > Completely disabling the oom killer for a memcg is problematic if
> > userspace is unable to address the condition itself, usually because
> > userspace is unresponsive.  This scenario creates a memcg livelock:
> > tasks are continuously trying to allocate memory and nothing is getting
> > killed, so memory freeing is impossible since reclaim has failed, and
> > all work stalls with no remedy in sight.
> 
> Userspace was buggy, surely.  If userspace has elected to disable the
> oom-killer then it should ensure that it can cope with the ensuing result.
> 

I think it would be argued that no such guarantee can ever be made.

> One approach might be to run a mlockall()ed watchdog which monitors the
> worker tasks via shared memory.  Another approach would be to run that
> watchdog in a different memcg, without mlockall().  There are surely
> plenty of other ways of doing it.
> 

Yeah, we considered a simple and perfect userspace implementation that 
would be as fault tolerant unless it ends up getting killed (not by the 
oom killer) or dies itself, but there was a concern that setting every 
memcg to have oom_control of 0 could render the entire kernel useless 
without the help of userspace and that is a bad policy.

In our particular use case, we _always_ want to defer using the kernel oom 
killer unless userspace chooses not to act (because the limit is already 
high enough) or cannot act (because of a bug).  The former is accomplished 
by setting memory.oom_control to 0 originally and then setting it to 1 for 
that particular memcg to allow the oom kill, but it is not possible for 
the latter.

> Minutea:
> 
> - changelog and docs forgot to mention that oom_delay=0 disables.
> 

I thought it would be intuitive that an oom_delay of 0 would mean there 
was no delay :)

> - it's called oom_kill_delay in the kernel and oom_delay in userspace.
> 

Right, this was because of the symmetry to the oom_kill_disable naming in 
the struct itself.  I'd be happy to change it if we're to go ahead in this 
direction.

> - oom_delay_millisecs would be a better name for the pseudo file.
> 

Agreed.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [patch] memcg: add oom killer delay
  2010-12-22  8:17   ` KAMEZAWA Hiroyuki
@ 2010-12-22  8:31     ` KOSAKI Motohiro
  2010-12-22  8:48     ` David Rientjes
  1 sibling, 0 replies; 58+ messages in thread
From: KOSAKI Motohiro @ 2010-12-22  8:31 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: kosaki.motohiro, Andrew Morton, David Rientjes, Balbir Singh,
	Daisuke Nishimura, Divyesh Shah, linux-mm

> seems to be hard to use. No one can estimate "milisecond" for avoidling
> OOM-kill. I think this is very bad. Nack to this feature itself.
> 
> 
> If you want something smart _in kernel_, please implement followings.
> 
>  - When hit oom, enlarge limit to some extent.
>  - All processes in cgroup should be stopped.
>  - A helper application will be called by usermode_helper().
>  - When a helper application exit(), automatically release all processes
>    to run again.
> 
> Then, you can avoid oom-kill situation in automatic with kernel's help.

I bet a monitor use diffent memcg is simplest thing.



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [patch] memcg: add oom killer delay
  2010-12-22  7:59 ` Andrew Morton
@ 2010-12-22  8:17   ` KAMEZAWA Hiroyuki
  2010-12-22  8:31     ` KOSAKI Motohiro
  2010-12-22  8:48     ` David Rientjes
  2010-12-22  8:42   ` David Rientjes
  1 sibling, 2 replies; 58+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-12-22  8:17 UTC (permalink / raw)
  To: Andrew Morton
  Cc: David Rientjes, Balbir Singh, Daisuke Nishimura, Divyesh Shah, linux-mm

On Tue, 21 Dec 2010 23:59:24 -0800
Andrew Morton <akpm@linux-foundation.org> wrote:

> On Tue, 21 Dec 2010 23:27:25 -0800 (PST) David Rientjes <rientjes@google.com> wrote:
> 
> > Completely disabling the oom killer for a memcg is problematic if
> > userspace is unable to address the condition itself, usually because
> > userspace is unresponsive.  This scenario creates a memcg livelock:
> > tasks are continuously trying to allocate memory and nothing is getting
> > killed, so memory freeing is impossible since reclaim has failed, and
> > all work stalls with no remedy in sight.
> 
> Userspace was buggy, surely.  If userspace has elected to disable the
> oom-killer then it should ensure that it can cope with the ensuing result.
> 
> One approach might be to run a mlockall()ed watchdog which monitors the
> worker tasks via shared memory.  Another approach would be to run that
> watchdog in a different memcg, without mlockall().  There are surely
> plenty of other ways of doing it.
> 
> > This patch adds an oom killer delay so that a memcg may be configured to
> > wait at least a pre-defined number of milliseconds before calling the
> > oom killer.  If the oom condition persists for this number of
> > milliseconds, the oom killer will be called the next time the memory
> > controller attempts to charge a page (and memory.oom_control is set to
> > 0).  This allows userspace to have a short period of time to respond to
> > the condition before timing out and deferring to the kernel to kill a
> > task.
> > 
> > Admins may set the oom killer timeout using the new interface:
> > 
> > 	# echo 60000 > memory.oom_delay
> > 
> > This will defer oom killing to the kernel only after 60 seconds has
> > elapsed.  When setting memory.oom_delay, all pending timeouts are
> > restarted.
> > 
> 
> eww, ick ick ick.
> 
> 
> Minutea:
> 
> - changelog and docs forgot to mention that oom_delay=0 disables.
> 
> - it's called oom_kill_delay in the kernel and oom_delay in userspace.
> 
> - oom_delay_millisecs would be a better name for the pseudo file.
> 
> - Also, ick.
> 

seems to be hard to use. No one can estimate "milisecond" for avoidling
OOM-kill. I think this is very bad. Nack to this feature itself.


If you want something smart _in kernel_, please implement followings.

 - When hit oom, enlarge limit to some extent.
 - All processes in cgroup should be stopped.
 - A helper application will be called by usermode_helper().
 - When a helper application exit(), automatically release all processes
   to run again.

Then, you can avoid oom-kill situation in automatic with kernel's help.

BTW, don't call cgroup_lock(). It's always dangerous. You can add your own
lock.


Thanks,
-Kame


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [patch] memcg: add oom killer delay
  2010-12-22  7:27 David Rientjes
@ 2010-12-22  7:59 ` Andrew Morton
  2010-12-22  8:17   ` KAMEZAWA Hiroyuki
  2010-12-22  8:42   ` David Rientjes
  2010-12-25 10:47 ` Balbir Singh
  1 sibling, 2 replies; 58+ messages in thread
From: Andrew Morton @ 2010-12-22  7:59 UTC (permalink / raw)
  To: David Rientjes
  Cc: Balbir Singh, Daisuke Nishimura, KAMEZAWA Hiroyuki, Divyesh Shah,
	linux-mm

On Tue, 21 Dec 2010 23:27:25 -0800 (PST) David Rientjes <rientjes@google.com> wrote:

> Completely disabling the oom killer for a memcg is problematic if
> userspace is unable to address the condition itself, usually because
> userspace is unresponsive.  This scenario creates a memcg livelock:
> tasks are continuously trying to allocate memory and nothing is getting
> killed, so memory freeing is impossible since reclaim has failed, and
> all work stalls with no remedy in sight.

Userspace was buggy, surely.  If userspace has elected to disable the
oom-killer then it should ensure that it can cope with the ensuing result.

One approach might be to run a mlockall()ed watchdog which monitors the
worker tasks via shared memory.  Another approach would be to run that
watchdog in a different memcg, without mlockall().  There are surely
plenty of other ways of doing it.

> This patch adds an oom killer delay so that a memcg may be configured to
> wait at least a pre-defined number of milliseconds before calling the
> oom killer.  If the oom condition persists for this number of
> milliseconds, the oom killer will be called the next time the memory
> controller attempts to charge a page (and memory.oom_control is set to
> 0).  This allows userspace to have a short period of time to respond to
> the condition before timing out and deferring to the kernel to kill a
> task.
> 
> Admins may set the oom killer timeout using the new interface:
> 
> 	# echo 60000 > memory.oom_delay
> 
> This will defer oom killing to the kernel only after 60 seconds has
> elapsed.  When setting memory.oom_delay, all pending timeouts are
> restarted.
> 

eww, ick ick ick.


Minutea:

- changelog and docs forgot to mention that oom_delay=0 disables.

- it's called oom_kill_delay in the kernel and oom_delay in userspace.

- oom_delay_millisecs would be a better name for the pseudo file.

- Also, ick.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 58+ messages in thread

* [patch] memcg: add oom killer delay
@ 2010-12-22  7:27 David Rientjes
  2010-12-22  7:59 ` Andrew Morton
  2010-12-25 10:47 ` Balbir Singh
  0 siblings, 2 replies; 58+ messages in thread
From: David Rientjes @ 2010-12-22  7:27 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Balbir Singh, Daisuke Nishimura, KAMEZAWA Hiroyuki, Divyesh Shah,
	linux-mm

Completely disabling the oom killer for a memcg is problematic if
userspace is unable to address the condition itself, usually because
userspace is unresponsive.  This scenario creates a memcg livelock:
tasks are continuously trying to allocate memory and nothing is getting
killed, so memory freeing is impossible since reclaim has failed, and
all work stalls with no remedy in sight.

This patch adds an oom killer delay so that a memcg may be configured to
wait at least a pre-defined number of milliseconds before calling the
oom killer.  If the oom condition persists for this number of
milliseconds, the oom killer will be called the next time the memory
controller attempts to charge a page (and memory.oom_control is set to
0).  This allows userspace to have a short period of time to respond to
the condition before timing out and deferring to the kernel to kill a
task.

Admins may set the oom killer timeout using the new interface:

	# echo 60000 > memory.oom_delay

This will defer oom killing to the kernel only after 60 seconds has
elapsed.  When setting memory.oom_delay, all pending timeouts are
restarted.

Signed-off-by: David Rientjes <rientjes@google.com>
---
 Documentation/cgroups/memory.txt |   15 ++++++++++
 mm/memcontrol.c                  |   56 +++++++++++++++++++++++++++++++++----
 2 files changed, 65 insertions(+), 6 deletions(-)

diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt
--- a/Documentation/cgroups/memory.txt
+++ b/Documentation/cgroups/memory.txt
@@ -68,6 +68,7 @@ Brief summary of control files.
 				 (See sysctl's vm.swappiness)
  memory.move_charge_at_immigrate # set/show controls of moving charges
  memory.oom_control		 # set/show oom controls.
+ memory.oom_delay		 # set/show millisecs to wait before oom kill
 
 1. History
 
@@ -640,6 +641,20 @@ At reading, current status of OOM is shown.
 	under_oom	 0 or 1 (if 1, the memory cgroup is under OOM, tasks may
 				 be stopped.)
 
+It is also possible to configure an oom killer timeout to prevent the
+possibility that the memcg will livelock looking for memory if userspace
+has disabled the oom killer with oom_control but cannot act to fix the
+condition itself (usually because userspace has become unresponsive).
+
+To set an oom killer timeout for a memcg, write the number of milliseconds
+to wait before killing a task to memory.oom_delay:
+
+	# echo 60000 > memory.oom_delay	# wait 60 seconds, then oom kill
+
+This timeout is reset the next time the memcg successfully charges memory
+to a task.
+
+
 11. TODO
 
 1. Add support for accounting huge pages (as a separate controller)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -233,12 +233,16 @@ struct mem_cgroup {
 	 * Should the accounting and control be hierarchical, per subtree?
 	 */
 	bool use_hierarchy;
+	/* oom_delay has expired and still out of memory? */
+	bool oom_kill_delay_expired;
 	atomic_t	oom_lock;
 	atomic_t	refcnt;
 
 	unsigned int	swappiness;
 	/* OOM-Killer disable */
 	int		oom_kill_disable;
+	/* min number of ticks to wait before calling oom killer */
+	int		oom_kill_delay;
 
 	/* set when res.limit == memsw.limit */
 	bool		memsw_is_minimum;
@@ -1524,6 +1528,7 @@ static void memcg_wakeup_oom(struct mem_cgroup *mem)
 
 static void memcg_oom_recover(struct mem_cgroup *mem)
 {
+	mem->oom_kill_delay_expired = false;
 	if (mem && atomic_read(&mem->oom_lock))
 		memcg_wakeup_oom(mem);
 }
@@ -1531,17 +1536,18 @@ static void memcg_oom_recover(struct mem_cgroup *mem)
 /*
  * try to call OOM killer. returns false if we should exit memory-reclaim loop.
  */
-bool mem_cgroup_handle_oom(struct mem_cgroup *mem, gfp_t mask)
+static bool mem_cgroup_handle_oom(struct mem_cgroup *mem, gfp_t mask)
 {
 	struct oom_wait_info owait;
-	bool locked, need_to_kill;
+	bool locked;
+	bool need_to_kill = true;
+	bool need_to_delay = false;
 
 	owait.mem = mem;
 	owait.wait.flags = 0;
 	owait.wait.func = memcg_oom_wake_function;
 	owait.wait.private = current;
 	INIT_LIST_HEAD(&owait.wait.task_list);
-	need_to_kill = true;
 	/* At first, try to OOM lock hierarchy under mem.*/
 	mutex_lock(&memcg_oom_mutex);
 	locked = mem_cgroup_oom_lock(mem);
@@ -1553,26 +1559,34 @@ bool mem_cgroup_handle_oom(struct mem_cgroup *mem, gfp_t mask)
 	prepare_to_wait(&memcg_oom_waitq, &owait.wait, TASK_KILLABLE);
 	if (!locked || mem->oom_kill_disable)
 		need_to_kill = false;
-	if (locked)
+	if (locked) {
 		mem_cgroup_oom_notify(mem);
+		if (mem->oom_kill_delay && !mem->oom_kill_delay_expired) {
+			need_to_kill = false;
+			need_to_delay = true;
+		}
+	}
 	mutex_unlock(&memcg_oom_mutex);
 
 	if (need_to_kill) {
 		finish_wait(&memcg_oom_waitq, &owait.wait);
 		mem_cgroup_out_of_memory(mem, mask);
 	} else {
-		schedule();
+		schedule_timeout(need_to_delay ? mem->oom_kill_delay :
+						 MAX_SCHEDULE_TIMEOUT);
 		finish_wait(&memcg_oom_waitq, &owait.wait);
 	}
 	mutex_lock(&memcg_oom_mutex);
 	mem_cgroup_oom_unlock(mem);
 	memcg_wakeup_oom(mem);
+	mem->oom_kill_delay_expired = need_to_delay;
 	mutex_unlock(&memcg_oom_mutex);
 
 	if (test_thread_flag(TIF_MEMDIE) || fatal_signal_pending(current))
 		return false;
 	/* Give chance to dying process */
-	schedule_timeout(1);
+	if (!need_to_delay)
+		schedule_timeout(1);
 	return true;
 }
 
@@ -2007,6 +2021,7 @@ again:
 		refill_stock(mem, csize - PAGE_SIZE);
 	css_put(&mem->css);
 done:
+	mem->oom_kill_delay_expired = false;
 	*memcg = mem;
 	return 0;
 nomem:
@@ -4053,6 +4068,29 @@ static int mem_cgroup_oom_control_write(struct cgroup *cgrp,
 	return 0;
 }
 
+static u64 mem_cgroup_oom_delay_read(struct cgroup *cgrp, struct cftype *cft)
+{
+	struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp);
+
+	return jiffies_to_msecs(memcg->oom_kill_delay);
+}
+
+static int mem_cgroup_oom_delay_write(struct cgroup *cgrp, struct cftype *cft,
+				      u64 val)
+{
+	struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp);
+
+	/* Sanity check -- don't wait longer than an hour */
+	if (val > (60 * 60 * 1000))
+		return -EINVAL;
+
+	cgroup_lock();
+	memcg->oom_kill_delay = msecs_to_jiffies(val);
+	memcg_oom_recover(memcg);
+	cgroup_unlock();
+	return 0;
+}
+
 static struct cftype mem_cgroup_files[] = {
 	{
 		.name = "usage_in_bytes",
@@ -4116,6 +4154,11 @@ static struct cftype mem_cgroup_files[] = {
 		.unregister_event = mem_cgroup_oom_unregister_event,
 		.private = MEMFILE_PRIVATE(_OOM_TYPE, OOM_CONTROL),
 	},
+	{
+		.name = "oom_delay",
+		.read_u64 = mem_cgroup_oom_delay_read,
+		.write_u64 = mem_cgroup_oom_delay_write,
+	},
 };
 
 #ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
@@ -4357,6 +4400,7 @@ mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
 		parent = mem_cgroup_from_cont(cont->parent);
 		mem->use_hierarchy = parent->use_hierarchy;
 		mem->oom_kill_disable = parent->oom_kill_disable;
+		mem->oom_kill_delay = parent->oom_kill_delay;
 	}
 
 	if (parent && parent->use_hierarchy) {

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 58+ messages in thread

end of thread, other threads:[~2011-03-18 20:36 UTC | newest]

Thread overview: 58+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-02-08  0:24 [patch] memcg: add oom killer delay David Rientjes
2011-02-08  1:55 ` KAMEZAWA Hiroyuki
2011-02-08  2:13   ` David Rientjes
2011-02-08  2:13     ` KAMEZAWA Hiroyuki
2011-02-08  2:20       ` KAMEZAWA Hiroyuki
2011-02-08  2:37         ` David Rientjes
2011-02-08 10:25           ` Balbir Singh
2011-02-09 22:19 ` David Rientjes
2011-02-10  0:04   ` KAMEZAWA Hiroyuki
2011-02-16  3:15     ` David Rientjes
2011-02-20 22:19       ` David Rientjes
2011-02-23 23:08   ` Andrew Morton
2011-02-24  0:13     ` KAMEZAWA Hiroyuki
2011-02-24  0:51     ` David Rientjes
2011-03-03 20:11       ` David Rientjes
2011-03-03 21:52       ` Andrew Morton
2011-03-08  0:12         ` David Rientjes
2011-03-08  0:29           ` Andrew Morton
2011-03-08  0:36             ` David Rientjes
2011-03-08  0:51               ` Andrew Morton
2011-03-08  1:02                 ` David Rientjes
2011-03-08  1:18                   ` Andrew Morton
2011-03-08  1:33                     ` David Rientjes
2011-03-08  2:51                       ` KAMEZAWA Hiroyuki
2011-03-08  3:07                         ` David Rientjes
2011-03-08  3:13                           ` KAMEZAWA Hiroyuki
2011-03-08  3:56                             ` David Rientjes
2011-03-08  4:17                               ` KAMEZAWA Hiroyuki
2011-03-08  5:30                                 ` David Rientjes
2011-03-08  5:49                                   ` KAMEZAWA Hiroyuki
2011-03-08 23:49                                     ` David Rientjes
2011-03-09  6:04                                       ` KAMEZAWA Hiroyuki
2011-03-09  6:44                                         ` David Rientjes
2011-03-09  7:16                                           ` KAMEZAWA Hiroyuki
2011-03-09 21:12                                             ` David Rientjes
2011-03-09 21:27                                               ` [patch] memcg: give current access to memory reserves if it's trying to die David Rientjes
2011-03-09 23:30                                                 ` KAMEZAWA Hiroyuki
2011-03-17 23:37                                                   ` David Rientjes
2011-03-17 23:53                                                 ` Andrew Morton
2011-03-18  4:35                                                   ` KAMEZAWA Hiroyuki
2011-03-18  5:17                                                     ` Andrew Morton
2011-03-18  5:58                                                       ` KAMEZAWA Hiroyuki
2011-03-18 20:36                                                       ` David Rientjes
2011-03-18 20:32                                                   ` David Rientjes
2011-03-08  3:06                     ` [patch] memcg: add oom killer delay KAMEZAWA Hiroyuki
  -- strict thread matches above, loose matches on Subject: below --
2010-12-22  7:27 David Rientjes
2010-12-22  7:59 ` Andrew Morton
2010-12-22  8:17   ` KAMEZAWA Hiroyuki
2010-12-22  8:31     ` KOSAKI Motohiro
2010-12-22  8:48     ` David Rientjes
2010-12-22  8:48       ` KAMEZAWA Hiroyuki
2010-12-22  8:55         ` KAMEZAWA Hiroyuki
2010-12-22  9:21           ` David Rientjes
2010-12-27  1:47             ` KAMEZAWA Hiroyuki
2010-12-22  9:04         ` David Rientjes
2010-12-22  8:42   ` David Rientjes
2010-12-25 10:47 ` Balbir Singh
2010-12-26 20:35   ` David Rientjes

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.