All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 1/2] memcg: make oom_lock 0 and 1 based rather than coutner
  2011-07-13 12:44 [PATCH 0/2] memcg: oom locking updates Michal Hocko
@ 2011-07-13 11:05 ` Michal Hocko
  2011-07-14  1:02     ` KAMEZAWA Hiroyuki
  2011-07-13 12:32 ` [PATCH 2/2] memcg: change memcg_oom_mutex to spinlock Michal Hocko
  1 sibling, 1 reply; 27+ messages in thread
From: Michal Hocko @ 2011-07-13 11:05 UTC (permalink / raw)
  To: linux-mm; +Cc: Balbir Singh, KAMEZAWA Hiroyuki, Daisuke Nishimura, linux-kernel

867578cb "memcg: fix oom kill behavior" introduced oom_lock counter
which is incremented by mem_cgroup_oom_lock when we are about to handle
memcg OOM situation. mem_cgroup_handle_oom falls back to a sleep if
oom_lock > 1 to prevent from multiple oom kills at the same time.
The counter is then decremented by mem_cgroup_oom_unlock called from the
same function.

This works correctly but it can lead to serious starvations when we
have many processes triggering OOM.

Consider a process (call it A) which gets the oom_lock (the first one
that got to mem_cgroup_handle_oom and grabbed memcg_oom_mutex). All
other processes are blocked on the mutex.
While A releases the mutex and calls mem_cgroup_out_of_memory others
will wake up (one after another) and increase the counter and fall into
sleep (memcg_oom_waitq). Once A finishes mem_cgroup_out_of_memory it
takes the mutex again and decreases oom_lock and wakes other tasks (if
releasing memory of the killed task hasn't done it yet).
The main problem here is that everybody still race for the mutex and
there is no guarantee that we will get counter back to 0 for those
that got back to mem_cgroup_handle_oom. In the end the whole convoy
in/decreases the counter but we do not get to 1 that would enable
killing so nothing useful is going on.
The time is basically unbounded because it highly depends on scheduling
and ordering on mutex.

This patch replaces the counter by a simple {un}lock semantic. We are
using only 0 and 1 to distinguish those two states.
As mem_cgroup_oom_{un}lock works on the hierarchy we have to make sure
that we cannot race with somebody else which is already guaranteed
because we call both functions with the mutex held. All other consumers
just read the value atomically for a single group which is sufficient
because we set the value atomically.
The other thing is that only that process which locked the oom will
unlock it once the OOM is handled.

Signed-off-by: Michal Hocko <mhocko@suse.cz>
---
 mm/memcontrol.c |   24 +++++++++++++++++-------
 1 files changed, 17 insertions(+), 7 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index e013b8e..f6c9ead 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1803,22 +1803,31 @@ static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *root_mem,
 /*
  * Check OOM-Killer is already running under our hierarchy.
  * If someone is running, return false.
+ * Has to be called with memcg_oom_mutex
  */
 static bool mem_cgroup_oom_lock(struct mem_cgroup *mem)
 {
-	int x, lock_count = 0;
+	int x, lock_count = -1;
 	struct mem_cgroup *iter;
 
 	for_each_mem_cgroup_tree(iter, mem) {
-		x = atomic_inc_return(&iter->oom_lock);
-		lock_count = max(x, lock_count);
+		x = !!atomic_add_unless(&iter->oom_lock, 1, 1);
+		if (lock_count == -1)
+			lock_count = x;
+
+		/* New child can be created but we shouldn't race with
+		 * somebody else trying to oom because we are under
+		 * memcg_oom_mutex
+		 */
+		BUG_ON(lock_count != x);
 	}
 
-	if (lock_count == 1)
-		return true;
-	return false;
+	return lock_count;
 }
 
+/*
+ * Has to be called with memcg_oom_mutex
+ */
 static int mem_cgroup_oom_unlock(struct mem_cgroup *mem)
 {
 	struct mem_cgroup *iter;
@@ -1916,7 +1925,8 @@ bool mem_cgroup_handle_oom(struct mem_cgroup *mem, gfp_t mask)
 		finish_wait(&memcg_oom_waitq, &owait.wait);
 	}
 	mutex_lock(&memcg_oom_mutex);
-	mem_cgroup_oom_unlock(mem);
+	if (locked)
+		mem_cgroup_oom_unlock(mem);
 	memcg_wakeup_oom(mem);
 	mutex_unlock(&memcg_oom_mutex);
 
-- 
1.7.5.4


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH 2/2] memcg: change memcg_oom_mutex to spinlock
  2011-07-13 12:44 [PATCH 0/2] memcg: oom locking updates Michal Hocko
  2011-07-13 11:05 ` [PATCH 1/2] memcg: make oom_lock 0 and 1 based rather than coutner Michal Hocko
@ 2011-07-13 12:32 ` Michal Hocko
  1 sibling, 0 replies; 27+ messages in thread
From: Michal Hocko @ 2011-07-13 12:32 UTC (permalink / raw)
  To: linux-mm; +Cc: Balbir Singh, KAMEZAWA Hiroyuki, Daisuke Nishimura, linux-kernel

memcg_oom_mutex is used to protect memcg OOM path and eventfd interface
for oom_control. None of the critical sections that it protects sleep
(eventfd_signal works from atomic context and the rest are simple linked
list resp. oom_lock atomic oprations).
Mutex is too heavy weight for those code paths because it triggers a lot
of scheduling.

Signed-off-by: Michal Hocko <mhocko@suse.cz>
---
 mm/memcontrol.c |   24 ++++++++++++------------
 1 files changed, 12 insertions(+), 12 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index f6c9ead..38f988e 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1803,7 +1803,7 @@ static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *root_mem,
 /*
  * Check OOM-Killer is already running under our hierarchy.
  * If someone is running, return false.
- * Has to be called with memcg_oom_mutex
+ * Has to be called with memcg_oom_lock
  */
 static bool mem_cgroup_oom_lock(struct mem_cgroup *mem)
 {
@@ -1817,7 +1817,7 @@ static bool mem_cgroup_oom_lock(struct mem_cgroup *mem)
 
 		/* New child can be created but we shouldn't race with
 		 * somebody else trying to oom because we are under
-		 * memcg_oom_mutex
+		 * memcg_oom_lock
 		 */
 		BUG_ON(lock_count != x);
 	}
@@ -1826,7 +1826,7 @@ static bool mem_cgroup_oom_lock(struct mem_cgroup *mem)
 }
 
 /*
- * Has to be called with memcg_oom_mutex
+ * Has to be called with memcg_oom_lock
  */
 static int mem_cgroup_oom_unlock(struct mem_cgroup *mem)
 {
@@ -1843,7 +1843,7 @@ static int mem_cgroup_oom_unlock(struct mem_cgroup *mem)
 }
 
 
-static DEFINE_MUTEX(memcg_oom_mutex);
+static DEFINE_SPINLOCK(memcg_oom_lock);
 static DECLARE_WAIT_QUEUE_HEAD(memcg_oom_waitq);
 
 struct oom_wait_info {
@@ -1903,7 +1903,7 @@ bool mem_cgroup_handle_oom(struct mem_cgroup *mem, gfp_t mask)
 	INIT_LIST_HEAD(&owait.wait.task_list);
 	need_to_kill = true;
 	/* At first, try to OOM lock hierarchy under mem.*/
-	mutex_lock(&memcg_oom_mutex);
+	spin_lock(&memcg_oom_lock);
 	locked = mem_cgroup_oom_lock(mem);
 	/*
 	 * Even if signal_pending(), we can't quit charge() loop without
@@ -1915,7 +1915,7 @@ bool mem_cgroup_handle_oom(struct mem_cgroup *mem, gfp_t mask)
 		need_to_kill = false;
 	if (locked)
 		mem_cgroup_oom_notify(mem);
-	mutex_unlock(&memcg_oom_mutex);
+	spin_unlock(&memcg_oom_lock);
 
 	if (need_to_kill) {
 		finish_wait(&memcg_oom_waitq, &owait.wait);
@@ -1924,11 +1924,11 @@ bool mem_cgroup_handle_oom(struct mem_cgroup *mem, gfp_t mask)
 		schedule();
 		finish_wait(&memcg_oom_waitq, &owait.wait);
 	}
-	mutex_lock(&memcg_oom_mutex);
+	spin_lock(&memcg_oom_lock);
 	if (locked)
 		mem_cgroup_oom_unlock(mem);
 	memcg_wakeup_oom(mem);
-	mutex_unlock(&memcg_oom_mutex);
+	spin_unlock(&memcg_oom_lock);
 
 	if (test_thread_flag(TIF_MEMDIE) || fatal_signal_pending(current))
 		return false;
@@ -4588,7 +4588,7 @@ static int mem_cgroup_oom_register_event(struct cgroup *cgrp,
 	if (!event)
 		return -ENOMEM;
 
-	mutex_lock(&memcg_oom_mutex);
+	spin_lock(&memcg_oom_lock);
 
 	event->eventfd = eventfd;
 	list_add(&event->list, &memcg->oom_notify);
@@ -4596,7 +4596,7 @@ static int mem_cgroup_oom_register_event(struct cgroup *cgrp,
 	/* already in OOM ? */
 	if (atomic_read(&memcg->oom_lock))
 		eventfd_signal(eventfd, 1);
-	mutex_unlock(&memcg_oom_mutex);
+	spin_unlock(&memcg_oom_lock);
 
 	return 0;
 }
@@ -4610,7 +4610,7 @@ static void mem_cgroup_oom_unregister_event(struct cgroup *cgrp,
 
 	BUG_ON(type != _OOM_TYPE);
 
-	mutex_lock(&memcg_oom_mutex);
+	spin_lock(&memcg_oom_lock);
 
 	list_for_each_entry_safe(ev, tmp, &mem->oom_notify, list) {
 		if (ev->eventfd == eventfd) {
@@ -4619,7 +4619,7 @@ static void mem_cgroup_oom_unregister_event(struct cgroup *cgrp,
 		}
 	}
 
-	mutex_unlock(&memcg_oom_mutex);
+	spin_unlock(&memcg_oom_lock);
 }
 
 static int mem_cgroup_oom_control_read(struct cgroup *cgrp,
-- 
1.7.5.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH 0/2] memcg: oom locking updates
@ 2011-07-13 12:44 Michal Hocko
  2011-07-13 11:05 ` [PATCH 1/2] memcg: make oom_lock 0 and 1 based rather than coutner Michal Hocko
  2011-07-13 12:32 ` [PATCH 2/2] memcg: change memcg_oom_mutex to spinlock Michal Hocko
  0 siblings, 2 replies; 27+ messages in thread
From: Michal Hocko @ 2011-07-13 12:44 UTC (permalink / raw)
  To: linux-mm; +Cc: Balbir Singh, KAMEZAWA Hiroyuki, Daisuke Nishimura, linux-kernel

Hi,
this small patch series has two patches. While the first one is a bug
fix the other one is a cleanup which might be a bit controversial.

I have experienced a serious starvation due the way how we handle
oom_lock counter currently and the first patch aims at fixing it.  The
issue can be reproduced quite easily on a machine with many CPUs and
many tasks fighting for a memory (e.g. 100 tasks each allocating and
touching 10MB anonymous memory in a tight loop within a 200MB group with
swapoff and mem_control=0)

I have no hard numbers to support why spinlock is better than mutex for
the second patch but it feels like it is more suitable for the code
paths we are using it at the moment. It should also reduce context
switches count for many contenders.

Michal Hocko (2):
  memcg: make oom_lock 0 and 1 based rather than coutner
  memcg: change memcg_oom_mutex to spinlock

 mm/memcontrol.c |   42 ++++++++++++++++++++++++++----------------
 1 files changed, 26 insertions(+), 16 deletions(-)

-- 
1.7.5.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 1/2] memcg: make oom_lock 0 and 1 based rather than coutner
  2011-07-13 11:05 ` [PATCH 1/2] memcg: make oom_lock 0 and 1 based rather than coutner Michal Hocko
@ 2011-07-14  1:02     ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 27+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-07-14  1:02 UTC (permalink / raw)
  To: Michal Hocko; +Cc: linux-mm, Balbir Singh, Daisuke Nishimura, linux-kernel

On Wed, 13 Jul 2011 13:05:49 +0200
Michal Hocko <mhocko@suse.cz> wrote:

> 867578cb "memcg: fix oom kill behavior" introduced oom_lock counter
> which is incremented by mem_cgroup_oom_lock when we are about to handle
> memcg OOM situation. mem_cgroup_handle_oom falls back to a sleep if
> oom_lock > 1 to prevent from multiple oom kills at the same time.
> The counter is then decremented by mem_cgroup_oom_unlock called from the
> same function.
> 
> This works correctly but it can lead to serious starvations when we
> have many processes triggering OOM.
> 
> Consider a process (call it A) which gets the oom_lock (the first one
> that got to mem_cgroup_handle_oom and grabbed memcg_oom_mutex). All
> other processes are blocked on the mutex.
> While A releases the mutex and calls mem_cgroup_out_of_memory others
> will wake up (one after another) and increase the counter and fall into
> sleep (memcg_oom_waitq). Once A finishes mem_cgroup_out_of_memory it
> takes the mutex again and decreases oom_lock and wakes other tasks (if
> releasing memory of the killed task hasn't done it yet).
> The main problem here is that everybody still race for the mutex and
> there is no guarantee that we will get counter back to 0 for those
> that got back to mem_cgroup_handle_oom. In the end the whole convoy
> in/decreases the counter but we do not get to 1 that would enable
> killing so nothing useful is going on.
> The time is basically unbounded because it highly depends on scheduling
> and ordering on mutex.
> 

Hmm, ok, I see the problem.


> This patch replaces the counter by a simple {un}lock semantic. We are
> using only 0 and 1 to distinguish those two states.
> As mem_cgroup_oom_{un}lock works on the hierarchy we have to make sure
> that we cannot race with somebody else which is already guaranteed
> because we call both functions with the mutex held. All other consumers
> just read the value atomically for a single group which is sufficient
> because we set the value atomically.
> The other thing is that only that process which locked the oom will
> unlock it once the OOM is handled.
> 
> Signed-off-by: Michal Hocko <mhocko@suse.cz>
> ---
>  mm/memcontrol.c |   24 +++++++++++++++++-------
>  1 files changed, 17 insertions(+), 7 deletions(-)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index e013b8e..f6c9ead 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -1803,22 +1803,31 @@ static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *root_mem,
>  /*
>   * Check OOM-Killer is already running under our hierarchy.
>   * If someone is running, return false.
> + * Has to be called with memcg_oom_mutex
>   */
>  static bool mem_cgroup_oom_lock(struct mem_cgroup *mem)
>  {
> -	int x, lock_count = 0;
> +	int x, lock_count = -1;
>  	struct mem_cgroup *iter;
>  
>  	for_each_mem_cgroup_tree(iter, mem) {
> -		x = atomic_inc_return(&iter->oom_lock);
> -		lock_count = max(x, lock_count);
> +		x = !!atomic_add_unless(&iter->oom_lock, 1, 1);
> +		if (lock_count == -1)
> +			lock_count = x;
> +


Hmm...Assume following hierarchy.

	  A
       B     C
      D E 

The orignal code hanldes the situation

 1. B-D-E is under OOM
 2. A enters OOM after 1.

In original code, A will not invoke OOM (because B-D-E oom will kill a process.)
The new code invokes A will invoke new OOM....right ?

I wonder this kind of code
==
	bool success = true;
	...
	for_each_mem_cgroup_tree(iter, mem) {
		success &= !!atomic_add_unless(&iter->oom_lock, 1, 1);
		/* "break" loop is not allowed because of css refcount....*/
	}
	return success.
==
Then, one hierarchy can invoke one OOM kill within it.
But this will not work because we can't do proper unlock.


Hm. how about this ? This has only one lock point and we'll not see the BUG.
Not tested yet..


---
 mm/memcontrol.c |   77 +++++++++++++++++++++++++++++++++++++-----------------
 1 files changed, 53 insertions(+), 24 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index e013b8e..c36bd05 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -246,7 +246,8 @@ struct mem_cgroup {
 	 * Should the accounting and control be hierarchical, per subtree?
 	 */
 	bool use_hierarchy;
-	atomic_t	oom_lock;
+	int		oom_lock;
+	atomic_t	under_oom;
 	atomic_t	refcnt;
 
 	unsigned int	swappiness;
@@ -1801,36 +1802,63 @@ static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *root_mem,
 }
 
 /*
- * Check OOM-Killer is already running under our hierarchy.
+ * Check whether OOM-Killer is already running under our hierarchy.
  * If someone is running, return false.
  */
-static bool mem_cgroup_oom_lock(struct mem_cgroup *mem)
+static bool mem_cgroup_oom_lock(struct mem_cgroup *memcg)
 {
-	int x, lock_count = 0;
 	struct mem_cgroup *iter;
+	bool ret;
 
-	for_each_mem_cgroup_tree(iter, mem) {
-		x = atomic_inc_return(&iter->oom_lock);
-		lock_count = max(x, lock_count);
+	/*
+	 * If an ancestor (including this memcg) is the owner of OOM Lock.
+	 * return false;
+	 */
+	for (iter = memcg; iter != NULL; iter = parent_mem_cgroup(iter)) {
+		if (iter->oom_lock)
+			break;
+		if (!iter->use_hierarchy) {
+			iter = NULL;
+			break;
+		}
 	}
 
-	if (lock_count == 1)
-		return true;
-	return false;
+	if (iter)
+		return false;
+	/*
+	 * Make the owner of OOM lock to be the highest ancestor of hierarchy
+	 * under OOM. IOW, move children's OOM owner information to this memcg
+	 * if a child is owner. In this case, an OOM killer is running and
+	 * we return false. But make this memcg as owner of oom-lock.
+	 */
+	ret = true;
+	for_each_mem_cgroup_tree(iter, memcg) {
+		if (iter->oom_lock) {
+			iter->oom_lock = 0;
+			ret = false;
+		}
+		atomic_set(&iter->under_oom, 1);
+	}
+	/* Make this memcg as the owner of OOM lock. */
+	memcg->oom_lock = 1;
+	return ret;
 }
 
-static int mem_cgroup_oom_unlock(struct mem_cgroup *mem)
+static void mem_cgroup_oom_unlock(struct mem_cgroup *memcg)
 {
-	struct mem_cgroup *iter;
+	struct mem_cgroup *iter, *iter2;
 
-	/*
-	 * When a new child is created while the hierarchy is under oom,
-	 * mem_cgroup_oom_lock() may not be called. We have to use
-	 * atomic_add_unless() here.
-	 */
-	for_each_mem_cgroup_tree(iter, mem)
-		atomic_add_unless(&iter->oom_lock, -1, 0);
-	return 0;
+	for (iter = memcg; iter != NULL; iter = parent_mem_cgroup(iter)) {
+		if (iter->oom_lock) {
+			iter->oom_lock = 0;
+			break;
+		}
+	}
+	BUG_ON(!iter);
+
+	for_each_mem_cgroup_tree(iter2, iter)
+		atomic_set(&iter->under_oom, 0);
+	return;
 }
 
 
@@ -1875,7 +1903,7 @@ static void memcg_wakeup_oom(struct mem_cgroup *mem)
 
 static void memcg_oom_recover(struct mem_cgroup *mem)
 {
-	if (mem && atomic_read(&mem->oom_lock))
+	if (mem && atomic_read(&mem->under_oom))
 		memcg_wakeup_oom(mem);
 }
 
@@ -1916,7 +1944,8 @@ bool mem_cgroup_handle_oom(struct mem_cgroup *mem, gfp_t mask)
 		finish_wait(&memcg_oom_waitq, &owait.wait);
 	}
 	mutex_lock(&memcg_oom_mutex);
-	mem_cgroup_oom_unlock(mem);
+	if (locked)
+		mem_cgroup_oom_unlock(mem);
 	memcg_wakeup_oom(mem);
 	mutex_unlock(&memcg_oom_mutex);
 
@@ -4584,7 +4613,7 @@ static int mem_cgroup_oom_register_event(struct cgroup *cgrp,
 	list_add(&event->list, &memcg->oom_notify);
 
 	/* already in OOM ? */
-	if (atomic_read(&memcg->oom_lock))
+	if (atomic_read(&memcg->under_oom))
 		eventfd_signal(eventfd, 1);
 	mutex_unlock(&memcg_oom_mutex);
 
@@ -4619,7 +4648,7 @@ static int mem_cgroup_oom_control_read(struct cgroup *cgrp,
 
 	cb->fill(cb, "oom_kill_disable", mem->oom_kill_disable);
 
-	if (atomic_read(&mem->oom_lock))
+	if (atomic_read(&mem->under_oom))
 		cb->fill(cb, "under_oom", 1);
 	else
 		cb->fill(cb, "under_oom", 0);
-- 
1.7.4.1











^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [PATCH 1/2] memcg: make oom_lock 0 and 1 based rather than coutner
@ 2011-07-14  1:02     ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 27+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-07-14  1:02 UTC (permalink / raw)
  To: Michal Hocko; +Cc: linux-mm, Balbir Singh, Daisuke Nishimura, linux-kernel

On Wed, 13 Jul 2011 13:05:49 +0200
Michal Hocko <mhocko@suse.cz> wrote:

> 867578cb "memcg: fix oom kill behavior" introduced oom_lock counter
> which is incremented by mem_cgroup_oom_lock when we are about to handle
> memcg OOM situation. mem_cgroup_handle_oom falls back to a sleep if
> oom_lock > 1 to prevent from multiple oom kills at the same time.
> The counter is then decremented by mem_cgroup_oom_unlock called from the
> same function.
> 
> This works correctly but it can lead to serious starvations when we
> have many processes triggering OOM.
> 
> Consider a process (call it A) which gets the oom_lock (the first one
> that got to mem_cgroup_handle_oom and grabbed memcg_oom_mutex). All
> other processes are blocked on the mutex.
> While A releases the mutex and calls mem_cgroup_out_of_memory others
> will wake up (one after another) and increase the counter and fall into
> sleep (memcg_oom_waitq). Once A finishes mem_cgroup_out_of_memory it
> takes the mutex again and decreases oom_lock and wakes other tasks (if
> releasing memory of the killed task hasn't done it yet).
> The main problem here is that everybody still race for the mutex and
> there is no guarantee that we will get counter back to 0 for those
> that got back to mem_cgroup_handle_oom. In the end the whole convoy
> in/decreases the counter but we do not get to 1 that would enable
> killing so nothing useful is going on.
> The time is basically unbounded because it highly depends on scheduling
> and ordering on mutex.
> 

Hmm, ok, I see the problem.


> This patch replaces the counter by a simple {un}lock semantic. We are
> using only 0 and 1 to distinguish those two states.
> As mem_cgroup_oom_{un}lock works on the hierarchy we have to make sure
> that we cannot race with somebody else which is already guaranteed
> because we call both functions with the mutex held. All other consumers
> just read the value atomically for a single group which is sufficient
> because we set the value atomically.
> The other thing is that only that process which locked the oom will
> unlock it once the OOM is handled.
> 
> Signed-off-by: Michal Hocko <mhocko@suse.cz>
> ---
>  mm/memcontrol.c |   24 +++++++++++++++++-------
>  1 files changed, 17 insertions(+), 7 deletions(-)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index e013b8e..f6c9ead 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -1803,22 +1803,31 @@ static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *root_mem,
>  /*
>   * Check OOM-Killer is already running under our hierarchy.
>   * If someone is running, return false.
> + * Has to be called with memcg_oom_mutex
>   */
>  static bool mem_cgroup_oom_lock(struct mem_cgroup *mem)
>  {
> -	int x, lock_count = 0;
> +	int x, lock_count = -1;
>  	struct mem_cgroup *iter;
>  
>  	for_each_mem_cgroup_tree(iter, mem) {
> -		x = atomic_inc_return(&iter->oom_lock);
> -		lock_count = max(x, lock_count);
> +		x = !!atomic_add_unless(&iter->oom_lock, 1, 1);
> +		if (lock_count == -1)
> +			lock_count = x;
> +


Hmm...Assume following hierarchy.

	  A
       B     C
      D E 

The orignal code hanldes the situation

 1. B-D-E is under OOM
 2. A enters OOM after 1.

In original code, A will not invoke OOM (because B-D-E oom will kill a process.)
The new code invokes A will invoke new OOM....right ?

I wonder this kind of code
==
	bool success = true;
	...
	for_each_mem_cgroup_tree(iter, mem) {
		success &= !!atomic_add_unless(&iter->oom_lock, 1, 1);
		/* "break" loop is not allowed because of css refcount....*/
	}
	return success.
==
Then, one hierarchy can invoke one OOM kill within it.
But this will not work because we can't do proper unlock.


Hm. how about this ? This has only one lock point and we'll not see the BUG.
Not tested yet..


---
 mm/memcontrol.c |   77 +++++++++++++++++++++++++++++++++++++-----------------
 1 files changed, 53 insertions(+), 24 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index e013b8e..c36bd05 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -246,7 +246,8 @@ struct mem_cgroup {
 	 * Should the accounting and control be hierarchical, per subtree?
 	 */
 	bool use_hierarchy;
-	atomic_t	oom_lock;
+	int		oom_lock;
+	atomic_t	under_oom;
 	atomic_t	refcnt;
 
 	unsigned int	swappiness;
@@ -1801,36 +1802,63 @@ static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *root_mem,
 }
 
 /*
- * Check OOM-Killer is already running under our hierarchy.
+ * Check whether OOM-Killer is already running under our hierarchy.
  * If someone is running, return false.
  */
-static bool mem_cgroup_oom_lock(struct mem_cgroup *mem)
+static bool mem_cgroup_oom_lock(struct mem_cgroup *memcg)
 {
-	int x, lock_count = 0;
 	struct mem_cgroup *iter;
+	bool ret;
 
-	for_each_mem_cgroup_tree(iter, mem) {
-		x = atomic_inc_return(&iter->oom_lock);
-		lock_count = max(x, lock_count);
+	/*
+	 * If an ancestor (including this memcg) is the owner of OOM Lock.
+	 * return false;
+	 */
+	for (iter = memcg; iter != NULL; iter = parent_mem_cgroup(iter)) {
+		if (iter->oom_lock)
+			break;
+		if (!iter->use_hierarchy) {
+			iter = NULL;
+			break;
+		}
 	}
 
-	if (lock_count == 1)
-		return true;
-	return false;
+	if (iter)
+		return false;
+	/*
+	 * Make the owner of OOM lock to be the highest ancestor of hierarchy
+	 * under OOM. IOW, move children's OOM owner information to this memcg
+	 * if a child is owner. In this case, an OOM killer is running and
+	 * we return false. But make this memcg as owner of oom-lock.
+	 */
+	ret = true;
+	for_each_mem_cgroup_tree(iter, memcg) {
+		if (iter->oom_lock) {
+			iter->oom_lock = 0;
+			ret = false;
+		}
+		atomic_set(&iter->under_oom, 1);
+	}
+	/* Make this memcg as the owner of OOM lock. */
+	memcg->oom_lock = 1;
+	return ret;
 }
 
-static int mem_cgroup_oom_unlock(struct mem_cgroup *mem)
+static void mem_cgroup_oom_unlock(struct mem_cgroup *memcg)
 {
-	struct mem_cgroup *iter;
+	struct mem_cgroup *iter, *iter2;
 
-	/*
-	 * When a new child is created while the hierarchy is under oom,
-	 * mem_cgroup_oom_lock() may not be called. We have to use
-	 * atomic_add_unless() here.
-	 */
-	for_each_mem_cgroup_tree(iter, mem)
-		atomic_add_unless(&iter->oom_lock, -1, 0);
-	return 0;
+	for (iter = memcg; iter != NULL; iter = parent_mem_cgroup(iter)) {
+		if (iter->oom_lock) {
+			iter->oom_lock = 0;
+			break;
+		}
+	}
+	BUG_ON(!iter);
+
+	for_each_mem_cgroup_tree(iter2, iter)
+		atomic_set(&iter->under_oom, 0);
+	return;
 }
 
 
@@ -1875,7 +1903,7 @@ static void memcg_wakeup_oom(struct mem_cgroup *mem)
 
 static void memcg_oom_recover(struct mem_cgroup *mem)
 {
-	if (mem && atomic_read(&mem->oom_lock))
+	if (mem && atomic_read(&mem->under_oom))
 		memcg_wakeup_oom(mem);
 }
 
@@ -1916,7 +1944,8 @@ bool mem_cgroup_handle_oom(struct mem_cgroup *mem, gfp_t mask)
 		finish_wait(&memcg_oom_waitq, &owait.wait);
 	}
 	mutex_lock(&memcg_oom_mutex);
-	mem_cgroup_oom_unlock(mem);
+	if (locked)
+		mem_cgroup_oom_unlock(mem);
 	memcg_wakeup_oom(mem);
 	mutex_unlock(&memcg_oom_mutex);
 
@@ -4584,7 +4613,7 @@ static int mem_cgroup_oom_register_event(struct cgroup *cgrp,
 	list_add(&event->list, &memcg->oom_notify);
 
 	/* already in OOM ? */
-	if (atomic_read(&memcg->oom_lock))
+	if (atomic_read(&memcg->under_oom))
 		eventfd_signal(eventfd, 1);
 	mutex_unlock(&memcg_oom_mutex);
 
@@ -4619,7 +4648,7 @@ static int mem_cgroup_oom_control_read(struct cgroup *cgrp,
 
 	cb->fill(cb, "oom_kill_disable", mem->oom_kill_disable);
 
-	if (atomic_read(&mem->oom_lock))
+	if (atomic_read(&mem->under_oom))
 		cb->fill(cb, "under_oom", 1);
 	else
 		cb->fill(cb, "under_oom", 0);
-- 
1.7.4.1










--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [PATCH 1/2] memcg: make oom_lock 0 and 1 based rather than coutner
  2011-07-14  1:02     ` KAMEZAWA Hiroyuki
@ 2011-07-14  2:59       ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 27+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-07-14  2:59 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Michal Hocko, linux-mm, Balbir Singh, Daisuke Nishimura, linux-kernel

On Thu, 14 Jul 2011 10:02:59 +0900
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:

> On Wed, 13 Jul 2011 13:05:49 +0200
> Michal Hocko <mhocko@suse.cz> wrote:
> 
> > 867578cb "memcg: fix oom kill behavior" introduced oom_lock counter
> > which is incremented by mem_cgroup_oom_lock when we are about to handle
> > memcg OOM situation. mem_cgroup_handle_oom falls back to a sleep if
> > oom_lock > 1 to prevent from multiple oom kills at the same time.
> > The counter is then decremented by mem_cgroup_oom_unlock called from the
> > same function.
> > 
> > This works correctly but it can lead to serious starvations when we
> > have many processes triggering OOM.
> > 
> > Consider a process (call it A) which gets the oom_lock (the first one
> > that got to mem_cgroup_handle_oom and grabbed memcg_oom_mutex). All
> > other processes are blocked on the mutex.
> > While A releases the mutex and calls mem_cgroup_out_of_memory others
> > will wake up (one after another) and increase the counter and fall into
> > sleep (memcg_oom_waitq). Once A finishes mem_cgroup_out_of_memory it
> > takes the mutex again and decreases oom_lock and wakes other tasks (if
> > releasing memory of the killed task hasn't done it yet).
> > The main problem here is that everybody still race for the mutex and
> > there is no guarantee that we will get counter back to 0 for those
> > that got back to mem_cgroup_handle_oom. In the end the whole convoy
> > in/decreases the counter but we do not get to 1 that would enable
> > killing so nothing useful is going on.
> > The time is basically unbounded because it highly depends on scheduling
> > and ordering on mutex.
> > 
> 
> Hmm, ok, I see the problem.
> 
> 
> > This patch replaces the counter by a simple {un}lock semantic. We are
> > using only 0 and 1 to distinguish those two states.
> > As mem_cgroup_oom_{un}lock works on the hierarchy we have to make sure
> > that we cannot race with somebody else which is already guaranteed
> > because we call both functions with the mutex held. All other consumers
> > just read the value atomically for a single group which is sufficient
> > because we set the value atomically.
> > The other thing is that only that process which locked the oom will
> > unlock it once the OOM is handled.
> > 
> > Signed-off-by: Michal Hocko <mhocko@suse.cz>
> > ---
> >  mm/memcontrol.c |   24 +++++++++++++++++-------
> >  1 files changed, 17 insertions(+), 7 deletions(-)
> > 
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index e013b8e..f6c9ead 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -1803,22 +1803,31 @@ static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *root_mem,
> >  /*
> >   * Check OOM-Killer is already running under our hierarchy.
> >   * If someone is running, return false.
> > + * Has to be called with memcg_oom_mutex
> >   */
> >  static bool mem_cgroup_oom_lock(struct mem_cgroup *mem)
> >  {
> > -	int x, lock_count = 0;
> > +	int x, lock_count = -1;
> >  	struct mem_cgroup *iter;
> >  
> >  	for_each_mem_cgroup_tree(iter, mem) {
> > -		x = atomic_inc_return(&iter->oom_lock);
> > -		lock_count = max(x, lock_count);
> > +		x = !!atomic_add_unless(&iter->oom_lock, 1, 1);
> > +		if (lock_count == -1)
> > +			lock_count = x;
> > +
> 
> 
> Hmm...Assume following hierarchy.
> 
> 	  A
>        B     C
>       D E 
> 
> The orignal code hanldes the situation
> 
>  1. B-D-E is under OOM
>  2. A enters OOM after 1.
> 
> In original code, A will not invoke OOM (because B-D-E oom will kill a process.)
> The new code invokes A will invoke new OOM....right ?
> 
> I wonder this kind of code
> ==
> 	bool success = true;
> 	...
> 	for_each_mem_cgroup_tree(iter, mem) {
> 		success &= !!atomic_add_unless(&iter->oom_lock, 1, 1);
> 		/* "break" loop is not allowed because of css refcount....*/
> 	}
> 	return success.
> ==
> Then, one hierarchy can invoke one OOM kill within it.
> But this will not work because we can't do proper unlock.
> 
> 
> Hm. how about this ? This has only one lock point and we'll not see the BUG.
> Not tested yet..
> 
Here, tested patch + test program. this seems to work well.
==
>From 8c878b3413b4d796844dbcb18fa7cfccf44860d7 Mon Sep 17 00:00:00 2001
From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Date: Thu, 14 Jul 2011 11:36:50 +0900
Subject: [PATCH] memcg: fix livelock at oom.

867578cb "memcg: fix oom kill behavior" introduced oom_lock counter
which is incremented by mem_cgroup_oom_lock when we are about to handle
memcg OOM situation. mem_cgroup_handle_oom falls back to a sleep if
oom_lock > 1 to prevent from multiple oom kills at the same time.
The counter is then decremented by mem_cgroup_oom_unlock called from the
same function.

This works correctly but it can lead to serious starvations when we
have many processes triggering OOM.

example.

Make a hierarchy of memcg, which has 300MB memory+swap limit.

 %cgcreate -g memory:A
 %cgset -r memory.use_hierarchy=1 A
 %cgset -r memory.limit_in_bytes=300M A
 %cgset -r memory.memsw.limit_in_bytes=300M A
 %cgcreate -g memory:A/B
 %cgcreate -g memory:A/C
 %cgcreate -g memory:A/B/X
 %cgcreate -g memory:A/B/Y

Then, running folloing program under A/B/X.
 %cgexec -g memory:A/B/X ./fork
==
int main(int argc, char *argv[])
{
        int i;
        int status;

        for (i = 0; i < 5000; i++) {
                if (fork() == 0) {
                        char *c;
                        c = malloc(1024*1024);
                        memset(c, 0, 1024*1024);
                        sleep(20);
                        fprintf(stderr, "[%d]\n", i);
                        exit(0);
                }
                printf("%d\n", i);
                waitpid(-1, &status, WNOHANG);
        }
        while (1) {
                int ret;
                ret = waitpid(-1, &status, WNOHANG);

                if (ret == -1)
                        break;
                if (!ret)
                        sleep(1);
        }
        return 0;
}
==

This forks a process and the child malloc(1M). Then, after forking 300
childrens, the memcg goes int OOM. Expected behavior is oom-killer
will kill process and make progress. But, 300 children will try to get
oom_lock and oom kill...and the program seems not to make progress.

The reason is that memcg invokes OOM-Kill when the counter oom_lock is 0.
But if many process runs, it never goes down to 0 because it's incremanted
before all processes quits sleep by previous oom-lock, which decremetns
oom_lock.

This patch fixes the behavior. This patch makes the oom-hierarchy should
have only one lock value 1/0. For example, in following hierarchy,

	A
       /
      B
     / \
    C   D

When C goes into OOM because of limit by B, set B->oom_lock=1
After that, when A goes into OOM because of limit by A,
clear B->oom_lock as 0 and set A->oom_lock=1.

At unlocking, the ancestor which has ()->oom_lock=1 will be cleared.

After this, above program will do fork 5000 procs with 4000+ oom-killer.

Reported-by: Michal Hocko <mhocko@suse.cz>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Changelog:
  - fixed under_oom counter reset.
---
 mm/memcontrol.c |   77 +++++++++++++++++++++++++++++++++++++-----------------
 1 files changed, 53 insertions(+), 24 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index e013b8e..5f9661b 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -246,7 +246,8 @@ struct mem_cgroup {
 	 * Should the accounting and control be hierarchical, per subtree?
 	 */
 	bool use_hierarchy;
-	atomic_t	oom_lock;
+	int		oom_lock;
+	atomic_t	under_oom;
 	atomic_t	refcnt;
 
 	unsigned int	swappiness;
@@ -1801,36 +1802,63 @@ static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *root_mem,
 }
 
 /*
- * Check OOM-Killer is already running under our hierarchy.
+ * Check whether OOM-Killer is already running under our hierarchy.
  * If someone is running, return false.
  */
-static bool mem_cgroup_oom_lock(struct mem_cgroup *mem)
+static bool mem_cgroup_oom_lock(struct mem_cgroup *memcg)
 {
-	int x, lock_count = 0;
 	struct mem_cgroup *iter;
+	bool ret;
 
-	for_each_mem_cgroup_tree(iter, mem) {
-		x = atomic_inc_return(&iter->oom_lock);
-		lock_count = max(x, lock_count);
+	/*
+	 * If an ancestor (including this memcg) is the owner of OOM Lock.
+	 * return false;
+	 */
+	for (iter = memcg; iter != NULL; iter = parent_mem_cgroup(iter)) {
+		if (iter->oom_lock)
+			break;
+		if (!iter->use_hierarchy) {
+			iter = NULL;
+			break;
+		}
 	}
 
-	if (lock_count == 1)
-		return true;
-	return false;
+	if (iter)
+		return false;
+	/*
+	 * Make the owner of OOM lock to be the highest ancestor of hierarchy
+	 * under OOM. IOW, move children's OOM owner information to this memcg
+	 * if a child is owner. In this case, an OOM killer is running and
+	 * we return false. But make this memcg as owner of oom-lock.
+	 */
+	ret = true;
+	for_each_mem_cgroup_tree(iter, memcg) {
+		if (iter->oom_lock) {
+			iter->oom_lock = 0;
+			ret = false;
+		}
+		atomic_set(&iter->under_oom, 1);
+	}
+	/* Make this memcg as the owner of OOM lock. */
+	memcg->oom_lock = 1;
+	return ret;
 }
 
-static int mem_cgroup_oom_unlock(struct mem_cgroup *mem)
+static void mem_cgroup_oom_unlock(struct mem_cgroup *memcg)
 {
-	struct mem_cgroup *iter;
+	struct mem_cgroup *iter, *iter2;
 
-	/*
-	 * When a new child is created while the hierarchy is under oom,
-	 * mem_cgroup_oom_lock() may not be called. We have to use
-	 * atomic_add_unless() here.
-	 */
-	for_each_mem_cgroup_tree(iter, mem)
-		atomic_add_unless(&iter->oom_lock, -1, 0);
-	return 0;
+	for (iter = memcg; iter != NULL; iter = parent_mem_cgroup(iter)) {
+		if (iter->oom_lock) {
+			iter->oom_lock = 0;
+			break;
+		}
+	}
+	BUG_ON(!iter);
+
+	for_each_mem_cgroup_tree(iter2, iter)
+		atomic_set(&iter2->under_oom, 0);
+	return;
 }
 
 
@@ -1875,7 +1903,7 @@ static void memcg_wakeup_oom(struct mem_cgroup *mem)
 
 static void memcg_oom_recover(struct mem_cgroup *mem)
 {
-	if (mem && atomic_read(&mem->oom_lock))
+	if (mem && atomic_read(&mem->under_oom))
 		memcg_wakeup_oom(mem);
 }
 
@@ -1916,7 +1944,8 @@ bool mem_cgroup_handle_oom(struct mem_cgroup *mem, gfp_t mask)
 		finish_wait(&memcg_oom_waitq, &owait.wait);
 	}
 	mutex_lock(&memcg_oom_mutex);
-	mem_cgroup_oom_unlock(mem);
+	if (locked)
+		mem_cgroup_oom_unlock(mem);
 	memcg_wakeup_oom(mem);
 	mutex_unlock(&memcg_oom_mutex);
 
@@ -4584,7 +4613,7 @@ static int mem_cgroup_oom_register_event(struct cgroup *cgrp,
 	list_add(&event->list, &memcg->oom_notify);
 
 	/* already in OOM ? */
-	if (atomic_read(&memcg->oom_lock))
+	if (atomic_read(&memcg->under_oom))
 		eventfd_signal(eventfd, 1);
 	mutex_unlock(&memcg_oom_mutex);
 
@@ -4619,7 +4648,7 @@ static int mem_cgroup_oom_control_read(struct cgroup *cgrp,
 
 	cb->fill(cb, "oom_kill_disable", mem->oom_kill_disable);
 
-	if (atomic_read(&mem->oom_lock))
+	if (atomic_read(&mem->under_oom))
 		cb->fill(cb, "under_oom", 1);
 	else
 		cb->fill(cb, "under_oom", 0);
-- 
1.7.4.1




^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [PATCH 1/2] memcg: make oom_lock 0 and 1 based rather than coutner
@ 2011-07-14  2:59       ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 27+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-07-14  2:59 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Michal Hocko, linux-mm, Balbir Singh, Daisuke Nishimura, linux-kernel

On Thu, 14 Jul 2011 10:02:59 +0900
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:

> On Wed, 13 Jul 2011 13:05:49 +0200
> Michal Hocko <mhocko@suse.cz> wrote:
> 
> > 867578cb "memcg: fix oom kill behavior" introduced oom_lock counter
> > which is incremented by mem_cgroup_oom_lock when we are about to handle
> > memcg OOM situation. mem_cgroup_handle_oom falls back to a sleep if
> > oom_lock > 1 to prevent from multiple oom kills at the same time.
> > The counter is then decremented by mem_cgroup_oom_unlock called from the
> > same function.
> > 
> > This works correctly but it can lead to serious starvations when we
> > have many processes triggering OOM.
> > 
> > Consider a process (call it A) which gets the oom_lock (the first one
> > that got to mem_cgroup_handle_oom and grabbed memcg_oom_mutex). All
> > other processes are blocked on the mutex.
> > While A releases the mutex and calls mem_cgroup_out_of_memory others
> > will wake up (one after another) and increase the counter and fall into
> > sleep (memcg_oom_waitq). Once A finishes mem_cgroup_out_of_memory it
> > takes the mutex again and decreases oom_lock and wakes other tasks (if
> > releasing memory of the killed task hasn't done it yet).
> > The main problem here is that everybody still race for the mutex and
> > there is no guarantee that we will get counter back to 0 for those
> > that got back to mem_cgroup_handle_oom. In the end the whole convoy
> > in/decreases the counter but we do not get to 1 that would enable
> > killing so nothing useful is going on.
> > The time is basically unbounded because it highly depends on scheduling
> > and ordering on mutex.
> > 
> 
> Hmm, ok, I see the problem.
> 
> 
> > This patch replaces the counter by a simple {un}lock semantic. We are
> > using only 0 and 1 to distinguish those two states.
> > As mem_cgroup_oom_{un}lock works on the hierarchy we have to make sure
> > that we cannot race with somebody else which is already guaranteed
> > because we call both functions with the mutex held. All other consumers
> > just read the value atomically for a single group which is sufficient
> > because we set the value atomically.
> > The other thing is that only that process which locked the oom will
> > unlock it once the OOM is handled.
> > 
> > Signed-off-by: Michal Hocko <mhocko@suse.cz>
> > ---
> >  mm/memcontrol.c |   24 +++++++++++++++++-------
> >  1 files changed, 17 insertions(+), 7 deletions(-)
> > 
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index e013b8e..f6c9ead 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -1803,22 +1803,31 @@ static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *root_mem,
> >  /*
> >   * Check OOM-Killer is already running under our hierarchy.
> >   * If someone is running, return false.
> > + * Has to be called with memcg_oom_mutex
> >   */
> >  static bool mem_cgroup_oom_lock(struct mem_cgroup *mem)
> >  {
> > -	int x, lock_count = 0;
> > +	int x, lock_count = -1;
> >  	struct mem_cgroup *iter;
> >  
> >  	for_each_mem_cgroup_tree(iter, mem) {
> > -		x = atomic_inc_return(&iter->oom_lock);
> > -		lock_count = max(x, lock_count);
> > +		x = !!atomic_add_unless(&iter->oom_lock, 1, 1);
> > +		if (lock_count == -1)
> > +			lock_count = x;
> > +
> 
> 
> Hmm...Assume following hierarchy.
> 
> 	  A
>        B     C
>       D E 
> 
> The orignal code hanldes the situation
> 
>  1. B-D-E is under OOM
>  2. A enters OOM after 1.
> 
> In original code, A will not invoke OOM (because B-D-E oom will kill a process.)
> The new code invokes A will invoke new OOM....right ?
> 
> I wonder this kind of code
> ==
> 	bool success = true;
> 	...
> 	for_each_mem_cgroup_tree(iter, mem) {
> 		success &= !!atomic_add_unless(&iter->oom_lock, 1, 1);
> 		/* "break" loop is not allowed because of css refcount....*/
> 	}
> 	return success.
> ==
> Then, one hierarchy can invoke one OOM kill within it.
> But this will not work because we can't do proper unlock.
> 
> 
> Hm. how about this ? This has only one lock point and we'll not see the BUG.
> Not tested yet..
> 
Here, tested patch + test program. this seems to work well.
==

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 1/2] memcg: make oom_lock 0 and 1 based rather than coutner
  2011-07-14  2:59       ` KAMEZAWA Hiroyuki
@ 2011-07-14  9:00         ` Michal Hocko
  -1 siblings, 0 replies; 27+ messages in thread
From: Michal Hocko @ 2011-07-14  9:00 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm, Balbir Singh, Daisuke Nishimura, linux-kernel

On Thu 14-07-11 11:59:13, KAMEZAWA Hiroyuki wrote:
> On Thu, 14 Jul 2011 10:02:59 +0900
> KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> 
> > On Wed, 13 Jul 2011 13:05:49 +0200
> > Michal Hocko <mhocko@suse.cz> wrote:
[...]
> > > This patch replaces the counter by a simple {un}lock semantic. We are
> > > using only 0 and 1 to distinguish those two states.
> > > As mem_cgroup_oom_{un}lock works on the hierarchy we have to make sure
> > > that we cannot race with somebody else which is already guaranteed
> > > because we call both functions with the mutex held. All other consumers
> > > just read the value atomically for a single group which is sufficient
> > > because we set the value atomically.
> > > The other thing is that only that process which locked the oom will
> > > unlock it once the OOM is handled.
> > > 
> > > Signed-off-by: Michal Hocko <mhocko@suse.cz>
> > > ---
> > >  mm/memcontrol.c |   24 +++++++++++++++++-------
> > >  1 files changed, 17 insertions(+), 7 deletions(-)
> > > 
> > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > > index e013b8e..f6c9ead 100644
> > > --- a/mm/memcontrol.c
> > > +++ b/mm/memcontrol.c
> > > @@ -1803,22 +1803,31 @@ static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *root_mem,
> > >  /*
> > >   * Check OOM-Killer is already running under our hierarchy.
> > >   * If someone is running, return false.
> > > + * Has to be called with memcg_oom_mutex
> > >   */
> > >  static bool mem_cgroup_oom_lock(struct mem_cgroup *mem)
> > >  {
> > > -	int x, lock_count = 0;
> > > +	int x, lock_count = -1;
> > >  	struct mem_cgroup *iter;
> > >  
> > >  	for_each_mem_cgroup_tree(iter, mem) {
> > > -		x = atomic_inc_return(&iter->oom_lock);
> > > -		lock_count = max(x, lock_count);
> > > +		x = !!atomic_add_unless(&iter->oom_lock, 1, 1);
> > > +		if (lock_count == -1)
> > > +			lock_count = x;
> > > +
> > 
> > 
> > Hmm...Assume following hierarchy.
> > 
> > 	  A
> >        B     C
> >       D E 

IIUC, A, B, D, E are one hierarchy, right?

> > 
> > The orignal code hanldes the situation
> > 
> >  1. B-D-E is under OOM
> >  2. A enters OOM after 1.
> > 
> > In original code, A will not invoke OOM (because B-D-E oom will kill a process.)
> > The new code invokes A will invoke new OOM....right ?

Sorry, I do not understand what you mean by that. The original code and
the new code do the same in that regards they lock the whole hierarchy.
The only difference is that the original one increments the counter for
all groups in the hierarchy while the new one just sets it to from 0->1
BUG_ON just checks that we are not racing with somebody else.

> > 
> > I wonder this kind of code
> > ==
> > 	bool success = true;
> > 	...
> > 	for_each_mem_cgroup_tree(iter, mem) {
> > 		success &= !!atomic_add_unless(&iter->oom_lock, 1, 1);
> > 		/* "break" loop is not allowed because of css refcount....*/
> > 	}
> > 	return success.
> > ==
> > Then, one hierarchy can invoke one OOM kill within it.
> > But this will not work because we can't do proper unlock.

Why cannot we do a proper unlock?

> > 
> > 
> > Hm. how about this ? This has only one lock point and we'll not see the BUG.
> > Not tested yet..
> > 
> Here, tested patch + test program. this seems to work well.

Will look at it later. At first glance it looks rather complicated. But
maybe I am missing something. I have to confess I am not absolutely sure
when it comes to hierarchies.

> ==
> From 8c878b3413b4d796844dbcb18fa7cfccf44860d7 Mon Sep 17 00:00:00 2001
> From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> Date: Thu, 14 Jul 2011 11:36:50 +0900
> Subject: [PATCH] memcg: fix livelock at oom.
> 
> 867578cb "memcg: fix oom kill behavior" introduced oom_lock counter
> which is incremented by mem_cgroup_oom_lock when we are about to handle
> memcg OOM situation. mem_cgroup_handle_oom falls back to a sleep if
> oom_lock > 1 to prevent from multiple oom kills at the same time.
> The counter is then decremented by mem_cgroup_oom_unlock called from the
> same function.
> 
> This works correctly but it can lead to serious starvations when we
> have many processes triggering OOM.
> 
> example.
> 
> Make a hierarchy of memcg, which has 300MB memory+swap limit.
> 
>  %cgcreate -g memory:A
>  %cgset -r memory.use_hierarchy=1 A
>  %cgset -r memory.limit_in_bytes=300M A
>  %cgset -r memory.memsw.limit_in_bytes=300M A
>  %cgcreate -g memory:A/B
>  %cgcreate -g memory:A/C
>  %cgcreate -g memory:A/B/X
>  %cgcreate -g memory:A/B/Y
> 
> Then, running folloing program under A/B/X.
>  %cgexec -g memory:A/B/X ./fork
> ==
> int main(int argc, char *argv[])
> {
>         int i;
>         int status;
> 
>         for (i = 0; i < 5000; i++) {
>                 if (fork() == 0) {
>                         char *c;
>                         c = malloc(1024*1024);
>                         memset(c, 0, 1024*1024);
>                         sleep(20);
>                         fprintf(stderr, "[%d]\n", i);
>                         exit(0);
>                 }
>                 printf("%d\n", i);
>                 waitpid(-1, &status, WNOHANG);
>         }
>         while (1) {
>                 int ret;
>                 ret = waitpid(-1, &status, WNOHANG);
> 
>                 if (ret == -1)
>                         break;
>                 if (!ret)
>                         sleep(1);
>         }
>         return 0;
> }
> ==
> 
> This forks a process and the child malloc(1M). Then, after forking 300
> childrens, the memcg goes int OOM. Expected behavior is oom-killer
> will kill process and make progress. But, 300 children will try to get
> oom_lock and oom kill...and the program seems not to make progress.
> 
> The reason is that memcg invokes OOM-Kill when the counter oom_lock is 0.
> But if many process runs, it never goes down to 0 because it's incremanted
> before all processes quits sleep by previous oom-lock, which decremetns
> oom_lock.
> 
> This patch fixes the behavior. This patch makes the oom-hierarchy should
> have only one lock value 1/0. For example, in following hierarchy,
> 
> 	A
>        /
>       B
>      / \
>     C   D
> 
> When C goes into OOM because of limit by B, set B->oom_lock=1
> After that, when A goes into OOM because of limit by A,
> clear B->oom_lock as 0 and set A->oom_lock=1.
> 
> At unlocking, the ancestor which has ()->oom_lock=1 will be cleared.
> 
> After this, above program will do fork 5000 procs with 4000+ oom-killer.
> 
> Reported-by: Michal Hocko <mhocko@suse.cz>
> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> Changelog:
>   - fixed under_oom counter reset.
> ---
>  mm/memcontrol.c |   77 +++++++++++++++++++++++++++++++++++++-----------------
>  1 files changed, 53 insertions(+), 24 deletions(-)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index e013b8e..5f9661b 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -246,7 +246,8 @@ struct mem_cgroup {
>  	 * Should the accounting and control be hierarchical, per subtree?
>  	 */
>  	bool use_hierarchy;
> -	atomic_t	oom_lock;
> +	int		oom_lock;
> +	atomic_t	under_oom;
>  	atomic_t	refcnt;
>  
>  	unsigned int	swappiness;
> @@ -1801,36 +1802,63 @@ static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *root_mem,
>  }
>  
>  /*
> - * Check OOM-Killer is already running under our hierarchy.
> + * Check whether OOM-Killer is already running under our hierarchy.
>   * If someone is running, return false.
>   */
> -static bool mem_cgroup_oom_lock(struct mem_cgroup *mem)
> +static bool mem_cgroup_oom_lock(struct mem_cgroup *memcg)
>  {
> -	int x, lock_count = 0;
>  	struct mem_cgroup *iter;
> +	bool ret;
>  
> -	for_each_mem_cgroup_tree(iter, mem) {
> -		x = atomic_inc_return(&iter->oom_lock);
> -		lock_count = max(x, lock_count);
> +	/*
> +	 * If an ancestor (including this memcg) is the owner of OOM Lock.
> +	 * return false;
> +	 */
> +	for (iter = memcg; iter != NULL; iter = parent_mem_cgroup(iter)) {
> +		if (iter->oom_lock)
> +			break;
> +		if (!iter->use_hierarchy) {
> +			iter = NULL;
> +			break;
> +		}
>  	}
>  
> -	if (lock_count == 1)
> -		return true;
> -	return false;
> +	if (iter)
> +		return false;
> +	/*
> +	 * Make the owner of OOM lock to be the highest ancestor of hierarchy
> +	 * under OOM. IOW, move children's OOM owner information to this memcg
> +	 * if a child is owner. In this case, an OOM killer is running and
> +	 * we return false. But make this memcg as owner of oom-lock.
> +	 */
> +	ret = true;
> +	for_each_mem_cgroup_tree(iter, memcg) {
> +		if (iter->oom_lock) {
> +			iter->oom_lock = 0;
> +			ret = false;
> +		}
> +		atomic_set(&iter->under_oom, 1);
> +	}
> +	/* Make this memcg as the owner of OOM lock. */
> +	memcg->oom_lock = 1;
> +	return ret;
>  }
>  
> -static int mem_cgroup_oom_unlock(struct mem_cgroup *mem)
> +static void mem_cgroup_oom_unlock(struct mem_cgroup *memcg)
>  {
> -	struct mem_cgroup *iter;
> +	struct mem_cgroup *iter, *iter2;
>  
> -	/*
> -	 * When a new child is created while the hierarchy is under oom,
> -	 * mem_cgroup_oom_lock() may not be called. We have to use
> -	 * atomic_add_unless() here.
> -	 */
> -	for_each_mem_cgroup_tree(iter, mem)
> -		atomic_add_unless(&iter->oom_lock, -1, 0);
> -	return 0;
> +	for (iter = memcg; iter != NULL; iter = parent_mem_cgroup(iter)) {
> +		if (iter->oom_lock) {
> +			iter->oom_lock = 0;
> +			break;
> +		}
> +	}
> +	BUG_ON(!iter);
> +
> +	for_each_mem_cgroup_tree(iter2, iter)
> +		atomic_set(&iter2->under_oom, 0);
> +	return;
>  }
>  
>  
> @@ -1875,7 +1903,7 @@ static void memcg_wakeup_oom(struct mem_cgroup *mem)
>  
>  static void memcg_oom_recover(struct mem_cgroup *mem)
>  {
> -	if (mem && atomic_read(&mem->oom_lock))
> +	if (mem && atomic_read(&mem->under_oom))
>  		memcg_wakeup_oom(mem);
>  }
>  
> @@ -1916,7 +1944,8 @@ bool mem_cgroup_handle_oom(struct mem_cgroup *mem, gfp_t mask)
>  		finish_wait(&memcg_oom_waitq, &owait.wait);
>  	}
>  	mutex_lock(&memcg_oom_mutex);
> -	mem_cgroup_oom_unlock(mem);
> +	if (locked)
> +		mem_cgroup_oom_unlock(mem);
>  	memcg_wakeup_oom(mem);
>  	mutex_unlock(&memcg_oom_mutex);
>  
> @@ -4584,7 +4613,7 @@ static int mem_cgroup_oom_register_event(struct cgroup *cgrp,
>  	list_add(&event->list, &memcg->oom_notify);
>  
>  	/* already in OOM ? */
> -	if (atomic_read(&memcg->oom_lock))
> +	if (atomic_read(&memcg->under_oom))
>  		eventfd_signal(eventfd, 1);
>  	mutex_unlock(&memcg_oom_mutex);
>  
> @@ -4619,7 +4648,7 @@ static int mem_cgroup_oom_control_read(struct cgroup *cgrp,
>  
>  	cb->fill(cb, "oom_kill_disable", mem->oom_kill_disable);
>  
> -	if (atomic_read(&mem->oom_lock))
> +	if (atomic_read(&mem->under_oom))
>  		cb->fill(cb, "under_oom", 1);
>  	else
>  		cb->fill(cb, "under_oom", 0);
> -- 
> 1.7.4.1
> 
> 
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9    
Czech Republic

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 1/2] memcg: make oom_lock 0 and 1 based rather than coutner
@ 2011-07-14  9:00         ` Michal Hocko
  0 siblings, 0 replies; 27+ messages in thread
From: Michal Hocko @ 2011-07-14  9:00 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm, Balbir Singh, Daisuke Nishimura, linux-kernel

On Thu 14-07-11 11:59:13, KAMEZAWA Hiroyuki wrote:
> On Thu, 14 Jul 2011 10:02:59 +0900
> KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> 
> > On Wed, 13 Jul 2011 13:05:49 +0200
> > Michal Hocko <mhocko@suse.cz> wrote:
[...]
> > > This patch replaces the counter by a simple {un}lock semantic. We are
> > > using only 0 and 1 to distinguish those two states.
> > > As mem_cgroup_oom_{un}lock works on the hierarchy we have to make sure
> > > that we cannot race with somebody else which is already guaranteed
> > > because we call both functions with the mutex held. All other consumers
> > > just read the value atomically for a single group which is sufficient
> > > because we set the value atomically.
> > > The other thing is that only that process which locked the oom will
> > > unlock it once the OOM is handled.
> > > 
> > > Signed-off-by: Michal Hocko <mhocko@suse.cz>
> > > ---
> > >  mm/memcontrol.c |   24 +++++++++++++++++-------
> > >  1 files changed, 17 insertions(+), 7 deletions(-)
> > > 
> > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > > index e013b8e..f6c9ead 100644
> > > --- a/mm/memcontrol.c
> > > +++ b/mm/memcontrol.c
> > > @@ -1803,22 +1803,31 @@ static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *root_mem,
> > >  /*
> > >   * Check OOM-Killer is already running under our hierarchy.
> > >   * If someone is running, return false.
> > > + * Has to be called with memcg_oom_mutex
> > >   */
> > >  static bool mem_cgroup_oom_lock(struct mem_cgroup *mem)
> > >  {
> > > -	int x, lock_count = 0;
> > > +	int x, lock_count = -1;
> > >  	struct mem_cgroup *iter;
> > >  
> > >  	for_each_mem_cgroup_tree(iter, mem) {
> > > -		x = atomic_inc_return(&iter->oom_lock);
> > > -		lock_count = max(x, lock_count);
> > > +		x = !!atomic_add_unless(&iter->oom_lock, 1, 1);
> > > +		if (lock_count == -1)
> > > +			lock_count = x;
> > > +
> > 
> > 
> > Hmm...Assume following hierarchy.
> > 
> > 	  A
> >        B     C
> >       D E 

IIUC, A, B, D, E are one hierarchy, right?

> > 
> > The orignal code hanldes the situation
> > 
> >  1. B-D-E is under OOM
> >  2. A enters OOM after 1.
> > 
> > In original code, A will not invoke OOM (because B-D-E oom will kill a process.)
> > The new code invokes A will invoke new OOM....right ?

Sorry, I do not understand what you mean by that. The original code and
the new code do the same in that regards they lock the whole hierarchy.
The only difference is that the original one increments the counter for
all groups in the hierarchy while the new one just sets it to from 0->1
BUG_ON just checks that we are not racing with somebody else.

> > 
> > I wonder this kind of code
> > ==
> > 	bool success = true;
> > 	...
> > 	for_each_mem_cgroup_tree(iter, mem) {
> > 		success &= !!atomic_add_unless(&iter->oom_lock, 1, 1);
> > 		/* "break" loop is not allowed because of css refcount....*/
> > 	}
> > 	return success.
> > ==
> > Then, one hierarchy can invoke one OOM kill within it.
> > But this will not work because we can't do proper unlock.

Why cannot we do a proper unlock?

> > 
> > 
> > Hm. how about this ? This has only one lock point and we'll not see the BUG.
> > Not tested yet..
> > 
> Here, tested patch + test program. this seems to work well.

Will look at it later. At first glance it looks rather complicated. But
maybe I am missing something. I have to confess I am not absolutely sure
when it comes to hierarchies.

> ==
> From 8c878b3413b4d796844dbcb18fa7cfccf44860d7 Mon Sep 17 00:00:00 2001
> From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> Date: Thu, 14 Jul 2011 11:36:50 +0900
> Subject: [PATCH] memcg: fix livelock at oom.
> 
> 867578cb "memcg: fix oom kill behavior" introduced oom_lock counter
> which is incremented by mem_cgroup_oom_lock when we are about to handle
> memcg OOM situation. mem_cgroup_handle_oom falls back to a sleep if
> oom_lock > 1 to prevent from multiple oom kills at the same time.
> The counter is then decremented by mem_cgroup_oom_unlock called from the
> same function.
> 
> This works correctly but it can lead to serious starvations when we
> have many processes triggering OOM.
> 
> example.
> 
> Make a hierarchy of memcg, which has 300MB memory+swap limit.
> 
>  %cgcreate -g memory:A
>  %cgset -r memory.use_hierarchy=1 A
>  %cgset -r memory.limit_in_bytes=300M A
>  %cgset -r memory.memsw.limit_in_bytes=300M A
>  %cgcreate -g memory:A/B
>  %cgcreate -g memory:A/C
>  %cgcreate -g memory:A/B/X
>  %cgcreate -g memory:A/B/Y
> 
> Then, running folloing program under A/B/X.
>  %cgexec -g memory:A/B/X ./fork
> ==
> int main(int argc, char *argv[])
> {
>         int i;
>         int status;
> 
>         for (i = 0; i < 5000; i++) {
>                 if (fork() == 0) {
>                         char *c;
>                         c = malloc(1024*1024);
>                         memset(c, 0, 1024*1024);
>                         sleep(20);
>                         fprintf(stderr, "[%d]\n", i);
>                         exit(0);
>                 }
>                 printf("%d\n", i);
>                 waitpid(-1, &status, WNOHANG);
>         }
>         while (1) {
>                 int ret;
>                 ret = waitpid(-1, &status, WNOHANG);
> 
>                 if (ret == -1)
>                         break;
>                 if (!ret)
>                         sleep(1);
>         }
>         return 0;
> }
> ==
> 
> This forks a process and the child malloc(1M). Then, after forking 300
> childrens, the memcg goes int OOM. Expected behavior is oom-killer
> will kill process and make progress. But, 300 children will try to get
> oom_lock and oom kill...and the program seems not to make progress.
> 
> The reason is that memcg invokes OOM-Kill when the counter oom_lock is 0.
> But if many process runs, it never goes down to 0 because it's incremanted
> before all processes quits sleep by previous oom-lock, which decremetns
> oom_lock.
> 
> This patch fixes the behavior. This patch makes the oom-hierarchy should
> have only one lock value 1/0. For example, in following hierarchy,
> 
> 	A
>        /
>       B
>      / \
>     C   D
> 
> When C goes into OOM because of limit by B, set B->oom_lock=1
> After that, when A goes into OOM because of limit by A,
> clear B->oom_lock as 0 and set A->oom_lock=1.
> 
> At unlocking, the ancestor which has ()->oom_lock=1 will be cleared.
> 
> After this, above program will do fork 5000 procs with 4000+ oom-killer.
> 
> Reported-by: Michal Hocko <mhocko@suse.cz>
> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> Changelog:
>   - fixed under_oom counter reset.
> ---
>  mm/memcontrol.c |   77 +++++++++++++++++++++++++++++++++++++-----------------
>  1 files changed, 53 insertions(+), 24 deletions(-)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index e013b8e..5f9661b 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -246,7 +246,8 @@ struct mem_cgroup {
>  	 * Should the accounting and control be hierarchical, per subtree?
>  	 */
>  	bool use_hierarchy;
> -	atomic_t	oom_lock;
> +	int		oom_lock;
> +	atomic_t	under_oom;
>  	atomic_t	refcnt;
>  
>  	unsigned int	swappiness;
> @@ -1801,36 +1802,63 @@ static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *root_mem,
>  }
>  
>  /*
> - * Check OOM-Killer is already running under our hierarchy.
> + * Check whether OOM-Killer is already running under our hierarchy.
>   * If someone is running, return false.
>   */
> -static bool mem_cgroup_oom_lock(struct mem_cgroup *mem)
> +static bool mem_cgroup_oom_lock(struct mem_cgroup *memcg)
>  {
> -	int x, lock_count = 0;
>  	struct mem_cgroup *iter;
> +	bool ret;
>  
> -	for_each_mem_cgroup_tree(iter, mem) {
> -		x = atomic_inc_return(&iter->oom_lock);
> -		lock_count = max(x, lock_count);
> +	/*
> +	 * If an ancestor (including this memcg) is the owner of OOM Lock.
> +	 * return false;
> +	 */
> +	for (iter = memcg; iter != NULL; iter = parent_mem_cgroup(iter)) {
> +		if (iter->oom_lock)
> +			break;
> +		if (!iter->use_hierarchy) {
> +			iter = NULL;
> +			break;
> +		}
>  	}
>  
> -	if (lock_count == 1)
> -		return true;
> -	return false;
> +	if (iter)
> +		return false;
> +	/*
> +	 * Make the owner of OOM lock to be the highest ancestor of hierarchy
> +	 * under OOM. IOW, move children's OOM owner information to this memcg
> +	 * if a child is owner. In this case, an OOM killer is running and
> +	 * we return false. But make this memcg as owner of oom-lock.
> +	 */
> +	ret = true;
> +	for_each_mem_cgroup_tree(iter, memcg) {
> +		if (iter->oom_lock) {
> +			iter->oom_lock = 0;
> +			ret = false;
> +		}
> +		atomic_set(&iter->under_oom, 1);
> +	}
> +	/* Make this memcg as the owner of OOM lock. */
> +	memcg->oom_lock = 1;
> +	return ret;
>  }
>  
> -static int mem_cgroup_oom_unlock(struct mem_cgroup *mem)
> +static void mem_cgroup_oom_unlock(struct mem_cgroup *memcg)
>  {
> -	struct mem_cgroup *iter;
> +	struct mem_cgroup *iter, *iter2;
>  
> -	/*
> -	 * When a new child is created while the hierarchy is under oom,
> -	 * mem_cgroup_oom_lock() may not be called. We have to use
> -	 * atomic_add_unless() here.
> -	 */
> -	for_each_mem_cgroup_tree(iter, mem)
> -		atomic_add_unless(&iter->oom_lock, -1, 0);
> -	return 0;
> +	for (iter = memcg; iter != NULL; iter = parent_mem_cgroup(iter)) {
> +		if (iter->oom_lock) {
> +			iter->oom_lock = 0;
> +			break;
> +		}
> +	}
> +	BUG_ON(!iter);
> +
> +	for_each_mem_cgroup_tree(iter2, iter)
> +		atomic_set(&iter2->under_oom, 0);
> +	return;
>  }
>  
>  
> @@ -1875,7 +1903,7 @@ static void memcg_wakeup_oom(struct mem_cgroup *mem)
>  
>  static void memcg_oom_recover(struct mem_cgroup *mem)
>  {
> -	if (mem && atomic_read(&mem->oom_lock))
> +	if (mem && atomic_read(&mem->under_oom))
>  		memcg_wakeup_oom(mem);
>  }
>  
> @@ -1916,7 +1944,8 @@ bool mem_cgroup_handle_oom(struct mem_cgroup *mem, gfp_t mask)
>  		finish_wait(&memcg_oom_waitq, &owait.wait);
>  	}
>  	mutex_lock(&memcg_oom_mutex);
> -	mem_cgroup_oom_unlock(mem);
> +	if (locked)
> +		mem_cgroup_oom_unlock(mem);
>  	memcg_wakeup_oom(mem);
>  	mutex_unlock(&memcg_oom_mutex);
>  
> @@ -4584,7 +4613,7 @@ static int mem_cgroup_oom_register_event(struct cgroup *cgrp,
>  	list_add(&event->list, &memcg->oom_notify);
>  
>  	/* already in OOM ? */
> -	if (atomic_read(&memcg->oom_lock))
> +	if (atomic_read(&memcg->under_oom))
>  		eventfd_signal(eventfd, 1);
>  	mutex_unlock(&memcg_oom_mutex);
>  
> @@ -4619,7 +4648,7 @@ static int mem_cgroup_oom_control_read(struct cgroup *cgrp,
>  
>  	cb->fill(cb, "oom_kill_disable", mem->oom_kill_disable);
>  
> -	if (atomic_read(&mem->oom_lock))
> +	if (atomic_read(&mem->under_oom))
>  		cb->fill(cb, "under_oom", 1);
>  	else
>  		cb->fill(cb, "under_oom", 0);
> -- 
> 1.7.4.1
> 
> 
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9    
Czech Republic

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 1/2] memcg: make oom_lock 0 and 1 based rather than coutner
  2011-07-14  9:00         ` Michal Hocko
@ 2011-07-14  9:30           ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 27+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-07-14  9:30 UTC (permalink / raw)
  To: Michal Hocko; +Cc: linux-mm, Balbir Singh, Daisuke Nishimura, linux-kernel

On Thu, 14 Jul 2011 11:00:17 +0200
Michal Hocko <mhocko@suse.cz> wrote:

> On Thu 14-07-11 11:59:13, KAMEZAWA Hiroyuki wrote:
> > On Thu, 14 Jul 2011 10:02:59 +0900
> > KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > 
> > > On Wed, 13 Jul 2011 13:05:49 +0200
> > > Michal Hocko <mhocko@suse.cz> wrote:
> [...]
> > > > This patch replaces the counter by a simple {un}lock semantic. We are
> > > > using only 0 and 1 to distinguish those two states.
> > > > As mem_cgroup_oom_{un}lock works on the hierarchy we have to make sure
> > > > that we cannot race with somebody else which is already guaranteed
> > > > because we call both functions with the mutex held. All other consumers
> > > > just read the value atomically for a single group which is sufficient
> > > > because we set the value atomically.
> > > > The other thing is that only that process which locked the oom will
> > > > unlock it once the OOM is handled.
> > > > 
> > > > Signed-off-by: Michal Hocko <mhocko@suse.cz>
> > > > ---
> > > >  mm/memcontrol.c |   24 +++++++++++++++++-------
> > > >  1 files changed, 17 insertions(+), 7 deletions(-)
> > > > 
> > > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > > > index e013b8e..f6c9ead 100644
> > > > --- a/mm/memcontrol.c
> > > > +++ b/mm/memcontrol.c
> > > > @@ -1803,22 +1803,31 @@ static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *root_mem,
> > > >  /*
> > > >   * Check OOM-Killer is already running under our hierarchy.
> > > >   * If someone is running, return false.
> > > > + * Has to be called with memcg_oom_mutex
> > > >   */
> > > >  static bool mem_cgroup_oom_lock(struct mem_cgroup *mem)
> > > >  {
> > > > -	int x, lock_count = 0;
> > > > +	int x, lock_count = -1;
> > > >  	struct mem_cgroup *iter;
> > > >  
> > > >  	for_each_mem_cgroup_tree(iter, mem) {
> > > > -		x = atomic_inc_return(&iter->oom_lock);
> > > > -		lock_count = max(x, lock_count);
> > > > +		x = !!atomic_add_unless(&iter->oom_lock, 1, 1);
> > > > +		if (lock_count == -1)
> > > > +			lock_count = x;
> > > > +
> > > 
> > > 
> > > Hmm...Assume following hierarchy.
> > > 
> > > 	  A
> > >        B     C
> > >       D E 
> 
> IIUC, A, B, D, E are one hierarchy, right?
> 
yes.


> > > 
> > > The orignal code hanldes the situation
> > > 
> > >  1. B-D-E is under OOM
> > >  2. A enters OOM after 1.
> > > 
> > > In original code, A will not invoke OOM (because B-D-E oom will kill a process.)
> > > The new code invokes A will invoke new OOM....right ?
> 
> Sorry, I do not understand what you mean by that. 

This is your code.
==
 	for_each_mem_cgroup_tree(iter, mem) {
-		x = atomic_inc_return(&iter->oom_lock);
-		lock_count = max(x, lock_count);
+		x = !!atomic_add_unless(&iter->oom_lock, 1, 1);
+		if (lock_count == -1)
+			lock_count = x;
+
+		/* New child can be created but we shouldn't race with
+		 * somebody else trying to oom because we are under
+		 * memcg_oom_mutex
+		 */
+		BUG_ON(lock_count != x);
 	}
==

When, B,D,E is under OOM,  

   A oom_lock = 0
   B oom_lock = 1
   C oom_lock = 0
   D oom_lock = 1
   E oom_lock = 1

Here, assume A enters OOM.

   A oom_lock = 1 -- (*)
   B oom_lock = 1
   C oom_lock = 1
   D oom_lock = 1
   E oom_lock = 1

because of (*), mem_cgroup_oom_lock() will return lock_count=1, true.

Then, a new oom-killer will another oom-kiiler running in B-D-E.




> The original code and
> the new code do the same in that regards they lock the whole hierarchy.
> The only difference is that the original one increments the counter for
> all groups in the hierarchy while the new one just sets it to from 0->1
> BUG_ON just checks that we are not racing with somebody else.
> 


In above situation, old code's result is

   A oom_lock = 1
   B oom_lock = 2
   C oom_lock = 1
   D oom_lock = 2
   E oom_lock = 2

Then, max lockcount== 2 and return value is false. 



> > > 
> > > I wonder this kind of code
> > > ==
> > > 	bool success = true;
> > > 	...
> > > 	for_each_mem_cgroup_tree(iter, mem) {
> > > 		success &= !!atomic_add_unless(&iter->oom_lock, 1, 1);
> > > 		/* "break" loop is not allowed because of css refcount....*/
> > > 	}
> > > 	return success.
> > > ==
> > > Then, one hierarchy can invoke one OOM kill within it.
> > > But this will not work because we can't do proper unlock.
> 
> Why cannot we do a proper unlock?
> 

After 1st lock by oom in B-D-E

   A oom_lock = 0
   B oom_lock = 1
   C oom_lock = 0
   D oom_lock = 1
   E oom_lock = 1

returns true. and mem_cgroup_oom_unlock() will scan B-D-E and set oom_lock==0.

2nd lock by oom in A-B-C-D-E

  A oom_lock = 1
  B oom_lock = 1
  C oom_lock = 1
  D oom_lock = 1
  E oom_lock = 1

returns false, then, mem_cgroup_oom_unlock() will not be called.

After unlock, we see

  A oom_lock = 1
  B oom_lock = 0
  C oom_lock = 1
  D oom_lock = 0
  E oom_lock = 0


> > > 
> > > 
> > > Hm. how about this ? This has only one lock point and we'll not see the BUG.
> > > Not tested yet..
> > > 
> > Here, tested patch + test program. this seems to work well.
> 
> Will look at it later. At first glance it looks rather complicated. But
> maybe I am missing something. I have to confess I am not absolutely sure
> when it comes to hierarchies.
> 


With my patch, oom_lock shows the lock owner memcg. oom status is divided to
the vairable, under_oom.

(1) At 1st vistor by oom in B-D-E

  A oom_lock = 0  under_oom=0
  B oom_lock = 1  under_oom=1
  C oom_lock = 0  under_oom=0
  D oom_lock = 0  under_oom=1
  E oom_lock - 0  under_oom=0

returns true.

(2) At 2nd visrot by oom in A-B-C-D-E

  A oom_lock = 1  under_oom=1
  B oom_lock = 0  under_oom=1
  C oom_lock = 0  under_oom=1
  D oom_lock = 0  under_oom=1
  E oom_lock = 0  under_oom=1

The lock moves to A and returns false.

(3) at unlock, the thread which did (1) will find group A and
    unlock all hierarchy.


  A oom_lock = 0  under_oom=0
  B oom_lock = 0  under_oom=0
  C oom_lock = 0  under_oom=0
  D oom_lock = 0  under_oom=0
  E oom_lock = 0  under_oom=0


Thanks,
-Kame









^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 1/2] memcg: make oom_lock 0 and 1 based rather than coutner
@ 2011-07-14  9:30           ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 27+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-07-14  9:30 UTC (permalink / raw)
  To: Michal Hocko; +Cc: linux-mm, Balbir Singh, Daisuke Nishimura, linux-kernel

On Thu, 14 Jul 2011 11:00:17 +0200
Michal Hocko <mhocko@suse.cz> wrote:

> On Thu 14-07-11 11:59:13, KAMEZAWA Hiroyuki wrote:
> > On Thu, 14 Jul 2011 10:02:59 +0900
> > KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > 
> > > On Wed, 13 Jul 2011 13:05:49 +0200
> > > Michal Hocko <mhocko@suse.cz> wrote:
> [...]
> > > > This patch replaces the counter by a simple {un}lock semantic. We are
> > > > using only 0 and 1 to distinguish those two states.
> > > > As mem_cgroup_oom_{un}lock works on the hierarchy we have to make sure
> > > > that we cannot race with somebody else which is already guaranteed
> > > > because we call both functions with the mutex held. All other consumers
> > > > just read the value atomically for a single group which is sufficient
> > > > because we set the value atomically.
> > > > The other thing is that only that process which locked the oom will
> > > > unlock it once the OOM is handled.
> > > > 
> > > > Signed-off-by: Michal Hocko <mhocko@suse.cz>
> > > > ---
> > > >  mm/memcontrol.c |   24 +++++++++++++++++-------
> > > >  1 files changed, 17 insertions(+), 7 deletions(-)
> > > > 
> > > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > > > index e013b8e..f6c9ead 100644
> > > > --- a/mm/memcontrol.c
> > > > +++ b/mm/memcontrol.c
> > > > @@ -1803,22 +1803,31 @@ static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *root_mem,
> > > >  /*
> > > >   * Check OOM-Killer is already running under our hierarchy.
> > > >   * If someone is running, return false.
> > > > + * Has to be called with memcg_oom_mutex
> > > >   */
> > > >  static bool mem_cgroup_oom_lock(struct mem_cgroup *mem)
> > > >  {
> > > > -	int x, lock_count = 0;
> > > > +	int x, lock_count = -1;
> > > >  	struct mem_cgroup *iter;
> > > >  
> > > >  	for_each_mem_cgroup_tree(iter, mem) {
> > > > -		x = atomic_inc_return(&iter->oom_lock);
> > > > -		lock_count = max(x, lock_count);
> > > > +		x = !!atomic_add_unless(&iter->oom_lock, 1, 1);
> > > > +		if (lock_count == -1)
> > > > +			lock_count = x;
> > > > +
> > > 
> > > 
> > > Hmm...Assume following hierarchy.
> > > 
> > > 	  A
> > >        B     C
> > >       D E 
> 
> IIUC, A, B, D, E are one hierarchy, right?
> 
yes.


> > > 
> > > The orignal code hanldes the situation
> > > 
> > >  1. B-D-E is under OOM
> > >  2. A enters OOM after 1.
> > > 
> > > In original code, A will not invoke OOM (because B-D-E oom will kill a process.)
> > > The new code invokes A will invoke new OOM....right ?
> 
> Sorry, I do not understand what you mean by that. 

This is your code.
==
 	for_each_mem_cgroup_tree(iter, mem) {
-		x = atomic_inc_return(&iter->oom_lock);
-		lock_count = max(x, lock_count);
+		x = !!atomic_add_unless(&iter->oom_lock, 1, 1);
+		if (lock_count == -1)
+			lock_count = x;
+
+		/* New child can be created but we shouldn't race with
+		 * somebody else trying to oom because we are under
+		 * memcg_oom_mutex
+		 */
+		BUG_ON(lock_count != x);
 	}
==

When, B,D,E is under OOM,  

   A oom_lock = 0
   B oom_lock = 1
   C oom_lock = 0
   D oom_lock = 1
   E oom_lock = 1

Here, assume A enters OOM.

   A oom_lock = 1 -- (*)
   B oom_lock = 1
   C oom_lock = 1
   D oom_lock = 1
   E oom_lock = 1

because of (*), mem_cgroup_oom_lock() will return lock_count=1, true.

Then, a new oom-killer will another oom-kiiler running in B-D-E.




> The original code and
> the new code do the same in that regards they lock the whole hierarchy.
> The only difference is that the original one increments the counter for
> all groups in the hierarchy while the new one just sets it to from 0->1
> BUG_ON just checks that we are not racing with somebody else.
> 


In above situation, old code's result is

   A oom_lock = 1
   B oom_lock = 2
   C oom_lock = 1
   D oom_lock = 2
   E oom_lock = 2

Then, max lockcount== 2 and return value is false. 



> > > 
> > > I wonder this kind of code
> > > ==
> > > 	bool success = true;
> > > 	...
> > > 	for_each_mem_cgroup_tree(iter, mem) {
> > > 		success &= !!atomic_add_unless(&iter->oom_lock, 1, 1);
> > > 		/* "break" loop is not allowed because of css refcount....*/
> > > 	}
> > > 	return success.
> > > ==
> > > Then, one hierarchy can invoke one OOM kill within it.
> > > But this will not work because we can't do proper unlock.
> 
> Why cannot we do a proper unlock?
> 

After 1st lock by oom in B-D-E

   A oom_lock = 0
   B oom_lock = 1
   C oom_lock = 0
   D oom_lock = 1
   E oom_lock = 1

returns true. and mem_cgroup_oom_unlock() will scan B-D-E and set oom_lock==0.

2nd lock by oom in A-B-C-D-E

  A oom_lock = 1
  B oom_lock = 1
  C oom_lock = 1
  D oom_lock = 1
  E oom_lock = 1

returns false, then, mem_cgroup_oom_unlock() will not be called.

After unlock, we see

  A oom_lock = 1
  B oom_lock = 0
  C oom_lock = 1
  D oom_lock = 0
  E oom_lock = 0


> > > 
> > > 
> > > Hm. how about this ? This has only one lock point and we'll not see the BUG.
> > > Not tested yet..
> > > 
> > Here, tested patch + test program. this seems to work well.
> 
> Will look at it later. At first glance it looks rather complicated. But
> maybe I am missing something. I have to confess I am not absolutely sure
> when it comes to hierarchies.
> 


With my patch, oom_lock shows the lock owner memcg. oom status is divided to
the vairable, under_oom.

(1) At 1st vistor by oom in B-D-E

  A oom_lock = 0  under_oom=0
  B oom_lock = 1  under_oom=1
  C oom_lock = 0  under_oom=0
  D oom_lock = 0  under_oom=1
  E oom_lock - 0  under_oom=0

returns true.

(2) At 2nd visrot by oom in A-B-C-D-E

  A oom_lock = 1  under_oom=1
  B oom_lock = 0  under_oom=1
  C oom_lock = 0  under_oom=1
  D oom_lock = 0  under_oom=1
  E oom_lock = 0  under_oom=1

The lock moves to A and returns false.

(3) at unlock, the thread which did (1) will find group A and
    unlock all hierarchy.


  A oom_lock = 0  under_oom=0
  B oom_lock = 0  under_oom=0
  C oom_lock = 0  under_oom=0
  D oom_lock = 0  under_oom=0
  E oom_lock = 0  under_oom=0


Thanks,
-Kame








--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 1/2] memcg: make oom_lock 0 and 1 based rather than coutner
  2011-07-14  9:30           ` KAMEZAWA Hiroyuki
@ 2011-07-14  9:51             ` Michal Hocko
  -1 siblings, 0 replies; 27+ messages in thread
From: Michal Hocko @ 2011-07-14  9:51 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm, Balbir Singh, Daisuke Nishimura, linux-kernel

On Thu 14-07-11 18:30:14, KAMEZAWA Hiroyuki wrote:
> On Thu, 14 Jul 2011 11:00:17 +0200
> Michal Hocko <mhocko@suse.cz> wrote:
> 
> > On Thu 14-07-11 11:59:13, KAMEZAWA Hiroyuki wrote:
> > > On Thu, 14 Jul 2011 10:02:59 +0900
> > > KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > > 
> > > > On Wed, 13 Jul 2011 13:05:49 +0200
> > > > Michal Hocko <mhocko@suse.cz> wrote:
> > [...]
> > > > > This patch replaces the counter by a simple {un}lock semantic. We are
> > > > > using only 0 and 1 to distinguish those two states.
> > > > > As mem_cgroup_oom_{un}lock works on the hierarchy we have to make sure
> > > > > that we cannot race with somebody else which is already guaranteed
> > > > > because we call both functions with the mutex held. All other consumers
> > > > > just read the value atomically for a single group which is sufficient
> > > > > because we set the value atomically.
> > > > > The other thing is that only that process which locked the oom will
> > > > > unlock it once the OOM is handled.
> > > > > 
> > > > > Signed-off-by: Michal Hocko <mhocko@suse.cz>
> > > > > ---
> > > > >  mm/memcontrol.c |   24 +++++++++++++++++-------
> > > > >  1 files changed, 17 insertions(+), 7 deletions(-)
> > > > > 
> > > > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > > > > index e013b8e..f6c9ead 100644
> > > > > --- a/mm/memcontrol.c
> > > > > +++ b/mm/memcontrol.c
> > > > > @@ -1803,22 +1803,31 @@ static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *root_mem,
> > > > >  /*
> > > > >   * Check OOM-Killer is already running under our hierarchy.
> > > > >   * If someone is running, return false.
> > > > > + * Has to be called with memcg_oom_mutex
> > > > >   */
> > > > >  static bool mem_cgroup_oom_lock(struct mem_cgroup *mem)
> > > > >  {
> > > > > -	int x, lock_count = 0;
> > > > > +	int x, lock_count = -1;
> > > > >  	struct mem_cgroup *iter;
> > > > >  
> > > > >  	for_each_mem_cgroup_tree(iter, mem) {
> > > > > -		x = atomic_inc_return(&iter->oom_lock);
> > > > > -		lock_count = max(x, lock_count);
> > > > > +		x = !!atomic_add_unless(&iter->oom_lock, 1, 1);
> > > > > +		if (lock_count == -1)
> > > > > +			lock_count = x;
> > > > > +
> > > > 
> > > > 
> > > > Hmm...Assume following hierarchy.
> > > > 
> > > > 	  A
> > > >        B     C
> > > >       D E 
> > 
> > IIUC, A, B, D, E are one hierarchy, right?
> > 
> yes.
> 
> 
> > > > 
> > > > The orignal code hanldes the situation
> > > > 
> > > >  1. B-D-E is under OOM
> > > >  2. A enters OOM after 1.
> > > > 
> > > > In original code, A will not invoke OOM (because B-D-E oom will kill a process.)
> > > > The new code invokes A will invoke new OOM....right ?
> > 
> > Sorry, I do not understand what you mean by that. 
> 
> This is your code.
> ==
>  	for_each_mem_cgroup_tree(iter, mem) {
> -		x = atomic_inc_return(&iter->oom_lock);
> -		lock_count = max(x, lock_count);
> +		x = !!atomic_add_unless(&iter->oom_lock, 1, 1);
> +		if (lock_count == -1)
> +			lock_count = x;
> +
> +		/* New child can be created but we shouldn't race with
> +		 * somebody else trying to oom because we are under
> +		 * memcg_oom_mutex
> +		 */
> +		BUG_ON(lock_count != x);
>  	}
> ==
> 
> When, B,D,E is under OOM,  
> 
>    A oom_lock = 0
>    B oom_lock = 1
>    C oom_lock = 0
>    D oom_lock = 1
>    E oom_lock = 1
> 
> Here, assume A enters OOM.
> 
>    A oom_lock = 1 -- (*)
>    B oom_lock = 1
>    C oom_lock = 1
>    D oom_lock = 1
>    E oom_lock = 1
> 
> because of (*), mem_cgroup_oom_lock() will return lock_count=1, true.
> 
> Then, a new oom-killer will another oom-kiiler running in B-D-E.

OK, does this mean that for_each_mem_cgroup_tree doesn't lock the whole
hierarchy at once? 
I have to confess that the behavior of mem_cgroup_start_loop is little
bit obscure to me. The comment says it searches for the cgroup with the
minimum ID - I assume this is the root of the hierarchy. Is this
correct?

If yes then if we have oom in what-ever cgroup in the hierarchy then
the above code should lock the whole hierarchy and the above never
happens. Right?

-- 
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9    
Czech Republic

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 1/2] memcg: make oom_lock 0 and 1 based rather than coutner
@ 2011-07-14  9:51             ` Michal Hocko
  0 siblings, 0 replies; 27+ messages in thread
From: Michal Hocko @ 2011-07-14  9:51 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm, Balbir Singh, Daisuke Nishimura, linux-kernel

On Thu 14-07-11 18:30:14, KAMEZAWA Hiroyuki wrote:
> On Thu, 14 Jul 2011 11:00:17 +0200
> Michal Hocko <mhocko@suse.cz> wrote:
> 
> > On Thu 14-07-11 11:59:13, KAMEZAWA Hiroyuki wrote:
> > > On Thu, 14 Jul 2011 10:02:59 +0900
> > > KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > > 
> > > > On Wed, 13 Jul 2011 13:05:49 +0200
> > > > Michal Hocko <mhocko@suse.cz> wrote:
> > [...]
> > > > > This patch replaces the counter by a simple {un}lock semantic. We are
> > > > > using only 0 and 1 to distinguish those two states.
> > > > > As mem_cgroup_oom_{un}lock works on the hierarchy we have to make sure
> > > > > that we cannot race with somebody else which is already guaranteed
> > > > > because we call both functions with the mutex held. All other consumers
> > > > > just read the value atomically for a single group which is sufficient
> > > > > because we set the value atomically.
> > > > > The other thing is that only that process which locked the oom will
> > > > > unlock it once the OOM is handled.
> > > > > 
> > > > > Signed-off-by: Michal Hocko <mhocko@suse.cz>
> > > > > ---
> > > > >  mm/memcontrol.c |   24 +++++++++++++++++-------
> > > > >  1 files changed, 17 insertions(+), 7 deletions(-)
> > > > > 
> > > > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > > > > index e013b8e..f6c9ead 100644
> > > > > --- a/mm/memcontrol.c
> > > > > +++ b/mm/memcontrol.c
> > > > > @@ -1803,22 +1803,31 @@ static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *root_mem,
> > > > >  /*
> > > > >   * Check OOM-Killer is already running under our hierarchy.
> > > > >   * If someone is running, return false.
> > > > > + * Has to be called with memcg_oom_mutex
> > > > >   */
> > > > >  static bool mem_cgroup_oom_lock(struct mem_cgroup *mem)
> > > > >  {
> > > > > -	int x, lock_count = 0;
> > > > > +	int x, lock_count = -1;
> > > > >  	struct mem_cgroup *iter;
> > > > >  
> > > > >  	for_each_mem_cgroup_tree(iter, mem) {
> > > > > -		x = atomic_inc_return(&iter->oom_lock);
> > > > > -		lock_count = max(x, lock_count);
> > > > > +		x = !!atomic_add_unless(&iter->oom_lock, 1, 1);
> > > > > +		if (lock_count == -1)
> > > > > +			lock_count = x;
> > > > > +
> > > > 
> > > > 
> > > > Hmm...Assume following hierarchy.
> > > > 
> > > > 	  A
> > > >        B     C
> > > >       D E 
> > 
> > IIUC, A, B, D, E are one hierarchy, right?
> > 
> yes.
> 
> 
> > > > 
> > > > The orignal code hanldes the situation
> > > > 
> > > >  1. B-D-E is under OOM
> > > >  2. A enters OOM after 1.
> > > > 
> > > > In original code, A will not invoke OOM (because B-D-E oom will kill a process.)
> > > > The new code invokes A will invoke new OOM....right ?
> > 
> > Sorry, I do not understand what you mean by that. 
> 
> This is your code.
> ==
>  	for_each_mem_cgroup_tree(iter, mem) {
> -		x = atomic_inc_return(&iter->oom_lock);
> -		lock_count = max(x, lock_count);
> +		x = !!atomic_add_unless(&iter->oom_lock, 1, 1);
> +		if (lock_count == -1)
> +			lock_count = x;
> +
> +		/* New child can be created but we shouldn't race with
> +		 * somebody else trying to oom because we are under
> +		 * memcg_oom_mutex
> +		 */
> +		BUG_ON(lock_count != x);
>  	}
> ==
> 
> When, B,D,E is under OOM,  
> 
>    A oom_lock = 0
>    B oom_lock = 1
>    C oom_lock = 0
>    D oom_lock = 1
>    E oom_lock = 1
> 
> Here, assume A enters OOM.
> 
>    A oom_lock = 1 -- (*)
>    B oom_lock = 1
>    C oom_lock = 1
>    D oom_lock = 1
>    E oom_lock = 1
> 
> because of (*), mem_cgroup_oom_lock() will return lock_count=1, true.
> 
> Then, a new oom-killer will another oom-kiiler running in B-D-E.

OK, does this mean that for_each_mem_cgroup_tree doesn't lock the whole
hierarchy at once? 
I have to confess that the behavior of mem_cgroup_start_loop is little
bit obscure to me. The comment says it searches for the cgroup with the
minimum ID - I assume this is the root of the hierarchy. Is this
correct?

If yes then if we have oom in what-ever cgroup in the hierarchy then
the above code should lock the whole hierarchy and the above never
happens. Right?

-- 
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9    
Czech Republic

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 1/2] memcg: make oom_lock 0 and 1 based rather than coutner
  2011-07-14  9:51             ` Michal Hocko
@ 2011-07-14 10:17               ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 27+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-07-14 10:17 UTC (permalink / raw)
  To: Michal Hocko; +Cc: linux-mm, Balbir Singh, Daisuke Nishimura, linux-kernel

On Thu, 14 Jul 2011 11:51:52 +0200
Michal Hocko <mhocko@suse.cz> wrote:

> On Thu 14-07-11 18:30:14, KAMEZAWA Hiroyuki wrote:
> > On Thu, 14 Jul 2011 11:00:17 +0200
> > Michal Hocko <mhocko@suse.cz> wrote:
> > 
> > > On Thu 14-07-11 11:59:13, KAMEZAWA Hiroyuki wrote:
> > > > On Thu, 14 Jul 2011 10:02:59 +0900
> > > > KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > > > 
> > > > > On Wed, 13 Jul 2011 13:05:49 +0200
> > > > > Michal Hocko <mhocko@suse.cz> wrote:
> > > [...]
> > > > > > This patch replaces the counter by a simple {un}lock semantic. We are
> > > > > > using only 0 and 1 to distinguish those two states.
> > > > > > As mem_cgroup_oom_{un}lock works on the hierarchy we have to make sure
> > > > > > that we cannot race with somebody else which is already guaranteed
> > > > > > because we call both functions with the mutex held. All other consumers
> > > > > > just read the value atomically for a single group which is sufficient
> > > > > > because we set the value atomically.
> > > > > > The other thing is that only that process which locked the oom will
> > > > > > unlock it once the OOM is handled.
> > > > > > 
> > > > > > Signed-off-by: Michal Hocko <mhocko@suse.cz>
> > > > > > ---
> > > > > >  mm/memcontrol.c |   24 +++++++++++++++++-------
> > > > > >  1 files changed, 17 insertions(+), 7 deletions(-)
> > > > > > 
> > > > > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > > > > > index e013b8e..f6c9ead 100644
> > > > > > --- a/mm/memcontrol.c
> > > > > > +++ b/mm/memcontrol.c
> > > > > > @@ -1803,22 +1803,31 @@ static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *root_mem,
> > > > > >  /*
> > > > > >   * Check OOM-Killer is already running under our hierarchy.
> > > > > >   * If someone is running, return false.
> > > > > > + * Has to be called with memcg_oom_mutex
> > > > > >   */
> > > > > >  static bool mem_cgroup_oom_lock(struct mem_cgroup *mem)
> > > > > >  {
> > > > > > -	int x, lock_count = 0;
> > > > > > +	int x, lock_count = -1;
> > > > > >  	struct mem_cgroup *iter;
> > > > > >  
> > > > > >  	for_each_mem_cgroup_tree(iter, mem) {
> > > > > > -		x = atomic_inc_return(&iter->oom_lock);
> > > > > > -		lock_count = max(x, lock_count);
> > > > > > +		x = !!atomic_add_unless(&iter->oom_lock, 1, 1);
> > > > > > +		if (lock_count == -1)
> > > > > > +			lock_count = x;
> > > > > > +
> > > > > 
> > > > > 
> > > > > Hmm...Assume following hierarchy.
> > > > > 
> > > > > 	  A
> > > > >        B     C
> > > > >       D E 
> > > 
> > > IIUC, A, B, D, E are one hierarchy, right?
> > > 
> > yes.
> > 
> > 
> > > > > 
> > > > > The orignal code hanldes the situation
> > > > > 
> > > > >  1. B-D-E is under OOM
> > > > >  2. A enters OOM after 1.
> > > > > 
> > > > > In original code, A will not invoke OOM (because B-D-E oom will kill a process.)
> > > > > The new code invokes A will invoke new OOM....right ?
> > > 
> > > Sorry, I do not understand what you mean by that. 
> > 
> > This is your code.
> > ==
> >  	for_each_mem_cgroup_tree(iter, mem) {
> > -		x = atomic_inc_return(&iter->oom_lock);
> > -		lock_count = max(x, lock_count);
> > +		x = !!atomic_add_unless(&iter->oom_lock, 1, 1);
> > +		if (lock_count == -1)
> > +			lock_count = x;
> > +
> > +		/* New child can be created but we shouldn't race with
> > +		 * somebody else trying to oom because we are under
> > +		 * memcg_oom_mutex
> > +		 */
> > +		BUG_ON(lock_count != x);
> >  	}
> > ==
> > 
> > When, B,D,E is under OOM,  
> > 
> >    A oom_lock = 0
> >    B oom_lock = 1
> >    C oom_lock = 0
> >    D oom_lock = 1
> >    E oom_lock = 1
> > 
> > Here, assume A enters OOM.
> > 
> >    A oom_lock = 1 -- (*)
> >    B oom_lock = 1
> >    C oom_lock = 1
> >    D oom_lock = 1
> >    E oom_lock = 1
> > 
> > because of (*), mem_cgroup_oom_lock() will return lock_count=1, true.
> > 
> > Then, a new oom-killer will another oom-kiiler running in B-D-E.
> 
> OK, does this mean that for_each_mem_cgroup_tree doesn't lock the whole
> hierarchy at once? 

yes. this for_each_mem_cgroup_tree() just locks a subtree.

> I have to confess that the behavior of mem_cgroup_start_loop is little
> bit obscure to me. The comment says it searches for the cgroup with the
> minimum ID - I assume this is the root of the hierarchy. Is this
> correct?
> 

No. Assume following sequence.

  1.  cgcreate -g memory:X  css_id=5 assigned.
  ........far later.....
  2.  cgcreate -g memory:A  css_id=30 assigned.
  3.  cgdelete -g memory:X  css_id=5 freed.
  4.  cgcreate -g memory:A/B
  5.  cgcreate -g memory:A/C
  6.  cgcreate -g memory:A/B/D
  7.  cgcreate -g memory:A/B/E

Then, css_id will be
==
 A css_id=30
 B css_id=5  # reuse X's id.
 C css_id=31
 D css_id=32
 E css_id=33
==
Then, the search under "B" will find B->D->E

The search under "A" will find B->A->C->D->E.

> If yes then if we have oom in what-ever cgroup in the hierarchy then
> the above code should lock the whole hierarchy and the above never
> happens. Right?

Yes and no. old code allows following happens at the same time.

      A
    B   C
   D E   F
 
   B-D-E goes into OOM because of B's limit.
   C-F   goes into OOM because of C's limit


When you stop OOM under A because of B's limit, C can't invoke OOM.

After a little more consideration, my suggestion is,

=== lock ===
	bool success = true;
	...
	for_each_mem_cgroup_tree(iter, mem) {
		success &= !!atomic_add_unless(&iter->oom_lock, 1, 1);
		/* "break" loop is not allowed because of css refcount....*/
	}
	return success.

By this, when a sub-hierarchy is under OOM, don't invoke new OOM.


=== unlock ===
	struct mem_cgroup *oom_root;

	oom_root = memcg; 
	do {
		struct mem_cgroup *parent;

		parent = mem_cgroup_parent(oom_root);
		if (!parent || !parent->use_hierarchy)
			break;

		if (atomic_read(&parent->oom_lock))
			break;
	} while (1);

	for_each_mem_cgroup_tree(iter, oom_root)
		atomic_add_unless(&iter->oom_lock, -1, 0);

By this, at unlock, unlock oom-lock of a hierarchy which was under oom_lock
because of a sub-hierarchy was under OOM.

==


Maybe not complicated as my 1st patch.

Thanks,
-Kame






^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 1/2] memcg: make oom_lock 0 and 1 based rather than coutner
@ 2011-07-14 10:17               ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 27+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-07-14 10:17 UTC (permalink / raw)
  To: Michal Hocko; +Cc: linux-mm, Balbir Singh, Daisuke Nishimura, linux-kernel

On Thu, 14 Jul 2011 11:51:52 +0200
Michal Hocko <mhocko@suse.cz> wrote:

> On Thu 14-07-11 18:30:14, KAMEZAWA Hiroyuki wrote:
> > On Thu, 14 Jul 2011 11:00:17 +0200
> > Michal Hocko <mhocko@suse.cz> wrote:
> > 
> > > On Thu 14-07-11 11:59:13, KAMEZAWA Hiroyuki wrote:
> > > > On Thu, 14 Jul 2011 10:02:59 +0900
> > > > KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > > > 
> > > > > On Wed, 13 Jul 2011 13:05:49 +0200
> > > > > Michal Hocko <mhocko@suse.cz> wrote:
> > > [...]
> > > > > > This patch replaces the counter by a simple {un}lock semantic. We are
> > > > > > using only 0 and 1 to distinguish those two states.
> > > > > > As mem_cgroup_oom_{un}lock works on the hierarchy we have to make sure
> > > > > > that we cannot race with somebody else which is already guaranteed
> > > > > > because we call both functions with the mutex held. All other consumers
> > > > > > just read the value atomically for a single group which is sufficient
> > > > > > because we set the value atomically.
> > > > > > The other thing is that only that process which locked the oom will
> > > > > > unlock it once the OOM is handled.
> > > > > > 
> > > > > > Signed-off-by: Michal Hocko <mhocko@suse.cz>
> > > > > > ---
> > > > > >  mm/memcontrol.c |   24 +++++++++++++++++-------
> > > > > >  1 files changed, 17 insertions(+), 7 deletions(-)
> > > > > > 
> > > > > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > > > > > index e013b8e..f6c9ead 100644
> > > > > > --- a/mm/memcontrol.c
> > > > > > +++ b/mm/memcontrol.c
> > > > > > @@ -1803,22 +1803,31 @@ static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *root_mem,
> > > > > >  /*
> > > > > >   * Check OOM-Killer is already running under our hierarchy.
> > > > > >   * If someone is running, return false.
> > > > > > + * Has to be called with memcg_oom_mutex
> > > > > >   */
> > > > > >  static bool mem_cgroup_oom_lock(struct mem_cgroup *mem)
> > > > > >  {
> > > > > > -	int x, lock_count = 0;
> > > > > > +	int x, lock_count = -1;
> > > > > >  	struct mem_cgroup *iter;
> > > > > >  
> > > > > >  	for_each_mem_cgroup_tree(iter, mem) {
> > > > > > -		x = atomic_inc_return(&iter->oom_lock);
> > > > > > -		lock_count = max(x, lock_count);
> > > > > > +		x = !!atomic_add_unless(&iter->oom_lock, 1, 1);
> > > > > > +		if (lock_count == -1)
> > > > > > +			lock_count = x;
> > > > > > +
> > > > > 
> > > > > 
> > > > > Hmm...Assume following hierarchy.
> > > > > 
> > > > > 	  A
> > > > >        B     C
> > > > >       D E 
> > > 
> > > IIUC, A, B, D, E are one hierarchy, right?
> > > 
> > yes.
> > 
> > 
> > > > > 
> > > > > The orignal code hanldes the situation
> > > > > 
> > > > >  1. B-D-E is under OOM
> > > > >  2. A enters OOM after 1.
> > > > > 
> > > > > In original code, A will not invoke OOM (because B-D-E oom will kill a process.)
> > > > > The new code invokes A will invoke new OOM....right ?
> > > 
> > > Sorry, I do not understand what you mean by that. 
> > 
> > This is your code.
> > ==
> >  	for_each_mem_cgroup_tree(iter, mem) {
> > -		x = atomic_inc_return(&iter->oom_lock);
> > -		lock_count = max(x, lock_count);
> > +		x = !!atomic_add_unless(&iter->oom_lock, 1, 1);
> > +		if (lock_count == -1)
> > +			lock_count = x;
> > +
> > +		/* New child can be created but we shouldn't race with
> > +		 * somebody else trying to oom because we are under
> > +		 * memcg_oom_mutex
> > +		 */
> > +		BUG_ON(lock_count != x);
> >  	}
> > ==
> > 
> > When, B,D,E is under OOM,  
> > 
> >    A oom_lock = 0
> >    B oom_lock = 1
> >    C oom_lock = 0
> >    D oom_lock = 1
> >    E oom_lock = 1
> > 
> > Here, assume A enters OOM.
> > 
> >    A oom_lock = 1 -- (*)
> >    B oom_lock = 1
> >    C oom_lock = 1
> >    D oom_lock = 1
> >    E oom_lock = 1
> > 
> > because of (*), mem_cgroup_oom_lock() will return lock_count=1, true.
> > 
> > Then, a new oom-killer will another oom-kiiler running in B-D-E.
> 
> OK, does this mean that for_each_mem_cgroup_tree doesn't lock the whole
> hierarchy at once? 

yes. this for_each_mem_cgroup_tree() just locks a subtree.

> I have to confess that the behavior of mem_cgroup_start_loop is little
> bit obscure to me. The comment says it searches for the cgroup with the
> minimum ID - I assume this is the root of the hierarchy. Is this
> correct?
> 

No. Assume following sequence.

  1.  cgcreate -g memory:X  css_id=5 assigned.
  ........far later.....
  2.  cgcreate -g memory:A  css_id=30 assigned.
  3.  cgdelete -g memory:X  css_id=5 freed.
  4.  cgcreate -g memory:A/B
  5.  cgcreate -g memory:A/C
  6.  cgcreate -g memory:A/B/D
  7.  cgcreate -g memory:A/B/E

Then, css_id will be
==
 A css_id=30
 B css_id=5  # reuse X's id.
 C css_id=31
 D css_id=32
 E css_id=33
==
Then, the search under "B" will find B->D->E

The search under "A" will find B->A->C->D->E.

> If yes then if we have oom in what-ever cgroup in the hierarchy then
> the above code should lock the whole hierarchy and the above never
> happens. Right?

Yes and no. old code allows following happens at the same time.

      A
    B   C
   D E   F
 
   B-D-E goes into OOM because of B's limit.
   C-F   goes into OOM because of C's limit


When you stop OOM under A because of B's limit, C can't invoke OOM.

After a little more consideration, my suggestion is,

=== lock ===
	bool success = true;
	...
	for_each_mem_cgroup_tree(iter, mem) {
		success &= !!atomic_add_unless(&iter->oom_lock, 1, 1);
		/* "break" loop is not allowed because of css refcount....*/
	}
	return success.

By this, when a sub-hierarchy is under OOM, don't invoke new OOM.


=== unlock ===
	struct mem_cgroup *oom_root;

	oom_root = memcg; 
	do {
		struct mem_cgroup *parent;

		parent = mem_cgroup_parent(oom_root);
		if (!parent || !parent->use_hierarchy)
			break;

		if (atomic_read(&parent->oom_lock))
			break;
	} while (1);

	for_each_mem_cgroup_tree(iter, oom_root)
		atomic_add_unless(&iter->oom_lock, -1, 0);

By this, at unlock, unlock oom-lock of a hierarchy which was under oom_lock
because of a sub-hierarchy was under OOM.

==


Maybe not complicated as my 1st patch.

Thanks,
-Kame





--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 1/2] memcg: make oom_lock 0 and 1 based rather than coutner
  2011-07-14 10:17               ` KAMEZAWA Hiroyuki
@ 2011-07-14 11:09                 ` Michal Hocko
  -1 siblings, 0 replies; 27+ messages in thread
From: Michal Hocko @ 2011-07-14 11:09 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm, Balbir Singh, Daisuke Nishimura, linux-kernel

On Thu 14-07-11 19:17:28, KAMEZAWA Hiroyuki wrote:
> On Thu, 14 Jul 2011 11:51:52 +0200
> Michal Hocko <mhocko@suse.cz> wrote:
> 
> > On Thu 14-07-11 18:30:14, KAMEZAWA Hiroyuki wrote:
> > > On Thu, 14 Jul 2011 11:00:17 +0200
> > > Michal Hocko <mhocko@suse.cz> wrote:
> > > 
> > > > On Thu 14-07-11 11:59:13, KAMEZAWA Hiroyuki wrote:
> > > > > On Thu, 14 Jul 2011 10:02:59 +0900
> > > > > KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
[...]
> > > ==
> > >  	for_each_mem_cgroup_tree(iter, mem) {
> > > -		x = atomic_inc_return(&iter->oom_lock);
> > > -		lock_count = max(x, lock_count);
> > > +		x = !!atomic_add_unless(&iter->oom_lock, 1, 1);
> > > +		if (lock_count == -1)
> > > +			lock_count = x;
> > > +
> > > +		/* New child can be created but we shouldn't race with
> > > +		 * somebody else trying to oom because we are under
> > > +		 * memcg_oom_mutex
> > > +		 */
> > > +		BUG_ON(lock_count != x);
> > >  	}
> > > ==
> > > 
> > > When, B,D,E is under OOM,  
> > > 
> > >    A oom_lock = 0
> > >    B oom_lock = 1
> > >    C oom_lock = 0
> > >    D oom_lock = 1
> > >    E oom_lock = 1
> > > 
> > > Here, assume A enters OOM.
> > > 
> > >    A oom_lock = 1 -- (*)
> > >    B oom_lock = 1
> > >    C oom_lock = 1
> > >    D oom_lock = 1
> > >    E oom_lock = 1
> > > 
> > > because of (*), mem_cgroup_oom_lock() will return lock_count=1, true.
> > > 
> > > Then, a new oom-killer will another oom-kiiler running in B-D-E.
> > 
> > OK, does this mean that for_each_mem_cgroup_tree doesn't lock the whole
> > hierarchy at once? 
> 
> yes. this for_each_mem_cgroup_tree() just locks a subtree.

OK, then I really misunderstood the macro and now I see your points.
Thinking about it some more having a full hierarchy locked is not that
good idea after all. We would block also parallel branches which will
not bail out from OOM if we handle oom condition in another branch.

> 
> > I have to confess that the behavior of mem_cgroup_start_loop is little
> > bit obscure to me. The comment says it searches for the cgroup with the
> > minimum ID - I assume this is the root of the hierarchy. Is this
> > correct?
> > 
> 
> No. Assume following sequence.
> 
>   1.  cgcreate -g memory:X  css_id=5 assigned.
>   ........far later.....
>   2.  cgcreate -g memory:A  css_id=30 assigned.
>   3.  cgdelete -g memory:X  css_id=5 freed.
>   4.  cgcreate -g memory:A/B
>   5.  cgcreate -g memory:A/C
>   6.  cgcreate -g memory:A/B/D
>   7.  cgcreate -g memory:A/B/E
> 
> Then, css_id will be
> ==
>  A css_id=30
>  B css_id=5  # reuse X's id.
>  C css_id=31
>  D css_id=32
>  E css_id=33
> ==
> Then, the search under "B" will find B->D->E
> 
> The search under "A" will find B->A->C->D->E. 
> 
> > If yes then if we have oom in what-ever cgroup in the hierarchy then
> > the above code should lock the whole hierarchy and the above never
> > happens. Right?
> 
> Yes and no. old code allows following happens at the same time.
> 
>       A
>     B   C
>    D E   F
>  
>    B-D-E goes into OOM because of B's limit.
>    C-F   goes into OOM because of C's limit
> 
> 
> When you stop OOM under A because of B's limit, C can't invoke OOM.
> 
> After a little more consideration, my suggestion is,
> 
> === lock ===
> 	bool success = true;
> 	...
> 	for_each_mem_cgroup_tree(iter, mem) {
> 		success &= !!atomic_add_unless(&iter->oom_lock, 1, 1);
> 		/* "break" loop is not allowed because of css refcount....*/
> 	}
> 	return success.
> 
> By this, when a sub-hierarchy is under OOM, don't invoke new OOM.

Hmm, I am afraid this will not work as well. The group tree traversing
depends on the creation order so we might end up seeing locked subtree
sooner than unlocked so we could grant the lock and see multiple OOMs.
We have to guarantee that we do not grant the lock if we encounter
already locked sub group (and then we have to clear oom_lock for all
groups that we have already visited).

> === unlock ===
> 	struct mem_cgroup *oom_root;
> 
> 	oom_root = memcg; 
> 	do {
> 		struct mem_cgroup *parent;
> 
> 		parent = mem_cgroup_parent(oom_root);
> 		if (!parent || !parent->use_hierarchy)
> 			break;
> 
> 		if (atomic_read(&parent->oom_lock))
> 			break;
> 	} while (1);
> 
> 	for_each_mem_cgroup_tree(iter, oom_root)
> 		atomic_add_unless(&iter->oom_lock, -1, 0);
> 
> By this, at unlock, unlock oom-lock of a hierarchy which was under oom_lock
> because of a sub-hierarchy was under OOM.

This would unlock also groups that might have a parallel oom lock.
A - B - C - D oom (from B)
  - E - F  oom (F)

unlock in what-ever branch will unlock also the parallel oom.
I will think about something else and return to your first patch if I
find it over complicated as well.

-- 
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9    
Czech Republic

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 1/2] memcg: make oom_lock 0 and 1 based rather than coutner
@ 2011-07-14 11:09                 ` Michal Hocko
  0 siblings, 0 replies; 27+ messages in thread
From: Michal Hocko @ 2011-07-14 11:09 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm, Balbir Singh, Daisuke Nishimura, linux-kernel

On Thu 14-07-11 19:17:28, KAMEZAWA Hiroyuki wrote:
> On Thu, 14 Jul 2011 11:51:52 +0200
> Michal Hocko <mhocko@suse.cz> wrote:
> 
> > On Thu 14-07-11 18:30:14, KAMEZAWA Hiroyuki wrote:
> > > On Thu, 14 Jul 2011 11:00:17 +0200
> > > Michal Hocko <mhocko@suse.cz> wrote:
> > > 
> > > > On Thu 14-07-11 11:59:13, KAMEZAWA Hiroyuki wrote:
> > > > > On Thu, 14 Jul 2011 10:02:59 +0900
> > > > > KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
[...]
> > > ==
> > >  	for_each_mem_cgroup_tree(iter, mem) {
> > > -		x = atomic_inc_return(&iter->oom_lock);
> > > -		lock_count = max(x, lock_count);
> > > +		x = !!atomic_add_unless(&iter->oom_lock, 1, 1);
> > > +		if (lock_count == -1)
> > > +			lock_count = x;
> > > +
> > > +		/* New child can be created but we shouldn't race with
> > > +		 * somebody else trying to oom because we are under
> > > +		 * memcg_oom_mutex
> > > +		 */
> > > +		BUG_ON(lock_count != x);
> > >  	}
> > > ==
> > > 
> > > When, B,D,E is under OOM,  
> > > 
> > >    A oom_lock = 0
> > >    B oom_lock = 1
> > >    C oom_lock = 0
> > >    D oom_lock = 1
> > >    E oom_lock = 1
> > > 
> > > Here, assume A enters OOM.
> > > 
> > >    A oom_lock = 1 -- (*)
> > >    B oom_lock = 1
> > >    C oom_lock = 1
> > >    D oom_lock = 1
> > >    E oom_lock = 1
> > > 
> > > because of (*), mem_cgroup_oom_lock() will return lock_count=1, true.
> > > 
> > > Then, a new oom-killer will another oom-kiiler running in B-D-E.
> > 
> > OK, does this mean that for_each_mem_cgroup_tree doesn't lock the whole
> > hierarchy at once? 
> 
> yes. this for_each_mem_cgroup_tree() just locks a subtree.

OK, then I really misunderstood the macro and now I see your points.
Thinking about it some more having a full hierarchy locked is not that
good idea after all. We would block also parallel branches which will
not bail out from OOM if we handle oom condition in another branch.

> 
> > I have to confess that the behavior of mem_cgroup_start_loop is little
> > bit obscure to me. The comment says it searches for the cgroup with the
> > minimum ID - I assume this is the root of the hierarchy. Is this
> > correct?
> > 
> 
> No. Assume following sequence.
> 
>   1.  cgcreate -g memory:X  css_id=5 assigned.
>   ........far later.....
>   2.  cgcreate -g memory:A  css_id=30 assigned.
>   3.  cgdelete -g memory:X  css_id=5 freed.
>   4.  cgcreate -g memory:A/B
>   5.  cgcreate -g memory:A/C
>   6.  cgcreate -g memory:A/B/D
>   7.  cgcreate -g memory:A/B/E
> 
> Then, css_id will be
> ==
>  A css_id=30
>  B css_id=5  # reuse X's id.
>  C css_id=31
>  D css_id=32
>  E css_id=33
> ==
> Then, the search under "B" will find B->D->E
> 
> The search under "A" will find B->A->C->D->E. 
> 
> > If yes then if we have oom in what-ever cgroup in the hierarchy then
> > the above code should lock the whole hierarchy and the above never
> > happens. Right?
> 
> Yes and no. old code allows following happens at the same time.
> 
>       A
>     B   C
>    D E   F
>  
>    B-D-E goes into OOM because of B's limit.
>    C-F   goes into OOM because of C's limit
> 
> 
> When you stop OOM under A because of B's limit, C can't invoke OOM.
> 
> After a little more consideration, my suggestion is,
> 
> === lock ===
> 	bool success = true;
> 	...
> 	for_each_mem_cgroup_tree(iter, mem) {
> 		success &= !!atomic_add_unless(&iter->oom_lock, 1, 1);
> 		/* "break" loop is not allowed because of css refcount....*/
> 	}
> 	return success.
> 
> By this, when a sub-hierarchy is under OOM, don't invoke new OOM.

Hmm, I am afraid this will not work as well. The group tree traversing
depends on the creation order so we might end up seeing locked subtree
sooner than unlocked so we could grant the lock and see multiple OOMs.
We have to guarantee that we do not grant the lock if we encounter
already locked sub group (and then we have to clear oom_lock for all
groups that we have already visited).

> === unlock ===
> 	struct mem_cgroup *oom_root;
> 
> 	oom_root = memcg; 
> 	do {
> 		struct mem_cgroup *parent;
> 
> 		parent = mem_cgroup_parent(oom_root);
> 		if (!parent || !parent->use_hierarchy)
> 			break;
> 
> 		if (atomic_read(&parent->oom_lock))
> 			break;
> 	} while (1);
> 
> 	for_each_mem_cgroup_tree(iter, oom_root)
> 		atomic_add_unless(&iter->oom_lock, -1, 0);
> 
> By this, at unlock, unlock oom-lock of a hierarchy which was under oom_lock
> because of a sub-hierarchy was under OOM.

This would unlock also groups that might have a parallel oom lock.
A - B - C - D oom (from B)
  - E - F  oom (F)

unlock in what-ever branch will unlock also the parallel oom.
I will think about something else and return to your first patch if I
find it over complicated as well.

-- 
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9    
Czech Republic

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 1/2] memcg: make oom_lock 0 and 1 based rather than coutner
  2011-07-14 11:09                 ` Michal Hocko
@ 2011-07-14 11:30                   ` Michal Hocko
  -1 siblings, 0 replies; 27+ messages in thread
From: Michal Hocko @ 2011-07-14 11:30 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm, Balbir Singh, Daisuke Nishimura, linux-kernel

On Thu 14-07-11 13:09:35, Michal Hocko wrote:
> On Thu 14-07-11 19:17:28, KAMEZAWA Hiroyuki wrote:
> > On Thu, 14 Jul 2011 11:51:52 +0200
> > Michal Hocko <mhocko@suse.cz> wrote:
> > 
> > > On Thu 14-07-11 18:30:14, KAMEZAWA Hiroyuki wrote:
> > > > On Thu, 14 Jul 2011 11:00:17 +0200
> > > > Michal Hocko <mhocko@suse.cz> wrote:
> > > > 
> > > > > On Thu 14-07-11 11:59:13, KAMEZAWA Hiroyuki wrote:
> > > > > > On Thu, 14 Jul 2011 10:02:59 +0900
> > > > > > KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> [...]
> > > > ==
> > > >  	for_each_mem_cgroup_tree(iter, mem) {
> > > > -		x = atomic_inc_return(&iter->oom_lock);
> > > > -		lock_count = max(x, lock_count);
> > > > +		x = !!atomic_add_unless(&iter->oom_lock, 1, 1);
> > > > +		if (lock_count == -1)
> > > > +			lock_count = x;
> > > > +
> > > > +		/* New child can be created but we shouldn't race with
> > > > +		 * somebody else trying to oom because we are under
> > > > +		 * memcg_oom_mutex
> > > > +		 */
> > > > +		BUG_ON(lock_count != x);
> > > >  	}
> > > > ==
> > > > 
> > > > When, B,D,E is under OOM,  
> > > > 
> > > >    A oom_lock = 0
> > > >    B oom_lock = 1
> > > >    C oom_lock = 0
> > > >    D oom_lock = 1
> > > >    E oom_lock = 1
> > > > 
> > > > Here, assume A enters OOM.
> > > > 
> > > >    A oom_lock = 1 -- (*)
> > > >    B oom_lock = 1
> > > >    C oom_lock = 1
> > > >    D oom_lock = 1
> > > >    E oom_lock = 1
> > > > 
> > > > because of (*), mem_cgroup_oom_lock() will return lock_count=1, true.
> > > > 
> > > > Then, a new oom-killer will another oom-kiiler running in B-D-E.
> > > 
> > > OK, does this mean that for_each_mem_cgroup_tree doesn't lock the whole
> > > hierarchy at once? 
> > 
> > yes. this for_each_mem_cgroup_tree() just locks a subtree.
> 
> OK, then I really misunderstood the macro and now I see your points.
> Thinking about it some more having a full hierarchy locked is not that
> good idea after all. We would block also parallel branches which will
> not bail out from OOM if we handle oom condition in another branch.
> 
> > 
> > > I have to confess that the behavior of mem_cgroup_start_loop is little
> > > bit obscure to me. The comment says it searches for the cgroup with the
> > > minimum ID - I assume this is the root of the hierarchy. Is this
> > > correct?
> > > 
> > 
> > No. Assume following sequence.
> > 
> >   1.  cgcreate -g memory:X  css_id=5 assigned.
> >   ........far later.....
> >   2.  cgcreate -g memory:A  css_id=30 assigned.
> >   3.  cgdelete -g memory:X  css_id=5 freed.
> >   4.  cgcreate -g memory:A/B
> >   5.  cgcreate -g memory:A/C
> >   6.  cgcreate -g memory:A/B/D
> >   7.  cgcreate -g memory:A/B/E
> > 
> > Then, css_id will be
> > ==
> >  A css_id=30
> >  B css_id=5  # reuse X's id.
> >  C css_id=31
> >  D css_id=32
> >  E css_id=33
> > ==
> > Then, the search under "B" will find B->D->E
> > 
> > The search under "A" will find B->A->C->D->E. 
> > 
> > > If yes then if we have oom in what-ever cgroup in the hierarchy then
> > > the above code should lock the whole hierarchy and the above never
> > > happens. Right?
> > 
> > Yes and no. old code allows following happens at the same time.
> > 
> >       A
> >     B   C
> >    D E   F
> >  
> >    B-D-E goes into OOM because of B's limit.
> >    C-F   goes into OOM because of C's limit
> > 
> > 
> > When you stop OOM under A because of B's limit, C can't invoke OOM.
> > 
> > After a little more consideration, my suggestion is,
> > 
> > === lock ===
> > 	bool success = true;
> > 	...
> > 	for_each_mem_cgroup_tree(iter, mem) {
> > 		success &= !!atomic_add_unless(&iter->oom_lock, 1, 1);
> > 		/* "break" loop is not allowed because of css refcount....*/
> > 	}
> > 	return success.
> > 
> > By this, when a sub-hierarchy is under OOM, don't invoke new OOM.
> 
> Hmm, I am afraid this will not work as well. The group tree traversing
> depends on the creation order so we might end up seeing locked subtree
> sooner than unlocked so we could grant the lock and see multiple OOMs.
> We have to guarantee that we do not grant the lock if we encounter
> already locked sub group (and then we have to clear oom_lock for all
> groups that we have already visited).
> 
> > === unlock ===
> > 	struct mem_cgroup *oom_root;
> > 
> > 	oom_root = memcg; 
> > 	do {
> > 		struct mem_cgroup *parent;
> > 
> > 		parent = mem_cgroup_parent(oom_root);
> > 		if (!parent || !parent->use_hierarchy)
> > 			break;
> > 
> > 		if (atomic_read(&parent->oom_lock))
> > 			break;
> > 	} while (1);
> > 
> > 	for_each_mem_cgroup_tree(iter, oom_root)
> > 		atomic_add_unless(&iter->oom_lock, -1, 0);
> > 
> > By this, at unlock, unlock oom-lock of a hierarchy which was under oom_lock
> > because of a sub-hierarchy was under OOM.
> 
> This would unlock also groups that might have a parallel oom lock.
> A - B - C - D oom (from B)
>   - E - F  oom (F)
> 
> unlock in what-ever branch will unlock also the parallel oom.
> I will think about something else and return to your first patch if I
> find it over complicated as well.

What about this? Just compile tested:
--- 
>From 90ab974eb69c61c2e3b94beabe9b6745fa319936 Mon Sep 17 00:00:00 2001
From: Michal Hocko <mhocko@suse.cz>
Date: Wed, 13 Jul 2011 13:05:49 +0200
Subject: [PATCH] memcg: make oom_lock 0 and 1 based rather than coutner

867578cb "memcg: fix oom kill behavior" introduced oom_lock counter
which is incremented by mem_cgroup_oom_lock when we are about to handle
memcg OOM situation. mem_cgroup_handle_oom falls back to a sleep if
oom_lock > 1 to prevent from multiple oom kills at the same time.
The counter is then decremented by mem_cgroup_oom_unlock called from the
same function.

This works correctly but it can lead to serious starvations when we
have many processes triggering OOM.

Consider a process (call it A) which gets the oom_lock (the first one
that got to mem_cgroup_handle_oom and grabbed memcg_oom_mutex). All
other processes are blocked on the mutex.
While A releases the mutex and calls mem_cgroup_out_of_memory others
will wake up (one after another) and increase the counter and fall into
sleep (memcg_oom_waitq). Once A finishes mem_cgroup_out_of_memory it
takes the mutex again and decreases oom_lock and wakes other tasks (if
releasing memory of the killed task hasn't done it yet).
The main problem here is that everybody still race for the mutex and
there is no guarantee that we will get counter back to 0 for those
that got back to mem_cgroup_handle_oom. In the end the whole convoy
in/decreases the counter but we do not get to 1 that would enable
killing so nothing useful is going on.
The time is basically unbounded because it highly depends on scheduling
and ordering on mutex.

This patch replaces the counter by a simple {un}lock semantic. We are
using only 0 and 1 to distinguish those two states.
As mem_cgroup_oom_{un}lock works on the a subtree of a hierarchy we have
to make sure that nobody else races with us which is guaranteed by the
memcg_oom_mutex. All other consumers just read the value atomically for
a single group which is sufficient because we set the value atomically.
mem_cgroup_oom_lock has to be really careful because we might be in
higher in a hierarchy than already oom locked subtree of the same
hierarchy:
          A
        /   \
       B     \
      /\      \
     C  D     E

B - C - D tree might be already locked. While we want to enable locking E
subtree because OOM situations cannot influence each other we definitely
do not want to allow locking A.
Therefore we have to refuse lock if any subtree is already locked and
clear up the lock for all nodes that have been set up to the failure
point.
Unlock path is then very easy because we always unlock only that subtree
we have locked previously.

Signed-off-by: Michal Hocko <mhocko@suse.cz>
---
 mm/memcontrol.c |   48 +++++++++++++++++++++++++++++++++++++++---------
 1 files changed, 39 insertions(+), 9 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index e013b8e..29f00d0 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1803,22 +1803,51 @@ static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *root_mem,
 /*
  * Check OOM-Killer is already running under our hierarchy.
  * If someone is running, return false.
+ * Has to be called with memcg_oom_mutex
  */
 static bool mem_cgroup_oom_lock(struct mem_cgroup *mem)
 {
-	int x, lock_count = 0;
-	struct mem_cgroup *iter;
+	int x, lock_count = -1;
+	struct mem_cgroup *iter, *failed = NULL;
+	bool cond = true;
 
-	for_each_mem_cgroup_tree(iter, mem) {
-		x = atomic_inc_return(&iter->oom_lock);
-		lock_count = max(x, lock_count);
+	for_each_mem_cgroup_tree_cond(iter, mem, cond) {
+		x = !!atomic_add_unless(&iter->oom_lock, 1, 1);
+		if (lock_count == -1)
+			lock_count = x;
+		else if (lock_count != x) {
+			/*
+			 * this subtree of our hierarchy is already locked
+			 * so we cannot give a lock.
+			 */
+			lock_count = 0;
+			failed = iter;
+			cond = false;
+		}
 	}
 
-	if (lock_count == 1)
-		return true;
-	return false;
+	if (!failed)
+		goto done;
+
+	/*
+	 * OK, we failed to lock the whole subtree so we have to clean up
+	 * what we set up to the failing subtree
+	 */
+	cond = true;
+	for_each_mem_cgroup_tree_cond(iter, mem, cond) {
+		if (iter == failed) {
+			cond = false;
+			continue;
+		}
+		atomic_set(&iter->oom_lock, 0)
+	}
+done:
+	return lock_count;
 }
 
+/*
+ * Has to be called with memcg_oom_mutex
+ */
 static int mem_cgroup_oom_unlock(struct mem_cgroup *mem)
 {
 	struct mem_cgroup *iter;
@@ -1916,7 +1945,8 @@ bool mem_cgroup_handle_oom(struct mem_cgroup *mem, gfp_t mask)
 		finish_wait(&memcg_oom_waitq, &owait.wait);
 	}
 	mutex_lock(&memcg_oom_mutex);
-	mem_cgroup_oom_unlock(mem);
+	if (locked)
+		mem_cgroup_oom_unlock(mem);
 	memcg_wakeup_oom(mem);
 	mutex_unlock(&memcg_oom_mutex);
 
-- 
1.7.5.4


-- 
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9    
Czech Republic

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [PATCH 1/2] memcg: make oom_lock 0 and 1 based rather than coutner
@ 2011-07-14 11:30                   ` Michal Hocko
  0 siblings, 0 replies; 27+ messages in thread
From: Michal Hocko @ 2011-07-14 11:30 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm, Balbir Singh, Daisuke Nishimura, linux-kernel

On Thu 14-07-11 13:09:35, Michal Hocko wrote:
> On Thu 14-07-11 19:17:28, KAMEZAWA Hiroyuki wrote:
> > On Thu, 14 Jul 2011 11:51:52 +0200
> > Michal Hocko <mhocko@suse.cz> wrote:
> > 
> > > On Thu 14-07-11 18:30:14, KAMEZAWA Hiroyuki wrote:
> > > > On Thu, 14 Jul 2011 11:00:17 +0200
> > > > Michal Hocko <mhocko@suse.cz> wrote:
> > > > 
> > > > > On Thu 14-07-11 11:59:13, KAMEZAWA Hiroyuki wrote:
> > > > > > On Thu, 14 Jul 2011 10:02:59 +0900
> > > > > > KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> [...]
> > > > ==
> > > >  	for_each_mem_cgroup_tree(iter, mem) {
> > > > -		x = atomic_inc_return(&iter->oom_lock);
> > > > -		lock_count = max(x, lock_count);
> > > > +		x = !!atomic_add_unless(&iter->oom_lock, 1, 1);
> > > > +		if (lock_count == -1)
> > > > +			lock_count = x;
> > > > +
> > > > +		/* New child can be created but we shouldn't race with
> > > > +		 * somebody else trying to oom because we are under
> > > > +		 * memcg_oom_mutex
> > > > +		 */
> > > > +		BUG_ON(lock_count != x);
> > > >  	}
> > > > ==
> > > > 
> > > > When, B,D,E is under OOM,  
> > > > 
> > > >    A oom_lock = 0
> > > >    B oom_lock = 1
> > > >    C oom_lock = 0
> > > >    D oom_lock = 1
> > > >    E oom_lock = 1
> > > > 
> > > > Here, assume A enters OOM.
> > > > 
> > > >    A oom_lock = 1 -- (*)
> > > >    B oom_lock = 1
> > > >    C oom_lock = 1
> > > >    D oom_lock = 1
> > > >    E oom_lock = 1
> > > > 
> > > > because of (*), mem_cgroup_oom_lock() will return lock_count=1, true.
> > > > 
> > > > Then, a new oom-killer will another oom-kiiler running in B-D-E.
> > > 
> > > OK, does this mean that for_each_mem_cgroup_tree doesn't lock the whole
> > > hierarchy at once? 
> > 
> > yes. this for_each_mem_cgroup_tree() just locks a subtree.
> 
> OK, then I really misunderstood the macro and now I see your points.
> Thinking about it some more having a full hierarchy locked is not that
> good idea after all. We would block also parallel branches which will
> not bail out from OOM if we handle oom condition in another branch.
> 
> > 
> > > I have to confess that the behavior of mem_cgroup_start_loop is little
> > > bit obscure to me. The comment says it searches for the cgroup with the
> > > minimum ID - I assume this is the root of the hierarchy. Is this
> > > correct?
> > > 
> > 
> > No. Assume following sequence.
> > 
> >   1.  cgcreate -g memory:X  css_id=5 assigned.
> >   ........far later.....
> >   2.  cgcreate -g memory:A  css_id=30 assigned.
> >   3.  cgdelete -g memory:X  css_id=5 freed.
> >   4.  cgcreate -g memory:A/B
> >   5.  cgcreate -g memory:A/C
> >   6.  cgcreate -g memory:A/B/D
> >   7.  cgcreate -g memory:A/B/E
> > 
> > Then, css_id will be
> > ==
> >  A css_id=30
> >  B css_id=5  # reuse X's id.
> >  C css_id=31
> >  D css_id=32
> >  E css_id=33
> > ==
> > Then, the search under "B" will find B->D->E
> > 
> > The search under "A" will find B->A->C->D->E. 
> > 
> > > If yes then if we have oom in what-ever cgroup in the hierarchy then
> > > the above code should lock the whole hierarchy and the above never
> > > happens. Right?
> > 
> > Yes and no. old code allows following happens at the same time.
> > 
> >       A
> >     B   C
> >    D E   F
> >  
> >    B-D-E goes into OOM because of B's limit.
> >    C-F   goes into OOM because of C's limit
> > 
> > 
> > When you stop OOM under A because of B's limit, C can't invoke OOM.
> > 
> > After a little more consideration, my suggestion is,
> > 
> > === lock ===
> > 	bool success = true;
> > 	...
> > 	for_each_mem_cgroup_tree(iter, mem) {
> > 		success &= !!atomic_add_unless(&iter->oom_lock, 1, 1);
> > 		/* "break" loop is not allowed because of css refcount....*/
> > 	}
> > 	return success.
> > 
> > By this, when a sub-hierarchy is under OOM, don't invoke new OOM.
> 
> Hmm, I am afraid this will not work as well. The group tree traversing
> depends on the creation order so we might end up seeing locked subtree
> sooner than unlocked so we could grant the lock and see multiple OOMs.
> We have to guarantee that we do not grant the lock if we encounter
> already locked sub group (and then we have to clear oom_lock for all
> groups that we have already visited).
> 
> > === unlock ===
> > 	struct mem_cgroup *oom_root;
> > 
> > 	oom_root = memcg; 
> > 	do {
> > 		struct mem_cgroup *parent;
> > 
> > 		parent = mem_cgroup_parent(oom_root);
> > 		if (!parent || !parent->use_hierarchy)
> > 			break;
> > 
> > 		if (atomic_read(&parent->oom_lock))
> > 			break;
> > 	} while (1);
> > 
> > 	for_each_mem_cgroup_tree(iter, oom_root)
> > 		atomic_add_unless(&iter->oom_lock, -1, 0);
> > 
> > By this, at unlock, unlock oom-lock of a hierarchy which was under oom_lock
> > because of a sub-hierarchy was under OOM.
> 
> This would unlock also groups that might have a parallel oom lock.
> A - B - C - D oom (from B)
>   - E - F  oom (F)
> 
> unlock in what-ever branch will unlock also the parallel oom.
> I will think about something else and return to your first patch if I
> find it over complicated as well.

What about this? Just compile tested:
--- 

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 1/2] memcg: make oom_lock 0 and 1 based rather than coutner
  2011-07-14 11:30                   ` Michal Hocko
@ 2011-07-14 11:50                     ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 27+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-07-14 11:50 UTC (permalink / raw)
  To: Michal Hocko; +Cc: linux-mm, Balbir Singh, Daisuke Nishimura, linux-kernel

On Thu, 14 Jul 2011 13:30:09 +0200
Michal Hocko <mhocko@suse.cz> wrote:

> What about this? Just compile tested:
> --- 
> From 90ab974eb69c61c2e3b94beabe9b6745fa319936 Mon Sep 17 00:00:00 2001
> From: Michal Hocko <mhocko@suse.cz>
> Date: Wed, 13 Jul 2011 13:05:49 +0200
> Subject: [PATCH] memcg: make oom_lock 0 and 1 based rather than coutner
> 
> 867578cb "memcg: fix oom kill behavior" introduced oom_lock counter
> which is incremented by mem_cgroup_oom_lock when we are about to handle
> memcg OOM situation. mem_cgroup_handle_oom falls back to a sleep if
> oom_lock > 1 to prevent from multiple oom kills at the same time.
> The counter is then decremented by mem_cgroup_oom_unlock called from the
> same function.
> 
> This works correctly but it can lead to serious starvations when we
> have many processes triggering OOM.
> 
> Consider a process (call it A) which gets the oom_lock (the first one
> that got to mem_cgroup_handle_oom and grabbed memcg_oom_mutex). All
> other processes are blocked on the mutex.
> While A releases the mutex and calls mem_cgroup_out_of_memory others
> will wake up (one after another) and increase the counter and fall into
> sleep (memcg_oom_waitq). Once A finishes mem_cgroup_out_of_memory it
> takes the mutex again and decreases oom_lock and wakes other tasks (if
> releasing memory of the killed task hasn't done it yet).
> The main problem here is that everybody still race for the mutex and
> there is no guarantee that we will get counter back to 0 for those
> that got back to mem_cgroup_handle_oom. In the end the whole convoy
> in/decreases the counter but we do not get to 1 that would enable
> killing so nothing useful is going on.
> The time is basically unbounded because it highly depends on scheduling
> and ordering on mutex.
> 
> This patch replaces the counter by a simple {un}lock semantic. We are
> using only 0 and 1 to distinguish those two states.
> As mem_cgroup_oom_{un}lock works on the a subtree of a hierarchy we have
> to make sure that nobody else races with us which is guaranteed by the
> memcg_oom_mutex. All other consumers just read the value atomically for
> a single group which is sufficient because we set the value atomically.
> mem_cgroup_oom_lock has to be really careful because we might be in
> higher in a hierarchy than already oom locked subtree of the same
> hierarchy:
>           A
>         /   \
>        B     \
>       /\      \
>      C  D     E
> 
> B - C - D tree might be already locked. While we want to enable locking E
> subtree because OOM situations cannot influence each other we definitely
> do not want to allow locking A.
> Therefore we have to refuse lock if any subtree is already locked and
> clear up the lock for all nodes that have been set up to the failure
> point.
> Unlock path is then very easy because we always unlock only that subtree
> we have locked previously.
> 
> Signed-off-by: Michal Hocko <mhocko@suse.cz>
> ---
>  mm/memcontrol.c |   48 +++++++++++++++++++++++++++++++++++++++---------
>  1 files changed, 39 insertions(+), 9 deletions(-)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index e013b8e..29f00d0 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -1803,22 +1803,51 @@ static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *root_mem,
>  /*
>   * Check OOM-Killer is already running under our hierarchy.
>   * If someone is running, return false.
> + * Has to be called with memcg_oom_mutex
>   */
>  static bool mem_cgroup_oom_lock(struct mem_cgroup *mem)
>  {
> -	int x, lock_count = 0;
> -	struct mem_cgroup *iter;
> +	int x, lock_count = -1;
> +	struct mem_cgroup *iter, *failed = NULL;
> +	bool cond = true;
>  
> -	for_each_mem_cgroup_tree(iter, mem) {
> -		x = atomic_inc_return(&iter->oom_lock);
> -		lock_count = max(x, lock_count);
> +	for_each_mem_cgroup_tree_cond(iter, mem, cond) {
> +		x = !!atomic_add_unless(&iter->oom_lock, 1, 1);
> +		if (lock_count == -1)
> +			lock_count = x;
> +		else if (lock_count != x) {
> +			/*
> +			 * this subtree of our hierarchy is already locked
> +			 * so we cannot give a lock.
> +			 */
> +			lock_count = 0;
> +			failed = iter;
> +			cond = false;
> +		}
>  	}

Hm ? assuming B-C-D is locked and a new thread tries a lock on A-B-C-D-E.
And for_each_mem_cgroup_tree will find groups in order of A->B->C->D->E.
Before lock
  A  0
  B  1
  C  1
  D  1
  E  0

After lock
  A  1
  B  1
  C  1
  D  1
  E  0

here, failed = B, cond = false. Undo routine will unlock A.
Hmm, seems to work in this case.

But....A's oom_lock==0 and memcg_oom_wakeup() at el will not able to
know "A" is in OOM. wakeup processes in A which is waiting for oom recover..

Will this work ?
==
 # cgcreate -g memory:A
 # cgset -r memory.use_hierarchy=1 A
 # cgset -r memory.oom_control=1   A
 # cgset -r memory.limit_in_bytes= 100M
 # cgset -r memory.memsw.limit_in_bytes= 100M
 # cgcreate -g memory:A/B
 # cgset -r memory.oom_control=1 A/B
 # cgset -r memory.limit_in_bytes=20M
 # cgset -r memory.memsw.limit_in_bytes=20M

 Assume malloc XXX is a program allocating XXX Megabytes of memory.

 # cgexec -g memory:A/B malloc 30  &    #->this will be blocked by OOM of group B
 # cgexec -g memory:A   malloc 80  &    #->this will be blocked by OOM of group A


Here, 2 procs are blocked by OOM. Here, relax A's limitation and clear OOM.

 # cgset -r memory.memsw.limit_in_bytes=300M A
 # cgset -r memory.limit_in_bytes=300M A

 malloc 80 will end.

Thanks,
-Kame




>  
> -	if (lock_count == 1)
> -		return true;
> -	return false;
> +	if (!failed)
> +		goto done;
> +
> +	/*
> +	 * OK, we failed to lock the whole subtree so we have to clean up
> +	 * what we set up to the failing subtree
> +	 */
> +	cond = true;
> +	for_each_mem_cgroup_tree_cond(iter, mem, cond) {
> +		if (iter == failed) {
> +			cond = false;
> +			continue;
> +		}
> +		atomic_set(&iter->oom_lock, 0)
> +	}









> +done:
> +	return lock_count;
>  }
>  
> +/*
> + * Has to be called with memcg_oom_mutex
> + */
>  static int mem_cgroup_oom_unlock(struct mem_cgroup *mem)
>  {
>  	struct mem_cgroup *iter;
> @@ -1916,7 +1945,8 @@ bool mem_cgroup_handle_oom(struct mem_cgroup *mem, gfp_t mask)
>  		finish_wait(&memcg_oom_waitq, &owait.wait);
>  	}
>  	mutex_lock(&memcg_oom_mutex);
> -	mem_cgroup_oom_unlock(mem);
> +	if (locked)
> +		mem_cgroup_oom_unlock(mem);
>  	memcg_wakeup_oom(mem);
>  	mutex_unlock(&memcg_oom_mutex);
>  
> -- 
> 1.7.5.4
> 
> 
> -- 
> Michal Hocko
> SUSE Labs
> SUSE LINUX s.r.o.
> Lihovarska 1060/12
> 190 00 Praha 9    
> Czech Republic
> 


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 1/2] memcg: make oom_lock 0 and 1 based rather than coutner
@ 2011-07-14 11:50                     ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 27+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-07-14 11:50 UTC (permalink / raw)
  To: Michal Hocko; +Cc: linux-mm, Balbir Singh, Daisuke Nishimura, linux-kernel

On Thu, 14 Jul 2011 13:30:09 +0200
Michal Hocko <mhocko@suse.cz> wrote:

> What about this? Just compile tested:
> --- 
> From 90ab974eb69c61c2e3b94beabe9b6745fa319936 Mon Sep 17 00:00:00 2001
> From: Michal Hocko <mhocko@suse.cz>
> Date: Wed, 13 Jul 2011 13:05:49 +0200
> Subject: [PATCH] memcg: make oom_lock 0 and 1 based rather than coutner
> 
> 867578cb "memcg: fix oom kill behavior" introduced oom_lock counter
> which is incremented by mem_cgroup_oom_lock when we are about to handle
> memcg OOM situation. mem_cgroup_handle_oom falls back to a sleep if
> oom_lock > 1 to prevent from multiple oom kills at the same time.
> The counter is then decremented by mem_cgroup_oom_unlock called from the
> same function.
> 
> This works correctly but it can lead to serious starvations when we
> have many processes triggering OOM.
> 
> Consider a process (call it A) which gets the oom_lock (the first one
> that got to mem_cgroup_handle_oom and grabbed memcg_oom_mutex). All
> other processes are blocked on the mutex.
> While A releases the mutex and calls mem_cgroup_out_of_memory others
> will wake up (one after another) and increase the counter and fall into
> sleep (memcg_oom_waitq). Once A finishes mem_cgroup_out_of_memory it
> takes the mutex again and decreases oom_lock and wakes other tasks (if
> releasing memory of the killed task hasn't done it yet).
> The main problem here is that everybody still race for the mutex and
> there is no guarantee that we will get counter back to 0 for those
> that got back to mem_cgroup_handle_oom. In the end the whole convoy
> in/decreases the counter but we do not get to 1 that would enable
> killing so nothing useful is going on.
> The time is basically unbounded because it highly depends on scheduling
> and ordering on mutex.
> 
> This patch replaces the counter by a simple {un}lock semantic. We are
> using only 0 and 1 to distinguish those two states.
> As mem_cgroup_oom_{un}lock works on the a subtree of a hierarchy we have
> to make sure that nobody else races with us which is guaranteed by the
> memcg_oom_mutex. All other consumers just read the value atomically for
> a single group which is sufficient because we set the value atomically.
> mem_cgroup_oom_lock has to be really careful because we might be in
> higher in a hierarchy than already oom locked subtree of the same
> hierarchy:
>           A
>         /   \
>        B     \
>       /\      \
>      C  D     E
> 
> B - C - D tree might be already locked. While we want to enable locking E
> subtree because OOM situations cannot influence each other we definitely
> do not want to allow locking A.
> Therefore we have to refuse lock if any subtree is already locked and
> clear up the lock for all nodes that have been set up to the failure
> point.
> Unlock path is then very easy because we always unlock only that subtree
> we have locked previously.
> 
> Signed-off-by: Michal Hocko <mhocko@suse.cz>
> ---
>  mm/memcontrol.c |   48 +++++++++++++++++++++++++++++++++++++++---------
>  1 files changed, 39 insertions(+), 9 deletions(-)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index e013b8e..29f00d0 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -1803,22 +1803,51 @@ static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *root_mem,
>  /*
>   * Check OOM-Killer is already running under our hierarchy.
>   * If someone is running, return false.
> + * Has to be called with memcg_oom_mutex
>   */
>  static bool mem_cgroup_oom_lock(struct mem_cgroup *mem)
>  {
> -	int x, lock_count = 0;
> -	struct mem_cgroup *iter;
> +	int x, lock_count = -1;
> +	struct mem_cgroup *iter, *failed = NULL;
> +	bool cond = true;
>  
> -	for_each_mem_cgroup_tree(iter, mem) {
> -		x = atomic_inc_return(&iter->oom_lock);
> -		lock_count = max(x, lock_count);
> +	for_each_mem_cgroup_tree_cond(iter, mem, cond) {
> +		x = !!atomic_add_unless(&iter->oom_lock, 1, 1);
> +		if (lock_count == -1)
> +			lock_count = x;
> +		else if (lock_count != x) {
> +			/*
> +			 * this subtree of our hierarchy is already locked
> +			 * so we cannot give a lock.
> +			 */
> +			lock_count = 0;
> +			failed = iter;
> +			cond = false;
> +		}
>  	}

Hm ? assuming B-C-D is locked and a new thread tries a lock on A-B-C-D-E.
And for_each_mem_cgroup_tree will find groups in order of A->B->C->D->E.
Before lock
  A  0
  B  1
  C  1
  D  1
  E  0

After lock
  A  1
  B  1
  C  1
  D  1
  E  0

here, failed = B, cond = false. Undo routine will unlock A.
Hmm, seems to work in this case.

But....A's oom_lock==0 and memcg_oom_wakeup() at el will not able to
know "A" is in OOM. wakeup processes in A which is waiting for oom recover..

Will this work ?
==
 # cgcreate -g memory:A
 # cgset -r memory.use_hierarchy=1 A
 # cgset -r memory.oom_control=1   A
 # cgset -r memory.limit_in_bytes= 100M
 # cgset -r memory.memsw.limit_in_bytes= 100M
 # cgcreate -g memory:A/B
 # cgset -r memory.oom_control=1 A/B
 # cgset -r memory.limit_in_bytes=20M
 # cgset -r memory.memsw.limit_in_bytes=20M

 Assume malloc XXX is a program allocating XXX Megabytes of memory.

 # cgexec -g memory:A/B malloc 30  &    #->this will be blocked by OOM of group B
 # cgexec -g memory:A   malloc 80  &    #->this will be blocked by OOM of group A


Here, 2 procs are blocked by OOM. Here, relax A's limitation and clear OOM.

 # cgset -r memory.memsw.limit_in_bytes=300M A
 # cgset -r memory.limit_in_bytes=300M A

 malloc 80 will end.

Thanks,
-Kame




>  
> -	if (lock_count == 1)
> -		return true;
> -	return false;
> +	if (!failed)
> +		goto done;
> +
> +	/*
> +	 * OK, we failed to lock the whole subtree so we have to clean up
> +	 * what we set up to the failing subtree
> +	 */
> +	cond = true;
> +	for_each_mem_cgroup_tree_cond(iter, mem, cond) {
> +		if (iter == failed) {
> +			cond = false;
> +			continue;
> +		}
> +		atomic_set(&iter->oom_lock, 0)
> +	}









> +done:
> +	return lock_count;
>  }
>  
> +/*
> + * Has to be called with memcg_oom_mutex
> + */
>  static int mem_cgroup_oom_unlock(struct mem_cgroup *mem)
>  {
>  	struct mem_cgroup *iter;
> @@ -1916,7 +1945,8 @@ bool mem_cgroup_handle_oom(struct mem_cgroup *mem, gfp_t mask)
>  		finish_wait(&memcg_oom_waitq, &owait.wait);
>  	}
>  	mutex_lock(&memcg_oom_mutex);
> -	mem_cgroup_oom_unlock(mem);
> +	if (locked)
> +		mem_cgroup_oom_unlock(mem);
>  	memcg_wakeup_oom(mem);
>  	mutex_unlock(&memcg_oom_mutex);
>  
> -- 
> 1.7.5.4
> 
> 
> -- 
> Michal Hocko
> SUSE Labs
> SUSE LINUX s.r.o.
> Lihovarska 1060/12
> 190 00 Praha 9    
> Czech Republic
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 1/2] memcg: make oom_lock 0 and 1 based rather than coutner
  2011-07-14 11:50                     ` KAMEZAWA Hiroyuki
@ 2011-07-14 12:55                       ` Michal Hocko
  -1 siblings, 0 replies; 27+ messages in thread
From: Michal Hocko @ 2011-07-14 12:55 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm, Balbir Singh, Daisuke Nishimura, linux-kernel

On Thu 14-07-11 20:50:12, KAMEZAWA Hiroyuki wrote:
> On Thu, 14 Jul 2011 13:30:09 +0200
> Michal Hocko <mhocko@suse.cz> wrote:
[...]
> >  static bool mem_cgroup_oom_lock(struct mem_cgroup *mem)
> >  {
> > -	int x, lock_count = 0;
> > -	struct mem_cgroup *iter;
> > +	int x, lock_count = -1;
> > +	struct mem_cgroup *iter, *failed = NULL;
> > +	bool cond = true;
> >  
> > -	for_each_mem_cgroup_tree(iter, mem) {
> > -		x = atomic_inc_return(&iter->oom_lock);
> > -		lock_count = max(x, lock_count);
> > +	for_each_mem_cgroup_tree_cond(iter, mem, cond) {
> > +		x = !!atomic_add_unless(&iter->oom_lock, 1, 1);
> > +		if (lock_count == -1)
> > +			lock_count = x;
> > +		else if (lock_count != x) {
> > +			/*
> > +			 * this subtree of our hierarchy is already locked
> > +			 * so we cannot give a lock.
> > +			 */
> > +			lock_count = 0;
> > +			failed = iter;
> > +			cond = false;
> > +		}
> >  	}
> 
> Hm ? assuming B-C-D is locked and a new thread tries a lock on A-B-C-D-E.
> And for_each_mem_cgroup_tree will find groups in order of A->B->C->D->E.
> Before lock
>   A  0
>   B  1
>   C  1
>   D  1
>   E  0
> 
> After lock
>   A  1
>   B  1
>   C  1
>   D  1
>   E  0
> 
> here, failed = B, cond = false. Undo routine will unlock A.
> Hmm, seems to work in this case.
> 
> But....A's oom_lock==0 and memcg_oom_wakeup() at el will not able to
> know "A" is in OOM. wakeup processes in A which is waiting for oom recover..

Hohm, we need to have 2 different states. lock and mark_oom.
oom_recovert would check only the under_oom.

> 
> Will this work ?

No it won't because the rest of the world has no idea that A is
under_oom as well.

> ==
>  # cgcreate -g memory:A
>  # cgset -r memory.use_hierarchy=1 A
>  # cgset -r memory.oom_control=1   A
>  # cgset -r memory.limit_in_bytes= 100M
>  # cgset -r memory.memsw.limit_in_bytes= 100M
>  # cgcreate -g memory:A/B
>  # cgset -r memory.oom_control=1 A/B
>  # cgset -r memory.limit_in_bytes=20M
>  # cgset -r memory.memsw.limit_in_bytes=20M
> 
>  Assume malloc XXX is a program allocating XXX Megabytes of memory.
> 
>  # cgexec -g memory:A/B malloc 30  &    #->this will be blocked by OOM of group B
>  # cgexec -g memory:A   malloc 80  &    #->this will be blocked by OOM of group A
> 
> 
> Here, 2 procs are blocked by OOM. Here, relax A's limitation and clear OOM.
> 
>  # cgset -r memory.memsw.limit_in_bytes=300M A
>  # cgset -r memory.limit_in_bytes=300M A
> 
>  malloc 80 will end.

What about yet another approach? Very similar what you proposed, I
guess. Again not tested and needs some cleanup just to illustrate.
What do you think?
--- 
>From 964158e226555a7a6a4d946062461d2b97c1c539 Mon Sep 17 00:00:00 2001
From: Michal Hocko <mhocko@suse.cz>
Date: Wed, 13 Jul 2011 13:05:49 +0200
Subject: [PATCH] memcg: make oom_lock 0 and 1 based rather than coutner

867578cb "memcg: fix oom kill behavior" introduced oom_lock counter
which is incremented by mem_cgroup_oom_lock when we are about to handle
memcg OOM situation. mem_cgroup_handle_oom falls back to a sleep if
oom_lock > 1 to prevent from multiple oom kills at the same time.
The counter is then decremented by mem_cgroup_oom_unlock called from the
same function.

This works correctly but it can lead to serious starvations when we
have many processes triggering OOM.

Consider a process (call it A) which gets the oom_lock (the first one
that got to mem_cgroup_handle_oom and grabbed memcg_oom_mutex). All
other processes are blocked on the mutex.
While A releases the mutex and calls mem_cgroup_out_of_memory others
will wake up (one after another) and increase the counter and fall into
sleep (memcg_oom_waitq). Once A finishes mem_cgroup_out_of_memory it
takes the mutex again and decreases oom_lock and wakes other tasks (if
releasing memory of the killed task hasn't done it yet).
The main problem here is that everybody still race for the mutex and
there is no guarantee that we will get counter back to 0 for those
that got back to mem_cgroup_handle_oom. In the end the whole convoy
in/decreases the counter but we do not get to 1 that would enable
killing so nothing useful is going on.
The time is basically unbounded because it highly depends on scheduling
and ordering on mutex.

This patch replaces the counter by a simple {un}lock semantic. We are
using only 0 and 1 to distinguish those two states.
As mem_cgroup_oom_{un}lock works on the a subtree of a hierarchy we have
to make sure that nobody else races with us which is guaranteed by the
memcg_oom_mutex. All other consumers just read the value atomically for
a single group which is sufficient because we set the value atomically.
mem_cgroup_oom_lock has to be really careful because we might be in
higher in a hierarchy than already oom locked subtree of the same
hierarchy:
          A
        /   \
       B     \
      /\      \
     C  D     E

B - C - D tree might be already locked. While we want to enable locking E
subtree because OOM situations cannot influence each other we definitely
do not want to allow locking A.
Therefore we have to refuse lock if any subtree is already locked and
clear up the lock for all nodes that have been set up to the failure
point.
On the other hand we have to make sure the rest of the world will
recognize that the group is under OOM even though it doesn't have a
lock. Therefore we have to introduce under_oom variable which is
incremented and decremented for whole subtree when we enter resp. leave
mem_cgroup_handle_oom.

Unlock path is then very easy because we always unlock only that subtree
we have locked previously.

Signed-off-by: Michal Hocko <mhocko@suse.cz>
---
 mm/memcontrol.c |   81 ++++++++++++++++++++++++++++++++++++++++++++-----------
 1 files changed, 65 insertions(+), 16 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index e013b8e..a278188 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -246,7 +246,8 @@ struct mem_cgroup {
 	 * Should the accounting and control be hierarchical, per subtree?
 	 */
 	bool use_hierarchy;
-	atomic_t	oom_lock;
+	bool		oom_lock;
+	atomic_t	under_oom;
 	atomic_t	refcnt;
 
 	unsigned int	swappiness;
@@ -1803,37 +1804,82 @@ static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *root_mem,
 /*
  * Check OOM-Killer is already running under our hierarchy.
  * If someone is running, return false.
+ * Has to be called with memcg_oom_mutex
  */
 static bool mem_cgroup_oom_lock(struct mem_cgroup *mem)
 {
-	int x, lock_count = 0;
-	struct mem_cgroup *iter;
+	int lock_count = -1;
+	struct mem_cgroup *iter, *failed = NULL;
+	bool cond = true;
 
-	for_each_mem_cgroup_tree(iter, mem) {
-		x = atomic_inc_return(&iter->oom_lock);
-		lock_count = max(x, lock_count);
+	for_each_mem_cgroup_tree_cond(iter, mem, cond) {
+		bool locked = iter->oom_lock;
+		iter->oom_lock = true;
+		if (lock_count == -1)
+			lock_count = iter->oom_lock;
+		else if (lock_count != locked) {
+			/*
+			 * this subtree of our hierarchy is already locked
+			 * so we cannot give a lock.
+			 */
+			lock_count = 0;
+			failed = iter;
+			cond = false;
+		}
 	}
 
-	if (lock_count == 1)
-		return true;
-	return false;
+	if (!failed)
+		goto done;
+
+	/*
+	 * OK, we failed to lock the whole subtree so we have to clean up
+	 * what we set up to the failing subtree
+	 */
+	cond = true;
+	for_each_mem_cgroup_tree_cond(iter, mem, cond) {
+		if (iter == failed) {
+			cond = false;
+			continue;
+		}
+		iter->oom_lock = false;
+	}
+done:
+	return lock_count;
 }
 
+/*
+ * Has to be called with memcg_oom_mutex
+ */
 static int mem_cgroup_oom_unlock(struct mem_cgroup *mem)
 {
 	struct mem_cgroup *iter;
 
+	for_each_mem_cgroup_tree(iter, mem)
+		iter->oom_lock = false;
+	return 0;
+}
+
+static void mem_cgroup_mark_under_oom(struct mem_cgroup *mem)
+{
+	struct mem_cgroup *iter;
+
+	for_each_mem_cgroup_tree(iter, mem)
+		atomic_inc(&iter->under_oom);
+}
+
+static void mem_cgroup_unmark_under_oom(struct mem_cgroup *mem)
+{
+	struct mem_cgroup *iter;
+
 	/*
 	 * When a new child is created while the hierarchy is under oom,
 	 * mem_cgroup_oom_lock() may not be called. We have to use
 	 * atomic_add_unless() here.
 	 */
 	for_each_mem_cgroup_tree(iter, mem)
-		atomic_add_unless(&iter->oom_lock, -1, 0);
-	return 0;
+		atomic_add_unless(&iter->under_oom, -1, 0);
 }
 
-
 static DEFINE_MUTEX(memcg_oom_mutex);
 static DECLARE_WAIT_QUEUE_HEAD(memcg_oom_waitq);
 
@@ -1875,7 +1921,7 @@ static void memcg_wakeup_oom(struct mem_cgroup *mem)
 
 static void memcg_oom_recover(struct mem_cgroup *mem)
 {
-	if (mem && atomic_read(&mem->oom_lock))
+	if (mem && atomic_read(&mem->under_oom))
 		memcg_wakeup_oom(mem);
 }
 
@@ -1896,6 +1942,7 @@ bool mem_cgroup_handle_oom(struct mem_cgroup *mem, gfp_t mask)
 	/* At first, try to OOM lock hierarchy under mem.*/
 	mutex_lock(&memcg_oom_mutex);
 	locked = mem_cgroup_oom_lock(mem);
+	mem_cgroup_mark_under_oom(mem);
 	/*
 	 * Even if signal_pending(), we can't quit charge() loop without
 	 * accounting. So, UNINTERRUPTIBLE is appropriate. But SIGKILL
@@ -1916,7 +1963,9 @@ bool mem_cgroup_handle_oom(struct mem_cgroup *mem, gfp_t mask)
 		finish_wait(&memcg_oom_waitq, &owait.wait);
 	}
 	mutex_lock(&memcg_oom_mutex);
-	mem_cgroup_oom_unlock(mem);
+	if (locked)
+		mem_cgroup_oom_unlock(mem);
+	mem_cgroup_unmark_under_oom(mem);
 	memcg_wakeup_oom(mem);
 	mutex_unlock(&memcg_oom_mutex);
 
@@ -4584,7 +4633,7 @@ static int mem_cgroup_oom_register_event(struct cgroup *cgrp,
 	list_add(&event->list, &memcg->oom_notify);
 
 	/* already in OOM ? */
-	if (atomic_read(&memcg->oom_lock))
+	if (atomic_read(&memcg->under_oom))
 		eventfd_signal(eventfd, 1);
 	mutex_unlock(&memcg_oom_mutex);
 
@@ -4619,7 +4668,7 @@ static int mem_cgroup_oom_control_read(struct cgroup *cgrp,
 
 	cb->fill(cb, "oom_kill_disable", mem->oom_kill_disable);
 
-	if (atomic_read(&mem->oom_lock))
+	if (atomic_read(&mem->under_oom))
 		cb->fill(cb, "under_oom", 1);
 	else
 		cb->fill(cb, "under_oom", 0);
-- 
1.7.5.4


-- 
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9    
Czech Republic

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [PATCH 1/2] memcg: make oom_lock 0 and 1 based rather than coutner
@ 2011-07-14 12:55                       ` Michal Hocko
  0 siblings, 0 replies; 27+ messages in thread
From: Michal Hocko @ 2011-07-14 12:55 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm, Balbir Singh, Daisuke Nishimura, linux-kernel

On Thu 14-07-11 20:50:12, KAMEZAWA Hiroyuki wrote:
> On Thu, 14 Jul 2011 13:30:09 +0200
> Michal Hocko <mhocko@suse.cz> wrote:
[...]
> >  static bool mem_cgroup_oom_lock(struct mem_cgroup *mem)
> >  {
> > -	int x, lock_count = 0;
> > -	struct mem_cgroup *iter;
> > +	int x, lock_count = -1;
> > +	struct mem_cgroup *iter, *failed = NULL;
> > +	bool cond = true;
> >  
> > -	for_each_mem_cgroup_tree(iter, mem) {
> > -		x = atomic_inc_return(&iter->oom_lock);
> > -		lock_count = max(x, lock_count);
> > +	for_each_mem_cgroup_tree_cond(iter, mem, cond) {
> > +		x = !!atomic_add_unless(&iter->oom_lock, 1, 1);
> > +		if (lock_count == -1)
> > +			lock_count = x;
> > +		else if (lock_count != x) {
> > +			/*
> > +			 * this subtree of our hierarchy is already locked
> > +			 * so we cannot give a lock.
> > +			 */
> > +			lock_count = 0;
> > +			failed = iter;
> > +			cond = false;
> > +		}
> >  	}
> 
> Hm ? assuming B-C-D is locked and a new thread tries a lock on A-B-C-D-E.
> And for_each_mem_cgroup_tree will find groups in order of A->B->C->D->E.
> Before lock
>   A  0
>   B  1
>   C  1
>   D  1
>   E  0
> 
> After lock
>   A  1
>   B  1
>   C  1
>   D  1
>   E  0
> 
> here, failed = B, cond = false. Undo routine will unlock A.
> Hmm, seems to work in this case.
> 
> But....A's oom_lock==0 and memcg_oom_wakeup() at el will not able to
> know "A" is in OOM. wakeup processes in A which is waiting for oom recover..

Hohm, we need to have 2 different states. lock and mark_oom.
oom_recovert would check only the under_oom.

> 
> Will this work ?

No it won't because the rest of the world has no idea that A is
under_oom as well.

> ==
>  # cgcreate -g memory:A
>  # cgset -r memory.use_hierarchy=1 A
>  # cgset -r memory.oom_control=1   A
>  # cgset -r memory.limit_in_bytes= 100M
>  # cgset -r memory.memsw.limit_in_bytes= 100M
>  # cgcreate -g memory:A/B
>  # cgset -r memory.oom_control=1 A/B
>  # cgset -r memory.limit_in_bytes=20M
>  # cgset -r memory.memsw.limit_in_bytes=20M
> 
>  Assume malloc XXX is a program allocating XXX Megabytes of memory.
> 
>  # cgexec -g memory:A/B malloc 30  &    #->this will be blocked by OOM of group B
>  # cgexec -g memory:A   malloc 80  &    #->this will be blocked by OOM of group A
> 
> 
> Here, 2 procs are blocked by OOM. Here, relax A's limitation and clear OOM.
> 
>  # cgset -r memory.memsw.limit_in_bytes=300M A
>  # cgset -r memory.limit_in_bytes=300M A
> 
>  malloc 80 will end.

What about yet another approach? Very similar what you proposed, I
guess. Again not tested and needs some cleanup just to illustrate.
What do you think?
--- 

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 1/2] memcg: make oom_lock 0 and 1 based rather than coutner
  2011-07-14 12:55                       ` Michal Hocko
@ 2011-07-14 23:47                         ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 27+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-07-14 23:47 UTC (permalink / raw)
  To: Michal Hocko; +Cc: linux-mm, Balbir Singh, Daisuke Nishimura, linux-kernel

On Thu, 14 Jul 2011 14:55:55 +0200
Michal Hocko <mhocko@suse.cz> wrote:

> On Thu 14-07-11 20:50:12, KAMEZAWA Hiroyuki wrote:
> > On Thu, 14 Jul 2011 13:30:09 +0200
> > Michal Hocko <mhocko@suse.cz> wrote:
> [...]
> > >  static bool mem_cgroup_oom_lock(struct mem_cgroup *mem)
> > >  {
> > > -	int x, lock_count = 0;
> > > -	struct mem_cgroup *iter;
> > > +	int x, lock_count = -1;
> > > +	struct mem_cgroup *iter, *failed = NULL;
> > > +	bool cond = true;
> > >  
> > > -	for_each_mem_cgroup_tree(iter, mem) {
> > > -		x = atomic_inc_return(&iter->oom_lock);
> > > -		lock_count = max(x, lock_count);
> > > +	for_each_mem_cgroup_tree_cond(iter, mem, cond) {
> > > +		x = !!atomic_add_unless(&iter->oom_lock, 1, 1);
> > > +		if (lock_count == -1)
> > > +			lock_count = x;
> > > +		else if (lock_count != x) {
> > > +			/*
> > > +			 * this subtree of our hierarchy is already locked
> > > +			 * so we cannot give a lock.
> > > +			 */
> > > +			lock_count = 0;
> > > +			failed = iter;
> > > +			cond = false;
> > > +		}
> > >  	}
> > 
> > Hm ? assuming B-C-D is locked and a new thread tries a lock on A-B-C-D-E.
> > And for_each_mem_cgroup_tree will find groups in order of A->B->C->D->E.
> > Before lock
> >   A  0
> >   B  1
> >   C  1
> >   D  1
> >   E  0
> > 
> > After lock
> >   A  1
> >   B  1
> >   C  1
> >   D  1
> >   E  0
> > 
> > here, failed = B, cond = false. Undo routine will unlock A.
> > Hmm, seems to work in this case.
> > 
> > But....A's oom_lock==0 and memcg_oom_wakeup() at el will not able to
> > know "A" is in OOM. wakeup processes in A which is waiting for oom recover..
> 
> Hohm, we need to have 2 different states. lock and mark_oom.
> oom_recovert would check only the under_oom.
> 

yes. I think so, too.

> > 
> > Will this work ?
> 
> No it won't because the rest of the world has no idea that A is
> under_oom as well.
> 
> > ==
> >  # cgcreate -g memory:A
> >  # cgset -r memory.use_hierarchy=1 A
> >  # cgset -r memory.oom_control=1   A
> >  # cgset -r memory.limit_in_bytes= 100M
> >  # cgset -r memory.memsw.limit_in_bytes= 100M
> >  # cgcreate -g memory:A/B
> >  # cgset -r memory.oom_control=1 A/B
> >  # cgset -r memory.limit_in_bytes=20M
> >  # cgset -r memory.memsw.limit_in_bytes=20M
> > 
> >  Assume malloc XXX is a program allocating XXX Megabytes of memory.
> > 
> >  # cgexec -g memory:A/B malloc 30  &    #->this will be blocked by OOM of group B
> >  # cgexec -g memory:A   malloc 80  &    #->this will be blocked by OOM of group A
> > 
> > 
> > Here, 2 procs are blocked by OOM. Here, relax A's limitation and clear OOM.
> > 
> >  # cgset -r memory.memsw.limit_in_bytes=300M A
> >  # cgset -r memory.limit_in_bytes=300M A
> > 
> >  malloc 80 will end.
> 
> What about yet another approach? Very similar what you proposed, I
> guess. Again not tested and needs some cleanup just to illustrate.
> What do you think?

Hmm, I think this will work. Please go ahead.
Unfortunately, I'll not be able to make a quick response for a week
because of other tasks. I'm sorry.

Anyway,
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> 

BTW, it's better to add "How-to-test" and the result in description.
Some test similar to mine will show the result we want.
==
Make a hierarchy of memcg, which has 300MB memory+swap limit.

 %cgcreate -g memory:A
 %cgset -r memory.limit_in_bytes=300M A
 %cgset -r memory.memsw.limit_in_bytes=300M A

Then, running folloing program under A.
 %cgexec -g memory:A ./fork
==
int main(int argc, char *argv[])
{
        int i;
        int status;

        for (i = 0; i < 5000; i++) {
                if (fork() == 0) {
                        char *c;
                        c = malloc(1024*1024);
                        memset(c, 0, 1024*1024);
                        sleep(20);
                        fprintf(stderr, "[%d]\n", i);
                        exit(0);
                }
                printf("%d\n", i);
                waitpid(-1, &status, WNOHANG);
        }
        while (1) {
                int ret;
                ret = waitpid(-1, &status, WNOHANG);

                if (ret == -1)
                        break;
                if (!ret)
                        sleep(1);
        }
        return 0;
}
==

Thank you for your effort.
-Kame


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 1/2] memcg: make oom_lock 0 and 1 based rather than coutner
@ 2011-07-14 23:47                         ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 27+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-07-14 23:47 UTC (permalink / raw)
  To: Michal Hocko; +Cc: linux-mm, Balbir Singh, Daisuke Nishimura, linux-kernel

On Thu, 14 Jul 2011 14:55:55 +0200
Michal Hocko <mhocko@suse.cz> wrote:

> On Thu 14-07-11 20:50:12, KAMEZAWA Hiroyuki wrote:
> > On Thu, 14 Jul 2011 13:30:09 +0200
> > Michal Hocko <mhocko@suse.cz> wrote:
> [...]
> > >  static bool mem_cgroup_oom_lock(struct mem_cgroup *mem)
> > >  {
> > > -	int x, lock_count = 0;
> > > -	struct mem_cgroup *iter;
> > > +	int x, lock_count = -1;
> > > +	struct mem_cgroup *iter, *failed = NULL;
> > > +	bool cond = true;
> > >  
> > > -	for_each_mem_cgroup_tree(iter, mem) {
> > > -		x = atomic_inc_return(&iter->oom_lock);
> > > -		lock_count = max(x, lock_count);
> > > +	for_each_mem_cgroup_tree_cond(iter, mem, cond) {
> > > +		x = !!atomic_add_unless(&iter->oom_lock, 1, 1);
> > > +		if (lock_count == -1)
> > > +			lock_count = x;
> > > +		else if (lock_count != x) {
> > > +			/*
> > > +			 * this subtree of our hierarchy is already locked
> > > +			 * so we cannot give a lock.
> > > +			 */
> > > +			lock_count = 0;
> > > +			failed = iter;
> > > +			cond = false;
> > > +		}
> > >  	}
> > 
> > Hm ? assuming B-C-D is locked and a new thread tries a lock on A-B-C-D-E.
> > And for_each_mem_cgroup_tree will find groups in order of A->B->C->D->E.
> > Before lock
> >   A  0
> >   B  1
> >   C  1
> >   D  1
> >   E  0
> > 
> > After lock
> >   A  1
> >   B  1
> >   C  1
> >   D  1
> >   E  0
> > 
> > here, failed = B, cond = false. Undo routine will unlock A.
> > Hmm, seems to work in this case.
> > 
> > But....A's oom_lock==0 and memcg_oom_wakeup() at el will not able to
> > know "A" is in OOM. wakeup processes in A which is waiting for oom recover..
> 
> Hohm, we need to have 2 different states. lock and mark_oom.
> oom_recovert would check only the under_oom.
> 

yes. I think so, too.

> > 
> > Will this work ?
> 
> No it won't because the rest of the world has no idea that A is
> under_oom as well.
> 
> > ==
> >  # cgcreate -g memory:A
> >  # cgset -r memory.use_hierarchy=1 A
> >  # cgset -r memory.oom_control=1   A
> >  # cgset -r memory.limit_in_bytes= 100M
> >  # cgset -r memory.memsw.limit_in_bytes= 100M
> >  # cgcreate -g memory:A/B
> >  # cgset -r memory.oom_control=1 A/B
> >  # cgset -r memory.limit_in_bytes=20M
> >  # cgset -r memory.memsw.limit_in_bytes=20M
> > 
> >  Assume malloc XXX is a program allocating XXX Megabytes of memory.
> > 
> >  # cgexec -g memory:A/B malloc 30  &    #->this will be blocked by OOM of group B
> >  # cgexec -g memory:A   malloc 80  &    #->this will be blocked by OOM of group A
> > 
> > 
> > Here, 2 procs are blocked by OOM. Here, relax A's limitation and clear OOM.
> > 
> >  # cgset -r memory.memsw.limit_in_bytes=300M A
> >  # cgset -r memory.limit_in_bytes=300M A
> > 
> >  malloc 80 will end.
> 
> What about yet another approach? Very similar what you proposed, I
> guess. Again not tested and needs some cleanup just to illustrate.
> What do you think?

Hmm, I think this will work. Please go ahead.
Unfortunately, I'll not be able to make a quick response for a week
because of other tasks. I'm sorry.

Anyway,
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> 

BTW, it's better to add "How-to-test" and the result in description.
Some test similar to mine will show the result we want.
==
Make a hierarchy of memcg, which has 300MB memory+swap limit.

 %cgcreate -g memory:A
 %cgset -r memory.limit_in_bytes=300M A
 %cgset -r memory.memsw.limit_in_bytes=300M A

Then, running folloing program under A.
 %cgexec -g memory:A ./fork
==
int main(int argc, char *argv[])
{
        int i;
        int status;

        for (i = 0; i < 5000; i++) {
                if (fork() == 0) {
                        char *c;
                        c = malloc(1024*1024);
                        memset(c, 0, 1024*1024);
                        sleep(20);
                        fprintf(stderr, "[%d]\n", i);
                        exit(0);
                }
                printf("%d\n", i);
                waitpid(-1, &status, WNOHANG);
        }
        while (1) {
                int ret;
                ret = waitpid(-1, &status, WNOHANG);

                if (ret == -1)
                        break;
                if (!ret)
                        sleep(1);
        }
        return 0;
}
==

Thank you for your effort.
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 1/2] memcg: make oom_lock 0 and 1 based rather than coutner
  2011-07-14 23:47                         ` KAMEZAWA Hiroyuki
@ 2011-07-15  7:28                           ` Michal Hocko
  -1 siblings, 0 replies; 27+ messages in thread
From: Michal Hocko @ 2011-07-15  7:28 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm, Balbir Singh, Daisuke Nishimura, linux-kernel

On Fri 15-07-11 08:47:55, KAMEZAWA Hiroyuki wrote:
> On Thu, 14 Jul 2011 14:55:55 +0200
> Michal Hocko <mhocko@suse.cz> wrote:
> 
> > On Thu 14-07-11 20:50:12, KAMEZAWA Hiroyuki wrote:
> > > On Thu, 14 Jul 2011 13:30:09 +0200
> > > Michal Hocko <mhocko@suse.cz> wrote:
> > [...]
> > > >  static bool mem_cgroup_oom_lock(struct mem_cgroup *mem)
> > > >  {
> > > > -	int x, lock_count = 0;
> > > > -	struct mem_cgroup *iter;
> > > > +	int x, lock_count = -1;
> > > > +	struct mem_cgroup *iter, *failed = NULL;
> > > > +	bool cond = true;
> > > >  
> > > > -	for_each_mem_cgroup_tree(iter, mem) {
> > > > -		x = atomic_inc_return(&iter->oom_lock);
> > > > -		lock_count = max(x, lock_count);
> > > > +	for_each_mem_cgroup_tree_cond(iter, mem, cond) {
> > > > +		x = !!atomic_add_unless(&iter->oom_lock, 1, 1);
> > > > +		if (lock_count == -1)
> > > > +			lock_count = x;
> > > > +		else if (lock_count != x) {
> > > > +			/*
> > > > +			 * this subtree of our hierarchy is already locked
> > > > +			 * so we cannot give a lock.
> > > > +			 */
> > > > +			lock_count = 0;
> > > > +			failed = iter;
> > > > +			cond = false;
> > > > +		}
> > > >  	}
> > > 
> > > Hm ? assuming B-C-D is locked and a new thread tries a lock on A-B-C-D-E.
> > > And for_each_mem_cgroup_tree will find groups in order of A->B->C->D->E.
> > > Before lock
> > >   A  0
> > >   B  1
> > >   C  1
> > >   D  1
> > >   E  0
> > > 
> > > After lock
> > >   A  1
> > >   B  1
> > >   C  1
> > >   D  1
> > >   E  0
> > > 
> > > here, failed = B, cond = false. Undo routine will unlock A.
> > > Hmm, seems to work in this case.
> > > 
> > > But....A's oom_lock==0 and memcg_oom_wakeup() at el will not able to
> > > know "A" is in OOM. wakeup processes in A which is waiting for oom recover..
> > 
> > Hohm, we need to have 2 different states. lock and mark_oom.
> > oom_recovert would check only the under_oom.
> > 
> 
> yes. I think so, too.
> 
> > > 
> > > Will this work ?
> > 
> > No it won't because the rest of the world has no idea that A is
> > under_oom as well.
> > 
> > > ==
> > >  # cgcreate -g memory:A
> > >  # cgset -r memory.use_hierarchy=1 A
> > >  # cgset -r memory.oom_control=1   A
> > >  # cgset -r memory.limit_in_bytes= 100M
> > >  # cgset -r memory.memsw.limit_in_bytes= 100M
> > >  # cgcreate -g memory:A/B
> > >  # cgset -r memory.oom_control=1 A/B
> > >  # cgset -r memory.limit_in_bytes=20M
> > >  # cgset -r memory.memsw.limit_in_bytes=20M
> > > 
> > >  Assume malloc XXX is a program allocating XXX Megabytes of memory.
> > > 
> > >  # cgexec -g memory:A/B malloc 30  &    #->this will be blocked by OOM of group B
> > >  # cgexec -g memory:A   malloc 80  &    #->this will be blocked by OOM of group A
> > > 
> > > 
> > > Here, 2 procs are blocked by OOM. Here, relax A's limitation and clear OOM.
> > > 
> > >  # cgset -r memory.memsw.limit_in_bytes=300M A
> > >  # cgset -r memory.limit_in_bytes=300M A
> > > 
> > >  malloc 80 will end.
> > 
> > What about yet another approach? Very similar what you proposed, I
> > guess. Again not tested and needs some cleanup just to illustrate.
> > What do you think?
> 
> Hmm, I think this will work. Please go ahead.
> Unfortunately, I'll not be able to make a quick response for a week
> because of other tasks. I'm sorry.
> 
> Anyway,
> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> 

Thanks

> 
> BTW, it's better to add "How-to-test" and the result in description.
> Some test similar to mine will show the result we want.

Agreed. I will hammer it today and repost with cleaned up description
with your example as well as mine that triggered the convoy behavior.

I hope you don't mine if I add you s-o-b to the patch as this was a
cooperative work.

> ==
> Make a hierarchy of memcg, which has 300MB memory+swap limit.
> 
>  %cgcreate -g memory:A
>  %cgset -r memory.limit_in_bytes=300M A
>  %cgset -r memory.memsw.limit_in_bytes=300M A
> 
> Then, running folloing program under A.
>  %cgexec -g memory:A ./fork
> ==
> int main(int argc, char *argv[])
> {
>         int i;
>         int status;
> 
>         for (i = 0; i < 5000; i++) {
>                 if (fork() == 0) {
>                         char *c;
>                         c = malloc(1024*1024);
>                         memset(c, 0, 1024*1024);
>                         sleep(20);
>                         fprintf(stderr, "[%d]\n", i);
>                         exit(0);
>                 }
>                 printf("%d\n", i);
>                 waitpid(-1, &status, WNOHANG);
>         }
>         while (1) {
>                 int ret;
>                 ret = waitpid(-1, &status, WNOHANG);
> 
>                 if (ret == -1)
>                         break;
>                 if (!ret)
>                         sleep(1);
>         }
>         return 0;
> }
> ==
> 
> Thank you for your effort.
> -Kame
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9    
Czech Republic

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 1/2] memcg: make oom_lock 0 and 1 based rather than coutner
@ 2011-07-15  7:28                           ` Michal Hocko
  0 siblings, 0 replies; 27+ messages in thread
From: Michal Hocko @ 2011-07-15  7:28 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm, Balbir Singh, Daisuke Nishimura, linux-kernel

On Fri 15-07-11 08:47:55, KAMEZAWA Hiroyuki wrote:
> On Thu, 14 Jul 2011 14:55:55 +0200
> Michal Hocko <mhocko@suse.cz> wrote:
> 
> > On Thu 14-07-11 20:50:12, KAMEZAWA Hiroyuki wrote:
> > > On Thu, 14 Jul 2011 13:30:09 +0200
> > > Michal Hocko <mhocko@suse.cz> wrote:
> > [...]
> > > >  static bool mem_cgroup_oom_lock(struct mem_cgroup *mem)
> > > >  {
> > > > -	int x, lock_count = 0;
> > > > -	struct mem_cgroup *iter;
> > > > +	int x, lock_count = -1;
> > > > +	struct mem_cgroup *iter, *failed = NULL;
> > > > +	bool cond = true;
> > > >  
> > > > -	for_each_mem_cgroup_tree(iter, mem) {
> > > > -		x = atomic_inc_return(&iter->oom_lock);
> > > > -		lock_count = max(x, lock_count);
> > > > +	for_each_mem_cgroup_tree_cond(iter, mem, cond) {
> > > > +		x = !!atomic_add_unless(&iter->oom_lock, 1, 1);
> > > > +		if (lock_count == -1)
> > > > +			lock_count = x;
> > > > +		else if (lock_count != x) {
> > > > +			/*
> > > > +			 * this subtree of our hierarchy is already locked
> > > > +			 * so we cannot give a lock.
> > > > +			 */
> > > > +			lock_count = 0;
> > > > +			failed = iter;
> > > > +			cond = false;
> > > > +		}
> > > >  	}
> > > 
> > > Hm ? assuming B-C-D is locked and a new thread tries a lock on A-B-C-D-E.
> > > And for_each_mem_cgroup_tree will find groups in order of A->B->C->D->E.
> > > Before lock
> > >   A  0
> > >   B  1
> > >   C  1
> > >   D  1
> > >   E  0
> > > 
> > > After lock
> > >   A  1
> > >   B  1
> > >   C  1
> > >   D  1
> > >   E  0
> > > 
> > > here, failed = B, cond = false. Undo routine will unlock A.
> > > Hmm, seems to work in this case.
> > > 
> > > But....A's oom_lock==0 and memcg_oom_wakeup() at el will not able to
> > > know "A" is in OOM. wakeup processes in A which is waiting for oom recover..
> > 
> > Hohm, we need to have 2 different states. lock and mark_oom.
> > oom_recovert would check only the under_oom.
> > 
> 
> yes. I think so, too.
> 
> > > 
> > > Will this work ?
> > 
> > No it won't because the rest of the world has no idea that A is
> > under_oom as well.
> > 
> > > ==
> > >  # cgcreate -g memory:A
> > >  # cgset -r memory.use_hierarchy=1 A
> > >  # cgset -r memory.oom_control=1   A
> > >  # cgset -r memory.limit_in_bytes= 100M
> > >  # cgset -r memory.memsw.limit_in_bytes= 100M
> > >  # cgcreate -g memory:A/B
> > >  # cgset -r memory.oom_control=1 A/B
> > >  # cgset -r memory.limit_in_bytes=20M
> > >  # cgset -r memory.memsw.limit_in_bytes=20M
> > > 
> > >  Assume malloc XXX is a program allocating XXX Megabytes of memory.
> > > 
> > >  # cgexec -g memory:A/B malloc 30  &    #->this will be blocked by OOM of group B
> > >  # cgexec -g memory:A   malloc 80  &    #->this will be blocked by OOM of group A
> > > 
> > > 
> > > Here, 2 procs are blocked by OOM. Here, relax A's limitation and clear OOM.
> > > 
> > >  # cgset -r memory.memsw.limit_in_bytes=300M A
> > >  # cgset -r memory.limit_in_bytes=300M A
> > > 
> > >  malloc 80 will end.
> > 
> > What about yet another approach? Very similar what you proposed, I
> > guess. Again not tested and needs some cleanup just to illustrate.
> > What do you think?
> 
> Hmm, I think this will work. Please go ahead.
> Unfortunately, I'll not be able to make a quick response for a week
> because of other tasks. I'm sorry.
> 
> Anyway,
> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> 

Thanks

> 
> BTW, it's better to add "How-to-test" and the result in description.
> Some test similar to mine will show the result we want.

Agreed. I will hammer it today and repost with cleaned up description
with your example as well as mine that triggered the convoy behavior.

I hope you don't mine if I add you s-o-b to the patch as this was a
cooperative work.

> ==
> Make a hierarchy of memcg, which has 300MB memory+swap limit.
> 
>  %cgcreate -g memory:A
>  %cgset -r memory.limit_in_bytes=300M A
>  %cgset -r memory.memsw.limit_in_bytes=300M A
> 
> Then, running folloing program under A.
>  %cgexec -g memory:A ./fork
> ==
> int main(int argc, char *argv[])
> {
>         int i;
>         int status;
> 
>         for (i = 0; i < 5000; i++) {
>                 if (fork() == 0) {
>                         char *c;
>                         c = malloc(1024*1024);
>                         memset(c, 0, 1024*1024);
>                         sleep(20);
>                         fprintf(stderr, "[%d]\n", i);
>                         exit(0);
>                 }
>                 printf("%d\n", i);
>                 waitpid(-1, &status, WNOHANG);
>         }
>         while (1) {
>                 int ret;
>                 ret = waitpid(-1, &status, WNOHANG);
> 
>                 if (ret == -1)
>                         break;
>                 if (!ret)
>                         sleep(1);
>         }
>         return 0;
> }
> ==
> 
> Thank you for your effort.
> -Kame
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9    
Czech Republic

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2011-07-15  7:28 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-07-13 12:44 [PATCH 0/2] memcg: oom locking updates Michal Hocko
2011-07-13 11:05 ` [PATCH 1/2] memcg: make oom_lock 0 and 1 based rather than coutner Michal Hocko
2011-07-14  1:02   ` KAMEZAWA Hiroyuki
2011-07-14  1:02     ` KAMEZAWA Hiroyuki
2011-07-14  2:59     ` KAMEZAWA Hiroyuki
2011-07-14  2:59       ` KAMEZAWA Hiroyuki
2011-07-14  9:00       ` Michal Hocko
2011-07-14  9:00         ` Michal Hocko
2011-07-14  9:30         ` KAMEZAWA Hiroyuki
2011-07-14  9:30           ` KAMEZAWA Hiroyuki
2011-07-14  9:51           ` Michal Hocko
2011-07-14  9:51             ` Michal Hocko
2011-07-14 10:17             ` KAMEZAWA Hiroyuki
2011-07-14 10:17               ` KAMEZAWA Hiroyuki
2011-07-14 11:09               ` Michal Hocko
2011-07-14 11:09                 ` Michal Hocko
2011-07-14 11:30                 ` Michal Hocko
2011-07-14 11:30                   ` Michal Hocko
2011-07-14 11:50                   ` KAMEZAWA Hiroyuki
2011-07-14 11:50                     ` KAMEZAWA Hiroyuki
2011-07-14 12:55                     ` Michal Hocko
2011-07-14 12:55                       ` Michal Hocko
2011-07-14 23:47                       ` KAMEZAWA Hiroyuki
2011-07-14 23:47                         ` KAMEZAWA Hiroyuki
2011-07-15  7:28                         ` Michal Hocko
2011-07-15  7:28                           ` Michal Hocko
2011-07-13 12:32 ` [PATCH 2/2] memcg: change memcg_oom_mutex to spinlock Michal Hocko

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.