All of lore.kernel.org
 help / color / mirror / Atom feed
* [patch 1/3] mm, oom: move declaration for mem_cgroup_out_of_memory to oom.h
@ 2012-06-26  1:47 ` David Rientjes
  0 siblings, 0 replies; 66+ messages in thread
From: David Rientjes @ 2012-06-26  1:47 UTC (permalink / raw)
  To: Andrew Morton
  Cc: KAMEZAWA Hiroyuki, Michal Hocko, Johannes Weiner,
	KOSAKI Motohiro, Minchan Kim, linux-mm, cgroups

mem_cgroup_out_of_memory() is defined in mm/oom_kill.c, so declare it in
linux/oom.h rather than linux/memcontrol.h.

Signed-off-by: David Rientjes <rientjes@google.com>
---
 include/linux/memcontrol.h |    2 --
 include/linux/oom.h        |    2 ++
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -72,8 +72,6 @@ extern void mem_cgroup_uncharge_end(void);
 extern void mem_cgroup_uncharge_page(struct page *page);
 extern void mem_cgroup_uncharge_cache_page(struct page *page);
 
-extern void mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
-				     int order);
 bool __mem_cgroup_same_or_subtree(const struct mem_cgroup *root_memcg,
 				  struct mem_cgroup *memcg);
 int task_in_mem_cgroup(struct task_struct *task, const struct mem_cgroup *memcg);
diff --git a/include/linux/oom.h b/include/linux/oom.h
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -49,6 +49,8 @@ extern unsigned long oom_badness(struct task_struct *p,
 extern int try_set_zonelist_oom(struct zonelist *zonelist, gfp_t gfp_flags);
 extern void clear_zonelist_oom(struct zonelist *zonelist, gfp_t gfp_flags);
 
+extern void mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
+				     int order);
 extern void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
 		int order, nodemask_t *mask, bool force_kill);
 extern int register_oom_notifier(struct notifier_block *nb);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* [patch 1/3] mm, oom: move declaration for mem_cgroup_out_of_memory to oom.h
@ 2012-06-26  1:47 ` David Rientjes
  0 siblings, 0 replies; 66+ messages in thread
From: David Rientjes @ 2012-06-26  1:47 UTC (permalink / raw)
  To: Andrew Morton
  Cc: KAMEZAWA Hiroyuki, Michal Hocko, Johannes Weiner,
	KOSAKI Motohiro, Minchan Kim, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	cgroups-u79uwXL29TY76Z2rM5mHXA

mem_cgroup_out_of_memory() is defined in mm/oom_kill.c, so declare it in
linux/oom.h rather than linux/memcontrol.h.

Signed-off-by: David Rientjes <rientjes-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
---
 include/linux/memcontrol.h |    2 --
 include/linux/oom.h        |    2 ++
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -72,8 +72,6 @@ extern void mem_cgroup_uncharge_end(void);
 extern void mem_cgroup_uncharge_page(struct page *page);
 extern void mem_cgroup_uncharge_cache_page(struct page *page);
 
-extern void mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
-				     int order);
 bool __mem_cgroup_same_or_subtree(const struct mem_cgroup *root_memcg,
 				  struct mem_cgroup *memcg);
 int task_in_mem_cgroup(struct task_struct *task, const struct mem_cgroup *memcg);
diff --git a/include/linux/oom.h b/include/linux/oom.h
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -49,6 +49,8 @@ extern unsigned long oom_badness(struct task_struct *p,
 extern int try_set_zonelist_oom(struct zonelist *zonelist, gfp_t gfp_flags);
 extern void clear_zonelist_oom(struct zonelist *zonelist, gfp_t gfp_flags);
 
+extern void mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
+				     int order);
 extern void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
 		int order, nodemask_t *mask, bool force_kill);
 extern int register_oom_notifier(struct notifier_block *nb);

^ permalink raw reply	[flat|nested] 66+ messages in thread

* [rfc][patch 2/3] mm, oom: introduce helper function to process threads during scan
@ 2012-06-26  1:47   ` David Rientjes
  0 siblings, 0 replies; 66+ messages in thread
From: David Rientjes @ 2012-06-26  1:47 UTC (permalink / raw)
  To: Andrew Morton
  Cc: KAMEZAWA Hiroyuki, Michal Hocko, Johannes Weiner,
	KOSAKI Motohiro, Minchan Kim, linux-mm, cgroups

This patch introduces a helper function to process each thread during the
iteration over the tasklist.  A new return type, enum oom_scan_t, is
defined to determine the future behavior of the iteration:

 - OOM_SCAN_OK: continue scanning the thread and find its badness,

 - OOM_SCAN_CONTINUE: do not consider this thread for oom kill, it's
   ineligible,

 - OOM_SCAN_ABORT: abort the iteration and return, or

 - OOM_SCAN_SELECT: always select this thread with the highest badness
   possible.

There is no functional change with this patch.  This new helper function
will be used in the next patch in the memory controller.

Signed-off-by: David Rientjes <rientjes@google.com>
---
 mm/oom_kill.c |  111 +++++++++++++++++++++++++++++++++------------------------
 1 file changed, 65 insertions(+), 46 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -288,6 +288,59 @@ static enum oom_constraint constrained_alloc(struct zonelist *zonelist,
 }
 #endif
 
+enum oom_scan_t {
+	OOM_SCAN_OK,
+	OOM_SCAN_CONTINUE,
+	OOM_SCAN_ABORT,
+	OOM_SCAN_SELECT,
+};
+
+static enum oom_scan_t oom_scan_process_thread(struct task_struct *task,
+		struct mem_cgroup *memcg, unsigned long totalpages,
+		const nodemask_t *nodemask, bool force_kill)
+{
+	if (task->exit_state)
+		return OOM_SCAN_CONTINUE;
+	if (oom_unkillable_task(task, memcg, nodemask))
+		return OOM_SCAN_CONTINUE;
+
+	/*
+	 * This task already has access to memory reserves and is being killed.
+	 * Don't allow any other task to have access to the reserves.
+	 */
+	if (test_tsk_thread_flag(task, TIF_MEMDIE)) {
+		if (unlikely(frozen(task)))
+			__thaw_task(task);
+		if (!force_kill)
+			return OOM_SCAN_ABORT;
+	}
+	if (!task->mm)
+		return OOM_SCAN_CONTINUE;
+
+	if (task->flags & PF_EXITING) {
+		/*
+		 * If task is current and is in the process of releasing memory,
+		 * allow the "kill" to set TIF_MEMDIE, which will allow it to
+		 * access memory reserves.  Otherwise, it may stall forever.
+		 *
+		 * The iteration isn't broken here, however, in case other
+		 * threads are found to have already been oom killed.
+		 */
+		if (task == current)
+			return OOM_SCAN_SELECT;
+		else if (!force_kill) {
+			/*
+			 * If this task is not being ptraced on exit, then wait
+			 * for it to finish before killing some other task
+			 * unnecessarily.
+			 */
+			if (!(task->group_leader->ptrace & PT_TRACE_EXIT))
+				return OOM_SCAN_ABORT;
+		}
+	}
+	return OOM_SCAN_OK;
+}
+
 /*
  * Simple selection loop. We chose the process with the highest
  * number of 'points'. We expect the caller will lock the tasklist.
@@ -305,53 +358,19 @@ static struct task_struct *select_bad_process(unsigned int *ppoints,
 	do_each_thread(g, p) {
 		unsigned int points;
 
-		if (p->exit_state)
-			continue;
-		if (oom_unkillable_task(p, memcg, nodemask))
-			continue;
-
-		/*
-		 * This task already has access to memory reserves and is
-		 * being killed. Don't allow any other task access to the
-		 * memory reserve.
-		 *
-		 * Note: this may have a chance of deadlock if it gets
-		 * blocked waiting for another task which itself is waiting
-		 * for memory. Is there a better alternative?
-		 */
-		if (test_tsk_thread_flag(p, TIF_MEMDIE)) {
-			if (unlikely(frozen(p)))
-				__thaw_task(p);
-			if (!force_kill)
-				return ERR_PTR(-1UL);
-		}
-		if (!p->mm)
+		switch (oom_scan_process_thread(p, memcg, totalpages, nodemask,
+						force_kill)) {
+		case OOM_SCAN_SELECT:
+			chosen = p;
+			chosen_points = ULONG_MAX;
+			/* fall through */
+		case OOM_SCAN_CONTINUE:
 			continue;
-
-		if (p->flags & PF_EXITING) {
-			/*
-			 * If p is the current task and is in the process of
-			 * releasing memory, we allow the "kill" to set
-			 * TIF_MEMDIE, which will allow it to gain access to
-			 * memory reserves.  Otherwise, it may stall forever.
-			 *
-			 * The loop isn't broken here, however, in case other
-			 * threads are found to have already been oom killed.
-			 */
-			if (p == current) {
-				chosen = p;
-				chosen_points = ULONG_MAX;
-			} else if (!force_kill) {
-				/*
-				 * If this task is not being ptraced on exit,
-				 * then wait for it to finish before killing
-				 * some other task unnecessarily.
-				 */
-				if (!(p->group_leader->ptrace & PT_TRACE_EXIT))
-					return ERR_PTR(-1UL);
-			}
-		}
-
+		case OOM_SCAN_ABORT:
+			return ERR_PTR(-1UL);
+		case OOM_SCAN_OK:
+			break;
+		};
 		points = oom_badness(p, memcg, nodemask, totalpages);
 		if (points > chosen_points) {
 			chosen = p;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* [rfc][patch 2/3] mm, oom: introduce helper function to process threads during scan
@ 2012-06-26  1:47   ` David Rientjes
  0 siblings, 0 replies; 66+ messages in thread
From: David Rientjes @ 2012-06-26  1:47 UTC (permalink / raw)
  To: Andrew Morton
  Cc: KAMEZAWA Hiroyuki, Michal Hocko, Johannes Weiner,
	KOSAKI Motohiro, Minchan Kim, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	cgroups-u79uwXL29TY76Z2rM5mHXA

This patch introduces a helper function to process each thread during the
iteration over the tasklist.  A new return type, enum oom_scan_t, is
defined to determine the future behavior of the iteration:

 - OOM_SCAN_OK: continue scanning the thread and find its badness,

 - OOM_SCAN_CONTINUE: do not consider this thread for oom kill, it's
   ineligible,

 - OOM_SCAN_ABORT: abort the iteration and return, or

 - OOM_SCAN_SELECT: always select this thread with the highest badness
   possible.

There is no functional change with this patch.  This new helper function
will be used in the next patch in the memory controller.

Signed-off-by: David Rientjes <rientjes-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
---
 mm/oom_kill.c |  111 +++++++++++++++++++++++++++++++++------------------------
 1 file changed, 65 insertions(+), 46 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -288,6 +288,59 @@ static enum oom_constraint constrained_alloc(struct zonelist *zonelist,
 }
 #endif
 
+enum oom_scan_t {
+	OOM_SCAN_OK,
+	OOM_SCAN_CONTINUE,
+	OOM_SCAN_ABORT,
+	OOM_SCAN_SELECT,
+};
+
+static enum oom_scan_t oom_scan_process_thread(struct task_struct *task,
+		struct mem_cgroup *memcg, unsigned long totalpages,
+		const nodemask_t *nodemask, bool force_kill)
+{
+	if (task->exit_state)
+		return OOM_SCAN_CONTINUE;
+	if (oom_unkillable_task(task, memcg, nodemask))
+		return OOM_SCAN_CONTINUE;
+
+	/*
+	 * This task already has access to memory reserves and is being killed.
+	 * Don't allow any other task to have access to the reserves.
+	 */
+	if (test_tsk_thread_flag(task, TIF_MEMDIE)) {
+		if (unlikely(frozen(task)))
+			__thaw_task(task);
+		if (!force_kill)
+			return OOM_SCAN_ABORT;
+	}
+	if (!task->mm)
+		return OOM_SCAN_CONTINUE;
+
+	if (task->flags & PF_EXITING) {
+		/*
+		 * If task is current and is in the process of releasing memory,
+		 * allow the "kill" to set TIF_MEMDIE, which will allow it to
+		 * access memory reserves.  Otherwise, it may stall forever.
+		 *
+		 * The iteration isn't broken here, however, in case other
+		 * threads are found to have already been oom killed.
+		 */
+		if (task == current)
+			return OOM_SCAN_SELECT;
+		else if (!force_kill) {
+			/*
+			 * If this task is not being ptraced on exit, then wait
+			 * for it to finish before killing some other task
+			 * unnecessarily.
+			 */
+			if (!(task->group_leader->ptrace & PT_TRACE_EXIT))
+				return OOM_SCAN_ABORT;
+		}
+	}
+	return OOM_SCAN_OK;
+}
+
 /*
  * Simple selection loop. We chose the process with the highest
  * number of 'points'. We expect the caller will lock the tasklist.
@@ -305,53 +358,19 @@ static struct task_struct *select_bad_process(unsigned int *ppoints,
 	do_each_thread(g, p) {
 		unsigned int points;
 
-		if (p->exit_state)
-			continue;
-		if (oom_unkillable_task(p, memcg, nodemask))
-			continue;
-
-		/*
-		 * This task already has access to memory reserves and is
-		 * being killed. Don't allow any other task access to the
-		 * memory reserve.
-		 *
-		 * Note: this may have a chance of deadlock if it gets
-		 * blocked waiting for another task which itself is waiting
-		 * for memory. Is there a better alternative?
-		 */
-		if (test_tsk_thread_flag(p, TIF_MEMDIE)) {
-			if (unlikely(frozen(p)))
-				__thaw_task(p);
-			if (!force_kill)
-				return ERR_PTR(-1UL);
-		}
-		if (!p->mm)
+		switch (oom_scan_process_thread(p, memcg, totalpages, nodemask,
+						force_kill)) {
+		case OOM_SCAN_SELECT:
+			chosen = p;
+			chosen_points = ULONG_MAX;
+			/* fall through */
+		case OOM_SCAN_CONTINUE:
 			continue;
-
-		if (p->flags & PF_EXITING) {
-			/*
-			 * If p is the current task and is in the process of
-			 * releasing memory, we allow the "kill" to set
-			 * TIF_MEMDIE, which will allow it to gain access to
-			 * memory reserves.  Otherwise, it may stall forever.
-			 *
-			 * The loop isn't broken here, however, in case other
-			 * threads are found to have already been oom killed.
-			 */
-			if (p == current) {
-				chosen = p;
-				chosen_points = ULONG_MAX;
-			} else if (!force_kill) {
-				/*
-				 * If this task is not being ptraced on exit,
-				 * then wait for it to finish before killing
-				 * some other task unnecessarily.
-				 */
-				if (!(p->group_leader->ptrace & PT_TRACE_EXIT))
-					return ERR_PTR(-1UL);
-			}
-		}
-
+		case OOM_SCAN_ABORT:
+			return ERR_PTR(-1UL);
+		case OOM_SCAN_OK:
+			break;
+		};
 		points = oom_badness(p, memcg, nodemask, totalpages);
 		if (points > chosen_points) {
 			chosen = p;

^ permalink raw reply	[flat|nested] 66+ messages in thread

* [rfc][patch 3/3] mm, memcg: introduce own oom handler to iterate only over its own threads
@ 2012-06-26  1:47   ` David Rientjes
  0 siblings, 0 replies; 66+ messages in thread
From: David Rientjes @ 2012-06-26  1:47 UTC (permalink / raw)
  To: Andrew Morton
  Cc: KAMEZAWA Hiroyuki, Michal Hocko, Johannes Weiner,
	KOSAKI Motohiro, Minchan Kim, linux-mm, cgroups

The global oom killer is serialized by the zonelist being used in the
page allocation.  Concurrent oom kills are thus a rare event and only
occur in systems using mempolicies and with a large number of nodes.

Memory controller oom kills, however, can frequently be concurrent since
there is no serialization once the oom killer is called for oom
conditions in several different memcgs in parallel.

This creates a massive contention on tasklist_lock since the oom killer
requires the readside for the tasklist iteration.  If several memcgs are
calling the oom killer, this lock can be held for a substantial amount of
time, especially if threads continue to enter it as other threads are
exiting.

Since the exit path grabs the writeside of the lock with irqs disabled in
a few different places, this can cause a soft lockup on cpus as a result
of tasklist_lock starvation.

The kernel lacks unfair writelocks, and successful calls to the oom
killer usually result in at least one thread entering the exit path, so
an alternative solution is needed.

This patch introduces a seperate oom handler for memcgs so that they do
not require tasklist_lock for as much time.  Instead, it iterates only
over the threads attached to the oom memcg and grabs a reference to the
selected thread before calling oom_kill_process() to ensure it doesn't
prematurely exit.

This still requires tasklist_lock for the tasklist dump, iterating
children of the selected process, and killing all other threads on the
system sharing the same memory as the selected victim.  So while this
isn't a complete solution to tasklist_lock starvation, it significantly
reduces the amount of time that it is held.

Signed-off-by: David Rientjes <rientjes@google.com>
---
 include/linux/memcontrol.h |    9 ++-----
 include/linux/oom.h        |   16 ++++++++++++
 mm/memcontrol.c            |   62 +++++++++++++++++++++++++++++++++++++++++++-
 mm/oom_kill.c              |   48 +++++++++++-----------------------
 4 files changed, 94 insertions(+), 41 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -180,7 +180,8 @@ static inline void mem_cgroup_dec_page_stat(struct page *page,
 unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
 						gfp_t gfp_mask,
 						unsigned long *total_scanned);
-u64 mem_cgroup_get_limit(struct mem_cgroup *memcg);
+extern void __mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
+				       int order);
 
 void mem_cgroup_count_vm_event(struct mm_struct *mm, enum vm_event_item idx);
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
@@ -364,12 +365,6 @@ unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
 	return 0;
 }
 
-static inline
-u64 mem_cgroup_get_limit(struct mem_cgroup *memcg)
-{
-	return 0;
-}
-
 static inline void mem_cgroup_split_huge_fixup(struct page *head)
 {
 }
diff --git a/include/linux/oom.h b/include/linux/oom.h
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -40,17 +40,33 @@ enum oom_constraint {
 	CONSTRAINT_MEMCG,
 };
 
+enum oom_scan_t {
+	OOM_SCAN_OK,
+	OOM_SCAN_CONTINUE,
+	OOM_SCAN_ABORT,
+	OOM_SCAN_SELECT,
+};
+
 extern void compare_swap_oom_score_adj(int old_val, int new_val);
 extern int test_set_oom_score_adj(int new_val);
 
 extern unsigned long oom_badness(struct task_struct *p,
 		struct mem_cgroup *memcg, const nodemask_t *nodemask,
 		unsigned long totalpages);
+extern void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
+			     unsigned int points, unsigned long totalpages,
+			     struct mem_cgroup *memcg, nodemask_t *nodemask,
+			     const char *message);
+
 extern int try_set_zonelist_oom(struct zonelist *zonelist, gfp_t gfp_flags);
 extern void clear_zonelist_oom(struct zonelist *zonelist, gfp_t gfp_flags);
 
+extern enum oom_scan_t oom_scan_process_thread(struct task_struct *task,
+		unsigned long totalpages, const nodemask_t *nodemask,
+		bool force_kill);
 extern void mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
 				     int order);
+
 extern void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
 		int order, nodemask_t *mask, bool force_kill);
 extern int register_oom_notifier(struct notifier_block *nb);
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1454,7 +1454,7 @@ static int mem_cgroup_count_children(struct mem_cgroup *memcg)
 /*
  * Return the memory (and swap, if configured) limit for a memcg.
  */
-u64 mem_cgroup_get_limit(struct mem_cgroup *memcg)
+static u64 mem_cgroup_get_limit(struct mem_cgroup *memcg)
 {
 	u64 limit;
 	u64 memsw;
@@ -1470,6 +1470,66 @@ u64 mem_cgroup_get_limit(struct mem_cgroup *memcg)
 	return min(limit, memsw);
 }
 
+void __mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
+				int order)
+{
+	struct mem_cgroup *iter;
+	unsigned long chosen_points = 0;
+	unsigned long totalpages;
+	unsigned int points = 0;
+	struct task_struct *chosen = NULL;
+	struct task_struct *task;
+
+	totalpages = mem_cgroup_get_limit(memcg) >> PAGE_SHIFT ? : 1;
+	for_each_mem_cgroup_tree(iter, memcg) {
+		struct cgroup *cgroup = iter->css.cgroup;
+		struct cgroup_iter it;
+
+		cgroup_iter_start(cgroup, &it);
+		while ((task = cgroup_iter_next(cgroup, &it))) {
+			switch (oom_scan_process_thread(task, totalpages, NULL,
+							false)) {
+			case OOM_SCAN_SELECT:
+				if (chosen)
+					put_task_struct(chosen);
+				chosen = task;
+				chosen_points = ULONG_MAX;
+				get_task_struct(chosen);
+				/* fall through */
+			case OOM_SCAN_CONTINUE:
+				continue;
+			case OOM_SCAN_ABORT:
+				cgroup_iter_end(cgroup, &it);
+				if (chosen)
+					put_task_struct(chosen);
+				return;
+			case OOM_SCAN_OK:
+				break;
+			};
+			points = oom_badness(task, memcg, NULL, totalpages);
+			if (points > chosen_points) {
+				if (chosen)
+					put_task_struct(chosen);
+				chosen = task;
+				chosen_points = points;
+				get_task_struct(chosen);
+			}
+		}
+		cgroup_iter_end(cgroup, &it);
+		if (!memcg->use_hierarchy)
+			break;
+	}
+
+	if (!chosen)
+		return;
+	points = chosen_points * 1000 / totalpages;
+	read_lock(&tasklist_lock);
+	oom_kill_process(chosen, gfp_mask, order, points, totalpages, memcg,
+			 NULL, "Memory cgroup out of memory");
+	read_unlock(&tasklist_lock);
+	put_task_struct(chosen);
+}
+
 static unsigned long mem_cgroup_reclaim(struct mem_cgroup *memcg,
 					gfp_t gfp_mask,
 					unsigned long flags)
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -288,20 +288,13 @@ static enum oom_constraint constrained_alloc(struct zonelist *zonelist,
 }
 #endif
 
-enum oom_scan_t {
-	OOM_SCAN_OK,
-	OOM_SCAN_CONTINUE,
-	OOM_SCAN_ABORT,
-	OOM_SCAN_SELECT,
-};
-
-static enum oom_scan_t oom_scan_process_thread(struct task_struct *task,
-		struct mem_cgroup *memcg, unsigned long totalpages,
-		const nodemask_t *nodemask, bool force_kill)
+enum oom_scan_t oom_scan_process_thread(struct task_struct *task,
+		unsigned long totalpages, const nodemask_t *nodemask,
+		bool force_kill)
 {
 	if (task->exit_state)
 		return OOM_SCAN_CONTINUE;
-	if (oom_unkillable_task(task, memcg, nodemask))
+	if (oom_unkillable_task(task, NULL, nodemask))
 		return OOM_SCAN_CONTINUE;
 
 	/*
@@ -348,8 +341,8 @@ static enum oom_scan_t oom_scan_process_thread(struct task_struct *task,
  * (not docbooked, we don't want this one cluttering up the manual)
  */
 static struct task_struct *select_bad_process(unsigned int *ppoints,
-		unsigned long totalpages, struct mem_cgroup *memcg,
-		const nodemask_t *nodemask, bool force_kill)
+		unsigned long totalpages, const nodemask_t *nodemask,
+		bool force_kill)
 {
 	struct task_struct *g, *p;
 	struct task_struct *chosen = NULL;
@@ -358,7 +351,7 @@ static struct task_struct *select_bad_process(unsigned int *ppoints,
 	do_each_thread(g, p) {
 		unsigned int points;
 
-		switch (oom_scan_process_thread(p, memcg, totalpages, nodemask,
+		switch (oom_scan_process_thread(p, totalpages, nodemask,
 						force_kill)) {
 		case OOM_SCAN_SELECT:
 			chosen = p;
@@ -371,7 +364,7 @@ static struct task_struct *select_bad_process(unsigned int *ppoints,
 		case OOM_SCAN_OK:
 			break;
 		};
-		points = oom_badness(p, memcg, nodemask, totalpages);
+		points = oom_badness(p, NULL, nodemask, totalpages);
 		if (points > chosen_points) {
 			chosen = p;
 			chosen_points = points;
@@ -442,10 +435,10 @@ static void dump_header(struct task_struct *p, gfp_t gfp_mask, int order,
 }
 
 #define K(x) ((x) << (PAGE_SHIFT-10))
-static void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
-			     unsigned int points, unsigned long totalpages,
-			     struct mem_cgroup *memcg, nodemask_t *nodemask,
-			     const char *message)
+void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
+		      unsigned int points, unsigned long totalpages,
+		      struct mem_cgroup *memcg, nodemask_t *nodemask,
+		      const char *message)
 {
 	struct task_struct *victim = p;
 	struct task_struct *child;
@@ -563,10 +556,6 @@ static void check_panic_on_oom(enum oom_constraint constraint, gfp_t gfp_mask,
 void mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
 			      int order)
 {
-	unsigned long limit;
-	unsigned int points = 0;
-	struct task_struct *p;
-
 	/*
 	 * If current has a pending SIGKILL, then automatically select it.  The
 	 * goal is to allow it to allocate so that it may quickly exit and free
@@ -578,13 +567,7 @@ void mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
 	}
 
 	check_panic_on_oom(CONSTRAINT_MEMCG, gfp_mask, order, NULL);
-	limit = mem_cgroup_get_limit(memcg) >> PAGE_SHIFT ? : 1;
-	read_lock(&tasklist_lock);
-	p = select_bad_process(&points, limit, memcg, NULL, false);
-	if (p && PTR_ERR(p) != -1UL)
-		oom_kill_process(p, gfp_mask, order, points, limit, memcg, NULL,
-				 "Memory cgroup out of memory");
-	read_unlock(&tasklist_lock);
+	__mem_cgroup_out_of_memory(memcg, gfp_mask, order);
 }
 #endif
 
@@ -709,7 +692,7 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
 	struct task_struct *p;
 	unsigned long totalpages;
 	unsigned long freed = 0;
-	unsigned int points;
+	unsigned int uninitialized_var(points);
 	enum oom_constraint constraint = CONSTRAINT_NONE;
 	int killed = 0;
 
@@ -747,8 +730,7 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
 		goto out;
 	}
 
-	p = select_bad_process(&points, totalpages, NULL, mpol_mask,
-			       force_kill);
+	p = select_bad_process(&points, totalpages, mpol_mask, force_kill);
 	/* Found nothing?!?! Either we hang forever, or we panic. */
 	if (!p) {
 		dump_header(NULL, gfp_mask, order, NULL, mpol_mask);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* [rfc][patch 3/3] mm, memcg: introduce own oom handler to iterate only over its own threads
@ 2012-06-26  1:47   ` David Rientjes
  0 siblings, 0 replies; 66+ messages in thread
From: David Rientjes @ 2012-06-26  1:47 UTC (permalink / raw)
  To: Andrew Morton
  Cc: KAMEZAWA Hiroyuki, Michal Hocko, Johannes Weiner,
	KOSAKI Motohiro, Minchan Kim, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	cgroups-u79uwXL29TY76Z2rM5mHXA

The global oom killer is serialized by the zonelist being used in the
page allocation.  Concurrent oom kills are thus a rare event and only
occur in systems using mempolicies and with a large number of nodes.

Memory controller oom kills, however, can frequently be concurrent since
there is no serialization once the oom killer is called for oom
conditions in several different memcgs in parallel.

This creates a massive contention on tasklist_lock since the oom killer
requires the readside for the tasklist iteration.  If several memcgs are
calling the oom killer, this lock can be held for a substantial amount of
time, especially if threads continue to enter it as other threads are
exiting.

Since the exit path grabs the writeside of the lock with irqs disabled in
a few different places, this can cause a soft lockup on cpus as a result
of tasklist_lock starvation.

The kernel lacks unfair writelocks, and successful calls to the oom
killer usually result in at least one thread entering the exit path, so
an alternative solution is needed.

This patch introduces a seperate oom handler for memcgs so that they do
not require tasklist_lock for as much time.  Instead, it iterates only
over the threads attached to the oom memcg and grabs a reference to the
selected thread before calling oom_kill_process() to ensure it doesn't
prematurely exit.

This still requires tasklist_lock for the tasklist dump, iterating
children of the selected process, and killing all other threads on the
system sharing the same memory as the selected victim.  So while this
isn't a complete solution to tasklist_lock starvation, it significantly
reduces the amount of time that it is held.

Signed-off-by: David Rientjes <rientjes-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
---
 include/linux/memcontrol.h |    9 ++-----
 include/linux/oom.h        |   16 ++++++++++++
 mm/memcontrol.c            |   62 +++++++++++++++++++++++++++++++++++++++++++-
 mm/oom_kill.c              |   48 +++++++++++-----------------------
 4 files changed, 94 insertions(+), 41 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -180,7 +180,8 @@ static inline void mem_cgroup_dec_page_stat(struct page *page,
 unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
 						gfp_t gfp_mask,
 						unsigned long *total_scanned);
-u64 mem_cgroup_get_limit(struct mem_cgroup *memcg);
+extern void __mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
+				       int order);
 
 void mem_cgroup_count_vm_event(struct mm_struct *mm, enum vm_event_item idx);
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
@@ -364,12 +365,6 @@ unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
 	return 0;
 }
 
-static inline
-u64 mem_cgroup_get_limit(struct mem_cgroup *memcg)
-{
-	return 0;
-}
-
 static inline void mem_cgroup_split_huge_fixup(struct page *head)
 {
 }
diff --git a/include/linux/oom.h b/include/linux/oom.h
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -40,17 +40,33 @@ enum oom_constraint {
 	CONSTRAINT_MEMCG,
 };
 
+enum oom_scan_t {
+	OOM_SCAN_OK,
+	OOM_SCAN_CONTINUE,
+	OOM_SCAN_ABORT,
+	OOM_SCAN_SELECT,
+};
+
 extern void compare_swap_oom_score_adj(int old_val, int new_val);
 extern int test_set_oom_score_adj(int new_val);
 
 extern unsigned long oom_badness(struct task_struct *p,
 		struct mem_cgroup *memcg, const nodemask_t *nodemask,
 		unsigned long totalpages);
+extern void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
+			     unsigned int points, unsigned long totalpages,
+			     struct mem_cgroup *memcg, nodemask_t *nodemask,
+			     const char *message);
+
 extern int try_set_zonelist_oom(struct zonelist *zonelist, gfp_t gfp_flags);
 extern void clear_zonelist_oom(struct zonelist *zonelist, gfp_t gfp_flags);
 
+extern enum oom_scan_t oom_scan_process_thread(struct task_struct *task,
+		unsigned long totalpages, const nodemask_t *nodemask,
+		bool force_kill);
 extern void mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
 				     int order);
+
 extern void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
 		int order, nodemask_t *mask, bool force_kill);
 extern int register_oom_notifier(struct notifier_block *nb);
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1454,7 +1454,7 @@ static int mem_cgroup_count_children(struct mem_cgroup *memcg)
 /*
  * Return the memory (and swap, if configured) limit for a memcg.
  */
-u64 mem_cgroup_get_limit(struct mem_cgroup *memcg)
+static u64 mem_cgroup_get_limit(struct mem_cgroup *memcg)
 {
 	u64 limit;
 	u64 memsw;
@@ -1470,6 +1470,66 @@ u64 mem_cgroup_get_limit(struct mem_cgroup *memcg)
 	return min(limit, memsw);
 }
 
+void __mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
+				int order)
+{
+	struct mem_cgroup *iter;
+	unsigned long chosen_points = 0;
+	unsigned long totalpages;
+	unsigned int points = 0;
+	struct task_struct *chosen = NULL;
+	struct task_struct *task;
+
+	totalpages = mem_cgroup_get_limit(memcg) >> PAGE_SHIFT ? : 1;
+	for_each_mem_cgroup_tree(iter, memcg) {
+		struct cgroup *cgroup = iter->css.cgroup;
+		struct cgroup_iter it;
+
+		cgroup_iter_start(cgroup, &it);
+		while ((task = cgroup_iter_next(cgroup, &it))) {
+			switch (oom_scan_process_thread(task, totalpages, NULL,
+							false)) {
+			case OOM_SCAN_SELECT:
+				if (chosen)
+					put_task_struct(chosen);
+				chosen = task;
+				chosen_points = ULONG_MAX;
+				get_task_struct(chosen);
+				/* fall through */
+			case OOM_SCAN_CONTINUE:
+				continue;
+			case OOM_SCAN_ABORT:
+				cgroup_iter_end(cgroup, &it);
+				if (chosen)
+					put_task_struct(chosen);
+				return;
+			case OOM_SCAN_OK:
+				break;
+			};
+			points = oom_badness(task, memcg, NULL, totalpages);
+			if (points > chosen_points) {
+				if (chosen)
+					put_task_struct(chosen);
+				chosen = task;
+				chosen_points = points;
+				get_task_struct(chosen);
+			}
+		}
+		cgroup_iter_end(cgroup, &it);
+		if (!memcg->use_hierarchy)
+			break;
+	}
+
+	if (!chosen)
+		return;
+	points = chosen_points * 1000 / totalpages;
+	read_lock(&tasklist_lock);
+	oom_kill_process(chosen, gfp_mask, order, points, totalpages, memcg,
+			 NULL, "Memory cgroup out of memory");
+	read_unlock(&tasklist_lock);
+	put_task_struct(chosen);
+}
+
 static unsigned long mem_cgroup_reclaim(struct mem_cgroup *memcg,
 					gfp_t gfp_mask,
 					unsigned long flags)
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -288,20 +288,13 @@ static enum oom_constraint constrained_alloc(struct zonelist *zonelist,
 }
 #endif
 
-enum oom_scan_t {
-	OOM_SCAN_OK,
-	OOM_SCAN_CONTINUE,
-	OOM_SCAN_ABORT,
-	OOM_SCAN_SELECT,
-};
-
-static enum oom_scan_t oom_scan_process_thread(struct task_struct *task,
-		struct mem_cgroup *memcg, unsigned long totalpages,
-		const nodemask_t *nodemask, bool force_kill)
+enum oom_scan_t oom_scan_process_thread(struct task_struct *task,
+		unsigned long totalpages, const nodemask_t *nodemask,
+		bool force_kill)
 {
 	if (task->exit_state)
 		return OOM_SCAN_CONTINUE;
-	if (oom_unkillable_task(task, memcg, nodemask))
+	if (oom_unkillable_task(task, NULL, nodemask))
 		return OOM_SCAN_CONTINUE;
 
 	/*
@@ -348,8 +341,8 @@ static enum oom_scan_t oom_scan_process_thread(struct task_struct *task,
  * (not docbooked, we don't want this one cluttering up the manual)
  */
 static struct task_struct *select_bad_process(unsigned int *ppoints,
-		unsigned long totalpages, struct mem_cgroup *memcg,
-		const nodemask_t *nodemask, bool force_kill)
+		unsigned long totalpages, const nodemask_t *nodemask,
+		bool force_kill)
 {
 	struct task_struct *g, *p;
 	struct task_struct *chosen = NULL;
@@ -358,7 +351,7 @@ static struct task_struct *select_bad_process(unsigned int *ppoints,
 	do_each_thread(g, p) {
 		unsigned int points;
 
-		switch (oom_scan_process_thread(p, memcg, totalpages, nodemask,
+		switch (oom_scan_process_thread(p, totalpages, nodemask,
 						force_kill)) {
 		case OOM_SCAN_SELECT:
 			chosen = p;
@@ -371,7 +364,7 @@ static struct task_struct *select_bad_process(unsigned int *ppoints,
 		case OOM_SCAN_OK:
 			break;
 		};
-		points = oom_badness(p, memcg, nodemask, totalpages);
+		points = oom_badness(p, NULL, nodemask, totalpages);
 		if (points > chosen_points) {
 			chosen = p;
 			chosen_points = points;
@@ -442,10 +435,10 @@ static void dump_header(struct task_struct *p, gfp_t gfp_mask, int order,
 }
 
 #define K(x) ((x) << (PAGE_SHIFT-10))
-static void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
-			     unsigned int points, unsigned long totalpages,
-			     struct mem_cgroup *memcg, nodemask_t *nodemask,
-			     const char *message)
+void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
+		      unsigned int points, unsigned long totalpages,
+		      struct mem_cgroup *memcg, nodemask_t *nodemask,
+		      const char *message)
 {
 	struct task_struct *victim = p;
 	struct task_struct *child;
@@ -563,10 +556,6 @@ static void check_panic_on_oom(enum oom_constraint constraint, gfp_t gfp_mask,
 void mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
 			      int order)
 {
-	unsigned long limit;
-	unsigned int points = 0;
-	struct task_struct *p;
-
 	/*
 	 * If current has a pending SIGKILL, then automatically select it.  The
 	 * goal is to allow it to allocate so that it may quickly exit and free
@@ -578,13 +567,7 @@ void mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
 	}
 
 	check_panic_on_oom(CONSTRAINT_MEMCG, gfp_mask, order, NULL);
-	limit = mem_cgroup_get_limit(memcg) >> PAGE_SHIFT ? : 1;
-	read_lock(&tasklist_lock);
-	p = select_bad_process(&points, limit, memcg, NULL, false);
-	if (p && PTR_ERR(p) != -1UL)
-		oom_kill_process(p, gfp_mask, order, points, limit, memcg, NULL,
-				 "Memory cgroup out of memory");
-	read_unlock(&tasklist_lock);
+	__mem_cgroup_out_of_memory(memcg, gfp_mask, order);
 }
 #endif
 
@@ -709,7 +692,7 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
 	struct task_struct *p;
 	unsigned long totalpages;
 	unsigned long freed = 0;
-	unsigned int points;
+	unsigned int uninitialized_var(points);
 	enum oom_constraint constraint = CONSTRAINT_NONE;
 	int killed = 0;
 
@@ -747,8 +730,7 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
 		goto out;
 	}
 
-	p = select_bad_process(&points, totalpages, NULL, mpol_mask,
-			       force_kill);
+	p = select_bad_process(&points, totalpages, mpol_mask, force_kill);
 	/* Found nothing?!?! Either we hang forever, or we panic. */
 	if (!p) {
 		dump_header(NULL, gfp_mask, order, NULL, mpol_mask);

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [patch 1/3] mm, oom: move declaration for mem_cgroup_out_of_memory to oom.h
  2012-06-26  1:47 ` David Rientjes
                   ` (2 preceding siblings ...)
  (?)
@ 2012-06-26  3:12 ` Kamezawa Hiroyuki
  2012-06-26  6:04     ` KOSAKI Motohiro
  -1 siblings, 1 reply; 66+ messages in thread
From: Kamezawa Hiroyuki @ 2012-06-26  3:12 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Michal Hocko, Johannes Weiner, KOSAKI Motohiro,
	Minchan Kim, linux-mm, cgroups

(2012/06/26 10:47), David Rientjes wrote:
> mem_cgroup_out_of_memory() is defined in mm/oom_kill.c, so declare it in
> linux/oom.h rather than linux/memcontrol.h.
>
> Signed-off-by: David Rientjes <rientjes@google.com>

Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [rfc][patch 2/3] mm, oom: introduce helper function to process threads during scan
@ 2012-06-26  3:22     ` Kamezawa Hiroyuki
  0 siblings, 0 replies; 66+ messages in thread
From: Kamezawa Hiroyuki @ 2012-06-26  3:22 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Michal Hocko, Johannes Weiner, KOSAKI Motohiro,
	Minchan Kim, linux-mm, cgroups

(2012/06/26 10:47), David Rientjes wrote:
> This patch introduces a helper function to process each thread during the
> iteration over the tasklist.  A new return type, enum oom_scan_t, is
> defined to determine the future behavior of the iteration:
>
>   - OOM_SCAN_OK: continue scanning the thread and find its badness,
>
>   - OOM_SCAN_CONTINUE: do not consider this thread for oom kill, it's
>     ineligible,
>
>   - OOM_SCAN_ABORT: abort the iteration and return, or
>
>   - OOM_SCAN_SELECT: always select this thread with the highest badness
>     possible.
>
> There is no functional change with this patch.  This new helper function
> will be used in the next patch in the memory controller.
>
> Signed-off-by: David Rientjes <rientjes@google.com>

I like this.

Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [rfc][patch 2/3] mm, oom: introduce helper function to process threads during scan
@ 2012-06-26  3:22     ` Kamezawa Hiroyuki
  0 siblings, 0 replies; 66+ messages in thread
From: Kamezawa Hiroyuki @ 2012-06-26  3:22 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Michal Hocko, Johannes Weiner, KOSAKI Motohiro,
	Minchan Kim, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	cgroups-u79uwXL29TY76Z2rM5mHXA

(2012/06/26 10:47), David Rientjes wrote:
> This patch introduces a helper function to process each thread during the
> iteration over the tasklist.  A new return type, enum oom_scan_t, is
> defined to determine the future behavior of the iteration:
>
>   - OOM_SCAN_OK: continue scanning the thread and find its badness,
>
>   - OOM_SCAN_CONTINUE: do not consider this thread for oom kill, it's
>     ineligible,
>
>   - OOM_SCAN_ABORT: abort the iteration and return, or
>
>   - OOM_SCAN_SELECT: always select this thread with the highest badness
>     possible.
>
> There is no functional change with this patch.  This new helper function
> will be used in the next patch in the memory controller.
>
> Signed-off-by: David Rientjes <rientjes-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>

I like this.

Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A@public.gmane.org>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [rfc][patch 3/3] mm, memcg: introduce own oom handler to iterate only over its own threads
@ 2012-06-26  5:32     ` Kamezawa Hiroyuki
  0 siblings, 0 replies; 66+ messages in thread
From: Kamezawa Hiroyuki @ 2012-06-26  5:32 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Michal Hocko, Johannes Weiner, KOSAKI Motohiro,
	Minchan Kim, linux-mm, cgroups

(2012/06/26 10:47), David Rientjes wrote:
> The global oom killer is serialized by the zonelist being used in the
> page allocation.  Concurrent oom kills are thus a rare event and only
> occur in systems using mempolicies and with a large number of nodes.
>
> Memory controller oom kills, however, can frequently be concurrent since
> there is no serialization once the oom killer is called for oom
> conditions in several different memcgs in parallel.
>
> This creates a massive contention on tasklist_lock since the oom killer
> requires the readside for the tasklist iteration.  If several memcgs are
> calling the oom killer, this lock can be held for a substantial amount of
> time, especially if threads continue to enter it as other threads are
> exiting.
>
> Since the exit path grabs the writeside of the lock with irqs disabled in
> a few different places, this can cause a soft lockup on cpus as a result
> of tasklist_lock starvation.
>
> The kernel lacks unfair writelocks, and successful calls to the oom
> killer usually result in at least one thread entering the exit path, so
> an alternative solution is needed.
>
> This patch introduces a seperate oom handler for memcgs so that they do
> not require tasklist_lock for as much time.  Instead, it iterates only
> over the threads attached to the oom memcg and grabs a reference to the
> selected thread before calling oom_kill_process() to ensure it doesn't
> prematurely exit.
>
> This still requires tasklist_lock for the tasklist dump, iterating
> children of the selected process, and killing all other threads on the
> system sharing the same memory as the selected victim.  So while this
> isn't a complete solution to tasklist_lock starvation, it significantly
> reduces the amount of time that it is held.
>
> Signed-off-by: David Rientjes <rientjes@google.com>

This seems good. Thank you!

Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>


> ---
>   include/linux/memcontrol.h |    9 ++-----
>   include/linux/oom.h        |   16 ++++++++++++
>   mm/memcontrol.c            |   62 +++++++++++++++++++++++++++++++++++++++++++-
>   mm/oom_kill.c              |   48 +++++++++++-----------------------
>   4 files changed, 94 insertions(+), 41 deletions(-)
>
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -180,7 +180,8 @@ static inline void mem_cgroup_dec_page_stat(struct page *page,
>   unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
>   						gfp_t gfp_mask,
>   						unsigned long *total_scanned);
> -u64 mem_cgroup_get_limit(struct mem_cgroup *memcg);
> +extern void __mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
> +				       int order);
>
>   void mem_cgroup_count_vm_event(struct mm_struct *mm, enum vm_event_item idx);
>   #ifdef CONFIG_TRANSPARENT_HUGEPAGE
> @@ -364,12 +365,6 @@ unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
>   	return 0;
>   }
>
> -static inline
> -u64 mem_cgroup_get_limit(struct mem_cgroup *memcg)
> -{
> -	return 0;
> -}
> -
>   static inline void mem_cgroup_split_huge_fixup(struct page *head)
>   {
>   }
> diff --git a/include/linux/oom.h b/include/linux/oom.h
> --- a/include/linux/oom.h
> +++ b/include/linux/oom.h
> @@ -40,17 +40,33 @@ enum oom_constraint {
>   	CONSTRAINT_MEMCG,
>   };
>
> +enum oom_scan_t {
> +	OOM_SCAN_OK,
> +	OOM_SCAN_CONTINUE,
> +	OOM_SCAN_ABORT,
> +	OOM_SCAN_SELECT,
> +};
> +
>   extern void compare_swap_oom_score_adj(int old_val, int new_val);
>   extern int test_set_oom_score_adj(int new_val);
>
>   extern unsigned long oom_badness(struct task_struct *p,
>   		struct mem_cgroup *memcg, const nodemask_t *nodemask,
>   		unsigned long totalpages);
> +extern void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
> +			     unsigned int points, unsigned long totalpages,
> +			     struct mem_cgroup *memcg, nodemask_t *nodemask,
> +			     const char *message);
> +
>   extern int try_set_zonelist_oom(struct zonelist *zonelist, gfp_t gfp_flags);
>   extern void clear_zonelist_oom(struct zonelist *zonelist, gfp_t gfp_flags);
>
> +extern enum oom_scan_t oom_scan_process_thread(struct task_struct *task,
> +		unsigned long totalpages, const nodemask_t *nodemask,
> +		bool force_kill);
>   extern void mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
>   				     int order);
> +
>   extern void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
>   		int order, nodemask_t *mask, bool force_kill);
>   extern int register_oom_notifier(struct notifier_block *nb);
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -1454,7 +1454,7 @@ static int mem_cgroup_count_children(struct mem_cgroup *memcg)
>   /*
>    * Return the memory (and swap, if configured) limit for a memcg.
>    */
> -u64 mem_cgroup_get_limit(struct mem_cgroup *memcg)
> +static u64 mem_cgroup_get_limit(struct mem_cgroup *memcg)
>   {
>   	u64 limit;
>   	u64 memsw;
> @@ -1470,6 +1470,66 @@ u64 mem_cgroup_get_limit(struct mem_cgroup *memcg)
>   	return min(limit, memsw);
>   }
>
> +void __mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
> +				int order)
> +{
> +	struct mem_cgroup *iter;
> +	unsigned long chosen_points = 0;
> +	unsigned long totalpages;
> +	unsigned int points = 0;
> +	struct task_struct *chosen = NULL;
> +	struct task_struct *task;
> +
> +	totalpages = mem_cgroup_get_limit(memcg) >> PAGE_SHIFT ? : 1;
> +	for_each_mem_cgroup_tree(iter, memcg) {
> +		struct cgroup *cgroup = iter->css.cgroup;
> +		struct cgroup_iter it;
> +
> +		cgroup_iter_start(cgroup, &it);
> +		while ((task = cgroup_iter_next(cgroup, &it))) {
> +			switch (oom_scan_process_thread(task, totalpages, NULL,
> +							false)) {
> +			case OOM_SCAN_SELECT:
> +				if (chosen)
> +					put_task_struct(chosen);
> +				chosen = task;
> +				chosen_points = ULONG_MAX;
> +				get_task_struct(chosen);
> +				/* fall through */
> +			case OOM_SCAN_CONTINUE:
> +				continue;
> +			case OOM_SCAN_ABORT:
> +				cgroup_iter_end(cgroup, &it);
> +				if (chosen)
> +					put_task_struct(chosen);
> +				return;
> +			case OOM_SCAN_OK:
> +				break;
> +			};
> +			points = oom_badness(task, memcg, NULL, totalpages);
> +			if (points > chosen_points) {
> +				if (chosen)
> +					put_task_struct(chosen);
> +				chosen = task;
> +				chosen_points = points;
> +				get_task_struct(chosen);
> +			}
> +		}
> +		cgroup_iter_end(cgroup, &it);
> +		if (!memcg->use_hierarchy)
> +			break;
> +	}
> +
> +	if (!chosen)
> +		return;
> +	points = chosen_points * 1000 / totalpages;
> +	read_lock(&tasklist_lock);
> +	oom_kill_process(chosen, gfp_mask, order, points, totalpages, memcg,
> +			 NULL, "Memory cgroup out of memory");
> +	read_unlock(&tasklist_lock);
> +	put_task_struct(chosen);
> +}
> +
>   static unsigned long mem_cgroup_reclaim(struct mem_cgroup *memcg,
>   					gfp_t gfp_mask,
>   					unsigned long flags)
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -288,20 +288,13 @@ static enum oom_constraint constrained_alloc(struct zonelist *zonelist,
>   }
>   #endif
>
> -enum oom_scan_t {
> -	OOM_SCAN_OK,
> -	OOM_SCAN_CONTINUE,
> -	OOM_SCAN_ABORT,
> -	OOM_SCAN_SELECT,
> -};
> -
> -static enum oom_scan_t oom_scan_process_thread(struct task_struct *task,
> -		struct mem_cgroup *memcg, unsigned long totalpages,
> -		const nodemask_t *nodemask, bool force_kill)
> +enum oom_scan_t oom_scan_process_thread(struct task_struct *task,
> +		unsigned long totalpages, const nodemask_t *nodemask,
> +		bool force_kill)
>   {
>   	if (task->exit_state)
>   		return OOM_SCAN_CONTINUE;
> -	if (oom_unkillable_task(task, memcg, nodemask))
> +	if (oom_unkillable_task(task, NULL, nodemask))
>   		return OOM_SCAN_CONTINUE;
>
>   	/*
> @@ -348,8 +341,8 @@ static enum oom_scan_t oom_scan_process_thread(struct task_struct *task,
>    * (not docbooked, we don't want this one cluttering up the manual)
>    */
>   static struct task_struct *select_bad_process(unsigned int *ppoints,
> -		unsigned long totalpages, struct mem_cgroup *memcg,
> -		const nodemask_t *nodemask, bool force_kill)
> +		unsigned long totalpages, const nodemask_t *nodemask,
> +		bool force_kill)
>   {
>   	struct task_struct *g, *p;
>   	struct task_struct *chosen = NULL;
> @@ -358,7 +351,7 @@ static struct task_struct *select_bad_process(unsigned int *ppoints,
>   	do_each_thread(g, p) {
>   		unsigned int points;
>
> -		switch (oom_scan_process_thread(p, memcg, totalpages, nodemask,
> +		switch (oom_scan_process_thread(p, totalpages, nodemask,
>   						force_kill)) {
>   		case OOM_SCAN_SELECT:
>   			chosen = p;
> @@ -371,7 +364,7 @@ static struct task_struct *select_bad_process(unsigned int *ppoints,
>   		case OOM_SCAN_OK:
>   			break;
>   		};
> -		points = oom_badness(p, memcg, nodemask, totalpages);
> +		points = oom_badness(p, NULL, nodemask, totalpages);
>   		if (points > chosen_points) {
>   			chosen = p;
>   			chosen_points = points;
> @@ -442,10 +435,10 @@ static void dump_header(struct task_struct *p, gfp_t gfp_mask, int order,
>   }
>
>   #define K(x) ((x) << (PAGE_SHIFT-10))
> -static void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
> -			     unsigned int points, unsigned long totalpages,
> -			     struct mem_cgroup *memcg, nodemask_t *nodemask,
> -			     const char *message)
> +void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
> +		      unsigned int points, unsigned long totalpages,
> +		      struct mem_cgroup *memcg, nodemask_t *nodemask,
> +		      const char *message)
>   {
>   	struct task_struct *victim = p;
>   	struct task_struct *child;
> @@ -563,10 +556,6 @@ static void check_panic_on_oom(enum oom_constraint constraint, gfp_t gfp_mask,
>   void mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
>   			      int order)
>   {
> -	unsigned long limit;
> -	unsigned int points = 0;
> -	struct task_struct *p;
> -
>   	/*
>   	 * If current has a pending SIGKILL, then automatically select it.  The
>   	 * goal is to allow it to allocate so that it may quickly exit and free
> @@ -578,13 +567,7 @@ void mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
>   	}
>
>   	check_panic_on_oom(CONSTRAINT_MEMCG, gfp_mask, order, NULL);
> -	limit = mem_cgroup_get_limit(memcg) >> PAGE_SHIFT ? : 1;
> -	read_lock(&tasklist_lock);
> -	p = select_bad_process(&points, limit, memcg, NULL, false);
> -	if (p && PTR_ERR(p) != -1UL)
> -		oom_kill_process(p, gfp_mask, order, points, limit, memcg, NULL,
> -				 "Memory cgroup out of memory");
> -	read_unlock(&tasklist_lock);
> +	__mem_cgroup_out_of_memory(memcg, gfp_mask, order);
>   }
>   #endif
>
> @@ -709,7 +692,7 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
>   	struct task_struct *p;
>   	unsigned long totalpages;
>   	unsigned long freed = 0;
> -	unsigned int points;
> +	unsigned int uninitialized_var(points);
>   	enum oom_constraint constraint = CONSTRAINT_NONE;
>   	int killed = 0;
>
> @@ -747,8 +730,7 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
>   		goto out;
>   	}
>
> -	p = select_bad_process(&points, totalpages, NULL, mpol_mask,
> -			       force_kill);
> +	p = select_bad_process(&points, totalpages, mpol_mask, force_kill);
>   	/* Found nothing?!?! Either we hang forever, or we panic. */
>   	if (!p) {
>   		dump_header(NULL, gfp_mask, order, NULL, mpol_mask);
>



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [rfc][patch 3/3] mm, memcg: introduce own oom handler to iterate only over its own threads
@ 2012-06-26  5:32     ` Kamezawa Hiroyuki
  0 siblings, 0 replies; 66+ messages in thread
From: Kamezawa Hiroyuki @ 2012-06-26  5:32 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Michal Hocko, Johannes Weiner, KOSAKI Motohiro,
	Minchan Kim, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	cgroups-u79uwXL29TY76Z2rM5mHXA

(2012/06/26 10:47), David Rientjes wrote:
> The global oom killer is serialized by the zonelist being used in the
> page allocation.  Concurrent oom kills are thus a rare event and only
> occur in systems using mempolicies and with a large number of nodes.
>
> Memory controller oom kills, however, can frequently be concurrent since
> there is no serialization once the oom killer is called for oom
> conditions in several different memcgs in parallel.
>
> This creates a massive contention on tasklist_lock since the oom killer
> requires the readside for the tasklist iteration.  If several memcgs are
> calling the oom killer, this lock can be held for a substantial amount of
> time, especially if threads continue to enter it as other threads are
> exiting.
>
> Since the exit path grabs the writeside of the lock with irqs disabled in
> a few different places, this can cause a soft lockup on cpus as a result
> of tasklist_lock starvation.
>
> The kernel lacks unfair writelocks, and successful calls to the oom
> killer usually result in at least one thread entering the exit path, so
> an alternative solution is needed.
>
> This patch introduces a seperate oom handler for memcgs so that they do
> not require tasklist_lock for as much time.  Instead, it iterates only
> over the threads attached to the oom memcg and grabs a reference to the
> selected thread before calling oom_kill_process() to ensure it doesn't
> prematurely exit.
>
> This still requires tasklist_lock for the tasklist dump, iterating
> children of the selected process, and killing all other threads on the
> system sharing the same memory as the selected victim.  So while this
> isn't a complete solution to tasklist_lock starvation, it significantly
> reduces the amount of time that it is held.
>
> Signed-off-by: David Rientjes <rientjes-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>

This seems good. Thank you!

Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A@public.gmane.org>


> ---
>   include/linux/memcontrol.h |    9 ++-----
>   include/linux/oom.h        |   16 ++++++++++++
>   mm/memcontrol.c            |   62 +++++++++++++++++++++++++++++++++++++++++++-
>   mm/oom_kill.c              |   48 +++++++++++-----------------------
>   4 files changed, 94 insertions(+), 41 deletions(-)
>
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -180,7 +180,8 @@ static inline void mem_cgroup_dec_page_stat(struct page *page,
>   unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
>   						gfp_t gfp_mask,
>   						unsigned long *total_scanned);
> -u64 mem_cgroup_get_limit(struct mem_cgroup *memcg);
> +extern void __mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
> +				       int order);
>
>   void mem_cgroup_count_vm_event(struct mm_struct *mm, enum vm_event_item idx);
>   #ifdef CONFIG_TRANSPARENT_HUGEPAGE
> @@ -364,12 +365,6 @@ unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
>   	return 0;
>   }
>
> -static inline
> -u64 mem_cgroup_get_limit(struct mem_cgroup *memcg)
> -{
> -	return 0;
> -}
> -
>   static inline void mem_cgroup_split_huge_fixup(struct page *head)
>   {
>   }
> diff --git a/include/linux/oom.h b/include/linux/oom.h
> --- a/include/linux/oom.h
> +++ b/include/linux/oom.h
> @@ -40,17 +40,33 @@ enum oom_constraint {
>   	CONSTRAINT_MEMCG,
>   };
>
> +enum oom_scan_t {
> +	OOM_SCAN_OK,
> +	OOM_SCAN_CONTINUE,
> +	OOM_SCAN_ABORT,
> +	OOM_SCAN_SELECT,
> +};
> +
>   extern void compare_swap_oom_score_adj(int old_val, int new_val);
>   extern int test_set_oom_score_adj(int new_val);
>
>   extern unsigned long oom_badness(struct task_struct *p,
>   		struct mem_cgroup *memcg, const nodemask_t *nodemask,
>   		unsigned long totalpages);
> +extern void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
> +			     unsigned int points, unsigned long totalpages,
> +			     struct mem_cgroup *memcg, nodemask_t *nodemask,
> +			     const char *message);
> +
>   extern int try_set_zonelist_oom(struct zonelist *zonelist, gfp_t gfp_flags);
>   extern void clear_zonelist_oom(struct zonelist *zonelist, gfp_t gfp_flags);
>
> +extern enum oom_scan_t oom_scan_process_thread(struct task_struct *task,
> +		unsigned long totalpages, const nodemask_t *nodemask,
> +		bool force_kill);
>   extern void mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
>   				     int order);
> +
>   extern void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
>   		int order, nodemask_t *mask, bool force_kill);
>   extern int register_oom_notifier(struct notifier_block *nb);
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -1454,7 +1454,7 @@ static int mem_cgroup_count_children(struct mem_cgroup *memcg)
>   /*
>    * Return the memory (and swap, if configured) limit for a memcg.
>    */
> -u64 mem_cgroup_get_limit(struct mem_cgroup *memcg)
> +static u64 mem_cgroup_get_limit(struct mem_cgroup *memcg)
>   {
>   	u64 limit;
>   	u64 memsw;
> @@ -1470,6 +1470,66 @@ u64 mem_cgroup_get_limit(struct mem_cgroup *memcg)
>   	return min(limit, memsw);
>   }
>
> +void __mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
> +				int order)
> +{
> +	struct mem_cgroup *iter;
> +	unsigned long chosen_points = 0;
> +	unsigned long totalpages;
> +	unsigned int points = 0;
> +	struct task_struct *chosen = NULL;
> +	struct task_struct *task;
> +
> +	totalpages = mem_cgroup_get_limit(memcg) >> PAGE_SHIFT ? : 1;
> +	for_each_mem_cgroup_tree(iter, memcg) {
> +		struct cgroup *cgroup = iter->css.cgroup;
> +		struct cgroup_iter it;
> +
> +		cgroup_iter_start(cgroup, &it);
> +		while ((task = cgroup_iter_next(cgroup, &it))) {
> +			switch (oom_scan_process_thread(task, totalpages, NULL,
> +							false)) {
> +			case OOM_SCAN_SELECT:
> +				if (chosen)
> +					put_task_struct(chosen);
> +				chosen = task;
> +				chosen_points = ULONG_MAX;
> +				get_task_struct(chosen);
> +				/* fall through */
> +			case OOM_SCAN_CONTINUE:
> +				continue;
> +			case OOM_SCAN_ABORT:
> +				cgroup_iter_end(cgroup, &it);
> +				if (chosen)
> +					put_task_struct(chosen);
> +				return;
> +			case OOM_SCAN_OK:
> +				break;
> +			};
> +			points = oom_badness(task, memcg, NULL, totalpages);
> +			if (points > chosen_points) {
> +				if (chosen)
> +					put_task_struct(chosen);
> +				chosen = task;
> +				chosen_points = points;
> +				get_task_struct(chosen);
> +			}
> +		}
> +		cgroup_iter_end(cgroup, &it);
> +		if (!memcg->use_hierarchy)
> +			break;
> +	}
> +
> +	if (!chosen)
> +		return;
> +	points = chosen_points * 1000 / totalpages;
> +	read_lock(&tasklist_lock);
> +	oom_kill_process(chosen, gfp_mask, order, points, totalpages, memcg,
> +			 NULL, "Memory cgroup out of memory");
> +	read_unlock(&tasklist_lock);
> +	put_task_struct(chosen);
> +}
> +
>   static unsigned long mem_cgroup_reclaim(struct mem_cgroup *memcg,
>   					gfp_t gfp_mask,
>   					unsigned long flags)
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -288,20 +288,13 @@ static enum oom_constraint constrained_alloc(struct zonelist *zonelist,
>   }
>   #endif
>
> -enum oom_scan_t {
> -	OOM_SCAN_OK,
> -	OOM_SCAN_CONTINUE,
> -	OOM_SCAN_ABORT,
> -	OOM_SCAN_SELECT,
> -};
> -
> -static enum oom_scan_t oom_scan_process_thread(struct task_struct *task,
> -		struct mem_cgroup *memcg, unsigned long totalpages,
> -		const nodemask_t *nodemask, bool force_kill)
> +enum oom_scan_t oom_scan_process_thread(struct task_struct *task,
> +		unsigned long totalpages, const nodemask_t *nodemask,
> +		bool force_kill)
>   {
>   	if (task->exit_state)
>   		return OOM_SCAN_CONTINUE;
> -	if (oom_unkillable_task(task, memcg, nodemask))
> +	if (oom_unkillable_task(task, NULL, nodemask))
>   		return OOM_SCAN_CONTINUE;
>
>   	/*
> @@ -348,8 +341,8 @@ static enum oom_scan_t oom_scan_process_thread(struct task_struct *task,
>    * (not docbooked, we don't want this one cluttering up the manual)
>    */
>   static struct task_struct *select_bad_process(unsigned int *ppoints,
> -		unsigned long totalpages, struct mem_cgroup *memcg,
> -		const nodemask_t *nodemask, bool force_kill)
> +		unsigned long totalpages, const nodemask_t *nodemask,
> +		bool force_kill)
>   {
>   	struct task_struct *g, *p;
>   	struct task_struct *chosen = NULL;
> @@ -358,7 +351,7 @@ static struct task_struct *select_bad_process(unsigned int *ppoints,
>   	do_each_thread(g, p) {
>   		unsigned int points;
>
> -		switch (oom_scan_process_thread(p, memcg, totalpages, nodemask,
> +		switch (oom_scan_process_thread(p, totalpages, nodemask,
>   						force_kill)) {
>   		case OOM_SCAN_SELECT:
>   			chosen = p;
> @@ -371,7 +364,7 @@ static struct task_struct *select_bad_process(unsigned int *ppoints,
>   		case OOM_SCAN_OK:
>   			break;
>   		};
> -		points = oom_badness(p, memcg, nodemask, totalpages);
> +		points = oom_badness(p, NULL, nodemask, totalpages);
>   		if (points > chosen_points) {
>   			chosen = p;
>   			chosen_points = points;
> @@ -442,10 +435,10 @@ static void dump_header(struct task_struct *p, gfp_t gfp_mask, int order,
>   }
>
>   #define K(x) ((x) << (PAGE_SHIFT-10))
> -static void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
> -			     unsigned int points, unsigned long totalpages,
> -			     struct mem_cgroup *memcg, nodemask_t *nodemask,
> -			     const char *message)
> +void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
> +		      unsigned int points, unsigned long totalpages,
> +		      struct mem_cgroup *memcg, nodemask_t *nodemask,
> +		      const char *message)
>   {
>   	struct task_struct *victim = p;
>   	struct task_struct *child;
> @@ -563,10 +556,6 @@ static void check_panic_on_oom(enum oom_constraint constraint, gfp_t gfp_mask,
>   void mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
>   			      int order)
>   {
> -	unsigned long limit;
> -	unsigned int points = 0;
> -	struct task_struct *p;
> -
>   	/*
>   	 * If current has a pending SIGKILL, then automatically select it.  The
>   	 * goal is to allow it to allocate so that it may quickly exit and free
> @@ -578,13 +567,7 @@ void mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
>   	}
>
>   	check_panic_on_oom(CONSTRAINT_MEMCG, gfp_mask, order, NULL);
> -	limit = mem_cgroup_get_limit(memcg) >> PAGE_SHIFT ? : 1;
> -	read_lock(&tasklist_lock);
> -	p = select_bad_process(&points, limit, memcg, NULL, false);
> -	if (p && PTR_ERR(p) != -1UL)
> -		oom_kill_process(p, gfp_mask, order, points, limit, memcg, NULL,
> -				 "Memory cgroup out of memory");
> -	read_unlock(&tasklist_lock);
> +	__mem_cgroup_out_of_memory(memcg, gfp_mask, order);
>   }
>   #endif
>
> @@ -709,7 +692,7 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
>   	struct task_struct *p;
>   	unsigned long totalpages;
>   	unsigned long freed = 0;
> -	unsigned int points;
> +	unsigned int uninitialized_var(points);
>   	enum oom_constraint constraint = CONSTRAINT_NONE;
>   	int killed = 0;
>
> @@ -747,8 +730,7 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
>   		goto out;
>   	}
>
> -	p = select_bad_process(&points, totalpages, NULL, mpol_mask,
> -			       force_kill);
> +	p = select_bad_process(&points, totalpages, mpol_mask, force_kill);
>   	/* Found nothing?!?! Either we hang forever, or we panic. */
>   	if (!p) {
>   		dump_header(NULL, gfp_mask, order, NULL, mpol_mask);
>



^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [patch 1/3] mm, oom: move declaration for mem_cgroup_out_of_memory to oom.h
@ 2012-06-26  6:04     ` KOSAKI Motohiro
  0 siblings, 0 replies; 66+ messages in thread
From: KOSAKI Motohiro @ 2012-06-26  6:04 UTC (permalink / raw)
  To: Kamezawa Hiroyuki
  Cc: David Rientjes, Andrew Morton, Michal Hocko, Johannes Weiner,
	Minchan Kim, linux-mm, cgroups

On Mon, Jun 25, 2012 at 11:12 PM, Kamezawa Hiroyuki
<kamezawa.hiroyu@jp.fujitsu.com> wrote:
> (2012/06/26 10:47), David Rientjes wrote:
>>
>> mem_cgroup_out_of_memory() is defined in mm/oom_kill.c, so declare it in
>> linux/oom.h rather than linux/memcontrol.h.
>>
>> Signed-off-by: David Rientjes <rientjes@google.com>
>
>
> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [patch 1/3] mm, oom: move declaration for mem_cgroup_out_of_memory to oom.h
@ 2012-06-26  6:04     ` KOSAKI Motohiro
  0 siblings, 0 replies; 66+ messages in thread
From: KOSAKI Motohiro @ 2012-06-26  6:04 UTC (permalink / raw)
  To: Kamezawa Hiroyuki
  Cc: David Rientjes, Andrew Morton, Michal Hocko, Johannes Weiner,
	Minchan Kim, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	cgroups-u79uwXL29TY76Z2rM5mHXA

On Mon, Jun 25, 2012 at 11:12 PM, Kamezawa Hiroyuki
<kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A@public.gmane.org> wrote:
> (2012/06/26 10:47), David Rientjes wrote:
>>
>> mem_cgroup_out_of_memory() is defined in mm/oom_kill.c, so declare it in
>> linux/oom.h rather than linux/memcontrol.h.
>>
>> Signed-off-by: David Rientjes <rientjes-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
>
>
> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A@public.gmane.org>

Acked-by: KOSAKI Motohiro <kosaki.motohiro-+CUm20s59erQFUHtdCDX3A@public.gmane.org>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [rfc][patch 2/3] mm, oom: introduce helper function to process threads during scan
  2012-06-26  3:22     ` Kamezawa Hiroyuki
  (?)
@ 2012-06-26  6:05     ` KOSAKI Motohiro
  -1 siblings, 0 replies; 66+ messages in thread
From: KOSAKI Motohiro @ 2012-06-26  6:05 UTC (permalink / raw)
  To: Kamezawa Hiroyuki
  Cc: David Rientjes, Andrew Morton, Michal Hocko, Johannes Weiner,
	Minchan Kim, linux-mm, cgroups

On Mon, Jun 25, 2012 at 11:22 PM, Kamezawa Hiroyuki
<kamezawa.hiroyu@jp.fujitsu.com> wrote:
> (2012/06/26 10:47), David Rientjes wrote:
>>
>> This patch introduces a helper function to process each thread during the
>> iteration over the tasklist.  A new return type, enum oom_scan_t, is
>> defined to determine the future behavior of the iteration:
>>
>>  - OOM_SCAN_OK: continue scanning the thread and find its badness,
>>
>>  - OOM_SCAN_CONTINUE: do not consider this thread for oom kill, it's
>>    ineligible,
>>
>>  - OOM_SCAN_ABORT: abort the iteration and return, or
>>
>>  - OOM_SCAN_SELECT: always select this thread with the highest badness
>>    possible.
>>
>> There is no functional change with this patch.  This new helper function
>> will be used in the next patch in the memory controller.
>>
>> Signed-off-by: David Rientjes <rientjes@google.com>
>
>
> I like this.
>
> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [patch 1/3] mm, oom: move declaration for mem_cgroup_out_of_memory to oom.h
@ 2012-06-26  8:34   ` Michal Hocko
  0 siblings, 0 replies; 66+ messages in thread
From: Michal Hocko @ 2012-06-26  8:34 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, Johannes Weiner,
	KOSAKI Motohiro, Minchan Kim, linux-mm, cgroups

On Mon 25-06-12 18:47:46, David Rientjes wrote:
> mem_cgroup_out_of_memory() is defined in mm/oom_kill.c, so declare it in
> linux/oom.h rather than linux/memcontrol.h.

Makes sense
> 
> Signed-off-by: David Rientjes <rientjes@google.com>

Acked-by: Michal Hocko <mhocko@suse.cz>

> ---
>  include/linux/memcontrol.h |    2 --
>  include/linux/oom.h        |    2 ++
>  2 files changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -72,8 +72,6 @@ extern void mem_cgroup_uncharge_end(void);
>  extern void mem_cgroup_uncharge_page(struct page *page);
>  extern void mem_cgroup_uncharge_cache_page(struct page *page);
>  
> -extern void mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
> -				     int order);
>  bool __mem_cgroup_same_or_subtree(const struct mem_cgroup *root_memcg,
>  				  struct mem_cgroup *memcg);
>  int task_in_mem_cgroup(struct task_struct *task, const struct mem_cgroup *memcg);
> diff --git a/include/linux/oom.h b/include/linux/oom.h
> --- a/include/linux/oom.h
> +++ b/include/linux/oom.h
> @@ -49,6 +49,8 @@ extern unsigned long oom_badness(struct task_struct *p,
>  extern int try_set_zonelist_oom(struct zonelist *zonelist, gfp_t gfp_flags);
>  extern void clear_zonelist_oom(struct zonelist *zonelist, gfp_t gfp_flags);
>  
> +extern void mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
> +				     int order);
>  extern void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
>  		int order, nodemask_t *mask, bool force_kill);
>  extern int register_oom_notifier(struct notifier_block *nb);
> --
> To unsubscribe from this list: send the line "unsubscribe cgroups" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9    
Czech Republic

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [patch 1/3] mm, oom: move declaration for mem_cgroup_out_of_memory to oom.h
@ 2012-06-26  8:34   ` Michal Hocko
  0 siblings, 0 replies; 66+ messages in thread
From: Michal Hocko @ 2012-06-26  8:34 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, Johannes Weiner,
	KOSAKI Motohiro, Minchan Kim, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	cgroups-u79uwXL29TY76Z2rM5mHXA

On Mon 25-06-12 18:47:46, David Rientjes wrote:
> mem_cgroup_out_of_memory() is defined in mm/oom_kill.c, so declare it in
> linux/oom.h rather than linux/memcontrol.h.

Makes sense
> 
> Signed-off-by: David Rientjes <rientjes-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>

Acked-by: Michal Hocko <mhocko-AlSwsSmVLrQ@public.gmane.org>

> ---
>  include/linux/memcontrol.h |    2 --
>  include/linux/oom.h        |    2 ++
>  2 files changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -72,8 +72,6 @@ extern void mem_cgroup_uncharge_end(void);
>  extern void mem_cgroup_uncharge_page(struct page *page);
>  extern void mem_cgroup_uncharge_cache_page(struct page *page);
>  
> -extern void mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
> -				     int order);
>  bool __mem_cgroup_same_or_subtree(const struct mem_cgroup *root_memcg,
>  				  struct mem_cgroup *memcg);
>  int task_in_mem_cgroup(struct task_struct *task, const struct mem_cgroup *memcg);
> diff --git a/include/linux/oom.h b/include/linux/oom.h
> --- a/include/linux/oom.h
> +++ b/include/linux/oom.h
> @@ -49,6 +49,8 @@ extern unsigned long oom_badness(struct task_struct *p,
>  extern int try_set_zonelist_oom(struct zonelist *zonelist, gfp_t gfp_flags);
>  extern void clear_zonelist_oom(struct zonelist *zonelist, gfp_t gfp_flags);
>  
> +extern void mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
> +				     int order);
>  extern void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
>  		int order, nodemask_t *mask, bool force_kill);
>  extern int register_oom_notifier(struct notifier_block *nb);
> --
> To unsubscribe from this list: send the line "unsubscribe cgroups" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9    
Czech Republic

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [rfc][patch 2/3] mm, oom: introduce helper function to process threads during scan
@ 2012-06-26  8:48     ` Michal Hocko
  0 siblings, 0 replies; 66+ messages in thread
From: Michal Hocko @ 2012-06-26  8:48 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, Johannes Weiner,
	KOSAKI Motohiro, Minchan Kim, linux-mm, cgroups

On Mon 25-06-12 18:47:49, David Rientjes wrote:
> This patch introduces a helper function to process each thread during the
> iteration over the tasklist.  A new return type, enum oom_scan_t, is
> defined to determine the future behavior of the iteration:
> 
>  - OOM_SCAN_OK: continue scanning the thread and find its badness,
> 
>  - OOM_SCAN_CONTINUE: do not consider this thread for oom kill, it's
>    ineligible,
> 
>  - OOM_SCAN_ABORT: abort the iteration and return, or
> 
>  - OOM_SCAN_SELECT: always select this thread with the highest badness
>    possible.

I like it but could you add this as a doc for the enum?

> There is no functional change with this patch.  This new helper function
> will be used in the next patch in the memory controller.
> 
> Signed-off-by: David Rientjes <rientjes@google.com>

Reviewed-by: Michal Hocko <mhocko@suse.cz>

> ---
>  mm/oom_kill.c |  111 +++++++++++++++++++++++++++++++++------------------------
>  1 file changed, 65 insertions(+), 46 deletions(-)
> 
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -288,6 +288,59 @@ static enum oom_constraint constrained_alloc(struct zonelist *zonelist,
>  }
>  #endif
>  
> +enum oom_scan_t {
> +	OOM_SCAN_OK,
> +	OOM_SCAN_CONTINUE,
> +	OOM_SCAN_ABORT,
> +	OOM_SCAN_SELECT,
> +};
> +
> +static enum oom_scan_t oom_scan_process_thread(struct task_struct *task,
> +		struct mem_cgroup *memcg, unsigned long totalpages,
> +		const nodemask_t *nodemask, bool force_kill)
> +{
> +	if (task->exit_state)
> +		return OOM_SCAN_CONTINUE;
> +	if (oom_unkillable_task(task, memcg, nodemask))
> +		return OOM_SCAN_CONTINUE;
> +
> +	/*
> +	 * This task already has access to memory reserves and is being killed.
> +	 * Don't allow any other task to have access to the reserves.
> +	 */
> +	if (test_tsk_thread_flag(task, TIF_MEMDIE)) {
> +		if (unlikely(frozen(task)))
> +			__thaw_task(task);
> +		if (!force_kill)
> +			return OOM_SCAN_ABORT;
> +	}
> +	if (!task->mm)
> +		return OOM_SCAN_CONTINUE;
> +
> +	if (task->flags & PF_EXITING) {
> +		/*
> +		 * If task is current and is in the process of releasing memory,
> +		 * allow the "kill" to set TIF_MEMDIE, which will allow it to
> +		 * access memory reserves.  Otherwise, it may stall forever.
> +		 *
> +		 * The iteration isn't broken here, however, in case other
> +		 * threads are found to have already been oom killed.
> +		 */
> +		if (task == current)
> +			return OOM_SCAN_SELECT;
> +		else if (!force_kill) {
> +			/*
> +			 * If this task is not being ptraced on exit, then wait
> +			 * for it to finish before killing some other task
> +			 * unnecessarily.
> +			 */
> +			if (!(task->group_leader->ptrace & PT_TRACE_EXIT))
> +				return OOM_SCAN_ABORT;
> +		}
> +	}
> +	return OOM_SCAN_OK;
> +}
> +
>  /*
>   * Simple selection loop. We chose the process with the highest
>   * number of 'points'. We expect the caller will lock the tasklist.
> @@ -305,53 +358,19 @@ static struct task_struct *select_bad_process(unsigned int *ppoints,
>  	do_each_thread(g, p) {
>  		unsigned int points;
>  
> -		if (p->exit_state)
> -			continue;
> -		if (oom_unkillable_task(p, memcg, nodemask))
> -			continue;
> -
> -		/*
> -		 * This task already has access to memory reserves and is
> -		 * being killed. Don't allow any other task access to the
> -		 * memory reserve.
> -		 *
> -		 * Note: this may have a chance of deadlock if it gets
> -		 * blocked waiting for another task which itself is waiting
> -		 * for memory. Is there a better alternative?
> -		 */
> -		if (test_tsk_thread_flag(p, TIF_MEMDIE)) {
> -			if (unlikely(frozen(p)))
> -				__thaw_task(p);
> -			if (!force_kill)
> -				return ERR_PTR(-1UL);
> -		}
> -		if (!p->mm)
> +		switch (oom_scan_process_thread(p, memcg, totalpages, nodemask,
> +						force_kill)) {
> +		case OOM_SCAN_SELECT:
> +			chosen = p;
> +			chosen_points = ULONG_MAX;
> +			/* fall through */
> +		case OOM_SCAN_CONTINUE:
>  			continue;
> -
> -		if (p->flags & PF_EXITING) {
> -			/*
> -			 * If p is the current task and is in the process of
> -			 * releasing memory, we allow the "kill" to set
> -			 * TIF_MEMDIE, which will allow it to gain access to
> -			 * memory reserves.  Otherwise, it may stall forever.
> -			 *
> -			 * The loop isn't broken here, however, in case other
> -			 * threads are found to have already been oom killed.
> -			 */
> -			if (p == current) {
> -				chosen = p;
> -				chosen_points = ULONG_MAX;
> -			} else if (!force_kill) {
> -				/*
> -				 * If this task is not being ptraced on exit,
> -				 * then wait for it to finish before killing
> -				 * some other task unnecessarily.
> -				 */
> -				if (!(p->group_leader->ptrace & PT_TRACE_EXIT))
> -					return ERR_PTR(-1UL);
> -			}
> -		}
> -
> +		case OOM_SCAN_ABORT:
> +			return ERR_PTR(-1UL);
> +		case OOM_SCAN_OK:
> +			break;
> +		};
>  		points = oom_badness(p, memcg, nodemask, totalpages);
>  		if (points > chosen_points) {
>  			chosen = p;
> --
> To unsubscribe from this list: send the line "unsubscribe cgroups" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9    
Czech Republic

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [rfc][patch 2/3] mm, oom: introduce helper function to process threads during scan
@ 2012-06-26  8:48     ` Michal Hocko
  0 siblings, 0 replies; 66+ messages in thread
From: Michal Hocko @ 2012-06-26  8:48 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, Johannes Weiner,
	KOSAKI Motohiro, Minchan Kim, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	cgroups-u79uwXL29TY76Z2rM5mHXA

On Mon 25-06-12 18:47:49, David Rientjes wrote:
> This patch introduces a helper function to process each thread during the
> iteration over the tasklist.  A new return type, enum oom_scan_t, is
> defined to determine the future behavior of the iteration:
> 
>  - OOM_SCAN_OK: continue scanning the thread and find its badness,
> 
>  - OOM_SCAN_CONTINUE: do not consider this thread for oom kill, it's
>    ineligible,
> 
>  - OOM_SCAN_ABORT: abort the iteration and return, or
> 
>  - OOM_SCAN_SELECT: always select this thread with the highest badness
>    possible.

I like it but could you add this as a doc for the enum?

> There is no functional change with this patch.  This new helper function
> will be used in the next patch in the memory controller.
> 
> Signed-off-by: David Rientjes <rientjes-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>

Reviewed-by: Michal Hocko <mhocko-AlSwsSmVLrQ@public.gmane.org>

> ---
>  mm/oom_kill.c |  111 +++++++++++++++++++++++++++++++++------------------------
>  1 file changed, 65 insertions(+), 46 deletions(-)
> 
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -288,6 +288,59 @@ static enum oom_constraint constrained_alloc(struct zonelist *zonelist,
>  }
>  #endif
>  
> +enum oom_scan_t {
> +	OOM_SCAN_OK,
> +	OOM_SCAN_CONTINUE,
> +	OOM_SCAN_ABORT,
> +	OOM_SCAN_SELECT,
> +};
> +
> +static enum oom_scan_t oom_scan_process_thread(struct task_struct *task,
> +		struct mem_cgroup *memcg, unsigned long totalpages,
> +		const nodemask_t *nodemask, bool force_kill)
> +{
> +	if (task->exit_state)
> +		return OOM_SCAN_CONTINUE;
> +	if (oom_unkillable_task(task, memcg, nodemask))
> +		return OOM_SCAN_CONTINUE;
> +
> +	/*
> +	 * This task already has access to memory reserves and is being killed.
> +	 * Don't allow any other task to have access to the reserves.
> +	 */
> +	if (test_tsk_thread_flag(task, TIF_MEMDIE)) {
> +		if (unlikely(frozen(task)))
> +			__thaw_task(task);
> +		if (!force_kill)
> +			return OOM_SCAN_ABORT;
> +	}
> +	if (!task->mm)
> +		return OOM_SCAN_CONTINUE;
> +
> +	if (task->flags & PF_EXITING) {
> +		/*
> +		 * If task is current and is in the process of releasing memory,
> +		 * allow the "kill" to set TIF_MEMDIE, which will allow it to
> +		 * access memory reserves.  Otherwise, it may stall forever.
> +		 *
> +		 * The iteration isn't broken here, however, in case other
> +		 * threads are found to have already been oom killed.
> +		 */
> +		if (task == current)
> +			return OOM_SCAN_SELECT;
> +		else if (!force_kill) {
> +			/*
> +			 * If this task is not being ptraced on exit, then wait
> +			 * for it to finish before killing some other task
> +			 * unnecessarily.
> +			 */
> +			if (!(task->group_leader->ptrace & PT_TRACE_EXIT))
> +				return OOM_SCAN_ABORT;
> +		}
> +	}
> +	return OOM_SCAN_OK;
> +}
> +
>  /*
>   * Simple selection loop. We chose the process with the highest
>   * number of 'points'. We expect the caller will lock the tasklist.
> @@ -305,53 +358,19 @@ static struct task_struct *select_bad_process(unsigned int *ppoints,
>  	do_each_thread(g, p) {
>  		unsigned int points;
>  
> -		if (p->exit_state)
> -			continue;
> -		if (oom_unkillable_task(p, memcg, nodemask))
> -			continue;
> -
> -		/*
> -		 * This task already has access to memory reserves and is
> -		 * being killed. Don't allow any other task access to the
> -		 * memory reserve.
> -		 *
> -		 * Note: this may have a chance of deadlock if it gets
> -		 * blocked waiting for another task which itself is waiting
> -		 * for memory. Is there a better alternative?
> -		 */
> -		if (test_tsk_thread_flag(p, TIF_MEMDIE)) {
> -			if (unlikely(frozen(p)))
> -				__thaw_task(p);
> -			if (!force_kill)
> -				return ERR_PTR(-1UL);
> -		}
> -		if (!p->mm)
> +		switch (oom_scan_process_thread(p, memcg, totalpages, nodemask,
> +						force_kill)) {
> +		case OOM_SCAN_SELECT:
> +			chosen = p;
> +			chosen_points = ULONG_MAX;
> +			/* fall through */
> +		case OOM_SCAN_CONTINUE:
>  			continue;
> -
> -		if (p->flags & PF_EXITING) {
> -			/*
> -			 * If p is the current task and is in the process of
> -			 * releasing memory, we allow the "kill" to set
> -			 * TIF_MEMDIE, which will allow it to gain access to
> -			 * memory reserves.  Otherwise, it may stall forever.
> -			 *
> -			 * The loop isn't broken here, however, in case other
> -			 * threads are found to have already been oom killed.
> -			 */
> -			if (p == current) {
> -				chosen = p;
> -				chosen_points = ULONG_MAX;
> -			} else if (!force_kill) {
> -				/*
> -				 * If this task is not being ptraced on exit,
> -				 * then wait for it to finish before killing
> -				 * some other task unnecessarily.
> -				 */
> -				if (!(p->group_leader->ptrace & PT_TRACE_EXIT))
> -					return ERR_PTR(-1UL);
> -			}
> -		}
> -
> +		case OOM_SCAN_ABORT:
> +			return ERR_PTR(-1UL);
> +		case OOM_SCAN_OK:
> +			break;
> +		};
>  		points = oom_badness(p, memcg, nodemask, totalpages);
>  		if (points > chosen_points) {
>  			chosen = p;
> --
> To unsubscribe from this list: send the line "unsubscribe cgroups" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9    
Czech Republic

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [rfc][patch 3/3] mm, memcg: introduce own oom handler to iterate only over its own threads
@ 2012-06-26  9:58     ` Michal Hocko
  0 siblings, 0 replies; 66+ messages in thread
From: Michal Hocko @ 2012-06-26  9:58 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, Johannes Weiner,
	KOSAKI Motohiro, Minchan Kim, linux-mm, cgroups

On Mon 25-06-12 18:47:53, David Rientjes wrote:
> The global oom killer is serialized by the zonelist being used in the
> page allocation.  Concurrent oom kills are thus a rare event and only
> occur in systems using mempolicies and with a large number of nodes.
> 
> Memory controller oom kills, however, can frequently be concurrent since
> there is no serialization once the oom killer is called for oom
> conditions in several different memcgs in parallel.
> 
> This creates a massive contention on tasklist_lock since the oom killer
> requires the readside for the tasklist iteration.  If several memcgs are
> calling the oom killer, this lock can be held for a substantial amount of
> time, especially if threads continue to enter it as other threads are
> exiting.
> 
> Since the exit path grabs the writeside of the lock with irqs disabled in
> a few different places, this can cause a soft lockup on cpus as a result
> of tasklist_lock starvation.
> 
> The kernel lacks unfair writelocks, and successful calls to the oom
> killer usually result in at least one thread entering the exit path, so
> an alternative solution is needed.
> 
> This patch introduces a seperate oom handler for memcgs so that they do
> not require tasklist_lock for as much time.  Instead, it iterates only
> over the threads attached to the oom memcg and grabs a reference to the
> selected thread before calling oom_kill_process() to ensure it doesn't
> prematurely exit.
> 
> This still requires tasklist_lock for the tasklist dump, iterating
> children of the selected process, and killing all other threads on the
> system sharing the same memory as the selected victim.  So while this
> isn't a complete solution to tasklist_lock starvation, it significantly
> reduces the amount of time that it is held.

There is an issues with memcg ref. counting but I like the approach in
general.
> 
> Signed-off-by: David Rientjes <rientjes@google.com>

After the things bellow are fixed
Acked-by: Michal Hocko <mhocko@suse.cz>

> ---
>  include/linux/memcontrol.h |    9 ++-----
>  include/linux/oom.h        |   16 ++++++++++++
>  mm/memcontrol.c            |   62 +++++++++++++++++++++++++++++++++++++++++++-
>  mm/oom_kill.c              |   48 +++++++++++-----------------------
>  4 files changed, 94 insertions(+), 41 deletions(-)
> 
[...]
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
[...]
> @@ -1470,6 +1470,66 @@ u64 mem_cgroup_get_limit(struct mem_cgroup *memcg)
>  	return min(limit, memsw);
>  }
>  
> +void __mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
> +				int order)
> +{
> +	struct mem_cgroup *iter;
> +	unsigned long chosen_points = 0;
> +	unsigned long totalpages;
> +	unsigned int points = 0;
> +	struct task_struct *chosen = NULL;
> +	struct task_struct *task;
> +
> +	totalpages = mem_cgroup_get_limit(memcg) >> PAGE_SHIFT ? : 1;
> +	for_each_mem_cgroup_tree(iter, memcg) {
> +		struct cgroup *cgroup = iter->css.cgroup;
> +		struct cgroup_iter it;
> +
> +		cgroup_iter_start(cgroup, &it);

I guess this should protect from task move and exit, right?

> +		while ((task = cgroup_iter_next(cgroup, &it))) {
> +			switch (oom_scan_process_thread(task, totalpages, NULL,
> +							false)) {
> +			case OOM_SCAN_SELECT:
> +				if (chosen)
> +					put_task_struct(chosen);
> +				chosen = task;
> +				chosen_points = ULONG_MAX;
> +				get_task_struct(chosen);
> +				/* fall through */
> +			case OOM_SCAN_CONTINUE:
> +				continue;
> +			case OOM_SCAN_ABORT:
> +				cgroup_iter_end(cgroup, &it);
> +				if (chosen)
> +					put_task_struct(chosen);

You need mem_cgroup_iter_break here to have ref. counting correct.

> +				return;
> +			case OOM_SCAN_OK:
> +				break;
> +			};
> +			points = oom_badness(task, memcg, NULL, totalpages);
> +			if (points > chosen_points) {
> +				if (chosen)
> +					put_task_struct(chosen);
> +				chosen = task;
> +				chosen_points = points;
> +				get_task_struct(chosen);
> +			}
> +		}
> +		cgroup_iter_end(cgroup, &it);
> +		if (!memcg->use_hierarchy)
> +			break;

And this is not necessary, because for_each_mem_cgroup_tree is hierarchy
aware.

[...]
-- 
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9    
Czech Republic

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [rfc][patch 3/3] mm, memcg: introduce own oom handler to iterate only over its own threads
@ 2012-06-26  9:58     ` Michal Hocko
  0 siblings, 0 replies; 66+ messages in thread
From: Michal Hocko @ 2012-06-26  9:58 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, Johannes Weiner,
	KOSAKI Motohiro, Minchan Kim, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	cgroups-u79uwXL29TY76Z2rM5mHXA

On Mon 25-06-12 18:47:53, David Rientjes wrote:
> The global oom killer is serialized by the zonelist being used in the
> page allocation.  Concurrent oom kills are thus a rare event and only
> occur in systems using mempolicies and with a large number of nodes.
> 
> Memory controller oom kills, however, can frequently be concurrent since
> there is no serialization once the oom killer is called for oom
> conditions in several different memcgs in parallel.
> 
> This creates a massive contention on tasklist_lock since the oom killer
> requires the readside for the tasklist iteration.  If several memcgs are
> calling the oom killer, this lock can be held for a substantial amount of
> time, especially if threads continue to enter it as other threads are
> exiting.
> 
> Since the exit path grabs the writeside of the lock with irqs disabled in
> a few different places, this can cause a soft lockup on cpus as a result
> of tasklist_lock starvation.
> 
> The kernel lacks unfair writelocks, and successful calls to the oom
> killer usually result in at least one thread entering the exit path, so
> an alternative solution is needed.
> 
> This patch introduces a seperate oom handler for memcgs so that they do
> not require tasklist_lock for as much time.  Instead, it iterates only
> over the threads attached to the oom memcg and grabs a reference to the
> selected thread before calling oom_kill_process() to ensure it doesn't
> prematurely exit.
> 
> This still requires tasklist_lock for the tasklist dump, iterating
> children of the selected process, and killing all other threads on the
> system sharing the same memory as the selected victim.  So while this
> isn't a complete solution to tasklist_lock starvation, it significantly
> reduces the amount of time that it is held.

There is an issues with memcg ref. counting but I like the approach in
general.
> 
> Signed-off-by: David Rientjes <rientjes-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>

After the things bellow are fixed
Acked-by: Michal Hocko <mhocko-AlSwsSmVLrQ@public.gmane.org>

> ---
>  include/linux/memcontrol.h |    9 ++-----
>  include/linux/oom.h        |   16 ++++++++++++
>  mm/memcontrol.c            |   62 +++++++++++++++++++++++++++++++++++++++++++-
>  mm/oom_kill.c              |   48 +++++++++++-----------------------
>  4 files changed, 94 insertions(+), 41 deletions(-)
> 
[...]
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
[...]
> @@ -1470,6 +1470,66 @@ u64 mem_cgroup_get_limit(struct mem_cgroup *memcg)
>  	return min(limit, memsw);
>  }
>  
> +void __mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
> +				int order)
> +{
> +	struct mem_cgroup *iter;
> +	unsigned long chosen_points = 0;
> +	unsigned long totalpages;
> +	unsigned int points = 0;
> +	struct task_struct *chosen = NULL;
> +	struct task_struct *task;
> +
> +	totalpages = mem_cgroup_get_limit(memcg) >> PAGE_SHIFT ? : 1;
> +	for_each_mem_cgroup_tree(iter, memcg) {
> +		struct cgroup *cgroup = iter->css.cgroup;
> +		struct cgroup_iter it;
> +
> +		cgroup_iter_start(cgroup, &it);

I guess this should protect from task move and exit, right?

> +		while ((task = cgroup_iter_next(cgroup, &it))) {
> +			switch (oom_scan_process_thread(task, totalpages, NULL,
> +							false)) {
> +			case OOM_SCAN_SELECT:
> +				if (chosen)
> +					put_task_struct(chosen);
> +				chosen = task;
> +				chosen_points = ULONG_MAX;
> +				get_task_struct(chosen);
> +				/* fall through */
> +			case OOM_SCAN_CONTINUE:
> +				continue;
> +			case OOM_SCAN_ABORT:
> +				cgroup_iter_end(cgroup, &it);
> +				if (chosen)
> +					put_task_struct(chosen);

You need mem_cgroup_iter_break here to have ref. counting correct.

> +				return;
> +			case OOM_SCAN_OK:
> +				break;
> +			};
> +			points = oom_badness(task, memcg, NULL, totalpages);
> +			if (points > chosen_points) {
> +				if (chosen)
> +					put_task_struct(chosen);
> +				chosen = task;
> +				chosen_points = points;
> +				get_task_struct(chosen);
> +			}
> +		}
> +		cgroup_iter_end(cgroup, &it);
> +		if (!memcg->use_hierarchy)
> +			break;

And this is not necessary, because for_each_mem_cgroup_tree is hierarchy
aware.

[...]
-- 
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9    
Czech Republic

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [rfc][patch 3/3] mm, memcg: introduce own oom handler to iterate only over its own threads
  2012-06-26  5:32     ` Kamezawa Hiroyuki
  (?)
@ 2012-06-26 20:38     ` David Rientjes
  2012-06-27  5:35         ` David Rientjes
  2012-06-28  8:52       ` Kamezawa Hiroyuki
  -1 siblings, 2 replies; 66+ messages in thread
From: David Rientjes @ 2012-06-26 20:38 UTC (permalink / raw)
  To: Kamezawa Hiroyuki
  Cc: Andrew Morton, Michal Hocko, Johannes Weiner, KOSAKI Motohiro,
	Minchan Kim, linux-mm, cgroups

On Tue, 26 Jun 2012, Kamezawa Hiroyuki wrote:

> > This still requires tasklist_lock for the tasklist dump, iterating
> > children of the selected process, and killing all other threads on the
> > system sharing the same memory as the selected victim.  So while this
> > isn't a complete solution to tasklist_lock starvation, it significantly
> > reduces the amount of time that it is held.
> > 
> > Signed-off-by: David Rientjes <rientjes@google.com>
> 
> This seems good. Thank you!
> 
> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> 

Thanks for the ack!

It's still not a perfect solution for the above reason.  We need 
tasklist_lock for oom_kill_process() for a few reasons:

 (1) if /proc/sys/vm/oom_dump_tasks is enabled, which is the default, 
     to iterate the tasklist

 (2) to iterate the selected process's children, and

 (3) to iterate the tasklist to kill all other processes sharing the 
     same memory.

I'm hoping we can avoid taking tasklist_lock entirely for memcg ooms to 
avoid the starvation problem at all.  We definitely still need to do (3) 
to avoid mm->mmap_sem deadlock if another thread sharing the same memory 
is holding the semaphore trying to allocate memory and waiting for current 
to exit, which needs the semaphore itself.  That can be done with 
rcu_read_lock(), however, and doesn't require tasklist_lock.

(1) can be done with rcu_read_lock() as well but I'm wondering if there 
would be a significant advantage doing this by a cgroup iterator as well.  
It may not be worth it just for the sanity of the code.

We can do (2) if we change to list_for_each_entry_rcu().

So I think I'll add another patch on top of this series to split up 
tasklist_lock handling even for the global oom killer and take references 
on task_struct like it is done in this patchset which should make avoiding 
taking tasklist_lock at all for memcg ooms much easier.

Comments?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [rfc][patch 3/3] mm, memcg: introduce own oom handler to iterate only over its own threads
@ 2012-06-27  5:35         ` David Rientjes
  0 siblings, 0 replies; 66+ messages in thread
From: David Rientjes @ 2012-06-27  5:35 UTC (permalink / raw)
  To: Kamezawa Hiroyuki
  Cc: Andrew Morton, Michal Hocko, Johannes Weiner, KOSAKI Motohiro,
	Minchan Kim, linux-mm, cgroups

On Tue, 26 Jun 2012, David Rientjes wrote:

> It's still not a perfect solution for the above reason.  We need 
> tasklist_lock for oom_kill_process() for a few reasons:
> 
>  (1) if /proc/sys/vm/oom_dump_tasks is enabled, which is the default, 
>      to iterate the tasklist
> 
>  (2) to iterate the selected process's children, and
> 
>  (3) to iterate the tasklist to kill all other processes sharing the 
>      same memory.
> 
> I'm hoping we can avoid taking tasklist_lock entirely for memcg ooms to 
> avoid the starvation problem at all.  We definitely still need to do (3) 
> to avoid mm->mmap_sem deadlock if another thread sharing the same memory 
> is holding the semaphore trying to allocate memory and waiting for current 
> to exit, which needs the semaphore itself.  That can be done with 
> rcu_read_lock(), however, and doesn't require tasklist_lock.
> 
> (1) can be done with rcu_read_lock() as well but I'm wondering if there 
> would be a significant advantage doing this by a cgroup iterator as well.  
> It may not be worth it just for the sanity of the code.
> 
> We can do (2) if we change to list_for_each_entry_rcu().
> 

It turns out that task->children is not an rcu-protected list so this 
doesn't work.  Both (1) and (3) can be accomplished with 
rcu_read_{lock,unlock}() that can nest inside the tasklist_lock for the 
global oom killer.  (We could even split the global oom killer tasklist 
locking and optimize it seperately from this patchset.)

So we have a couple of options:

 - allow oom_kill_process() to do

	if (memcg)
		read_lock(&tasklist_lock);
	...
	if (memcg)
		read_unlock(&tasklist_lock);

   around the iteration over the victim's children.  This should solve the 
   issue since any other iteration over the entire tasklist would have 
   triggered the same starvation if it were that bad, or

 - suppress the iteration for memcg ooms and just kill the parent instead.

Comments?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [rfc][patch 3/3] mm, memcg: introduce own oom handler to iterate only over its own threads
@ 2012-06-27  5:35         ` David Rientjes
  0 siblings, 0 replies; 66+ messages in thread
From: David Rientjes @ 2012-06-27  5:35 UTC (permalink / raw)
  To: Kamezawa Hiroyuki
  Cc: Andrew Morton, Michal Hocko, Johannes Weiner, KOSAKI Motohiro,
	Minchan Kim, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	cgroups-u79uwXL29TY76Z2rM5mHXA

On Tue, 26 Jun 2012, David Rientjes wrote:

> It's still not a perfect solution for the above reason.  We need 
> tasklist_lock for oom_kill_process() for a few reasons:
> 
>  (1) if /proc/sys/vm/oom_dump_tasks is enabled, which is the default, 
>      to iterate the tasklist
> 
>  (2) to iterate the selected process's children, and
> 
>  (3) to iterate the tasklist to kill all other processes sharing the 
>      same memory.
> 
> I'm hoping we can avoid taking tasklist_lock entirely for memcg ooms to 
> avoid the starvation problem at all.  We definitely still need to do (3) 
> to avoid mm->mmap_sem deadlock if another thread sharing the same memory 
> is holding the semaphore trying to allocate memory and waiting for current 
> to exit, which needs the semaphore itself.  That can be done with 
> rcu_read_lock(), however, and doesn't require tasklist_lock.
> 
> (1) can be done with rcu_read_lock() as well but I'm wondering if there 
> would be a significant advantage doing this by a cgroup iterator as well.  
> It may not be worth it just for the sanity of the code.
> 
> We can do (2) if we change to list_for_each_entry_rcu().
> 

It turns out that task->children is not an rcu-protected list so this 
doesn't work.  Both (1) and (3) can be accomplished with 
rcu_read_{lock,unlock}() that can nest inside the tasklist_lock for the 
global oom killer.  (We could even split the global oom killer tasklist 
locking and optimize it seperately from this patchset.)

So we have a couple of options:

 - allow oom_kill_process() to do

	if (memcg)
		read_lock(&tasklist_lock);
	...
	if (memcg)
		read_unlock(&tasklist_lock);

   around the iteration over the victim's children.  This should solve the 
   issue since any other iteration over the entire tasklist would have 
   triggered the same starvation if it were that bad, or

 - suppress the iteration for memcg ooms and just kill the parent instead.

Comments?

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [rfc][patch 3/3] mm, memcg: introduce own oom handler to iterate only over its own threads
@ 2012-06-28  1:43           ` David Rientjes
  0 siblings, 0 replies; 66+ messages in thread
From: David Rientjes @ 2012-06-28  1:43 UTC (permalink / raw)
  To: Kamezawa Hiroyuki
  Cc: Andrew Morton, Michal Hocko, Johannes Weiner, KOSAKI Motohiro,
	Minchan Kim, Oleg Nesterov, linux-mm, cgroups

On Tue, 26 Jun 2012, David Rientjes wrote:

> It turns out that task->children is not an rcu-protected list so this 
> doesn't work.  Both (1) and (3) can be accomplished with 
> rcu_read_{lock,unlock}() that can nest inside the tasklist_lock for the 
> global oom killer.  (We could even split the global oom killer tasklist 
> locking and optimize it seperately from this patchset.)
> 
> So we have a couple of options:
> 
>  - allow oom_kill_process() to do
> 
> 	if (memcg)
> 		read_lock(&tasklist_lock);
> 	...
> 	if (memcg)
> 		read_unlock(&tasklist_lock);
> 
>    around the iteration over the victim's children.  This should solve the 
>    issue since any other iteration over the entire tasklist would have 
>    triggered the same starvation if it were that bad, or
> 
>  - suppress the iteration for memcg ooms and just kill the parent instead.
> 

Adding Oleg for comment as well.

I did the first option but I split tasklist_lock for global oom conditions 
as well.  The only place we actually need it is when iterating the 
victim's children since that list is not rcu-protected.  If this happens 
to be too painful for parallel memcg ooms then we can look to protecting 
it, but it shouldn't be a problem for global ooms because of the global 
oom killer's zonelist serialization.

It's a tough patch to review, but the basics are that

 - oom_kill_process() is made to no longer need tasklist_lock; it's only
   taken for the iteration over children and everything else, including
   dump_header() is protected by rcu_read_lock() for kernels enabling
   /proc/sys/vm/oom_dump_tasks,

 - oom_kill_process() assumes that we have a reference to p, the victim,
   when it's called.  It can release this reference and grab a child's
   reference if necessary and drops it before returning, and

 - select_bad_process() does not require tasklist_lock, it gets
   protected by rcu_read_lock() as well.

Comments?

 memcontrol.c |    2 --
 oom_kill.c   |   33 ++++++++++++++++++++++-----------
 2 files changed, 22 insertions(+), 13 deletions(-)
---
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1522,10 +1522,8 @@ void __mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
 	if (!chosen)
 		return;
 	points = chosen_points * 1000 / totalpages;
-	read_lock(&tasklist_lock);
 	oom_kill_process(chosen, gfp_mask, order, points, totalpages, memcg,
 			 NULL, "Memory cgroup out of memory");
-	read_unlock(&tasklist_lock);
 	put_task_struct(chosen);
 }
 
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -336,7 +336,7 @@ enum oom_scan_t oom_scan_process_thread(struct task_struct *task,
 
 /*
  * Simple selection loop. We chose the process with the highest
- * number of 'points'. We expect the caller will lock the tasklist.
+ * number of 'points'.
  *
  * (not docbooked, we don't want this one cluttering up the manual)
  */
@@ -348,6 +348,7 @@ static struct task_struct *select_bad_process(unsigned int *ppoints,
 	struct task_struct *chosen = NULL;
 	unsigned long chosen_points = 0;
 
+	rcu_read_lock();
 	do_each_thread(g, p) {
 		unsigned int points;
 
@@ -370,6 +371,9 @@ static struct task_struct *select_bad_process(unsigned int *ppoints,
 			chosen_points = points;
 		}
 	} while_each_thread(g, p);
+	if (chosen)
+		get_task_struct(chosen);
+	rcu_read_unlock();
 
 	*ppoints = chosen_points * 1000 / totalpages;
 	return chosen;
@@ -385,8 +389,6 @@ static struct task_struct *select_bad_process(unsigned int *ppoints,
  * are not shown.
  * State information includes task's pid, uid, tgid, vm size, rss, cpu, oom_adj
  * value, oom_score_adj value, and name.
- *
- * Call with tasklist_lock read-locked.
  */
 static void dump_tasks(const struct mem_cgroup *memcg, const nodemask_t *nodemask)
 {
@@ -394,6 +396,7 @@ static void dump_tasks(const struct mem_cgroup *memcg, const nodemask_t *nodemas
 	struct task_struct *task;
 
 	pr_info("[ pid ]   uid  tgid total_vm      rss cpu oom_adj oom_score_adj name\n");
+	rcu_read_lock();
 	for_each_process(p) {
 		if (oom_unkillable_task(p, memcg, nodemask))
 			continue;
@@ -415,6 +418,7 @@ static void dump_tasks(const struct mem_cgroup *memcg, const nodemask_t *nodemas
 			task->signal->oom_score_adj, task->comm);
 		task_unlock(task);
 	}
+	rcu_read_unlock();
 }
 
 static void dump_header(struct task_struct *p, gfp_t gfp_mask, int order,
@@ -454,6 +458,7 @@ void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
 	 */
 	if (p->flags & PF_EXITING) {
 		set_tsk_thread_flag(p, TIF_MEMDIE);
+		put_task_struct(p);
 		return;
 	}
 
@@ -471,6 +476,7 @@ void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
 	 * parent.  This attempts to lose the minimal amount of work done while
 	 * still freeing memory.
 	 */
+	read_lock(&tasklist_lock);
 	do {
 		list_for_each_entry(child, &t->children, sibling) {
 			unsigned int child_points;
@@ -483,15 +489,23 @@ void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
 			child_points = oom_badness(child, memcg, nodemask,
 								totalpages);
 			if (child_points > victim_points) {
+				put_task_struct(victim);
 				victim = child;
 				victim_points = child_points;
+				get_task_struct(victim);
 			}
 		}
 	} while_each_thread(p, t);
+	read_unlock(&tasklist_lock);
 
-	victim = find_lock_task_mm(victim);
-	if (!victim)
+	rcu_read_lock();
+	p = find_lock_task_mm(victim);
+	if (!p) {
+		rcu_read_unlock();
+		put_task_struct(victim);
 		return;
+	} else
+		victim = p;
 
 	/* mm cannot safely be dereferenced after task_unlock(victim) */
 	mm = victim->mm;
@@ -522,9 +536,11 @@ void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
 			task_unlock(p);
 			do_send_sig_info(SIGKILL, SEND_SIG_FORCED, p, true);
 		}
+	rcu_read_unlock();
 
 	set_tsk_thread_flag(victim, TIF_MEMDIE);
 	do_send_sig_info(SIGKILL, SEND_SIG_FORCED, victim, true);
+	put_task_struct(victim);
 }
 #undef K
 
@@ -545,9 +561,7 @@ static void check_panic_on_oom(enum oom_constraint constraint, gfp_t gfp_mask,
 		if (constraint != CONSTRAINT_NONE)
 			return;
 	}
-	read_lock(&tasklist_lock);
 	dump_header(NULL, gfp_mask, order, NULL, nodemask);
-	read_unlock(&tasklist_lock);
 	panic("Out of memory: %s panic_on_oom is enabled\n",
 		sysctl_panic_on_oom == 2 ? "compulsory" : "system-wide");
 }
@@ -720,10 +734,10 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
 	mpol_mask = (constraint == CONSTRAINT_MEMORY_POLICY) ? nodemask : NULL;
 	check_panic_on_oom(constraint, gfp_mask, order, mpol_mask);
 
-	read_lock(&tasklist_lock);
 	if (sysctl_oom_kill_allocating_task &&
 	    !oom_unkillable_task(current, NULL, nodemask) &&
 	    current->mm) {
+		get_task_struct(current);
 		oom_kill_process(current, gfp_mask, order, 0, totalpages, NULL,
 				 nodemask,
 				 "Out of memory (oom_kill_allocating_task)");
@@ -734,7 +748,6 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
 	/* Found nothing?!?! Either we hang forever, or we panic. */
 	if (!p) {
 		dump_header(NULL, gfp_mask, order, NULL, mpol_mask);
-		read_unlock(&tasklist_lock);
 		panic("Out of memory and no killable processes...\n");
 	}
 	if (PTR_ERR(p) != -1UL) {
@@ -743,8 +756,6 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
 		killed = 1;
 	}
 out:
-	read_unlock(&tasklist_lock);
-
 	/*
 	 * Give "p" a good chance of killing itself before we
 	 * retry to allocate memory unless "p" is current

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [rfc][patch 3/3] mm, memcg: introduce own oom handler to iterate only over its own threads
@ 2012-06-28  1:43           ` David Rientjes
  0 siblings, 0 replies; 66+ messages in thread
From: David Rientjes @ 2012-06-28  1:43 UTC (permalink / raw)
  To: Kamezawa Hiroyuki
  Cc: Andrew Morton, Michal Hocko, Johannes Weiner, KOSAKI Motohiro,
	Minchan Kim, Oleg Nesterov, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	cgroups-u79uwXL29TY76Z2rM5mHXA

On Tue, 26 Jun 2012, David Rientjes wrote:

> It turns out that task->children is not an rcu-protected list so this 
> doesn't work.  Both (1) and (3) can be accomplished with 
> rcu_read_{lock,unlock}() that can nest inside the tasklist_lock for the 
> global oom killer.  (We could even split the global oom killer tasklist 
> locking and optimize it seperately from this patchset.)
> 
> So we have a couple of options:
> 
>  - allow oom_kill_process() to do
> 
> 	if (memcg)
> 		read_lock(&tasklist_lock);
> 	...
> 	if (memcg)
> 		read_unlock(&tasklist_lock);
> 
>    around the iteration over the victim's children.  This should solve the 
>    issue since any other iteration over the entire tasklist would have 
>    triggered the same starvation if it were that bad, or
> 
>  - suppress the iteration for memcg ooms and just kill the parent instead.
> 

Adding Oleg for comment as well.

I did the first option but I split tasklist_lock for global oom conditions 
as well.  The only place we actually need it is when iterating the 
victim's children since that list is not rcu-protected.  If this happens 
to be too painful for parallel memcg ooms then we can look to protecting 
it, but it shouldn't be a problem for global ooms because of the global 
oom killer's zonelist serialization.

It's a tough patch to review, but the basics are that

 - oom_kill_process() is made to no longer need tasklist_lock; it's only
   taken for the iteration over children and everything else, including
   dump_header() is protected by rcu_read_lock() for kernels enabling
   /proc/sys/vm/oom_dump_tasks,

 - oom_kill_process() assumes that we have a reference to p, the victim,
   when it's called.  It can release this reference and grab a child's
   reference if necessary and drops it before returning, and

 - select_bad_process() does not require tasklist_lock, it gets
   protected by rcu_read_lock() as well.

Comments?

 memcontrol.c |    2 --
 oom_kill.c   |   33 ++++++++++++++++++++++-----------
 2 files changed, 22 insertions(+), 13 deletions(-)
---
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1522,10 +1522,8 @@ void __mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
 	if (!chosen)
 		return;
 	points = chosen_points * 1000 / totalpages;
-	read_lock(&tasklist_lock);
 	oom_kill_process(chosen, gfp_mask, order, points, totalpages, memcg,
 			 NULL, "Memory cgroup out of memory");
-	read_unlock(&tasklist_lock);
 	put_task_struct(chosen);
 }
 
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -336,7 +336,7 @@ enum oom_scan_t oom_scan_process_thread(struct task_struct *task,
 
 /*
  * Simple selection loop. We chose the process with the highest
- * number of 'points'. We expect the caller will lock the tasklist.
+ * number of 'points'.
  *
  * (not docbooked, we don't want this one cluttering up the manual)
  */
@@ -348,6 +348,7 @@ static struct task_struct *select_bad_process(unsigned int *ppoints,
 	struct task_struct *chosen = NULL;
 	unsigned long chosen_points = 0;
 
+	rcu_read_lock();
 	do_each_thread(g, p) {
 		unsigned int points;
 
@@ -370,6 +371,9 @@ static struct task_struct *select_bad_process(unsigned int *ppoints,
 			chosen_points = points;
 		}
 	} while_each_thread(g, p);
+	if (chosen)
+		get_task_struct(chosen);
+	rcu_read_unlock();
 
 	*ppoints = chosen_points * 1000 / totalpages;
 	return chosen;
@@ -385,8 +389,6 @@ static struct task_struct *select_bad_process(unsigned int *ppoints,
  * are not shown.
  * State information includes task's pid, uid, tgid, vm size, rss, cpu, oom_adj
  * value, oom_score_adj value, and name.
- *
- * Call with tasklist_lock read-locked.
  */
 static void dump_tasks(const struct mem_cgroup *memcg, const nodemask_t *nodemask)
 {
@@ -394,6 +396,7 @@ static void dump_tasks(const struct mem_cgroup *memcg, const nodemask_t *nodemas
 	struct task_struct *task;
 
 	pr_info("[ pid ]   uid  tgid total_vm      rss cpu oom_adj oom_score_adj name\n");
+	rcu_read_lock();
 	for_each_process(p) {
 		if (oom_unkillable_task(p, memcg, nodemask))
 			continue;
@@ -415,6 +418,7 @@ static void dump_tasks(const struct mem_cgroup *memcg, const nodemask_t *nodemas
 			task->signal->oom_score_adj, task->comm);
 		task_unlock(task);
 	}
+	rcu_read_unlock();
 }
 
 static void dump_header(struct task_struct *p, gfp_t gfp_mask, int order,
@@ -454,6 +458,7 @@ void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
 	 */
 	if (p->flags & PF_EXITING) {
 		set_tsk_thread_flag(p, TIF_MEMDIE);
+		put_task_struct(p);
 		return;
 	}
 
@@ -471,6 +476,7 @@ void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
 	 * parent.  This attempts to lose the minimal amount of work done while
 	 * still freeing memory.
 	 */
+	read_lock(&tasklist_lock);
 	do {
 		list_for_each_entry(child, &t->children, sibling) {
 			unsigned int child_points;
@@ -483,15 +489,23 @@ void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
 			child_points = oom_badness(child, memcg, nodemask,
 								totalpages);
 			if (child_points > victim_points) {
+				put_task_struct(victim);
 				victim = child;
 				victim_points = child_points;
+				get_task_struct(victim);
 			}
 		}
 	} while_each_thread(p, t);
+	read_unlock(&tasklist_lock);
 
-	victim = find_lock_task_mm(victim);
-	if (!victim)
+	rcu_read_lock();
+	p = find_lock_task_mm(victim);
+	if (!p) {
+		rcu_read_unlock();
+		put_task_struct(victim);
 		return;
+	} else
+		victim = p;
 
 	/* mm cannot safely be dereferenced after task_unlock(victim) */
 	mm = victim->mm;
@@ -522,9 +536,11 @@ void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
 			task_unlock(p);
 			do_send_sig_info(SIGKILL, SEND_SIG_FORCED, p, true);
 		}
+	rcu_read_unlock();
 
 	set_tsk_thread_flag(victim, TIF_MEMDIE);
 	do_send_sig_info(SIGKILL, SEND_SIG_FORCED, victim, true);
+	put_task_struct(victim);
 }
 #undef K
 
@@ -545,9 +561,7 @@ static void check_panic_on_oom(enum oom_constraint constraint, gfp_t gfp_mask,
 		if (constraint != CONSTRAINT_NONE)
 			return;
 	}
-	read_lock(&tasklist_lock);
 	dump_header(NULL, gfp_mask, order, NULL, nodemask);
-	read_unlock(&tasklist_lock);
 	panic("Out of memory: %s panic_on_oom is enabled\n",
 		sysctl_panic_on_oom == 2 ? "compulsory" : "system-wide");
 }
@@ -720,10 +734,10 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
 	mpol_mask = (constraint == CONSTRAINT_MEMORY_POLICY) ? nodemask : NULL;
 	check_panic_on_oom(constraint, gfp_mask, order, mpol_mask);
 
-	read_lock(&tasklist_lock);
 	if (sysctl_oom_kill_allocating_task &&
 	    !oom_unkillable_task(current, NULL, nodemask) &&
 	    current->mm) {
+		get_task_struct(current);
 		oom_kill_process(current, gfp_mask, order, 0, totalpages, NULL,
 				 nodemask,
 				 "Out of memory (oom_kill_allocating_task)");
@@ -734,7 +748,6 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
 	/* Found nothing?!?! Either we hang forever, or we panic. */
 	if (!p) {
 		dump_header(NULL, gfp_mask, order, NULL, mpol_mask);
-		read_unlock(&tasklist_lock);
 		panic("Out of memory and no killable processes...\n");
 	}
 	if (PTR_ERR(p) != -1UL) {
@@ -743,8 +756,6 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
 		killed = 1;
 	}
 out:
-	read_unlock(&tasklist_lock);
-
 	/*
 	 * Give "p" a good chance of killing itself before we
 	 * retry to allocate memory unless "p" is current

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [rfc][patch 3/3] mm, memcg: introduce own oom handler to iterate only over its own threads
  2012-06-26 20:38     ` David Rientjes
  2012-06-27  5:35         ` David Rientjes
@ 2012-06-28  8:52       ` Kamezawa Hiroyuki
  1 sibling, 0 replies; 66+ messages in thread
From: Kamezawa Hiroyuki @ 2012-06-28  8:52 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Michal Hocko, Johannes Weiner, KOSAKI Motohiro,
	Minchan Kim, linux-mm, cgroups

(2012/06/27 5:38), David Rientjes wrote:
> On Tue, 26 Jun 2012, Kamezawa Hiroyuki wrote:
>
>>> This still requires tasklist_lock for the tasklist dump, iterating
>>> children of the selected process, and killing all other threads on the
>>> system sharing the same memory as the selected victim.  So while this
>>> isn't a complete solution to tasklist_lock starvation, it significantly
>>> reduces the amount of time that it is held.
>>>
>>> Signed-off-by: David Rientjes <rientjes@google.com>
>>
>> This seems good. Thank you!
>>
>> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
>>
>
> Thanks for the ack!
>
> It's still not a perfect solution for the above reason.  We need
> tasklist_lock for oom_kill_process() for a few reasons:
>
>   (1) if /proc/sys/vm/oom_dump_tasks is enabled, which is the default,
>       to iterate the tasklist
>
>   (2) to iterate the selected process's children, and
>
>   (3) to iterate the tasklist to kill all other processes sharing the
>       same memory.
>
> I'm hoping we can avoid taking tasklist_lock entirely for memcg ooms to
> avoid the starvation problem at all.  We definitely still need to do (3)
> to avoid mm->mmap_sem deadlock if another thread sharing the same memory
> is holding the semaphore trying to allocate memory and waiting for current
> to exit, which needs the semaphore itself.  That can be done with
> rcu_read_lock(), however, and doesn't require tasklist_lock.
>
> (1) can be done with rcu_read_lock() as well but I'm wondering if there
> would be a significant advantage doing this by a cgroup iterator as well.
> It may not be worth it just for the sanity of the code.
>
> We can do (2) if we change to list_for_each_entry_rcu().
>
> So I think I'll add another patch on top of this series to split up
> tasklist_lock handling even for the global oom killer and take references
> on task_struct like it is done in this patchset which should make avoiding
> taking tasklist_lock at all for memcg ooms much easier.
>
> Comments?
>

sounds reasonable to me.

Thanks,
-Kame



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [rfc][patch 3/3] mm, memcg: introduce own oom handler to iterate only over its own threads
@ 2012-06-28  8:55           ` Kamezawa Hiroyuki
  0 siblings, 0 replies; 66+ messages in thread
From: Kamezawa Hiroyuki @ 2012-06-28  8:55 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Michal Hocko, Johannes Weiner, KOSAKI Motohiro,
	Minchan Kim, linux-mm, cgroups

(2012/06/27 14:35), David Rientjes wrote:
> On Tue, 26 Jun 2012, David Rientjes wrote:
>
>> It's still not a perfect solution for the above reason.  We need
>> tasklist_lock for oom_kill_process() for a few reasons:
>>
>>   (1) if /proc/sys/vm/oom_dump_tasks is enabled, which is the default,
>>       to iterate the tasklist
>>
>>   (2) to iterate the selected process's children, and
>>
>>   (3) to iterate the tasklist to kill all other processes sharing the
>>       same memory.
>>
>> I'm hoping we can avoid taking tasklist_lock entirely for memcg ooms to
>> avoid the starvation problem at all.  We definitely still need to do (3)
>> to avoid mm->mmap_sem deadlock if another thread sharing the same memory
>> is holding the semaphore trying to allocate memory and waiting for current
>> to exit, which needs the semaphore itself.  That can be done with
>> rcu_read_lock(), however, and doesn't require tasklist_lock.
>>
>> (1) can be done with rcu_read_lock() as well but I'm wondering if there
>> would be a significant advantage doing this by a cgroup iterator as well.
>> It may not be worth it just for the sanity of the code.
>>
>> We can do (2) if we change to list_for_each_entry_rcu().
>>
>
> It turns out that task->children is not an rcu-protected list so this
> doesn't work.

Can't we use sighand->lock to iterate children ?


Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [rfc][patch 3/3] mm, memcg: introduce own oom handler to iterate only over its own threads
@ 2012-06-28  8:55           ` Kamezawa Hiroyuki
  0 siblings, 0 replies; 66+ messages in thread
From: Kamezawa Hiroyuki @ 2012-06-28  8:55 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Michal Hocko, Johannes Weiner, KOSAKI Motohiro,
	Minchan Kim, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	cgroups-u79uwXL29TY76Z2rM5mHXA

(2012/06/27 14:35), David Rientjes wrote:
> On Tue, 26 Jun 2012, David Rientjes wrote:
>
>> It's still not a perfect solution for the above reason.  We need
>> tasklist_lock for oom_kill_process() for a few reasons:
>>
>>   (1) if /proc/sys/vm/oom_dump_tasks is enabled, which is the default,
>>       to iterate the tasklist
>>
>>   (2) to iterate the selected process's children, and
>>
>>   (3) to iterate the tasklist to kill all other processes sharing the
>>       same memory.
>>
>> I'm hoping we can avoid taking tasklist_lock entirely for memcg ooms to
>> avoid the starvation problem at all.  We definitely still need to do (3)
>> to avoid mm->mmap_sem deadlock if another thread sharing the same memory
>> is holding the semaphore trying to allocate memory and waiting for current
>> to exit, which needs the semaphore itself.  That can be done with
>> rcu_read_lock(), however, and doesn't require tasklist_lock.
>>
>> (1) can be done with rcu_read_lock() as well but I'm wondering if there
>> would be a significant advantage doing this by a cgroup iterator as well.
>> It may not be worth it just for the sanity of the code.
>>
>> We can do (2) if we change to list_for_each_entry_rcu().
>>
>
> It turns out that task->children is not an rcu-protected list so this
> doesn't work.

Can't we use sighand->lock to iterate children ?


Thanks,
-Kame

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [rfc][patch 3/3] mm, memcg: introduce own oom handler to iterate only over its own threads
  2012-06-28  1:43           ` David Rientjes
  (?)
@ 2012-06-28 17:16           ` Oleg Nesterov
  2012-06-29 20:37             ` David Rientjes
  -1 siblings, 1 reply; 66+ messages in thread
From: Oleg Nesterov @ 2012-06-28 17:16 UTC (permalink / raw)
  To: David Rientjes
  Cc: Kamezawa Hiroyuki, Andrew Morton, Michal Hocko, Johannes Weiner,
	KOSAKI Motohiro, Minchan Kim, linux-mm, cgroups

On 06/27, David Rientjes wrote:
>
> On Tue, 26 Jun 2012, David Rientjes wrote:
>
> > It turns out that task->children is not an rcu-protected list so this
> > doesn't work.

Yes. And just in case, we can't rcuify ->children because of re-parenting.

> It's a tough patch to review, but the basics are that
>
>  - oom_kill_process() is made to no longer need tasklist_lock; it's only
>    taken for the iteration over children and everything else, including
>    dump_header() is protected by rcu_read_lock() for kernels enabling
>    /proc/sys/vm/oom_dump_tasks,
>
>  - oom_kill_process() assumes that we have a reference to p, the victim,
>    when it's called.  It can release this reference and grab a child's
>    reference if necessary and drops it before returning, and
>
>  - select_bad_process() does not require tasklist_lock, it gets
>    protected by rcu_read_lock() as well.

Looks correct at first glance... (ignoring the fact we need the fixes
in while_each_thread/rcu interaction but this is off-topic and should
be fixed anyway).

> @@ -348,6 +348,7 @@ static struct task_struct *select_bad_process(unsigned int *ppoints,
>  	struct task_struct *chosen = NULL;
>  	unsigned long chosen_points = 0;
>
> +	rcu_read_lock();
>  	do_each_thread(g, p) {
>  		unsigned int points;
>
> @@ -370,6 +371,9 @@ static struct task_struct *select_bad_process(unsigned int *ppoints,
>  			chosen_points = points;
>  		}
>  	} while_each_thread(g, p);
> +	if (chosen)
> +		get_task_struct(chosen);

OK, so the caller should do put_task_struct().

But, unless I misread the patch,

> @@ -454,6 +458,7 @@ void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
> ...
> +	rcu_read_lock();
> +	p = find_lock_task_mm(victim);
> +	if (!p) {
> +		rcu_read_unlock();
> +		put_task_struct(victim);
>  		return;
> +	} else
> +		victim = p;

And, before return,

> +	put_task_struct(victim);

Doesn't look right if victim != p.

Oleg.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [rfc][patch 3/3] mm, memcg: introduce own oom handler to iterate only over its own threads
@ 2012-06-29 20:30             ` David Rientjes
  0 siblings, 0 replies; 66+ messages in thread
From: David Rientjes @ 2012-06-29 20:30 UTC (permalink / raw)
  To: Kamezawa Hiroyuki
  Cc: Andrew Morton, Michal Hocko, Johannes Weiner, KOSAKI Motohiro,
	Minchan Kim, Oleg Nesterov, linux-mm, cgroups

On Thu, 28 Jun 2012, Kamezawa Hiroyuki wrote:

> > It turns out that task->children is not an rcu-protected list so this
> > doesn't work.
> 
> Can't we use sighand->lock to iterate children ?
> 

I don't think so, this list is protected by tasklist_lock.  Oleg?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [rfc][patch 3/3] mm, memcg: introduce own oom handler to iterate only over its own threads
@ 2012-06-29 20:30             ` David Rientjes
  0 siblings, 0 replies; 66+ messages in thread
From: David Rientjes @ 2012-06-29 20:30 UTC (permalink / raw)
  To: Kamezawa Hiroyuki
  Cc: Andrew Morton, Michal Hocko, Johannes Weiner, KOSAKI Motohiro,
	Minchan Kim, Oleg Nesterov, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	cgroups-u79uwXL29TY76Z2rM5mHXA

On Thu, 28 Jun 2012, Kamezawa Hiroyuki wrote:

> > It turns out that task->children is not an rcu-protected list so this
> > doesn't work.
> 
> Can't we use sighand->lock to iterate children ?
> 

I don't think so, this list is protected by tasklist_lock.  Oleg?

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [rfc][patch 3/3] mm, memcg: introduce own oom handler to iterate only over its own threads
  2012-06-28 17:16           ` Oleg Nesterov
@ 2012-06-29 20:37             ` David Rientjes
  0 siblings, 0 replies; 66+ messages in thread
From: David Rientjes @ 2012-06-29 20:37 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Kamezawa Hiroyuki, Andrew Morton, Michal Hocko, Johannes Weiner,
	KOSAKI Motohiro, Minchan Kim, linux-mm, cgroups

On Thu, 28 Jun 2012, Oleg Nesterov wrote:

> > @@ -348,6 +348,7 @@ static struct task_struct *select_bad_process(unsigned int *ppoints,
> >  	struct task_struct *chosen = NULL;
> >  	unsigned long chosen_points = 0;
> >
> > +	rcu_read_lock();
> >  	do_each_thread(g, p) {
> >  		unsigned int points;
> >
> > @@ -370,6 +371,9 @@ static struct task_struct *select_bad_process(unsigned int *ppoints,
> >  			chosen_points = points;
> >  		}
> >  	} while_each_thread(g, p);
> > +	if (chosen)
> > +		get_task_struct(chosen);
> 
> OK, so the caller should do put_task_struct().
> 

oom_kill_process() will now do the put_task_struct() since we need a 
reference before killing it, so callers to oom_kill_process() are 
responsible for grabbing it before doing rcu_read_unlock().

> But, unless I misread the patch,
> 
> > @@ -454,6 +458,7 @@ void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
> > ...
> > +	rcu_read_lock();
> > +	p = find_lock_task_mm(victim);
> > +	if (!p) {
> > +		rcu_read_unlock();
> > +		put_task_struct(victim);

So if the victim has no threads that have an mm, then we have raced with 
select_bad_process() and we silently return.

> >  		return;
> > +	} else
> > +		victim = p;
> 
> And, before return,
> 
> > +	put_task_struct(victim);
> 
> Doesn't look right if victim != p.
> 

Ah, good catch, we need to do

	if (!p) {
		rcu_read_unlock();
		put_task_struct(victim);
		return;
	} else {
		put_task_struct(victim);
		victim = p;
		get_task_struct(victim);
		rcu_read_unlock();
	}

Thanks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* [patch 1/5] mm, oom: move declaration for mem_cgroup_out_of_memory to oom.h
  2012-06-26  1:47 ` David Rientjes
                   ` (4 preceding siblings ...)
  (?)
@ 2012-06-29 21:06 ` David Rientjes
  2012-06-29 21:06     ` David Rientjes
                     ` (4 more replies)
  -1 siblings, 5 replies; 66+ messages in thread
From: David Rientjes @ 2012-06-29 21:06 UTC (permalink / raw)
  To: Andrew Morton
  Cc: KAMEZAWA Hiroyuki, Michal Hocko, Johannes Weiner,
	KOSAKI Motohiro, Minchan Kim, Oleg Nesterov, linux-mm, cgroups

mem_cgroup_out_of_memory() is defined in mm/oom_kill.c, so declare it in
linux/oom.h rather than linux/memcontrol.h.

Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Signed-off-by: David Rientjes <rientjes@google.com>
---
 include/linux/memcontrol.h |    2 --
 include/linux/oom.h        |    2 ++
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -72,8 +72,6 @@ extern void mem_cgroup_uncharge_end(void);
 extern void mem_cgroup_uncharge_page(struct page *page);
 extern void mem_cgroup_uncharge_cache_page(struct page *page);
 
-extern void mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
-				     int order);
 bool __mem_cgroup_same_or_subtree(const struct mem_cgroup *root_memcg,
 				  struct mem_cgroup *memcg);
 int task_in_mem_cgroup(struct task_struct *task, const struct mem_cgroup *memcg);
diff --git a/include/linux/oom.h b/include/linux/oom.h
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -49,6 +49,8 @@ extern unsigned long oom_badness(struct task_struct *p,
 extern int try_set_zonelist_oom(struct zonelist *zonelist, gfp_t gfp_flags);
 extern void clear_zonelist_oom(struct zonelist *zonelist, gfp_t gfp_flags);
 
+extern void mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
+				     int order);
 extern void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
 		int order, nodemask_t *mask, bool force_kill);
 extern int register_oom_notifier(struct notifier_block *nb);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* [patch 2/5] mm, oom: introduce helper function to process threads during scan
@ 2012-06-29 21:06     ` David Rientjes
  0 siblings, 0 replies; 66+ messages in thread
From: David Rientjes @ 2012-06-29 21:06 UTC (permalink / raw)
  To: Andrew Morton
  Cc: KAMEZAWA Hiroyuki, Michal Hocko, Johannes Weiner,
	KOSAKI Motohiro, Minchan Kim, Oleg Nesterov, linux-mm, cgroups

This patch introduces a helper function to process each thread during the
iteration over the tasklist.  A new return type, enum oom_scan_t, is
defined to determine the future behavior of the iteration:

 - OOM_SCAN_OK: continue scanning the thread and find its badness,

 - OOM_SCAN_CONTINUE: do not consider this thread for oom kill, it's
   ineligible,

 - OOM_SCAN_ABORT: abort the iteration and return, or

 - OOM_SCAN_SELECT: always select this thread with the highest badness
   possible.

There is no functional change with this patch.  This new helper function
will be used in the next patch in the memory controller.

Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Signed-off-by: David Rientjes <rientjes@google.com>
---
 mm/oom_kill.c |  111 +++++++++++++++++++++++++++++++++-----------------------
 1 files changed, 65 insertions(+), 46 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -288,6 +288,59 @@ static enum oom_constraint constrained_alloc(struct zonelist *zonelist,
 }
 #endif
 
+enum oom_scan_t {
+	OOM_SCAN_OK,		/* scan thread and find its badness */
+	OOM_SCAN_CONTINUE,	/* do not consider thread for oom kill */
+	OOM_SCAN_ABORT,		/* abort the iteration and return */
+	OOM_SCAN_SELECT,	/* always select this thread first */
+};
+
+static enum oom_scan_t oom_scan_process_thread(struct task_struct *task,
+		struct mem_cgroup *memcg, unsigned long totalpages,
+		const nodemask_t *nodemask, bool force_kill)
+{
+	if (task->exit_state)
+		return OOM_SCAN_CONTINUE;
+	if (oom_unkillable_task(task, memcg, nodemask))
+		return OOM_SCAN_CONTINUE;
+
+	/*
+	 * This task already has access to memory reserves and is being killed.
+	 * Don't allow any other task to have access to the reserves.
+	 */
+	if (test_tsk_thread_flag(task, TIF_MEMDIE)) {
+		if (unlikely(frozen(task)))
+			__thaw_task(task);
+		if (!force_kill)
+			return OOM_SCAN_ABORT;
+	}
+	if (!task->mm)
+		return OOM_SCAN_CONTINUE;
+
+	if (task->flags & PF_EXITING) {
+		/*
+		 * If task is current and is in the process of releasing memory,
+		 * allow the "kill" to set TIF_MEMDIE, which will allow it to
+		 * access memory reserves.  Otherwise, it may stall forever.
+		 *
+		 * The iteration isn't broken here, however, in case other
+		 * threads are found to have already been oom killed.
+		 */
+		if (task == current)
+			return OOM_SCAN_SELECT;
+		else if (!force_kill) {
+			/*
+			 * If this task is not being ptraced on exit, then wait
+			 * for it to finish before killing some other task
+			 * unnecessarily.
+			 */
+			if (!(task->group_leader->ptrace & PT_TRACE_EXIT))
+				return OOM_SCAN_ABORT;
+		}
+	}
+	return OOM_SCAN_OK;
+}
+
 /*
  * Simple selection loop. We chose the process with the highest
  * number of 'points'. We expect the caller will lock the tasklist.
@@ -305,53 +358,19 @@ static struct task_struct *select_bad_process(unsigned int *ppoints,
 	do_each_thread(g, p) {
 		unsigned int points;
 
-		if (p->exit_state)
-			continue;
-		if (oom_unkillable_task(p, memcg, nodemask))
-			continue;
-
-		/*
-		 * This task already has access to memory reserves and is
-		 * being killed. Don't allow any other task access to the
-		 * memory reserve.
-		 *
-		 * Note: this may have a chance of deadlock if it gets
-		 * blocked waiting for another task which itself is waiting
-		 * for memory. Is there a better alternative?
-		 */
-		if (test_tsk_thread_flag(p, TIF_MEMDIE)) {
-			if (unlikely(frozen(p)))
-				__thaw_task(p);
-			if (!force_kill)
-				return ERR_PTR(-1UL);
-		}
-		if (!p->mm)
+		switch (oom_scan_process_thread(p, memcg, totalpages, nodemask,
+						force_kill)) {
+		case OOM_SCAN_SELECT:
+			chosen = p;
+			chosen_points = ULONG_MAX;
+			/* fall through */
+		case OOM_SCAN_CONTINUE:
 			continue;
-
-		if (p->flags & PF_EXITING) {
-			/*
-			 * If p is the current task and is in the process of
-			 * releasing memory, we allow the "kill" to set
-			 * TIF_MEMDIE, which will allow it to gain access to
-			 * memory reserves.  Otherwise, it may stall forever.
-			 *
-			 * The loop isn't broken here, however, in case other
-			 * threads are found to have already been oom killed.
-			 */
-			if (p == current) {
-				chosen = p;
-				chosen_points = ULONG_MAX;
-			} else if (!force_kill) {
-				/*
-				 * If this task is not being ptraced on exit,
-				 * then wait for it to finish before killing
-				 * some other task unnecessarily.
-				 */
-				if (!(p->group_leader->ptrace & PT_TRACE_EXIT))
-					return ERR_PTR(-1UL);
-			}
-		}
-
+		case OOM_SCAN_ABORT:
+			return ERR_PTR(-1UL);
+		case OOM_SCAN_OK:
+			break;
+		};
 		points = oom_badness(p, memcg, nodemask, totalpages);
 		if (points > chosen_points) {
 			chosen = p;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* [patch 2/5] mm, oom: introduce helper function to process threads during scan
@ 2012-06-29 21:06     ` David Rientjes
  0 siblings, 0 replies; 66+ messages in thread
From: David Rientjes @ 2012-06-29 21:06 UTC (permalink / raw)
  To: Andrew Morton
  Cc: KAMEZAWA Hiroyuki, Michal Hocko, Johannes Weiner,
	KOSAKI Motohiro, Minchan Kim, Oleg Nesterov,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA

This patch introduces a helper function to process each thread during the
iteration over the tasklist.  A new return type, enum oom_scan_t, is
defined to determine the future behavior of the iteration:

 - OOM_SCAN_OK: continue scanning the thread and find its badness,

 - OOM_SCAN_CONTINUE: do not consider this thread for oom kill, it's
   ineligible,

 - OOM_SCAN_ABORT: abort the iteration and return, or

 - OOM_SCAN_SELECT: always select this thread with the highest badness
   possible.

There is no functional change with this patch.  This new helper function
will be used in the next patch in the memory controller.

Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A@public.gmane.org>
Acked-by: KOSAKI Motohiro <kosaki.motohiro-+CUm20s59erQFUHtdCDX3A@public.gmane.org>
Reviewed-by: Michal Hocko <mhocko-AlSwsSmVLrQ@public.gmane.org>
Signed-off-by: David Rientjes <rientjes-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
---
 mm/oom_kill.c |  111 +++++++++++++++++++++++++++++++++-----------------------
 1 files changed, 65 insertions(+), 46 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -288,6 +288,59 @@ static enum oom_constraint constrained_alloc(struct zonelist *zonelist,
 }
 #endif
 
+enum oom_scan_t {
+	OOM_SCAN_OK,		/* scan thread and find its badness */
+	OOM_SCAN_CONTINUE,	/* do not consider thread for oom kill */
+	OOM_SCAN_ABORT,		/* abort the iteration and return */
+	OOM_SCAN_SELECT,	/* always select this thread first */
+};
+
+static enum oom_scan_t oom_scan_process_thread(struct task_struct *task,
+		struct mem_cgroup *memcg, unsigned long totalpages,
+		const nodemask_t *nodemask, bool force_kill)
+{
+	if (task->exit_state)
+		return OOM_SCAN_CONTINUE;
+	if (oom_unkillable_task(task, memcg, nodemask))
+		return OOM_SCAN_CONTINUE;
+
+	/*
+	 * This task already has access to memory reserves and is being killed.
+	 * Don't allow any other task to have access to the reserves.
+	 */
+	if (test_tsk_thread_flag(task, TIF_MEMDIE)) {
+		if (unlikely(frozen(task)))
+			__thaw_task(task);
+		if (!force_kill)
+			return OOM_SCAN_ABORT;
+	}
+	if (!task->mm)
+		return OOM_SCAN_CONTINUE;
+
+	if (task->flags & PF_EXITING) {
+		/*
+		 * If task is current and is in the process of releasing memory,
+		 * allow the "kill" to set TIF_MEMDIE, which will allow it to
+		 * access memory reserves.  Otherwise, it may stall forever.
+		 *
+		 * The iteration isn't broken here, however, in case other
+		 * threads are found to have already been oom killed.
+		 */
+		if (task == current)
+			return OOM_SCAN_SELECT;
+		else if (!force_kill) {
+			/*
+			 * If this task is not being ptraced on exit, then wait
+			 * for it to finish before killing some other task
+			 * unnecessarily.
+			 */
+			if (!(task->group_leader->ptrace & PT_TRACE_EXIT))
+				return OOM_SCAN_ABORT;
+		}
+	}
+	return OOM_SCAN_OK;
+}
+
 /*
  * Simple selection loop. We chose the process with the highest
  * number of 'points'. We expect the caller will lock the tasklist.
@@ -305,53 +358,19 @@ static struct task_struct *select_bad_process(unsigned int *ppoints,
 	do_each_thread(g, p) {
 		unsigned int points;
 
-		if (p->exit_state)
-			continue;
-		if (oom_unkillable_task(p, memcg, nodemask))
-			continue;
-
-		/*
-		 * This task already has access to memory reserves and is
-		 * being killed. Don't allow any other task access to the
-		 * memory reserve.
-		 *
-		 * Note: this may have a chance of deadlock if it gets
-		 * blocked waiting for another task which itself is waiting
-		 * for memory. Is there a better alternative?
-		 */
-		if (test_tsk_thread_flag(p, TIF_MEMDIE)) {
-			if (unlikely(frozen(p)))
-				__thaw_task(p);
-			if (!force_kill)
-				return ERR_PTR(-1UL);
-		}
-		if (!p->mm)
+		switch (oom_scan_process_thread(p, memcg, totalpages, nodemask,
+						force_kill)) {
+		case OOM_SCAN_SELECT:
+			chosen = p;
+			chosen_points = ULONG_MAX;
+			/* fall through */
+		case OOM_SCAN_CONTINUE:
 			continue;
-
-		if (p->flags & PF_EXITING) {
-			/*
-			 * If p is the current task and is in the process of
-			 * releasing memory, we allow the "kill" to set
-			 * TIF_MEMDIE, which will allow it to gain access to
-			 * memory reserves.  Otherwise, it may stall forever.
-			 *
-			 * The loop isn't broken here, however, in case other
-			 * threads are found to have already been oom killed.
-			 */
-			if (p == current) {
-				chosen = p;
-				chosen_points = ULONG_MAX;
-			} else if (!force_kill) {
-				/*
-				 * If this task is not being ptraced on exit,
-				 * then wait for it to finish before killing
-				 * some other task unnecessarily.
-				 */
-				if (!(p->group_leader->ptrace & PT_TRACE_EXIT))
-					return ERR_PTR(-1UL);
-			}
-		}
-
+		case OOM_SCAN_ABORT:
+			return ERR_PTR(-1UL);
+		case OOM_SCAN_OK:
+			break;
+		};
 		points = oom_badness(p, memcg, nodemask, totalpages);
 		if (points > chosen_points) {
 			chosen = p;

^ permalink raw reply	[flat|nested] 66+ messages in thread

* [patch 3/5] mm, memcg: introduce own oom handler to iterate only over its own threads
@ 2012-06-29 21:06     ` David Rientjes
  0 siblings, 0 replies; 66+ messages in thread
From: David Rientjes @ 2012-06-29 21:06 UTC (permalink / raw)
  To: Andrew Morton
  Cc: KAMEZAWA Hiroyuki, Michal Hocko, Johannes Weiner,
	KOSAKI Motohiro, Minchan Kim, Oleg Nesterov, linux-mm, cgroups

The global oom killer is serialized by the zonelist being used in the
page allocation.  Concurrent oom kills are thus a rare event and only
occur in systems using mempolicies and with a large number of nodes.

Memory controller oom kills, however, can frequently be concurrent since
there is no serialization once the oom killer is called for oom
conditions in several different memcgs in parallel.

This creates a massive contention on tasklist_lock since the oom killer
requires the readside for the tasklist iteration.  If several memcgs are
calling the oom killer, this lock can be held for a substantial amount of
time, especially if threads continue to enter it as other threads are
exiting.

Since the exit path grabs the writeside of the lock with irqs disabled in
a few different places, this can cause a soft lockup on cpus as a result
of tasklist_lock starvation.

The kernel lacks unfair writelocks, and successful calls to the oom
killer usually result in at least one thread entering the exit path, so
an alternative solution is needed.

This patch introduces a seperate oom handler for memcgs so that they do
not require tasklist_lock for as much time.  Instead, it iterates only
over the threads attached to the oom memcg and grabs a reference to the
selected thread before calling oom_kill_process() to ensure it doesn't
prematurely exit.

This still requires tasklist_lock for the tasklist dump, iterating
children of the selected process, and killing all other threads on the
system sharing the same memory as the selected victim.  So while this
isn't a complete solution to tasklist_lock starvation, it significantly
reduces the amount of time that it is held.

Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Signed-off-by: David Rientjes <rientjes@google.com>
---
 include/linux/memcontrol.h |    9 +-----
 include/linux/oom.h        |   16 +++++++++++
 mm/memcontrol.c            |   61 +++++++++++++++++++++++++++++++++++++++++++-
 mm/oom_kill.c              |   48 +++++++++++-----------------------
 4 files changed, 93 insertions(+), 41 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -180,7 +180,8 @@ static inline void mem_cgroup_dec_page_stat(struct page *page,
 unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
 						gfp_t gfp_mask,
 						unsigned long *total_scanned);
-u64 mem_cgroup_get_limit(struct mem_cgroup *memcg);
+extern void __mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
+				       int order);
 
 void mem_cgroup_count_vm_event(struct mm_struct *mm, enum vm_event_item idx);
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
@@ -364,12 +365,6 @@ unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
 	return 0;
 }
 
-static inline
-u64 mem_cgroup_get_limit(struct mem_cgroup *memcg)
-{
-	return 0;
-}
-
 static inline void mem_cgroup_split_huge_fixup(struct page *head)
 {
 }
diff --git a/include/linux/oom.h b/include/linux/oom.h
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -40,17 +40,33 @@ enum oom_constraint {
 	CONSTRAINT_MEMCG,
 };
 
+enum oom_scan_t {
+	OOM_SCAN_OK,		/* scan thread and find its badness */
+	OOM_SCAN_CONTINUE,	/* do not consider thread for oom kill */
+	OOM_SCAN_ABORT,		/* abort the iteration and return */
+	OOM_SCAN_SELECT,	/* always select this thread first */
+};
+
 extern void compare_swap_oom_score_adj(int old_val, int new_val);
 extern int test_set_oom_score_adj(int new_val);
 
 extern unsigned long oom_badness(struct task_struct *p,
 		struct mem_cgroup *memcg, const nodemask_t *nodemask,
 		unsigned long totalpages);
+extern void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
+			     unsigned int points, unsigned long totalpages,
+			     struct mem_cgroup *memcg, nodemask_t *nodemask,
+			     const char *message);
+
 extern int try_set_zonelist_oom(struct zonelist *zonelist, gfp_t gfp_flags);
 extern void clear_zonelist_oom(struct zonelist *zonelist, gfp_t gfp_flags);
 
+extern enum oom_scan_t oom_scan_process_thread(struct task_struct *task,
+		unsigned long totalpages, const nodemask_t *nodemask,
+		bool force_kill);
 extern void mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
 				     int order);
+
 extern void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
 		int order, nodemask_t *mask, bool force_kill);
 extern int register_oom_notifier(struct notifier_block *nb);
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1453,7 +1453,7 @@ static int mem_cgroup_count_children(struct mem_cgroup *memcg)
 /*
  * Return the memory (and swap, if configured) limit for a memcg.
  */
-u64 mem_cgroup_get_limit(struct mem_cgroup *memcg)
+static u64 mem_cgroup_get_limit(struct mem_cgroup *memcg)
 {
 	u64 limit;
 	u64 memsw;
@@ -1469,6 +1469,65 @@ u64 mem_cgroup_get_limit(struct mem_cgroup *memcg)
 	return min(limit, memsw);
 }
 
+void __mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
+				int order)
+{
+	struct mem_cgroup *iter;
+	unsigned long chosen_points = 0;
+	unsigned long totalpages;
+	unsigned int points = 0;
+	struct task_struct *chosen = NULL;
+
+	totalpages = mem_cgroup_get_limit(memcg) >> PAGE_SHIFT ? : 1;
+	for_each_mem_cgroup_tree(iter, memcg) {
+		struct cgroup *cgroup = iter->css.cgroup;
+		struct cgroup_iter it;
+		struct task_struct *task;
+
+		cgroup_iter_start(cgroup, &it);
+		while ((task = cgroup_iter_next(cgroup, &it))) {
+			switch (oom_scan_process_thread(task, totalpages, NULL,
+							false)) {
+			case OOM_SCAN_SELECT:
+				if (chosen)
+					put_task_struct(chosen);
+				chosen = task;
+				chosen_points = ULONG_MAX;
+				get_task_struct(chosen);
+				/* fall through */
+			case OOM_SCAN_CONTINUE:
+				continue;
+			case OOM_SCAN_ABORT:
+				cgroup_iter_end(cgroup, &it);
+				mem_cgroup_iter_break(memcg, iter);
+				if (chosen)
+					put_task_struct(chosen);
+				return;
+			case OOM_SCAN_OK:
+				break;
+			};
+			points = oom_badness(task, memcg, NULL, totalpages);
+			if (points > chosen_points) {
+				if (chosen)
+					put_task_struct(chosen);
+				chosen = task;
+				chosen_points = points;
+				get_task_struct(chosen);
+			}
+		}
+		cgroup_iter_end(cgroup, &it);
+	}
+
+	if (!chosen)
+		return;
+	points = chosen_points * 1000 / totalpages;
+	read_lock(&tasklist_lock);
+	oom_kill_process(chosen, gfp_mask, order, points, totalpages, memcg,
+			 NULL, "Memory cgroup out of memory");
+	read_unlock(&tasklist_lock);
+	put_task_struct(chosen);
+}
+
 static unsigned long mem_cgroup_reclaim(struct mem_cgroup *memcg,
 					gfp_t gfp_mask,
 					unsigned long flags)
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -288,20 +288,13 @@ static enum oom_constraint constrained_alloc(struct zonelist *zonelist,
 }
 #endif
 
-enum oom_scan_t {
-	OOM_SCAN_OK,		/* scan thread and find its badness */
-	OOM_SCAN_CONTINUE,	/* do not consider thread for oom kill */
-	OOM_SCAN_ABORT,		/* abort the iteration and return */
-	OOM_SCAN_SELECT,	/* always select this thread first */
-};
-
-static enum oom_scan_t oom_scan_process_thread(struct task_struct *task,
-		struct mem_cgroup *memcg, unsigned long totalpages,
-		const nodemask_t *nodemask, bool force_kill)
+enum oom_scan_t oom_scan_process_thread(struct task_struct *task,
+		unsigned long totalpages, const nodemask_t *nodemask,
+		bool force_kill)
 {
 	if (task->exit_state)
 		return OOM_SCAN_CONTINUE;
-	if (oom_unkillable_task(task, memcg, nodemask))
+	if (oom_unkillable_task(task, NULL, nodemask))
 		return OOM_SCAN_CONTINUE;
 
 	/*
@@ -348,8 +341,8 @@ static enum oom_scan_t oom_scan_process_thread(struct task_struct *task,
  * (not docbooked, we don't want this one cluttering up the manual)
  */
 static struct task_struct *select_bad_process(unsigned int *ppoints,
-		unsigned long totalpages, struct mem_cgroup *memcg,
-		const nodemask_t *nodemask, bool force_kill)
+		unsigned long totalpages, const nodemask_t *nodemask,
+		bool force_kill)
 {
 	struct task_struct *g, *p;
 	struct task_struct *chosen = NULL;
@@ -358,7 +351,7 @@ static struct task_struct *select_bad_process(unsigned int *ppoints,
 	do_each_thread(g, p) {
 		unsigned int points;
 
-		switch (oom_scan_process_thread(p, memcg, totalpages, nodemask,
+		switch (oom_scan_process_thread(p, totalpages, nodemask,
 						force_kill)) {
 		case OOM_SCAN_SELECT:
 			chosen = p;
@@ -371,7 +364,7 @@ static struct task_struct *select_bad_process(unsigned int *ppoints,
 		case OOM_SCAN_OK:
 			break;
 		};
-		points = oom_badness(p, memcg, nodemask, totalpages);
+		points = oom_badness(p, NULL, nodemask, totalpages);
 		if (points > chosen_points) {
 			chosen = p;
 			chosen_points = points;
@@ -442,10 +435,10 @@ static void dump_header(struct task_struct *p, gfp_t gfp_mask, int order,
 }
 
 #define K(x) ((x) << (PAGE_SHIFT-10))
-static void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
-			     unsigned int points, unsigned long totalpages,
-			     struct mem_cgroup *memcg, nodemask_t *nodemask,
-			     const char *message)
+void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
+		      unsigned int points, unsigned long totalpages,
+		      struct mem_cgroup *memcg, nodemask_t *nodemask,
+		      const char *message)
 {
 	struct task_struct *victim = p;
 	struct task_struct *child;
@@ -563,10 +556,6 @@ static void check_panic_on_oom(enum oom_constraint constraint, gfp_t gfp_mask,
 void mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
 			      int order)
 {
-	unsigned long limit;
-	unsigned int points = 0;
-	struct task_struct *p;
-
 	/*
 	 * If current has a pending SIGKILL, then automatically select it.  The
 	 * goal is to allow it to allocate so that it may quickly exit and free
@@ -578,13 +567,7 @@ void mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
 	}
 
 	check_panic_on_oom(CONSTRAINT_MEMCG, gfp_mask, order, NULL);
-	limit = mem_cgroup_get_limit(memcg) >> PAGE_SHIFT ? : 1;
-	read_lock(&tasklist_lock);
-	p = select_bad_process(&points, limit, memcg, NULL, false);
-	if (p && PTR_ERR(p) != -1UL)
-		oom_kill_process(p, gfp_mask, order, points, limit, memcg, NULL,
-				 "Memory cgroup out of memory");
-	read_unlock(&tasklist_lock);
+	__mem_cgroup_out_of_memory(memcg, gfp_mask, order);
 }
 #endif
 
@@ -709,7 +692,7 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
 	struct task_struct *p;
 	unsigned long totalpages;
 	unsigned long freed = 0;
-	unsigned int points;
+	unsigned int uninitialized_var(points);
 	enum oom_constraint constraint = CONSTRAINT_NONE;
 	int killed = 0;
 
@@ -747,8 +730,7 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
 		goto out;
 	}
 
-	p = select_bad_process(&points, totalpages, NULL, mpol_mask,
-			       force_kill);
+	p = select_bad_process(&points, totalpages, mpol_mask, force_kill);
 	/* Found nothing?!?! Either we hang forever, or we panic. */
 	if (!p) {
 		dump_header(NULL, gfp_mask, order, NULL, mpol_mask);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* [patch 3/5] mm, memcg: introduce own oom handler to iterate only over its own threads
@ 2012-06-29 21:06     ` David Rientjes
  0 siblings, 0 replies; 66+ messages in thread
From: David Rientjes @ 2012-06-29 21:06 UTC (permalink / raw)
  To: Andrew Morton
  Cc: KAMEZAWA Hiroyuki, Michal Hocko, Johannes Weiner,
	KOSAKI Motohiro, Minchan Kim, Oleg Nesterov,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA

The global oom killer is serialized by the zonelist being used in the
page allocation.  Concurrent oom kills are thus a rare event and only
occur in systems using mempolicies and with a large number of nodes.

Memory controller oom kills, however, can frequently be concurrent since
there is no serialization once the oom killer is called for oom
conditions in several different memcgs in parallel.

This creates a massive contention on tasklist_lock since the oom killer
requires the readside for the tasklist iteration.  If several memcgs are
calling the oom killer, this lock can be held for a substantial amount of
time, especially if threads continue to enter it as other threads are
exiting.

Since the exit path grabs the writeside of the lock with irqs disabled in
a few different places, this can cause a soft lockup on cpus as a result
of tasklist_lock starvation.

The kernel lacks unfair writelocks, and successful calls to the oom
killer usually result in at least one thread entering the exit path, so
an alternative solution is needed.

This patch introduces a seperate oom handler for memcgs so that they do
not require tasklist_lock for as much time.  Instead, it iterates only
over the threads attached to the oom memcg and grabs a reference to the
selected thread before calling oom_kill_process() to ensure it doesn't
prematurely exit.

This still requires tasklist_lock for the tasklist dump, iterating
children of the selected process, and killing all other threads on the
system sharing the same memory as the selected victim.  So while this
isn't a complete solution to tasklist_lock starvation, it significantly
reduces the amount of time that it is held.

Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A@public.gmane.org>
Acked-by: Michal Hocko <mhocko-AlSwsSmVLrQ@public.gmane.org>
Signed-off-by: David Rientjes <rientjes-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
---
 include/linux/memcontrol.h |    9 +-----
 include/linux/oom.h        |   16 +++++++++++
 mm/memcontrol.c            |   61 +++++++++++++++++++++++++++++++++++++++++++-
 mm/oom_kill.c              |   48 +++++++++++-----------------------
 4 files changed, 93 insertions(+), 41 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -180,7 +180,8 @@ static inline void mem_cgroup_dec_page_stat(struct page *page,
 unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
 						gfp_t gfp_mask,
 						unsigned long *total_scanned);
-u64 mem_cgroup_get_limit(struct mem_cgroup *memcg);
+extern void __mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
+				       int order);
 
 void mem_cgroup_count_vm_event(struct mm_struct *mm, enum vm_event_item idx);
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
@@ -364,12 +365,6 @@ unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
 	return 0;
 }
 
-static inline
-u64 mem_cgroup_get_limit(struct mem_cgroup *memcg)
-{
-	return 0;
-}
-
 static inline void mem_cgroup_split_huge_fixup(struct page *head)
 {
 }
diff --git a/include/linux/oom.h b/include/linux/oom.h
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -40,17 +40,33 @@ enum oom_constraint {
 	CONSTRAINT_MEMCG,
 };
 
+enum oom_scan_t {
+	OOM_SCAN_OK,		/* scan thread and find its badness */
+	OOM_SCAN_CONTINUE,	/* do not consider thread for oom kill */
+	OOM_SCAN_ABORT,		/* abort the iteration and return */
+	OOM_SCAN_SELECT,	/* always select this thread first */
+};
+
 extern void compare_swap_oom_score_adj(int old_val, int new_val);
 extern int test_set_oom_score_adj(int new_val);
 
 extern unsigned long oom_badness(struct task_struct *p,
 		struct mem_cgroup *memcg, const nodemask_t *nodemask,
 		unsigned long totalpages);
+extern void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
+			     unsigned int points, unsigned long totalpages,
+			     struct mem_cgroup *memcg, nodemask_t *nodemask,
+			     const char *message);
+
 extern int try_set_zonelist_oom(struct zonelist *zonelist, gfp_t gfp_flags);
 extern void clear_zonelist_oom(struct zonelist *zonelist, gfp_t gfp_flags);
 
+extern enum oom_scan_t oom_scan_process_thread(struct task_struct *task,
+		unsigned long totalpages, const nodemask_t *nodemask,
+		bool force_kill);
 extern void mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
 				     int order);
+
 extern void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
 		int order, nodemask_t *mask, bool force_kill);
 extern int register_oom_notifier(struct notifier_block *nb);
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1453,7 +1453,7 @@ static int mem_cgroup_count_children(struct mem_cgroup *memcg)
 /*
  * Return the memory (and swap, if configured) limit for a memcg.
  */
-u64 mem_cgroup_get_limit(struct mem_cgroup *memcg)
+static u64 mem_cgroup_get_limit(struct mem_cgroup *memcg)
 {
 	u64 limit;
 	u64 memsw;
@@ -1469,6 +1469,65 @@ u64 mem_cgroup_get_limit(struct mem_cgroup *memcg)
 	return min(limit, memsw);
 }
 
+void __mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
+				int order)
+{
+	struct mem_cgroup *iter;
+	unsigned long chosen_points = 0;
+	unsigned long totalpages;
+	unsigned int points = 0;
+	struct task_struct *chosen = NULL;
+
+	totalpages = mem_cgroup_get_limit(memcg) >> PAGE_SHIFT ? : 1;
+	for_each_mem_cgroup_tree(iter, memcg) {
+		struct cgroup *cgroup = iter->css.cgroup;
+		struct cgroup_iter it;
+		struct task_struct *task;
+
+		cgroup_iter_start(cgroup, &it);
+		while ((task = cgroup_iter_next(cgroup, &it))) {
+			switch (oom_scan_process_thread(task, totalpages, NULL,
+							false)) {
+			case OOM_SCAN_SELECT:
+				if (chosen)
+					put_task_struct(chosen);
+				chosen = task;
+				chosen_points = ULONG_MAX;
+				get_task_struct(chosen);
+				/* fall through */
+			case OOM_SCAN_CONTINUE:
+				continue;
+			case OOM_SCAN_ABORT:
+				cgroup_iter_end(cgroup, &it);
+				mem_cgroup_iter_break(memcg, iter);
+				if (chosen)
+					put_task_struct(chosen);
+				return;
+			case OOM_SCAN_OK:
+				break;
+			};
+			points = oom_badness(task, memcg, NULL, totalpages);
+			if (points > chosen_points) {
+				if (chosen)
+					put_task_struct(chosen);
+				chosen = task;
+				chosen_points = points;
+				get_task_struct(chosen);
+			}
+		}
+		cgroup_iter_end(cgroup, &it);
+	}
+
+	if (!chosen)
+		return;
+	points = chosen_points * 1000 / totalpages;
+	read_lock(&tasklist_lock);
+	oom_kill_process(chosen, gfp_mask, order, points, totalpages, memcg,
+			 NULL, "Memory cgroup out of memory");
+	read_unlock(&tasklist_lock);
+	put_task_struct(chosen);
+}
+
 static unsigned long mem_cgroup_reclaim(struct mem_cgroup *memcg,
 					gfp_t gfp_mask,
 					unsigned long flags)
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -288,20 +288,13 @@ static enum oom_constraint constrained_alloc(struct zonelist *zonelist,
 }
 #endif
 
-enum oom_scan_t {
-	OOM_SCAN_OK,		/* scan thread and find its badness */
-	OOM_SCAN_CONTINUE,	/* do not consider thread for oom kill */
-	OOM_SCAN_ABORT,		/* abort the iteration and return */
-	OOM_SCAN_SELECT,	/* always select this thread first */
-};
-
-static enum oom_scan_t oom_scan_process_thread(struct task_struct *task,
-		struct mem_cgroup *memcg, unsigned long totalpages,
-		const nodemask_t *nodemask, bool force_kill)
+enum oom_scan_t oom_scan_process_thread(struct task_struct *task,
+		unsigned long totalpages, const nodemask_t *nodemask,
+		bool force_kill)
 {
 	if (task->exit_state)
 		return OOM_SCAN_CONTINUE;
-	if (oom_unkillable_task(task, memcg, nodemask))
+	if (oom_unkillable_task(task, NULL, nodemask))
 		return OOM_SCAN_CONTINUE;
 
 	/*
@@ -348,8 +341,8 @@ static enum oom_scan_t oom_scan_process_thread(struct task_struct *task,
  * (not docbooked, we don't want this one cluttering up the manual)
  */
 static struct task_struct *select_bad_process(unsigned int *ppoints,
-		unsigned long totalpages, struct mem_cgroup *memcg,
-		const nodemask_t *nodemask, bool force_kill)
+		unsigned long totalpages, const nodemask_t *nodemask,
+		bool force_kill)
 {
 	struct task_struct *g, *p;
 	struct task_struct *chosen = NULL;
@@ -358,7 +351,7 @@ static struct task_struct *select_bad_process(unsigned int *ppoints,
 	do_each_thread(g, p) {
 		unsigned int points;
 
-		switch (oom_scan_process_thread(p, memcg, totalpages, nodemask,
+		switch (oom_scan_process_thread(p, totalpages, nodemask,
 						force_kill)) {
 		case OOM_SCAN_SELECT:
 			chosen = p;
@@ -371,7 +364,7 @@ static struct task_struct *select_bad_process(unsigned int *ppoints,
 		case OOM_SCAN_OK:
 			break;
 		};
-		points = oom_badness(p, memcg, nodemask, totalpages);
+		points = oom_badness(p, NULL, nodemask, totalpages);
 		if (points > chosen_points) {
 			chosen = p;
 			chosen_points = points;
@@ -442,10 +435,10 @@ static void dump_header(struct task_struct *p, gfp_t gfp_mask, int order,
 }
 
 #define K(x) ((x) << (PAGE_SHIFT-10))
-static void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
-			     unsigned int points, unsigned long totalpages,
-			     struct mem_cgroup *memcg, nodemask_t *nodemask,
-			     const char *message)
+void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
+		      unsigned int points, unsigned long totalpages,
+		      struct mem_cgroup *memcg, nodemask_t *nodemask,
+		      const char *message)
 {
 	struct task_struct *victim = p;
 	struct task_struct *child;
@@ -563,10 +556,6 @@ static void check_panic_on_oom(enum oom_constraint constraint, gfp_t gfp_mask,
 void mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
 			      int order)
 {
-	unsigned long limit;
-	unsigned int points = 0;
-	struct task_struct *p;
-
 	/*
 	 * If current has a pending SIGKILL, then automatically select it.  The
 	 * goal is to allow it to allocate so that it may quickly exit and free
@@ -578,13 +567,7 @@ void mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
 	}
 
 	check_panic_on_oom(CONSTRAINT_MEMCG, gfp_mask, order, NULL);
-	limit = mem_cgroup_get_limit(memcg) >> PAGE_SHIFT ? : 1;
-	read_lock(&tasklist_lock);
-	p = select_bad_process(&points, limit, memcg, NULL, false);
-	if (p && PTR_ERR(p) != -1UL)
-		oom_kill_process(p, gfp_mask, order, points, limit, memcg, NULL,
-				 "Memory cgroup out of memory");
-	read_unlock(&tasklist_lock);
+	__mem_cgroup_out_of_memory(memcg, gfp_mask, order);
 }
 #endif
 
@@ -709,7 +692,7 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
 	struct task_struct *p;
 	unsigned long totalpages;
 	unsigned long freed = 0;
-	unsigned int points;
+	unsigned int uninitialized_var(points);
 	enum oom_constraint constraint = CONSTRAINT_NONE;
 	int killed = 0;
 
@@ -747,8 +730,7 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
 		goto out;
 	}
 
-	p = select_bad_process(&points, totalpages, NULL, mpol_mask,
-			       force_kill);
+	p = select_bad_process(&points, totalpages, mpol_mask, force_kill);
 	/* Found nothing?!?! Either we hang forever, or we panic. */
 	if (!p) {
 		dump_header(NULL, gfp_mask, order, NULL, mpol_mask);

^ permalink raw reply	[flat|nested] 66+ messages in thread

* [patch 4/5] mm, oom: reduce dependency on tasklist_lock
  2012-06-29 21:06 ` [patch 1/5] " David Rientjes
  2012-06-29 21:06     ` David Rientjes
  2012-06-29 21:06     ` David Rientjes
@ 2012-06-29 21:06   ` David Rientjes
  2012-07-03 18:17       ` Oleg Nesterov
  2012-07-13 14:32       ` Michal Hocko
  2012-06-29 21:07     ` David Rientjes
  2012-07-10 21:05     ` David Rientjes
  4 siblings, 2 replies; 66+ messages in thread
From: David Rientjes @ 2012-06-29 21:06 UTC (permalink / raw)
  To: Andrew Morton
  Cc: KAMEZAWA Hiroyuki, Michal Hocko, Johannes Weiner,
	KOSAKI Motohiro, Minchan Kim, Oleg Nesterov, linux-mm, cgroups

Since exiting tasks require write_lock_irq(&tasklist_lock) several times,
try to reduce the amount of time the readside is held for oom kills.
This makes the interface with the memcg oom handler more consistent since
it now never needs to take tasklist_lock unnecessarily.

The only time the oom killer now takes tasklist_lock is when iterating
the children of the selected task, everything else is protected by
rcu_read_lock().

This requires that a reference to the selected process, p, is grabbed
before calling oom_kill_process().  It may release it and grab a
reference on another one of p's threads if !p->mm, but it also guarantees
that it will release the reference before returning.

Signed-off-by: David Rientjes <rientjes@google.com>
---
 mm/memcontrol.c |    2 --
 mm/oom_kill.c   |   40 +++++++++++++++++++++++++++++-----------
 2 files changed, 29 insertions(+), 13 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1521,10 +1521,8 @@ void __mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
 	if (!chosen)
 		return;
 	points = chosen_points * 1000 / totalpages;
-	read_lock(&tasklist_lock);
 	oom_kill_process(chosen, gfp_mask, order, points, totalpages, memcg,
 			 NULL, "Memory cgroup out of memory");
-	read_unlock(&tasklist_lock);
 	put_task_struct(chosen);
 }
 
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -336,7 +336,7 @@ enum oom_scan_t oom_scan_process_thread(struct task_struct *task,
 
 /*
  * Simple selection loop. We chose the process with the highest
- * number of 'points'. We expect the caller will lock the tasklist.
+ * number of 'points'.
  *
  * (not docbooked, we don't want this one cluttering up the manual)
  */
@@ -348,6 +348,7 @@ static struct task_struct *select_bad_process(unsigned int *ppoints,
 	struct task_struct *chosen = NULL;
 	unsigned long chosen_points = 0;
 
+	rcu_read_lock();
 	do_each_thread(g, p) {
 		unsigned int points;
 
@@ -370,6 +371,9 @@ static struct task_struct *select_bad_process(unsigned int *ppoints,
 			chosen_points = points;
 		}
 	} while_each_thread(g, p);
+	if (chosen)
+		get_task_struct(chosen);
+	rcu_read_unlock();
 
 	*ppoints = chosen_points * 1000 / totalpages;
 	return chosen;
@@ -385,8 +389,6 @@ static struct task_struct *select_bad_process(unsigned int *ppoints,
  * are not shown.
  * State information includes task's pid, uid, tgid, vm size, rss, cpu, oom_adj
  * value, oom_score_adj value, and name.
- *
- * Call with tasklist_lock read-locked.
  */
 static void dump_tasks(const struct mem_cgroup *memcg, const nodemask_t *nodemask)
 {
@@ -394,6 +396,7 @@ static void dump_tasks(const struct mem_cgroup *memcg, const nodemask_t *nodemas
 	struct task_struct *task;
 
 	pr_info("[ pid ]   uid  tgid total_vm      rss cpu oom_adj oom_score_adj name\n");
+	rcu_read_lock();
 	for_each_process(p) {
 		if (oom_unkillable_task(p, memcg, nodemask))
 			continue;
@@ -415,6 +418,7 @@ static void dump_tasks(const struct mem_cgroup *memcg, const nodemask_t *nodemas
 			task->signal->oom_score_adj, task->comm);
 		task_unlock(task);
 	}
+	rcu_read_unlock();
 }
 
 static void dump_header(struct task_struct *p, gfp_t gfp_mask, int order,
@@ -435,6 +439,10 @@ static void dump_header(struct task_struct *p, gfp_t gfp_mask, int order,
 }
 
 #define K(x) ((x) << (PAGE_SHIFT-10))
+/*
+ * Must be called while holding a reference to p, which will be released upon
+ * returning.
+ */
 void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
 		      unsigned int points, unsigned long totalpages,
 		      struct mem_cgroup *memcg, nodemask_t *nodemask,
@@ -454,6 +462,7 @@ void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
 	 */
 	if (p->flags & PF_EXITING) {
 		set_tsk_thread_flag(p, TIF_MEMDIE);
+		put_task_struct(p);
 		return;
 	}
 
@@ -471,6 +480,7 @@ void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
 	 * parent.  This attempts to lose the minimal amount of work done while
 	 * still freeing memory.
 	 */
+	read_lock(&tasklist_lock);
 	do {
 		list_for_each_entry(child, &t->children, sibling) {
 			unsigned int child_points;
@@ -483,15 +493,26 @@ void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
 			child_points = oom_badness(child, memcg, nodemask,
 								totalpages);
 			if (child_points > victim_points) {
+				put_task_struct(victim);
 				victim = child;
 				victim_points = child_points;
+				get_task_struct(victim);
 			}
 		}
 	} while_each_thread(p, t);
+	read_unlock(&tasklist_lock);
 
-	victim = find_lock_task_mm(victim);
-	if (!victim)
+	rcu_read_lock();
+	p = find_lock_task_mm(victim);
+	if (!p) {
+		rcu_read_unlock();
+		put_task_struct(victim);
 		return;
+	} else if (victim != p) {
+		get_task_struct(p);
+		put_task_struct(victim);
+		victim = p;
+	}
 
 	/* mm cannot safely be dereferenced after task_unlock(victim) */
 	mm = victim->mm;
@@ -522,9 +543,11 @@ void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
 			task_unlock(p);
 			do_send_sig_info(SIGKILL, SEND_SIG_FORCED, p, true);
 		}
+	rcu_read_unlock();
 
 	set_tsk_thread_flag(victim, TIF_MEMDIE);
 	do_send_sig_info(SIGKILL, SEND_SIG_FORCED, victim, true);
+	put_task_struct(victim);
 }
 #undef K
 
@@ -545,9 +568,7 @@ static void check_panic_on_oom(enum oom_constraint constraint, gfp_t gfp_mask,
 		if (constraint != CONSTRAINT_NONE)
 			return;
 	}
-	read_lock(&tasklist_lock);
 	dump_header(NULL, gfp_mask, order, NULL, nodemask);
-	read_unlock(&tasklist_lock);
 	panic("Out of memory: %s panic_on_oom is enabled\n",
 		sysctl_panic_on_oom == 2 ? "compulsory" : "system-wide");
 }
@@ -720,10 +741,10 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
 	mpol_mask = (constraint == CONSTRAINT_MEMORY_POLICY) ? nodemask : NULL;
 	check_panic_on_oom(constraint, gfp_mask, order, mpol_mask);
 
-	read_lock(&tasklist_lock);
 	if (sysctl_oom_kill_allocating_task &&
 	    !oom_unkillable_task(current, NULL, nodemask) &&
 	    current->mm) {
+		get_task_struct(current);
 		oom_kill_process(current, gfp_mask, order, 0, totalpages, NULL,
 				 nodemask,
 				 "Out of memory (oom_kill_allocating_task)");
@@ -734,7 +755,6 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
 	/* Found nothing?!?! Either we hang forever, or we panic. */
 	if (!p) {
 		dump_header(NULL, gfp_mask, order, NULL, mpol_mask);
-		read_unlock(&tasklist_lock);
 		panic("Out of memory and no killable processes...\n");
 	}
 	if (PTR_ERR(p) != -1UL) {
@@ -743,8 +763,6 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
 		killed = 1;
 	}
 out:
-	read_unlock(&tasklist_lock);
-
 	/*
 	 * Give the killed threads a good chance of exiting before trying to
 	 * allocate memory again.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* [patch 5/5] mm, memcg: move all oom handling to memcontrol.c
@ 2012-06-29 21:07     ` David Rientjes
  0 siblings, 0 replies; 66+ messages in thread
From: David Rientjes @ 2012-06-29 21:07 UTC (permalink / raw)
  To: Andrew Morton
  Cc: KAMEZAWA Hiroyuki, Michal Hocko, Johannes Weiner,
	KOSAKI Motohiro, Minchan Kim, Oleg Nesterov, linux-mm, cgroups

By globally defining check_panic_on_oom(), the memcg oom handler can be
moved entirely to mm/memcontrol.c.  This removes the ugly #ifdef in the
oom killer and cleans up the code.

Signed-off-by: David Rientjes <rientjes@google.com> 
---
 include/linux/memcontrol.h |    2 --
 include/linux/oom.h        |    3 +++
 mm/memcontrol.c            |   15 +++++++++++++--
 mm/oom_kill.c              |   23 ++---------------------
 4 files changed, 18 insertions(+), 25 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -180,8 +180,6 @@ static inline void mem_cgroup_dec_page_stat(struct page *page,
 unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
 						gfp_t gfp_mask,
 						unsigned long *total_scanned);
-extern void __mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
-				       int order);
 
 void mem_cgroup_count_vm_event(struct mm_struct *mm, enum vm_event_item idx);
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
diff --git a/include/linux/oom.h b/include/linux/oom.h
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -61,6 +61,9 @@ extern void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
 extern int try_set_zonelist_oom(struct zonelist *zonelist, gfp_t gfp_flags);
 extern void clear_zonelist_oom(struct zonelist *zonelist, gfp_t gfp_flags);
 
+extern void check_panic_on_oom(enum oom_constraint constraint, gfp_t gfp_mask,
+			       int order, const nodemask_t *nodemask);
+
 extern enum oom_scan_t oom_scan_process_thread(struct task_struct *task,
 		unsigned long totalpages, const nodemask_t *nodemask,
 		bool force_kill);
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1469,8 +1469,8 @@ static u64 mem_cgroup_get_limit(struct mem_cgroup *memcg)
 	return min(limit, memsw);
 }
 
-void __mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
-				int order)
+void mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
+			      int order)
 {
 	struct mem_cgroup *iter;
 	unsigned long chosen_points = 0;
@@ -1478,6 +1478,17 @@ void __mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
 	unsigned int points = 0;
 	struct task_struct *chosen = NULL;
 
+	/*
+	 * If current has a pending SIGKILL, then automatically select it.  The
+	 * goal is to allow it to allocate so that it may quickly exit and free
+	 * its memory.
+	 */
+	if (fatal_signal_pending(current)) {
+		set_thread_flag(TIF_MEMDIE);
+		return;
+	}
+
+	check_panic_on_oom(CONSTRAINT_MEMCG, gfp_mask, order, NULL);
 	totalpages = mem_cgroup_get_limit(memcg) >> PAGE_SHIFT ? : 1;
 	for_each_mem_cgroup_tree(iter, memcg) {
 		struct cgroup *cgroup = iter->css.cgroup;
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -554,8 +554,8 @@ void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
 /*
  * Determines whether the kernel must panic because of the panic_on_oom sysctl.
  */
-static void check_panic_on_oom(enum oom_constraint constraint, gfp_t gfp_mask,
-				int order, const nodemask_t *nodemask)
+void check_panic_on_oom(enum oom_constraint constraint, gfp_t gfp_mask,
+			int order, const nodemask_t *nodemask)
 {
 	if (likely(!sysctl_panic_on_oom))
 		return;
@@ -573,25 +573,6 @@ static void check_panic_on_oom(enum oom_constraint constraint, gfp_t gfp_mask,
 		sysctl_panic_on_oom == 2 ? "compulsory" : "system-wide");
 }
 
-#ifdef CONFIG_CGROUP_MEM_RES_CTLR
-void mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
-			      int order)
-{
-	/*
-	 * If current has a pending SIGKILL, then automatically select it.  The
-	 * goal is to allow it to allocate so that it may quickly exit and free
-	 * its memory.
-	 */
-	if (fatal_signal_pending(current)) {
-		set_thread_flag(TIF_MEMDIE);
-		return;
-	}
-
-	check_panic_on_oom(CONSTRAINT_MEMCG, gfp_mask, order, NULL);
-	__mem_cgroup_out_of_memory(memcg, gfp_mask, order);
-}
-#endif
-
 static BLOCKING_NOTIFIER_HEAD(oom_notify_list);
 
 int register_oom_notifier(struct notifier_block *nb)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* [patch 5/5] mm, memcg: move all oom handling to memcontrol.c
@ 2012-06-29 21:07     ` David Rientjes
  0 siblings, 0 replies; 66+ messages in thread
From: David Rientjes @ 2012-06-29 21:07 UTC (permalink / raw)
  To: Andrew Morton
  Cc: KAMEZAWA Hiroyuki, Michal Hocko, Johannes Weiner,
	KOSAKI Motohiro, Minchan Kim, Oleg Nesterov,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA

By globally defining check_panic_on_oom(), the memcg oom handler can be
moved entirely to mm/memcontrol.c.  This removes the ugly #ifdef in the
oom killer and cleans up the code.

Signed-off-by: David Rientjes <rientjes-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> 
---
 include/linux/memcontrol.h |    2 --
 include/linux/oom.h        |    3 +++
 mm/memcontrol.c            |   15 +++++++++++++--
 mm/oom_kill.c              |   23 ++---------------------
 4 files changed, 18 insertions(+), 25 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -180,8 +180,6 @@ static inline void mem_cgroup_dec_page_stat(struct page *page,
 unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
 						gfp_t gfp_mask,
 						unsigned long *total_scanned);
-extern void __mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
-				       int order);
 
 void mem_cgroup_count_vm_event(struct mm_struct *mm, enum vm_event_item idx);
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
diff --git a/include/linux/oom.h b/include/linux/oom.h
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -61,6 +61,9 @@ extern void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
 extern int try_set_zonelist_oom(struct zonelist *zonelist, gfp_t gfp_flags);
 extern void clear_zonelist_oom(struct zonelist *zonelist, gfp_t gfp_flags);
 
+extern void check_panic_on_oom(enum oom_constraint constraint, gfp_t gfp_mask,
+			       int order, const nodemask_t *nodemask);
+
 extern enum oom_scan_t oom_scan_process_thread(struct task_struct *task,
 		unsigned long totalpages, const nodemask_t *nodemask,
 		bool force_kill);
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1469,8 +1469,8 @@ static u64 mem_cgroup_get_limit(struct mem_cgroup *memcg)
 	return min(limit, memsw);
 }
 
-void __mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
-				int order)
+void mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
+			      int order)
 {
 	struct mem_cgroup *iter;
 	unsigned long chosen_points = 0;
@@ -1478,6 +1478,17 @@ void __mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
 	unsigned int points = 0;
 	struct task_struct *chosen = NULL;
 
+	/*
+	 * If current has a pending SIGKILL, then automatically select it.  The
+	 * goal is to allow it to allocate so that it may quickly exit and free
+	 * its memory.
+	 */
+	if (fatal_signal_pending(current)) {
+		set_thread_flag(TIF_MEMDIE);
+		return;
+	}
+
+	check_panic_on_oom(CONSTRAINT_MEMCG, gfp_mask, order, NULL);
 	totalpages = mem_cgroup_get_limit(memcg) >> PAGE_SHIFT ? : 1;
 	for_each_mem_cgroup_tree(iter, memcg) {
 		struct cgroup *cgroup = iter->css.cgroup;
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -554,8 +554,8 @@ void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
 /*
  * Determines whether the kernel must panic because of the panic_on_oom sysctl.
  */
-static void check_panic_on_oom(enum oom_constraint constraint, gfp_t gfp_mask,
-				int order, const nodemask_t *nodemask)
+void check_panic_on_oom(enum oom_constraint constraint, gfp_t gfp_mask,
+			int order, const nodemask_t *nodemask)
 {
 	if (likely(!sysctl_panic_on_oom))
 		return;
@@ -573,25 +573,6 @@ static void check_panic_on_oom(enum oom_constraint constraint, gfp_t gfp_mask,
 		sysctl_panic_on_oom == 2 ? "compulsory" : "system-wide");
 }
 
-#ifdef CONFIG_CGROUP_MEM_RES_CTLR
-void mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
-			      int order)
-{
-	/*
-	 * If current has a pending SIGKILL, then automatically select it.  The
-	 * goal is to allow it to allocate so that it may quickly exit and free
-	 * its memory.
-	 */
-	if (fatal_signal_pending(current)) {
-		set_thread_flag(TIF_MEMDIE);
-		return;
-	}
-
-	check_panic_on_oom(CONSTRAINT_MEMCG, gfp_mask, order, NULL);
-	__mem_cgroup_out_of_memory(memcg, gfp_mask, order);
-}
-#endif
-
 static BLOCKING_NOTIFIER_HEAD(oom_notify_list);
 
 int register_oom_notifier(struct notifier_block *nb)

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [rfc][patch 3/3] mm, memcg: introduce own oom handler to iterate only over its own threads
  2012-06-29 20:30             ` David Rientjes
  (?)
@ 2012-07-03 17:56             ` Oleg Nesterov
  -1 siblings, 0 replies; 66+ messages in thread
From: Oleg Nesterov @ 2012-07-03 17:56 UTC (permalink / raw)
  To: David Rientjes
  Cc: Kamezawa Hiroyuki, Andrew Morton, Michal Hocko, Johannes Weiner,
	KOSAKI Motohiro, Minchan Kim, linux-mm, cgroups

Sorry for delay,

On 06/29, David Rientjes wrote:
>
> On Thu, 28 Jun 2012, Kamezawa Hiroyuki wrote:
>
> > > It turns out that task->children is not an rcu-protected list so this
> > > doesn't work.
> >
> > Can't we use sighand->lock to iterate children ?
> >
>
> I don't think so, this list is protected by tasklist_lock.  Oleg?

Yes, you are right, ->siglock can't help.

Oleg.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [patch 4/5] mm, oom: reduce dependency on tasklist_lock
@ 2012-07-03 18:17       ` Oleg Nesterov
  0 siblings, 0 replies; 66+ messages in thread
From: Oleg Nesterov @ 2012-07-03 18:17 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, Michal Hocko, Johannes Weiner,
	KOSAKI Motohiro, Minchan Kim, linux-mm, cgroups

On 06/29, David Rientjes wrote:
>
> +/*
> + * Must be called while holding a reference to p, which will be released upon
> + * returning.
> + */

I am not really sure this is the most clean approach, see below...

>  void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
>  		      unsigned int points, unsigned long totalpages,
>  		      struct mem_cgroup *memcg, nodemask_t *nodemask,
> @@ -454,6 +462,7 @@ void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
>  	 */
>  	if (p->flags & PF_EXITING) {
>  		set_tsk_thread_flag(p, TIF_MEMDIE);
> +		put_task_struct(p);
>  		return;
>  	}
>  
> @@ -471,6 +480,7 @@ void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
>  	 * parent.  This attempts to lose the minimal amount of work done while
>  	 * still freeing memory.
>  	 */
> +	read_lock(&tasklist_lock);
>  	do {
>  		list_for_each_entry(child, &t->children, sibling) {
>  			unsigned int child_points;
> @@ -483,15 +493,26 @@ void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
>  			child_points = oom_badness(child, memcg, nodemask,
>  								totalpages);
>  			if (child_points > victim_points) {
> +				put_task_struct(victim);
>  				victim = child;
>  				victim_points = child_points;
> +				get_task_struct(victim);
>  			}
>  		}
>  	} while_each_thread(p, t);
> +	read_unlock(&tasklist_lock);
>  
> -	victim = find_lock_task_mm(victim);
> -	if (!victim)
> +	rcu_read_lock();
> +	p = find_lock_task_mm(victim);
> +	if (!p) {
> +		rcu_read_unlock();
> +		put_task_struct(victim);
>  		return;
> +	} else if (victim != p) {
> +		get_task_struct(p);
> +		put_task_struct(victim);
> +		victim = p;
> +	}
>  
>  	/* mm cannot safely be dereferenced after task_unlock(victim) */
>  	mm = victim->mm;
> @@ -522,9 +543,11 @@ void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
>  			task_unlock(p);
>  			do_send_sig_info(SIGKILL, SEND_SIG_FORCED, p, true);
>  		}
> +	rcu_read_unlock();
>  
>  	set_tsk_thread_flag(victim, TIF_MEMDIE);
>  	do_send_sig_info(SIGKILL, SEND_SIG_FORCED, victim, true);
> +	put_task_struct(victim);

It seems to me we can avoid this get/put dance in oom_kill_process(),
just you need to extend the rcu-protected area. In this case the caller
of select_bad_process() does a single put_, and
sysctl_oom_kill_allocating_task doesn't need get_task_struct(current).
Look more clean/simple to me.

However. This is subjective, and I see nothing wrong in this patch,
I won't insist.

Looks correct.

Oleg.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [patch 4/5] mm, oom: reduce dependency on tasklist_lock
@ 2012-07-03 18:17       ` Oleg Nesterov
  0 siblings, 0 replies; 66+ messages in thread
From: Oleg Nesterov @ 2012-07-03 18:17 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, Michal Hocko, Johannes Weiner,
	KOSAKI Motohiro, Minchan Kim, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	cgroups-u79uwXL29TY76Z2rM5mHXA

On 06/29, David Rientjes wrote:
>
> +/*
> + * Must be called while holding a reference to p, which will be released upon
> + * returning.
> + */

I am not really sure this is the most clean approach, see below...

>  void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
>  		      unsigned int points, unsigned long totalpages,
>  		      struct mem_cgroup *memcg, nodemask_t *nodemask,
> @@ -454,6 +462,7 @@ void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
>  	 */
>  	if (p->flags & PF_EXITING) {
>  		set_tsk_thread_flag(p, TIF_MEMDIE);
> +		put_task_struct(p);
>  		return;
>  	}
>  
> @@ -471,6 +480,7 @@ void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
>  	 * parent.  This attempts to lose the minimal amount of work done while
>  	 * still freeing memory.
>  	 */
> +	read_lock(&tasklist_lock);
>  	do {
>  		list_for_each_entry(child, &t->children, sibling) {
>  			unsigned int child_points;
> @@ -483,15 +493,26 @@ void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
>  			child_points = oom_badness(child, memcg, nodemask,
>  								totalpages);
>  			if (child_points > victim_points) {
> +				put_task_struct(victim);
>  				victim = child;
>  				victim_points = child_points;
> +				get_task_struct(victim);
>  			}
>  		}
>  	} while_each_thread(p, t);
> +	read_unlock(&tasklist_lock);
>  
> -	victim = find_lock_task_mm(victim);
> -	if (!victim)
> +	rcu_read_lock();
> +	p = find_lock_task_mm(victim);
> +	if (!p) {
> +		rcu_read_unlock();
> +		put_task_struct(victim);
>  		return;
> +	} else if (victim != p) {
> +		get_task_struct(p);
> +		put_task_struct(victim);
> +		victim = p;
> +	}
>  
>  	/* mm cannot safely be dereferenced after task_unlock(victim) */
>  	mm = victim->mm;
> @@ -522,9 +543,11 @@ void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
>  			task_unlock(p);
>  			do_send_sig_info(SIGKILL, SEND_SIG_FORCED, p, true);
>  		}
> +	rcu_read_unlock();
>  
>  	set_tsk_thread_flag(victim, TIF_MEMDIE);
>  	do_send_sig_info(SIGKILL, SEND_SIG_FORCED, victim, true);
> +	put_task_struct(victim);

It seems to me we can avoid this get/put dance in oom_kill_process(),
just you need to extend the rcu-protected area. In this case the caller
of select_bad_process() does a single put_, and
sysctl_oom_kill_allocating_task doesn't need get_task_struct(current).
Look more clean/simple to me.

However. This is subjective, and I see nothing wrong in this patch,
I won't insist.

Looks correct.

Oleg.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [patch 5/5] mm, memcg: move all oom handling to memcontrol.c
@ 2012-07-04  5:51       ` Kamezawa Hiroyuki
  0 siblings, 0 replies; 66+ messages in thread
From: Kamezawa Hiroyuki @ 2012-07-04  5:51 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Michal Hocko, Johannes Weiner, KOSAKI Motohiro,
	Minchan Kim, Oleg Nesterov, linux-mm, cgroups

(2012/06/30 6:07), David Rientjes wrote:
> By globally defining check_panic_on_oom(), the memcg oom handler can be
> moved entirely to mm/memcontrol.c.  This removes the ugly #ifdef in the
> oom killer and cleans up the code.
>
> Signed-off-by: David Rientjes <rientjes@google.com>

Thank you.
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [patch 5/5] mm, memcg: move all oom handling to memcontrol.c
@ 2012-07-04  5:51       ` Kamezawa Hiroyuki
  0 siblings, 0 replies; 66+ messages in thread
From: Kamezawa Hiroyuki @ 2012-07-04  5:51 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Michal Hocko, Johannes Weiner, KOSAKI Motohiro,
	Minchan Kim, Oleg Nesterov, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	cgroups-u79uwXL29TY76Z2rM5mHXA

(2012/06/30 6:07), David Rientjes wrote:
> By globally defining check_panic_on_oom(), the memcg oom handler can be
> moved entirely to mm/memcontrol.c.  This removes the ugly #ifdef in the
> oom killer and cleans up the code.
>
> Signed-off-by: David Rientjes <rientjes-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>

Thank you.
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A@public.gmane.org>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [patch 4/5] mm, oom: reduce dependency on tasklist_lock
  2012-07-03 18:17       ` Oleg Nesterov
  (?)
@ 2012-07-10 21:04       ` David Rientjes
  -1 siblings, 0 replies; 66+ messages in thread
From: David Rientjes @ 2012-07-10 21:04 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, Michal Hocko, Johannes Weiner,
	KOSAKI Motohiro, Minchan Kim, linux-mm, cgroups

On Tue, 3 Jul 2012, Oleg Nesterov wrote:

> >  		      unsigned int points, unsigned long totalpages,
> >  		      struct mem_cgroup *memcg, nodemask_t *nodemask,
> > @@ -454,6 +462,7 @@ void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
> >  	 */
> >  	if (p->flags & PF_EXITING) {
> >  		set_tsk_thread_flag(p, TIF_MEMDIE);
> > +		put_task_struct(p);
> >  		return;
> >  	}
> >  
> > @@ -471,6 +480,7 @@ void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
> >  	 * parent.  This attempts to lose the minimal amount of work done while
> >  	 * still freeing memory.
> >  	 */
> > +	read_lock(&tasklist_lock);
> >  	do {
> >  		list_for_each_entry(child, &t->children, sibling) {
> >  			unsigned int child_points;
> > @@ -483,15 +493,26 @@ void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
> >  			child_points = oom_badness(child, memcg, nodemask,
> >  								totalpages);
> >  			if (child_points > victim_points) {
> > +				put_task_struct(victim);
> >  				victim = child;
> >  				victim_points = child_points;
> > +				get_task_struct(victim);
> >  			}
> >  		}
> >  	} while_each_thread(p, t);
> > +	read_unlock(&tasklist_lock);
> >  
> > -	victim = find_lock_task_mm(victim);
> > -	if (!victim)
> > +	rcu_read_lock();
> > +	p = find_lock_task_mm(victim);
> > +	if (!p) {
> > +		rcu_read_unlock();
> > +		put_task_struct(victim);
> >  		return;
> > +	} else if (victim != p) {
> > +		get_task_struct(p);
> > +		put_task_struct(victim);
> > +		victim = p;
> > +	}
> >  
> >  	/* mm cannot safely be dereferenced after task_unlock(victim) */
> >  	mm = victim->mm;
> > @@ -522,9 +543,11 @@ void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
> >  			task_unlock(p);
> >  			do_send_sig_info(SIGKILL, SEND_SIG_FORCED, p, true);
> >  		}
> > +	rcu_read_unlock();
> >  
> >  	set_tsk_thread_flag(victim, TIF_MEMDIE);
> >  	do_send_sig_info(SIGKILL, SEND_SIG_FORCED, victim, true);
> > +	put_task_struct(victim);
> 
> It seems to me we can avoid this get/put dance in oom_kill_process(),
> just you need to extend the rcu-protected area. In this case the caller
> of select_bad_process() does a single put_, and
> sysctl_oom_kill_allocating_task doesn't need get_task_struct(current).
> Look more clean/simple to me.
> 

We could grab rcu_read_lock() before the first tasklist scan and hold it 
until a process is killed, yes, but there's a higher liklihood that it 
will never be dropped for concurrent oom kills in the same way that the 
write-side of tasklist_lock is currently starved.  On a system with a 
large number of cpus this isn't even a rare situation to run into: the 
read lock will never be dropped on all cpus.  I've attempted to make it as 
fine-grained as possible and only hold it when absolutely required and use 
task references to keep the selected threads around until they are killed.

Let me know if you have a better solution to rcu read lock starvation.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [patch 1/5] mm, oom: move declaration for mem_cgroup_out_of_memory to oom.h
@ 2012-07-10 21:05     ` David Rientjes
  0 siblings, 0 replies; 66+ messages in thread
From: David Rientjes @ 2012-07-10 21:05 UTC (permalink / raw)
  To: Andrew Morton
  Cc: KAMEZAWA Hiroyuki, Michal Hocko, Johannes Weiner,
	KOSAKI Motohiro, Minchan Kim, Oleg Nesterov, linux-mm, cgroups

On Fri, 29 Jun 2012, David Rientjes wrote:

> mem_cgroup_out_of_memory() is defined in mm/oom_kill.c, so declare it in
> linux/oom.h rather than linux/memcontrol.h.
> 
> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> Acked-by: Michal Hocko <mhocko@suse.cz>
> Signed-off-by: David Rientjes <rientjes@google.com>

Ping on this patchset?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [patch 1/5] mm, oom: move declaration for mem_cgroup_out_of_memory to oom.h
@ 2012-07-10 21:05     ` David Rientjes
  0 siblings, 0 replies; 66+ messages in thread
From: David Rientjes @ 2012-07-10 21:05 UTC (permalink / raw)
  To: Andrew Morton
  Cc: KAMEZAWA Hiroyuki, Michal Hocko, Johannes Weiner,
	KOSAKI Motohiro, Minchan Kim, Oleg Nesterov,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA

On Fri, 29 Jun 2012, David Rientjes wrote:

> mem_cgroup_out_of_memory() is defined in mm/oom_kill.c, so declare it in
> linux/oom.h rather than linux/memcontrol.h.
> 
> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A@public.gmane.org>
> Acked-by: KOSAKI Motohiro <kosaki.motohiro-+CUm20s59erQFUHtdCDX3A@public.gmane.org>
> Acked-by: Michal Hocko <mhocko-AlSwsSmVLrQ@public.gmane.org>
> Signed-off-by: David Rientjes <rientjes-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>

Ping on this patchset?

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [patch 3/5] mm, memcg: introduce own oom handler to iterate only over its own threads
@ 2012-07-10 21:19       ` Andrew Morton
  0 siblings, 0 replies; 66+ messages in thread
From: Andrew Morton @ 2012-07-10 21:19 UTC (permalink / raw)
  To: David Rientjes
  Cc: KAMEZAWA Hiroyuki, Michal Hocko, Johannes Weiner,
	KOSAKI Motohiro, Minchan Kim, Oleg Nesterov, linux-mm, cgroups

On Fri, 29 Jun 2012 14:06:56 -0700 (PDT)
David Rientjes <rientjes@google.com> wrote:

> The global oom killer is serialized by the zonelist being used in the
> page allocation.

Brain hurts.  Presumably this is referring to some lock within the
zonelist.  Clarify, please?

>  Concurrent oom kills are thus a rare event and only
> occur in systems using mempolicies and with a large number of nodes.
> 
> Memory controller oom kills, however, can frequently be concurrent since
> there is no serialization once the oom killer is called for oom
> conditions in several different memcgs in parallel.
> 
> This creates a massive contention on tasklist_lock since the oom killer
> requires the readside for the tasklist iteration.  If several memcgs are
> calling the oom killer, this lock can be held for a substantial amount of
> time, especially if threads continue to enter it as other threads are
> exiting.
> 
> Since the exit path grabs the writeside of the lock with irqs disabled in
> a few different places, this can cause a soft lockup on cpus as a result
> of tasklist_lock starvation.
> 
> The kernel lacks unfair writelocks, and successful calls to the oom
> killer usually result in at least one thread entering the exit path, so
> an alternative solution is needed.
> 
> This patch introduces a seperate oom handler for memcgs so that they do
> not require tasklist_lock for as much time.  Instead, it iterates only
> over the threads attached to the oom memcg and grabs a reference to the
> selected thread before calling oom_kill_process() to ensure it doesn't
> prematurely exit.
> 
> This still requires tasklist_lock for the tasklist dump, iterating
> children of the selected process, and killing all other threads on the
> system sharing the same memory as the selected victim.  So while this
> isn't a complete solution to tasklist_lock starvation, it significantly
> reduces the amount of time that it is held.
> 
>
> ...
>
> @@ -1469,6 +1469,65 @@ u64 mem_cgroup_get_limit(struct mem_cgroup *memcg)
>  	return min(limit, memsw);
>  }
>  
> +void __mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
> +				int order)

Perhaps have a comment over this function explaining why it exists?

> +{
> +	struct mem_cgroup *iter;
> +	unsigned long chosen_points = 0;
> +	unsigned long totalpages;
> +	unsigned int points = 0;
> +	struct task_struct *chosen = NULL;
> +
> +	totalpages = mem_cgroup_get_limit(memcg) >> PAGE_SHIFT ? : 1;
> +	for_each_mem_cgroup_tree(iter, memcg) {
> +		struct cgroup *cgroup = iter->css.cgroup;
> +		struct cgroup_iter it;
> +		struct task_struct *task;
> +
> +		cgroup_iter_start(cgroup, &it);
> +		while ((task = cgroup_iter_next(cgroup, &it))) {
> +			switch (oom_scan_process_thread(task, totalpages, NULL,
> +							false)) {
> +			case OOM_SCAN_SELECT:
>
> ...
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [patch 3/5] mm, memcg: introduce own oom handler to iterate only over its own threads
@ 2012-07-10 21:19       ` Andrew Morton
  0 siblings, 0 replies; 66+ messages in thread
From: Andrew Morton @ 2012-07-10 21:19 UTC (permalink / raw)
  To: David Rientjes
  Cc: KAMEZAWA Hiroyuki, Michal Hocko, Johannes Weiner,
	KOSAKI Motohiro, Minchan Kim, Oleg Nesterov,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA

On Fri, 29 Jun 2012 14:06:56 -0700 (PDT)
David Rientjes <rientjes-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> wrote:

> The global oom killer is serialized by the zonelist being used in the
> page allocation.

Brain hurts.  Presumably this is referring to some lock within the
zonelist.  Clarify, please?

>  Concurrent oom kills are thus a rare event and only
> occur in systems using mempolicies and with a large number of nodes.
> 
> Memory controller oom kills, however, can frequently be concurrent since
> there is no serialization once the oom killer is called for oom
> conditions in several different memcgs in parallel.
> 
> This creates a massive contention on tasklist_lock since the oom killer
> requires the readside for the tasklist iteration.  If several memcgs are
> calling the oom killer, this lock can be held for a substantial amount of
> time, especially if threads continue to enter it as other threads are
> exiting.
> 
> Since the exit path grabs the writeside of the lock with irqs disabled in
> a few different places, this can cause a soft lockup on cpus as a result
> of tasklist_lock starvation.
> 
> The kernel lacks unfair writelocks, and successful calls to the oom
> killer usually result in at least one thread entering the exit path, so
> an alternative solution is needed.
> 
> This patch introduces a seperate oom handler for memcgs so that they do
> not require tasklist_lock for as much time.  Instead, it iterates only
> over the threads attached to the oom memcg and grabs a reference to the
> selected thread before calling oom_kill_process() to ensure it doesn't
> prematurely exit.
> 
> This still requires tasklist_lock for the tasklist dump, iterating
> children of the selected process, and killing all other threads on the
> system sharing the same memory as the selected victim.  So while this
> isn't a complete solution to tasklist_lock starvation, it significantly
> reduces the amount of time that it is held.
> 
>
> ...
>
> @@ -1469,6 +1469,65 @@ u64 mem_cgroup_get_limit(struct mem_cgroup *memcg)
>  	return min(limit, memsw);
>  }
>  
> +void __mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
> +				int order)

Perhaps have a comment over this function explaining why it exists?

> +{
> +	struct mem_cgroup *iter;
> +	unsigned long chosen_points = 0;
> +	unsigned long totalpages;
> +	unsigned int points = 0;
> +	struct task_struct *chosen = NULL;
> +
> +	totalpages = mem_cgroup_get_limit(memcg) >> PAGE_SHIFT ? : 1;
> +	for_each_mem_cgroup_tree(iter, memcg) {
> +		struct cgroup *cgroup = iter->css.cgroup;
> +		struct cgroup_iter it;
> +		struct task_struct *task;
> +
> +		cgroup_iter_start(cgroup, &it);
> +		while ((task = cgroup_iter_next(cgroup, &it))) {
> +			switch (oom_scan_process_thread(task, totalpages, NULL,
> +							false)) {
> +			case OOM_SCAN_SELECT:
>
> ...
>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [patch 3/5] mm, memcg: introduce own oom handler to iterate only over its own threads
  2012-07-10 21:19       ` Andrew Morton
  (?)
@ 2012-07-10 23:24       ` David Rientjes
  -1 siblings, 0 replies; 66+ messages in thread
From: David Rientjes @ 2012-07-10 23:24 UTC (permalink / raw)
  To: Andrew Morton
  Cc: KAMEZAWA Hiroyuki, Michal Hocko, Johannes Weiner,
	KOSAKI Motohiro, Minchan Kim, Oleg Nesterov, linux-mm, cgroups

On Tue, 10 Jul 2012, Andrew Morton wrote:

> > The global oom killer is serialized by the zonelist being used in the
> > page allocation.
> 
> Brain hurts.  Presumably this is referring to some lock within the
> zonelist.  Clarify, please?
> 

Yeah, it's done with try_set_zonelist_oom() before calling the oom killer; 
it sets the ZONE_OOM_LOCKED bit for each zone in the zonelist to avoid 
concurrent oom kills for the same zonelist, otherwise it's possible to 
overkill.

> >  Concurrent oom kills are thus a rare event and only
> > occur in systems using mempolicies and with a large number of nodes.
> > 
> > Memory controller oom kills, however, can frequently be concurrent since
> > there is no serialization once the oom killer is called for oom
> > conditions in several different memcgs in parallel.
> > 
> > This creates a massive contention on tasklist_lock since the oom killer
> > requires the readside for the tasklist iteration.  If several memcgs are
> > calling the oom killer, this lock can be held for a substantial amount of
> > time, especially if threads continue to enter it as other threads are
> > exiting.
> > 
> > Since the exit path grabs the writeside of the lock with irqs disabled in
> > a few different places, this can cause a soft lockup on cpus as a result
> > of tasklist_lock starvation.
> > 
> > The kernel lacks unfair writelocks, and successful calls to the oom
> > killer usually result in at least one thread entering the exit path, so
> > an alternative solution is needed.
> > 
> > This patch introduces a seperate oom handler for memcgs so that they do
> > not require tasklist_lock for as much time.  Instead, it iterates only
> > over the threads attached to the oom memcg and grabs a reference to the
> > selected thread before calling oom_kill_process() to ensure it doesn't
> > prematurely exit.
> > 
> > This still requires tasklist_lock for the tasklist dump, iterating
> > children of the selected process, and killing all other threads on the
> > system sharing the same memory as the selected victim.  So while this
> > isn't a complete solution to tasklist_lock starvation, it significantly
> > reduces the amount of time that it is held.
> > 
> >
> > ...
> >
> > @@ -1469,6 +1469,65 @@ u64 mem_cgroup_get_limit(struct mem_cgroup *memcg)
> >  	return min(limit, memsw);
> >  }
> >  
> > +void __mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
> > +				int order)
> 
> Perhaps have a comment over this function explaining why it exists?
> 

It's removed in the last patch in the series, but I can add a comment to 
explain why we need to kill a task when a memcg reaches its limit to the 
new mem_cgroup_out_of_memory() if you'd like.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [patch 2/5] mm, oom: introduce helper function to process threads during scan
@ 2012-07-12  7:18       ` Sha Zhengju
  0 siblings, 0 replies; 66+ messages in thread
From: Sha Zhengju @ 2012-07-12  7:18 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, Michal Hocko, Johannes Weiner,
	KOSAKI Motohiro, Minchan Kim, Oleg Nesterov, linux-mm, cgroups

On 06/30/2012 05:06 AM, David Rientjes wrote:
> This patch introduces a helper function to process each thread during the
> iteration over the tasklist.  A new return type, enum oom_scan_t, is
> defined to determine the future behavior of the iteration:
>
>   - OOM_SCAN_OK: continue scanning the thread and find its badness,
>
>   - OOM_SCAN_CONTINUE: do not consider this thread for oom kill, it's
>     ineligible,
>
>   - OOM_SCAN_ABORT: abort the iteration and return, or
>
>   - OOM_SCAN_SELECT: always select this thread with the highest badness
>     possible.
>
> There is no functional change with this patch.  This new helper function
> will be used in the next patch in the memory controller.
>

Looks good to me.
You can add  Reviewed-by: Sha Zhengju <handai.szj@taobao.com> :-)


Thanks,
Sha

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [patch 2/5] mm, oom: introduce helper function to process threads during scan
@ 2012-07-12  7:18       ` Sha Zhengju
  0 siblings, 0 replies; 66+ messages in thread
From: Sha Zhengju @ 2012-07-12  7:18 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, Michal Hocko, Johannes Weiner,
	KOSAKI Motohiro, Minchan Kim, Oleg Nesterov,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA

On 06/30/2012 05:06 AM, David Rientjes wrote:
> This patch introduces a helper function to process each thread during the
> iteration over the tasklist.  A new return type, enum oom_scan_t, is
> defined to determine the future behavior of the iteration:
>
>   - OOM_SCAN_OK: continue scanning the thread and find its badness,
>
>   - OOM_SCAN_CONTINUE: do not consider this thread for oom kill, it's
>     ineligible,
>
>   - OOM_SCAN_ABORT: abort the iteration and return, or
>
>   - OOM_SCAN_SELECT: always select this thread with the highest badness
>     possible.
>
> There is no functional change with this patch.  This new helper function
> will be used in the next patch in the memory controller.
>

Looks good to me.
You can add  Reviewed-by: Sha Zhengju <handai.szj-3b8fjiQLQpfQT0dZR+AlfA@public.gmane.org> :-)


Thanks,
Sha

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [patch 3/5] mm, memcg: introduce own oom handler to iterate only over its own threads
@ 2012-07-12 14:50       ` Sha Zhengju
  0 siblings, 0 replies; 66+ messages in thread
From: Sha Zhengju @ 2012-07-12 14:50 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, Michal Hocko, Johannes Weiner,
	KOSAKI Motohiro, Minchan Kim, Oleg Nesterov, linux-mm, cgroups

On 06/30/2012 05:06 AM, David Rientjes wrote:
> The global oom killer is serialized by the zonelist being used in the
> page allocation.  Concurrent oom kills are thus a rare event and only
> occur in systems using mempolicies and with a large number of nodes.
>
> Memory controller oom kills, however, can frequently be concurrent since
> there is no serialization once the oom killer is called for oom
> conditions in several different memcgs in parallel.
>
> This creates a massive contention on tasklist_lock since the oom killer
> requires the readside for the tasklist iteration.  If several memcgs are
> calling the oom killer, this lock can be held for a substantial amount of
> time, especially if threads continue to enter it as other threads are
> exiting.
>
> Since the exit path grabs the writeside of the lock with irqs disabled in
> a few different places, this can cause a soft lockup on cpus as a result
> of tasklist_lock starvation.
>
> The kernel lacks unfair writelocks, and successful calls to the oom
> killer usually result in at least one thread entering the exit path, so
> an alternative solution is needed.
>
> This patch introduces a seperate oom handler for memcgs so that they do
> not require tasklist_lock for as much time.  Instead, it iterates only
> over the threads attached to the oom memcg and grabs a reference to the
> selected thread before calling oom_kill_process() to ensure it doesn't
> prematurely exit.
>
> This still requires tasklist_lock for the tasklist dump, iterating
> children of the selected process, and killing all other threads on the
> system sharing the same memory as the selected victim.  So while this
> isn't a complete solution to tasklist_lock starvation, it significantly
> reduces the amount of time that it is held.
>

Looks good.
You can add Reviewed-by: Sha Zhengju <handai.szj@taobao.com>

Thanks,
Sha

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [patch 3/5] mm, memcg: introduce own oom handler to iterate only over its own threads
@ 2012-07-12 14:50       ` Sha Zhengju
  0 siblings, 0 replies; 66+ messages in thread
From: Sha Zhengju @ 2012-07-12 14:50 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, Michal Hocko, Johannes Weiner,
	KOSAKI Motohiro, Minchan Kim, Oleg Nesterov,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA

On 06/30/2012 05:06 AM, David Rientjes wrote:
> The global oom killer is serialized by the zonelist being used in the
> page allocation.  Concurrent oom kills are thus a rare event and only
> occur in systems using mempolicies and with a large number of nodes.
>
> Memory controller oom kills, however, can frequently be concurrent since
> there is no serialization once the oom killer is called for oom
> conditions in several different memcgs in parallel.
>
> This creates a massive contention on tasklist_lock since the oom killer
> requires the readside for the tasklist iteration.  If several memcgs are
> calling the oom killer, this lock can be held for a substantial amount of
> time, especially if threads continue to enter it as other threads are
> exiting.
>
> Since the exit path grabs the writeside of the lock with irqs disabled in
> a few different places, this can cause a soft lockup on cpus as a result
> of tasklist_lock starvation.
>
> The kernel lacks unfair writelocks, and successful calls to the oom
> killer usually result in at least one thread entering the exit path, so
> an alternative solution is needed.
>
> This patch introduces a seperate oom handler for memcgs so that they do
> not require tasklist_lock for as much time.  Instead, it iterates only
> over the threads attached to the oom memcg and grabs a reference to the
> selected thread before calling oom_kill_process() to ensure it doesn't
> prematurely exit.
>
> This still requires tasklist_lock for the tasklist dump, iterating
> children of the selected process, and killing all other threads on the
> system sharing the same memory as the selected victim.  So while this
> isn't a complete solution to tasklist_lock starvation, it significantly
> reduces the amount of time that it is held.
>

Looks good.
You can add Reviewed-by: Sha Zhengju <handai.szj-3b8fjiQLQpfQT0dZR+AlfA@public.gmane.org>

Thanks,
Sha

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [patch 4/5] mm, oom: reduce dependency on tasklist_lock
@ 2012-07-13 14:32       ` Michal Hocko
  0 siblings, 0 replies; 66+ messages in thread
From: Michal Hocko @ 2012-07-13 14:32 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, Johannes Weiner,
	KOSAKI Motohiro, Minchan Kim, Oleg Nesterov, linux-mm, cgroups

On Fri 29-06-12 14:06:59, David Rientjes wrote:
> Since exiting tasks require write_lock_irq(&tasklist_lock) several times,
> try to reduce the amount of time the readside is held for oom kills.
> This makes the interface with the memcg oom handler more consistent since
> it now never needs to take tasklist_lock unnecessarily.
> 
> The only time the oom killer now takes tasklist_lock is when iterating
> the children of the selected task, everything else is protected by
> rcu_read_lock().
> 
> This requires that a reference to the selected process, p, is grabbed
> before calling oom_kill_process().  It may release it and grab a
> reference on another one of p's threads if !p->mm, but it also guarantees
> that it will release the reference before returning.
> 
> Signed-off-by: David Rientjes <rientjes@google.com>

Sorry for the late reply I didn't get to this one sooner...

Reviewed-by: Michal Hocko <mhocko@suse.cz>

> ---
>  mm/memcontrol.c |    2 --
>  mm/oom_kill.c   |   40 +++++++++++++++++++++++++++++-----------
>  2 files changed, 29 insertions(+), 13 deletions(-)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -1521,10 +1521,8 @@ void __mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
>  	if (!chosen)
>  		return;
>  	points = chosen_points * 1000 / totalpages;
> -	read_lock(&tasklist_lock);
>  	oom_kill_process(chosen, gfp_mask, order, points, totalpages, memcg,
>  			 NULL, "Memory cgroup out of memory");
> -	read_unlock(&tasklist_lock);
>  	put_task_struct(chosen);
>  }
>  
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -336,7 +336,7 @@ enum oom_scan_t oom_scan_process_thread(struct task_struct *task,
>  
>  /*
>   * Simple selection loop. We chose the process with the highest
> - * number of 'points'. We expect the caller will lock the tasklist.
> + * number of 'points'.
>   *
>   * (not docbooked, we don't want this one cluttering up the manual)
>   */
> @@ -348,6 +348,7 @@ static struct task_struct *select_bad_process(unsigned int *ppoints,
>  	struct task_struct *chosen = NULL;
>  	unsigned long chosen_points = 0;
>  
> +	rcu_read_lock();
>  	do_each_thread(g, p) {
>  		unsigned int points;
>  
> @@ -370,6 +371,9 @@ static struct task_struct *select_bad_process(unsigned int *ppoints,
>  			chosen_points = points;
>  		}
>  	} while_each_thread(g, p);
> +	if (chosen)
> +		get_task_struct(chosen);
> +	rcu_read_unlock();
>  
>  	*ppoints = chosen_points * 1000 / totalpages;
>  	return chosen;
> @@ -385,8 +389,6 @@ static struct task_struct *select_bad_process(unsigned int *ppoints,
>   * are not shown.
>   * State information includes task's pid, uid, tgid, vm size, rss, cpu, oom_adj
>   * value, oom_score_adj value, and name.
> - *
> - * Call with tasklist_lock read-locked.
>   */
>  static void dump_tasks(const struct mem_cgroup *memcg, const nodemask_t *nodemask)
>  {
> @@ -394,6 +396,7 @@ static void dump_tasks(const struct mem_cgroup *memcg, const nodemask_t *nodemas
>  	struct task_struct *task;
>  
>  	pr_info("[ pid ]   uid  tgid total_vm      rss cpu oom_adj oom_score_adj name\n");
> +	rcu_read_lock();
>  	for_each_process(p) {
>  		if (oom_unkillable_task(p, memcg, nodemask))
>  			continue;
> @@ -415,6 +418,7 @@ static void dump_tasks(const struct mem_cgroup *memcg, const nodemask_t *nodemas
>  			task->signal->oom_score_adj, task->comm);
>  		task_unlock(task);
>  	}
> +	rcu_read_unlock();
>  }
>  
>  static void dump_header(struct task_struct *p, gfp_t gfp_mask, int order,
> @@ -435,6 +439,10 @@ static void dump_header(struct task_struct *p, gfp_t gfp_mask, int order,
>  }
>  
>  #define K(x) ((x) << (PAGE_SHIFT-10))
> +/*
> + * Must be called while holding a reference to p, which will be released upon
> + * returning.
> + */
>  void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
>  		      unsigned int points, unsigned long totalpages,
>  		      struct mem_cgroup *memcg, nodemask_t *nodemask,
> @@ -454,6 +462,7 @@ void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
>  	 */
>  	if (p->flags & PF_EXITING) {
>  		set_tsk_thread_flag(p, TIF_MEMDIE);
> +		put_task_struct(p);
>  		return;
>  	}
>  
> @@ -471,6 +480,7 @@ void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
>  	 * parent.  This attempts to lose the minimal amount of work done while
>  	 * still freeing memory.
>  	 */
> +	read_lock(&tasklist_lock);
>  	do {
>  		list_for_each_entry(child, &t->children, sibling) {
>  			unsigned int child_points;
> @@ -483,15 +493,26 @@ void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
>  			child_points = oom_badness(child, memcg, nodemask,
>  								totalpages);
>  			if (child_points > victim_points) {
> +				put_task_struct(victim);
>  				victim = child;
>  				victim_points = child_points;
> +				get_task_struct(victim);
>  			}
>  		}
>  	} while_each_thread(p, t);
> +	read_unlock(&tasklist_lock);
>  
> -	victim = find_lock_task_mm(victim);
> -	if (!victim)
> +	rcu_read_lock();
> +	p = find_lock_task_mm(victim);
> +	if (!p) {
> +		rcu_read_unlock();
> +		put_task_struct(victim);
>  		return;
> +	} else if (victim != p) {
> +		get_task_struct(p);
> +		put_task_struct(victim);
> +		victim = p;
> +	}
>  
>  	/* mm cannot safely be dereferenced after task_unlock(victim) */
>  	mm = victim->mm;
> @@ -522,9 +543,11 @@ void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
>  			task_unlock(p);
>  			do_send_sig_info(SIGKILL, SEND_SIG_FORCED, p, true);
>  		}
> +	rcu_read_unlock();
>  
>  	set_tsk_thread_flag(victim, TIF_MEMDIE);
>  	do_send_sig_info(SIGKILL, SEND_SIG_FORCED, victim, true);
> +	put_task_struct(victim);
>  }
>  #undef K
>  
> @@ -545,9 +568,7 @@ static void check_panic_on_oom(enum oom_constraint constraint, gfp_t gfp_mask,
>  		if (constraint != CONSTRAINT_NONE)
>  			return;
>  	}
> -	read_lock(&tasklist_lock);
>  	dump_header(NULL, gfp_mask, order, NULL, nodemask);
> -	read_unlock(&tasklist_lock);
>  	panic("Out of memory: %s panic_on_oom is enabled\n",
>  		sysctl_panic_on_oom == 2 ? "compulsory" : "system-wide");
>  }
> @@ -720,10 +741,10 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
>  	mpol_mask = (constraint == CONSTRAINT_MEMORY_POLICY) ? nodemask : NULL;
>  	check_panic_on_oom(constraint, gfp_mask, order, mpol_mask);
>  
> -	read_lock(&tasklist_lock);
>  	if (sysctl_oom_kill_allocating_task &&
>  	    !oom_unkillable_task(current, NULL, nodemask) &&
>  	    current->mm) {
> +		get_task_struct(current);
>  		oom_kill_process(current, gfp_mask, order, 0, totalpages, NULL,
>  				 nodemask,
>  				 "Out of memory (oom_kill_allocating_task)");
> @@ -734,7 +755,6 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
>  	/* Found nothing?!?! Either we hang forever, or we panic. */
>  	if (!p) {
>  		dump_header(NULL, gfp_mask, order, NULL, mpol_mask);
> -		read_unlock(&tasklist_lock);
>  		panic("Out of memory and no killable processes...\n");
>  	}
>  	if (PTR_ERR(p) != -1UL) {
> @@ -743,8 +763,6 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
>  		killed = 1;
>  	}
>  out:
> -	read_unlock(&tasklist_lock);
> -
>  	/*
>  	 * Give the killed threads a good chance of exiting before trying to
>  	 * allocate memory again.

-- 
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9    
Czech Republic

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [patch 4/5] mm, oom: reduce dependency on tasklist_lock
@ 2012-07-13 14:32       ` Michal Hocko
  0 siblings, 0 replies; 66+ messages in thread
From: Michal Hocko @ 2012-07-13 14:32 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, Johannes Weiner,
	KOSAKI Motohiro, Minchan Kim, Oleg Nesterov,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA

On Fri 29-06-12 14:06:59, David Rientjes wrote:
> Since exiting tasks require write_lock_irq(&tasklist_lock) several times,
> try to reduce the amount of time the readside is held for oom kills.
> This makes the interface with the memcg oom handler more consistent since
> it now never needs to take tasklist_lock unnecessarily.
> 
> The only time the oom killer now takes tasklist_lock is when iterating
> the children of the selected task, everything else is protected by
> rcu_read_lock().
> 
> This requires that a reference to the selected process, p, is grabbed
> before calling oom_kill_process().  It may release it and grab a
> reference on another one of p's threads if !p->mm, but it also guarantees
> that it will release the reference before returning.
> 
> Signed-off-by: David Rientjes <rientjes-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>

Sorry for the late reply I didn't get to this one sooner...

Reviewed-by: Michal Hocko <mhocko-AlSwsSmVLrQ@public.gmane.org>

> ---
>  mm/memcontrol.c |    2 --
>  mm/oom_kill.c   |   40 +++++++++++++++++++++++++++++-----------
>  2 files changed, 29 insertions(+), 13 deletions(-)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -1521,10 +1521,8 @@ void __mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
>  	if (!chosen)
>  		return;
>  	points = chosen_points * 1000 / totalpages;
> -	read_lock(&tasklist_lock);
>  	oom_kill_process(chosen, gfp_mask, order, points, totalpages, memcg,
>  			 NULL, "Memory cgroup out of memory");
> -	read_unlock(&tasklist_lock);
>  	put_task_struct(chosen);
>  }
>  
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -336,7 +336,7 @@ enum oom_scan_t oom_scan_process_thread(struct task_struct *task,
>  
>  /*
>   * Simple selection loop. We chose the process with the highest
> - * number of 'points'. We expect the caller will lock the tasklist.
> + * number of 'points'.
>   *
>   * (not docbooked, we don't want this one cluttering up the manual)
>   */
> @@ -348,6 +348,7 @@ static struct task_struct *select_bad_process(unsigned int *ppoints,
>  	struct task_struct *chosen = NULL;
>  	unsigned long chosen_points = 0;
>  
> +	rcu_read_lock();
>  	do_each_thread(g, p) {
>  		unsigned int points;
>  
> @@ -370,6 +371,9 @@ static struct task_struct *select_bad_process(unsigned int *ppoints,
>  			chosen_points = points;
>  		}
>  	} while_each_thread(g, p);
> +	if (chosen)
> +		get_task_struct(chosen);
> +	rcu_read_unlock();
>  
>  	*ppoints = chosen_points * 1000 / totalpages;
>  	return chosen;
> @@ -385,8 +389,6 @@ static struct task_struct *select_bad_process(unsigned int *ppoints,
>   * are not shown.
>   * State information includes task's pid, uid, tgid, vm size, rss, cpu, oom_adj
>   * value, oom_score_adj value, and name.
> - *
> - * Call with tasklist_lock read-locked.
>   */
>  static void dump_tasks(const struct mem_cgroup *memcg, const nodemask_t *nodemask)
>  {
> @@ -394,6 +396,7 @@ static void dump_tasks(const struct mem_cgroup *memcg, const nodemask_t *nodemas
>  	struct task_struct *task;
>  
>  	pr_info("[ pid ]   uid  tgid total_vm      rss cpu oom_adj oom_score_adj name\n");
> +	rcu_read_lock();
>  	for_each_process(p) {
>  		if (oom_unkillable_task(p, memcg, nodemask))
>  			continue;
> @@ -415,6 +418,7 @@ static void dump_tasks(const struct mem_cgroup *memcg, const nodemask_t *nodemas
>  			task->signal->oom_score_adj, task->comm);
>  		task_unlock(task);
>  	}
> +	rcu_read_unlock();
>  }
>  
>  static void dump_header(struct task_struct *p, gfp_t gfp_mask, int order,
> @@ -435,6 +439,10 @@ static void dump_header(struct task_struct *p, gfp_t gfp_mask, int order,
>  }
>  
>  #define K(x) ((x) << (PAGE_SHIFT-10))
> +/*
> + * Must be called while holding a reference to p, which will be released upon
> + * returning.
> + */
>  void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
>  		      unsigned int points, unsigned long totalpages,
>  		      struct mem_cgroup *memcg, nodemask_t *nodemask,
> @@ -454,6 +462,7 @@ void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
>  	 */
>  	if (p->flags & PF_EXITING) {
>  		set_tsk_thread_flag(p, TIF_MEMDIE);
> +		put_task_struct(p);
>  		return;
>  	}
>  
> @@ -471,6 +480,7 @@ void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
>  	 * parent.  This attempts to lose the minimal amount of work done while
>  	 * still freeing memory.
>  	 */
> +	read_lock(&tasklist_lock);
>  	do {
>  		list_for_each_entry(child, &t->children, sibling) {
>  			unsigned int child_points;
> @@ -483,15 +493,26 @@ void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
>  			child_points = oom_badness(child, memcg, nodemask,
>  								totalpages);
>  			if (child_points > victim_points) {
> +				put_task_struct(victim);
>  				victim = child;
>  				victim_points = child_points;
> +				get_task_struct(victim);
>  			}
>  		}
>  	} while_each_thread(p, t);
> +	read_unlock(&tasklist_lock);
>  
> -	victim = find_lock_task_mm(victim);
> -	if (!victim)
> +	rcu_read_lock();
> +	p = find_lock_task_mm(victim);
> +	if (!p) {
> +		rcu_read_unlock();
> +		put_task_struct(victim);
>  		return;
> +	} else if (victim != p) {
> +		get_task_struct(p);
> +		put_task_struct(victim);
> +		victim = p;
> +	}
>  
>  	/* mm cannot safely be dereferenced after task_unlock(victim) */
>  	mm = victim->mm;
> @@ -522,9 +543,11 @@ void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
>  			task_unlock(p);
>  			do_send_sig_info(SIGKILL, SEND_SIG_FORCED, p, true);
>  		}
> +	rcu_read_unlock();
>  
>  	set_tsk_thread_flag(victim, TIF_MEMDIE);
>  	do_send_sig_info(SIGKILL, SEND_SIG_FORCED, victim, true);
> +	put_task_struct(victim);
>  }
>  #undef K
>  
> @@ -545,9 +568,7 @@ static void check_panic_on_oom(enum oom_constraint constraint, gfp_t gfp_mask,
>  		if (constraint != CONSTRAINT_NONE)
>  			return;
>  	}
> -	read_lock(&tasklist_lock);
>  	dump_header(NULL, gfp_mask, order, NULL, nodemask);
> -	read_unlock(&tasklist_lock);
>  	panic("Out of memory: %s panic_on_oom is enabled\n",
>  		sysctl_panic_on_oom == 2 ? "compulsory" : "system-wide");
>  }
> @@ -720,10 +741,10 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
>  	mpol_mask = (constraint == CONSTRAINT_MEMORY_POLICY) ? nodemask : NULL;
>  	check_panic_on_oom(constraint, gfp_mask, order, mpol_mask);
>  
> -	read_lock(&tasklist_lock);
>  	if (sysctl_oom_kill_allocating_task &&
>  	    !oom_unkillable_task(current, NULL, nodemask) &&
>  	    current->mm) {
> +		get_task_struct(current);
>  		oom_kill_process(current, gfp_mask, order, 0, totalpages, NULL,
>  				 nodemask,
>  				 "Out of memory (oom_kill_allocating_task)");
> @@ -734,7 +755,6 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
>  	/* Found nothing?!?! Either we hang forever, or we panic. */
>  	if (!p) {
>  		dump_header(NULL, gfp_mask, order, NULL, mpol_mask);
> -		read_unlock(&tasklist_lock);
>  		panic("Out of memory and no killable processes...\n");
>  	}
>  	if (PTR_ERR(p) != -1UL) {
> @@ -743,8 +763,6 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
>  		killed = 1;
>  	}
>  out:
> -	read_unlock(&tasklist_lock);
> -
>  	/*
>  	 * Give the killed threads a good chance of exiting before trying to
>  	 * allocate memory again.

-- 
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9    
Czech Republic

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [patch 5/5] mm, memcg: move all oom handling to memcontrol.c
  2012-06-29 21:07     ` David Rientjes
  (?)
  (?)
@ 2012-07-13 14:34     ` Michal Hocko
  -1 siblings, 0 replies; 66+ messages in thread
From: Michal Hocko @ 2012-07-13 14:34 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, Johannes Weiner,
	KOSAKI Motohiro, Minchan Kim, Oleg Nesterov, linux-mm, cgroups

[Sorry for the late reply]

On Fri 29-06-12 14:07:01, David Rientjes wrote:
> By globally defining check_panic_on_oom(), the memcg oom handler can be
> moved entirely to mm/memcontrol.c.  This removes the ugly #ifdef in the
> oom killer and cleans up the code.
> 
> Signed-off-by: David Rientjes <rientjes@google.com> 

Yes, I like it.
Acked-by: Michal Hocko <mhocko@suse.cz>

> ---
>  include/linux/memcontrol.h |    2 --
>  include/linux/oom.h        |    3 +++
>  mm/memcontrol.c            |   15 +++++++++++++--
>  mm/oom_kill.c              |   23 ++---------------------
>  4 files changed, 18 insertions(+), 25 deletions(-)
> 
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -180,8 +180,6 @@ static inline void mem_cgroup_dec_page_stat(struct page *page,
>  unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
>  						gfp_t gfp_mask,
>  						unsigned long *total_scanned);
> -extern void __mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
> -				       int order);
>  
>  void mem_cgroup_count_vm_event(struct mm_struct *mm, enum vm_event_item idx);
>  #ifdef CONFIG_TRANSPARENT_HUGEPAGE
> diff --git a/include/linux/oom.h b/include/linux/oom.h
> --- a/include/linux/oom.h
> +++ b/include/linux/oom.h
> @@ -61,6 +61,9 @@ extern void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
>  extern int try_set_zonelist_oom(struct zonelist *zonelist, gfp_t gfp_flags);
>  extern void clear_zonelist_oom(struct zonelist *zonelist, gfp_t gfp_flags);
>  
> +extern void check_panic_on_oom(enum oom_constraint constraint, gfp_t gfp_mask,
> +			       int order, const nodemask_t *nodemask);
> +
>  extern enum oom_scan_t oom_scan_process_thread(struct task_struct *task,
>  		unsigned long totalpages, const nodemask_t *nodemask,
>  		bool force_kill);
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -1469,8 +1469,8 @@ static u64 mem_cgroup_get_limit(struct mem_cgroup *memcg)
>  	return min(limit, memsw);
>  }
>  
> -void __mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
> -				int order)
> +void mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
> +			      int order)
>  {
>  	struct mem_cgroup *iter;
>  	unsigned long chosen_points = 0;
> @@ -1478,6 +1478,17 @@ void __mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
>  	unsigned int points = 0;
>  	struct task_struct *chosen = NULL;
>  
> +	/*
> +	 * If current has a pending SIGKILL, then automatically select it.  The
> +	 * goal is to allow it to allocate so that it may quickly exit and free
> +	 * its memory.
> +	 */
> +	if (fatal_signal_pending(current)) {
> +		set_thread_flag(TIF_MEMDIE);
> +		return;
> +	}
> +
> +	check_panic_on_oom(CONSTRAINT_MEMCG, gfp_mask, order, NULL);
>  	totalpages = mem_cgroup_get_limit(memcg) >> PAGE_SHIFT ? : 1;
>  	for_each_mem_cgroup_tree(iter, memcg) {
>  		struct cgroup *cgroup = iter->css.cgroup;
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -554,8 +554,8 @@ void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
>  /*
>   * Determines whether the kernel must panic because of the panic_on_oom sysctl.
>   */
> -static void check_panic_on_oom(enum oom_constraint constraint, gfp_t gfp_mask,
> -				int order, const nodemask_t *nodemask)
> +void check_panic_on_oom(enum oom_constraint constraint, gfp_t gfp_mask,
> +			int order, const nodemask_t *nodemask)
>  {
>  	if (likely(!sysctl_panic_on_oom))
>  		return;
> @@ -573,25 +573,6 @@ static void check_panic_on_oom(enum oom_constraint constraint, gfp_t gfp_mask,
>  		sysctl_panic_on_oom == 2 ? "compulsory" : "system-wide");
>  }
>  
> -#ifdef CONFIG_CGROUP_MEM_RES_CTLR
> -void mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
> -			      int order)
> -{
> -	/*
> -	 * If current has a pending SIGKILL, then automatically select it.  The
> -	 * goal is to allow it to allocate so that it may quickly exit and free
> -	 * its memory.
> -	 */
> -	if (fatal_signal_pending(current)) {
> -		set_thread_flag(TIF_MEMDIE);
> -		return;
> -	}
> -
> -	check_panic_on_oom(CONSTRAINT_MEMCG, gfp_mask, order, NULL);
> -	__mem_cgroup_out_of_memory(memcg, gfp_mask, order);
> -}
> -#endif
> -
>  static BLOCKING_NOTIFIER_HEAD(oom_notify_list);
>  
>  int register_oom_notifier(struct notifier_block *nb)

-- 
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9    
Czech Republic

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* [PATCH mmotm] mm, oom: reduce dependency on tasklist_lock: fix
@ 2012-07-16  7:42         ` Hugh Dickins
  0 siblings, 0 replies; 66+ messages in thread
From: Hugh Dickins @ 2012-07-16  7:42 UTC (permalink / raw)
  To: Andrew Morton, David Rientjes
  Cc: KAMEZAWA Hiroyuki, Johannes Weiner, KOSAKI Motohiro, Minchan Kim,
	Oleg Nesterov, Michal Hocko, linux-mm, cgroups

Slab poisoning gave me a General Protection Fault on the
	atomic_dec(&__task_cred(p)->user->processes);
line of release_task() called from wait_task_zombie(),
every time my dd to USB testing generated a memcg OOM.

oom_kill_process() now does the put_task_struct(),
mem_cgroup_out_of_memory() should not repeat it.

Signed-off-by: Hugh Dickins <hughd@google.com>
---

 mm/memcontrol.c |    1 -
 1 file changed, 1 deletion(-)

--- mmotm/mm/memcontrol.c	2012-07-11 14:50:29.808349013 -0700
+++ linux/mm/memcontrol.c	2012-07-15 12:21:26.234289161 -0700
@@ -1533,7 +1533,6 @@ void mem_cgroup_out_of_memory(struct mem
 	points = chosen_points * 1000 / totalpages;
 	oom_kill_process(chosen, gfp_mask, order, points, totalpages, memcg,
 			 NULL, "Memory cgroup out of memory");
-	put_task_struct(chosen);
 }
 
 static unsigned long mem_cgroup_reclaim(struct mem_cgroup *memcg,

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* [PATCH mmotm] mm, oom: reduce dependency on tasklist_lock: fix
@ 2012-07-16  7:42         ` Hugh Dickins
  0 siblings, 0 replies; 66+ messages in thread
From: Hugh Dickins @ 2012-07-16  7:42 UTC (permalink / raw)
  To: Andrew Morton, David Rientjes
  Cc: KAMEZAWA Hiroyuki, Johannes Weiner, KOSAKI Motohiro, Minchan Kim,
	Oleg Nesterov, Michal Hocko, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	cgroups-u79uwXL29TY76Z2rM5mHXA

Slab poisoning gave me a General Protection Fault on the
	atomic_dec(&__task_cred(p)->user->processes);
line of release_task() called from wait_task_zombie(),
every time my dd to USB testing generated a memcg OOM.

oom_kill_process() now does the put_task_struct(),
mem_cgroup_out_of_memory() should not repeat it.

Signed-off-by: Hugh Dickins <hughd-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
---

 mm/memcontrol.c |    1 -
 1 file changed, 1 deletion(-)

--- mmotm/mm/memcontrol.c	2012-07-11 14:50:29.808349013 -0700
+++ linux/mm/memcontrol.c	2012-07-15 12:21:26.234289161 -0700
@@ -1533,7 +1533,6 @@ void mem_cgroup_out_of_memory(struct mem
 	points = chosen_points * 1000 / totalpages;
 	oom_kill_process(chosen, gfp_mask, order, points, totalpages, memcg,
 			 NULL, "Memory cgroup out of memory");
-	put_task_struct(chosen);
 }
 
 static unsigned long mem_cgroup_reclaim(struct mem_cgroup *memcg,

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH mmotm] mm, oom: reduce dependency on tasklist_lock: fix
@ 2012-07-16  8:06           ` Michal Hocko
  0 siblings, 0 replies; 66+ messages in thread
From: Michal Hocko @ 2012-07-16  8:06 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, David Rientjes, KAMEZAWA Hiroyuki,
	Johannes Weiner, KOSAKI Motohiro, Minchan Kim, Oleg Nesterov,
	linux-mm, cgroups

On Mon 16-07-12 00:42:37, Hugh Dickins wrote:
> Slab poisoning gave me a General Protection Fault on the
> 	atomic_dec(&__task_cred(p)->user->processes);
> line of release_task() called from wait_task_zombie(),
> every time my dd to USB testing generated a memcg OOM.

Just curious, was it with the wait-on-pagereclaim patch?

> oom_kill_process() now does the put_task_struct(),
> mem_cgroup_out_of_memory() should not repeat it.

Good catch. I have missed that during review - my bad...

> 
> Signed-off-by: Hugh Dickins <hughd@google.com>

Acked-by: Michal Hocko <mhocko@suse.cz>

> ---
> 
>  mm/memcontrol.c |    1 -
>  1 file changed, 1 deletion(-)
> 
> --- mmotm/mm/memcontrol.c	2012-07-11 14:50:29.808349013 -0700
> +++ linux/mm/memcontrol.c	2012-07-15 12:21:26.234289161 -0700
> @@ -1533,7 +1533,6 @@ void mem_cgroup_out_of_memory(struct mem
>  	points = chosen_points * 1000 / totalpages;
>  	oom_kill_process(chosen, gfp_mask, order, points, totalpages, memcg,
>  			 NULL, "Memory cgroup out of memory");
> -	put_task_struct(chosen);
>  }
>  
>  static unsigned long mem_cgroup_reclaim(struct mem_cgroup *memcg,
> --
> To unsubscribe from this list: send the line "unsubscribe cgroups" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9    
Czech Republic

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH mmotm] mm, oom: reduce dependency on tasklist_lock: fix
@ 2012-07-16  8:06           ` Michal Hocko
  0 siblings, 0 replies; 66+ messages in thread
From: Michal Hocko @ 2012-07-16  8:06 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, David Rientjes, KAMEZAWA Hiroyuki,
	Johannes Weiner, KOSAKI Motohiro, Minchan Kim, Oleg Nesterov,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA

On Mon 16-07-12 00:42:37, Hugh Dickins wrote:
> Slab poisoning gave me a General Protection Fault on the
> 	atomic_dec(&__task_cred(p)->user->processes);
> line of release_task() called from wait_task_zombie(),
> every time my dd to USB testing generated a memcg OOM.

Just curious, was it with the wait-on-pagereclaim patch?

> oom_kill_process() now does the put_task_struct(),
> mem_cgroup_out_of_memory() should not repeat it.

Good catch. I have missed that during review - my bad...

> 
> Signed-off-by: Hugh Dickins <hughd-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>

Acked-by: Michal Hocko <mhocko-AlSwsSmVLrQ@public.gmane.org>

> ---
> 
>  mm/memcontrol.c |    1 -
>  1 file changed, 1 deletion(-)
> 
> --- mmotm/mm/memcontrol.c	2012-07-11 14:50:29.808349013 -0700
> +++ linux/mm/memcontrol.c	2012-07-15 12:21:26.234289161 -0700
> @@ -1533,7 +1533,6 @@ void mem_cgroup_out_of_memory(struct mem
>  	points = chosen_points * 1000 / totalpages;
>  	oom_kill_process(chosen, gfp_mask, order, points, totalpages, memcg,
>  			 NULL, "Memory cgroup out of memory");
> -	put_task_struct(chosen);
>  }
>  
>  static unsigned long mem_cgroup_reclaim(struct mem_cgroup *memcg,
> --
> To unsubscribe from this list: send the line "unsubscribe cgroups" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9    
Czech Republic

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH mmotm] mm, oom: reduce dependency on tasklist_lock: fix
  2012-07-16  8:06           ` Michal Hocko
  (?)
@ 2012-07-16  9:01           ` Hugh Dickins
  2012-07-16  9:27             ` Michal Hocko
  -1 siblings, 1 reply; 66+ messages in thread
From: Hugh Dickins @ 2012-07-16  9:01 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, David Rientjes, KAMEZAWA Hiroyuki,
	Johannes Weiner, KOSAKI Motohiro, Minchan Kim, Oleg Nesterov,
	linux-mm, cgroups

On Mon, 16 Jul 2012, Michal Hocko wrote:
> On Mon 16-07-12 00:42:37, Hugh Dickins wrote:
> > Slab poisoning gave me a General Protection Fault on the
> > 	atomic_dec(&__task_cred(p)->user->processes);
> > line of release_task() called from wait_task_zombie(),
> > every time my dd to USB testing generated a memcg OOM.
> 
> Just curious, was it with the wait-on-pagereclaim patch?

Yes, that's what I was trying to test.

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH mmotm] mm, oom: reduce dependency on tasklist_lock: fix
  2012-07-16  9:01           ` Hugh Dickins
@ 2012-07-16  9:27             ` Michal Hocko
  0 siblings, 0 replies; 66+ messages in thread
From: Michal Hocko @ 2012-07-16  9:27 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, David Rientjes, KAMEZAWA Hiroyuki,
	Johannes Weiner, KOSAKI Motohiro, Minchan Kim, Oleg Nesterov,
	linux-mm, cgroups

On Mon 16-07-12 02:01:55, Hugh Dickins wrote:
> On Mon, 16 Jul 2012, Michal Hocko wrote:
> > On Mon 16-07-12 00:42:37, Hugh Dickins wrote:
> > > Slab poisoning gave me a General Protection Fault on the
> > > 	atomic_dec(&__task_cred(p)->user->processes);
> > > line of release_task() called from wait_task_zombie(),
> > > every time my dd to USB testing generated a memcg OOM.
> > 
> > Just curious, was it with the wait-on-pagereclaim patch?
> 
> Yes, that's what I was trying to test.

I have noticed the other email already. Sorry for the noise.

> 
> Hugh

-- 
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9    
Czech Republic

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH mmotm] mm, oom: reduce dependency on tasklist_lock: fix
@ 2012-07-19 10:11           ` Kamezawa Hiroyuki
  0 siblings, 0 replies; 66+ messages in thread
From: Kamezawa Hiroyuki @ 2012-07-19 10:11 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, David Rientjes, Johannes Weiner, KOSAKI Motohiro,
	Minchan Kim, Oleg Nesterov, Michal Hocko, linux-mm, cgroups

(2012/07/16 16:42), Hugh Dickins wrote:
> Slab poisoning gave me a General Protection Fault on the
> 	atomic_dec(&__task_cred(p)->user->processes);
> line of release_task() called from wait_task_zombie(),
> every time my dd to USB testing generated a memcg OOM.
>
> oom_kill_process() now does the put_task_struct(),
> mem_cgroup_out_of_memory() should not repeat it.
>
> Signed-off-by: Hugh Dickins <hughd@google.com>

Thank you for catching !

Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>




--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH mmotm] mm, oom: reduce dependency on tasklist_lock: fix
@ 2012-07-19 10:11           ` Kamezawa Hiroyuki
  0 siblings, 0 replies; 66+ messages in thread
From: Kamezawa Hiroyuki @ 2012-07-19 10:11 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, David Rientjes, Johannes Weiner, KOSAKI Motohiro,
	Minchan Kim, Oleg Nesterov, Michal Hocko,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA

(2012/07/16 16:42), Hugh Dickins wrote:
> Slab poisoning gave me a General Protection Fault on the
> 	atomic_dec(&__task_cred(p)->user->processes);
> line of release_task() called from wait_task_zombie(),
> every time my dd to USB testing generated a memcg OOM.
>
> oom_kill_process() now does the put_task_struct(),
> mem_cgroup_out_of_memory() should not repeat it.
>
> Signed-off-by: Hugh Dickins <hughd-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>

Thank you for catching !

Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A@public.gmane.org>




^ permalink raw reply	[flat|nested] 66+ messages in thread

end of thread, other threads:[~2012-07-19 10:13 UTC | newest]

Thread overview: 66+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-06-26  1:47 [patch 1/3] mm, oom: move declaration for mem_cgroup_out_of_memory to oom.h David Rientjes
2012-06-26  1:47 ` David Rientjes
2012-06-26  1:47 ` [rfc][patch 2/3] mm, oom: introduce helper function to process threads during scan David Rientjes
2012-06-26  1:47   ` David Rientjes
2012-06-26  3:22   ` Kamezawa Hiroyuki
2012-06-26  3:22     ` Kamezawa Hiroyuki
2012-06-26  6:05     ` KOSAKI Motohiro
2012-06-26  8:48   ` Michal Hocko
2012-06-26  8:48     ` Michal Hocko
2012-06-26  1:47 ` [rfc][patch 3/3] mm, memcg: introduce own oom handler to iterate only over its own threads David Rientjes
2012-06-26  1:47   ` David Rientjes
2012-06-26  5:32   ` Kamezawa Hiroyuki
2012-06-26  5:32     ` Kamezawa Hiroyuki
2012-06-26 20:38     ` David Rientjes
2012-06-27  5:35       ` David Rientjes
2012-06-27  5:35         ` David Rientjes
2012-06-28  1:43         ` David Rientjes
2012-06-28  1:43           ` David Rientjes
2012-06-28 17:16           ` Oleg Nesterov
2012-06-29 20:37             ` David Rientjes
2012-06-28  8:55         ` Kamezawa Hiroyuki
2012-06-28  8:55           ` Kamezawa Hiroyuki
2012-06-29 20:30           ` David Rientjes
2012-06-29 20:30             ` David Rientjes
2012-07-03 17:56             ` Oleg Nesterov
2012-06-28  8:52       ` Kamezawa Hiroyuki
2012-06-26  9:58   ` Michal Hocko
2012-06-26  9:58     ` Michal Hocko
2012-06-26  3:12 ` [patch 1/3] mm, oom: move declaration for mem_cgroup_out_of_memory to oom.h Kamezawa Hiroyuki
2012-06-26  6:04   ` KOSAKI Motohiro
2012-06-26  6:04     ` KOSAKI Motohiro
2012-06-26  8:34 ` Michal Hocko
2012-06-26  8:34   ` Michal Hocko
2012-06-29 21:06 ` [patch 1/5] " David Rientjes
2012-06-29 21:06   ` [patch 2/5] mm, oom: introduce helper function to process threads during scan David Rientjes
2012-06-29 21:06     ` David Rientjes
2012-07-12  7:18     ` Sha Zhengju
2012-07-12  7:18       ` Sha Zhengju
2012-06-29 21:06   ` [patch 3/5] mm, memcg: introduce own oom handler to iterate only over its own threads David Rientjes
2012-06-29 21:06     ` David Rientjes
2012-07-10 21:19     ` Andrew Morton
2012-07-10 21:19       ` Andrew Morton
2012-07-10 23:24       ` David Rientjes
2012-07-12 14:50     ` Sha Zhengju
2012-07-12 14:50       ` Sha Zhengju
2012-06-29 21:06   ` [patch 4/5] mm, oom: reduce dependency on tasklist_lock David Rientjes
2012-07-03 18:17     ` Oleg Nesterov
2012-07-03 18:17       ` Oleg Nesterov
2012-07-10 21:04       ` David Rientjes
2012-07-13 14:32     ` Michal Hocko
2012-07-13 14:32       ` Michal Hocko
2012-07-16  7:42       ` [PATCH mmotm] mm, oom: reduce dependency on tasklist_lock: fix Hugh Dickins
2012-07-16  7:42         ` Hugh Dickins
2012-07-16  8:06         ` Michal Hocko
2012-07-16  8:06           ` Michal Hocko
2012-07-16  9:01           ` Hugh Dickins
2012-07-16  9:27             ` Michal Hocko
2012-07-19 10:11         ` Kamezawa Hiroyuki
2012-07-19 10:11           ` Kamezawa Hiroyuki
2012-06-29 21:07   ` [patch 5/5] mm, memcg: move all oom handling to memcontrol.c David Rientjes
2012-06-29 21:07     ` David Rientjes
2012-07-04  5:51     ` Kamezawa Hiroyuki
2012-07-04  5:51       ` Kamezawa Hiroyuki
2012-07-13 14:34     ` Michal Hocko
2012-07-10 21:05   ` [patch 1/5] mm, oom: move declaration for mem_cgroup_out_of_memory to oom.h David Rientjes
2012-07-10 21:05     ` David Rientjes

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.