linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 0/6] Change OOM killer to use list of mm_struct.
@ 2016-07-07 15:58 Tetsuo Handa
  2016-07-07 16:00 ` [PATCH 1/6] mm,oom_reaper: Reduce find_lock_task_mm() usage Tetsuo Handa
                   ` (5 more replies)
  0 siblings, 6 replies; 19+ messages in thread
From: Tetsuo Handa @ 2016-07-07 15:58 UTC (permalink / raw)
  To: linux-mm; +Cc: akpm, oleg, rientjes, vdavydov, mst, mhocko, mhocko

This series is an update of
http://lkml.kernel.org/r/201607031135.AAH95347.MVOHQtFJFLOOFS@I-love.SAKURA.ne.jp .

This series is based on top of linux-next-20160707 +
http://lkml.kernel.org/r/1467201562-6709-1-git-send-email-mhocko@kernel.org .

 include/linux/mm_types.h |    7 +
 include/linux/oom.h      |   14 --
 include/linux/sched.h    |    2
 kernel/exit.c            |    2
 kernel/fork.c            |    4
 mm/memcontrol.c          |   14 --
 mm/oom_kill.c            |  297 ++++++++++++++++++-----------------------------
 7 files changed, 140 insertions(+), 200 deletions(-)

[PATCH 1/6] mm,oom_reaper: Reduce find_lock_task_mm() usage.
[PATCH 2/6] mm,oom_reaper: Do not attempt to reap a task twice.
[PATCH 3/6] mm,oom: Use list of mm_struct used by OOM victims.
[PATCH 4/6] mm,oom_reaper: Make OOM reaper use list of mm_struct.
[PATCH 5/6] mm,oom: Remove OOM_SCAN_ABORT case and signal_struct->oom_victims.
[PATCH 6/6] mm,oom: Stop clearing TIF_MEMDIE on remote thread.

This series does not include patches for use_mm() users and wait_event()
in oom_killer_disable(). We can apply
http://lkml.kernel.org/r/1467365190-24640-3-git-send-email-mhocko@kernel.org
on top of this series.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH 1/6] mm,oom_reaper: Reduce find_lock_task_mm() usage.
  2016-07-07 15:58 [PATCH v2 0/6] Change OOM killer to use list of mm_struct Tetsuo Handa
@ 2016-07-07 16:00 ` Tetsuo Handa
  2016-07-11 12:02   ` Michal Hocko
  2016-07-07 16:01 ` [PATCH 2/6] mm,oom_reaper: Do not attempt to reap a task twice Tetsuo Handa
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 19+ messages in thread
From: Tetsuo Handa @ 2016-07-07 16:00 UTC (permalink / raw)
  To: linux-mm; +Cc: akpm, oleg, rientjes, vdavydov, mst, mhocko, mhocko

>From 70de3fe92435095b6ecbb400c61e84a99f639d56 Mon Sep 17 00:00:00 2001
From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Date: Fri, 8 Jul 2016 00:28:12 +0900
Subject: [PATCH 1/6] mm,oom_reaper: Reduce find_lock_task_mm() usage.

Since holding mm_struct with elevated mm_count for a second is harmless,
we can determine mm_struct and hold it upon entry of oom_reap_task().
This patch has no functional change. Future patch in this series will
eliminate find_lock_task_mm() usage from the OOM reaper.

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
---
 mm/oom_kill.c | 79 ++++++++++++++++++++++++++++-------------------------------
 1 file changed, 37 insertions(+), 42 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 7d0a275..951eb1b 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -452,12 +452,10 @@ static DECLARE_WAIT_QUEUE_HEAD(oom_reaper_wait);
 static struct task_struct *oom_reaper_list;
 static DEFINE_SPINLOCK(oom_reaper_lock);
 
-static bool __oom_reap_task(struct task_struct *tsk)
+static bool __oom_reap_task(struct task_struct *tsk, struct mm_struct *mm)
 {
 	struct mmu_gather tlb;
 	struct vm_area_struct *vma;
-	struct mm_struct *mm = NULL;
-	struct task_struct *p;
 	struct zap_details details = {.check_swap_entries = true,
 				      .ignore_dirty = true};
 	bool ret = true;
@@ -478,22 +476,9 @@ static bool __oom_reap_task(struct task_struct *tsk)
 	 */
 	mutex_lock(&oom_lock);
 
-	/*
-	 * Make sure we find the associated mm_struct even when the particular
-	 * thread has already terminated and cleared its mm.
-	 * We might have race with exit path so consider our work done if there
-	 * is no mm.
-	 */
-	p = find_lock_task_mm(tsk);
-	if (!p)
-		goto unlock_oom;
-	mm = p->mm;
-	atomic_inc(&mm->mm_count);
-	task_unlock(p);
-
 	if (!down_read_trylock(&mm->mmap_sem)) {
 		ret = false;
-		goto mm_drop;
+		goto unlock_oom;
 	}
 
 	/*
@@ -503,7 +488,7 @@ static bool __oom_reap_task(struct task_struct *tsk)
 	 */
 	if (!mmget_not_zero(mm)) {
 		up_read(&mm->mmap_sem);
-		goto mm_drop;
+		goto unlock_oom;
 	}
 
 	tlb_gather_mmu(&tlb, mm, 0, -1);
@@ -551,8 +536,6 @@ static bool __oom_reap_task(struct task_struct *tsk)
 	 * put the oom_reaper out of the way.
 	 */
 	mmput_async(mm);
-mm_drop:
-	mmdrop(mm);
 unlock_oom:
 	mutex_unlock(&oom_lock);
 	return ret;
@@ -562,36 +545,45 @@ unlock_oom:
 static void oom_reap_task(struct task_struct *tsk)
 {
 	int attempts = 0;
+	struct mm_struct *mm = NULL;
+	struct task_struct *p = find_lock_task_mm(tsk);
+
+	/*
+	 * Make sure we find the associated mm_struct even when the particular
+	 * thread has already terminated and cleared its mm.
+	 * We might have race with exit path so consider our work done if there
+	 * is no mm.
+	 */
+	if (!p)
+		goto done;
+	mm = p->mm;
+	atomic_inc(&mm->mm_count);
+	task_unlock(p);
 
 	/* Retry the down_read_trylock(mmap_sem) a few times */
-	while (attempts++ < MAX_OOM_REAP_RETRIES && !__oom_reap_task(tsk))
+	while (attempts++ < MAX_OOM_REAP_RETRIES && !__oom_reap_task(tsk, mm))
 		schedule_timeout_idle(HZ/10);
 
-	if (attempts > MAX_OOM_REAP_RETRIES) {
-		struct task_struct *p;
+	if (attempts <= MAX_OOM_REAP_RETRIES)
+		goto done;
 
-		pr_info("oom_reaper: unable to reap pid:%d (%s)\n",
-				task_pid_nr(tsk), tsk->comm);
+	pr_info("oom_reaper: unable to reap pid:%d (%s)\n",
+		task_pid_nr(tsk), tsk->comm);
 
-		/*
-		 * If we've already tried to reap this task in the past and
-		 * failed it probably doesn't make much sense to try yet again
-		 * so hide the mm from the oom killer so that it can move on
-		 * to another task with a different mm struct.
-		 */
-		p = find_lock_task_mm(tsk);
-		if (p) {
-			if (test_and_set_bit(MMF_OOM_NOT_REAPABLE, &p->mm->flags)) {
-				pr_info("oom_reaper: giving up pid:%d (%s)\n",
-						task_pid_nr(tsk), tsk->comm);
-				set_bit(MMF_OOM_REAPED, &p->mm->flags);
-			}
-			task_unlock(p);
-		}
-
-		debug_show_all_locks();
+	/*
+	 * If we've already tried to reap this task in the past and
+	 * failed it probably doesn't make much sense to try yet again
+	 * so hide the mm from the oom killer so that it can move on
+	 * to another task with a different mm struct.
+	 */
+	if (test_and_set_bit(MMF_OOM_NOT_REAPABLE, &mm->flags)) {
+		pr_info("oom_reaper: giving up pid:%d (%s)\n",
+			task_pid_nr(tsk), tsk->comm);
+		set_bit(MMF_OOM_REAPED, &mm->flags);
 	}
+	debug_show_all_locks();
 
+done:
 	/*
 	 * Clear TIF_MEMDIE because the task shouldn't be sitting on a
 	 * reasonably reclaimable memory anymore or it is not a good candidate
@@ -603,6 +595,9 @@ static void oom_reap_task(struct task_struct *tsk)
 
 	/* Drop a reference taken by wake_oom_reaper */
 	put_task_struct(tsk);
+	/* Drop a reference taken above. */
+	if (mm)
+		mmdrop(mm);
 }
 
 static int oom_reaper(void *unused)
-- 
1.8.3.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 2/6] mm,oom_reaper: Do not attempt to reap a task twice.
  2016-07-07 15:58 [PATCH v2 0/6] Change OOM killer to use list of mm_struct Tetsuo Handa
  2016-07-07 16:00 ` [PATCH 1/6] mm,oom_reaper: Reduce find_lock_task_mm() usage Tetsuo Handa
@ 2016-07-07 16:01 ` Tetsuo Handa
  2016-07-11 12:15   ` Michal Hocko
  2016-07-07 16:03 ` [PATCH 3/6] mm,oom: Use list of mm_struct used by OOM victims Tetsuo Handa
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 19+ messages in thread
From: Tetsuo Handa @ 2016-07-07 16:01 UTC (permalink / raw)
  To: linux-mm; +Cc: akpm, oleg, rientjes, vdavydov, mst, mhocko, mhocko

>From 50b3a862b136c783be6ce25e0f22446f15a0ab03 Mon Sep 17 00:00:00 2001
From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Date: Fri, 8 Jul 2016 00:28:50 +0900
Subject: [PATCH 2/6] mm,oom_reaper: Do not attempt to reap a task twice.

"mm, oom_reaper: do not attempt to reap a task twice" tried to give
OOM reaper one more chance to retry. But since OOM killer or one of
threads sharing that mm will queue that mm immediately, retrying
MMF_OOM_NOT_REAPABLE mm is unlikely helpful.

Let's always set MMF_OOM_REAPED.

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
---
 include/linux/sched.h |  1 -
 mm/oom_kill.c         | 15 +++------------
 2 files changed, 3 insertions(+), 13 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 553af29..c0efd80 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -523,7 +523,6 @@ static inline int get_dumpable(struct mm_struct *mm)
 #define MMF_HAS_UPROBES		19	/* has uprobes */
 #define MMF_RECALC_UPROBES	20	/* MMF_HAS_UPROBES can be wrong */
 #define MMF_OOM_REAPED		21	/* mm has been already reaped */
-#define MMF_OOM_NOT_REAPABLE	22	/* mm couldn't be reaped */
 
 #define MMF_INIT_MASK		(MMF_DUMPABLE_MASK | MMF_DUMP_FILTER_MASK)
 
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 951eb1b..9f0022e 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -567,20 +567,11 @@ static void oom_reap_task(struct task_struct *tsk)
 	if (attempts <= MAX_OOM_REAP_RETRIES)
 		goto done;
 
+	/* Ignore this mm because somebody can't call up_write(mmap_sem). */
+	set_bit(MMF_OOM_REAPED, &mm->flags);
+
 	pr_info("oom_reaper: unable to reap pid:%d (%s)\n",
 		task_pid_nr(tsk), tsk->comm);
-
-	/*
-	 * If we've already tried to reap this task in the past and
-	 * failed it probably doesn't make much sense to try yet again
-	 * so hide the mm from the oom killer so that it can move on
-	 * to another task with a different mm struct.
-	 */
-	if (test_and_set_bit(MMF_OOM_NOT_REAPABLE, &mm->flags)) {
-		pr_info("oom_reaper: giving up pid:%d (%s)\n",
-			task_pid_nr(tsk), tsk->comm);
-		set_bit(MMF_OOM_REAPED, &mm->flags);
-	}
 	debug_show_all_locks();
 
 done:
-- 
1.8.3.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 3/6] mm,oom: Use list of mm_struct used by OOM victims.
  2016-07-07 15:58 [PATCH v2 0/6] Change OOM killer to use list of mm_struct Tetsuo Handa
  2016-07-07 16:00 ` [PATCH 1/6] mm,oom_reaper: Reduce find_lock_task_mm() usage Tetsuo Handa
  2016-07-07 16:01 ` [PATCH 2/6] mm,oom_reaper: Do not attempt to reap a task twice Tetsuo Handa
@ 2016-07-07 16:03 ` Tetsuo Handa
  2016-07-11 12:50   ` Michal Hocko
  2016-07-07 16:04 ` [PATCH 4/6] mm,oom_reaper: Make OOM reaper use list of mm_struct Tetsuo Handa
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 19+ messages in thread
From: Tetsuo Handa @ 2016-07-07 16:03 UTC (permalink / raw)
  To: linux-mm; +Cc: akpm, oleg, rientjes, vdavydov, mst, mhocko, mhocko

>From 5fbd16cffd5dc51f9ba8591fc18d315ff6ff9b96 Mon Sep 17 00:00:00 2001
From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Date: Fri, 8 Jul 2016 00:33:13 +0900
Subject: [PATCH 3/6] mm,oom: Use list of mm_struct used by OOM victims.

Currently, we walk process list in order to find existing TIF_MEMDIE
threads. But if we remember list of mm_struct used by TIF_MEMDIE threads,
we can avoid walking process list. Next patch in this series allows
OOM reaper to use list of mm_struct introduced by this patch.

This patch reverts commit e2fe14564d3316d1 ("oom_reaper: close race with
exiting task") because oom_has_pending_mm() will prevent that race.

Since CONFIG_MMU=y kernel has OOM reaper callback hook which can remove
mm_struct from the list, let the OOM reaper call exit_oom_mm(mm). This
patch temporarily fails to call exit_oom_mm(mm) when find_lock_task_mm()
in oom_reap_task() failed. It will be fixed by next patch.

But since CONFIG_MMU=n kernel does not have OOM reaper callback hook,
call exit_oom_mm(mm) from __mmput(mm) if that mm is used by OOM victims.

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
---
 include/linux/mm_types.h |  7 +++++
 include/linux/oom.h      |  3 ++
 kernel/fork.c            |  4 +++
 mm/memcontrol.c          |  5 ++++
 mm/oom_kill.c            | 72 +++++++++++++++++++++++++++++++-----------------
 5 files changed, 66 insertions(+), 25 deletions(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index e093e1d..7c1370a 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -392,6 +392,12 @@ struct mm_rss_stat {
 	atomic_long_t count[NR_MM_COUNTERS];
 };
 
+struct oom_mm {
+	struct list_head list; /* Linked to oom_mm_list list. */
+	/* Thread which was passed to mark_oom_victim() for the last time. */
+	struct task_struct *victim;
+};
+
 struct kioctx_table;
 struct mm_struct {
 	struct vm_area_struct *mmap;		/* list of VMAs */
@@ -515,6 +521,7 @@ struct mm_struct {
 #ifdef CONFIG_HUGETLB_PAGE
 	atomic_long_t hugetlb_usage;
 #endif
+	struct oom_mm oom_mm;
 #ifdef CONFIG_MMU
 	struct work_struct async_put_work;
 #endif
diff --git a/include/linux/oom.h b/include/linux/oom.h
index 5bc0457..bdcb331 100644
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -91,6 +91,9 @@ extern void oom_kill_process(struct oom_control *oc, struct task_struct *p,
 extern void check_panic_on_oom(struct oom_control *oc,
 			       enum oom_constraint constraint);
 
+extern void exit_oom_mm(struct mm_struct *mm);
+extern bool oom_has_pending_mm(struct mem_cgroup *memcg,
+			       const nodemask_t *nodemask);
 extern enum oom_scan_t oom_scan_process_thread(struct oom_control *oc,
 					       struct task_struct *task);
 
diff --git a/kernel/fork.c b/kernel/fork.c
index 7926993..8e469e0 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -722,6 +722,10 @@ static inline void __mmput(struct mm_struct *mm)
 	}
 	if (mm->binfmt)
 		module_put(mm->binfmt->module);
+#ifndef CONFIG_MMU
+	if (mm->oom_mm.victim)
+		exit_oom_mm(mm);
+#endif
 	mmdrop(mm);
 }
 
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 40dfca3..8f7a5b7 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1241,6 +1241,11 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
 	}
 
 	check_panic_on_oom(&oc, CONSTRAINT_MEMCG);
+	if (oom_has_pending_mm(memcg, NULL)) {
+		/* Set a dummy value to return "true". */
+		chosen = (void *) 1;
+		goto unlock;
+	}
 	totalpages = mem_cgroup_get_limit(memcg) ? : 1;
 	for_each_mem_cgroup_tree(iter, memcg) {
 		struct css_task_iter it;
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 9f0022e..87e7ff3 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -275,6 +275,28 @@ static enum oom_constraint constrained_alloc(struct oom_control *oc,
 }
 #endif
 
+static LIST_HEAD(oom_mm_list);
+
+void exit_oom_mm(struct mm_struct *mm)
+{
+	mutex_lock(&oom_lock);
+	list_del(&mm->oom_mm.list);
+	put_task_struct(mm->oom_mm.victim);
+	mm->oom_mm.victim = NULL;
+	mmdrop(mm);
+	mutex_unlock(&oom_lock);
+}
+
+bool oom_has_pending_mm(struct mem_cgroup *memcg, const nodemask_t *nodemask)
+{
+	struct mm_struct *mm;
+
+	list_for_each_entry(mm, &oom_mm_list, oom_mm.list)
+		if (!oom_unkillable_task(mm->oom_mm.victim, memcg, nodemask))
+			return true;
+	return false;
+}
+
 enum oom_scan_t oom_scan_process_thread(struct oom_control *oc,
 					struct task_struct *task)
 {
@@ -458,28 +480,9 @@ static bool __oom_reap_task(struct task_struct *tsk, struct mm_struct *mm)
 	struct vm_area_struct *vma;
 	struct zap_details details = {.check_swap_entries = true,
 				      .ignore_dirty = true};
-	bool ret = true;
 
-	/*
-	 * We have to make sure to not race with the victim exit path
-	 * and cause premature new oom victim selection:
-	 * __oom_reap_task		exit_mm
-	 *   mmget_not_zero
-	 *				  mmput
-	 *				    atomic_dec_and_test
-	 *				  exit_oom_victim
-	 *				[...]
-	 *				out_of_memory
-	 *				  select_bad_process
-	 *				    # no TIF_MEMDIE task selects new victim
-	 *  unmap_page_range # frees some memory
-	 */
-	mutex_lock(&oom_lock);
-
-	if (!down_read_trylock(&mm->mmap_sem)) {
-		ret = false;
-		goto unlock_oom;
-	}
+	if (!down_read_trylock(&mm->mmap_sem))
+		return false;
 
 	/*
 	 * increase mm_users only after we know we will reap something so
@@ -488,7 +491,7 @@ static bool __oom_reap_task(struct task_struct *tsk, struct mm_struct *mm)
 	 */
 	if (!mmget_not_zero(mm)) {
 		up_read(&mm->mmap_sem);
-		goto unlock_oom;
+		return true;
 	}
 
 	tlb_gather_mmu(&tlb, mm, 0, -1);
@@ -536,9 +539,7 @@ static bool __oom_reap_task(struct task_struct *tsk, struct mm_struct *mm)
 	 * put the oom_reaper out of the way.
 	 */
 	mmput_async(mm);
-unlock_oom:
-	mutex_unlock(&oom_lock);
-	return ret;
+	return true;
 }
 
 #define MAX_OOM_REAP_RETRIES 10
@@ -586,6 +587,9 @@ done:
 
 	/* Drop a reference taken by wake_oom_reaper */
 	put_task_struct(tsk);
+	/* Drop references taken by mark_oom_victim() */
+	if (mm)
+		exit_oom_mm(mm);
 	/* Drop a reference taken above. */
 	if (mm)
 		mmdrop(mm);
@@ -653,6 +657,9 @@ subsys_initcall(oom_init)
  */
 void mark_oom_victim(struct task_struct *tsk)
 {
+	struct mm_struct *mm = tsk->mm;
+	struct task_struct *old_tsk = mm->oom_mm.victim;
+
 	WARN_ON(oom_killer_disabled);
 	/* OOM killer might race with memcg OOM */
 	if (test_and_set_tsk_thread_flag(tsk, TIF_MEMDIE))
@@ -666,6 +673,18 @@ void mark_oom_victim(struct task_struct *tsk)
 	 */
 	__thaw_task(tsk);
 	atomic_inc(&oom_victims);
+	/*
+	 * Since mark_oom_victim() is called from multiple threads,
+	 * connect this mm to oom_mm_list only if not yet connected.
+	 */
+	get_task_struct(tsk);
+	mm->oom_mm.victim = tsk;
+	if (!old_tsk) {
+		atomic_inc(&mm->mm_count);
+		list_add_tail(&mm->oom_mm.list, &oom_mm_list);
+	} else {
+		put_task_struct(old_tsk);
+	}
 }
 
 /**
@@ -1026,6 +1045,9 @@ bool out_of_memory(struct oom_control *oc)
 		return true;
 	}
 
+	if (!is_sysrq_oom(oc) && oom_has_pending_mm(oc->memcg, oc->nodemask))
+		return true;
+
 	p = select_bad_process(oc, &points, totalpages);
 	/* Found nothing?!?! Either we hang forever, or we panic. */
 	if (!p && !is_sysrq_oom(oc)) {
-- 
1.8.3.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 4/6] mm,oom_reaper: Make OOM reaper use list of mm_struct.
  2016-07-07 15:58 [PATCH v2 0/6] Change OOM killer to use list of mm_struct Tetsuo Handa
                   ` (2 preceding siblings ...)
  2016-07-07 16:03 ` [PATCH 3/6] mm,oom: Use list of mm_struct used by OOM victims Tetsuo Handa
@ 2016-07-07 16:04 ` Tetsuo Handa
  2016-07-11 13:16   ` Michal Hocko
  2016-07-07 16:06 ` [PATCH 5/6] mm,oom: Remove OOM_SCAN_ABORT case and signal_struct->oom_victims Tetsuo Handa
  2016-07-07 16:07 ` [PATCH 6/6] mm,oom: Stop clearing TIF_MEMDIE on remote thread Tetsuo Handa
  5 siblings, 1 reply; 19+ messages in thread
From: Tetsuo Handa @ 2016-07-07 16:04 UTC (permalink / raw)
  To: linux-mm; +Cc: akpm, oleg, rientjes, vdavydov, mst, mhocko, mhocko

>From acc85fdd36452e39bace6aa73b3aaa41bbe776a5 Mon Sep 17 00:00:00 2001
From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Date: Fri, 8 Jul 2016 00:39:36 +0900
Subject: [PATCH 4/6] mm,oom_reaper: Make OOM reaper use list of mm_struct.

Since OOM reaping is per mm_struct operation, it is natural to use
list of mm_struct used by OOM victims. By using list of mm_struct,
we can eliminate find_lock_task_mm() usage from the OOM reaper.

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
---
 include/linux/oom.h |   8 -----
 mm/memcontrol.c     |   1 -
 mm/oom_kill.c       | 100 +++++++++++++++-------------------------------------
 3 files changed, 29 insertions(+), 80 deletions(-)

diff --git a/include/linux/oom.h b/include/linux/oom.h
index bdcb331..cb3f041 100644
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -72,14 +72,6 @@ static inline bool oom_task_origin(const struct task_struct *p)
 
 extern void mark_oom_victim(struct task_struct *tsk);
 
-#ifdef CONFIG_MMU
-extern void wake_oom_reaper(struct task_struct *tsk);
-#else
-static inline void wake_oom_reaper(struct task_struct *tsk)
-{
-}
-#endif
-
 extern unsigned long oom_badness(struct task_struct *p,
 		struct mem_cgroup *memcg, const nodemask_t *nodemask,
 		unsigned long totalpages);
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 8f7a5b7..5043324 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1236,7 +1236,6 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
 	 */
 	if (task_will_free_mem(current)) {
 		mark_oom_victim(current);
-		wake_oom_reaper(current);
 		goto unlock;
 	}
 
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 87e7ff3..223e1fe 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -471,8 +471,6 @@ bool process_shares_mm(struct task_struct *p, struct mm_struct *mm)
  */
 static struct task_struct *oom_reaper_th;
 static DECLARE_WAIT_QUEUE_HEAD(oom_reaper_wait);
-static struct task_struct *oom_reaper_list;
-static DEFINE_SPINLOCK(oom_reaper_lock);
 
 static bool __oom_reap_task(struct task_struct *tsk, struct mm_struct *mm)
 {
@@ -543,30 +541,23 @@ static bool __oom_reap_task(struct task_struct *tsk, struct mm_struct *mm)
 }
 
 #define MAX_OOM_REAP_RETRIES 10
-static void oom_reap_task(struct task_struct *tsk)
+static void oom_reap_task(struct task_struct *tsk, struct mm_struct *mm)
 {
 	int attempts = 0;
-	struct mm_struct *mm = NULL;
-	struct task_struct *p = find_lock_task_mm(tsk);
 
 	/*
-	 * Make sure we find the associated mm_struct even when the particular
-	 * thread has already terminated and cleared its mm.
-	 * We might have race with exit path so consider our work done if there
-	 * is no mm.
+	 * Check MMF_OOM_REAPED in case oom_kill_process() found this mm
+	 * pinned.
 	 */
-	if (!p)
-		goto done;
-	mm = p->mm;
-	atomic_inc(&mm->mm_count);
-	task_unlock(p);
+	if (test_bit(MMF_OOM_REAPED, &mm->flags))
+		return;
 
 	/* Retry the down_read_trylock(mmap_sem) a few times */
 	while (attempts++ < MAX_OOM_REAP_RETRIES && !__oom_reap_task(tsk, mm))
 		schedule_timeout_idle(HZ/10);
 
 	if (attempts <= MAX_OOM_REAP_RETRIES)
-		goto done;
+		return;
 
 	/* Ignore this mm because somebody can't call up_write(mmap_sem). */
 	set_bit(MMF_OOM_REAPED, &mm->flags);
@@ -574,25 +565,6 @@ static void oom_reap_task(struct task_struct *tsk)
 	pr_info("oom_reaper: unable to reap pid:%d (%s)\n",
 		task_pid_nr(tsk), tsk->comm);
 	debug_show_all_locks();
-
-done:
-	/*
-	 * Clear TIF_MEMDIE because the task shouldn't be sitting on a
-	 * reasonably reclaimable memory anymore or it is not a good candidate
-	 * for the oom victim right now because it cannot release its memory
-	 * itself nor by the oom reaper.
-	 */
-	tsk->oom_reaper_list = NULL;
-	exit_oom_victim(tsk);
-
-	/* Drop a reference taken by wake_oom_reaper */
-	put_task_struct(tsk);
-	/* Drop references taken by mark_oom_victim() */
-	if (mm)
-		exit_oom_mm(mm);
-	/* Drop a reference taken above. */
-	if (mm)
-		mmdrop(mm);
 }
 
 static int oom_reaper(void *unused)
@@ -600,41 +572,31 @@ static int oom_reaper(void *unused)
 	set_freezable();
 
 	while (true) {
-		struct task_struct *tsk = NULL;
-
-		wait_event_freezable(oom_reaper_wait, oom_reaper_list != NULL);
-		spin_lock(&oom_reaper_lock);
-		if (oom_reaper_list != NULL) {
-			tsk = oom_reaper_list;
-			oom_reaper_list = tsk->oom_reaper_list;
-		}
-		spin_unlock(&oom_reaper_lock);
-
-		if (tsk)
-			oom_reap_task(tsk);
+		struct mm_struct *mm;
+		struct task_struct *victim;
+
+		wait_event_freezable(oom_reaper_wait,
+				     !list_empty(&oom_mm_list));
+		mutex_lock(&oom_lock);
+		mm = list_first_entry(&oom_mm_list, struct mm_struct,
+				      oom_mm.list);
+		victim = mm->oom_mm.victim;
+		/*
+		 * Take a reference on current victim thread in case
+		 * oom_reap_task() raced with mark_oom_victim() by
+		 * other threads sharing this mm.
+		 */
+		get_task_struct(victim);
+		mutex_unlock(&oom_lock);
+		oom_reap_task(victim, mm);
+		put_task_struct(victim);
+		/* Drop references taken by mark_oom_victim() */
+		exit_oom_mm(mm);
 	}
 
 	return 0;
 }
 
-void wake_oom_reaper(struct task_struct *tsk)
-{
-	if (!oom_reaper_th)
-		return;
-
-	/* tsk is already queued? */
-	if (tsk == oom_reaper_list || tsk->oom_reaper_list)
-		return;
-
-	get_task_struct(tsk);
-
-	spin_lock(&oom_reaper_lock);
-	tsk->oom_reaper_list = oom_reaper_list;
-	oom_reaper_list = tsk;
-	spin_unlock(&oom_reaper_lock);
-	wake_up(&oom_reaper_wait);
-}
-
 static int __init oom_init(void)
 {
 	oom_reaper_th = kthread_run(oom_reaper, NULL, "oom_reaper");
@@ -682,6 +644,9 @@ void mark_oom_victim(struct task_struct *tsk)
 	if (!old_tsk) {
 		atomic_inc(&mm->mm_count);
 		list_add_tail(&mm->oom_mm.list, &oom_mm_list);
+#ifdef CONFIG_MMU
+		wake_up(&oom_reaper_wait);
+#endif
 	} else {
 		put_task_struct(old_tsk);
 	}
@@ -826,7 +791,6 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
 	unsigned int victim_points = 0;
 	static DEFINE_RATELIMIT_STATE(oom_rs, DEFAULT_RATELIMIT_INTERVAL,
 					      DEFAULT_RATELIMIT_BURST);
-	bool can_oom_reap = true;
 
 	/*
 	 * If the task is already exiting, don't alarm the sysadmin or kill
@@ -835,7 +799,6 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
 	task_lock(p);
 	if (task_will_free_mem(p)) {
 		mark_oom_victim(p);
-		wake_oom_reaper(p);
 		task_unlock(p);
 		put_task_struct(p);
 		return;
@@ -925,7 +888,6 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
 			 * memory might be still used. Hide the mm from the oom
 			 * killer to guarantee OOM forward progress.
 			 */
-			can_oom_reap = false;
 			set_bit(MMF_OOM_REAPED, &mm->flags);
 			pr_info("oom killer %d (%s) has mm pinned by %d (%s)\n",
 					task_pid_nr(victim), victim->comm,
@@ -936,9 +898,6 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
 	}
 	rcu_read_unlock();
 
-	if (can_oom_reap)
-		wake_oom_reaper(victim);
-
 	mmdrop(mm);
 	put_task_struct(victim);
 }
@@ -1014,7 +973,6 @@ bool out_of_memory(struct oom_control *oc)
 	 */
 	if (task_will_free_mem(current)) {
 		mark_oom_victim(current);
-		wake_oom_reaper(current);
 		return true;
 	}
 
-- 
1.8.3.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 5/6] mm,oom: Remove OOM_SCAN_ABORT case and signal_struct->oom_victims.
  2016-07-07 15:58 [PATCH v2 0/6] Change OOM killer to use list of mm_struct Tetsuo Handa
                   ` (3 preceding siblings ...)
  2016-07-07 16:04 ` [PATCH 4/6] mm,oom_reaper: Make OOM reaper use list of mm_struct Tetsuo Handa
@ 2016-07-07 16:06 ` Tetsuo Handa
  2016-07-11 13:19   ` Michal Hocko
  2016-07-07 16:07 ` [PATCH 6/6] mm,oom: Stop clearing TIF_MEMDIE on remote thread Tetsuo Handa
  5 siblings, 1 reply; 19+ messages in thread
From: Tetsuo Handa @ 2016-07-07 16:06 UTC (permalink / raw)
  To: linux-mm; +Cc: akpm, oleg, rientjes, vdavydov, mst, mhocko, mhocko

>From 2da51204250a2f1127792f98c12436771c07b67d Mon Sep 17 00:00:00 2001
From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Date: Fri, 8 Jul 2016 00:41:38 +0900
Subject: [PATCH 5/6] mm,oom: Remove OOM_SCAN_ABORT case and signal_struct->oom_victims.

Since oom_has_pending_mm() controls whether to select next OOM victim,
we no longer need to abort OOM victim selection loop using OOM_SCAN_ABORT
case. Also, since signal_struct->oom_victims was used only for allowing
oom_scan_process_thread() to return OOM_SCAN_ABORT, we no longer need to
use signal_struct->oom_victims.

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
---
 include/linux/oom.h   |  1 -
 include/linux/sched.h |  1 -
 mm/memcontrol.c       |  8 --------
 mm/oom_kill.c         | 26 +-------------------------
 4 files changed, 1 insertion(+), 35 deletions(-)

diff --git a/include/linux/oom.h b/include/linux/oom.h
index cb3f041..27be4ba 100644
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -49,7 +49,6 @@ enum oom_constraint {
 enum oom_scan_t {
 	OOM_SCAN_OK,		/* scan thread and find its badness */
 	OOM_SCAN_CONTINUE,	/* do not consider thread for oom kill */
-	OOM_SCAN_ABORT,		/* abort the iteration and return */
 	OOM_SCAN_SELECT,	/* always select this thread first */
 };
 
diff --git a/include/linux/sched.h b/include/linux/sched.h
index c0efd80..2f57cf1c 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -670,7 +670,6 @@ struct signal_struct {
 	atomic_t		sigcnt;
 	atomic_t		live;
 	int			nr_threads;
-	atomic_t oom_victims; /* # of TIF_MEDIE threads in this thread group */
 	struct list_head	thread_head;
 
 	wait_queue_head_t	wait_chldexit;	/* for wait4() */
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 5043324..6afe1c5 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1262,14 +1262,6 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
 				/* fall through */
 			case OOM_SCAN_CONTINUE:
 				continue;
-			case OOM_SCAN_ABORT:
-				css_task_iter_end(&it);
-				mem_cgroup_iter_break(memcg, iter);
-				if (chosen)
-					put_task_struct(chosen);
-				/* Set a dummy value to return "true". */
-				chosen = (void *) 1;
-				goto unlock;
 			case OOM_SCAN_OK:
 				break;
 			};
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 223e1fe..b6b79ae 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -304,25 +304,6 @@ enum oom_scan_t oom_scan_process_thread(struct oom_control *oc,
 		return OOM_SCAN_CONTINUE;
 
 	/*
-	 * This task already has access to memory reserves and is being killed.
-	 * Don't allow any other task to have access to the reserves unless
-	 * the task has MMF_OOM_REAPED because chances that it would release
-	 * any memory is quite low.
-	 */
-	if (!is_sysrq_oom(oc) && atomic_read(&task->signal->oom_victims)) {
-		struct task_struct *p = find_lock_task_mm(task);
-		enum oom_scan_t ret = OOM_SCAN_ABORT;
-
-		if (p) {
-			if (test_bit(MMF_OOM_REAPED, &p->mm->flags))
-				ret = OOM_SCAN_CONTINUE;
-			task_unlock(p);
-		}
-
-		return ret;
-	}
-
-	/*
 	 * If task is allocating a lot of memory and has been marked to be
 	 * killed first if it triggers an oom, then select it.
 	 */
@@ -354,9 +335,6 @@ static struct task_struct *select_bad_process(struct oom_control *oc,
 			/* fall through */
 		case OOM_SCAN_CONTINUE:
 			continue;
-		case OOM_SCAN_ABORT:
-			rcu_read_unlock();
-			return (struct task_struct *)(-1UL);
 		case OOM_SCAN_OK:
 			break;
 		};
@@ -626,7 +604,6 @@ void mark_oom_victim(struct task_struct *tsk)
 	/* OOM killer might race with memcg OOM */
 	if (test_and_set_tsk_thread_flag(tsk, TIF_MEMDIE))
 		return;
-	atomic_inc(&tsk->signal->oom_victims);
 	/*
 	 * Make sure that the task is woken up from uninterruptible sleep
 	 * if it is frozen because OOM killer wouldn't be able to free
@@ -659,7 +636,6 @@ void exit_oom_victim(struct task_struct *tsk)
 {
 	if (!test_and_clear_tsk_thread_flag(tsk, TIF_MEMDIE))
 		return;
-	atomic_dec(&tsk->signal->oom_victims);
 
 	if (!atomic_dec_return(&oom_victims))
 		wake_up_all(&oom_victims_wait);
@@ -1012,7 +988,7 @@ bool out_of_memory(struct oom_control *oc)
 		dump_header(oc, NULL);
 		panic("Out of memory and no killable processes...\n");
 	}
-	if (p && p != (void *)-1UL) {
+	if (p) {
 		oom_kill_process(oc, p, points, totalpages, "Out of memory");
 		/*
 		 * Give the killed process a good chance to exit before trying
-- 
1.8.3.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 6/6] mm,oom: Stop clearing TIF_MEMDIE on remote thread.
  2016-07-07 15:58 [PATCH v2 0/6] Change OOM killer to use list of mm_struct Tetsuo Handa
                   ` (4 preceding siblings ...)
  2016-07-07 16:06 ` [PATCH 5/6] mm,oom: Remove OOM_SCAN_ABORT case and signal_struct->oom_victims Tetsuo Handa
@ 2016-07-07 16:07 ` Tetsuo Handa
  2016-07-11 13:22   ` Michal Hocko
  5 siblings, 1 reply; 19+ messages in thread
From: Tetsuo Handa @ 2016-07-07 16:07 UTC (permalink / raw)
  To: linux-mm; +Cc: akpm, oleg, rientjes, vdavydov, mst, mhocko, mhocko

>From fefcbf2412e38b006281cf1d20f9e3a2bb76714f Mon Sep 17 00:00:00 2001
From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Date: Fri, 8 Jul 2016 00:42:48 +0900
Subject: [PATCH 6/6] mm,oom: Stop clearing TIF_MEMDIE on remote thread.

Since oom_has_pending_mm() controls whether to select next OOM victim,
we are no longer clearing TIF_MEMDIE on remote thread.

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
---
 include/linux/oom.h | 2 +-
 kernel/exit.c       | 2 +-
 mm/oom_kill.c       | 5 ++---
 3 files changed, 4 insertions(+), 5 deletions(-)

diff --git a/include/linux/oom.h b/include/linux/oom.h
index 27be4ba..90e98ea 100644
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -90,7 +90,7 @@ extern enum oom_scan_t oom_scan_process_thread(struct oom_control *oc,
 
 extern bool out_of_memory(struct oom_control *oc);
 
-extern void exit_oom_victim(struct task_struct *tsk);
+extern void exit_oom_victim(void);
 
 extern int register_oom_notifier(struct notifier_block *nb);
 extern int unregister_oom_notifier(struct notifier_block *nb);
diff --git a/kernel/exit.c b/kernel/exit.c
index 84ae830..1b1dada 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -511,7 +511,7 @@ static void exit_mm(struct task_struct *tsk)
 	mm_update_next_owner(mm);
 	mmput(mm);
 	if (test_thread_flag(TIF_MEMDIE))
-		exit_oom_victim(tsk);
+		exit_oom_victim();
 }
 
 static struct task_struct *find_alive_thread(struct task_struct *p)
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index b6b79ae..d120cb1 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -632,10 +632,9 @@ void mark_oom_victim(struct task_struct *tsk)
 /**
  * exit_oom_victim - note the exit of an OOM victim
  */
-void exit_oom_victim(struct task_struct *tsk)
+void exit_oom_victim(void)
 {
-	if (!test_and_clear_tsk_thread_flag(tsk, TIF_MEMDIE))
-		return;
+	clear_thread_flag(TIF_MEMDIE);
 
 	if (!atomic_dec_return(&oom_victims))
 		wake_up_all(&oom_victims_wait);
-- 
1.8.3.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH 1/6] mm,oom_reaper: Reduce find_lock_task_mm() usage.
  2016-07-07 16:00 ` [PATCH 1/6] mm,oom_reaper: Reduce find_lock_task_mm() usage Tetsuo Handa
@ 2016-07-11 12:02   ` Michal Hocko
  0 siblings, 0 replies; 19+ messages in thread
From: Michal Hocko @ 2016-07-11 12:02 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: linux-mm, akpm, oleg, rientjes, vdavydov, mst

On Fri 08-07-16 01:00:13, Tetsuo Handa wrote:
> >From 70de3fe92435095b6ecbb400c61e84a99f639d56 Mon Sep 17 00:00:00 2001
> From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> Date: Fri, 8 Jul 2016 00:28:12 +0900
> Subject: [PATCH 1/6] mm,oom_reaper: Reduce find_lock_task_mm() usage.
> 
> Since holding mm_struct with elevated mm_count for a second is harmless,
> we can determine mm_struct and hold it upon entry of oom_reap_task().
> This patch has no functional change. Future patch in this series will
> eliminate find_lock_task_mm() usage from the OOM reaper.

the changelog is quite poor to be honest. It doesn't explain why this
is really needed. What do you think about the following:
"
__oom_reap_task can be simplified a bit if it received a valid mm from
oom_reap_task which might need it as well. We could drop one
find_lock_task_mm call and also make the __oom_reap_task code flow
easier to follow. Moreover this will make later patch in the series
easier to review. Pinning mm's mm_count for longer time is not really
harmfull because this will not pin much memory. 

This patch doesn't introduce any functional change.
"

> 
> Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>

Other than that the patch looks good to me.

Acked-by: Michal Hocko <mhocko@suse.com>

> ---
>  mm/oom_kill.c | 79 ++++++++++++++++++++++++++++-------------------------------
>  1 file changed, 37 insertions(+), 42 deletions(-)
> 
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 7d0a275..951eb1b 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -452,12 +452,10 @@ static DECLARE_WAIT_QUEUE_HEAD(oom_reaper_wait);
>  static struct task_struct *oom_reaper_list;
>  static DEFINE_SPINLOCK(oom_reaper_lock);
>  
> -static bool __oom_reap_task(struct task_struct *tsk)
> +static bool __oom_reap_task(struct task_struct *tsk, struct mm_struct *mm)
>  {
>  	struct mmu_gather tlb;
>  	struct vm_area_struct *vma;
> -	struct mm_struct *mm = NULL;
> -	struct task_struct *p;
>  	struct zap_details details = {.check_swap_entries = true,
>  				      .ignore_dirty = true};
>  	bool ret = true;
> @@ -478,22 +476,9 @@ static bool __oom_reap_task(struct task_struct *tsk)
>  	 */
>  	mutex_lock(&oom_lock);
>  
> -	/*
> -	 * Make sure we find the associated mm_struct even when the particular
> -	 * thread has already terminated and cleared its mm.
> -	 * We might have race with exit path so consider our work done if there
> -	 * is no mm.
> -	 */
> -	p = find_lock_task_mm(tsk);
> -	if (!p)
> -		goto unlock_oom;
> -	mm = p->mm;
> -	atomic_inc(&mm->mm_count);
> -	task_unlock(p);
> -
>  	if (!down_read_trylock(&mm->mmap_sem)) {
>  		ret = false;
> -		goto mm_drop;
> +		goto unlock_oom;
>  	}
>  
>  	/*
> @@ -503,7 +488,7 @@ static bool __oom_reap_task(struct task_struct *tsk)
>  	 */
>  	if (!mmget_not_zero(mm)) {
>  		up_read(&mm->mmap_sem);
> -		goto mm_drop;
> +		goto unlock_oom;
>  	}
>  
>  	tlb_gather_mmu(&tlb, mm, 0, -1);
> @@ -551,8 +536,6 @@ static bool __oom_reap_task(struct task_struct *tsk)
>  	 * put the oom_reaper out of the way.
>  	 */
>  	mmput_async(mm);
> -mm_drop:
> -	mmdrop(mm);
>  unlock_oom:
>  	mutex_unlock(&oom_lock);
>  	return ret;
> @@ -562,36 +545,45 @@ unlock_oom:
>  static void oom_reap_task(struct task_struct *tsk)
>  {
>  	int attempts = 0;
> +	struct mm_struct *mm = NULL;
> +	struct task_struct *p = find_lock_task_mm(tsk);
> +
> +	/*
> +	 * Make sure we find the associated mm_struct even when the particular
> +	 * thread has already terminated and cleared its mm.
> +	 * We might have race with exit path so consider our work done if there
> +	 * is no mm.
> +	 */
> +	if (!p)
> +		goto done;
> +	mm = p->mm;
> +	atomic_inc(&mm->mm_count);
> +	task_unlock(p);
>  
>  	/* Retry the down_read_trylock(mmap_sem) a few times */
> -	while (attempts++ < MAX_OOM_REAP_RETRIES && !__oom_reap_task(tsk))
> +	while (attempts++ < MAX_OOM_REAP_RETRIES && !__oom_reap_task(tsk, mm))
>  		schedule_timeout_idle(HZ/10);
>  
> -	if (attempts > MAX_OOM_REAP_RETRIES) {
> -		struct task_struct *p;
> +	if (attempts <= MAX_OOM_REAP_RETRIES)
> +		goto done;
>  
> -		pr_info("oom_reaper: unable to reap pid:%d (%s)\n",
> -				task_pid_nr(tsk), tsk->comm);
> +	pr_info("oom_reaper: unable to reap pid:%d (%s)\n",
> +		task_pid_nr(tsk), tsk->comm);
>  
> -		/*
> -		 * If we've already tried to reap this task in the past and
> -		 * failed it probably doesn't make much sense to try yet again
> -		 * so hide the mm from the oom killer so that it can move on
> -		 * to another task with a different mm struct.
> -		 */
> -		p = find_lock_task_mm(tsk);
> -		if (p) {
> -			if (test_and_set_bit(MMF_OOM_NOT_REAPABLE, &p->mm->flags)) {
> -				pr_info("oom_reaper: giving up pid:%d (%s)\n",
> -						task_pid_nr(tsk), tsk->comm);
> -				set_bit(MMF_OOM_REAPED, &p->mm->flags);
> -			}
> -			task_unlock(p);
> -		}
> -
> -		debug_show_all_locks();
> +	/*
> +	 * If we've already tried to reap this task in the past and
> +	 * failed it probably doesn't make much sense to try yet again
> +	 * so hide the mm from the oom killer so that it can move on
> +	 * to another task with a different mm struct.
> +	 */
> +	if (test_and_set_bit(MMF_OOM_NOT_REAPABLE, &mm->flags)) {
> +		pr_info("oom_reaper: giving up pid:%d (%s)\n",
> +			task_pid_nr(tsk), tsk->comm);
> +		set_bit(MMF_OOM_REAPED, &mm->flags);
>  	}
> +	debug_show_all_locks();
>  
> +done:
>  	/*
>  	 * Clear TIF_MEMDIE because the task shouldn't be sitting on a
>  	 * reasonably reclaimable memory anymore or it is not a good candidate
> @@ -603,6 +595,9 @@ static void oom_reap_task(struct task_struct *tsk)
>  
>  	/* Drop a reference taken by wake_oom_reaper */
>  	put_task_struct(tsk);
> +	/* Drop a reference taken above. */
> +	if (mm)
> +		mmdrop(mm);
>  }
>  
>  static int oom_reaper(void *unused)
> -- 
> 1.8.3.1

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 2/6] mm,oom_reaper: Do not attempt to reap a task twice.
  2016-07-07 16:01 ` [PATCH 2/6] mm,oom_reaper: Do not attempt to reap a task twice Tetsuo Handa
@ 2016-07-11 12:15   ` Michal Hocko
  0 siblings, 0 replies; 19+ messages in thread
From: Michal Hocko @ 2016-07-11 12:15 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: linux-mm, akpm, oleg, rientjes, vdavydov, mst

On Fri 08-07-16 01:01:36, Tetsuo Handa wrote:
> >From 50b3a862b136c783be6ce25e0f22446f15a0ab03 Mon Sep 17 00:00:00 2001
> From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> Date: Fri, 8 Jul 2016 00:28:50 +0900
> Subject: [PATCH 2/6] mm,oom_reaper: Do not attempt to reap a task twice.
> 
> "mm, oom_reaper: do not attempt to reap a task twice" tried to give
> OOM reaper one more chance to retry. But since OOM killer or one of
> threads sharing that mm will queue that mm immediately, retrying
> MMF_OOM_NOT_REAPABLE mm is unlikely helpful.

I agree that the usefulness of the flag is rather limited and actually
never shown in practice. Considering it was added as a just-in-case
measure it is really hard to reason about. And I believe this should be
stated in the changelog.

> Let's always set MMF_OOM_REAPED.
> 
> Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>

That being said, feel free to add my Acked-by after the changelog is
more clear about the motivation.

> ---
>  include/linux/sched.h |  1 -
>  mm/oom_kill.c         | 15 +++------------
>  2 files changed, 3 insertions(+), 13 deletions(-)
> 
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 553af29..c0efd80 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -523,7 +523,6 @@ static inline int get_dumpable(struct mm_struct *mm)
>  #define MMF_HAS_UPROBES		19	/* has uprobes */
>  #define MMF_RECALC_UPROBES	20	/* MMF_HAS_UPROBES can be wrong */
>  #define MMF_OOM_REAPED		21	/* mm has been already reaped */
> -#define MMF_OOM_NOT_REAPABLE	22	/* mm couldn't be reaped */
>  
>  #define MMF_INIT_MASK		(MMF_DUMPABLE_MASK | MMF_DUMP_FILTER_MASK)
>  
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 951eb1b..9f0022e 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -567,20 +567,11 @@ static void oom_reap_task(struct task_struct *tsk)
>  	if (attempts <= MAX_OOM_REAP_RETRIES)
>  		goto done;
>  
> +	/* Ignore this mm because somebody can't call up_write(mmap_sem). */
> +	set_bit(MMF_OOM_REAPED, &mm->flags);
> +
>  	pr_info("oom_reaper: unable to reap pid:%d (%s)\n",
>  		task_pid_nr(tsk), tsk->comm);
> -
> -	/*
> -	 * If we've already tried to reap this task in the past and
> -	 * failed it probably doesn't make much sense to try yet again
> -	 * so hide the mm from the oom killer so that it can move on
> -	 * to another task with a different mm struct.
> -	 */
> -	if (test_and_set_bit(MMF_OOM_NOT_REAPABLE, &mm->flags)) {
> -		pr_info("oom_reaper: giving up pid:%d (%s)\n",
> -			task_pid_nr(tsk), tsk->comm);
> -		set_bit(MMF_OOM_REAPED, &mm->flags);
> -	}
>  	debug_show_all_locks();
>  
>  done:
> -- 
> 1.8.3.1

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 3/6] mm,oom: Use list of mm_struct used by OOM victims.
  2016-07-07 16:03 ` [PATCH 3/6] mm,oom: Use list of mm_struct used by OOM victims Tetsuo Handa
@ 2016-07-11 12:50   ` Michal Hocko
  2016-07-12  6:00     ` Tetsuo Handa
  0 siblings, 1 reply; 19+ messages in thread
From: Michal Hocko @ 2016-07-11 12:50 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: linux-mm, akpm, oleg, rientjes, vdavydov, mst

On Fri 08-07-16 01:03:11, Tetsuo Handa wrote:
> >From 5fbd16cffd5dc51f9ba8591fc18d315ff6ff9b96 Mon Sep 17 00:00:00 2001
> From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> Date: Fri, 8 Jul 2016 00:33:13 +0900
> Subject: [PATCH 3/6] mm,oom: Use list of mm_struct used by OOM victims.
> 
> Currently, we walk process list in order to find existing TIF_MEMDIE
> threads. But if we remember list of mm_struct used by TIF_MEMDIE threads,
> we can avoid walking process list. Next patch in this series allows
> OOM reaper to use list of mm_struct introduced by this patch.

I believe the changelog doesn't tell the whole story and it is silent
about many aspects which are not obvious at first sight. E.g. why it is
any better to iterate over all mms rather than existing tasks? This is a
slow path. Also it is quite possible to have thousands of mms linked on
the list because of many memcgs hitting the oom. Sure, highly unlikely,
but it would be better to note that this has been considered being
acceptable with an explanation why.

> This patch reverts commit e2fe14564d3316d1 ("oom_reaper: close race with
> exiting task") because oom_has_pending_mm() will prevent that race.

I guess a worth a separate patch.

> Since CONFIG_MMU=y kernel has OOM reaper callback hook which can remove
> mm_struct from the list, let the OOM reaper call exit_oom_mm(mm). This
> patch temporarily fails to call exit_oom_mm(mm) when find_lock_task_mm()
> in oom_reap_task() failed. It will be fixed by next patch.
> 
> But since CONFIG_MMU=n kernel does not have OOM reaper callback hook,
> call exit_oom_mm(mm) from __mmput(mm) if that mm is used by OOM victims.

I guess referring to the MMU configuration is more confusing than
helpful. The life time on the list is quite straightforward. mm is is
unlinked by exit_oom_mm after the address space has been unmapped from
__mmput or from the oom_reaper when available. This will guarantee that
the mm doesn't block the next oom victim selection after the memory was
reclaimed.

> Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> ---
>  include/linux/mm_types.h |  7 +++++
>  include/linux/oom.h      |  3 ++
>  kernel/fork.c            |  4 +++
>  mm/memcontrol.c          |  5 ++++
>  mm/oom_kill.c            | 72 +++++++++++++++++++++++++++++++-----------------
>  5 files changed, 66 insertions(+), 25 deletions(-)
> 
[...]
> diff --git a/kernel/fork.c b/kernel/fork.c
> index 7926993..8e469e0 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -722,6 +722,10 @@ static inline void __mmput(struct mm_struct *mm)
>  	}
>  	if (mm->binfmt)
>  		module_put(mm->binfmt->module);
> +#ifndef CONFIG_MMU
> +	if (mm->oom_mm.victim)
> +		exit_oom_mm(mm);
> +#endif

This ifdef is not really needed. There is no reason we should wait for
the oom_reaper to unlink the mm.

>  	mmdrop(mm);
>  }
>  
[...]
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 9f0022e..87e7ff3 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -275,6 +275,28 @@ static enum oom_constraint constrained_alloc(struct oom_control *oc,
>  }
>  #endif
>  
> +static LIST_HEAD(oom_mm_list);
> +
> +void exit_oom_mm(struct mm_struct *mm)
> +{
> +	mutex_lock(&oom_lock);
> +	list_del(&mm->oom_mm.list);
> +	put_task_struct(mm->oom_mm.victim);
> +	mm->oom_mm.victim = NULL;
> +	mmdrop(mm);
> +	mutex_unlock(&oom_lock);
> +}
> +
> +bool oom_has_pending_mm(struct mem_cgroup *memcg, const nodemask_t *nodemask)
> +{
> +	struct mm_struct *mm;
> +
> +	list_for_each_entry(mm, &oom_mm_list, oom_mm.list)
> +		if (!oom_unkillable_task(mm->oom_mm.victim, memcg, nodemask))
> +			return true;

The condition is quite hard to read. Moreover 2 of 4 conditions are
never true. Wouldn't it be better to do something like the following?

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index d120cb103507..df4b2b3ad7d0 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -132,6 +132,20 @@ static inline bool is_sysrq_oom(struct oom_control *oc)
 	return oc->order == -1;
 }
 
+static bool task_in_oom_domain(struct task_struct *p,
+		struct mem_cgroup *memcg, const nodemask_t *nodemask)
+{
+	/* When mem_cgroup_out_of_memory() and p is not member of the group */
+	if (memcg && !task_in_mem_cgroup(p, memcg))
+		return false;
+
+	/* p may not have freeable memory in nodemask */
+	if (!has_intersects_mems_allowed(p, nodemask))
+		return false;
+
+	return true;
+}
+
 /* return true if the task is not adequate as candidate victim task. */
 static bool oom_unkillable_task(struct task_struct *p,
 		struct mem_cgroup *memcg, const nodemask_t *nodemask)
@@ -141,12 +155,7 @@ static bool oom_unkillable_task(struct task_struct *p,
 	if (p->flags & PF_KTHREAD)
 		return true;
 
-	/* When mem_cgroup_out_of_memory() and p is not member of the group */
-	if (memcg && !task_in_mem_cgroup(p, memcg))
-		return true;
-
-	/* p may not have freeable memory in nodemask */
-	if (!has_intersects_mems_allowed(p, nodemask))
+	if (!task_in_oom_domain(p, memcg, nodemask))
 		return true;
 
 	return false;
@@ -292,7 +301,7 @@ bool oom_has_pending_mm(struct mem_cgroup *memcg, const nodemask_t *nodemask)
 	struct mm_struct *mm;
 
 	list_for_each_entry(mm, &oom_mm_list, oom_mm.list)
-		if (!oom_unkillable_task(mm->oom_mm.victim, memcg, nodemask))
+		if (task_in_oom_domain(mm->oom_mm.victim, memcg, nodemask))
 			return true;
 	return false;
 }

[...]
> @@ -653,6 +657,9 @@ subsys_initcall(oom_init)
>   */
>  void mark_oom_victim(struct task_struct *tsk)
>  {
> +	struct mm_struct *mm = tsk->mm;
> +	struct task_struct *old_tsk = mm->oom_mm.victim;
> +
>  	WARN_ON(oom_killer_disabled);
>  	/* OOM killer might race with memcg OOM */
>  	if (test_and_set_tsk_thread_flag(tsk, TIF_MEMDIE))
> @@ -666,6 +673,18 @@ void mark_oom_victim(struct task_struct *tsk)
>  	 */
>  	__thaw_task(tsk);
>  	atomic_inc(&oom_victims);
> +	/*
> +	 * Since mark_oom_victim() is called from multiple threads,
> +	 * connect this mm to oom_mm_list only if not yet connected.
> +	 */
> +	get_task_struct(tsk);
> +	mm->oom_mm.victim = tsk;
> +	if (!old_tsk) {
> +		atomic_inc(&mm->mm_count);
> +		list_add_tail(&mm->oom_mm.list, &oom_mm_list);
> +	} else {
> +		put_task_struct(old_tsk);
> +	}

Isn't this overcomplicated? Why do we need to replace the old task by
the current one?

>  }
>  
>  /**
> @@ -1026,6 +1045,9 @@ bool out_of_memory(struct oom_control *oc)
>  		return true;
>  	}
>  
> +	if (!is_sysrq_oom(oc) && oom_has_pending_mm(oc->memcg, oc->nodemask))
> +		return true;
> +
>  	p = select_bad_process(oc, &points, totalpages);
>  	/* Found nothing?!?! Either we hang forever, or we panic. */
>  	if (!p && !is_sysrq_oom(oc)) {
> -- 
> 1.8.3.1

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH 4/6] mm,oom_reaper: Make OOM reaper use list of mm_struct.
  2016-07-07 16:04 ` [PATCH 4/6] mm,oom_reaper: Make OOM reaper use list of mm_struct Tetsuo Handa
@ 2016-07-11 13:16   ` Michal Hocko
  2016-07-12 13:38     ` Tetsuo Handa
  0 siblings, 1 reply; 19+ messages in thread
From: Michal Hocko @ 2016-07-11 13:16 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: linux-mm, akpm, oleg, rientjes, vdavydov, mst

On Fri 08-07-16 01:04:46, Tetsuo Handa wrote:
> >From acc85fdd36452e39bace6aa73b3aaa41bbe776a5 Mon Sep 17 00:00:00 2001
> From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> Date: Fri, 8 Jul 2016 00:39:36 +0900
> Subject: [PATCH 4/6] mm,oom_reaper: Make OOM reaper use list of mm_struct.
> 
> Since OOM reaping is per mm_struct operation, it is natural to use
> list of mm_struct used by OOM victims. By using list of mm_struct,
> we can eliminate find_lock_task_mm() usage from the OOM reaper.

Again, the changelog doesn't state why getting rid of find_lock_task_mm
is useful.

You are also dropping exit_oom_victim from the oom reaper without any
explanation in the changelog.

> Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>

I haven't spotted anything obviously wrong with the patch. Conceptually
this sounds like a right approach. It is a bit sad we have to keep the
oom victim around as well but I guess this is acceptable.

That being said, I think that going mm queuing way is better than
keeping mm alive in signal_struct long term.

> ---
>  include/linux/oom.h |   8 -----
>  mm/memcontrol.c     |   1 -
>  mm/oom_kill.c       | 100 +++++++++++++++-------------------------------------
>  3 files changed, 29 insertions(+), 80 deletions(-)

The code reduction is really nice! Few notes below

[...]

>  #define MAX_OOM_REAP_RETRIES 10
> -static void oom_reap_task(struct task_struct *tsk)
> +static void oom_reap_task(struct task_struct *tsk, struct mm_struct *mm)
>  {
>  	int attempts = 0;
> -	struct mm_struct *mm = NULL;
> -	struct task_struct *p = find_lock_task_mm(tsk);
>  
>  	/*
> -	 * Make sure we find the associated mm_struct even when the particular
> -	 * thread has already terminated and cleared its mm.
> -	 * We might have race with exit path so consider our work done if there
> -	 * is no mm.
> +	 * Check MMF_OOM_REAPED in case oom_kill_process() found this mm
> +	 * pinned.
>  	 */
> -	if (!p)
> -		goto done;
> -	mm = p->mm;
> -	atomic_inc(&mm->mm_count);
> -	task_unlock(p);
> +	if (test_bit(MMF_OOM_REAPED, &mm->flags))
> +		return;
>  
>  	/* Retry the down_read_trylock(mmap_sem) a few times */
>  	while (attempts++ < MAX_OOM_REAP_RETRIES && !__oom_reap_task(tsk, mm))
>  		schedule_timeout_idle(HZ/10);
>  
>  	if (attempts <= MAX_OOM_REAP_RETRIES)
> -		goto done;
> +		return;
>  
>  	/* Ignore this mm because somebody can't call up_write(mmap_sem). */
>  	set_bit(MMF_OOM_REAPED, &mm->flags);

This seems unnecessary when oom_reaper always calls exit_oom_mm. The
same applies to __oom_reap_task. Which then means that the flag is
turning into a misnomer. MMF_SKIP_OOM would fit better its current
meaning.

[...]

> @@ -600,41 +572,31 @@ static int oom_reaper(void *unused)
>  	set_freezable();
>  
>  	while (true) {
> -		struct task_struct *tsk = NULL;
> -
> -		wait_event_freezable(oom_reaper_wait, oom_reaper_list != NULL);
> -		spin_lock(&oom_reaper_lock);
> -		if (oom_reaper_list != NULL) {
> -			tsk = oom_reaper_list;
> -			oom_reaper_list = tsk->oom_reaper_list;
> -		}
> -		spin_unlock(&oom_reaper_lock);
> -
> -		if (tsk)
> -			oom_reap_task(tsk);
> +		struct mm_struct *mm;
> +		struct task_struct *victim;
> +
> +		wait_event_freezable(oom_reaper_wait,
> +				     !list_empty(&oom_mm_list));
> +		mutex_lock(&oom_lock);
> +		mm = list_first_entry(&oom_mm_list, struct mm_struct,
> +				      oom_mm.list);
> +		victim = mm->oom_mm.victim;
> +		/*
> +		 * Take a reference on current victim thread in case
> +		 * oom_reap_task() raced with mark_oom_victim() by
> +		 * other threads sharing this mm.
> +		 */
> +		get_task_struct(victim);

If you didn't play old_task games in mark_oom_victim then you wouldn't
need this AFAIU.

> +		mutex_unlock(&oom_lock);
> +		oom_reap_task(victim, mm);
> +		put_task_struct(victim);
> +		/* Drop references taken by mark_oom_victim() */
> +		exit_oom_mm(mm);

[...]
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 5/6] mm,oom: Remove OOM_SCAN_ABORT case and signal_struct->oom_victims.
  2016-07-07 16:06 ` [PATCH 5/6] mm,oom: Remove OOM_SCAN_ABORT case and signal_struct->oom_victims Tetsuo Handa
@ 2016-07-11 13:19   ` Michal Hocko
  0 siblings, 0 replies; 19+ messages in thread
From: Michal Hocko @ 2016-07-11 13:19 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: linux-mm, akpm, oleg, rientjes, vdavydov, mst

On Fri 08-07-16 01:06:12, Tetsuo Handa wrote:
> >From 2da51204250a2f1127792f98c12436771c07b67d Mon Sep 17 00:00:00 2001
> From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> Date: Fri, 8 Jul 2016 00:41:38 +0900
> Subject: [PATCH 5/6] mm,oom: Remove OOM_SCAN_ABORT case and signal_struct->oom_victims.
> 
> Since oom_has_pending_mm() controls whether to select next OOM victim,
> we no longer need to abort OOM victim selection loop using OOM_SCAN_ABORT
> case. Also, since signal_struct->oom_victims was used only for allowing
> oom_scan_process_thread() to return OOM_SCAN_ABORT, we no longer need to
> use signal_struct->oom_victims.
> 
> Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>

Yes this looks good!

Acked-by: Michal Hocko <mhocko@suse.com>

> ---
>  include/linux/oom.h   |  1 -
>  include/linux/sched.h |  1 -
>  mm/memcontrol.c       |  8 --------
>  mm/oom_kill.c         | 26 +-------------------------
>  4 files changed, 1 insertion(+), 35 deletions(-)
> 
> diff --git a/include/linux/oom.h b/include/linux/oom.h
> index cb3f041..27be4ba 100644
> --- a/include/linux/oom.h
> +++ b/include/linux/oom.h
> @@ -49,7 +49,6 @@ enum oom_constraint {
>  enum oom_scan_t {
>  	OOM_SCAN_OK,		/* scan thread and find its badness */
>  	OOM_SCAN_CONTINUE,	/* do not consider thread for oom kill */
> -	OOM_SCAN_ABORT,		/* abort the iteration and return */
>  	OOM_SCAN_SELECT,	/* always select this thread first */
>  };
>  
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index c0efd80..2f57cf1c 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -670,7 +670,6 @@ struct signal_struct {
>  	atomic_t		sigcnt;
>  	atomic_t		live;
>  	int			nr_threads;
> -	atomic_t oom_victims; /* # of TIF_MEDIE threads in this thread group */
>  	struct list_head	thread_head;
>  
>  	wait_queue_head_t	wait_chldexit;	/* for wait4() */
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 5043324..6afe1c5 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -1262,14 +1262,6 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
>  				/* fall through */
>  			case OOM_SCAN_CONTINUE:
>  				continue;
> -			case OOM_SCAN_ABORT:
> -				css_task_iter_end(&it);
> -				mem_cgroup_iter_break(memcg, iter);
> -				if (chosen)
> -					put_task_struct(chosen);
> -				/* Set a dummy value to return "true". */
> -				chosen = (void *) 1;
> -				goto unlock;
>  			case OOM_SCAN_OK:
>  				break;
>  			};
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 223e1fe..b6b79ae 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -304,25 +304,6 @@ enum oom_scan_t oom_scan_process_thread(struct oom_control *oc,
>  		return OOM_SCAN_CONTINUE;
>  
>  	/*
> -	 * This task already has access to memory reserves and is being killed.
> -	 * Don't allow any other task to have access to the reserves unless
> -	 * the task has MMF_OOM_REAPED because chances that it would release
> -	 * any memory is quite low.
> -	 */
> -	if (!is_sysrq_oom(oc) && atomic_read(&task->signal->oom_victims)) {
> -		struct task_struct *p = find_lock_task_mm(task);
> -		enum oom_scan_t ret = OOM_SCAN_ABORT;
> -
> -		if (p) {
> -			if (test_bit(MMF_OOM_REAPED, &p->mm->flags))
> -				ret = OOM_SCAN_CONTINUE;
> -			task_unlock(p);
> -		}
> -
> -		return ret;
> -	}
> -
> -	/*
>  	 * If task is allocating a lot of memory and has been marked to be
>  	 * killed first if it triggers an oom, then select it.
>  	 */
> @@ -354,9 +335,6 @@ static struct task_struct *select_bad_process(struct oom_control *oc,
>  			/* fall through */
>  		case OOM_SCAN_CONTINUE:
>  			continue;
> -		case OOM_SCAN_ABORT:
> -			rcu_read_unlock();
> -			return (struct task_struct *)(-1UL);
>  		case OOM_SCAN_OK:
>  			break;
>  		};
> @@ -626,7 +604,6 @@ void mark_oom_victim(struct task_struct *tsk)
>  	/* OOM killer might race with memcg OOM */
>  	if (test_and_set_tsk_thread_flag(tsk, TIF_MEMDIE))
>  		return;
> -	atomic_inc(&tsk->signal->oom_victims);
>  	/*
>  	 * Make sure that the task is woken up from uninterruptible sleep
>  	 * if it is frozen because OOM killer wouldn't be able to free
> @@ -659,7 +636,6 @@ void exit_oom_victim(struct task_struct *tsk)
>  {
>  	if (!test_and_clear_tsk_thread_flag(tsk, TIF_MEMDIE))
>  		return;
> -	atomic_dec(&tsk->signal->oom_victims);
>  
>  	if (!atomic_dec_return(&oom_victims))
>  		wake_up_all(&oom_victims_wait);
> @@ -1012,7 +988,7 @@ bool out_of_memory(struct oom_control *oc)
>  		dump_header(oc, NULL);
>  		panic("Out of memory and no killable processes...\n");
>  	}
> -	if (p && p != (void *)-1UL) {
> +	if (p) {
>  		oom_kill_process(oc, p, points, totalpages, "Out of memory");
>  		/*
>  		 * Give the killed process a good chance to exit before trying
> -- 
> 1.8.3.1

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 6/6] mm,oom: Stop clearing TIF_MEMDIE on remote thread.
  2016-07-07 16:07 ` [PATCH 6/6] mm,oom: Stop clearing TIF_MEMDIE on remote thread Tetsuo Handa
@ 2016-07-11 13:22   ` Michal Hocko
  0 siblings, 0 replies; 19+ messages in thread
From: Michal Hocko @ 2016-07-11 13:22 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: linux-mm, akpm, oleg, rientjes, vdavydov, mst

On Fri 08-07-16 01:07:47, Tetsuo Handa wrote:
> >From fefcbf2412e38b006281cf1d20f9e3a2bb76714f Mon Sep 17 00:00:00 2001
> From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> Date: Fri, 8 Jul 2016 00:42:48 +0900
> Subject: [PATCH 6/6] mm,oom: Stop clearing TIF_MEMDIE on remote thread.
> 
> Since oom_has_pending_mm() controls whether to select next OOM victim,
> we are no longer clearing TIF_MEMDIE on remote thread.

The changelog is quite cryptic. What about:
"
Since no kernel code path needs to clear TIF_MEMDIE flag on a remote
thread we can drop the task parameter and enforce that actually.
"
> 
> Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>

with the updated changelog
Acked-by: Michal Hocko <mhocko@suse.com>

> ---
>  include/linux/oom.h | 2 +-
>  kernel/exit.c       | 2 +-
>  mm/oom_kill.c       | 5 ++---
>  3 files changed, 4 insertions(+), 5 deletions(-)
> 
> diff --git a/include/linux/oom.h b/include/linux/oom.h
> index 27be4ba..90e98ea 100644
> --- a/include/linux/oom.h
> +++ b/include/linux/oom.h
> @@ -90,7 +90,7 @@ extern enum oom_scan_t oom_scan_process_thread(struct oom_control *oc,
>  
>  extern bool out_of_memory(struct oom_control *oc);
>  
> -extern void exit_oom_victim(struct task_struct *tsk);
> +extern void exit_oom_victim(void);
>  
>  extern int register_oom_notifier(struct notifier_block *nb);
>  extern int unregister_oom_notifier(struct notifier_block *nb);
> diff --git a/kernel/exit.c b/kernel/exit.c
> index 84ae830..1b1dada 100644
> --- a/kernel/exit.c
> +++ b/kernel/exit.c
> @@ -511,7 +511,7 @@ static void exit_mm(struct task_struct *tsk)
>  	mm_update_next_owner(mm);
>  	mmput(mm);
>  	if (test_thread_flag(TIF_MEMDIE))
> -		exit_oom_victim(tsk);
> +		exit_oom_victim();
>  }
>  
>  static struct task_struct *find_alive_thread(struct task_struct *p)
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index b6b79ae..d120cb1 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -632,10 +632,9 @@ void mark_oom_victim(struct task_struct *tsk)
>  /**
>   * exit_oom_victim - note the exit of an OOM victim
>   */
> -void exit_oom_victim(struct task_struct *tsk)
> +void exit_oom_victim(void)
>  {
> -	if (!test_and_clear_tsk_thread_flag(tsk, TIF_MEMDIE))
> -		return;
> +	clear_thread_flag(TIF_MEMDIE);
>  
>  	if (!atomic_dec_return(&oom_victims))
>  		wake_up_all(&oom_victims_wait);
> -- 
> 1.8.3.1

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 3/6] mm,oom: Use list of mm_struct used by OOM victims.
  2016-07-11 12:50   ` Michal Hocko
@ 2016-07-12  6:00     ` Tetsuo Handa
  2016-07-12  7:09       ` Michal Hocko
  0 siblings, 1 reply; 19+ messages in thread
From: Tetsuo Handa @ 2016-07-12  6:00 UTC (permalink / raw)
  To: mhocko; +Cc: linux-mm, akpm, oleg, rientjes, vdavydov, mst

Michal Hocko wrote:
> > diff --git a/kernel/fork.c b/kernel/fork.c
> > index 7926993..8e469e0 100644
> > --- a/kernel/fork.c
> > +++ b/kernel/fork.c
> > @@ -722,6 +722,10 @@ static inline void __mmput(struct mm_struct *mm)
> >  	}
> >  	if (mm->binfmt)
> >  		module_put(mm->binfmt->module);
> > +#ifndef CONFIG_MMU
> > +	if (mm->oom_mm.victim)
> > +		exit_oom_mm(mm);
> > +#endif
> 
> This ifdef is not really needed. There is no reason we should wait for
> the oom_reaper to unlink the mm.

Oleg wanted to avoid adding OOM related hooks if possible
( http://lkml.kernel.org/r/20160705205231.GA25340@redhat.com ),
but I thought that calling exit_oom_mm() from here is better for CONFIG_MMU=n case
( http://lkml.kernel.org/r/201607062043.FEC86485.JFFVLtFOQOSHMO@I-love.SAKURA.ne.jp ).

I think that not calling exit_oom_mm() from here is better for CONFIG_MMU=y case.
Calling exit_oom_mm() from here will require !list_empty() check after holding
oom_lock at oom_reaper(). Instead, we can do

+#ifdef CONFIG_MMU
+	if (mm->oom_mm.victim)
+		set_bit(MMF_OOM_REAPED, &mm->flags);
+#else
+	if (mm->oom_mm.victim)
+		exit_oom_mm(mm);
+#endif

here and let oom_has_pending_mm() check for MMF_OOM_REAPED.

> > +bool oom_has_pending_mm(struct mem_cgroup *memcg, const nodemask_t *nodemask)
> > +{
> > +	struct mm_struct *mm;
> > +
> > +	list_for_each_entry(mm, &oom_mm_list, oom_mm.list)
> > +		if (!oom_unkillable_task(mm->oom_mm.victim, memcg, nodemask))
> > +			return true;
> 
> The condition is quite hard to read. Moreover 2 of 4 conditions are
> never true. Wouldn't it be better to do something like the following?
> 

No problem.

> > @@ -653,6 +657,9 @@ subsys_initcall(oom_init)
> >   */
> >  void mark_oom_victim(struct task_struct *tsk)
> >  {
> > +	struct mm_struct *mm = tsk->mm;
> > +	struct task_struct *old_tsk = mm->oom_mm.victim;
> > +
> >  	WARN_ON(oom_killer_disabled);
> >  	/* OOM killer might race with memcg OOM */
> >  	if (test_and_set_tsk_thread_flag(tsk, TIF_MEMDIE))
> > @@ -666,6 +673,18 @@ void mark_oom_victim(struct task_struct *tsk)
> >  	 */
> >  	__thaw_task(tsk);
> >  	atomic_inc(&oom_victims);
> > +	/*
> > +	 * Since mark_oom_victim() is called from multiple threads,
> > +	 * connect this mm to oom_mm_list only if not yet connected.
> > +	 */
> > +	get_task_struct(tsk);
> > +	mm->oom_mm.victim = tsk;
> > +	if (!old_tsk) {
> > +		atomic_inc(&mm->mm_count);
> > +		list_add_tail(&mm->oom_mm.list, &oom_mm_list);
> > +	} else {
> > +		put_task_struct(old_tsk);
> > +	}
> 
> Isn't this overcomplicated? Why do we need to replace the old task by
> the current one?

I'm not sure whether task_in_oom_domain(mm->oom_mm.victim, memcg, nodemask) in
oom_has_pending_mm() will work as expected, especially when all threads in
one thread group (which mm->oom_mm.victim belongs to) reached TASK_DEAD state.
( http://lkml.kernel.org/r/201607042150.CIB00512.FSOtMHLOOVFFQJ@I-love.SAKURA.ne.jp )

I guess that task_in_oom_domain() will return false, and that mm will be selected
by another thread group (which mm->oom_mm.victim does not belongs to). Therefore,
I think we need to replace the old task with the new task (at least when
task_in_oom_domain() returned false) at mark_oom_victim().

If task_in_oom_domain(mm->oom_mm.victim, memcg, nodemask) in oom_has_pending_mm()
does not work as expected even if we replace the old task with the new task at
mark_oom_victim(), I think we after all need to use something like

struct task_struct {
(...snipped...)
+	struct mm_struct *oom_mm; /* current->mm as of getting TIF_MEMDIE */
+	struct task_struct *oom_mm_list; /* Connected to oom_mm_list global list. */
-#ifdef CONFIG_MMU
-	struct task_struct *oom_reaper_list;
-#endif
(...snipped...)
};

or your signal_struct->oom_mm approach.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 3/6] mm,oom: Use list of mm_struct used by OOM victims.
  2016-07-12  6:00     ` Tetsuo Handa
@ 2016-07-12  7:09       ` Michal Hocko
  0 siblings, 0 replies; 19+ messages in thread
From: Michal Hocko @ 2016-07-12  7:09 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: linux-mm, akpm, oleg, rientjes, vdavydov, mst

On Tue 12-07-16 15:00:41, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > > diff --git a/kernel/fork.c b/kernel/fork.c
> > > index 7926993..8e469e0 100644
> > > --- a/kernel/fork.c
> > > +++ b/kernel/fork.c
> > > @@ -722,6 +722,10 @@ static inline void __mmput(struct mm_struct *mm)
> > >  	}
> > >  	if (mm->binfmt)
> > >  		module_put(mm->binfmt->module);
> > > +#ifndef CONFIG_MMU
> > > +	if (mm->oom_mm.victim)
> > > +		exit_oom_mm(mm);
> > > +#endif
> > 
> > This ifdef is not really needed. There is no reason we should wait for
> > the oom_reaper to unlink the mm.
> 
> Oleg wanted to avoid adding OOM related hooks if possible
> ( http://lkml.kernel.org/r/20160705205231.GA25340@redhat.com ),
> but I thought that calling exit_oom_mm() from here is better for CONFIG_MMU=n case
> ( http://lkml.kernel.org/r/201607062043.FEC86485.JFFVLtFOQOSHMO@I-love.SAKURA.ne.jp ).

__mmput is a MM part of the exit path so I think sticking oom related
things there is not harmful. Yes we do take a global lock (btw. the lock
contention could be reduced if you preserve the existing spinlock and
use it for enqueing. Besides that the ifdef is really ugly.

> I think that not calling exit_oom_mm() from here is better for CONFIG_MMU=y case.
> Calling exit_oom_mm() from here will require !list_empty() check after holding
> oom_lock at oom_reaper(). Instead, we can do
> 
> +#ifdef CONFIG_MMU
> +	if (mm->oom_mm.victim)
> +		set_bit(MMF_OOM_REAPED, &mm->flags);
> +#else
> +	if (mm->oom_mm.victim)
> +		exit_oom_mm(mm);
> +#endif
> 
> here and let oom_has_pending_mm() check for MMF_OOM_REAPED.

Yes that would be possible but why should we make this more complicated
than necessary. It is natural that exit_oom_mm is called after the
address space has been torn down. This would be the most common path.
We have oom_reaper as a backup if this doesn't happen in time. It really
doesn't make much sense to keep mm on the list artificially if the
oom_reaper should just skip over it because it is already empty.
So please let's make it as simple as possible.

[...]
> > > @@ -653,6 +657,9 @@ subsys_initcall(oom_init)
> > >   */
> > >  void mark_oom_victim(struct task_struct *tsk)
> > >  {
> > > +	struct mm_struct *mm = tsk->mm;
> > > +	struct task_struct *old_tsk = mm->oom_mm.victim;
> > > +
> > >  	WARN_ON(oom_killer_disabled);
> > >  	/* OOM killer might race with memcg OOM */
> > >  	if (test_and_set_tsk_thread_flag(tsk, TIF_MEMDIE))
> > > @@ -666,6 +673,18 @@ void mark_oom_victim(struct task_struct *tsk)
> > >  	 */
> > >  	__thaw_task(tsk);
> > >  	atomic_inc(&oom_victims);
> > > +	/*
> > > +	 * Since mark_oom_victim() is called from multiple threads,
> > > +	 * connect this mm to oom_mm_list only if not yet connected.
> > > +	 */
> > > +	get_task_struct(tsk);
> > > +	mm->oom_mm.victim = tsk;
> > > +	if (!old_tsk) {
> > > +		atomic_inc(&mm->mm_count);
> > > +		list_add_tail(&mm->oom_mm.list, &oom_mm_list);
> > > +	} else {
> > > +		put_task_struct(old_tsk);
> > > +	}
> > 
> > Isn't this overcomplicated? Why do we need to replace the old task by
> > the current one?
> 
> I'm not sure whether task_in_oom_domain(mm->oom_mm.victim, memcg, nodemask) in
> oom_has_pending_mm() will work as expected, especially when all threads in
> one thread group (which mm->oom_mm.victim belongs to) reached TASK_DEAD state.
> ( http://lkml.kernel.org/r/201607042150.CIB00512.FSOtMHLOOVFFQJ@I-love.SAKURA.ne.jp )
> 
> I guess that task_in_oom_domain() will return false, and that mm will be selected
> by another thread group (which mm->oom_mm.victim does not belongs to). Therefore,
> I think we need to replace the old task with the new task (at least when
> task_in_oom_domain() returned false) at mark_oom_victim().

Can we do that in a separate patch then. It would make this patch easier
to review and we can discuss about this corner case without distracting
from the main point of this patch series.

> If task_in_oom_domain(mm->oom_mm.victim, memcg, nodemask) in oom_has_pending_mm()
> does not work as expected even if we replace the old task with the new task at
> mark_oom_victim(), I think we after all need to use something like
> 
> struct task_struct {
> (...snipped...)
> +	struct mm_struct *oom_mm; /* current->mm as of getting TIF_MEMDIE */
> +	struct task_struct *oom_mm_list; /* Connected to oom_mm_list global list. */
> -#ifdef CONFIG_MMU
> -	struct task_struct *oom_reaper_list;
> -#endif
> (...snipped...)
> };
> 
> or your signal_struct->oom_mm approach.

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 4/6] mm,oom_reaper: Make OOM reaper use list of mm_struct.
  2016-07-11 13:16   ` Michal Hocko
@ 2016-07-12 13:38     ` Tetsuo Handa
  2016-07-12 13:46       ` Michal Hocko
  0 siblings, 1 reply; 19+ messages in thread
From: Tetsuo Handa @ 2016-07-12 13:38 UTC (permalink / raw)
  To: mhocko; +Cc: linux-mm, akpm, oleg, rientjes, vdavydov, mst

Michal Hocko wrote:
> >  #define MAX_OOM_REAP_RETRIES 10
> > -static void oom_reap_task(struct task_struct *tsk)
> > +static void oom_reap_task(struct task_struct *tsk, struct mm_struct *mm)
> >  {
> >  	int attempts = 0;
> > -	struct mm_struct *mm = NULL;
> > -	struct task_struct *p = find_lock_task_mm(tsk);
> >  
> >  	/*
> > -	 * Make sure we find the associated mm_struct even when the particular
> > -	 * thread has already terminated and cleared its mm.
> > -	 * We might have race with exit path so consider our work done if there
> > -	 * is no mm.
> > +	 * Check MMF_OOM_REAPED in case oom_kill_process() found this mm
> > +	 * pinned.
> >  	 */
> > -	if (!p)
> > -		goto done;
> > -	mm = p->mm;
> > -	atomic_inc(&mm->mm_count);
> > -	task_unlock(p);
> > +	if (test_bit(MMF_OOM_REAPED, &mm->flags))
> > +		return;
> >  
> >  	/* Retry the down_read_trylock(mmap_sem) a few times */
> >  	while (attempts++ < MAX_OOM_REAP_RETRIES && !__oom_reap_task(tsk, mm))
> >  		schedule_timeout_idle(HZ/10);
> >  
> >  	if (attempts <= MAX_OOM_REAP_RETRIES)
> > -		goto done;
> > +		return;
> >  
> >  	/* Ignore this mm because somebody can't call up_write(mmap_sem). */
> >  	set_bit(MMF_OOM_REAPED, &mm->flags);
> 
> This seems unnecessary when oom_reaper always calls exit_oom_mm. The
> same applies to __oom_reap_task. Which then means that the flag is
> turning into a misnomer. MMF_SKIP_OOM would fit better its current
> meaning.

Large oom_score_adj value or being a child process of highest OOM score
might cause the same mm being selected again. I think these set_bit() are
necessary in order to avoid the same mm being selected again.

Other than that, I sent v3 patchset.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 4/6] mm,oom_reaper: Make OOM reaper use list of mm_struct.
  2016-07-12 13:38     ` Tetsuo Handa
@ 2016-07-12 13:46       ` Michal Hocko
  2016-07-12 13:55         ` Michal Hocko
  0 siblings, 1 reply; 19+ messages in thread
From: Michal Hocko @ 2016-07-12 13:46 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: linux-mm, akpm, oleg, rientjes, vdavydov, mst

On Tue 12-07-16 22:38:42, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > >  #define MAX_OOM_REAP_RETRIES 10
> > > -static void oom_reap_task(struct task_struct *tsk)
> > > +static void oom_reap_task(struct task_struct *tsk, struct mm_struct *mm)
> > >  {
> > >  	int attempts = 0;
> > > -	struct mm_struct *mm = NULL;
> > > -	struct task_struct *p = find_lock_task_mm(tsk);
> > >  
> > >  	/*
> > > -	 * Make sure we find the associated mm_struct even when the particular
> > > -	 * thread has already terminated and cleared its mm.
> > > -	 * We might have race with exit path so consider our work done if there
> > > -	 * is no mm.
> > > +	 * Check MMF_OOM_REAPED in case oom_kill_process() found this mm
> > > +	 * pinned.
> > >  	 */
> > > -	if (!p)
> > > -		goto done;
> > > -	mm = p->mm;
> > > -	atomic_inc(&mm->mm_count);
> > > -	task_unlock(p);
> > > +	if (test_bit(MMF_OOM_REAPED, &mm->flags))
> > > +		return;
> > >  
> > >  	/* Retry the down_read_trylock(mmap_sem) a few times */
> > >  	while (attempts++ < MAX_OOM_REAP_RETRIES && !__oom_reap_task(tsk, mm))
> > >  		schedule_timeout_idle(HZ/10);
> > >  
> > >  	if (attempts <= MAX_OOM_REAP_RETRIES)
> > > -		goto done;
> > > +		return;
> > >  
> > >  	/* Ignore this mm because somebody can't call up_write(mmap_sem). */
> > >  	set_bit(MMF_OOM_REAPED, &mm->flags);
> > 
> > This seems unnecessary when oom_reaper always calls exit_oom_mm. The
> > same applies to __oom_reap_task. Which then means that the flag is
> > turning into a misnomer. MMF_SKIP_OOM would fit better its current
> > meaning.
> 
> Large oom_score_adj value or being a child process of highest OOM score
> might cause the same mm being selected again. I think these set_bit() are
> necessary in order to avoid the same mm being selected again.

I do not understand. Child will have a different mm struct from the
parent and I do not see how oom_score_adj is relevant here. Could you
elaborate, please?

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 4/6] mm,oom_reaper: Make OOM reaper use list of mm_struct.
  2016-07-12 13:46       ` Michal Hocko
@ 2016-07-12 13:55         ` Michal Hocko
  2016-07-12 14:01           ` Tetsuo Handa
  0 siblings, 1 reply; 19+ messages in thread
From: Michal Hocko @ 2016-07-12 13:55 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: linux-mm, akpm, oleg, rientjes, vdavydov, mst

On Tue 12-07-16 15:46:57, Michal Hocko wrote:
> On Tue 12-07-16 22:38:42, Tetsuo Handa wrote:
> > Michal Hocko wrote:
> > > >  #define MAX_OOM_REAP_RETRIES 10
> > > > -static void oom_reap_task(struct task_struct *tsk)
> > > > +static void oom_reap_task(struct task_struct *tsk, struct mm_struct *mm)
> > > >  {
> > > >  	int attempts = 0;
> > > > -	struct mm_struct *mm = NULL;
> > > > -	struct task_struct *p = find_lock_task_mm(tsk);
> > > >  
> > > >  	/*
> > > > -	 * Make sure we find the associated mm_struct even when the particular
> > > > -	 * thread has already terminated and cleared its mm.
> > > > -	 * We might have race with exit path so consider our work done if there
> > > > -	 * is no mm.
> > > > +	 * Check MMF_OOM_REAPED in case oom_kill_process() found this mm
> > > > +	 * pinned.
> > > >  	 */
> > > > -	if (!p)
> > > > -		goto done;
> > > > -	mm = p->mm;
> > > > -	atomic_inc(&mm->mm_count);
> > > > -	task_unlock(p);
> > > > +	if (test_bit(MMF_OOM_REAPED, &mm->flags))
> > > > +		return;
> > > >  
> > > >  	/* Retry the down_read_trylock(mmap_sem) a few times */
> > > >  	while (attempts++ < MAX_OOM_REAP_RETRIES && !__oom_reap_task(tsk, mm))
> > > >  		schedule_timeout_idle(HZ/10);
> > > >  
> > > >  	if (attempts <= MAX_OOM_REAP_RETRIES)
> > > > -		goto done;
> > > > +		return;
> > > >  
> > > >  	/* Ignore this mm because somebody can't call up_write(mmap_sem). */
> > > >  	set_bit(MMF_OOM_REAPED, &mm->flags);
> > > 
> > > This seems unnecessary when oom_reaper always calls exit_oom_mm. The
> > > same applies to __oom_reap_task. Which then means that the flag is
> > > turning into a misnomer. MMF_SKIP_OOM would fit better its current
> > > meaning.
> > 
> > Large oom_score_adj value or being a child process of highest OOM score
> > might cause the same mm being selected again. I think these set_bit() are
> > necessary in order to avoid the same mm being selected again.
> 
> I do not understand. Child will have a different mm struct from the
> parent and I do not see how oom_score_adj is relevant here. Could you
> elaborate, please?

OK, I guess I got your point. You mean we can select the same child/task
again after it has passed its exit_oom_mm. Trying to oom_reap such a
task would be obviously pointless. Then it would be better to stich that
set_bit into exit_oom_mm. Renaming it would be also better in that
context.

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 4/6] mm,oom_reaper: Make OOM reaper use list of mm_struct.
  2016-07-12 13:55         ` Michal Hocko
@ 2016-07-12 14:01           ` Tetsuo Handa
  0 siblings, 0 replies; 19+ messages in thread
From: Tetsuo Handa @ 2016-07-12 14:01 UTC (permalink / raw)
  To: mhocko; +Cc: linux-mm, akpm, oleg, rientjes, vdavydov, mst

Michal Hocko wrote:
> On Tue 12-07-16 15:46:57, Michal Hocko wrote:
> > On Tue 12-07-16 22:38:42, Tetsuo Handa wrote:
> > > Michal Hocko wrote:
> > > > >  #define MAX_OOM_REAP_RETRIES 10
> > > > > -static void oom_reap_task(struct task_struct *tsk)
> > > > > +static void oom_reap_task(struct task_struct *tsk, struct mm_struct *mm)
> > > > >  {
> > > > >  	int attempts = 0;
> > > > > -	struct mm_struct *mm = NULL;
> > > > > -	struct task_struct *p = find_lock_task_mm(tsk);
> > > > >  
> > > > >  	/*
> > > > > -	 * Make sure we find the associated mm_struct even when the particular
> > > > > -	 * thread has already terminated and cleared its mm.
> > > > > -	 * We might have race with exit path so consider our work done if there
> > > > > -	 * is no mm.
> > > > > +	 * Check MMF_OOM_REAPED in case oom_kill_process() found this mm
> > > > > +	 * pinned.
> > > > >  	 */
> > > > > -	if (!p)
> > > > > -		goto done;
> > > > > -	mm = p->mm;
> > > > > -	atomic_inc(&mm->mm_count);
> > > > > -	task_unlock(p);
> > > > > +	if (test_bit(MMF_OOM_REAPED, &mm->flags))
> > > > > +		return;
> > > > >  
> > > > >  	/* Retry the down_read_trylock(mmap_sem) a few times */
> > > > >  	while (attempts++ < MAX_OOM_REAP_RETRIES && !__oom_reap_task(tsk, mm))
> > > > >  		schedule_timeout_idle(HZ/10);
> > > > >  
> > > > >  	if (attempts <= MAX_OOM_REAP_RETRIES)
> > > > > -		goto done;
> > > > > +		return;
> > > > >  
> > > > >  	/* Ignore this mm because somebody can't call up_write(mmap_sem). */
> > > > >  	set_bit(MMF_OOM_REAPED, &mm->flags);
> > > > 
> > > > This seems unnecessary when oom_reaper always calls exit_oom_mm. The
> > > > same applies to __oom_reap_task. Which then means that the flag is
> > > > turning into a misnomer. MMF_SKIP_OOM would fit better its current
> > > > meaning.
> > > 
> > > Large oom_score_adj value or being a child process of highest OOM score
> > > might cause the same mm being selected again. I think these set_bit() are
> > > necessary in order to avoid the same mm being selected again.
> > 
> > I do not understand. Child will have a different mm struct from the
> > parent and I do not see how oom_score_adj is relevant here. Could you
> > elaborate, please?
> 
> OK, I guess I got your point. You mean we can select the same child/task
> again after it has passed its exit_oom_mm. Trying to oom_reap such a
> task would be obviously pointless. Then it would be better to stich that
> set_bit into exit_oom_mm. Renaming it would be also better in that
> context.
> 

Right.

oom_kill_process() receives a task which oom_badness() returned highest
score. But list_for_each_entry(child, &t->children, sibling) selects a
child process if that task has any OOM killable child process. The OOM
killer will kill that child process and the OOM reaper reaps memory from
that child process. But if the OOM reaper does not set MMF_OOM_REAPED after
reaping that child's memory, next round of the OOM killer will select that
child process again. It is possible that the child process was consuming
too little memory to solve the OOM situation.

Even if that task does not have any OOM killable child process, it is
possible that that task was consuming too little memory to solve the OOM
situation. We can misguide the OOM killer to select such process by
setting oom_score_adj to 1000.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2016-07-12 14:01 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-07-07 15:58 [PATCH v2 0/6] Change OOM killer to use list of mm_struct Tetsuo Handa
2016-07-07 16:00 ` [PATCH 1/6] mm,oom_reaper: Reduce find_lock_task_mm() usage Tetsuo Handa
2016-07-11 12:02   ` Michal Hocko
2016-07-07 16:01 ` [PATCH 2/6] mm,oom_reaper: Do not attempt to reap a task twice Tetsuo Handa
2016-07-11 12:15   ` Michal Hocko
2016-07-07 16:03 ` [PATCH 3/6] mm,oom: Use list of mm_struct used by OOM victims Tetsuo Handa
2016-07-11 12:50   ` Michal Hocko
2016-07-12  6:00     ` Tetsuo Handa
2016-07-12  7:09       ` Michal Hocko
2016-07-07 16:04 ` [PATCH 4/6] mm,oom_reaper: Make OOM reaper use list of mm_struct Tetsuo Handa
2016-07-11 13:16   ` Michal Hocko
2016-07-12 13:38     ` Tetsuo Handa
2016-07-12 13:46       ` Michal Hocko
2016-07-12 13:55         ` Michal Hocko
2016-07-12 14:01           ` Tetsuo Handa
2016-07-07 16:06 ` [PATCH 5/6] mm,oom: Remove OOM_SCAN_ABORT case and signal_struct->oom_victims Tetsuo Handa
2016-07-11 13:19   ` Michal Hocko
2016-07-07 16:07 ` [PATCH 6/6] mm,oom: Stop clearing TIF_MEMDIE on remote thread Tetsuo Handa
2016-07-11 13:22   ` Michal Hocko

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).