All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/8] Change OOM killer to use list of mm_struct.
@ 2016-07-03  2:35 Tetsuo Handa
  2016-07-03  2:36 ` [PATCH 1/8] mm,oom_reaper: Remove pointless kthread_run() failure check Tetsuo Handa
                   ` (8 more replies)
  0 siblings, 9 replies; 31+ messages in thread
From: Tetsuo Handa @ 2016-07-03  2:35 UTC (permalink / raw)
  To: linux-mm; +Cc: akpm, oleg, rientjes, vdavydov, mst, mhocko, mhocko

This is my alternative proposal compared to what Michal posted at
http://lkml.kernel.org/r/1467365190-24640-1-git-send-email-mhocko@kernel.org .

The series is based on top of linux-next-20160701 +
http://lkml.kernel.org/r/1467201562-6709-1-git-send-email-mhocko@kernel.org .

The key point of the series is [PATCH 3/8].
[PATCH 1/8] can be sent to current linux.git as a clean up.

The series does not include patches for use_mm() users and wait_event()
in oom_killer_disable(). Thus, some of Michal's patches can be added
on top of the series.

[PATCH 1/8] mm,oom_reaper: Remove pointless kthread_run() failure check.
[PATCH 2/8] mm,oom_reaper: Reduce find_lock_task_mm() usage.
[PATCH 3/8] mm,oom: Use list of mm_struct used by OOM victims.
[PATCH 4/8] mm,oom: Remove OOM_SCAN_ABORT case.
[PATCH 5/8] mm,oom: Remove unused signal_struct->oom_victims.
[PATCH 6/8] mm,oom_reaper: Stop clearing TIF_MEMDIE on remote thread.
[PATCH 7/8] mm,oom_reaper: Pass OOM victim's comm and pid values via mm_struct.
[PATCH 8/8] mm,oom_reaper: Make OOM reaper use list of mm_struct.

 include/linux/mm_types.h |   12 +
 include/linux/oom.h      |   16 --
 include/linux/sched.h    |    4
 kernel/exit.c            |    2
 kernel/fork.c            |    1
 kernel/power/process.c   |   12 -
 mm/memcontrol.c          |   16 --
 mm/oom_kill.c            |  336 +++++++++++++++++++++--------------------------
 8 files changed, 177 insertions(+), 222 deletions(-)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH 1/8] mm,oom_reaper: Remove pointless kthread_run() failure check.
  2016-07-03  2:35 [PATCH 0/8] Change OOM killer to use list of mm_struct Tetsuo Handa
@ 2016-07-03  2:36 ` Tetsuo Handa
  2016-07-03 12:42   ` Oleg Nesterov
  2016-07-03  2:37 ` [PATCH 2/8] mm,oom_reaper: Reduce find_lock_task_mm() usage Tetsuo Handa
                   ` (7 subsequent siblings)
  8 siblings, 1 reply; 31+ messages in thread
From: Tetsuo Handa @ 2016-07-03  2:36 UTC (permalink / raw)
  To: linux-mm; +Cc: akpm, oleg, rientjes, vdavydov, mst, mhocko, mhocko

>From c8b596f22c4fcc9424149681c568e1ab90e545a1 Mon Sep 17 00:00:00 2001
From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Date: Sat, 2 Jul 2016 22:53:03 +0900
Subject: [PATCH 1/8] mm,oom_reaper: Remove pointless kthread_run() failure check.

Since oom_init() is called before OOM-killable userspace processes are
started, the system will panic if out_of_memory() is called before
oom_init() returns. Therefore, oom_reaper_th == NULL check in
wake_oom_reaper() is pointless.

If kthread_run() in oom_init() fails due to reasons other than OOM
(e.g. no free pid is available), userspace processes won't be able to
start as well. Therefore, trying to continue with error message is
also pointless.

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
---
 mm/oom_kill.c | 11 +----------
 1 file changed, 1 insertion(+), 10 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 7d0a275..16340f2 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -447,7 +447,6 @@ bool process_shares_mm(struct task_struct *p, struct mm_struct *mm)
  * OOM Reaper kernel thread which tries to reap the memory used by the OOM
  * victim (if that is possible) to help the OOM killer to move on.
  */
-static struct task_struct *oom_reaper_th;
 static DECLARE_WAIT_QUEUE_HEAD(oom_reaper_wait);
 static struct task_struct *oom_reaper_list;
 static DEFINE_SPINLOCK(oom_reaper_lock);
@@ -629,9 +628,6 @@ static int oom_reaper(void *unused)
 
 void wake_oom_reaper(struct task_struct *tsk)
 {
-	if (!oom_reaper_th)
-		return;
-
 	/* tsk is already queued? */
 	if (tsk == oom_reaper_list || tsk->oom_reaper_list)
 		return;
@@ -647,12 +643,7 @@ void wake_oom_reaper(struct task_struct *tsk)
 
 static int __init oom_init(void)
 {
-	oom_reaper_th = kthread_run(oom_reaper, NULL, "oom_reaper");
-	if (IS_ERR(oom_reaper_th)) {
-		pr_err("Unable to start OOM reaper %ld. Continuing regardless\n",
-				PTR_ERR(oom_reaper_th));
-		oom_reaper_th = NULL;
-	}
+	kthread_run(oom_reaper, NULL, "oom_reaper");
 	return 0;
 }
 subsys_initcall(oom_init)
-- 
1.8.3.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH 2/8] mm,oom_reaper: Reduce find_lock_task_mm() usage.
  2016-07-03  2:35 [PATCH 0/8] Change OOM killer to use list of mm_struct Tetsuo Handa
  2016-07-03  2:36 ` [PATCH 1/8] mm,oom_reaper: Remove pointless kthread_run() failure check Tetsuo Handa
@ 2016-07-03  2:37 ` Tetsuo Handa
  2016-07-07 11:19   ` Michal Hocko
  2016-07-03  2:38 ` [PATCH 3/8] mm,oom: Use list of mm_struct used by OOM victims Tetsuo Handa
                   ` (6 subsequent siblings)
  8 siblings, 1 reply; 31+ messages in thread
From: Tetsuo Handa @ 2016-07-03  2:37 UTC (permalink / raw)
  To: linux-mm; +Cc: akpm, oleg, rientjes, vdavydov, mst, mhocko, mhocko

>From 3be379c6b42a0901cd81fb2c743e321b6fbdec5b Mon Sep 17 00:00:00 2001
From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Date: Sat, 2 Jul 2016 22:55:17 +0900
Subject: [PATCH 2/8] mm,oom_reaper: Reduce find_lock_task_mm() usage.

Since holding mm_struct with elevated mm_count for a second is harmless,
we can determine mm_struct and hold it upon entry of oom_reap_task().
This patch has no functional change. Future patch in this series will
eliminate find_lock_task_mm() usage from the OOM reaper.

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
---
 mm/oom_kill.c | 79 ++++++++++++++++++++++++++++-------------------------------
 1 file changed, 37 insertions(+), 42 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 16340f2..76c765e 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -451,12 +451,10 @@ static DECLARE_WAIT_QUEUE_HEAD(oom_reaper_wait);
 static struct task_struct *oom_reaper_list;
 static DEFINE_SPINLOCK(oom_reaper_lock);
 
-static bool __oom_reap_task(struct task_struct *tsk)
+static bool __oom_reap_task(struct task_struct *tsk, struct mm_struct *mm)
 {
 	struct mmu_gather tlb;
 	struct vm_area_struct *vma;
-	struct mm_struct *mm = NULL;
-	struct task_struct *p;
 	struct zap_details details = {.check_swap_entries = true,
 				      .ignore_dirty = true};
 	bool ret = true;
@@ -477,22 +475,9 @@ static bool __oom_reap_task(struct task_struct *tsk)
 	 */
 	mutex_lock(&oom_lock);
 
-	/*
-	 * Make sure we find the associated mm_struct even when the particular
-	 * thread has already terminated and cleared its mm.
-	 * We might have race with exit path so consider our work done if there
-	 * is no mm.
-	 */
-	p = find_lock_task_mm(tsk);
-	if (!p)
-		goto unlock_oom;
-	mm = p->mm;
-	atomic_inc(&mm->mm_count);
-	task_unlock(p);
-
 	if (!down_read_trylock(&mm->mmap_sem)) {
 		ret = false;
-		goto mm_drop;
+		goto unlock_oom;
 	}
 
 	/*
@@ -502,7 +487,7 @@ static bool __oom_reap_task(struct task_struct *tsk)
 	 */
 	if (!mmget_not_zero(mm)) {
 		up_read(&mm->mmap_sem);
-		goto mm_drop;
+		goto unlock_oom;
 	}
 
 	tlb_gather_mmu(&tlb, mm, 0, -1);
@@ -550,8 +535,6 @@ static bool __oom_reap_task(struct task_struct *tsk)
 	 * put the oom_reaper out of the way.
 	 */
 	mmput_async(mm);
-mm_drop:
-	mmdrop(mm);
 unlock_oom:
 	mutex_unlock(&oom_lock);
 	return ret;
@@ -561,36 +544,45 @@ unlock_oom:
 static void oom_reap_task(struct task_struct *tsk)
 {
 	int attempts = 0;
+	struct mm_struct *mm = NULL;
+	struct task_struct *p = find_lock_task_mm(tsk);
+
+	/*
+	 * Make sure we find the associated mm_struct even when the particular
+	 * thread has already terminated and cleared its mm.
+	 * We might have race with exit path so consider our work done if there
+	 * is no mm.
+	 */
+	if (!p)
+		goto done;
+	mm = p->mm;
+	atomic_inc(&mm->mm_count);
+	task_unlock(p);
 
 	/* Retry the down_read_trylock(mmap_sem) a few times */
-	while (attempts++ < MAX_OOM_REAP_RETRIES && !__oom_reap_task(tsk))
+	while (attempts++ < MAX_OOM_REAP_RETRIES && !__oom_reap_task(tsk, mm))
 		schedule_timeout_idle(HZ/10);
 
-	if (attempts > MAX_OOM_REAP_RETRIES) {
-		struct task_struct *p;
+	if (attempts <= MAX_OOM_REAP_RETRIES)
+		goto done;
 
-		pr_info("oom_reaper: unable to reap pid:%d (%s)\n",
-				task_pid_nr(tsk), tsk->comm);
+	pr_info("oom_reaper: unable to reap pid:%d (%s)\n",
+		task_pid_nr(tsk), tsk->comm);
 
-		/*
-		 * If we've already tried to reap this task in the past and
-		 * failed it probably doesn't make much sense to try yet again
-		 * so hide the mm from the oom killer so that it can move on
-		 * to another task with a different mm struct.
-		 */
-		p = find_lock_task_mm(tsk);
-		if (p) {
-			if (test_and_set_bit(MMF_OOM_NOT_REAPABLE, &p->mm->flags)) {
-				pr_info("oom_reaper: giving up pid:%d (%s)\n",
-						task_pid_nr(tsk), tsk->comm);
-				set_bit(MMF_OOM_REAPED, &p->mm->flags);
-			}
-			task_unlock(p);
-		}
-
-		debug_show_all_locks();
+	/*
+	 * If we've already tried to reap this task in the past and
+	 * failed it probably doesn't make much sense to try yet again
+	 * so hide the mm from the oom killer so that it can move on
+	 * to another task with a different mm struct.
+	 */
+	if (test_and_set_bit(MMF_OOM_NOT_REAPABLE, &mm->flags)) {
+		pr_info("oom_reaper: giving up pid:%d (%s)\n",
+			task_pid_nr(tsk), tsk->comm);
+		set_bit(MMF_OOM_REAPED, &mm->flags);
 	}
+	debug_show_all_locks();
 
+done:
 	/*
 	 * Clear TIF_MEMDIE because the task shouldn't be sitting on a
 	 * reasonably reclaimable memory anymore or it is not a good candidate
@@ -602,6 +594,9 @@ static void oom_reap_task(struct task_struct *tsk)
 
 	/* Drop a reference taken by wake_oom_reaper */
 	put_task_struct(tsk);
+	/* Drop a reference taken above. */
+	if (mm)
+		mmdrop(mm);
 }
 
 static int oom_reaper(void *unused)
-- 
1.8.3.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH 3/8] mm,oom: Use list of mm_struct used by OOM victims.
  2016-07-03  2:35 [PATCH 0/8] Change OOM killer to use list of mm_struct Tetsuo Handa
  2016-07-03  2:36 ` [PATCH 1/8] mm,oom_reaper: Remove pointless kthread_run() failure check Tetsuo Handa
  2016-07-03  2:37 ` [PATCH 2/8] mm,oom_reaper: Reduce find_lock_task_mm() usage Tetsuo Handa
@ 2016-07-03  2:38 ` Tetsuo Handa
  2016-07-04 10:39   ` Oleg Nesterov
  2016-07-07 11:31   ` Michal Hocko
  2016-07-03  2:39 ` [PATCH 4/8] mm,oom: Remove OOM_SCAN_ABORT case Tetsuo Handa
                   ` (5 subsequent siblings)
  8 siblings, 2 replies; 31+ messages in thread
From: Tetsuo Handa @ 2016-07-03  2:38 UTC (permalink / raw)
  To: linux-mm; +Cc: akpm, oleg, rientjes, vdavydov, mst, mhocko, mhocko

>From d86a4340be775bd00217e596ed7f2d65f45b58cb Mon Sep 17 00:00:00 2001
From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Date: Sat, 2 Jul 2016 22:56:21 +0900
Subject: [PATCH 3/8] mm,oom: Use list of mm_struct used by OOM victims.

Currently, we walk process list in order to find existing TIF_MEMDIE
threads. But if we remember list of mm_struct used by TIF_MEMDIE threads,
we can avoid walking process list. Future patch in this series allows
OOM reaper to use list of mm_struct introduced by this patch.

This patch reverts commit e2fe14564d3316d1 ("oom_reaper: close race with
exiting task") because oom_has_pending_mm() will prevent that race.

Due to oom_has_pending_mm() remaining true until that mm is removed,
this patch temporarily breaks what commit 36324a990cf578b5 ("oom: clear
TIF_MEMDIE after oom_reaper managed to unmap the address space") and
commit 449d777d7ad6d7f9 ("mm, oom_reaper: clear TIF_MEMDIE for all tasks
queued for oom_reaper") tried to address. Future patch in this series
will fix it.

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
---
 include/linux/mm_types.h |   8 ++++
 include/linux/oom.h      |   5 ++-
 kernel/fork.c            |   1 +
 mm/memcontrol.c          |   7 +++-
 mm/oom_kill.c            | 101 +++++++++++++++++++++++++++++++++--------------
 5 files changed, 91 insertions(+), 31 deletions(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index e093e1d..718c0bd 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -13,6 +13,7 @@
 #include <linux/uprobes.h>
 #include <linux/page-flags-layout.h>
 #include <linux/workqueue.h>
+#include <linux/nodemask.h>
 #include <asm/page.h>
 #include <asm/mmu.h>
 
@@ -392,6 +393,12 @@ struct mm_rss_stat {
 	atomic_long_t count[NR_MM_COUNTERS];
 };
 
+struct oom_mm {
+	struct list_head list; /* Linked to oom_mm_list list. */
+	struct mem_cgroup *memcg; /* No deref. Maybe NULL. */
+	const nodemask_t *nodemask; /* No deref. Maybe NULL. */
+};
+
 struct kioctx_table;
 struct mm_struct {
 	struct vm_area_struct *mmap;		/* list of VMAs */
@@ -515,6 +522,7 @@ struct mm_struct {
 #ifdef CONFIG_HUGETLB_PAGE
 	atomic_long_t hugetlb_usage;
 #endif
+	struct oom_mm oom_mm;
 #ifdef CONFIG_MMU
 	struct work_struct async_put_work;
 #endif
diff --git a/include/linux/oom.h b/include/linux/oom.h
index 5bc0457..1e538c5 100644
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -70,7 +70,7 @@ static inline bool oom_task_origin(const struct task_struct *p)
 	return p->signal->oom_flag_origin;
 }
 
-extern void mark_oom_victim(struct task_struct *tsk);
+extern void mark_oom_victim(struct task_struct *tsk, struct oom_control *oc);
 
 #ifdef CONFIG_MMU
 extern void wake_oom_reaper(struct task_struct *tsk);
@@ -91,6 +91,9 @@ extern void oom_kill_process(struct oom_control *oc, struct task_struct *p,
 extern void check_panic_on_oom(struct oom_control *oc,
 			       enum oom_constraint constraint);
 
+extern void exit_oom_mm(struct mm_struct *mm);
+extern bool oom_has_pending_mm(struct mem_cgroup *memcg,
+			       const nodemask_t *nodemask);
 extern enum oom_scan_t oom_scan_process_thread(struct oom_control *oc,
 					       struct task_struct *task);
 
diff --git a/kernel/fork.c b/kernel/fork.c
index 7926993..b870dbc 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -722,6 +722,7 @@ static inline void __mmput(struct mm_struct *mm)
 	}
 	if (mm->binfmt)
 		module_put(mm->binfmt->module);
+	exit_oom_mm(mm);
 	mmdrop(mm);
 }
 
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 40dfca3..835c95c 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1235,12 +1235,17 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
 	 * quickly exit and free its memory.
 	 */
 	if (task_will_free_mem(current)) {
-		mark_oom_victim(current);
+		mark_oom_victim(current, &oc);
 		wake_oom_reaper(current);
 		goto unlock;
 	}
 
 	check_panic_on_oom(&oc, CONSTRAINT_MEMCG);
+	if (oom_has_pending_mm(memcg, NULL)) {
+		/* Set a dummy value to return "true". */
+		chosen = (void *) 1;
+		goto unlock;
+	}
 	totalpages = mem_cgroup_get_limit(memcg) ? : 1;
 	for_each_mem_cgroup_tree(iter, memcg) {
 		struct css_task_iter it;
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 76c765e..39c5034 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -275,6 +275,48 @@ static enum oom_constraint constrained_alloc(struct oom_control *oc,
 }
 #endif
 
+static LIST_HEAD(oom_mm_list);
+static DEFINE_SPINLOCK(oom_mm_lock);
+
+void exit_oom_mm(struct mm_struct *mm)
+{
+	bool remove;
+
+	/*
+	 * Since exit_oom_mm() can be concurrently called by exiting thread
+	 * and the OOM reaper thread, disconnect this mm from oom_mm_list
+	 * only if still connected.
+	 */
+	spin_lock(&oom_mm_lock);
+	remove = mm->oom_mm.list.next;
+	if (remove) {
+		list_del(&mm->oom_mm.list);
+		mm->oom_mm.list.next = NULL;
+	}
+	spin_unlock(&oom_mm_lock);
+	/* Drop a reference taken by mark_oom_victim() */
+	if (remove)
+		mmdrop(mm);
+}
+
+bool oom_has_pending_mm(struct mem_cgroup *memcg, const nodemask_t *nodemask)
+{
+	struct oom_mm *mm;
+	bool ret = false;
+
+	spin_lock(&oom_mm_lock);
+	list_for_each_entry(mm, &oom_mm_list, list) {
+		if (memcg && mm->memcg != memcg)
+			continue;
+		if (nodemask && mm->nodemask != nodemask)
+			continue;
+		ret = true;
+		break;
+	}
+	spin_unlock(&oom_mm_lock);
+	return ret;
+}
+
 enum oom_scan_t oom_scan_process_thread(struct oom_control *oc,
 					struct task_struct *task)
 {
@@ -457,28 +499,9 @@ static bool __oom_reap_task(struct task_struct *tsk, struct mm_struct *mm)
 	struct vm_area_struct *vma;
 	struct zap_details details = {.check_swap_entries = true,
 				      .ignore_dirty = true};
-	bool ret = true;
-
-	/*
-	 * We have to make sure to not race with the victim exit path
-	 * and cause premature new oom victim selection:
-	 * __oom_reap_task		exit_mm
-	 *   mmget_not_zero
-	 *				  mmput
-	 *				    atomic_dec_and_test
-	 *				  exit_oom_victim
-	 *				[...]
-	 *				out_of_memory
-	 *				  select_bad_process
-	 *				    # no TIF_MEMDIE task selects new victim
-	 *  unmap_page_range # frees some memory
-	 */
-	mutex_lock(&oom_lock);
 
-	if (!down_read_trylock(&mm->mmap_sem)) {
-		ret = false;
-		goto unlock_oom;
-	}
+	if (!down_read_trylock(&mm->mmap_sem))
+		return false;
 
 	/*
 	 * increase mm_users only after we know we will reap something so
@@ -487,7 +510,7 @@ static bool __oom_reap_task(struct task_struct *tsk, struct mm_struct *mm)
 	 */
 	if (!mmget_not_zero(mm)) {
 		up_read(&mm->mmap_sem);
-		goto unlock_oom;
+		return true;
 	}
 
 	tlb_gather_mmu(&tlb, mm, 0, -1);
@@ -535,9 +558,7 @@ static bool __oom_reap_task(struct task_struct *tsk, struct mm_struct *mm)
 	 * put the oom_reaper out of the way.
 	 */
 	mmput_async(mm);
-unlock_oom:
-	mutex_unlock(&oom_lock);
-	return ret;
+	return true;
 }
 
 #define MAX_OOM_REAP_RETRIES 10
@@ -647,18 +668,37 @@ subsys_initcall(oom_init)
 /**
  * mark_oom_victim - mark the given task as OOM victim
  * @tsk: task to mark
+ * @oc: oom_control
  *
  * Has to be called with oom_lock held and never after
  * oom has been disabled already.
  */
-void mark_oom_victim(struct task_struct *tsk)
+void mark_oom_victim(struct task_struct *tsk, struct oom_control *oc)
 {
+	struct mm_struct *mm = tsk->mm;
+
 	WARN_ON(oom_killer_disabled);
 	/* OOM killer might race with memcg OOM */
 	if (test_and_set_tsk_thread_flag(tsk, TIF_MEMDIE))
 		return;
 	atomic_inc(&tsk->signal->oom_victims);
 	/*
+	 * Since mark_oom_victim() is called from multiple threads,
+	 * connect this mm to oom_mm_list only if not yet connected.
+	 *
+	 * Since mark_oom_victim() is called with a stable mm (i.e.
+	 * mm->mm_userst > 0), exit_oom_mm() from __mmput() can't be called
+	 * before we add this mm to the list.
+	 */
+	spin_lock(&oom_mm_lock);
+	if (!mm->oom_mm.list.next) {
+		atomic_inc(&mm->mm_count);
+		mm->oom_mm.memcg = oc->memcg;
+		mm->oom_mm.nodemask = oc->nodemask;
+		list_add_tail(&mm->oom_mm.list, &oom_mm_list);
+	}
+	spin_unlock(&oom_mm_lock);
+	/*
 	 * Make sure that the task is woken up from uninterruptible sleep
 	 * if it is frozen because OOM killer wouldn't be able to free
 	 * any memory and livelock. freezing_slow_path will tell the freezer
@@ -815,7 +855,7 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
 	 */
 	task_lock(p);
 	if (task_will_free_mem(p)) {
-		mark_oom_victim(p);
+		mark_oom_victim(p, oc);
 		wake_oom_reaper(p);
 		task_unlock(p);
 		put_task_struct(p);
@@ -876,7 +916,7 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
 	 * space under its control.
 	 */
 	do_send_sig_info(SIGKILL, SEND_SIG_FORCED, victim, true);
-	mark_oom_victim(victim);
+	mark_oom_victim(victim, oc);
 	pr_err("Killed process %d (%s) total-vm:%lukB, anon-rss:%lukB, file-rss:%lukB, shmem-rss:%lukB\n",
 		task_pid_nr(victim), victim->comm, K(victim->mm->total_vm),
 		K(get_mm_counter(victim->mm, MM_ANONPAGES)),
@@ -994,7 +1034,7 @@ bool out_of_memory(struct oom_control *oc)
 	 * quickly exit and free its memory.
 	 */
 	if (task_will_free_mem(current)) {
-		mark_oom_victim(current);
+		mark_oom_victim(current, oc);
 		wake_oom_reaper(current);
 		return true;
 	}
@@ -1026,6 +1066,9 @@ bool out_of_memory(struct oom_control *oc)
 		return true;
 	}
 
+	if (!is_sysrq_oom(oc) && oom_has_pending_mm(oc->memcg, oc->nodemask))
+		return true;
+
 	p = select_bad_process(oc, &points, totalpages);
 	/* Found nothing?!?! Either we hang forever, or we panic. */
 	if (!p && !is_sysrq_oom(oc)) {
-- 
1.8.3.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH 4/8] mm,oom: Remove OOM_SCAN_ABORT case.
  2016-07-03  2:35 [PATCH 0/8] Change OOM killer to use list of mm_struct Tetsuo Handa
                   ` (2 preceding siblings ...)
  2016-07-03  2:38 ` [PATCH 3/8] mm,oom: Use list of mm_struct used by OOM victims Tetsuo Handa
@ 2016-07-03  2:39 ` Tetsuo Handa
  2016-07-03  2:40 ` [PATCH 5/8] mm,oom: Remove unused signal_struct->oom_victims Tetsuo Handa
                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 31+ messages in thread
From: Tetsuo Handa @ 2016-07-03  2:39 UTC (permalink / raw)
  To: linux-mm; +Cc: akpm, oleg, rientjes, vdavydov, mst, mhocko, mhocko

>From e2d850ccbc7abc2643f5aac4dd09787f09270fd4 Mon Sep 17 00:00:00 2001
From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Date: Sat, 2 Jul 2016 23:01:17 +0900
Subject: [PATCH 4/8] mm,oom: Remove OOM_SCAN_ABORT case.

Since oom_has_pending_mm() controls whether to select next OOM victim,
we no longer need to abort OOM victim selection loop using OOM_SCAN_ABORT
case. Remove it.

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
---
 include/linux/oom.h |  1 -
 mm/memcontrol.c     |  8 --------
 mm/oom_kill.c       | 24 +-----------------------
 3 files changed, 1 insertion(+), 32 deletions(-)

diff --git a/include/linux/oom.h b/include/linux/oom.h
index 1e538c5..5991d41 100644
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -49,7 +49,6 @@ enum oom_constraint {
 enum oom_scan_t {
 	OOM_SCAN_OK,		/* scan thread and find its badness */
 	OOM_SCAN_CONTINUE,	/* do not consider thread for oom kill */
-	OOM_SCAN_ABORT,		/* abort the iteration and return */
 	OOM_SCAN_SELECT,	/* always select this thread first */
 };
 
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 835c95c..c17160d 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1263,14 +1263,6 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
 				/* fall through */
 			case OOM_SCAN_CONTINUE:
 				continue;
-			case OOM_SCAN_ABORT:
-				css_task_iter_end(&it);
-				mem_cgroup_iter_break(memcg, iter);
-				if (chosen)
-					put_task_struct(chosen);
-				/* Set a dummy value to return "true". */
-				chosen = (void *) 1;
-				goto unlock;
 			case OOM_SCAN_OK:
 				break;
 			};
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 39c5034..734378a 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -324,25 +324,6 @@ enum oom_scan_t oom_scan_process_thread(struct oom_control *oc,
 		return OOM_SCAN_CONTINUE;
 
 	/*
-	 * This task already has access to memory reserves and is being killed.
-	 * Don't allow any other task to have access to the reserves unless
-	 * the task has MMF_OOM_REAPED because chances that it would release
-	 * any memory is quite low.
-	 */
-	if (!is_sysrq_oom(oc) && atomic_read(&task->signal->oom_victims)) {
-		struct task_struct *p = find_lock_task_mm(task);
-		enum oom_scan_t ret = OOM_SCAN_ABORT;
-
-		if (p) {
-			if (test_bit(MMF_OOM_REAPED, &p->mm->flags))
-				ret = OOM_SCAN_CONTINUE;
-			task_unlock(p);
-		}
-
-		return ret;
-	}
-
-	/*
 	 * If task is allocating a lot of memory and has been marked to be
 	 * killed first if it triggers an oom, then select it.
 	 */
@@ -374,9 +355,6 @@ static struct task_struct *select_bad_process(struct oom_control *oc,
 			/* fall through */
 		case OOM_SCAN_CONTINUE:
 			continue;
-		case OOM_SCAN_ABORT:
-			rcu_read_unlock();
-			return (struct task_struct *)(-1UL);
 		case OOM_SCAN_OK:
 			break;
 		};
@@ -1075,7 +1053,7 @@ bool out_of_memory(struct oom_control *oc)
 		dump_header(oc, NULL);
 		panic("Out of memory and no killable processes...\n");
 	}
-	if (p && p != (void *)-1UL) {
+	if (p) {
 		oom_kill_process(oc, p, points, totalpages, "Out of memory");
 		/*
 		 * Give the killed process a good chance to exit before trying
-- 
1.8.3.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH 5/8] mm,oom: Remove unused signal_struct->oom_victims.
  2016-07-03  2:35 [PATCH 0/8] Change OOM killer to use list of mm_struct Tetsuo Handa
                   ` (3 preceding siblings ...)
  2016-07-03  2:39 ` [PATCH 4/8] mm,oom: Remove OOM_SCAN_ABORT case Tetsuo Handa
@ 2016-07-03  2:40 ` Tetsuo Handa
  2016-07-07 14:03   ` Michal Hocko
  2016-07-03  2:40 ` [PATCH 6/8] mm,oom_reaper: Stop clearing TIF_MEMDIE on remote thread Tetsuo Handa
                   ` (3 subsequent siblings)
  8 siblings, 1 reply; 31+ messages in thread
From: Tetsuo Handa @ 2016-07-03  2:40 UTC (permalink / raw)
  To: linux-mm; +Cc: akpm, oleg, rientjes, vdavydov, mst, mhocko, mhocko

>From e2e22cd0c0c486d1aef01232fcc62b00fb01709f Mon Sep 17 00:00:00 2001
From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Date: Sat, 2 Jul 2016 23:02:19 +0900
Subject: [PATCH 5/8] mm,oom: Remove unused signal_struct->oom_victims.

Since OOM_SCAN_ABORT case was removed, we no longer need to use
signal_struct->oom_victims useless. Remove it.

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
---
 include/linux/sched.h | 1 -
 mm/oom_kill.c         | 2 --
 2 files changed, 3 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 553af29..f472f27 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -671,7 +671,6 @@ struct signal_struct {
 	atomic_t		sigcnt;
 	atomic_t		live;
 	int			nr_threads;
-	atomic_t oom_victims; /* # of TIF_MEDIE threads in this thread group */
 	struct list_head	thread_head;
 
 	wait_queue_head_t	wait_chldexit;	/* for wait4() */
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 734378a..55e0ffb 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -659,7 +659,6 @@ void mark_oom_victim(struct task_struct *tsk, struct oom_control *oc)
 	/* OOM killer might race with memcg OOM */
 	if (test_and_set_tsk_thread_flag(tsk, TIF_MEMDIE))
 		return;
-	atomic_inc(&tsk->signal->oom_victims);
 	/*
 	 * Since mark_oom_victim() is called from multiple threads,
 	 * connect this mm to oom_mm_list only if not yet connected.
@@ -693,7 +692,6 @@ void exit_oom_victim(struct task_struct *tsk)
 {
 	if (!test_and_clear_tsk_thread_flag(tsk, TIF_MEMDIE))
 		return;
-	atomic_dec(&tsk->signal->oom_victims);
 
 	if (!atomic_dec_return(&oom_victims))
 		wake_up_all(&oom_victims_wait);
-- 
1.8.3.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH 6/8] mm,oom_reaper: Stop clearing TIF_MEMDIE on remote thread.
  2016-07-03  2:35 [PATCH 0/8] Change OOM killer to use list of mm_struct Tetsuo Handa
                   ` (4 preceding siblings ...)
  2016-07-03  2:40 ` [PATCH 5/8] mm,oom: Remove unused signal_struct->oom_victims Tetsuo Handa
@ 2016-07-03  2:40 ` Tetsuo Handa
  2016-07-07 14:06   ` Michal Hocko
  2016-07-03  2:41 ` [PATCH 7/8] mm,oom_reaper: Pass OOM victim's comm and pid values via mm_struct Tetsuo Handa
                   ` (2 subsequent siblings)
  8 siblings, 1 reply; 31+ messages in thread
From: Tetsuo Handa @ 2016-07-03  2:40 UTC (permalink / raw)
  To: linux-mm; +Cc: akpm, oleg, rientjes, vdavydov, mst, mhocko, mhocko

>From 00b7a14653c9700429f89e4512f6000a39cce59d Mon Sep 17 00:00:00 2001
From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Date: Sat, 2 Jul 2016 23:03:03 +0900
Subject: [PATCH 6/8] mm,oom_reaper: Stop clearing TIF_MEMDIE on remote thread.

Since oom_has_pending_mm() controls whether to select next OOM victim,
we no longer need to clear TIF_MEMDIE on remote thread. Therefore,
revert related changes in commit 36324a990cf578b5 ("oom: clear TIF_MEMDIE
after oom_reaper managed to unmap the address space") and
commit e26796066fdf929c ("oom: make oom_reaper freezable") and
commit 74070542099c66d8 ("oom, suspend: fix oom_reaper vs.
oom_killer_disable race").

This patch temporarily breaks what commit 36324a990cf578b5 ("oom: clear
TIF_MEMDIE after oom_reaper managed to unmap the address space") and
commit 449d777d7ad6d7f9 ("mm, oom_reaper: clear TIF_MEMDIE for all tasks
queued for oom_reaper") try to solve due to oom_has_pending_mm().
Future patch in this series will fix it.

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
---
 include/linux/oom.h    |  2 +-
 kernel/exit.c          |  2 +-
 kernel/power/process.c | 12 ------------
 mm/oom_kill.c          | 15 ++-------------
 4 files changed, 4 insertions(+), 27 deletions(-)

diff --git a/include/linux/oom.h b/include/linux/oom.h
index 5991d41..4844325 100644
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -98,7 +98,7 @@ extern enum oom_scan_t oom_scan_process_thread(struct oom_control *oc,
 
 extern bool out_of_memory(struct oom_control *oc);
 
-extern void exit_oom_victim(struct task_struct *tsk);
+extern void exit_oom_victim(void);
 
 extern int register_oom_notifier(struct notifier_block *nb);
 extern int unregister_oom_notifier(struct notifier_block *nb);
diff --git a/kernel/exit.c b/kernel/exit.c
index 84ae830..1b1dada 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -511,7 +511,7 @@ static void exit_mm(struct task_struct *tsk)
 	mm_update_next_owner(mm);
 	mmput(mm);
 	if (test_thread_flag(TIF_MEMDIE))
-		exit_oom_victim(tsk);
+		exit_oom_victim();
 }
 
 static struct task_struct *find_alive_thread(struct task_struct *p)
diff --git a/kernel/power/process.c b/kernel/power/process.c
index 0c2ee97..df058be 100644
--- a/kernel/power/process.c
+++ b/kernel/power/process.c
@@ -146,18 +146,6 @@ int freeze_processes(void)
 	if (!error && !oom_killer_disable())
 		error = -EBUSY;
 
-	/*
-	 * There is a hard to fix race between oom_reaper kernel thread
-	 * and oom_killer_disable. oom_reaper calls exit_oom_victim
-	 * before the victim reaches exit_mm so try to freeze all the tasks
-	 * again and catch such a left over task.
-	 */
-	if (!error) {
-		pr_info("Double checking all user space processes after OOM killer disable... ");
-		error = try_to_freeze_tasks(true);
-		pr_cont("\n");
-	}
-
 	if (error)
 		thaw_processes();
 	return error;
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 55e0ffb..45e7de2 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -582,15 +582,7 @@ static void oom_reap_task(struct task_struct *tsk)
 	debug_show_all_locks();
 
 done:
-	/*
-	 * Clear TIF_MEMDIE because the task shouldn't be sitting on a
-	 * reasonably reclaimable memory anymore or it is not a good candidate
-	 * for the oom victim right now because it cannot release its memory
-	 * itself nor by the oom reaper.
-	 */
 	tsk->oom_reaper_list = NULL;
-	exit_oom_victim(tsk);
-
 	/* Drop a reference taken by wake_oom_reaper */
 	put_task_struct(tsk);
 	/* Drop a reference taken above. */
@@ -600,8 +592,6 @@ done:
 
 static int oom_reaper(void *unused)
 {
-	set_freezable();
-
 	while (true) {
 		struct task_struct *tsk = NULL;
 
@@ -688,10 +678,9 @@ void mark_oom_victim(struct task_struct *tsk, struct oom_control *oc)
 /**
  * exit_oom_victim - note the exit of an OOM victim
  */
-void exit_oom_victim(struct task_struct *tsk)
+void exit_oom_victim(void)
 {
-	if (!test_and_clear_tsk_thread_flag(tsk, TIF_MEMDIE))
-		return;
+	clear_thread_flag(TIF_MEMDIE);
 
 	if (!atomic_dec_return(&oom_victims))
 		wake_up_all(&oom_victims_wait);
-- 
1.8.3.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH 7/8] mm,oom_reaper: Pass OOM victim's comm and pid values via mm_struct.
  2016-07-03  2:35 [PATCH 0/8] Change OOM killer to use list of mm_struct Tetsuo Handa
                   ` (5 preceding siblings ...)
  2016-07-03  2:40 ` [PATCH 6/8] mm,oom_reaper: Stop clearing TIF_MEMDIE on remote thread Tetsuo Handa
@ 2016-07-03  2:41 ` Tetsuo Handa
  2016-07-03  2:41 ` [PATCH 8/8] mm,oom_reaper: Make OOM reaper use list of mm_struct Tetsuo Handa
  2016-07-07 11:04 ` [PATCH 0/8] Change OOM killer to " Michal Hocko
  8 siblings, 0 replies; 31+ messages in thread
From: Tetsuo Handa @ 2016-07-03  2:41 UTC (permalink / raw)
  To: linux-mm; +Cc: akpm, oleg, rientjes, vdavydov, mst, mhocko, mhocko

>From 14cb84f6c1cd3a7ace58e52fea2fb52cb8e16f91 Mon Sep 17 00:00:00 2001
From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Date: Sat, 2 Jul 2016 23:04:23 +0900
Subject: [PATCH 7/8] mm,oom_reaper: Pass OOM victim's comm and pid values via mm_struct.

In order to make OOM reaper operate on mm_struct, pass comm and pid values
which are used for printing result of OOM reap attempt via oom_mm embedded
into mm_struct.

While it is possible to add pointer to task_struct to oom_mm struct,
we don't want to hold a reference to task_struct because that OOM victim
might be able to exit soon. Thus, copy comm and pid values as of calling
mark_oom_victim().

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
---
 include/linux/mm_types.h |  4 ++++
 mm/oom_kill.c            | 20 ++++++++++++--------
 2 files changed, 16 insertions(+), 8 deletions(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 718c0bd..3eabea9 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -397,6 +397,10 @@ struct oom_mm {
 	struct list_head list; /* Linked to oom_mm_list list. */
 	struct mem_cgroup *memcg; /* No deref. Maybe NULL. */
 	const nodemask_t *nodemask; /* No deref. Maybe NULL. */
+#ifdef CONFIG_MMU
+	char comm[16]; /* Copy of task_struct->comm[TASK_COMM_LEN]. */
+	pid_t pid; /* Copy of task_struct->pid. */
+#endif
 };
 
 struct kioctx_table;
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 45e7de2..317ce2c 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -471,7 +471,7 @@ static DECLARE_WAIT_QUEUE_HEAD(oom_reaper_wait);
 static struct task_struct *oom_reaper_list;
 static DEFINE_SPINLOCK(oom_reaper_lock);
 
-static bool __oom_reap_task(struct task_struct *tsk, struct mm_struct *mm)
+static bool __oom_reap_vmas(struct mm_struct *mm)
 {
 	struct mmu_gather tlb;
 	struct vm_area_struct *vma;
@@ -519,10 +519,10 @@ static bool __oom_reap_task(struct task_struct *tsk, struct mm_struct *mm)
 	}
 	tlb_finish_mmu(&tlb, 0, -1);
 	pr_info("oom_reaper: reaped process %d (%s), now anon-rss:%lukB, file-rss:%lukB, shmem-rss:%lukB\n",
-			task_pid_nr(tsk), tsk->comm,
-			K(get_mm_counter(mm, MM_ANONPAGES)),
-			K(get_mm_counter(mm, MM_FILEPAGES)),
-			K(get_mm_counter(mm, MM_SHMEMPAGES)));
+		mm->oom_mm.pid, mm->oom_mm.comm,
+		K(get_mm_counter(mm, MM_ANONPAGES)),
+		K(get_mm_counter(mm, MM_FILEPAGES)),
+		K(get_mm_counter(mm, MM_SHMEMPAGES)));
 	up_read(&mm->mmap_sem);
 
 	/*
@@ -559,14 +559,14 @@ static void oom_reap_task(struct task_struct *tsk)
 	task_unlock(p);
 
 	/* Retry the down_read_trylock(mmap_sem) a few times */
-	while (attempts++ < MAX_OOM_REAP_RETRIES && !__oom_reap_task(tsk, mm))
+	while (attempts++ < MAX_OOM_REAP_RETRIES && !__oom_reap_vmas(mm))
 		schedule_timeout_idle(HZ/10);
 
 	if (attempts <= MAX_OOM_REAP_RETRIES)
 		goto done;
 
 	pr_info("oom_reaper: unable to reap pid:%d (%s)\n",
-		task_pid_nr(tsk), tsk->comm);
+		mm->oom_mm.pid, mm->oom_mm.comm);
 
 	/*
 	 * If we've already tried to reap this task in the past and
@@ -576,7 +576,7 @@ static void oom_reap_task(struct task_struct *tsk)
 	 */
 	if (test_and_set_bit(MMF_OOM_NOT_REAPABLE, &mm->flags)) {
 		pr_info("oom_reaper: giving up pid:%d (%s)\n",
-			task_pid_nr(tsk), tsk->comm);
+			mm->oom_mm.pid, mm->oom_mm.comm);
 		set_bit(MMF_OOM_REAPED, &mm->flags);
 	}
 	debug_show_all_locks();
@@ -662,6 +662,10 @@ void mark_oom_victim(struct task_struct *tsk, struct oom_control *oc)
 		atomic_inc(&mm->mm_count);
 		mm->oom_mm.memcg = oc->memcg;
 		mm->oom_mm.nodemask = oc->nodemask;
+#ifdef CONFIG_MMU
+		strncpy(mm->oom_mm.comm, tsk->comm, sizeof(mm->oom_mm.comm));
+		mm->oom_mm.pid = task_pid_nr(tsk);
+#endif
 		list_add_tail(&mm->oom_mm.list, &oom_mm_list);
 	}
 	spin_unlock(&oom_mm_lock);
-- 
1.8.3.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH 8/8] mm,oom_reaper: Make OOM reaper use list of mm_struct.
  2016-07-03  2:35 [PATCH 0/8] Change OOM killer to use list of mm_struct Tetsuo Handa
                   ` (6 preceding siblings ...)
  2016-07-03  2:41 ` [PATCH 7/8] mm,oom_reaper: Pass OOM victim's comm and pid values via mm_struct Tetsuo Handa
@ 2016-07-03  2:41 ` Tetsuo Handa
  2016-07-04 10:40   ` Oleg Nesterov
  2016-07-07 14:15   ` Michal Hocko
  2016-07-07 11:04 ` [PATCH 0/8] Change OOM killer to " Michal Hocko
  8 siblings, 2 replies; 31+ messages in thread
From: Tetsuo Handa @ 2016-07-03  2:41 UTC (permalink / raw)
  To: linux-mm; +Cc: akpm, oleg, rientjes, vdavydov, mst, mhocko, mhocko

>From ce97a7f6b94a6003bc2b41e5f69da2fedc934a9d Mon Sep 17 00:00:00 2001
From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Date: Sat, 2 Jul 2016 23:10:06 +0900
Subject: [PATCH 8/8] mm,oom_reaper: Make OOM reaper use list of mm_struct.

Since OOM reaping is per mm_struct operation, it is natural to use
list of mm_struct used by OOM victims.

This patch eliminates find_lock_task_mm() usage from the OOM reaper.

This patch fixes what commit 36324a990cf578b5 ("oom: clear TIF_MEMDIE
after oom_reaper managed to unmap the address space") and
commit 449d777d7ad6d7f9 ("mm, oom_reaper: clear TIF_MEMDIE for all tasks
queued for oom_reaper") tried to address, by always calling exit_oom_mm()
after OOM reap attempt was made.

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
---
 include/linux/oom.h   |  8 -----
 include/linux/sched.h |  3 --
 mm/memcontrol.c       |  1 -
 mm/oom_kill.c         | 84 +++++++++++++++++----------------------------------
 4 files changed, 27 insertions(+), 69 deletions(-)

diff --git a/include/linux/oom.h b/include/linux/oom.h
index 4844325..1a212c1 100644
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -71,14 +71,6 @@ static inline bool oom_task_origin(const struct task_struct *p)
 
 extern void mark_oom_victim(struct task_struct *tsk, struct oom_control *oc);
 
-#ifdef CONFIG_MMU
-extern void wake_oom_reaper(struct task_struct *tsk);
-#else
-static inline void wake_oom_reaper(struct task_struct *tsk)
-{
-}
-#endif
-
 extern unsigned long oom_badness(struct task_struct *p,
 		struct mem_cgroup *memcg, const nodemask_t *nodemask,
 		unsigned long totalpages);
diff --git a/include/linux/sched.h b/include/linux/sched.h
index f472f27..4379279 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1916,9 +1916,6 @@ struct task_struct {
 	unsigned long	task_state_change;
 #endif
 	int pagefault_disabled;
-#ifdef CONFIG_MMU
-	struct task_struct *oom_reaper_list;
-#endif
 /* CPU-specific state of this task */
 	struct thread_struct thread;
 /*
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index c17160d..9acc840 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1236,7 +1236,6 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
 	 */
 	if (task_will_free_mem(current)) {
 		mark_oom_victim(current, &oc);
-		wake_oom_reaper(current);
 		goto unlock;
 	}
 
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 317ce2c..bdc192f 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -468,8 +468,6 @@ bool process_shares_mm(struct task_struct *p, struct mm_struct *mm)
  * victim (if that is possible) to help the OOM killer to move on.
  */
 static DECLARE_WAIT_QUEUE_HEAD(oom_reaper_wait);
-static struct task_struct *oom_reaper_list;
-static DEFINE_SPINLOCK(oom_reaper_lock);
 
 static bool __oom_reap_vmas(struct mm_struct *mm)
 {
@@ -540,30 +538,27 @@ static bool __oom_reap_vmas(struct mm_struct *mm)
 }
 
 #define MAX_OOM_REAP_RETRIES 10
-static void oom_reap_task(struct task_struct *tsk)
+static void oom_reap_vmas(struct mm_struct *mm)
 {
 	int attempts = 0;
-	struct mm_struct *mm = NULL;
-	struct task_struct *p = find_lock_task_mm(tsk);
+	bool ret;
 
 	/*
-	 * Make sure we find the associated mm_struct even when the particular
-	 * thread has already terminated and cleared its mm.
-	 * We might have race with exit path so consider our work done if there
-	 * is no mm.
+	 * Check MMF_OOM_REAPED after holding oom_lock because
+	 * oom_kill_process() might find this mm pinned.
 	 */
-	if (!p)
-		goto done;
-	mm = p->mm;
-	atomic_inc(&mm->mm_count);
-	task_unlock(p);
+	mutex_lock(&oom_lock);
+	ret = test_bit(MMF_OOM_REAPED, &mm->flags);
+	mutex_unlock(&oom_lock);
+	if (ret)
+		return;
 
 	/* Retry the down_read_trylock(mmap_sem) a few times */
 	while (attempts++ < MAX_OOM_REAP_RETRIES && !__oom_reap_vmas(mm))
 		schedule_timeout_idle(HZ/10);
 
 	if (attempts <= MAX_OOM_REAP_RETRIES)
-		goto done;
+		return;
 
 	pr_info("oom_reaper: unable to reap pid:%d (%s)\n",
 		mm->oom_mm.pid, mm->oom_mm.comm);
@@ -580,51 +575,30 @@ static void oom_reap_task(struct task_struct *tsk)
 		set_bit(MMF_OOM_REAPED, &mm->flags);
 	}
 	debug_show_all_locks();
-
-done:
-	tsk->oom_reaper_list = NULL;
-	/* Drop a reference taken by wake_oom_reaper */
-	put_task_struct(tsk);
-	/* Drop a reference taken above. */
-	if (mm)
-		mmdrop(mm);
 }
 
 static int oom_reaper(void *unused)
 {
 	while (true) {
-		struct task_struct *tsk = NULL;
-
-		wait_event_freezable(oom_reaper_wait, oom_reaper_list != NULL);
-		spin_lock(&oom_reaper_lock);
-		if (oom_reaper_list != NULL) {
-			tsk = oom_reaper_list;
-			oom_reaper_list = tsk->oom_reaper_list;
-		}
-		spin_unlock(&oom_reaper_lock);
-
-		if (tsk)
-			oom_reap_task(tsk);
+		struct mm_struct *mm = NULL;
+
+		wait_event_freezable(oom_reaper_wait,
+				     !list_empty(&oom_mm_list));
+		spin_lock(&oom_mm_lock);
+		if (!list_empty(&oom_mm_list))
+			mm = list_first_entry(&oom_mm_list, struct mm_struct,
+					      oom_mm.list);
+		spin_unlock(&oom_mm_lock);
+		if (!mm)
+			continue;
+		oom_reap_vmas(mm);
+		/* Drop a reference taken by mark_oom_victim(). */
+		exit_oom_mm(mm);
 	}
 
 	return 0;
 }
 
-void wake_oom_reaper(struct task_struct *tsk)
-{
-	/* tsk is already queued? */
-	if (tsk == oom_reaper_list || tsk->oom_reaper_list)
-		return;
-
-	get_task_struct(tsk);
-
-	spin_lock(&oom_reaper_lock);
-	tsk->oom_reaper_list = oom_reaper_list;
-	oom_reaper_list = tsk;
-	spin_unlock(&oom_reaper_lock);
-	wake_up(&oom_reaper_wait);
-}
-
 static int __init oom_init(void)
 {
 	kthread_run(oom_reaper, NULL, "oom_reaper");
@@ -667,6 +641,9 @@ void mark_oom_victim(struct task_struct *tsk, struct oom_control *oc)
 		mm->oom_mm.pid = task_pid_nr(tsk);
 #endif
 		list_add_tail(&mm->oom_mm.list, &oom_mm_list);
+#ifdef CONFIG_MMU
+		wake_up(&oom_reaper_wait);
+#endif
 	}
 	spin_unlock(&oom_mm_lock);
 	/*
@@ -816,7 +793,6 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
 	unsigned int victim_points = 0;
 	static DEFINE_RATELIMIT_STATE(oom_rs, DEFAULT_RATELIMIT_INTERVAL,
 					      DEFAULT_RATELIMIT_BURST);
-	bool can_oom_reap = true;
 
 	/*
 	 * If the task is already exiting, don't alarm the sysadmin or kill
@@ -825,7 +801,6 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
 	task_lock(p);
 	if (task_will_free_mem(p)) {
 		mark_oom_victim(p, oc);
-		wake_oom_reaper(p);
 		task_unlock(p);
 		put_task_struct(p);
 		return;
@@ -915,7 +890,6 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
 			 * memory might be still used. Hide the mm from the oom
 			 * killer to guarantee OOM forward progress.
 			 */
-			can_oom_reap = false;
 			set_bit(MMF_OOM_REAPED, &mm->flags);
 			pr_info("oom killer %d (%s) has mm pinned by %d (%s)\n",
 					task_pid_nr(victim), victim->comm,
@@ -926,9 +900,6 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
 	}
 	rcu_read_unlock();
 
-	if (can_oom_reap)
-		wake_oom_reaper(victim);
-
 	mmdrop(mm);
 	put_task_struct(victim);
 }
@@ -1004,7 +975,6 @@ bool out_of_memory(struct oom_control *oc)
 	 */
 	if (task_will_free_mem(current)) {
 		mark_oom_victim(current, oc);
-		wake_oom_reaper(current);
 		return true;
 	}
 
-- 
1.8.3.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [PATCH 1/8] mm,oom_reaper: Remove pointless kthread_run() failure check.
  2016-07-03  2:36 ` [PATCH 1/8] mm,oom_reaper: Remove pointless kthread_run() failure check Tetsuo Handa
@ 2016-07-03 12:42   ` Oleg Nesterov
  2016-07-03 16:03     ` Tetsuo Handa
  0 siblings, 1 reply; 31+ messages in thread
From: Oleg Nesterov @ 2016-07-03 12:42 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: linux-mm, akpm, rientjes, vdavydov, mst, mhocko, mhocko

On 07/03, Tetsuo Handa wrote:
>
> If kthread_run() in oom_init() fails due to reasons other than OOM
> (e.g. no free pid is available), userspace processes won't be able to
> start as well.

Why?

The kernel will boot with or without your change, but

> Therefore, trying to continue with error message is
> also pointless.

Can't understand...

I think this warning makes sense. And since you removed the oom_reaper_the
check in wake_oom_reaper(), the kernel will leak every task_struct passed
to wake_oom_reaper() ?

Oleg.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 1/8] mm,oom_reaper: Remove pointless kthread_run() failure check.
  2016-07-03 12:42   ` Oleg Nesterov
@ 2016-07-03 16:03     ` Tetsuo Handa
  2016-07-03 17:10       ` Oleg Nesterov
  0 siblings, 1 reply; 31+ messages in thread
From: Tetsuo Handa @ 2016-07-03 16:03 UTC (permalink / raw)
  To: oleg; +Cc: linux-mm, akpm, rientjes, vdavydov, mst, mhocko, mhocko

Oleg Nesterov wrote:
> On 07/03, Tetsuo Handa wrote:
> >
> > If kthread_run() in oom_init() fails due to reasons other than OOM
> > (e.g. no free pid is available), userspace processes won't be able to
> > start as well.
> 
> Why?
> 
> The kernel will boot with or without your change, but
> 
> > Therefore, trying to continue with error message is
> > also pointless.
> 
> Can't understand...
> 
> I think this warning makes sense. And since you removed the oom_reaper_the
> check in wake_oom_reaper(), the kernel will leak every task_struct passed
> to wake_oom_reaper() ?

We are trying to prove that OOM livelock is impossible for CONFIG_MMU=y
kernels (as long as OOM killer is invoked) because the OOM reaper always
gives feedback to the OOM killer, right? Then, preserving code which
continues without OOM reaper no longer makes sense.

In the past discussion, I suggested Michal to use BUG_ON() or panic()
( http://lkml.kernel.org/r/20151127123525.GG2493@dhcp22.suse.cz ). At that
time, we chose continue with pr_err(). If you think that kthread_run()
failure in oom_init() will ever happen, I can change my patch to call
BUG_ON() or panic(). I don't like continuing without OOM reaper.

Anyway, [PATCH 8/8] in this series removes get_task_struct().
Thus, the kernel won't leak every task_struct after all.

> 
> Oleg.
> 
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 1/8] mm,oom_reaper: Remove pointless kthread_run() failure check.
  2016-07-03 16:03     ` Tetsuo Handa
@ 2016-07-03 17:10       ` Oleg Nesterov
  2016-07-03 21:53         ` Tetsuo Handa
  0 siblings, 1 reply; 31+ messages in thread
From: Oleg Nesterov @ 2016-07-03 17:10 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: linux-mm, akpm, rientjes, vdavydov, mst, mhocko, mhocko

On 07/04, Tetsuo Handa wrote:
>
> Oleg Nesterov wrote:
> > On 07/03, Tetsuo Handa wrote:
> > >
> > > If kthread_run() in oom_init() fails due to reasons other than OOM
> > > (e.g. no free pid is available), userspace processes won't be able to
> > > start as well.
> >
> > Why?
> >
> > The kernel will boot with or without your change, but
> >
> > > Therefore, trying to continue with error message is
> > > also pointless.
> >
> > Can't understand...
> >
> > I think this warning makes sense. And since you removed the oom_reaper_the
> > check in wake_oom_reaper(), the kernel will leak every task_struct passed
> > to wake_oom_reaper() ?
>
> We are trying to prove that OOM livelock is impossible for CONFIG_MMU=y
> kernels (as long as OOM killer is invoked) because the OOM reaper always
> gives feedback to the OOM killer, right? Then, preserving code which
> continues without OOM reaper no longer makes sense.
>
> In the past discussion, I suggested Michal to use BUG_ON() or panic()
> ( http://lkml.kernel.org/r/20151127123525.GG2493@dhcp22.suse.cz ). At that
> time, we chose continue with pr_err(). If you think that kthread_run()
> failure in oom_init() will ever happen, I can change my patch to call
> BUG_ON() or panic(). I don't like continuing without OOM reaper.

And probably this makes sense, but

> Anyway, [PATCH 8/8] in this series removes get_task_struct().
> Thus, the kernel won't leak every task_struct after all.

which I can't read yet. I am still trying to clone linux-net, currently
my internet connection is very slow.

Anyway, this means that this 1/1 patch depends on 8/8, but 0/8 says

	[PATCH 1/8] can be sent to current linux.git as a clean up.

IOW, this patch doesn't look correct without other changes?

Oleg.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 1/8] mm,oom_reaper: Remove pointless kthread_run() failure check.
  2016-07-03 17:10       ` Oleg Nesterov
@ 2016-07-03 21:53         ` Tetsuo Handa
  2016-07-04 11:13           ` Oleg Nesterov
  2016-07-07 11:14           ` Michal Hocko
  0 siblings, 2 replies; 31+ messages in thread
From: Tetsuo Handa @ 2016-07-03 21:53 UTC (permalink / raw)
  To: oleg; +Cc: linux-mm, akpm, rientjes, vdavydov, mst, mhocko, mhocko

Oleg Nesterov wrote:
> On 07/04, Tetsuo Handa wrote:
> >
> > Oleg Nesterov wrote:
> > > On 07/03, Tetsuo Handa wrote:
> > > >
> > > > If kthread_run() in oom_init() fails due to reasons other than OOM
> > > > (e.g. no free pid is available), userspace processes won't be able to
> > > > start as well.
> > >
> > > Why?
> > >
> > > The kernel will boot with or without your change, but
> > >
> > > > Therefore, trying to continue with error message is
> > > > also pointless.
> > >
> > > Can't understand...
> > >
> > > I think this warning makes sense. And since you removed the oom_reaper_the
> > > check in wake_oom_reaper(), the kernel will leak every task_struct passed
> > > to wake_oom_reaper() ?
> >
> > We are trying to prove that OOM livelock is impossible for CONFIG_MMU=y
> > kernels (as long as OOM killer is invoked) because the OOM reaper always
> > gives feedback to the OOM killer, right? Then, preserving code which
> > continues without OOM reaper no longer makes sense.
> >
> > In the past discussion, I suggested Michal to use BUG_ON() or panic()
> > ( http://lkml.kernel.org/r/20151127123525.GG2493@dhcp22.suse.cz ). At that
> > time, we chose continue with pr_err(). If you think that kthread_run()
> > failure in oom_init() will ever happen, I can change my patch to call
> > BUG_ON() or panic(). I don't like continuing without OOM reaper.
> 
> And probably this makes sense, but
> 
> > Anyway, [PATCH 8/8] in this series removes get_task_struct().
> > Thus, the kernel won't leak every task_struct after all.
> 
> which I can't read yet. I am still trying to clone linux-net, currently
> my internet connection is very slow.
> 
> Anyway, this means that this 1/1 patch depends on 8/8, but 0/8 says
> 
> 	[PATCH 1/8] can be sent to current linux.git as a clean up.
> 
> IOW, this patch doesn't look correct without other changes?

If you think that global init can successfully start after kthread_run()
in oom_init() failed.

Here is an updated patch.

> 
> Oleg.
> 
> 

>From 977b0f4368a7ca07af7e519aa8795e7b2ee653d0 Mon Sep 17 00:00:00 2001
From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Date: Mon, 4 Jul 2016 06:40:05 +0900
Subject: [PATCH 1/8] mm,oom_reaper: Don't boot without OOM reaper kernel thread.

We are trying to prove that OOM livelock is impossible for CONFIG_MMU=y
kernels (as long as OOM killer is invoked) because the OOM reaper always
gives feedback to the OOM killer. Therefore, preserving code which
continues without OOM reaper no longer makes sense.

Since oom_init() is called before OOM-killable userspace processes are
started, the system will panic if out_of_memory() is called before
oom_init() returns. Therefore, oom_reaper_th == NULL check in
wake_oom_reaper() is pointless.

If kthread_run() in oom_init() fails due to reasons other than
out_of_memory(), userspace processes won't be able to start as well.
Therefore, trying to continue with error message is also pointless.
But in case something unexpected occurred, let's explicitly add
BUG_ON() check.

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
---
 mm/oom_kill.c | 13 +++----------
 1 file changed, 3 insertions(+), 10 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 7d0a275..079ce96 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -447,7 +447,6 @@ bool process_shares_mm(struct task_struct *p, struct mm_struct *mm)
  * OOM Reaper kernel thread which tries to reap the memory used by the OOM
  * victim (if that is possible) to help the OOM killer to move on.
  */
-static struct task_struct *oom_reaper_th;
 static DECLARE_WAIT_QUEUE_HEAD(oom_reaper_wait);
 static struct task_struct *oom_reaper_list;
 static DEFINE_SPINLOCK(oom_reaper_lock);
@@ -629,9 +628,6 @@ static int oom_reaper(void *unused)
 
 void wake_oom_reaper(struct task_struct *tsk)
 {
-	if (!oom_reaper_th)
-		return;
-
 	/* tsk is already queued? */
 	if (tsk == oom_reaper_list || tsk->oom_reaper_list)
 		return;
@@ -647,12 +643,9 @@ void wake_oom_reaper(struct task_struct *tsk)
 
 static int __init oom_init(void)
 {
-	oom_reaper_th = kthread_run(oom_reaper, NULL, "oom_reaper");
-	if (IS_ERR(oom_reaper_th)) {
-		pr_err("Unable to start OOM reaper %ld. Continuing regardless\n",
-				PTR_ERR(oom_reaper_th));
-		oom_reaper_th = NULL;
-	}
+	struct task_struct *p = kthread_run(oom_reaper, NULL, "oom_reaper");
+
+	BUG_ON(IS_ERR(p));
 	return 0;
 }
 subsys_initcall(oom_init)
-- 
1.8.3.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [PATCH 3/8] mm,oom: Use list of mm_struct used by OOM victims.
  2016-07-03  2:38 ` [PATCH 3/8] mm,oom: Use list of mm_struct used by OOM victims Tetsuo Handa
@ 2016-07-04 10:39   ` Oleg Nesterov
  2016-07-04 12:50     ` Tetsuo Handa
  2016-07-07 11:31   ` Michal Hocko
  1 sibling, 1 reply; 31+ messages in thread
From: Oleg Nesterov @ 2016-07-04 10:39 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: linux-mm, akpm, rientjes, vdavydov, mst, mhocko, mhocko

Tetsuo,

I'll try to actually read this series later, although I will leave the
actual review to maintainers anyway...

Just a couple of questions for now,

On 07/03, Tetsuo Handa wrote:
>
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -722,6 +722,7 @@ static inline void __mmput(struct mm_struct *mm)
>  	}
>  	if (mm->binfmt)
>  		module_put(mm->binfmt->module);
> +	exit_oom_mm(mm);

Is it strictly necessary? At first glance not. Sooner or later oom_reaper() should
find this mm_struct and do exit_oom_mm(). And given that mm->mm_users is already 0
the "extra" __oom_reap_vmas() doesn't really hurt.

It would be nice to remove exit_oom_mm() from __mmput(); it takes the global spinlock
for the very unlikely case, and if we can avoid it here then perhaps we can remove
->oom_mm from mm_struct.

Oleg.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 8/8] mm,oom_reaper: Make OOM reaper use list of mm_struct.
  2016-07-03  2:41 ` [PATCH 8/8] mm,oom_reaper: Make OOM reaper use list of mm_struct Tetsuo Handa
@ 2016-07-04 10:40   ` Oleg Nesterov
  2016-07-07 14:15   ` Michal Hocko
  1 sibling, 0 replies; 31+ messages in thread
From: Oleg Nesterov @ 2016-07-04 10:40 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: linux-mm, akpm, rientjes, vdavydov, mst, mhocko, mhocko

On 07/03, Tetsuo Handa wrote:
>
> +static void oom_reap_vmas(struct mm_struct *mm)
>  {
>  	int attempts = 0;
> -	struct mm_struct *mm = NULL;
> -	struct task_struct *p = find_lock_task_mm(tsk);
> +	bool ret;
>
>  	/*
> -	 * Make sure we find the associated mm_struct even when the particular
> -	 * thread has already terminated and cleared its mm.
> -	 * We might have race with exit path so consider our work done if there
> -	 * is no mm.
> +	 * Check MMF_OOM_REAPED after holding oom_lock because
> +	 * oom_kill_process() might find this mm pinned.
>  	 */
> -	if (!p)
> -		goto done;
> -	mm = p->mm;
> -	atomic_inc(&mm->mm_count);
> -	task_unlock(p);
> +	mutex_lock(&oom_lock);
> +	ret = test_bit(MMF_OOM_REAPED, &mm->flags);
> +	mutex_unlock(&oom_lock);

OK, but this looks strange...

Can't we instead move mark_oom_victim(victim) to the end of oom_kill_process(),
and change mark_oom_victim() to do nothing if MMF_OOM_REAPED is set? Or just
check this flag in oom_kill_process() before mark_oom_victim().

Oleg.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 1/8] mm,oom_reaper: Remove pointless kthread_run() failure check.
  2016-07-03 21:53         ` Tetsuo Handa
@ 2016-07-04 11:13           ` Oleg Nesterov
  2016-07-07 11:14           ` Michal Hocko
  1 sibling, 0 replies; 31+ messages in thread
From: Oleg Nesterov @ 2016-07-04 11:13 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: linux-mm, akpm, rientjes, vdavydov, mst, mhocko, mhocko

I guess we misunderstood each other,

On 07/04, Tetsuo Handa wrote:
>
> Oleg Nesterov wrote:
> > On 07/04, Tetsuo Handa wrote:
> > >
> > > Oleg Nesterov wrote:
> > > > On 07/03, Tetsuo Handa wrote:
> > > > >
> > > > > If kthread_run() in oom_init() fails due to reasons other than OOM
> > > > > (e.g. no free pid is available), userspace processes won't be able to
> > > > > start as well.
> > > >
> > > > Why?
> > > >
> > > > The kernel will boot with or without your change, but

and yes, I probably confused you. I tried to say that in theory nothing
prevents the kernel from booting even if oom_init() fails.

> > IOW, this patch doesn't look correct without other changes?
>
> If you think that global init can successfully start after kthread_run()
> in oom_init() failed.

No, I think you are right. If kthread_run() fails at this stage than something
is seriously wrong anyway. And yes, for example hung_task_init() doesn't bother
to check the value returned by kthread_run().

I meant that this just looks wrong (and proc_dohung_task_timeout_secs() too).
And the warning can help to identify the problem.

In any case I agree, we should remove "oom_reaper_th",

>  static int __init oom_init(void)
>  {
> -	oom_reaper_th = kthread_run(oom_reaper, NULL, "oom_reaper");
> -	if (IS_ERR(oom_reaper_th)) {
> -		pr_err("Unable to start OOM reaper %ld. Continuing regardless\n",
> -				PTR_ERR(oom_reaper_th));
> -		oom_reaper_th = NULL;
> -	}
> +	struct task_struct *p = kthread_run(oom_reaper, NULL, "oom_reaper");
> +
> +	BUG_ON(IS_ERR(p));
>  	return 0;

Yes, do_initcall_level() ignores the error returned by fn(), so we need BUG()
or panic().

This is off-topic, but perhaps we should audit all initcalls and fix those
who return the error for no reason, say, bts_init(). And then change
do_one_initcall() to panic or at least WARN() if a non-modular initcall fails.

Oleg.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 3/8] mm,oom: Use list of mm_struct used by OOM victims.
  2016-07-04 10:39   ` Oleg Nesterov
@ 2016-07-04 12:50     ` Tetsuo Handa
  2016-07-04 18:25       ` Oleg Nesterov
  0 siblings, 1 reply; 31+ messages in thread
From: Tetsuo Handa @ 2016-07-04 12:50 UTC (permalink / raw)
  To: oleg; +Cc: linux-mm, akpm, rientjes, vdavydov, mst, mhocko, mhocko

Oleg Nesterov wrote:
> Tetsuo,
> 
> I'll try to actually read this series later, although I will leave the
> actual review to maintainers anyway...

Thank you.

> 
> Just a couple of questions for now,
> 
> On 07/03, Tetsuo Handa wrote:
> >
> > --- a/kernel/fork.c
> > +++ b/kernel/fork.c
> > @@ -722,6 +722,7 @@ static inline void __mmput(struct mm_struct *mm)
> >  	}
> >  	if (mm->binfmt)
> >  		module_put(mm->binfmt->module);
> > +	exit_oom_mm(mm);
> 
> Is it strictly necessary? At first glance not. Sooner or later oom_reaper() should
> find this mm_struct and do exit_oom_mm(). And given that mm->mm_users is already 0
> the "extra" __oom_reap_vmas() doesn't really hurt.
> 
> It would be nice to remove exit_oom_mm() from __mmput(); it takes the global spinlock
> for the very unlikely case, and if we can avoid it here then perhaps we can remove
> ->oom_mm from mm_struct.

I changed not to take global spinlock from __mmput() unless that mm was used by
TIF_MEMDIE threads. But I don't think I can remove oom_mm from mm_struct because
oom_mm is used for controlling "until when should the OOM killer refrain selecting
next OOM victim". My series is also intended for Michal's

  We can also start thinking to use TIF_MEMDIE only for the access to memory
  reserves to oom victims which actually need to allocate and decouple the
  current double meaning.

response.

> 
> Oleg.
> 
> 

I realized that

	if (nodemask && mm->nodemask != nodemask)
		continue;

in oom_has_pending_mm() is wrong. nodemask_t is not a pointer which can be
compared using address but which needs to be compared using bitmap operation.
Thus, I think I need to remember task_struct which got TIF_MEMDIE. I updated
my series to remember task_struct rather than cgroup and nodemask_t, and to do

	if (oom_unkillable_task(mm->oom_mm.victim, memcg, nodemask))
		continue;

instead. But I'm not sure whether this will work as expected, especially when
all threads in one thread group (which mm->oom_mm.victim belongs to) reached
TASK_DEAD state. I guess that oom_unkillable_task() will return true, and
that mm will be selected by another thread group (which mm->oom_mm.victim
does not belongs to), and mark_oom_victim() will update mm->oom_mm.victim.
I'd like to wait for Michal to come back...

Below updated patch is based on top of linux-next-20160704 +
http://lkml.kernel.org/r/1467201562-6709-1-git-send-email-mhocko@kernel.org .
Andrew, can you add "mmotm: mm-oom-fortify-task_will_free_mem-fix" to linux-next?

 include/linux/mm_types.h |    7 +
 include/linux/oom.h      |   16 --
 include/linux/sched.h    |    4
 kernel/exit.c            |    2
 kernel/fork.c            |    1
 kernel/power/process.c   |   12 -
 mm/memcontrol.c          |   16 --
 mm/oom_kill.c            |  298 ++++++++++++++++++++++-------------------------
 8 files changed, 163 insertions(+), 193 deletions(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index e093e1d..7c1370a 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -392,6 +392,12 @@ struct mm_rss_stat {
 	atomic_long_t count[NR_MM_COUNTERS];
 };
 
+struct oom_mm {
+	struct list_head list; /* Linked to oom_mm_list list. */
+	/* Thread which was passed to mark_oom_victim() for the last time. */
+	struct task_struct *victim;
+};
+
 struct kioctx_table;
 struct mm_struct {
 	struct vm_area_struct *mmap;		/* list of VMAs */
@@ -515,6 +521,7 @@ struct mm_struct {
 #ifdef CONFIG_HUGETLB_PAGE
 	atomic_long_t hugetlb_usage;
 #endif
+	struct oom_mm oom_mm;
 #ifdef CONFIG_MMU
 	struct work_struct async_put_work;
 #endif
diff --git a/include/linux/oom.h b/include/linux/oom.h
index 5bc0457..1a212c1 100644
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -49,7 +49,6 @@ enum oom_constraint {
 enum oom_scan_t {
 	OOM_SCAN_OK,		/* scan thread and find its badness */
 	OOM_SCAN_CONTINUE,	/* do not consider thread for oom kill */
-	OOM_SCAN_ABORT,		/* abort the iteration and return */
 	OOM_SCAN_SELECT,	/* always select this thread first */
 };
 
@@ -70,15 +69,7 @@ static inline bool oom_task_origin(const struct task_struct *p)
 	return p->signal->oom_flag_origin;
 }
 
-extern void mark_oom_victim(struct task_struct *tsk);
-
-#ifdef CONFIG_MMU
-extern void wake_oom_reaper(struct task_struct *tsk);
-#else
-static inline void wake_oom_reaper(struct task_struct *tsk)
-{
-}
-#endif
+extern void mark_oom_victim(struct task_struct *tsk, struct oom_control *oc);
 
 extern unsigned long oom_badness(struct task_struct *p,
 		struct mem_cgroup *memcg, const nodemask_t *nodemask,
@@ -91,12 +82,15 @@ extern void oom_kill_process(struct oom_control *oc, struct task_struct *p,
 extern void check_panic_on_oom(struct oom_control *oc,
 			       enum oom_constraint constraint);
 
+extern void exit_oom_mm(struct mm_struct *mm);
+extern bool oom_has_pending_mm(struct mem_cgroup *memcg,
+			       const nodemask_t *nodemask);
 extern enum oom_scan_t oom_scan_process_thread(struct oom_control *oc,
 					       struct task_struct *task);
 
 extern bool out_of_memory(struct oom_control *oc);
 
-extern void exit_oom_victim(struct task_struct *tsk);
+extern void exit_oom_victim(void);
 
 extern int register_oom_notifier(struct notifier_block *nb);
 extern int unregister_oom_notifier(struct notifier_block *nb);
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 553af29..4379279 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -671,7 +671,6 @@ struct signal_struct {
 	atomic_t		sigcnt;
 	atomic_t		live;
 	int			nr_threads;
-	atomic_t oom_victims; /* # of TIF_MEDIE threads in this thread group */
 	struct list_head	thread_head;
 
 	wait_queue_head_t	wait_chldexit;	/* for wait4() */
@@ -1917,9 +1916,6 @@ struct task_struct {
 	unsigned long	task_state_change;
 #endif
 	int pagefault_disabled;
-#ifdef CONFIG_MMU
-	struct task_struct *oom_reaper_list;
-#endif
 /* CPU-specific state of this task */
 	struct thread_struct thread;
 /*
diff --git a/kernel/exit.c b/kernel/exit.c
index 84ae830..1b1dada 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -511,7 +511,7 @@ static void exit_mm(struct task_struct *tsk)
 	mm_update_next_owner(mm);
 	mmput(mm);
 	if (test_thread_flag(TIF_MEMDIE))
-		exit_oom_victim(tsk);
+		exit_oom_victim();
 }
 
 static struct task_struct *find_alive_thread(struct task_struct *p)
diff --git a/kernel/fork.c b/kernel/fork.c
index 7926993..b870dbc 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -722,6 +722,7 @@ static inline void __mmput(struct mm_struct *mm)
 	}
 	if (mm->binfmt)
 		module_put(mm->binfmt->module);
+	exit_oom_mm(mm);
 	mmdrop(mm);
 }
 
diff --git a/kernel/power/process.c b/kernel/power/process.c
index 0c2ee97..df058be 100644
--- a/kernel/power/process.c
+++ b/kernel/power/process.c
@@ -146,18 +146,6 @@ int freeze_processes(void)
 	if (!error && !oom_killer_disable())
 		error = -EBUSY;
 
-	/*
-	 * There is a hard to fix race between oom_reaper kernel thread
-	 * and oom_killer_disable. oom_reaper calls exit_oom_victim
-	 * before the victim reaches exit_mm so try to freeze all the tasks
-	 * again and catch such a left over task.
-	 */
-	if (!error) {
-		pr_info("Double checking all user space processes after OOM killer disable... ");
-		error = try_to_freeze_tasks(true);
-		pr_cont("\n");
-	}
-
 	if (error)
 		thaw_processes();
 	return error;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 40dfca3..9acc840 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1235,12 +1235,16 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
 	 * quickly exit and free its memory.
 	 */
 	if (task_will_free_mem(current)) {
-		mark_oom_victim(current);
-		wake_oom_reaper(current);
+		mark_oom_victim(current, &oc);
 		goto unlock;
 	}
 
 	check_panic_on_oom(&oc, CONSTRAINT_MEMCG);
+	if (oom_has_pending_mm(memcg, NULL)) {
+		/* Set a dummy value to return "true". */
+		chosen = (void *) 1;
+		goto unlock;
+	}
 	totalpages = mem_cgroup_get_limit(memcg) ? : 1;
 	for_each_mem_cgroup_tree(iter, memcg) {
 		struct css_task_iter it;
@@ -1258,14 +1262,6 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
 				/* fall through */
 			case OOM_SCAN_CONTINUE:
 				continue;
-			case OOM_SCAN_ABORT:
-				css_task_iter_end(&it);
-				mem_cgroup_iter_break(memcg, iter);
-				if (chosen)
-					put_task_struct(chosen);
-				/* Set a dummy value to return "true". */
-				chosen = (void *) 1;
-				goto unlock;
 			case OOM_SCAN_OK:
 				break;
 			};
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 7d0a275..f60ed04 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -282,25 +282,6 @@ enum oom_scan_t oom_scan_process_thread(struct oom_control *oc,
 		return OOM_SCAN_CONTINUE;
 
 	/*
-	 * This task already has access to memory reserves and is being killed.
-	 * Don't allow any other task to have access to the reserves unless
-	 * the task has MMF_OOM_REAPED because chances that it would release
-	 * any memory is quite low.
-	 */
-	if (!is_sysrq_oom(oc) && atomic_read(&task->signal->oom_victims)) {
-		struct task_struct *p = find_lock_task_mm(task);
-		enum oom_scan_t ret = OOM_SCAN_ABORT;
-
-		if (p) {
-			if (test_bit(MMF_OOM_REAPED, &p->mm->flags))
-				ret = OOM_SCAN_CONTINUE;
-			task_unlock(p);
-		}
-
-		return ret;
-	}
-
-	/*
 	 * If task is allocating a lot of memory and has been marked to be
 	 * killed first if it triggers an oom, then select it.
 	 */
@@ -310,6 +291,57 @@ enum oom_scan_t oom_scan_process_thread(struct oom_control *oc,
 	return OOM_SCAN_OK;
 }
 
+static LIST_HEAD(oom_mm_list);
+static DEFINE_SPINLOCK(oom_mm_lock);
+
+static inline void __exit_oom_mm(struct mm_struct *mm)
+{
+	struct task_struct *tsk;
+
+	spin_lock(&oom_mm_lock);
+	list_del(&mm->oom_mm.list);
+	tsk = mm->oom_mm.victim;
+	mm->oom_mm.victim = NULL;
+	spin_unlock(&oom_mm_lock);
+	/* Drop references taken by mark_oom_victim() */
+	put_task_struct(tsk);
+	mmdrop(mm);
+}
+
+void exit_oom_mm(struct mm_struct *mm)
+{
+	/* Nothing to do unless mark_oom_victim() was called with this mm. */
+	if (!mm->oom_mm.victim)
+		return;
+#ifdef CONFIG_MMU
+	/*
+	 * OOM reaper will eventually call __exit_oom_mm().
+	 * Allow oom_has_pending_mm() to ignore this mm.
+	 */
+	set_bit(MMF_OOM_REAPED, &mm->flags);
+#else
+	__exit_oom_mm(mm);
+#endif
+}
+
+bool oom_has_pending_mm(struct mem_cgroup *memcg, const nodemask_t *nodemask)
+{
+	struct mm_struct *mm;
+	bool ret = false;
+
+	spin_lock(&oom_mm_lock);
+	list_for_each_entry(mm, &oom_mm_list, oom_mm.list) {
+		if (oom_unkillable_task(mm->oom_mm.victim, memcg, nodemask))
+			continue;
+		if (test_bit(MMF_OOM_REAPED, &mm->flags))
+			continue;
+		ret = true;
+		break;
+	}
+	spin_unlock(&oom_mm_lock);
+	return ret;
+}
+
 /*
  * Simple selection loop. We chose the process with the highest
  * number of 'points'.  Returns -1 on scan abort.
@@ -332,9 +364,6 @@ static struct task_struct *select_bad_process(struct oom_control *oc,
 			/* fall through */
 		case OOM_SCAN_CONTINUE:
 			continue;
-		case OOM_SCAN_ABORT:
-			rcu_read_unlock();
-			return (struct task_struct *)(-1UL);
 		case OOM_SCAN_OK:
 			break;
 		};
@@ -447,54 +476,17 @@ bool process_shares_mm(struct task_struct *p, struct mm_struct *mm)
  * OOM Reaper kernel thread which tries to reap the memory used by the OOM
  * victim (if that is possible) to help the OOM killer to move on.
  */
-static struct task_struct *oom_reaper_th;
 static DECLARE_WAIT_QUEUE_HEAD(oom_reaper_wait);
-static struct task_struct *oom_reaper_list;
-static DEFINE_SPINLOCK(oom_reaper_lock);
 
-static bool __oom_reap_task(struct task_struct *tsk)
+static bool __oom_reap_task(struct task_struct *tsk, struct mm_struct *mm)
 {
 	struct mmu_gather tlb;
 	struct vm_area_struct *vma;
-	struct mm_struct *mm = NULL;
-	struct task_struct *p;
 	struct zap_details details = {.check_swap_entries = true,
 				      .ignore_dirty = true};
-	bool ret = true;
 
-	/*
-	 * We have to make sure to not race with the victim exit path
-	 * and cause premature new oom victim selection:
-	 * __oom_reap_task		exit_mm
-	 *   mmget_not_zero
-	 *				  mmput
-	 *				    atomic_dec_and_test
-	 *				  exit_oom_victim
-	 *				[...]
-	 *				out_of_memory
-	 *				  select_bad_process
-	 *				    # no TIF_MEMDIE task selects new victim
-	 *  unmap_page_range # frees some memory
-	 */
-	mutex_lock(&oom_lock);
-
-	/*
-	 * Make sure we find the associated mm_struct even when the particular
-	 * thread has already terminated and cleared its mm.
-	 * We might have race with exit path so consider our work done if there
-	 * is no mm.
-	 */
-	p = find_lock_task_mm(tsk);
-	if (!p)
-		goto unlock_oom;
-	mm = p->mm;
-	atomic_inc(&mm->mm_count);
-	task_unlock(p);
-
-	if (!down_read_trylock(&mm->mmap_sem)) {
-		ret = false;
-		goto mm_drop;
-	}
+	if (!down_read_trylock(&mm->mmap_sem))
+		return false;
 
 	/*
 	 * increase mm_users only after we know we will reap something so
@@ -503,7 +495,7 @@ static bool __oom_reap_task(struct task_struct *tsk)
 	 */
 	if (!mmget_not_zero(mm)) {
 		up_read(&mm->mmap_sem);
-		goto mm_drop;
+		return true;
 	}
 
 	tlb_gather_mmu(&tlb, mm, 0, -1);
@@ -551,108 +543,85 @@ static bool __oom_reap_task(struct task_struct *tsk)
 	 * put the oom_reaper out of the way.
 	 */
 	mmput_async(mm);
-mm_drop:
-	mmdrop(mm);
-unlock_oom:
-	mutex_unlock(&oom_lock);
-	return ret;
+	return true;
 }
 
 #define MAX_OOM_REAP_RETRIES 10
-static void oom_reap_task(struct task_struct *tsk)
+static void oom_reap_task(struct mm_struct *mm, struct task_struct *tsk)
 {
 	int attempts = 0;
+	bool ret;
+
+	/*
+	 * Check MMF_OOM_REAPED after holding oom_lock because
+	 * oom_kill_process() might find this mm pinned.
+	 */
+	mutex_lock(&oom_lock);
+	ret = test_bit(MMF_OOM_REAPED, &mm->flags);
+	mutex_unlock(&oom_lock);
+	if (ret)
+		return;
 
 	/* Retry the down_read_trylock(mmap_sem) a few times */
-	while (attempts++ < MAX_OOM_REAP_RETRIES && !__oom_reap_task(tsk))
+	while (attempts++ < MAX_OOM_REAP_RETRIES && !__oom_reap_task(tsk, mm))
 		schedule_timeout_idle(HZ/10);
 
-	if (attempts > MAX_OOM_REAP_RETRIES) {
-		struct task_struct *p;
-
-		pr_info("oom_reaper: unable to reap pid:%d (%s)\n",
-				task_pid_nr(tsk), tsk->comm);
-
-		/*
-		 * If we've already tried to reap this task in the past and
-		 * failed it probably doesn't make much sense to try yet again
-		 * so hide the mm from the oom killer so that it can move on
-		 * to another task with a different mm struct.
-		 */
-		p = find_lock_task_mm(tsk);
-		if (p) {
-			if (test_and_set_bit(MMF_OOM_NOT_REAPABLE, &p->mm->flags)) {
-				pr_info("oom_reaper: giving up pid:%d (%s)\n",
-						task_pid_nr(tsk), tsk->comm);
-				set_bit(MMF_OOM_REAPED, &p->mm->flags);
-			}
-			task_unlock(p);
-		}
+	if (attempts <= MAX_OOM_REAP_RETRIES)
+		return;
 
-		debug_show_all_locks();
-	}
+	pr_info("oom_reaper: unable to reap pid:%d (%s)\n",
+		task_pid_nr(tsk), tsk->comm);
 
 	/*
-	 * Clear TIF_MEMDIE because the task shouldn't be sitting on a
-	 * reasonably reclaimable memory anymore or it is not a good candidate
-	 * for the oom victim right now because it cannot release its memory
-	 * itself nor by the oom reaper.
+	 * If we've already tried to reap this task in the past and
+	 * failed it probably doesn't make much sense to try yet again
+	 * so hide the mm from the oom killer so that it can move on
+	 * to another task with a different mm struct.
 	 */
-	tsk->oom_reaper_list = NULL;
-	exit_oom_victim(tsk);
-
-	/* Drop a reference taken by wake_oom_reaper */
-	put_task_struct(tsk);
+	if (test_and_set_bit(MMF_OOM_NOT_REAPABLE, &mm->flags)) {
+		pr_info("oom_reaper: giving up pid:%d (%s)\n",
+			task_pid_nr(tsk), tsk->comm);
+		set_bit(MMF_OOM_REAPED, &mm->flags);
+	}
+	debug_show_all_locks();
 }
 
 static int oom_reaper(void *unused)
 {
-	set_freezable();
-
 	while (true) {
-		struct task_struct *tsk = NULL;
+		struct mm_struct *mm = NULL;
+		struct task_struct *victim = NULL;
 
-		wait_event_freezable(oom_reaper_wait, oom_reaper_list != NULL);
-		spin_lock(&oom_reaper_lock);
-		if (oom_reaper_list != NULL) {
-			tsk = oom_reaper_list;
-			oom_reaper_list = tsk->oom_reaper_list;
+		wait_event_freezable(oom_reaper_wait,
+				     !list_empty(&oom_mm_list));
+		/*
+		 * Take a reference on current victim thread in case
+		 * oom_reap_task() raced with mark_oom_victim() by
+		 * other threads sharing this mm.
+		 */
+		spin_lock(&oom_mm_lock);
+		if (!list_empty(&oom_mm_list)) {
+			mm = list_first_entry(&oom_mm_list, struct mm_struct,
+					      oom_mm.list);
+			victim = mm->oom_mm.victim;
+			get_task_struct(victim);
 		}
-		spin_unlock(&oom_reaper_lock);
-
-		if (tsk)
-			oom_reap_task(tsk);
+		spin_unlock(&oom_mm_lock);
+		if (!mm)
+			continue;
+		oom_reap_task(mm, victim);
+		put_task_struct(victim);
+		__exit_oom_mm(mm);
 	}
 
 	return 0;
 }
 
-void wake_oom_reaper(struct task_struct *tsk)
-{
-	if (!oom_reaper_th)
-		return;
-
-	/* tsk is already queued? */
-	if (tsk == oom_reaper_list || tsk->oom_reaper_list)
-		return;
-
-	get_task_struct(tsk);
-
-	spin_lock(&oom_reaper_lock);
-	tsk->oom_reaper_list = oom_reaper_list;
-	oom_reaper_list = tsk;
-	spin_unlock(&oom_reaper_lock);
-	wake_up(&oom_reaper_wait);
-}
-
 static int __init oom_init(void)
 {
-	oom_reaper_th = kthread_run(oom_reaper, NULL, "oom_reaper");
-	if (IS_ERR(oom_reaper_th)) {
-		pr_err("Unable to start OOM reaper %ld. Continuing regardless\n",
-				PTR_ERR(oom_reaper_th));
-		oom_reaper_th = NULL;
-	}
+	struct task_struct *p = kthread_run(oom_reaper, NULL, "oom_reaper");
+
+	BUG_ON(IS_ERR(p));
 	return 0;
 }
 subsys_initcall(oom_init)
@@ -661,17 +630,39 @@ subsys_initcall(oom_init)
 /**
  * mark_oom_victim - mark the given task as OOM victim
  * @tsk: task to mark
+ * @oc: oom_control
  *
  * Has to be called with oom_lock held and never after
  * oom has been disabled already.
  */
-void mark_oom_victim(struct task_struct *tsk)
+void mark_oom_victim(struct task_struct *tsk, struct oom_control *oc)
 {
+	struct mm_struct *mm = tsk->mm;
+	struct task_struct *old_tsk;
+
 	WARN_ON(oom_killer_disabled);
 	/* OOM killer might race with memcg OOM */
 	if (test_and_set_tsk_thread_flag(tsk, TIF_MEMDIE))
 		return;
-	atomic_inc(&tsk->signal->oom_victims);
+	/*
+	 * Since mark_oom_victim() is called from multiple threads,
+	 * connect this mm to oom_mm_list only if not yet connected.
+	 *
+	 * Since mark_oom_victim() is called with a stable mm (i.e.
+	 * mm->mm_users > 0), __exit_oom_mm() from __mmput() can't be called
+	 * before we add this mm to the list.
+	 */
+	spin_lock(&oom_mm_lock);
+	old_tsk = mm->oom_mm.victim;
+	get_task_struct(tsk);
+	mm->oom_mm.victim = tsk;
+	if (!old_tsk) {
+		atomic_inc(&mm->mm_count);
+		list_add_tail(&mm->oom_mm.list, &oom_mm_list);
+	}
+	spin_unlock(&oom_mm_lock);
+	if (old_tsk)
+		put_task_struct(old_tsk);
 	/*
 	 * Make sure that the task is woken up from uninterruptible sleep
 	 * if it is frozen because OOM killer wouldn't be able to free
@@ -680,16 +671,17 @@ void mark_oom_victim(struct task_struct *tsk)
 	 */
 	__thaw_task(tsk);
 	atomic_inc(&oom_victims);
+#ifdef CONFIG_MMU
+	wake_up(&oom_reaper_wait);
+#endif
 }
 
 /**
  * exit_oom_victim - note the exit of an OOM victim
  */
-void exit_oom_victim(struct task_struct *tsk)
+void exit_oom_victim(void)
 {
-	if (!test_and_clear_tsk_thread_flag(tsk, TIF_MEMDIE))
-		return;
-	atomic_dec(&tsk->signal->oom_victims);
+	clear_thread_flag(TIF_MEMDIE);
 
 	if (!atomic_dec_return(&oom_victims))
 		wake_up_all(&oom_victims_wait);
@@ -821,7 +813,6 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
 	unsigned int victim_points = 0;
 	static DEFINE_RATELIMIT_STATE(oom_rs, DEFAULT_RATELIMIT_INTERVAL,
 					      DEFAULT_RATELIMIT_BURST);
-	bool can_oom_reap = true;
 
 	/*
 	 * If the task is already exiting, don't alarm the sysadmin or kill
@@ -829,8 +820,7 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
 	 */
 	task_lock(p);
 	if (task_will_free_mem(p)) {
-		mark_oom_victim(p);
-		wake_oom_reaper(p);
+		mark_oom_victim(p, oc);
 		task_unlock(p);
 		put_task_struct(p);
 		return;
@@ -890,7 +880,7 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
 	 * space under its control.
 	 */
 	do_send_sig_info(SIGKILL, SEND_SIG_FORCED, victim, true);
-	mark_oom_victim(victim);
+	mark_oom_victim(victim, oc);
 	pr_err("Killed process %d (%s) total-vm:%lukB, anon-rss:%lukB, file-rss:%lukB, shmem-rss:%lukB\n",
 		task_pid_nr(victim), victim->comm, K(victim->mm->total_vm),
 		K(get_mm_counter(victim->mm, MM_ANONPAGES)),
@@ -920,7 +910,6 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
 			 * memory might be still used. Hide the mm from the oom
 			 * killer to guarantee OOM forward progress.
 			 */
-			can_oom_reap = false;
 			set_bit(MMF_OOM_REAPED, &mm->flags);
 			pr_info("oom killer %d (%s) has mm pinned by %d (%s)\n",
 					task_pid_nr(victim), victim->comm,
@@ -931,9 +920,6 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
 	}
 	rcu_read_unlock();
 
-	if (can_oom_reap)
-		wake_oom_reaper(victim);
-
 	mmdrop(mm);
 	put_task_struct(victim);
 }
@@ -1008,8 +994,7 @@ bool out_of_memory(struct oom_control *oc)
 	 * quickly exit and free its memory.
 	 */
 	if (task_will_free_mem(current)) {
-		mark_oom_victim(current);
-		wake_oom_reaper(current);
+		mark_oom_victim(current, oc);
 		return true;
 	}
 
@@ -1040,13 +1025,16 @@ bool out_of_memory(struct oom_control *oc)
 		return true;
 	}
 
+	if (!is_sysrq_oom(oc) && oom_has_pending_mm(oc->memcg, oc->nodemask))
+		return true;
+
 	p = select_bad_process(oc, &points, totalpages);
 	/* Found nothing?!?! Either we hang forever, or we panic. */
 	if (!p && !is_sysrq_oom(oc)) {
 		dump_header(oc, NULL);
 		panic("Out of memory and no killable processes...\n");
 	}
-	if (p && p != (void *)-1UL) {
+	if (p) {
 		oom_kill_process(oc, p, points, totalpages, "Out of memory");
 		/*
 		 * Give the killed process a good chance to exit before trying

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [PATCH 3/8] mm,oom: Use list of mm_struct used by OOM victims.
  2016-07-04 12:50     ` Tetsuo Handa
@ 2016-07-04 18:25       ` Oleg Nesterov
  2016-07-05 10:43         ` Tetsuo Handa
  0 siblings, 1 reply; 31+ messages in thread
From: Oleg Nesterov @ 2016-07-04 18:25 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: linux-mm, akpm, rientjes, vdavydov, mst, mhocko, mhocko

On 07/04, Tetsuo Handa wrote:
>
> Oleg Nesterov wrote:
> > >
> > > --- a/kernel/fork.c
> > > +++ b/kernel/fork.c
> > > @@ -722,6 +722,7 @@ static inline void __mmput(struct mm_struct *mm)
> > >  	}
> > >  	if (mm->binfmt)
> > >  		module_put(mm->binfmt->module);
> > > +	exit_oom_mm(mm);
> >
> > Is it strictly necessary? At first glance not. Sooner or later oom_reaper() should
> > find this mm_struct and do exit_oom_mm(). And given that mm->mm_users is already 0
> > the "extra" __oom_reap_vmas() doesn't really hurt.
> >
> > It would be nice to remove exit_oom_mm() from __mmput(); it takes the global spinlock
> > for the very unlikely case, and if we can avoid it here then perhaps we can remove
> > ->oom_mm from mm_struct.
>
> I changed not to take global spinlock from __mmput() unless that mm was used by
> TIF_MEMDIE threads.

This new version doesn't apply on top of 2/8, I can't really understand it...

> But I don't think I can remove oom_mm from mm_struct

I think we can try later. oom_init() can create a small mem-pool or we can even
use GFP_ATOMIC for the start to (try to) alloc

	struct oom_mm {
		struct mm_struct *mm;	// mm to reap
		struct list_head list;	// node in the oom_mm_list
		...
	};

lets discuss this later, but to do this we need to remove exit_oom_mm() from exit_mm().
Although we can probably add another MMF_flag, but I think it would be nice to avoid
exit_oom_mm anyway, if it is possible.

And in any case personally I really hate oom_mm->comm/pid ;) but I think we
can remove it later either way.

> Thus, I think I need to remember task_struct which got TIF_MEMDIE.

Well, this is unfortunate imho. This in fact turns "reap mm" back into "reap task".

> I'd like to wait for Michal to come back...

Yes. imho this series doesn't look bad, but lets wait for Michal.

> +void exit_oom_mm(struct mm_struct *mm)
> +{
> +	/* Nothing to do unless mark_oom_victim() was called with this mm. */
> +	if (!mm->oom_mm.victim)
> +		return;
> +#ifdef CONFIG_MMU
> +	/*
> +	 * OOM reaper will eventually call __exit_oom_mm().
> +	 * Allow oom_has_pending_mm() to ignore this mm.
> +	 */
> +	set_bit(MMF_OOM_REAPED, &mm->flags);

If the caller is exit_mm(), then mm->mm_users == 0 and oom_has_pending_mm()
can check it is zero instead?

So,

> +#else
> +	__exit_oom_mm(mm);
> +#endif

it seems that only CONFIG_MMU=n needs this... Apart from oom_has_pending_mm()
why do we bother to add the victim's mm to oom_mm_list?

Oleg.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 3/8] mm,oom: Use list of mm_struct used by OOM victims.
  2016-07-04 18:25       ` Oleg Nesterov
@ 2016-07-05 10:43         ` Tetsuo Handa
  2016-07-05 20:52           ` Oleg Nesterov
  0 siblings, 1 reply; 31+ messages in thread
From: Tetsuo Handa @ 2016-07-05 10:43 UTC (permalink / raw)
  To: oleg; +Cc: linux-mm, akpm, rientjes, vdavydov, mst, mhocko, mhocko

Oleg Nesterov wrote:
> On 07/04, Tetsuo Handa wrote:
> >
> > Oleg Nesterov wrote:
> > > >
> > > > --- a/kernel/fork.c
> > > > +++ b/kernel/fork.c
> > > > @@ -722,6 +722,7 @@ static inline void __mmput(struct mm_struct *mm)
> > > >  	}
> > > >  	if (mm->binfmt)
> > > >  		module_put(mm->binfmt->module);
> > > > +	exit_oom_mm(mm);
> > >
> > > Is it strictly necessary? At first glance not. Sooner or later oom_reaper() should
> > > find this mm_struct and do exit_oom_mm(). And given that mm->mm_users is already 0
> > > the "extra" __oom_reap_vmas() doesn't really hurt.
> > >
> > > It would be nice to remove exit_oom_mm() from __mmput(); it takes the global spinlock
> > > for the very unlikely case, and if we can avoid it here then perhaps we can remove
> > > ->oom_mm from mm_struct.
> >
> > I changed not to take global spinlock from __mmput() unless that mm was used by
> > TIF_MEMDIE threads.
> 
> This new version doesn't apply on top of 2/8, I can't really understand it...

This new version is not for on top of 2/8, but squashed of all [1-8]/8 patches.

> > +void exit_oom_mm(struct mm_struct *mm)
> > +{
> > +	/* Nothing to do unless mark_oom_victim() was called with this mm. */
> > +	if (!mm->oom_mm.victim)
> > +		return;
> > +#ifdef CONFIG_MMU
> > +	/*
> > +	 * OOM reaper will eventually call __exit_oom_mm().
> > +	 * Allow oom_has_pending_mm() to ignore this mm.
> > +	 */
> > +	set_bit(MMF_OOM_REAPED, &mm->flags);
> 
> If the caller is exit_mm(), then mm->mm_users == 0 and oom_has_pending_mm()
> can check it is zero instead?

I don't think so. Setting MMF_OOM_REAPED indicates that memory used by that
mm is already reclaimed by the OOM reaper or by __mmput(). mm->mm_users == 0
alone does not mean memory used by that mm is already reclaimed.

Making exit_oom_mm() a no-op for CONFIG_MMU=y would be OK, but maybe we don't
want to defer next OOM victim selection in D2 domain which reached this point
due to waiting for OOM reaper to process D1 domain before D2 domain is processed
when OOM event in D1 domain occurred before OOM event in D2 domain occurs.
If OOM reaper tries D2 domain before retrying D1 domain for MAX_OOM_REAP_RETRIES
times, __exit_oom_mm() for D2 domain will be called immediately.



By the way, if this new version works as expected and we don't want oom_mm_lock
spinlock, we can reuse oom_lock like below.

 oom_kill.c |   39 +++++++++++++--------------------------
 1 file changed, 13 insertions(+), 26 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index bbd3138..f23b306 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -292,17 +292,16 @@ enum oom_scan_t oom_scan_process_thread(struct oom_control *oc,
 }
 
 static LIST_HEAD(oom_mm_list);
-static DEFINE_SPINLOCK(oom_mm_lock);
 
 static inline void __exit_oom_mm(struct mm_struct *mm)
 {
 	struct task_struct *tsk;
 
-	spin_lock(&oom_mm_lock);
+	mutex_lock(&oom_lock);
 	list_del(&mm->oom_mm.list);
 	tsk = mm->oom_mm.victim;
 	mm->oom_mm.victim = NULL;
-	spin_unlock(&oom_mm_lock);
+	mutex_unlock(&oom_lock);
 	/* Drop references taken by mark_oom_victim() */
 	put_task_struct(tsk);
 	mmdrop(mm);
@@ -329,7 +328,6 @@ bool oom_has_pending_mm(struct mem_cgroup *memcg, const nodemask_t *nodemask)
 	struct mm_struct *mm;
 	bool ret = false;
 
-	spin_lock(&oom_mm_lock);
 	list_for_each_entry(mm, &oom_mm_list, oom_mm.list) {
 		if (oom_unkillable_task(mm->oom_mm.victim, memcg, nodemask))
 			continue;
@@ -338,7 +336,6 @@ bool oom_has_pending_mm(struct mem_cgroup *memcg, const nodemask_t *nodemask)
 		ret = true;
 		break;
 	}
-	spin_unlock(&oom_mm_lock);
 	return ret;
 }
 
@@ -550,16 +547,12 @@ static bool __oom_reap_task(struct task_struct *tsk, struct mm_struct *mm)
 static void oom_reap_task(struct mm_struct *mm, struct task_struct *tsk)
 {
 	int attempts = 0;
-	bool ret;
 
 	/*
 	 * Check MMF_OOM_REAPED after holding oom_lock because
 	 * oom_kill_process() might find this mm pinned.
 	 */
-	mutex_lock(&oom_lock);
-	ret = test_bit(MMF_OOM_REAPED, &mm->flags);
-	mutex_unlock(&oom_lock);
-	if (ret)
+	if (test_bit(MMF_OOM_REAPED, &mm->flags))
 		return;
 
 	/* Retry the down_read_trylock(mmap_sem) a few times */
@@ -589,8 +582,8 @@ static void oom_reap_task(struct mm_struct *mm, struct task_struct *tsk)
 static int oom_reaper(void *unused)
 {
 	while (true) {
-		struct mm_struct *mm = NULL;
-		struct task_struct *victim = NULL;
+		struct mm_struct *mm;
+		struct task_struct *victim;
 
 		wait_event_freezable(oom_reaper_wait,
 				     !list_empty(&oom_mm_list));
@@ -599,16 +592,12 @@ static int oom_reaper(void *unused)
 		 * oom_reap_task() raced with mark_oom_victim() by
 		 * other threads sharing this mm.
 		 */
-		spin_lock(&oom_mm_lock);
-		if (!list_empty(&oom_mm_list)) {
-			mm = list_first_entry(&oom_mm_list, struct mm_struct,
-					      oom_mm.list);
-			victim = mm->oom_mm.victim;
-			get_task_struct(victim);
-		}
-		spin_unlock(&oom_mm_lock);
-		if (!mm)
-			continue;
+		mutex_lock(&oom_lock);
+		mm = list_first_entry(&oom_mm_list, struct mm_struct,
+				      oom_mm.list);
+		victim = mm->oom_mm.victim;
+		get_task_struct(victim);
+		mutex_unlock(&oom_lock);
 		oom_reap_task(mm, victim);
 		put_task_struct(victim);
 		__exit_oom_mm(mm);
@@ -652,17 +641,15 @@ void mark_oom_victim(struct task_struct *tsk, struct oom_control *oc)
 	 * mm->mm_users > 0), __exit_oom_mm() from __mmput() can't be called
 	 * before we add this mm to the list.
 	 */
-	spin_lock(&oom_mm_lock);
 	old_tsk = mm->oom_mm.victim;
 	get_task_struct(tsk);
 	mm->oom_mm.victim = tsk;
 	if (!old_tsk) {
 		atomic_inc(&mm->mm_count);
 		list_add_tail(&mm->oom_mm.list, &oom_mm_list);
-	}
-	spin_unlock(&oom_mm_lock);
-	if (old_tsk)
+	} else {
 		put_task_struct(old_tsk);
+	}
 	/*
 	 * Make sure that the task is woken up from uninterruptible sleep
 	 * if it is frozen because OOM killer wouldn't be able to free

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [PATCH 3/8] mm,oom: Use list of mm_struct used by OOM victims.
  2016-07-05 10:43         ` Tetsuo Handa
@ 2016-07-05 20:52           ` Oleg Nesterov
  2016-07-06  8:53             ` Oleg Nesterov
  0 siblings, 1 reply; 31+ messages in thread
From: Oleg Nesterov @ 2016-07-05 20:52 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: linux-mm, akpm, rientjes, vdavydov, mst, mhocko, mhocko

Tetsuo,

I am already sleeping and had a lot of beer ;) but let me try to (partly) reply anyway.

On 07/05, Tetsuo Handa wrote:
>
> Oleg Nesterov wrote:
> >
> > This new version doesn't apply on top of 2/8, I can't really understand it...
>
> This new version is not for on top of 2/8, but squashed of all [1-8]/8 patches.

Ah, OK, I'll try to take a look tomorrow.

> > > +void exit_oom_mm(struct mm_struct *mm)
> > > +{
> > > +	/* Nothing to do unless mark_oom_victim() was called with this mm. */
> > > +	if (!mm->oom_mm.victim)
> > > +		return;
> > > +#ifdef CONFIG_MMU
> > > +	/*
> > > +	 * OOM reaper will eventually call __exit_oom_mm().
> > > +	 * Allow oom_has_pending_mm() to ignore this mm.
> > > +	 */
> > > +	set_bit(MMF_OOM_REAPED, &mm->flags);
> >
> > If the caller is exit_mm(), then mm->mm_users == 0 and oom_has_pending_mm()
> > can check it is zero instead?
>
> I don't think so. Setting MMF_OOM_REAPED indicates that memory used by that
> mm is already reclaimed by the OOM reaper or by __mmput().

Sure, this is clear,

> mm->mm_users == 0
> alone does not mean memory used by that mm is already reclaimed.
  ^^^^^

Of course! I meant that oom_has_pending_mm() can check _both_ mm_users and
MMF_OOM_REAPED and then we do not need to set MMF_OOM_REAPED in exit_mm() path.

No?

> Making exit_oom_mm() a no-op for CONFIG_MMU=y would be OK,

Yes. Not only because this can simplify other changes. I do believe that the less
"oom" hooks we have the better, even if this needs some complications in oom_kill.c.

For example, this series removes the extra try_to_freeze_tasks() from freeze_processes()
(which is in fact the "oom" hook) and personally I do like this fact.

And. Of course I am not sure this is possible, but to me it would be very nice
to kill oom_reaper_list altogether if CONFIG_MMU=n.

Oleg.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 3/8] mm,oom: Use list of mm_struct used by OOM victims.
  2016-07-05 20:52           ` Oleg Nesterov
@ 2016-07-06  8:53             ` Oleg Nesterov
  2016-07-06 11:43               ` Tetsuo Handa
  0 siblings, 1 reply; 31+ messages in thread
From: Oleg Nesterov @ 2016-07-06  8:53 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: linux-mm, akpm, rientjes, vdavydov, mst, mhocko, mhocko

On 07/05, Oleg Nesterov wrote:
>
> > I don't think so. Setting MMF_OOM_REAPED indicates that memory used by that
> > mm is already reclaimed by the OOM reaper or by __mmput().
>
> Sure, this is clear,
>
> > mm->mm_users == 0
> > alone does not mean memory used by that mm is already reclaimed.
>   ^^^^^
>
> Of course! I meant that oom_has_pending_mm() can check _both_ mm_users and
> MMF_OOM_REAPED and then we do not need to set MMF_OOM_REAPED in exit_mm() path.
>
> No?

OK, perhaps you meant that mm_users == 0 can't help because __mmput() can block
after that and thus we should not assume this memory is already reclaimed...

So yes this probably needs more thinking. perhaps we can check mm->vma == NULL.

>
> > Making exit_oom_mm() a no-op for CONFIG_MMU=y would be OK,
>
> Yes. Not only because this can simplify other changes. I do believe that the less
> "oom" hooks we have the better, even if this needs some complications in oom_kill.c.
>
> For example, this series removes the extra try_to_freeze_tasks() from freeze_processes()
> (which is in fact the "oom" hook) and personally I do like this fact.
>
> And. Of course I am not sure this is possible, but to me it would be very nice
> to kill oom_reaper_list altogether if CONFIG_MMU=n.
>
> Oleg.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 3/8] mm,oom: Use list of mm_struct used by OOM victims.
  2016-07-06  8:53             ` Oleg Nesterov
@ 2016-07-06 11:43               ` Tetsuo Handa
  0 siblings, 0 replies; 31+ messages in thread
From: Tetsuo Handa @ 2016-07-06 11:43 UTC (permalink / raw)
  To: oleg; +Cc: linux-mm, akpm, rientjes, vdavydov, mst, mhocko, mhocko

Oleg Nesterov wrote:
> On 07/05, Oleg Nesterov wrote:
> >
> > > I don't think so. Setting MMF_OOM_REAPED indicates that memory used by that
> > > mm is already reclaimed by the OOM reaper or by __mmput().
> >
> > Sure, this is clear,
> >
> > > mm->mm_users == 0
> > > alone does not mean memory used by that mm is already reclaimed.
> >   ^^^^^
> >
> > Of course! I meant that oom_has_pending_mm() can check _both_ mm_users and
> > MMF_OOM_REAPED and then we do not need to set MMF_OOM_REAPED in exit_mm() path.
> >
> > No?
> 
> OK, perhaps you meant that mm_users == 0 can't help because __mmput() can block
> after that and thus we should not assume this memory is already reclaimed...

Right.

> 
> So yes this probably needs more thinking. perhaps we can check mm->vma == NULL.
> 

Below patch is an example of removing exit_oom_mm() from __mmput().

> >
> > > Making exit_oom_mm() a no-op for CONFIG_MMU=y would be OK,
> >
> > Yes. Not only because this can simplify other changes. I do believe that the less
> > "oom" hooks we have the better, even if this needs some complications in oom_kill.c.
> >
> > For example, this series removes the extra try_to_freeze_tasks() from freeze_processes()
> > (which is in fact the "oom" hook) and personally I do like this fact.
> >
> > And. Of course I am not sure this is possible, but to me it would be very nice
> > to kill oom_reaper_list altogether if CONFIG_MMU=n.

If CONFIG_NUMA and CONFIG_CGROUP depend on CONFIG_MMU, oom_has_pending_mm() for
CONFIG_MMU=n will become as simple as

	bool oom_has_pending_mm(struct mem_cgroup *memcg, const nodemask_t *nodemask)
	{
		return atomic_read(&oom_victims);
	}

and we can remove oom_mm_list altogether. But it seems to me that CONFIG_CGROUP=y
with CONFIG_MMU=n is possible. I think calling exit_oom_mm() from __mmput() in
order to clear references immediately is better for CONFIG_MMU=n case.
(Or Michal's signal->oom_mm approach is better?)


 include/linux/oom.h |   3 +-
 kernel/fork.c       |   1 -
 mm/memcontrol.c     |   2 +-
 mm/oom_kill.c       | 117 ++++++++++++++++++++--------------------------------
 4 files changed, 47 insertions(+), 76 deletions(-)

diff --git a/include/linux/oom.h b/include/linux/oom.h
index 1a212c1..72a21a4 100644
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -69,7 +69,7 @@ static inline bool oom_task_origin(const struct task_struct *p)
 	return p->signal->oom_flag_origin;
 }
 
-extern void mark_oom_victim(struct task_struct *tsk, struct oom_control *oc);
+extern void mark_oom_victim(struct task_struct *tsk);
 
 extern unsigned long oom_badness(struct task_struct *p,
 		struct mem_cgroup *memcg, const nodemask_t *nodemask,
@@ -82,7 +82,6 @@ extern void oom_kill_process(struct oom_control *oc, struct task_struct *p,
 extern void check_panic_on_oom(struct oom_control *oc,
 			       enum oom_constraint constraint);
 
-extern void exit_oom_mm(struct mm_struct *mm);
 extern bool oom_has_pending_mm(struct mem_cgroup *memcg,
 			       const nodemask_t *nodemask);
 extern enum oom_scan_t oom_scan_process_thread(struct oom_control *oc,
diff --git a/kernel/fork.c b/kernel/fork.c
index b870dbc..7926993 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -722,7 +722,6 @@ static inline void __mmput(struct mm_struct *mm)
 	}
 	if (mm->binfmt)
 		module_put(mm->binfmt->module);
-	exit_oom_mm(mm);
 	mmdrop(mm);
 }
 
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 9acc840..6afe1c5 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1235,7 +1235,7 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
 	 * quickly exit and free its memory.
 	 */
 	if (task_will_free_mem(current)) {
-		mark_oom_victim(current, &oc);
+		mark_oom_victim(current);
 		goto unlock;
 	}
 
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index bbd3138..f2b5ec2 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -292,54 +292,40 @@ enum oom_scan_t oom_scan_process_thread(struct oom_control *oc,
 }
 
 static LIST_HEAD(oom_mm_list);
-static DEFINE_SPINLOCK(oom_mm_lock);
 
-static inline void __exit_oom_mm(struct mm_struct *mm)
+static void exit_oom_mm(struct mm_struct *mm)
 {
-	struct task_struct *tsk;
-
-	spin_lock(&oom_mm_lock);
+	/* Drop references taken by mark_oom_victim() */
 	list_del(&mm->oom_mm.list);
-	tsk = mm->oom_mm.victim;
+	put_task_struct(mm->oom_mm.victim);
 	mm->oom_mm.victim = NULL;
-	spin_unlock(&oom_mm_lock);
-	/* Drop references taken by mark_oom_victim() */
-	put_task_struct(tsk);
 	mmdrop(mm);
 }
 
-void exit_oom_mm(struct mm_struct *mm)
-{
-	/* Nothing to do unless mark_oom_victim() was called with this mm. */
-	if (!mm->oom_mm.victim)
-		return;
-#ifdef CONFIG_MMU
-	/*
-	 * OOM reaper will eventually call __exit_oom_mm().
-	 * Allow oom_has_pending_mm() to ignore this mm.
-	 */
-	set_bit(MMF_OOM_REAPED, &mm->flags);
-#else
-	__exit_oom_mm(mm);
-#endif
-}
-
 bool oom_has_pending_mm(struct mem_cgroup *memcg, const nodemask_t *nodemask)
 {
 	struct mm_struct *mm;
-	bool ret = false;
+	struct mm_struct *tmp;
 
-	spin_lock(&oom_mm_lock);
-	list_for_each_entry(mm, &oom_mm_list, oom_mm.list) {
-		if (oom_unkillable_task(mm->oom_mm.victim, memcg, nodemask))
+	list_for_each_entry_safe(mm, tmp, &oom_mm_list, oom_mm.list) {
+		/* Was set_mm_exe_file(mm, NULL) called from __mmput(mm) ? */
+		if (!rcu_dereference_raw(mm->exe_file)) {
+#ifndef CONFIG_MMU
+			/*
+			 * Note that a reference on mm and mm->oom_mm.victim
+			 * will remain until this function is called for the
+			 * next time after set_mm_exe_file(mm, NULL) was
+			 * called, for OOM reaper callback is not available.
+			 */
+			exit_oom_mm(mm);
+#endif
 			continue;
-		if (test_bit(MMF_OOM_REAPED, &mm->flags))
+		}
+		if (oom_unkillable_task(mm->oom_mm.victim, memcg, nodemask))
 			continue;
-		ret = true;
-		break;
+		return true;
 	}
-	spin_unlock(&oom_mm_lock);
-	return ret;
+	return false;
 }
 
 /*
@@ -550,16 +536,12 @@ static bool __oom_reap_task(struct task_struct *tsk, struct mm_struct *mm)
 static void oom_reap_task(struct mm_struct *mm, struct task_struct *tsk)
 {
 	int attempts = 0;
-	bool ret;
 
 	/*
-	 * Check MMF_OOM_REAPED after holding oom_lock because
-	 * oom_kill_process() might find this mm pinned.
+	 * Check MMF_OOM_REAPED in case oom_kill_process() found this mm
+	 * pinned.
 	 */
-	mutex_lock(&oom_lock);
-	ret = test_bit(MMF_OOM_REAPED, &mm->flags);
-	mutex_unlock(&oom_lock);
-	if (ret)
+	if (test_bit(MMF_OOM_REAPED, &mm->flags))
 		return;
 
 	/* Retry the down_read_trylock(mmap_sem) a few times */
@@ -589,8 +571,8 @@ static void oom_reap_task(struct mm_struct *mm, struct task_struct *tsk)
 static int oom_reaper(void *unused)
 {
 	while (true) {
-		struct mm_struct *mm = NULL;
-		struct task_struct *victim = NULL;
+		struct mm_struct *mm;
+		struct task_struct *victim;
 
 		wait_event_freezable(oom_reaper_wait,
 				     !list_empty(&oom_mm_list));
@@ -599,19 +581,17 @@ static int oom_reaper(void *unused)
 		 * oom_reap_task() raced with mark_oom_victim() by
 		 * other threads sharing this mm.
 		 */
-		spin_lock(&oom_mm_lock);
-		if (!list_empty(&oom_mm_list)) {
-			mm = list_first_entry(&oom_mm_list, struct mm_struct,
-					      oom_mm.list);
-			victim = mm->oom_mm.victim;
-			get_task_struct(victim);
-		}
-		spin_unlock(&oom_mm_lock);
-		if (!mm)
-			continue;
+		mutex_lock(&oom_lock);
+		mm = list_first_entry(&oom_mm_list, struct mm_struct,
+				      oom_mm.list);
+		victim = mm->oom_mm.victim;
+		get_task_struct(victim);
+		mutex_unlock(&oom_lock);
 		oom_reap_task(mm, victim);
 		put_task_struct(victim);
-		__exit_oom_mm(mm);
+		mutex_lock(&oom_lock);
+		exit_oom_mm(mm);
+		mutex_unlock(&oom_lock);
 	}
 
 	return 0;
@@ -630,39 +610,32 @@ subsys_initcall(oom_init)
 /**
  * mark_oom_victim - mark the given task as OOM victim
  * @tsk: task to mark
- * @oc: oom_control
  *
  * Has to be called with oom_lock held and never after
  * oom has been disabled already.
  */
-void mark_oom_victim(struct task_struct *tsk, struct oom_control *oc)
+void mark_oom_victim(struct task_struct *tsk)
 {
 	struct mm_struct *mm = tsk->mm;
-	struct task_struct *old_tsk;
+	struct task_struct *old_tsk = mm->oom_mm.victim;
 
 	WARN_ON(oom_killer_disabled);
-	/* OOM killer might race with memcg OOM */
-	if (test_and_set_tsk_thread_flag(tsk, TIF_MEMDIE))
-		return;
+
 	/*
 	 * Since mark_oom_victim() is called from multiple threads,
 	 * connect this mm to oom_mm_list only if not yet connected.
-	 *
-	 * Since mark_oom_victim() is called with a stable mm (i.e.
-	 * mm->mm_users > 0), __exit_oom_mm() from __mmput() can't be called
-	 * before we add this mm to the list.
 	 */
-	spin_lock(&oom_mm_lock);
-	old_tsk = mm->oom_mm.victim;
 	get_task_struct(tsk);
 	mm->oom_mm.victim = tsk;
 	if (!old_tsk) {
 		atomic_inc(&mm->mm_count);
 		list_add_tail(&mm->oom_mm.list, &oom_mm_list);
-	}
-	spin_unlock(&oom_mm_lock);
-	if (old_tsk)
+	} else
 		put_task_struct(old_tsk);
+
+	/* OOM killer might race with memcg OOM */
+	if (test_and_set_tsk_thread_flag(tsk, TIF_MEMDIE))
+		return;
 	/*
 	 * Make sure that the task is woken up from uninterruptible sleep
 	 * if it is frozen because OOM killer wouldn't be able to free
@@ -820,7 +793,7 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
 	 */
 	task_lock(p);
 	if (task_will_free_mem(p)) {
-		mark_oom_victim(p, oc);
+		mark_oom_victim(p);
 		task_unlock(p);
 		put_task_struct(p);
 		return;
@@ -880,7 +853,7 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
 	 * space under its control.
 	 */
 	do_send_sig_info(SIGKILL, SEND_SIG_FORCED, victim, true);
-	mark_oom_victim(victim, oc);
+	mark_oom_victim(victim);
 	pr_err("Killed process %d (%s) total-vm:%lukB, anon-rss:%lukB, file-rss:%lukB, shmem-rss:%lukB\n",
 		task_pid_nr(victim), victim->comm, K(victim->mm->total_vm),
 		K(get_mm_counter(victim->mm, MM_ANONPAGES)),
@@ -994,7 +967,7 @@ bool out_of_memory(struct oom_control *oc)
 	 * quickly exit and free its memory.
 	 */
 	if (task_will_free_mem(current)) {
-		mark_oom_victim(current, oc);
+		mark_oom_victim(current);
 		return true;
 	}
 
-- 
1.8.3.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [PATCH 0/8] Change OOM killer to use list of mm_struct.
  2016-07-03  2:35 [PATCH 0/8] Change OOM killer to use list of mm_struct Tetsuo Handa
                   ` (7 preceding siblings ...)
  2016-07-03  2:41 ` [PATCH 8/8] mm,oom_reaper: Make OOM reaper use list of mm_struct Tetsuo Handa
@ 2016-07-07 11:04 ` Michal Hocko
  8 siblings, 0 replies; 31+ messages in thread
From: Michal Hocko @ 2016-07-07 11:04 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: linux-mm, akpm, oleg, rientjes, vdavydov, mst

On Sun 03-07-16 11:35:56, Tetsuo Handa wrote:
> This is my alternative proposal compared to what Michal posted at
> http://lkml.kernel.org/r/1467365190-24640-1-git-send-email-mhocko@kernel.org .
> 
> The series is based on top of linux-next-20160701 +
> http://lkml.kernel.org/r/1467201562-6709-1-git-send-email-mhocko@kernel.org .
> 
> The key point of the series is [PATCH 3/8].

I have only checked the diff between the whole patchset applied with
what I have posted as an RFC last week, so I cannot comment on specific
patches.  Let me summarize the differences between the two approaches
though.

My proposal adds a stable reference of the killed mm struct into the
signal struct and most oom decisions can refer to this mm and its flags
because signal struct life time exceeds its visible task_struct. We still
need signal->oom_victims counter to catch different threads lifetime.

Yours enqueues the mm to a linked list and has a similar effect with
an advantage that signal->oom_victims is no longer needed because you
have pulled the OOM_SCAN_ABORT out of select_bad_process to earlier
{mem_cgroup_}out_of_memory and check existence of a compatible mm for
the oom domain. This means that mm struct has to remember all the
information that might be gone by the time we look at the enqueued mm
again. This means a slightly larger memory foot print (nothing earth
shattering though).

That being said, I believe both approaches are sound. So let's discuss
{ad,dis}vantages of those approaches.

You are introducing more code but to be fair I guess the mm rather than
task queuing is better long term. Copying the state for later use is
unfortunate but it might turn out better to have all the oom specific
stuff inside the mm rather than spread around in other structures.

As I've said I haven't looked very deeply into details but at least
memcg handling would need more work, I will respond to the specific
patch.

I guess the mm visibility is basically same with both approaches. Even
though you hide the mm from __mmput while mine has it alive until signal
struct goes away this is basically the equivalent because mine is hiding
the mm with MMF_OOM_REAPED from the oom reaper and oom_reaper is just a
weaker form of __mmput.

I am not tight to my approach but could you name main arguments why you
think yours is better?
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 1/8] mm,oom_reaper: Remove pointless kthread_run() failure check.
  2016-07-03 21:53         ` Tetsuo Handa
  2016-07-04 11:13           ` Oleg Nesterov
@ 2016-07-07 11:14           ` Michal Hocko
  1 sibling, 0 replies; 31+ messages in thread
From: Michal Hocko @ 2016-07-07 11:14 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: oleg, linux-mm, akpm, rientjes, vdavydov, mst

On Mon 04-07-16 06:53:49, Tetsuo Handa wrote:
[...]
> >From 977b0f4368a7ca07af7e519aa8795e7b2ee653d0 Mon Sep 17 00:00:00 2001
> From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> Date: Mon, 4 Jul 2016 06:40:05 +0900
> Subject: [PATCH 1/8] mm,oom_reaper: Don't boot without OOM reaper kernel thread.
> 
> We are trying to prove that OOM livelock is impossible for CONFIG_MMU=y
> kernels (as long as OOM killer is invoked) because the OOM reaper always
> gives feedback to the OOM killer. Therefore, preserving code which
> continues without OOM reaper no longer makes sense.
> 
> Since oom_init() is called before OOM-killable userspace processes are
> started, the system will panic if out_of_memory() is called before
> oom_init() returns. Therefore, oom_reaper_th == NULL check in
> wake_oom_reaper() is pointless.
> 
> If kthread_run() in oom_init() fails due to reasons other than
> out_of_memory(), userspace processes won't be able to start as well.
> Therefore, trying to continue with error message is also pointless.
> But in case something unexpected occurred, let's explicitly add
> BUG_ON() check.

I have said that earlier already. The oom_reaper is not crucial for the
standard system operation. Panicing on this failure seems like an over
reaction. It is true that the system might panic just right after for
other reasons but that is not the reason to panic here.

> Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>

I am not going to ack nor nak this patch because it will have no
practical effect.

> ---
>  mm/oom_kill.c | 13 +++----------
>  1 file changed, 3 insertions(+), 10 deletions(-)
> 
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 7d0a275..079ce96 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -447,7 +447,6 @@ bool process_shares_mm(struct task_struct *p, struct mm_struct *mm)
>   * OOM Reaper kernel thread which tries to reap the memory used by the OOM
>   * victim (if that is possible) to help the OOM killer to move on.
>   */
> -static struct task_struct *oom_reaper_th;
>  static DECLARE_WAIT_QUEUE_HEAD(oom_reaper_wait);
>  static struct task_struct *oom_reaper_list;
>  static DEFINE_SPINLOCK(oom_reaper_lock);
> @@ -629,9 +628,6 @@ static int oom_reaper(void *unused)
>  
>  void wake_oom_reaper(struct task_struct *tsk)
>  {
> -	if (!oom_reaper_th)
> -		return;
> -
>  	/* tsk is already queued? */
>  	if (tsk == oom_reaper_list || tsk->oom_reaper_list)
>  		return;
> @@ -647,12 +643,9 @@ void wake_oom_reaper(struct task_struct *tsk)
>  
>  static int __init oom_init(void)
>  {
> -	oom_reaper_th = kthread_run(oom_reaper, NULL, "oom_reaper");
> -	if (IS_ERR(oom_reaper_th)) {
> -		pr_err("Unable to start OOM reaper %ld. Continuing regardless\n",
> -				PTR_ERR(oom_reaper_th));
> -		oom_reaper_th = NULL;
> -	}
> +	struct task_struct *p = kthread_run(oom_reaper, NULL, "oom_reaper");
> +
> +	BUG_ON(IS_ERR(p));
>  	return 0;
>  }
>  subsys_initcall(oom_init)
> -- 
> 1.8.3.1

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 2/8] mm,oom_reaper: Reduce find_lock_task_mm() usage.
  2016-07-03  2:37 ` [PATCH 2/8] mm,oom_reaper: Reduce find_lock_task_mm() usage Tetsuo Handa
@ 2016-07-07 11:19   ` Michal Hocko
  0 siblings, 0 replies; 31+ messages in thread
From: Michal Hocko @ 2016-07-07 11:19 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: linux-mm, akpm, oleg, rientjes, vdavydov, mst

On Sun 03-07-16 11:37:33, Tetsuo Handa wrote:
> >From 3be379c6b42a0901cd81fb2c743e321b6fbdec5b Mon Sep 17 00:00:00 2001
> From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> Date: Sat, 2 Jul 2016 22:55:17 +0900
> Subject: [PATCH 2/8] mm,oom_reaper: Reduce find_lock_task_mm() usage.
> 
> Since holding mm_struct with elevated mm_count for a second is harmless,
> we can determine mm_struct and hold it upon entry of oom_reap_task().
> This patch has no functional change. Future patch in this series will
> eliminate find_lock_task_mm() usage from the OOM reaper.

OK, this seems to reduce the code size and makes the code more readable.

> Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>

Acked-by: Michal Hocko <mhocko@suse.com>

> ---
>  mm/oom_kill.c | 79 ++++++++++++++++++++++++++++-------------------------------
>  1 file changed, 37 insertions(+), 42 deletions(-)
> 
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 16340f2..76c765e 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -451,12 +451,10 @@ static DECLARE_WAIT_QUEUE_HEAD(oom_reaper_wait);
>  static struct task_struct *oom_reaper_list;
>  static DEFINE_SPINLOCK(oom_reaper_lock);
>  
> -static bool __oom_reap_task(struct task_struct *tsk)
> +static bool __oom_reap_task(struct task_struct *tsk, struct mm_struct *mm)
>  {
>  	struct mmu_gather tlb;
>  	struct vm_area_struct *vma;
> -	struct mm_struct *mm = NULL;
> -	struct task_struct *p;
>  	struct zap_details details = {.check_swap_entries = true,
>  				      .ignore_dirty = true};
>  	bool ret = true;
> @@ -477,22 +475,9 @@ static bool __oom_reap_task(struct task_struct *tsk)
>  	 */
>  	mutex_lock(&oom_lock);
>  
> -	/*
> -	 * Make sure we find the associated mm_struct even when the particular
> -	 * thread has already terminated and cleared its mm.
> -	 * We might have race with exit path so consider our work done if there
> -	 * is no mm.
> -	 */
> -	p = find_lock_task_mm(tsk);
> -	if (!p)
> -		goto unlock_oom;
> -	mm = p->mm;
> -	atomic_inc(&mm->mm_count);
> -	task_unlock(p);
> -
>  	if (!down_read_trylock(&mm->mmap_sem)) {
>  		ret = false;
> -		goto mm_drop;
> +		goto unlock_oom;
>  	}
>  
>  	/*
> @@ -502,7 +487,7 @@ static bool __oom_reap_task(struct task_struct *tsk)
>  	 */
>  	if (!mmget_not_zero(mm)) {
>  		up_read(&mm->mmap_sem);
> -		goto mm_drop;
> +		goto unlock_oom;
>  	}
>  
>  	tlb_gather_mmu(&tlb, mm, 0, -1);
> @@ -550,8 +535,6 @@ static bool __oom_reap_task(struct task_struct *tsk)
>  	 * put the oom_reaper out of the way.
>  	 */
>  	mmput_async(mm);
> -mm_drop:
> -	mmdrop(mm);
>  unlock_oom:
>  	mutex_unlock(&oom_lock);
>  	return ret;
> @@ -561,36 +544,45 @@ unlock_oom:
>  static void oom_reap_task(struct task_struct *tsk)
>  {
>  	int attempts = 0;
> +	struct mm_struct *mm = NULL;
> +	struct task_struct *p = find_lock_task_mm(tsk);
> +
> +	/*
> +	 * Make sure we find the associated mm_struct even when the particular
> +	 * thread has already terminated and cleared its mm.
> +	 * We might have race with exit path so consider our work done if there
> +	 * is no mm.
> +	 */
> +	if (!p)
> +		goto done;
> +	mm = p->mm;
> +	atomic_inc(&mm->mm_count);
> +	task_unlock(p);
>  
>  	/* Retry the down_read_trylock(mmap_sem) a few times */
> -	while (attempts++ < MAX_OOM_REAP_RETRIES && !__oom_reap_task(tsk))
> +	while (attempts++ < MAX_OOM_REAP_RETRIES && !__oom_reap_task(tsk, mm))
>  		schedule_timeout_idle(HZ/10);
>  
> -	if (attempts > MAX_OOM_REAP_RETRIES) {
> -		struct task_struct *p;
> +	if (attempts <= MAX_OOM_REAP_RETRIES)
> +		goto done;
>  
> -		pr_info("oom_reaper: unable to reap pid:%d (%s)\n",
> -				task_pid_nr(tsk), tsk->comm);
> +	pr_info("oom_reaper: unable to reap pid:%d (%s)\n",
> +		task_pid_nr(tsk), tsk->comm);
>  
> -		/*
> -		 * If we've already tried to reap this task in the past and
> -		 * failed it probably doesn't make much sense to try yet again
> -		 * so hide the mm from the oom killer so that it can move on
> -		 * to another task with a different mm struct.
> -		 */
> -		p = find_lock_task_mm(tsk);
> -		if (p) {
> -			if (test_and_set_bit(MMF_OOM_NOT_REAPABLE, &p->mm->flags)) {
> -				pr_info("oom_reaper: giving up pid:%d (%s)\n",
> -						task_pid_nr(tsk), tsk->comm);
> -				set_bit(MMF_OOM_REAPED, &p->mm->flags);
> -			}
> -			task_unlock(p);
> -		}
> -
> -		debug_show_all_locks();
> +	/*
> +	 * If we've already tried to reap this task in the past and
> +	 * failed it probably doesn't make much sense to try yet again
> +	 * so hide the mm from the oom killer so that it can move on
> +	 * to another task with a different mm struct.
> +	 */
> +	if (test_and_set_bit(MMF_OOM_NOT_REAPABLE, &mm->flags)) {
> +		pr_info("oom_reaper: giving up pid:%d (%s)\n",
> +			task_pid_nr(tsk), tsk->comm);
> +		set_bit(MMF_OOM_REAPED, &mm->flags);
>  	}
> +	debug_show_all_locks();
>  
> +done:
>  	/*
>  	 * Clear TIF_MEMDIE because the task shouldn't be sitting on a
>  	 * reasonably reclaimable memory anymore or it is not a good candidate
> @@ -602,6 +594,9 @@ static void oom_reap_task(struct task_struct *tsk)
>  
>  	/* Drop a reference taken by wake_oom_reaper */
>  	put_task_struct(tsk);
> +	/* Drop a reference taken above. */
> +	if (mm)
> +		mmdrop(mm);
>  }
>  
>  static int oom_reaper(void *unused)
> -- 
> 1.8.3.1

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 3/8] mm,oom: Use list of mm_struct used by OOM victims.
  2016-07-03  2:38 ` [PATCH 3/8] mm,oom: Use list of mm_struct used by OOM victims Tetsuo Handa
  2016-07-04 10:39   ` Oleg Nesterov
@ 2016-07-07 11:31   ` Michal Hocko
  1 sibling, 0 replies; 31+ messages in thread
From: Michal Hocko @ 2016-07-07 11:31 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: linux-mm, akpm, oleg, rientjes, vdavydov, mst

I've got lost in the follow up patches. Could you post only rework of
this particular one, please?

You have already noticed that nodemask check is tricky and the memcg
check would be tricky for different reason.

On Sun 03-07-16 11:38:20, Tetsuo Handa wrote:
[...]
> +bool oom_has_pending_mm(struct mem_cgroup *memcg, const nodemask_t *nodemask)
> +{
> +	struct oom_mm *mm;
> +	bool ret = false;
> +
> +	spin_lock(&oom_mm_lock);
> +	list_for_each_entry(mm, &oom_mm_list, list) {
> +		if (memcg && mm->memcg != memcg)
> +			continue;
> +		if (nodemask && mm->nodemask != nodemask)
> +			continue;
> +		ret = true;
> +		break;
> +	}
> +	spin_unlock(&oom_mm_lock);
> +	return ret;
> +}
> +
>  enum oom_scan_t oom_scan_process_thread(struct oom_control *oc,
>  					struct task_struct *task)
>  {
[...]
> @@ -647,18 +668,37 @@ subsys_initcall(oom_init)
>  /**
>   * mark_oom_victim - mark the given task as OOM victim
>   * @tsk: task to mark
> + * @oc: oom_control
>   *
>   * Has to be called with oom_lock held and never after
>   * oom has been disabled already.
>   */
> -void mark_oom_victim(struct task_struct *tsk)
> +void mark_oom_victim(struct task_struct *tsk, struct oom_control *oc)
>  {
> +	struct mm_struct *mm = tsk->mm;
> +
>  	WARN_ON(oom_killer_disabled);
>  	/* OOM killer might race with memcg OOM */
>  	if (test_and_set_tsk_thread_flag(tsk, TIF_MEMDIE))
>  		return;
>  	atomic_inc(&tsk->signal->oom_victims);
>  	/*
> +	 * Since mark_oom_victim() is called from multiple threads,
> +	 * connect this mm to oom_mm_list only if not yet connected.
> +	 *
> +	 * Since mark_oom_victim() is called with a stable mm (i.e.
> +	 * mm->mm_userst > 0), exit_oom_mm() from __mmput() can't be called
> +	 * before we add this mm to the list.
> +	 */
> +	spin_lock(&oom_mm_lock);
> +	if (!mm->oom_mm.list.next) {
> +		atomic_inc(&mm->mm_count);
> +		mm->oom_mm.memcg = oc->memcg;
> +		mm->oom_mm.nodemask = oc->nodemask;
> +		list_add_tail(&mm->oom_mm.list, &oom_mm_list);
> +	}

Here you are storing the memcg where the OOM killer happened but later
on we might encounter an OOM on upper level of the memcg hierarchy and
we want to prevent from the oom killer if there is an mm which hasn't
released a memory from the lower level of the hierarchy. E.g.
          A
         / \
        B   C

C hits a limit and invokes oom. Then we hit oom on A because of a charge
for B. Now for_each_mem_cgroup_tree would encounter such a task/mm and
abort the selection. With a pure memcg pointer check we would skip that
mm. This would be fixable by doing mem_cgroup_is_descendant but that
requires an alive memcg's css so you have to pin it.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 5/8] mm,oom: Remove unused signal_struct->oom_victims.
  2016-07-03  2:40 ` [PATCH 5/8] mm,oom: Remove unused signal_struct->oom_victims Tetsuo Handa
@ 2016-07-07 14:03   ` Michal Hocko
  0 siblings, 0 replies; 31+ messages in thread
From: Michal Hocko @ 2016-07-07 14:03 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: linux-mm, akpm, oleg, rientjes, vdavydov, mst

On Sun 03-07-16 11:40:03, Tetsuo Handa wrote:
> >From e2e22cd0c0c486d1aef01232fcc62b00fb01709f Mon Sep 17 00:00:00 2001
> From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> Date: Sat, 2 Jul 2016 23:02:19 +0900
> Subject: [PATCH 5/8] mm,oom: Remove unused signal_struct->oom_victims.
> 
> Since OOM_SCAN_ABORT case was removed, we no longer need to use
> signal_struct->oom_victims useless. Remove it.

I would squash this one into the previous.

> Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> ---
>  include/linux/sched.h | 1 -
>  mm/oom_kill.c         | 2 --
>  2 files changed, 3 deletions(-)
> 
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 553af29..f472f27 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -671,7 +671,6 @@ struct signal_struct {
>  	atomic_t		sigcnt;
>  	atomic_t		live;
>  	int			nr_threads;
> -	atomic_t oom_victims; /* # of TIF_MEDIE threads in this thread group */
>  	struct list_head	thread_head;
>  
>  	wait_queue_head_t	wait_chldexit;	/* for wait4() */
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 734378a..55e0ffb 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -659,7 +659,6 @@ void mark_oom_victim(struct task_struct *tsk, struct oom_control *oc)
>  	/* OOM killer might race with memcg OOM */
>  	if (test_and_set_tsk_thread_flag(tsk, TIF_MEMDIE))
>  		return;
> -	atomic_inc(&tsk->signal->oom_victims);
>  	/*
>  	 * Since mark_oom_victim() is called from multiple threads,
>  	 * connect this mm to oom_mm_list only if not yet connected.
> @@ -693,7 +692,6 @@ void exit_oom_victim(struct task_struct *tsk)
>  {
>  	if (!test_and_clear_tsk_thread_flag(tsk, TIF_MEMDIE))
>  		return;
> -	atomic_dec(&tsk->signal->oom_victims);
>  
>  	if (!atomic_dec_return(&oom_victims))
>  		wake_up_all(&oom_victims_wait);
> -- 
> 1.8.3.1

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 6/8] mm,oom_reaper: Stop clearing TIF_MEMDIE on remote thread.
  2016-07-03  2:40 ` [PATCH 6/8] mm,oom_reaper: Stop clearing TIF_MEMDIE on remote thread Tetsuo Handa
@ 2016-07-07 14:06   ` Michal Hocko
  2016-07-07 16:10     ` Tetsuo Handa
  0 siblings, 1 reply; 31+ messages in thread
From: Michal Hocko @ 2016-07-07 14:06 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: linux-mm, akpm, oleg, rientjes, vdavydov, mst

On Sun 03-07-16 11:40:41, Tetsuo Handa wrote:
> >From 00b7a14653c9700429f89e4512f6000a39cce59d Mon Sep 17 00:00:00 2001
> From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> Date: Sat, 2 Jul 2016 23:03:03 +0900
> Subject: [PATCH 6/8] mm,oom_reaper: Stop clearing TIF_MEMDIE on remote thread.
> 
> Since oom_has_pending_mm() controls whether to select next OOM victim,
> we no longer need to clear TIF_MEMDIE on remote thread. Therefore,
> revert related changes in commit 36324a990cf578b5 ("oom: clear TIF_MEMDIE
> after oom_reaper managed to unmap the address space") and
> commit e26796066fdf929c ("oom: make oom_reaper freezable") and
> commit 74070542099c66d8 ("oom, suspend: fix oom_reaper vs.
> oom_killer_disable race").

The last revert is not safe. See
http://lkml.kernel.org/r/1467365190-24640-3-git-send-email-mhocko@kernel.org

> This patch temporarily breaks what commit 36324a990cf578b5 ("oom: clear
> TIF_MEMDIE after oom_reaper managed to unmap the address space") and
> commit 449d777d7ad6d7f9 ("mm, oom_reaper: clear TIF_MEMDIE for all tasks
> queued for oom_reaper") try to solve due to oom_has_pending_mm().
> Future patch in this series will fix it.
> 
> Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> ---
>  include/linux/oom.h    |  2 +-
>  kernel/exit.c          |  2 +-
>  kernel/power/process.c | 12 ------------
>  mm/oom_kill.c          | 15 ++-------------
>  4 files changed, 4 insertions(+), 27 deletions(-)
> 
> diff --git a/include/linux/oom.h b/include/linux/oom.h
> index 5991d41..4844325 100644
> --- a/include/linux/oom.h
> +++ b/include/linux/oom.h
> @@ -98,7 +98,7 @@ extern enum oom_scan_t oom_scan_process_thread(struct oom_control *oc,
>  
>  extern bool out_of_memory(struct oom_control *oc);
>  
> -extern void exit_oom_victim(struct task_struct *tsk);
> +extern void exit_oom_victim(void);
>  
>  extern int register_oom_notifier(struct notifier_block *nb);
>  extern int unregister_oom_notifier(struct notifier_block *nb);
> diff --git a/kernel/exit.c b/kernel/exit.c
> index 84ae830..1b1dada 100644
> --- a/kernel/exit.c
> +++ b/kernel/exit.c
> @@ -511,7 +511,7 @@ static void exit_mm(struct task_struct *tsk)
>  	mm_update_next_owner(mm);
>  	mmput(mm);
>  	if (test_thread_flag(TIF_MEMDIE))
> -		exit_oom_victim(tsk);
> +		exit_oom_victim();
>  }
>  
>  static struct task_struct *find_alive_thread(struct task_struct *p)
> diff --git a/kernel/power/process.c b/kernel/power/process.c
> index 0c2ee97..df058be 100644
> --- a/kernel/power/process.c
> +++ b/kernel/power/process.c
> @@ -146,18 +146,6 @@ int freeze_processes(void)
>  	if (!error && !oom_killer_disable())
>  		error = -EBUSY;
>  
> -	/*
> -	 * There is a hard to fix race between oom_reaper kernel thread
> -	 * and oom_killer_disable. oom_reaper calls exit_oom_victim
> -	 * before the victim reaches exit_mm so try to freeze all the tasks
> -	 * again and catch such a left over task.
> -	 */
> -	if (!error) {
> -		pr_info("Double checking all user space processes after OOM killer disable... ");
> -		error = try_to_freeze_tasks(true);
> -		pr_cont("\n");
> -	}
> -
>  	if (error)
>  		thaw_processes();
>  	return error;
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 55e0ffb..45e7de2 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -582,15 +582,7 @@ static void oom_reap_task(struct task_struct *tsk)
>  	debug_show_all_locks();
>  
>  done:
> -	/*
> -	 * Clear TIF_MEMDIE because the task shouldn't be sitting on a
> -	 * reasonably reclaimable memory anymore or it is not a good candidate
> -	 * for the oom victim right now because it cannot release its memory
> -	 * itself nor by the oom reaper.
> -	 */
>  	tsk->oom_reaper_list = NULL;
> -	exit_oom_victim(tsk);
> -
>  	/* Drop a reference taken by wake_oom_reaper */
>  	put_task_struct(tsk);
>  	/* Drop a reference taken above. */
> @@ -600,8 +592,6 @@ done:
>  
>  static int oom_reaper(void *unused)
>  {
> -	set_freezable();
> -
>  	while (true) {
>  		struct task_struct *tsk = NULL;
>  
> @@ -688,10 +678,9 @@ void mark_oom_victim(struct task_struct *tsk, struct oom_control *oc)
>  /**
>   * exit_oom_victim - note the exit of an OOM victim
>   */
> -void exit_oom_victim(struct task_struct *tsk)
> +void exit_oom_victim(void)
>  {
> -	if (!test_and_clear_tsk_thread_flag(tsk, TIF_MEMDIE))
> -		return;
> +	clear_thread_flag(TIF_MEMDIE);
>  
>  	if (!atomic_dec_return(&oom_victims))
>  		wake_up_all(&oom_victims_wait);
> -- 
> 1.8.3.1

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 8/8] mm,oom_reaper: Make OOM reaper use list of mm_struct.
  2016-07-03  2:41 ` [PATCH 8/8] mm,oom_reaper: Make OOM reaper use list of mm_struct Tetsuo Handa
  2016-07-04 10:40   ` Oleg Nesterov
@ 2016-07-07 14:15   ` Michal Hocko
  1 sibling, 0 replies; 31+ messages in thread
From: Michal Hocko @ 2016-07-07 14:15 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: linux-mm, akpm, oleg, rientjes, vdavydov, mst

On Sun 03-07-16 11:41:59, Tetsuo Handa wrote:
> >From ce97a7f6b94a6003bc2b41e5f69da2fedc934a9d Mon Sep 17 00:00:00 2001
> From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> Date: Sat, 2 Jul 2016 23:10:06 +0900
> Subject: [PATCH 8/8] mm,oom_reaper: Make OOM reaper use list of mm_struct.
> 
> Since OOM reaping is per mm_struct operation, it is natural to use
> list of mm_struct used by OOM victims.

I would expect that this would be an immediate follow up for "[PATCH 3/8]
mm,oom: Use list of mm_struct used by OOM victims."

> This patch eliminates find_lock_task_mm() usage from the OOM reaper.

It would be good to explain what are the consequences.

> This patch fixes what commit 36324a990cf578b5 ("oom: clear TIF_MEMDIE
> after oom_reaper managed to unmap the address space") and
> commit 449d777d7ad6d7f9 ("mm, oom_reaper: clear TIF_MEMDIE for all tasks
> queued for oom_reaper") tried to address, by always calling exit_oom_mm()
> after OOM reap attempt was made.

And rather than mentioning the above two commits it would be better to
explain what is the difference wrt. to the current implementation.

Other than that it looks good to me.

> 
> Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> ---
>  include/linux/oom.h   |  8 -----
>  include/linux/sched.h |  3 --
>  mm/memcontrol.c       |  1 -
>  mm/oom_kill.c         | 84 +++++++++++++++++----------------------------------
>  4 files changed, 27 insertions(+), 69 deletions(-)
> 
> diff --git a/include/linux/oom.h b/include/linux/oom.h
> index 4844325..1a212c1 100644
> --- a/include/linux/oom.h
> +++ b/include/linux/oom.h
> @@ -71,14 +71,6 @@ static inline bool oom_task_origin(const struct task_struct *p)
>  
>  extern void mark_oom_victim(struct task_struct *tsk, struct oom_control *oc);
>  
> -#ifdef CONFIG_MMU
> -extern void wake_oom_reaper(struct task_struct *tsk);
> -#else
> -static inline void wake_oom_reaper(struct task_struct *tsk)
> -{
> -}
> -#endif
> -
>  extern unsigned long oom_badness(struct task_struct *p,
>  		struct mem_cgroup *memcg, const nodemask_t *nodemask,
>  		unsigned long totalpages);
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index f472f27..4379279 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1916,9 +1916,6 @@ struct task_struct {
>  	unsigned long	task_state_change;
>  #endif
>  	int pagefault_disabled;
> -#ifdef CONFIG_MMU
> -	struct task_struct *oom_reaper_list;
> -#endif
>  /* CPU-specific state of this task */
>  	struct thread_struct thread;
>  /*
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index c17160d..9acc840 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -1236,7 +1236,6 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
>  	 */
>  	if (task_will_free_mem(current)) {
>  		mark_oom_victim(current, &oc);
> -		wake_oom_reaper(current);
>  		goto unlock;
>  	}
>  
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 317ce2c..bdc192f 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -468,8 +468,6 @@ bool process_shares_mm(struct task_struct *p, struct mm_struct *mm)
>   * victim (if that is possible) to help the OOM killer to move on.
>   */
>  static DECLARE_WAIT_QUEUE_HEAD(oom_reaper_wait);
> -static struct task_struct *oom_reaper_list;
> -static DEFINE_SPINLOCK(oom_reaper_lock);
>  
>  static bool __oom_reap_vmas(struct mm_struct *mm)
>  {
> @@ -540,30 +538,27 @@ static bool __oom_reap_vmas(struct mm_struct *mm)
>  }
>  
>  #define MAX_OOM_REAP_RETRIES 10
> -static void oom_reap_task(struct task_struct *tsk)
> +static void oom_reap_vmas(struct mm_struct *mm)
>  {
>  	int attempts = 0;
> -	struct mm_struct *mm = NULL;
> -	struct task_struct *p = find_lock_task_mm(tsk);
> +	bool ret;
>  
>  	/*
> -	 * Make sure we find the associated mm_struct even when the particular
> -	 * thread has already terminated and cleared its mm.
> -	 * We might have race with exit path so consider our work done if there
> -	 * is no mm.
> +	 * Check MMF_OOM_REAPED after holding oom_lock because
> +	 * oom_kill_process() might find this mm pinned.
>  	 */
> -	if (!p)
> -		goto done;
> -	mm = p->mm;
> -	atomic_inc(&mm->mm_count);
> -	task_unlock(p);
> +	mutex_lock(&oom_lock);
> +	ret = test_bit(MMF_OOM_REAPED, &mm->flags);
> +	mutex_unlock(&oom_lock);
> +	if (ret)
> +		return;
>  
>  	/* Retry the down_read_trylock(mmap_sem) a few times */
>  	while (attempts++ < MAX_OOM_REAP_RETRIES && !__oom_reap_vmas(mm))
>  		schedule_timeout_idle(HZ/10);
>  
>  	if (attempts <= MAX_OOM_REAP_RETRIES)
> -		goto done;
> +		return;
>  
>  	pr_info("oom_reaper: unable to reap pid:%d (%s)\n",
>  		mm->oom_mm.pid, mm->oom_mm.comm);
> @@ -580,51 +575,30 @@ static void oom_reap_task(struct task_struct *tsk)
>  		set_bit(MMF_OOM_REAPED, &mm->flags);
>  	}
>  	debug_show_all_locks();
> -
> -done:
> -	tsk->oom_reaper_list = NULL;
> -	/* Drop a reference taken by wake_oom_reaper */
> -	put_task_struct(tsk);
> -	/* Drop a reference taken above. */
> -	if (mm)
> -		mmdrop(mm);
>  }
>  
>  static int oom_reaper(void *unused)
>  {
>  	while (true) {
> -		struct task_struct *tsk = NULL;
> -
> -		wait_event_freezable(oom_reaper_wait, oom_reaper_list != NULL);
> -		spin_lock(&oom_reaper_lock);
> -		if (oom_reaper_list != NULL) {
> -			tsk = oom_reaper_list;
> -			oom_reaper_list = tsk->oom_reaper_list;
> -		}
> -		spin_unlock(&oom_reaper_lock);
> -
> -		if (tsk)
> -			oom_reap_task(tsk);
> +		struct mm_struct *mm = NULL;
> +
> +		wait_event_freezable(oom_reaper_wait,
> +				     !list_empty(&oom_mm_list));
> +		spin_lock(&oom_mm_lock);
> +		if (!list_empty(&oom_mm_list))
> +			mm = list_first_entry(&oom_mm_list, struct mm_struct,
> +					      oom_mm.list);
> +		spin_unlock(&oom_mm_lock);
> +		if (!mm)
> +			continue;
> +		oom_reap_vmas(mm);
> +		/* Drop a reference taken by mark_oom_victim(). */
> +		exit_oom_mm(mm);
>  	}
>  
>  	return 0;
>  }
>  
> -void wake_oom_reaper(struct task_struct *tsk)
> -{
> -	/* tsk is already queued? */
> -	if (tsk == oom_reaper_list || tsk->oom_reaper_list)
> -		return;
> -
> -	get_task_struct(tsk);
> -
> -	spin_lock(&oom_reaper_lock);
> -	tsk->oom_reaper_list = oom_reaper_list;
> -	oom_reaper_list = tsk;
> -	spin_unlock(&oom_reaper_lock);
> -	wake_up(&oom_reaper_wait);
> -}
> -
>  static int __init oom_init(void)
>  {
>  	kthread_run(oom_reaper, NULL, "oom_reaper");
> @@ -667,6 +641,9 @@ void mark_oom_victim(struct task_struct *tsk, struct oom_control *oc)
>  		mm->oom_mm.pid = task_pid_nr(tsk);
>  #endif
>  		list_add_tail(&mm->oom_mm.list, &oom_mm_list);
> +#ifdef CONFIG_MMU
> +		wake_up(&oom_reaper_wait);
> +#endif
>  	}
>  	spin_unlock(&oom_mm_lock);
>  	/*
> @@ -816,7 +793,6 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
>  	unsigned int victim_points = 0;
>  	static DEFINE_RATELIMIT_STATE(oom_rs, DEFAULT_RATELIMIT_INTERVAL,
>  					      DEFAULT_RATELIMIT_BURST);
> -	bool can_oom_reap = true;
>  
>  	/*
>  	 * If the task is already exiting, don't alarm the sysadmin or kill
> @@ -825,7 +801,6 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
>  	task_lock(p);
>  	if (task_will_free_mem(p)) {
>  		mark_oom_victim(p, oc);
> -		wake_oom_reaper(p);
>  		task_unlock(p);
>  		put_task_struct(p);
>  		return;
> @@ -915,7 +890,6 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
>  			 * memory might be still used. Hide the mm from the oom
>  			 * killer to guarantee OOM forward progress.
>  			 */
> -			can_oom_reap = false;
>  			set_bit(MMF_OOM_REAPED, &mm->flags);
>  			pr_info("oom killer %d (%s) has mm pinned by %d (%s)\n",
>  					task_pid_nr(victim), victim->comm,
> @@ -926,9 +900,6 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
>  	}
>  	rcu_read_unlock();
>  
> -	if (can_oom_reap)
> -		wake_oom_reaper(victim);
> -
>  	mmdrop(mm);
>  	put_task_struct(victim);
>  }
> @@ -1004,7 +975,6 @@ bool out_of_memory(struct oom_control *oc)
>  	 */
>  	if (task_will_free_mem(current)) {
>  		mark_oom_victim(current, oc);
> -		wake_oom_reaper(current);
>  		return true;
>  	}
>  
> -- 
> 1.8.3.1

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 6/8] mm,oom_reaper: Stop clearing TIF_MEMDIE on remote thread.
  2016-07-07 14:06   ` Michal Hocko
@ 2016-07-07 16:10     ` Tetsuo Handa
  2016-07-07 16:54       ` Michal Hocko
  0 siblings, 1 reply; 31+ messages in thread
From: Tetsuo Handa @ 2016-07-07 16:10 UTC (permalink / raw)
  To: mhocko; +Cc: linux-mm, akpm, oleg, rientjes, vdavydov, mst

Michal Hocko wrote:
> On Sun 03-07-16 11:40:41, Tetsuo Handa wrote:
> > >From 00b7a14653c9700429f89e4512f6000a39cce59d Mon Sep 17 00:00:00 2001
> > From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> > Date: Sat, 2 Jul 2016 23:03:03 +0900
> > Subject: [PATCH 6/8] mm,oom_reaper: Stop clearing TIF_MEMDIE on remote thread.
> > 
> > Since oom_has_pending_mm() controls whether to select next OOM victim,
> > we no longer need to clear TIF_MEMDIE on remote thread. Therefore,
> > revert related changes in commit 36324a990cf578b5 ("oom: clear TIF_MEMDIE
> > after oom_reaper managed to unmap the address space") and
> > commit e26796066fdf929c ("oom: make oom_reaper freezable") and
> > commit 74070542099c66d8 ("oom, suspend: fix oom_reaper vs.
> > oom_killer_disable race").
> 
> The last revert is not safe. See
> http://lkml.kernel.org/r/1467365190-24640-3-git-send-email-mhocko@kernel.org

Why?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 6/8] mm,oom_reaper: Stop clearing TIF_MEMDIE on remote thread.
  2016-07-07 16:10     ` Tetsuo Handa
@ 2016-07-07 16:54       ` Michal Hocko
  0 siblings, 0 replies; 31+ messages in thread
From: Michal Hocko @ 2016-07-07 16:54 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: linux-mm, akpm, oleg, rientjes, vdavydov, mst

On Fri 08-07-16 01:10:02, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > On Sun 03-07-16 11:40:41, Tetsuo Handa wrote:
> > > >From 00b7a14653c9700429f89e4512f6000a39cce59d Mon Sep 17 00:00:00 2001
> > > From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> > > Date: Sat, 2 Jul 2016 23:03:03 +0900
> > > Subject: [PATCH 6/8] mm,oom_reaper: Stop clearing TIF_MEMDIE on remote thread.
> > > 
> > > Since oom_has_pending_mm() controls whether to select next OOM victim,
> > > we no longer need to clear TIF_MEMDIE on remote thread. Therefore,
> > > revert related changes in commit 36324a990cf578b5 ("oom: clear TIF_MEMDIE
> > > after oom_reaper managed to unmap the address space") and
> > > commit e26796066fdf929c ("oom: make oom_reaper freezable") and
> > > commit 74070542099c66d8 ("oom, suspend: fix oom_reaper vs.
> > > oom_killer_disable race").
> > 
> > The last revert is not safe. See
> > http://lkml.kernel.org/r/1467365190-24640-3-git-send-email-mhocko@kernel.org
> 
> Why?

Because there is no guarantee of calling exit_oom_victim and thus
oom_disable making a forward progress. The changelog referenced above
tries to explain that.

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

end of thread, other threads:[~2016-07-07 16:54 UTC | newest]

Thread overview: 31+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-07-03  2:35 [PATCH 0/8] Change OOM killer to use list of mm_struct Tetsuo Handa
2016-07-03  2:36 ` [PATCH 1/8] mm,oom_reaper: Remove pointless kthread_run() failure check Tetsuo Handa
2016-07-03 12:42   ` Oleg Nesterov
2016-07-03 16:03     ` Tetsuo Handa
2016-07-03 17:10       ` Oleg Nesterov
2016-07-03 21:53         ` Tetsuo Handa
2016-07-04 11:13           ` Oleg Nesterov
2016-07-07 11:14           ` Michal Hocko
2016-07-03  2:37 ` [PATCH 2/8] mm,oom_reaper: Reduce find_lock_task_mm() usage Tetsuo Handa
2016-07-07 11:19   ` Michal Hocko
2016-07-03  2:38 ` [PATCH 3/8] mm,oom: Use list of mm_struct used by OOM victims Tetsuo Handa
2016-07-04 10:39   ` Oleg Nesterov
2016-07-04 12:50     ` Tetsuo Handa
2016-07-04 18:25       ` Oleg Nesterov
2016-07-05 10:43         ` Tetsuo Handa
2016-07-05 20:52           ` Oleg Nesterov
2016-07-06  8:53             ` Oleg Nesterov
2016-07-06 11:43               ` Tetsuo Handa
2016-07-07 11:31   ` Michal Hocko
2016-07-03  2:39 ` [PATCH 4/8] mm,oom: Remove OOM_SCAN_ABORT case Tetsuo Handa
2016-07-03  2:40 ` [PATCH 5/8] mm,oom: Remove unused signal_struct->oom_victims Tetsuo Handa
2016-07-07 14:03   ` Michal Hocko
2016-07-03  2:40 ` [PATCH 6/8] mm,oom_reaper: Stop clearing TIF_MEMDIE on remote thread Tetsuo Handa
2016-07-07 14:06   ` Michal Hocko
2016-07-07 16:10     ` Tetsuo Handa
2016-07-07 16:54       ` Michal Hocko
2016-07-03  2:41 ` [PATCH 7/8] mm,oom_reaper: Pass OOM victim's comm and pid values via mm_struct Tetsuo Handa
2016-07-03  2:41 ` [PATCH 8/8] mm,oom_reaper: Make OOM reaper use list of mm_struct Tetsuo Handa
2016-07-04 10:40   ` Oleg Nesterov
2016-07-07 14:15   ` Michal Hocko
2016-07-07 11:04 ` [PATCH 0/8] Change OOM killer to " Michal Hocko

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.